Review for NeurIPS paper: Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

NeurIPS 2020

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

Review 1

Summary and Contributions: This paper identifies two problems in existing detectors. First, the current loss function for classification does not take the quality of a prediction (whether it sufficiently overlaps with ground-truth) into account. Second, the current loss function for regression does not allow ambiguity and uncertainty in a complex scene because of the inflexible Dirac delta distribution assumption which assumes only one correct value. The authors propose an IoU loss function which trains the network to estimate how well a prediction overlaps with the ground-truth. They also propose a regression function which does away the Dirac delta distribution assumption, and allow the network to learn the probabilities of values around the ground-truth value. The authors design both the loss functions by extending the focal loss function.

Strengths: I like the idea that trains the network to directly predict the overlaps between the bounding boxes and ground-truths. This better fit the post-processing step such as non-maximal suppression commonly found in modern state-of-the-art object detectors. The authors clearly show that how they derive the generalized version of focal loss to be used in training their network. The authors show that the proposed loss function can improve the performance of ATSS from 47.7% to 48.2% on the COCO test-dev dataset.

Weaknesses: The idea of predicting IoUs between predictions and ground-truths has already been explored by other works. It seems to me the main differences between this and prior works are: 1) Prior works such as IoU-Net still have a separate branch for classification but this one does not. 2) This work uses a new loss function, a generalized version of focal loss, to train the network for predicting IoUs. The contribution of each modification to the final performance is unclear. Does the separate branch for classification hurt the performance? What if we just ignore the separate branch for classification in prior works? Is the new loss function essential to the success of the proposed approach? Can the network be trained with other loss such as a simple regression loss (setting the regression target to be zero for negative samples)? Understanding the contribution of each modification would allow us to understand the significance of this work better. Otherwise, the proposed modification seems to be just incremental. The quality of the predicted IoUs is crucial to the performance of the detector. The authors should provide analysis on how well the predicted IoUs match the actual IoUs between the predicted box and the ground-truth. They can do so by calculating the correlations between the predicted and the actual IoUs. Some details on the localization loss function (i.e. DFL) are unclear. What does discretizing the range with even intervals delta (line 166-167) mean? My understanding of DFL is that the network predicts n values and a likelihood for each value. It then calculates the expected value (eq. 5). How does the delta play a role in this equation? Also what does the regression target mean in Fig 5c? How do the authors get the regression target? The second loss function seems to allow the network to fit better to the distribution of the annotations, and hence the network may not generalize well to data from unseen distribution. Having said that, I do not consider this to be a major weakness of this approach, and this is not considered in my rating. I just hope the authors can provide more characteristics of the proposed loss function if possible.

Correctness: Yes. Please see strengths.

Clarity: The part about localization can be improved. The definition of some the terms are unclear which makes it difficult to follow.

Relation to Prior Work: No. Please see weaknesses.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: In this paper, the authors propose to predict IoU instead of pre-assigned class and regress the bbox parameters in a classification manner. The authors also unify the two kinds of loss functions into one form. Generally the idea is novel and the experiments are sufficient. The weakness of this paper is the improvements against several baselines are marginal. But considering there is almost no extra cost, it is still acceptable.

Strengths: Important findings with acceptable solution and sufficient experiements

Weaknesses: I have some concerns about the paper: 1. Why focal loss is used in regression tasks? Focal loss is famous for doing class imbalance problem. It has lower gradients on easy samples, which is a good property for classification. But for regressing the IoU, lower weight for easy samples may cause inaccurate problem. This paper gives me a feeling that the authors only want to have a unified form, but didn't consider the difference between the classification and regression tasks. 2. In [1], the predicted variance of bbox parameters is used for NMS. The algorithm in this paper also produce bbox confidence (sum of two neighbour probabilities). Could it benefit the NMS? 3. The DFL is very similar with softargmax which is widely used in keypoint detection. However, the citation of this research topic is lacking. Please give some credit to authors of keypoint detection papers such as [2] and more. A problem of the softargmax is that the gradient imbalance problem. The form of softargmax is \sum p(x)x. The gradient for p(x) with x=10 is 10 times bigger than the gradient at x=1. This means that the DFL puts more weight on big objects, while the difficult problem of detection is usually the small objects. Overall, the idea of this paper is good but some of the details are still coarse. [1] He Y, Zhang X, Savvides M, et al. Softer-nms: Rethinking bounding box regression for accurate object detection[J]. arXiv preprint arXiv:1809.08545, 2018, 2. [2] Nibali A, He Z, Morgan S, et al. Numerical coordinate regression with convolutional neural networks[J]. arXiv preprint arXiv:1801.07372, 2018.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: See above =============================== The rebuttal addresses part of my concerns, but not fully. I will still keep my score borderline towards accept on this paper.

Review 3

Summary and Contributions: This paper works on improving existing one-stage detectors FCOS and ATSS. Two contributions are proposed: 1. merging the centerness head and the classification head with a continuous focal loss; 2. changing the regression representation from float number to 16-bins and integral. Both contributions bring ~0.6mAP improvements on COCO under different settings, and the best performance is improved from 47.7 (ATSS) to 48.2.

Strengths: + The quality focal loss idea is neat. It removed a redundant component from FCOS and ATSS for free with slight performance improvement (+0.3 mAP, 3rd and last rows of Table. 1(a)). + The overall performance (48.2 mAP on COCO) is strong and healthy. + The paper is well written and easy to follow. All figures are clear with comprehensive captions. + The reviewer appreciates that the authors included code in the submission, even though the code contains google drive links that expose one of the author's identity.

Weaknesses: - While the reviewer appreciates the contribution of the quality focal loss, the contribution of distribution focal loss looks far less exciting. It seems just changed the regression representation from a simple float to a complex integral over bins. First of all, this is not new and has been studied in the human pose estimation community (e.g., Integral human pose estimation/ LCRNet). Also, it makes a simple, straightforward representation complex and slower, and has nothing to do with the focal loss idea. The reviewer is not convinced enough to adopt this integral representation into his detector given the minor improvement (+0.3~0.6 mAP) with costs. - Changing from FCOS centerness to IoU brings 0.2~0.4 mAP improvement is interesting. However, this is not highlighted in the paper and the details are hidden in the supplementary. The reviewer found this is a bit misleading in Table. 3, which tends to show the improvement of QFL is 0.7mAP. However 0.4 of them are from changing the FCOS centerness to IoU, and is not the core idea of the proposed continuous focal loss. - Is Figure 3 a real output or just an illustration? The reviewer doubts if the real outputs will be as clean as is shown. - The speed-accuracy trade-off improvement in Figure. 8 and Table. 4 is unclear to the reviewer. E.g., why is X-101-DCN (10 FPS) faster then ATSS (6.9 FPS)? Should it be slower than ATSS or FCOS due to the distribution focal loss? The removal of centerness should be minor in runtime as it is only a single layer, if understood correctly. - It will be beneficial to show the multi-scale testing results as well, to show the proposed method can really push the state-of-the-art number.

Correctness: There are some slight inaccurate claims as mentioned in the paper weakness. The rest of the conclusions are fair as far as the reviewer can access.

Clarity: Yes.

Relation to Prior Work: The reviewer feels the proposed qualitative focal loss based on IoU map has some connections to the modified focal loss used in CornerNet[14] and CenterNet[6]. It will interesting to have a discussion on that.

Reproducibility: Yes

Additional Feedback: Here is how the reviewer justifies the rating: on the one hand, the reviewer likes the continuous focal loss idea and the final performance. However, the "valuable improvement" is only 0.3mAP as discussed in the paper weaknesses. The reviewer feels this is too marginal for a NeurIPS publication. However, the reviewer is happy to raise the score if the authors find considerable misunderstanding in the review. ==================== The rebuttal resolved some of my concerns/ confusions: the run time, Fig. 3, and the multi-scale testing results. However, my complaint on the cumbersomeness of DFL remains, and the overall improvements are still not exciting. Overall, I don't have a strong objection to accepting the paper (I would choose a neutral borderline if these is such option). If the paper is accepted, I highly recommend the authors to make the speed claim clear, and add the analysis of IoU head to the main paper.

Review 4

Summary and Contributions: The paper proposes an approach for object class detection based on existing neural network architectures, but using a novel loss function (generalized focal loss) that combines classification and localization score in a particular way and furthermore represents localization uncertainty explicitly in a non-parametric distribution estimate. Experiments are conducted on the MS COCO detection task and encompass comparison to state of the art results as well as a number of ablations and applications of the proposed loss to existing detection architectures.

Strengths: + The proposed loss function is relatively well motivated and intuitive. It can be applied as an addon to existing detection architectures. + The proposed loss function formulation has a moderate degree of novelty. + The experimental results indicate that the proposed loss function and its ablations are indeed effective and outperform the respective baselines by moderate margins of up to 1.5 percent points in AP. + The experiments contain a plethora of ablations and application of the proposed loss to existing detectors. + The paper makes an effort to illustrate the introduced ideas with diagrams and visualizations.

Weaknesses: - The presentation of the paper could certainly be improved. It contains a multitude of unusual formulations and grammatical oddities (e.g., the very first sentence in the introduction; understanding, however, is not hampered too much by this). - The title is quite misleading: 'qualified and distributed bounding boxes' at least for me set expectations quite differently from what the paper turned out to be about, which is estimating both the quality of and distributions over (bounding box) localization. - The abstract contains too many details of the proposed method and could be shortened.

Correctness: The introduced loss seems plausible on a high level, even though I did not verify every equation explicitly.

Clarity: The paper has some issues in presentation that should be addressed in case a final version were to be prepared.

Relation to Prior Work: The paper gives mostly adequate references to prior work, even though the related work Sect. 2 is on the shorter end of the spectrum.

Reproducibility: Yes

Additional Feedback: I have read the authors' rebuttal, which addresses my concerns concerning the chosen title. I hence decided to stick with my original rating '6: marginally above ...'.