Reviews: FreeAnchor: Learning to Match Anchors for Visual Object Detection

UPDATED post rebuttal: Thanks to the authors for addressing all my points. I am raising my score to seven. The authors begin by noting that many existing object detection pipelines include a step on 'anchor assignment', where from a large set of candidate bounding boxes (or "anchors") in a generic image frame, the one that best matches the ground truth bounding box, as measure by IoU, is chosen to be the one that is used for training, ie the object detection and bounding box regression outputs for that anchor will be pushed towards the ground truth. The authors note that for objects which don't fill the anchor well (slim objects oriented diagonally, objects with holes, or occluded objects) the best anchor according to this IoU comparison may be actively bad for training as a whole. The authors propose "learning to match", ie producing a custom likelihood which promotes both precision and recall of the final result (making reference to terms from the traditional loss function). For each ground truth bounding box, a 'bag of anchors' is selected by ranking IoU and picking the best n. During training, a different bounding box is selected from this bag for each object, for each backwards pass. Which one is chosen depends on the current state of training - as demonstrated in Figure 4, the confidence gradually increases for certain anchors. The Mean-Max function means that at the start of training, many of the anchors in the bag will be selected, but over time a single best one will come to dominate. I do not have the relevant background to confidently assess originality / quality / significance in this subfield. The results seem impressive, to do a drop-in replacement of the loss function and get multiple percent increases on difficult categories (figure 5 right) with negligible impact on other classes is a good result. Figure 6 is a nice result as well. All necessary experimental details seem to be present. The paper seems to be well written, and for those working with these kind of models I'm fairly confident that trying out these changes would be simple, given the information in the paper. Minor points: * L16: the citations given for algorithms which incorporate anchor boxes includes [5] and [6], which are R-CNN and Fast R-CNN - neither of these papers includes the term 'anchor', I believe that first came into that line of work as part of Faster R-CNN. * In the figure 1 caption, (top) and (bottom) would read better to me than (up) and (down). * L145: "The anchor with the highest confidence is not be suitable for detector training" - remove the extra word "be" * Algorithm 1 - the anchor bag construction is implied to happen on every forwards pass - but if it just depends on the IoU between the anchor and the ground truth bounding box, presumably this could be done once before the training loop and cached? * Algorithm 1 - "super-parameter" - I'm pretty sure I have not come across this term before, and google search says "did you mean hyperparameter"... * Algorithm 1 - Backward propagation - I think it's way more standard to do something like $\theta^{t+1} = \theta^t - \lambda \nabla_{\theta^t}L(\theta^t)$. You are presumably not solving the exact argmin problem, and it takes approximately as much space to write. * Figure 4 - Firstly, I would recommend noting in the caption that the laptop is relatively low contrast and encouraging readers to zoom in. I am viewing in color and it was on the 3rd pass through the paper that I realised the green bounding box does actually capture the laptop very well - I basically didn't notice half the laptop and it's easy to assume that this is some kind of failure case where it's incorrectly focused on the cat. Given this realisation, I'm still not sure if this image shows as much as it could - obviously there is progression to the right as the center anchors get redder, but I don't really know what actual spatial extent those anchors represent. It's also unclear why more than these 16 were not represented (the minimum anchor bag size that is mentioned is 40) - presumably the confidence goes really low much further away, but could we then see this? Perhaps it might be more interesting to show, in separate images, the actual anchor extents for the final most confident and least confident. My intuition is that one of them would clearly have a better IoU with the true bounding box, but the eventual higher confidence one would focus more on the pixels that are a laptop - am I right about this? * L172 - I am intrigued as to why this value for a bias initialization (presumably the rest of the convolutional biases are initialized at zero as normal). Can you provide some justification as to why this formula is used? I would also recommend not using $\pi$, as that already has a well known scalar interpretation.

Overall, the introduced method is interesting and novel, however, the paper is not yet in a state to be acceptable for publication. Some parts of the paper appear too ad-hoc and it is not not always clear what the contribution of the introduced method is. More details below: - The third and fourth paragraph of Section 1 appear convoluted and are difficult to follow. It should be more clear what the motivation of the introduced technique is. - Section 2: it is not sufficiently explained how the presented approach differs from existing methods, e.g. such as FoveaBox or MetaAnchor. This should clarified. - The text states 'The proposed approach breaks the IoU restriction, allowing objects to flexibly select anchors ...". However, in Section 3, it then states 'a hand-crafted criterion based on IoU is used.' This is confusing. What for was the IoU criterion used? What is similar/different to existing approaches. - a^{loc}_j is defined to be in R4. Which dimensions are defined here? It later becomes clear from the context, but it should be more formally defined. - The definitions in lines 101, 102 are ambiguous and should be clarified. How is SmoothL1 defined? - 'bg' as e.g. in \mathcal(\theta)^{bg} is not defined. - Why do \mathcal{P}(\theta)^{cls}_{ij} \mathcal{P}(\theta)^{bg}_{j} together define class confidence? - Lines 110 - 113: If Eq. 2 strictly considers the optimization of classification and localization of anchors but it ignores how to learn the matching matrix C what is the conclusion here? Is this a limitation? - L121: 'it requires to guarantee...'. What requires to guarantee? - Some equations (e.g. Eq. 4) are not necessary to explain the contribution of the paper and should be moved to the Appendix. This would help to sharpen the presentation of the contribution. Similarly for Eq. 5 and 6. - It is not clear what is the relationship of the Mean-max function and insufficient training. Why will it be close to the Mean when training is not sufficient? What is X? It becomes clear later in the text, but as presented this appears too ad-hoc. - Figure 2 and 3 could be moved to the Appendix to make room for more important figures. - What is meant by 'inheriting the parameters' (L156)? - L200: Is is not clear how a 'bag' of anchors was constructed and how they where selected. - The quantitative evaluation discussed in Section 4.2 (and the qualitative results in Supplementary Material) is interesting. However, it would have been more informative to also provide the average performance across all categories. Also, it is not clear why 'couch' is considered a slender object. - The results presented in Table 3 are interesting but the presented approach only marginally improve upon existing work or not at all. The cases for which it works well and does not are not sufficiently discussed. - No limitations are discussed. In which situations does the approach fail, also compared to existing work?

Paper ID:	81
Title:	FreeAnchor: Learning to Match Anchors for Visual Object Detection

Reviewer 1

Reviewer 2

Reviewer 3