Review for NeurIPS paper: HyNet: Learning Local Descriptor with Hybrid Similarity Measure and Triplet Loss

NeurIPS 2020

HyNet: Learning Local Descriptor with Hybrid Similarity Measure and Triplet Loss

Review 1

Summary and Contributions: This work tackles the area of learning local patch descriptors via deep learning frameworks. It is the area that has been explored a lot recently, with some well established and well performing methods. Authors propose few incremental contributions, compared to those methods: (i) using a hybrid similarity measure instead of inner product or L2-distance-based similarity; (ii) regularization term; (iii) novel network architecture. Finally, the resulting deep local descriptors achieve state-of-the-art performance on all well-established evaluation benchmarks.

Strengths: + I really like the "Gradient Analysis" section, that properly motivates incremental contributions of this work. In fact, I find this part to be the strongest contribution of the paper and should be added as a main one. + Experiments are performed on all well established benchmarks with comparison against proper related work and state-of-the-art approaches + Ablation study gives a really nice overview into the performance of each contribution/part, and beyond + This works clearly sets a new state-of-the-art performance

Weaknesses: - Contributions that are claimed are somewhat weak. In fact, I think that main contribution is in the gradient analysis (as I mentioned in Strengths) and all of the three claimed contributions can be bundled into one "incremental improvements of previous works". Namely: C1: L2-regularization is interesting, but ablation study shows it has the smallest effect on the performance; C2: hybrid similarity measure is a simple combination of two established similarity measures, and it additionally adds another hyper-parameter (alpha) that seems to be very sensitive to setup (Fig5(a) shows that setting lower alpha reduces performance by 1mAP, and setting higher reduces by 0.5 mAP); C3: novel architecture is actually almost identical architecture as [21,22] with addition of FRN block from [36] (ablation study shows this gives the most increase to performance), so it more of a practical combination of previous work than actual contribution. Minor: - Color code Tab 2 with 1st, 2nd and 3rd best result for each column, to be easier for reader to follow, it is a pretty big and unreadable table

Correctness: Yes, claims and experimentation section seems to be in order. I have some remarks about contribution claims that I detail in Strengths and Weaknesses.

Clarity: Paper is well written and easy to read.

Relation to Prior Work: To a careful reader, the connection to prior work is clearly stated.

Reproducibility: Yes

Additional Feedback: I have some problems with paper's contributions, as they seem too incremental and too weak. However, I find the analysis that supports the design choices to be very interesting and results speak for themselves. Because of that, I feel like the paper is interesting for general public and local descriptor learning community and I am leaning towards acceptance of the paper. ** After reading rebuttal and all reviews, I am comfortable with my original rating, and will keep it. **

Review 2

Summary and Contributions: This paper deals with the problem of learning local image descriptors using deep networks. The paper advocates to use 1) L2 normalization for the final descriptors; 2) a hybrid similarity by a weighted combination of the L2 distance and the cosine similarity; 3) filter response normalization (FRN) after each layer of the CNNs instead of batch normalization or instance normalization. A triplet loss function is adopted in the end for learning the descriptors. While the hybrid similarity is partially motivated by a careful analysis of the gradient of both similarities/distances, the other setting are more empirical or standard. Empirical experiments on both patch datasets (UBC and HPatch dataset) and 3D reconstruction datasets (ETH dataset).

Strengths: + The paper is clearly presented and the motivation on using a hybrid similarity to facilitate balanced learning between hard and easy learning samples are well conducted. + The experimental results looks to be solid in three datasets. + The paper is a solid execution combining largely known peieces together for the task of learning local image descriptors.

Weaknesses: - The only weakness is that there are not much exciting new knowledge, in terms of learning, revealed.

Correctness: The claims made are correct, and empirical methodology followed conventions on the adopted datasets.

Clarity: The paper is easy to read and well written.

Relation to Prior Work: The relation to prior work is clearly discussed.

Reproducibility: Yes

Additional Feedback: The paper in overall is a solid execution. But I am afraid the new knowledge advanced here is rather limited. After reading the rebuttal, my rating remains the same. The paper is overall a good execution, it is just the overall knowledge advancement is limited.

Review 3

Summary and Contributions: This paper proposes HyNet for learning local feature descriptors, novelties are 1) hybrid similarity measure, 2) L2 regularization, 3) architecture where all intermediate feature maps are L2-normalized. These choices are justified by an analysis of L2 normalization in gradient-based learning. Results are presented for patch matching, image matching, and 3D reconstruction.

Strengths: + Detailed analysis of the effects of L2 normalization in gradient-based feature learning. The proposed loss is a natural follow-up to the analysis. + The empirical results are impressive, I believe these are the current SOTA results in local feature descriptors.

Weaknesses: - Although the paper strongly advocates the novel loss design, the network architecture and FRN layers appear to play an important role. Which contributes more? I don't think the ablation study in Section 5 is quite sufficient to paint the whole picture. From Table 3, presumably (BN + s_H + R_L2) corresponds to using L2Net architecture with the new loss, which achieves 52.04 MAP - marginally better than 51.62 from SOSNet. I think a key entry missing from all the SOTA comparisons is using baseline L2Net architecture (or one that's as close as possible) + proposed loss. - Conversely, another interesting question is, how will the baselines (esp. HardNet and SOSNet) perform if equipped with FRN normalization? - L161: "... based on the analysis in Sec. 2.2 that, similarly to the output descriptors, L2 normalisation needs to be applied to the intermediate feature maps". I checked Sec 2.2 and didn't find an explicit motivation for L2-normalizing all intermediate feature maps. Also, since FRN is employed anyway, is L2 normalization to the feature maps still really necessary?

Correctness: The math in the analysis of L2 normalization is correct - my complaint is that calling both cosine similarity and L2 distance "similarity measures", and using the same kind of shorthand (s) for them, makes things confusing and should probably be avoided. As I commented in Weaknesses, the empirical methodology of presenting the final product that combines a novel loss and a novel architecture makes things less interpretable. It may not be "incorrect", but it renders the core contributions unclear.

Clarity: Regarding writing (and writing only), yes.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Author's response alleviated some of my concerns, but not all. First, confirming my suspicion, the additional evidence suggests FRN can improve all baselines (although the proposed loss achieves the most additional improvement on top). Thus it seems a fair comparison should be against SOSNet+FRN, HardNet+FRN, etc. in all the tables. Reporting the "final product" is fine for eg. a competition, but I was expecting more for a NeurIPS publication. Second, regarding the motivation of L2-normalizing _all_ intermediate feature maps, authors simply referred back to Sec. 2.2, which is unsatisfactory. From the gradient analysis in the paper, I can clearly see an argument for L2-normalizing the _descriptors_, but generalizing to all feature maps felt like a "leap of faith". It's a tough decision as there's much to like about this paper, but my final score would be 5 (down from 6).