NeurIPS 2020

Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer Proxies

Review 1

Summary and Contributions: Post Rebuttal: The author addressed my main concerns. So I would like to upgrade my score. ------------------ The author introduced Graph classification into DML. The experiments show improvement in several benchmark tasks.

Strengths: The results show improvement over recent baseline models. The experiment including the ablation study is sound on several datasets.

Weaknesses: The use of ground truth labels to compute S^{POS} will lead to an unbalance between training and testing since when testing, there should not be any ground truth labels. Even though the author claimed that they introduced graph classification into DML, the actual use is unclear. The current version seems only to use k-top neighbors, which will be a trivial graph. There is no message passing between nodes. The difference between the learnable proxies and some trivial methods like cluster is unclear.

Correctness: The claim of S^{POS} is unclear when testing.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: 1. The biggest concern I have is the use of S^{POS}. 2. Another one is that the use of a graph is unnecessary. There seems to be no need to build the graph, other than the top-k neighbors. 3. For the proxy, they are initialed randomly. So different proxies may have the same representative. Is there any method to hard constraint this problem, or piratically this will not be an issue? A similar problem will happen when directly use k-NN for machine learning.

Review 2

Summary and Contributions: This paper introduces a new deep metric learning method, namely ProxyGML, which integrates graph classification into the embedding space learning. ProxyGML globally represents each class in the data using several proxies, and selects a few most informative ones to construct a series of sub-graphs that encodes the local similarity relations between proxies and samples. By classifying these sub-graphs with the ground-truth labels, a favorable embedding space can be learned through the proposed reverse label propagation. The proposed method is evaluated on image retrieval and clustering tasks on CUB-200-2011, Cars196, and Stanford Online Products datasets. Ablation studies and comparison with the state-of-the-arts have demonstrated the effectiveness of the proposed method.

Strengths: - Deep metric learning with graph classification is an interesting perspective, in which the proposed "reverse label propagation" algorithm is able to adaptively adjust the embedding manifold through a classification loss. I'm positive to see such novel idea in this area, which I assume is also of interest to the NeurIPS community. - The paper is well structured, and all the contributions are motivated well and clearly discussed; also all the information required to reproduce the results seems to be in place. - Experiments are comprehensive. The proposed method is compared against a series of state-of-the-arts, and evaluated in terms of time and memory costing. Also the parameters are well discussed through ablation study, which is convincing.

Weaknesses: - There are two important parameters in the proposed method, i.e., N, r. While the presented ablation studies are only conducted on Cars196 dataset, does it follow the same pattern on the other two datasets as on Cars196? Also, how can we choose an optimal combination of these parameters for a new dataset without exhaustive grid search? Is there any empirical experience? The authors are encouraged to give some more analysis into this. - The proposed method performs less favorable on Stanford Online Products datasets than on the other two involved datasets. The authors argued in lines 297--299 that low intra-class variance goes against the advantage of the proposed method. I understand that there is no free lunch, but is there any supplemental way to fix this or is the proposed method not designed for this type of problem? At least, the authors should consider this as an important future work.

Correctness: Yes. As far as I am concerned, I didn't spot any incorrectness in the method itself.

Clarity: Yes. The paper is well structured and easy to read. All the information needed seems to be in place.

Relation to Prior Work: Yes. The connections and differences between this work and previous works are well discussed.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper presents a supervised metric learning approach which leverages proxies. The method leverages proxies to better characterize the distribution of the entire dataset to address the sampling problem. The method formulates the proxies in a graph structure, along with a novel reverse label propagation algorithm. The authors conduct extensive experiments with their method including in the supplementary material and demonstrate superiority compared to numerous baselines.

Strengths: The authors explain their method in detail and provide numerous results demonstrating the benefit of using proxies in their approach. For example, they demonstrate that not only is retrieval / clustering performance improved, they show that their approach causes Faster convergence while obtaining higher-recall. The proposed method features some technical novelty. Specifically, leveraging a similarity graph over the N proxies in order to trade off between number of proxies (higher overhead / overfitting) vs too little information. Similarly, the authors use the similarity graph to construct sub-graphs to better capture local relationships around points. These sub-graphs then leverage a novel reverse label propagation technique. Most label propagation is used to predict missing labels from data from a learned semantic space, but in the authors case, they show how complete label information can be used to change the space (i.e. reverse propagation). While many metric learning approaches use labels to change the manifold, the use of the labels this way in sub-graphs to change the space seems to be of some novelty. Each component of the proposed method is explained in sufficient detail. Extensive experiments are provided for each parameter showing the effects of the parameter on recall. Similarly, ablation studies are present demonstrating the importance of the masks and masked softmax function, as well as the proxy regularization component. The method is compared with numerous other state-of-the-art metric learning approaches. Given the numbers are fairly close in many of the tables, highlighting based on statistical significance could have been helpful to determine whether results are significant for some methods (e.g. 86.3 vs 86.4 / 96.2 vs 96.0 / 90.6 vs 90.5). On CUB and Cars the method does seem to perform better however.

Weaknesses: I am a bit concerned about some of the claims in the paper being too strong. For example, the authors state at L67 that, "To our best knowledge, this is the first work that introduce graph classification into DML." However, I would be very careful making claims like this. There are many approaches in metric learning now that do similar things. For example: Li, Xiaocui, et al. "Semi-supervised clustering with deep metric learning and graph embedding." World Wide Web 23.2 (2020): 781-798. This approach is quite similar to yours actually. They use graphs as well as label propagation for deep metric learning, somewhat along the lines of what you do. It would have been nice to compare against this paper. However, at a minimum you could cite it and tone down the claims of being the first to use graph classification for DML. Writing quality needs improvements (see below). The method is supervised in that it requires class labels for data in order to learn the subgraph maps / proxies / etc. It would be nice to see how this method could be applied in "classless" settings, like vision+language settings where there is no concept of a class. This is more an idea for future work, however. Still, this limits the applicability of your method when class labels are not available. I would have liked to see how well this method works on larger-scale datasets. As the authors note, in the SOP dataset, there are sometimes around 5 instances per class. It would be nice to see how this could be expanded to much o I noticed that you use Inception pretrained on Imagenet. Do all the baselines also use that backbone? If not, it is possible some of your performance gain is coming from a more robust backbone architecture. It would be nice to indicate what the backbone is for each method, and if the backbone is different for those methods, to use your method with that backbone, especially for the most competitive methods. I noticed that the authors use the numbers from prior papers - while this is fine, if one is changing the backbone architecture, it is possible that the performance gains are not so much as claimed from your method, but rather from change of architecture. Have the authors considered imposing a diversity constraint among the proxies? For example, it is possible that the proxies collapse to a single point, or, that the proxies are not diverse across the class. Could a diversity constraint be imposed to enforce that the proxies are not too overlapping? Perhaps this was the purpose of the proxy regularization section, but if so, it is not clear to me why this occurs. I also recommend the authors to clarify that and make it explicit if that is the purpose. Despite being thorough, I am a bit concerned about the experimental evaluation. It would have been good to compare against recent approachies that are competitive that use proxies. For example: Aziere, Nicolas, and Sinisa Todorovic. "Ensemble deep manifold similarity learning using hard proxies." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. also This was published in CVPR this year, so after the submission deadline, but you can still cite it: Kim, Sungyeon, et al. "Proxy Anchor Loss for Deep Metric Learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. I would also note that this above method - Proxy Anchor loss is a drop in loss and has been available on Github for 4 months. It isn't required to cite it since wasn't published yet, but I recommend the authors add stronger proxy baselines. ProxyNCA (2017) is widely known as a weak method nowadays. Similarly, other baselines like clustering [13] (2017) and lifted structure loss [14] (2016) are dated and semihard [18] (2015) are dated. Your method leverages graph networks which may involve introducing additional parameters into the learning process. Thus, even if a prior baseline method uses the Inception baseline, your method has additional parameters - and is thus potentially more powerful. It would have been nice to tease this out. I have seen other papers control for parameter size, limit new parameters, or introduce additional layers into baseline methods to compensate. It is thus difficult to determine whether the gains are coming from your method or the increased parameters, or switch in backbone. The gain on Stanford Products is questionable, if any. Statistical tests are necessarily / useful to determine any increase in performance. ----------------------------------- POST REBUTTAL ----------------------------------- Please see my comments in feedback section - which address the weaknesses described above.

Correctness: The ablations / evaluations experiment with each component of the method and demonstrate improved performance. Each component of the method is demonstrated to result in increased performance. The experimental methodologies are reasonable and what one would expect. However, I am a bit concerned about a possible shift in baselines.

Clarity: There were also concerns about the writing quality throughout the paper. For example, even in the abstract, there are grammatical mistakes (e.g. "which uses fewer proxies while achieves better performance.") I would recommend the authors do a thorough read-through this paper to correct such writing errors. I was also lost at times during the explanation of the method. It was not immediately clear to me what the proxy regularization loss was doing. It seems to just be a classificaiton loss? This should be clarified in the text.

Relation to Prior Work: The authors include related work and some discussion of the differences of their work with other proxy methods. However, given the overly strong claims made, I believe the authors may have not been aware of other graph methods for metric learning. I recommend the authors clearly differentiate their work from: Li, Xiaocui, et al. "Semi-supervised clustering with deep metric learning and graph embedding." World Wide Web 23.2 (2020): 781-798. I also recommend they include a discussion with some of the papers I cited above.

Reproducibility: Yes

Additional Feedback: Overall, I think this is a good submission - but there are significant concerns about experimental evaluation. Most significantly, I am concerned about the lack of comparison to recent proxy based methods. The significance of the method on SOP dataset is questionable / if any. Similarly, it is difficult to assess the contribution of the method due to change in backbone / extra parameters / etc. If authors can verify baselines for SoftTriple and MS are the same, then maybe add the CVPR 2020 or the CVPR2019 paper cited above as an additional proxy-based baseline, I feel the paper would be strong enough for acceptance. Even though the CVPR2020 one wasn't published at the time, the CVPR2019 one was. As it stands, there are many proxy-based deep metric learning approaches which the authors have failed to compare to. ----------------------------------- POST REBUTTAL ----------------------------------- I was happy to see that the authors have taken into account the feedback I suggested above. In particular, they added a comparison to a CVPR2020 paper that also used proxies as I suggested. Their method significantly outperforms this method as shown in their rebuttal. They also discussed that the backbones of the baselines they compared against were the same. They state they will include this in their final submission, which I strongly encourage they do (and is somewhat typical nowadays). This ensures a fair comparison is made. They also discussed their performance on SOP to my satisfaction. At this time, my concerns are satisfied. This paper presents a solid contribution as a NeurIPS paper and I believe should be accepted.