NeurIPS 2020

Unsupervised Representation Learning by Invariance Propagation

Review 1

Summary and Contributions: This is an interesting work that shows that mining hard positives and hard negatives, found by measuring similarity via dot product between two vector representations, can improve over using the current instance as positive and uniformly sampling negatives. This is the main contribution of the paper along with experiments showing proposed negative and positive selection improve over other ways to choose positives and negatives for contrastive learning.

Strengths: This work shows improvements over the state of the art on a variety of image recognition tasks typically used to measure performance of self-supervised learning algorithms. The main mechanism of improvement is by mining hard positives and hard negatives during the self supervised portion of the training. The authors make a convincing case (via experimental results) that such a scheme does, in fact, improve the learned representations compared to other popular approaches such as MoCo and SimCLR.

Weaknesses: The work does not make it very clear how exactly the kNN graph (shown in figure 1) is actually constructed. Is it by computing pairwise distances between all samples using a similarity metric from equation 1? Is this the "graph distance" mentioned in the paper? Elaborating on this would go a long way to improve the quality of the paper. In addition, while many of these architectures are well known, it would be nice to have a rough description of the architecture used in the paper (at least the "mlp" part of resnet50-mlp). It would also be nice to have a more extensive ablation study to measure the impact of the a) number of positives/negatives to mine, b) what happens if instance positive loss component is disabled after invP loss component is enabled

Correctness: The claims and empirical methodology to evaluate their effectiveness appear correct, however as mentioned earlier greater clarity on actual construction of the graph used to select hard positives and negatives would help evaluate correctness of the claims beyond the empirical results.

Clarity: The paper is pretty well written, however there are a number of typos: line 126: "uses" lines 174-175 seem to have errors ("If not More") line 223 typo ("impact") line 251: easy&hard

Relation to Prior Work: This work should make a bit more effort to discuss differences with prior work, for example Deep(er) Cluster works use clustering methods for grouping negatives and positives, while other works have also mined hard positives and negatives (e.g. "Smart Mining for Deep Metric Learning")

Reproducibility: Yes

Additional Feedback: I've read the author rebuttal and thank the authors for their clarifications. I am believe my rating is still appropriate for this work.

Review 2

Summary and Contributions: This paper proposes a new self-supervised learning method, aiming to cluster the similar (positive) examples and pull away the dissimilar (negative) examples, in the representation space. The experimental results reveal the proposed approach is compelling.

Strengths: 1. Well written paper with clear intuition for the proposed methodology. 2. Noticable improvement from the previous work (for example [39]) 3. The results are pretty decent.

Weaknesses: With some theoritical analysis, the paper would be improved. But the extensive experiments and analysis certainly have made up for that.

Correctness: Yes.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper proposes a invariance propagation approach for unsupervised learning of visual representations. It builds on the assumption that abstract embeddings which lie closely to each other should share high-level semantic labels. The method discovers positives samples in an iterative diffusion manner. The learning signal for training considers to contrast the hard positives and hard negative background samples. Improvements are observed in several benchmarks.

Strengths: * A novel clustering-based approach for unsupervised learning. It mines the nearest neighbors to form local clusters. * The mining approach which iteratively propagates similar metrics seems interesting. The diffusion is empirical and deserves more discussions with spectral methods for clustering.

Weaknesses: * While the paper presents a set of ablation experiments, it still lacks a great deal of analysis to explain the idea. For example, I am extremely curious about the accuracy for predicting the positive samples and negatives samples. How the quality of this invariance propagation affects the learning? * The paper builds on top of the InstDisc method. Since it is no longer a strong baseline, I am wondering can it be applied to MoCo as well? How does it likely to perform? * While the paper conducts a number of empirical experiments, it is not clear what the baseline approach that this approach should be compared to, and how much the improvement is. * The method seems to rely on a number hyper-parameters, k, l, P.

Correctness: I find no significant wrong claims in the paper.

Clarity: The paper is easy to follow and the graphics are good to understand.

Relation to Prior Work: * The paper lacks a thorough comparison and explanations against the local aggregation approach, which tries to solve a similar problem. Though the local aggregation paper is mentioned in the related works, no connections are discussed and how they are differentiated.

Reproducibility: Yes

Additional Feedback: My concerns are well addressed in the rebuttal.

Review 4

Summary and Contributions: This paper focuses on the unsupervised representation learning task. Different from previous image-level variations, the author focuses on the category-level variations. The proposed method recursively discovers semantically similar image samples as neighbors and tries to maximize the agreement between images from the same category. The results on the ImageNet classification and related downstream tasks look promising.

Strengths: 1. Category-level variations are more representative than image-level variations. 2. The hard sampling strategy for finding good positive and negative samples is reasonable and effective. 3. The evaluation results are good on both the classification task and the related downstream tasks.

Weaknesses: 1. The experimental comparisons are not enough. Some methods like MoCo and SimCLR also test the results with wider backbones like ResNet50 (2×) and ResNet50 (4×). It would be interesting to see the results of proposed InvP with these wider backbones. 2. Some methods use epochs and pretrain epochs as 200, while the reported InvP uses 800 epochs. What are the results of InvP with epochs as 200? It would be more clear after adding these results into the tables. 3. The proposed method adopts memory bank to update vi, as detailed in the beginning of Sec.3. What the results would be when adopting momentum queue and current batch of features? As the results of SimCLR and MoCo are better than InsDis, it would be nice to have those results.

Correctness: Yes

Clarity: The paper is well writen and well organized.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: