Review for NeurIPS paper: Parametric Instance Classification for Unsupervised Visual Feature learning

NeurIPS 2020

Parametric Instance Classification for Unsupervised Visual Feature learning

Review 1

Summary and Contributions: This paper presents a novel parametric instance classification (PIC) method to perform one-branch parametric instance classification. Additionally, to address infrequent instance visiting and training consumption, they introduce two novel techniques, sliding window data scheduler and negative instance class sampling along with weight update correction.

Strengths: Novelty: This paper has the solid contribution, its methods are reasonable and show solid improvements. The intuition of the proposed methods is easy to understand. The method may serve as a simple baseline in future. Experiments: The ablation study is quite extensive and supports the effectiveness of the proposed methods. Writing: The writing is good and easy to read. Figures are illustrative and helpful for the readers to understand the proposed methods. Equations are clear and with enough explanation.

Weaknesses: There are many implementation details in the proposed methods, so it is better to release the codes as soon as possible.

Correctness: The claims and method in this paper are technically correct, which is also supported with extensive experiments.

Clarity: The paper is well prepared and written.

Relation to Prior Work: The difference with previous works is clearly claimed.

Reproducibility: Yes

Additional Feedback: After reading the rebuttal, the authors address all my concerns. I maintain my score.

Review 2

Summary and Contributions: The paper proposes an updated implementation of instance discrimination for self-supervised learning called parametric instance classification (PIC). The main contribution is that by using implementation tricks from recent unsupervised frameworks PIC obtains similar performance to recent approaches like SimCLR while being a simpler model.

Strengths: The paper shows a good analysis of the implementation components of PIC. The ablation studies are thorough and show that each part of the implementation needs to be tuned to achieve good performance. I think there is value on showing the simple approaches for self-supervised learning like PIC can attain similar performance to more complex approaches like SimCLR.

Weaknesses: The main weakness of this paper is novelty. Most of the implementation tricks used on this paper are introduced in other papers (as the authors correctly point out). The novel contributions are in data scheduler and in the negative sample strategy. Which in my opinion do not constitute a contribution worth publication at NeurIPS. Another weakness of this paper is on the empirical side. Especially on the comparison with SimCLR. As a reviewer, I understand that computational resources play a big role on this comparison. Since we cannot expect all institutions to have access to the same compute resources. Tab 5 is the main example of this situation. The proposed approach seems to outperform SimCLR if training for 200 epochs. However, when SimCLR is trained for 1000 epochs it outperforms the proposed method. From this results I cannot say if the proposed approach is outperforming SimCLR, it could be that the proposed approach benefits from the cross-level discrimination earlier in training, and thus outperforms SimCLR at epoch 200, but then SimCLR catches up and outperforms the proposed approach when converged.

Correctness: The submission is empirically and methodologically correct. However, I am concerned that due to computational resources, a fair comparison between the proposed method and SimCLR cannot be obtained.

Clarity: The paper is clear but a lot of notation details are not properly introduced or used. For example: - line 72 W \in R^{D \times N}, what is D and N?

Relation to Prior Work: Related work is properly discussed and how this submission differs from previous contributions is clear. Although there are a couple of missing references on cross-level relationship modeling: Bautista, Miguel A., et al. "Cliquecnn: Deep unsupervised exemplar learning." Advances in Neural Information Processing Systems. 2016. Milbich, Timo, et al. "Unsupervised video understanding by reconciliation of posture similarities." Proceedings of the IEEE International Conference on Computer Vision. 2017. Bautista, Miguel A., Artsiom Sanakoyeu, and Bjorn Ommer. "Deep unsupervised similarity learning using partially ordered sets." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The work tackles the problem of self-supervised representation learning. It shows how to "fix" the parametric approach of [12] such that it performs better or on par with the most recent dual-branch contrastive frameworks while the training procedure is efficient. To this end, the authors propose a sliding-window sampling strategy to alleviate infrequent instance class visits. For improving training efficiency, the paper proposes to sample fewer negative classes and to correct classification weights only when they are used in forward pass, instead of updating them on every iteration.

Strengths: 1. This work addresses an important problem of self-supervised feature learning with a simpler solution than the current SoTA methods while delivering comparable results. The proposed solution pays a lot of attention to small implementation details, and I appreciate that. 2. The experimental section is reach with results and ablation studies. The authors provide enough evidence that their method achieves results better than the current SoTA. 3. The paper is well written and overall clear. 4. The related work is well organized and mentions relevant articles.

Weaknesses: 1. The authors must be more clear in the introduction that the proposed solution is a "fix" of [12], rather than a new PIC approach, as introduced in lines 29-30 by saying: "... This paper presents a framework which solves instance discrimination by direct parametric instance classification (PIC)". This framework has been already proposed by [12] and the authors must mention it. 2. It is not clear to me why exactly the sliding-window data sampler improves training. My understanding is that with the sliding-window sampler, an instance is repeatedly visited several (something like B/S) times in a row, and then not visited for a very long time (something like B * N / S). This means that in the expectation, a single instance class is visited as often as it would have been visited with epoch-based training. Does this mean that the improvement in training comes only from being able to "learn well" a single instance class, before moving to another one? How about the opopsit effect like forgetting this instance class [1*], since the network does not see this instance class for a much longer period after it has been repeatedly visited? The paper is lacking a clear explanation of this phenomena and hence the sliding window sampling is not well motivated. 3. While it is nice to have Section 5, the feature visualization technique used there is not limited to models with parametric classifiers. Therefore, it would be much more valuable if we could see a comparison of visualizations and statistics (Figure 3) with other methods such as MoCo and SimCLR. Otherwise, simply stating the facts only for PIC without any comparisons is not very informative. [1*] - Toneva et. al "AN EMPIRICAL STUDY OF EXAMPLE FORGETTING DURING DEEP NEURAL NETWORK LEARNING"

Correctness: The claims in the paper seem to be correct.

Clarity: The paper is clear and well written.

Relation to Prior Work: In general, yes. See my comment about [12] above.

Reproducibility: Yes

Additional Feedback: 1. I do not fully understand how the authors count epochs of training, as for example in line 232, while using sliding window sampler, that does not have a clear notion of epochs. Could the authors please elaborate on that? Overall, I like the work and think that it is an important direction. I am willing to adjust my score if the authors address the comments above. ---------------------------- After Rebuttal --------------------------- The authors addressed my comments and provided more insights into the method which is very valuable. I suggest the authors to include their clarifications in the final version of the paper. I recommend the paper for acceptance.

Review 4

Summary and Contributions: This paper proposes a simple framework for unsupervised feature learning. Instead of using the current popular dual-branch non-parametric instance classification setting, the proposed method adopts a single-branch parametric setting. The proposed sliding-window data scheduler and negative sampling with weight correcting techniques make the whole framework efficient and practical.

Strengths: 1. Experiments - This paper provides a detailed ablation study to validate the effectiveness of the proposed techniques - The proposed method achieves a new state-of-the-art performance on several downstream visual tasks - The authors also analyze the connection between the proposed method and the supervised training method, which I found interesting and I believe can give the community some insights 2. Significance and Novelty The core contribution of this paper lies in the two training techniques (sliding-window data scheduler and negative sampling with weight correcting). Unlike current state-of-the-art unsupervised feature learning methods that require special handling to avoid data leakage, the proposed method can be easily implemented. I think this could be a good contribution to the community.

Weaknesses: 1. Experiments It will be interesting to know if the proposed method can achieve good performance on larger backbone networks (e.g., deeper or wider) or not. 2. Other questions 1) I wonder if the projection head is dropped in the downstream task 2) Can the authors comment on why cosine softmax brings such a significant improvement in the current setting?

Correctness: The extensive experiments well support and validate the claims made by the authors.

Clarity: Yes. Overall, this paper is easy to read and understand.

Relation to Prior Work: Yes. Instead of using a dual-branch non-parametric setting, the proposed method is a revisit of the single-branch parametric setting. The core contribution lies in the two proposed techniques.

Reproducibility: Yes

Additional Feedback: Update: I have read the comments from other reviewers and the rebuttal. I think the rebuttal well addresses most of the raised issues. Hence, I keep my original rating and recommend acceptance.