Review for NeurIPS paper: Hard Negative Mixing for Contrastive Learning

NeurIPS 2020

Hard Negative Mixing for Contrastive Learning

Review 1

Summary and Contributions: This paper does something similar to mixup (mixing two samples in order to generate a new one) in the context of MoCo. It mixes the hardest negatives in the MoCo queue in order to surround the query and force cleaner separation. It further mixes the hard negatives with the query itself, showing experimentally this makes the task even harder. This technique offers modest, but consistent, improvements.

Strengths: - The method seems to offer a consistent improvement over MoCo, with minimal extra computational costs (and made up for with better and faster learning). - Using the oracle gives an interesting sample point. - Varied experiments. I also appreciate the effort to have error bars, and I acknowledge that it is always a resource problem. - Good to consider the broader impact. Although the hope that this technique can contribute to reduced dataset bias is highly speculative and I wouldn't want to promote a trend where every algorithm starts to claims this, drowning out important work that actually tackles this problem and demonstrates it experimentally.

Weaknesses: - It would be great if you could try to give the reader more intuition about the mixing with the query technique. It just seems that it would simply rescale the loss in a certain way. If you actually plug the new h into (1) and expand it, the squared L2 norm of q shows up. However, the L2 norm of q should be 1, since the paper states that "all embeddings are L2-normalized". So, instead of "q^Tn" we get something like "beta + (1 - beta)q^Tn". The beta can be go out of the sum. If you follow this thought, I think you can see another way of formulating this that boils down to a change in the shape of the loss, or perhaps a "2-parameter temperature". I list this as a weakness, because I think this should be understood and communicated to the reader a bit more in depth, since it will facilitate future work. - The paper keeps saying this is faster, but maybe I'm missing exactly where this is thoroughly demonstrated. In Fig(a) it does show a bigger gap after 100 epochs than 200 epochs, which suggests this. However, a few more sample points and a plot that gives a sense of shape of the curves would strengthen this argument. MoCHi may give better results faster, but does it also reach a plateau earlier than MoCo-v2? If not, many people would still want to train for the same amount, which wouldn't lead to faster training in practice (better results though).

Correctness: Yes, the method is correct. I wonder if the mixing with the query technique can be reformulated in a more intuitive way though, as mentioned under weaknesses.

Clarity: Overall, yes. Although figures could be made more stand-alone from the text and a bit easier to interpret. More information on this in a separate field.

Relation to Prior Work: Yes, and I think it was good that mixup was mentioned.

Reproducibility: Yes

Additional Feedback: - Figure 2 (b/c) were hard to parse. Some descriptions are only in the text and not the caption or the legend. The dashed lines with triangles do not include synthetic features, so isn't that simply MoCo-v2 then? Since they are not included, the green and red MoCHi should be identical, right? So this is just two runs of the baseline, which is why they are so close? I'm also confused by the notion that the green line is faster - Fig 3c: The diff is not '%' but 'percentage points'.

Review 2

Summary and Contributions: This paper focuses on hard negative mining (or more accurately mixing) for self-supervised contrastive learning. Unlike finding hard negative samples, the authors create hard negatives by mixing pre-computed features, which does not require significant computational overhead. == update === I appreciate the results on MS COCO, and I also think it's more important to see improvements on transfer learning tasks than on ImageNet linear evaluation. According to my experience, such improvement shown in rebuttal on MS COCO is non-trivial, thus I upgrade my score from 6 to 7.

Strengths: The way to create hard negatives proposed in this paper is mixing hard negatives at the feature level, for each query point. This relates to the Manifold Mixup paper [29], which originally targets for supervised learning. So hard negative mining, as well as the way the authors proposed, is some new contribution to contrastive learning. (Concerns will be explained later) The empirical analysis of the evolution of hard negative samples of contrastive learning is actually interesting, and it may further inspire the way future researchers look at hard negative mining.

Weaknesses: In general, my biggest concern is that the efficacy of the proposed method is not very significant. - For example, it only improves over MoCo-v2 on ImageNet-100 with 0.8% and on ImageNet-1k with 0.1 (67.9% v.s. 68.0). - And the improvement on PASCAL VOC detection is also not very significant. - I also wonder whether the proposed approach is compatible with more advanced data augmentation that sets a higher baseline. For example, can the proposed module directly be dropped in to improve [25] and [28]? In line 233-234, the explanation here is a bit counter-intuitive. I understand this is possible if inter-class distance increases more than intra-class variance, but can you plot it to justify it better? Otherwise, increasing intra-class variance should lead to lower accuracy. As shown in Figure 3(b), why are s' and s not comparable. For example, the last column shows that increase s' actually harms? It would be great to have deeper understanding on how s and s' are really manipulating the latent space (or logits). Why in section 4, authors always argues for light weight computation, I wonder how big is the difference of running time (e.g. relatively x% slower)? Because ranking the logits and mix features from noncontinuous GPU ram (as the top-N negatives are not continuous in the queue) should take some time. The line 141-150 need to be revised. Indeed, the pre-text task accuracy drop between MoCo and MoCo v2, as shown in Figure 2(b), mostly results from non-linear projection head rather than data augmentation, though augmentation also contributes a bit. Correct me if I am wrong. How can p goes beyond 1 in Figure 2(a)?

Correctness: Looks reasonable to me.

Clarity: Not very well written. It really slows down my reviewing speed significantly. I found myself need to read back and forth multiple times to understand the organization better. Generally the writing should be improved. Figure 2 is messy. - (a) the probability should not go beyond 1?? - legend in (b) missing some items, which confuses me when I read the text describing it - The color and shape in (c) is also less clear. Figure 3: - (a) maybe label X-axis as N to be clearer? = (b) maybe label s and s' in the table? Figure 4: - Caption: "results copied from [13]". I did not find these results from [13], from somewhere else?

Relation to Prior Work: The authors are up-to-date, and includes even recent related works. So prior works are clearly discussed and the difference is clear.

Reproducibility: Yes

Additional Feedback: Maybe also includes MS-COCO results? I am giving marginally above threshold, as I think the negative analysis is interesting, but the efficacy of the proposed mixing method is less effective, unfortunately.

Review 3

Summary and Contributions: This paper considers the problem of self-supervised learning. Specifically, the authors construct their model based on MOCO and argue that hard negative samples are important in improving performance. Therefore, the authors present a strategy to generate virtual hard negative samples by mixing among top-ranked samples. Experiments show the effectiveness of the proposed method in self-supervised learning setting.

Strengths: 1. This paper considers a specif problem in MOCO where the authors aim to generate new hard negative samples for learning more discriminative features. The proposed method is simple but can consistently improve the results. 2. The authors provide a good analysis to show the importance of hard negative samples in Sec 3.2 and 3.3. This helps readers to better understand the motivation of this paper. 3. Throughout experiments are provided to demonstrate the effectiveness of the proposed method. Consistent improvements are obtained by the proposed mixing methods.

Weaknesses: 1. Although this is the first (If I am right) work that considers the hard negative samples in self-supervised learning, the proposed method is somewhat simple and trivial. The main framework is the MOCO and the proposed hard negative generating is based on the well-known mixup. This is my major concern. 2. From Fig. 3b, I found that the proposed two mixing strategies are not well-complementary to each other. Using only one can achieve the highest results. Therefore, I I don’t think it is necessary to jointly using these two strategies. 3. From Fig.3 c, the iMix [25] can also improve the performance of MOCO. How about the reuslts when applying iMIX on MOCO V2. Can it achieve similar improvement to the proposed MoCHI? In addition, I think the proposed MoCHI is very similar to iMIX[25]. This paper does not introduce too much new techniques compared to iMIX.

Correctness: Yes. It is simple and clearly correct.

Clarity: Yes. This paper is well written and easy to follow.

Relation to Prior Work: Yes. This paper provides its own motivation and discuss the difference to other works. One mix-based method could be discussed in the related work [A]. OpenMix: Reviving Known Knowledge for Discovering Novel Visual Categories in An Open World. Arxiv 2020.

Reproducibility: Yes

Additional Feedback: Post rebuttal: I have read the comments of other reviewers and the rebuttal. The authors have addressed my concerns. Although the proposed method is simple, the authors provide good intuition and analyst for why negative samples matter in self-supervised learning. The proposed method achieves consistent improvement in downstream tasks. I also agree with R2 that the improvement in downstream tasks is important. To this end, I would like to upgrade my rating to 6.

Review 4

Summary and Contributions: This paper proposes to generate synthetic hard negatives for contrastive learning by mixing the real hard negatives. The authors also provide an in-depth analysis about hard negatives sampling in contrastive learning tasks, and justify why sampling/generating harder negatives is needed. The experiments follow standard self-supervised learning benchmarks and implementations, however the relative accuracy improvement is not very high.

Strengths: Simple but interesting method to generate more synthetic hard negatives for a given set of anchor and negative points. The authors also provide a detailed analysis with the oracle in order to justify the need for their method. The paper is well written and is easy to understand.

Weaknesses: The major weakness of the paper is the relative improvement in terms of accuracy in the experiments. The proposed method only improves the baseline by 1% accuracy in ImageNet-100, and less than 1% in ImageNet-1K. The paper also misses similar existing work in metric learning. [1,2] (below) generate hard negatives in the context of supervised metric learning. [2] especially presents a similar solution by mixing the embeddings of the anchor and negatives. [1] Duan et al. Deep adversarial metric learning. CVPR 2018 [2] Zheng et al. Hardness-aware deep metric learning. CVPR 2019 After rebuttal: I appreciate the additional experiments on the MS COCO dataset where the improvement are more significant. I'm improving my rating from 5 to 6 after the rebuttal.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Discussion about some prior work is missing. Please see the Weaknesses.

Reproducibility: Yes

Additional Feedback: