Summary and Contributions: This paper proposes a new approach to self-supervised learning. There are two major differences with previous methods: removing the negative pairs, and the online-target branches. Both designs are well ablated. The final accuracy is very high, as 74.3 using R50.
Strengths: ++ This paper works on a very important problem of unsupervised pretraining. ++ The findings that no negative pairs are required during training is amazing. ++ The accuracy is very high, which improves previous best approach by about 1.3 points (Info Min).
Weaknesses: No good explanations about why positive pair alone is enough for pretraining are provided. I would further increase my ratings if a good explanation is given.
Correctness: seems sound
Clarity: well written
Relation to Prior Work: clear
Summary and Contributions: I thank the authors for a very considerate rebuttal that adressed a few comments I made, the relation to prior work and the question about non-collapse. I have increased my score. This paper explores using a target network to perform contrastive learning without negative examples. This approach outperforms standard contrastive methods and is more robust to smaller batch sizes and image augmentations.
Strengths: The model achieves state of the art in unsupervised learning on ImageNet. Despite much recent progress in this field, the authors show that without negative samples models can become even better at learning from simple augmentations. The empirical evaluation is thorough and convincing.
Weaknesses: Why does this approach work instead of collapsing to a trivial solution? Have the authors explored any theoretical or practical justifications or experiments that help understanding why this approach works?
Correctness: The presented methods are theoretically and empirically very clean and raise no concerns. One nit: is there a specific reason why equation 2 and preceding sentences speak about normalizing and then measuring an L2 distances rather than just saying cosine similiarity?
Clarity: The paper is extremely well written and easy to follow.
Relation to Prior Work: There are two prior works that appear to be extremely similar: Laine & Aila (2016) Temporal Ensembling for Semi-Supervised Learning. Tarvainen & Valpola (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Could the authors please comment why these are not cited and how this current work is not incremental on those former works.
Additional Feedback: I will consider raising my score if the authors address the two main concerns: a) relation to prior work, b) (tentative) theoretical and or empirical explanation why this approach does not collapse. Why are the number in Tab. 2b lower than the corresponding entries in Tab. 1b? Should the semi-supervised setting not perform better? line 222: is there a verb missing? 224-228: are these changes relevant? 241-251: SimCLR is also maximing some cosine similarity between positive pairs. Can the authors be more explicit what the difference is here? footnote 4: this seems to be ignorring the effect of gradient noise that is larger for smaller batches? Tab5: given this table, how is tau chosen for the main results? Supp. material: 516-517: these lines appear again below? 559: is that 'local validation set' part of the 1% of data used for training? eq. 5: this is not the standard InfoNCE formulation, can you please derive this e.g. from the formulation in (Poole et al., 2019, On Variational Bounds of Mutual Information)? 745-756: this is going some way towards understanding why the approach works and does not collapse, maybe the authors could expand?
Summary and Contributions: This paper proposes a new method for self-supervised learning, which doesn't rely on negative pairs which are required in most of the contrastive-based self-supervised learning techniques. Two networks are built, and the online model tries to predict the outputs of target model, which is exponentially averaged update by the online model's parameters, similar to mean-teacher and MoCo. Very good performances have been achieved by BYOL.
Strengths: -BYOL gets rid of negative pairs in contrastive learning, which eases the self-supervise learning problem. -The proposed method is very simple. And kind of too simple to be true. -Very comprehensive evaluation of BYOL, and in most the cases, BYOL outperforms the SOTA SimCLR and MoCo.
Weaknesses: -As mentioned in the paper, the proposed method has a trivial solution, that both models output 0's. Although empirically it doesn't, any theoretical support? -It doesn't say the code will be released. To me, the method is too simple to be true. I tried to reimplement it, but no success. It is highly recommend to opensource the code for reproduceable research. -The detection results are weird. Why so low? Frozen representation? How can you learn detection with frozen representation? Please use the standard settings, e.g. as in MoCo. The results are not convincing! Some questions: -L38, I understand negative pairs could be a limit of contrastive learning, e.g. batch-size. But why without negative pairs can improve robustness? -Why the online network needs a predictor? Actually, the predictor is just another embedding layer. Any ablation on removing the predictor from online model and adding a predictor to the target model? -Fig. 3(a), if the method is not coupled with batch size, why still have accuracy drop, especially from 256 to 128? For example, when you train classification network, the accuracies should be almost the same for 256 and 128. Isn't 128 enough for BN? Batch size of 32 should be enough to get reliable BN statistics. The explanation is not convincing.
Correctness: Seems mostly correct
Clarity: Yes, mostly
Relation to Prior Work: To me, (1) is more closely related to MoCo and Mean-teacher than . It seems this paper tries to weaken the relations to MoCo and MT on purpose.
Additional Feedback: The rebuttal has resolved most of my concerns. A few suggestions for the camera-ready. - release the code for reproduction, and show some results for shorter training, e.g. 200 epochs. Not everyone in the community has the resources to run experiments of 1000 epochs, especially in the university. - Footnote 4 is confusing and not the true reason. Please make it clear as in the rebuttal. - Fig. 5b shows some interesting ablations, but I missed most of them due to the lack of description. And they are not just ablation to contrastive methods, when \beta=0, they are simply on BYOL. Make them clear. -be honest to the relations to prior works, e.g. MoCo and Mean Teacher.