This is an interesting paper bringing a sober perspective on the recent self-supervised learning progress in ImageNet. It shows that there are still opportunities in going beyond image-level self-supervised tasks, achieving slightly better results than MOCO and other baselines on semantic segmentation, region retrieval and tracking, while simplifying some aspects (but requiring a pre-trained edge detector). All the reviewers agreed about acceptance and so do I. Please follow the reviewers suggestions for the camera ready. Definitely update the ImageNet numbers. Also worth making sure the semantic segmentation results are as much apples to apples as possible in terms of the heads put on top the ResNet-50 backbone.