NeurIPS 2020

Self-supervised learning through the eyes of a child

Review 1

Summary and Contributions: The authors show that training models with self-supervision on a biologically realistic dataset yields useful features for downstream classification tasks.

Strengths: This is a really cool idea. The experiment are clear and the baselines are relatively strong.

Weaknesses: I wish we could have more analysis of when things go wrong. What invariances are missing from the representations. I think a thorough hard example mining could shed some light here. There might even be some insights that some invariances have to be baked in by "nature" (e.g. imagine finding out that all cats seen from the front are misclassified, does that mean that robustness to 3D rotation is not learned?) I hope you'll consider this in future research.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: The paper is framed from a nature vs. nurture perspective, however, I believe the reader would benefit from more context from the computer vision literature. In particular, there are a large number of methods that look at self-supervision from videos. For example: Deep learning from temporal coherence in video, Trading robust representations for sample complexity through self-supervised visual experience, Deep learning of invariant features via simulated fixations in video, Unsupervised learning of spatiotemporally coherent metrics, Unsupervised learning of visual representations using videos, Slow and steady feature analysis: Higher order temporal coherence in video, Transitive invariance for self-supervised visual representation learning. Additionally, there are other surrogate tasks that have been investigated to construct useful representations. For example: Colorization as a proxy task for visual understanding, Unsupervised visual representation learning by context prediction.

Reproducibility: Yes

Additional Feedback: thank you for sharing these cool ideas and results. All the best! ========= After Author Feedback =========== Thank you for taking the time to write a rebuttal. After a brief discussion with the other reviewers and reading the rebuttal I stand by my assessment that this paper should be accepted. All the best.

Review 2

Summary and Contributions: In this manuscript, the authors explore whether meaningful representations for object categorization can be learned through self-supervised training. More importantly, the authors extend the scope of computational modeling to children development by adopting ego-centric video of infants. The authors show that the proposed self-supervised learning method generalizes well to object categorization tasks for two datasets, and the learned representations explain the findings well.

Strengths: Understanding the developmental process of human beings through computational modeling is an important research area, and is of interest to broad audience in the community. The work naturally combines the progress of self-supervised learning algorithm and the SAYCam dataset for infant development recording to help push the boundary of understanding human capabilities for object categorization. To me this is a novel work and is of significance to the research field. Methodologies and experiments are technically sound as well.

Weaknesses: * The authors need to provide more details for the temporal classification method. For example, why and how is the self-supervised approach trained using standard classification setup? Why not training the temporal model using unsupervised ways (for example, combine features across different frames and use single-frame contrastive learning losses, which can be used as a baseline). The authors should provide clear explanations and comparisons for classification setting vs. self-supervised setting. * The authors provide many feature/attention map visualizations. However I am not quite clear about how these single neuron/class level analyses link to the core idea of the manuscript for general object recognition. * Line 193: I am a bit confused about the “exemplar split” setting. Seems using 90% for testing should help with the performance? More explanation is needed. * Figure 3: Why does ImageNet pretrained models perform worse on Labeled S than Toybox? =======Post-rebuttal====== The authors have addressed most of my concerns, including details of temporal classification, unsupervised training baseline, visualizations, etc. I believe the most valuable contribution of the submission is the ego-centric infant video dataset, which helps with the understanding of children development and the corresponding perception capabilities. This definitely enables further research opportunities in the field. I increase the score accordingly.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper tackles an interesting question: what can be learned from naturalistic visual experience children receive during their development. Based on a recent longitudinal, egocentric video dataset captured from children, a temporal classification method is proposed, which considering the temporal invariance in model training. The learned model presented competitive transfer learning results on some datasets, compared to MoCo_v2, and even ImageNet supervised pretrained models.

Strengths: + The problem of learning from visual representation of longitudinal, egocentric video dataset is very interesting. I think this problem is quite natural. + The paper is well written. + Systematical experimental exploration of the problem of interest.

Weaknesses: - I expected to see the linear evaluation performance on ImageNet can be impressive. However, it's a pity to see this transfer learning's performance is poor with only TC-S: 20.9% at best. This seriously limits the impact of this work. If the model can only perform well on some easy datasets that are close to the SAYCam, we cannot get too much benefits from learning on such datasets, especially with access to so many big datasets. Maybe the authors can change the SAYCam to other standard videos (Charades) and see if they can have good transfer learning performances. - I expected to see more intuitive discoveries of learning on these egocentric video. But the discovery based on attention is not surprising. The paper uses the method of " Class Activation Maps (CAM)" (Learning Deep Features for Discriminative Localization, CVPR 2016). The recent contrastive learning self-supervised methods also have such properties on just images without considering temporal information. Considering temporal info, there should be more related cues that may be learned, such as object grouping, optical flow, even without using CAM. I suggest authors to read "PsyNet: Self-supervised Approach to Object Localization Using Point Symmetric Transformation", AAAI 2020. - Lack reference: The author uses CAM to visualize attention, but it does not cite the paper. Learning Deep Features for Discriminative Localization, CVPR 2016 - If the transfer learning performances on standard datasets, such as PASCAL, CIFAR, etc can be good. We can say the proposed method can learn good and generalizable performance. Thus may be it's a good idea to report these results. ----Update------ I read the rebuttal and other reviewers' comments and would like to raise the score. Definitely, the problem of learning from longitudinal, egocentric video are quite interesting for both computer vision and psychology community. I may be biased to practical side. But in the future, I hope the authors can show more on what current self-supervised learning (proposed for non-egocentric data) cannot do on egocentric data. And I think all the discoveries in the paper are not surprising and still hold for non-egocentric videos. Thus these discoveries may come from the algorithms side but not from data side. I think the authors do not fully leverage the characters of egocentric videos. If utilized wisely, we may have a better understanding and surpass the current methods proposed for non-egocentric videos/images. But currently as the first attempt, the paper is worthy to be published.

Correctness: Correct.

Clarity: Yes

Relation to Prior Work: Some important missing references.

Reproducibility: Yes

Additional Feedback: Although I think the problem is interesting, I still think this paper can be improved before publications.

Review 4

Summary and Contributions: This paper applies recently proposed unsupervised learning algorithms to a video dataset that is close to what infants receive during their development and shows that self-supervised algorithms can lead to powerful representations that yield strong downstream task performance. --update-- After reading the rebuttal, I will increase my score. I think this work shows a promising new direction for both machine learning and psychology: how to have good machine learning algorithms that can leverage the same data humans get during development. Therefore, I think it’s worthwhile to have the results of the TS algorithm on other Internet datasets. They will show whether different algorithms are needed to handle different datasets.

Strengths: This work is claimed to be the first one showing that a good representation can be learned from a child-view dataset using a self-supervised learning algorithm. The authors also propose a new self-supervised learning algorithm that works better than the SOTA algorithm on this dataset, the temporal classification algorithm. The learning representation from this algorithm even surpasses a ImageNet-pretrained representation evaluated on this dataset. The authors also provide analyses of learned representations.

Weaknesses: Although the learned representation by the TS algorithm is better than ImageNet pretrained representation on the labeled S task, it is much worse than the latter on a third-dataset, Toybox. This indicates that the better performance on the labeled S task is mainly due to the domain advantage this representation has, as it is evaluated on the same dataset that it is learned from. Although the temporal classification algorithm achieves better performance than MoCoV2, the authors could have addressed another interesting question: can this algorithm be applied to other video datasets and learn powerful representations from unlabeled videos that are widely available on Internet? The ablation results show that the segment length of 288s is better than other parameters, however, it would be better to see results of segment length larger than this number.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: