Review for NeurIPS paper: What Do Neural Networks Learn When Trained With Random Labels?

NeurIPS 2020

What Do Neural Networks Learn When Trained With Random Labels?

Review 1

Summary and Contributions: Training deep neural networks with entirely random labels have some interesting phenomenons. Previous work usually study this phenomenon from the perspective of memorization and generalization of neural network. In contrast, this paper studies an interesting phenomenon that *sometimes* pre-trained on randomly labels dataset can still help the transfer learning ability. This paper show the networks can learn weights that have principal components aligned with those from data, which may be the explanation of such positive effect. By providing theoretical analysis and empirically study, they verify such hypothesis. Furthermore, this paper also gives a possible explanation of cases where pretraining on random labels is harmful.

Strengths: - This paper is well-motivated, from an interesting phenomenon of training network with random label. - The story and mathematical derivations are easy to follow.

Weaknesses: I’m mainly concerned about the experiment session. In particular, I’m a bit confused about the experiment setting of Section 2.5. What is the number of layers in this setting when trained with random labels? From my perspective, when applying to downstream tasks, only the first layer’s weight is learned, and the deeper layers’ weights are analytically computed. Is that the case? ========= after rebuttal ========= I read the authors' rebuttal and decide to maintain my original score.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: ==========Update after rebuttal========== I have read the author's response and the other reviews. The rebuttal addressed most of my concerns, although I still do not fully understand the link to critical periods. I look forward to the expanded discussion of this relationship in the camera-ready version. I leave my score unchanged and maintain that the paper should be accepted. =================================== This work is motivated by the observation that pretraining a deep network on randomly labelled images can either have positive or negative transfer effect on a downstream task, depending on the scale of the initialization and the number of downstream classes. They explain this observation by analyzing what deep networks learn when trained on random labels, finding that the first layer will learn to align the covariance of the weights to the input covariance.

Strengths: * I think this paper is highly significant and will be of mass appeal to the NeurIPS community. * The ideas are clearly communicated. Figure 3 is especially great. * The experimental details are throughly described in the supplemental material.

Weaknesses: * The formal analysis focuses only on the first layer. (This is not a major weakness.) In section 2.5, experiments with deeper layers are discussed. This discussion assumes that what's going on at the first layer is representative of deeper layers as well. However, the paper also discusses the task specification that occurs at deeper levels. It seems there will be a trade off between the bottom-up eigenspace alignment and the top-down task-specificity. The theoretical relationship between these two forces is left for future work. * The authors propose no broader societal or ethical impacts of their work. I think an argument can be made that work on understanding deep learning can contribute to interpretable and explainable AI, which has potential to help identify sources of bias, for example. I encourage the authors to try to write something more meaningful in that section.

Correctness: * It is claimed that the present results help to explain critical learning stages in DNNs. This claim is made in both the introduction and the discussion but it is not justified clearly anywhere in the text.

Clarity: The paper is well written and I did not find any typos or ungrammatical sentences. Minor points * Figure 4: use subtitles for each subfigure to make more readable e.g. synthetic, CIFAR10-random, CIFAR10 real. The labels are also too small to read.

Relation to Prior Work: * The authors fail to mention previous work on unsupervised pretraining. Back in the Deep Belief Network days, we thought unsupervised pretraining was necessary for deep learning to work. How do these results on what neural networks learn when trained on random labels mesh with our understanding of how unsupervised pretraining helped DBN training? * Erhan, D., Courville, A., & Vincent, P. (2010). Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research, 11, 625–660. * Section 2.5 sounds similar to the procedure described in Dehmamy, N., Rohani, N., & Katsaggelos, A. K. (2019). DIRECT ESTIMATION OF WEIGHTS AND EFFICIENT TRAINING OF DEEP NEURAL NETWORKS WITHOUT SGD. In Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing (ICASSP) (pp. 3232–3236). * The submission described a large body of work aimed at "identifying fundamental differences between real and random labels". This seems like a strange way of summarizing the goals of those works, which are more about identifying fundamental differences between networks that generalize and networks that memorize. But I understand the intended distinction between that work and the present submission, namely that previous work treats memorization as a negative thing whereas this submission also looks at positive transfer.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper aims to understand what a neural network learns when it is trained on random labels. As a motivating factor, the authors show that pre-training on random labels can speed up training on true labels. The explain this speed up with 'alignment' between the input data and the first layer weights.

Strengths: 1. The paper studies an interesting question - understanding what the network learns even when it is trained on random labels can give us more insight into the implicit biases of our architectures and training procedures - a largely open question. 2. Though Proposition 1 has been proven for a narrow case (Gaussian inputs), the conclusion seems to hold broadly as demonstrated by the experiments on sampling the weights from the Gaussian approximation. I found it surprising that you can gain speed ups only from the second order statistics of the weights. The empirical definition for misalignment also seems well thought out. 3. The discussion on specialization in the neurons in the later layers is also interesting and may have implications for transfer learning in general

Weaknesses: The main weakness of this paper is the organization and clarity. More detailed comments below. The paper also makes some unsubstantiated claims in some places. For instance, the paper provides some reasoning for the shape of f(\sigma) which could be tested but it is not easy to follow - why is it reasonable to expect that the large eigenvalues will dominate the output of the layers or what does it mean for backprop to capture this signal? ___ Post rebuttal ___ I am happy with the author response and would keep my score as is

Correctness: 1. Proposition 1 and the accompanying proof are correct 2. Experimental methodology is also correct

Clarity: The paper conveys the main points but can be organized better. The paper conducts a few different types of experiments and the results feel a bit scattered at times. Some more concrete problems: 1. The existence of positive and negative transfer has been mentioned a few times, but the precise cases in which they arise have not been specified. 2. f(\sigma) is not properly defined. The experimental setup for the synthetic case is not properly described. 3. Line 213 "Suppose that instead of pre-training on random labels, we sample from the Gaussian approximation of the filters in the first layer that were trained on random labels." It is hard to understand what the exact procedure is here.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper analyzes the effect of training with random (fixed) labels on the weights of a neural network. The authors show theoretically and empirically that the principal components of the weights align to those of the data. The experiments are carried out on three different architectures using the CIFAR-10 dataset.

Strengths: Understanding the effects of training with random labels is an important problem that is worthwhile investigating. This paper makes an interesting observation both theoretically and practically. The theoretical findings are evaluated thoroughly using different architectures. The paper is well written and good to follow.

Weaknesses: Even though the authors dismiss performing experiments with image augmentations (L 50) as it would introduce a supervisory signal, it could be beneficial to investigate it in the paper. Even though augmentations do add a prior on the expected data distribution, it could be worthwhile to investigate the effect. This is of course another step away from the i.i.d. assumption in Proposition 1, but since neighboring patches of the same image are already correlated, the effect in practice could be interesting to observe. Along the same lines, I would expect that with increasing kernel size of the convolutions, the correlation between patches increases and with that potentially the misalignment score. If this understanding is correct I would also expect that the experiment in Fig. 6 would look very different if only the last layers were transferred instead of the first layers. This would mean that random labels are a meaningful proxy to learn early layers but not the later layers. This would also align with the specialization observation in Sec. 3. +++++++ Post Rebuttal +++++++ After reading the other reviews and the authors’ feedback, I find most concerns addressed and will keep my recommendation. However, I would still recommend investigating the effect of augmentations on the training, since these are crucial in almost all applications and will affect the independence assumption between patches.

Correctness: The claims of the paper are theoretically sound and are empirically validated in several experiments. Due to the randomness involved in NN training (weights, batches) results are reported as averages over multiple runs in most places.

Clarity: The paper is well written and the structure makes sense, unraveling different observations and experiments, one at a time. Section 3, however is not fully self contained and has moved a lot of content to Appendix F. Similarly, the paper contains only CIFAR-10 evaluations, all ImageNet experiments are in the appendix.

Relation to Prior Work: The related work section is well structured and contains most relevant literature. One angle that could be included is the area of research that investigates what early layers learn by reducing the amount of training data that is used (eg 1 image in [Asao20], handcrafted Conv 1+2 via scattering transforms in [Oyallon18]). This direction is interesting as it shows that the early layers of a CNN do not seem to depend much on the actual training data and can be learned from little data or directly crafted. The paper here comes to a similar conclusion from the theoretical side. References A critical analysis of self-supervision, or what we can learn from a single image Yuki M. Asano, Christian Rupprecht, Andrea Vedaldi ICLR 2020 Scattering networks for hybrid representation learning. Oyallon, Edouard, Sergey Zagoruyko, Gabriel Huang, Nikos Komodakis, Simon Lacoste-Julien, Matthew Blaschko, and Eugene Belilovsky TPAMI 2018

Reproducibility: Yes

Additional Feedback: With the findings in the paper it should be possible to construct a set of weights that aligns with the observed data distribution. Would this constructed set of weights be enough to accelerate downstream training? Typos and other minor things: Fig.4 “per epochs” -> “per epoch” Text in Figs. 3&4 is very small