Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper empirically shows that SGD learns functions of increasing complexity through experiments on real and synthetic datasets. Specifically, in the initial phase, the function learned can be almost explained by a linear function; in the later phase, the accuracy increases but the correlation between the learned function and the linear function does not decrease (so the linear function is not "forgot"). The measure used in quantifying these correlations is based on mutual information. This observation is extended beyond linear functions, to the case of CNNs with increasing depths, although the phenomenon is less clear there. Finally, the paper provides a theoretical result for a linear model, where overfitting doesn't hurt generalization. Originality: Measuring the function complexity using mutual information appears to be new in the study of deep learning. Quality: The experimental result is solid and the phenomenon is evident from the plots. The theoretical result appears to be correct, although I did not check the proof in appendix. Clarity: The paper is well written and easy to follow. Significance: Understanding the implicit regularization in training neural networks is an important problem. A key step is how to formulate the notion of implicit regularization, i.e., how to identify regularization as something concrete instead of "promoting generalization". This paper uses mutual information which is able to capture the interesting property of SGD identified in the paper. I think this observation is interesting and could be useful for further research in this topic. The downside is that the paper is purely empirical (the theoretical result in Sec 5 is only a toy example and not very relevant to other parts of the paper). There isn't any intuitive explanation of why the observed phenomenon might have happened. ==== update ==== I have read authors' feedback, and am going to keep my evaluation.
Update: I have read the author's response and am keeping my score =========== Originality: The idea that SGD biases networks towards simple classifiers has been discussed in the community extensively, but lacked a crisp formalization. This paper proposes a formal description for this conjecture (and also extends it -- Conjecture 1 and Claim 2) using a rich and simple information-theoretic framework. This work is roughly related to recent work on understanding the inductive bias of SGD in function space, e.g. Valle-Perez et al, Savarese et al: both only analyze the solution returned by SGD (under different assumptions), and not the intermediate iterates like this paper does. By defining the 'simplicity' of a neural network solely on the mutual information between its predictions and another classifier leads to objects which are invariant to reparameterization of the network, and only depend on the function that it implements. Such measure is highly novel and might prove to be a good candidate for the study of the complexity of networks in function space. The characterization of the two regimes (linear and 'post-linear') is also novel and interesting. -------------------- Quality: The claims are clear and backed up by experimental results. Since the paper aims at supporting the conjecture/claims with experiments, these could be more extensive. For example, plots of \mu_Y(F_t, L), I(F_t, L), I(F_t, Y_S) over time (as in Fig 4) in settings where the conjectured behavior does not manifest itself: - Bad initialization, such as in Figure 8b in the appendix - Different optimizer, which can be simulated with SGD + L2 regularization with small negative penalty lambda, as by enforcing that the weights grow any potential implicit L2 regularization of SGD is cancelled - Non-standard network parameterization In other words, it would be interesting to have some idea on what exactly is necessary for the conjectured behavior to happen. -------------------- Clarity: The paper is very well written, with clear definitions, experiments and discussion. -------------------- Significance: The formalization of the conjecture, the clear experimental results to back it up, and the proposed complexity measure \mu_Y(F, L) are all strong contributions, and very timely when considering the recent discussion on the inductive bias of SGD in the function space of neural nets. The proposed framework is very suited for this discussion as it is completely agnostic to how the network is parameterized.  Guillermo Valle-Pérez, Chico Q. Camargo, Ard A. Louis - Deep learning generalizes because the parameter-function map is biased towards simple functions  Pedro Savarese, Itay Evron, Daniel Soudry, Nathan Srebro - How do infinite width bounded norm networks look in function space?
The main claim of the paper is that SGD learns, when training a deep network, a function fully explainable initially by a linear classifier. This, and other observations, are based on a metric that captures how similar are predictions of two models. The paper on the whole is very clear and well written. Importantly, the topic of the implicit bias induced by SGD and other optimizers is of great interest and significance to the community. However, the first two claims of the paper are not novel based on my understanding of the field. I also have some theoretical issues with the developed metric. More detailed comments: 1. Novelty of the first two claims of the paper are not clear to me. The third claim seems novel and intruiging. 1a) The first (main) claim does not seem novel. "Understanding training and generalization in deep learning by fourier analysis" and the cited "On the spectral bias of neural networks" try to theoretically and empirically show that SGD training a deep network learns sequentially from the lowest frequencies to the highest frequencies in the input space. More precisely, if one decomposes function of the network using Fourier transformation, then coefficients of the lowest frequency component will be learned first. This to me seems to undermine novelty of the paper because if the claim of these papers is correct, then it implies that initially a network is similar to a linear model. 1b) The claim that network learns function of increasing complexity is not very much formalized. Considering the level to which it is formalized it does not seem novel. See for instance https://arxiv.org/pdf/1610.01644.pdf. This makes outcome of the experiment not surprising. 2. I have two issues with the metric used. 2a) Some novelty is attributed to the proposed metric. But are there any benefit of using the proposed metric compared to a straightforward mutual information between network predictions (F) and linear classifier (L), i.e. I(F, L)? Perhaps it is trivial (sorry then if I missed it), but in any case it should be discussed. It seems to me that if I(F, L) = 0, then the proposed metric is also 0. 2b) While I, roughly speaking, agree with the main claim, I also wonder if the metric actually proves the main claim that initial performance is fully explainable away by a linear model. Consider a hypothetical scenario. Let's say the model initially learns to classify correctly all the examples that *are not* correctly classified by the linear model. Isn't the metric invariant to such a situation, i.e. wouldnt \mu curve look similar? If it is, then we cannot claim based solely on it that model initially learns the same things as a linear model. Other comments: 3. There is a group of highly related papers on the analysis of deep linear models. Therein it is relatively easy to show that network learns from simplest correlations between input and output to the highest (see for instance "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks"), and later work has shown theoretically that this implies that SGD learns increasingly complex functions https://arxiv.org/abs/1904.13262. I think it would be nice to cite these papers. Another group of related papers that might be worth citing shows that DCNNs are highly sensitive to perturbations in the fourier domain of the image, e.g. https://arxiv.org/pdf/1809.04098.pdf. This implies that there is a direction in the input space that models reacts linearly to. 4. I would like to mention again that I found very intriguing the claim that the network is sensitive to things learned in the beginning of training. It might be worth emphasizing in the revised version. There is also a related work I am aware of that has discussed this phenomenon: https://openreview.net/forum?id=HJepJh0qKX. 5. I am not sure the claims are sufficiently formal to put them as claims. This might be personal taste, but I feel that paper clarity would benefit from a cleaner separation of formal and informal arguments. Update Thank you for the well written rebuttal. I am very much convinced by your explanations of the metric properties, as well as your contextualization of the linear probe paper. Indeed the metric is better than mutual information. It might be useful to include this discussion in the paper? I would like also to clarify that I was wrong in the review to say that Fourier based analysis of training of deep networks (such as "On the spectral (...)" paper) explains, strictly speaking, your empirical observations. Having said that, I am still concerned regarding novelty of the empirical observations. Novelty of empirical observations is key, because the theoretical contribution of the submission is arguably limited (as also mentioned by another reviewer). To add evidence that empirical observations are of limited novelty, see Figure 1 from https://arxiv.org/pdf/1807.01251.pdf that shows that low frequencies on MNIST and CIFAR-10 are learned first. This implies that a model close to linear (in some strict sense) exists in the beginning of training. Based on this I would like to keep my score.