NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3778
Title:Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Reviewer 1

Please see comments in item 1 above. Some other comments: The statement of Thm 1.1. can be greatly improved. I suggest separating the assumptions in a previous statement, in an itemized manner. Then start the statement of the theorem with: Under Assumptions?, let $T > 0$ and $\alpha > 0$. Then, the macroscopic state ... Also, just say "where $m(\alpha)$ is the unique solution of ..." (you don't need the "deterministic function") Furthermore, is $f$ Lipschitz continous? If not, the ODE (8) is not guaranteed to have a unique solution. The approximation in Eq. (10) is for SCM. Does this provide any insights into more general settings? I believe this situation is unrealistic in general NN, for instance the results of Sec. 3 points out to the complete opposite when training both layers. Line 149: It is being mentioned that the ODE is going to be integrated numerically. So how this gives analytical predictions? Minor comments: Please proof read the paper (there were quite some english mistakes and poorly formed phrases). For instance: Line 16: I find "behind" a little weird here. Maybe besides? Line 36: too -> to Line 39: oft-used -> often used Line 46: particular -> particularly Line 48: a a -> a Line 72: coupled ODE -> coupled ODEs Line 82: this last phrase seems quite weird to me. There are solutions of SGD dynamics that SGD does not find?? I don't know what is implied here. If a dynamical system has a set of admissable solutions, how it does not "find" it? Eqn. (1): Write the sum correctly: \sum_{m=1}^M Lines 118-119: I don't understand the "overlap" ... measure of the overlap Line 123: "we give write full expression"? Line: 124: What is a closed set of ODEs? Do you mean a set of coupled ODEs with closed form solutions? Line 147: "... SGD finds solutions with drastically different performance..." ========= post rebuttal. I read the author's response and I keep my score. I think it's a good paper.

Reviewer 2

This paper studies the learning dynamics of two-layer neural networks in the teacher-student scenario under the assumptions that the input is i.i.d. from zero-mean unit-variance Gaussian and its dimension is very large. The dynamics is set to be the online algorithm or the stochastic gradient descent (SGD) with mini-batch of single sample, and the dataset size is also assumed to be sufficiently large so that the parameters have no correlation with forthcoming samples. Thanks to these assumptions, the dynamics is governed only by the covariances of connections of the student and teacher, and the closed-form macroscopic dynamics of those covariances can be derived from the SGD dynamics itself. Using this macroscopic dynamics, the generalization error which is also characterized by the covariances only, can be accurately calculated. In the soft committee machine (SCM) setting (i.e. weight from the hidden layer to the output is constant), when only the first layer of the student is leant, it is shown that the overparameterization (the student has more nodes in the hidden layer than the teacher) generally degrades the generalization ability irrespectively of the choice of activation functions (here sigmoid (erf), linear, and ReLu are treated). Meanwhile when both layers of the student are leant, the generalization ability strongly depends on the choice of the activation function: For the sigmoid activation, the generalization error ``decreases'' as the overparameterization level ``increases'' while for the other activations the generalization error almost stays constant with respect to it. A further interesting observation is that there exist another fixed point of SGD exhibiting better generalization but cannot be found from random initialization for the linear and ReLu activations. These observations reveal the complicated interplay among the activation function, the network structure, and the algorithm, which largely affects the generalization ability. This study is thus important because it reveals the need for a more careful consideration about the generalization and the learning dynamics. Originality: Analysis of two-layer neural networks is recently drawing much attentions of the researchers in the community, but this type of analysis has never been done and the problem setting is interesting. The novelty is clear. Quality: The submission is technically sound and the theoretical prediction is well supported by experiments. The quality is high enough. Clarity: The paper is well organized and equipped with nice appendices well summarizing the detailed computations and experiments, though I think the introduction is slightly redundant. No need of large modifications. I only give the list of typos and uneasy-to-understand expressions which I found: Line 36: they have too -> they have to Line 48: a a surge -> a surge Line 56: kernel regime -> kernel regimes Line 94: (x^{\mu},y^{\mu}_{B}). What the subscript B? Line 609: (S34ff) -> (S34) Equations S26-S30: These expressions are for sigmoid activation, but it is not explicitly mentioned in Appendix B. This point should be addressed. Significance: I think this work is an important one because it provides an insight about the learning dynamics and the generalization. Explicit demonstration of the superiority of the overparameterization in the generalization error in the sigmoid activation is interesting and suggesting. Additional comments after feedback: Given the author feedback, I am impressed with the result for larger K and the difference from the mean-field result in the infinite width limit, and accordingly I increased the score. I think the fact that the authors found the different convergent point of SGD from that oft the mean-field theory is very interesting and should be stressed in the Discussion section more in the revised manuscript.

Reviewer 3

--- Added after feedback --- All reviews are in agreement that this is a strong contribution and given that consensus and the convincing nature of the author response, which carefully addressed all of my concerns, I have increased my score to a 9. I firmly believe that the technical developments in this paper will enable further research on the detailed dynamics of neural networks within and beyond the student-teacher paradigm. --- Original review ---- In this compelling and well-written paper, the author(s) derive a closed-form and tractable system of ODEs to study the dynamical evolution of neural networks using a "student-teacher" paradigm. While the student-teacher set-up, in which one attempts to learn a neural network with a neural network of similar architecture, is perhaps not the most practically significant case, this framework also for a precise formulation of notions that are often left vague, most notably "over-parameterization". Typically, a less meaningful comparison between the cardinality of the data set and the number of parameters in a neural network is used. In this case, it is much more straightforward, simply being the discrepancy between the number of student vs teacher hidden units. I found the analysis to be satisfyingly complete, deriving asymptotic expressions for sigmoidal, linear, and ReLU networks. For the soft committee machines, numerical experiments seem to agree extremely well with the asymptotic expressions, to the extent I left wondering what magnitude of output noise is required in order for the asymptotic expressions to fail. Perhaps my only major complaint about the paper is that it does not describe the numerical experiments in detail. There is a promise of eventual github links, but I would have liked to know, at least at a high level in the appendix, how they were set-up, how long they were run, what was used as a stopping criterion, etc. The paper contrasts the case of the soft committee machine in which the weight of every neuron is unity with the more general and more typical set-up of training both the pre-activation and the post-activation parameters. The results are different in the two scenarios---to a surprising degree. Some of the work on mean-field neural networks that the authors cite is related to this phenomenon, I believe: the dynamical accessibility of optimal solutions hinges on the evolution of the linear coefficients of the neurons. Regarding clarity and quality: While I have a number of minor comments on some of the logical steps in the proof, as well as the more significant point about the lack of information with regard to the numerics, on the whole the paper is sufficiently clear. Overall, I found the paper to be of a high caliber, both timely and complete. I have a few comments about the proof of the paper's main theorem: - The meaning of $\mathbb{E}_{\mu} $ is not clearly stated anywhere that I could find, but I took it mean an expectation over the initial conditions taken at time-step $\mu$ - The use of $m$ and $q$ is fairly confusing---$q$ only contains the time-dependent order parameters that involve the student, whereas $m$ contains both the student and teacher order parameters. Is it necessary to separate the two rather than just using $m$ throughout? - The step going from S11 to S12 is not well-explained. While the assumption that the order parameters are uncorrelated at $\mu=0$ seems perfectly fine, this does not necessarily imply $q^0$ is independent of $f_q(m^0, x^0)$. As a result, there appears to be an extra expectation in the first line of S11. If as written the equation is correct, I think a more detailed explanation is merited. - The discussion of the coupling trick glosses over the functions $d$ and $g$ and what properties they require *Specific comments on originality* I would not say that the question being investigated is particularly original, nor is the basic framework of using order parameters and deriving ODEs for the evolution of these parameters. As the authors acknowledge, this basic approach has existed in the statistical mechanics literature for many years. However, I would emphasize that using this approach to resolve questions regarding implicit regularization in SGD and quantify over-parameterization directly is a novelty. Below are some entirely inconsequential typos that I noticed which the author(s) may want to correct: Line 156: "case studied most commonly so far" seems like a hard thing to know Line 196: misplaced "*" Line 255: "fulfil" -> fulfill Line 285: "exemplary" is a bit awkward due to the connotation of "exceptional" rather than "typical" Fig S2 is highly pixelated Fig S5 caption "sigma" -> "\sigma" SI 548: "namely namely" SI 640: "to first order \sigma^2"