Review for NeurIPS paper: An analytic theory of shallow networks dynamics for hinge loss classification

NeurIPS 2020

An analytic theory of shallow networks dynamics for hinge loss classification

Review 1

Summary and Contributions: The authors found a case in which gradient descent dynamics in a two-layer neural networks can be described analytically in a very explicit form, in the mean-field limit (i.e. wide hidden layer) with infinitely many examples.

Strengths: The investigation of the transition from rich to lazy regime via the introduction of the parameter \alpha is very interesting. The paper makes assumptions to arrive to an explicit solution, but in the numerical and discussion section assesses very nicely going beyond these assumptions and comparing to toy, but realistic, simulation. Thus providing a very nice combinations of nice mathematical results and its relevance beyond the made assumptions.

Weaknesses: The used dataset if rather simplistic. The authors justify the choice of the dataset by the analytic sociability, which is ok and I buy it. But can they also give a hint where would the solution fail for other models of data? A lot of existing results are known for this same dataset and other learning models, e.g. without hidden units or with just a few hidden units, e.g. Refs. [11,13]. The authors could put their work in the context of these existing results.

Correctness: As far as I could tell the paper is correct.

Clarity: The paper is written clearly.

Relation to Prior Work: The related work section could be improved. For instance ref. [28] is cited for the use of a similar dataset, but isn't this dataset just the teacher perception as considered in statistical physics literature in dozens of cases, including e.g. ref. [11]. Related discussion could be included.

Reproducibility: Yes

Additional Feedback: I would suggest the authors state explicitely already in the abstract that their theory applies to the case of infinite dataset, as this is an important limiting factor. It is discussed nicely in the paper, including the discussion of the limitations, but one needs to read rather far in the paper to even realize this is the setting. ------- i have read the author's feedback and it addresses my suggestions. I am confident this paper will make a nice contribution to NeurIPS2020.

Review 2

Summary and Contributions: The authors solve exactly the gradient dynamics of two layer binary classification networks with a linear hinge loss. The problem is treated in a mean field limit (with a parameter interpolating between lazy and rich learning as proposed in the recent literature) and reduced to an effective single node problem. The data is seprable with spherical symmetry and the solution analytical. Results are also tested numerically which is useful for getting an idea of finite size effects that kick in at long times.

Strengths: An analytical solution of a special model of (shallow) neural networks. This is a useful benchmark. Interpolation between lazy and rich learning is also considered. It is also interesting that finite size effects are tested numerically.

Weaknesses: The mean field limit is not controlled rigorously. Probably the authors should be more precise when an approximation is used. For example passing from equation (1) to (2) is formal and should be stated more explicitly.

Correctness: Correct, as far as I can tell.

Clarity: I found that it is clearly written. In 3.1 I didnt get some of the details about the Gaussian distributions of initial conditions of a(0) w//(0) wperp(0). Are these assumptions or not ? This section was generally speaking less clear.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This is an elegant paper in which some advanced statistical physics methods are used to study the dynamics of learning in not trivial NN architectures. Of particular interest is the hydrodynamic treatment of the second layer. The authors specialize on the case of a linearly separable data and linear hinge loss. This allows them to solve analytically the dynamics. Several phenomena such as slowing down of the learning dynamics, rich and lazy learning, and overfitting can be observed in this simple setting.

Strengths: Though the results are mainly limited to the mean field (MF) infinite data and infinite width regimes, the paper provides an original view to the problem, which can hopefully open the way to improve over statistical physics MF techniques. The paper also serves to connect different languages and communities.

Weaknesses: The infinite data regime is perhaps the most relevant limiting factor. However the statistical physics methodology should be of interest for the ML community. The corrections to the MF limit are interesting though not conclusive. ************** I'm satisfied with the author's response.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: In the manuscript, the authors study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task. Using mean-field theory, the authors developed this theory by treating the data set as data on a linearly discriminable hypersphere. Using the theory, several phenomena, such as slowing down of training dynamics, the crossover between rich and lazy learning, and overfishing, are revealed. The authors also verify the theory using MNIST dataset.

Strengths: This paper is excellent in that it provides a theoretical analysis of learning dynamics using mean-field theory for the discrimination problem, and is able to explain such phenomena as slowing down of training dynamics, the crossover between rich and lazy learning, and overfishing.

Weaknesses: It is unclear whether the input data assumption of a linear discriminable spherical distribution holds for the practical data such as a natural image dataset. The authors verify the theory using MNIST dataset as practical data. But the MNIST dataset is known to have a low-dimensional submanifold structure, so it seems relatively easy to ride on this assumption. The authors need to verify theory using more practical data, such as ImageNet dataset, or add a discussion of the validity of this data assumption.

Correctness: As mentioned above, I am concerned that only MNIST is targeted as an practical data experiment.

Clarity: The paper is well written.

Relation to Prior Work: Theoretical studies on learning dynamics other than mean-field theory should be cited and compared.

Reproducibility: Yes

Additional Feedback: ==== UPDATE AFTER AUTHOR RESPONSE ===== Thank you for your very careful reply. I decide to raise my rate of the manuscript because, this is because the authors have added experiments on the applicability of the theory to real data, such as ImageNet dataset. Thank you very much for your time-consuming experiments.