Reviews: On Lazy Training in Differentiable Programming

This work provides a unified framework called lazy training to explain some recent success in deep learning theory. In general, it shows that by proper scaling, many real world machine learning applications including two-layer neural networks enjoys properties of lazy training which makes them easier for training. Experiments backup their theory. Overall, this is a good submission and I recommend an accept for this paper. My comments are listed as follows. - It seems that this paper considers both empirical loss and population loss in a general loss function. I suggest the authors to highlight this in their problem setting. #################################### I have read all the reviews and the authors' response.

Reviewer 2

The paper provided some interesting understanding, but is not significant enough to explain interesting issues in deep learning. The paper showed that lazy training can be caused by parameter scaling, not special to overparameterization of neural networks. What does this tell us about the overparameterized neural networks? Does this result imply that lazy regime of overparameterized neural networks is necessarily due to parameter scaling? If not, lazy regime of overparameterized neural networks cannot be explained simply by parameter scaling. I would like to understand the logic of the paper here. What exactly does the paper want to convey? The paper provided experiments to demonstrate that lazy training does not necessarily yield good performance. This is a good observation. However, beyond this, does this tell us anything about overparameterized neural networks? I feel this does not imply that lazy training regime that overparameterized neural networks enter does not provide good performance. I found it is more meaningful to compare the lazy training of overparameterized neural networks and lazying training by large scaling parameter of underparameterized neural networks. I wonder if any of the experiments in the paper can imply any property of such a comparison. ---------- After authors' response The authors answered my questions satisfactorily. I appreciate their efforts into running extra experiments to provide further understanding. Thus, I improve my score to 6.

Reviewer 3

The paper builds on the existing idea of lazy training. where the authors give new insights using the idea about the existance of an implicit scale that controls this phenomenon. This is an interesting idea, nevertheless it feels that they should expand more on this. Technically, the paper is not strong, it feels more like an experimental paper. The idea is novel, I am not sure about the importance of it at this point. ================================================== After rebuttal: I am still not convinced about the significance of this contribution (which is not technical for sure). I keep my score as it is.

Paper ID:	1688
Title:	On Lazy Training in Differentiable Programming

Reviewer 1

Reviewer 2

Reviewer 3