Paper ID: | 4884 |
---|---|

Title: | Limitations of Lazy Training of Two-layers Neural Network |

See above. This work seems to be original and well written, and offers an interesting theoretical and experimental exploration of a simple version of several methods often proposed for computation-constrained applications. Edit: The authors have provided avenues for extending the results in the rebuttal. More importantly, they have provided evidence that the trends they observe hold for non-quadratic activations. I have thus incremented my score.

The analysis shows that the error of RF is always bounded away from zero unless the number of neural N goes to infinity. Both NT and NN achieve zero error if N is greater than or equal to the dimension d. It also shows that NN always achieves smaller errors than NT, because NN learns to fit the most significant direction, while NT can only fit the sub-space that is spanned by random directions. The results of the paper is quite intuitive, but non-trivial in my perspective. It provides a clear evidence that even for simple target functions, the neural network can hold advantage over random features models. This fact is often easy to be ignored, because for many simple tasks, random features work as well as the neural networks. I think the paper can become a reference point for future work when people want to talk about MLP versus random features. The comparison between the RF and the NT is not very meaningful. It is true that NT can achieve a zero error with a finite number of neurons while RF cannot, but that only holds for specific target functions (quadratic and mixture of Gaussians), not to mention that NT has much more parameters to learn than RF. That said, the comparison between RF/NT and NN is the main contribution of this paper.

In this article, the authors analyzed the performance of a single-hidden-layer neural network model under the random feature (RF) regime, the neural tangent (NT) regime, as well as the fully trained neural network (NN) regime. By considering the tasks of 1) learning a quadratic function of d-dimensional Gaussian data, and 2) classifying a two-class d-dimensional Gaussian mixture, the authors showed that, in the high dimensional regime where the number of neurons N and the data dimension p are both large and comparable, one has NN > NT > RF in the sense of prediction performance, in the infinite data limit. In this vein, this article improves/generalizes the analyses in [25] by covering the neural tangent model, which is a more involved model that is of more practical interest. This article provides solid analyses on an interesting problem and is already quite polished. I strongly recommend it for publication. Below are a few minor comments that the authors might consider addressing before publication. **After rebuttal**: This is a solid work on an interesting topic, I vote for accepting.