NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 2668 The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

### Reviewer 1

Originality: The analysis is original to the best of my knowledge. Quality: Seems good, except a few issues that require clarifications: 1) In section 4.1 it is claimed that the eigenvalues for sine functions are zero, but in Figure 3 we see that the leading (non-zero) eigenvalue belongs to a sine funcion. How is this possible? Also, does figure 5 include sine functions? 2) In figure 5, what is the reason for the mismatch at the low frequncies? Clarity: Mostly clear except 1) What was the stopping critertion (\delta) in Figures 5 an 6? 2) In figure 5 and 6 the phrasing is a bit a obscure - it took me a while to understand what is going on with the odd frequencies (that they do not appear in the graph since they didn't reach the stopping critertion after many iterations). It is was not very clear that dots on top of left panel correspond to these non-converging odd runs. I would suggest adding a few example runs in the appendix. 3) I feel the authors should add the basic explanation why bias it is necessary for the network to approximate an odd function. It is rather simple, as I explain next. 2-layer ReLU without bias are positively homogeneous functions of x. Therefore, they cannot approximate odd functions in x. Specifically, in S^1, we must require cos(k(\theta+pi)=cos(k(\theta), so k must be odd. Significance: A nice paper with some interesting observations. Assuming my comments will be answered satisfactorily, I recommend acceptance. Minor issues: - line 149: shouldn't cos(\theta) have some normalizing scale factor? - line 198: "makes sense" not very clear. Please use more precise terms. - The equation below 198, S should be in \mathbb{}. - line 213 x^{\ell} should not be in bold. - eq. 16 "w'x" in indicator function should be "t". %%% After Author Feedback %%% The authors have answered my comments, and I'm voting for acceptance. I think these convergence results give a nice and informative characterization of the convergence rates for different frequencies.

### Reviewer 2

What functions do NNs learn (approximate a function) and how fast are central questions in the study of the dynamics of (D)NNs. A common conception behind this problem is that if one trains a network longer than necessary, then the model might overfit. However, the definition of overfitting appears to vary from paper to paper. Moreover, overfitting is intimately linked with another hot topic in the area: over-parametrization. Please refer to "Advani & Saxe 2017 High Dimensional Dynamics of Gen Error for NNs" for a modern take on this link. Keeping in mind this link, we focus on fixed-size networks. If one monitors the validation performance during training, it is well established that early stopping yields better performance. Recently, Nahaman et. al. On the Spectral bias of NNs shows that NNs learn (in time) functions with increasing complexities. The present paper, "The Convergence Rate of NNs for Learned Functions of Different Properties" builds on this idea of learning functions with increasingly high frequencies. It is a well-written paper that demonstrates the core decomposition. Moreover, the decomposition yields convergence times for each component. As a major plus, the proposed rates seem to match well with the empirical observations. Even though the paper is an important step forward, it has a few shortcomings: - The mathematical description of the learning problem is limited to 1 hidden layer networks and mean square loss. It may be hard to extend proofs further. But the empirical evidence is not enough to justify its real practical value. - Authors didn't provide the code. It didn't affect the score I assigned. But it would have been a great chance to see how the curves change with different ways of scaling the meta-parameters and change the loss. - Derivations depend on (it heavily relies on [3] in the references of the paper) a particular scaling that puts the dynamics into a linear regime. There is no discussion of over-parametrization and active and lazy learning regimes. For a review and further references, please refer to Chizat & Bach 2018 On Lazy Training in Differentiable Programming. I think with the inclusions of the above discussions, code, and further experiments, it would be a very nice paper. Overall, the argument is sound and the text is clear (except for a few typos and sentence constructions). For a further point: Convergence times scale pretty bad with the dimension of the input data. But then it may depend on the intrinsic dimension of the data for real-world problems. I think it is another exciting direction that is not directly within the scope of the present work. But it would be a useful direction of research for future studies. ======= update after author response ======== Thanks very much for considering suggested changes. I hope to be able to see the code available as well!

### Reviewer 3

The paper presents a theoretical analysis of the convergence of overparameterized shallow neural networks trained on regression problems. The main claim is that low frequency components of the regression landscape are learned faster than high frequency components and that this phenomenon, when combined with early stopping, leads to higher generalization performance. Major concerns: The analysis presented in the paper is definitly insightful but it is not original. The bulk of the results presented in the paper were obtained in (Sanjeev, 2019) and (Xie2017). These results include the derivation of the Gram matrix and the connection with Fourier analysis on n-dimensional hyper-spheres (Xie2016) and the asymptitic convergence results (Sanjeev, 2019) . As far as I understand, the only contribution of the present paper is in the inclusion of a bias term in the theoretical analysis which leads to a minor modification of the Gram matrix. The absence of the bias term apparently leads to the impossible to learn odd frequency components. introducing the bias term leads to a behavior for the odd frequency components that is asymptotically equivalent to the previously known behavior of the even frequency components. Consequently, the fix does not provide new theoretical insights but it simply confirms what it was previously known. Most of the conclusions drawn in the paper follows from the previous work and are therefore not original. Conclusion: The original contributions of the paper are very limited. Consequently, I believe that the paper should be rejected. References: Xie, Bo, Yingyu Liang, and Le Song. "Diverse neural network learns true target functions." arXiv preprint arXiv:1611.03131 (2016). Arora, Sanjeev, et al. "Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks." arXiv preprint arXiv:1901.08584 (2019).