Paper ID: | 5782 |
---|---|

Title: | Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks |

This paper studies the task of training over-parameterized deep ReLU networks via SGD, and proves a generalization bound which is more general or sharper than several recent ones. The generalization bound has the additional nice property which can distinguish between learning from noisy labels and true ones. The analysis is intuitive and seems to take a slightly different route from most previous works I am aware of, which may benefit the future work on this topic. Therefore, I feel that the paper has enough originality and significance. The results appear technically sound to me, as the proofs all look reasonable, although I did not check them very carefully. The paper is well written and easy to follow in general. * After author response: I have read the response, and I am satisfied with it. I still keep my score as “Marginally above the acceptance threshold” and vote for accepting the paper.

Originality: To the best of my knowledge, the results are novel and provide important extensions/improvements over the previous art. Quality: I did a high level check of the proofs and it seems sound to me. Clarity: the paper is a joy to read. The problem definition, assumptions, the algorithm, and statement of results are very well presented. Significance: the results provide several extensions and improvements over the previous work, including training deeper models, training all layers, training with SGD (rather than GD), and smaller required overparameterization. More importantly, the proofs are simpler compared to the related work. My only concern are the followings. 1) the width requirement is still very stringent, and not realistic; and 2) It is not clear how to compare such a bound with other results. In particular, it would have been nice to see how small the NTRK error (the first term in the right hand side of Thm 3.3.) can get in practice for real models. ******************************************************************** *******************after author feedback********************** ******************************************************************** I have read the authors feedback and am happy to increase my rating from 6 to 8. I have also increased my confidence score from 3 to 4 after authors clarified some part of the proof. Overall, I think this paper is well qualified to be published in NeurIPS.

The authors describe the proof for expected 0-1 error of deep ReLU networks trained with SGD using random initialization (from He. et. al.), that results in algorithm dependent generalization error bound that is independent of network width if the data can be classified by the purposed NTRF features with small enough error. The result is generalizable (two layer and larger function class) and when reduced to two layer setting is sharper. A more general and tighter bound is also derived, similar to Arora [3] to show similarity to neural tangent kernel from Jacot [17]. The purposed NTRF is shown to be richer (contains gradient from first layer) and generalization bound sharper ( compared to 31 and 13). He initialization enables removing exponential dependence on depth in kernel matrix, where the kernel matrix takes all layers into consideration (as opposed to last hidden layer in [10]). The results are also compared and shown as generalization of [3] with sharper bounds.