Paper ID: | 1217 |
---|---|

Title: | An Improved Analysis of Training Over-parameterized Deep Neural Networks |

While this paper makes a nice contribution to an important problem, I am not sure if it is significant enough for the conference. The overall outline of the analysis follows closely that of [2], and the main new component is the improved gradient lower bound, which is largely based on previous ones in [2] and [16]. Although the improved analysis provides new insight and I find it useful, I do not feel that it will provide a big impact. The other technical contribution on improved trajectory length is also nice but again I feel that it is somewhat incremental. The results seem technically sound; the proofs all look reasonable although I did not verify them thoroughly. The paper is well written and easy to follow in general, and the problem is an important one in machine learning. * After author response: I am satisfied with the response, but it does not change my opinion. Although the idea in the improved analysis is nice, I still feel that it is hard to push the idea much further, and very different ideas may be needed in order to improve the over-parameterization requirement significantly.

Originality: main technicalities including the assumptions, the initialization, and the main proof technique is very similar to the previous art. However, the improved upper bound in lemma 4.2. is insightful and novel, to the best of my knowledge. Quality: I only checked the proofs at a high level, but the submission seems sound to me, and the comparisons with the previous art seems legit. Clarity: The submission is very well-written and well-organized. The proof sketch section is specifically well done. Significance: the results are important because their improve the previous results on the over-parameterization requirements, however, the requirements are still far from being realistic. >>>>>>>>>>>>>>>>> after feedback <<<<<<<<<<<<<<<<<<<<<< I have read and considered the authors feedback in my final evaluation. I vote for accepting this paper.

The work describes the proof for establishing the milder over parameterization and better iteration complexity using GD or SGD under the assumptions of Gaussian random initialization for each layer, similar to 2 that results in great improvements to both. The assumptions and setup is similar to 2 (as referenced), with the main difference being the exploitation of the gradient regions for all the data points, that allow for larger area, tighter bounds and linear trajectory length. The complexity increase from this increased computation is stated to be O(n) for n data points. This needs to be factored into the complexity calculations and it is not apparent if this treatment was done justly under 4.1. Further, the remark under 3.11 sounds like it could be better explained for the case of 2 layer ReLU networks (than referring to appendix). It would also been great to include practical results for implementation using e.g. data sets to support the theory. Overall, this paper sounds like a positive contribution to advancing state of the art with above caveats.