Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
It appears to me that this paper is closely related to the paper by Zhang et al. . The settings differs slightly: a slightly different type of residual networks and a different loss function are considered in . Due to such differences, there are new issues that the current paper needs to deal with. However, the analysis basically follows the same outline and the high-level ideas are very similar. Therefore, the optimization result of this paper seems somewhat incremental based on that of . The generalization result seems new to me, but it does not look surprising as it is based on a standard approach via Rademacher complexity. The results seem technically sound, as the proofs all look reasonable, although I did not check all of them carefully. The paper is well written in general, and the problem is important. * After author response: I have read the response, but it does not change my evaluation. I still feel that the contribution of this paper given the results of Zhang et al.  is not substantial enough, as changing the loss does not seem to make a huge difference and the generalization result given the optimization result does not seem to require very novel idea.
Originality: The assumptions and the overall proof ideas is similar to a previous work , as mentioned several times by the authors. However, extending the analysis to residual networks is non-trivial and novel, as far as I know. Quality: The claims and the theory seems sound, however, I have not checked all proofs in details. Clarity: I enjoyed reading the paper for the most part. Overall presentation is clear; specifically, the related work section is well done. However, lines 13-14 of the abstract requires more clarifications. Is this statement precise? Is there a result showing that the super logarithmic dependence on depth is un-avoidable in deep fully connected neural networks? Significance: the results is interesting in particular because it shows a non-trivial improvement in the generalization error when skip-connections are used. However, it is not clear if this stems from a sub-optimal analysis for deep feed-forward networks. ######## I have read the author feedback and considered it in my final evaluation. Overall, I think this paper is qualified to be published in NeurIPS.
The paper describes the proof for characterizing the fully connected deep ReLU residual network's optimization and generalization properties, trained with gradient descent following Gaussian initialization.Building on the work from  the authors demonstrate that over-parameterization forces gradient descent-trained networks to stay in a small neighborhood of initialization where the learned networks are guaranteed to find small surrogate training error, and come from a sufficiently small hypothesis class to guarantee a small generalization gap between the training and test errors. This allows for derivation of test error guarantees. The proof's show that provided we have sufficient overparameterization, gradient descent is guaranteed to find networks that have arbitrarily high classification accuracy. In comparison with the results of Cao and Gu  , the width m , number of samples n , step size η , and number of iterates K required for the guarantees for residual networks given in Theorem 3.5 and Corollary 3.7 all have (at most) logarithmic dependence on L. Additionally, the step size and number of iterations required for our guarantees are independent of the depth. Finally, the presence of skip connections in the network architecture removes the complications relating to the depth that traditionally arise in the analysis of non-residual architectures, and provides an explanation for why residual networks can be trained easily at all depths, unlike non-residual ones.