Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper tries to answer the question: can the results from [Shamir, 2018] be extended to multi-block ResNets? The authors formally characterize that under condition Line 206 and Line 207, ANY local minimum is either "good" (meaning it is better than linear predictors), or the minimum eigenvalue of its Hessian is strictly negative. These results are new as far as I know. My main concern is regarding the usefulness of the result. In particular, the assumption in Line 207 seems strong compared to the results in [Shamir, 2018]. Also, the result in Section 3.2 seems like a trivial application of [Shamir, 2018]. Overall the paper is easy to follow. I feel there is room for improvements in some sections. Some of them are listed below.
The topic is timely and the presentation is crisp and clear. The results shed some light to ResNets, and this is a good technical advance.
The paper theoretically investigates deep residual networks, extending a result by Shamir (citation  in the paper). First, two motivating examples are given, showing that a linear predictor might outperform all the local minima of a fully connected network, while at the same time a ResNet has strictly better minima. The second example shows that gradually adding residual blocks won't necessarily improve local minima monotonically. Then, the authors show that the critical points of a deep ResNets either have better objective value than a linear predictor, or that they have a strictly negative eigenvalue (hence they are not a local minimum). This is done under a 'representation coverage condition' and another 'parameter coverage condition', where the latter seems rather limiting. Lastly, analysis of deep ResNets is given in the 'near-identity' regions, where the residual block output is small. An upper bound on the objective of such critical points is given, as well as a bound on the Rademacher-complexity of the ResNet. It seems that the near-identity assumption limits these results, although these might be arguably a reasonable assumption given known theory about this setting. My main concern regarding this paper is that proofs applying to deep Resnets, which are mainly based on a recursive analysis of each residual block, do not seem to provide deeper insights into the nature of ResNets, as far as I can tell. Therefore, the techniques used might not result in future improvement based on them. Moreover, the second assumption made in theorem 2 seems rather limiting, as to the best of my knowledge, there's no reason why it should hold for a 'general' local minimum when the condition in corollary 3 isn't met, which seems like a rather strong assumption on the architecture. The examples provided make understanding and reading the paper easier, and provide motivation for what follows, but are nevertheless simple, tailored examples which I do not view as standalone results (nor is it their purpose in the paper). Additionally, section 5 strongly relies on the near-identity assumption which also limits the results in it. While I found these results interesting, they seem in my humble opinion like incremental improvements. Lastly, in line 119 it is claimed that augmenting one dimension per layer extends the results for networks with bias terms, however I'm rather convinced that this trick only works for the first layer. Setting the augmented dimension in the data to all ones simulates bias terms for the first layer, but this does not extend further. Did I miss anything? Can you please elaborate on this trick? Overall, although the paper is clear and easy to follow, and as much as it is exciting to see theoretical works analyzing deep ResNets, this alone does not merit acceptance as the collection of results presented in this paper seem incremental for the most part. Minor comments: Line 78: examples shows -> example shows Line 207: I'm not sure where and if was col(...) defined. If I didn't miss it, please add a clarification for this definition.