Paper ID: | 4313 |
---|---|

Title: | Towards Understanding the Importance of Shortcut Connections in Residual Networks |

Originality: The methods described in the paper have been studied vigorously over the past few years. The main contribution made by the paper is directly stated in the paper in the form of theorem 2. The paper highlight related work in the area that recognizes the problem statement i.e "GD gets trapped in spurious local minima in specific networks with shortcut connections " but is rather very brief in stating the line of work that aims to answer the question. Clarity: The methods and results described in the paper are extremely well written and easy to follow. The organization of different sections in the paper can be greatly improved. The related work is stated sporadically in various sections and there is only a small paragraph that is completely dedicated to related work. Such form of writing may be confusing for first-time readers to draw a connection between related work and methods presented in the paper. Additionally, It may be a better choice to split the last section into two sections future work and discussions. Significance: The main result of the paper is only applicable to small networks with average pooling shortcut connection, very specific weight initialization, specific step size etc. These results may not be directly extended to other shortcut connection based architectures. Thus, it may be helpful to have in-depth discussions on the extension of the presented work to other cases e.g would SGD show similar 2-stage convergence behavior? Will the convergence behavior extend to deeper networks? How to perform initialization, choose step size for deep networks?

The paper investigates the outcome of training a one hidden layer convolutional residual network architecture using gradient descent when input is sampled from standard Gaussian distribution. As a followup of a similar analysis of Du et al (2017) for CNNs, this paper shows for ResNets that there exists two fixed points to the teacher-student loss function (network architecture is same for both). While one is a global minimum, the other is a spurious fixed point. The authors then derive *sufficient* conditions on the parameter initialization and learning rates such that training happens in two phases: 1. first phase where the hidden layer weights (w) remain away from the spurious fixed point (due to sufficiently small learning rate) while the last layer weights (a) approach the optimal value and eventually enter the region where the inner product satisfies a'a* > 0. 2. second phase in which both parameters approach the global minimum such that the learning rate for w can be larger allowing faster convergence. I find this paper to be very interesting as it provides novel insights into the optimization process of ResNets even though in a very restricted setting. They also point out that the success probability of reaching the global minimum using their prescribed optimization hyper-parameters is 1 for ResNets while as shown in previous literature, this probability is only constant and less than 1 for CNNs. However I have the following concerns: 1. I do not see the condition w_0=0 being used in the proof of lemma 5. It seems unnecessary. 2. the conditions on initialization and learning rate derived in the paper are sufficient conditions for optimization to reach the global minimum. The conditions under which optimization may reach the spurious fixed point have not been discussed. This is why in table 1, the probability of optimization landing in the global minimum when not using step size warmup is between 0 and 1 and not 0. Some discussion around this would be useful. 3. From the proof strategy, it seems another learning rate schedule that would allow optimization with proposed initialization to reach the global minimum would be-- set the learning rate of w to 0 in the first phase while train a with the proposed learning rate until it converges to a*. Then train w while fixing the parameters a in the second phase until converges. In this way, the network can be trained in a layerwise manner. I encourage the authors to add a discussion around this. 4. The objective has a unique global minimum and spurious minimum due to the nature of the architecture and input distribution used in the analysis. What insights can we get from this analysis about the case where the fixed points are not unique (Eg. due to a different input distribution or architecture)? 5. In table 1, it can be seen that the success rate increases with increasing input dimensionality. While this is acknowledged in line 254, there is no discussion around it as to why this happens. Can authors provide an explanation? 6. In lines 304-311, the authors try to establish a relation between the strategy of learning rate warmup used in Goyal et al (2017) and the analysis provided by the authors on the learning rate schedule which suggests the use of small learning rate initially to avoid spurious fixed point. However, in the realistic case (eg. Goyal et al 2017 and many other papers that follow), using a large learning rate initially does not prevent training loss from reducing. It is the generalization that becomes worse. But this is a separate issue than the one the authors of this paper are addressing. So i would recommend the authors to revise their claim. 7. The authors again claim to establish a relationship between their analysis of using w=0 as initialization and the proposal in Fixup (Zhang et al 2019). Fixup proposes an initialization for ResNets without any normalization while the analysis in this paper is for a ResNet with normalization. So it is misleading to compare the two. Minor corrections: 1. Line 180: The dissipative region is for gradient of loss w.r.t. a, not w. 2. Line 230: It should be stage II, not I. 3. Line 263: It is first row, not first column. ##### Post Rebuttal: Thank you for the detailed explanations. Looking forward to the revised version with the recommenced changes.

Originality: The paper gives a new insight over why residual networks work in practice. It follows several prior research to analyze the theoretical part, but the understanding is new. Clarity: The paper is clearly written and well organized. Significance: In terms of understanding, the paper is somehow valuable. But why this understanding is only based on a two-layer non-overlapping convolutional neural networks? It assumes ||v||=1 and using ReLU as the nonlinear active function, what will it be if the assumptions are avoided? I am doubt the generalization of the understanding. For the convergence analysis, the paper provides some bounds for the convergency, but no compatibles are provided, it is hard to judge the tightness of the results.