Paper ID: | 5830 |
---|---|

Title: | How to Initialize your Network? Robust Initialization for WeightNorm & ResNets |

# Strengths - This submission copes with a problem of general importance: the optimization of weight-normalized networks. Coming-up with the right initialization strategy can prove to be critical in that case and the theoretical analysis suggests optimal init. strategy to avoid vanishing / exploding gradients in very deep networks. - The draft is clear, reads nicely and the research is well motivated. Allowing to improve the generalization capability of weight-normalized networks is an important research direction. - The proposed experiments seem to show the effectiveness of the initialization scheme. - The experiments reported in Table 1 and Table 2 are rather convincing. # Weaknesses / questions - First of all, I would like to know why is the assumption about asymptotic setting with infinite width needed. Where in the proof is it used? In terms of the parametrization proposed here it means that we work with n->infty. Eventually, in the experiments, the networks have rather small width compared to the depth. Following-up on that, could it be somehow empirically validated? Is this width assumption important? - What is the exact setup in the experiment presented in FIg. 2 (right). The text of the paper fails to exactly describe the experimental setup. What is the depth of the network considered here? - I find the experiments with resNets contestable. First of all the range of depths considered here is very surprising. Indeed, a resnet with 10000 blocks per stage is rather uncommon. Second, this experiment is ran for "one epoch of training" (l. 215) which in my opinion fails to show anything else than "PyTorch default init fails completely with such deep WN resnets". - The formulation on l. 243 is rather surprising: "Unlike previous works which use the test set for hyperparameter tuning". I do agree that proper hyper parameter tuning is essential in ML, but the wording is a bit hard in my opinion. Moreover, the "smaller training set" (l. 245) could be avoided by training the network with optimal parameters on the complete train+val. - In Table 1, I would expect a baseline WN model to be presented for other architectures than Resnet-110. Overall this is a good paper, coping with an interesting problem and is executed properly. I would like the authors to react to my negative comments / questions, and await the discussion period to make my final decision. # Rebuttal After reading the other reviews and the author's response, I decide to stick to my accept rating of 7. The author's response answered most of my concerns and doubts.

This paper proposed a new initialization strategy for weight normalized neural networks where the weight matrix in each layer is normalized. The key ingredient is to appropriately rescale the weight by a factor depending on the width. A theoretical analysis is provided showing that the proposed initialization strategy prevents vanishing/exploding gradients in expectation. Extensive experiments are provided showing that the proposed initialization does help the training and outperforms standard initialization strategy. Overall, I am very positive of the paper although I have several minor concerns, following are my comments. 1. It is mentioned several time in the paper that the initialization schemes is developed where the network width tends to infinity. I am a bit confused by it since none of the analysis explicitly used the infinite width condition. Maybe I am missing something, please clarify on it. 2. There is a closed form formula for the surface area of the unit ball, check for example Wikipedia, it would be better to prove that K_n=1 instead of a empirical check. 3. When performing the analysis on the backward pass, the gradients are evaluated according to intermediate pre-activation a^l. However, a^l is not a parameter that we are optimizing, what are the motivations of taking the derivatives with respect to a^l instead of W^l? 4 In the synthetic example in Figure 1, are the width of the fully connected network constant? In particular, the rescaling parameter in thm 1 and thm 2 match to each other when the width is constant, thus it is an extreme case. It would be good to perform experiments when the width varies and see if there is a difference. 5 If n_l > n_{l-1}, there is more rows than columns, how do we set the rows of the weight matrix to be orthogonal? 6 Is it possible to combine weight normalization with batch normalization? As far as I could see, batch normalization does not effect proposed strategy, maybe it would be good to try some experiments heuristically. ===Edit after Rebuttal=== I thank the authors for the clarification. I believe the initialization strategy introduced will be helpful for training deep networks in practice.

Summary ------- The paper proposes several initializations for weight normalized deep neural networks. The initializations are justified from simple theoretical considerations, then tested on numerous benchmarks, from which several conclusions are drawn. Originality and Significance --------------------------- While the justification of the initializations are simple and the approach is not new, the experiments provide a solid material for further exploration of weight normalized deep neural networks. In particular it seems to reduce the gap of performance of weight normalized networks compared to batch-normalized networks. This paves the way to better optimization of other structures such as in the reinforcement learning applications provided in the appendix. Quality and Clarity ------------------- The paper is well written and organized. In particular the experiments are well presented. - it is finally not clear what initialization is chosen for the residual networks, forward or backward ? - the authors could better emphasize that the plots separate the initialization of the scaling and the initialization of the weights. - also numerous plots are superimposed which makes their reading difficult. Conclusion: --------------- Overall the paper proposes a dense experimental study of initialization techniques for weight normalizations. This provides interesting material for future research. After rebuttal: ----------------- The authors answered my concern about the warm-up supplementary boosts and definitely showed the benefits of the approach. Overall this paper provides a very neat experimental material, the code is well written (I quickly skimmed through it) and therefore could be easily used for future research. The paper itself is well presented such that it can be used for the community. Overall I think the paper deserves publication. I hope that it could also open some discussions to relate weight normalization and the kernel perspective in the future.