Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The authors propose a new normalization scheme called RMSNorm. They provide an interesting (although somewhat straightforward) theoretical analysis, and empirically show benefits of the approach. Detailed comments follow: - The paper is quite nicely written, and it is in general easy to follow authors' line of thought. - Figure 1 doesn't seem to indicate that there is a large problem with LayerNorm, as the difference is quite small. Maybe choose a different example that better illustrates this issue. - In the related work, the authors mention "internal covariate shift" as a fact, while there is a lot of recent work that questions this explanation. Some the authors do mention (such as ), but some are not (e.g., Bjorck et al. "Understanding batch normalization", NeurIPS 2018). Would be good to add a discussion on ICS to reflect these recent works, and not consider the ICS as a fact. - In (2), should the second part use vector notation as done in (1)? Or should it be vice versa, (1) should use scalar notation instead? - Please never use "obviosuly" in any of your papers (or "clearly", or any other similar word). What is obvious to you may not be obvious to others. Rephrase this sentence, or you could even add a short equation explain it (although not really necessary). - In Fig 3, is the x-axis really in "%"? Doesn't seem to be, is the max value considered really just "1%"? - What is a difference between various versions of OE and OE+LayerNorm in Table 7? Several rows have the same caption. This should be explained and elaborated, it is quite confusing at the moment. - Typos: "cloze", "shorting" ** Comments after the author response ** Thank you for a very detailed response! I am staying with my `accept` recommendation.
This is mostly an engineering/empirical paper which simply explores an architecture modification. The paper is clearly written and the idea is fairly straight forward: the authors propose to use LayerNorm without mean centering. As such, the originality is limited. Still, there has recently been a proliferation of different network architectures and especially normalization techniques, so any possible simplification with only minimal performance losses should be welcomed by the community (in the spirit of Occam's razor). I'm not very familiar with the particular tasks the authors use to compare their method. There is a focus on language models, and this is likely because LayerNorm happens to provide good performance on these types of tasks. But since the paper brings only minimal theoretic contributions to the table, it would be helpful to compare performance on tasks which involve different types of architectures, as well as data (such as images, sound, video, etc.). EDIT: I thank the authors for their helpful clarifications. I’m still somewhat on the fence about this paper, it mainly representing a fix to an earlier method (it appears the authors of LayerNorm were missing an ablation experiment). However, overall the paper represents a net benefit to the community, which shouldn’t be ignored. I’ll update my score to 7.
ORIGINALITY: + The proposed normalization technique is original in the sense that the main difference in existing normalization techniques (batch, layer, group, instance..) differ only in the dimensions over which the activations are normalized. This paper proposes removing one of the typical steps in the normalization process in order to speed up training, which has been less well-studied - This work proposes dividing by the RMS statistic instead of standard deviation without hurting accuracy. Other works (for example, Santurkar et al.) experiment with scaling by different statistics, such as various l_p norms, without a loss in training accuracy. This work is not the first to suggest scaling the activations by a different statistic QUALITY: + The authors tested their technique on multiple deep learning frameworks (TensorFlow, PyTorch, Theano), which gives more support to their empirical results, as different implementations can have very different timing results + The authors tested their technique on multiple tasks and neural network architectures - The main hypothesis hypothesis is that the re-centering step in Layer Normalization is dispensable, and this is backed only by experimental results and could be a lot stronger with some theoretical justification - While the few experimental results show that there is no degradation of accuracy from not centering the activations, I am still not fully convinced that the centering step can be deemed unnecessary. For example, it is likely that the weights/biases of the networks in the paper are initialized such that the activations are roughly centered around zero already, and therefore the mean-centering step can be removed without seeing much of a difference in performance. An advantage of existing techniques such as Batch Normalization and Layer Normalization is that they are more robust to hyperparameter selection such as learning rate, but more importantly in this case, weight/bias initialization. It’s possible that the proposed technique may not work as well as existing techniques for arbitrary weight/bias initializations - Would be good to also compare to Weight Normalization, which is motivated by reducing computational overhead compared to Batch Normalization CLARITY: + The paper is generally clearly written and each section flows logically after the next. - Some minor details about the experiments are missing - for example, the learning rate/optimizer used for training is given for Daily Mail Reading and Image-Caption Retrieval but not Machine Translation. Weight initialization methods are missing entirely and would be useful to know for anyone trying to reimplement these results. Mention of the type of nonlinearity used in each network is also absent. SIGNIFICANCE: + Drop-in replacement for networks with Layer Normalization that can speed up training without reducing accuracy. Accuracy with RMSNorm/pRMSNorm is slightly better than Layer Normalization but these differences are small and it is unclear if these differences are within noise + Easy to implement in existing models, so easy for practitioners to use - Depending on the circumstances, practitioners often care about final accuracy and inference speed more than training speed. The proposed technique does not convincingly improve asymptotic accuracy and would improve inference speed over Layer Normalization, but not Batch Normalization if the inference-time parameters are ‘folded in’ to the preceding layer’s weights