{"title": "Root Mean Square Layer Normalization", "book": "Advances in Neural Information Processing Systems", "page_first": 12381, "page_last": 12392, "abstract": "Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm. We also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of the summed inputs without breaking the above properties. Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models. Source code is available at https://github.com/bzhangGo/rmsnorm.", "full_text": "Root Mean Square Layer Normalization\n\nBiao Zhang1 Rico Sennrich2,1\n\n1School of Informatics, University of Edinburgh\n\n2Institute of Computational Linguistics, University of Zurich\n\nB.Zhang@ed.ac.uk, sennrich@cl.uzh.ch\n\nAbstract\n\nLayer normalization (LayerNorm) has been successfully applied to various deep\nneural networks to help stabilize training and boost model convergence because\nof its capability in handling re-centering and re-scaling of both inputs and weight\nmatrix. However, the computational overhead introduced by LayerNorm makes\nthese improvements expensive and signi\ufb01cantly slows the underlying network, e.g.\nRNN in particular. In this paper, we hypothesize that re-centering invariance in\nLayerNorm is dispensable and propose root mean square layer normalization, or\nRMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer ac-\ncording to root mean square (RMS), giving the model re-scaling invariance property\nand implicit learning rate adaptation ability. RMSNorm is computationally simpler\nand thus more ef\ufb01cient than LayerNorm. We also present partial RMSNorm, or\npRMSNorm where the RMS is estimated from p% of the summed inputs without\nbreaking the above properties. Extensive experiments on several tasks using di-\nverse network architectures show that RMSNorm achieves comparable performance\nagainst LayerNorm but reduces the running time by 7%\u223c64% on different models.\nSource code is available at https://github.com/bzhangGo/rmsnorm.\n\n1\n\nIntroduction\n\nHow to train deep neural networks ef\ufb01ciently is a long-standing challenge. To accelerate model\nconvergence, Ba et al. [3] propose the layer normalization (LayerNorm) which stabilizes the training\nof deep neural networks by regularizing neuron dynamics within one layer via mean and variance\nstatistics. Due to its simplicity and requiring no dependencies among training cases, LayerNorm\nhas been widely applied to different neural architectures, which enables remarkable success on\nvarious tasks ranging from computer vision [19, 26], speech recognition [37] to natural language\nprocessing [31, 35]. In some cases, LayerNorm was found to be essential for successfully training a\nmodel [6]. Besides, the decoupling from batch-based samples endows LayerNorm with the superiority\nover batch normalization (BatchNorm) [12] in handling variable-length sequences using RNNs.\nUnfortunately, the incorporation of LayerNorm raises computational overhead. Although this is\nnegligible to small and shallow neural models with few normalization layers, this problem becomes\nsevere when underlying networks grow larger and deeper. As a result, the ef\ufb01ciency gain from\nfaster and more stable training (in terms of number of training steps) is counter-balanced by an\nincreased computational cost per training step, which diminishes the net ef\ufb01ciency, as show in Figure\n1. One major feature of LayerNorm that is widely regarded as contributions to the stabilization is its\nre-centering invariance property: the summed inputs after LayerNorm remain intact when the inputs\nor weight matrix is shifted by some amount of noise. We argue that this mean normalization does not\nreduce the variance of hidden states or model gradients, and hypothesize that it has little impact on\nthe success of LayerNorm.\nIn this paper, we propose root mean square layer normalization (RMSNorm), which regularizes\nthe summed inputs to a neuron in one layer with the root mean square (RMS) statistic alone.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Training loss vs. training steps. (b) Training loss vs. training time.\n\nFigure 1: Training procedure of a GRU-based RNNSearch [4] for the \ufb01rst 10k training steps. Baseline means the\noriginal model without any normalization. When the Baseline training loss arrives at 7.0, the loss of LayerNorm\nreaches 5.4 after the same number of training steps 1(a), but only 5.9 after the same training time 1(b).\n\nRMSNorm reduces the amount of computation and increases ef\ufb01ciency over LayerNorm. Despite the\nsimpler formulation, the RMS normalizer helps stabilize the magnitude of layer activations, ensuring\ninvariance to the re-scaling of both weights and datasets. We also show the possibility of estimating\nRMS on a subset of the summed inputs, maintaining this invariance property. Assuming that the\nsummed inputs have an independent identically distributed structure, we propose partial RMSNorm,\nwhere only the \ufb01rst p% summed inputs are utilized for RMS estimation.\nWe thoroughly examine our model on various tasks, including machine translation, image classi\ufb01ca-\ntion, image-caption retrieval and question answering. Experimental results show that across different\nmodels, RMSNorm yields comparable performance against LayerNorm but shows superiority in\nterms of running speed with a speed-up of 7%\u223c64%. When estimating the RMS with partial (6.25%)\nsummed inputs, pRMSNorm achieves competitive performance compared to RMSNorm.\n\n2 Related Work\n\nOne bottleneck deep neural networks have been hypothesized to suffer from is the internal covariate\nshift issue [27], where a layer\u2019s input distribution changes as previous layers are updated, which\nsigni\ufb01cantly slows the training.1 One promising direction to solve this problem is normalization.\nIoffe and Szegedy [12] introduce batch normalization (BatchNorm) to stabilize activations based on\nmean and variance statistics estimated from each training mini-batch. Unfortunately, the reliance\nacross training cases deprives BatchNorm of the capability in handling variable-length sequences,\nthough several researchers develop different strategies to enable it in RNNs [16, 8]. Instead, Salimans\nand Kingma [22] propose weight normalization (WeightNorm) to reparameterize weight matrix\nso as to decouple the length of weight vectors from their directions. Ba et al. [3] propose layer\nnormalization which differs from BatchNorm in that statistics are directly estimated from the same\nlayer without accessing other training cases. Due to its simplicity and effectiveness, LayerNorm has\nbeen successfully applied to various deep neural models, and achieves state-of-the-art performance\non different tasks [19, 37, 31, 6].\nThese studies pioneer the research direction that integrates normalization as a part of the model\narchitecture. This paradigm ensures encouraging performance by shortening model convergence\nbut at the cost of consuming more time for each running step. To improve ef\ufb01ciency, Arpit et al.\n[2] employ a data-independent method to approximately estimate mean and variance statistics, thus\navoiding calculating batch statistics. Ioffe [11] propose batch renormalization so as to reduce the\ndependence of mini-batches in BatchNorm. Ulyanov et al. [30] replace batch normalization with\ninstance normalization for image generation. Hoffer et al. [10] and Wu et al. [33] observe that l1-norm\ncan act as an alternative of variance in BatchNorm with the bene\ufb01t of fewer nonlinear operations and\nhigher computational ef\ufb01ciency. Nevertheless, all these work still follow the original normalization\nstructure and utilize mean statistic estimated from the whole summed inputs to handle re-centering\ninvariance.\n\n1Note that the internal covariate shift is given as motivation by [12, 3]. Recent studies have proposed\nalternative explanations for the success of normalization, such as the uncontrollable growth of layer activations\nin unnormalized deep networks [5].\n\n2\n\n020406080100TrainingStep(x100)45678910Loss7.05.4BaselineLayerNorm020406080100120140160TrainingTime(inminutes)45678910Loss7.05.9BaselineLayerNorm\fDifferent from these related work, the proposed RMSNorm modi\ufb01es the normalization structure by\nremoving the re-centering operation and regularizing the summed inputs with RMS alone. Our model\nonly maintains the re-scaling invariance property which we \ufb01nd can be inherited when the RMS is\nestimated from only subset of the summed inputs, partially inspired by the group normalization [34].\nAs a side effect, our model reduces the computational overhead and increases ef\ufb01ciency. Recently,\nZhang et al. [36] show that with careful initialization, residual networks can be trained as stable as\nthose with normalization. However, the approach mainly aims at improving residual networks and\ncan not be freely switched without modifying all initialization layers. Besides, it is not trivial to\nbe adapted to other general neural networks, such as RNNs where model depth expands along the\nvariable sequence length. By contrast, our model is simple, effective and can be used as a drop-in\nreplacement of LayerNorm.\n\n3 Background\n\nWe brie\ufb02y review LayerNorm in this section based on a standard feed-forward neural network. Given\nan input vector x \u2208 Rm, a feed-forward network projects it into an output vector y \u2208 Rn through a\nlinear transformation followed by a non-linear activation as follows:\n\nai =\n\nwijxj,\n\nyi = f (ai + bi) ,\n\n(1)\n\nm(cid:88)\n\nj=1\n\nai \u2212 \u00b5\n\nwhere wi is weight vector to the i-th output neuron, bi is bias scalar which is usually initialized by\n0, and f (\u00b7) is an element-wise non-linear function. a \u2208 Rn denotes the weight-summed inputs to\nneurons, which is also the target of normalization.\nThis vanilla network might suffer from internal covariate shift issue [12], where a layer\u2019s input\ndistribution changes as previous layers are updated. This could negatively affect the stability of\nparameters\u2019 gradients, delaying model convergence. To reduce this shift, LayerNorm normalizes the\nsummed inputs so as to \ufb01x their mean and variance as follows:\n\n\u00afai =\n\n(2)\nwhere \u00afai is the i-th value of vector \u00afa \u2208 Rn, which acts as the normalized alternative of ai for layer\nactivation. g \u2208 Rn is the gain parameter used to re-scale the standardized summed inputs, and is set\nto 1 at the beginning. \u00b5 and \u03c32 are the mean and variance statistic respectively estimated from raw\nsummed inputs a:\n\nyi = f (\u00afai + bi) ,\n\ngi,\n\n\u03c3\n\nn(cid:88)\n\ni=1\n\n\u00b5 =\n\n1\nn\n\nai,\n\n\u03c3 =\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n(ai \u2212 \u00b5)2.\n\n(3)\n\nThus, LayerNorm forces the norm of neurons to be decoupled from the inputs and weight matrix.\n\n4 RMSNorm\n\nA well-known explanation of the success of LayerNorm is its re-centering and re-scaling invariance\nproperty. The former enables the model to be insensitive to shift noises on both inputs and weights,\nand the latter keeps the output representations intact when both inputs and weights are randomly\nscaled. In this paper, we hypothesize that the re-scaling invariance is the reason for success of\nLayerNorm, rather than re-centering invariance.\nWe propose RMSNorm which only focuses on re-scaling invariance and regularizes the summed\ninputs simply according to the root mean square (RMS) statistic:\n\n\u00afai =\n\nai\n\nRMS(a)\n\ngi, where RMS(a) =\n\na2\ni .\n\n(4)\n\nIntuitively, RMSNorm simpli\ufb01es LayerNorm by totally removing the mean statistic in Eq. (3) at\nthe cost of sacri\ufb01cing the invariance that mean normalization affords. When the mean of summed\ninputs is zero, RMSNorm is exactly equal to LayerNorm. Although RMSNorm does not re-center\n\n3\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n\fWeight matrix Weight matrix Weight vector\n\nre-scaling\n\nre-centering\n\nre-scaling\n\nDataset\nre-scaling\n\nDataset\n\nre-centering\n\nSingle training case\n\nre-scaling\n\nBatchNorm\nWeightNorm\nLayerNorm\nRMSNorm\npRMSNorm\n\n\u0013\n\u0013\n\u0013\n\n\u0013\n\u0013\n\n\u0017\n\u0017\n\u0013\n\n\u0017\n\u0017\n\n\u0013\n\u0013\n\u0017\n\n\u0017\n\u0017\n\n\u0013\n\u0017\n\u0013\n\n\u0013\n\u0013\n\n\u0013\n\u0017\n\u0017\n\n\u0017\n\u0017\n\n\u0017\n\u0017\n\u0013\n\n\u0013\n\u0013\n\nTable 1: Invariance properties of different normalization methods. \u201c\u0013\u201d indicates invariant, while \u201c\u0017\u201d denotes\nthe opposite.\n\nthe summed inputs as in LayerNorm, we demonstrate through experiments that this property is not\nfundamental to the success of LayerNorm, and that RMSNorm is similarly or more effective.\n\u221a\nRMS measures the quadratic mean of inputs, which in RMSNorm forces the summed inputs into a\nn-scaled unit sphere. By doing so, the output distribution remains regardless of the scaling of input\nand weight distributions, bene\ufb01ting the stability of layer activations. Although Euclidean norm which\nonly differs from RMS by a factor of\nn has been successfully explored [22], we empirically \ufb01nd\nthat it does not work for layer normalization. We hypothesize that scaling the sphere with the size\nof the input vector is important because it makes the normalization more robust across vectors of\ndifferent size. As far as we know, the idea of employing RMS for neural network normalization has\nnot been investigated before.\n\n\u221a\n\n4.1\n\nInvariance Analysis\n\nInvariance measures whether model output after normalization changes highly in accordance with\nits input and weight matrix. Ba et al. [3] show that different normalization methods reveal different\ninvariance properties, which contributes considerably to the model\u2019s robustness. In this section, we\ntheoretically examine the invariance properties of RMSNorm.\nWe consider the following general form of RMSNorm:\n\n(5)\nwhere (cid:12) denotes element-wise multiplication. Our main results are summarized in Table 1. RMS-\nNorm is invariant to both weight matrix and input re-scaling, because of the following linearity\nproperty of RMS:\n\nRMS(a)\n\ny = f\n\n,\n\n(6)\nwhere \u03b1 is a scale value. Suppose the weight matrix is scaled by a factor of \u03b4, i.e. W(cid:48) = \u03b4W, then\nthis change does not affect the \ufb01nal layer output:\n\nRMS(\u03b1x) = \u03b1RMS(x),\n\n(cid:18) Wx\n\n(cid:19)\n\n(cid:12) g + b\n\n(cid:18) W(cid:48)x\n\nRMS(a(cid:48))\n\n(cid:19)\n\n(cid:18) \u03b4Wx\n\n\u03b4RMS(a)\n\n(cid:19)\n\n(cid:48)\n\ny\n\n= f\n\n(cid:12) g + b\n\n= f\n\n(cid:12) g + b\n\n= y.\n\n(7)\n\nBy contrast, if the scaling is only performed on individual weight vectors, this property does not hold\nanymore as different scaling factors break the linearity property of RMS. Similarly, if we enforce\na scale on the input with a factor of \u03b4, i.e. x(cid:48) = \u03b4x, the output of RMSNorm remains through an\nanalysis analogous to that in Eq. 7. We can easily extend the equality to batch-based inputs as well as\nthe whole dataset. Therefore, RMSNorm is invariant to the scaling of its inputs.\nThe main difference to LayerNorm is that RMSNorm is not re-centered and thus does not show\nsimilar linearity property for variable shifting. It is not invariant to all re-centering operations.\n\n4.2 Gradient Analysis\n\nThe above analysis only considers the effect of scaling inputs and the weight matrix on the layer\noutput. In a general setting, however, a RMSNorm-enhanced neural network is trained via standard\nstochastic gradient descent approach, where the robustness of model gradient is very crucial to\nparameters\u2019 update and model convergence (see also Santurkar et al. [23] who argue that the success\nof normalization methods does not come from the added stability to layer inputs, but due to increased\nsmoothness of the optimization landscape). In this section, we investigate the properties of model\ngradients in RMSNorm.\n\n4\n\n\fGiven a loss function L, we perform back-propagation through Eq. (4) to obtain the gradient with\nrespect to parameters g, b as follows:\n\n\u2202L\n\u2202b\n\n\u2202L\n\u2202v\n\n\u2202L\n\u2202g\n\n\u2202L\n\u2202v\n\n(cid:12) Wx\nRMS(a)\n\n,\n\n,\n\n=\n\n=\n\n(8)\nwhere v is short for the whole expression inside f (\u00b7) in Eq. (4), and \u2202L/\u2202v is the gradient back-\npropagated from L to v. Both gradients \u2202L/\u2202b and \u2202L/\u2202g are invariant to the scaling of inputs x and\nthe weight matrix W (in the case of \u2202L/\u2202g because of the linearity property in Eq. (6)). Besides, the\ngradient of g is proportional to the normalized summed inputs, rather than raw inputs. This powers\nthe stability of the magnitude of g.\nUnlike these vector parameters, the gradient of the weight matrix W is more complicated due to the\nquadratic computation in RMS. Formally,\n\n(cid:18)\n\n(cid:19)\n\n(cid:20)\n\nn(cid:88)\n\ni=1\n\n\u2202L\n\u2202W\n\n=\n\n(cid:18)\n\n(cid:18)\ng (cid:12) \u2202L\n\n(cid:19)\n\n\u2202v\n\n(cid:19)(cid:21)\n\nxT \u2297\n\ndiag\n\n\u00d7 R\n\n, where R =\n\ni\n\n1\n\nRMS(a)\n\nI \u2212 (Wx) (Wx)T\nnRMS(a)2\n\n,\n\n(9)\n\ndiag(\u00b7) denotes the diagonal matrix of input, \u2297 denotes the Kronecker product, and \u201cI\u201d indicates\nidentity matrix. For clarity, we explicitly use \u201c\u00d7\u201d to represent matrix multiplication. The matrix term\nR associates the gradient of W with both inputs x and weight matrix W. With a thorough analysis,\nwe can demonstrate that this term is negatively correlated with both input and weight matrix scaling.\nAfter assigning a scale of \u03b4 to either input x (x(cid:48) = \u03b4x) or weight matrix (W(cid:48) = \u03b4W), we have\n\n(cid:18)\n\n(cid:19)\n\n1\n\n(cid:48)\nR\n\nI \u2212 (\u03b4Wx) (\u03b4Wx)T\nn\u03b42RMS(a)2\n\n=\n\n\u03b4RMS(a)\n\n(10)\nIf we put the scaled term R(cid:48) back into Eq. (9), we can easily prove that the gradient \u2202L/\u2202W is\ninvariant to input scaling, but keeps the negative correlation with weight matrix scaling. Reducing\nthe sensitivity of gradient \u2202L/\u2202W to the scaling of inputs ensures its smoothness and improves the\nstability of learning. On the other hand, the negative correlation acts as an implicit learning rate\nadaptor and dynamically controls the norm of gradients which avoids large-norm weight matrix and\nimproves model convergence.\n\nR.\n\n=\n\n1\n\u03b4\n\n5\n\npRMSNorm\n\n(cid:113)\n\nThe re-scaling invariance property of RMSNorm ascribes to the linearity property of RMS. Consider-\ning that neurons in one layer often have independent identically distributed structure, we argue that\nthe RMS can be estimated on a subset of these neurons rather than all of them. We propose partial\nRMSNorm (pRMSNorm). Given the unnormalized input a, pRMSNorm infers the RMS statistic\ni , where k = (cid:100)n \u00b7 p(cid:101) denotes the number of\nfrom \ufb01rst-p% elements of a: RMS(a) =\nelements used for RMS estimation. The linearity property still holds for RMS as in Eq. (6), which\nindicates pRMSNorm shares the same invariance properties as RMSNorm as shown in Table 1.\nRMS is a biased estimation of the RMS which is often inaccurate. Though theoretically pRMSNorm\napproximates to RMSNorm, we observe gradient instability where the gradient tends to explode with\nsmall m. In practice, however, models with pRMSNorm can succeed in satisfactory convergence\nwith a partial ratio of 6.25%.\n\n(cid:80)k\n\ni=1 a2\n\n1\nk\n\n6 Experiments\n\nTo test the ef\ufb01ciency of layer normalization across different implementations, we perform experi-\nments with Tensor\ufb02ow [1], PyTorch [20] and Theano [29]. We add RMSNorm to different models,\ncomparing against an unnormalized baseline and LayerNorm. These models are based on diverse\narchitectures, covering different RNN variants, convolutional and self-attentional models, and various\nactivations (such as sigmoid, tanh, and softmax), with initialization ranging from uniform, normal,\northogonal with different initialization ranges or variances. Unless otherwise noted, all speed-related\nstatistics are measured on one TITAN X (Pascal). Reported time is averaged over 3 runs. We also list\nthe standard deviation of these three runs.\n\n5\n\n\fTest14\n21.7\n22.6\n20.7\n22.4\n22.6\n\nTest17\n23.4\n23.6\n22.0\n23.7\n23.1\n\nModel\nBaseline\nLayerNorm\nL2-Norm\nRMSNorm\npRMSNorm\n\nTime\n399\u00b13.40s\n665\u00b132.5s\n482\u00b119.7s\n501\u00b111.8s (24.7%)\n493\u00b110.7s (25.9%)\nTable 2: SacreBLEU score on newstest2014 (Test14) and\nnewstest2017 (Test17) for RNNSearch using Tensor\ufb02ow-\nversion Nematus. \u201cTime\u201d: the time in second per 1k training\nsteps. We set p to 6.25%. We highlight the best results in\nbold, and show the speedup of RMSNorm against Layer-\nNorm in bracket.\n\nFigure 2: SacreBLEU score on newstest2013 for\nthe RNNSearch. Models are implemented accord-\ning to Nematus [25] in Tensor\ufb02ow.\n\n6.1 Machine Translation\n\nMachine translation aims at transforming a sentence from one (source) language to another (target)\nlanguage. We focus on neural machine translation based on an attention-enhanced encoder-decoder\nframework. We train two different models, a GRU-based RNNSearch [4] and a self-attention\nbased neural Transformer [31] on WMT14 English-German translation task. More details about the\nexperimental settings as well as comparison with WeightNorm are listed in Appendix A.1\nWe \ufb01rst experiment with RNNSearch. Normalization is added to the recurrent connections and\nfeedforward layers. Apart from RNNSearch without any normalization (Baseline) and with Layer-\nNorm, we also compare against the same model equipped with L2-Norm (i.e. replacing RMS with\nL2-Norm), which has been observed to improve lexical selection [18].\nFigure 2 illustrates the evolution of BLEU score on our development set after every 30k training\nsteps, and Table 2 summarizes the test results. In short, both LayerNorm and RMSNorm outperform\nthe Baseline by accelerating model convergence: they reduce the number of training steps until con-\nvergence by about 50%, and improve test accuracy, with RMSNorm being comparable to LayerNorm.\nThis supports our hypothesis that re-scaling invariance is the core property of LayerNorm, and that\nRMSNorm is an effective substitute. Our results with L2-Norm show that it fails to improve the\nmodel.2 Results in Table 2 highlight the challenge that RNN with LayerNorm in Tensor\ufb02ow suffers\nfrom serious computational inef\ufb01ciency, where LayerNorm is slower than the Baseline by about 67%.\nIn this respect, RMSNorm performs signi\ufb01cantly better, improving upon LayerNorm by \u223c25%.\nTable 3 further lists translation results of different models implemented in Theano and Pytorch.\nOverall, RMSNorm yields comparable translation quality compared with LayerNorm but incurs less\ncomputational overhead, outperforming LayerNorm with speedups ranging from 11%\u223c34%. In\naddition, we observe that though in theory the amount of computation in pRMSNorm is less than\nthat in RMSNorm, pRMSNorm (p = 6.25%) sometimes tends to be slower. We ascribe this to the\nnon-optimal implementation of tensor slicing operation in these computational frameworks, which\ncan be improved with speci\ufb01c low-level coding.\nIn pRMSNorm, the partial ratio p directly controls the accuracy of estimated RMS, thereby affecting\nthe stability of model training. Figure 3 shows the effect of p on model performance. Surprisingly, we\n\ufb01nd that the scale of p has little in\ufb02uence on the \ufb01nal translation quality in RNNSearch: using a small\nratio does not signi\ufb01cantly degenerate BLEU score. We set p to 6.25% for all following experiments.\nWe also experiment with Transformer, which is based on self-attention, avoiding recurrent connec-\ntions and allowing a higher degree of parallelization. Still, layer normalization is an important part of\nthe architecture. We use an in-house Tensor\ufb02ow implementation of the Transformer, and employ the\nbase setting as in [31] with all models trained for 300K steps. We treat Transformer with no normal-\nization as our Baseline, and compare RMSNorm-enhanced Transformer with LayerNorm-equipped\nTransformer. Table 4 shows the results, from which we observe the importance of normalization\nfor Transformer, without which training fails. RMSNorm achieves BLEU scores comparable to\nLayerNorm, and yields a speedup of 7%\u223c9%. Compared with RNNSearch, the relative cost of\n\n2We note that Nguyen and Chiang [18] only applied L2-Norm to the last layer, and treat the scaling factor as\na hyperparameter. While not a replication of their experiment, we still found it worth testing L2-Norm as an\nalternative to LayerNorm.\n\n6\n\n01020304050Trainingsteps(x30k)0510152025ValidBLEUscoreBaselineLayerNormRMSNormL2-NormpRMSNorm\fTh\n\nTest14\n21.8\n22.3\n22.5\n22.7\n22.7\n23.2\n22.9\n23.2\n\nModel\nBaseline\nLayerNorm\nRMSNorm\npRMSNorm\nBaseline\nLayerNorm\nRMSNorm\npRMSNorm\n\nTime\n596\u00b120.8s\n988\u00b11.10s\n652\u00b124.1s (34.0%)\n658\u00b117.9s (33.4%)\n427\u00b16.50s\n857\u00b117.2s\n763\u00b116.2s (11.0%)\n754\u00b136.1s (12.0%)\nTable 3: SacreBLEU score on newstest2014 (Test14) and new-\nstest2017 (Test17) for RNNSearch. \u201cTh\u201d: Theano-version Nema-\ntus, \u201cPy\u201d: an in-house PyTorch-based RNNSearch.\n\nTest17\n22.9\n23.8\n23.2\n24.0\n24.7\n24.3\n24.5\n24.6\n\nPy\n\nModel\n\n1\n\n2\n\n3\n\n4\n\nTable 5: Mean (M) and standard deviation (S) statistics esti-\nmated on the hidden-to-hidden mapping of decoder-part GRU\ncell in RNNSearch model. We use the newstest2013 dataset.\nALL: the statistics averaged across all token positions. Num-\nbers 1,2,3,4 indicate the statistic estimated for speci\ufb01c token\npositions.\n\nSacreBLEU score on new-\nFigure 3:\nstest2013 (devset) for the RNNSearch with\npRMSNorm. We use Tensor\ufb02ow-version Ne-\nmatus, and change p by a step size of 10%.\n\n-\n\n-\n\nTest14\n\nTest17\n\n27.7\n27.7\n27.8\n\n26.6\n26.8\n26.5\n\nModel\nBaseline\nLayerNorm\nRMSNorm\npRMSNorm\n\nTime\n210\u00b10.23s\n248\u00b11.31s\n231\u00b10.04s (6.9%)\n225\u00b11.63s (9.3%)\nTable 4: SacreBLEU score on newstest2014\n(Test14) and newstest2017 (Test17) for the\nTransformer. \u201cTime\u201d: the time in second per\n1k training steps, which is measured using Tesla\nV100. \u201c-\u201d indicates that we fail to train this\nmodel and BLEU score is 0.\n\nM -2.60\nBaseline\nS\n7.35\nLayerNorm M -0.43\nS\n1.19\nM -0.40\nS\n1.27\n\nRMSNorm\n\n-1.19\n2.33\n-0.48\n1.51\n-0.60\n1.51\n\n-1.43\n2.61\n-0.50\n1.51\n-0.69\n1.50\n\n-1.53\n2.73\n-0.50\n1.51\n-0.74\n1.49\n\nALL\n-1.60\n3.04\n-0.51\n1.51\n-0.73\n1.50\n\nnormalization is lower because there are signi\ufb01cantly fewer sequential normalization operations in\nTransformer.\nEffect of Normalization on Mean and Standard Deviation Table 5 shows the distribution of mean\nand standard deviation of hidden representations across token positions for an RNNSearch model.\nMean and standard deviation are unstable in the baseline, as observed by Ba et al. [3]. Due to their\nnormalization properties, both RMSNorm and LayerNorm stabilize standard deviation. Although the\nmean in RMSNorm is not normalized, in practice it is more stable than the mean of the baseline. This\nsupports our hypothesis that RMSNorm stabilizes recurrent activations without the need to explicitly\nnormalize the mean.\nOn the Robustness of RMSNorm One remaining ques-\ntion is whether the re-centering operation in LayerNorm\n(which RMSNorm abandons) makes models more robust\ntowards arbitrary weight/bias initializations. We perform\nan experiment on RNNSearch with Nematus in Tensor-\n\ufb02ow, and change the center of weight initialization to 0.2.\nResults in Figure 4 show that LayerNorm becomes very\nunstable with abnormal initialization, but RMSNorm is\nmore robust (both underperform the original initialization).\nOur empirical evidence so far suggests that RMSNorm is\nsimilarly robust as LayerNorm, or more.\n\nFigure 4: SacreBLEU score curve of Layer-\nNorm and RMSNorm on newstest2013 (de-\nvset) when the initialization center is 0.2.\n\n6.2 CNN/Daily Mail Reading Comprehension\n\nThis reading comprehension task is a cloze-style question\nanswering task, where models are required to answer a question regarding to a passage, and the\nanswer is an anonymized entity from the passage [9]. We train a bidirectional attentive reader model\nproposed by Hermann et al. [9] on the CNN corpus. More details about the experimental settings are\ngiven in Appendix A.2. We compare RMSNorm with both LayerNorm and BatchNorm.\nFigure 5 and Table 6 show the results. After normalizing RNN by BatchNorm with separate statistics\nfor each time step in a sequence, both BatchNorm-LSTM and BatchNorm-Everywhere help speed up\nthe convergence of training process. By contrast, LayerNorm and RMSNorm not only converge faster\nthan BatchNorm, but also reach lower validation error rate, though pRMSNorm performs slightly\n\n7\n\n20406080100p(%)22.022.523.023.524.024.525.0ValidBLEUscore051015202530Trainingsteps(x30k)0510152025ValidBLEUscoreLayerNormRMSNorm\fModel\n\nBaseline\nBatchNorm-Everywhere\nBatchNorm-LSTM\nLayerNorm\nRMSNorm\npRMSNorm\n\nTime\n315\u00b16.30s\n348\u00b110.5s\n345\u00b111.2s\n392\u00b15.70s\n333\u00b15.20s (15.1%)\n330\u00b15.50s (15.8%)\n\nTable 6: Time in seconds per 0.1k training steps for the\nattentive reader model.\n\nFigure 5: Error rate on validation set for the\nattentive reader model.\n\n(a) Recall@1\n\n(c) Recall@10\nFigure 6: Recall@K values on validation set for the order-embedding models.\n\n(b) Recall@5\n\nworse than RMSNorm. Although in Figure 5 the performance of RMSNorm and LayerNorm is\ncomparable, RMSNorm is around 15% faster than LayerNorm as shown in Table 6.3\n\n6.3\n\nImage-Caption Retrieval\n\nImage-caption retrieval is a cross-modal task aiming at learning a joint embedding space of images\nand sentences, which consists of two sub-tasks: image retrieval and caption retrieval. The former\nranks a set of images according to a query caption, and the latter ranks a set of captions based\non a query image. We train an order-embedding model (OE) proposed by Vendrov et al. [32] on\nthe Microsoft COCO dataset [17] using their public source code in Theano. Model details about\nexperimental settings are provides in Appendix A.3. We compare RMSNorm with two models: one\nwithout any normalization (Baseline) and one with LayerNorm.\nFigure 6 shows the R@K curve on validation set after\nevery 300 training steps, and Table 7 lists the \ufb01nal test re-\nsults. Across all these metrics, RMSNorm and LayerNorm\nconsistently outperform the Baseline in terms of model\nconvergence as shown in Figure 6. We observe that on\nthe validation set, RMSNorm slightly exceeds LayerNorm\nwith respect to recall value. For the \ufb01nal test results as\nshown in Table 7, both RMSNorm and LayerNorm improve the model performance, reaching higher\nrecall values (except LayerNorm on R@5) and lower mean rank, though RMSNorm reveals better\ngeneralization than LayerNorm. Besides, results in Table 8 show that RMSNorm accelerates training\nspeed by 40%\u223c64% compared with LayerNorm, highlighting better ef\ufb01ciency of pRMSNorm.\n\nTime\n2.11\u00b10.047s\n12.02\u00b10.191s\n7.12\u00b10.207s (40.8%)\n4.34\u00b10.168s (63.9%)\n\nTable 8: Time in seconds per 0.1k training\nsteps for the order-embedding model.\n\nModel\nBaseline\nLayerNorm\nRMSNorm\npRMSNorm\n\n6.4 CIFAR-10 Classi\ufb01cation\n\nCIFAR-10 is a supervised image classi\ufb01cation task, with 10 different classes. We train a modi\ufb01ed\nversion of the ConvPool-CNN-C architecture [15], and follow the same experimental protocol as Sal-\nimans and Kingma [22]. BatchNorm, LayerNorm, and WeightNorm are included for comparison.\nTraining details are given in Appendix A.4.\nFigure 9 and Table 10 show the results. Models enhanced with a normalization technique converge\nfaster than Baseline, among which BatchNorm performs the best. Similar to previous observation [3],\n\n3Notice that the implementation of BatchNorm is cuDNN-based, so time cost of BatchNorm in Table 6 can\n\nnot be directly compared with others.\n\n8\n\n050100150200250300Trainingsteps(x1k)0.40.50.60.70.80.91.0ValiderrorrateBaselineBatchNorm-EverywhereBatchNorm-LSTMLayerNormRMSNormpRMSNorm050100150200250Trainingsteps(x0.3k)3436384042MeanRecall@1BaselineLayerNormRMSNormpRMSNorm050100150200250Trainingsteps(x0.3k)7172737475767778MeanRecall@5BaselineLayerNormRMSNormpRMSNorm050100150200250Trainingsteps(x0.3k)84858687888990MeanRecall@10BaselineLayerNormRMSNormpRMSNorm\fModel\n\nSym [32]\nOE + Baseline [32]\u2020\nOE + Baseline [3]\u2021\nOE + LayerNorm [3]\nOE + Baseline\nOE + LayerNorm\nOE + RMSNorm\nOE + pRMSNorm\n\nR@1\n45.4\n46.7\n46.6\n48.5\n45.8\n47.9\n48.7\n46.8\n\n79.3\n80.6\n79.7\n79.5\n79.7\n79.8\n\nExisting\nWork\n\nThis\nWork\n\nCaption Retrieval\nR@5\n\nImage Retrieval\nR@5\n\nR@10 Mean r\n88.7\n88.9\n89.1\n89.8\n88.8\n89.2\n89.5\n90.3\n\n5.8\n5.7\n5.2\n5.1\n5.4\n5.3\n5.3\n5.2\n\nR@1\n36.3\n37.9\n37.8\n38.9\n37.6\n38.4\n39.0\n39.0\n\n73.6\n74.3\n73.6\n74.6\n74.8\n74.5\n\nR@10 Mean r\n85.8\n85.9\n85.7\n86.3\n85.8\n86.7\n86.3\n86.3\n\n9.0\n8.1\n7.9\n7.6\n7.7\n7.5\n7.5\n7.4\n\nTable 7: Average R@K values across 5 test sets from Microsoft COCO. R@K: Recall @ K, higher is better.\nMean r: mean rank, lower is better. The number in bold highlights the best result. \u2021 denotes the reproduced\nresults of \u2020.\n\nModel\nBaseline\nBatchNorm\nWeightNorm\nLayerNorm\nRMSNorm\npRMSNorm\n\nTest Error\n8.96%\n8.25%\n8.28%\n10.49%\n8.83%\n10.37%\n\nTime\n21\u00b10.0s\n38\u00b10.0s\n23\u00b10.0s\n39\u00b10.4s\n31\u00b10.5s (20.5%)\n30\u00b10.4s (23.1%)\n\nTable 10: Test error rate and time in seconds per training\nepoch for the ConvPool-CNN-C model. Time is measured\nwith GeForce RTX 2080 Ti.\n\nTable 9: Training error rate for the ConvPool-\nCNN-C model.\nwe also \ufb01nd that layer normalization works worse than BatchNorm and WeightNorm for image\nprocessing. Though LayerNorm outperforms Baseline by shorting model convergence, it fails to\ngeneralize to the test set, degenerating the test error by 1.53%. In contrast, RMSNorm shows better\ngeneralization, surpassing the Baseline by 0.013% and saving about 20.5% training time compared to\nLayerNorm. pRMSNorm gains further speedup of 2.6%, albeit at the cost of sacri\ufb01cing test accuracy\nof 1.54%.\n\n7 Conclusion and Future Work\n\nThis paper presents RMSNorm, a novel normalization approach that normalizes the summed inputs\naccording to the RMS. RMSNorm preserves the re-scaling invariance property of LayerNorm but\neschews the re-centering invariance property which contributes less to the model training. Compared\nwith LayerNorm, models with RMSNorm suffers from less computational overhead. RMSNorm can\nbe easily applied to different model architectures as a drop-in replacement of LayerNorm. Experiments\non several NLP tasks show that RMSNorm is comparable to LayerNorm in quality, but accelerates\nthe running speed. Actual speed improvements depend on the framework, hardware, neural network\narchitecture and relative computational cost of other components, and we empirically observed\nspeedups of 7%\u223c64% across different models and implementations. Our ef\ufb01ciency improvement\ncome from simplifying the computation, and we thus expect them to be orthogonal to other means\nof increasing training speed, such as low-precision arithmetic and GPU kernel fusion. We also\nexperimented with pRMSNorm which estimates the RMS on a subset of the summed inputs. While\ntheoretically faster, we did not consistently observe empirical speed improvements for pRMSNorm.\nWe leave it to future work to investigate if the performance can be improved via code optimization.\nIn the future, we would like to take more analysis about the success behind RMSNorm. Inspired\nby recent success of l1-norm for BatchNorm, we will explore different norms for RMSNorm, and\nsimplify other normalization techniques such as BatchNorm.\n\nAcknowledgments\n\nWe thank the reviewers for their insightful comments, and Antonio Valerio Miceli Barone for\nhis support with weight normalization for MT. This project has received funding from the grant\nH2020-ICT-2018-2-825460 (ELITR) by the European Union. Biao Zhang also acknowledges the\nsupport of the Baidu Scholarship. This work has been performed using resources provided by the\nCambridge Tier-2 system operated by the University of Cambridge Research Computing Service\n(http://www.hpc.cam.ac.uk) funded by EPSRC Tier-2 capital grant EP/P020259/1.\n\n9\n\n050100150200Trainingepochs0.000.020.040.060.08ErrorRateBaselineBatchNormLayerNormWeightNormRMSNormpRMSNorm\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,\nRajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,\nPete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor\ufb02ow: A system for large-\nscale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems\nDesign and Implementation, OSDI\u201916, pages 265\u2013283, 2016. ISBN 978-1-931971-33-1.\n\n[2] Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization propa-\ngation: A parametric technique for removing internal covariate shift in deep networks. arXiv\npreprint arXiv:1603.01431, 2016.\n\n[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv e-prints, abs/1409.0473, September 2014.\n\n[5] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding\nbatch normalization.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-\nBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31,\npages 7694\u20137705. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/\n7996-understanding-batch-normalization.pdf.\n\n[6] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster,\nLlion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit,\nLukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds:\nCombining recent advances in neural machine translation. In Proceedings of the 56th Annual\nMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages\n76\u201386. Association for Computational Linguistics, 2018.\n\n[7] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[8] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, \u00c7a\u02d8glar G\u00fcl\u00e7ehre, and Aaron Courville. Recur-\n\nrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[9] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa\nSuleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in\nNeural Information Processing Systems, pages 1693\u20131701, 2015.\n\n[10] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: ef\ufb01cient and accurate\n\nnormalization schemes in deep networks. arXiv preprint arXiv:1803.01814, 2018.\n\n[11] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-\nnormalized models. In Advances in Neural Information Processing Systems, pages 1945\u20131953,\n2017.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proceedings of the 32Nd International Conference on\nInternational Conference on Machine Learning - Volume 37, ICML\u201915, pages 448\u2013456, 2015.\n\n[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[14] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings\n\nwith multimodal neural language models. CoRR, abs/1411.2539, 2014.\n\n[15] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s\n\nthesis, Department of Computer Science, University of Toronto, 2009.\n\n10\n\n\f[16] C\u00e9sar Laurent, Gabriel Pereyra, Phil\u00e9mon Brakel, Ying Zhang, and Yoshua Bengio. Batch\nnormalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics,\nSpeech and Signal Processing (ICASSP), pages 2657\u20132661. IEEE, 2016.\n\n[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr\nDoll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European\nconference on computer vision, pages 740\u2013755. Springer, 2014.\n\n[18] Toan Q Nguyen and David Chiang. Improving lexical choice in neural machine translation.\n\narXiv preprint arXiv:1710.01329, 2017.\n\n[19] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander\n\nKu. Image transformer. arXiv preprint arXiv:1802.05751, 2018.\n\n[20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\n[21] Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.\n\n[22] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to\naccelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems 29, pages 901\u2013909. 2016.\n\n[23] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch\nnormalization help optimization? In Advances in Neural Information Processing Systems 31,\npages 2488\u20132498. 2018.\n\n[24] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words\n\nwith subword units. arXiv preprint arXiv:1508.07909, 2015.\n\n[25] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler,\nMarcin Junczys-Dowmunt, Samuel L\u00e4ubli, Antonio Valerio Miceli Barone, Jozef Mokry, and\nMaria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In Proceedings of the\nSoftware Demonstrations of the 15th Conference of the European Chapter of the Association\nfor Computational Linguistics, pages 65\u201368, Valencia, Spain, April 2017.\n\n[26] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A\ncleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of\nthe 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long\nPapers), pages 2556\u20132565, 2018.\n\n[27] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the\nlog-likelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[28] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[29] Theano Development Team. Theano: A Python framework for fast computation of mathematical\n\nexpressions. arXiv e-prints, May 2016.\n\n[30] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing\n\ningredient for fast stylization. CoRR, 2016.\n\n[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems 30, pages 5998\u20136008. 2017.\n\n[32] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and\n\nlanguage. arXiv preprint arXiv:1511.06361, 2015.\n\n[33] Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Dong Wu, Yuan Xie, and Luping Shi. L1-norm\nbatch normalization for ef\ufb01cient training of deep neural networks. IEEE transactions on neural\nnetworks and learning systems, 2018.\n\n11\n\n\f[34] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference\n\non Computer Vision (ECCV), pages 3\u201319, 2018.\n\n[35] Biao Zhang and Rico Sennrich. A lightweight recurrent network for sequence modeling. arXiv\n\npreprint arXiv:1905.13324, 2019.\n\n[36] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization via\n\nbetter initialization. In International Conference on Learning Representations, 2019.\n\n[37] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. Syllable-based sequence-to-sequence\nspeech recognition with the transformer in mandarin chinese. arXiv preprint arXiv:1804.10752,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 6705, "authors": [{"given_name": "Biao", "family_name": "Zhang", "institution": "University of Edinburgh"}, {"given_name": "Rico", "family_name": "Sennrich", "institution": "University of Edinburgh"}]}