{"title": "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 901, "page_last": 909, "abstract": "We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.", "full_text": "Weight Normalization: A Simple Reparameterization\n\nto Accelerate Training of Deep Neural Networks\n\nTim Salimans\n\nOpenAI\n\ntim@openai.com\n\nDiederik P. Kingma\n\nOpenAI\n\ndpkingma@openai.com\n\nAbstract\n\nWe present weight normalization: a reparameterization of the weight vectors\nin a neural network that decouples the length of those weight vectors from their\ndirection. By reparameterizing the weights in this way we improve the conditioning\nof the optimization problem and we speed up convergence of stochastic gradient\ndescent. Our reparameterization is inspired by batch normalization but does not\nintroduce any dependencies between the examples in a minibatch. This means\nthat our method can also be applied successfully to recurrent models such as\nLSTMs and to noise-sensitive applications such as deep reinforcement learning\nor generative models, for which batch normalization is less well suited. Although\nour method is much simpler, it still provides much of the speed-up of full batch\nnormalization. In addition, the computational overhead of our method is lower,\npermitting more optimization steps to be taken in the same amount of time. We\ndemonstrate the usefulness of our method on applications in supervised image\nrecognition, generative modelling, and deep reinforcement learning.\n\n1\n\nIntroduction\n\nRecent successes in deep learning have shown that neural networks trained by \ufb01rst-order gradient\nbased optimization are capable of achieving amazing results in diverse domains like computer vision,\nspeech recognition, and language modelling [7]. However, it is also well known that the practical\nsuccess of \ufb01rst-order gradient based optimization is highly dependent on the curvature of the objective\nthat is optimized. If the condition number of the Hessian matrix of the objective at the optimum is\nlow, the problem is said to exhibit pathological curvature, and \ufb01rst-order gradient descent will have\ntrouble making progress [22, 32]. The amount of curvature, and thus the success of our optimization,\nis not invariant to reparameterization [1]: there may be multiple equivalent ways of parameterizing\nthe same model, some of which are much easier to optimize than others. Finding good ways of\nparameterizing neural networks is thus an important problem in deep learning.\nWhile the architectures of neural networks differ widely across applications, they are typically mostly\ncomposed of conceptually simple computational building blocks sometimes called neurons: each such\nneuron computes a weighted sum over its inputs and adds a bias term, followed by the application of\nan elementwise nonlinear transformation. Improving the general optimizability of deep networks is a\nchallenging task [6], but since many neural architectures share these basic building blocks, improving\nthese building blocks improves the performance of a very wide range of model architectures and\ncould thus be very useful.\nSeveral authors have recently developed methods to improve the conditioning of the cost gradient for\ngeneral neural network architectures. One approach is to explicitly left multiply the cost gradient\nwith an approximate inverse of the Fisher information matrix, thereby obtaining an approximately\nwhitened natural gradient. Such an approximate inverse can for example be obtained by using a\nKronecker factored approximation to the Fisher matrix and inverting it (KFAC, [23]), by using an\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fapproximate Cholesky factorization of the inverse Fisher matrix (FANG, [10]), or by whitening the\ninput of each layer in the neural network (PRONG, [5]).\nAlternatively, we can use standard \ufb01rst order gradient descent without preconditioning, but change\nthe parameterization of our model to give gradients that are more like the whitened natural gradients\nof these methods. For example, Raiko et al. [27] propose to transform the outputs of each neuron\nto have zero output and zero slope on average. They show that this transformation approximately\ndiagonalizes the Fisher information matrix, thereby whitening the gradient, and that this leads to\nimproved optimization performance. Another approach in this direction is batch normalization [14],\na method where the output of each neuron (before application of the nonlinearity) is normalized by\nthe mean and standard deviation of the outputs calculated over the examples in the minibatch. This\nreduces covariate shift of the neuron outputs and the authors suggest it also brings the Fisher matrix\ncloser to the identity matrix.\nFollowing this second approach to approximate natural gradient optimization, we propose a simple\nbut general method, called weight normalization, for improving the optimizability of the weights\nof neural network models. The method is inspired by batch normalization, but it is a deterministic\nmethod that does not share batch normalization\u2019s property of adding noise to the gradients. In\naddition, the overhead imposed by our method is lower: no additional memory is required and the\nadditional computation is negligible. The method show encouraging results on a wide range of deep\nlearning applications.\n\n2 Weight Normalization\n\nWe consider standard arti\ufb01cial neural networks where the computation of each neuron consists in\ntaking a weighted sum of input features, followed by an elementwise nonlinearity:\n\ny = \u03c6(w \u00b7 x + b),\n\n(1)\n\nwhere w is a k-dimensional weight vector, b is a scalar bias term, x is a k-dimensional vector of input\nfeatures, \u03c6(.) denotes an elementwise nonlinearity such as the recti\ufb01er max(., 0), and y denotes the\nscalar output of the neuron.\nAfter associating a loss function to one or more neuron outputs, such a neural network is commonly\ntrained by stochastic gradient descent in the parameters w, b of each neuron. In an effort to speed up\nthe convergence of this optimization procedure, we propose to reparameterize each weight vector w\nin terms of a parameter vector v and a scalar parameter g and to perform stochastic gradient descent\nwith respect to those parameters instead. We do so by expressing the weight vectors in terms of the\nnew parameters using\n\ng\n||v|| v\n\nw =\n\n(2)\nwhere v is a k-dimensional vector, g is a scalar, and ||v|| denotes the Euclidean norm of v. This\nreparameterization has the effect of \ufb01xing the Euclidean norm of the weight vector w: we now\nhave ||w|| = g, independent of the parameters v. We therefore call this reparameterizaton weight\nnormalization.\nThe idea of normalizing the weight vector has been proposed before (e.g. [31, 33]) but earlier work\ntypically still performed optimization in the w-parameterization, only applying the normalization\nafter each step of stochastic gradient descent. This is fundamentally different from our approach: we\npropose to explicitly reparameterize the model and to perform stochastic gradient descent in the new\nparameters v, g directly. Doing so improves the conditioning of the gradient and leads to improved\nconvergence of the optimization procedure: By decoupling the norm of the weight vector (g) from the\ndirection of the weight vector (v/||v||), we speed up convergence of our stochastic gradient descent\noptimization, as we show experimentally in section 5.\nInstead of working with g directly, we may also use an exponential parameterization for the scale, i.e.\ng = es, where s is a log-scale parameter to learn by stochastic gradient descent. Parameterizing the g\nparameter in the log-scale is more intuitive and more easily allows g to span a wide range of different\nmagnitudes. Empirically, however, we did not \ufb01nd this to be an advantage. In our experiments,\nthe eventual test-set performance was not signi\ufb01cantly better or worse than the results with directly\nlearning g in its original parameterization, and optimization was slightly slower.\n\n2\n\n\f2.1 Gradients\n\nTraining a neural network in the new parameterization is done using standard stochastic gradient\ndescent methods. Here we differentiate through (2) to obtain the gradient of a loss function L with\nrespect to the new parameters v, g. Doing so gives\n\n\u2207gL =\n\n,\n\ng\n\n\u2207vL =\n\n\u2207wL \u00b7 v\n||v||\n\n||v||\u2207wL \u2212 g\u2207gL\n||v||2 v,\nwhere \u2207wL is the gradient with respect to the weights w as used normally.\nBackpropagation using weight normalization thus only requires a minor modi\ufb01cation to the usual\nbackpropagation equations, and is easily implemented using standard neural network software, either\nby directly specifying the network in terms of the v, g parameters and relying on auto-differentiation,\nor by applying (3) in a post-processing step. We provide reference implementations using both\napproaches for Theano, Tensor\ufb02ow and Keras at https://github.com/openai/weightnorm.\nUnlike with batch normalization, the expressions above are independent of the minibatch size and\nthus cause only minimal computational overhead.\nAn alternative way to write the gradient is\n\n(3)\n\n\u2207vL =\n\ng\n\n||v|| Mw\u2207wL,\n\nwith Mw = I\u2212 ww(cid:48)\n||w||2 ,\n\n(4)\n\n\u221a\n\n\u221a\n\nthe new weight vector will have norm ||v(cid:48)|| =(cid:112)||v||2 + c2||v||2 =\n\nwhere Mw is a projection matrix that projects onto the complement of the w vector. This shows\nthat weight normalization accomplishes two things: it scales the weight gradient by g/||v||, and it\nprojects the gradient away from the current weight vector. Both effects help to bring the covariance\nmatrix of the gradient closer to identity and bene\ufb01t optimization, as we explain below.\nDue to projecting away from w, the norm of v grows monotonically with the number of weight\nupdates when learning a neural network with weight normalization using standard gradient descent\nwithout momentum: Let v(cid:48) = v + \u2206v denote our parameter update, with \u2206v \u221d \u2207vL (steepest\nascent/descent), then \u2206v is necessarily orthogonal to the current weight vector w since we project\naway from it when calculating \u2207vL (equation 4). Since v is proportional to w, the update is thus also\northogonal to v and increases its norm by the Pythagorean theorem. Speci\ufb01cally, if ||\u2206v||/||v|| = c\n1 + c2||v|| \u2265 ||v||. The rate\nof increase will depend on the the variance of the weight gradient. If our gradients are noisy, c will be\nhigh and the norm of v will quickly increase, which in turn will decrease the scaling factor g/||v||.\n1 + c2 \u2248 1, and the norm of v will stop increasing.\nIf the norm of the gradients is small, we get\nUsing this mechanism, the scaled gradient self-stabilizes its norm. This property does not strictly\nhold for optimizers that use separate learning rates for individual parameters, like Adam [15] which\nwe use in experiments, or when using momentum. However, qualitatively we still \ufb01nd the same effect\nto hold.\nEmpirically, we \ufb01nd that the ability to grow the norm ||v|| makes optimization of neural networks\nwith weight normalization very robust to the value of the learning rate: If the learning rate is too\nlarge, the norm of the unnormalized weights grows quickly until an appropriate effective learning rate\nis reached. Once the norm of the weights has grown large with respect to the norm of the updates, the\neffective learning rate stabilizes. Neural networks with weight normalization therefore work well\nwith a much wider range of learning rates than when using the normal parameterization. It has been\nobserved that neural networks with batch normalization also have this property [14], which can also\nbe explained by this analysis.\nBy projecting the gradient away from the weight vector w, we also eliminate the noise in that\ndirection. If the covariance matrix of the gradient with respect to w is given by C, the covariance\nmatrix of the gradient in v is given by D = (g2/||v||2)MwCMw. Empirically, we \ufb01nd that w is\noften (close to) a dominant eigenvector of the covariance matrix C: removing that eigenvector then\ngives a new covariance matrix D that is closer to the identity matrix, which may further speed up\nlearning.\n\n3\n\n\f2.2 Relation to batch normalization\n\nAn important source of inspiration for this reparameterization is batch normalization [14], which\nnormalizes the statistics of the pre-activation t for each minibatch as\n\nt(cid:48) =\n\nt \u2212 \u00b5[t]\n\u03c3[t]\n\n,\n\nwith \u00b5[t], \u03c3[t] the mean and standard deviation of the pre-activations t = v \u00b7 x. For the special\ncase where our network only has a single layer, and the input features x for that layer are whitened\n(independently distributed with zero mean and unit variance), these statistics are given by \u00b5[t] = 0\nand \u03c3[t] = ||v||. In that case, normalizing the pre-activations using batch normalization is equivalent\nto normalizing the weights using weight normalization.\nConvolutional neural networks usually have much fewer weights than pre-activations, so normalizing\nthe weights is often much cheaper computationally. In addition, the norm of v is non-stochastic, while\nthe minibatch mean \u00b5[t] and variance \u03c32[t] can in general have high variance for small minibatch\nsize. Weight normalization can thus be viewed as a cheaper and less noisy approximation to batch\nnormalization. Although exact equivalence does not usually hold for deeper architectures, we still\n\ufb01nd that our weight normalization method provides much of the speed-up of full batch normalization.\nIn addition, its deterministic nature and independence on the minibatch input also means that our\nmethod can be applied more easily to models like RNNs and LSTMs, as well as noise-sensitive\napplications like reinforcement learning.\n\n3 Data-Dependent Initialization of Parameters\n\nBesides a reparameterization effect, batch normalization also has the bene\ufb01t of \ufb01xing the scale of the\nfeatures generated by each layer of the neural network. This makes the optimization robust against\nparameter initializations for which these scales vary across layers. Since weight normalization lacks\nthis property, we \ufb01nd it is important to properly initialize our parameters. We propose to sample the\nelements of v from a simple distribution with a \ufb01xed scale, which is in our experiments a normal\ndistribution with mean zero and standard deviation 0.05. Before starting training, we then initialize\nthe b and g parameters to \ufb01x the minibatch statistics of all pre-activations in our network, just like\nin batch normalization, but only for a single minibatch of data and only during initialization. This\ncan be done ef\ufb01ciently by performing an initial feedforward pass through our network for a single\nminibatch of data X, using the following computation at each neuron:\n\nv \u00b7 x\n||v|| ,\n\nt =\n\nand\n\ny = \u03c6\n\n\u03c3[t]\n\n(cid:18) t \u2212 \u00b5[t]\n\n(cid:19)\n\n,\n\n(5)\n\nwhere \u00b5[t] and \u03c3[t] are the mean and standard deviation of the pre-activation t over the examples in\nthe minibatch. We can then initialize the neuron\u2019s biase b and scale g as\n\ng \u2190 1\n\u03c3[t]\n\n,\n\nb \u2190 \u2212\u00b5[t]\n\n\u03c3[t]\n\n,\n\n(6)\n\nso that y = \u03c6(w\u00b7 x + b). Like batch normalization, this method ensures that all features initially have\nzero mean and unit variance before application of the nonlinearity. With our method this only holds\nfor the minibatch we use for initialization, and subsequent minibatches may have slightly different\nstatistics, but experimentally we \ufb01nd this initialization method to work well. The method can also be\napplied to networks without weight normalization, simply by doing stochastic gradient optimization\non the parameters w directly, after initialization in terms of v and g: this is what we compare to in\nsection 5. Independently from our work, this type of initialization was recently proposed by different\nauthors [24, 18] who found such data-based initialization to work well for use with the standard\nparameterization in terms of w.\nThe downside of this initialization method is that it can only be applied in similar cases as where\nbatch normalization is applicable. For models with recursion, such as RNNs and LSTMs, we will\nhave to resort to standard initialization methods.\n\n4\n\n\f4 Mean-only Batch Normalization\n\nWeight normalization, as introduced in section 2, makes the scale of neuron activations approximately\nindependent of the parameters v. Unlike with batch normalization, however, the means of the neuron\nactivations still depend on v. We therefore also explore the idea of combining weight normalization\nwith a special version of batch normalization, which we call mean-only batch normalization: With\nthis normalization method, we subtract out the minibatch means like with full batch normalization,\nbut we do not divide by the minibatch standard deviations. That is, we compute neuron activations\nusing\n\n(7)\nwhere w is the weight vector, parameterized using weight normalization, and \u00b5[t] is the minibatch\nmean of the pre-activation t. During training, we keep a running average of the minibatch mean\nwhich we substitute in for \u00b5[t] at test time.\nThe gradient of the loss with respect to the pre-activation t is calculated as\n\ny = \u03c6(\u02dct)\n\nt = w \u00b7 x,\n\n\u02dct = t \u2212 \u00b5[t] + b,\n\n\u2207tL = \u2207\u02dctL \u2212 \u00b5[\u2207\u02dctL],\n\n(8)\nwhere \u00b5[.] denotes once again the operation of taking the minibatch mean. Mean-only batch normal-\nization thus has the effect of centering the gradients that are backpropagated. This is a comparatively\ncheap operation, and the computational overhead of mean-only batch normalization is thus lower\nthan for full batch normalization. In addition, this method causes less noise during training, and\nthe noise that is caused is more gentle as the law of large numbers ensures that \u00b5[t] and \u00b5[\u2207\u02dct] are\napproximately normally distributed. Thus, the added noise has much lighter tails than the highly\nkurtotic noise caused by the minibatch estimate of the variance used in full batch normalization. As\nwe show in section 5.1, this leads to improved accuracy at test time.\n\n5 Experiments\n\nWe experimentally validate the usefulness of our method using four different models for varied\napplications in supervised image recognition, generative modelling, and deep reinforcement learning.\n\n5.1 Supervised Classi\ufb01cation: CIFAR-10\n\nTo test our reparameterization method for the application of supervised classi\ufb01cation, we consider the\nCIFAR-10 data set of natural images [19]. The model we are using is based on the ConvPool-CNN-C\narchitecture of [30], with some small modi\ufb01cations: we replace the \ufb01rst dropout layer by a layer\nthat adds Gaussian noise, we expand the last hidden layer from 10 units to 192 units, and we use\n2 \u00d7 2 max-pooling, rather than 3 \u00d7 3. The only hyperparameter that we actively optimized (the\nstandard deviation of the Gaussian noise) was chosen to maximize the performance of the network\non a holdout set of 10000 examples, using the standard parameterization (no weight normalization\nor batch normalization). A full description of the resulting architecture is given in table A in the\nsupplementary material.\nWe train our network for CIFAR-10 using Adam [15] for 200 epochs, with a \ufb01xed learning rate and\nmomentum of 0.9 for the \ufb01rst 100 epochs. For the last 100 epochs we set the momentum to 0.5 and\nlinearly decay the learning rate to zero. We use a minibatch size of 100. We evaluate 5 different\nparameterizations of the network: 1) the standard parameterization, 2) using batch normalization,\n3) using weight normalization, 4) using weight normalization combined with mean-only batch nor-\nmalization, 5) using mean-only batch normalization with the normal parameterization. The network\nparameters are initialized using the scheme of section 3 such that all four cases have identical parame-\nters starting out. For each case we pick the optimal learning rate in {0.0003, 0.001, 0.003, 0.01}. The\nresulting error curves during training can be found in \ufb01gure 1: both weight normalization and batch\nnormalization provide a signi\ufb01cant speed-up over the standard parameterization. Batch normalization\nmakes slightly more progress per epoch than weight normalization early on, although this is partly\noffset by the higher computational cost: with our implementation, training with batch normalization\nwas about 16% slower compared to the standard parameterization. In contrast, weight normaliza-\ntion was not noticeably slower. During the later stage of training, weight normalization and batch\nnormalization seem to optimize at about the same speed, with the normal parameterization (with or\nwithout mean-only batch normalization) still lagging behind. After optimizing the network for 200\n\n5\n\n\fModel\nMaxout [8]\nNetwork in Network [21]\nDeeply Supervised [20]\nConvPool-CNN-C [30]\nALL-CNN-C [30]\nour CNN, mean-only B.N.\nour CNN, weight norm.\nour CNN, normal param.\nour CNN, batch norm.\nours, W.N. + mean-only B.N.\nDenseNet [13]\n\nTest Error\n11.68%\n10.41%\n9.6%\n9.31%\n9.08%\n8.52%\n8.46%\n8.43%\n8.05%\n7.31%\n5.77%\n\nFigure 2: Classi\ufb01cation results on CIFAR-10\nwithout data augmentation.\n\nFigure 1: Training error for CIFAR-10 using dif-\nferent parameterizations. For weight normaliza-\ntion, batch normalization, and mean-only batch\nnormalization we show results using Adam with\na learning rate of 0.003. For the normal param-\neterization we instead use 0.0003 which works\nbest in this case. For the last 100 epochs the\nlearning rate is linearly decayed to zero.\n\nepochs using the different parameterizations, we evaluate their performance on the CIFAR-10 test\nset. The results are summarized in table 2: weight normalization, the normal parameterization, and\nmean-only batch normalization have similar test accuracy (\u2248 8.5% error). Batch normalization does\nsigni\ufb01cantly better at 8.05% error. Mean-only batch normalization combined with weight normaliza-\ntion has the best performance at 7.31% test error, and interestingly does much better than mean-only\nbatch normalization combined with the normal parameterization: This suggests that the noise added\nby batch normalization can be useful for regularizing the network, but that the reparameterization\nprovided by weight normalization or full batch normalization is also needed for optimal results. We\nhypothesize that the substantial improvement by mean-only B.N. with weight normalization over\nregular batch normalization is due to the distribution of the noise caused by the normalization method\nduring training: for mean-only batch normalization the minibatch mean has a distribution that is\napproximately Gaussian, while the noise added by full batch normalization during training has much\nhigher kurtosis. The result with mean-only batch normalization combined with weight normalization\nrepresented the state-of-the-art for CIFAR-10 among methods that do not use data augmentation,\nuntil it was recently surpassed by DenseNets [13].\n\n5.2 Generative Modelling: Convolutional VAE\n\nNext, we test the effect of weight normalization applied to deep convolutional variational auto-\nencoders (CVAEs) [16, 28, 29], trained on the MNIST data set of images of handwritten digits and\nthe CIFAR-10 data set of small natural images.\nVariational auto-encoders are generative models that explain the data vector x as arising from a set of\nlatent variables z, through a joint distribution of the form p(z, x) = p(z)p(x|z), where the decoder\np(x|z) is speci\ufb01ed using a neural network. A lower bound on the log marginal likelihood log p(x)\ncan be obtained by approximately inferring the latent variables z from the observed data x using\nan encoder distribution q(z|x) that is also speci\ufb01ed as a neural network. This lower bound is then\noptimized to \ufb01t the model to the data.\nWe follow a similar implementation of the CVAE as in [29] with some modi\ufb01cations, mainly that the\nencoder and decoder are parameterized with ResNet [11] blocks, and that the diagonal posterior is\nreplaced with a more \ufb02exible speci\ufb01cation based on inverse autoregressive \ufb02ow. A further developed\nversion of this model is presented in [17], where the architecture is explained in detail.\nFor MNIST, the encoder consists of 3 sequences of two ResNet blocks each, the \ufb01rst sequence acting\non 16 feature maps, the others on 32 feature maps. The \ufb01rst two sequences are followed by a 2-times\nsubsampling operation implemented using 2 \u00d7 2 stride, while the third sequence is followed by\na fully connected layer with 450 units. The decoder has a similar architecture, but with reversed\ndirection. For CIFAR-10, we used a neural architecture with ResNet units and multiple intermediate\nstochastic layers. We used Adamax [15] with \u03b1 = 0.002 for optimization, in combination with\n\n6\n\n05010015020000.050.1training epochstraining error normal param.weight norm.batch norm.WN + mean\u2212only BNmean\u2212only BN\fPolyak averaging [26] in the form of an exponential moving average that averages parameters over\napproximately 10 epochs.\nIn \ufb01gure 3, we plot the test-set lower bound as a function of number of training epochs, including\nerror bars based on multiple different random seeds for initializing parameters. As can be seen, the\nparameterization with weight normalization has lower variance and converges to a better optimum.\nWe observe similar results across different hyper-parameter settings.\n\nFigure 3: Marginal log likelihood lower bound on the MNIST (top) and CIFAR-10 (bottom) test\nsets for a convolutional VAE during training, for both the standard implementation as well as our\nmodi\ufb01cation with weight normalization. For MNIST, we provide standard error bars to indicate\nvariance based on different initial random seeds.\n\n5.3 Generative Modelling: DRAW\n\nNext, we consider DRAW, a recurrent generative model by [9]. DRAW is a variational auto-encoder\nwith generative model p(z)p(x|z) and encoder q(z|x), similar to the model in section 5.2, but with\nboth the encoder and decoder consisting of a recurrent neural network comprised of Long Short-Term\nMemory (LSTM) [12] units. LSTM units consist of a memory cell with additive dynamics, combined\nwith input, forget, and output gates that determine which information \ufb02ows in and out of the memory.\nThe additive dynamics enables learning of long-range dependencies in the data.\nAt each time step of the model, DRAW uses the same set of weight vectors to update the cell states\nof the LSTM units in its encoder and decoder. Because of the recurrent nature of this process it is not\ntrivial to apply batch normalization here: Normalizing the cell states diminishes their ability to pass\nthrough information. Fortunately, weight normalization can easily be applied to the weight vectors\nof each LSTM unit, and we \ufb01nd this to work well empirically. Some other potential solutions were\nrecently proposed in [4, 2].\n\nFigure 4: Marginal log likelihood lower bound on the MNIST test set for DRAW during training, for\nboth the standard implementation as well as our modi\ufb01cation with weight normalization. 100 epochs\nis not suf\ufb01cient for convergence for this model, but the implementation using weight normalization\nclearly makes progress much more quickly than with the standard parameterization.\n\n7\n\n50100150200250300training epochs88.087.587.086.586.085.585.084.584.0bound on marginal likelihoodConvolutional VAE on MNISTnormal parameterizationweight normalization050100150200250300350400450training epochs800085009000950010000bound on marginal likelihoodConvolutional VAE on CIFAR-10normal parameterizationweight normalization0102030405060708090100\u2212120\u2212115\u2212110\u2212105\u2212100\u221295\u221290\u221285\u221280training epochsbound on marginal log likelihood normal parameterizationweight normalization\fWe take the Theano implementation of DRAW provided at https://github.com/jbornschein/\ndraw and use it to model the MNIST data set of handwritten digits. We then make a single modi\ufb01ca-\ntion to the model: we apply weight normalization to all weight vectors. As can be seen in \ufb01gure 4,\nthis signi\ufb01cantly speeds up convergence of the optimization procedure, even without modifying the\ninitialization method and learning rate that were tuned for use with the normal parameterization.\n\n5.4 Reinforcement Learning: DQN\n\nNext we apply weight normalization to the problem of Reinforcement Learning for playing games on\nthe Atari Learning Environment [3]. The approach we use is the Deep Q-Network (DQN) proposed\nby [25]. This is an application for which batch normalization is not well suited: the noise introduced\nby estimating the minibatch statistics destabilizes the learning process. We were not able to get\nbatch normalization to work for DQN without using an impractically large minibatch size. In\ncontrast, weight normalization is easy to apply in this context, as is the initialization method of\nsection 3. Stochastic gradient learning is performed using Adamax [15] with momentum of 0.5. We\nsearch for optimal learning rates in {0.0001, 0.0003, 0.001, 0.003}, generally \ufb01nding 0.0003 to work\nwell with weight normalization and 0.0001 to work well for the normal parameterization. We also\nuse a larger minibatch size (64) which we found to be more ef\ufb01cient on our hardware (Amazon\nElastic Compute Cloud g2.2xlarge GPU instance). Apart from these changes we follow [25]\nas closely as possible in terms of parameter settings and evaluation methods. However, we use a\nPython/Theano/Lasagne reimplementation of their work, adapted from the implementation available\nat https://github.com/spragunr/deep_q_rl, so there may be small additional differences in\nimplementation.\nFigure 5 shows the training curves obtained using DQN with the standard parameterization and\nwith weight normalization on Space Invaders. Using weight normalization the algorithm progresses\nmore quickly and reaches a better \ufb01nal result. Table 6 shows the \ufb01nal evaluation scores obtained\nby DQN with weight normalization for four games: on average weight normalization improves the\nperformance of DQN.\n\nGame\nBreakout\nEnduro\nSeaquest\nSpace Invaders\n\nnormal weightnorm Mnih\n410\n1,250\n7,188\n1,779\n\n403\n1,448\n7,375\n2,179\n\n401\n302\n5,286\n1,975\n\nFigure 6: Maximum evaluation scores obtained by\nDQN, using either the normal parameterization or\nusing weight normalization. The scores indicated by\nMnih et al. are those reported by [25]: Our normal\nparameterization is approximately equivalent to their\nmethod. Differences in scores may be caused by small\ndifferences in our implementation. Speci\ufb01cally, the\ndifference in our score on Enduro and that reported\nby [25] might be due to us not using a play-time limit\nduring evaluation.\n\nFigure 5: Evaluation scores for Space In-\nvaders obtained by DQN after each epoch of\ntraining, for both the standard parameteriza-\ntion and using weight normalization. Learn-\ning rates for both cases were selected to max-\nimize the highest achieved test score.\n\n6 Conclusion\n\nWe have presented weight normalization, a simple reparameterization of the weight vectors in a\nneural network that accelerates the convergence of stochastic gradient descent optimization. Weight\nnormalization was applied to four different models in supervised image recognition, generative\nmodelling, and deep reinforcement learning, showing a consistent advantage across applications. The\nreparameterization method is easy to apply, has low computational overhead, and does not introduce\ndependencies between the examples in a minibatch, making it our default choice in the development\nof new deep learning architectures.\n\n8\n\n05010015020005001000150020002500training epochstest reward per episode normal parameterizationweight normalization\fReferences\n[1] S. Amari. Neural learning in structured parameter spaces - natural Riemannian gradient. In Advances in\n\nNeural Information Processing Systems, pages 127\u2013133. MIT Press, 1997.\n\n[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.\n[3] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation\n\nplatform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, 06 2013.\n\n[4] T. Cooijmans, N. Ballas, C. Laurent, and A. Courville. Recurrent batch normalization. arXiv preprint\n\narXiv:1603.09025, 2016.\n\n[5] G. Desjardins, K. Simonyan, R. Pascanu, et al. Natural neural networks. In Advances in Neural Information\n\nProcessing Systems, pages 2062\u20132070, 2015.\n\n[6] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\nInternational conference on arti\ufb01cial intelligence and statistics, pages 249\u2013256, 2010.\n\n[7] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016.\n[8] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML,\n\n2013.\n\n[9] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image\n\ngeneration. arXiv preprint arXiv:1502.04623, 2015.\n\n[10] R. Grosse and R. Salakhudinov. Scaling up natural gradient by sparsely factorizing the inverse \ufb01sher\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint\n\nmatrix. In ICML, pages 2304\u20132313, 2015.\n\narXiv:1512.03385, 2015.\n\n[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[13] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. arXiv preprint\n\n[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\narXiv:1608.06993, 2016.\n\ncovariate shift. In ICML, 2015.\n\n[15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[16] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. Proceedings of the 2nd International\n\n[17] D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive\n\nConference on Learning Representations, 2013.\n\n\ufb02ow. arXiv preprint arXiv:1606.04934, 2016.\n\n[18] P. Kr\u00e4henb\u00fchl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent initializations of convolutional\n\nneural networks. arXiv preprint arXiv:1511.06856, 2015.\n\n[19] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.\n[20] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Deep Learning and\n\nRepresentation Learning Workshop, NIPS, 2014.\n\n[21] M. Lin, C. Qiang, and S. Yan. Network in network. In ICLR: Conference Track, 2014.\n[22] J. Martens. Deep learning via hessian-free optimization.\n\nIn Proceedings of the 27th International\n\nConference on Machine Learning (ICML-10), pages 735\u2013742, 2010.\n\n[23] J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature.\n\narXiv preprint arXiv:1503.05671, 2015.\n\n[24] D. Mishkin and J. Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.\n[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n[26] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on\n\nControl and Optimization, 30(4):838\u2013855, 1992.\n\n[27] T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 924\u2013932, 2012.\n\n[28] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, pages 1278\u20131286, 2014.\n\n[29] T. Salimans, D. P. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging\n\n[30] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolu-\n\nthe gap. In ICML, 2015.\n\ntional net. In ICLR Workshop Track, 2015.\n\n[31] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th Annual\n\nConference on Learning Theory, pages 545\u2014-560, 2005.\n\n[32] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in\n\ndeep learning. In ICML, pages 1139\u20131147, 2013.\n\n[33] S. Zhang, H. Jiang, S. Wei, and L.-R. Dai. Recti\ufb01ed linear neural networks with tied-scalar regularization\n\nfor lvcsr. In INTERSPEECH, pages 2635\u20132639, 2015.\n\n9\n\n\f", "award": [], "sourceid": 546, "authors": [{"given_name": "Tim", "family_name": "Salimans", "institution": "Algoritmica"}, {"given_name": "Durk", "family_name": "Kingma", "institution": "OpenAI"}]}