{"title": "How to Initialize your Network? Robust Initialization for WeightNorm & ResNets", "book": "Advances in Neural Information Processing Systems", "page_first": 10902, "page_last": 10911, "abstract": "Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initialization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation. We run over 2,500 experiments and evaluate our proposal on image datasets showing that the proposed initialization outperforms existing initialization methods in terms of generalization performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. Finally, we show that using our initialization in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks.", "full_text": "How to Initialize your Network?\n\nRobust Initialization for WeightNorm & ResNets\n\nDevansh Arpit\u2217\u2020, V\u00edctor Campos\u2217\u2021, Yoshua Bengio\u00a7\n\u2020Salesforce Research, \u2021Barcelona Supercomputing Center,\n\n\u00a7Montr\u00e9al Institute for Learning Algorithms, Universit\u00e9 de Montr\u00e9al, CIFAR Senior Fellow\n\ndevansharpit@gmail.com, victor.campos@bsc.es\n\nAbstract\n\nResidual networks (ResNet) and weight normalization play an important role in\nvarious deep learning applications. However, parameter initialization strategies\nhave not been studied previously for weight normalized networks and, in practice,\ninitialization methods designed for un-normalized networks are used as a proxy.\nSimilarly, initialization for ResNets have also been studied for un-normalized\nnetworks and often under simpli\ufb01ed settings ignoring the shortcut connection.\nTo address these issues, we propose a novel parameter initialization strategy that\navoids explosion/vanishment of information across layers for weight normalized\nnetworks with and without residual connections. The proposed strategy is based\non a theoretical analysis using mean \ufb01eld approximation. We run over 2,500\nexperiments and evaluate our proposal on image datasets showing that the proposed\ninitialization outperforms existing initialization methods in terms of generalization\nperformance, robustness to hyper-parameter values and variance between seeds,\nespecially when networks get deeper in which case existing methods fail to even\nstart training. Finally, we show that using our initialization in conjunction with\nlearning rate warmup is able to reduce the gap between the performance of weight\nnormalized and batch normalized networks.\n\n1\n\nIntroduction\n\nParameter initialization is an important aspect of deep network optimization and plays a crucial role\nin determining the quality of the \ufb01nal model. In order for deep networks to learn successfully using\ngradient descent based methods, information must \ufb02ow smoothly in both forward and backward\ndirections [5, 10, 8, 7]. Too large or too small parameter scale leads to information exploding or\nvanishing across hidden layers in both directions. This could lead to loss being stuck at initialization\nor quickly diverging at the beginning of training. Beyond these characteristics near the point\nof initialization itself, we argue that the choice of initialization also has an impact on the \ufb01nal\ngeneralization performance. This non-trivial relationship between initialization and \ufb01nal performance\nemerges because good initializations allow the use of larger learning rates which have been shown in\nexisting literature to correlate with better generalization [12, 26, 27].\nWeight normalization [23] accelerates convergence of stochastic gradient descent optimization by\nre-parameterizing weight vectors in neural networks. However, previous works have not studied\ninitialization strategies for weight normalization and it is a common practice to use initialization\nschemes designed for un-normalized networks as a proxy. We study initialization conditions for\nweight normalized ReLU networks, and propose a new initialization strategy for both plain and\nresidual architectures.\n\n\u2217Equal contribution. Work done while V\u00edctor Campos was an intern at Salesforce Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe main contribution of this work is the theoretical derivation of a novel initialization strategy\nfor weight normalized ReLU networks, with and without residual connections, that prevents in-\nformation \ufb02ow from exploding/vanishing in forward and backward pass. Extensive experimental\nevaluation shows that the proposed initialization increases robustness to network depth, choice of\nhyper-parameters and seed. When combining the proposed initialization with learning rate warmup,\nwe are able to use learning rates as large as the ones used with batch normalization [11] and signif-\nicantly reduce the generalization gap between weight and batch normalized networks reported in\nthe literature [4, 25]. Further analysis reveals that our proposal initializes networks in regions of\nthe parameter space that have low curvature, thus allowing the use of large learning rates which are\nknown to correlate with better generalization [12, 26, 27].\n\n2 Background and Existing Work\n\nWeight Normalization: previous works have considered re-parameterizations that normalize weights\nin neural networks as means to accelerate convergence. In Arpit et al. [1], the pre- and post-activations\nare scaled/summed with constants depending on the activation function, ensuring that the hidden\nactivations have 0 mean and unit variance, especially at initialization. However, their work makes\nassumptions on the distribution of input and pre-activations of the hidden layers in order to make\nthese guarantees. Weight normalization [23] is a simpler alternative, and the authors propose to use a\ndata-dependent initialization [19, 15] for the introduced re-parameterization. This operation improves\nthe \ufb02ow of information, but its dependence on statistics computed from a batch of data may make it\nsensitive to the samples used to estimate the initial values.\nResidual Network Architecture: residual networks (ResNets) [10] have become a cornerstone of\ndeep learning due to their state-of-the-art performance in various applications. However, when using\nresidual networks with weight normalization instead of batch normalization [11], they have been\nshown to have signi\ufb01cantly worse generalization performance. For instance, Gitman and Ginsburg\n[4] and Shang et al. [25] have shown that ResNets with weight normalization suffer from severe\nover-\ufb01tting and have concluded that batch normalization has an implicit regularization effect.\nInitialization strategies: there exists extensive literature studying initialization schemes for un-\nnormalized plain networks (c.f. Glorot and Bengio [5], He et al. [9], Saxe et al. [24], Poole et al.\n[22], Pennington et al. [20, 21], to name some of the most prominent ones). Similarly, previous\nworks have studied initialization strategies for un-normalized ResNets [8, 28, 29], but they lack\nlarge scale experiments demonstrating the effectiveness of the proposed approaches and consider a\nsimpli\ufb01ed ResNet setup where shortcut connections are ignored, even though they play an important\nrole [13]. Zhang et al. [33] propose an initialization scheme for un-normalized ResNets which\ninvolves initializing the different types of layers individually using carefully designed schemes. They\nprovide large scale experiments on various datasets, and show that the generalization gap between\nbatch normalized ResNets and un-normalized ResNets can be reduced when using their initialization\nalong with additional domain-speci\ufb01c regularization techniques like cutout [3] and mixup [32]. All\nthe aforementioned works consider un-normalized networks and, to the best of our knowledge, there\nhas been no formal analysis of initialization strategies for weight normalized networks that allow a\nsmooth \ufb02ow of information in the forward and backward pass.\n\n3 Weight Normalized ReLU Networks\n\nWe derive initialization schemes for weight normalized networks in the asymptotic setting where\nnetwork width tends to in\ufb01nity, similarly to previous analysis for un-normalized networks [5, 10]. We\nde\ufb01ne an L layer weight normalized ReLU network f\u03b8(x) = hL recursively, where the lth hidden\nlayer\u2019s activation is given by,\n\nhl := ReLU (al)\nal := gl (cid:12) \u02c6Wlhl\u22121 + bl\n\nl \u2208 {1, 2,\u00b7\u00b7\u00b7 L}\n\n(1)\nwhere al are the pre-activations, hl \u2208 Rnl are the hidden activations, ho = x is the input to the\nnetwork, Wl \u2208 Rnl\u00d7nl\u22121 are the weight matrices, b \u2208 Rnl are the bias vectors, and gl \u2208 Rnl is a\nscale factor. We denote the set of all learnable parameters as \u03b8 = {(Wl, gl, bl)}L\nl=1. Notation \u02c6Wl\n\n2\n\n\fFigure 1: Experiments on weight normalized networks using synthetic data to con\ufb01rm theoretical\npredictions. Top: feed forward networks. Bottom: residual networks. We report results for networks\nof width \u223c U(150, 250) (solid lines) and width \u223c U(950, 1050) (dashed lines). The proposed\ninitialization prevents explosion/vanishing of the norm of hidden activations (left) and gradients\n(right) across layers at initialization. For ResNets, norm growth is O(1) for an arbitrary depth network.\nNaively initializing g = 1 results in vanishing/exploding signals.\n\nimplies that each row vector of \u02c6Wl has unit norm, i.e.,\n\n\u02c6Wl\n\ni =\n\nWl\ni\n(cid:107)Wl\ni(cid:107)2\n\n\u2200i\n\n(2)\n\ni controls the norm of each weight vector, whereas \u02c6Wl\n\nthus gl\ni controls its direction. Finally, we will\nmake use of the notion (cid:96)(f\u03b8(x), y) to represent a differentiable loss function over the output of the\nnetwork.\nForward pass: we \ufb01rst study the forward pass and derive an initialization scheme such that for any\ngiven input, the norm of hidden activation of any layer and input norm are asymptotically equal.\nFailure to do so prevents training to begin, as studied by Hanin and Rolnick [8] for vanilla deep\nfeedforward networks. The theorem below shows that a normalized linear transformation followed\nby ReLU non-linearity is a norm preserving transform in expectation when proper scaling is used.\ni.i.d.\u223c P\nTheorem 1 Let v = ReLU\nwhere P is any isotropic distribution in Rn, or alternatively \u02c6R is a randomly generated matrix with\northogonal rows, then for any \ufb01xed vector u, E[(cid:107)v(cid:107)2] = Kn \u00b7 (cid:107)u(cid:107)2 where,\nif n is even\n\n, where u \u2208 Rn and \u02c6R \u2208 Rm\u00d7n. If Ri\n\n(cid:16)(cid:112)2n/m \u00b7 \u02c6Ru\n(cid:17)\n\uf8f1\uf8f2\uf8f3 2Sn\u22121\n\u00b7(cid:16) 2\n\u00b7(cid:16) 1\n\n3 \u00b7 4\n2 \u00b7 3\n\n2Sn\u22121\n\nSn\n\nKn =\n\nSn\n\n(cid:17)\n(cid:17) \u00b7 \u03c0\n\n2\n\n5 . . . n\u22122\nn\u22121\n4 . . . n\u22122\nn\u22121\n\notherwise\n\n(3)\n\nand Sn is the surface area of a unit n-dimensional sphere.\n\nThe constant Kn seems hard to evaluate analytically, but remarkably, we empirically \ufb01nd that Kn = 1\nfor all integers n > 1. Thus applying the above theorem to Eq. 1 implies that every hidden layer in a\nweight normalized ReLU network is norm preserving for an in\ufb01nitely wide network if the elements\n\nof gl are initialized with(cid:112)2nl\u22121/nl. Therefore, we can recursively apply the above argument to\n\neach layer in a normalized deep ReLU network starting from the input to the last layer and have that\nthe network output norm is approximately equal to the input norm, i.e. (cid:107)f\u03b8(x)(cid:107) \u2248 (cid:107)x(cid:107). Figure 1 (top\nleft) shows a synthetic experiment with a 20 layer weight normalized MLP that empirically con\ufb01rms\nthe above theory. Details for this experiment can be found in the supplementary material.\nBackward pass: the goal of studying the backward pass is to derive conditions for which gradients\ndo not explode nor vanish, which is essential for gradient descent based training. Therefore, we\n\n3\n\n15101520Layer Index (i)012||hi||/||x||25101519Layer Index (i)012||hi||/||hL||1510152025303540Block Index (b)1234||hb||/||x||2510152025303539Block Index (b)1234||hb||/||hB||He (g=1)Proposed (orthogonal)Proposed (He)\fare interested in the value of (cid:107) \u2202(cid:96)(f\u03b8(x),y)\n(cid:107) for different layers, which are indexed by l. To prevent\nexploding/vanishing gradients, the value of this term should be similar for all layers. We begin by\nwriting the recursive relation between the value of this derivative for consecutive layers,\n\n\u2202al\n\n\u2202(cid:96)(f\u03b8(x), y)\n\n\u2202al\n\n\u00b7 \u2202(cid:96)(f\u03b8(x), y)\n\n\u2202al+1\n\u2202al\n\n=\n= gl+1 (cid:12) 1(al) (cid:12)\n\n\u2202al+1\n\n(cid:18)\n\n\u02c6Wl+1T \u2202(cid:96)(f\u03b8(x), y)\n\n\u2202al+1\n\n(cid:19)\n\n(4)\n\n(5)\n\nWe note that conditioned on a \ufb01xed hl\u22121, each dimension of 1(al) in the above equation follows an\ni.i.d. sampling from Bernoulli distribution with probability 0.5 at initialization. This is formalized in\nLemma 1 in the supplementary material. We now consider the following theorem,\n\ni.i.d.\u223c P\nTheorem 2 Let v =\nwhere P is any isotropic distribution in Rn or alternatively \u02c6R is a randomly generated matrix with\northogonal rows and zi\n\ni.i.d.\u223c Bernoulli(0.5), then for any \ufb01xed vector u, E[(cid:107)v(cid:107)2] = (cid:107)u(cid:107)2.\n\n, where u \u2208 Rm, R \u2208 Rm\u00d7n and z \u2208 Rn. If each Ri\n\n(cid:17)\n2 \u00b7 z (cid:12)(cid:16) \u02c6RT u\n\n\u221a\n\n\u2202al\n\n\u2202al\n\n\u221a\n\n\u2202al+1\n\n\u2202al+1\n\n\u2202al+1\n\n(cid:107) \u2200l if we initialize gl =\n\n2\u00b71. This also shows that \u2202al+1\n\nIn order to apply the above theorem to Eq. 5, we assume that u := \u2202(cid:96)(f\u03b8(x),y)\nis independent of the\nother terms, similar to He et al. [10]. This simpli\ufb01es the analysis by allowing us to treat \u2202(cid:96)(f\u03b8(x),y)\nas \ufb01xed and take expectation w.r.t. the other terms, over Wl and Wl+1. Thus (cid:107) \u2202(cid:96)(f\u03b8(x),y)\n(cid:107) =\n(cid:107) \u2202(cid:96)(f\u03b8(x),y)\nis a norm preserving transform.\nHence applying this theorem recursively to Eq. 5 for all l yields that (cid:107) \u2202(cid:96)(f\u03b8(x),y)\n(cid:107) \u2248 \u2202(cid:96)(f\u03b8(x),y)\n\u2200l thereby avoiding gradient explosion/vanishment. Note that the above result is strictly better for\northogonal weight matrices compared with other isotropic distributions (see proof). Figure 1 (top\nright) shows a synthetic experiment with a 20 layer weight normalized MLP to con\ufb01rm the above\ntheory. The details for this experiment are provided in the supplementary material.\nWe also point out that the\n2 factor that appears in theorems 1 and 2 is due to the presence of ReLU\nactivation. In the absence of ReLU, this factor should be 1. We will use this fact in the next section\nwith the ResNet architecture.\nImplementation details: since there is a discrepancy between the initialization required by the\nforward and backward pass, we tested both (and combinations of them) in our preliminary experiments\nand found the one proposed for the forward pass to be superior. We therefore propose to initialize\n\nweight matrices Wl to be orthogonal2, bl = 0, and gl = (cid:112)2nl\u22121/nl \u00b7 1, where nl\u22121 and nl\n\n\u221a\n\n\u2202aL\n\n\u2202al\n\nrepresent the fan-in and fan-out of the lth layer respectively. Our results apply to both fully-connected\nand convolutional3 networks.\n\n4 Residual Networks\n\nSimilar to the previous section, we derive an initialization strategy for ResNets in the in\ufb01nite width\nsetting. We de\ufb01ne a residual network R({Fb(\u00b7)}B\u22121\nb=0 , \u03b8, \u03b1) with B residual blocks and parameters \u03b8\nwhose output is denoted as f\u03b8(\u00b7) = hB, and the hidden states are de\ufb01ned recursively as,\n\nhb+1 := hb + \u03b1Fb(hb)\n\nb \u2208 {0, 1, . . . , B \u2212 1}\n\n(6)\n\nwhere h0 = x is the input, hb denotes the hidden representation after applying b residual blocks and\n\u03b1 is a scalar that scales the output of the b-th residual blocks. The b-th \u2208 {0, 1, . . . , B \u2212 1} residual\nblock Fb(\u00b7) is a feed-forward ReLU network. We discuss how to deal with shortcut connections during\ninitialization separately. We use the notation <\u00b7,\u00b7> to denote dot product between the argument\nvectors.\n\n\u221a\n\n2We note that Saxe et al. [24] propose to initialize weights of un-normalized deep ReLU networks to be\n2. Our derivation and proposal is for weight normalized ReLU networks where we study\n\northogonal with scale\nboth Gaussian and orthogonal initialization and show the latter is superior.\n\n3For convolutional layers with kernel size k and c channels, we de\ufb01ne nl\u22121 = k2cl\u22121 and nl = k2cl [9].\n\n4\n\n\fForward pass: here we derive an initialization strategy for residual networks that prevents informa-\ntion in the forward pass from exploding/vanishing independent of the number of residual blocks,\nassuming that each residual block is initialized such that it preserves information in the forward pass.\nTheorem 3 Let R({Fb(\u00b7)}B\u22121\nb=0 , \u03b8, \u03b1) be a residual network with output f\u03b8(\u00b7). Assume that each\nresidual block Fb(.) (\u2200b) is designed such that at initialization, (cid:107)Fb(h)(cid:107) = (cid:107)h(cid:107) for any input h to\n\u221a\nthe residual block, and <u, Fb(u)> \u2248 0. If we set \u03b1 = 1/\n(cid:107)f\u03b8(x)(cid:107) \u2248 c \u00b7 (cid:107)x(cid:107)\n\n\u221a\nB, then \u2203c \u2208 [\n\ne], such that,\n\n\u221a\n\n(7)\n\n2,\n\nThe assumption <u, Fb(u)> \u2248 0 is reasonable because at initialization, Fb(u) is a random trans-\nformation in a high dimensional space which will likely rotate a vector to be orthogonal to itself.\nTo understand the rationale behind the second assumption, (cid:107)Fb(h)(cid:107) = (cid:107)h(cid:107), recall that Fb(.) is\nessentially a non-residual network. Therefore we can initialize each such block using the scheme\ndeveloped in Section 3 which due to Theorem 1 (see discussion below it) will guarantee that the\nnorm of the output of Fb(\u00b7) equals the norm of the input to the block. Figure 1 (bottom left) shows a\n\u221a\nsynthetic experiment with a 40 block weight normalized ResNet to con\ufb01rm the above theory. The\nratio of norms of output to input lies in [\ne] independent of the number of residual blocks exactly\nas predicted by the theory. The details for this experiment can be found in the supplementary material.\nBackward pass: we now study the backward pass for residual networks.\nTheorem 4 Let R({Fb(\u00b7)}B\u22121\nresidual block Fb(\u00b7) (\u2200b) is designed such that at initialization, (cid:107) \u2202Fb(hb)\ninput u of appropriate dimensions, and < \u2202(cid:96)\nsuch that,\n\nb=0 , \u03b8, \u03b1) be a residual network with output f\u03b8(\u00b7). Assume that each\n\u2202hb u(cid:107) = (cid:107)u(cid:107) for any \ufb01xed\n\u221a\ne],\n\n\u221a\n, then \u2203c \u2208 [\n\n> \u2248 0. If \u03b1 = 1\u221a\n\n\u2202hb , \u2202Fb\u22121\n\n\u2202hb\u22121 \u00b7 \u2202(cid:96)\n\n\u221a\n\n\u2202hb\n\n2,\n\nB\n\n2,\n\n\u2202h1(cid:107) \u2248 c \u00b7 (cid:107) \u2202(cid:96)\n(cid:107) \u2202(cid:96)\n\u2202hB (cid:107)\n\n(8)\n\n2,\n\n\u221a\n\n\u221a\n\none shortcut connection and Bk residual blocks, leading to a total of(cid:80)K\n\n\u221a\nThe above theorem shows that scaling the output of the residual block with 1/\nB prevents explo-\nsion/vanishing of gradients irrespective of the number of residual blocks. The rationale behind the\nassumptions is similar to that given for the forward pass above. Figure 1 (bottom right) shows a\nsynthetic experiment with a 40 block weight normalized ResNet to con\ufb01rm the above theory. Once\nagain, the ratio of norms of gradient w.r.t. input to output lies in [\ne] independent of the number\nof residual blocks exactly as predicted by the theory. The details can be found in the supplementary\nmaterial.\nShortcut connections: a ResNet often has K stages [10], where each stage is characterized by\nk=1 Bk blocks. In order\nto account for shortcut connections, we need to ensure that the input and output of each stage in a\nResNet are at the same scale; the same argument applies during the backward pass. To achieve this,\nwe scale the output of the residual blocks in each stage using the total number of residual blocks in\nthat stage. Then theorems 3 and 4 treat each stage of the network as a ResNet and normalize the \ufb02ow\nof information in both directions to be independent of the number of residual blocks.\nImplementation details: we consider ResNets with shortcut connections and architecture de-\nsign similar to that proposed in [10] with the exception that our residual block structure is\nConv\u2192ReLU\u2192Conv, similar to B(3, 3) blocks in [31], as illustrated in the supplementary ma-\nterial4. Weights of all layers in the network are initialized to be orthogonal and biases are set to zero.\n\nThe gain parameter of weight normalization is initialized to be g =(cid:112)\u03b3 \u00b7 fan-in/fan-out \u00b7 1. We set\n\n\u03b3 = 1/Bk for the last convolutional layer of each residual block in the k-th stage5. For the rest of\nlayers we follow the strategy derived in Section 3, with \u03b3 = 2 when the layer is followed by ReLU,\nand \u03b3 = 1 otherwise.\n\n4More generally, our residual block design principle is D\u00d7[Conv\u2192ReLU\u2192]Conv, where D \u2208 Z.\n5We therefore absorb \u03b1 in Eq. 6 into the gain parameter g.\n\n5\n\n\fFigure 2: Results for MLPs on MNIST. Dashed lines denote train accuracy, and solid lines denote\ntest accuracy. The accuracy of diverged runs is set to 0. Left: Accuracy as a function of depth. A\nheld-out validation set is used to select the best model for each con\ufb01guration. Right: Accuracy for\neach job in our hyperparameter sweep, depicting robustness to hyperparameter con\ufb01gurations.\n\n5 Experiments\n\nWe study the impact of initialization on weight normalized networks across a wide variety of\ncon\ufb01gurations. Among others, we compare against the data-dependent initialization proposed by\nSalimans and Kingma [23], which initializes g and b so that all pre-activations in the network have\nzero mean and unit variance based on estimates collected from a single minibatch of data.\nCode is publicly available at https://github.com/victorcampos7/weightnorm-init. We\nrefer the reader to the supplementary material for detailed description of the hyperparameter settings\nfor each experiment, as well as for initial reinforcement learning results.\n\n5.1 Robustness Analysis of Initialization methods\u2013 Depth, Hyper-parameters and Seed\n\nThe dif\ufb01culty of training due to exploding and vanishing gradients increases with network depth. In\npractice, depth often complicates the search for hyperparameters that enable successful optimization,\nif any. This section presents a thorough evaluation of the impact of initialization on different network\narchitectures for increasing depths, as well as their robustness to hyperparameter con\ufb01gurations. We\nbenchmark fully-connected networks on MNIST [17], whereas CIFAR-10 [16] is considered for\nconvolutional and residual networks. We tune hyperparameters individually for each network depth\nand initialization strategy on a set of held-out examples, and report results on the test set. We refer\nthe reader to the supplementary material for a detailed description of the considered hyperparameters.\nFully-connected networks: results in Figure 2 (left) show that the data-dependent initialization can\nbe used to train networks of up to depth 20, but training diverges for deeper nets even when using\nvery small learning rates, e.g. 10\u22125. On the other hand, we managed to successfully train very deep\nnetworks with up to 200 layers using the proposed initialization. When analyzing all runs in the\ngrid search, we observe that the proposed initialization is more robust to the particular choice of\nhyperparameters (Figure 2, right). In particular, the proposed initialization allows using learning rates\nup to 10\u00d7 larger for most depths.\nConvolutional networks: we adopt a similar architecture to that in Xiao et al. [30], where all layers\nhave 3 \u00d7 3 kernels and a \ufb01xed width. The two \ufb01rst layers use a stride of 2 in order to reduce the\nmemory footprint. Results are depicted in Figure 3 (left) and show a similar trend to that observed for\nfully-connected nets, with the data-dependent initialization failing at optimizing very deep networks.\nResidual networks: we construct residual networks of varying depths by controlling the number\nof residual blocks per stage in the wide residual network (WRN) architecture with k = 1. Training\nnetworks with thousands of layers is computationally intensive, so we measure the test accuracy\nafter a single epoch of training [33]. We consider two additional baselines for these experiments:\n(1) the default initialization in PyTorch6, which initializes gi = (cid:107)Wi(cid:107)2, and (2) a modi\ufb01cation\nof the initialization proposed by Hanin and Rolnick [8] to fairly adapt it to weight normalized\nmulti-stage ResNets. For the k-th stage with Bk blocks, the stage-wise Hanin scheme initializes\nthe gain of the last convolutional layer in each block as g = 0.9b1, where b \u2208 {1, . . . , Bk} refers\nto the block number within a stage. All other parameters are initialized in a way identical to our\nproposal, so that information across the layers within residual blocks remains preserved. We report\nresults over 5 random seeds for each con\ufb01guration in Figure 3 (right), which shows that the proposed\n\n6https://pytorch.org/docs/stable/_modules/torch/nn/utils/weight_norm.html\n\n6\n\n251020100200Depth0.000.250.500.751.00Accuracy110203040Hyperparameter CombinationMNIST MLP + WNData-dependent (He)Data-dependent (orthogonal)Proposed (He)Proposed (orthogonal)\fFigure 3: Accuracy as a function of depth on CIFAR-10 for CNNs (left), and WRNs (right). Dashed\nlines denote train accuracy, and solid lines denote validation accuracy. Note that WRNs are trained\nfor a single epoch due to the computational burden of training extremely deep networks.\n\nFigure 4: Robustness to seed of different initialization schemes on WRN-40-10. We launch 20\ntraining runs for every con\ufb01guration, and measure the percentage of runs that reach epoch 3 without\ndiverging. Weight normalized ResNets bene\ufb01t from learning rate warmup, which enables the usage of\nhigher learning rates. The proposed initialization is the most robust scheme across all con\ufb01gurations.\n\ninitialization achieves similar accuracy rates across the wide range of evaluated depths. PyTorch\u2019s\ndefault initialization diverges for most depths, and the data-dependent baseline converges signi\ufb01cantly\nslower for deeper networks due to the small learning rates used in order to avoid divergence. Despite\nthe stage-wise Hanin strategy and the proposed initialization achieve similar accuracy rates, we were\nable to use an order of magnitude larger learning rates with the latter, which denotes an increased\nrobustness against hyperparameter con\ufb01gurations.\nTo further evaluate the robustness of each initialization strategy, we train WRN-40-10 networks for\n3 epochs with different learning rates, with and without learning rate warmup [6]. We repeat each\nexperiment 20 times using different random seeds, and report the percentage of runs that successfully\ncompleted all 3 epochs without diverging in Figure 4. We observed that learning rate warmup greatly\nimproved the range of learning rates that work well for all initializations, but the proposed strategy\nmanages to train more robustly across all tested con\ufb01gurations.\n\n5.2 Comparison with Batch Normalization\n\nExisting literature has pointed towards an implicit regularization effect of batch normalization [18],\nwhich prevented weight normalized models from matching the \ufb01nal performance of batch normalized\nones [4]. On the other hand, previous works have shown that larger learning rates facilitate \ufb01nding\nwider minima which correlate with better generalization performance [14, 12, 26, 27], and the\nproposed initialization and learning rate warmup have proven very effective in stabilizing training\nfor high learning rates. This section aims at evaluating the \ufb01nal performance of weight normalized\nnetworks trained with high learning rates, and compare them with batch normalized networks.\nWe evaluate models on CIFAR-10 and CIFAR-100. We set aside 10% of the training data for\nhyperparameter tuning, whereas some previous works use the test set for such purpose [10, 31]. This\ndifference in the experimental setup explains why the achieved error rates are slightly larger than\nthose reported in the literature. For each architecture we use the default hyperparameters reported in\nliterature for batch normalized networks, and tune only the initial learning rate for weight normalized\nmodels.\nResults in Table 1 show that the proposed initialization scheme, when combined with learning rate\nwarmup, allows weight normalized residual networks to achieve comparable error rates to their batch\nnormalized counterparts. We note that previous works reported a large generalization gap between\n\n7\n\n525100Depth0.00.20.40.60.81.0AccuracyCIFAR-10 CNN+WN10100100010000Number of layersCIFAR-10 WRN+WNData-dependent (He)Data-dependent (orthogonal)PyTorch default (orthogonal)Proposed (He)Proposed (orthogonal)Stage-wise Hanin (orthogonal)0.0010.0050.010.050.1Learning rate50%100%Without learning rate warmup0.0010.0050.010.050.1Learning rateWith learning rate warmupData-dependent (orthogonal)PyTorch default (orthogonal)Proposed (orthogonal)Stage-wise Hanin (orthogonal)\fweight and batch normalized networks [25, 4]. The only architecture for which the batch normalized\nvariant achieves a superior performance is WRN-40-10, for which the weight normalized version is\nnot able to completely \ufb01t the training set before reaching the epoch limit. This phenomena is different\nto the generalization gap reported in previous works, and might be caused by sub-optimal learning\nrate schedules that were tailored for networks with batch normalization.\nTable 1: Comparison between Weight Normalization with proposed initialization and Batch Normal-\nization. Results are reported as mean and std over 5 runs.\n\nDataset\n\nArchitecture Method\n\nResNet-56\n\nCIFAR-10\n\nResNet-110\n\nWRN-40-10\n\nCIFAR-100 ResNet-164\n\nWN w/ datadep init\nWN w/ proposed init\nWN w/ proposed init + warmup\nBN (He et al. [10])\nWN w/ datadep init\nWN w/ proposed init\nWN w/ proposed init + warmup\nWN (Shang et al. [25])\nBN (He et al. [10])\nWN w/ datadep init + cutout\nWN w/ proposed init + cutout\nWN w/ proposed init + cutout + warmup\nBN w/ orthogonal init + cutout\nWN w/ datadep init + cutout\nWN w/ proposed init + cutout\nWN w/ proposed init + cutout + warmup\nBN w/ orthogonal init + cutout\n\nTest Error (%)\n9.19 \u00b1 0.24\n7.87 \u00b1 0.14\n7.20 \u00b1 0.12\n6.97\n9.33 \u00b1 0.10\n7.71 \u00b1 0.14\n6.69 \u00b1 0.11\n7.46\n6.61 \u00b1 0.16\n6.10 \u00b1 0.23\n4.74 \u00b1 0.14\n4.75 \u00b1 0.08\n3.53 \u00b1 0.38\n30.26 \u00b1 0.51\n27.30 \u00b1 0.49\n25.31 \u00b1 0.26\n25.52 \u00b1 0.17\n\n5.3\n\nInitialization Method and Generalization Gap\n\nThe motivation behind designing good parameter initialization is mainly for better optimization\nat the beginning of training, and it is not apparent why our initialization is able to reduce the\ngeneralization gap between weight normalized and batch normalized networks [4, 25]. On this note\nwe point out that a number of papers have shown how using stochastic gradient descent (SGD)\nwith larger learning rates facilitate \ufb01nding wider minima which correlate with better generalization\nperformance [14, 12, 26, 27]. Additionally, it is often not possible to use large learning rates with\nweight normalization with traditional initializations. Therefore we believe that the use of large\nlearning rate allowed by our initialization played an important role in this aspect.\nIn order to\nunderstand why our initialization allows using large learning rates compared with existing ones,\nwe compute the (log) spectral norm of the Hessian at initialization (using Power method) for the\nvarious initialization methods considered in our experiments using 10% of the training samples. They\nare shown in Table 2. We \ufb01nd that the local curvature (spectral norm) is smallest for the proposed\ninitialization. These results are complementary to the seed robustness experiment shown in Figure 4.\nTable 2: Log (base 10) spectral norm of Hessian at initialization for different initializations. Smaller\nvalues imply lower curvature. N/A means that the computation diverged. The proposed strategy\ninitializes at a point with lowest curvature, which explains why larger learning rates can be used.\n\nModel\n\nDataset\nCIFAR-10 WRN-40-10\nCIFAR-100 ResNet-164\n\nPyTorch default Data-dependent\n\n4.68 \u00b1 0.60\n9.56 \u00b1 0.54\n\n3.01 \u00b1 0.02\n2.68 \u00b1 0.09\n\nStage-wise Hanin\n\n7.14 \u00b1 0.72\n\nN/A\n\nProposed\n1.31 \u00b1 0.12\n1.56 \u00b1 0.18\n\n6 Conclusion and Future Work\n\nWeight normalization (WN) is frequently used in different network architectures due to its simplicity.\nHowever, the lack of existing theory on parameter initialization of weight normalized networks has\nled practitioners to arbitrarily pick existing initializations designed for un-normalized networks. To\naddress this issue, we derived parameter initialization schemes for weight normalized networks, with\n\n8\n\n\fand without residual connections, that avoid explosion/vanishment of information in the forward and\nbackward pass. To the best of our knowledge, no prior work has formally studied this setting. Through\nthorough empirical evaluation, we showed that the proposed initialization increases robustness to\nnetwork depth, choice of hyper-parameters and seed compared to existing initialization methods\nthat are not designed speci\ufb01cally for weight normalized networks. We found that the proposed\nscheme initializes networks in low curvature regions, which enable the use of large learning rates.\nBy doing so, we were able to signi\ufb01cantly reduce the performance gap between batch and weight\nnormalized networks which had been previously reported in the literature. Therefore, we hope that\nour proposal replaces the current practice of choosing arbitrary initialization schemes for weight\nnormalized networks.\nWe believe our proposal can also help in achieving better performance using WN in settings which\nare not well-suited for batch normalization. One such scenario is training of recurrent networks in\nbackpropagation through time settings, which often suffer from exploding/vanishing gradients, and\nbatch statistics are timestep-dependent [2]. The current analysis was done for feedforward networks,\nand we plan to extend it to the recurrent setting. Another application where batch normalization often\nfails is reinforcement learning, as good estimates of activation statistics are not available due to the\nonline nature of some of these algorithms. We con\ufb01rmed the bene\ufb01ts of our proposal in preliminary\nreinforcement learning experiments, which can be found in the supplementary material.\n\nAcknowledgments\n\nWe would like to thank Giancarlo Kerg for proofreading the paper. DA was supported by IVADO\nduring his time at Mila where part of this work was done.\n\nReferences\n[1] Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization propa-\ngation: A parametric technique for removing internal covariate shift in deep networks. In ICML,\n2016.\n\n[2] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, \u00c7a\u02d8glar G\u00fcl\u00e7ehre, and Aaron Courville. Recur-\n\nrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[3] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv preprint arXiv:1708.04552, 2017.\n\n[4] Igor Gitman and Boris Ginsburg. Comparison of batch normalization and weight normalization\n\nalgorithms for the large-scale image classi\ufb01cation. arXiv preprint arXiv:1709.08145, 2017.\n\n[5] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\n\nneural networks. In AISTATS, 2010.\n\n[6] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[7] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In\n\nNeurIPS, 2018.\n\n[8] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architec-\n\nture. In NeurIPS, 2018.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\n\nSurpassing human-level performance on imagenet classi\ufb01cation. In ICCV, 2015.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. In ICML, 2015.\n\n9\n\n\f[12] Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua\narXiv preprint\n\nBengio, and Amos Storkey. Three factors in\ufb02uencing minima in sgd.\narXiv:1711.04623, 2017.\n\n[13] Stanis\u0142aw Jastrz\u02dbebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, and Yoshua\n\nBengio. Residual connections encourage iterative inference. In ICLR, 2018.\n\n[14] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[15] Philipp Kr\u00e4henb\u00fchl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializa-\n\ntions of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015.\n\n[16] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n[17] Yann Lecun and Corinna Cortes. The mnist database of handwritten digits. URL http:\n\n//yann.lecun.com/exdb/mnist/.\n\n[18] Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Towards understanding regulariza-\n\ntion in batch normalization. In ICLR, 2019.\n\n[19] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422,\n\n2015.\n\n[20] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep\n\nlearning through dynamical isometry: theory and practice. In NIPS, 2017.\n\n[21] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral\n\nuniversality in deep networks. In AISTATS, 2018.\n\n[22] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\n\nExponential expressivity in deep neural networks through transient chaos. In NIPS, 2016.\n\n[23] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to\n\naccelerate training of deep neural networks. In NIPS, 2016.\n\n[24] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. In ICLR, 2014.\n\n[25] Wenling Shang, Justin Chiu, and Kihyuk Sohn. Exploring normalization in deep residual\n\nnetworks with concatenated recti\ufb01ed linear units. In AAAI, 2017.\n\n[26] Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic\n\ngradient descent. In ICLR, 2018.\n\n[27] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don\u2019t decay the learning\n\nrate, increase the batch size. In ICLR, 2018.\n\n[28] Masato Taki. Deep residual networks and weight initialization. arXiv preprint arXiv:1709.02956,\n\n2017.\n\n[29] Wojciech Tarnowski, Piotr Warcho\u0142, Stanis\u0142aw Jastrz\u02dbebski, Jacek Tabor, and Maciej A Nowak.\nDynamical isometry is achieved in residual networks in a universal way for any activation\nfunction. arXiv preprint arXiv:1809.08848, 2018.\n\n[30] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pen-\nnington. Dynamical isometry and a mean \ufb01eld theory of cnns: How to train 10,000-layer vanilla\nconvolutional neural networks. In ICML, 2018.\n\n[31] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.\n[32] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond\n\nempirical risk minimization. In ICLR, 2018.\n\n[33] Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning\n\nwithout normalization. In ICLR, 2019.\n\n10\n\n\f", "award": [], "sourceid": 5830, "authors": [{"given_name": "Devansh", "family_name": "Arpit", "institution": "Salesforce/MILA"}, {"given_name": "V\u00edctor", "family_name": "Campos", "institution": "Barcelona Supercomputing Center"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Mila - University of Montreal"}]}