{"title": "How to Start Training: The Effect of Initialization and Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 571, "page_last": 581, "abstract": "We identify and study two common failure modes for early training in deep ReLU nets. For each, we give a rigorous proof of when it occurs and how to avoid it, for fully connected, convolutional, and residual architectures. We show that the first failure mode, exploding or vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly scaling the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided. In contrast, for fully connected nets, we prove that this failure mode can happen and is avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.", "full_text": "How to Start Training:\n\nThe Effect of Initialization and Architecture\n\nBoris Hanin\n\nDepartment of Mathematics\n\nTexas A& M University\nCollege Station, TX, USA\nbhanin@math.tamu.edu\n\nDepartment of Mathematics\n\nMassachusetts Institute of Technology\n\nDavid Rolnick\n\nCambridge, MA, USA\ndrolnick@mit.edu\n\nAbstract\n\nWe identify and study two common failure modes for early training in deep ReLU\nnets. For each, we give a rigorous proof of when it occurs and how to avoid it, for\nfully connected, convolutional, and residual architectures. We show that the \ufb01rst\nfailure mode, exploding or vanishing mean activation length, can be avoided by\ninitializing weights from a symmetric distribution with variance 2/fan-in and, for\nResNets, by correctly scaling the residual modules. We prove that the second failure\nmode, exponentially large variance of activation length, never occurs in residual\nnets once the \ufb01rst failure mode is avoided. In contrast, for fully connected nets, we\nprove that this failure mode can happen and is avoided by keeping constant the sum\nof the reciprocals of layer widths. We demonstrate empirically the effectiveness\nof our theoretical results in predicting when networks are able to start training.\nIn particular, we note that many popular initializations fail our criteria, whereas\ncorrect initialization and architecture allows much deeper networks to be trained.\n\n1\n\nIntroduction\n\nDespite the growing number of practical uses for deep learning, training deep neural networks remains\na challenge. Among the many possible obstacles to training, it is natural to distinguish two kinds:\nproblems that prevent a given neural network from ever achieving better-than-chance performance\nand problems that have to do with later stages of training, such as escaping \ufb02at regions and saddle\npoints [12, 27], reaching spurious local minima [5, 15], and over\ufb01tting [2, 34]. This paper focuses\nspeci\ufb01cally on two failure modes related to the \ufb01rst kind of dif\ufb01culty:\n\n(FM1): The mean length scale in the \ufb01nal layer increases/decreases exponentially with the depth.\n(FM2): The empirical variance of length scales across layers grows exponentially with the depth.\n\nOur main contributions and conclusions are:\n\n\u2022 The mean and variance of activations in a neural network are both important in de-\ntermining whether training begins. If both failure modes FM1 and FM2 are avoided, then\na deeper network need not take longer to start training than a shallower network.\n\n\u2022 FM1 is dependent on weight initialization. Initializing weights with the correct variance\n(in fully connected and convolutional networks) and correctly weighting residual modules\n(in residual networks) prevents the mean size of activations from becoming exponentially\nlarge or small as a function of the depth, allowing training to start for deeper architectures.\n\u2022 For fully connected and convolutional networks, FM2 is dependent on architecture.\nWider layers prevent FM2, again allowing training to start for deeper architectures. In the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcase of constant-width networks, the width should grow approximately linearly with the\ndepth to avoid FM2.\n\n\u2022 For residual networks, FM2 is largely independent of the architecture. Provided that\nresidual modules are weighted to avoid FM1, FM2 can never occur. This qualitative\ndifference between fully connected and residual networks can help to explain the empirical\nsuccess of the latter, allowing deep and relatively narrow networks to be trained more readily.\n\nFM1 for fully connected networks has been previously studied [8, 21, 25]. Training may fail to start,\nin this failure mode, since the difference between network outputs may exceed machine precision\neven for moderate d. For ReLU activations, FM1 has been observed to be overcome by initializations\nof He et al. [8]. We prove this fact rigorously (see Theorems 5 and 6). We \ufb01nd empirically that for\npoor initializations, training fails more frequently as networks become deeper (see Figures 1 and 4).\nAside from [29], there appears to be less literature studying FM1 for residual networks (ResNets)\n[9]. We prove that the key to avoiding FM1 in ResNets is to correctly rescale the contributions of\nindividual residual modules (see Theorems 5 and 6). Without this, we \ufb01nd empirically that training\nfails for deeper ResNets (see Figure 2).\nFM2 is more subtle and does not seem to have been widely studied (see [16] for a notable exception).\nWe \ufb01nd that FM2 indeed impedes early training (see Figure 3). Let us mention two possible explana-\ntions. First, if the variance between activations at different layers is very large, then some layers may\nhave very large or small activations that exceed machine precision. Second, the backpropagated SGD\nupdate for a weight w in a given layer includes a factor that corresponds to the size of activations at\nthe previous layer. A very small update of w essentially keeps it at its randomly initialized value. A\nvery large update on the hand essentially re-randomizes w. Thus, we conjecture that FM2 causes the\nstochasticity of parameter updates to outweigh the effect of the training loss.\nOur analysis of FM2 reveals an interesting difference between fully connected and residual networks.\nNamely, for fully connected and convolutional networks, FM2 is a function of architecture, rather\nthan just of initialization, and can occur even if FM1 does not. For residual networks, we prove by\ncontrast that FM2 never occurs once FM1 is avoided (see Corollary 2 and Theorem 6).\n\n2 Related Work\n\nClosely related to this article is the work of Taki [29] on initializing ResNets as well as the work\nof Yang-Schoenholz [32, 33]. The article [29] gives heuristic computations related to the mean\nsquared activation in a depth-L ResNet and suggests taking the scales \u2318` of the residual modules to\nall be equal to 1/L (see \u00a73.2). The articles [32, 33], in contrast, use mean \ufb01eld theory to derive the\ndependence of both forward and backward dynamics in a randomly initialized ResNet on the residual\nmodule weights.\nAlso related to this work is that of He et al. [8] already mentioned above, as well as [14, 22, 23, 25].\nThe authors in the latter group show that information can be propagated in in\ufb01nitely wide ReLU nets\nso long as weights are initialized independently according to an appropriately normalized distribution\n(see condition (ii) in De\ufb01nition 1). One notable difference between this collection of papers and the\npresent work is that we are concerned with a rigorous computation of \ufb01nite width effects.\nThese \ufb01nite size corrections were also studied by Schoenholz et al. [26], which gives exact formulas\nfor the distribution of pre-activations in the case when the weights and biases are Gaussian. For more\non the Gaussian case, we also point the reader to Giryes et al. [6]. The idea that controlling means and\nvariances of activations at various hidden layers in a deep network can help with the start of training\nwas previously considered in Klaumbauer et al. [16]. This work introduced the scaled exponential\nlinear unit (SELU) activation, which is shown to cause the mean values of neuron activations to\nconverge to 0 and the average squared length to converge to 1. A different approach to this kind\nof self-normalizing behavior was suggested in Wu et al. [31]. There, the authors suggest adding a\nlinear hidden layer (that has no learnable parameters) but directly normalizes activations to have\nmean 0 and variance 1. Activation lengths can also be controlled by constraining weight matrices to\nbe orthogonal or unitary (see e.g. [1, 10, 13, 18, 24]).\nWe would also like to point out a previous appearance in Hanin [7] of the sum of reciprocals\nof layer widths, which we here show determines the variance of the sizes of the activations (see\n\n2\n\n\fTheorem 5) in randomly initialized fully connected ReLU nets. The article [7] found that this same\nsum of reciprocals is also related to the problem of vanishing and exploding gradients, indicating\ncommonalities between this failure mode and those we consider here. Finally, we point the reader to\nthe discussion around Figure 6 in [2], which also \ufb01nds that time to convergence is better for wider\nnetworks.\n\n3 Results\n\nIn this section, we will (1) provide an intuitive motivation and explanation of our mathematical\nresults, (2) verify empirically that our predictions hold, and (3) show by experiment the implications\nfor training neural networks.\n\n(a)\n\n(b)\n\nFigure 1: Comparison of the behavior of differently initialized fully connected networks as depth\nincreases. Width is equal to depth throughout. Note that in He normal (truncated), the normal\ndistribution is truncated at two standard deviations from the mean, as implemented e.g. in Keras and\nPyTorch. For 2x He normal, weights are drawn from a normal distribution with twice the variance\nof He normal. (a) Mean square length Md (log scale), demonstrating exponential decay or explosion\nunless variance is set at 2/fan-in, as in He normal and He uniform initializations; (b) average\nnumber of epochs required to obtain 20% test accuracy when training on vectorized MNIST, showing\nthat exponential decay or explosion of Md is associated with reduced ability to begin training.\n\n3.1 Avoiding FM1 for Fully Connected Networks: Variance of Weights\nConsider a depth-d, fully connected ReLU net N with hidden layer widths nj, j = 0, . . . , d, and\nrandom weights and biases (see De\ufb01nition 1 for the details of the initialization). As N propagates\nan input vector act(0) 2 Rn0 from one layer to the next, the lengths of the resulting vectors of\nactivations act(j) 2 Rnj change in some manner, eventually producing an output vector whose\nlength is potentially very different from that of the input. These changes in length are summarized by\n\nMj :=\n\n2\n\n,\n\n1\n\nnjact(j)\n\nwhere here and throughout the squared norm of a vector is the sum of the squares of its entries. We\nprove in Theorem 5 that the mean of the normalized output length Md, which controls whether failure\nmode FM1 occurs, is determined by the variance of the distribution used to initialize weights. We\nemphasize that all our results hold for any \ufb01xed input, which need not be random; we average only\nover the weights and the biases. Thus, FM1 cannot be directly solved by batch normalization [11],\nwhich renormalizes by averaging over inputs to N , rather than averaging over initializations for N .\nTheorem 1 (FM1 for fully connected networks (informal)). The mean E [Md] of the normalized\noutput length is equal to the input length if network weights are drawn independently from a symmetric\ndistribution with variance 2/fan-in. For higher variance, the mean E [Md] grows exponentially in\nthe depth d, while for lower variance, it decays exponentially.\n\nIn Figure 1, we compare the effects of different initializations in networks with varying depth, where\nthe width is equal to the depth (this is done to prevent FM2, see \u00a73.3). Figure 1(a) shows that, as\n\n3\n\n\fpredicted, initializations for which the variance of weights is smaller than the critical value of 2/fan-in\nlead to a dramatic decrease in output length, while variance larger than this value causes the output\nlength to explode. Figure 1(b) compares the ability of differently initialized networks to start training;\nit shows the average number of epochs required to achieve 20% test accuracy on MNIST [19]. It is\nclear that those initializations which preserve output length are also those which allow for fast initial\ntraining - in fact, we see that it is faster to train a suitably initialized depth-100 network than it is to\ntrain a depth-10 network. Datapoints in (a) represent the statistics over random unit inputs for 1,000\nindependently initialized networks, while (b) shows the number of epochs required to achieve 20%\naccuracy on vectorized MNIST, averaged over 5 training runs with independent initializations, where\nnetworks were trained using stochastic gradient descent with a \ufb01xed learning rate of 0.01 and batch\nsize of 1024, for up to 100 epochs. Note that changing the learning rate depending on depth could be\nused to compensate for FM1; choosing the right initialization is equivalent and much simpler.\nWhile E [Md] and its connection to the variance of weights at initialization have been previously noted\n(we refer the reader especially to [8] and to \u00a72 for other references), the implications for choosing a\ngood initialization appear not to have been fully recognized. Many established initializations for ReLU\nnetworks draw weights i.i.d. from a distribution for which the variance is not normalized to preserve\noutput lengths. As we shall see, such initializations hamper training of very deep networks. For\ninstance, as implemented in the Keras deep learning Python library [4], the only default initialization to\nhave the critical variance 2/fan-in is He uniform. By contrast, LeCun uniform and LeCun normal\nhave variance 1/fan-in, Glorot uniform (also known as Xavier uniform) and Glorot normal\n(Xavier normal) have variance 2/(fan-in + fan-out). Finally, the initialization He normal comes\nclose to having the correct variance, but, at the time of writing, the Keras implementation truncates\nthe normal distribution at two standard deviations from the mean (the implementation in PyTorch [20]\ndoes likewise). This leads to a decrease in the variance of the resulting weight distribution, causing a\ncatastrophic decay in the lengths of output activations (see Figure 1). We note this to emphasize both\nthe sensitivity of initialization and the popularity of initializers that can lead to FM1.\nIt is worth noting that the 2 in our optimal variance 2/fan-in arises from the ReLU, which zeros\nout symmetrically distributed input with probability 1/2, thereby effectively halving the variance at\neach layer. (For linear activations, the 2 would disappear.) The initializations described above may\npreserve output lengths for activation functions other than ReLU. However, ReLU is one of the most\ncommon activation functions for feed-forward networks and various initializations are commonly used\nblindly with ReLUs without recognizing the effect upon ease of training. An interesting systematic\napproach to predicting the correct multiplicative constant in the variance of weights as a function of\nthe non-linearity is proposed in [22, 25] (e.g., the de\ufb01nition of 1 around (7) in Poole et al. [22]). For\nnon-linearities other than ReLU, however, this constant seems dif\ufb01cult to compute directly.\n\n(a)\n\n(b)\n\nFigure 2: Comparison of the behavior of differently scaled ResNets as the number of modules\nincreases. Each residual module here is a single layer of width 5. (a) Mean length scale M res\nL , which\ngrows exponentially in the sum of scales \u2318`; (b) average number of epochs to 20% test accuracy when\ntraining on MNIST, showing that M res\n\nL is a good predictor of initial training performance.\n\n4\n\n\f3.2 Avoiding FM1 for Residual Networks: Weights of Residual Modules\nTo state our results about FM1 for ResNets, we must set some notation (based on the framework\npresented e.g. in Veit et al. [30]). For a sequence \u2318`,` = 1, 2 . . . of positive real numbers and a\nsequence of fully connected ReLU nets N1,N2, . . ., we de\ufb01ne a residual network N res\nL with residual\nmodules N1, . . . ,NL and scales \u23181, . . . ,\u2318 L by the recursion\n\nN res\n\n0\n\n(x) = x,\n\nN res\n\n`\n\n(x) = N res\n\nExplicitly,\n\n`1 (x) + \u2318`N`N res\n\n`1(x) ,`\n\n= 1, . . . , L.\n\nN res\nL (x) = x + \u23181 N1(x) + \u23182 N2(x + \u23181 N1(x))\n\n+ \u23183 N3 (x + \u23181 N1(x) + \u23182 N2 (x + \u23181 N1(x))) + \u00b7\u00b7\u00b7 .\n\n`1 computed by the residual module\nIntuitively, the scale \u2318` controls the size of the correction to N res\nN`. Since we implicitly assume that the depths and widths of the residual modules N` are uniformly\nbounded (e.g., the modules may have a common architecture), failure mode FM1 comes down to\ndetermining for which sequences {\u2318`} of scales there exist c, C > 0 so that\n\n(1)\n\n(2)\n\nc \uf8ff sup\nL1\n\nE [M res\n\nL ] \uf8ff C,\n\nL = kN res\n\nL (x)k2 and x is a unit norm input to N res\n\nwhere we write M res\nL . The expectation in (2) is\nover the weights and biases in the fully connected residual modules N`, which we initialize as in\nDe\ufb01nition 1, except that we set biases to zero for simplicity (this does not affect the results below). A\npart of our main theoretical result, Theorem 6, on residual networks can be summarized as follows.\nTheorem 2 (FM1 for ResNets (informal)). Consider a randomly initialized ResNet with L residual\nmodules, scales \u23181, . . . ,\u2318 L, and weights drawn independently from a symmetric distribution with\nvariance 2/fan-in. The mean E [M res\nL ] of the normalized output length grows exponentially with the\n\nsum of the scalesPL\n\n`=1 \u2318`.\n\nWe empirically verify the predictive power of the quantity \u2318` in the performance of ResNets. In\nFigure 2(a), we initialize ResNets with constant \u2318` = 1 as well as geometrically decaying \u2318` = b` for\nb = 0.9, 0.75, 0.5. All modules are single hidden layers with width 5. We observe that, as predicted\nby Theorem 1, \u2318` = 1 leads to exponentially growing length scale M res\nL , while \u2318` = b` leads the\nL to grow until it reaches a plateau (the value of which depends on b), sinceP \u2318` is\nmean of M res\n\ufb01nite. In Figure 2(b), we show that the mean of M res\nL well predicts the ease with which ResNets\nof different depths are trained. Note the large gap between b = 0.9 and 0.75, which is explained by\nnoting that the approximation of \u23182 \u2327 \u2318 which we use in the proof holds for \u2318 \u2327 1, leading to a\nlarger constant multiple ofP \u2318` in the exponent for b closer to 1. Each datapoint is averaged over\n100 training runs with independent initializations, with training parameters as in Figure 1.\n\n3.3 FM2 for Fully Connected Networks: The Effect of Architecture\nIn the notation of \u00a73.1, failure mode FM2 is characterized by a large expected value for\n\n1\nd\n\ndXj=1\n\nM 2\n\nj 0@ 1\n\nd\n\ndXj=1\n\n2\n\n,\n\nMj1A\n\ndVar[M ] :=\n\nthe sum of the reciprocals of the widths of the hidden layers.\n\nthe empirical variance of the normalized squared lengths of activations among all the hidden layers in\nN . Our main theoretical result about FM2 for fully connected networks is the following.\nTheorem 3 (FM2 for fully connected networks (informal)). The mean E[dVar[M ]] of the empirical\nvariance for the lengths of activations in a fully connected ReLU net is exponential inPd1\ntrain, and this result provides theoretical justi\ufb01cation; since for such netsP 1/nj is large. More\n\nthan that, this sum of reciprocals gives a de\ufb01nite way to quantify the effect of \u201cdeep but narrow\u201d\narchitectures on the volatility of the scale of activations at various layers within the network. We note\nthat this result also implies that for a given depth and \ufb01xed budget of neurons or parameters, constant\n\nFor a formal statement see Theorem 5. It is well known that deep but narrow networks are hard to\n\nj=1 1/nj,\n\n5\n\n\f(a)\n\n(b)\n\nFigure 3: Comparison of ease with which different fully connected architectures may be trained.\n(a) Mean epochs required to obtain 20% test accuracy when training on MNIST, as a function of\nnetwork depth; (b) same y-axis, with x-axis showing the sum of reciprocals of layer widths. Training\nef\ufb01ciency is shown to be predicted closely by this sum of reciprocals, independent of other details of\nnetwork architecture. Note that all networks are initialized with He normal weights to avoid FM1.\n\nwidth is optimal, since by the Power Mean Inequality,Pj 1/nj is minimized for all nj equal ifP nj\n(number of neurons) orP n2\n\nWe experimentally verify that the sum of reciprocals of layer widths (FM2) is an astonishingly good\npredictor of the speed of early training. Figure 3(a) compares training performance on MNIST for\nfully connected networks of \ufb01ve types:\n\nj (approximate number of parameters) is held \ufb01xed.\n\ni. Alternating layers of width 30 and 10,\nii. The \ufb01rst half of the layers of width 30, then the second half of width 10,\niii. The \ufb01rst half of the layers of width 10, then the second half of width 30,\niv. Constant width 15,\nv. Constant width 20.\n\nNote that types (i)-(iii) have the same layer widths, but differently permuted. As predicted by our\ntheory, the order of layer widths does not affect FM2. We emphasize that\n\n1\n30\n\n+\n\n1\n10\n\n=\n\n1\n15\n\n+\n\n1\n15\n\n,\n\ntype (iv) networks have, for each depth, the same sum of reciprocals of layer widths as types (i)-(iii).\nAs predicted, early training dynamics for networks of type (i)-(iv) were similar for each \ufb01xed depth.\nBy contrast, networks of type (v), which had a lower sum of reciprocals of layer widths, trained faster\nfor every depth than the corresponding networks of types (i)-(iv).\n\nIn all cases, training becomes harder with greater depth, sinceP 1/nj increases with depth for\nconstant-width networks. In Figure 3(b), we plot the same data withP 1/nj on the x-axis, showing\n\nthis quantity\u2019s power in predicting the effectiveness of early training, irrespective of the particular\ndetails of the network architecture in question.\nEach datapoint is averaged over 100 independently initialized training runs, with training parameters\nas in Figure 1. All networks are initialized with He normal weights to prevent FM1.\n\n3.4 FM2 for Residual Networks\nIn the notation of \u00a73.2, failure mode FM2 is equivalent to a large expected value for the empirical\nvariance\n\ndVar[M res] :=\n\n1\nL\n\nLX`=1\n\n(M res\n\n`\n\n)2 1\n\nL\n\n` !2\n\nM res\n\nLX`=1\n\nof the normalized squared lengths of activations among the residual modules in N . Our main\ntheoretical result about FM2 for ResNets is the following (see Theorem 5 for the precise statement).\n\n6\n\n\flengths of activations in a residual ReLU net with L residual modules and scales \u2318` is exponential in\n`=1 \u2318`. By Theorem 2, this means that in ResNets, if failure mode FM1 does not occur, then neither\n\nTheorem 4 (FM2 for ResNets (informal)). The mean E[dVar[M res]] of the empirical variance for the\nPL\n\ndoes FM2 (assuming FM2 does not occur in individual residual modules).\n\n3.5 Convolutional Architectures\nOur above results were stated for fully connected networks, but the logic of our proofs carries\nover to other architectures. In particular, similar statements hold for convolutional neural networks\n(ConvNets). Note that the fan-in for a convolutional layer is not given by the width of the preceding\nlayer, but instead is equal to the number of features multiplied by the kernel size.\n\nFigure 4: Comparison of the behavior of\ndifferently initialized ConvNets as depth\nincreases, with the number of features at\neach layer proportional to the overall net-\nwork depth. The mean output length over\ndifferent random initializations is observed\nto follow the same patterns as in Figure 1\nfor fully connected networks. Weight dis-\ntributions with variance 2/fan-in preserve\noutput length, while other distributions lead\nto exponential growth or decay. The input\nimage from CIFAR-10 is shown.\n\nIn Figure 4, we show that the output length behavior we observed in fully connected networks also\nholds in ConvNets. Namely, mean output length equals input length for weights drawn i.i.d. from a\nsymmetric distribution of variance 2/fan-in, while other variances lead to exploding or vanishing\noutput lengths as the depth increases. In our experiments, networks were purely convolutional, with\nno pooling or fully connected layers. By analogy to Figure 1, the fan-in was set to approximately\nthe depth of the network by \ufb01xing kernel size 3 \u21e5 3 and setting the number of features at each layer\nto one tenth of the network\u2019s total depth. For each datapoint, the network was allowed to vary over\n1,000 independent initializations, with input a \ufb01xed image from the dataset CIFAR-10 [17].\n\n4 Notation\n\n+ , we de\ufb01ne\n\ni=0 2 Zd+1\n\nTo state our results formally, we \ufb01rst give the precise de\ufb01nition of the networks we study; and we\nintroduce some notation. For every d 1 and n = (ni)d\n\nN(n, d) =nfully connected feed-forward nets with ReLU activations\nand depth d, whose jth hidden layer has width nj o .\nNote that n0 is the dimension of the input. Given N2 N(n, d), the function fN it computes is\ndetermined by its weights and biases\n{w(j)\n , 1 \uf8ff \u21b5 \uf8ff nj, 1 \uf8ff \uf8ff nj+1, 0 \uf8ff j \uf8ff d 1}.\nFor every input act(0) =\u21e3act(0)\n\u21b5 \u2318n0\n\u21b5=1 2 Rn0 to N , we write for all j = 1, . . . , d\nnj1X\u21b5=1\n\nThe vectors preact(j), act(j) are thus the inputs and outputs of nonlinearities in the jth layer of N .\nDe\ufb01nition 1 (Random Nets). Fix d 1, positive integers n = (n0, . . . , nd) 2 Zd+1\n+ , and two\ncollections of probability measures \u00b5 =\u00b5(1), . . . , \u00b5(d) and \u232b =\u232b(1), . . . ,\u232b (d) on R such that\n\u00b5(j),\u232b (j) are symmetric around 0 for every 1 \uf8ff j \uf8ff d, and such that the variance of \u00b5(j) is 2/(nj1).\nA random network N2 N\u00b5,\u232b (n, d) is obtained by requiring that the weights and biases for neurons\nat layer j are drawn independently from \u00b5(j),\u232b (j), respectively.\n\nact(j)\n\n = ReLU(preact(j)\n ),\n\npreact(j)\n\n = b(j)\n\n +\n\n1 \uf8ff \uf8ff nj.\n\n\u21b5,, b(j)\n\nact(j1)\n\n\u21b5\n\nw(j)\n\u21b5,,\n\n(3)\n\n7\n\n\f5 Formal statements\nWe begin by stating our results about fully connected networks. Given a random network N2\nN\u00b5,\u232b (d, n) and an input act(0) to N , we write as in \u00a73.1, Mj for the normalized square length of\nactivations 1\nnd|| act(j) ||2 at layer j. Our \ufb01rst theoretical result, Theorem 5, concerns both the mean\nand variance of Md. To state it, we denote for any probability measure on R its moments by\n\nk :=ZR\n\nxkd(x).\n\nTheorem 5. For each j 0, \ufb01x nj 2 Z+. For each d 1, let N2 N\u00b5,\u232b (d, n) be a fully connected\nReLU net with depth d, hidden layer widths n = (nj)d\nj=0 as well as random weights and biases as\nin De\ufb01nition 1. Fix also an input act(0) 2 Rn0 to N with || act(0) || = 1. We have almost surely\n\nlim sup\nd!1\n\nMd < 1 () Xj1\n\n\u232b(j)\n2 < 1.\n\nMoreover, if (4) holds, then there exists a random variable M1 (that is almost surely \ufb01nite) such that\nMd ! M1 as d ! 1 pointwise almost surely. Further, suppose \u00b5(j)\n4 < 1 for all j 1 and that\n\n(4)\n\n(5)\n\nP1j=1\u21e3\u232b(j)\n2 \u23182\n\n< 1. Then\n\nexp24 1\n\n2\n\ndXj=1\n\nwhere C1, C2 the following \ufb01nite constants:\n\nand\n\nC2 = sup\nd1\n\nmax{1,\n\n1\n4\n\n\u232b(d)\n4 \n\nC1 = 1 + M 2\n\n1\n\nnj35 ,\n\n1\n\ndXj=1\n\nnj35 \uf8ff E\u21e5M 2\nd\u21e4 \uf8ff C1 exp24C2\n0 +0@M0 +\n2 1A\n1Xj=1\n1Xj=1\n4 3|}0@2 + M0 +\n2\u21e3\u232b(d)\n2 \u23182\n, 2|e\u00b5(d)\n\n\u232b(j)\n2\n\n\u232b(j)\n\n1\n\n1Xj=1\n\n\u232b(j)\n\n2 1A .\n\n2 ,\u232b (d)\n\nC1 = C2 = 4.\n\n4 \uf8ff 1 and \u00b5(j) is Gaussian for all j 1, then we can take\n\nj=1 1/nj and if Pj 1/nj < 1, then the convergence of\n\nIf M0 = 1 andP1j=1 \u232b(j)\nIn particular, Var[Md] is exponential inPd\nMd to M1 is in L2 and Var[M1] < 1.\nThe proof of Theorem 5 is deferred to the Supplementary Material. Although we state our results only\nfor fully connected feed-forward ReLU nets, the proof techniques carry over essentially verbatim to\nany feed-forward network in which only weights in the same hidden layer are tied. In particular, our\nresults apply to convolutional networks in which the kernel sizes are uniformly bounded. In this case,\nthe constants in Theorem 5 depend on the bound for the kernel dimensions, and nj denotes the fan-in\nfor neurons in the (j + 1)st hidden layer (i.e. the number of channels in layer j multiplied by the\nsize of the appropriate kernel). We also point out the following corollary, which follows immediately\nfrom the proof of Theorem 5.\nCorollary 1 (FM1 for Fully Connected Networks). With notation as in Theorem 5, suppose that\nfor all j = 1, . . . , d, the weights in layer j of Nd have variance \uf8ff \u00b7 2/nj for some \uf8ff> 0. Then the\naverage squared size Md of activations at layer d will grow or decay exponentially unless \uf8ff = 1:\n\nE\uf8ff 1\nndact(d)\n\n2 =\n\n\uf8ffd\n\nn0act(0)\n\n2\n\n+\n\ndXj=1\n\n\uf8ffdj\u232b(d)\n2 .\n\nOur \ufb01nal result about fully connected networks is a corollary of Theorem 5, which explains precisely\nwhen failure mode FM2 occurs (see \u00a73.3). It is proved in the Supplementary Material.\n\n8\n\n\f1\n\nCorollary 2 (FM2 for Fully Connected Networks). Take the same notation as in Theorem 5. There\nexist c, C > 0 so that\n\nC\nd\n\nj 1\nd\n\nc\nd\n\ndXj=1\n\nj 1\nd\n\ndXj=1\n\nd1Xk=j\n\nexp0@c\n\nnk1A \uf8ff EhdVar[M ]i \uf8ff\nexp0@C\nIn particular, suppose the hidden layer widths are all equal, nj = n. Then, E[dVar[M ]] is exponential\nin =Pj 1/nj = (d 1)/n in the sense that there exist c, C > 0 so that\nc exp (c) \uf8ff EhdVar[M ]i \uf8ff C exp (C ) .\n\nFinally, our main result about residual networks is the following:\nTheorem 6. Take the notation from \u00a73.2 and \u00a73.4. The mean of squared activations in N res\nL are\nuniformly bounded in the number of modules L if and only if the scales \u2318` form a convergent series:\n\nnk1A .\n\nd1Xk=j\n\n1\n\n0 <\n\nsup\n\nL1, kxk=1\n\nM res\n\nL < 1\n\n()\n\n\u2318` < 1.\n\nMoreover, for any sequence of scales \u2318` for which sup` \u2318` < 1 and for every K, L 1, we have\n\n(6)\n\n(7)\n\n(8)\n\n1X`=1\n\u2318`!! ,\n\nL )K\u21e4 = exp O LX`=1\n\nE\u21e5(M res\n\nwhere the implied constant depends on K but not on L. Hence, once the condition in part (8) holds,\n,` = 1, . . . , L}\n\nL )K\u21e4 and the mean of the empirical variance of {M res\n\nboth the moments E\u21e5(M res\n\nare uniformly bounded in L.\n\n`\n\n6 Conclusion\n\nIn this article, we give a rigorous analysis of the layerwise length scales in fully connected, convolu-\ntional, and residual ReLU networks at initialization. We \ufb01nd that a careful choice of initial weights is\nneeded for well-behaved mean length scales. For fully connected and convolutional networks, this\nentails a critical variance for i.i.d. weights, while for residual nets this entails appropriately rescaling\nthe residual modules. For fully connected nets, we prove that to control not merely the mean but also\nthe variance of layerwise length scales requires choosing a suf\ufb01ciently wide architecture, while for\nresidual nets nothing further is required. We also demonstrate empirically that both the mean and\nvariance of length scales are strong predictors of early training dynamics.\nOur results hold for any \ufb01xed input, representing a reasonable (and tractable) approximation to\ndistributions of interest in which inputs are drawn from discrete clusters. In the case of classi\ufb01cation\nproblems, for example, high output variance for a single \ufb01xed input could indicate high variance over\nthe set of similar inputs of that class. Theoretical analysis over an entire speci\ufb01ed input distribution is\ntricky (and likely depends on the particular distribution in question), though in future work we will\nattempt to extend the present results to the case of joint distributions over two or more inputs. We\nalso plan to extend our analysis to other (e.g. sigmoidal) activations, recurrent networks, and weight\ninitializations beyond i.i.d., such as orthogonal weights.\n\nReferences\n[1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1120\u20131128, 2016.\n\n[2] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A\ncloser look at memorization in deep networks. arXiv preprint arXiv:1706.05394, 2017.\n\n[3] Patrick Billingsley. Probability and measure. John Wiley & Sons, 2008.\n[4] Fran\u00e7ois Chollet et al. Keras. https://github.com/keras-team/keras, 2018.\n\n9\n\n\f[5] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems:\n\nA uni\ufb01ed geometric analysis. arXiv preprint arXiv:1704.00708, 2017.\n\n[6] Raja Giryes, Guillermo Sapiro, and Alexander M Bronstein. Deep neural networks with\nrandom Gaussian weights: a universal classi\ufb01cation strategy? IEEE Trans. Signal Processing,\n64(13):3444\u20133457, 2016.\n\n[7] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In\n\nAdvances in Neural Information Processing Systems, 2018.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on ImageNet classi\ufb01cation. In Proceedings of the IEEE\ninternational conference on computer vision, pages 1026\u20131034, 2015.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[10] Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-\n\nmemory tasks. In International Conference on Machine Learning, pages 2034\u20132042, 2016.\n\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[12] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape\n\nsaddle points ef\ufb01ciently. arXiv preprint arXiv:1703.00887, 2017.\n\n[13] Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark,\nand Marin Solja\u02c7ci\u00b4c. Tunable ef\ufb01cient unitary neural networks (eunn) and their application to\nrnns. In International Conference on Machine Learning, pages 1733\u20131741, 2017.\n\n[14] Jonathan Kadmon and Haim Sompolinsky. Transition to chaos in random neuronal networks.\n\nPhysical Review X, 5(4):041030, 2015.\n\n[15] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[16] G\u00fcnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing\nneural networks. In Advances in Neural Information Processing Systems, pages 972\u2013981, 2017.\n\n[17] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n[18] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent\n\nnetworks of recti\ufb01ed linear units. arXiv preprint arXiv:1504.00941, 2015.\n\n[19] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten\n\ndigits. http://yann.lecun.com/exdb/mnist/, 1998.\n\n[20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. 2017.\n\n[21] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep\nlearning through dynamical isometry: theory and practice. In Advances in Neural Information\nProcessing Systems, pages 4788\u20134798, 2017.\n\n[22] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos. In Advances in\nNeural Information Processing Systems, pages 3360\u20133368, 2016.\n\n[23] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the\nexpressive power of deep neural networks. In Proceedings of the 34th International Conference\non Machine Learning, pages 2847\u20132854, 2017.\n\n[24] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.\n\n[25] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa-\n\ntion propagation. arXiv preprint arXiv:1611.01232, 2016.\n\n[26] Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. A correspondence\nbetween random neural networks and statistical \ufb01eld theory. arXiv preprint arXiv:1710.06570,\n2017.\n\n10\n\n\f[27] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep\nlearning. In Proceedings of the 34th International Conference on Machine Learning, volume 70,\npages 3067\u20133075, 2017.\n\n[28] Ramalingam Shanmugam and Rajan Chattamvelli. Statistics for scientists and engineers. Wiley\n\nOnline Library, 2015.\n\n[29] Masato Taki. Deep residual networks and weight initialization. arXiv preprint arXiv:1709.02956,\n\n2017.\n\n[30] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles\nof relatively shallow networks. In Advances in Neural Information Processing Systems, pages\n550\u2013558, 2016.\n\n[31] Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. Second-order\noptimization for deep reinforcement learning using Kronecker-factored approximation. In\nAdvances in Neural Information Processing Systems 30, pages 5285\u20135294, 2017.\n\n[32] Greg Yang and Samuel S Schoenholz. Mean \ufb01eld residual networks: On the edge of chaos. In\n\nAdvances in neural information processing systems, pages 7103\u20137114, 2017.\n\n[33] Greg Yang and Samuel S Schoenholz. Deep mean \ufb01eld theory: Layerwise variance and width\nvariation as methods to control gradient explosion. Workshop of International Conference on\nLearning Representations, 2018.\n\n[34] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In International Conference on Learning\nRepresentations, 2017.\n\n11\n\n\f", "award": [], "sourceid": 345, "authors": [{"given_name": "Boris", "family_name": "Hanin", "institution": "Texas A&M"}, {"given_name": "David", "family_name": "Rolnick", "institution": "University of Pennsylvania"}]}