{"title": "Initialization of ReLUs for Dynamical Isometry", "book": "Advances in Neural Information Processing Systems", "page_first": 2385, "page_last": 2395, "abstract": "Deep learning relies on good initialization schemes and hyperparameter choices prior to training a neural network. Random weight initializations induce random network ensembles, which give rise to the trainability, training speed, and sometimes also generalization ability of an instance. In addition, such ensembles provide theoretical insights into the space of candidate models of which one is selected during training. The results obtained so far rely on mean field approximations that assume infinite layer width and that study average squared signals. We derive the joint signal output distribution exactly, without mean field assumptions, for fully-connected networks with Gaussian weights and biases, and analyze deviations from the mean field results. For rectified linear units, we further discuss limitations of the standard initialization scheme, such as its lack of dynamical isometry, and propose a simple alternative that overcomes these by initial parameter sharing.", "full_text": "Initialization of ReLUs for Dynamical Isometry\n\nRebekka Burkholz\n\nDepartment of Biostatistics\n\nHarvard T.H. Chan School of Public Health\n655 Huntington Avenue, Boston, MA 02115\n\nrburkholz@hsph.harvard.edu\n\nAlina Dubatovka\n\nDepartment of Computer Science\n\nETH Zurich\n\nUniversit\u00e4tstrasse 6, 8092 Zurich\nalina.dubatovka@inf.ethz.ch\n\nAbstract\n\nDeep learning relies on good initialization schemes and hyperparameter choices\nprior to training a neural network. Random weight initializations induce random\nnetwork ensembles, which give rise to the trainability, training speed, and some-\ntimes also generalization ability of an instance. In addition, such ensembles provide\ntheoretical insights into the space of candidate models of which one is selected\nduring training. The results obtained so far rely on mean \ufb01eld approximations\nthat assume in\ufb01nite layer width and that study average squared signals. We derive\nthe joint signal output distribution exactly, without mean \ufb01eld assumptions, for\nfully-connected networks with Gaussian weights and biases, and analyze deviations\nfrom the mean \ufb01eld results. For recti\ufb01ed linear units, we further discuss limitations\nof the standard initialization scheme, such as its lack of dynamical isometry, and\npropose a simple alternative that overcomes these by initial parameter sharing.\n\n1\n\nIntroduction\n\nDeep learning relies critically on good parameter initialization prior to training. Two approaches are\ncommonly employed: random network initialization [4, 7, 14] and transfer learning [26] (including\nunsupervised pre-training), where a network that was trained for a different task or a part of it\nis retrained and extended by additional network layers. While the latter can speed up training\nconsiderably and also improve the generalization ability of the new model, its bias towards the\noriginal task can also hinder successful training if the learned features barely relate to the new task.\nRandom initialization of parameters, meanwhile, requires careful tuning of the distributions from\nwhich neural network weights and biases are drawn. While heterogeneity of network parameters is\nneeded to produce meaningful output, a too big variance can also dilute the original signal. To avoid\nexploding or vanishing gradients, the distributions can be adjusted to preserve signal variance from\nlayer to layer. This enables the training of very deep networks by simple stochastic gradient descent\n(SGD) without the need of computationally intensive corrections as batch normalization [8] or variants\nthereof [12]. This approach is justi\ufb01ed by the similar update rules of gradient back-propagation and\nsignal forward propagation [20]. In addition to trainability, good parameter initializations also seem\nto support the generalization ability of the trained, overparametrized network. According to [3], the\nparameter values remain close to the initialized ones, which has a regularization effect.\nAn early example of approximate signal variance preservation is proposed in [4] for fully connected\nfeed forward neural networks, an important building block of most common neural architectures.\nInspired by those derivations, He et al. [7] found that for recti\ufb01ed linear units (ReLUs) and Gaussian\nweight initialization w \u223c N (\u00b5, \u03c32) the optimal choice is zero mean \u00b5 = 0, variance \u03c32 = 2/N and\nzero bias b = 0, where N refers to the number of neurons in a layer. These \ufb01ndings are con\ufb01rmed by\nmean \ufb01eld theory, which assumes in\ufb01nitely wide network layers to employ the central limit theorem\nand focus on normal distributions. Similar results have been obtained for tanh [16, 18, 20], residual\nnetworks with different activation functions [24], and convolutional neural networks [23]. The same\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fderivations also lead to the insight that in\ufb01nitely wide fully-connected neural networks approximately\nlearn the kernel of a Gaussian process [11]. According to these works, not only the signal variance\nbut also correlations between signals corresponding to different inputs need to be preserved to ensure\ngood trainability of initialized neural networks. This way, the average eigenvalue of the signal input-\noutput Jacobian in mean \ufb01eld neural networks is steered towards 1. Furthermore, a high concentration\nof the full spectral density of the Jacobian close to 1 seems to support higher training speeds [14, 15].\nThis property is called dynamical isometry and is better realized by orthogonal weight initializations\n[19]. So far, these insights rely on the mean \ufb01eld assumption of in\ufb01nite layer width. [6, 5] have\nderived \ufb01nite size corrections for the average squared signal norm and answered the question when\nthe mean \ufb01eld assumption holds.\nIn this article, we determine the exact signal output distribution without requiring mean \ufb01eld ap-\nproximations. For fully-connected network ensembles with Gaussian weights and biases for general\nnonlinear activation functions, we \ufb01nd that the output distribution only depends on the scalar products\nbetween different inputs. We therefore focus on their propagation through a network ensemble. In\nparticular, we study a linear transition operator that advances the signal distribution layer-wise. We\nconjecture that the spectral properties of this operator can be more informative of trainability than\nthe average spectral density of the input-output Jacobian. Additionally, the distribution of the cosine\nsimilarity indicates how well an initialized network can distinguish different inputs.\nWe further discuss when network layers of \ufb01nite width are well represented by mean \ufb01eld analysis and\nwhen they are not. Furthermore, we highlight important differences in the analysis. By specializing\nour derivations to ReLUs, we \ufb01nd variants of the He initialization [7] that ful\ufb01ll the same criteria but\nalso suffer from the same lack of dynamical isometry [14]. In consequence, such initialized neural\nnetworks cannot be trained effectively without batch normalization for high depth. To overcome\nthis problem, we propose a simple initialization scheme for ReLU layers that guarantees perfect\ndynamical isometry. A subset of the weights can still be drawn from Gaussian distributions or\nchosen as orthogonal while the remaining ones are designed to ensure full signal propagation. Both\nconsistently outperform the He initialization in our experiments on MNIST and CIFAR-10.\n\n2 Signal propagation through Gaussian neural network ensembles\n\n2.1 Background and notation\n\nij \u223c N(cid:16)\n\nWe study fully-connected neural network ensembles with zero mean Gaussian weights and biases.\nWe thus make the following assumption:\nAn ensemble {G}L,Nl,\u03c6,\u03c3w,\u03c3b of fully-connected feed forward neural networks consists of networks\nwith depths L, widths Nl, l = 0, ..., L, independently normally distributed weights and biases with\n, and non-decreasing activation function \u03c6 : R \u2192 R. Starting\nw(l)\nfrom the input vector x(0), signal x(l) propagates through the network, as usual, as:\nh(l) = W(l)x(l\u22121) + b(l),\n\nx(l) = \u03c6\n\n0, \u03c32\nb,l\n\n, b(l)\n\n(cid:17)\n\n(cid:17)\n\n0, \u03c32\n\nw,l\n\n,\n\ni \u223c N(cid:16)\n(cid:16)\nh(l)(cid:17)\n(cid:17)\n(cid:16)\n\nNl\u22121(cid:88)\n\nx(l)\ni = \u03c6\n\nh(l)\ni\n\n,\n\nh(l)\ni =\n\nij x(l\u22121)\nw(l)\n\nj\n\n+ b(l)\ni\n\n,\n\nj=1\n\n, xi instead of x(l\u22121)\n\nfor l = 1, . . . , L, where h(l) is the pre-activation at layer l, W(l) is the weight matrix, and b(l) is the\nbias vector. If not indicated otherwise, 1-dimensional functions applied to vectors are applied to each\ncomponent separately. To ease notation, we follow the convention to suppress the superscript (l) and\nwrite, for instance, xi instead of x(l)\n, when the layer\ni\nreference is clear from the context.\nIdeally, the initialized network is close to the trained one with high probability and can be reached\nfast in a small number of training steps. Hence, our \ufb01rst goal is to understand the ensemble above\nand the trainability of an initialized network without requiring mean \ufb01eld approximations of in\ufb01nite\nNl. In particular, we derive the probability distribution of the output x(L). Within this framework,\ni = 0. Even though it preserves the variance for ReLUs, i.e., \u03c6(x) = max{0, x}, as activation\n\nour second goal is to learn how to improve on the He initialization, i.e., the choice \u03c3w,l =(cid:112)2/Nl\n\n, and xi instead of x(l+1)\n\nand b(l)\n\ni\n\ni\n\n2\n\n\ffunctions [7], neither this parameter choice nor orthogonal weights lead to dynamical isometry [14].\nThus, the average spectrum of the input-output Jacobian is not concentrated around 1 for higher\ndepths and in\ufb01nite width. In consequence, ReLUs are argued to be an inferior choice compared to\nsigmoids [14]. Thus, our third goal is to provide an initialization scheme for ReLUs that overcomes\nthe resulting problems and provides dynamical isometry.\nWe start with our results about the signal propagation for general activation functions. The proofs\nfor all theorems are given in the supplementary material. As we show, the signal output distribution\ndepends on the input distribution only via scalar products of the inputs. Higher order terms do\nnot propagate through a network ensemble at initialization. In consequence, we can focus on the\ndistribution of such scalar products later on to derive meaningful criteria for the trainability of\ninitialized deep neural networks.\n\n(cid:17)\n\n0, \u03c32\nw\n\nj x2\n\nj + \u03c32\nb\n\n2.2 General activation functions\n\nj=1 wijxj + bi \u223c N(cid:16)\n\nhi of the current layer is normally distributed as hi =(cid:80)Nl\n\nLet\u2019s \ufb01rst assume that the signal x of the previous layer is given. Then, each pre-activation component\n,\nsince the weights and bias are independently normally distributed with zero mean. The non-linear\nmonotonically increasing transformation xi = \u03c6(hi) is distributed as xi \u223c \u03a6\n, where \u03c6\u22121\ndenotes the generalized inverse of \u03c6, i.e. \u03c6\u22121(x) := inf{y \u2208 R | \u03c6(y) \u2265 x}, \u03a6 the cumulative\ndistribution function (cdf) of a standard normal random variable, and \u03c32 = \u03c32\nb . Thus, we\nonly need to know the distribution of |x|2 as input to compute the distribution of xi. The signal\npropagation is thus reduced to a 1-dimensional problem. Note that the assumption of equal \u03c32\nw for all\nj + \u03c32\nw,jx2\nb,i\nj, which depends on the parameters\n\nincoming edges into a neuron are crucial for this result. Otherwise, hi \u223c N(cid:16)\nwould require the knowledge of the distribution of(cid:80)\nProposition 1. Let the probability density p0(z) of the squared input norm |x(0)|2 =(cid:80)N0\n\n(cid:80)\n(cid:16) \u03c6\u22121(\u00b7)\n(cid:17)\nw|x|2 + \u03c32\n\nw however, we can compute the probability distribution of outputs.\n\nw,j. Based on \u03c32\n\u03c32\n\n(cid:17)\n(cid:17)2\n\n0,(cid:80)\n\nbe known. Then, the distribution pl(z) of the squared signal norm |x(l)|2 depends only on the\ndistribution of the previous layer pl\u22121(z) as transformed by a linear operator Tl : L1(R+) \u2192\nL1(R+) so that pl = Tl(pl\u22121). Tl is de\ufb01ned as\n\nw,j = \u03c32\n\nj \u03c32\n\nw,jx2\n\nx(0)\ni\n\ni=1\n\n(cid:16)\n\nj \u03c32\n\n\u03c3\n\n(cid:90) \u221e\n\n0\n\nTl(p)[z] =\n\nkl(y, z)p(y) dy,\n\n(1)\n\nwhere k(y, z) is the distribution of the squared signal z at layer l given the squared signal at\n\u2217Nl\u22121\n\u03c6(hy)2(z), where \u2217 stands for convolution and p\u03c6(hy)2 (z)\n(cid:1). This distribution serves to compute the cumulative distribution function\nthe previous layer y so that kl(y, z) = p\ndenotes the distribution of the squared transformed pre-activation hy, which is normally distributed\n\nas hy \u223c N(cid:0)0, \u03c32\n\nwy2 + \u03c32\nb\n\n(cdf) of each signal component x(l)\n\ni as\n\n(cid:90) \u221e\n\n0\n\n(cid:33)\n\n(cid:32)\n\n(cid:112)\u03c32\n\n\u03c6\u22121(x)\nwz + \u03c32\nb\n\nFx(l)\n\ni\n\n(x) =\n\ndz pl\u22121(z)\u03a6\n\n,\n\n(2)\n\nwhere \u03c6\u22121 denotes the generalized inverse of \u03c6 and \u03a6 the cdf of a standard normal random variable.\nAccordingly, the components are jointly distributed as\n\n(cid:90) \u221e\nwhere we use the abbreviation \u03c3z =(cid:112)\u03c32\nwith f, i.e., by induction, f\u2217N (z) = f \u2217 f\u2217(N\u22121)(z) =(cid:82) z\n\nb .\nwz + \u03c32\n\n1 ,...,x(l)\nNl\n\nFx(l)\n\n(x) =\n\ndz pl\u22121(z)\u03a0Nl\n\n0\n\n(cid:18) \u03c6\u22121(xi)\n\n(cid:19)\n\ni=1\u03a6\n\n\u03c3z\n\nAs common, the N-fold convolution of a function f \u2208 L1(R+) is de\ufb01ned as repeated convolution\n0 f (x)f\u2217(N\u22121)(z \u2212 x) dx. In Prop. 1, we\nnote the radial symmetry of the output distribution. It only depends on the squared norm of the input.\nFor a single input x(0), p0(z) is given by the indicator function p0(z) = 1|x(0)|2(z). Interestingly,\n\n,\n\n(3)\n\n3\n\n\f(a) Squared signal norm distribution at different\ndepths for Nl = 200. The initial distribution (L =\n0) is de\ufb01ned by MNIST.\n\ntion parameters \u03c3w =(cid:112)2/Nl, \u03c3b = 0.\n\n(b) Eigenvalues \u03bb corresponding to eigenfunctions\nym of Tl. Nl = 10 (black circles), Nl = 20 (blue\ntriangles), Nl = 100 (red +).\n\nFigure 1: Layer-wise transition of the squared signal norm distribution for ReLUs with He initializa-\n\ndistribution. px(L) =(cid:81)L\nlinear operator(cid:81)L\n\nmean \ufb01eld analysis also focuses on the average or the squared signal, which is likewise updated\nlayer-wise. Prop. 1 explains and justi\ufb01es the focus of mean \ufb01eld theory on the squared signal norm.\nMore information is not transmitted from layer to layer to determine the state (distribution) of a single\nneuron. The difference to mean \ufb01eld theory here is that we regard the full distribution pl\u22121 of the\nprevious layer instead of its average only on in\ufb01nitely large layers. The linear operator Tl governs this\nl=1 Tlpx(0), where the product is de\ufb01ned by function composition. Hence, the\nl=1 Tl can also be interpreted as the Jacobian corresponding to the (linear) function\nthat maps the squared input norm distribution to the squared output norm distribution. Tl is different\nfrom the signal input output Jacobian studied in mean \ufb01eld random matrix theory, yet, its spectral\nproperties can also inform us about the trainability of the network ensemble. Conveniently, we only\nhave to study one spectrum and not a distribution of eigenvalues that are potentially coupled as in\nrandom matrix theory. For any nonlinear activation function, Tl can be approximated numerically on\nan equidistant grid. The convolution in the kernel de\ufb01nition can be computed ef\ufb01ciently with the\nhelp of Fast Fourier Transformations. The eigenvalues of the matrix approximating Tl de\ufb01ne the\napproximate signal propagation along the eigendirections.\nHowever, we only receive the full picture when we extend our study to look at the joint output\ndistribution, i.e., the outputs corresponding to different inputs.\nProposition 2. The same component of pre-activations of signals h1, ..., hD corresponding to\ndifferent inputs x(0)\nD , are jointly normally distributed with zero mean and covariance matrix\nV de\ufb01ned by\n\n1 , ..., x(0)\n\nw(cid:104)xi, xj(cid:105) + \u03c32\n\nb\n\nvij = Cov(hi, hj) = \u03c32\n\n(4)\n\nfor i, j = 1, ..., D conditional on the signals xi of the previous layer corresponding to x(0)\ndenotes the number of data points.\n\ni\n\n, where D\n\nAfter non-linear activation, the signals are not jointly normally distributed anymore. But their\ndistribution is a function of the squared norms and scalar products between signals of the previous\nlayer only. Thus, it is suf\ufb01cient to propagate the joint distribution of three variables that can\nattain different values, i.e., |x1|2, |x2|2 , (cid:104)x1, x2(cid:105), through the layers to determine the joint output\ndistribution of two signals x1 and x1 corresponding to different inputs. No other information about the\njoint distribution of inputs, e.g., higher moments, can in\ufb02uence the ensemble output distribution and\nthus our choice of weight and bias parameters. In consequence, the focus on quantities corresponding\nto the above in mean \ufb01eld theory is justi\ufb01ed for Gaussian parameter initialization and does not require\nany approximation. Yet, the mean \ufb01eld assumption that pre-activation signals are exactly normally\ndistributed and not only conditional on the previous signal is approximate. Accordingly, the output\ndistribution for \ufb01nite neural networks does not follow a Gaussian process with average covariance\nmatrix V as in mean \ufb01eld theory [11]. V follows a probability distribution that is determined by\n\n4\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0000.0050.0100.0150.02005101520|x|2PLllllllllll0123456789lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.51.01.5\u221250\u221240\u221230\u221220\u2212100ml\fthe previous layers. For the initialization scheme for ReLUs that we propose later, we can state\nthe distribution of V explicitly. First however, we analyze ReLUs in the standard framework and\nspecialize the above theorems.\n\n2.3 Recti\ufb01ed Linear Units (ReLUs)\n\nThe minimum initialization criterion to avoid vanishing or exploding gradients is to preserve the\nexpected squared signal norm. For \ufb01nite networks, this is given as follows.\nCorollary 3. For ReLUs, the expectation value of the squared signal conditional on the squared\nsignal of the previous layer is given by:\n\nE(cid:16)|x(l)|2(cid:12)(cid:12) |x(l\u22121)|2 = y\n\n(cid:17)\n\n= (\u03c32\n\nw,ly + \u03c32\n\nb,l)\n\nNl\n2\n\n.\n\nConsequently, the expectation of the \ufb01nal squared signal norm depends on the initial input as:\n\n(5)\n\n(6)\n\nE(cid:16)|x(L)|2(cid:12)(cid:12)|x(0)|2(cid:17)\n\n=|x(0)|2\u03a0L\n\nl=1\n\nw,l\n\nNl\u03c32\n2\n\n+ \u03c32\n\nb,L\n\nNL\n2\n\n+\n\n\u03c32\nb,l\n\nNl\n2\n\n\u03a0L\n\nn=l+1\n\nw,n\n\nNn\u03c32\n2\n\nL\u22121(cid:88)\n\nl=1\n\nSimilar relations for the expected signal components and their variance follow from Eq. (6) and are\ncovered in the supplementary material. [6] has derived a simpler version of Eq. (6) for equal \u03c3w,l and\nNl across layers.\nA straightforward way to preserve the average squared signal or the squared output signal norm\n\ndistribution is exactly the He initialization \u03c3b,l = 0 and \u03c3w,l =(cid:112)2/Nl [7], which is also con\ufb01rmed\n\nl=1Nl\u03c32\n\nb,l = 0. We only need to ful\ufb01ll\nw,l \u2248 1. In case that we normalize the input so that |x(0)|2 = 1,\n\njoint distribution of the variables(cid:0)|x1|2,|x2|2,(cid:104)x1, x2(cid:105)(cid:1) given(cid:0)|x1|2,|x2|2,(cid:104)x1, x2(cid:105)(cid:1). As this is\n\nby mean \ufb01eld analysis. Yet, we have many more choices even when \u03c32\none condition, i.e., 0.5L\u03a0L\nb,l (cid:54)= 0 is also a valid option and we have 2L \u2212 1 degrees of freedom.\n\u03c32\nThere remains the question whether there exist further criteria to be ful\ufb01lled that improve the\ntrainability of the initial network ensemble. The whole output distribution could provide those and its\nderivation is given in the supplementary material. According to Prop. 2, it is guided by the layer-wise\ncomputationally intensive to obtain, we focus on marginals, i.e., the distributions of |x|2 and (cid:104)x1, x2(cid:105).\nThese are suf\ufb01cient to highlight several drawbacks of the initialization approach and provide us with\ninsights to propose an alternative that overcomes these shortcomings.\nFirst, we focus on |x(l)|2 and derive a closed form solution for the integral kernel kl(y, z) of Tl\nin Prop. 1 and analyse some of its spectral properties for ReLUs. This allows us to reason about\nthe shape of the stationary distribution of Tl, i.e., the limit output distribution for networks with\nincreasing depth.\nProposition 4. For ReLUs, the linear operator Tl in Prop. 1 is de\ufb01ned by\n\n(cid:32)\n\nNl(cid:88)\n\n(cid:18)Nl\n\n(cid:19) 1\n\nk\n\n\u03c32\ny\n\nk=1\n\np\u03c72\n\nk\n\n(cid:19)(cid:33)\n\n(cid:18) z\n\n\u03c32\ny\n\nkl(y, z) = 0.5Nl\n\n\u03b40(z) +\n\nwy + \u03c32\n\nb , where \u03b40(z) denotes the \u03b4-distribution peaked at 0 and p\u03c72\n\nthe density of\nthe \u03c72 distribution with k degrees of freedom. For \u03c3b = 0, the functions fm(y) = ym1]0,\u221e](y) are\neigenfunctions of Tl for any m \u2208 R (even though they are not elements of L1(R+) and thus not\nnormalizable as probability measures) with corresponding eigenvalue \u03bbl,m \u2208 R: Tlfm = \u03bbl,mfm\nwith\n\nk\n\n\u03bbl,m = 0.5Nl\u2212m\u22121\n\n1\n\nNl(cid:88)\n\n(cid:18)Nl\n\n(cid:19) \u0393(k/2 \u2212 m \u2212 1)\n\n\u03c32m+2\nw\n\nk=1\n\nk\n\n\u0393(k/2)\n\nNote that, for \u03c3b = 0, the eigenfunctions ym cannot be normalized on R+, as the antiderivative\ndiverges at zero for m \u2264 \u22121. However, if we discretize Tl in numerical experiments they can\nbe normalized and the real eigenvectors representing probability distributions attain shapes \u2248 ym.\n\n5\n\nwith \u03c3y =(cid:112)\u03c32\n\n(7)\n\n(8)\n\n\fSpeci\ufb01cally, for the He values \u03c3b,l = 0 and \u03c3w,l = (cid:112)2/Nl, numerical experiments reveal a\n\nFig. 1a provides an example of the output distribution for 9 layers each consisting of Nl = 200\nneurons with He initialization parameters. The average squared signal is indeed preserved but\nbecomes more right tailed for deeper layers. Fig. 1b shows the corresponding eigenvalues of Tl\nas in Prop. 4. In summary, we observe a window mcrit < m \u2264 \u22121 with eigenvalues \u03bbl,m < 1.\nrelation mcrit \u2248 \u22123.2559793 \u2212 1.6207083Nl. Signal parts within this window are damped down in\ndeeper layers, while the remaining parts explode. Only ymcrit is preserved through the layers and\ndepends on the choice of \u03c3w,l. Interestingly, for m = \u22121, \u03bbl,m is independent of \u03c3w,l and given by\n\u03bbl,\u22121 = 1 \u2212 0.5Nl. Thus, it approaches \u03bbl,\u22121 = 1 for increasing Nl. For the He initialization, ymcrit\nconverges to the \u03b40(y) for increasing Nl. In contrast to mean \ufb01eld analysis, not the whole space of\neigenfunctions corresponds to eigenvalue 1 for the He initialization. In particular, eigenvalues bigger\nthan one exist that can be problematic for exploding gradients. To reduce their number, broader layers\npromise better protection as well as smaller values of \u03c3w,l. Ultimately, we care about the product of\nlayer-wise eigenvalues, i.e., the eigenvalues of \u03a0lTl. Again, setting these to 1 imposes a constraint\nw,l like in Cor. 3. Hence, we gain no additional constraint on our initial\nonly on the product \u03a0l\u03c32\nparameters and have no means to prevent eigenvalues larger than 1.\nThe biggest challenge for trainability, however, is the ability to differentiate similar signals. We\ntherefore study the evolution of the cosine similarity (cid:104)x(l)\n2 or the\nunnormalized scalar product through layers l.\nTheorem 5. For ReLUs, let x1 = \u03c6(h1), x2 = \u03c6(h2) be the same signal component, i.e., neuron,\nwhere each corresponds to a different input x(0)\nv11v22 of the\npre-activations h1, h2 be given, where V denotes the 2-dimensional covariance matrix as de\ufb01ned in\nProp. 2. Then, the correlation after non-linear activation is\n\n\u221a\n2 . Let the correlation \u03c1 = v12/\n\n2 (cid:105) of two inputs x(0)\n\n1 or x(0)\n\nand x(0)\n\n1 , x(l)\n\n1\n\n(cid:112)1 \u2212 \u03c12 \u2212 1 + 2\u03c0\u03c1g(\u03c1)\n(cid:19)\n\n\u03c0 \u2212 1\n\nCor (x1, x2) =\n\n(cid:18)\n\n(cid:82) \u221e\n\ng(\u03c1) is de\ufb01ned as g(\u03c1) = 1\u221a\n2\u03c0\ng(1) = 0.5. The average of the sum of all components E ((cid:104)x1, x2(cid:105)) conditional on the previous layer\nis:\n\n0 \u03a6\n\nu\n\n\u03c1\u221a\n1\u2212\u03c12\n\n(cid:32)\n\n\u221a\n\nE ((cid:104)x1, x2(cid:105) | \u03c1) = Nl\n\nv11v22\n\ng(\u03c1)\u03c1 +\n\n.\n\n(9)\n\nexp(cid:0)\u2212 1\n2 u2(cid:1) du for |\u03c1| (cid:54)= 1 and g(\u22121) = 0,\n(cid:33)\n(cid:112)1 \u2212 \u03c12\n(cid:16) v12y\n\n(cid:16)\u221a\n\n\u2248 Nl\n\n(\u03c1 + 1).\n\nv11v22\n\n(cid:17)\n\n(cid:17)\n\n(10)\n\n\u221a\n\n1\n4\n\n2\u03c0\n\nb\n\n2\u03c0\n\n(\u03c32\n\nK0\n\n\u221a\n\nexp\n\n2 det(V )\n\n(cid:113)\n\nb)(\u03c32\n\n1\ndet(V )\n\nv11v22\ndet(V ) y\n\nw|x1|2+\u03c32\n\n(cid:82) \u221e\n\nw|x2|2+\u03c32\nb)\n\nw(cid:104)x1,x2(cid:105)+\u03c32\n\u03c32\n\n. Interestingly, for \u03c3b = 0, \u03c1 =\n\n\u2217Nl\nprod(t),\nand K0(w) =\n\nFurthermore, conditional on the signals of the previous layer, (cid:104)x1, x2(cid:105) is distributed as f\nwhere fprod(y) = (1 \u2212 g(\u03c1)) \u03b40(y) +\n0 cos(w sinh(t)) dt denotes the modi\ufb01ed Bessel function of second kind.\nNote that [2] studies a similar integral but in the mean \ufb01eld limit. The correlation of the signal\ncomponents only depends on \u03c1 (and is always smaller than \u03c1). Analogous to the c-map in mean \ufb01eld\napproaches [18], the actual quantity of interest would be the distribution of the correlation \u03c1, i.e.,\n(cid:104)x1,x2(cid:105)\n|x1||x2| coincides with the cosine\n\u03c1 =\nsimilarity of the two signals. The preservation of this quantity on average has been shown to be the\nmost indicative criterion for trainability of ReLU residual networks [24]. We therefore take a closer\nlook at its distribution. To save computational time and space, we sample Nl components iid from a 2\ndimensional normal distribution as introduced in Prop. 2 and transform the components by ReLUs to\nobtain two vectors x1 and x2 and calculate their cosine similarity. Repeating this procedure 106 times\nresults in Fig. 2. First, we note that correlations can only be positive after the \ufb01rst layer, since all\nsignal components are positive (or zero) after transformation by ReLUs. Negative cosine similarities\ncannot be propagated through Gaussian ReLU ensembles. Data transformation to obtain positive\ninputs can mitigate this issue. Yet, Eq. (10) highlights an unavoidable problem for deep models, i.e.,\nthe average cosine similarity increases from layer to layer until it reaches 1 at high depths. Then, all\nsignals become parallel and thus indistinguishable.\nWhile this effect cannot be mitigated completely within our initialization scheme, a slightly smaller\nchoice of \u03c3w than the He initialization reduces the average cosine distance and a smaller number\n\n6\n\n\fof neurons in one layer increases the variance of the cosine distance, as shown in Fig. 2. A higher\nvariance increases the probability that smaller values of the cosine distance can be propagated.\nWe hypothesize that this effect contributes to the better trainability of ReLU neural networks with\nDropOut or DropConnect [22], since both reduce the effective number of neurons Nl. Yet, a smaller\n\u03c3w and DropOut (or DropConnect) introduce a risk of vanishing gradients in deep neural networks\n[17]. An adjustment of \u03c3w by the DropOut rate to avoid this effect [17] would also destroy possible\nbene\ufb01cial effects on the cosine similarity.\n\nFor smaller \u03c32 = 1/N, [1] observes a\nphenomenon related to the cosine sim-\nilarity, i.e. shattered gradients. How-\never, in this setting, the effect of van-\nishing gradients and increasing corre-\nlations are indistinguishable. In fact,\nthe authors observe decreasing corre-\nlations, while we highlight the prob-\nlem of increasing ones for the He ini-\ntialization.\nInterestingly, in the He\ncase (\u03c32 = 2/N), [1] \ufb01nds that also\nbatch normalization cannot provide\nbetter trainability. For \u201ctypical\u201d inputs\nthat are shown to be common in net-\nworks with batch normalization (but\nnot in networks without, which we\nstudy here), the covariance between\noutputs corresponding to different in-\nputs decays exponentially.\n\nFigure 2: Probability distribution of the cosine similarity\nconditional on the previous layer with |x1| = |x2| = 1 and\n(cid:104)x1, x2(cid:105) equals 0 (circles), 0.25 (squares), 0.5 (diamonds),\nand 0.75 (triangles) for Nl = 100 (dashed lines and \ufb01lled\nsymbols) and Nl = 200 (lines and un\ufb01lled symbols) neurons.\nWe therefore discuss an alternative solution that [1] proposes also for convolutional and residual\nneural networks and has \ufb01rst been introduced by [21].\n\n3 Alternative initialization of fully-connected ReLU deep neural networks\n\nThe issues of training deep neural networks with ReLUs are caused by the fact that negative signal\ncan never propagate through the network and a neuron\u2019s state is zero in half of the cases. Hence, we\ndiscuss an initialization, where the full signal is transmitted. We set the bias vector b(l)\ni = 0 and the\nweight matrices W (l) \u2208 RNl\u22121\u00d7Nl are initially determined by a submatrix W (l)\n\n2 \u00d7 Nl\n\n2 as\n\n0 \u2208 R Nl\u22121\n\n(cid:34)\n\nW (l) =\n\nW (l)\n0\n\u2212W (l)\n\n0\n\n\u2212W (l)\n0\nW (l)\n0\n\n(cid:35)\n\n.\n\ni =(cid:80)\n\nj\n\ni\n\nj\n\n> 0, we have x(l)\n\nij x(l\u22121)\n\nj w(l)\n\ni =(cid:80)Nl/2\n\ni = h(l)\nj=1 w(l)\n\nand x(l)\n0,ijh(l\u22121)\n\nRegardless of the choice of W (l)\n0 , we receive a signal vector x(l), where half of the entries correspond\nto the positive part of the pre-activations and the second half to the negative part, i.e., if i \u2264 Nl/2\nand h(l)\ni+Nl/2 = 0 or the other way around for\nh(l)\ni < 0. This way, effectively h(l)\nis propagated so that we have initially\nl=0 W (l)\n0 h0. Note that we still have to train the full Nl\u22121Nl parameters\nof W (l) and can learn non-linear functions. [21] observed that convolutional neural networks even\ntrained the \ufb01rst layers to resemble linear neural networks, which inspired this choice of initialization.\n0,ij \u223c\nIn this setting, we have several good choices of W (l)\nIn this\ncase, our assumptions from the previous sections are met and we can repeat the analysis for networks\nof width Nl/2 and \u03c6(h) = h, i.e. set the activation function to the identity. The same parameter\nchoice as in Cor. 3, e.g., \u03c32\nw,l = 2/Nl, preserves the variance and now also the cosine distance\nbetween signals corresponding to different inputs. The analysis is rather straight-forward and the\nfollows a product\n\nlinear networks h(L) =(cid:81)L\nN(cid:16)\noutput distribution is de\ufb01ned by the distribution of(cid:81)L\n\nas before. We call this variant Gaussian submatrix (GSM) initialization.\n\n0 . First, we assume iid entries w(l)\n\n0 x(0). (cid:81)L\n\nl=0 W (l)\n\nl=0 W (l)\n\n0\n\n(cid:17)\n\n0, \u03c32\n\nw,l\n\nWishart distribution with known spectrum [13, 14].\n\n7\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.030.060.090.120.000.250.500.751.00cosine similarityP\fAccording to [14] however, dynamical isometry leads to better training results and speed. This\ndemands an input-output Jacobian close to the identity or a spectrum of the signal propagation\noperator Tl that is highly concentrated around 1. Previously, this was believed to be better achievable\nby tanh or sigmoid rather than by ReLU as choice of activation functions [14]. Yet, in the parameter\nsharing setting above, perfect dynamical isometry for h can be achieved by orthogonal W (l)\n0 , i.e., it is\ndrawn uniformly at random from all matrices W ful\ufb01lling W T W = \u03c32\nw,l = 1. This is\nour second initialization proposal in addition to GSM.\nAlternatively, [25] recommends to shift the signal h(l)\nto enable negative\nsignal to pass through a ReLU activation instead of the proposed parameter sharing solution. We also\nconsidered a similar approach but decided for the proposed one as it is point-symmetric, guaranties\ntherefore perfect dynamical isometry, is computational less intensive, as it does not need to compute\na data dependent bias b(l)\nas in batch normalization, and is more convenient for theoretical analysis,\ni\nwhich can rely on a rich literature on linear deep neural networks.\n\ni by a non-zero bias b(l)\n\nw,lI with \u03c32\n\ni\n\n4 Experiments for different initialization schemes\n\ninitialization with parameters \u03c3b = 0, \u03c3w =(cid:112)2/N for He and GSM initialization, and \u03c3w = 1 for\n\nFigure 3: Classi\ufb01cation test accuracy on MNIST for different widths N, depths L, and weight\n\northogonal W0 after 104 SGD steps. We report the average of 100 realizations and the corresponding\n0.95 con\ufb01dence interval. The right plot is a section of the left. Note that the legends apply to both\nplots.\n\nWe train fully-connected ReLU feed forward networks of different depth consisting of L = 1, . . . , 10\nhidden layers with the same number of neurons Nl = N = 100, 300, 500 and an additional softmax\nclassi\ufb01cation layer on MNIST [10] and CIFAR-10 [9] to compare three different initialization\nschemes: the standard He initialization and our two proposals in Sec. 3, i.e., GSM and orthogonal\nweights. We focus on minimizing the cross-entropy by Stochastic Gradient Descent (SGD) without\nbatch normalization or any data augmentation techniques. Hence, our goal is not to outperform the\nclassi\ufb01cation state of the art but to compare the initialization schemes under similar realistic conditions.\nSince deep networks normally require a smaller learning rate than the ones with a small number of\nhidden layers, as in Ref. [14], we adapt the learning rate to (0.0001 + 0.003 \u00b7 exp(\u2212step/104))/L\nfor MNIST and (0.00001 + 0.0005 \u00b7 exp(\u2212step/104))/L for CIFAR-10 for 104 SGD steps with a\nbatch size of 100 in all cases. To reduce the number of parameters and speed up computations, we\nclipped original CIFAR-10 images to 28 \u00d7 28 size. For each con\ufb01guration, we train 100 instances\non MNIST and 30 instances on CIFAR-10 and report the average accuracy with a 0.95 con\ufb01dence\ninterval in Fig. 3 and Fig. 4 respectively. Each experiment on MNIST was run on 1 Nvidia GTX 1080\nTi GPU, while each experiment on CIFAR-10 was performed on 4 Nvidia GTX 1080 Ti GPUs.\nNote that the accuracy on CIFAR-10 is lower than for convolutional architectures, as we restrict\nourselves to deep fully-connected networks to focus on their trainability. [1] shows that a similar\n\n8\n\n\finitialization with parameters \u03c3b = 0, \u03c3w =(cid:112)2/N for He and GSM initialization, and \u03c3w = 1 for\n\nFigure 4: Classi\ufb01cation test accuracy on CIFAR-10 for different widths N, depths L, and weight\n\northogonal W0 after 104 SGD steps. We report the average of 30 realizations and the corresponding\n0.95 con\ufb01dence interval. The right plot is a section of the left.\n\northogonal initialization improves training results also for convolutional and residual neural networks.\nAs suggested by our theoretical analysis, both proposed initialization schemes consistently outperform\nthe He initialization and show stable training results, in particular, for deeper network architectures,\nwhere the He initialized networks decrease in accuracy. GSM and orthogonal W0 both perform better\nfor higher width N, while orthogonal W0 seems to be the most reliable choice.\n\n5 Discussion\n\nWe have introduced a framework for the analysis of deep fully-connected feed forward neural\nnetworks at initialization with zero mean normally distributed weights and biases. It is exact, does\nnot rely on mean \ufb01eld approximations, provides distribution information of output and joint output\nsignals, and applies to networks with arbitrary layer widths. It has led to the insight that only the\nscalar products between inputs determine the shape of the output distribution, but it is not in\ufb02uenced\nby higher interaction terms.\nHence, for ReLUs, we have analysed the propagation of these quantities through the deep neural\nnetwork ensemble. While mean \ufb01eld analysis provides only the He initialization for good training\nresults, we have extended the number of possible parameter choices that avoid vanishing or exploding\ngradients. However, no parameter choice can avoid the tendency that signals become more aligned\nwith increasing depth. Deep ReLU Gaussian neural network ensembles cannot distinguish different\ninput correlations and are therefore not well trainable without batch normalization. Even batch\nnormalization does not guaranty the transmission of correlations between different inputs.\nAs solution to this problem, we have discussed an alternative but simple initialization scheme that\nrelies on initial parameter sharing. One variant guarantees perfect dynamical isometry. Experiments\non MNIST and CIFAR-10 demonstrate that deeper fully-connected ReLU networks can become\nbetter trainable in the proposed way than by the standard approach.\n\nAcknowledgments\n\nWe would like to thank Joachim M. Buhmann and Alkis Gotovos for their valuable feedback on the\nmanuscript and the reviewers for their insightful comments. This work was partially funded by the\nSwiss Heart Failure Network, PHRT122/2018DRI14 (J. M. Buhmann, PI). RB was supported by a\ngrant from the US National Cancer Institute (1R35CA220523).\n\n9\n\n\fReferences\n[1] David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The\nshattered gradients problem: If resnets are the answer, then what is the question? In Doina Precup and\nYee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 342\u2013350, International Convention Centre, Sydney,\nAustralia, 06\u201311 Aug 2017. PMLR.\n\n[2] Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans,\nJ. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing\nSystems 22, pages 342\u2013350. Curran Associates, Inc., 2009.\n\n[3] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\n\nover-parameterized neural networks. In International Conference on Learning Representations, 2019.\n\n[4] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and\nnetworks.\nStatistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 249\u2013256, 2010.\n\n[5] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In S. Bengio,\nH. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 31, pages 582\u2013591. Curran Associates, Inc., 2018.\n\n[6] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. In\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 571\u2013581. Curran Associates, Inc., 2018.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the 2015 IEEE International\nConference on Computer Vision (ICCV), ICCV \u201915, pages 1026\u20131034, Washington, DC, USA, 2015. IEEE\nComputer Society.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International\nConference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages\n448\u2013456, Lille, France, 07\u201309 Jul 2015. PMLR.\n\n[9] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research).\n\nTechnical report, 2009.\n\n[10] Yann Lecun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. In Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[11] Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman\nBahri. Deep neural networks as gaussian processes. In International Conference on Learning Representa-\ntions, 2018.\n\n[12] Dmytro Mishkin and Jiri Matas. All you need is a good init. In 4th International Conference on Learning\nRepresentations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.\n\n[13] Thorsten Neuschel. Plancherel-rotach formulae for average characteristic polynomials of products of\nproducts of ginibre random matrices and the fuss-catalan distribution. Random Matrices: Theory and\nApplications, 3(1), 2014.\n\n[14] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning\nthrough dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems\n30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach,\nCA, USA, pages 4788\u20134798, 2017.\n\n[15] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The emergence of spectral universality in\ndeep networks. In International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2018, 9-11\nApril 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1924\u20131932, 2018.\n\n[16] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential\nexpressivity in deep neural networks through transient chaos. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3360\u20133368.\nCurran Associates, Inc., 2016.\n\n10\n\n\f[17] Arnu Pretorius, Elan van Biljon, Steve Kroon, and Herman Kamper. Critical initialisation for deep signal\npropagation in noisy recti\ufb01er neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages\n5717\u20135726. Curran Associates, Inc., 2018.\n\n[18] Maithra Raghu, Ben Poole, Jon M. Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive\npower of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning,\nICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2847\u20132854, 2017.\n\n[19] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\nlearning in deep linear neural networks. In 2nd International Conference on Learning Representations,\nICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.\n\n[20] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information\npropagation. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France,\nApril 24-26, 2017, Conference Track Proceedings, 2017.\n\n[21] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolu-\ntional neural networks via concatenated recti\ufb01ed linear units. In Proceedings of the 33rd International\nConference on International Conference on Machine Learning - Volume 48, ICML\u201916, pages 2217\u20132225.\nJMLR.org, 2016.\n\n[22] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks\nusing dropconnect. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th Interna-\ntional Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages\n1058\u20131066, Atlanta, Georgia, USA, 17\u201319 Jun 2013. PMLR.\n\n[23] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dy-\nnamical isometry and a mean \ufb01eld theory of CNNs: How to train 10,000-layer vanilla convolutional neural\nnetworks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference\non Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5393\u20135402,\nStockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[24] Ge Yang and Samuel Schoenholz. Mean \ufb01eld residual networks: On the edge of chaos. In I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 7103\u20137114. Curran Associates, Inc., 2017.\n\n[25] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. A mean\n\n\ufb01eld theory of batch normalization. In International Conference on Learning Representations, 2019.\n\n[26] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural\nnetworks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 27, pages 3320\u20133328. Curran Associates, Inc., 2014.\n\n11\n\n\f", "award": [], "sourceid": 1398, "authors": [{"given_name": "Rebekka", "family_name": "Burkholz", "institution": "Harvard University"}, {"given_name": "Alina", "family_name": "Dubatovka", "institution": "ETH Zurich"}]}