{"title": "Self-Normalizing Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 971, "page_last": 980, "abstract": "Deep Learning has revolutionized vision via convolutional neural networks (CNNs) and natural language processing via recurrent neural networks (RNNs). However, success stories of Deep Learning with standard feed-forward neural networks (FNNs) are rare. FNNs that perform well are typically shallow and, therefore cannot exploit many levels of abstract representations. We introduce self-normalizing neural networks (SNNs) to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are \"scaled exponential linear units\" (SELUs), which induce self-normalizing properties. Using the Banach fixed-point theorem, we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance -- even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with standard FNNs and other machine learning methods such as random forests and support vector machines. For FNNs we considered (i) ReLU networks without normalization, (ii) batch normalization, (iii) layer normalization, (iv) weight normalization, (v) highway networks, (vi) residual networks. SNNs significantly outperformed all competing FNN methods at 121 UCI tasks, outperformed all competing methods at the Tox21 dataset, and set a new record at an astronomy data set. The winning SNN architectures are often very deep.", "full_text": "Self-Normalizing Neural Networks\n\nG\u00fcnter Klambauer\n\nThomas Unterthiner\n\nAndreas Mayr\n\nSepp Hochreiter\n\nLIT AI Lab & Institute of Bioinformatics,\n\nJohannes Kepler University Linz\n\n{klambauer,unterthiner,mayr,hochreit}@bioinf.jku.at\n\nA-4040 Linz, Austria\n\nAbstract\n\nDeep Learning has revolutionized vision via convolutional neural networks (CNNs)\nand natural language processing via recurrent neural networks (RNNs). However,\nsuccess stories of Deep Learning with standard feed-forward neural networks\n(FNNs) are rare. FNNs that perform well are typically shallow and, therefore cannot\nexploit many levels of abstract representations. We introduce self-normalizing\nneural networks (SNNs) to enable high-level abstract representations. While\nbatch normalization requires explicit normalization, neuron activations of SNNs\nautomatically converge towards zero mean and unit variance. The activation\nfunction of SNNs are \u201cscaled exponential linear units\u201d (SELUs), which induce\nself-normalizing properties. Using the Banach \ufb01xed-point theorem, we prove that\nactivations close to zero mean and unit variance that are propagated through many\nnetwork layers will converge towards zero mean and unit variance \u2014 even under\nthe presence of noise and perturbations. This convergence property of SNNs allows\nto (1) train deep networks with many layers, (2) employ strong regularization\nschemes, and (3) to make learning highly robust. Furthermore, for activations\nnot close to unit variance, we prove an upper and lower bound on the variance,\nthus, vanishing and exploding gradients are impossible. We compared SNNs on\n(a) 121 tasks from the UCI machine learning repository, on (b) drug discovery\nbenchmarks, and on (c) astronomy tasks with standard FNNs, and other machine\nlearning methods such as random forests and support vector machines. For FNNs\nwe considered (i) ReLU networks without normalization, (ii) batch normalization,\n(iii) layer normalization, (iv) weight normalization, (v) highway networks, and (vi)\nresidual networks. SNNs signi\ufb01cantly outperformed all competing FNN methods\nat 121 UCI tasks, outperformed all competing methods at the Tox21 dataset, and\nset a new record at an astronomy data set. The winning SNN architectures are often\nvery deep.\n\n1\n\nIntroduction\n\nDeep Learning has set new records at different benchmarks and led to various commercial applications\n[21, 26]. Recurrent neural networks (RNNs) [15] achieved new levels at speech and natural language\nprocessing, for example at the TIMIT benchmark [10] or at language translation [29], and are already\nemployed in mobile devices [24]. RNNs have won handwriting recognition challenges (Chinese and\nArabic handwriting) [26, 11, 4] and Kaggle challenges, such as the \u201cGrasp-and Lift EEG\u201d competition.\nTheir counterparts, convolutional neural networks (CNNs) [20] excel at vision and video tasks. CNNs\nare on par with human dermatologists at the visual detection of skin cancer [8]. The visual processing\nfor self-driving cars is based on CNNs [16], as is the visual input to AlphaGo which has beaten one\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The left panel and the right panel show the training error (y-axis) for feed-forward neural\nnetworks (FNNs) with batch normalization (BatchNorm) and self-normalizing networks (SNN) across\nupdate steps (x-axis) on the MNIST dataset the CIFAR10 dataset, respectively. We tested networks\nwith 8, 16, and 32 layers and learning rate 1e-5. FNNs with batch normalization exhibit high variance\ndue to perturbations. In contrast, SNNs do not suffer from high variance as they are more robust to\nperturbations and learn faster.\n\nof the best human GO players [27]. At vision challenges, CNNs are constantly winning, for example\nat the large ImageNet competition [19, 13], but also almost all Kaggle vision challenges, such as the\n\u201cDiabetic Retinopathy\u201d and the \u201cRight Whale\u201d challenges [7, 12].\nHowever, looking at Kaggle challenges that are not related to vision or sequential tasks, gradient\nboosting, random forests, or support vector machines (SVMs) are winning most of the competitions.\nDeep Learning is notably absent, and for the few cases where FNNs won, they are shallow. For\nexample, the HIGGS challenge, the Merck Molecular Activity challenge, and the Tox21 Data\nchallenge were all won by FNNs with at most four hidden layers. Surprisingly, it is hard to \ufb01nd\nsuccess stories with FNNs that have many hidden layers, though they would allow for different levels\nof abstract representations of the input [2].\nTo robustly train very deep CNNs, batch normalization evolved into a standard to normalize neuron\nactivations to zero mean and unit variance [17]. Layer normalization [1] also ensures zero mean\nand unit variance, while weight normalization [25] ensures zero mean and unit variance if in the\nprevious layer the activations have zero mean and unit variance. Natural neural networks [6] also aim\nat normalizing the variance of activations by reparametrization of the weights. However, training with\nnormalization techniques is perturbed by stochastic gradient descent (SGD), stochastic regularization\n(like dropout), and the estimation of the normalization parameters. Both RNNs and CNNs can\nstabilize learning via weight sharing, therefore they are less prone to these perturbations. In contrast,\nFNNs trained with normalization techniques suffer from these perturbations and have high variance in\nthe training error (see Figure 1). This high variance hinders learning and slows it down. Furthermore,\nstrong regularization, such as dropout, is not possible as it would further increase the variance\nwhich in turn would lead to divergence of the learning process. We believe that this sensitivity to\nperturbations is the reason that FNNs are less successful than RNNs and CNNs.\nSelf-normalizing neural networks (SNNs) are robust to perturbations and do not have high variance\nin their training errors (see Figure 1). SNNs push neuron activations to zero mean and unit variance\nthereby leading to the same effect as batch normalization, which enables to robustly learn many\nlayers. SNNs are based on scaled exponential linear units \u201cSELUs\u201d which induce self-normalizing\nproperties like variance stabilization which in turn avoids exploding and vanishing gradients.\n\n2 Self-normalizing Neural Networks (SNNs)\n\nNormalization and SNNs. For a neural network with activation function f, we consider two\nconsecutive layers that are connected by a weight matrix W . Since the input to a neural network\nis a random variable, the activations x in the lower layer, the network inputs z = W x, and the\nactivations y = f (z) in the higher layer are random variables as well. We assume that all activations\n\n2\n\n025050075010001250150017502000105104103102101100BatchNorm Depth 8BatchNorm Depth 16BatchNorm Depth 32SNN Depth 8SNN Depth 16SNN Depth 32Training lossIterations025050075010001250150017502000105104103102101100BatchNorm Depth 8BatchNorm Depth 16BatchNorm Depth 32SNN Depth 8SNN Depth 16SNN Depth 32IterationsTraining loss\fxi of the lower layer have mean \u00b5 := E(xi) and variance \u03bd := Var(xi). An activation y in the higher\nlayer has mean \u02dc\u00b5 := E(y) and variance \u02dc\u03bd := Var(y). Here E(.) denotes the expectation and Var(.)\nthe variance of a random variable. A single activation y = f (z) has net input z = wT x. For n units\nwith activation xi, 1 (cid:54) i (cid:54) n in the lower layer, we de\ufb01ne n times the mean of the weight vector\n\ni=1 wi and n times the second moment as \u03c4 :=(cid:80)n\n\nw \u2208 Rn as \u03c9 :=(cid:80)n\n\ni .\ni=1 w2\n\nWe consider the mapping g that maps mean and variance of the activations from one layer to mean\nand variance of the activations in the next layer g : (\u00b5, \u03bd) (cid:55)\u2192 (\u02dc\u00b5, \u02dc\u03bd). Normalization techniques\nlike batch, layer, or weight normalization ensure a mapping g that keeps (\u00b5, \u03bd) and (\u02dc\u00b5, \u02dc\u03bd) close to\nprede\ufb01ned values, typically (0, 1).\nDe\ufb01nition 1 (Self-normalizing neural net). A neural network is self-normalizing if it possesses a\nmapping g : \u2126 (cid:55)\u2192 \u2126 for each activation y that maps mean and variance from one layer to the next\nand has a stable and attracting \ufb01xed point depending on (\u03c9, \u03c4 ) in \u2126. Furthermore, the mean and\nthe variance remain in the domain \u2126, that is g(\u2126) \u2286 \u2126, where \u2126 = {(\u00b5, \u03bd) | \u00b5 \u2208 [\u00b5min, \u00b5max], \u03bd \u2208\n[\u03bdmin, \u03bdmax]}. When iteratively applying the mapping g, each point within \u2126 converges to this \ufb01xed\npoint.\n\nTherefore, we consider activations of a neural network to be normalized, if both their mean and their\nvariance across samples are within prede\ufb01ned intervals. If mean and variance of x are already within\nthese intervals, then also mean and variance of y remain in these intervals, i.e., the normalization is\ntransitive across layers. Within these intervals, the mean and variance both converge to a \ufb01xed point\nif the mapping g is applied iteratively.\nTherefore, SNNs keep normalization of activations when propagating them through layers of the\nnetwork. The normalization effect is observed across layers of a network: in each layer the activations\nare getting closer to the \ufb01xed point. The normalization effect can also observed be for two \ufb01xed\nlayers across learning steps: perturbations of lower layer activations or weights are damped in the\nhigher layer by drawing the activations towards the \ufb01xed point. If for all y in the higher layer, \u03c9 and\n\u03c4 of the corresponding weight vector are the same, then the \ufb01xed points are also the same. In this\ncase we have a unique \ufb01xed point for all activations y. Otherwise, in the more general case, \u03c9 and\n\u03c4 differ for different y but the mean activations are drawn into [\u00b5min, \u00b5max] and the variances are\ndrawn into [\u03bdmin, \u03bdmax].\n\nConstructing Self-Normalizing Neural Networks. We aim at constructing self-normalizing neu-\nral networks by adjusting the properties of the function g. Only two design choices are available for\nthe function g: (1) the activation function and (2) the initialization of the weights.\nFor the activation function, we propose \u201cscaled exponential linear units\u201d (SELUs) to render a FNN as\nself-normalizing. The SELU activation function is given by\n\n(cid:26)x\n\nselu(x) = \u03bb\n\nif x > 0\n\u03b1ex \u2212 \u03b1 if x (cid:54) 0\n\n.\n\n(1)\n\nSELUs allow to construct a mapping g with properties that lead to SNNs. SNNs cannot be derived\nwith (scaled) recti\ufb01ed linear units (ReLUs), sigmoid units, tanh units, and leaky ReLUs. The\nactivation function is required to have (1) negative and positive values for controlling the mean, (2)\nsaturation regions (derivatives approaching zero) to dampen the variance if it is too large in the lower\nlayer, (3) a slope larger than one to increase the variance if it is too small in the lower layer, (4) a\ncontinuous curve. The latter ensures a \ufb01xed point, where variance damping is equalized by variance\nincreasing. We met these properties of the activation function by multiplying the exponential linear\nunit (ELU) [5] with \u03bb > 1 to ensure a slope larger than one for positive net inputs.\nFor the weight initialization, we propose \u03c9 = 0 and \u03c4 = 1 for all units in the higher layer. The\nnext paragraphs will show the advantages of this initialization. Of course, during learning these\nassumptions on the weight vector will be violated. However, we can prove the self-normalizing\nproperty even for weight vectors that are not normalized, therefore, the self-normalizing property can\nbe kept during learning and weight changes.\n\nDeriving the Mean and Variance Mapping Function g. We assume that the xi are independent\nfrom each other but share the same mean \u00b5 and variance \u03bd. Of course, the independence assumptions\nis not ful\ufb01lled in general. We will elaborate on the independence assumption below. The network\n\n3\n\n\fi=1 wi E(xi) = \u00b5 \u03c9 and Var(z) = Var((cid:80)n\n(cid:80)n\n\ninput z in the higher layer is z = wT x for which we can infer the following moments E(z) =\ni=1 wi xi) = \u03bd \u03c4, where we used the independence\nof the xi. The net input z is a weighted sum of independent, but not necessarily identically distributed\nvariables xi, for which the central limit theorem (CLT) states that z approaches a normal distribution:\nz \u223c N (\u00b5\u03c9,\n\u03bd\u03c4 ). According to the CLT, the larger n, the closer is z\nto a normal distribution. For Deep Learning, broad layers with hundreds of neurons xi are common.\nTherefore the assumption that z is normally distributed is met well for most currently used neural\nnetworks (see Supplementary Figure S7). The function g maps the mean and variance of activations\nin the lower layer to the mean \u02dc\u00b5 = E(y) and variance \u02dc\u03bd = Var(y) of the activations y in the next\nlayer:\n\n\u03bd\u03c4 ) with density pN(z; \u00b5\u03c9,\n\n\u221a\n\n\u221a\n\n(cid:18)\u00b5\n(cid:19)\n\n\u03bd\n\n(cid:18)\u02dc\u00b5\n(cid:19)\n\n\u02dc\u03bd\n\n(cid:55)\u2192\n\ng :\n\nThese integrals can be analytically computed and lead to following mappings of the moments:\n\n\u02dc\u00b5 =\n\n1\n2\n\n\u03bb\n\n\u03b1 e\u00b5\u03c9+ \u03bd\u03c4\n\n\u02dc\u03bd =\n\n1\n2\n\n\u03bb2\n\n(cid:18)\n\n(cid:19)\n\n\u03bd\u03c4\n\n(\u00b5\u03c9) erf\n\n\u221a\n2\n\u221a\n2\n\n(cid:18) \u00b5\u03c9\u221a\n(cid:19)\n(cid:18) \u00b5\u03c9 + \u03bd\u03c4\n(cid:18)(cid:0)(\u00b5\u03c9)2 + \u03bd\u03c4(cid:1)(cid:18)\n(cid:18) \u00b5\u03c9 + 2\u03bd\u03c4\n\n2 erfc\n\n\u221a\n\n\u03bd\u03c4\n\n\u221a\n\n\u221a\n2\n\n\u03bd\u03c4\n\n+\n\n\u2212 \u03b1 erfc\n\n2 \u2212 erfc\n\n(cid:19)\n\n\u03bd\u03c4\n\n(cid:18) \u00b5\u03c9\u221a\n(cid:19)\n(cid:18) \u00b5\u03c9\u221a\n(cid:19)(cid:19)\n(cid:18) \u00b5\u03c9\u221a\n\n\u221a\n2\n\u221a\n2\n\n\u03bd\u03c4\n\u221a\n2\n\n\u03bd\u03c4\n\n\u221a\n\n(cid:114) 2\n(cid:18)\n(cid:114) 2\n\n\u03c0\n\n+\n\n\u03c0\n\n+\n\n+ \u03b12\n\n(cid:19)(cid:19)\n\n+e2(\u00b5\u03c9+\u03bd\u03c4 ) erfc\n\n+ erfc\n\n(cid:33)\n(cid:18) \u00b5\u03c9 + \u03bd\u03c4\n(cid:33)\n\n\u221a\n\n\u221a\n\n2\n\n\u2212 (\u00b5\u03c9)2\n\n2(\u03bd\u03c4 )\n\n(cid:19)\n\n\u03bd\u03c4\n\u2212 (\u02dc\u00b5)2\n\n\u2212 (\u00b5\u03c9)2\n\n\u03bd\u03c4 e\n\n2(\u03bd\u03c4 ) + \u00b5\u03c9\n\n\u22122e\u00b5\u03c9+ \u03bd\u03c4\n2 erfc\n\u221a\n\n(\u00b5\u03c9)\n\n\u03bd\u03c4 e\n\n:\n\n\u02dc\u00b5(\u00b5, \u03c9, \u03bd, \u03c4 ) =\n\n\u02dc\u03bd(\u00b5, \u03c9, \u03bd, \u03c4 ) =\n\nselu(z) pN(z; \u00b5\u03c9,\n\nselu(z)2 pN(z; \u00b5\u03c9,\n\n\u03bd\u03c4 ) dz \u2212 \u02dc\u00b52 .\n\n\u221a\n\n\u03bd\u03c4 ) dz\n\u221a\n\n(cid:90) \u221e\n(cid:90) \u221e\n\n\u2212\u221e\n\n\u2212\u221e\n\n(2)\n\n(3)\n\n(4)\n\nStable and Attracting Fixed Point (0, 1) for Normalized Weights. We assume a normalized\nweight vector w with \u03c9 = 0 and \u03c4 = 1. Given a \ufb01xed point (\u00b5, \u03bd), we can solve equations Eq. (3)\nand Eq. (4) for \u03b1 and \u03bb. We chose the \ufb01xed point (\u00b5, \u03bd) = (0, 1), which is typical for activation\nnormalization. We obtain the \ufb01xed point equations \u02dc\u00b5 = \u00b5 = 0 and \u02dc\u03bd = \u03bd = 1 that we solve for \u03b1\nand \u03bb and obtain the solutions \u03b101 \u2248 1.6733 and \u03bb01 \u2248 1.0507, where the subscript 01 indicates\nthat these are the parameters for \ufb01xed point (0, 1). The analytical expressions for \u03b101 and \u03bb01 are\ngiven in Supplementary Eq. (8). We are interested whether the \ufb01xed point (\u00b5, \u03bd) = (0, 1) is stable\nand attracting. If the Jacobian of g has a norm smaller than 1 at the \ufb01xed point, then g is a contraction\nmapping and the \ufb01xed point is stable. The (2x2)-Jacobian J (\u00b5, \u03bd) of g : (\u00b5, \u03bd) (cid:55)\u2192 (\u02dc\u00b5, \u02dc\u03bd) evaluated at\nthe \ufb01xed point (0, 1) with \u03b101 and \u03bb01 is J (0, 1) = ((0.0, 0.088834), (0.0, 0.782648)). The spectral\nnorm of J (0, 1) (its largest singular value) is 0.7877 < 1. That means g is a contraction mapping\naround the \ufb01xed point (0, 1) (the mapping is depicted in Figure 2). Therefore, (0, 1) is a stable\n\ufb01xed point of the mapping g. The norm of the Jacobian also determines the convergence rate as a\nconsequence of the Banach \ufb01xed point theorem. The convergence rate around the \ufb01xed point (0,1) is\nabout 0.78. In general, the convergence rate depends on \u03c9, \u00b5, \u03bd, \u03c4 and is between 0.78 and 1.\n\nStable and Attracting Fixed Points for Unnormalized Weights. A normalized weight vector w\ncannot be ensured during learning. For SELU parameters \u03b1 = \u03b101 and \u03bb = \u03bb01, we show in the next\ntheorem that if (\u03c9, \u03c4 ) is close to (0, 1), then g still has an attracting and stable \ufb01xed point that is close\nto (0, 1). Thus, in the general case there still exists a stable \ufb01xed point which, however, depends\non (\u03c9, \u03c4 ). If we restrict (\u00b5, \u03bd, \u03c9, \u03c4 ) to certain intervals, then we can show that (\u00b5, \u03bd) is mapped to\nthe respective intervals. Next we present the central theorem of this paper, from which follows that\nSELU networks are self-normalizing under mild conditions on the weights.\nTheorem 1 (Stable and Attracting Fixed Points). We assume \u03b1 = \u03b101 and \u03bb = \u03bb01. We restrict the\nrange of the variables to the following intervals \u00b5 \u2208 [\u22120.1, 0.1], \u03c9 \u2208 [\u22120.1, 0.1], \u03bd \u2208 [0.8, 1.5], and\n\u03c4 \u2208 [0.95, 1.1], that de\ufb01ne the functions\u2019 domain \u2126. For \u03c9 = 0 and \u03c4 = 1, the mapping Eq. (2) has\nthe stable \ufb01xed point (\u00b5, \u03bd) = (0, 1), whereas for other \u03c9 and \u03c4 the mapping Eq. (2) has a stable\nand attracting \ufb01xed point depending on (\u03c9, \u03c4 ) in the (\u00b5, \u03bd)-domain: \u00b5 \u2208 [\u22120.03106, 0.06773] and\n\n4\n\n\fFigure 2: For \u03c9 = 0 and \u03c4 = 1, the mapping g of mean \u00b5 (x-axis) and variance \u03bd (y-axis) to the\nnext layer\u2019s mean \u02dc\u00b5 and variance \u02dc\u03bd is depicted. Arrows show in which direction (\u00b5, \u03bd) is mapped by\ng : (\u00b5, \u03bd) (cid:55)\u2192 (\u02dc\u00b5, \u02dc\u03bd). The \ufb01xed point of the mapping g is (0, 1).\n\n\u03bd \u2208 [0.80009, 1.48617]. All points within the (\u00b5, \u03bd)-domain converge when iteratively applying the\nmapping Eq. (2) to this \ufb01xed point.\n\nProof. We provide a proof sketch (see detailed proof in Supplementary Material). With the Banach\n\ufb01xed point theorem we show that there exists a unique attracting and stable \ufb01xed point. To this end,\nwe have to prove that a) g is a contraction mapping and b) that the mapping stays in the domain, that\nis, g(\u2126) \u2286 \u2126. The spectral norm of the Jacobian of g can be obtained via an explicit formula for the\nlargest singular value for a 2 \u00d7 2 matrix. g is a contraction mapping if its spectral norm is smaller\nthan 1. We perform a computer-assisted proof to evaluate the largest singular value on a \ufb01ne grid and\nensure the precision of the computer evaluation by an error propagation analysis of the implemented\nalgorithms on the according hardware. Singular values between grid points are upper bounded by the\nmean value theorem. To this end, we bound the derivatives of the formula for the largest singular\nvalue with respect to \u03c9, \u03c4, \u00b5, \u03bd. Then we apply the mean value theorem to pairs of points, where one\nis on the grid and the other is off the grid. This shows that for all values of \u03c9, \u03c4, \u00b5, \u03bd in the domain \u2126,\nthe spectral norm of g is smaller than one. Therefore, g is a contraction mapping on the domain \u2126.\nFinally, we show that the mapping g stays in the domain \u2126 by deriving bounds on \u02dc\u00b5 and \u02dc\u03bd. Hence,\nthe Banach \ufb01xed-point theorem holds and there exists a unique \ufb01xed point in \u2126 that is attained.\n\nConsequently, feed-forward neural networks with many units in each layer and with the SELU\nactivation function are self-normalizing (see de\ufb01nition 1), which readily follows from Theorem 1. To\ngive an intuition, the main property of SELUs is that they damp the variance for negative net inputs\nand increase the variance for positive net inputs. The variance damping is stronger if net inputs are\nfurther away from zero while the variance increase is stronger if net inputs are close to zero. Thus, for\nlarge variance of the activations in the lower layer the damping effect is dominant and the variance\ndecreases in the higher layer. Vice versa, for small variance the variance increase is dominant and the\nvariance increases in the higher layer.\nHowever, we cannot guarantee that mean and variance remain in the domain \u2126. Therefore, we next\ntreat the case where (\u00b5, \u03bd) are outside \u2126. It is especially crucial to consider \u03bd because this variable\nhas much stronger in\ufb02uence than \u00b5. Mapping \u03bd across layers to a high value corresponds to an\nexploding gradient, since the Jacobian of the activation of high layers with respect to activations\nin lower layers has large singular values. Analogously, mapping \u03bd across layers to a low value\ncorresponds to an vanishing gradient. Bounding the mapping of \u03bd from above and below would avoid\nboth exploding and vanishing gradients. Theorem 2 states that the variance of neuron activations of\nSNNs is bounded from above, and therefore ensures that SNNs learn robustly and do not suffer from\nexploding gradients.\nTheorem 2 (Decreasing \u03bd). For \u03bb = \u03bb01, \u03b1 = \u03b101 and the domain \u2126+: \u22121 (cid:54) \u00b5 (cid:54) 1, \u22120.1 (cid:54) \u03c9 (cid:54)\n0.1, 3 (cid:54) \u03bd (cid:54) 16, and 0.8 (cid:54) \u03c4 (cid:54) 1.25, we have for the mapping of the variance \u02dc\u03bd(\u00b5, \u03c9, \u03bd, \u03c4, \u03bb, \u03b1)\ngiven in Eq. (4): \u02dc\u03bd(\u00b5, \u03c9, \u03bd, \u03c4, \u03bb01, \u03b101) < \u03bd.\n\nThe proof can be found in Supplementary Material. Thus, when mapped across many layers, the\nvariance in the interval [3, 16] is mapped to a value below 3. Consequently, all \ufb01xed points (\u00b5, \u03bd)\n\n5\n\n \fof the mapping g (Eq. (2)) have \u03bd < 3. Analogously, Theorem 3 states that the variance of neuron\nactivations of SNNs is bounded from below, and therefore ensures that SNNs do not suffer from\nvanishing gradients.\nTheorem 3 (Increasing \u03bd). We consider \u03bb = \u03bb01, \u03b1 = \u03b101 and the domain \u2126\u2212: \u22120.1 (cid:54) \u00b5 (cid:54) 0.1,\nand \u22120.1 (cid:54) \u03c9 (cid:54) 0.1. For the domain 0.02 (cid:54) \u03bd (cid:54) 0.16 and 0.8 (cid:54) \u03c4 (cid:54) 1.25 as well as for the\ndomain 0.02 (cid:54) \u03bd (cid:54) 0.24 and 0.9 (cid:54) \u03c4 (cid:54) 1.25, the mapping of the variance \u02dc\u03bd(\u00b5, \u03c9, \u03bd, \u03c4, \u03bb, \u03b1) given\nin Eq. (4) increases: \u02dc\u03bd(\u00b5, \u03c9, \u03bd, \u03c4, \u03bb01, \u03b101) > \u03bd.\n\nThe proof can be found in the Supplementary Material. All \ufb01xed points (\u00b5, \u03bd) of the mapping g\n(Eq. (2)) ensure for 0.8 (cid:54) \u03c4 that \u02dc\u03bd > 0.16 and for 0.9 (cid:54) \u03c4 that \u02dc\u03bd > 0.24. Consequently, the variance\nmapping Eq. (4) ensures a lower bound on the variance \u03bd. Therefore SELU networks control the\nvariance of the activations and push it into an interval, whereafter the mean and variance move\ntoward the \ufb01xed point. Thus, SELU networks are steadily normalizing the variance and subsequently\nnormalizing the mean, too. In all experiments, we observed that self-normalizing neural networks\npush the mean and variance of activations into the domain \u2126 .\n\n\u03c9 =(cid:80)n\n\ni=1 wi = 0 and \u03c4 =(cid:80)n\n\nInitialization. Since SNNs have a \ufb01xed point at zero mean and unit variance for normalized weights\ni = 1 (see above), we initialize SNNs such that these constraints\nare ful\ufb01lled in expectation. We draw the weights from a Gaussian distribution with E(wi) = 0 and\nvariance Var(wi) = 1/n. Uniform and truncated Gaussian distributions with these moments led to\nnetworks with similar behavior. The \u201cMSRA initialization\u201d is similar since it uses zero mean and\nvariance 2/n to initialize the weights [14]. The additional factor 2 counters the effect of recti\ufb01ed\nlinear units.\n\ni=1 w2\n\nNew Dropout Technique. Standard dropout randomly sets an activation x to zero with probability\n1 \u2212 q for 0 < q (cid:54) 1. In order to preserve the mean, the activations are scaled by 1/q during\ntraining. If x has mean E(x) = \u00b5 and variance Var(x) = \u03bd, and the dropout variable d follows a\nbinomial distribution B(1, q), then the mean E(1/qdx) = \u00b5 is kept. Dropout \ufb01ts well to recti\ufb01ed\nlinear units, since zero is in the low variance region and corresponds to the default value. For scaled\nexponential linear units, the default and low variance value is limx\u2192\u2212\u221e selu(x) = \u2212\u03bb\u03b1 = \u03b1(cid:48).\nTherefore, we propose \u201calpha dropout\u201d, that randomly sets inputs to \u03b1(cid:48). The new mean and new\nvariance is E(xd + \u03b1(cid:48)(1 \u2212 d)) = q\u00b5 + (1 \u2212 q)\u03b1(cid:48), and Var(xd + \u03b1(cid:48)(1 \u2212 d)) = q((1 \u2212 q)(\u03b1(cid:48) \u2212\n\u00b5)2 + \u03bd). We aim at keeping mean and variance to their original values after \u201calpha dropout\u201d, in\norder to ensure the self-normalizing property even for \u201calpha dropout\u201d. The af\ufb01ne transformation\na(xd + \u03b1(cid:48)(1 \u2212 d)) + b allows to determine parameters a and b such that mean and variance are kept\nto their values: E(a(x \u00b7 d + \u03b1(cid:48)(1 \u2212 d)) + b) = \u00b5 and Var(a(x \u00b7 d + \u03b1(cid:48)(1 \u2212 d)) + b) = \u03bd . In\ncontrast to dropout, a and b will depend on \u00b5 and \u03bd, however our SNNs converge to activations with\n\nzero mean and unit variance. With \u00b5 = 0 and \u03bd = 1, we obtain a =(cid:0)q + \u03b1(cid:48)2q(1 \u2212 q)(cid:1)\u22121/2 and\nb = \u2212(cid:0)q + \u03b1(cid:48)2q(1 \u2212 q)(cid:1)\u22121/2\n\n((1 \u2212 q)\u03b1(cid:48)). The parameters a and b only depend on the dropout rate\n1 \u2212 q and the most negative activation \u03b1(cid:48). Empirically, we found that dropout rates 1 \u2212 q = 0.05 or\n0.10 lead to models with good performance. \u201cAlpha-dropout\u201d \ufb01ts well to scaled exponential linear\nunits by randomly setting activations to the negative saturation value.\n\nz =(cid:80)n\n\nApplicability of the central limit theorem and independence assumption.\nIn the derivative of\nthe mapping (Eq. (2)), we used the central limit theorem (CLT) to approximate the network inputs\ni=1 wixi with a normal distribution. We justi\ufb01ed normality because network inputs represent\na weighted sum of the inputs xi, where for Deep Learning n is typically large. The Berry-Esseen\ntheorem states that the convergence rate to normality is n\u22121/2 [18]. In the classical version of the CLT,\nthe random variables have to be independent and identically distributed, which typically does not\nhold for neural networks. However, the Lyapunov CLT does not require the variable to be identically\ndistributed anymore. Furthermore, even under weak dependence, sums of random variables converge\nin distribution to a Gaussian distribution [3].\n\nOptimizers. Empirically, we found that SGD, momentum, Adadelta and Adamax worked well for\ntraining SNNs, whereas for Adam we had to adjust the parameters (\u03b22 = 0.99, \u0001 = 0.01) to obtain\npro\ufb01cient networks.\n\n6\n\n\f3 Experiments\n\nWe compare SNNs to other deep networks at different benchmarks. Hyperparameters such as number\nof layers (blocks), neurons per layer, learning rate, and dropout rate, are adjusted by grid-search\nfor each dataset on a separate validation set (see Supplementary Section S4). We compare the\nfollowing FNN methods: (1) \u201cMSRAinit\u201d: FNNs without normalization and with ReLU activations\nand \u201cMicrosoft weight initialization\u201d [14]. (2) \u201cBatchNorm\u201d: FNNs with batch normalization [17].\n(3) \u201cLayerNorm\u201d: FNNs with layer normalization [1]. (4) \u201cWeightNorm\u201d: FNNs with weight\nnormalization [25]. (5) \u201cHighway\u201d: Highway networks [28]. (6) \u201cResNet\u201d: Residual networks\n[13] adapted to FNNs using residual blocks with 2 or 3 layers with rectangular or diavolo shape.\n(7) \u201cSNNs\u201d: Self normalizing networks with SELUs with \u03b1 = \u03b101 and \u03bb = \u03bb01 and the proposed\ndropout technique and initialization strategy.\n\n121 UCI Machine Learning Repository datasets. The benchmark comprises 121 classi\ufb01cation\ndatasets from the UCI Machine Learning repository [9] from diverse application areas, such as\nphysics, geology, or biology. The size of the datasets ranges between 10 and 130, 000 data points\nand the number of features from 4 to 250. In abovementioned work [9], there were methodological\nmistakes [30] which we avoided here. Each compared FNN method was optimized with respect to\nits architecture and hyperparameters on a validation set that was then removed from the subsequent\nanalysis. The selected hyperparameters served to evaluate the methods in terms of accuracy on\nthe pre-de\ufb01ned test sets. The accuracies are reported in the Supplementary Table S8. We ranked\nthe methods by their accuracy for each prediction task and compared their average ranks. SNNs\nsigni\ufb01cantly outperform all competing networks in pairwise comparisons (paired Wilcoxon test\nacross datasets) as reported in Table 1 (left panel).\n\nTable 1: Left: Comparison of seven FNNs on 121 UCI tasks. We consider the average rank difference\nto rank 4, which is the average rank of seven methods with random predictions. The \ufb01rst column gives\nthe method, the second the average rank difference, and the last the p-value of a paired Wilcoxon test\nwhether the difference to the best performing method is signi\ufb01cant. SNNs signi\ufb01cantly outperform\nall other methods. Right: Comparison of 24 machine learning methods (ML) on the UCI datasets\nwith more than 1000 data points. The \ufb01rst column gives the method, the second the average rank\ndifference to rank 12.5, and the last the p-value of a paired Wilcoxon test whether the difference\nto the best performing method is signi\ufb01cant. Methods that were signi\ufb01cantly worse than the best\nmethod are marked with \u201c*\u201d. SNNs outperform all competing methods.\n\nFNN method comparison\n\navg. rank diff.\n\nMethod\nSNN\nMSRAinit\nLayerNorm\nHighway\nResNet\nWeightNorm\nBatchNorm\n\np-value Method\n\nML method comparison\n\navg. rank diff.\n\np-value\n\n-0.756\n-0.240*\n-0.198*\n0.021*\n0.273*\n0.397*\n0.504*\n\nSNN\nSVM\n\n2.7e-02\n1.5e-02 RandomForest\n1.9e-03 MSRAinit\n5.4e-04 LayerNorm\n7.8e-07 Highway\n3.5e-06\n\n. . .\n\n-6.7\n-6.4\n-5.9\n-5.4*\n-5.3\n-4.6*\n. . .\n\n5.8e-01\n2.1e-01\n4.5e-03\n7.1e-02\n1.7e-03\n\n. . .\n\nWe further included 17 machine learning methods representing diverse method groups [9] in the\ncomparison and the grouped the data sets into \u201csmall\u201d and \u201clarge\u201d data sets (for details see Supple-\nmentary Section S4.2). On 75 small datasets with less than 1000 data points, random forests and\nSVMs outperform SNNs and other FNNs. On 46 larger datasets with at least 1000 data points, SNNs\nshow the highest performance followed by SVMs and random forests (see right panel of Table 1, for\ncomplete results see Supplementary Tables S9 and S10). Overall, SNNs have outperformed state of\nthe art machine learning methods on UCI datasets with more than 1,000 data points.\nTypically, hyperparameter selection chose SNN architectures that were much deeper than the selected\narchitectures of other FNNs, with an average depth of 10.8 layers, compared to average depths of 6.0\nfor BatchNorm, 3.8 WeightNorm, 7.0 LayerNorm, 5.9 Highway, and 7.1 for MSRAinit networks. For\nResNet, the average number of blocks was 6.35. SNNs with many more than 4 layers often provide\nthe best predictive accuracies across all neural networks.\n\n7\n\n\fDrug discovery: The Tox21 challenge dataset. The Tox21 challenge dataset comprises about\n12,000 chemical compounds whose twelve toxic effects have to be predicted based on their chemical\nstructure. We used the validation sets of the challenge winners for hyperparameter selection (see\nSupplementary Section S4) and the challenge test set for performance comparison. We repeated\nthe whole evaluation procedure 5 times to obtain error bars. The results in terms of average AUC\nare given in Table 2. In 2015, the challenge organized by the US NIH was won by an ensemble\nof shallow ReLU FNNs which achieved an AUC of 0.846 [23]. Besides FNNs, this ensemble\nalso contained random forests and SVMs. Single SNNs came close with an AUC of 0.845\u00b10.003.\nThe best performing SNNs have 8 layers, compared to the runner-ups ReLU networks with layer\nnormalization with 2 and 3 layers. Also batchnorm and weightnorm networks, typically perform best\nwith shallow networks of 2 to 4 layers (Table 2). The deeper the networks, the larger the difference in\nperformance between SNNs and other methods (see columns 5\u20138 of Table 2). The best performing\nmethod is an SNN with 8 layers.\n\nTable 2: Comparison of FNNs at the Tox21 challenge dataset in terms of AUC. The rows represent\ndifferent methods and the columns different network depth and for ResNets the number of residual\nblocks (6 and 32 blocks were omitted due to computational constraints). The deeper the networks,\nthe more prominent is the advantage of SNNs. The best networks are SNNs with 8 layers.\n\n2\n\nmethod\n83.7 \u00b1 0.3\nSNN\nBatchnorm\n80.0 \u00b1 0.5\nWeightNorm 83.7 \u00b1 0.8\n84.3 \u00b1 0.3\nLayerNorm\n83.3 \u00b1 0.9\nHighway\n82.7 \u00b1 0.4\nMSRAinit\nResNet\n82.2 \u00b1 1.1\n\n3\n\n84.4 \u00b1 0.5\n79.8 \u00b1 1.6\n82.9 \u00b1 0.8\n84.3 \u00b1 0.5\n83.0 \u00b1 0.5\n81.6 \u00b1 0.9\n80.0 \u00b1 2.0\n\n#layers / #blocks\n6\n\n4\n\n84.2 \u00b1 0.4\n77.2 \u00b1 1.1\n82.2 \u00b1 0.9\n84.0 \u00b1 0.2\n82.6 \u00b1 0.9\n81.1 \u00b1 1.7\n80.5 \u00b1 1.2\n\n83.9 \u00b1 0.5\n77.0 \u00b1 1.7\n82.5 \u00b1 0.6\n82.5 \u00b1 0.8\n82.4 \u00b1 0.8\n80.6 \u00b1 0.6\n81.2 \u00b1 0.7\n\n8\n\n84.5 \u00b1 0.2\n75.0 \u00b1 0.9\n81.9 \u00b1 1.2\n80.9 \u00b1 1.8\n80.3 \u00b1 1.4\n80.9 \u00b1 1.1\n81.8 \u00b1 0.6\n\n16\n\n83.5 \u00b1 0.5\n73.7 \u00b1 2.0\n78.1 \u00b1 1.3\n78.7 \u00b1 2.3\n80.3 \u00b1 2.4\n80.2 \u00b1 1.1\n81.2 \u00b1 0.6\n\n32\n\n82.5 \u00b1 0.7\n76.0 \u00b1 1.1\n56.6 \u00b1 2.6\n78.8 \u00b1 0.8\n79.6 \u00b1 0.8\n80.4 \u00b1 1.9\n\nna\n\nAstronomy: Prediction of pulsars in the HTRU2 dataset. Since a decade, machine learning\nmethods have been used to identify pulsars in radio wave signals [22]. Recently, the High Time\nResolution Universe Survey (HTRU2) dataset has been released with 1,639 real pulsars and 16,259\nspurious signals. Currently, the highest AUC value of a 10-fold cross-validation is 0.976 which has\nbeen achieved by Naive Bayes classi\ufb01ers followed by decision tree C4.5 with 0.949 and SVMs with\n0.929. We used eight features constructed by the PulsarFeatureLab as used previously [22]. We\nassessed the performance of FNNs using 10-fold nested cross-validation, where the hyperparameters\nwere selected in the inner loop on a validation set (for details see Supplementary Section S4). Table 3\nreports the results in terms of AUC. SNNs outperform all other methods and have pushed the\nstate-of-the-art to an AUC of 0.98.\n\nTable 3: Comparison of FNNs and reference methods at HTRU2 in terms of AUC. The \ufb01rst, fourth\nand seventh column give the method, the second, \ufb01fth and eight column the AUC averaged over 10\ncross-validation folds, and the third and sixth column the p-value of a paired Wilcoxon test of the\nAUCs against the best performing method across the 10 folds. FNNs achieve better results than Naive\nBayes (NB), C4.5, and SVM. SNNs exhibit the best performance and set a new record.\n\nFNN methods\n\nAUC\n\nmethod\n0.9803 \u00b1 0.010\nSNN\n0.9791 \u00b1 0.010\nMSRAinit\nWeightNorm 0.9786* \u00b1 0.010\n0.9766* \u00b1 0.009\nHighway\n\np-value method\n\nFNN methods\n\nAUC\n\nref. methods\np-value method AUC\n\n3.5e-01\n2.4e-02\n9.8e-03\n\nLayerNorm 0.9762* \u00b1 0.011\nBatchNorm 0.9760 \u00b1 0.013\n0.9753* \u00b1 0.010\nResNet\n\n1.4e-02\n6.5e-02\n6.8e-03\n\nNB\nC4.5\nSVM\n\n0.976\n0.946\n0.929\n\nSNNs and convolutional neural networks. In initial experiments with CNNs, we found that SELU\nactivations work well at image classi\ufb01cation tasks: On MNIST, SNN-CNNs (2x Conv, MaxPool, 2x\nfully-connected, 30 Epochs) reach 99.2%\u00b10.1 accuracy (ReLU: 99.2%\u00b10.1) and on CIFAR10 (2x\nConv, MaxPool, 2x Conv, MaxPool, 2x fully-connected, 200 Epochs) SNN-CNNs reach 82.5\u00b10.8%\n\n8\n\n\faccuracy (ReLU: 76.1\u00b11.0%). This \ufb01nding unsurprising since even standard ELUs without the\nself-normalizing property have been shown to improve CNN training and accuracy[5].\n\n4 Conclusion\n\nTo summarize, self-normalizing networks work well with the following con\ufb01guration:\n\n\u2022 SELU activation with parameters \u03bb \u2248 1.0507 and \u03b1 \u2248 1.6733,\n\u2022 inputs normalized to zero mean and unit variance,\n\u2022 network weights initialized with variance 1/n, and\n\u2022 regularization with \u201calpha-dropout\u201d.\n\nWe have introduced self-normalizing neural networks for which we have proved that neuron ac-\ntivations are pushed towards zero mean and unit variance when propagated through the network.\nAdditionally, for activations not close to unit variance, we have proved an upper and lower bound\non the variance mapping. Consequently, SNNs do not face vanishing and exploding gradient prob-\nlems. Therefore, SNNs work well for architectures with many layers, allowed us to introduce a\nnovel regularization scheme and learn very robustly. On 121 UCI benchmark datasets, SNNs have\noutperformed other FNNs with and without normalization techniques, such as batch, layer, and\nweight normalization, or specialized architectures, such as Highway or Residual networks. SNNs\nalso yielded the best results on drug discovery and astronomy tasks. The best performing SNN\narchitectures are typically very deep in contrast to other FNNs.\n\nReferences\n[1] Ba, J. L., Kiros, J. R., and Hinton, G. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.\n\n[2] Bengio, Y. (2013). Deep learning of representations: Looking forward.\n\nIn Proceedings of the First\nInternational Conference on Statistical Language and Speech Processing, pages 1\u201337, Berlin, Heidelberg.\n\n[3] Bradley, R. C. (1981). Central limit theorems under weak dependence. Journal of Multivariate Analysis,\n\n11(1):1\u201316.\n\n[4] Cire\u00b8san, D. and Meier, U. (2015). Multi-column deep neural networks for of\ufb02ine handwritten chinese\ncharacter classi\ufb01cation. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1\u20136.\nIEEE.\n\n[5] Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by expo-\nnential linear units (ELUs). 5th International Conference on Learning Representations, arXiv:1511.07289.\n\n[6] Desjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 2071\u20132079.\n\n[7] Dugan, P., Clark, C., LeCun, Y., and Van Parijs, S. (2016). Phase 4: Dcl system using deep learning\napproaches for land-based or ship-based real-time recognition and localization of marine mammals-distributed\nprocessing and big data applications. arXiv preprint arXiv:1605.00982.\n\n[8] Esteva, A., Kuprel, B., Novoa, R., Ko, J., Swetter, S., Blau, H., and Thrun, S. (2017). Dermatologist-level\n\nclassi\ufb01cation of skin cancer with deep neural networks. Nature, 542(7639):115\u2013118.\n\n[9] Fern\u00e1ndez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classi\ufb01ers\n\nto solve real world classi\ufb01cation problems. Journal of Machine Learning Research, 15(1):3133\u20133181.\n\n[10] Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks.\nIn IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 6645\u20136649.\n\n[11] Graves, A. and Schmidhuber, J. (2009). Of\ufb02ine handwriting recognition with multidimensional recurrent\n\nneural networks. In Advances in neural information processing systems, pages 545\u2013552.\n\n[12] Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner,\nK., Madams, T., Cuadros, J., et al. (2016). Development and validation of a deep learning algorithm for\ndetection of diabetic retinopathy in retinal fundus photographs. JAMA, 316(22):2402\u20132410.\n\n9\n\n\f[13] He, K., Zhang, X., Ren, S., and Sun, J. (2015a). Deep residual learning for image recognition. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR).\n\n[14] He, K., Zhang, X., Ren, S., and Sun, J. (2015b). Delving deep into recti\ufb01ers: Surpassing human-level\nperformance on imagenet classi\ufb01cation. In Proceedings of the IEEE International Conference on Computer\nVision (ICCV), pages 1026\u20131034.\n\n[15] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780.\n\n[16] Huval, B., Wang, T., Tandon, S., et al. (2015). An empirical evaluation of deep learning on highway driving.\n\narXiv preprint arXiv:1504.01716.\n\n[17] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages\n448\u2013456.\n\n[18] Korolev, V. and Shevtsova, I. (2012). An improvement of the Berry\u2013Esseen inequality with applications to\n\nPoisson and mixed Poisson random sums. Scandinavian Actuarial Journal, 2012(2):81\u2013105.\n\n[19] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105.\n\n[20] LeCun, Y. and Bengio, Y. (1995). Convolutional networks for images, speech, and time series. The\n\nhandbook of brain theory and neural networks, 3361(10):1995.\n\n[21] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436\u2013444.\n\n[22] Lyon, R., Stappers, B., Cooper, S., Brooke, J., and Knowles, J. (2016). Fifty years of pulsar candidate\nselection: From simple \ufb01lters to a new principled real-time classi\ufb01cation approach. Monthly Notices of the\nRoyal Astronomical Society, 459(1):1104\u20131123.\n\n[23] Mayr, A., Klambauer, G., Unterthiner, T., and Hochreiter, S. (2016). DeepTox: Toxicity prediction using\n\ndeep learning. Frontiers in Environmental Science, 3:80.\n\n[24] Sak, H., Senior, A., Rao, K., and Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic\n\nmodels for speech recognition. arXiv preprint arXiv:1507.06947.\n\n[25] Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, pages 901\u2013909.\n\n[26] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117.\n\n[27] Silver, D., Huang, A., Maddison, C., et al. (2016). Mastering the game of Go with deep neural networks\n\nand tree search. Nature, 529(7587):484\u2013489.\n\n[28] Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Training very deep networks. In Advances in\n\nNeural Information Processing Systems, pages 2377\u20132385.\n\n[29] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 3104\u20133112.\n\n[30] Wainberg, M., Alipanahi, B., and Frey, B. J. (2016). Are random forests truly the best classi\ufb01ers? Journal\n\nof Machine Learning Research, 17(110):1\u20135.\n\n10\n\n\f", "award": [], "sourceid": 617, "authors": [{"given_name": "G\u00fcnter", "family_name": "Klambauer", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Thomas", "family_name": "Unterthiner", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Andreas", "family_name": "Mayr", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Sepp", "family_name": "Hochreiter", "institution": "LIT AI Lab / University Linz"}]}