{"title": "Training Very Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2377, "page_last": 2385, "abstract": "Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.", "full_text": "Training Very Deep Networks\n\nRupesh Kumar Srivastava Klaus Greff\n\nJ\u00a8urgen Schmidhuber\n\nThe Swiss AI Lab IDSIA / USI / SUPSI\n\n{rupesh, klaus, juergen}@idsia.ch\n\nAbstract\n\nTheoretical and empirical evidence indicates that the depth of neural networks\nis crucial for their success. However, training becomes more dif\ufb01cult as depth\nincreases, and training of very deep networks remains an open problem. Here we\nintroduce a new architecture designed to overcome this. Our so-called highway\nnetworks allow unimpeded information \ufb02ow across many layers on information\nhighways. They are inspired by Long Short-Term Memory recurrent networks and\nuse adaptive gating units to regulate the information \ufb02ow. Even with hundreds of\nlayers, highway networks can be trained directly through simple gradient descent.\nThis enables the study of extremely deep and ef\ufb01cient architectures.\n\n1\n\nIntroduction & Previous Work\n\nMany recent empirical breakthroughs in supervised machine learning have been achieved through\nlarge and deep neural networks. Network depth (the number of successive computational layers) has\nplayed perhaps the most important role in these successes. For instance, within just a few years, the\ntop-5 image classi\ufb01cation accuracy on the 1000-class ImageNet dataset has increased from \u223c84%\n[1] to \u223c95% [2, 3] using deeper networks with rather small receptive \ufb01elds [4, 5]. Other results on\npractical machine learning problems have also underscored the superiority of deeper networks [6]\nin terms of accuracy and/or performance.\nIn fact, deep networks can represent certain function classes far more ef\ufb01ciently than shallow ones.\nThis is perhaps most obvious for recurrent nets, the deepest of them all. For example, the n bit\nparity problem can in principle be learned by a large feedforward net with n binary input units, 1\noutput unit, and a single but large hidden layer. But the natural solution for arbitrary n is a recurrent\nnet with only 3 units and 5 weights, reading the input bit string one bit at a time, making a single\nrecurrent hidden unit \ufb02ip its state whenever a new 1 is observed [7]. Related observations hold for\nBoolean circuits [8, 9] and modern neural networks [10, 11, 12].\nTo deal with the dif\ufb01culties of training deep networks, some researchers have focused on developing\nbetter optimizers (e.g. [13, 14, 15]). Well-designed initialization strategies, in particular the nor-\nmalized variance-preserving initialization for certain activation functions [16, 17], have been widely\nadopted for training moderately deep networks. Other similarly motivated strategies have shown\npromising results in preliminary experiments [18, 19]. Experiments showed that certain activation\nfunctions based on local competition [20, 21] may help to train deeper networks. Skip connec-\ntions between layers or to output layers (where error is \u201cinjected\u201d) have long been used in neural\nnetworks, more recently with the explicit aim to improve the \ufb02ow of information [22, 23, 2, 24].\nA related recent technique is based on using soft targets from a shallow teacher network to aid in\ntraining deeper student networks in multiple stages [25], similar to the neural history compressor\nfor sequences, where a slowly ticking teacher recurrent net is \u201cdistilled\u201d into a quickly ticking stu-\ndent recurrent net by forcing the latter to predict the hidden units of the former [26]. Finally, deep\nnetworks can be trained layer-wise to help in credit assignment [26, 27], but this approach is less\nattractive compared to direct training.\n\n1\n\n\fVery deep network training still faces problems, albeit perhaps less fundamental ones than the prob-\nlem of vanishing gradients in standard recurrent networks [28]. The stacking of several non-linear\ntransformations in conventional feed-forward network architectures typically results in poor propa-\ngation of activations and gradients. Hence it remains hard to investigate the bene\ufb01ts of very deep\nnetworks for a variety of problems.\nTo overcome this, we take inspiration from Long Short Term Memory (LSTM) recurrent networks\n[29, 30]. We propose to modify the architecture of very deep feedforward networks such that infor-\nmation \ufb02ow across layers becomes much easier. This is accomplished through an LSTM-inspired\nadaptive gating mechanism that allows for computation paths along which information can \ufb02ow\nacross many layers without attenuation. We call such paths information highways. They yield high-\nway networks, as opposed to traditional \u2018plain\u2019 networks.1\nOur primary contribution is to show that extremely deep highway networks can be trained directly\nusing stochastic gradient descent (SGD), in contrast to plain networks which become hard to opti-\nmize as depth increases (Section 3.1). Deep networks with limited computational budget (for which\na two-stage training procedure mentioned above was recently proposed [25]) can also be directly\ntrained in a single stage when converted to highway networks. Their ease of training is supported\nby experimental results demonstrating that highway networks also generalize well to unseen data.\n\n2 Highway Networks\n\nNotation We use boldface letters for vectors and matrices, and italicized capital letters to denote\ntransformation functions. 0 and 1 denote vectors of zeros and ones respectively, and I denotes an\n1+e\u2212x , x \u2208 R. The dot operator (\u00b7) is used\nidentity matrix. The function \u03c3(x) is de\ufb01ned as \u03c3(x) = 1\nto denote element-wise multiplication.\nA plain feedforward neural network typically consists of L layers where the lth layer (l \u2208\n{1, 2, ..., L}) applies a non-linear transformation H (parameterized by WH,l) on its input xl to\nproduce its output yl. Thus, x1 is the input to the network and yL is the network\u2019s output. Omitting\nthe layer index and biases for clarity,\n\ny = H(x, WH).\n\n(1)\n\nH is usually an af\ufb01ne transform followed by a non-linear activation function, but in general it may\ntake other forms, possibly convolutional or recurrent. For a highway network, we additionally de\ufb01ne\ntwo non-linear transforms T (x, WT) and C(x, WC) such that\n\ny = H(x, WH)\u00b7 T (x, WT) + x \u00b7 C(x, WC).\n\n(2)\n\nWe refer to T as the transform gate and C as the carry gate, since they express how much of the\noutput is produced by transforming the input and carrying it, respectively. For simplicity, in this\npaper we set C = 1 \u2212 T , giving\n\ny = H(x, WH)\u00b7 T (x, WT) + x \u00b7 (1 \u2212 T (x, WT)).\n\n(3)\n\nThe dimensionality of x, y, H(x, WH) and T (x, WT) must be the same for Equation 3 to be valid.\nNote that this layer transformation is much more \ufb02exible than Equation 1. In particular, observe that\nfor particular values of T ,\n\n(cid:26)x,\n\ny =\n\nH(x, WH),\n\nif T (x, WT) = 0,\nif T (x, WT) = 1.\n\n(4)\n\nSimilarly, for the Jacobian of the layer transform,\n\n1This paper expands upon a shorter report on Highway Networks [31]. More recently, a similar LSTM-\n\ninspired model was also proposed [32].\n\n2\n\n\fFigure 1: Comparison of optimization of plain networks and highway networks of various depths.\nLeft: The training curves for the best hyperparameter settings obtained for each network depth.\nRight: Mean performance of top 10 (out of 100) hyperparameter settings. Plain networks become\nmuch harder to optimize with increasing depth, while highway networks with up to 100 layers can\nstill be optimized well. Best viewed on screen (larger version included in Supplementary Material).\n\n(cid:26)I,\n\nH(cid:48)(x, WH),\n\ndy\ndx\n\n=\n\nif T (x, WT) = 0,\nif T (x, WT) = 1.\n\n(5)\n\nThus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior\nbetween that of H and that of a layer which simply passes its inputs through. Just as a plain layer\nconsists of multiple computing units such that the ith unit computes yi = Hi(x), a highway network\nconsists of multiple blocks such that the ith block computes a block state Hi(x) and transform\ngate output Ti(x). Finally, it produces the block output yi = Hi(x) \u2217 Ti(x) + xi \u2217 (1 \u2212 Ti(x)),\nwhich is connected to the next layer.2\n\n2.1 Constructing Highway Networks\n\nAs mentioned earlier, Equation 3 requires that\nthe dimensionality of x, y, H(x, WH) and\nT (x, WT) be the same. To change the size of the intermediate representation, one can replace\nx with \u02c6x obtained by suitably sub-sampling or zero-padding x. Another alternative is to use a plain\nlayer (without highways) to change dimensionality, which is the strategy we use in this study.\nConvolutional highway layers utilize weight-sharing and local receptive \ufb01elds for both H and T\ntransforms. We used the same sized receptive \ufb01elds for both, and zero-padding to ensure that the\nblock state and transform gate feature maps match the input size.\n\n2.2 Training Deep Highway Networks\n\nT x + bT), where WT is the weight matrix\nWe use the transform gate de\ufb01ned as T (x) = \u03c3(WT\nand bT the bias vector for the transform gates. This suggests a simple initialization scheme which\nis independent of the nature of H: bT can be initialized with a negative value (e.g. -1, -3 etc.) such\nthat the network is initially biased towards carry behavior. This scheme is strongly inspired by the\nproposal [30] to initially bias the gates in an LSTM network, to help bridge long-term temporal\ndependencies early in learning. Note that \u03c3(x) \u2208 (0, 1),\u2200x \u2208 R, so the conditions in Equation 4\ncan never be met exactly.\nIn our experiments, we found that a negative bias initialization for the transform gates was suf\ufb01cient\nfor training to proceed in very deep networks for various zero-mean initial distributions of WH\nand different activation functions used by H. In pilot experiments, SGD did not stall for networks\nwith more than 1000 layers. Although the initial bias is best treated as a hyperparameter, as a\ngeneral guideline we suggest values of -1, -2 and -3 for convolutional highway networks of depth\napproximately 10, 20 and 30.\n\n2Our pilot experiments on training very deep networks were successful with a more complex block design\nclosely resembling an LSTM block \u201cunrolled in time\u201d. Here we report results only for a much simpli\ufb01ed form.\n\n3\n\n\fNetwork\n\nNo. of parameters\nTest Accuracy (in %)\n\nHighway Networks\n\n10-layer (width 16)\n99.43 (99.4\u00b10.03)\n\n39 K\n\n10-layer (width 32)\n99.55 (99.54\u00b10.02)\n\n151 K\n\nMaxout [20] DSN [24]\n\n420 K\n99.55\n\n350 K\n99.61\n\nTable 1: Test set classi\ufb01cation accuracy for pilot experiments on the MNIST dataset.\n\nNetwork\nFitnet Results (reported by Romero et. al.[25])\n\nNo. of Layers\n\nTeacher\nFitnet A\nFitnet B\n\nHighway networks\n\nHighway A (Fitnet A)\nHighway B (Fitnet B)\nHighway C\n\n5\n11\n19\n\n11\n19\n32\n\nNo. of Parameters Accuracy (in %)\n\u223c9M\n\u223c250K\n\u223c2.5M\n\u223c236K\n\u223c2.3M\n\u223c1.25M\n\n89.18\n92.46 (92.28\u00b10.16)\n91.20\n\n90.18\n89.01\n91.61\n\nTable 2: CIFAR-10 test set accuracy of convolutional highway networks. Architectures tested were\nbased on \ufb01tnets trained by Romero et. al. [25] using two-stage hint based training. Highway net-\nworks were trained in a single stage without hints, matching or exceeding the performance of \ufb01tnets.\n\n3 Experiments\n\nAll networks were trained using SGD with momentum. An exponentially decaying learning rate was\nused in Section 3.1. For the rest of the experiments, a simpler commonly used strategy was employed\nwhere the learning rate starts at a value \u03bb and decays according to a \ufb01xed schedule by a factor \u03b3.\n\u03bb, \u03b3 and the schedule were selected once based on validation set performance on the CIFAR-10\ndataset, and kept \ufb01xed for all experiments. All convolutional highway networks utilize the recti\ufb01ed\nlinear activation function [16] to compute the block state H. To provide a better estimate of the\nvariability of classi\ufb01cation results due to random initialization, we report our results in the format\nBest (mean \u00b1 std.dev.) based on 5 runs wherever available. Experiments were conducted using\nCaffe [33] and Brainstorm (https://github.com/IDSIA/brainstorm) frameworks. Source\ncode, hyperparameter search results and related scripts are publicly available at http://people.\nidsia.ch/\u02dcrupesh/very_deep_learning/.\n\n3.1 Optimization\n\nTo support the hypothesis that highway networks do not suffer from increasing depth, we conducted\na series of rigorous optimization experiments, comparing them to plain networks with normalized\ninitialization [16, 17].\nWe trained both plain and highway networks of varying varying depths on the MNIST digit clas-\nsi\ufb01cation dataset. All networks are thin: each layer has 50 blocks for highway networks and 71\nunits for plain networks, yielding roughly identical numbers of parameters (\u22485000) per layer. In\nall networks, the \ufb01rst layer is a fully connected plain layer followed by 9, 19, 49, or 99 fully con-\nnected plain or highway layers. Finally, the network output is produced by a softmax layer. We\nperformed a random search of 100 runs for both plain and highway networks to \ufb01nd good settings\nfor the following hyperparameters: initial learning rate, momentum, learning rate exponential decay\nfactor & activation function (either recti\ufb01ed linear or tanh). For highway networks, an additional\nhyperparameter was the initial value for the transform gate bias (between -1 and -10). Other weights\nwere initialized using the same normalized initialization as plain networks.\nThe training curves for the best performing networks for each depth are shown in Figure 1. As ex-\npected, 10 and 20-layer plain networks exhibit very good performance (mean loss < 1e\u22124), which\nsigni\ufb01cantly degrades as depth increases, even though network capacity increases. Highway net-\nworks do not suffer from an increase in depth, and 50/100 layer highway networks perform similar\nto 10/20 layer networks. The 100-layer highway network performed more than 2 orders of magni-\ntude better compared to a similarly-sized plain network. It was also observed that highway networks\nconsistently converged signi\ufb01cantly faster than plain ones.\n\n4\n\n\fNetwork\nMaxout [20]\ndasNet [36]\nNiN [35]\nDSN [24]\nAll-CNN [37]\nHighway Network\n\nCIFAR-10 Accuracy (in %) CIFAR-100 Accuracy (in %)\n90.62\n90.78\n91.19\n92.03\n92.75\n92.40 (92.31\u00b10.12)\n\n61.42\n66.22\n64.32\n65.43\n66.29\n67.76 (67.61\u00b10.15)\n\nTable 3: Test set accuracy of convolutional highway networks on the CIFAR-10 and CIFAR-100\nobject recognition datasets with typical data augmentation. For comparison, we list the accuracy\nreported by recent studies in similar experimental settings.\n\n3.2 Pilot Experiments on MNIST Digit Classi\ufb01cation\n\nAs a sanity check for the generalization capability of highway networks, we trained 10-layer con-\nvolutional highway networks on MNIST, using two architectures, each with 9 convolutional layers\nfollowed by a softmax output. The number of \ufb01lter maps (width) was set to 16 and 32 for all the\nlayers. We obtained test set performance competitive with state-of-the-art methods with much fewer\nparameters, as show in Table 1.\n\n3.3 Experiments on CIFAR-10 and CIFAR-100 Object Recognition\n\n3.3.1 Comparison to Fitnets\nFitnet training Maxout networks can cope much better with increased depth than those with tra-\nditional activation functions [20]. However, Romero et. al. [25] recently reported that training on\nCIFAR-10 through plain backpropogation was only possible for maxout networks with a depth up\nto 5 layers when the number of parameters was limited to \u223c250K and the number of multiplications\nto \u223c30M. Similar limitations were observed for higher computational budgets. Training of deeper\nnetworks was only possible through the use of a two-stage training procedure and addition of soft\ntargets produced from a pre-trained shallow teacher network (hint-based training).\nWe found that it was easy to train highway networks with numbers of parameters and operations\ncomparable to those of \ufb01tnets in a single stage using SGD. As shown in Table 2, Highway A and\nHighway B, which are based on the architectures of Fitnet A and Fitnet B, respectively, obtain\nsimilar or higher accuracy on the test set. We were also able to train thinner and deeper networks:\nfor example a 32-layer highway network consisting of alternating receptive \ufb01elds of size 3x3 and\n1x1 with \u223c1.25M parameters performs better than the earlier teacher network [20].\n\n3.3.2 Comparison to State-of-the-art Methods\n\nIt is possible to obtain high performance on the CIFAR-10 and CIFAR-100 datasets by utilizing\nvery large networks and extensive data augmentation. This approach was popularized by Ciresan\net. al. [5] and recently extended by Graham [34]. Since our aim is only to demonstrate that deeper\nnetworks can be trained without sacri\ufb01cing ease of training or generalization ability, we only per-\nformed experiments in the more common setting of global contrast normalization, small translations\nand mirroring of images. Following Lin et. al. [35], we replaced the fully connected layer used\nin the networks in the previous section with a convolutional layer with a receptive \ufb01eld of size one\nand a global average pooling layer. The hyperparameters from the last section were re-used for both\nCIFAR-10 and CIFAR-100, therefore it is quite possible to obtain much better results with better\narchitectures/hyperparameters. The results are tabulated in Table 3.\n\n4 Analysis\n\nFigure 2 illustrates the inner workings of the best3 50 hidden layer fully-connected highway net-\nworks trained on MNIST (top row) and CIFAR-100 (bottom row). The \ufb01rst three columns show\n\n3obtained via random search over hyperparameters to minimize the best training set error achieved using\n\neach con\ufb01guration\n\n5\n\n\fFigure 2: Visualization of best 50 hidden-layer highway networks trained on MNIST (top row) and\nCIFAR-100 (bottom row). The \ufb01rst hidden layer is a plain layer which changes the dimensionality\nof the representation to 50. Each of the 49 highway layers (y-axis) consists of 50 blocks (x-axis).\nThe \ufb01rst column shows the transform gate biases, which were initialized to -2 and -4 respectively.\nIn the second column the mean output of the transform gate over all training examples is depicted.\nThe third and fourth columns show the output of the transform gates and the block outputs (both\nnetworks using tanh) for a single random training sample. Best viewed in color.\n\nthe bias, the mean activity over all training samples, and the activity for a single random sample for\neach transform gate respectively. Block outputs for the same single sample are displayed in the last\ncolumn.\nThe transform gate biases of the two networks were initialized to -2 and -4 respectively. It is inter-\nesting to note that contrary to our expectations most biases decreased further during training. For\nthe CIFAR-100 network the biases increase with depth forming a gradient. Curiously this gradient\nis inversely correlated with the average activity of the transform gates, as seen in the second column.\nThis indicates that the strong negative biases at low depths are not used to shut down the gates, but to\nmake them more selective. This behavior is also suggested by the fact that the transform gate activity\nfor a single example (column 3) is very sparse. The effect is more pronounced for the CIFAR-100\nnetwork, but can also be observed to a lesser extent in the MNIST network.\nThe last column of Figure 2 displays the block outputs and visualizes the concept of \u201cinformation\nhighways\u201d. Most of the outputs stay constant over many layers forming a pattern of stripes. Most of\nthe change in outputs happens in the early layers (\u2248 15 for MNIST and \u2248 40 for CIFAR-100).\n\n4.1 Routing of Information\n\nOne possible advantage of the highway architecture over hard-wired shortcut connections is that the\nnetwork can learn to dynamically adjust the routing of the information based on the current input.\nThis begs the question: does this behaviour manifest itself in trained networks or do they just learn\na static routing that applies to all inputs similarly. A partial answer can be found by looking at the\nmean transform gate activity (second column) and the single example transform gate outputs (third\ncolumn) in Figure 2. Especially for the CIFAR-100 case, most transform gates are active on average,\nwhile they show very selective activity for the single example. This implies that for each sample\nonly a few blocks perform transformation but different blocks are utilized by different samples.\nThis data-dependent routing mechanism is further investigated in Figure 3. In each of the columns\nwe show how the average over all samples of one speci\ufb01c class differs from the total average shown\nin the second column of Figure 2. For MNIST digits 0 and 7 substantial differences can be seen\n\n6\n\n\fFigure 3: Visualization showing the extent to which the mean transform gate activity for certain\nclasses differs from the mean activity over all training samples. Generated using the same 50-layer\nhighway networks on MNIST on CIFAR-100 as Figure 2. Best viewed in color.\n\nwithin the \ufb01rst 15 layers, while for CIFAR class numbers 0 and 1 the differences are sparser and\nspread out over all layers. In both cases it is clear that the mean activity pattern differs between\nclasses. The gating system acts not just as a mechanism to ease training, but also as an important\npart of the computation in a trained network.\n\n4.2 Layer Importance\n\nSince we bias all the transform gates towards being closed, in the beginning every layer mostly\ncopies the activations of the previous layer. Does training indeed change this behaviour, or is the\n\ufb01nal network still essentially equivalent to a network with a much fewer layers? To shed light on this\nissue, we investigated the extent to which lesioning a single layer affects the total performance of\ntrained networks from Section 3.1. By lesioning, we mean manually setting all the transform gates\nof a layer to 0 forcing it to simply copy its inputs. For each layer, we evaluated the network on the\nfull training set with the gates of that layer closed. The resulting performance as a function of the\nlesioned layer is shown in Figure 4.\nFor MNIST (left) it can be seen that the error rises signi\ufb01cantly if any one of the early layers is\nremoved, but layers 15\u2212 45 seem to have close to no effect on the \ufb01nal performance. About 60% of\nthe layers don\u2019t learn to contribute to the \ufb01nal result, likely because MNIST is a simple dataset that\ndoesn\u2019t require much depth.\nWe see a different picture for the CIFAR-100 dataset (right) with performance degrading noticeably\nwhen removing any of the \ufb01rst \u2248 40 layers. This suggests that for complex problems a highway\nnetwork can learn to utilize all of its layers, while for simpler problems like MNIST it will keep\nmany of the unneeded layers idle. Such behavior is desirable for deep networks in general, but\nappears dif\ufb01cult to obtain using plain networks.\n\n5 Discussion\n\nAlternative approaches to counter the dif\ufb01culties posed by depth mentioned in Section 1 often have\nseveral limitations. Learning to route information through neural networks with the help of com-\npetitive interactions has helped to scale up their application to challenging problems by improving\ncredit assignment [38], but they still suffer when depth increases beyond \u224820 even with careful ini-\ntialization [17]. Effective initialization methods can be dif\ufb01cult to derive for a variety of activation\nfunctions. Deep supervision [24] has been shown to hurt performance of thin deep networks [25].\nVery deep highway networks, on the other hand, can directly be trained with simple gradient de-\nscent methods due to their speci\ufb01c architecture. This property does not rely on speci\ufb01c non-linear\ntransformations, which may be complex convolutional or recurrent transforms, and derivation of\na suitable initialization scheme is not essential. The additional parameters required by the gating\nmechanism help in routing information through the use of multiplicative connections, responding\ndifferently to different inputs, unlike \ufb01xed \u201cskip\u201d connections.\n\n7\n\n\fFigure 4: Lesioned training set performance (y-axis) of the best 50-layer highway networks on\nMNIST (left) and CIFAR-100 (right), as a function of the lesioned layer (x-axis). Evaluated on\nthe full training set while forcefully closing all the transform gates of a single layer at a time. The\nnon-lesioned performance is indicated as a dashed line at the bottom.\n\nA possible objection is that many layers might remain unused if the transform gates stay closed.\nOur experiments show that this possibility does not affect networks adversely\u2014deep and narrow\nhighway networks can match/exceed the accuracy of wide and shallow maxout networks, which\nwould not be possible if layers did not perform useful computations. Additionally, we can exploit\nthe structure of highways to directly evaluate the contribution of each layer as shown in Figure 4.\nFor the \ufb01rst time, highway networks allow us to examine how much computation depth is needed\nfor a given problem, which can not be easily done with plain networks.\n\nAcknowledgments\n\nWe thank NVIDIA Corporation for their donation of GPUs and acknowledge funding from the\nEU project NASCENCE (FP7-ICT-317662). We are grateful to Sepp Hochreiter and Thomas Un-\nterthiner for helpful comments and Jan Koutn\u00b4\u0131k for help in conducting experiments.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, 2012.\n\n[2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv:1409.4842\n[cs], September 2014.\n\n[3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. arXiv:1409.1556 [cs], September 2014.\n\n[4] DC Ciresan, Ueli Meier, Jonathan Masci, Luca M Gambardella, and J\u00a8urgen Schmidhuber. Flexible, high\n\nperformance convolutional neural networks for image classi\ufb01cation. In IJCAI, 2011.\n\n[5] Dan Ciresan, Ueli Meier, and J\u00a8urgen Schmidhuber. Multi-column deep neural networks for image classi-\n\n\ufb01cation. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\n[6] Dong Yu, Michael L. Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide. Feature learning in deep neural\n\nnetworks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605, 2013.\n\n[7] Sepp Hochreiter and Jurgen Schmidhuber. Bridging long time lags by weight guessing and \u201clong short-\n\nterm memory\u201d. Spatiotemporal models in biological and arti\ufb01cial systems, 37:65\u201372, 1996.\n\n[8] Johan H\u02daastad. Computational limitations of small-depth circuits. MIT press, 1987.\n[9] Johan H\u02daastad and Mikael Goldmann. On the power of small-depth threshold circuits. Computational\n\nComplexity, 1(2):113\u2013129, 1991.\n\n[10] Monica Bianchini and Franco Scarselli. On the complexity of neural network classi\ufb01ers: A comparison\n\nbetween shallow and deep architectures. IEEE Transactions on Neural Networks, 2014.\n\n8\n\n\f[11] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear\n\nregions of deep neural networks. In Advances in Neural Information Processing Systems. 2014.\n\n[12] James Martens and Venkatesh Medabalimi. On the expressive ef\ufb01ciency of sum product networks.\n\narXiv:1411.7717 [cs, stat], November 2014.\n\n[13] James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free optimization.\n\nNeural Networks: Tricks of the Trade, pages 1\u201358, 2012.\n\n[14] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization\n\nand momentum in deep learning. pages 1139\u20131147, 2013.\n\n[15] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben-\ngio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In\nAdvances in Neural Information Processing Systems 27, pages 2933\u20132941. 2014.\n\n[16] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 249\u2013256, 2010.\n\n[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\n\nhuman-level performance on ImageNet classi\ufb01cation. arXiv:1502.01852 [cs], February 2015.\n\n[18] David Sussillo and L. F. Abbott. Random walk initialization for training very deep feedforward networks.\n\narXiv:1412.6558 [cs, stat], December 2014.\n\n[19] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013.\n\n[20] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. arXiv:1302.4389 [cs, stat], February 2013.\n\n[21] Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and J\u00a8urgen Schmidhuber.\n\nCompete to compute. In Advances in Neural Information Processing Systems, pages 2310\u20132318, 2013.\n\n[22] Tapani Raiko, Harri Valpola, and Yann LeCun. Deep learning made easier by linear transformations in\nperceptrons. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 924\u2013932, 2012.\n\n[23] Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.\n[24] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised\n\nnets. pages 562\u2013570, 2015.\n\n[25] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua\n\nBengio. FitNets: Hints for thin deep nets. arXiv:1412.6550 [cs], December 2014.\n\n[26] J\u00a8urgen Schmidhuber. Learning complex, extended sequences using the principle of history compression.\n\nNeural Computation, 4(2):234\u2013242, March 1992.\n\n[27] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets.\n\nNeural computation, 18(7):1527\u20131554, 2006.\n\n[28] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Masters thesis, Technische Uni-\n\nversit\u00a8at M\u00a8unchen, M\u00a8unchen, 1991.\n\n[29] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780, November 1997.\n\n[30] Felix A. Gers, J\u00a8urgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with\n\nLSTM. In ICANN, volume 2, pages 850\u2013855, 1999.\n\n[31] Rupesh Kumar Srivastava, Klaus Greff, and J\u00a8urgen Schmidhuber. Highway networks. arXiv:1505.00387\n\n[cs], May 2015.\n\n[32] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long Short-Term memory. arXiv:1507.01526\n\n[cs], July 2015.\n\n[33] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Ser-\ngio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.\narXiv:1408.5093 [cs], 2014.\n\n[34] Benjamin Graham. Spatially-sparse convolutional neural networks. arXiv:1409.6070, September 2014.\n[35] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv:1312.4400, 2014.\n[36] Marijn F Stollenga, Jonathan Masci, Faustino Gomez, and J\u00a8urgen Schmidhuber. Deep networks with\n\ninternal selective attention through feedback connections. In NIPS. 2014.\n\n[37] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for sim-\n\nplicity: The all convolutional net. arXiv:1412.6806 [cs], December 2014.\n\n[38] Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, and J\u00a8urgen Schmidhuber. Understanding\n\nlocally competitive networks. In International Conference on Learning Representations, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1397, "authors": [{"given_name": "Rupesh", "family_name": "Srivastava", "institution": "IDSIA"}, {"given_name": "Klaus", "family_name": "Greff", "institution": "IDSIA"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}