{"title": "Dilated Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 77, "page_last": 87, "abstract": "Learning with recurrent neural networks (RNNs) on long sequences is a notoriously difficult task.  There are three major challenges: 1) complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DilatedRNN, which simultaneously tackles all of these challenges.  The proposed architecture is characterized by multi-resolution dilated recurrent skip connections and can be combined flexibly with diverse RNN cells.  Moreover, the DilatedRNN reduces the number of parameters needed and enhances training efficiency significantly, while matching state-of-the-art performance (even with standard RNN cells) in tasks involving very long-term dependencies.  To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure, the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures.  We rigorously prove the advantages of the DilatedRNN over other recurrent neural architectures.  The code for our method is publicly available at https://github.com/code-terminator/DilatedRNN.", "full_text": "Dilated Recurrent Neural Networks\n\nShiyu Chang1\u21e4, Yang Zhang1\u21e4, Wei Han2\u21e4, Mo Yu1, Xiaoxiao Guo1, Wei Tan1,\n\nXiaodong Cui1, Michael Witbrock1, Mark Hasegawa-Johnson2, Thomas S. Huang2\n\n{yum, wtan, cuix, witbrock}@us.ibm.com,\n\n{weihan3, jhasegaw, t-huang1}@illinois.edu\n\n1IBM Thomas J. Watson Research Center, Yorktown, NY 10598, USA\n2University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA\n\n{shiyu.chang, yang.zhang2, xiaoxiao.guo}@ibm.com,\n\nAbstract\n\nLearning with recurrent neural networks (RNNs) on long sequences is a notori-\nously dif\ufb01cult task. There are three major challenges: 1) complex dependencies, 2)\nvanishing and exploding gradients, and 3) ef\ufb01cient parallelization. In this paper,\nwe introduce a simple yet effective RNN connection structure, the DILATEDRNN,\nwhich simultaneously tackles all of these challenges. The proposed architecture is\ncharacterized by multi-resolution dilated recurrent skip connections, and can be\ncombined \ufb02exibly with diverse RNN cells. Moreover, the DILATEDRNN reduces\nthe number of parameters needed and enhances training ef\ufb01ciency signi\ufb01cantly,\nwhile matching state-of-the-art performance (even with standard RNN cells) in\ntasks involving very long-term dependencies. To provide a theory-based quanti\ufb01-\ncation of the architecture\u2019s advantages, we introduce a memory capacity measure,\nthe mean recurrent length, which is more suitable for RNNs with long skip\nconnections than existing measures. We rigorously prove the advantages of the\nDILATEDRNN over other recurrent neural architectures. The code for our method\nis publicly available1.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) have been shown to have remarkable performance on many\nsequential learning problems. However, long sequence learning with RNNs remains a challenging\nproblem for the following reasons: \ufb01rst, memorizing extremely long-term dependencies while\nmaintaining mid- and short-term memory is dif\ufb01cult; second, training RNNs using back-propagation-\nthrough-time is impeded by vanishing and exploding gradients; And lastly, both forward- and\nback-propagation are performed in a sequential manner, which makes the training time-consuming.\nMany attempts have been made to overcome these dif\ufb01culties using specialized neural structures, cells,\nand optimization techniques. Long short-term memory (LSTM) [10] and gated recurrent units (GRU)\n[6] powerfully model complex data dependencies. Recent attempts have focused on multi-timescale\ndesigns, including clockwork RNNs [12], phased LSTM [17], hierarchical multi-scale RNNs [5], etc.\nThe problem of vanishing and exploding gradients is mitigated by LSTM and GRU memory gates;\nother partial solutions include gradient clipping [18], orthogonal and unitary weight optimization\n[2, 14, 24], and skip connections across multiple timestamps [8, 30]. For ef\ufb01cient sequential training,\nWaveNet [22] abandoned RNN structures, proposing instead the dilated causal convolutional neural\nnetwork (CNN) architecture, which provides signi\ufb01cant advantages in working directly with raw\naudio waveforms. However, the length of dependencies captured by a dilated CNN is limited by its\nkernel size, whereas an RNN\u2019s autoregressive modeling can, in theory, capture potentially in\ufb01nitely\n\n\u21e4Denotes equal contribution.\n1https://github.com/code-terminator/DilatedRNN\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f*+*+5\n\n,/\n\n,.\n\n,3\n\n,0\n\n,1\n\n,2\n\n,-\n\n,4\n\n!\"\n\n*+\n\n!#\n\n!(\n\n!$\n\n!%\n\n!&\n\n!'\n\n!)\n\n*+\n\n!%!&!'\n!)\n\n!\"!#!$\n!(\n\nFigure 1: (left) A single-layer RNN with recurrent skip connections. (mid) A single-layer RNN with\ndilated recurrent skip connections. (right) A computation structure equivalent to the second graph,\nwhich reduces the sequence length by four times.\n\nlong dependencies with a small number of parameters. Recently, Yu et al. [27] proposed learning-\nbased RNNs with the ability to jump (skim input text) after seeing a few timestamps worth of data;\nalthough the authors showed that the modi\ufb01ed LSTM with jumping provides up to a six-fold speed\nincrease, the ef\ufb01ciency gain is mainly in the testing phase.\nIn this paper, we introduce the DILATEDRNN, a neural connection architecture analogous to the\ndilated CNN [22, 28], but under a recurrent setting. Our approach provides a simple yet useful\nsolution that tries to alleviate all challenges simultaneously. The DILATEDRNN is a multi-layer, and\ncell-independent architecture characterized by multi-resolution dilated recurrent skip connections.\nThe main contributions of this work are as follows. 1) We introduce a new dilated recurrent skip\nconnection as the key building block of the proposed architecture. These alleviate gradient problems\nand extend the range of temporal dependencies like conventional recurrent skip connections, but in\nthe dilated version require fewer parameters and signi\ufb01cantly enhance computational ef\ufb01ciency. 2)\nWe stack multiple dilated recurrent layers with hierarchical dilations to construct a DILATEDRNN,\nwhich learns temporal dependencies of different scales at different layers. 3) We present the mean\nrecurrent length as a new neural memory capacity measure that reveals the performance difference\nbetween the previously developed recurrent skip-connections and the dilated version. We also verify\nthe optimality of the exponentially increasing dilation distribution used in the proposed architecture.\nIt is worth mentioning that, the recent proposed Dilated LSTM [23] can be viewed as a special case\nof our model, which contains only one dilated recurrent layer with \ufb01xed dilation. The main purpose\nof their model is to reduce the temporal resolution on time-sensitive tasks. Thus, the Dilated LSTM\nis not a general solution for modeling at multiple temporal resolutions.\nWe empirically validate the DILATEDRNN in multiple RNN settings on a variety of sequential\nlearning tasks, including long-term memorization, pixel-by-pixel classi\ufb01cation of handwritten digits\n(with permutation and noise), character-level language modeling, and speaker identi\ufb01cation with\nraw audio waveforms. The DILATEDRNN improves signi\ufb01cantly on the performance of a regular\nRNN, LSTM, or GRU with far fewer parameters. Many studies [6, 14] have shown that vanilla RNN\ncells perform poorly in these learning tasks. However, within the proposed structure, even vanilla\nRNN cells outperform more sophisticated designs, and match the state-of-the-art. We believe that the\nDILATEDRNN provides a simple and generic approach to learning on very long sequences.\n\n2 Dilated Recurrent Neural Networks\n\nThe main ingredients of the DILATEDRNN are its dilated recurrent skip connection and its use of\nexponentially increasing dilation; these will be discussed in the following two subsections respectively.\n\n2.1 Dilated Recurrent Skip Connection\n\nDenote c(l)\n\nt as the cell in layer l at time t. The dilated skip connection can be represented as\n\nThis is similar to the regular skip connection[8, 30], which can be represented as\n\ns(l) is referred to as the skip length, or dilation of layer l; x(l)\nt as the input to layer l at time t; and\nf (\u00b7) denotes any RNN cell and output operations, e.g. Vanilla RNN cell, LSTM, GRU etc. Both\nskip connections allow information to travel along fewer edges. The difference between dilated and\n\nc(l)\n\n, c(l)\n\nt = f\u21e3x(l)\nt = f\u21e3x(l)\n\nts(l)\u2318 .\nts(l)\u2318 .\n, c(l)\nt1, c(l)\n\nc(l)\n\nt\n\nt\n\n2\n\n(1)\n\n(2)\n\n\fOutput\n\nHidden\tLayer\nDilation\t=\t4\n\nHidden\tLayer\nDilation\t=\t2\n\nHidden\tLayer\nDilation\t=\t1\n\nInput\n\nFigure 2: (left) An example of a three-layer DILATEDRNN with dilation 1, 2, and 4. (right) An\nexample of a two-layer DILATEDRNN, with dilation 2 in the \ufb01rst layer. In such a case, extra\nembedding connections are required (red arrows) to compensate missing data dependencies.\n\nregular skip connection is that the dependency on c(l)\nt1 is removed in dilated skip connection. The left\nand middle graphs in \ufb01gure 1 illustrate the differences between two architectures with dilation or skip\nlength s(l) = 4, where W 0r is removed in the middle graph. This reduces the number of parameters.\nMore importantly, computational ef\ufb01ciency of a parallel implementation (e.g., using GPUs) can be\ngreatly improved by parallelizing operations that, in a regular RNN, would be impossible. The middle\nand right graphs in \ufb01gure 1 illustrate the idea with s(l) = 4 as an example. The input subsequences\n4t },\n4t }, {x(l)\n4t+3} are given four different colors. The four cell chains, {c(l)\n{x(l)\n4t+3}, can be computed in parallel by feeding the four subsequences into a\n4t+1}, {c(l)\n{c(l)\nregular RNN, as shown in the right of \ufb01gure 1. The output can then be obtained by interweaving\namong the four output chains. The degree of parallelization is increased by s(l) times.\n\n4t+1}, {x(l)\n4t+2} and {c(l)\n\n4t+2} and {x(l)\n\n2.2 Exponentially Increasing Dilation\nTo extract complex data dependencies, we stack dilated recurrent layers to construct DILATEDRNN.\nSimilar to settings that were introduced in WaveNet [22], the dilation increases exponentially across\nlayers. Denote s(l) as the dilation of the l-th layer. Then,\n\ns(l) = M l1, l = 1,\u00b7\u00b7\u00b7 , L.\n\n(3)\nThe left side of \ufb01gure 2 depicts an example of DILATEDRNN with L = 3 and M = 2. On one\nhand, stacking multiple dilated recurrent layers increases the model capacity. On the other hand,\nexponentially increasing dilation brings two bene\ufb01ts. First, it makes different layers focus on different\ntemporal resolutions. Second, it reduces the average length of paths between nodes at different\ntimestamps, which improves the ability of RNNs to extract long-term dependencies and prevents\nvanishing and exploding gradients. A formal proof of this statement will be given in section 3.\nTo improve overall computational ef\ufb01ciency, a generalization of our standard DILATEDRNN is also\nproposed. The dilation in the generalized DILATEDRNN does not start at one, but M l0. Formally,\n(4)\nwhere M l\n0 is called the starting dilation. To compensate for the missing dependencies shorter than\nM l0, a 1-by-M (l0) convolutional layer is appended as the \ufb01nal layer. The right side of \ufb01gure 2\nillustrates an example of l0 = 1, i.e. dilations start at two. Without the red edges, there would be\nno edges connecting nodes at odd and even time stamps. As discussed in section 2.1, the computa-\ntional ef\ufb01ciency can be increased by M l0 by breaking the input sequence into M l0 downsampled\nsubsequences, and feeding each into a L  l0-layer regular DILATEDRNN with shared weights.\n3 The Memory Capacity of DILATEDRNN\n\ns(l) = M (l1+l0), l = 1,\u00b7\u00b7\u00b7 , L and l0  0,\n\nIn this section, we extend the analysis framework in [30] to establish better measures of memory\ncapacity and parameter ef\ufb01ciency, which will be discussed in the following two sections respectively.\n\n3.1 Memory Capacity\nTo facilitate theoretical analysis, we apply the cyclic graph Gc notation introduced in [30].\nDe\ufb01nition 3.1 (Cyclic Graph). The cyclic graph representation of an RNN structure is a directed\nmulti-graph, GC = (VC, EC). Each edge is labeled as e = (u, v, ) 2 EC, where u is the origin\n\n3\n\n\fnode, v is the destination node, and  is the number of time steps the edge travels. Each node is\nlabeled as v = (i, p) 2 VC, where i is the time index of the node modulo m, m is the period of the\ngraph, and p is the node index. GC must contain at least one directed cycle. Along the edges of any\ndirected cycle, the summation of  must not be zero.\nDe\ufb01ne di(n) as the length of the shortest path from any input node at time i to any output node at\ntime i + n. In [30], a measure of the memory capacity is proposed that essentially only looks at\ndi(m), where m is the period of the graph. This is reasonable when the period is small. However,\nwhen the period is large, the entire distribution of di(n),8n \uf8ff m makes a difference, not just the\none at span m. As a concrete example, suppose there is an RNN architecture of period m = 10, 000,\nimplemented using equation (2) with skip length s(l) = m, so that di(n) = n for n = 1,\u00b7\u00b7\u00b7 , 9, 999\nand di(m) = 1. This network rapidly memorizes the dependence on inputs at time i of the outputs\nat time i + m = i + 10, 000, but shorter dependencies 2 \uf8ff n \uf8ff 9, 999 are much harder to learn.\nMotivated by this, we proposed the following additional measure of memory capacity.\nDe\ufb01nition 3.2 (Mean Recurrent Length). For an RNN with cycle m, the mean recurrent length is\n\n\u00afd =\n\n1\nm\n\nmXn=1\n\ndi(n).\n\nmax\ni2V\n\n(5)\n\nMean recurrent length studies the average dilation across different time spans within a cycle. An\narchitecture with good memory capacity should generally have a small recurrent length for all\ntime spans. Otherwise the network can only selectively memorize information at a few time spans.\nAlso, we take the maximum over i, so as to punish networks that have good length only for a few\nstarting times, which can only well memorize information originating from those speci\ufb01c times. The\nproposed mean recurrent length has an interesting reciprocal relation with the short-term memory\n(STM) measure proposed in [11], but mean recurrent length emphasizes more on long-term memory\ncapacity, which is more suitable for our intended applications.\nWith this, we are ready to illustrate the memory advantage of DILATEDRNN . Consider two RNN\narchitectures. One is the proposed DILATEDRNN structure with d layers and M = 2 (equation (1)).\nThe other is a regular d-layer RNN with skip connections (equation (2)). If the skip connections are\nof skip s(l) = 2l1, then connections in the RNN are a strict superset of those in the DILATEDRNN ,\nand the RNN accomplishes exactly the same \u00afd as the DILATEDRNN , but with twice the number of\ntrainable parameters (see section 3.2). Suppose one were to give every layer in the RNN the largest\npossible skip for any graph with a period of m = 2d1: set s(l) = 2d1 in every layer, which is the\nregular skip RNN setting. This apparent advantage turns out to be a disadvantage, because time spans\nof 2 \uf8ff n < m suffer from increased path lengths, and therefore\nwhich grows linearly with m. On the other hand, for the proposed DILATEDRNN,\n\n(6)\n\n(7)\nwhere \u00afd only grows logarithmically with m, which is much smaller than that of regular skip RNN.\nIt implies that the information in the past on average travels along much fewer edges, and thus\nundergoes far less attenuation. The derivation is given in appendix A in the supplementary materials.\n\n\u00afd = (m  1)/2 + log2 m + 1/m + 1,\n\u00afd = (3m  1)/2m log2 m + 1/m + 1,\n\n3.2 Parameter Ef\ufb01ciency\nThe advantage of DILATEDRNN lies not only in the memory capacity but also the number of\nparameters that achieves such memory capacity. To quantify the analysis, the following measure is\nintroduced.\nDe\ufb01nition 3.3 (Number of Recurrent Edges per Node). Denote Card{\u00b7} as the set cardinality. For\nan RNN represented as GC = (VC, EC), the number of recurrent edges per node, Nr, is de\ufb01ned as\n(8)\n\nNr = Card{e = (u, v, ) 2 EC :  6= 0} / Card{VC}.\n\nIdeally, we would want a network that has large recurrent skips while maintaining a small number of\nrecurrent weights. It is easy to show that Nr for DILATEDRNN is 1 and that for RNNs with regular\nskip connections is 2. The DILATEDRNN has half the recurrent complexity as the RNN with regular\nskip RNN because of the removal of the direct recurrent edge. The following theorem states that the\nDILATEDRNN is able to achieve the best memory capacity among a class of connection structures\nwith Nr = 1, and thus is among the most parameter ef\ufb01cient RNN architectures.\n\n4\n\n\fTheorem 3.1 (Parameter Ef\ufb01ciency of DILATEDRNN). Consider a subset of d-layer RNNs with\nperiod m = M d1 that consists purely of dilated skip connections (hence Nr = 1). For the RNNs in\nthis subset, there are d different dilations, 1 = s1 \uf8ff s2 \uf8ff\u00b7\u00b7\u00b7\uf8ff sd = m, and\n\n(9)\nwhere ni is any arbitrary positive integer. Among this subset, the d-layer DILATEDRNN with dilation\nrate {M 0,\u00b7\u00b7\u00b7 , M d1} achieves the smallest \u00afd.\nThe proof is motivated by [4], and is given in appendix B.\n\nsi = nisi1,\n\n3.3 Comparing with Dilated CNN\n\nSince DILATEDRNN is motivated by dilated CNN [22, 28], it is useful to compare their memory\ncapacities. Although cyclic graph, mean recurrent length and number of recurrent edges per node\nare designed for recurrent structures, they happen to be applicable to dilated CNN as well. What\u2019s\nmore, it can be easily shown that, compared to a DILATEDRNN with the same number of layers and\ndilation rate of each layer, a dilated CNN has exactly the same number of recurrent edges per node,\nand a slightly smaller (by log2 m) mean recurrent length. Hence both architectures have the same\nmodel complexity, and it looks like a dilated CNN has a slightly better memory capacity.\nHowever, mean recurrent length only measures the memory capacity within a cycle. When going\nbeyond a cycle, it is already shown that the recurrent length grows linearly with the number of cycles\n[30] for RNN structures, including DILATEDRNN, whereas for a dilated CNN, the receptive \ufb01eld\nsize is always \ufb01nite (thus mean recurrent length goes to in\ufb01nity beyond the receptive \ufb01eld size). For\nexample, with dilation rate M = 2l1 and d layers l = 1,\u00b7\u00b7\u00b7 , d, a dilated CNN has a receptive\n\ufb01eld size of 2d, which is two cycles. On the other hand, the memory of a DILATEDRNN can go far\nbeyond two cycles, particularly with the sophisticated units like GRU and LSTM. Hence the memory\ncapacity advantage of DILATEDRNN over a dilated CNN is obvious.\n\n4 Experiments\n\nIn this section, we evaluate the performance of DILATEDRNN on four different tasks, which include\nlong-term memorization, pixel-by-pixel MNIST classi\ufb01cation [15], character-level language modeling\non the Penn Treebank [16], and speaker identi\ufb01cation with raw waveforms on VCTK [26]. We also\ninvestigate the effect of dilation on performance and computational ef\ufb01ciency.\nUnless speci\ufb01ed otherwise, all the models are implemented with Tensor\ufb02ow [1]. We use the default\nnonlinearities and RMSProp optimizer [21] with learning rate 0.001 and decay rate of 0.9. All\nweight matrices are initialized by the standard normal distribution. The batch size is set to 128.\nFurthermore, in all the experiments, we apply the sequence classi\ufb01cation setting [25], where the\noutput layer only adds at the end of the sequence. Results are reported for trained models that achieve\nthe best validation loss. Unless stated otherwise, no tricks, such as gradient clipping [18], learning\nrate annealing, recurrent weight dropout [20], recurrent batch norm [20], layer norm [3], etc., are\napplied. All the tasks are sequence level classi\ufb01cation tasks, and therefore the \u201cgridding\u201d problem\n[29] is irrelevant. No \u201cdegridded\u201d module is needed.\nThree RNN cells, Vanilla, LSTM and GRU cells, are combined with the DILATEDRNN , which\nwe refer to as dilated Vanilla, dilated LSTM and dilated GRU, respectively. The common baselines\ninclude single-layer RNNs (denoted as Vanilla RNN, LSTM, and GRU), multi-layer RNNs (denoted\nas stack Vanilla, stack LSTM, and stack GRU), and Vanilla RNN with regular skip connections\n(denoted as Skip Vanilla). Additional baselines will be speci\ufb01ed in the corresponding subsections.\n\n4.1 Copy memory problem\n\nThis task tests the ability of recurrent models in memorizing long-term information. We follow\na similar setup in [2, 24, 10]. Each input sequence is of length T + 20. The \ufb01rst ten values are\nrandomly generated from integers 0 to 7; the next T  1 values are all 8; the last 11 values are all\n9, where the \ufb01rst 9 signals that the model needs to start to output the \ufb01rst 10 values of the inputs.\nDifferent from the settings in [2, 24], the average cross-entropy loss is only measured at the last 10\ntimestamps. Therefore, the random guess yields an expected average cross entropy of ln(8) \u21e1 2.079.\n\n5\n\n\fFigure 3: Results of the copy memory problem with T = 500 (left) and T = 1000 (right). The dilated-\nRNN converges quickly to the perfect solutions. Except for RNNs with dilated skip connections, all\nother methods are unable to improve over random guesses.\n\nThe DILATEDRNN uses 9 layers with hidden state size of 10. The dilation starts from one to 256 at\nthe last hidden layer. The single-layer baselines have 256 hidden units. The multi-layer baselines use\nthe same number of layers and hidden state size as the DILATEDRNN . The skip Vanilla has 9 layers,\nand the skip length at each layer is 256, which matches the maximum dilation of the DILATEDRNN.\nThe convergence curves in two settings, T = 500 and 1, 000, are shown in \ufb01gure 3. In both settings,\nthe DILATEDRNN with vanilla cells converges to a good optimum after about 1,000 training iterations,\nwhereas dilated LSTM and GRU converge slower. It might be because the LSTM and GRU cells are\nmuch more complex than the vanilla unit. Except for the proposed models, all the other models are\nunable to do better than the random guess, including the skip Vanilla. These results suggest that the\nproposed structure as a simple renovation is very useful for problems requiring very long memory.\n\n4.2 Pixel-by-pixel MNIST\n\nSequential classi\ufb01cation on the MNIST digits [15] is commonly used to test the performance of RNNs.\nWe \ufb01rst implement two settings. In the \ufb01rst setting, called the unpermuted setting, we follow the same\nsetups in [2, 13, 14, 24, 30] by serializing each image into a 784 x 1 sequence. The second setting,\ncalled permuted setting, rearranges the input sequence with a \ufb01xed permutations. Training, validation\nand testing sets are the default ones in Tensor\ufb02ow. Hyperparameters and results are reported in table\n1. In addition to the baselines already described, we also implement the dilated CNN. However, the\nreceptive \ufb01elds size of a nine-layer dilated CNN is 512, and is insuf\ufb01cient to cover the sequence\nlength of 784. Therefore, we added one more layer to the dilated CNN, which enlarges its receptive\n\ufb01eld size to 1,024. It also forms a slight advantage of dilated CNN over the DILATEDRNN structures.\nIn the unpermuted setting, the dilated GRU achieves the best evaluation accuracy of 99.2. However,\nthe performance improvements of dilated GRU and LSTM over both the single- and multi-layer\nones are marginal, which might be because the task is too simple. Further, we observe signi\ufb01cant\nperformance differences between stack Vanilla and skip vanilla, which is consistent with the \ufb01ndings\nin [30] that RNNs can better model long-term dependencies and achieves good results when recurrent\nskip connections added. Nevertheless, the dilated vanilla has yet another signi\ufb01cant performance gain\nover the skip Vanilla, which is consistent with our argument in section 3, that the DILATEDRNN has\na much more balanced memory over a wide range of time periods than RNNs with the regular skips.\nThe performance of the dilated CNN is dominated by dilated LSTM and GRU, even when the latter\nhave fewer parameters (in the 20 hidden units case) than the former (in the 50 hidden units case).\nIn the permuted setting, almost all performances are lower. However, the DILATEDRNN models\nmaintain very high evaluation accuracies. In particular, dilated Vanilla outperforms the previous\nRNN-based state-of-the-art Zoneout [13] with a comparable number of parameters. It achieves\ntest accuracy of 96.1 with only 44k parameters. Note that the previous state-of-the-art utilizes the\nrecurrent batch normalization. The version without it has a much lower performance compared to\nall the dilated models. We believe the consistently high performance of the DILATEDRNN across\ndifferent permutations is due to its hierarchical multi-resolution dilations. In addition, the dilated\nCNN is able the achieve the best performance, which is in accordance with our claim in section 3.3\nthat dilated CNN has a slightly shorter mean recurrent length than DILATEDRNN architectures, when\nsequence length fall within its receptive \ufb01eld size. However, note that this is achieved by adding one\nadditional layer to expand its receptive \ufb01eld size compared to the RNN counterparts. When the useful\ninformation lies outside its receptive \ufb01eld, the dilated CNN might fail completely.\n\n6\n\n\fFigure 4: Results of the noisy MNIST task with T = 1000 (left) and 2000 (right). RNN models\nwithout skip connections fail. DILATEDRNN signi\ufb01cant outperforms regular recurrent skips and\non-pars with the dilated CNN.\n\nTable 1: Results for unpermuted and permuted pixel-by-pixel MNIST. Italic numbers indicate the\nresults copied from the original paper. The best results are bold.\nMax\n\n# parameters\n\nMethod\n\n#\n\nVanilla RNN\nLSTM [24]\nGRU\nIRNN [14]\nFull uRNN [24]\nSkipped RNN [30]\nZoneout [13]\nDilated CNN [22]\nDilated Vanilla\nDilated LSTM\nDilated GRU\n\nlayers\n1 / 9\n1 / 9\n1 / 9\n\n1 / 9\n\n1\n1\n\n1\n10\n9\n9\n9\n\nhidden /\nlayer\n\n256 / 20\n256 / 20\n256 / 20\n\n100\n512\n\n95 / 20\n\n100\n\n20 / 50\n20 / 50\n20 / 50\n20 / 50\n\n(\u21e1, k)\n68 / 7\n270 / 28\n200 / 21\n\n12\n270\n\n16 / 11\n\n42\n\n7 / 46\n7 / 44\n28 / 173\n21 / 130\n\ndilations\n\n1\n1\n1\n1\n1\n\n1\n512\n256\n256\n256\n\nUnpermuted\ntest accuracy\n\n- / 49.1\n\n98.2 / 98.7\n99.1 / 98.8\n\n97.0\n97.5\n\n-\n\n98.0 / 98.3\n97.7 / 98.0\n98. 9 / 98.9\n99.0 / 99.2\n\nPermunted\ntest accuracy\n71.6 / 88.5\n91.7 / 89.5\n94.1 / 91.3\n\u21e182.0\n94.1\n94.0 / 91.8\n93.1 / 95.92\n95.7 / 96.7\n95.5 / 96.1\n94.2 / 95.4\n94.4 / 94.6\n\n21 / 256\n\n98.1 / 85.4\n\nIn addition to these two settings, we propose a more challenging task called the noisy MNIST, where\nwe pad the unpermuted pixel sequences with [0, 1] uniform random noise to the length of T . The\nresults with two setups T = 1, 000 and T = 2, 000 are shown in \ufb01gure 4. The dilated recurrent\nmodels and skip RNN have 9 layers and 20 hidden units per layer. The number of skips at each layer\nof skip RNN is 256. The dilated CNN has 10 layers and 11 layers for T = 1, 000 and T = 2, 000,\nrespectively. This expands the receptive \ufb01eld size of the dilated CNN to the entire input sequence.\nThe number of \ufb01lters per layer is 20. It is worth mentioning that, in the case of T = 2, 000, if\nwe use a 10-layer dilated CNN instead, it will only produce random guesses. This is because the\noutput node only sees the last 1, 024 input samples which do not contain any informative data. All\nthe other reported models have the same hyperparameters as shown in the \ufb01rst three row of table 1.\nWe found that none of the models without skip connections is able to learn. Although skip Vanilla\nremains learning, its performance drops compared to the \ufb01rst unpermuted setup. On the contrary,\nthe DILATEDRNN and dilated CNN models obtain almost the same performances as before. It is\nalso worth mentioning that in all three experiments, the DILATEDRNN models are able to achieve\ncomparable results with an extremely small number of parameters.\n\n4.3 Language modeling\n\nWe further investigate the task of predicting the next character on the Penn Treebank dataset [16]. We\nfollow the data splitting rule with the sequence length of 100 that are commonly used in previous\nstudies. This corpus contains 1 million words, which is small and prone to over-\ufb01tting. Therefore\nmodel regularization methods have been shown effective on the validation and test set performances.\nUnlike many existing approaches, we apply no regularization other than a dropout on the input layer.\nInstead, we focus on investigating the regularization effect of the dilated structure itself. Results are\nshown in table 2. Although Zoneout, LayerNorm HM-LSTM and HyperNetowrks outperform the\nDILATEDRNN models, they apply batch or layer normalizations as regularization. To the best of our\nknowledge, the dilated GRU with 1.27 BPC achieves the best result among models of similar sizes\n\n2with recurrent batch norm [20].\n\n7\n\n\fTable 2: Character-level language modeling on the Penn Tree Bank dataset.\n\nlayers\n1 / 5\n1 / 5\n\nhidden\n/ layer\n1k / 256\n1k / 256\n\n# parameters\n\n(\u21e1, M)\n4.25 / 1.9\n3.19 / 1.42\n\nMax\n\ndilations\n\nEvaluation\n\nBPC\n\n1.31 / 1.33\n1.32 / 1.33\n\nMethod\n\nLSTM\nGRU\nRecurrent BN-LSTM [7]\nRecurrent dropout LSTM [20]\nZoneout [13]\nLayerNorm HM-LSTM [5]\nHyperNetworks [9]\nDilated Vanilla\nDilated LSTM\nDilated GRU\n\n#\n\n1\n1\n1\n3\n\n5\n5\n5\n\n1k\n1k\n1k\n512\n1k\n256\n256\n256\n\n4.25\n4.25\n\n-\n\n-\n\n0.6\n1.9\n1.42\n\n1\n1\n1\n1\n1\n1\n1\n64\n64\n64\n\n1.32\n1.30\n1.27\n1.24\n\n1.37\n1.31\n1.27\n\n1 / 2\n\n4.91 / 14.41\n\n1.26 / 1.223\n\nTable 3: Speaker identi\ufb01cation on the VCTK dataset.\n\nMethod\n\nMFCC GRU\nRaw\n\nFused GRU\nDilated GRU\n\nlayers\n5 / 1\n\n#\n\n1\n\n6 / 8\n\nhidden\n/ layer\n20 / 128\n\n256\n50\n\n# parameters\n\n(\u21e1, k)\n16 / 68\n225\n\n103 / 133\n\nMin\n\ndilations\n\nMax\n\ndilations\n\n1\n\n32 / 8\n32 / 8\n\n1\n\n32 /8\n1024\n\nEvaluation\naccuracy\n0.66 / 0.77\n0.45 / 0.65\n0.64 / 0.74\n\nwithout layer normalizations. Also, the dilated models outperform their regular counterparts, Vanilla\n(didn\u2019t converge, omitted), LSTM and GRU, without increasing the model complexity.\n\n4.4 Speaker identi\ufb01cation from raw waveform\nWe also perform the speaker identi\ufb01cation task using the corpus from VCTK [26]. Learning audio\nmodels directly from the raw waveform poses a dif\ufb01cult challenge for recurrent models because of\nthe vastly long-term dependency. Recently the CLDNN family of models [19] managed to match\nor surpass the log mel-frequency features in several speech problems using waveform. However,\nCLDNNs coarsen the temporal granularity by pooling the \ufb01rst-layer CNN output before feeding it\ninto the subsequent RNN layers, so as to solve the memory challenge. Instead, the DILATEDRNN\ndirectly works on the raw waveform without pooling, which is considered more dif\ufb01cult.\nTo achieve a feasible training time, we adopt the ef\ufb01cient generalization of the DILATEDRNN as\nproposed in equation (4) with l0 = 3 and l0 = 5 . As mentioned before, if the dilations do not start\nat one, the model is equivalent to multiple shared-weight networks, each working on partial inputs,\nand the predictions are made by fusing the information using a 1-by-M l0 convolutional layer. Our\nbaseline GRU model follows the same setting with various resolutions (referred to as fused-GRU),\nwith dilation starting at 8. This baseline has 8 share-weight GRU networks, and each subnetwork\nworks on 1/8 of the subsampled sequences. The same fusion layer is used to obtain the \ufb01nal prediction.\nSince most other regular baselines failed to converge, we also implemented the MFCC-based models\non the same task setting for reference. The 13-dimensional log-mel frequency features are computed\nwith 25ms window and 5ms shift. The inputs of MFCC models are of length 100 to match the\ninput duration in the waveform-based models. The MFCC feature has two natural advantages: 1)\nno information loss from operating on subsequences; 2) shorter sequence length. Nevertheless, our\ndilated models operating directly on the waveform still offer a competitive performance (Table 3).\n\n4.5 Discussion\nIn this subsection, we \ufb01rst investigate the relationship between performance and the number of\ndilations. We compare the DILATEDRNN models with different numbers of layers on the noisy\nMNIST T = 1, 000 task. All models use vanilla RNN cells with hidden state size 20. The number of\ndilations starts at one. In \ufb01gure 5, we observe that the classi\ufb01cation accuracy and rate of convergence\nincreases as the models become deeper. Recall the maximum skip is exponential in the number of\nlayers. Thus, the deeper model has a larger maximum skip and mean recurrent length.\nSecond, we consider maintaining a large maximum skip with a smaller number of layers, by increasing\nthe dilation at the bottom layer of DILATEDRNN . First, we construct a nine-layer DILATEDRNN\n\n3with layer normalization [3].\n\n8\n\n\fFigure 5: Results for dilated vanilla with different numbers of layers on the noisy MNIST dataset.\nThe performance and convergent speed increase as the number of layers increases.\n\nFigure 6: Training time (left) and evaluation performance (right) for dilated vanilla that starts at\ndifferent numbers of dilations at the bottom layer. The maximum dilations for all models are 256.\n\nmodel with vanilla RNN cells. The number of dilations starts at 1, and hidden state size is 20. This\narchitecture is referred to as \u201cstarts at 1\u201d in \ufb01gure 6. Then, we remove the bottom hidden layers\none-by-one to construct seven new models. The last created model has three layers, and the number\nof dilations starts at 64. Figure 6 demonstrates both the wall time and evaluation accuracy for 50,000\ntraining iterations of noisy MNIST dataset. The training time reduces by roughly 50% for every\ndropped layer (for every doubling of the minimum dilation). Although the testing performance\ndecreases when the dilation does not start at one, the effect is marginal with s(0) = 2, and small with\n4 \uf8ff s(0) \uf8ff 16. Notably, the model with dilation starting at 64 is able to train within 17 minutes by\nusing a single Nvidia P-100 GPU while maintaining a 93.5% test accuracy.\n\n5 Conclusion\n\nOur experiments with DILATEDRNN provide strong evidence that this simple multi-timescale\narchitectural choice can reliably improve the ability of recurrent models to learn long-term dependency\nin problems from different domains. We found that the DILATEDRNN trains faster, requires less\nhyperparameter tuning, and needs fewer parameters to achieve the state-of-the-arts. In complement\nto the experimental results, we have provided a theoretical analysis showing the advantages of\nDILATEDRNN and proved its optimality under a meaningful architectural measure of RNNs.\n\nAcknowledgement\n\nAuthors would like to thank Tom Le Paine (paine1@illinois.edu) and Ryan Musa\n(ramusa@us.ibm.com) for their insightful discussions.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nInternational Conference on Machine Learning, pages 1120\u20131128, 2016.\n\nIn\n\n[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[4] Eduardo R Caianiello, Gaetano Scarpetta, and Giovanna Simoncelli. A systemic study of monetary systems.\n\nInternational Journal Of General System, 8(2):81\u201392, 1982.\n\n[5] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks.\n\narXiv preprint arXiv:1609.01704, 2016.\n\n[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated\n\nrecurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[7] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, \u00c7a\u02d8glar G\u00fcl\u00e7ehre, and Aaron Courville. Recurrent batch\n\nnormalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[8] Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In\n\nNips, volume 409, 1995.\n\n[9] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[11] Herbert Jaeger. Short term memory in echo state networks, volume 5. GMD-Forschungszentrum Informa-\n\ntionstechnik, 2001.\n\n[12] Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv preprint\n\narXiv:1402.3511, 2014.\n\n[13] David Krueger, Tegan Maharaj, J\u00e1nos Kram\u00e1r, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke,\nAnirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, et al. Zoneout: Regularizing rnns by\nrandomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.\n\n[14] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of\n\nrecti\ufb01ed linear units. arXiv preprint arXiv:1504.00941, 2015.\n\n[15] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[16] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus\n\nof english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[17] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased LSTM: accelerating recurrent network training\n\nfor long or event-based sequences. arXiv preprint arXiv:1610.09513, 2016.\n\n[18] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent neural\n\nnetworks. ICML (3), 28:1310\u20131318, 2013.\n\n[19] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals. Learning the speech front-\nend with raw waveform cldnns. In Sixteenth Annual Conference of the International Speech Communication\nAssociation, 2015.\n\n[20] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. arXiv\n\npreprint arXiv:1603.05118, 2016.\n\n[21] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.\n\n[22] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.\nCoRR abs/1609.03499, 2016.\n\n10\n\n\f[23] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver,\nand Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint\narXiv:1703.01161, 2017.\n\n[24] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity unitary\nrecurrent neural networks. In Advances in Neural Information Processing Systems, pages 4880\u20134888,\n2016.\n\n[25] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classi\ufb01cation. ACM Sigkdd\n\nExplorations Newsletter, 12(1):40\u201348, 2010.\n\n[26] Junichi Yamagishi. English multi-speaker corpus for cstr voice cloning toolkit. http://homepages.inf.\n\ned.ac.uk/jyamagis/page3/page58/page58.html, 2012.\n\n[27] Adams W Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. arXiv preprint arXiv:1704.06877,\n\n2017.\n\n[28] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint\n\narXiv:1511.07122, 2015.\n\n[29] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. arXiv preprint\n\narXiv:1705.09914, 2017.\n\n[30] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan R Salakhutdinov, and\nYoshua Bengio. Architectural complexity measures of recurrent neural networks. In Advances in Neural\nInformation Processing Systems, pages 1822\u20131830, 2016.\n\n11\n\n\f", "award": [], "sourceid": 72, "authors": [{"given_name": "Shiyu", "family_name": "Chang", "institution": "IBM T.J. Watson Research Center"}, {"given_name": "Yang", "family_name": "Zhang", "institution": "IBM T. J. Watson Research"}, {"given_name": "Wei", "family_name": "Han", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Mo", "family_name": "Yu", "institution": "Johns Hopkins University"}, {"given_name": "Xiaoxiao", "family_name": "Guo", "institution": "IBM Research"}, {"given_name": "Wei", "family_name": "Tan", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Xiaodong", "family_name": "Cui", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Michael", "family_name": "Witbrock", "institution": "IBM Research, USA"}, {"given_name": "Mark", "family_name": "Hasegawa-Johnson", "institution": "University of Illinois"}, {"given_name": "Thomas", "family_name": "Huang", "institution": "UIUC"}]}