{"title": "Fast-Slow Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5915, "page_last": 5924, "abstract": "Processing sequential data of variable length is a major challenge in a wide range of applications, such as speech recognition, language modeling, generative image modeling and machine translation. Here, we address this challenge by proposing a novel recurrent neural network (RNN) architecture, the Fast-Slow RNN (FS-RNN). The FS-RNN incorporates the strengths of both multiscale RNNs and deep transition RNNs as it processes sequential data on different timescales and learns complex transition functions from one time step to the next. We evaluate the FS-RNN on two character based language modeling data sets, Penn Treebank and Hutter Prize Wikipedia, where we improve state of the art results to 1.19 and 1.25 bits-per-character (BPC), respectively. In addition, an ensemble of two FS-RNNs achieves 1.20 BPC on Hutter Prize Wikipedia outperforming the best known compression algorithm with respect to the BPC measure. We also present an empirical investigation of the learning and network dynamics of the FS-RNN, which explains the improved performance compared to other RNN architectures. Our approach is general as any kind of RNN cell is a possible building block for the FS-RNN architecture, and thus can be flexibly applied to different tasks.", "full_text": "Fast-Slow Recurrent Neural Networks\n\nAsier Mujika\n\nDepartment of Computer Science\n\nETH Z\u00fcrich, Switzerland\n\nasierm@ethz.ch\n\nFlorian Meier\n\nDepartment of Computer Science\n\nETH Z\u00fcrich, Switzerland\nmeierflo@inf.ethz.ch\n\nAngelika Steger\n\nDepartment of Computer Science\n\nETH Z\u00fcrich, Switzerland\nsteger@inf.ethz.ch\n\nAbstract\n\nProcessing sequential data of variable length is a major challenge in a wide range\nof applications, such as speech recognition, language modeling, generative image\nmodeling and machine translation. Here, we address this challenge by proposing\na novel recurrent neural network (RNN) architecture, the Fast-Slow RNN (FS-\nRNN). The FS-RNN incorporates the strengths of both multiscale RNNs and\ndeep transition RNNs as it processes sequential data on different timescales and\nlearns complex transition functions from one time step to the next. We evaluate\nthe FS-RNN on two character level language modeling data sets, Penn Treebank\nand Hutter Prize Wikipedia, where we improve state of the art results to 1.19\nand 1.25 bits-per-character (BPC), respectively. In addition, an ensemble of two\nFS-RNNs achieves 1.20 BPC on Hutter Prize Wikipedia outperforming the best\nknown compression algorithm with respect to the BPC measure. We also present\nan empirical investigation of the learning and network dynamics of the FS-RNN,\nwhich explains the improved performance compared to other RNN architectures.\nOur approach is general as any kind of RNN cell is a possible building block for\nthe FS-RNN architecture, and thus can be \ufb02exibly applied to different tasks.\n\n1\n\nIntroduction\n\nProcessing, modeling and predicting sequential data of variable length is a major challenge in the\n\ufb01eld of machine learning. In recent years, recurrent neural networks (RNNs) [34, 32, 39, 41] have\nbeen the most popular tool to approach this challenge. RNNs have been successfully applied to\nimprove state of the art results in complex tasks like language modeling and speech recognition. A\npopular variation of RNNs are long short-term memories (LSTMs) [18], which have been proposed\nto address the vanishing gradient problem [16, 5, 17]. LSTMs maintain constant error \ufb02ow and thus\nare more suitable to learn long-term dependencies compared to standard RNNs.\nOur work contributes to the ongoing debate on how to interconnect several RNN cells with the goals\nof promoting the learning of long-term dependencies, favoring ef\ufb01cient hierarchical representations of\ninformation, exploiting the computational advantages of deep over shallow networks and increasing\ncomputational ef\ufb01ciency of training and testing. In deep RNN architectures, RNNs or LSTMs\nare stacked layer-wise on top of each other [9, 20, 11]. The additional layers enable the network\nto learn complex input to output relations and encourage a ef\ufb01cient hierarchical representation\nof information. In these architectures, the hidden states of all the hierarchical layers are updated\nonce per time step (by one time step we refer to the time between two consecutive input elements).\nIn multiscale RNN architectures [35, 9, 25, 6], the operation on different timescales is enforced\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fby updating the higher layers less frequently, which further encourages an ef\ufb01cient hierarchical\nrepresentation of information. Updating higher layers in fewer time steps leads to computationally\nef\ufb01cient implementations and gives rise to short gradient paths that favor the learning of long-term\ndependencies. In deep transition RNN architectures, intermediate sequentially connected layers are\ninterposed between two consecutive hidden states in order to increase the depth of the transition\nfunction from one time step to the next, as for example in deep transition networks [31] or Recurrent\nHighway Networks (RHN) [43]. The intermediate layers enable the network to learn complex non-\nlinear transition functions. Thus, the model exploits the fact that deep models can represent some\nfunctions exponentially more ef\ufb01ciently than shallow models [4]. We interpret these networks as\nseveral RNN cells that update a single hidden state sequentially. Observe that any RNN cell can be\nused to build a deep transition RNN by connecting several of these cells sequentially.\nHere, we propose the Fast-Slow RNN (FS-RNN) architecture, a novel way of interconnecting RNN\ncells, that combines advantages of multiscale RNNs and deep transition RNNs. The architecture\nconsists of k sequentially connected RNN cells in the lower hierarchical layer and one RNN cell in\nthe higher hierarchical layer, see Figure 1 and Section 3. Therefore, the hidden state of the lower\nlayer is updated k times per time step, whereas the hidden state of the higher layer is updated only\nonce per time step. We evaluate the FS-RNN on two standard character level language modeling data\nsets, namely Penn Treebank and Hutter Prize Wikipedia. Additionally, following [31], we present an\nempirical analysis that reveals advantages of the FS-RNN architecture over other RNN architectures.\nThe main contributions of this paper are:\n\n\u2022 We propose the FS-RNN as a novel RNN architecture.\n\u2022 We improve state of the art results on the Penn Treebank and Hutter Prize Wikipedia data\n\nsets.\n\n\u2022 We surpass the BPC performance of the best known text compression algorithm evaluated\n\non Hutter Prize Wikipedia by using an ensemble of two FS-RNNs.\n\n\u2022 We show empirically that the FS-RNN incorporates strengths of both multiscale RNNs and\ndeep transition RNNs, as it stores long-term dependencies ef\ufb01ciently and it adapts quickly\nto unexpected input.\n\n\u2022 We provide our code in the following URL https://github.com/amujika/Fast-Slow-LSTM.\n\n2 Related work\n\nIn the following, we review the work that relates to our approach in more detail. First, we focus\non deep transition RNNs and multiscale RNNs since these two architectures are the main sources\nof inspiration for the FS-RNN architecture. Then, we discuss how our approach differs from these\ntwo architectures. Finally, we review other approaches that address the issue of learning long-term\ndependencies when processing sequential data.\nPascanu et al. [31] investigated how a RNN can be converted into a deep RNN. In standard RNNs,\nthe transition function from one hidden state to the next is shallow, that is, the function can be\nwritten as one linear transformation concatenated with a point wise non-linearity. The authors added\nintermediate layers to increase the depth of the transition function, and they found empirically that\nsuch deeper architectures boost performance. Since deeper architectures are more dif\ufb01cult to train,\nthey equip the network with skip connections, which give rise to shorter gradient paths (DT(S)-RNN,\nsee [31]). Following a similar line of research, Zilly et al. [43] further increased the transition depth\nbetween two consecutive hidden states. They used highway layers [38] to address the issue of training\ndeep architectures. The resulting RHN [43] achieved state of the art results on the Penn Treebank and\nHutter Prize Wikipedia data sets. Furthermore, a vague similarity to deep transition networks can be\nseen in adaptive computation [12], where an LSTM cell learns how many times it should update its\nstate after receiving the input to produce the next output.\nMultiscale RNNs are obtained by stacking multiple RNNs with decreasing order of update frequencies\non top of each other. Early attempts proposed such architectures for sequential data compression\n[35], where the higher layer is only updated in case of prediction errors of the lower layer, and for\nsequence classi\ufb01cation [9], where the higher layers are updated with a \ufb01xed smaller frequency. More\nrecently, Koutnik et al. [25] proposed the Clockwork RNN, in which the hidden units are divided into\n\n2\n\n\fhS\nt\u22121\n\nhFk\nt\u22121\n\nf F1\n\nxt\n\nf S\n\nhF1\nt\n\nf F2\n\nhF2\nt\n\n\u00b7 \u00b7 \u00b7\n\nhFk\u22121\nt\n\nhS\nt\n\nhFk\nt\n\nf Fk\n\nyt\n\nFigure 1: Diagram of a Fast-Slow RNN with k Fast cells. Observe that only the second Fast cell\nreceives the input from the Slow cell.\n\nseveral modules, of which the i-th module is only updated every 2i-th time-step. General advantages\nof this multiscale RNN architecture are improved computational ef\ufb01ciency, ef\ufb01cient propagation\nof long-term dependencies and \ufb02exibility in allocating resources (units) to the hierarchical layers.\nMultiscale RNNs have been applied for speech recognition in [3], where the slower operating RNN\npools information over time and the timescales are \ufb01xed hyperparameters as in Clockwork RNNs. In\n[36], multiscale RNNs are applied to make context-aware query suggestions. In this case, explicit\nhierarchical boundary information is provided. Chung et al. [6] presented a hierarchical multiscale\nRNN (HM-RNN) that discovers the latent hierarchical structure of the sequence without explicitly\ngiven boundary information. If a parametrized boundary detector indicates the end of a segment, then\na summarized representation of the segment is fed to the upper layer and the state of the lower layer\nis reset [6].\nOur FS-RNN architectures borrows elements from both deep transition RNNs and multiscale RNNs.\nThe lower hierarchical layer is a deep transition RNN, that updates the hidden state several times per\ntime step, whereas the higher hierarchical layer updates the hidden state only once per time step.\nMany approaches aim at solving the problem of learning long-term dependencies in sequential data.\nA very popular one is to use external memory cells that can be accessed and modi\ufb01ed by the network,\nsee Neural Turing Machines [13], Memory Networks [40] and Differentiable Neural Computer [14].\nOther approaches focus on different optimization techniques rather than network architectures. One\nattempt is Hessian Free optimization [29], a second order training method that achieved good results\non RNNs. The use of different optimization techniques can improve learning in a wide range of RNN\narchitectures and therefore, the FS-RNN may also bene\ufb01t from it.\n\n3 Fast-Slow RNN\n\nWe propose the FS-RNN architecture, see Figure 1. It consists of k sequentially connected RNN\ncells F1, . . . , Fk on the lower hierarchical layer and one RNN cell S on the higher hierarchical layer.\nWe call F1, . . . , Fk the Fast cells, S the Slow cell and the corresponding hierarchical layers the Fast\nand Slow layer, respectively. S receives input from F1 and feeds its state to F2. F1 receives the\nsequential input data xt, and Fk outputs the predicted probability distribution yt of the next element\nof the sequence.\nIntuitively, the Fast cells are able to learn complex transition functions from one time step to the\nnext one. The Slow cell gives rise to shorter gradient paths between sequential inputs that are distant\nin time, and thus, it facilitates the learning of long-term dependencies. Therefore, the FS-RNN\narchitecture incorporates advantages of deep transition RNNs and of multiscale RNNs, see Section 2.\nSince any kind of RNN cell can be used as building block for the FS-RNN architecture, we state\nthe formal update rules of the FS-RNN for arbitrary RNN cells. We de\ufb01ne a RNN cell Q to be a\ndifferentiable function f Q(h, x) that maps a hidden state h and an additional input x to a new hidden\nstate. Note that x can be input data or input from a cell in a higher or lower hierarchical layer. If a\ncell does not receive an additional input, then we will omit x. The following equations de\ufb01ne the\nFS-RNN architecture for arbitrary RNN cells F1, . . . , Fk and S.\n\n3\n\n\ft = f F1 (hFk\nhF1\nhS\nt = f S(hS\nt = f F2 (hF1\nhF2\nt\nt = f Fi(hFi\u22121\nhFi\n\nt\u22121, xt)\nt\u22121, hF1\nt )\n, hS\nt )\n)\n\nt\n\nfor 3 \u2264 i \u2264 k\n\nThe output yt is computed as an af\ufb01ne transformation of hFk\n. It is possible to extend the FS-RNN\nt\narchitecture in order to further facilitate the learning of long-term dependencies by adding hierarchical\nlayers, each of which operates on a slower timescale than the ones below, resembling clockwork\nRNNs [25]. However, for the tasks considered in Section 4, we observed that this led to over\ufb01tting\nthe training data even when applying regularization techniques and reduced the performance at test\ntime. Therefore, we will not further investigate this extension of the model in this paper, even though\nit might be bene\ufb01cial for other tasks or larger data sets.\nIn the experiments in Section 4, we use LSTM cells as building blocks for the FS-RNN architecture.\nFor completeness, we state the update function f Q for an LSTM Q. The state of an LSTM is a pair\n(ht, ct), consisting of the hidden state and the cell state. The function f Q maps the previous state and\ninput (ht\u22121, ct\u22121, xt) to the next state (ht, ct) according to\n\n\uf8f6\uf8f7\uf8f8 = W Q\n\n\uf8eb\uf8ec\uf8edft\n\nit\not\ngt\n\nh ht\u22121 + W Q\n\nx xt + bQ\n\nct = \u03c3(ft) (cid:12) ct\u22121 + \u03c3(it) (cid:12) tanh(gt)\nht = \u03c3(ot) (cid:12) tanh(ct) ,\n\nwhere ft, it and ot are commonly referred to as forget, input and output gates, and gt are the new\ncandidate cell states. Moreover, W Q\nx and bQ are the learnable parameters, \u03c3 denotes the sigmoid\nfunction, and (cid:12) denotes the element-wise multiplication.\n\nh , W Q\n\n4 Experiments\n\nFor the experiments, we consider the Fast-Slow LSTM (FS-LSTM) that is a FS-RNN, where each\nRNN cell is a LSTM cell. The FS-LSTM is evaluated on two character level language modeling data\nsets, namely Penn Treebank and Hutter Prize Wikipedia, which will be referred to as enwik8 in this\nsection. The task consists of predicting the probability distribution of the next character given all the\nprevious ones. In Section 4.1, we compare the performance of the FS-LSTM with other approaches.\nIn Section 4.2, we empirically compare the network dynamics of different RNN architectures and\nshow the FS-LSTM combines the bene\ufb01ts of both, deep transition RNNs and multiscale RNNs.\n\n4.1 Performance on Penn Treebank and Hutter Prize Wikipedia\n\nThe FS-LSTM achieves 1.19 BPC and 1.25 BPC on the Penn Treebank and enwik8 data sets,\nrespectively. These results are compared to other approaches in Table 1 and Table 2 (the baseline\nLSTM results without citations are taken from [44] for Penn Treebank and from [15] for enwik8).\nFor the Penn Treebank, the FS-LSTM outperforms all previous approaches with signi\ufb01cantly less\nparameters than the previous top approaches. We did not observe any improvement when increasing\nthe model size, probably due to over\ufb01tting. In the enwik8 data set, the FS-LSTM surpasses all other\nneural approaches. Following [13], we compare the results with text compression algorithms using\nthe BPC measure. An ensemble of two FS-LSTM models (1.20 BPC) outperforms cmix (1.23 BPC)\n[24], the current best text compression algorithm on enwik8 [27]. However, a fair comparison is\ndif\ufb01cult. Compression algorithms are usually evaluated by the \ufb01nal size of the compressed data set\nincluding the decompressor size. For character prediction models, the network size is usually not\ntaken into account and the performance is measured on the test set. We remark that as the FS-LSTM is\nevaluated on the test set, it should achieve similar performance on any part of the English Wikipedia.\n\n4\n\n\fTable 1: BPC on Penn Treebank\n\nModel\nZoneout LSTM [2]\n2-Layers LSTM\nHM-LSTM [6]\nHyperLSTM - small [15]\nHyperLSTM [15]\nNASCell - small [44]\nNASCell [44]\nFS-LSTM-2 (ours)\nFS-LSTM-4 (ours)\n\nBPC Param Count\n1.27\n-\n6.6M\n1.243\n1.24\n-\n5.1M\n1.233\n14.4M\n1.219\n1.228\n6.6M\n16.3M\n1.214\n7.2M\n1.190\n1.193\n6.5M\n\nThe FS-LSTM-2 and FS-LSTM-4 model consist of two and four cells in the Fast layer, respectively.\nThe FS-LSTM-4 model outperforms the FS-LSTM-2 model, but its processing time for one time step\nis 25% higher than the one of the FS-LSTM-2. Adding more cells to the Fast layer could further\nimprove the performance as observed for RHN [43], but would increase the processing time, because\nthe cell states are computed sequentially. Therefore, we did not further increase the number of Fast\n(cid:80)n\ncells.\nThe model is trained to minimize the cross-entropy loss between the predictions and the training\ndata. Formally, the loss function is de\ufb01ned as L = \u2212 1\ni=1 log p\u03b8(xi|x1, . . . , xi\u22121), where\np\u03b8(xi|x1, . . . , xi\u22121) is the probability that a model with parameters \u03b8 assigns to the next character\nxi given all the previous ones. The model is evaluated by the BPC measure, which uses the binary\nlogarithm instead of the natural logarithm in the loss function. All the hyperparameters used for the\nexperiments are summarized in Table 3. We regularize the FS-LSTM with dropout [37]. In each\ntime step, a different dropout mask is applied for the non-recurrent connections [42], and Zoneout\n[2] is applied for the recurrent connections. The network is trained with minibatch gradient descent\nusing the Adam optimizer [23]. If the gradients have norm larger than 1 they are normalized to 1.\nTruncated backpropagation through time (TBPTT) [34, 10] is used to approximate the gradients,\nand the \ufb01nal hidden state is passed to the next sequence. The learning rate is divided by a factor 10\nfor the last 20 epochs in the Penn Treebank experiments, and it is divided by a factor 10 whenever\nthe validation error does not improve in two consecutive epochs in the enwik8 experiments. The\nforget bias of every LSTM cell is initialized to 1, and all weight matrices are initialized to orthogonal\nmatrices. Layer normalization [1] is applied to the cell and to each gate separately. The network with\nthe smallest validation error is evaluated on the test set. The two data sets that we use for evaluation\nare:\n\nn\n\nPenn Treebank [28] The dataset is a collection of Wall Street Journal articles written in English.\nIt only contains 10000 different words, all written in lower-case, and rare words are replaced with\n\"< unk >\". Following [30], we split the data set into train, validation and test sets consisting of\n5.1M, 400K and 450K characters, respectively.\n\nHutter Prize Wikipedia [19] This dataset is also known as enwik8 and it consists of \"raw\"\nWikipedia data, that is, English articles, tables, XML data, hyperlinks and special characters. The\ndata set contains 100M characters with 205 unique tokens. Following [7], we split the data set into\ntrain, validation and test sets consisting of 90M, 5M and 5M characters, respectively.\n\n4.2 Comparison of network dynamics of different architectures\n\nWe compare the FS-LSTM architecture with the stacked-LSTM and the sequential-LSTM archi-\ntectures, depicted in Figure 2, by investigating the network dynamics. In order to conduct a fair\ncomparison we chose the number of parameters to roughly be the same for all three models. The\nFS-LSTM consists of one Slow and four Fast LSTM cells of 450 units each. The stacked-LSTM\nconsists of \ufb01ve LSTM cells stacked on top of each other consisting of 375 units each, which will be\n\n5\n\n\fTable 2: BPC on enwik8\n\nModel\nLSTM, 2000 units\nLayer Norm LSTM, 1800 units\nHyperLSTM [15]\nHM-LSTM [6]\nSurprisal-driven Zoneout [33]\nByteNet [22]\nRHN - depth 5 [43]\nRHN - depth 10 [43]\nLarge RHN - depth 10 [43]\nFS-LSTM-2 (ours)\nFS-LSTM-4 (ours)\nLarge FS-LSTM-4 (ours)\n2 \u00d7 Large FS-LSTM-4 (ours)\ncmix v13 [24]\n\nBPC Param Count\n18M\n1.461\n1.402\n14M\n27M\n1.340\n35M\n1.32\n64M\n1.31\n1.31\n-\n23M\n1.31\n21M\n1.30\n46M\n1.27\n1.290\n27M\n27M\n1.277\n1.245\n47M\n2 \u00d7 47M\n1.198\n-\n1.225\n\nTable 3: Hyperparameters for the character-level language model experiments.\n\nPenn Treebank\n\nenwik8\n\nFS-LSTM-2\n\nFS-LSTM-4\n\nFS-LSTM-2\n\nFS-LSTM-4\n\nNon-recurrent dropout\nCell zoneout\nHidden zoneout\nFast cell size\nSlow cell size\nTBPTT length\nMinibatch size\nInput embedding size\nInitial Learning rate\nEpochs\n\n0.35\n0.5\n0.1\n700\n400\n150\n128\n128\n0.002\n200\n\n0.35\n0.5\n0.1\n500\n400\n150\n128\n128\n0.002\n200\n\n0.2\n0.3\n0.05\n900\n1500\n150\n128\n256\n0.001\n35\n\n0.2\n0.3\n0.05\n730\n1500\n150\n128\n256\n0.001\n35\n\nLarge\n\nFS-LSTM-4\n0.25\n0.3\n0.05\n1200\n1500\n100\n128\n256\n0.001\n50\n\nreferred to as Stacked-1, ... , Stacked-5, from bottom to top. The sequential-LSTM consists of \ufb01ve\nsequentially connected LSTM cells of 500 units each. All three models require roughly the same time\nto process one time step. The models are trained on enwik8 for 20 epochs with minibatch gradient\ndescent using the Adam optimizer [23] without any regularization, but layer normalization [1] is\napplied on the cell states of the LSTMs. The hyperparameters are not optimized for any of the three\nmodels. We repeat each experiment 5 times and report the mean and standard deviation.\nThe experiments suggest that the FS-LSTM architecture favors the learning of long-term dependencies\n(Figure 3), enforces hidden cell states to change at different rates (Figure 4) and facilitates a quick\nadaptation to unexpected inputs (Figure 5). Moreover, the FS-LSTM achieves a mean performance\nof 1.49 BPC with a standard deviation of 0.007 BPC and outperforms the stacked-LSTM (mean 1.60\nBPC, standard deviation 0.022 BPC ) and the sequential-LSTM (mean 1.58 BPC, standard deviation\n0.008 BPC ).\nIn Figure 3, we asses the ability to capture long-term dependencies by investigating the effect of\nthe cell state on the loss at later time points, following [2]. We measure the effect of the cell state\nat time t \u2212 k on the loss at time t by the gradient (cid:107) \u2202Lt\n(cid:107). This gradient is the largest for the Slow\n\n\u2202ct\u2212k\n\n6\n\n\fh5\nt\u22121\n\nh1\nt\u22121\n\nyt\n\nf 5\n\n...\n\nf 1\n\nxt\n\nh5\nt\n\nh1\nt\n\nht\u22121\n\nf 1\n\nxt\n\nf 2\n\n\u00b7 \u00b7 \u00b7\n\nht\n\nf 5\n\nyt\n\n(a) Stacked\n\n(b) Sequential\n\nFigure 2: Diagram of (a) stacked-LSTM and (b) sequential-LSTM with 5 cells each.\n\nFigure 3: Long-term effect of the cell states on the loss function. The average value of\nwhich is the effect of the cell state at time t \u2212 k on the loss function at time t, is plotted against k for\nthe different layers in the three RNN architectures. The shaded area shows the standard deviation.\nFor the sequential-LSTM only the \ufb01rst cell is considered.\n\n\u2202ct\u2212k\n\n(cid:13)(cid:13)(cid:13) \u2202Lt\n\n(cid:13)(cid:13)(cid:13),\n\nLSTM, and it is small and steeply decaying as k increases for the Fast LSTM. Evidently, the Slow\ncell captures long-term dependencies, whereas the Fast cell only stores short-term information. In the\nstacked-LSTM, the gradients decrease from the top layer to the bottom layer, which can be explained\nby the vanishing gradient problem. The small, steeply decaying gradients of the sequential-LSTM\nindicate that it is less capable to learn long-term dependencies than the other two models.\nFigure 4 gives further evidence that the FS-LSTM stores long-term dependencies ef\ufb01ciently in the\nSlow LSTM cell. It shows that among all the layers of the three RNN architectures, the cell states of\nthe Slow LSTM change the least from one time step to the next. The highest change is observed for\nthe cells of the sequential model followed by the Fast LSTM cells.\nIn Figure 5, we investigate whether the FS-LSTM quickly adapts to unexpected characters, that is,\nwhether it performs well on the subsequent ones. In text modeling, the initial character of a word\nhas the highest entropy, whereas later characters in a word are usually less ambiguous [10]. Since\nthe \ufb01rst character of a word is the most dif\ufb01cult one to predict, the performance at the following\npositions should re\ufb02ect the ability to adapt to unexpected inputs. While the prediction qualities at\nthe \ufb01rst position are rather close for all three models, the FS-LSTM outperforms the stacked-LSTM\nand sequential-LSTM signi\ufb01cantly on subsequent positions. It is possible that new information is\nincorporated quickly in the Fast layer, because it only stores short-term information, see Figure 3.\n\n7\n\n020406080100k0.00.20.40.60.81.0Gradient norm020406080100kFS-FastFS-SlowSequentialStacked-4Stacked-5Stacked-1Stacked-2Stacked-3Stacked-4Stacked-5\f(cid:80)n\ni=1(ct,i \u2212\nFigure 4: Rate of change of the cell states from one time step to the next. We plot 1\nn\nct\u22121,i)2 averaged over all time steps, where ct,i is the value of the ith unit at time step t, for the\ndifferent layers of the three RNN architectures. The error bars show the standard deviation. For the\nsequential-LSTM only the \ufb01rst cell is considered.\n\nFigure 5: Bits-per-character at each character position. The left panel shows the average bits-per-\ncharacter at each character positions in the test set. The right panel shows the average relative loss\nwith respect to the stacked-LSTM at each character position. The shaded area shows the standard\ndeviation. For this Figure, a word is considered to be a sequence of lower-case letters of length at\nleast 2 in-between two spaces.\n\n5 Conclusion\n\nIn this paper, we have proposed the FS-RNN architecture. Up to our knowledge, it is the \ufb01rst\narchitecture that incorporates ideas of both multiscale and deep transition RNNs. The FS-RNN\narchitecture improved state of the art results on character level language modeling evaluated on\nthe Penn Treebank and Hutter Prize Wikipedia data sets. An ensemble of two FS-RNNs achieves\nbetter BPC performance than the best known compression algorithm. Further experiments provided\nevidence that the Slow cell enables the network to learn long-term dependencies, while the Fast cells\nenable the network to quickly adapt to unexpected inputs and learn complex transition functions from\none time step to the next.\nOur FS-RNN architecture provides a general framework for connecting RNN cells as any type of\nRNN cell can be used as building block. Thus, there is a lot of \ufb02exibility in applying the architecture\nto different tasks. For instance using RNN cells with good long-term memory, like EURNNs [21]\nor NARX RNNs [26, 8], for the Slow cell might boost the long-term memory of the FS-RNN\n\n8\n\nSequentialFast-SlowStacked0.00.20.40.60.81.0Rate of changeSequentialFS-FastFS-SlowStacked-1Stacked-2Stacked-3Stacked-4Stacked-5246810character position0.51.01.52.02.5BPC246810character position0.900.920.940.960.981.001.02Relative lossFast-SlowSequentialStacked\farchitecture. Therefore, the FS-RNN architecture might improve performance in many different\napplications.\n\nAcknowledgments\n\nWe thank Julian Zilly for many helpful discussions.\n\nReferences\n[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Zoneout: Regularizing rnns by randomly\n\npreserving hidden activations. arXiv preprint arXiv:1607.06450, 2016.\n\n[3] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end\nattention-based large vocabulary speech recognition. Acoustics, Speech and Signal Processing (ICASSP),\n2016 IEEE International Conference, 2016.\n\n[4] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends R(cid:13) in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient\n\ndescent is dif\ufb01cult. IEEE transactions on neural networks, 5(2):157\u2013166, 1994.\n\n[6] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks.\n\narXiv preprint arXiv:1609.01704, 2016.\n\n[7] Junyoung Chung, Caglar G\u00fcl\u00e7ehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural\n\nnetworks. In ICML, pages 2067\u20132075, 2015.\n\n[8] Robert DiPietro, Nassir Navab, and Gregory D. Hager. Revisiting narx recurrent neural networks for\n\nlong-term dependencies, 2017.\n\n[9] Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In\n\nNips, volume 409, 1995.\n\n[10] Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14:179\u2013211, 1990.\n[11] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[12] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,\n\n2016.\n\n[13] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,\n\n2014.\n\n[14] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi\u00b4nska,\nSergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing\nusing a neural network with dynamic external memory. Nature, 538(7626):471\u2013476, 2016.\n\n[15] David Ha, Andrew Dai, and Quoc V. Le. Hypernetworks. arXiv preprint arXiv:1611.01578, 2016.\n[16] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, diploma thesis, institut\n\nf\u00fcr informatik, lehrstuhl prof. brauer, technische universit\u00e4t m\u00fcnchen, 1991.\n\n[17] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem\nsolutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107\u2013116,\n1998.\n\n[18] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[19] Marcus Hutter. The human knowledge compression contest. http://prize.hutter1.net, 2012.\n[20] Herbert Jaeger. Discovering multiscale dynamical features with hierarchical echo state networks. Technical\n\nreport, Jacobs University Bremen, 2007.\n\n[21] Li Jing, Yichen Shen, Tena Dub\u02c7cek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, and Marin\n\nSolja\u02c7ci\u00b4c. Tunable ef\ufb01cient unitary neural networks (eunn) and their application to rnns, 2016.\n\n[22] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray\n\nKavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.\n\n[23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[24] Bryon Knoll. Cmix. http://www.byronknoll.com/cmix.html, 2017. Accessed: 2017-05-18.\n\n9\n\n\f[25] Jan Koutn\u00edk, Klaus Greff, Faustino Gomez, and J\u00fcrgen Schmidhuber. A clockwork rnn. arXiv preprint\n\narXiv:1603.08983, 2016.\n\n[26] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles. Learning long-term dependencies in narx\n\nrecurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329\u20131338, 1996.\n\n[27] Matt Mahoney. Large text compression benchmark. http://mattmahoney.net/dc/text.html, 2017.\n\nAccessed: 2017-05-18.\n\n[28] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus\n\nof english: The penn treebank. Comput. Linguist., 19(2):313\u2013330, June 1993.\n\n[29] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In\nProceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1033\u20131040,\n2011.\n\n[30] Tom\u00e1\u02d8s Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Kombrink Stefan, and Jan \u02d8Cernock\u00fd. Subword\nlanguage modeling with neural networks. preprint: http://www.\ufb01t.vutbr.cz/ imikolov/rnnlm/char.pdf, 2012.\n[31] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent\n\nneural networks. arXiv preprint arXiv:1312.6026, 2013.\n\n[32] AJ Robinson and Frank Fallside. The utility driven dynamic error propagation network. University of\n\nCambridge Department of Engineering, 1987.\n\n[33] Kamil Rocki, Tomasz Kornuta, and Tegan Maharaj.\n\narXiv:1610.07675, 2016.\n\nSurprisal-driven zoneout.\n\narXiv preprint\n\n[34] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-\n\npropagating errors. Cognitive modeling, 5(3):1, 1988.\n\n[35] J\u00fcrgen Schmidhuber. Learning complex, extended sequences using the principle of history compression.\n\nNeural Computation, 4(2):234\u2013242, 1992.\n\n[36] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-\nYun Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In\nProceedings of the 24th ACM International on Conference on Information and Knowledge Management,\npages 553\u2013562. ACM, 2015.\n\n[37] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. J. Mach. Learn. Res., 15(1):1929\u20131958, January\n2014.\n\n[38] Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. arXiv preprint\n\narXiv:1505.00387, 2015.\n\n[39] Paul J Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural\n\nnetworks, 1(4):339\u2013356, 1988.\n\n[40] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916,\n\n2014.\n\n[41] Ronald J Williams. Complexity of exact gradient computation algorithms for recurrent neural networks.\nTechnical report, Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University,\nCollege of Computer Science, 1989.\n\n[42] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv\n\npreprint arXiv:1409.2329, 2014.\n\n[43] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn\u00edk, and J\u00fcrgen Schmidhuber. Recurrent highway\n\nnetworks. arXiv preprint arXiv:1607.03474, 2016.\n\n[44] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3020, "authors": [{"given_name": "Asier", "family_name": "Mujika", "institution": "ETH Z\u00fcrich"}, {"given_name": "Florian", "family_name": "Meier", "institution": "ETH Zurich"}, {"given_name": "Angelika", "family_name": "Steger", "institution": "ETH Zurich"}]}