{"title": "Architectural Complexity Measures of Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1822, "page_last": 1830, "abstract": "In this paper, we systematically analyze the connecting architectures of recurrent neural networks (RNNs). Our main contribution is twofold: first, we present a rigorous graph-theoretic framework describing the connecting architectures of RNNs in general. Second, we propose three architecture complexity measures of RNNs: (a) the recurrent depth, which captures the RNN\u2019s over-time nonlinear complexity, (b) the feedforward depth, which captures the local input-output nonlinearity (similar to the \u201cdepth\u201d in feedforward neural networks (FNNs)), and (c) the recurrent skip coefficient which captures how rapidly the information propagates over time. We rigorously prove each measure\u2019s existence and computability. Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth. We further demonstrate that increasing recurrent skip coefficient offers performance boosts on long term dependency problems.", "full_text": "Architectural Complexity Measures of\n\nRecurrent Neural Networks\n\nSaizheng Zhang1,\u2217, Yuhuai Wu2,\u2217, Tong Che4, Zhouhan Lin1,\n\nRoland Memisevic1,5, Ruslan Salakhutdinov3,5 and Yoshua Bengio1,5\n\n1MILA, Universit\u00e9 de Montr\u00e9al, 2University of Toronto, 3Carnegie Mellon University,\n\n4Institut des Hautes \u00c9tudes Scienti\ufb01ques, France, 5CIFAR\n\nAbstract\n\nIn this paper, we systematically analyze the connecting architectures of recurrent\nneural networks (RNNs). Our main contribution is twofold: \ufb01rst, we present a\nrigorous graph-theoretic framework describing the connecting architectures of\nRNNs in general. Second, we propose three architecture complexity measures of\nRNNs: (a) the recurrent depth, which captures the RNN\u2019s over-time nonlinear\ncomplexity, (b) the feedforward depth, which captures the local input-output non-\nlinearity (similar to the \u201cdepth\u201d in feedforward neural networks (FNNs)), and (c)\nthe recurrent skip coef\ufb01cient which captures how rapidly the information propa-\ngates over time. We rigorously prove each measure\u2019s existence and computability.\nOur experimental results show that RNNs might bene\ufb01t from larger recurrent depth\nand feedforward depth. We further demonstrate that increasing recurrent skip\ncoef\ufb01cient offers performance boosts on long term dependency problems.\n\nIntroduction\n\n1\nRecurrent neural networks (RNNs) have been shown to achieve promising results on many dif\ufb01cult\nsequential learning problems [1, 2, 3, 4, 5]. There is also much work attempting to reveal the\nprinciples behind the challenges and successes of RNNs, including optimization issues [6, 7], gradient\nvanishing/exploding related problems [8, 9], analysing/designing new RNN transition functional\nunits like LSTMs, GRUs and their variants [10, 11, 12, 13].\nThis paper focuses on another important theoretical aspect of RNNs: the connecting architecture.\nEver since [14, 15] introduced different forms of \u201cstacked RNNs\u201d, researchers have taken architecture\ndesign for granted and have paid less attention to the exploration of other connecting architectures.\nSome examples include [16, 1, 17] who explored the use of skip connections; [18] who pointed out\nthe distinction of constructing a \u201cdeep\u201d RNN from the view of the recurrent paths and the view of the\ninput-to-hidden and hidden-to-output maps. However, they did not rigorously formalize the notion\nof \u201cdepth\u201d and its implications in \u201cdeep\u201d RNNs. Besides \u201cdeep\u201d RNNs, there still remains a vastly\nunexplored \ufb01eld of connecting architectures. We argue that one barrier for better understanding the\narchitectural complexity is the lack of a general de\ufb01nition of the connecting architecture. This forced\nprevious researchers to mostly consider the simple cases while neglecting other possible connecting\nvariations. Another barrier is the lack of quantitative measurements of the complexity of different\nRNN connecting architectures: even the concept of \u201cdepth\u201d is not clear with current RNNs.\nIn this paper, we try to address these two barriers. We \ufb01rst introduce a general formulation of\nRNN connecting architectures, using a well-de\ufb01ned graph representation. Observing that the RNN\nundergoes multiple transformations not only feedforwardly (from input to output within a time step)\nbut also recurrently (across multiple time steps), we carry out a quantitative analysis of the number of\ntransformations in these two orthogonal directions, which results in the de\ufb01nitions of recurrent depth\n\n\u2217Equal contribution.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fand feedforward depth. These two depths can be viewed as general extensions of the work of [18]. We\nalso explore a quantity called the recurrent skip coef\ufb01cient which measures how quickly information\npropagates over time. This quantity is strongly related to vanishing/exploding gradient issues, and\nhelps deal with long term dependency problems. Skip connections crossing different timescales have\nalso been studied by [19, 15, 20, 21]. Instead of speci\ufb01c architecture design, we focus on analyzing\nthe graph-theoretic properties of recurrent skip coef\ufb01cients, revealing the fundamental difference\nbetween the regular skip connections and the ones which truly increase the recurrent skip coef\ufb01cients.\nWe rigorously prove each measure\u2019s existence and computability under the general framework.\nWe empirically evaluate models with different recurrent/feedforward depths and recurrent skip\ncoef\ufb01cients on various sequential modelling tasks. We also show that our experimental results further\nvalidate the usefulness of the proposed de\ufb01nitions.\n\n2 General Formulations of RNN Connecting Architectures\nRNNs are learning machines that recursively compute new states by applying transition functions\nto previous states and inputs. Its connecting architecture describes how information \ufb02ows between\ndifferent nodes. In this section, we formalize the concept of the connecting architecture by extending\nthe traditional graph-based illustration to a more general de\ufb01nition with a \ufb01nite directed multigraph\nand its unfolded version. Let us \ufb01rst de\ufb01ne the notion of the RNN cyclic graph Gc that can be viewed\nas a cyclic graphical representation of RNNs. We attach \u201cweights\u201d to the edges in the cyclic graph Gc\nthat represent time delay differences between the source and destination node in the unfolded graph.\nDe\ufb01nition 2.1. Let Gc = (Vc, Ec) be a weighted directed multigraph 2, in which Vc = Vin \u222a Vout \u222a\nVhid is a \ufb01nite nonempty set of nodes, Ec \u2282 Vc \u00d7 Vc \u00d7 Z is a \ufb01nite set of directed edges. Each\ne = (u, v, \u03c3) \u2208 Ec denotes a directed weighted edge pointing from node u to node v with an integer\nweight \u03c3. Each node v \u2208 Vc is labelled by an integer tuple (i, p). i \u2208 {0, 2,\u00b7\u00b7\u00b7 m \u2212 1} denotes\nthe time index of the given node, where m is the period number of the RNN, and p \u2208 S, where S is\na \ufb01nite set of node labels. We call the weighted directed multigraph Gc = (Vc, Ec) an RNN cyclic\n(1) For every edge e = (u, v, \u03c3) \u2208 Ec, let iu and iv denote the time index of node u and v,\ngraph, if\nthen \u03c3 = iv \u2212 iu + k \u00b7 m for some k \u2208 Z.\n(3)\nFor any closed walk \u03c9, the sum of all the \u03c3 along \u03c9 is not zero.\n\n(2) There exists at least one directed cycle 3 in Gc.\n\nCondition (1) assures that we can get a periodic graph (repeating pattern) when unfolding the RNN\nthrough time. Condition (2) excludes feedforward neural networks in the de\ufb01nition by forcing to\nhave at least one cycle in the cyclic graph. Condition (3) simply avoids cycles after unfolding. The\ncyclic representation can be seen as a time folded representation of RNNs, as shown in Figure 1(a).\nGiven an RNN cyclic graph Gc, we unfold Gc over time t \u2208 Z by the following procedure:\nDe\ufb01nition 2.2 (Unfolding). Given an RNN cyclic graph Gc = (Vc, Ec, \u03c3), we de\ufb01ne a new in\ufb01nite\nset of nodes Vun = {(i + km, p)|(i, p) \u2208 V, k \u2208 Z}. The new set of edges Eun \u2208 Vun \u00d7 Vun is\nconstructed as follows: ((t, p), (t(cid:48), p(cid:48))) \u2208 Eun if and only if there is an edge e = ((i, p), (i(cid:48), p(cid:48)), \u03c3) \u2208\nE such that t(cid:48) \u2212 t = \u03c3, and t \u2261 i(mod m). The new directed graph Gun = (Vun, Eun) is called\nthe unfolding of Gc. Any in\ufb01nite directed graph that can be constructed from an RNN cyclic graph\nthrough unfolding is called an RNN unfolded graph.\nLemma 2.1. The unfolding Gun of any RNN cyclic graph Gc is a directed acyclic graph (DAG).\nFigure 1(a) shows an example of two graph representations Gun and Gc of a given RNN. Consider\nthe edge from node (1, 7) going to node (0, 3) in Gc. The fact that it has weight 1 indicates that the\ncorresponding edge in Gun travels one time step, ((t + 1, 7), (t + 2, 3)). Note that node (0, 3) also has\na loop with weight 2. This loop corresponds to the edge ((t, 3), (t + 2, 3)). The two kinds of graph\nrepresentations we presented above have a one-to-one correspondence. Also, any graph structure\n\u03b8 on Gun is naturally mapped into a graph structure \u00af\u03b8 on Gc. Given an edge tuple \u00afe = (u, v, \u03c3)\nin Gc, \u03c3 stands for the number of time steps crossed by \u00afe\u2019s covering edges in Eun, i.e., for every\ncorresponding edge e \u2208 Gun, e must start from some time index t to t + \u03c3. Hence \u03c3 corresponds\nto the \u201ctime delay\u201d associated with e. In addition, the period number m in De\ufb01nition 2.1 can be\ninterpreted as the time length of the entire non-repeated recurrent structure in its unfolded RNN graph\nGun. In other words, shifting the Gun through time by km time steps will result in a DAG which is\n\n2A directed multigraph is a directed graph that allows multiple directed edges connecting two nodes.\n3A directed cycle is a closed walk with no repetitions of edges.\n\n2\n\n\f2 and s = 2. (b) 5 more examples. (1) and (2) have dr = 2, 3\n\nFigure 1: (a) An example of an RNN\u2019s Gc and Gun. Vin is denoted by square, Vhid is denoted by circle and Vout\nis denoted by diamond. In Gc, the number on each edge is its corresponding \u03c3. The longest path is colored in\nred. The longest input-output path is colored in yellow and the shortest path is colored blue. The value of three\nmeasures are dr = 3\n2 , (3) has df = 5, (4)\n2 , df = 7\nand (5) has s = 2, 3\n2 .\nidentical to Gun, and m is the smallest number that has such property for Gun. Most traditional RNNs\nhave m = 1, while some special structures like hierarchical or clockwork RNN [15, 21] have m > 1.\nFor example, Figure 1(a) shows that the period number of this speci\ufb01c RNN is 2.\nThe connecting architecture describes how information \ufb02ows among RNN units. Assume \u00afv \u2208 Vc\nis a node in Gc, let In(\u00afv) denotes the set of incoming nodes of \u00afv, In(\u00afv) = {\u00afu|(\u00afu, \u00afv) \u2208 Ec}. In\nthe forward pass of the RNN, the transition function F\u00afv takes outputs of nodes In(\u00afv) as inputs and\ncomputes a new output. For example, vanilla RNNs units with different activation functions, LSTMs\nand GRUs can all be viewed as units with speci\ufb01c transition functions. We now give the general\nde\ufb01nition of an RNN:\nDe\ufb01nition 2.3. An RNN is a tuple (Gc,Gun,{F\u00afv}\u00afv\u2208Vc), in which Gun = (Vun, Eun) is the unfolding\nof RNN cyclic graph Gc, and {F\u00afv}\u00afv\u2208Vc is the set of transition functions. In the forward pass, for\neach hidden and output node v \u2208 Vun, the transition function F\u00afv takes all incoming nodes of v as the\ninput to compute the output.\n\nAn RNN is homogeneous if all the hidden nodes share the same form of the transition function.\n\n3 Measures of Architectural Complexity\nIn this section, we develop different measures of RNNs\u2019 architectural complexity, focusing mostly\non the graph-theoretic properties of RNNs. To analyze an RNN solely from its architectural aspect,\nwe make the mild assumption that the RNN is homogeneous. We further assume the RNN to\nbe unidirectional. For a bidirectional RNN, it is more natural to measure the complexities of its\nunidirectional components.\n\n3.1 Recurrent Depth\nUnlike feedforward models where computations are done within one time frame, RNNs map inputs\nto outputs over multiple time steps. In some sense, an RNN undergoes transformations along both\nfeedforward and recurrent dimensions. This fact suggests that we should investigate its architectural\ncomplexity from these two different perspectives. We \ufb01rst consider the recurrent perspective.\nThe conventional de\ufb01nition of depth is the maximum number of nonlinear transformations from inputs\nto outputs. Observe that a directed path in an unfolded graph representation Gun corresponds to a\nsequence of nonlinear transformations. Given an unfolded RNN graph Gun, \u2200i, n \u2208 Z, let Di(n) be\nthe length of the longest path from any node at starting time i to any node at time i + n. From the\nrecurrent perspective, it is natural to investigate how Di(n) changes over time. Generally speaking,\nDi(n) increases as n increases for all i. Such increase is caused by the recurrent structure of the RNN\nwhich keeps adding new nonlinearities over time. Since Di(n) approaches \u221e as n approaches \u221e,4\nto measure the complexity of Di(n), we consider its asymptotic behaviour, i.e., the limit of Di(n)\nas n \u2192 \u221e. Under a mild assumption, this limit exists. The following theorem prove such limit\u2019s\ncomputability and well-de\ufb01nedness:\nTheorem 3.2 (Recurrent Depth). Given an RNN and its two graph representation Gun and Gc, we\ndenote C(Gc) to be the set of directed cycles in Gc. For \u03d1 \u2208 C(Gc), let l(\u03d1) denote the length of \u03d1\n\nn\n\n4Without loss of generality, we assume the unidirectional RNN approaches positive in\ufb01nity.\n\n3\n\n(a)(b)\fand \u03c3s(\u03d1) denote the sum of edge weights \u03c3 along \u03d1. Under a mild assumption5,\n\ndr = lim\n\nn\u2192+\u221e\n\nDi(n)\n\nn\n\n= max\n\u03d1\u2208C(Gc)\n\nl(\u03d1)\n\u03c3s(\u03d1)\n\n.\n\n(1)\n\n1 = 5, Dt(2)\n\n2 = 3, Dt(3)\n\n3 = 8\n\n4 = 9\n\n3, Dt(4)\n\n4 . . . ., which eventually converges to 3\n\nMore intuitively, dr is a measure of the average maximum number of nonlinear transformations per\ntime step as n gets large. Thus, we call it recurrent depth:\nDe\ufb01nition 3.1 (Recurrent Depth). Given an RNN and its two graph representations Gun and Gc,\nwe call dr, de\ufb01ned in Eq.(1), the recurrent depth of the RNN.\nIn Figure 1(a), one can easily verify that Dt(1) = 5, Dt(2) = 6, Dt(3) = 8, Dt(4) = 9 . . . Thus\n2 as n \u2192 \u221e. As\nDt(1)\nn increases, most parts of the longest path coincides with the path colored in red. As a result, dr\ncoincides with the number of nodes the red path goes through per time step. Similarly in Gc, observe\nthat the red cycle achieves the maximum ( 3\n2) in Eq.(1). Usually, one can directly calculate dr from\nGun. It is easy to verify that simple RNNs and stacked RNNs share the same recurrent depth which is\nequal to 1. This reveals the fact that their nonlinearities increase at the same rate, which suggests\nthat they will behave similarly in the long run. This fact is often neglected, since one would typically\nconsider the number of layers as a measure of depth, and think of stacked RNNs as \u201cdeep\u201d and simple\nRNNs as \u201cshallow\u201d, even though their discrepancies are not due to recurrent depth (which regards\ntime) but due to feedforward depth, de\ufb01ned next.\n3.3 Feedforward Depth\nRecurrent depth does not fully characterize the nature of nonlinearity of an RNN. As previous work\nsuggests [3], stacked RNNs do outperform shallow ones with the same hidden size on problems\nwhere a more immediate input and output process is modeled. This is not surprising, since the growth\nrate of Di(n) only captures the number of nonlinear transformations in the time direction, not in\nthe feedforward direction. The perspective of feedforward computation puts more emphasis on the\nspeci\ufb01c paths connecting inputs to outputs. Given an RNN unfolded graph Gun, let D\u2217\ni (n) be the\nlength of the longest path from any input node at time step i to any output node at time step i + n.\nClearly, when n is small, the recurrent depth cannot serve as a good description for D\u2217\ni (n). In fact. it\nheavily depends on another quantity which we call feedforward depth. The following proposition\nguarantees the existence of such a quantity and demonstrates the role of both measures in quantifying\nthe nonlinearity of an RNN.\nProposition 3.3.1 (Input-Output Length Least Upper Bound). Given an RNN with recurrent\ni (n) \u2212 n \u00b7 dr, the supremum df exists and thus we have the\ndepth dr, we denote df = supi,n\u2208Z D\u2217\nfollowing upper bound for D\u2217\ni (n) \u2264 n \u00b7 dr + df .\nD\u2217\n\ni (n):\n\nThe above upper bound explicitly shows the interplay between recurrent depth and feedforward\ndepth: when n is small, D\u2217\ni (n) is largely bounded by df ; when n is large, dr captures the nature\nof the bound (\u2248 n \u00b7 dr). These two measures are equally important, as they separately capture the\nmaximum number of nonlinear transformations of an RNN in the long run and in the short run.\nDe\ufb01nition 3.2. (Feedforward Depth) Given an RNN with recurrent depth dr and its two graph\nrepresentations Gun and Gc, we call df , de\ufb01ned in Proposition 3.3.1, the feedforward depth6 of the\nRNN.\nThe following theorem proves df \u2019s computability:\nTheorem 3.4 (Feedforward Depth). Given an RNN and its two graph representations Gun and Gc,\nwe denote \u03be(Gc) the set of directed paths that start at an input node and end at an output node in Gc.\nFor \u03b3 \u2208 \u03be(Gc), denote l(\u03b3) the length and \u03c3s(\u03b3) the sum of \u03c3 along \u03b3. Then we have:\n\ndf = sup\ni,n\u2208Z\n\ni (n) \u2212 n \u00b7 dr = max\nD\u2217\n\u03b3\u2208\u03be(Gc)\n\nl(\u03b3) \u2212 \u03c3s(\u03b3) \u00b7 dr,\n\nwhere m is the period number and dr is the recurrent depth of the RNN.\nFor example, in Figure 1(a), one can easily verify that df = D\u2217\nsame as D\u2217\n\nt (0), i.e., the maximum length from an input to its current output.\n\nt (0) = 3. Most commonly, df is the\n\n5See a full treatment of the limit in general cases in Theorem A.1 and Proposition A.1.1 in Appendix.\n6Conventionally, an architecture with depth 1 is a three-layer architecture containing one hidden layer. But in\nour de\ufb01nition, since it goes through two transformations, we count the depth as 2 instead of 1. This should be\nparticularly noted with the concept of feedforward depth, which can be thought as the conventional depth plus 1.\n\n4\n\n\f3.5 Recurrent Skip Coef\ufb01cient\nDepth provides a measure of the complexity of the model. But such a measure is not suf\ufb01cient to\ncharacterize behavior on long-term dependency tasks. In particular, since models with large recurrent\ndepths have more nonlinearities through time, gradients can explode or vanish more easily. On the\nother hand, it is known that adding skip connections across multiple time steps may help improve\nthe performance on long-term dependency problems [19, 20]. To measure such a \u201cskipping\u201d effect,\nwe should instead pay attention to the length of the shortest path from time i to time i + n. In Gun,\n\u2200i, n \u2208 Z, let di(n) be the length of the shortest path. Similar to the recurrent depth, we consider the\ngrowth rate of di(n).\nTheorem 3.6 (Recurrent Skip Coef\ufb01cient). Given an RNN and its two graph representations Gun\nand Gc, under mild assumptions7\n\n.\n\n(2)\n\nj = lim\n\nn\u2192+\u221e\n\ndi(n)\n\nn\n\n= min\n\n\u03d1\u2208C(Gc)\n\nl(\u03d1)\n\u03c3s(\u03d1)\n\nj , whose reciprocal is de\ufb01ned in Eq.(2), as the recurrent skip coef\ufb01cient of the RNN.\n\nSince it is often the case that j is smaller or equal to 1, it is more intuitive to consider its reciprocal.\nDe\ufb01nition 3.3. (Recurrent Skip Coef\ufb01cient)8. Given an RNN and corresponding Gun and Gc, we\nde\ufb01ne s = 1\nWith a larger recurrent skip coef\ufb01cient, the number of transformations per time step is smaller. As a\nresult, the nodes in the RNN are more capable of \u201cskipping\u201d across the network, allowing unimpeded\ninformation \ufb02ow across multiple time steps, thus alleviating the problem of learning long term\ndependencies. In particular, such effect is more prominent in the long run, due to the network\u2019s\nrecurrent structure. Also note that not all types of skip connections can increase the recurrent skip\ncoef\ufb01cient. We will consider speci\ufb01c examples in our experimental results section.\n4 Experiments and Results\nIn this section we conduct a series of experiments to investigate the following questions: (1) Is\nrecurrent depth a trivial measure? (2) Can increasing depth yield performance improvements? (3)\nCan increasing the recurrent skip coef\ufb01cient improve the performance on long term dependency\ntasks? (4) Does the recurrent skip coef\ufb01cient suggest something more compared to simply adding\nskip connections? We show our evaluations on both tanh RNNs and LSTMs.\n4.1 Tasks and Training Settings\nPennTreebank dataset: We evaluate our models on character level language modelling using the\nPennTreebank dataset [22]. It contains 5059k characters for training, 396k for validation and 446k\nfor test, and has a alphabet size of 50. We set each training sequence to have the length of 50. Quality\nof \ufb01t is evaluated by the bits-per-character (BPC) metric, which is log2 of perplexity.\ntext8 dataset: Another dataset used for character level language modelling is the text8 dataset9,\nwhich contains 100M characters from Wikipedia with an alphabet size of 27. We follow the setting\nfrom [23] and each training sequence has length of 180.\nadding problem: The adding problem (and the following copying memory problem) was introduced\nin [10]. For the adding problem, each input has two sequences with length of T where the \ufb01rst\nsequence are numbers sampled from uniform[0, 1] and the second sequence are all zeros except two\nelements which indicates the position of the two elements in the \ufb01rst sequence that should be summed\ntogether. The output is the sum. We follow the most recent results and experimental settings in [24]\n(same for copying memory).\ncopying memory problem: Each input sequence has length of T + 20, where the \ufb01rst 10 values are\nrandom integers between 1 to 8. The model should remember them after T steps. The rest of the\nsequence are all zeros, except for the last 11 entries in the sequence, which starts with 9 as a marker\nindicating that the model should begin to output its memorized values. The model is expected to\ngive zero outputs at every time step except the last 10 entries, where it should generate (copy) the 10\nvalues in the same order as it has seen at the beginning of the sequence. The goal is to minimize the\naverage cross entropy of category predictions at each time step.\n\n7See Proposition A.3.1 in Appendix.\n8One would \ufb01nd this de\ufb01nition very similar to the de\ufb01nition of the recurrent depth. Therefore, we refer\n\nreaders to examples in Figure 1 for some illustrations.\n\n9http://mattmahoney.net/dc/textdata.\n\n5\n\n\fFigure 2: Left: (a) The architectures for sh, st, bu and td, with their (dr, df ) equal to (1, 2), (1, 3), (1, 3) and\n(2, 3), respectively. The longest path in td are colored in red. (b) The 9 architectures denoted by their (df , dr)\nwith dr = 1, 2, 3 and df = 2, 3, 4. In both (a) and (b), we only plot hidden states at two adjacent time steps\nand the connections between them (the period number is 1). Right: (a) Various architectures that we consider\nin Section 4.4. From top to bottom are baseline s = 1, and s = 2, s = 3. (b) Proposed architectures that we\nconsider in Section 4.5 where we take k = 3 as an example. The shortest paths in (a) and (b) that correspond to\nthe recurrent skip coef\ufb01cients are colored in blue.\n\nDATASET\n\nPENNTREEBANK\n\nTEXT8\n\nMODELS\\ARCHS\n\ntanh RNN\n\ntanh RNN-SMALL\ntanh RNN-LARGE\n\nLSTM-SMALL\nLSTM-LARGE\n\nst\n\nbu\n\nsh\ntd\n1.54 1.59 1.54 1.49\n1.80 1.82 1.80 1.77\n1.69 1.67 1.64 1.59\n1.65 1.66 1.65 1.63\n1.52 1.53 1.52 1.49\n\ndf \\dr\ndf = 2\ndf = 3\ndf = 4\n\ndr = 1 dr = 2 dr = 3\n1.86\n1.88\n1.86\n1.86\n1.85\n1.88\n\n1.86\n1.84\n1.86\n\nTable 1: Left: Test BPCs of sh, st, bu, td for tanh RNNs and LSTMs. Right: Test BPCs of tanh RNNs with\nrecurrent depth dr = 1, 2, 3 and feedforward depth df = 2, 3, 4 respectively.\nsequential MNIST dataset: Each MNIST image data is reshaped into a 784 \u00d7 1 sequence, turning\nthe digit classi\ufb01cation task into a sequence classi\ufb01cation one with long-term dependencies [25, 24].\nA slight modi\ufb01cation of the dataset is to permute the image sequences by a \ufb01xed random order\nbeforehand (permuted MNIST). Results in [25] have shown that both tanh RNNs and LSTMs did not\nachieve satisfying performance, which also highlights the dif\ufb01culty of this task.\nFor all of our experiments we use Adam [26] for optimization, and conduct a grid search on the\nlearning rate in {10\u22122, 10\u22123, 10\u22124, 10\u22125}. For tanh RNNs, the parameters are initialized with\nsamples from a uniform distribution. For LSTM networks we adopt a similar initialization scheme,\nwhile the forget gate biases are chosen by the grid search on {\u22125,\u22123,\u22121, 0, 1, 3, 5}. We employ\nearly stopping and the batch size was set to 50.\n4.2 Recurrent Depth is Non-trivial\nTo investigate the \ufb01rst question, we compare 4 similar connecting architectures: 1-layer (shallow)\n\u201csh\u201d, 2-layers stacked \u201cst\u201d, 2-layers stacked with an extra bottom-up connection \u201cbu\u201d, and 2-layers\nstacked with an extra top-down connection \u201ctd\u201d, as shown in Figure 2(a), left panel. Although the\nfour architectures look quite similar, they have different recurrent depths: sh, st and bu have dr = 1,\nwhile td has dr = 2. Note that the speci\ufb01c construction of the extra nonlinear transformations in td is\nnot conventional. Instead of simply adding intermediate layers in hidden-to-hidden connection, as\nreported in [18], more nonlinearities are gained by a recurrent \ufb02ow from the \ufb01rst layer to the second\nlayer and then back to the \ufb01rst layer at each time step (see the red path in Figure 2a, left panel).\nWe \ufb01rst evaluate our architectures using tanh RNN on PennTreebank, where sh has hidden-layer\nsize of 1600. Next, we evaluate four different models for text8 which are tanh RNN-small, tanh\nRNN-large, LSTM-small, LSTM large, where the model\u2019s sh architecture has hidden-layer size of\n512, 2048, 512, 1024 respectively. Given the architecture of the sh model, we set the remaining\nthree architectures to have the same number of parameters. Table 1, left panel, shows that the td\narchitecture outperforms all the other architectures for all the different models. Speci\ufb01cally, td in\ntanh RNN achieves a test BPC of 1.49 on PennTreebank, which is comparable to the BPC of 1.48\nreported in [27] using stabilization techniques. Similar improvements are shown for LSTMs, where td\narchitecture in LSTM-large achieves BPC of 1.49 on text8, outperforming the BPC of 1.54 reported\nin [23] with Multiplicative RNN (MRNN). It is also interesting to note the improvement we obtain\nwhen switching from bu to td. The only difference between these two architectures lies in changing\nthe direction of one connection (see Figure 2(a)), which also increases the recurrent depth. Such a\nfundamental difference is by no means self-evident, but this result highlights the necessity of the\nconcept of recurrent depth.\n\n6\n\n(a)(b)(a)(b)(1)(2)(3)(4)\f4.3 Comparing Depths\nFrom the previous experiment, we found some evidence that with larger recurrent depth, the per-\nformance might improve. To further investigate various implications of depths, we carry out a\nsystematic analysis for both recurrent depth dr and feedforward depth df on text8 and sequential\nMNIST datasets. We build 9 models in total with dr = 1, 2, 3 and df = 2, 3, 4, respectively (as\nshown in Figure 2(b)). We ensure that all the models have roughly the same number of parameters\n(e.g., the model with dr = 1 and df = 2 has a hidden-layer size of 360).\nTable 1, right panel, displays results on the text8 dataset. We observed that when \ufb01xing feedforward\ndepth df = 2, 3 (or \ufb01xing recurrent depth dr = 1, 2), increasing recurrent depth dr from 1 to\n2 (or increasing feedforward depth df from 2 to 3) does improve the model performance. The\nbest test BPC is achieved by the architecture with df = 3, dr = 2. This suggests that reasonably\nincreasing dr and df can aid in better capturing the over-time nonlinearity of the input sequence.\nHowever, for too large dr (or df ) like dr = 3 or df = 4, increasing df (or dr) only hurts models\nperformance. This can potentially be attributed to the optimization issues when modelling large\ninput-to-output dependencies (see Appendix B.4 for more details). With sequential MNIST dataset,\nwe next examined the effects of df and dr when modelling long term dependencies (more in Appendix\nB.4). In particular, we observed that increasing df does not bring any improvement to the model\nperformance, and increasing dr might even be detrimental for training. Indeed, it appears that df\nonly captures the local nonlinearity and has less effect on the long term prediction. This result seems\nto contradict previous claims [17] that stacked RNNs (df > 1, dr = 1) could capture information in\ndifferent time scales and would thus be more capable of dealing with learning long-term dependencies.\nOn the other hand, a large dr indicates multiple transformations per time step, resulting in greater\ngradient vanishing/exploding issues [18], which suggests that dr should be neither too small nor too\nlarge.\n4.4 Recurrent Skip Coef\ufb01cients\nTo investigate whether increasing a recurrent skip coef\ufb01cient s improves model performance on long\nterm dependency tasks, we compare models with increasing s on the adding problem, the copying\nmemory problem and the sequential MNIST problem (without/with permutation, denoted as MNIST\nand pMNIST). Our baseline model is the shallow architecture proposed in [25]. To increase the\nrecurrent skip coef\ufb01cient s, we add connections from time step t to time step t + k for some \ufb01xed\ninteger k, shown in Figure 2(a), right panel. By using this speci\ufb01c construction, the recurrent skip\ncoef\ufb01cient increases from 1 (i.e., baseline) to k and the new model with extra connection has 2 hidden\nmatrices (one from t to t + 1 and the other from t to t + k).\nFor the adding problem, we follow the same setting as in [24]. We evaluate the baseline LSTM\nwith 128 hidden units and an LSTM with s = 30 and 90 hidden units (roughly the same number of\nparameters as the baseline). The results are quite encouraging: as suggested in [24] baseline LSTM\nworks well for input sequence lengths T = 100, 200, 400 but fails when T = 750. On the other hand,\nwe observe that the LSTM with s = 30 learns perfectly when T = 750, and even if we increase T to\n1000, LSTM with s = 30 still works well and the loss reaches to zero.\nFor the copying memory problem, we use a single layer RNN with 724 hidden units as our basic model,\nand 512 hidden units with skip connections. So they have roughly the same number of parameters.\nModels with a higher recurrent skip coef\ufb01cient outperform those without skip connections by a large\nmargin. When T = 200, test set cross entropy (CE) of a basic model only yields 0.2409, but with\ns = 40 it is able to reach a test set cross entropy of 0.0975. When T = 300, a model with s = 30\nyields a test set CE of 0.1328, while its baseline could only reach 0.2025. We varied the sequence\nlength (T ) and recurrent skip coef\ufb01cient (s) in a wide range (where T varies from 100 up to 300, and\ns from 10 up to 50), and found that this kind of improvement persists.\nFor the sequential MNIST problem, the hidden-layer size of the baseline model is set to 90 and\nmodels with s > 1 have hidden-layer sizes of 64. The results in Table 2, top-left panel, show that\ntanh RNNs with recurrent skip coef\ufb01cient s larger than 1 could improve the model performance\ndramatically. Within a reasonable range of s, test accuracy increases quickly as s becomes larger.\nWe note that our model is the \ufb01rst tanh RNN model that achieves good performance on this task,\neven improving upon the method proposed in [25]. In addition, we also formally compare with\nthe previous results reported in [25, 24], where our model (referred to as stanh) has a hidden-layer\nsize of 95, which is about the same number of parameters as in the tanh model of [24]. Table 2,\nbottom-left panel, shows that our simple architecture improves upon the uRNN by 2.6% on pMNIST,\n\n7\n\n\fstanh\nMNIST 34.9 46.9 74.9\n\ns = 1 s = 5 s = 9 s = 13 s = 21\n87.8\ns = 9\n88.0\n\n85.4\ns = 1 s = 3 s = 5 s = 7\n88.9\n\npMNIST 49.8 79.1 84.3\n\nLSTM s = 1 s = 3 s = 5 s = 7 s = 9\nMNIST 56.2 87.2 86.4 86.4 84.8\ns = 1 s = 3 s = 4 s = 5 s = 6\npMNIST 28.5 25.0 60.8 62.2 65.9\n\nModel\n\niRNN[25]\nuRNN[24]\nLSTM[24]\n\nRNN(tanh)[25]\nstanh(s = 21, 11)\n\nMNIST\n\n97.0\n95.1\n98.2\n\u224835.0\n98.1\n\npMNIST\n\u224882.0\n91.4\n88.0\n\u224835.0\n94.0\n\nArchitecture, s\nMNIST k = 17\nk = 21\npMNIST k = 5\nk = 9\n\n(1), 1 (2), 1 (3), k\n2\n39.5\n54.2\n69.6\n39.5\n74.7\n55.5\n55.5\n78.6\n\n39.4\n39.9\n66.6\n71.1\n\n(4), k\n77.8\n71.8\n81.2\n86.9\n\nTable 2: Results for MNIST/pMNIST. Top-left: Test accuracies with different s for tanh RNN. Top-right:\nTest accuracies with different s for LSTM. Bottom-left: Compared to previous results. Bottom-right: Test\naccuracies for architectures (1), (2), (3) and (4) for tanh RNN.\nand achieves almost the same performance as LSTM on the MNIST dataset with only 25% number of\nparameters [24]. Note that obtaining good performance on sequential MNIST requires a larger s than\nthat for pMNIST (see Appendix B.4 for more details). LSTMs also showed performance boost and\nmuch faster convergence speed when using larger s, as displayed in Table 2, top-right panel. LSTM\nwith s = 3 already performs quite well and increasing s did not result in any signi\ufb01cant improvement,\nwhile in pMNIST, the performance gradually improves as s increases from 4 to 6. We also observed\nthat the LSTM network performed worse on permuted MNIST compared to a tanh RNN. Similar\nresult was also reported in [25].\n4.5 Recurrent Skip Coef\ufb01cients vs. Skip Connections\nWe also investigated whether the recurrent skip coef\ufb01cient can suggest something more than simply\nadding skip connections. We design 4 speci\ufb01c architectures shown in Figure 2(b), right panel. (1)\nis the baseline model with a 2-layer stacked architecture, while the other three models add extra\nskip connections in different ways. Note that these extra skip connections all cross the same time\nlength k. In particular, (2) and (3) share quite similar architectures. However, ways in which the skip\nconnections are allocated makes big differences on their recurrent skip coef\ufb01cients: (2) has s = 1, (3)\nhas s = k\n2 and (4) has s = k. Therefore, even though (2), (3) and (4) all add extra skip connections,\nthe fact that their recurrent skip coef\ufb01cients are different might result in different performance.\nWe evaluated these architectures on the sequential MNIST and pMNIST datasets. The results show\nthat differences in s indeed cause big performance gaps regardless of the fact that they all have skip\nconnections (see Table 2, bottom-right panel). Given the same k, the model with a larger s performs\nbetter. In particular, model (3) is better than model (2) even though they only differ in the direction of\nthe skip connections. It is interesting to see that for MNIST (unpermuted), the extra skip connection\nin model (2) (which does not really increase the recurrent skip coef\ufb01cient) brings almost no bene\ufb01ts,\nas model (2) and model (1) have almost the same results. This observation highlights the following\npoint: when addressing the long term dependency problems using skip connections, instead of only\nconsidering the time intervals crossed by the skip connection, one should also consider the model\u2019s\nrecurrent skip coef\ufb01cient, which can serve as a guide for introducing more powerful skip connections.\n5 Conclusion\nIn this paper, we \ufb01rst introduced a general formulation of RNN architectures, which provides a solid\nframework for the architectural complexity analysis. We then proposed three architectural complexity\nmeasures: recurrent depth, feedforward depth, and recurrent skip coef\ufb01cients capturing both short\nterm and long term properties of RNNs. We also found empirical evidences that increasing recurrent\ndepth and feedforward depth might yield performance improvements, increasing feedforward depth\nmight not help on long term dependency tasks, while increasing the recurrent skip coef\ufb01cient can\nlargely improve performance on long term dependency tasks. These measures and results can provide\nguidance for the design of new recurrent architectures for particular learning tasks.\n\nAcknowledgments\nThe authors acknowledge the following agencies for funding and support: NSERC, Canada Research\nChairs, CIFAR, Calcul Quebec, Compute Canada, Samsung, ONR Grant N000141310721, ONR\nGrant N000141512791 and IARPA Raytheon BBN Contract No. D11PC20071. The authors thank\nthe developers of Theano [28] and Keras [29], and also thank Nicolas Ballas, Tim Cooijmans, Ryan\nLowe, Mohammad Pezeshki, Roger Grosse and Alex Schwing for their insightful comments.\n\n8\n\n\fReferences\n[1] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[3] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[4] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video represen-\n\ntations using LSTMs. In ICML, 2015.\n\n[5] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and\n\nSanja Fidler. Skip-thought vectors. In NIPS, 2015.\n\n[6] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In\nProceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1033\u20131040,\n2011.\n\n[7] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent neural\nnetworks. In Proceedings of The 30th International Conference on Machine Learning, pages 1310\u20131318,\n2013.\n\n[8] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universit\u00e4t\n\n[9] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient\n\ndescent is dif\ufb01cult. Neural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\nM\u00fcnchen, 1991.\n\n1997.\n\n[11] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutn\u00edk, Bas R Steunebrink, and J\u00fcrgen Schmidhuber. Lstm:\n\nA search space odyssey. arXiv preprint arXiv:1503.04069, 2015.\n\n[12] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[13] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network\narchitectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15),\npages 2342\u20132350, 2015.\n\n[14] J\u00fcrgen Schmidhuber. Learning complex, extended sequences using the principle of history compression.\n\nNeural Computation, 4(2):234\u2013242, 1992.\n\n[15] Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In\n\nAdvances in Neural Information Processing Systems, pages 493\u2013499, 1996.\n\n[16] Tapani Raiko, Harri Valpola, and Yann LeCun. Deep learning made easier by linear transformations in\nperceptrons. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 924\u2013932, 2012.\n[17] Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 190\u2013198, 2013.\n\n[18] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent\n\nneural networks. arXiv preprint arXiv:1312.6026, 2013.\n\n[19] T. Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning long-term dependencies is not as dif\ufb01cult with\nNARX recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329\u20131338, November\n1996.\n\n[20] Ilya Sutskever and Geoffrey Hinton. Temporal-kernel recurrent neural networks. Neural Networks,\n\n23(2):239\u2013243, 2010.\n\n[21] Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. In Proceedings\n\nof The 31st International Conference on Machine Learning, pages 1863\u20131871, 2014.\n\n[22] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus\n\nof english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[23] Tom\u00e1\u0161 Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, and Stefan Kombrink. Subword language\n\nmodeling with neural networks. preprint, (http://www.\ufb01t.vutbr.cz/imikolov/rnnlm/char.pdf), 2012.\n\n[24] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. arXiv\n\npreprint arXiv:1511.06464, 2015.\n\n[25] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of\n\nrecti\ufb01ed linear units. arXiv preprint arXiv:1504.00941, 2015.\n\n[26] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\n[27] David Krueger and Roland Memisevic. Regularizing rnns by stabilizing activations. arXiv preprint\n\narXiv:1412.6980, 2014.\n\narXiv:1511.08400, 2015.\n\n[28] The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Anger-\nmueller, Dzmitry Bahdanau, Nicolas Ballas, Fr\u00e9d\u00e9ric Bastien, Justin Bayer, Anatoly Belikov, et al. Theano:\nA python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688,\n2016.\n\n[29] Fran\u00e7ois Chollet. Keras. GitHub repository: https://github.com/fchollet/keras, 2015.\n\n9\n\n\f", "award": [], "sourceid": 992, "authors": [{"given_name": "Saizheng", "family_name": "Zhang", "institution": "University of Montreal"}, {"given_name": "Yuhuai", "family_name": "Wu", "institution": "University of Toronto"}, {"given_name": "Tong", "family_name": "Che", "institution": "IHES"}, {"given_name": "Zhouhan", "family_name": "Lin", "institution": "University of Montreal"}, {"given_name": "Roland", "family_name": "Memisevic", "institution": "University of Montreal"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}