{"title": "Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 13613, "page_last": 13623, "abstract": "A recent strategy to circumvent the exploding and vanishing gradient problem in RNNs, and to allow the stable propagation of signals over long time scales, is to constrain recurrent connectivity matrices to be orthogonal or unitary. This ensures eigenvalues with unit norm and thus stable dynamics and training. However this comes at the cost of reduced expressivity due to the limited variety of orthogonal transformations. We propose a novel connectivity structure based on the Schur decomposition and a splitting of the Schur form into normal and non-normal parts. \nThis allows to parametrize matrices with unit-norm eigenspectra without orthogonality constraints on eigenbases. The resulting architecture ensures access to a larger space of spectrally constrained matrices, of which orthogonal matrices are a subset. \nThis crucial difference retains the stability advantages and training speed of orthogonal RNNs while enhancing expressivity, especially on tasks that require computations over ongoing input sequences.", "full_text": "Non-normal Recurrent Neural Network (nnRNN):\nlearning long time dependencies while improving\n\nexpressivity with transient dynamics\n\nGiancarlo Kerg1,2, \u2217\n\nKyle Goyette1,2,3,\u2217\n\nMaximilian Puelma Touzel1,4\n\nGauthier Gidel1,2\n\nEugene Vorontsov1,5\n\nYoshua Bengio1,2,6\n\nGuillaume Lajoie1,7\n\nAbstract\n\nA recent strategy to circumvent the exploding and vanishing gradient problem in\nRNNs, and to allow the stable propagation of signals over long time scales, is to\nconstrain recurrent connectivity matrices to be orthogonal or unitary. This ensures\neigenvalues with unit norm and thus stable dynamics and training. However this\ncomes at the cost of reduced expressivity due to the limited variety of orthogonal\ntransformations. We propose a novel connectivity structure based on the Schur\ndecomposition, though we avoid computing it explicitly, and a splitting of the\nSchur form into normal and non-normal parts. This allows to parametrize matrices\nwith unit-norm eigenspectra without orthogonality constraints on eigenbases. The\nresulting architecture ensures access to a larger space of spectrally constrained\nmatrices, of which orthogonal matrices are a subset. This crucial difference retains\nthe stability advantages and training speed of orthogonal RNNs while enhancing\nexpressivity, especially on tasks that require computations over ongoing input\nsequences.\n\n1\n\nIntroduction\n\nTraining recurrent neural networks (RNN) to process temporal inputs over long timescales is notori-\nously dif\ufb01cult. A central factor is the exploding and vanishing gradient problem (EVGP) [13, 4, 27],\nwhich stems from the compounding effects of propagating signals over many iterates of recurrent\ninteractions. Several approaches have been developed to mitigate this issue, including the introduction\nof gating mechanisms (e.g. [15, 13]), purposely using non-saturating activation functions [5], and\nmanipulating the propagation path of gradients [3]. Another way is to constrain connectivity matrices\nto be orthogonal (and more generally, unitary) leading to a class of models we refer to as orthogo-\nnal RNNs [24, 21, 19, 33, 16, 32, 10, 2]. Orthogonal RNNs have eigen- and singular-spectra with\nunit norm, therefore helping to prevent exponential growth or decay in long products of Jacobians\nassociated with EVGP. They perform exceptionally well on tasks requiring memorization of inputs\nover long time-scales [11] (outperforming gated networks) but struggle on tasks involving continued\n\n\u2217Indicates \ufb01rst authors. Ordering determined by coin \ufb02ip.\n1: Mila - Quebec AI Institute, Canada\n2: Universit\u00e9 de Montr\u00e9al, D\u00e9partement d\u2019Informatique et Recherche Op\u00e9rationelle, Montreal, Canada\n3: Universit\u00e9 de Montr\u00e9al, CIRRELT, Montreal, Canada\n4: IVADO post-doctoral fellow\n5: Ecole Polytechnique de Montr\u00e9al, Montreal,Canada\n6: CIFAR senior fellow\n7: Universit\u00e9 de Montr\u00e9al, D\u00e9partement de Math\u00e9matiques et Statistiques, Montreal, Canada\n\nCorrespondence to: \n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcomputations across timescales. A contributing factor to this limitation is the mutually orthogonal\nnature of connectivity eigendirections which substantially limits the space of solutions available to\northogonal RNNs.\n\nIn this paper, we propose a \ufb01rst step toward a solution to this expressivity problem in orthogonal\nRNNs by allowing non-orthogonal eigenbases while retaining control of eigenvalues\u2019 norms. We\nachieve this by leveraging the Schur decomposition of the connectivity matrix, though we avoid\nthe need to compute this costly factorization explicitly. This provides a separation into \"diagonal\"\nand \"feed-forward\" parts, with their own optimization constraints. Mathematically, our contribution\namounts to adding \"non-normal\" connectivity, and we call our novel architecture non-normal RNN\n(nnRNN). In linear algebra, a matrix is called normal if its eigenbasis is orthogonal, and non-normal if\nnot. Orthogonal matrices are normal, with eigenvalues of norm 1 (i.e. on the unit circle). In recurrent\nnetworks, normal connectivity produces dynamics solely characterized by the eigenspectrum while\nnon-normal connectivity allows transient expansion and compression. Transient dynamics have\nknown computational advantages [12, 8, 9], but orthogonal RNNs cannot produce them. The added\n\ufb02exiblity in nnRNN allows such transients, and we show analytically how they afford additional\nexpressivity to better encode complex inputs, while at the same time retaining ef\ufb01cient signal\npropagation to learn long-term dependencies. Through a series of numerical experiments, we show\nthat the nnRNN provides two main advantages:\n\n1. On tasks well suited for orthogonal RNNs, nnRNN learns orthogonal (normal) connectivity\n\nand matches state-of-the art performance while training as fast as orthogonal RNNs.\n\n2. On tasks requiring additional expressivity, non-normal connectivity emerges from training\n\nand nnRNN outperforms orthogonal RNNs.\n\nFrom a parametric standpoint, this advantage can be attributed to the fact that the nnRNN has access\nto all matrices with unit-norm eigenspectra, of which orthogonal ones are only a subset.\n\n2 Background\n\n2.1 Unitary RNNs and constrained optimization\n\nFirst outlined in [2] and inspired by [30, 34, 18], RNNs whose recurrent connectivity is determined\nby an orthogonal, or unitary matrix are a direct answer to the EVGP since their eigenspectra and\nsingular spectra exactly lie on the complex unit circle. The same mechanism was invoked in a\nseries of theoretical studies for deep and recurrent networks in the large size limit, showing that ideal\nregimes for effective network performance are those initialized with such spectral attributes [29, 28, 7].\nBy construction, orthogonal matrices and their complex-valued counterparts, unitary matrices, are\nisometric operators and do not expand or contract space, which helps to mitigate the EVGP. A central\nchallenge to train unitary RNNs is to ensure that parameter updates are restricted to the manifolds\nsatisfying orthogonality constraints known as Stiefel manifolds (see review in [14]). This is an\nactive area of optimization research, and several techniques have been used for orthogonal or unitary\nRNN training. In [2], the authors construct connectivity matrices with long products of rotation\nmatrices leveraging fast Fourier transforms. In [33, 32, 10], the Cayley transform is used, which\nparametrizes weight matrices using skew-symmetric matrices that need to be inverted (cf. [6] for an\nRNN implementation directly using skew-symmetric matrices). Another approach uses Householder\nre\ufb02ections [24]. Recent studies also adapt some of these methods to the quaternion domain [26].\nThe methods listed above have their advantages by either being fast, or memory ef\ufb01cient, but suffer\nfrom only parametrizing a subset of all orthogonal (unitary) matrices. A novel approach considering\nthe group of unitary matrices as a Lie group and leveraging a parametrization via the exponential\nmap applied to its Lie algebra, addresses this problem and currently outperforms the rest on many\ntasks [19]. Still, of all matrices with unit-norm eigenvalues, unitary matrices are only a small subset\nand remain limited in their expressivity since they are restricted to isometric transformations [11].\nThis is why orthogonal RNNs, while performing better than a conventional RNN or LSTM at some\ntasks (e.g. copy task [13], or sequential MNIST [18]), struggle at more complex tasks requiring\ncomputations across multiple timescales.\n\n2.2 Non-normal connectivity\n\nAny diagonalizable matrix V can be expressed as V = P \u0398P \u22121 where P \u2019s columns are V \u2019s\neigenvectors and \u0398 is a diagonal matrix containing its eigenvalues. V is said to be normal if its\n\n2\n\n\fFigure 1: Bene\ufb01ts of non-normal dynamics.\n(a) The Schur decomposition provides the lower-\ntriangular Schur form (top). A feed-forward interaction coupling among Schur modes underlies\nnon-normal dynamics (bottom). (b) Lower triangle generates stronger transients. Trajectories of\nstandard deviation across hidden units (top) and norm of hidden state vector (bottom) obtained from\nthe dynamics of Eq. (1). Lines and shading are average and standard deviation, respectively, over\n103 initial conditions uniformly distributed on the unit hypersphere. Parameters: d = 0. (c) Fisher\nmemory curves across \u03b1 and \u03b2 (see legend in (b)) as computed by Eq. 7. Parameters: d = 0 ((cid:4)),\nd = 0.2 (N). N = 100 for (b) and (c).\n\neigenbasis is orthogonal and thus, P \u22121 = P \u22a4 and V = P \u0398P \u22a4. Orthogonal matrices are normal\nmatrices with eigenvalues on the unit circle. When a matrix is non-normal, it is diagonalized with\na non-orthogonal basis. However, it is still possible to express it using an orthogonal basis at the\ncost of adding (lower) triangular structure to \u0398. This is known as the Schur decomposition: for any\nmatrix V , we have V = P (\u039b + T )P \u22a4 with P an orthogonal matrix, \u039b a diagonal matrix containing\nthe eigenvalues, and T a strictly lower-triangular matrix.2 In short, T contains the interactions\nbetween the orthogonal column vectors of P (called Schur modes). P and T are obtained from\northogonalizing the non-orthogonal eigenbasis of V , and do not affect the eigenspectrum. As a\nrecurrent connectivity matrix, T represents purely feed-forward structure that produces strictly\ntransient dynamics impossible to produce in normal (orthogonal) matrices. In other words, if a normal\nand non-normal matrix share exactly the same eigenspectrum, the iterative propagation of an input\nwill be equivalent in the long-term, but can differ greatly in the short-term. We revisit this distinction\nin \u00a73. This was exploited by [9, 12] to analyze the decomposition of the activity of recurrent networks\n(in continuous time) into a normal part responsible for slow \ufb02uctuations, and a non-normal part\nproducing fast, transient ones. How this mechanism propagates information was studied in [8] for\nstochastic linear dynamics. The authors show analytically that non-normal dynamics can lead to\nextensive memory traces, as measured by the Fisher information of the distribution of hidden state\ntrajectories parametrized by the input signal. To the best of our knowledge, an explicit demonstration\nand explanation of the bene\ufb01ts of non-normal dynamics for learning in RNNs is lacking, though\nsee [25] for similar ideas used for initialization.\n\n3 Non-normal matrices are more expressive and propagate information\n\nmore robustly than orthogonal matrices\n\nWe now outline the role of non-normal connectivity we exploit for recurrent network parametrization.\nTo provide mathematically-grounded intuition for the bene\ufb01t it provides for learning, we \ufb01rst consider\ngeneric RNN dynamics,\n\nht+1 = \u03c6(V ht + U xt+1 + b)\nV = P \u0398P \u22a4, \u0398 = \u039b + T\n\n(1)\n\nwhere ht \u2208 RN is the time-varying hidden state vector, \u03c6 is a nonlinear function, xt is the input\n\nsequence projected into the dynamics via matrix U , and b is a bias (we omit the output for brevity).\n\n2When eigenvalues and eigenvectors are complex, P is unitary and P \u22a4 corresponds to conjugate transposition.\nHowever for any real V , it is possible to \ufb01nd an orthogonal (real) P with \u039b being block-diagonal with 2 \u00d7 2\nblocks instead of complex-conjugate eigenvalues.\n\n3\n\nSchurmodesSchurform, \u0398\u03b2\u03b1d{P\u2217j}Nj=1(b)(a)(c)V=P\u0398Ptransient expansioninformation propagationtd=0d=0.205010015010\u2212210\u22121100J(t)10\u22122100102\u03c3hit050100t10\u22122100102||ht||\u03b20.00.005\u03b10.951.01.05d=0d=0\fV is the matrix of recurrent weights, which in line with \u00a72.2 we decompose into its lower-triangular\nSchur form \u0398 in Eq. (1), with P orthogonal and \u0398 lower triangular. \u0398 has two parts: a (block) diago-\nnal part \u039b, and a strictly lower triangular part T .3 The Schur decomposition maps the hard problem\nof controlling the directions of a non-orthogonal basis to the easier problem of specifying interactions\nbetween \ufb01xed orthogonal modes. It is important to highlight the fact that an orthonormalization of\nthe eigenbasis is just a change in representation and thus has no effect on the spectrum of V , which\nstill lies on the diagonal of \u039b. The triangular part T can thus be modi\ufb01ed independently from the\nconstraint (employed in orthogonal RNN approaches) that the spectrum have norms equal or near 1.\n\nThe ability to encode complex signals and then selectively recall past inputs is a basic requirement\nneeded to solve many sequence-based tasks. Intuitively, the two features that allow systems to\nperform well in such tasks are:\n\n1. High dimensional activity to better encode complex input.\n\n2. Ef\ufb01cient signal propagation, to better learn long-term dependencies.\n\nTo illustrate how non-normal dynamics controlled by the entries in the lower triangle of T contribute\nto these two features, we consider a simpli\ufb01ed linear case where \u03c6(ht) = ht and \u0398 is parametrized\nas follows and illustrated in Fig. 1(a)):\n\n(\u0398)i,j = d\u03b4i,j + \u03b1\u03b4i,j+1 + \u03b2 X2\u2264k\u2264i\n\n\u03b4i,j+k .\n\n(2)\n\nHere, diagonal entries are set to d, sub-diagonal entries to \u03b1, and the remaining entries in the lower\ntriangle to \u03b2. By varying \u03b1 and \u03b2 we will show how the lower triangle in T enhances expressivity\nand information propagation.\n\n3.1 Non-normality drives expressive transients\n\nRNNs can be made more expressive with stronger \ufb02uctuations of hidden state dynamics. The\ndependence of hidden state variance on the values of \u0398 was studied in depth in [12]. Here, we present\nexperiments where the RNN parametrized by Eqs. (1), (2) exempli\ufb01es some of those results. We\nnumerically compute a set of trajectories over a sampled ensemble of inputs with xt > 0 for t = 0\nand 0 otherwise. Without loss of generality we assume a form of U and distribution of x0 that leads to\ninput-dependent initial conditions on the unit hypersphere in the space of ht. For \u03b1 = 0.95, 1.0, 1.05,\n\u03b2 = 0, 0.005 and d = 0, we see that trajectories of single units exhibit increasing large transients\nwith increasing \u03b1 and \u03b2, that abruptly end at t = N (Fig. 1(b)). The latter is a result of the nilpotent\nproperty of a strictly triangular matrix: each iteration removes the top entries in each column until\n\u0398N = 0. Computing ensemble statistics, we \ufb01nd that \u03b1 contributes signi\ufb01cantly to the strength of\nthe exponential ampli\ufb01cation, while \u03b2 structures the shape of the transient. This ability of T to both\nexhibit ampli\ufb01cation, and to control its shape, is what endows the Schur form \u0398 with expressivity\n(see \u00a75.3 for empirical evidence in trained nnRNNs).\n\n3.2 Non-normality allows for ef\ufb01cient information propagation\n\nPropagation of information in a network requires feed-forward interactions. Perhaps the most simple\nexample of a feed-forward structure is the local feed-forward chain (also called delay-line [8]),\nwhere each mode feeds its signal only to the next mode in the chain (\u03b1 > 0, \u03b2 = 0, d = 0; see\nFig. 1(a)). In this case, we denote \u0398 by \u0398delay. As a consequence, signals feeding the \ufb01rst entry of\n\u0398delay propagate down the chain and are ampli\ufb01ed or attenuated according to the values of these\nnon-zero entries. Moreover, inputs from different time steps do not interact with each other thanks\nto this ordered propagation down the line. In contrast, the signal is not propagated across modes\nfor dynamics given by a purely (block) diagonal \u0398. It instead simply decays within the mode into\nwhich it was injected on the timescale intrinsic to that mode, which can be much less than the O(N )\ntimescale of the chain.\n\nTo quantify the ef\ufb01ciency with which a RNN can store inputs, we follow and extend the approach\nof [8]. For a given scalar-valued input sequence, xt = st + \u03bet, t \u2208 N, composed of signal st and\ninjected noise \u03bet, the noise ensemble induces the conditional distribution, P (h:t|s:t), over trajectories\n\n3We use the real Schur decomposition but a similar treatment can be derived for the complex case.\n\n4\n\n\fof hidden states, h:t, given the received input, s:t, where : t subscript is short hand for (k : k \u2264 t).\nTaking the signal sequence s:t as a set of parameters of a model, and P (h:t|s:t) as this model\u2019s\nlikelihood, the corresponding Fisher information matrix that captures how P (h:t|s:t) changes with\nthe input s:t is,\n\nJk,l(s:t) = (cid:28)\u2212\n\n\u22022\n\n\u2202sk\u2202sl\n\nlog P (h:t|s:t)(cid:29)P (h:t|s:t)\n\nk, l \u2264 t .\n\n(3)\n\nThe diagonal of this matrix, J(t) := Jt,t is called the Fisher memory curve (FMC) and has a simple\ninterpretation: if a single signal s0 is injected into the network at time 0, then J(t) is the Fisher\ninformation that ht retains about this single signal.\n\nlower-triangular matrix may approach the performance of a delay line:\n\nIn [8], the authors proved that the delay line \u0398delay achieves the highest possible values for the\nFMC when k \u2264 N : J(k) = \u03b1k \u03b1\u22121\n\u03b1k+1\u22121 . However, we show (proof in SM\u00a7B) that any strictly\nProposition 1. Let \u0398 \u2208 RN \u00d7N be any strictly lower-triangular matrix with \u221a\u03b1 on the lower\ndiagonal and let TGram \u2208 RN \u00d7N be the triangular matrix associated with the Gram\u2013Schmidt\northogonalization process of the columns of \u0398 (thus with only 1 on the diagonal). Then,\n\nJ(k) \u2265\n\n\u03b1k\n\n\u03c32(N \u22121)\nmax\n\n\u03b1 \u2212 1\n\u03b1k+1 \u2212 1\n\n,\n\n(4)\n\nwhere \u03c3max is the maximum singular value of TGram.\n\nNote that \u03c3max \u2265 1 and is equal to 1 for a delay line and close to 1 when \u0398 is close to a delay\nline. In Fig. 1(a) we present a class of matrices providing feed-forward interaction and compute the\nFMC of some matrices of this class in Fig. 1(c). The delay line from [8] with \u03b1 > 1 (shown to\nbe optimal for t \u2264 N ) retains the most Fisher information across time up to time step N , when the\nnilpotency of \u0398 erases all information. As expected from Prop.1, non-zero \u03b2, which endows the\ndynamics with expressivity (Fig. 1(b)), does not signi\ufb01cantly degrade the information propagation of\nthe delay line. Interestingly, the addition of diagonal terms (d > 0), i.e. \u039b non-zero, helps to maintain\nalmost optimal values of the FMC for t < N , while extending the memory beyond t = N , and thus\noutperforming the delay line with regards to the area under the FMC (see Table 3 in the supplemental\nmaterials (SM)).\n\nTogether with the last section, these results demonstrate that non-normal dynamics, as parametrized\nthrough the entries in the lower triangle of \u0398, provide signi\ufb01cant bene\ufb01ts to expressivity and\ninformation propagation. What remains to show is how these bene\ufb01ts translate into enhanced\nperformance of our nnRNN on actual tasks.\n\n3.3 Non-normal matrix spectra and gradient propagation\n\nWhile eigenvalues control the exponential growth and decay of matrix iterates, the spectral norm\nof these iterates may behave differently [4]. This norm is dominated by the modulus of the largest\nsingular value of the matrix, and can thus differ from the eigenvalues\u2019 moduli. This is a subtle\ndifference in\ufb02uencing gradient growth rates, and is explicitly revealed by different spectral constraints\non RNNs. For comparison, a singular value decomposition (SVD) is presented in [35] with the same\nmotivation as our Schur-decomposition: to maintain expressivity, whilst controlling a sprectrum (both\nusing regularization). First note that, while constraining the eigenspectrum to the unit circle, non-\nnormality implies having the largest singular value (and thus the spectral norm of the Jacobian) greater\nthan 1. Hence, our approach mitigates gradient vanishing, but not necessarily gradient explosion.\nIn this case however, gradients explode polynomially in time rather than exponentially [27, 2]. We\nprovide a theorem (proof in SM\u00a7C) to establish this for triangular matrices.\n\nProposition 2. Let A \u2208 Rn\u00d7n be a matrix such that Aii = 1, Aij = x for i < j, and Aij = 0\notherwise. Then for all integer t \u2265 1 and j > i, we have (At)ij = p(t)\nj\u2212i(x) is polynomial in x\nof degree at most j \u2212 i, where the coef\ufb01cient of x0 is zero and the coef\ufb01cient of xl is O((cid:0)t\nl(cid:1)) for\nl = 1, 2, . . . , j \u2212 i (which is polynomial in t of degree at most l).\nThis reveals that gradient explosion in nnRNN with unit-norm eigenspectrum, if present, is polynomial\nand thus not as severe as the case where eigenvalues are larger than one (in which case the gradient\n\n5\n\n\fexplosion is exponential). In \u00a75.3, we illustrate that relaxing unit-norm requirements for eigenvalues\nusing regularization allows the optimizer to \ufb01nd a task-dependent trade-off, thus balancing control\nover exponential vanishing and polynomial exploding gradients respectively. See also SM\u00a7F for\ngradient propagation measurements.\n\n4\n\nImplementing a non-normal RNN\n\nThe nnRNN is a standard RNN model where we parametrize the recurrent connectivity matrix V\nusing its real Schur decomposition4 as in Eq. (1), yielding the form:\n\nV = P\n\nR1\n0\n0 R2\n...\n...\n0\n0\n\n\uf8eb\n\uf8ec\uf8ec\uf8ed\n\n\uf8ee\n\uf8ef\uf8ef\uf8f0\n\n0\n0\n...\n\n. . .\n. . .\n. . .\n. . . RN/2\n\n\uf8f9\n\uf8fa\uf8fa\uf8fb\n\n+\n\n0\nt2,1\n...\ntN,1\n\n\uf8ee\n\uf8ef\uf8ef\uf8f0\n\n0\n0\n...\ntN,2\n\n. . .\n. . .\n. . .\n. . .\n\n0\n0\n...\n0\n\n\uf8f9\n\uf8fa\uf8fa\uf8fb\n\n\uf8f6\n\uf8f7\uf8f7\uf8f8\n\nP \u22a4\n\n(5)\n\nwith\n\nRi(\u03b3i, \u03b8i) def= \u03b3i(cid:20)cos \u03b8i \u2212 sin \u03b8i\ncos \u03b8i (cid:21) ,\n\nsin \u03b8i\n\nwhere P is constrained to be an N \u00d7 N orthogonal matrix. Each parameter above (including entries\nin P ) is subject to optimization, as well as speci\ufb01c constraints outlined below. We note that although\nthis parametrization uses the Schur form, we never explicitly compute Schur decompositions, which\nwould be expensive and has stability issues.5 Note that Eq. 5 can express any matrix V with a set of\ncomplex-conjugate pairs of eigenvalues.\n\nDuring training, the orthogonal matrix P is optimized using the expRNN algorithm [19], a Rieman-\nnian gradient descent-like algorithm operating inside the Stiefel manifold of orthogonal matrices.\nWe note that other suitable orthogonality-preserving algorithms could be used here (see \u00a72) but we\nfound expRNN to be the fastest and most stable. Instead of rigidly enforcing that eigenvalues be of\nunit norm, we found relaxing this constraint to be helpful. We therefore allow \u03b3i to be optimized\nbut add a strong L2 regularization constraint \u03b4k|1 \u2212 \u03b3ik|2\n2 to encourage them to be to close to 1. The\nhyperparameter \u03b4 is tuned differently for each task (see SM\u00a7A) but remains high overall, indicating\nonly mild departure from unit-norm eigenvalues. Both \u03b8i and tij are freely optimized via automatic\ndifferentiation. The non-linearity we use is modReLU, as de\ufb01ned in [2, 10]. We initialize P as in [19]\nusing Henaff or Cayley initialization scheme [11], \u03b8i from a uniform distribution between 0 and 2\u03c0,\nand \u03b3i\u2019s are initialized at 1.\n\nWe reiterate that the set of orthogonal matrices is a subset of all the connectivity matrices covered\nby nnRNN, by setting all \u03b3\u2019s to 1, and T = 0. Consequently the connectivity matrix in nnRNN has\nmore parameters than an orthogonal matrix: N (N \u2212 1)/2 for T , and N/2 \u03b3i\u2019s, which in total gives\nroughly N 2/2 more parameters than orthogonal RNNs.\nThe forward pass of the nnRNN has the same complexity as that of a vanilla RNN, that is O(Tn 2 ),\nfor a hidden state of size n and a sequence of length T . The backward pass is similarly O(Tn 2 )\nplus the update cost of P , in addition to a once-per-update cost of O(n 3 ) to combine the Schur\nparametrization via matrix multiplication. Importantly, the nnRNN leverages any orthogonal/unitary\noptimizer for P which have complexities ranging from O(n log n) to O(n 3 ) at each update, with\ntheir own advantages and caveats (see \u00a72.1). We chose the expRNN scheme, which is O(n 3 ) in the\nworst case, but has fast run-time in practice.\n\n5 Numerical experiments\n\nIn this section, we test the performance of our nnRNN on various sequential processing tasks. We\nhave two goals:\n\n1. Establish the nnRNN\u2019s ability to perform as well as orthogonal RNNs on tasks with patho-\nlogically long term dependencies: the copy task and the permuted sequential MNIST task.\n\n2. Demonstrate improved performance over orthogonal RNNs on a more realistic task requiring\n\nongoing computation and output: the Penn Tree Bank character-level benchmark.\n\n4See discussion for more details about complex-valued implementations.\n5See SM\u00a7D for a discussion.\n\n6\n\n\fWe compare our nnRNN model to the following architectures: vanilla RNN (RNN), the orthogonally\ninitialized RNN (RNN-orth) [11], the Ef\ufb01cient Unitary RNN (EURNN) [24], and the Exponential\nRNN (expRNN) [19]. Our goal is to establish performance for non-gated models, but we include\nLSTM [13] for reference. For comparison, models are separately matched in the number of hidden\nunits and number of parameters. Every training run was tuned with a thorough optimization hyper-\nparameter search. Model training and task setup are detailed in SM\u00a7A.\n\n-\n\nFigure 2: Holding the number N of hidden units constant, model performance is plotted for the copy\ntask (T=200, left; cross-entropy loss; N \u223c 128) and for the permuted sequential MNIST task (right;\naccuracy; N \u223c 512). Shading indicates one standard error of the mean.\n\n5.1 Copy task & Permuted sequential MNIST\n\nThe copy task, introduced in [13], requires that a model reads a sequence of inputs, waits for some\ndelay T (here we use T = 200), and then outputs the same sequence. Fig. 2 shows the cross entropy\nof each tested with N = 128 hidden units. We see little difference if we match the number of\nparameters with \u223c 18.9K (see Fig. 4 in SM\u00a7A.2). For reference, a model that simply predicts a\nconstant set of output tokens for every input sequence is expected to achieve a baseline loss of 0.095.\nAs shown in [11], an orthogonal RNN is an optimal solution for the copy task. Indeed, the LSTM\nstruggled to solve the task and RNN failed completely, unlike all orthogonal RNNs who learn to\nsolve at very high performance very quickly. The proposed nnRNN matched the performance of\northogonal RNNs, as well as best training timescales.\n\nSequential MNIST [18] requires a model to classify an MNIST digit after reading the digit image one\npixel at a time. The pixels are permuted in order to increase the time delay between inter-dependent\npixels, making the task harder. Fig. 2 shows mean validation accuracy of each tested model with with\nN = 512 hidden units (see Fig. 4 in SM\u00a7A.2 for parameter match). As with the copy task, the nnRNN\nmatches orthogonal RNNs in performance, whereas RNN and LSTM show lesser performances.\n\n5.2 Penn Tree Bank (PTB) character-level prediction\n\nCharacter level language modelling with the Penn Treebank Corpus (PTB) [22] consists of predicting\nthe next character at each character in a sequence of text (see SM\u00a7A.3 for test accuracy). We compare\nthe performance of different models on this task in Table 1 in terms of test mean bits per character\n(BPC), where lower BPC indicates better performance. We compare truncated backpropagation\nthrough time over 150 time steps and over 300 time steps.\n\nIn contrast to the copy and psMNIST tasks (see \u00a75.1), the PTB task requires online computation\nacross several inputs received in the past. Furthermore, it is a task that demands an output from the\nnetwork at each time step, as opposed to a prompted one. These ingredients are not particularly\nwell-suited for orthogonal transformations since it is not enough to simply keep inputs in memory or\nintegrate input paths to a classi\ufb01cation outcome, the network must transform past inputs to compute a\nprobability distribution. Gated networks are well-suited for such tasks, and we could get an LSTM\nwith N = 1024 hidden units to achieve 1.37 \u00b1 0.003 BPC (see \u00a76 for a discussion).\nImportantly, without the use of gating mechanisms, our nnRNN outperformed all other models we\ntested. To our knowledge, it also surpasses all reported performances for other non-gated models\n\n7\n\n0200400600800iteration0.000.250.500.751.00lossCopy (T=200)0204060epoch0.000.250.500.751.00accuracyPermuted sequential MNISTRNNRNN-orthEURNNexpRNNnnRNNLSTM\fTest Bit per Character (BPC)\n\nFixed # params (\u223c1.32M)\n\nModel\nRNN\n\nRNN-orth\nEURNN\nexpRNN\nnnRNN\n\nTP T B = 150\n2.89 \u00b1 0.002\n1.62 \u00b1 0.004\n1.61 \u00b1 0.001\n1.49 \u00b1 0.008\n1.47 \u00b1 0.003\n\nTP T B = 300\n2.90 \u00b1 0.0016\n1.66 \u00b1 0.006\n1.62 \u00b1 0.001\n1.52 \u00b1 0.001\n1.49 \u00b1 0.002\n\nFixed # hidden units (N = 1024)\nTP T B = 150\n2.89 \u00b1 0.002\n1.62 \u00b1 0.004\n1.69 \u00b1 0.001\n1.51 \u00b1 0.005\n1.47 \u00b1 0.003\n\nTP T B = 300\n2.90 \u00b1 0.002\n1.66 \u00b1 0.006\n1.68 \u00b1 0.001\n1.55 \u00b1 0.001\n1.49 \u00b1 0.002\n\nTable 1: PTB test performance: Bit per Character (BPC), for sequence lengths TP T B = 150, 300.\nTwo comparisons across models shown: \ufb01xed number of parameters (left), and \ufb01xed number of\nhidden units (right). Error range indicates standard error of the mean.\n\nof comparable size (see also [20]). While the performance gap to expRNN (the state-of-the-art\northogonal RNN) is modest for equal number of parameters and shorter time scale (TP T B = 150), it\nappreciatively improves for TP T B = 300. Where the nnRNN shines is for equal numbers of hidden\nunits, where the performance gap to expRNN is much greater. This suggests two things (i) the nnRNN\nimproves propagation of meaningful signals over longer time scales, and (ii) its connectivity structure\nprovides superior expressivity for a \ufb01xed number of neurons, a desirable feature for ef\ufb01cient model\ndeployment. In the next section, we explore the structure of trained nnRNN weights to illustrate that\nthe mechanisms responsible for this performance gain are consistent with the arguments presented in\n\u00a73.\n\n5.3 Analysis of learned connectivity structure\n\n(a)\n\nCopy task\n\n(b)\n\n0.1\n\nPTB task\n\n(c)\n\n0.1\n\n0\n\n\u03c0\n\n\u03c0/2\n\u03b8i\n\n0\n\n0\n\n\u03c0\n\n\u03c0/2\n\u03b8i\n\n0\n\nf\no\n \ne\nd\nu\nt\ni\nn\ng\na\nm\n \nn\na\ne\nm\n\nl\na\nn\no\ng\na\ni\nd\n \nr\ne\nw\no\nl\n \nh\nt\nk\n\nk=1\n\n0\n\n0.5\n\nentry \n\n1\n\nmagnitude\n\n0.25\n\n0.20\n\n0.15\n\n0.10\n\n0.05\n\n0.00\n\n-0.1\n\n-0.1\n\n1\n\n500\nk\n\n1000\n\nFigure 3: Learned \u0398s show decomposition into \u039b and T . Elements of learned \u0398 matrix entries for\ncopy task (a) are concentrated on the diagonal, and distributed in the lower triangle for the PTB task\n(b). Insets in (a) and (b) show the distribution of eigenvalues angles \u03b8i (cf. Eq. 4). (c) The mean\nmagnitude of entries along the kth sub-diagonal of the lower triangle in (b) shows both a delay-line\nand lower triangle component. Inset: the distribution of entry magnitudes along the delay line is\nbimodal from its two contributions: the cosine of uniformily distributed angles, and the relatively\nsmall, but signi\ufb01cant pure delay line entries.\n\nTo validate the theoretical arguments in favor of non-normal dynamics presented in \u00a73, we take a\nlook at the connectivity structure that emerges from our training procedure (see \u00a74). Fig. 3 (and 5 in\nSM\u00a7E) shows the triangular Schur form \u0398 = \u039b+T of the recurrent connectivity matrix V = P \u0398P \u22a4,\nat the end of training. For the copy task, \u0398 is practically composed of 2 \u00d7 2 rotation blocks along\nits diagonal (i.e. T = 0). This indicates that the learned dynamics are normal, and orthogonal. In\ncontrast, for the PTB task we \ufb01nd that the lower triangular part T shows a lot of structure, indicating\nthat non-normal transient dynamics are used to solve the prediction task. The distributions of elements\nof T away from the diagonal highlights the nature of the tasks. The network distributes the angles\nroughly uniformly in the case of the copy task, consistent with the explicit optimal solution that\ninvolves such a distribution of rotations [11]. For the PTB task however, the angles strongly align,\npromoting the delay-line motif in \u0398, shown in \u00a73 to be optimal for the information propagation useful\nfor character prediction. This is more clearly demonstrated by the mean absolute value of entries\naway from the diagonal, shown in Fig. 3. The rest of the triangle also shows structure, consistent with\nour proof that the lower triangle and delay line can jointly contribute to information propagation.\n\n8\n\n\fIn summary, these \ufb01ndings indicate that when tasks are well-suited for isometric transformations\n(e.g. storing things in memory for later recall) the nnRNN easily learns to eliminate non-normal\ndynamics and restricts itself to the set of orthogonal matrices. Moreover, it does so without any\npenalty on learning speed, as shown in Fig. 2. However, when tasks require online computations,\nnon-normal dynamics come into play and enable transient activity to be shaped for computations.\n\nLastly, as already discussed in \u00a73.3, the expressivity afforded by non-normality must come with a\ntrade-off between maintaining the eigen and singular spectra \"close\" to the unit circle, balancing\ncontrol over exponential vanishing and polynomial exploding gradients respectively. This fact remains\ntrue for any parametrization of non-normal matrices, including the SVD used in spectral RNN [35].\nThe nnRNN is naturally suited to target this balance by explicitly allowing regularization over normal\nand non-normal parts of a matrix, and enabling the optimizer to \ufb01nd that trade-off. This explains\nwhy we \ufb01nd that allowing eigenvalues to deviate slightly from the unit circle throughout training\n(regularization on \u03b3), along with weight decay for the non-normal part, yields the best results with\nmost stable training. Further evidence of this balancing mechanism is found in trained matrices (see\nFig. 3). For the PTB task, non-normal structure emerges and the mean eigenvalue norm is balanced at\n\u00af\u03b3 \u223c 0.958. In contrast for the copy task, matrices remain normal and \u00af\u03b3 \u223c 1. See SM\u00a7F for additional\nexperiments with \ufb01xed \u03b3 further outlining their role in this trade-off.\n\n6 Discussion\n\nWith the nnRNN, we showed that augmenting orthogonal recurrent connectivity matrices with non-\nnormal terms increases the \ufb02exibility of a recurrent network. We compared the nnRNN\u2019s performance\nto several other recurrent models on distinct tasks; some that are well suited for orthogonal RNNs,\nand another that targets their limitations. We \ufb01nd that non-normal structure affords two distinct\nimprovements for nnRNNs:\n\n1. Preservation of advantages from purely orthogonal RNNs (long-term gradient propagation;\n\nfast learning on tasks involving long-term memory)\n\n2. Compared to orthogonal RNNs, increased expressivity on tasks requiring online computa-\n\ntions thanks to transient dynamics.\n\nTo better understand why this is, we derived analytical expressions that outline the role of non-normal\ndynamics that were corroborated by an analysis of nnRNN connectivity structure after training.\nImportantly, the nnRNN leverages existing optimization algorithms for orthogonal matrices with\nincreased scope, all the while retaining learning speed.\n\nThe principal contribution of this paper is not to report major gains in performance as measured\nby tests, but rather to convincingly outline a promising novel direction for spectrally constrained\nRNNs. This spans the expressivity and ability to handle long-term dependencies of orthogonal RNNs\non one hand, and completely unconstrained RNNs on the other. The nnRNN is a \ufb01rst step toward\na trainable RNN parametrization where regularization over the eigenspectrum is readily available\nwhile conserving the \ufb02exibility of arbitrary eigenbases. This allows explicit control over quantities\nwith direct impact on gradient propagation and expressivity, providing a promising RNN toolbox.\nUnlike the orthogonal RNNs present in our tests, which have bene\ufb01ted over the years from a series\nof algorithmic improvements, our nnRNN is basic in its implementation, and presents a number of\nareas for direct improvement. These include (i) using a complex-valued parametrization as in [2], (ii)\nexploring better initializations, and (iii) identifying helpful regularization schemes for the non-normal\npart. Beyond these, we should mention that the Schur decomposition presents implicit instabilities\nwhich can jeopardize training when eigenbases become degenerate (see SM\u00a7D). Simple perturbation\nschemes to prevent this should greatly improve performance.\n\nFinally, we acknowledge that on a number of time-dependent tasks, gated recurrent networks such as\nthe LSTM or the GRU [15] have clear advantages (see also [31] for a derivation of gated dynamics\nfrom \ufb01rst principles). Building on these, there is promising evidence that combining orthogonal\nconnectivity with gates can greatly help learning [15]. This further motivates the development of\nspectrally constrained recurrent architectures to be combined with gating, thereby optimizing the\nef\ufb01ciency of gradient propagation and expressivity with both explicit mechanisms, and implicit\nstructure. Ongoing work in this direction is under way, leveraging our nnRNN \ufb01ndings.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Tim Cooijmans, Sarath Chandar, Jonathan Binas, Anirudh Goyal, C\u00e9sar\nLaurent and Tianyu Li for useful discussions. YB ackowledges support from CIFAR, Microsoft\nand NSERC. MPT acknowledges IVADO support. GL is funded by an NSERC Discovery Grant\n(RGPIN-2018-04821), an FRQNT Young Investigator Startup Program (2019-NC-253251), and an\nFRQS Research Scholar Award, Junior 1 (LAJGU0401-253188).\n\nReferences\n\n[1] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz,\nS. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. LAPACK Users\u2019 Guide (Third\nEd.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1999. ISBN\n0-89871-447-8.\n\n[2] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\nIn Proceedings of the 33rd International Conference on International Conference on Machine\nLearning - Volume 48, ICML\u201916, pages 1120\u20131128. JMLR.org, 2016. URL http://dl.acm.\norg/citation.cfm?id=3045390.3045509.\n\n[3] Devansh Arpit, Bhargav Kanuparthi, Giancarlo Kerg, Nan Rosemary Ke, Ioannis Mitliagkas,\nand Yoshua Bengio. h-detach: Modifying the LSTM Gradient Towards Better Optimization.\nICLR, 2019.\n\n[4] Y Bengio, P Simard, and P Frasconi. Learning long-term dependencies with gradient descent is\n\ndif\ufb01cult. IEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[5] Sarath Chandar, Chinnadhurai Sankar, Eugene Vorontsov, Samira Ebrahimi Kahou, and Yoshua\nBengio. Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies.\nAAAI, 2019.\n\n[6] Bo Chang, Minmin Chen, Eldad Haber, and Ed H Chi. AntisymmetricRNN: A Dynamical\n\nSystem View on Recurrent Neural Networks. ICLR, 2019.\n\n[7] Minmin Chen, Jeffrey Pennington, and Samuel S Schoenholz. Dynamical Isometry and a Mean\nField Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks.\nICML, 2018.\n\n[8] Surya Ganguli, Dongsung Huh, and Haim Sompolinsky. Memory traces in dynamical systems.\nProceedings of the National Academy of Sciences of the United States of America, 105(48):\n18970\u201318975, December 2008.\n\n[9] Mark S Goldman. Memory without Feedback in a Neural Network. Neuron, 61(4):621\u2013634,\n\nFebruary 2009.\n\n[10] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal Recurrent Neural Networks with\n\nScaled Cayley Transform. ICML, 2018.\n\n[11] Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-\nmemory tasks. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The\n33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine\nLearning Research, pages 2034\u20132042, New York, New York, USA, 20\u201322 Jun 2016. PMLR.\nURL http://proceedings.mlr.press/v48/henaff16.html.\n\n[12] Guillaume Hennequin, Tim P. Vogels, and Wulfram Gerstner. Non-normal ampli\ufb01cation in\nrandom balanced neuronal networks. Phys. Rev. E, 86:011909, Jul 2012. doi: 10.1103/PhysRevE.\n86.011909. URL https://link.aps.org/doi/10.1103/PhysRevE.86.011909.\n\n[13] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Comput., 9\n(8):1735\u20131780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL\nhttp://dx.doi.org/10.1162/neco.1997.9.8.1735.\n\n[14] Bo Jiang and Yu-Hong Dai. A framework of constraint preserving update schemes for optimiza-\n\ntion on Stiefel manifold. Mathematical Programming, 153(2):1\u201341, April 2015.\n\n10\n\n\f[15] L Jing, C Gulcehre, J Peurifoy, Y Shen, M Tegmark Neural, and 2019. Gated orthogonal\n\nrecurrent units: On learning to forget. Neural Computations, 2019.\n\n[16] Li Jing, Yichen Shen, John Peurifoy, Scott Skirlo, Yann LeCun, and Max Tegmark. Tunable\n\nEf\ufb01cient Unitary Neural Networks (EUNN) and their application to RNNs. ICML, 2017.\n\n[17] Peter D. Lax. Linear Algebra and Its Applications. John Wiley & Sons, 2007.\n\n[18] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent\n\nnetworks of recti\ufb01ed linear units. arXiv, 2015.\n\n[19] Mario Lezcano-Casado and David Mart\u00ednez-Rubio. Cheap Orthogonal Constraints in Neural\n\nNetworks: A Simple Parametrization of the Orthogonal and Unitary Group. ICML, 2019.\n\n[20] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently recurrent neural\nnetwork (indrnn): Building a longer and deeper rnn. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), June 2018.\n\n[21] Kehelwala D G Maduranga, Kyle E Helfrich, and Qiang Ye. Complex Unitary Recurrent Neural\n\nNetworks using Scaled Cayley Transform. AAAI, 2019.\n\n[22] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large\nannotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313\u2013330, June 1993.\nISSN 0891-2017. URL http://dl.acm.org/citation.cfm?id=972470.972475.\n\n[23] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An Analysis of Neural Language\n\nModeling at Multiple Scales. arXiv preprint arXiv:1803.08240, 2018.\n\n[24] Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Ef\ufb01cient Orthog-\nonal Parametrisation of Recurrent Neural Networks Using Householder Re\ufb02ections. ICML,\n2017.\n\n[25] A Emin Orhan and Xaq Pitkow. Improved memory in recurrent neural networks with sequential\n\nnon-normal dynamics. arXiv.org, May 2019.\n\n[26] Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linar\u00e8s, Chiheb Trabelsi,\n\nRenato De Mori, and Yoshua Bengio. Quaternion Recurrent Neural Networks. ICLR, 2019.\n\n[27] R Pascanu, T Mikolov, and Y Bengio. On the dif\ufb01culty of training recurrent neural networks.\n\nICML, 2013.\n\n[28] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep\nlearning through dynamical isometry: theory and practice. Advances in Neural Information\nProcessing Systems, 2017.\n\n[29] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the\n\nexpressive power of deep neural networks. In ICML, 2017.\n\n[30] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. ICLR, 2014.\n\n[31] Corentin Tallec and Yann Ollivier. Can recurrent neural networks warp time? ICLR, 2018.\n\n[32] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Christopher Pal. On orthogonality\n\nand learning rnn with long term dependencies. ICML, 2017.\n\n[33] Scott Wisdom, Thomas Powers, John R Hershey, Jonathan Le Roux, and Les Atlas. Full-\nCapacity Unitary Recurrent Neural Networks. Advances in Neural Information Processing\nSystems, 2016.\n\n[34] Z. Yang, M. Moczulski, M. Denil, N. d. Freitas, A. Smola, L. Song, and Z. Wang. Deep fried\n\nconvnets. In ICCV, 2015.\n\n[35] Jiong Zhang, Qi Lei, and Inderjit Dhillon. Stabilizing Gradients for Deep Neural Networks via\nEf\ufb01cient SVD Parameterization. In Jennifer Dy and Andreas Krause, editors, Proceedings of\nthe 35th International Conference on Machine Learning, pages 5806\u20135814, Stockholmsm\u00e4ssan,\nStockholm Sweden, July 2018. PMLR.\n\n11\n\n\f", "award": [], "sourceid": 7549, "authors": [{"given_name": "Giancarlo", "family_name": "Kerg", "institution": "MILA"}, {"given_name": "Kyle", "family_name": "Goyette", "institution": "University of Montreal"}, {"given_name": "Maximilian", "family_name": "Puelma Touzel", "institution": "Mila"}, {"given_name": "Gauthier", "family_name": "Gidel", "institution": "Mila"}, {"given_name": "Eugene", "family_name": "Vorontsov", "institution": "Polytechnique Montreal"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Mila"}, {"given_name": "Guillaume", "family_name": "Lajoie", "institution": "Universit\u00e9 de Montr\u00e9al / Mila"}]}