{"title": "Approximating Real-Time Recurrent Learning with Random Kronecker Factors", "book": "Advances in Neural Information Processing Systems", "page_first": 6594, "page_last": 6603, "abstract": "Despite all the impressive advances of recurrent neural networks, sequential data is still in need of better modelling. Truncated backpropagation through time (TBPTT), the learning algorithm most widely used in practice, suffers from the truncation bias, which drastically limits its ability to learn long-term dependencies.The Real Time Recurrent Learning algorithm (RTRL) addresses this issue, but its high computational requirements make it infeasible in practice. The Unbiased Online Recurrent Optimization algorithm (UORO) approximates RTRL with a smaller runtime and memory cost, but with the disadvantage of obtaining noisy gradients that also limit its practical applicability. In this paper we propose the Kronecker Factored RTRL (KF-RTRL) algorithm that uses a Kronecker product decomposition to approximate the gradients for a large class of RNNs. We show that KF-RTRL is an unbiased and memory efficient online learning algorithm. Our theoretical analysis shows that, under reasonable assumptions, the noise introduced by our algorithm is not only stable over time but also asymptotically much smaller than the one of the UORO algorithm. We also confirm these theoretical results experimentally. Further, we show empirically that the KF-RTRL algorithm captures long-term dependencies and almost matches the performance of TBPTT on real world tasks by training Recurrent Highway Networks on a synthetic string memorization task and on the Penn TreeBank task, respectively. These results indicate that RTRL based approaches might be a promising future alternative to TBPTT.", "full_text": "Approximating Real-Time Recurrent Learning with\n\nRandom Kronecker Factors\n\nAsier Mujika \u2217\n\nFlorian Meier\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nETH Z\u00fcrich, Switzerland\nasierm@inf.ethz.ch\n\nETH Z\u00fcrich, Switzerland\nmeierflo@inf.ethz.ch\n\nAngelika Steger\n\nDepartment of Computer Science\n\nETH Z\u00fcrich, Switzerland\nsteger@inf.ethz.ch\n\nAbstract\n\nDespite all the impressive advances of recurrent neural networks, sequential data is\nstill in need of better modelling. Truncated backpropagation through time (TBPTT),\nthe learning algorithm most widely used in practice, suffers from the truncation bias,\nwhich drastically limits its ability to learn long-term dependencies.The Real-Time\nRecurrent Learning algorithm (RTRL) addresses this issue, but its high computa-\ntional requirements make it infeasible in practice. The Unbiased Online Recurrent\nOptimization algorithm (UORO) approximates RTRL with a smaller runtime and\nmemory cost, but with the disadvantage of obtaining noisy gradients that also limit\nits practical applicability. In this paper we propose the Kronecker Factored RTRL\n(KF-RTRL) algorithm that uses a Kronecker product decomposition to approximate\nthe gradients for a large class of RNNs. We show that KF-RTRL is an unbiased and\nmemory ef\ufb01cient online learning algorithm. Our theoretical analysis shows that,\nunder reasonable assumptions, the noise introduced by our algorithm is not only\nstable over time but also asymptotically much smaller than the one of the UORO\nalgorithm. We also con\ufb01rm these theoretical results experimentally. Further, we\nshow empirically that the KF-RTRL algorithm captures long-term dependencies\nand almost matches the performance of TBPTT on real world tasks by training\nRecurrent Highway Networks on a synthetic string memorization task and on\nthe Penn TreeBank task, respectively. These results indicate that RTRL based\napproaches might be a promising future alternative to TBPTT.\n\n1\n\nIntroduction\n\nProcessing sequential data is a central problem in the \ufb01eld of machine learning. In recent years,\nRecurrent Neural Networks (RNN) have achieved great success, outperforming all other approaches in\nmany different sequential tasks like machine translation, language modeling, reinforcement learning\nand more.\n\nDespite this success, it remains unclear how to train such models. The standard algorithm, Truncated\nBack Propagation Through Time (TBPTT) [18], considers the RNN as a feed-forward model over\ntime with shared parameters. While this approach works extremely well in the range of a few hundred\ntime-steps, it scales very poorly to longer time dependencies. As the time horizon is increased, the\n\n\u2217Author was supported by grant no. CRSII5_173721 of the Swiss National Science Foundation.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fparameters are updated less frequently and more memory is required to store all past states. This\nmakes TBPTT ill-suited for learning long-term dependencies in sequential tasks.\n\nAn appealing alternative to TBPTT is Real-Time Recurrent Learning (RTRL) [19]. This algorithm\nallows online updates of the parameters and learning arbitrarily long-term dependencies by exploiting\nthe recurrent structure of the network for forward propagation of the gradient. Despite its impressive\ntheoretical properties, RTRL is impractical for decently sized RNNs because run-time and memory\ncosts scale poorly with network size.\n\nAs a remedy to this issue, Tallec and Ollivier [16] proposed the Unbiased Online Recurrent Learning\nalgorithm (UORO). This algorithm unbiasedly approximates the gradients, which reduces the run-\ntime and memory costs such that they are similar to the costs required to run the RNN forward.\nUnbiasedness is of central importance since it guarantees convergence to a local optimum. Still, the\nvariance of the gradients slows down learning.\n\nHere we propose the Kronecker Factored RTRL (KF-RTRL) algorithm. This algorithm builds up on\nthe ideas of the UORO algorithm, but uses Kronecker factors for the RTRL approximation. We show\nboth theoretically and empirically that this drastically reduces the noise in the approximation and\ngreatly improves learning. However, this comes at the cost of requiring more computation and only\nbeing applicable to a class of RNNs. Still, this class of RNNs is very general and includes Tanh-RNN\nand Recurrent Highway Networks [20] among others.\n\nThe main contributions of this paper are:\n\n\u2022 We propose the KF-RTRL online learning algorithm.\n\u2022 We theoretically prove that our algorithm is unbiased and under reasonable assumptions the\nnoise is stable over time and asymptotically by a factor n smaller than the noise of UORO.\n\u2022 We test KF-RTRL on a binary string memorization task where our networks can learn\n\ndependencies of up to 36 steps.\n\n\u2022 We evaluate in character-level Penn TreeBank, where the performance of KF-RTRL almost\n\nmatches the one of TBPTT for 25 steps.\n\n\u2022 We empirically con\ufb01rm that the variance of KF-RTRL is stable over time and that increasing\n\nthe number of units does not increase the noise signi\ufb01cantly.\n\n2 Related Work\n\nTraining Recurrent Neural Networks for \ufb01nite length sequences is currently almost exclusively done\nusing BackPropagation Through Time [15] (BPTT). The network is \"unrolled\" over time and is\nconsidered as a feed-forward model with shared parameters (the same parameters are used at each\ntime step). Like this, it is easy to do backpropagation and exactly calculate the gradients in order to\ndo gradient descent.\n\nHowever, this approach does not scale well to very long sequences, as the whole sequence needs to\nbe processed before calculating the gradients, which makes training extremely slow and very memory\nintensive. In fact, BPTT cannot be applied to an online stream of data. In order to circumvent this\nissue, Truncated BackPropagation Through Time [18] (TBPTT) is used generally. The RNN is only\n\"unrolled\" for a \ufb01xed number of steps (the truncation horizon) and gradients beyond these steps are\nignored. Therefore, if the truncation horizon is smaller than the length of the dependencies needed to\nsolve a task, the network cannot learn it.\n\nSeveral approaches have been proposed to deal with the truncation horizon. Anticipated Reweighted\nTruncated Backpropagation [17] samples different truncation horizons and weights the calculated\ngradients such that the expected gradient is that of the whole sequence. Jaderberg et al. [5] proposed\nDecoupled Neural Interfaces, where the network learns to predict incoming gradients from the future.\nThen, it uses these predictions for learning. The main assumption of this model is that all future\ngradients can be computed as a function of the current hidden state.\n\nA more extreme proposal is calculating the gradients forward and not doing any kind of BPTT. This is\nknown as Real-Time Recurrent Learning [19] (RTRL). RTRL allows updating the model parameters\nonline after observing each input/output pair; we explain it in detail in Section 3. However, its large\nrunning time of order O(n4) and memory requirements of order O(n3), where n is the number of\n\n2\n\n\funits of a fully connected RNN, make it unpractical for large networks. To \ufb01x this, Tallec and Ollivier\n[16] presented the Unbiased Online Recurrent Optimization (UORO) algorithm. This algorithm\napproximates RTRL using a low rank matrix. This makes the run-time of the algorithm of the same\norder as a single forward pass in an RNN, O(n2). However, the low rank approximation introduces a\nlot of variance, which negatively affects learning as we show in Section 5.\n\nOther alternatives are Reservoir computing approaches [8] like Echo State Networks [6] or Liquid\nState Machines [9]. In these approaches, the recurrent weights are \ufb01xed and only the output con-\nnections are learned. This allows online learning, as gradients do not need to be propagated back in\ntime. However, it prevents any kind of learning in the recurrent connections, which makes the RNN\ncomputationally much less powerful.\n\n3 Real-Time Recurrent Learning and UORO\n\nRTRL [19] is an online learning algorithm for RNNs. Contrary to TBPPT, no previous inputs or\nnetwork states need to be stored. At any time-step t, RTRL only requires the hidden state ht, input\nxt and dht\u22121\ndht\nd\u03b8 is obtained by applying the\nchain rule. Thus, the parameters can be updated online, that is, for each observed input/output pair\none parameter update can be performed.\n\nin order to compute dht\n\nd\u03b8 at hand, dLt\n\nd\u03b8\n\nd\u03b8 . With dht\n\nd\u03b8 = dLt\ndht\n\nIn order to present the RTRL update precisely, let us \ufb01rst de\ufb01ne an RNN formally. An RNN is a\ndifferentiable function f , that maps an input xt, a hidden state ht\u22121 and parameters \u03b8 to the next\nhidden state ht = f (xt, ht\u22121, \u03b8). At any time-step t, RTRL computes dht\nd\u03b8 by applying the chain rule:\n\ndht\nd\u03b8\n\n=\n\n\u2202ht\n\u2202ht\u22121\n\ndht\u22121\n\nd\u03b8\n\n+\n\n\u2202ht\n\u2202xt\n\ndxt\nd\u03b8\n\n+\n\n\u2202ht\n\u2202\u03b8\n\n=\n\n\u2202ht\n\u2202ht\u22121\n\ndht\u22121\n\nd\u03b8\n\n+\n\n\u2202ht\n\u2202\u03b8\n\n,\n\n(1)\n\n(2)\n\nwhere the middle term vanishes because we assume that the inputs do not depend on the parameters.\nFor notational simplicity, de\ufb01ne Gt := dht\n\u2202\u03b8 , which reduces the above\nequation to\n\nd\u03b8 , Ht := \u2202ht\n\u2202ht\u22121\n\nand Ft := \u2202ht\n\ndht\nd\u03b8\n\n= Gt = HtGt\u22121 + Ft .\n\n(3)\n\nBoth Ft and Ht are straight-forward to compute for RNNs. We assume h0 to be \ufb01xed, which implies\nG0 = 0. With all this, RTRL obtains the exact gradient Gt for each time-step and enables online\nupdates of the parameters. However, updating the parameters means that Gt is only exact in the limit\nwere the learning rate is arbitrarily small. This is because the \u03b8 that was used for computing Gt is\ndifferent from the \u03b8 after the parameter update. In practice learning rates are suf\ufb01ciently small such\nthat this is not an issue.\n\nThe downside of RTRL is that for a fully connected RNN with n units the matrices Ht and Gt have\nsize n \u00d7 n and n \u00d7 n2, respectively. Therefore, computing Equation 3 takes O(n4) operations and\nrequires O(n3) storage, which makes RTRL impractical for large networks.\n\nThe UORO algorithm [16] addresses this issue and reduces run-time and memory requirements to\nO(n2) at the cost of obtaining an unbiased but noisy estimate of Gt. More precisely, the UORO\nalgorithm keeps an unbiased rank-one estimate of Gt by approximating Gt as the outer product vwT\nof two vectors of size n and size n2, respectively. At any time t, the UORO update consists of two\napproximation steps. Assume that the unbiased approximation \u02c6Gt\u22121 = vwT of Gt\u22121 is given. First,\nFt is approximated by a rank-one matrix. Second, the sum of two rank-one matrices Ht \u02c6Gt\u22121 + Ft\nis approximated by a rank-one matrix yielding the estimate \u02c6Gt of Gt. The estimate \u02c6Gt is provably\nunbiased and the UORO update requires the same run-time and memory as updating the RNN [16].\n\n3\n\n\f4 Kronecker Factored RTRL\n\nOur proposed Kronecker Factored RTRL algorithm (KF-RTRL) is an online learning algorithm for\nRNNs, which does not require storing any previous inputs or network states. KF-RTRL approximates\nGt, which is the derivative of the internal state with respect to the parameters, see Section 3, by a\nKronecker product. The following theorem shows that the KF-RTRL algorithm satis\ufb01es various\ndesireable properties.\n\nTheorem 1. For the class of RNNs de\ufb01ned in Lemma 1, the estimate G\u2032\nalgorithm satis\ufb01es\n\nt obtained by the KF-RTRL\n\n1. G\u2032\n\nt is an unbiased estimate of Gt, that is E[G\u2032\n\nt] = Gt, and\n\n2. assuming that the spectral norm of Ht is at most 1 \u2212 \u01eb for some arbitrary small \u01eb > 0, then\n\nat any time t, the mean of the variances of the entries of G\u2032\n\nt is of order O(n\u22121).\n\nMoreover, one time-step of the KF-RTRL algorithm requires O(n3) operations and O(n2) memory.\n\nWe remark that the class of RNNs de\ufb01ned in Lemma 1 contains many widely used RNN architectures\nlike Recurrent Highway Networks and Tanh-RNNs, and can be extended to include LSTMs [4], see\nSection A.6. Further, the assumption that the spectral norm of Ht is at most 1 \u2212 \u01eb is reasonable, as\notherwise gradients might grow exponentially as noted by Bengio et al. [2]. Lastly, the bottleneck of\nthe algorithm is a matrix multiplication and thus for suf\ufb01ciently large matrices an algorithm with a\nbetter run time than O(n3) may be be practical.\n\nIn the remainder of this section, we explain the main ideas behind the KF-RTRL algorithm (formal\nproofs are given in the appendix). In the subsequent Section 5 we show that these theoretical\nproperties carry over into practical application. KF-RTRL is well suited for learning long-term\ndependencies (see Section 5.1) and almost matches the performance of TBPTT on a complex real\nworld task, that is, character level language modeling (see Section 5.2). Moreover, we con\ufb01rm\nempirically that the variance of the KF-RTRL estimate is stable over time and scales well as the\nnetwork size increases (see Section 5.3).\n\nBefore giving the theoretical background and motivating the key ideas of KF-RTRL, we give a\nbrief overview of the KF-RTRL algorithm. At any time-step t, KF-RTRL maintains a vector ut\nand a matrix At, such that G\u2032\nt\u22121 and Ft are factored\nas a Kronecker product, and then the sum of these two Kronecker products is approximated by\none Kronecker product. This approximation step (see Lemma 2) works analogously to the second\napproximation step of the UORO algorithm (see rank-one trick, Proposition 1 in [16]). The detailed\nalgorithmic steps of KF-RTRL are presented in Algorithm 1 and motivated below.\n\nt = ut \u2297 At satis\ufb01es E[G\u2032\n\nt] = Gt. Both HtG\u2032\n\nTheoretical motivation of the KF-RTRL algorithm\n\nThe key observation that motivates our algorithm is that for many popular RNN architectures F can be\nexactly decomposed as the Kronecker product of a vector and a diagonal matrix, see Lemma 1. In the\nfollowing Lemma, we show that this property is satis\ufb01ed by a large class of RNNs that include many\npopular RNN architectures like Tanh-RNN and Recurrent Highway Networks. Intuitively, an RNN of\nthis class computes several linear transformations (corresponding to the matrices W 1, . . . , W r) and\nmerges the resulting vectors through pointwise non-linearities. For example, in the case of RHNs,\nthere are two linear transformations that compute the new candidate cell and the forget gate, which\nthen are merged through pointwise non-linearities to generate the new hidden state.\nLemma 1. Assume the learnable parameters \u03b8 are a set of matrices W 1, . . . , W r, let \u02c6ht\u22121 be the\nhidden state ht\u22121 concatenated with the input xt and let zk = \u02c6ht\u22121W k for k = 1, . . . , r. Assume\nthat ht is obtained by point-wise operations over the zk\u2019s, that is, (ht)j = f (z1\nj ), where\nf is such that \u2202f\nis bounded by a constant. Let Dk \u2208 Rn\u00d7n be the diagonal matrix de\ufb01ned by\n\u2202zk\nj\n\nj , . . . , zr\n\nDk\n\njj = \u2202(ht)j\n\u2202zk\nj\n\n, and let D = (cid:0)D1| . . . |Dr(cid:1). Then, it holds \u2202ht\n\n\u2202\u03b8 = \u02c6ht\u22121 \u2297 D.\n\nFurther, we observe that the sum of two Kronecker products can be approximated by a single\nKronecker product in expectation. The following lemma, which is the analogue of Proposition 1 in\n[14] for Kronecker products, states how this is achieved.\n\n4\n\n\fAlgorithm 1 \u2014 One step of KF-RTRL (from time t \u2212 1 to t)\n\nInputs:\n\n\u2013 input xt, target yt, previous recurrent state ht\u22121 and parameters \u03b8\n\u2013 ut\u22121 and At\u22121 such that E [ut\u22121 \u2297 At\u22121] = Gt\u22121\n\u2013 SGDopt and \u03b7t+1: stochastic optimizer and its learning rate\n\nOutputs:\n\n\u2013 new recurrent state ht and updated parameters \u03b8\n\u2013 ut and At such that E [ut \u2297 At] = Gt\n\n/* Run one step of the RNN and compute the necessary matrices*/\nzk\nj \u2190 Compute linear transformations using xt, ht\u22121 and \u03b8\nht \u2190 Compute ht using using point-wise operations over the zk\nj\n\u02c6ht\u22121 \u2190 Concatenate ht\u22121 and xt\nDk\n\nH \u2032 \u2190 H \u00b7 At\u22121\n\nH \u2190 \u2202ht\n\u2202ht\u22121\n\njj \u2190 \u2202(ht)j\n\u2202zk\nj\n\n/* Compute variance minimization and random multipliers */\n\np2 \u2190 qkDkHS/k\u02c6ht\u22121kHS\n\np1 \u2190 pkH \u2032kHS/kut\u22121kHS\n\nc1, c2 \u2190 Independent random signs\n/* Compute next approximation */\nut \u2190 c1p1ut\u22121 + c2p2\u02c6ht\u22121\n/* Compute gradients and update parameters */\n\nAt \u2190 c1\n\n1\np1\n\nH \u2032 + c2\n\n1\np2\n\nD\n\nLgrad \u2190 ut \u2297(cid:16) \u2202L(yt,ht)\n\n\u2202ht\n\n\u00b7 At(cid:17)\n\nSGDopt(Lgrad, \u03b7t+1, \u03b8)\n\nLemma 2. Let C = A1 \u2297 B1 + A2 \u2297 B2, where the matrix A1 has the same size as the matrix\nA2 and B1 has the same size as B2. Let c1 and c2 be chosen independently and uniformly at\nrandom from {\u22121, 1} and let p1, p2 > 0 be positive reals. De\ufb01ne A\u2032 = c1p1A1 + c2p2A2 and\nB \u2032 = c1\nB2. Then, A\u2032 \u2297 B \u2032 is an unbiased approximation of C, that is E [A\u2032 \u2297 B \u2032] = C.\nMoreover, the variance of this approximation is minimized by setting the free parameters pi =\n\nB1 + c2\n\n1\np2\n\n1\np1\n\np||Bi||/||Ai||.\n\nLastly, we show by induction that random vectors ut and random matrices At exist, such that\nG\u2032\nt\u22121] = Gt\u22121.\nEquation 3 and Lemma 1 imply that\n\nt\u22121 = ut\u22121 \u2297 At\u22121 satis\ufb01es E[G\u2032\n\nt = ut \u2297 At satis\ufb01es E[G\u2032\n\nt] = Gt. Assume that G\u2032\n\nNext, by linearity of expectation and since the \ufb01rst dimension of ut\u22121 is 1, it follows\n\nGt = HtE(cid:2)G\u2032\n\nt\u22121(cid:3) + Ft = HtE [ut\u22121 \u2297 At\u22121] + \u02c6ht \u2297 Dt .\n\nGt = EhHt(ut\u22121 \u2297 At\u22121) + \u02c6ht \u2297 Dti = Ehut\u22121 \u2297 (HtAt\u22121) + \u02c6ht \u2297 Dti .\n\nFinally, we obtain by Lemma 2 for any p1, p2 > 0\n\nGt = E(cid:20)(c1p1ut\u22121 + c2p1\u02c6ht) \u2297 (c1\n\n1\np1\n\n(HtAt\u22121) + c2\n\n1\np2\n\nDt)(cid:21) ,\n\n(4)\n\n(5)\n\n(6)\n\nwhere the expectation is taken over the probability distribution of ut\u22121, At\u22121, c1 and c2.\n\nWith these observations at hand, we are ready to present the KF-RTRL algorithm. At any time-step t\nwe receive the estimates ut\u22121 and At\u22121 from the previous time-step. First, compute ht, Dt and Ht.\nThen, choose c1 and c2 uniformly at random from {\u22121, +1} and compute\n\nut = c1p1ut\u22121 + c2p2\u02c6ht\n\nAt = c1\n\n1\np1\n\n(HtAt\u22121) + c2\n\n1\np2\n\nDt ,\n\n(7)\n\n(8)\n\nwhere p1 = pkHtAt\u22121k/kut\u22121k and p2 = qkDtk/k\u02c6htk. Lastly, our algorithm computes dLt\n\n\u00b7 G\u2032\nt,\nwhich is used for optimizing the parameters. For a detailed pseudo-code of the KF-RTRL algorithm\n\ndht\n\n5\n\n\fsee Algorithm 1. In order to see that dLt\ndht\n\n\u00b7 G\u2032\n\nlinearity of expectation: Eh dLt\n\ndht\n\n\u00b7 G\u2032\n\nti = dLt\n\ndht\n\nt is an unbiased estimate of dLt\n\u00b7 E [G\u2032\n\n\u00b7 Gt = dLt\nd\u03b8 .\n\nt] = dLt\ndht\n\nd\u03b8 , we apply once more\n\nOne KF-RTRL update has run-time O(n3) and requires O(n2) memory. In order to see the statement\nfor the memory requirement, note that all involved matrices and vectors have O(n2) elements,\nexcept G\u2032\nd\u03b8 , because\ndLt\nAt) can be evaluated in this order. In order to see the statement\ndht\nfor the run-time, note that Ht and At\u22121 have both size O(n) \u00d7 O(n). Therefore, computing HtAt\u22121\nrequires O(n3) operations. All other arithmetic operations trivially require run-time O(n2).\n\nt. However, we do not need to explicitly compute G\u2032\n\nt in order to obtain dLt\n\n\u00b7 ut \u2297 At = ut \u2297 ( dLt\ndht\n\nt = dLt\ndht\n\n\u00b7 G\u2032\n\nThe proofs of Lemmas 1 and 2 and of the second statement of Theorem 1 are given in the appendix.\n\nComparison of the KF-RTRL with the UORO algorithm\n\nSince the variance of the gradient estimate is directly linked to convergence speed and performance,\nlet us \ufb01rst compare the variance of the two algorithms. Theorem 1 states that the mean of the variances\nt are of order O(n\u22121). In the appendix, we show a slightly stronger statement, that\nof the entries of G\u2032\nis, if kFtk \u2264 C for all t, then the mean of the variances of the entries is of order O( C 2\nn3 ), where n3 is\nthe number of elements of Gt. The bound O(n\u22121) is obtained by a trivial bound on the size of the\nentries of ht and Dt and using khtkkDtk = kFtk. In the appendix, we show further that already the\n\ufb01rst step of the UORO approximation, in which Ft is approximated by a rank-one matrix, introduces\nnoise of order (n \u2212 1)kFtk. Assuming that all further approximations would not add any noise, then\nthe same trivial bounds on kFtk yield a mean variance of O(1). We conclude that the variance of\nKF-RTRL is asymptotically by (at least) a factor n smaller than the variance of UORO.\n\nNext, let us compare the generality of the algorithm when applying it to different network architectures.\nThe KF-RTRL algorithms requires that in one time-step each parameter only affects one element of\nthe next hidden state (see Lemma 1). Although many widely used RNN architectures satisfy this\nrequirement, seen from this angle, the UORO algorithm is favorable as it is applicable to arbitrary\nRNN architectures.\n\nFinally, let us compare memory requirements and runtime of KF-RTRL and UORO. In terms of\nmemory requirements, both algorithms require O(n2) storage and perform equally good. In terms of\nrun-time, KF-RTRL requires O(n3), while UORO requires O(n2) operations. However, the faster\nrun-time of UORO comes at the cost of worse variance and therefore worse performance. In order\nto reduce the variance of UORO by a factor n, one would need n independent samples of G\u2032\nt. This\ncould be achieved by reducing the learning rate by a factor of n, which would then require n times\nmore data, or by sampling G\u2032\nt n times in parallel, which would require n times more memory. Still,\nour empirical investigation shows, that KF-RTRL outperforms UORO, even when taking n UORO\nsamples of Gt to reduce the variance (see Figure 3). Moreover, for suf\ufb01ciently large networks the\nscaling of the KF-RTRL run-time improves by using fast matrix multiplication algorithms.\n\n5 Experiments\n\nIn this section, we quantify the effect on learning that the reduced variance of KF-RTRL compared\nto the one of UORO has. First, we evaluate the ability of learning long-term dependencies on a\ndeterministic binary string memorization task. Since most real world problems are more complex and\nof stochastic nature, we secondly evaluate the performance of the learning algorithms on a character\nlevel language modeling task, which is a more realistic benchmark. For these two tasks, we also\ncompare to learning with Truncated BPTT and measure the performance of the considered algorithms\nbased on \u2019data time\u2019, i.e. the amount of data observed by the algorithm. Finally, we investigate the\nvariance of KF-RTRL and UORO by comparing to their exact counterpart, RTRL. For all experiments,\nwe use a single-layer Recurrent Highway Network [20] 2.\n\n2For implementation simplicity, we use 2 \u2217 sigmoid(x) \u2212 1 instead of T anh(x) as non-linearity function.\n\nBoth functions have very similar properties, and therefore, we do not believe this has any signi\ufb01cant effect.\n\n6\n\n\fInput: #01101\u2013\u2013\u2013\u2013\u2013\u2013\nOutput: \u2013\u2013\u2013\u2013\u2013\u2013#01101\n\nInput: #11010\u2013\u2013\u2013\u2013\u2013\u2013\nOutput: \u2013\u2013\u2013\u2013\u2013\u2013#11010\n\nInput: #00100\u2013\u2013\u2013\u2013\u2013\u2013\nOutput: \u2013\u2013\u2013\u2013\u2013\u2013#00100\n\n(a)\n\n(b)\n\nFigure 1: Copy Task: Figure 1(a) shows the learning curves of UORO, KF-RTRL and TBPTT with truncation\nhorizon of 25 steps. We plot the mean and standard deviation (shaded area) over 5 trials. Figure 1(b) shows\nthree input and output examples with T = 5.\n\n5.1 Copy Task\n\nIn the copy task experiment, a binary string is presented sequentially to an RNN. Once the full\nstring has been presented, it should reconstruct the original string without any further information.\nFigure 1(b) shows several input-output pairs. We refer to the length of the string as T . Figure 1(a)\nsummarizes the results. The smaller variance of KF-RTRL greatly helps learning faster and capturing\nlonger dependencies. KF-RTRL and UORO manage to solve the task on average up to T = 36\nand T = 13, respectively. As expected, TBPTT cannot learn dependencies that are longer than the\ntruncation horizon.\n\nIn this experiment, we start with T = 1 and when the RNN error drops below 0.15 bits/char, we\nincrease T by one. After each sequence, the hidden state is reset to all zeros. To improve performance,\nthe length of each sample is picked uniformly at random from T to T \u2212 5. This forces the network to\nlearn a general algorithm for the task, rather than just learning to solve sequences of length T . We\nuse a RHN with 256 units and a batch size of 256. We optimize the log-likelihood using the Adam\noptimizer [7] with default Tensor\ufb02ow [1] parameters, \u03b21 = 0.9 and \u03b22 = 0.999. For each model\nwe pick the optimal learning rate from {10\u22122.5, 10\u22123, 10\u22123.5, 10\u22124}. We repeat each experiment 5\ntimes.\n\n5.2 Character level language modeling on the Penn Treebank dataset\n\nA standard test for RNNs is character level language modeling. The network receives a text sequen-\ntially, character by character, and at each time-step it must predict the next character. This task is very\nchallenging, as it requires both long and short term dependencies. Additionally, it is highly stochastic,\ni.e. for the same input sequence there are many possible continuations, but only one is observed\nat each training step. Figure 2 and Table 1 summarize the results. Truncated BPTT outperforms\nboth online learning algorithms, but KF-RTRL almost reaches its performance and considerably\noutperforms UORO. For this task, the noise introduced in the approximation is more harmful than the\ntruncation bias. This is probably the case because the short term dependencies dominate the loss, as\nindicated by the small difference between TBPTT with truncation horizon 5 and 25.\n\nFor this experiment we use the Penn TreeBank [10] dataset, which is a collection of Wall Street\nJournal articles. The text is lower cased and the vocabulary is restricted to 10K words. Out of\nvocabulary words are replaced by \"\" and numbers by \"N\". We split the data following Mikolov\net al. [13]. The experimental setup is the same as in the Copy task, and we pick the optimal learning\nrate from the same range. Apart from that, we reset the hidden state to the all zeros state with\nprobability 0.01 at each time step. This technique was introduced by Melis et al. [11] to improve\nthe performance on the validation set, as the initial state for the validation is the all zeros state.\nAdditionally, this helps the online learning algorithms, as it resets the gradient approximation, getting\nrid of stale gradients. Similar techniques have been shown [3] to also improve RTRL.\n\n7\n\n0500000100000015000002000000data time01020304050TKF-RTRLUOROTBPTT-25\fTable 1: Results on Penn TreeBank. Merity et al. [12] is\ncurrently the state of the art (trained with TBPTT). For\nsimplicity we do not report standard deviations, as all of\nthem are smaller than 0.03.\n\nName\n\nValidation Test\n\n#params\n\nKF-RTRL\n\nUORO\n\nTBPTT-5\nTBPTT-25\n\nMerity et al. [12]\n\n1.77\n2.63\n1.64\n1.61\n\n-\n\n1.72\n2.61\n1.58\n1.56\n1.18\n\n133K\n133K\n133K\n133K\n13.8M\n\nFigure 2: Validation performance on Penn\nTreeBank in bits per character (BPC). The\nsmall variance of the KF-RTRL approxima-\ntion considerably improves the performance\ncompared to UORO.\n\n(a)\n\n(b)\n\nFigure 3: Variance analysis: We compare the cosine of the angle between the approximated and the true value\nof dL\nd\u03b8 . A cosine of 1 implies that the approximation and the true value are exactly aligned, while a random\nvector gets a cosine of 0 in expectation. Figure 3(a) shows that the variance is stable over time for the three\nalgorithms. Figure 3(b) shows that the variance of KF-RTRL and UORO-AVG are almost unaffected by the\nnumber of units, while UORO degrades more quickly as the network size increases.\n\n5.3 Variance Analysis\n\nWith our \ufb01nal set of experiments, we empirically measure how the noise evolves over time and how\nit is affected by the number of units n. Here, we also compare to UORO-AVG that computes n\nindependent samples of UORO and averages them to reduce the variance. The computation costs of\nUORO-AVG are on par with those of KF-RTRL, O(n3), however the memory costs of O(n3) are\nhigher than the ones of KF-RTRL of O(n2). For each experiment, we compute the angle \u03c6 between\nthe gradient estimate and the exact gradient of the loss with respect to the parameters. Intuitively, \u03c6\nmeasures how aligned the gradients are, even if the magnitude is different. Figure 3(a) shows that \u03c6\nis stable over time and the noise does not accumulate for any of the three algorithms. Figure 3(b)\nshows that KF-RTRL and UORO-AVG have similar performance as the number of units increases.\nThis observation is in line with the theoretical prediction in Section A.5 that the variance of UORO\nis by a factor n larger than the KF-RTRL variance (averaging n samples as done in AVG-UORO\nreduces the variance by a factor n).\n\nIn the \ufb01rst experiment, we run several untrained RHNs with 256 units over the \ufb01rst 10000 characters\nof Penn TreeBank. In the second experiment, we compute \u03c6 after running RHNs with different\nnumber of units for 100 steps on Penn TreeBank. We perform 100 repetitions per experiment and\nplot the mean and standard deviation.\n\n6 Conclusion\n\nIn this paper, we have presented the KF-RTRL online learning algorithm. We have proven that it\napproximates RTRL in an unbiased way, and that under reasonable assumptions the noise is stable\nover time and much smaller than the one of UORO, the only other previously known unbiased RTRL\n\n8\n\n2000004000006000008000001000000data time1.52.02.53.03.5BPCKF-RTRLUOROTBPTT-25TBPTT-50200040006000800010000timesteps0.20.00.20.40.60.81.01.2cos()KF-RTRLUOROUORO-AVG200400units0.00.51.0cos()\fapproximation algorithm. Additionally, we have empirically veri\ufb01ed that the reduced variance of\nour algorithm greatly improves learning for the two tested tasks. In the \ufb01rst task, an RHN trained\nwith KF-RTRL effectively captures long-term dependencies (it learns to memorize binary strings of\nlength up to 36). In the second task, it almost matches the performance of TBPTT in a standard RNN\nbenchmark, character level language modeling on Penn TreeBank.\n\nMore importantly, our work opens up interesting directions for future work, as even minor reductions\nof the noise could make the approach a viable alternative to TBPTT, especially for tasks with inherent\nlong-term dependencies. For example constraining the weights, constraining the activations or using\nsome form of regularization could reduce the noise. Further, it may be possible to design architectures\nthat make the approximation less noisy. Moreover, one might attempt to improve the run-time of\nKF-RTRL by using approximate matrix multiplication algorithms or inducing properties on the Ht\nmatrix that allow for fast matrix multiplications, like sparsity or low-rank.\n\nThis work advances the understanding of how unbiased gradients can be computed, which is of\ncentral importance as unbiasedness is essential for theoretical convergence guarantees. Since RTRL\nbased approaches satisfy this key assumption, it is of interest to further progress them.\n\nReferences\n\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, volume 16,\npages 265\u2013283, 2016.\n\n[2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent\n\nis dif\ufb01cult. IEEE transactions on neural networks, 5(2):157\u2013166, 1994.\n\n[3] T. Catfolis. A method for improving the real-time recurrent learning algorithm. Neural Networks,\n\n6(6):807\u2013821, 1993.\n\n[4] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[5] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and\nK. Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint\narXiv:1608.05343, 2016.\n\n[6] H. Jaeger. The \u201cecho state\u201d approach to analysing and training recurrent neural networks-with an\nerratum note. Bonn, Germany: German National Research Center for Information Technology\nGMD Technical Report, 148(34):13, 2001.\n\n[7] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[8] M. Luko\u0161evi\u02c7cius and H. Jaeger. Reservoir computing approaches to recurrent neural network\n\ntraining. Computer Science Review, 3(3):127\u2013149, 2009.\n\n[9] W. Maass, T. Natschl\u00e4ger, and H. Markram. Real-time computing without stable states: A\nnew framework for neural computation based on perturbations. Neural computation, 14(11):\n2531\u20132560, 2002.\n\n[10] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of\n\nenglish: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[11] G. Melis, C. Dyer, and P. Blunsom. On the state of the art of evaluation in neural language\n\nmodels. arXiv preprint arXiv:1707.05589, 2017.\n\n[12] S. Merity, N. S. Keskar, and R. Socher. An analysis of neural language modeling at multiple\n\nscales. arXiv preprint arXiv:1803.08240, 2018.\n\n[13] T. Mikolov, I. Sutskever, A. Deoras, H.-S. Le, S. Kombrink, and J. Cernocky. Subword language\nmodeling with neural networks. preprint (http://www. \ufb01t. vutbr. cz/imikolov/rnnlm/char. pdf),\n2012.\n\n[14] Y. Ollivier, C. Tallec, and G. Charpiat. Training recurrent networks online without backtracking.\n\narXiv preprint arXiv:1507.07680, 2015.\n\n[15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating\n\nerrors. nature, 323(6088):533, 1986.\n\n9\n\n\f[16] C. Tallec and Y. Ollivier. Unbiased online recurrent optimization.\n\narXiv preprint\n\narXiv:1702.05043, 2017.\n\n[17] C. Tallec and Y. Ollivier. Unbiasing truncated backpropagation through time. arXiv preprint\n\narXiv:1705.08209, 2017.\n\n[18] R. J. Williams and J. Peng. An ef\ufb01cient gradient-based algorithm for on-line training of recurrent\n\nnetwork trajectories. Neural Computation, 2:490\u2013501, 1990.\n\n[19] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural\n\nnetworks. Neural computation, 1(2):270\u2013280, 1989.\n\n[20] J. G. Zilly, R. K. Srivastava, J. Koutn\u00edk, and J. Schmidhuber. Recurrent highway networks.\n\narXiv preprint arXiv:1607.03474, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3326, "authors": [{"given_name": "Asier", "family_name": "Mujika", "institution": "ETH Zurich"}, {"given_name": "Florian", "family_name": "Meier", "institution": "ETH Zurich"}, {"given_name": "Angelika", "family_name": "Steger", "institution": "ETH Zurich"}]}