{"title": "Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 15570, "page_last": 15579, "abstract": "We propose a novel memory cell for recurrent neural networks that dynamically maintains information across long windows of time using relatively few resources. The Legendre Memory Unit~(LMU) is mathematically derived to orthogonalize its continuous-time history -- doing so by solving $d$ coupled ordinary differential equations~(ODEs), whose phase space linearly maps onto sliding windows of time via the Legendre polynomials up to degree $d - 1$. Backpropagation across LMUs outperforms equivalently-sized LSTMs on a chaotic time-series prediction task, improves memory capacity by two orders of magnitude, and significantly reduces training and inference times. LMUs can efficiently handle temporal dependencies spanning $100\\text{,}000$ time-steps, converge rapidly, and use few internal state-variables to learn complex functions spanning long windows of time -- exceeding state-of-the-art performance among RNNs on permuted sequential MNIST. These results are due to the network's disposition to learn scale-invariant features independently of step size. Backpropagation through the ODE solver allows each layer to adapt its internal time-step, enabling the network to learn task-relevant time-scales. We demonstrate that LMU memory cells can be implemented using $m$ recurrently-connected Poisson spiking neurons, $\\mathcal{O}( m )$ time and memory, with error scaling as $\\mathcal{O}( d / \\sqrt{m} )$. We discuss implementations of LMUs on analog and digital neuromorphic hardware.", "full_text": "Legendre Memory Units: Continuous-Time\nRepresentation in Recurrent Neural Networks\n\nAaron R. Voelker1,2\n\nIvana Kaji\u00b4c1\n\nChris Eliasmith1,2\n\n1Centre for Theoretical Neuroscience, Waterloo, ON 2Applied Brain Research, Inc.\n\n{arvoelke, i2kajic, celiasmith}@uwaterloo.ca\n\nAbstract\n\nWe propose a novel memory cell for recurrent neural networks that dynamically\nmaintains information across long windows of time using relatively few resources.\nThe Legendre Memory Unit (LMU) is mathematically derived to orthogonalize\nits continuous-time history \u2013 doing so by solving d coupled ordinary differential\nequations (ODEs), whose phase space linearly maps onto sliding windows of\ntime via the Legendre polynomials up to degree d \u2212 1. Backpropagation across\nLMUs outperforms equivalently-sized LSTMs on a chaotic time-series prediction\ntask, improves memory capacity by two orders of magnitude, and signi\ufb01cantly\nreduces training and inference times. LMUs can ef\ufb01ciently handle temporal de-\npendencies spanning 100,000 time-steps, converge rapidly, and use few internal\nstate-variables to learn complex functions spanning long windows of time \u2013 ex-\nceeding state-of-the-art performance among RNNs on permuted sequential MNIST.\nThese results are due to the network\u2019s disposition to learn scale-invariant features\nindependently of step size. Backpropagation through the ODE solver allows each\nlayer to adapt its internal time-step, enabling the network to learn task-relevant\ntime-scales. We demonstrate that LMU memory cells can be implemented using\nm recurrently-connected Poisson spiking neurons, O(m) time and memory, with\nerror scaling as O(d/\u221am). We discuss implementations of LMUs on analog and\ndigital neuromorphic hardware.\n\n1\n\nIntroduction\n\nA variety of recurrent neural network (RNN) architectures have been used for tasks that require\nlearning long-range temporal dependencies, including machine translation [3, 26, 34], image caption\ngeneration [36, 39], and speech recognition [10, 16]. An architecture that has been especially\nsuccessful in modelling complex temporal relationships is the LSTM [18], which owes its superior\nperformance to a combination of memory cells and gating mechanisms that maintain and nonlinearly\nmix information over time.\nLSTMs are designed to help alleviate the issue of vanishing and exploding gradients commonly\nassociated with training RNNs [5]. However, they are still prone to unstable gradients and saturation\neffects for sequences of length T > 100 [2, 22]. To combat this problem, extensive hyperparameter\nsearches, gradient clipping strategies, layer normalization, and many other RNN training \u201ctricks\u201d are\ncommonly employed [21].\nAlthough standard LSTMs with saturating units have recently been found to have a memory of about\nT = 500\u20131,000 time-steps [25], non-saturating units in RNNs can improve gradient \ufb02ow and scale to\n2,000\u20135,000 time-steps before encountering instabilities [7, 25]. However, signals in realistic natural\nenvironments are continuous in time, and it is unclear how existing RNNs can cope with conditions\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fas T \u2192 \u221e. This is particularly relevant for models that must leverage long-range dependencies\nwithin an ongoing stream of continuous-time data, and run in real time given limited memory.\nInterestingly, biological nervous systems naturally come equipped with mechanisms that allow them\nto solve problems relating to the processing of continuous-time information \u2013 both from a learning\nand representational perspective. Neurons in the brain transmit information using spikes, and \ufb01lter\nthose spikes continuously over time through synaptic connections. A spiking neural network called\nthe Delay Network [38] embraces these mechanisms to approximate an ideal delay line by converting\nit into a \ufb01nite number of ODEs integrated over time. This model reproduces properties of \u201ctime\ncells\u201d observed in the hippocampus, striatum, and cortex [13, 38], and has been deployed on ultra\nlow-power [6] analog and digital neuromorphic hardware including Braindrop [28] and Loihi [12, 37].\nThis paper applies the memory model from [38] to the domain of deep learning. In particular, we\npropose the Legendre Memory Unit (LMU), a new recurrent architecture and method of weight\ninitialization that provides theoretical guarantees for learning long-range dependencies, even as the\ndiscrete time-step, \u2206t, approaches zero. This enables the gradient to \ufb02ow across the continuous\nhistory of internal feature representations. We compare the ef\ufb01ciency and accuracy of this approach\nto state-of-the-art results on a number of benchmarks designed to stress-test the ability of recurrent\narchitectures to learn temporal relationships spanning long intervals of time.\n\n2 Legendre Memory Unit\n\nMemory Cell Dynamics The main component of the Legendre Memory Unit (LMU) is a memory\ncell that orthogonalizes the continuous-time history of its input signal, u(t) \u2208 R, across a sliding\nwindow of length \u03b8 \u2208 R>0. The cell is derived from the linear transfer function for a continuous-time\ndelay, F (s) = e\u2212\u03b8s, which is best-approximated by d coupled ordinary differential equations (ODEs):\n\n\u03b8 \u02d9m(t) = Am(t) + Bu(t)\n\n(1)\nwhere m(t) \u2208 Rd is a state-vector with d dimensions. The ideal state-space matrices, (A, B), are\nderived through the use of Pad\u00e9 [30] approximants [37]:\nA = [a]ij \u2208 Rd\u00d7d, aij = (2i + 1)\nB = [b]i \u2208 Rd\u00d71,\n(cid:18) \u03b8(cid:48)\nThe key property of this dynamical system is that m represents sliding windows of u via the Legendre\n[24] polynomials up to degree d \u2212 1:\nmi(t),\n\n(cid:40)\n\u22121\n(\u22121)i\u2212j+1\ni, j \u2208 [0, d \u2212 1].\ni(cid:88)\n\nbi = (2i + 1)(\u22121)i,\n\ni < j\ni \u2265 j\n\n(cid:18)i\n\n(cid:19)(cid:18)i + j\n\n(cid:19)\n\n(cid:19)\n\n(2)\n\n0 \u2264 \u03b8(cid:48) \u2264 \u03b8, Pi(r) = (\u22121)i\n\n(\u2212r)j\n\n(3)\n\nu(t \u2212 \u03b8(cid:48)) \u2248\n\nPi\n\n\u03b8\n\nd\u22121(cid:88)\n\ni=0\n\nj\n\nj\n\nj=0\n\nwhere Pi(r) is the ith shifted Legendre polynomial [32]. This gives a unique and optimal decomposi-\ntion, wherein functions of m correspond to computations across windows of length \u03b8, projected onto\nd orthogonal basis functions.\n\nDiscretization We map these equations onto\nthe memory of a recurrent neural network, mt \u2208\nRd, given some input ut \u2208 R, indexed at dis-\ncrete moments in time, t \u2208 N:\n(4)\nwhere ( \u00afA, \u00afB) are the discretized matrices pro-\nvided by the ODE solver for some time-step \u2206t\nrelative to the window length \u03b8. For instance, Eu-\nler\u2019s method supposes \u2206t is suf\ufb01ciently small:\n\nmt = \u00afAmt\u22121 + \u00afBut\n\n\u00afB = (\u2206t/\u03b8) B.\n\n\u00afA = (\u2206t/\u03b8) A + I,\n\n(5)\nWe also consider discretization methods such as\nzero-order hold (ZOH) as well as those that can\nadapt their internal time-steps [9].\n\nFigure 1: Shifted Legendre polynomials (d = 12).\nThe memory of the LMU represents the entire slid-\ning window of input history as a linear combination\nof these scale-invariant polynomials. Increasing\nthe number of dimensions supports the storage of\nhigher-frequency inputs relative to the time-scale.\n\n2\n\n0.00.20.40.60.81.0\u03b80/\u03b8(Unitless)\u2212101Pii=0...d\u22121\fApproximation Error When d = 1, the memory is analogous to a single-unit LSTM without any\ngating mechanisms (i.e., a leaky integrator with time-constant \u03b8). As d increases, so does its memory\ncapacity relative to frequency content. In particular, the approximation error in equation 3 scales as\nO (\u03b8\u03c9/d), where \u03c9 is the frequency of the input u that is to be committed to memory [38].\n\nLayer Design The LMU takes an input vec-\ntor, xt, and generates a hidden state, ht \u2208 Rn.\nEach layer maintains its own hidden state and\nmemory vector. The state mutually interacts\nwith the memory, mt \u2208 Rd, in order to com-\npute nonlinear functions across time, while dy-\nnamically writing to memory. Similar to the\nNRU [7], the state is a function of the input,\nprevious state, and current memory:\nht = f (Wxxt + Whht\u22121 + Wmmt) (6)\nwhere f\nchosen nonlinearity\n(e.g., tanh) and Wx, Wh, Wm are learned\nkernels. Note this decouples the size of the\nlayer\u2019s hidden state (n) from the size of the\nlayer\u2019s memory (d), and requires holding n + d\nvariables in memory between time-steps. The\ninput signal that writes to the memory (via\nequation 4) is:\n\nsome\n\nis\n\nFigure 2: Time-unrolled LMU layer. An n-\ndimensional state-vector (ht) is dynamically cou-\npled with a d-dimensional memory vector (mt).\nThe memory represents a sliding window of ut, pro-\njected onto the \ufb01rst d Legendre polynomials.\n\nut = ex\n\nTxt + eh\n\nTht\u22121 + em\n\nTmt\u22121\n\n(7)\n\nwhere ex, eh, em are learned encoding vectors. Intuitively, the kernels (W) learn to compute\nnonlinear functions across the memory, while the encoders (e) learn to project the relevant information\ninto the memory. The parameters of the memory ( \u00afA, \u00afB, \u03b8) may be trained to adapt their time-scales\nby backpropagating through the ODE solver [9], although we do not require this in our experiments.\nThis is the simplest design that we found to perform well across all tasks explored below, but variants\nin the form of gating ut, forgetting mt, and bias terms may also be considered for more challenging\ntasks. Our focus here is to demonstrate the advantages of learning the coupling between an optimal\nlinear dynamical memory and a nonlinear function.\n\n3 Experiments\n\nTasks were selected with the goals of validating the LMU\u2019s derivation while succinctly highlighting\nits key advantages: it can learn temporal dependencies spanning T = 100,000 time-steps, converge\nrapidly due to the use of non-saturating memory units, and use relatively few internal state-variables\nto compute nonlinear functions across long windows of time. The source code for the LMU and our\nexperiments are published on GitHub.1\nProper weight initialization is central to the performance of the LMU, as the architecture is indeed\na speci\ufb01c way of con\ufb01guring a more general RNN in order to learn across continuous-time repre-\nsentations. Equation 2 and ZOH are used to initialize the weights of the memory cell ( \u00afA, \u00afB). The\ntime-scale \u03b8 is initialized based on prior knowledge of the task. We \ufb01nd that \u00afA, \u00afB, \u03b8 do not require\ntraining for these tasks since they can be appropriately initialized. We recommend equation 3 as\nan option to initialize Wm that can improve training times. The memory\u2019s feedback encoders are\ninitialized to em = 0 to ensure stability. Remaining kernels (Wx, Wh) are initialized to Xavier\nnormal [15] and the remaining encoders (ex, eh) are initialized to LeCun uniform [23]. The activation\nfunction f is set to tanh. All models are implemented with Keras and the TensorFlow backend [1]\nand run on CPUs and GPUs. We use the Adam optimizer [20] with default hyperparameters, monitor\nthe validation loss to save the best model, and train until convergence or 500 epochs. We note that our\nmethod does not require layer normalization, gradient clipping, or other regularization techniques.\n\n1https://github.com/abr/neurips2019\n\n3\n\nNonlinearLinearxtutht\u22121mt\u22121hthtmtWhWxWmehexem\u00afA\u00afB\f3.1 Capacity Task\n\nThe copy memory task [18] is a synthetic task that stress-tests the ability of an RNN to store a \ufb01xed\namount of data\u2014typically 10 values\u2014and persist them for a long interval of time. We consider\na variant of this task that we dub the capacity task. It is designed to test the network\u2019s ability to\nmaintain T values in memory for large values of T relative to the size of the network. We do so in a\ncontrolled way by setting T = 1/\u2206t and then scaling \u2206t \u2192 0 while keeping the underlying data\ndistribution \ufb01xed. Speci\ufb01cally, we randomly sample from a continuous-time white noise process,\nband-limited to \u03c9 = 10 Hz. Each sequence iterates through 2.5 seconds of this process with a\ntime-step of \u2206t = 1/T . At each step, the network must recall the input values from (cid:98)iT /(k \u2212 1)(cid:99)\ntime-steps ago for i \u2208 {0, 1, . . . k \u2212 1}. Thus, the task evaluates the network\u2019s ability to maintain k\npoints along a sliding window of length T , using only its internal state-variables to persist information\nbetween time-steps. We compare an LSTM to a simpli\ufb01ed LMU, for k = 5, while scaling T .\n\nIsolating the Memory Cell To validate the\nfunction of the LMU memory in isolation, we\ndisable the kernels Wx and Wh, as well as the\nencoders eh and em, set ex = 1, and set the\nhidden activation f to the identity. We use d =\n100 dimensions for the memory. This simpli\ufb01es\nthe architecture to 500 parameters that connect a\nlinear memory cell to a linear output layer. This\nis done to demonstrate that the LMU is initially\ndisposed to remember sequences of length \u03b8 (set\nto T steps).\n\nModel Complexity We compare the isolated\nLMU memory cell to an LSTM with 100 units\nconnected to a 5-dimensional linear output layer.\nThis model contains ~41k parameters, and 200\ninternal state-variables\u2014100 for the hidden\nstate, and 100 for the \u201ccarry\u201d state\u2014that can be\nleveraged to maintain information between time-\nsteps. Thus, this task is theoretically trivial in\nterms of internal memory storage for T \u2264 200.\nThe LMU has n = 5 hidden units and d = 100\ndimensions for the memory (equation 4). Thus,\nthe LMU is using signi\ufb01cantly fewer computa-\ntional resources than the LSTM, 500 vs 41k parameters, and 105 vs 200 state-variables.\n\nFigure 3: Comparing LSTMs to LMUs while scal-\ning the number of time-steps between the input and\noutput. Each curve corresponds to a model trained\nat a different window length, T , and evaluated at\n5 different delay lengths across the window. The\nLMU successfully persists information across 105\ntime-steps using only 105 internal state-variables,\nand without any training. It is able to do so by\nmaintaining a compressed representation of the\n10 Hz band-limited input signal processed with a\ntime-step of \u2206t = 1/T .\n\nResults and Discussion Figure 3 summarizes the test results of each model, trained at different\nvalues of T , by reporting the MSE for each of the k outputs separately. We \ufb01nd that the LSTM\ncan solve this task when T < 400, but struggles for T \u2265 400 due to the lack of hyperparameter\noptimization, consistent with [22, 25]. The LMU solves this task near-perfectly, since \u03b8\u03c9 = 10 (cid:28)\nd [38]. In fact, the LMU does not even need to be trained for this task; testing is performed on\nthe initial state of the network, without any training data. LMU performance continues to improve\nwith training (not shown), but that is not required for the task. We note that performance improves\nas T \u2192 \u221e because this yields discretized numerics that more closely follow the continuous-time\ndescriptions of equations 1 and 3. The next task demonstrates that the representation of the memory\ngeneralizes to internally-generated sequences, that are not described by band-limited white noise\nprocesses, and are learned rapidly through mutual interactions with the hidden units.\n\n3.2 Permuted Sequential MNIST\n\nThe permuted sequential MNIST (psMNIST) digit classi\ufb01cation task [22] is commonly used to assess\nthe ability of RNN models to learn complex temporal relationships [2, 7, 8, 21, 25]. Each 28 \u00d7 28\nimage is \ufb02attened into a one-dimensional pixel array and permuted by a \ufb01xed permutation matrix.\nElements of the array are then provided to the network one pixel at a time. Permutation distorts the\n\n4\n\n100101102103104105Delay Length (# Time-steps)108107106105104103102101Test MSEMemory Capacity of Recurrent ArchitecturesLMU-0 (T=100)LMU-0 (T=1000)LMU-0 (T=10000)LMU-0 (T=100000)LSTM (T=25)LSTM (T=100)LSTM (T=400)LSTM (T=1600)\ftemporal structure in the image sequence, resulting in a task that is signi\ufb01cantly more dif\ufb01cult than\nthe unpermuted version.\n\nState-of-the-art Current state-of-the-art results on psMNIST for RNNs include Zoneout [21] with\n95.9% test accuracy, indRNN [25] with 96.0%, and the Dilated RNN [8] with 96.1%. One must\nbe careful when comparing across studies, as each tend to use different permutation seeds, which\ncan impact the overall dif\ufb01culty of the task. More importantly, to allow for a fair comparison of\ncomputational resources utilized by models, it is necessary to consider the number of state-variables\nthat must be modi\ufb01ed in memory as the input is streamed online during inference. In particular, if a\nnetwork has access to more than 282 = 784 variables to store information between time-steps, then\nthere is very little point in attempting this task with an RNN [4, 7]. That is, it becomes trivial to\nstore all 784 pixels in a buffer, and then apply a feed-forward network to achieve state-of-the-art. For\nexample, the Dilated RNN uses an internal memory of size 50 \u00b7 (29 \u2212 1) \u2248 25k (i.e., 30x greater than\n784), due to the geometric progression of dilated connections that must buffer hidden states in order\nto skip them in time. Parameter counts for RNNs are ultimately poor measures of resource ef\ufb01ciency\nif a solution has write-access to more internal memory than there are elements in the input sequence.\n\nLMU Model Our model uses n = 212 hidden units and d = 256 dimensions for the memory, thus\nmaintaining n + d = 468 variables in memory between time-steps. The hidden state is projected to\nan output softmax layer. This is equivalent to the NRU [7] in terms of state-variables, and similar\nin computational resources, while the LMU has ~102k trainable parameters compared to ~165k for\nthe NRU. We set \u03b8 = 784 s with \u2206t = 1 s, and initialize eh = em = Wx = Wh = 0 to test the\nability of the network to learn these parameters. Training is stopped after 10 epochs, as we observe\nthat validation loss is already minimized by this point.\n\nResults Table 1 is reproduced from Chandar\net al. [7] with the following adjustments. First,\nthe EURNN (94.50%) has been removed since\nit uses 1024 state-variables. Second, we have\nadded the phased LSTM [29] with matched\nparameter counts and \u03b1 = 10\u22124. Third, a\nfeed-forward baseline is included, which simply\nprojects the \ufb02attened input sequence to a soft-\nmax output layer. This informs us of how well\nwe should expect a model to perform (92.65%)\nsupposing it linearly memorizes the input and\nprojects it to the output softmax layer. For the\nLMU and feed-forward baseline, we extended\nthe code from Chandar et al. [7] in order to en-\nsure that the training, validation, and test data\nwere identical with the same permutation seed\nand batch size. All other results in Table 1 use\n~165k parameters (LMU uses ~102k, and FF-\nbaseline uses ~8k).\n\nTable 1: Validation and test set accuracy for\n\npsMNIST (extended from [7])\n\nValidation Test\nModel\n89.26\n88.70\nRNN-orth\n86.13\n85.98\nRNN-id\n90.01\n89.86\nLSTM\n88.43\n88.10\nLSTM-chrono\n92.39\n92.16\nGRU\n91.94\n92.50\nJANET\n92.49\n92.79\nSRU\n86.90\n87.00\nGORU\n95.38\nNRU\n95.46\n89.61\nPhased LSTM 88.76\n97.15\n96.97\nLMU\nFF-baseline\n92.37\n92.65\n\nDiscussion The LMU surpasses state-of-the-art by achieving 97.15% test accuracy, despite using\nonly 468 internal state-variables and ~102k parameters. We make three important observations\nregarding the LMU\u2019s performance on this task: (1) it learns quickly, exceeding state-of-the-art in\n10 epochs (the results from Chandar et al. [7] use 100 epochs for comparison); (2) it is doing more\nthan simply memorizing the input (by outperforming the baseline it must be leveraging the hidden\nnonlinearities to perform some useful computations across the memory); and (3) since d = 256 is\nsigni\ufb01cantly less than 784, and the input sequence is highly discontinuous in time, it must necessarily\nbe learning a strategy for writing features to the memory cell (equation 7) that minimizes the\ninformation loss from compression of the window onto the Legendre polynomials.\n\n3.3 Mackey-Glass Prediction\n\nThe Mackey-Glass (MG) data set [27] is a time-series prediction task that tests the ability of a network\nto model chaotic dynamical systems. MG is commonly used to evaluate the nonlinear dynamical\n\n5\n\n\f(cid:118)(cid:117)(cid:117)(cid:116) E\n\n(cid:105)\n\n(cid:104)\n(Y \u2212 \u02c6Y )2\nE [Y 2]\n\n(8)\n\nModel\n\nTest\nNRMSE\n\nLSTM 0.079\n0.054\nLMU\n0.050\nHybrid\n\nTraining Time\n(s/epoch)\n20.34 s\n12.89 s\n16.21 s\n\nprocessing of reservoir computers [17]. In this task, a sequence of one-dimensional observations\u2014\ngenerated by solving the MG differential equations\u2014are streamed as input, and the network is tasked\nwith predicting the next value in the sequence. We use a parameterized version of the data set from\nthe Deep Learning Summer School (2015) held in Montreal, where we predict 15 time-steps into the\nfuture with an MG time-constant of 17 steps.\n\nTask Dif\ufb01culty Due to the \u201cbutter\ufb02y effect\u201d in chaotic strange attractors (i.e., the effect that\narbitrarily small perturbations to the state cause future trajectories to exponentially diverge), this task\nis both theoretically and practically challenging. Essentially, the network must use its observations to\nestimate the underlying dynamical state of the attractor, and then internally simulate its dynamics\nforward some number of steps. Takens\u2019 theorem [35] guarantees that this can be accomplished by\nrepresenting a window of the input sequence and then applying a static nonlinear transformation\nto this delay embedding. Nevertheless, since any perturbations to the estimate of the underlying\nstate diverge exponentially over time (at a rate given by its Lyapunov exponent), the time-horizon of\npredictions with bounded-error scales only logarithmically with the precision of the observer [33].\n\nModel Speci\ufb01cation We compare three architectures: one using LSTMs; one using LMUs; and a\nhybrid that is half LSTMs and LMUs in alternating layers. Each model stacks 4 layers and contains\n~18k parameters. To balance the number of parameters, each LSTM layer contains 25 units, while\neach LMU layer contains n = 49 units and d = 4 memory dimensions. We set \u03b8 = 4 time-steps, and\ndid not try any other values of \u03b8 or d. All other settings are kept as their defaults. Lastly, since our\nLMU lacks any explicit gating mechanisms, we evaluated a hybrid approach that interleaves two\nLMU layers of 40 units with two LSTM layers of 25 units.\n\nEvaluation Metric We report the normalized root\nmean squared error (NRMSE):\n\nTable 2: Mackey-Glass results\n\nwhere Y is the ideal target and \u02c6Y is the prediction,\nsuch that a baseline solution that always predicts 0\nobtains an NRMSE of 1. For this data set, the identity\nfunction (predicting the future output to be equal to the input) obtains an NRMSE of ~1.623.\n\n4 Characteristics of the LMU\n\nLinear-Nonlinear Processing Linear units maximize the information capacity of dynamical sys-\ntems, while nonlinearities are required to compute useful functions across this information [19]. The\nLMU formalizes this linear-nonlinear trade-off by decoupling the functional role of d linear memory\nunits from that of n nonlinear hidden units, and then using backpropagation to learn their coupling.\n\nParameter-State Trade-offs One can increase d to improve the linear memory (m) capacity at\nthe cost of a linear increase in the size of the encoding parameters, or, increase n to improve the\ncomplexity of nonlinear interactions with the memory (h) at the expense of a quadratic increase in the\nsize of the recurrent kernels. Thus, d and n can be set independently to trade storage for parameters\nwhile balancing linear memory capacity with hidden nonlinear processing.\n\nOptimality and Uniqueness The memory cell is optimal in the sense of being derived from\nthe Pad\u00e9 [30] approximants of the delay line expanded about the zeroth frequency [38]. These\napproximants have been proven optimal for this purpose. Moreover, the phase space of the memory\nmaps onto the unique set of orthogonal polynomials over [0, \u03b8] (the shifted Legendre polynomials) up\nto a constant scaling factor [24]. Thus the LMU is provably optimal with respect to its continuous-\ntime memory capacity, which provides a nontrivial starting point for backpropagation. To validate\nthis characteristic, we reran the psMNIST benchmark with the diagonals of \u00afA perturbed by \u0001 \u2208\n{\u22120.01,\u22120.001, 0.001, 0.01}. Despite retraining the network for each \u0001, this achieved sub-optimal\ntest performance in each case, and resulted in chance-level performance for larger |\u0001|.\n\n6\n\n\fFigure 4: LMU memory (d = 10,240) given ut = 1.\n\nFigure 5: O(d/\u221am) scaling.\n\nScalability Equations 1 and 2 have been scaled to d = 10,240 to accurately maintain information\nacross \u03b8 = 512,000,000 time-steps, as shown in Figure 4 [37]. This implements the dynamical\nsystem using O(d) time and memory, by exploiting the structure of (A, B) as shown in Figure 6. We\n\ufb01nd that the most dif\ufb01cult sequences to remember are pure white noise signals, which requires O(d)\ndimensions to accurately maintain a window of d time-steps.\n\n5 Spiking Implementation\n\nThe LMU can be implemented with a spiking neural network [38], and on neuromorphic hard-\nware [28], while consuming several orders less energy than traditional computing architectures [6, 37].\nHere we review these \ufb01ndings and their implications for neuromorphic deep learning.\nThe challenge in this section pertains to the substitution of static nonlinearities that emit multi-bit\nactivities every time-step (i.e., \u201crate\u201d neurons) with spiking neurons that emit temporally sparse\n1-bit events. We consider Poisson neurons since they are stateless, and thus serve as an inexpensive\nmechanism for sparsifying signals over time \u2013 but this is not a strict requirement. More generally,\nby reducing the amount of communication and converting weight multiplies into additions, spikes\ncan trade precision for energy-ef\ufb01ciency on neuromorphic hardware [6, 12]. Moreover, this can be\naccomplished while preserving the optimizations afforded by deep learning [31].\n\nNeural Precision The dynamical system for the memory cell can be implemented by mapping\neach state-variable onto the postsynaptic currents of d individual populations of p Poisson spiking\nneurons with \ufb01xed heterogeneous tuning curves [14, 38]. We consider the error between the ideal\ninput to the original rate neuron representing some dimension, versus the weighted summation of\n(cid:112)\nspike events representing the same dimension. Theorem 3.2.1 from [37] proves that this error has a\nvariance of O(1/p). By the variance sum law, repeating this for d independent populations yields\nd/p). Letting m = pd be the total number of neurons, we \ufb01nd that the\nan overall RMSE of O(\nerror scales as O(d/\u221am), as validated in Figure 5. This grants access to a free parameter that trades\nprecision for energy-ef\ufb01ciency, while scaling to the original network in the limit of large m [37].\n\nNeuromorphic Implementation This spik-\ning neural network has been implemented\non neuromorphic hardware including Brain-\ndrop [28] and Loihi [37]. Each population is\ncoupled to one another to implement equation 1\nby converting the postsynaptic \ufb01lters into inte-\ngrators [38]. This results in a speci\ufb01c connectiv-\nity pattern, shown in Figure 6, that exploits the\nalternating structure of equation 2. An ideal im-\nplementation of this system requires m nonlin-\nearities, O(m) additions, and d state-variables.\nSpiking neurons may also be used to implement\nthe hidden state of the LMU by nonlinearly en-\ncoding the memory vector [38]. Since this scales\nlinearly in time and memory, with sqrt preci-\nsion, the LMU offers a promising architecture\nfor low-power RNNs.\n\nFigure 6: Connection structure (d = 6) adapted\nfrom [37]. Forward arrow heads indicate addition,\ncircular heads indicate subtraction. The ith state-\nvariable continuously integrates its input with a\ngain of (2i + 1) \u03b8\u22121.\n\n7\n\n102103104m(# Neurons)101RMSEdmPoissonu0m1m2m3m4m5m\f6 Discussion\n\nAdvanced regularization techniques, such as recurrent batch normalization [11] and Zoneout [21]\nare able to improve standard RNNs to perform near state-of-the-art on the recent psMNIST bench-\nmark (95.9%). Without relying on such techniques, our recurrent architecture surpasses state-of-the-\nart by a full percent (97.15%), while using fewer internal units and state-variables (468) than image\npixels (784). Nevertheless, recurrent batch normalization and Zoneout are both fully compatible with\nour architecture, and the potential bene\ufb01ts should be explored.\nTo our knowledge, the LMU is the \ufb01rst recurrent architecture capable of handling temporal depen-\ndencies across 100,000 time-steps. Its strong performance is attributed to the cell structure being\nderived from \ufb01rst principles to project continuous-time signals onto d orthogonal dimensions. The\nmathematical derivation of the LMU is critical for its success. Speci\ufb01cally, we \ufb01nd that the LMU\u2019s\ndynamical system, when coupled with a nonlinear function, endows the RNN with several non-\ntrivial advantages in terms of learning long-range dependencies, training quickly, and representing\ntask-relevant information within sequences whose length exceeds the size of the network.\nThe LMU is a rare example of deriving RNN dynamics from \ufb01rst principles to have some desired\ncharacteristics, showing that neural activity is consistent with such a derivation, and demonstrating\nstate-of-the-art performance on a machine learning task. As such, it serves as a reminder of the value\nin pursuing diverse perspectives on neural computation and combining tools in mathematics, signal\nprocessing, and deep learning.\nThe basic design of our layer, which consists of a nonlinear hidden state and linear memory cell,\npresents several opportunities for extension. Preliminary work in this direction has indicated that\nintroducing an input gate is bene\ufb01cial for problems that require latching onto individual values (as\nrequired by the adding task [18]). Likewise, a forget gate yields improved performance on problems\nwhere it is helpful to selectively reset the memory (such as language modelling). Lastly, we have\nproposed a single memory cell per layer, but in theory one can have multiple independent memory\ncells, each coupled to the same hidden state with a different set of encoders and kernels. Multiple\nmemories would enable the same hidden units to write multiple streams in parallel, each along\ndifferent time-scales, and compute across all of them simultaneously.\n\nAcknowledgments\n\nWe thank the reviewers for improving our work by identifying areas in need of clari\ufb01cation and\nsuggesting additional points of validation. This work was supported by CFI and OIT infrastruc-\nture funding, the Canada Research Chairs program, NSERC Discovery grant 261453, ONR grant\nN000141310419, AFOSR grant FA8655-13-1-3084, OGS, and NSERC CGS-D.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for large-scale machine\nlearning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages\n265\u2013283, 2016.\n\n[2] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nInternational Conference on Machine Learning, pages 1120\u20131128, 2016.\n\nIn\n\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[4] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and\n\nrecurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.\n\n[5] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. Learning long-term dependencies with gradient\n\ndescent is dif\ufb01cult. IEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[6] Peter Blouw, Xuan Choo, Eric Hunsberger, and Chris Eliasmith. Benchmarking keyword spotting ef\ufb01ciency\n\non neuromorphic hardware. arXiv preprint arXiv:1812.01739, 2018.\n\n[7] Sarath Chandar, Chinnadhurai Sankar, Eugene Vorontsov, Samira Ebrahimi Kahou, and Yoshua Ben-\ngio. Towards non-saturating recurrent units for modelling long-term dependencies. arXiv preprint\narXiv:1902.06704, 2019.\n\n8\n\n\f[8] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock,\nMark A Hasegawa-Johnson, and Thomas S Huang. Dilated Recurrent Neural Networks. In Advances in\nNeural Information Processing Systems, pages 77\u201387, 2017.\n\n[9] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential\n\nequations. In Advances in Neural Information Processing Systems, pages 6571\u20136583, 2018.\n\n[10] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-\nIn Advances in neural information processing systems, pages\n\nbased models for speech recognition.\n577\u2013585, 2015.\n\n[11] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, \u00c7a\u02d8glar G\u00fcl\u00e7ehre, and Aaron Courville. Recurrent batch\n\nnormalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[12] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday,\nGeorgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor\nwith on-chip learning. IEEE Micro, 38(1):82\u201399, 2018.\n\n[13] Joost de Jong, Aaron R. Voelker, Hedderik van Rijn, Terrence C. Stewart, and Chris Eliasmith. Flexible\ntiming with delay networks \u2013 The scalar property and neural scaling. In International Conference on\nCognitive Modelling, 2019.\n\n[14] Chris Eliasmith and Charles H Anderson. Neural engineering: Computation, representation, and dynamics\n\nin neurobiological systems. MIT press, 2003.\n\n[15] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and\n\nnetworks.\nStatistics, pages 249\u2013256, 2010.\n\n[16] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\nneural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages\n6645\u20136649. IEEE, 2013.\n\n[17] Lyudmila Grigoryeva, Julie Henriques, Laurent Larger, and Juan-Pablo Ortega. Optimal nonlinear\n\ninformation processing capacity in delay-based reservoir computers. Scienti\ufb01c reports, 5:12858, 2015.\n\n[18] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[19] Masanobu Inubushi and Kazuyuki Yoshimura. Reservoir computing beyond memory-nonlinearity trade-off.\n\nScienti\ufb01c reports, 7(1):10199, 2017.\n\n[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[21] David Krueger, Tegan Maharaj, J\u00e1nos Kram\u00e1r, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary\nKe, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. Zoneout: Regularizing RNNs by\nrandomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.\n\n[22] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of\n\nrecti\ufb01ed linear units. arXiv preprint arXiv:1504.00941, 2015.\n\n[23] Yann A LeCun, L\u00e9on Bottou, Genevieve B Orr, and Klaus-Robert M\u00fcller. Ef\ufb01cient backprop. In Neural\n\nnetworks: Tricks of the trade, pages 9\u201348. Springer, 2012.\n\n[24] Adrien-Marie Legendre. Recherches sur l\u2019attraction des sph\u00e9ro\u00efdes homog\u00e8nes. M\u00e9moires de Math\u00e9ma-\n\ntiques et de Physique, pr\u00e9sent\u00e9s \u00e0 l\u2019Acad\u00e9mie Royale des Sciences, pages 411\u2013435, 1782.\n\n[25] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently recurrent neural network\n(INDRNN): Building a longer and deeper RNN. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 5457\u20135466, 2018.\n\n[26] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based\n\nneural machine translation. arXiv preprint arXiv:1508.04025, 2015.\n\n[27] Michael C Mackey and Leon Glass. Oscillation and chaos in physiological control systems. Science, 197\n\n(4300):287\u2013289, 1977.\n\n9\n\n\f[28] Alexander Neckar, Sam Fok, Ben V Benjamin, Terrence C Stewart, Nick N Oza, Aaron R Voelker, Chris\nEliasmith, Rajit Manohar, and Kwabena Boahen. Braindrop: A mixed-signal neuromorphic architecture\nwith a dynamical systems-based programming model. Proceedings of the IEEE, 107(1):144\u2013164, 2019.\n\n[29] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased LSTM: Accelerating recurrent network training\nfor long or event-based sequences. In Advances in Neural Information Processing Systems, pages 3882\u2013\n3890, 2016.\n\n[30] H. Pad\u00e9. Sur la repr\u00e9sentation approch\u00e9e d\u2019une fonction par des fractions rationnelles. Annales scienti\ufb01ques\n\nde l\u2019\u00c9cole Normale Sup\u00e9rieure, 9:3\u201393, 1892.\n\n[31] Daniel Rasmussen. NengoDL: Combining deep learning and neuromorphic modelling methods. Neuroin-\n\nformatics, pages 1\u201318, 2019.\n\n[32] Olinde Rodrigues. De l\u2019attraction des sph\u00e9ro\u00efdes, Correspondence sur l\u2019\u00c9-cole Imp\u00e9riale Polytechnique.\n\nPhD thesis, Thesis for the Faculty of Science of the University of Paris, 1816.\n\n[33] Steven H Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry,\n\nand Engineering. CRC Press, 2015.\n\n[34] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[35] Floris Takens. Detecting strange attractors in turbulence. In Dynamical systems and turbulence, Warwick\n\n1980, pages 366\u2013381. Springer, 1981.\n\n[36] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\ncaption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 3156\u20133164, 2015.\n\n[37] Aaron R. Voelker. Dynamical Systems in Spiking Neuromorphic Hardware. Phd thesis, University of\n\nWaterloo, 2019. URL http://hdl.handle.net/10012/14625.\n\n[38] Aaron R. Voelker and Chris Eliasmith. Improving spiking dynamical networks: Accurate delays, higher-\n\norder synapses, and time cells. Neural computation, 30(3):569\u2013609, 2018.\n\n[39] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard\nZemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention.\narXiv preprint arXiv:1502.03044, 2015.\n\n10\n\n\f", "award": [], "sourceid": 9024, "authors": [{"given_name": "Aaron", "family_name": "Voelker", "institution": "Applied Brain Research"}, {"given_name": "Ivana", "family_name": "Kaji\u0107", "institution": "University of Waterloo"}, {"given_name": "Chris", "family_name": "Eliasmith", "institution": "U of Waterloo"}]}