{"title": "Learning to Transduce with Unbounded Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 1828, "page_last": 1836, "abstract": "Recently, strong results have been demonstrated by Deep Recurrent Neural Networks on natural language transduction problems. In this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation. These experiments lead us to propose new memory-based recurrent networks that implement continuously differentiable analogues of traditional data structures such as Stacks, Queues, and DeQues. We show that these architectures exhibit superior generalisation performance to Deep RNNs and are often able to learn the underlying generating algorithms in our transduction experiments.", "full_text": "Learning to Transduce with Unbounded Memory\n\nEdward Grefenstette\nGoogle DeepMind\netg@google.com\n\nKarl Moritz Hermann\n\nGoogle DeepMind\nkmh@google.com\n\nMustafa Suleyman\nGoogle DeepMind\n\nmustafasul@google.com\n\nPhil Blunsom\n\nGoogle DeepMind and Oxford University\n\npblunsom@google.com\n\nAbstract\n\nRecently, strong results have been demonstrated by Deep Recurrent Neural Net-\nworks on natural language transduction problems. In this paper we explore the\nrepresentational power of these models using synthetic grammars designed to ex-\nhibit phenomena similar to those found in real transduction problems such as ma-\nchine translation. These experiments lead us to propose new memory-based recur-\nrent networks that implement continuously differentiable analogues of traditional\ndata structures such as Stacks, Queues, and DeQues. We show that these architec-\ntures exhibit superior generalisation performance to Deep RNNs and are often able\nto learn the underlying generating algorithms in our transduction experiments.\n\n1 Introduction\n\nRecurrent neural networks (RNNs) offer a compelling tool for processing natural language input in\na straightforward sequential manner. Many natural language processing (NLP) tasks can be viewed\nas transduction problems, that is learning to convert one string into another. Machine translation is\na prototypical example of transduction and recent results indicate that Deep RNNs have the ability\nto encode long source strings and produce coherent translations [1, 2]. While elegant, the appli-\ncation of RNNs to transduction tasks requires hidden layers large enough to store representations\nof the longest strings likely to be encountered, implying wastage on shorter strings and a strong\ndependency between the number of parameters in the model and its memory.\nIn this paper we use a number of linguistically-inspired synthetic transduction tasks to explore the\nability of RNNs to learn long-range reorderings and substitutions. Further, inspired by prior work on\nneural network implementations of stack data structures [3], we propose and evaluate transduction\nmodels based on Neural Stacks, Queues, and DeQues (double ended queues). Stack algorithms are\nwell-suited to processing the hierarchical structures observed in natural language and we hypothesise\nthat their neural analogues will provide an effective and learnable transduction tool. Our models\nprovide a middle ground between simple RNNs and the recently proposed Neural Turing Machine\n(NTM) [4] which implements a powerful random access memory with read and write operations.\nNeural Stacks, Queues, and DeQues also provide a logically unbounded memory while permitting\nef\ufb01cient constant time push and pop operations.\nOur results indicate that the models proposed in this work, and in particular the Neural DeQue, are\nable to consistently learn a range of challenging transductions. While Deep RNNs based on long\nshort-term memory (LSTM) cells [1, 5] can learn some transductions when tested on inputs of the\nsame length as seen in training, they fail to consistently generalise to longer strings. In contrast,\nour sequential memory-based algorithms are able to learn to reproduce the generating transduction\nalgorithms, often generalising perfectly to inputs well beyond those encountered in training.\n\n1\n\n\f2 Related Work\n\nString transduction is central to many applications in NLP, from name transliteration and spelling\ncorrection, to in\ufb02ectional morphology and machine translation. The most common approach lever-\nages symbolic \ufb01nite state transducers [6, 7], with approaches based on context free representations\nalso being popular [8]. RNNs offer an attractive alternative to symbolic transducers due to their sim-\nple algorithms and expressive representations [9]. However, as we show in this work, such models\nare limited in their ability to generalise beyond their training data and have a memory capacity that\nscales with the number of their trainable parameters.\nPrevious work has touched on the topic of rendering discrete data structures such as stacks continu-\nous, especially within the context of modelling pushdown automata with neural networks [10, 11, 3].\nWe were inspired by the continuous pop and push operations of these architectures and the idea of\nan RNN controlling the data structure when developing our own models. The key difference is that\nour work adapts these operations to work within a recurrent continuous Stack/Queue/DeQue-like\nstructure, the dynamics of which are fully decoupled from those of the RNN controlling it. In our\nmodels, the backwards dynamics are easily analysable in order to obtain the exact partial derivatives\nfor use in error propagation, rather than having to approximate them as done in previous work.\nIn a parallel effort to ours, researchers are exploring the addition of memory to recurrent networks.\nThe NTM and Memory Networks [4, 12, 13] provide powerful random access memory operations,\nwhereas we focus on a more ef\ufb01cient and restricted class of models which we believe are suf\ufb01cient\nfor natural language transduction tasks. More closely related to our work, [14] have sought to\ndevelop a continuous stack controlled by an RNN. Note that this model\u2014unlike the work proposed\nhere\u2014renders discrete push and pop operations continuous by \u201cmixing\u201d information across levels of\nthe stack at each time step according to scalar push/pop action values. This means the model ends up\ncompressing information in the stack, thereby limiting its use, as it effectively loses the unbounded\nmemory nature of traditional symbolic models.\n\n3 Models\n\nIn this section, we present an extensible memory enhancement to recurrent layers which can be set\nup to act as a continuous version of a classical Stack, Queue, or DeQue (double-ended queue). We\nbegin by describing the operations and dynamics of a neural Stack, before showing how to modify\nit to act as a Queue, and extend it to act as a DeQue.\n\n3.1 Neural Stack\n\nLet a Neural Stack be a differentiable structure onto and from which continuous vectors are pushed\nand popped. Inspired by the neural pushdown automaton of [3], we render these traditionally dis-\ncrete operations continuous by letting push and pop operations be real values in the interval (0, 1).\nIntuitively, we can interpret these values as the degree of certainty with which some controller wishes\nto push a vector v onto the stack, or pop the top of the stack.\n\nvt\n\nif 1 \uf8ff i < t\nif i = t\n\nmax(0, st1[i] max(0, ut \ndt\n\nVt[i] =\u21e2 Vt1[i]\nst[i] =8<:\ntXi=1\n\nrt =\n\n(Note that Vt[i] = vi for all i \uf8ff t)\nt1Pj=i+1\nif 1 \uf8ff i < t\nif i = t\n\nst1[j]))\n\n(1)\n\n(2)\n\n(3)\n\n(min(st[i], max(0, 1 \n\ntXj=i+1\n\nst[j]))) \u00b7 Vt[i]\n\nFormally, a Neural Stack, fully parametrised by an embedding size m, is described at some timestep\nt by a t \u21e5 m value matrix Vt and a strength vector st 2 Rt. These form the core of a recurrent layer\nwhich is acted upon by a controller by receiving, from the controller, a value vt 2 Rm, a pop signal\nut 2 (0, 1), and a push signal dt 2 (0, 1). It outputs a read vector rt 2 Rm. The recurrence of this\n\n2\n\n\flayer comes from the fact that it will receive as previous state of the stack the pair (Vt1, st1), and\nproduce as next state the pair (Vt, st) following the dynamics described below. Here, Vt[i] represents\nthe ith row (an m-dimensional vector) of Vt and st[i] represents the ith value of st.\nEquation 1 shows the update of the value component of the recurrent layer state represented as a\nmatrix, the number of rows of which grows with time, maintaining a record of the values pushed to\nthe stack at each timestep (whether or not they are still logically on the stack). Values are appended\nto the bottom of the matrix (top of the stack) and never changed.\nEquation 2 shows the effect of the push and pop signal in updating the strength vector st1 to\nproduce st. First, the pop operation removes objects from the stack. We can think of the pop value\nut as the initial deletion quantity for the operation. We traverse the strength vector st1 from the\nhighest index to the lowest. If the next strength scalar is less than the remaining deletion quantity, it\nis subtracted from the remaining quantity and its value is set to 0. If the remaining deletion quantity\nis less than the next strength scalar, the remaining deletion quantity is subtracted from that scalar and\ndeletion stops. Next, the push value is set as the strength for the value added in the current timestep.\nEquation 3 shows the dynamics of the read operation, which are similar to the pop operation. A\n\ufb01xed initial read quantity of 1 is set at the top of a temporary copy of the strength vector st which\nis traversed from the highest index to the lowest.\nIf the next strength scalar is smaller than the\nremaining read quantity, its value is preserved for this operation and subtracted from the remaining\nread quantity. If not, it is temporarily set to the remaining read quantity, and the strength scalars of\nall lower indices are temporarily set to 0. The output rt of the read operation is the weighted sum\nof the rows of Vt, scaled by the temporary scalar values created during the traversal. An example\nof the stack read calculations across three timesteps, after pushes and pops as described above, is\nillustrated in Figure 1a. The third step shows how setting the strength s3[2] to 0 for V3[2] logically\nremoves v2 from the stack, and how it is ignored during the read.\nThis completes the description of the forward dynamics of a neural Stack, cast as a recurrent layer,\nas illustrated in Figure 1b. All operations described in this section are differentiable1. The equations\ndescribing the backwards dynamics are provided in Appendix A of the supplementary materials.\n\nt = 1 u1 = 0 d1 = 0.8\n\nt = 2 u2 = 0.1 d2 = 0.5\n\nt = 3 u3 = 0.9 d3 = 0.9\n\ns\nd\nr\na\nw\np\nu\n \ns\nw\no\nr\ng\nk\nc\na\nt\ns\n\n \n\nrow 3\n\nrow 2\n\nrow 1\n\nv2\nv1\n\n0.5\n0.7\n\nv3\nv2\nv1\n\n0.9\n0\n0.3\n\nv2 removed\nfrom stack\n\nr2 = 0.5 \u2219 v2 + 0.5 \u2219 v1\n\nr3 = 0.9 \u2219 v3 + 0 \u2219 v2 + 0.1 \u2219 v1\n\n0.8\n\nv1\nr1 = 0.8 \u2219 v1\n\n(a) Example Operation of a Continuous Neural Stack\n\nprev. values (Vt-1)\n\nprevious state\n\nprev. strengths (st-1)\npush (dt)\npop (ut)\nvalue (vt)\n\ninput\n\nNeural \nStack\n\nnext values (Vt)\n\nnext state\n\nnext strengths (st)\n\noutput (rt)\n\nSplit\n\nJoin\n\n(Vt-1, st-1)\n\nht-1\n\nrt-1\n\n(it, rt-1)\n\nR\nN\nN\n\nht\n\n(ot, \u2026)\n\n\u2026\n\not\n\nVt-1\n\nst-1\n\ndt\nut\n\nvt\n\nprevious\n\nstate\nHt-1\n\ninput\nit\n\nVt\n\nst\n\n(Vt, st)\n\nNeural \nStack\n\nrt\n\nnext\nstate\nHt\n\noutput\n\not\n\n(b) Neural Stack as a Recurrent Layer\n\n(c) RNN Controlling a Stack\n\nFigure 1: Illustrating a Neural Stack\u2019s Operations, Recurrent Structure, and Control\n\n3.2 Neural Queue\n\nA neural Queue operates the same way as a neural Stack, with the exception that the pop operation\nreads the lowest index of the strength vector st, rather than the highest. This represents popping and\n1The max(x, y) and min(x, y) functions are technically not differentiable for x = y. Following the work\n\non recti\ufb01ed linear units [15], we arbitrarily take the partial differentiation of the left argument in these cases.\n\n3\n\n\freading from the front of the Queue rather than the top of the stack. These operations are described\nin Equations 4\u20135.\n\nst[i] =8<:\n\n3.3 Neural DeQue\n\nrt =\n\ntXi=1\n\nmax(0, st1[i] max(0, ut \ndt\n\nst1[j]))\n\nif 1 \uf8ff i < t\nif i = t\n\n(min(st[i], max(0, 1 \n\nst[j]))) \u00b7 Vt[i]\n\ni1Pj=1\ni1Xj=1\n\nt\n\nand ubot\n\ninstead of ut, vtop\n\nA neural DeQue operates likes a neural Stack, except it takes a push, pop, and value as input for\nboth \u201cends\u201d of the structure (which we call top and bot), and outputs a read for both ends. We write\nutop\ninstead of vt, and so on. The state, Vt and st are now\nt\na 2t \u21e5 m-dimensional matrix and a 2t-dimensional vector, respectively. At each timestep, a pop\nfrom the top is followed by a pop from the bottom of the DeQue, followed by the pushes and reads.\nThe dynamics of a DeQue, which unlike a neural Stack or Queue \u201cgrows\u201d in two directions, are\ndescribed in Equations 6\u201311, below. Equations 7\u20139 decompose the strength vector update into three\nsteps purely for notational clarity.\n\nand vbot\n\nt\n\nt\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nVt[i] =8<:\n\nstop\nt\n\nst[i] =8<:\n2tXi=1\n2tXi=1\n\nrtop\nt =\n\nrbot\nt =\n\nvbot\nt\nvtop\nt\nVt1[i 1]\n\nif i = 1\nif i = 2t\nif 1 < i < 2t\n\n[i] = max(0, st1[i] max(0, utop\n\nt \n\nst1[j]))\n\nif 1 \uf8ff i < 2(t 1)\n\n2(t1)1Xj=i+1\ni1Xj=1\n\nstop\nt\n\n[j]))\n\nif 1 \uf8ff i < 2(t 1)\n\nsboth\nt\n\n[i] = max(0, stop\n\nt\n\nt \n\n[i] max(0, ubot\nif 1 < i < 2t\nif i = 1\nif i = 2t\n\n[i 1]\n\nsboth\nt\ndbot\nt\ndtop\nt\n\n(min(st[i], max(0, 1 \n\n(min(st[i], max(0, 1 \n\n2tXj=i+1\ni1Xj=1\n\nst[j]))) \u00b7 Vt[i]\n\nst[j]))) \u00b7 Vt[i]\n\nTo summarise, a neural DeQue acts like two neural Stacks operated on in tandem, except that the\npushes and pops from one end may eventually affect pops and reads on the other, and vice versa.\n\n3.4\n\nInteraction with a Controller\n\nWhile the three memory modules described can be seen as recurrent layers, with the operations being\nused to produce the next state and output from the input and previous state being fully differentiable,\nthey contain no tunable parameters to optimise during training. As such, they need to be attached\nto a controller in order to be used for any practical purposes. In exchange, they offer an extensible\nmemory, the logical size of which is unbounded and decoupled from both the nature and parameters\nof the controller, and from the size of the problem they are applied to. Here, we describe how any\nRNN controller may be enhanced by a neural Stack, Queue or DeQue.\nWe begin by giving the case where the memory is a neural Stack, as illustrated in Figure 1c. Here\nwe wish to replicate the overall \u2018interface\u2019 of a recurrent layer\u2014as seen from outside the dotted\n\n4\n\n\flines\u2014which takes the previous recurrent state Ht1 and an input vector it, and transforms them\nto return the next recurrent state Ht and an output vector ot. In our setup, the previous state Ht1\nof the recurrent layer will be the tuple (ht1, rt1, (Vt1, st1)), where ht1 is the previous state\nof the RNN, rt1 is the previous stack read, and (Vt1, st1) is the previous state of the stack\nas described above. With the exception of h0, which is initialised randomly and optimised during\ntraining, all other initial states, r0 and (V0, s0), are set to 0-valued vectors/matrices and not updated\nduring training.\nThe overall input it is concatenated with previous read rt1 and passed to the RNN controller as\ninput along with the previous controller state ht1. The controller outputs its next state ht and a\ncontroller output o0t, from which we obtain the push and pop scalars dt and ut and the value vector\nvt, which are passed to the stack, as well as the network output ot:\n\ndt = sigmoid(Wdo0t + bd)\nvt = tanh(Wvo0t + bv)\n\nut = sigmoid(Wuo0t + bu)\not = tanh(Woo0t + bo)\n\nwhere Wd and Wu are vector-to-scalar projection matrices, and bd and bu are their scalar biases;\nWv and Wo are vector-to-vector projections, and bd and bu are their vector biases, all randomly\nintialised and then tuned during training. Along with the previous stack state (Vt1, st1), the stack\noperations dt and ut and the value vt are passed to the neural stack to obtain the next read rt and\nnext stack state (Vt, st), which are packed into a tuple with the controller state ht to form the next\nstate Ht of the overall recurrent layer. The output vector ot serves as the overall output of the\nrecurrent layer. The structure described here can be adapted to control a neural Queue instead of a\nstack by substituting one memory module for the other.\nThe only additional trainable parameters in either con\ufb01guration, relative to a non-enhanced RNN,\nare the projections for the input concatenated with the previous read into the RNN controller, and the\nprojections from the controller output into the various Stack/Queue inputs, described above. In the\ncase of a DeQue, both the top read rtop and bottom read rbot must be preserved in the overall state.\nThey are both concatenated with the input to form the input to the RNN controller. The output of the\ncontroller must have additional projections to output push/pop operations and values for the bottom\nof the DeQue. This roughly doubles the number of additional tunable parameters \u201cwrapping\u201d the\nRNN controller, compared to the Stack/Queue case.\n\n4 Experiments\n\nIn every experiment, integer-encoded source and target sequence pairs are presented to the candidate\nmodel as a batch of single joint sequences. The joint sequence starts with a start-of-sequence (SOS)\nsymbol, and ends with an end-of-sequence (EOS) symbol, with a separator symbol separating the\nsource and target sequences. Integer-encoded symbols are converted to 64-dimensional embeddings\nvia an embedding matrix, which is randomly initialised and tuned during training. Separate word-\nto-index mappings are used for source and target vocabularies. Separate embedding matrices are\nused to encode input and output (predicted) embeddings.\n\n4.1 Synthetic Transduction Tasks\n\nThe aim of each of the following tasks is to read an input sequence, and generate as target sequence a\ntransformed version of the source sequence, followed by an EOS symbol. Source sequences are ran-\ndomly generated from a vocabulary of 128 meaningless symbols. The length of each training source\nsequence is uniformly sampled from unif {8, 64}, and each symbol in the sequence is drawn with\nreplacement from a uniform distribution over the source vocabulary (ignoring SOS, and separator).\nA deterministic task-speci\ufb01c transformation, described for each task below, is applied to the source\nsequence to yield the target sequence. As the training sequences are entirely determined by the\nsource sequence, there are close to 10135 training sequences for each task, and training examples\nare sampled from this space due to the random generation of source sequences. The following steps\nare followed before each training and test sequence are presented to the models, the SOS symbol\n(hsi) is prepended to the source sequence, which is concatenated with a separator symbol (|||) and\nthe target sequences, to which the EOS symbol (h/si) is appended.\n\n5\n\n\fSequence Copying The source sequence is copied to form the target sequence. Sequences have\nthe form:\n\nhsia1 . . . ak|||a1 . . . akh/si\n\nSequence Reversal The source sequence is deterministically reversed to produce the target se-\nquence. Sequences have the form:\n\nhsia1a2 . . . ak|||ak . . . a2a1h/si\n\nBigram \ufb02ipping The source side is restricted to even-length sequences. The target is produced\nby swapping, for all odd source sequence indices i 2 [1,|seq|] ^ odd(i), the ith symbol with the\n(i + 1)th symbol. Sequences have the form:\n\nhsia1a2a3a4 . . . ak1ak|||a2a1a4a3 . . . akak1h/si\n\n4.2\n\nITG Transduction Tasks\n\nThe following tasks examine how well models can approach sequence transduction problems where\nthe source and target sequence are jointly generated by Inversion Transduction Grammars (ITG) [8],\na subclass of Synchronous Context-Free Grammars [16] often used in machine translation [17]. We\npresent two simple ITG-based datasets with interesting linguistic properties and their underlying\ngrammars. We show these grammars in Table 1, in Appendix C of the supplementary materials. For\neach synchronised non-terminal, an expansion is chosen according to the probability distribution\nspeci\ufb01ed by the rule probability p at the beginning of each rule. For each grammar, \u2018A\u2019 is always the\nroot of the ITG tree.\nWe tuned the generative probabilities for recursive rules by hand so that the grammars generate left\nand right sequences of lengths 8 to 128 with relatively uniform distribution. We generate training\ndata by rejecting samples that are outside of the range [8, 64], and testing data by rejecting samples\noutside of the range [65, 128]. For terminal symbol-generating rules, we balance the classes so\nthat for k terminal-generating symbols in the grammar, each terminal-generating non-terminal \u2018X\u2019\ngenerates a vocabulary of approximately 128/k, and each each vocabulary word under that class is\nequiprobable. These design choices were made to maximise the similarity between the experimental\nsettings of the ITG tasks described here and the synthetic tasks described above.\n\nSubj\u2013Verb\u2013Obj to Subj\u2013Obj\u2013Verb A persistent challenge in machine translation is to learn to\nfaithfully reproduce high-level syntactic divergences between languages. For instance, when trans-\nlating an English sentence with a non-\ufb01nite verb into German, a transducer must locate and move\nthe verb over the object to the \ufb01nal position. We simulate this phenomena with a synchronous\ngrammar which generates strings exhibiting verb movements. To add an extra challenge, we also\nsimulate simple relative clause embeddings to test the models\u2019 ability to transduce in the presence\nof unbounded recursive structures.\nA sample output of the grammar is presented here, with spaces between words being included for\nstylistic purposes, and where s, o, and v indicate subject, object, and verb terminals respectively, i\nand o mark input and output, and rp indicates a relative pronoun:\nsi1 vi28 oi5 oi7 si15 rpi si19 vi16 oi10 oi24 ||| so1 oo5 oo7 so15 rpo so19 vo16 oo10 oo24 vo28\nGenderless to gendered grammar We design a small grammar to simulate translations from a\nlanguage with gender-free articles to one with gender-speci\ufb01c de\ufb01nite and inde\ufb01nite articles. A\nreal world example of such a translation would be from English (the, a) to German (der/die/das,\nein/eine/ein).\nThe grammar simulates sentences in (N P/(V /N P )) or (N P/V ) form, where every noun phrase\ncan become an in\ufb01nite sequence of nouns joined by a conjunction. Each noun in the source language\nhas a neutral de\ufb01nite or inde\ufb01nite article. The matching word in the target language then needs to be\npreceeded by its appropriate article. A sample output of the grammar is presented here, with spaces\nbetween words being included for stylistic purposes:\n\nwe11 the en19 and the em17 ||| wg11 das gn19 und der gm17\n\n6\n\n\f4.3 Evaluation\n\nFor each task, test data is generated through the same procedure as training data, with the key dif-\nference that the length of the source sequence is sampled from unif {65, 128}. As a result of this\nchange, we not only are assured that the models cannot observe any test sequences during training,\nbut are also measuring how well the sequence transduction capabilities of the evaluated models gen-\neralise beyond the sequence lengths observed during training. To control for generalisation ability,\nwe also report accuracy scores on sequences separately sampled from the training set, which given\nthe size of the sample space are unlikely to have ever been observed during actual model training.\nFor each round of testing, we sample 1000 sequences from the appropriate test set. For each se-\nquence, the model reads in the source sequence and separator symbol, and begins generating the\nnext symbol by taking the maximally likely symbol from the softmax distribution over target sym-\nbols produced by the model at each step. Based on this process, we give each model a coarse\naccuracy score, corresponding to the proportion of test sequences correctly predicted from begin-\nning until end (EOS symbol) without error, as well as a \ufb01ne accuracy score, corresponding to the\naverage proportion of each sequence correctly generated before the \ufb01rst error. Formally, we have:\n\ncoarse =\n\n#correct\n\n#seqs\n\nf ine =\n\n1\n\n#seqs\n\n#seqsXi=1\n\n#correcti\n|targeti|\n\nwhere #correct and #seqs are the number of correctly predicted sequences (end-to-end) and the\ntotal number of sequences in the test batch (1000 in this experiment), respectively; #correcti is the\nnumber of correctly predicted symbols before the \ufb01rst error in the ith sequence of the test batch, and\n|targeti| is the length of the target segment that sequence (including EOS symbol).\n4.4 Models Compared and Experimental Setup\n\nFor each task, we use as benchmarks the Deep LSTMs described in [1], with 1, 2, 4, and 8 layers.\nAgainst these benchmarks, we evaluate neural Stack-, Queue-, and DeQue-enhanced LSTMs. When\nrunning experiments, we trained and tested a version of each model where all LSTMs in each model\nhave a hidden layer size of 256, and one for a hidden layer size of 512. The Stack/Queue/DeQue\nembedding size was arbitrarily set to 256, half the maximum hidden size. The number of parameters\nfor each model are reported for each architecture in Table 2 of the appendix. Concretely, the neural\nStack-, Queue-, and DeQue-enhanced LSTMs have the same number of trainable parameters as a\ntwo-layer Deep LSTM. These all come from the extra connections to and from the memory module,\nwhich itself has no trainable parameters, regardless of its logical size.\nModels are trained with minibatch RMSProp [18], with a batch size of 10. We grid-searched learning\nrates across the set {5\u21e5103, 1\u21e5103, 5\u21e5104, 1\u21e5104, 5\u21e5105}. We used gradient clipping\n[19], clipping all gradients above 1. Average training perplexity was calculated every 100 batches.\nTraining and test set accuracies were recorded every 1000 batches.\n\n5 Results and Discussion\n\nBecause of the impossibility of over\ufb01tting the datasets, we let the models train an unbounded number\nof steps, and report results at convergence. We present in Figure 2a the coarse- and \ufb01ne-grained\naccuracies, for each task, of the best model of each architecture described in this paper alongside\nthe best performing Deep LSTM benchmark. The best models were automatically selected based on\naverage training perplexity. The LSTM benchmarks performed similarly across the range of random\ninitialisations, so the effect of this procedure is primarily to try and select the better performing\nStack/Queue/DeQue-enhanced LSTM. In most cases, this procedure does not yield the actual best-\nperforming model, and in practice a more sophisticated procedure such as ensembling [20] should\nproduce better results.\nFor all experiments, the Neural Stack or Queue outperforms the Deep LSTM benchmarks, often by\na signi\ufb01cant margin. For most experiments, if a Neural Stack- or Queue-enhanced LSTM learns\nto partially or consistently solve the problem, then so does the Neural DeQue. For experiments\nwhere the enhanced LSTMs solve the problem completely (consistent accuracy of 1) in training,\nthe accuracy persists in longer sequences in the test set, whereas benchmark accuracies drop for\n\n7\n\n\fExperiment\n\nSequence\nCopying\n\nSequence\nReversal\n\nBigram\nFlipping\n\nSVO to SOV\n\nGender\nConjugation\n\nModel\n4-layer LSTM\nStack-LSTM\nQueue-LSTM\nDeQue-LSTM\n8-layer LSTM\nStack-LSTM\nQueue-LSTM\nDeQue-LSTM\n2-layer LSTM\nStack-LSTM\nQueue-LSTM\nDeQue-LSTM\n8-layer LSTM\nStack-LSTM\nQueue-LSTM\nDeQue-LSTM\n8-layer LSTM\nStack-LSTM\nQueue-LSTM\nDeQue-LSTM\n\nTraining\n\nTesting\n\nCoarse\n0.98\n0.89\n1.00\n1.00\n\n0.95\n1.00\n0.44\n1.00\n\n0.54\n0.44\n0.55\n0.55\n\n0.98\n1.00\n1.00\n1.00\n\n0.98\n0.93\n1.00\n1.00\n\nFine\n0.98\n0.94\n1.00\n1.00\n\n0.98\n1.00\n0.61\n1.00\n\n0.93\n0.90\n0.94\n0.94\n\n0.99\n1.00\n1.00\n1.00\n\n0.99\n0.97\n1.00\n1.00\n\nCoarse\n0.01\n0.00\n1.00\n1.00\n\n0.04\n1.00\n0.00\n1.00\n\n0.02\n0.00\n0.55\n0.53\n\n0.98\n1.00\n1.00\n1.00\n\n0.99\n0.93\n1.00\n1.00\n\nFine\n0.50\n0.22\n1.00\n1.00\n\n0.13\n1.00\n0.01\n1.00\n\n0.52\n0.48\n0.98\n0.98\n\n0.99\n1.00\n1.00\n1.00\n\n0.99\n0.97\n1.00\n1.00\n\n(a) Comparing Enhanced LSTMs to Best Benchmarks\n\n(b) Comparison of Model Conver-\ngence during Training\n\nFigure 2: Results on the transduction tasks and convergence properties\n\nall experiments except the SVO to SOV and Gender Conjugation ITG transduction tasks. Across\nall tasks which the enhanced LSTMs solve, the convergence on the top accuracy happens orders of\nmagnitude earlier for enhanced LSTMs than for benchmark LSTMs, as exempli\ufb01ed in Figure 2b.\nThe results for the sequence inversion and copying tasks serve as unit tests for our models, as the\ncontroller mainly needs to learn to push the appropriate number of times and then pop continuously.\nNonetheless, the failure of Deep LSTMs to learn such a regular pattern and generalise is itself\nindicative of the limitations of the benchmarks presented here, and of the relative expressive power\nof our models. Their ability to generalise perfectly to sequences up to twice as long as those attested\nduring training is also notable, and also attested in the other experiments. Finally, this pair of\nexperiments illustrates how while the neural Queue solves copying and the Stack solves reversal, a\nsimple LSTM controller can learn to operate a DeQue as either structure, and solve both tasks.\nThe results of the Bigram Flipping task for all models are consistent with the failure to consistently\ncorrectly generate the last two symbols of the sequence. We hypothesise that both Deep LSTMs and\nour models economically learn to pairwise \ufb02ip the sequence tokens, and attempt to do so half the\ntime when reaching the EOS token. For the two ITG tasks, the success of Deep LSTM benchmarks\nrelative to their performance in other tasks can be explained by their ability to exploit short local\ndependencies dominating the longer dependencies in these particular grammars.\nOverall, the rapid convergence, where possible, on a general solution to a transduction problem\nin a manner which propagates to longer sequences without loss of accuracy is indicative that an\nunbounded memory-enhanced controller can learn to solve these problems procedurally, rather than\nmemorising the underlying distribution of the data.\n\n6 Conclusions\n\nThe experiments performed in this paper demonstrate that single-layer LSTMs enhanced by an un-\nbounded differentiable memory capable of acting, in the limit, like a classical Stack, Queue, or\nDeQue, are capable of solving sequence-to-sequence transduction tasks for which Deep LSTMs\nfalter. Even in tasks for which benchmarks obtain high accuracies, the memory-enhanced LSTMs\nconverge earlier, and to higher accuracies, while requiring considerably fewer parameters than all\nbut the simplest of Deep LSTMs. We therefore believe these constitute a crucial addition to our neu-\nral network toolbox, and that more complex linguistic transduction tasks such as machine translation\nor parsing will be rendered more tractable by their inclusion.\n\n8\n\n\fReferences\n[1] Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. Sequence to sequence learning with neural\nnetworks. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 27, pages 3104\u20133112. Curran\nAssociates, Inc., 2014.\n\n[2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[3] GZ Sun, C Lee Giles, HH Chen, and YC Lee. The neural network pushdown automaton:\n\nModel, stack and learning simulations. 1998.\n\n[4] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,\n\n2014.\n\n[5] Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of\n\nStudies in Computational Intelligence. Springer, 2012.\n\n[6] Markus Dreyer, Jason R. Smith, and Jason Eisner. Latent-variable modeling of string trans-\nductions with \ufb01nite-state methods. In Proceedings of the Conference on Empirical Methods in\nNatural Language Processing, EMNLP \u201908, pages 1080\u20131089, Stroudsburg, PA, USA, 2008.\nAssociation for Computational Linguistics.\n\n[7] Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. Open-\nFST: A general and ef\ufb01cient weighted \ufb01nite-state transducer library. In Implementation and\nApplication of Automata, volume 4783 of Lecture Notes in Computer Science, pages 11\u201323.\nSpringer Berlin Heidelberg, 2007.\n\n[8] Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel cor-\n\npora. Computational linguistics, 23(3):377\u2013403, 1997.\n\n[9] Alex Graves. Sequence transduction with recurrent neural networks. In Representation Learn-\n\ning Worksop, ICML. 2012.\n\n[10] Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. Learning context-free grammars: Capabilities\nand limitations of a recurrent neural network with an external stack memory. In Proceedings\nof The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, 1992.\n[11] Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. Using prior knowledge in a {NNPDA} to\n[12] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Weakly supervised mem-\n\nlearn context-free languages. Advances in neural information processing systems, 1993.\n\nory networks. CoRR, abs/1503.08895, 2015.\n\n[13] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv\n\npreprint arXiv:1505.00521, 2015.\n\n[14] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented re-\n\ncurrent nets. arXiv preprint arXiv:1503.01007, 2015.\n\n[15] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann ma-\nchines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10),\npages 807\u2013814, 2010.\n\n[16] Alfred V Aho and Jeffrey D Ullman. The theory of parsing, translation, and compiling.\n\nPrentice-Hall, Inc., 1972.\n\n[17] Dekai Wu and Hongsing Wong. Machine translation with a stochastic grammatical channel.\nIn Proceedings of the 17th international conference on Computational linguistics-Volume 2,\npages 1408\u20131415. Association for Computational Linguistics, 1998.\n\n[18] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4,\n2012.\n\n[19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient\n\nproblem. Computing Research Repository (CoRR) abs/1211.5063, 2012.\n\n[20] Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: many could be better\n\nthan all. Arti\ufb01cial intelligence, 137(1):239\u2013263, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1108, "authors": [{"given_name": "Edward", "family_name": "Grefenstette", "institution": "Google DeepMind"}, {"given_name": "Karl Moritz", "family_name": "Hermann", "institution": "Google DeepMind"}, {"given_name": "Mustafa", "family_name": "Suleyman", "institution": "Google DeepMind"}, {"given_name": "Phil", "family_name": "Blunsom", "institution": "Google DeepMind"}]}