{"title": "Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 198, "abstract": "Despite the recent achievements in machine learning, we are still very far from achieving real artificial intelligence. In this paper, we discuss the limitations of standard deep learning approaches and show that some of these limitations can be overcome by learning how to grow the complexity of a model in a structured way. Specifically, we study the simplest sequence prediction problems that are beyond the scope of what is learnable with standard recurrent networks, algorithmically generated sequences which can only be learned by models which have the capacity to count and to memorize sequences. We show that some basic algorithms can be learned from sequential data using a recurrent network associated with a trainable memory.", "full_text": "Inferring Algorithmic Patterns with\nStack-Augmented Recurrent Nets\n\nArmand Joulin\n\nFacebook AI Research\n\n770 Broadway, New York, USA.\n\najoulin@fb.com\n\nTomas Mikolov\n\nFacebook AI Research\n\n770 Broadway, New York, USA.\n\ntmikolov@fb.com\n\nAbstract\n\nDespite the recent achievements in machine learning, we are still very far from\nachieving real arti\ufb01cial intelligence. In this paper, we discuss the limitations of\nstandard deep learning approaches and show that some of these limitations can be\novercome by learning how to grow the complexity of a model in a structured way.\nSpeci\ufb01cally, we study the simplest sequence prediction problems that are beyond\nthe scope of what is learnable with standard recurrent networks, algorithmically\ngenerated sequences which can only be learned by models which have the capacity\nto count and to memorize sequences. We show that some basic algorithms can be\nlearned from sequential data using a recurrent network associated with a trainable\nmemory.\n\n1\n\nIntroduction\n\nMachine learning aims to \ufb01nd regularities in data to perform various tasks. Historically there have\nbeen two major sources of breakthroughs: scaling up the existing approaches to larger datasets, and\ndevelopment of novel approaches [5, 14, 22, 30]. In the recent years, a lot of progress has been\nmade in scaling up learning algorithms, by either using alternative hardware such as GPUs [9] or by\ntaking advantage of large clusters [28]. While improving computational ef\ufb01ciency of the existing\nmethods is important to deploy the models in real world applications [4], it is crucial for the research\ncommunity to continue exploring novel approaches able to tackle new problems.\nRecently, deep neural networks have become very successful at various tasks, leading to a shift in\nthe computer vision [21] and speech recognition communities [11]. This breakthrough is commonly\nattributed to two aspects of deep networks: their similarity to the hierarchical, recurrent structure of\nthe neocortex and the theoretical justi\ufb01cation that certain patterns are more ef\ufb01ciently represented\nby functions employing multiple non-linearities instead of a single one [1, 25].\nThis paper investigates which patterns are dif\ufb01cult to represent and learn with the current state of the\nart methods. This would hopefully give us hints about how to design new approaches which will ad-\nvance machine learning research further. In the past, this approach has lead to crucial breakthrough\nresults: the well-known XOR problem is an example of a trivial classi\ufb01cation problem that cannot\nbe solved using linear classi\ufb01ers, but can be solved with a non-linear one. This popularized the use\nof non-linear hidden layers [30] and kernels methods [2]. Another well-known example is the parity\nproblem described by Papert and Minsky [25]: it demonstrates that while a single non-linear hidden\nlayer is suf\ufb01cient to represent any function, it is not guaranteed to represent it ef\ufb01ciently, and in\nsome cases can even require exponentially many more parameters (and thus, also training data) than\nwhat is suf\ufb01cient for a deeper model. This lead to use of architectures that have several layers of\nnon-linearities, currently known as deep learning models.\nFollowing this line of work, we study basic patterns which are dif\ufb01cult to represent and learn for\nstandard deep models. In particular, we study learning regularities in sequences of symbols gen-\n\n1\n\n\fSequence generator\n{anbn | n > 0}\n{anbncn | n > 0}\n{anbncndn | n > 0}\n{anb2n | n > 0}\n\n{anbmcn+m | n, m > 0}\n\nExample\n\naabbaaabbbabaaaaabbbbb\n\naaabbbcccabcaaaaabbbbbccccc\n\naabbccddaaabbbcccdddabcd\n\nn \u2208 [1, k], X \u2192 nXn, X \u2192= (k = 2) 12=212122=221211121=12111\n\naabbbbaaabbbbbbabb\n\naabcccaaabbcccccabcc\n\nTable 1: Examples generated from the algorithms studied in this paper. In bold, the characters which\ncan be predicted deterministically. During training, we do not have access to this information and at\ntest time, we evaluate only on deterministically predictable characters.\n\nerated by simple algorithms. Interestingly, we \ufb01nd that these regularities are dif\ufb01cult to learn even\nfor some advanced deep learning methods, such as recurrent networks. We attempt to increase the\nlearning capabilities of recurrent nets by allowing them to learn how to control an in\ufb01nite structured\nmemory. We explore two basic topologies of the structured memory: pushdown stack, and a list.\nOur structured memory is de\ufb01ned by constraining part of the recurrent matrix in a recurrent net [24].\nWe use multiplicative gating mechanisms as learnable controllers over the memory [8, 19] and show\nthat this allows our network to operate as if it was performing simple read and write operations, such\nas PUSH or POP for a stack.\nAmong recent work with similar motivation, we are aware of the Neural Turing Machine [17] and\nMemory Networks [33]. However, our work can be considered more as a follow up of the research\ndone in the early nineties, when similar types of memory augmented neural networks were stud-\nied [12, 26, 27, 37].\n\n2 Algorithmic Patterns\n\nWe focus on sequences generated by simple, short algorithms. The goal is to learn regularities in\nthese sequences by building predictive models. We are mostly interested in discrete patterns related\nto those that occur in the real world, such as various forms of a long term memory.\nMore precisely, we suppose that during training we have only access to a stream of data which is\nobtained by concatenating sequences generated by a given algorithm. We do not have access to the\nboundary of any sequence nor to sequences which are not generated by the algorithm. We denote\nthe regularities in these sequences of symbols as Algorithmic patterns. In this paper, we focus on\nalgorithmic patterns which involve some form of counting and memorization. Examples of these\npatterns are presented in Table 1. For simplicity, we mostly focus on the unary and binary numeral\nsystems to represent patterns. This allows us to focus on designing a model which can learn these\nalgorithms when the input is given in its simplest form.\nSome algorithm can be given as context free grammars, however we are interested in the more gen-\neral case of sequential patterns that have a short description length in some general Turing-complete\ncomputational system. Of particular interest are patterns relevant to develop a better language un-\nderstanding. Finally, this study is limited to patterns whose symbols can be predicted in a single\ncomputational step, leaving out algorithms such as sorting or dynamic programming.\n\n3 Related work\n\nSome of the algorithmic patterns we study in this paper are closely related to context free and context\nsensitive grammars which were widely studied in the past. Some works used recurrent networks\nwith hardwired symbolic structures [10, 15, 18]. These networks are continuous implementation of\nsymbolic systems, and can deal with recursive patterns in computational linguistics. While theses\napproaches are interesting to understand the link between symbolic and sub-symbolic systems such\nas neural networks, they are often hand designed for each speci\ufb01c grammar.\nWiles and Elman [34] show that simple recurrent networks are able to learn sequences of the form\nanbn and generalize on a limited range of n. While this is a promising result, their model does not\n\n2\n\n\ftruly learn how to count but instead relies mostly on memorization of the patterns seen in the training\ndata. Rodriguez et al. [29] further studied the behavior of this network. Gr\u00a8unwald [18] designs a\nhardwired second order recurrent network to tackle similar sequences. Christiansen and Chater [7]\nextended these results to grammars with larger vocabularies. This work shows that this type of\narchitectures can learn complex internal representation of the symbols but it cannot generalize to\nlonger sequences generated by the same algorithm. Beside using simple recurrent networks, other\nstructures have been used to deal with recursive patterns, such as pushdown dynamical automata [31]\nor sequenctial cascaded networks [3, 27].\nHochreiter and Schmidhuber [19] introduced the Long Short Term Memory network (LSTM) archi-\ntecture. While this model was orginally developed to address the vanishing and exploding gradient\nproblems, LSTM is also able to learn simple context-free and context-sensitive grammars [16, 36].\nThis is possible because its hidden units can choose through a multiplicative gating mechanism to\nbe either linear or non-linear. The linear units allow the network to potentially count (one can easily\nadd and subtract constants) and store a \ufb01nite amount of information for a long period of time. These\nmechanisms are also used in the Gated Recurrent Unit network [8]. In our work we investigate the\nuse of a similar mechanism in a context where the memory is unbounded and structured. As opposed\nto previous work, we do not need to \u201cerase\u201d our memory to store a new unit. More recently, Graves\net al. [17] have extended LSTM with an attention mechansim to build a model which roughly resem-\nbles a Turing machine with limited tape. Their memory controller works with a \ufb01xed size memory\nand it is not clear if its complexity is necessary for the the simple problems they study.\nFinally, many works have also used external memory modules with a recurrent network, such as\nstacks [12, 13, 20, 26, 37]. Zheng et al. [37] use a discrete external stack which may be hard\nto learn on long sequences. Das et al. [12] learn a continuous stack which has some similarities\nwith ours. The mechnisms used in their work is quite different from ours. Their memory cells are\nassociated with weights to allow continuous representation of the stack, in order to train it with\ncontinuous optimization scheme. On the other hand, our solution is closer to a standard RNN with\nspecial connectivities which simulate a stack with unbounded capacity. We tackle problems which\nare closely related to the ones addressed in these works and try to go further by exploring more\nchallenging problems such as binary addition.\n\n4 Model\n\n4.1 Simple recurrent network\n\nWe consider sequential data that comes in the form of discrete tokens, such as characters or words.\nThe goal is to design a model able to predict the next symbol in a stream of data. Our approach is\nbased on a standard model called recurrent neural network (RNN) and popularized by Elman [14].\nRNN consists of an input layer, a hidden layer with a recurrent time-delayed connection and an\noutput layer. The recurrent connection allows the propagation of information through time.Given a\nsequence of tokens, RNN takes as input the one-hot encoding xt of the current token and predicts\nthe probability yt of next symbol. There is a hidden layer with m units which stores additional\ninformation about the previous tokens seen in the sequence. More precisely, at each time t, the state\nof the hidden layer ht is updated based on its previous state ht\u22121 and the encoding xt of the current\ntoken, according to the following equation:\n\nht = \u03c3 (U xt + Rht\u22121) ,\n\n(1)\nwhere \u03c3(x) = 1/(1 + exp(\u2212x)) is the sigmoid activation function applied coordinate wise, U is the\nd \u00d7 m token embedding matrix and R is the m \u00d7 m matrix of recurrent weights. Given the state of\nthese hidden units, the network then outputs the probability vector yt of the next token, according to\nthe following equation:\n(2)\nwhere f is the softmax function [6] and V is the m \u00d7 d output matrix, where d is the number of\ndifferent tokens. This architecture is able to learn relatively complex patterns similar in nature to\nthe ones captured by N-grams. While this has made the RNNs interesting for language modeling\n[23], they may not have the capacity to learn how algorithmic patterns are generated. In the next\nsection, we show how to add an external memory to RNNs which has the theoretical capability to\nlearn simple algorithmic patterns.\n\nyt = f (V ht) ,\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Neural network extended with push-down stack and a controlling mechanism that\nlearns what action (among PUSH, POP and NO-OP) to perform. (b) The same model extended with\na doubly-linked list with actions INSERT, LEFT, RIGHT and NO-OP.\n4.2 Pushdown network\n\nIn this section, we describe a simple structured memory inspired by pushdown automaton, i.e., an\nautomaton which employs a stack. We train our network to learn how to operate this memory with\nstandard optimization tools.\nA stack is a type of persistent memory which can be only accessed through its topmost element.\nThree basic operations can be performed with a stack: POP removes the top element, PUSH adds\na new element on top of the stack and NO-OP does nothing. For simplicity, we \ufb01rst consider a\nsimpli\ufb01ed version where the model can only choose between a PUSH or a POP at each time step.\nWe suppose that this decision is made by a 2-dimensional variable at which depends on the state of\nthe hidden variable ht:\n\nat = f (Aht) ,\n\n(3)\nwhere A is a 2\u00d7 m matrix (m is the size of the hidden layer) and f is a softmax function. We denote\nby at[PUSH], the probability of the PUSH action, and by at[POP] the probability of the POP action.\nWe suppose that the stack is stored at time t in a vector st of size p. Note that p could be increased\non demand and does not have to be \ufb01xed which allows the capacity of the model to grow. The top\nelement is stored at position 0, with value st[0]:\n\nst[0] = at[PUSH]\u03c3(Dht) + at[POP]st\u22121[1],\n\n(4)\nwhere D is 1 \u00d7 m matrix. If at[POP] is equal to 1, the top element is replaced by the value below\n(all values are moved by one position up in the stack structure). If at[PUSH] is equal to 1, we move\nall values down in the stack and add a value on top of the stack. Similarly, for an element stored at\na depth i > 0 in the stack, we have the following update rule:\n\nst[i] = at[PUSH]st\u22121[i \u2212 1] + at[POP]st\u22121[i + 1].\n\n(5)\n\nWe use the stack to carry information to the hidden layer at the next time step. When the stack is\nempty, st is set to \u22121. The hidden layer ht is now updated as:\n\nht = \u03c3(cid:0)U xt + Rht\u22121 + P sk\n\n(cid:1) ,\n\n(6)\nt\u22121 are the k top-most element of the stack at time t \u2212 1.\nwhere P is a m \u00d7 k recurrent matrix and sk\nIn our experiments, we set k to 2. We call this model Stack RNN, and show it in Figure 1-a without\nthe recurrent matrix R for clarity.\nStack with a no-operation. Adding the NO-OP action allows the stack to keep the same value on\ntop by a minor change of the stack update rule. Eq. (4) is replaced by:\n\nt\u22121\n\nst[0] = at[PUSH]\u03c3(Dht) + at[POP]st\u22121[1] + at[NO-OP]st\u22121[0].\n\nExtension to multiple stacks. Using a single stack has serious limitations, especially considering\nthat at each time step, only one action can be performed. We increase capacity of the model by\nusing multiple stacks in parallel. The stacks can interact through the hidden layer allowing them to\nprocess more challenging patterns.\n\n4\n\n\fmethod\nRNN\nLSTM\n\nList RNN 40+5\nStack RNN 40+10\n\nStack RNN 40+10 + rounding\n\nanbn\nanbncn\n25% 23.3%\n100% 100%\n100% 33.3%\n100% 100%\n100% 100%\n\nanbncndn\n\n13.3%\n68.3%\n100%\n100%\n100%\n\nanb2n\n23.3%\n75%\n100%\n100%\n100%\n\nanbmcn+m\n\n33.3%\n100%\n100%\n43.3%\n100%\n\nTable 2: Comparison with RNN and LSTM on sequences generated by counting algorithms. The\nsequences seen during training are such that n < 20 (and n + m < 20), and we test on sequences\nup to n = 60. We report the percent of n for which the model was able to correctly predict the\nsequences. Performance above 33.3% means it is able to generalize to never seen sequence lengths.\nDoubly-linked lists. While in this paper we mostly focus on an in\ufb01nite memory based on stacks, it\nis straightforward to extend the model to another forms of in\ufb01nite memory, for example, the doubly-\nlinked list. A list is a one dimensional memory where each node is connected to its left and right\nneighbors. There is a read/write head associated with the list. The head can move between nearby\nnodes and insert a new node at its current position. More precisely, we consider three different\nactions: INSERT, which inserts an element at the current position of the head, LEFT, which moves\nthe head to the left, and RIGHT which moves it to the right. Given a list L and a \ufb01xed head position\nHEAD, the updates are:\n\n(cid:40) at[RIGHT]Lt\u22121[i + 1] + at[LEFT]Lt\u22121[i \u2212 1] + at[INSERT]\u03c3(Dht)\n\nLt[i] =\n\nif i = HEAD,\nat[RIGHT]Lt\u22121[i + 1] + at[LEFT]Lt\u22121[i \u2212 1] + at[INSERT]Lt\u22121[i + 1] if i < HEAD,\nat[RIGHT]Lt\u22121[i + 1] + at[LEFT]Lt\u22121[i \u2212 1] + at[INSERT]Lt\u22121[i]\nif i > HEAD.\nNote that we can add a NO-OP operation as well. We call this model List RNN, and show it in\nFigure 1-b without the recurrent matrix R for clarity.\nOptimization. The models presented above are continuous and can thus be trained with stochastic\ngradient descent (SGD) method and back-propagation through time [30, 32, 35]. As patterns be-\ncomes more complex, more complex memory controller must be learned. In practice, we observe\nthat these more complex controller are harder to learn with SGD. Using several random restarts\nseems to solve the problem in our case. We have also explored other type of search based proce-\ndures as discussed in the supplementary material.\nRounding. Continuous operators on stacks introduce small imprecisions leading to numerical is-\nsues on very long sequences. While simply discretizing the controllers partially solves this problem,\nwe design a more robust rounding procedure tailored to our model. We slowly makes the controllers\nconverge to discrete values by multiply their weights by a constant which slowly goes to in\ufb01nity. We\n\ufb01netune the weights of our network as this multiplicative variable increase, leading to a smoother\nrounding of our network. Finally, we remove unused stacks by exploring models which use only a\nsubset of the stacks. While brute-force would be exponential in the number of stacks, we can do it\nef\ufb01ciently by building a tree of removable stacks and exploring it with deep \ufb01rst search.\n\n5 Experiments and results\n\nFirst, we consider various sequences generated by simple algorithms, where the goal is to learn their\ngeneration rule [3, 12, 29]. We hope to understand the scope of algorithmic patterns each model can\ncapture. We also evaluate the models on a standard language modeling dataset, Penn Treebank.\nImplementation details. Stack and List RNNs are trained with SGD and backpropagation through\ntime with 50 steps [32], a hard clipping of 15 to prevent gradient explosions [23], and an initial\nlearning rate of 0.1. The learning rate is divided by 2 each time the entropy on the validation set is\nnot decreasing. The depth k de\ufb01ned in Eq. (6) is set to 2. The free parameters are the number of\nhidden units, stacks and the use of NO-OP. The baselines are RNNs with 40, 100 and 500 units, and\nLSTMs with 1 and 2 layers with 50, 100 and 200 units. The hyper-parameters of the baselines are\nselected on the validation sets.\n\n5.1 Learning simple algorithmic patterns\n\nGiven an algorithm with short description length, we generate sequences and concatenate them into\nlonger sequences. This is an unsupervised task, since the boundaries of each generated sequences\n\n5\n\n\fcurrent\n\nb\na\na\na\na\na\na\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\n\nnext\n\na\na\na\na\na\na\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\na\n\nprediction\n\na\na\na\na\na\na\na\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\na\n\nproba(next)\n\n0.99\n0.99\n0.95\n0.93\n0.91\n0.90\n0.10\n0.99\n1.00\n1.00\n1.00\n1.00\n1.00\n0.99\n0.99\n0.99\n0.99\n0.99\n0.99\n\naction\n\nPOP\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPOP\nPOP\nPOP\nPOP\nPOP\nPOP\nPOP\nPOP\nPOP\nPOP\nPOP\n\nPOP\nPOP\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPUSH\nPOP\nPOP\nPOP\nPOP\nPOP\n\nstack1[top]\n\n-1\n0.01\n0.18\n0.32\n0.40\n0.46\n0.52\n0.57\n0.52\n0.46\n0.40\n0.32\n0.18\n0.01\n-1\n-1\n-1\n-1\n-1\n\nstack2[top]\n\n0.53\n0.97\n0.99\n0.98\n0.97\n0.97\n0.97\n0.97\n0.56\n0.01\n0.00\n0.00\n0.00\n0.00\n0.00\n0.00\n0.00\n0.01\n0.56\n\nTable 3: Example of the Stack RNN with 20 hidden units and 2 stacks on a sequence anb2n with\nn = 6. \u22121 means that the stack is empty. The depth k is set to 1 for clarity. We see that the \ufb01rst\nstack pushes an element every time it sees a and pop when it sees b. The second stack pushes when\nit sees a. When it sees b , it pushes if the \ufb01rst stack is not empty and pop otherwise. This shows how\nthe two stacks interact to correctly predict the deterministic part of the sequence (shown in bold).\n\nMemorization\n\nBinary addition\n\nFigure 2: Comparison of RNN, LSTM, List RNN and Stack RNN on memorization and the perfor-\nmance of Stack RNN on binary addition. The accuracy is in the proportion of correctly predicted\nsequences generated with a given n. We use 100 hidden units and 10 stacks.\n\nare not known. We study patterns related to counting and memorization as shown in Table 1. To\nevaluate if a model has the capacity to understand the generation rule used to produce the sequences,\nit is tested on sequences it has not seen during training. Our experimental setting is the following:\nthe training and validation set are composed of sequences generated with n up to N < 20 while\nthe test set is composed of sequences generated with n up to 60. During training, we incrementally\nincrease the parameter n every few epochs until it reaches some N. At test time, we measure the\nperformance by counting the number of correctly predicted sequences. A sequence is considered as\ncorrectly predicted if we correctly predict its deterministic part, shown in bold in Table 1. On these\ntoy examples, the recurrent matrix R de\ufb01ned in Eq. (1) is set to 0 to isolate the mechanisms that\nStack and list can capture.\nCounting. Results on patterns generated by \u201ccounting\u201d algorithms are shown in Table 2. We report\nthe percentage of sequence lengths for which a method is able to correctly predict sequences of\nthat length. List RNN and Stack RNN have 40 hidden units and either 5 lists or 10 stacks. For\nthese tasks, the NO-OP operation is not used. Table 2 shows that RNNs are unable to generalize to\nlonger sequences, and they only correctly predict sequences seen during training. LSTM is able to\ngeneralize to longer sequences which shows that it is able to count since the hidden units in an LSTM\ncan be linear [16]. With a \ufb01ner hyper-parameter search, the LSTM should be able to achieve 100%\n\n6\n\n\fon all of these tasks. Despite the absence of linear units, these models are also able to generalize.\nFor anbmcn+m, rounding is required to obtain the best performance.\nTable 3 show an example of actions done by a Stack RNN with two stacks on a sequence of the\nform anb2n. For clarity, we show a sequence generated with n equal to 6, and we use discretization.\nStack RNN pushes an element on both stacks when it sees a. The \ufb01rst stack pops elements when the\ninput is b and the second stack starts popping only when the \ufb01rst one is empty. Note that the second\nstack pushes a special value to keep track of the sequence length, i.e. 0.56.\nMemorization. Figure 2 shows results on memorization for a dictionary with two elements. Stack\nRNN has 100 units and 10 stacks, and List RNN has 10 lists. We use random restarts and we repeat\nthis process multiple times. Stack RNN and List RNN are able to learn memorization, while RNN\nand LSTM do not seem to generalize. In practice, List RNN is more unstable than Stack RNN and\nover\ufb01ts on the training set more frequently. This unstability may be explained by the higher number\nof actions the controler can choose from (4 versus 3). For this reason, we focus on Stack RNN in\nthe rest of the experiments.\n\nFigure 3: An example of a learned Stack RNN that performs binary addition. The last column\nis our interpretation of the functionality learned by the different stacks. The color code is: green\nmeans PUSH, red means POP and grey means actions equivalent to NO-OP. We show the current\n(discretized) value on the top of the each stack at each given time. The sequence is read from left\nto right, one character at a time. In bold is the part of the sequence which has to be predicted. Note\nthat the result is written in reverse.\n\nBinary addition. Given a sequence representing a binary addition, e.g., \u201c101+1=\u201d, the goal is\nto predict the result, e.g., \u201c110.\u201d where \u201c.\u201d represents the end of the sequence. As opposed to\nthe previous tasks, this task is supervised, i.e., the location of the deterministic tokens is provided.\nThe result of the addition is asked in the reverse order, e.g., \u201c011.\u201d in the previous example. As\npreviously, we train on short sequences and test on longer ones. The length of the two input numbers\nis chosen such that the sum of their lengths is equal to n (less than 20 during training and up to 60\nat test time). Their most signi\ufb01cant digit is always set to 1. Stack RNN has 100 hidden units with\n10 stacks. The right panel of Figure 2 shows the results averaged over multiple runs (with random\nrestarts). While Stack RNNs are generalizing to longer numbers, it over\ufb01ts for some runs on the\nvalidation set, leading to a larger error bar than in the previous experiments.\nFigure 3 shows an example of a model which generalizes to long sequences of binary addition. This\nexample illustrates the moderately complex behavior that the Stack RNN learns to solve this task: the\n\ufb01rst stack keeps track of where we are in the sequence, i.e., either reading the \ufb01rst number, reading\nthe second number or writing the result. Stack 6 keeps in memory the \ufb01rst number. Interestingly, the\n\ufb01rst number is \ufb01rst captured by the stacks 3 and 5 and then copied to stack 6. The second number is\nstored on stack 3, while its length is captured on stack 4 (by pushing a one and then a set of zeros).\nWhen producing the result, the values stored on these three stacks are popped. Finally stack 5 takes\n\n7\n\n\fcare of the carry: it switches between two states (0 or 1) which explicitly say if there is a carry over\nor not. While this use of stacks is not optimal in the sense of minimal description length, it is able\nto generalize to sequences never seen before.\n\n5.2 Language modeling.\n\nModel\n\nValidation perplexity\n\nTest perplexity\n\n-\n141\n\n-\n125\n\nNgram Ngram + Cache RNN LSTM SRCN [24]\n\n137\n129\n\n120\n115\n\n120\n115\n\nStack RNN\n\n124\n118\n\nTable 4: Comparison of RNN, LSTM, SRCN [24] and Stack RNN on Penn Treebank Corpus. We\nuse the recurrent matrix R in Stack RNN as well as 100 hidden units and 60 stacks.\nWe compare Stack RNN with RNN, LSTM and SRCN [24] on the standard language modeling\ndataset Penn Treebank Corpus. SRCN is a standard RNN with additional self-connected linear\nunits which capture long term dependencies similar to bag of words. The models have only one\nhidden layer with 100 hidden units. Table 4 shows that Stack RNN performs better than RNN with\na comparable number of parameters, but not as well as LSTM and SRCN. Empirically, we observe\nthat Stack RNN learns to store exponentially decaying bag of words similar in nature to the memory\nof SRCN.\n6 Discussion and future work\nContinuous versus discrete model and search. Certain simple algorithmic patterns can be ef\ufb01-\nciently learned using a continuous optimization approach (stochastic gradient descent) applied to a\ncontinuous model representation (in our case RNN). Note that Stack RNN works better than prior\nwork based on RNN from the nineties [12, 34, 37].\nIt seems also simpler than many other ap-\nproaches designed for these tasks [3, 17, 31]. However, it is not clear if a continuous representation\nis completely appropriate for learning algorithmic patterns. It may be more natural to attempt to\nsolve these problems with a discrete model. This motivates us to try to combine continuous and\ndiscrete optimization. It is possible that the future of learning of algorithmic patterns will involve\nsuch combination of discrete and continuous optimization.\nLong-term memory. While in theory using multiple stacks for representing memory is as powerful\nas a Turing complete computational system, intricate interactions between stacks need to be learned\nto capture more complex algorithmic patterns. Stack RNN also requires the input and output se-\nquences to be in the right format (e.g., memorization is in reversed order). It would be interesting\nto consider in the future other forms of memory which may be more \ufb02exible, as well as additional\nmechanisms which allow to perform multiple steps with the memory, such as loop or random access.\nFinally, complex algorithmic patterns can be more easily learned by composing simpler algorithms.\nDesigning a model which possesses a mechanism to compose algorithms automatically and training\nit on incrementally harder tasks is a very important research direction.\n\n7 Conclusion\nWe have shown that certain dif\ufb01cult pattern recognition problems can be solved by augmenting a\nrecurrent network with structured, growing (potentially unlimited) memory. We studied very simple\nmemory structures such as a stack and a list, but, the same approach can be used to learn how to\noperate more complex ones (for example a multi-dimensional tape). While currently the topology\nof the long term memory is \ufb01xed, we think that it should be learned from the data as well.\nAcknowledgment. We would like to thank Arthur Szlam, Keith Adams, Jason Weston, Yann LeCun\nand the rest of the Facebook AI Research team for their useful comments.\nReferences\n[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards ai. Large-scale kernel machines, 2007.\n[2] C. M. Bishop. Pattern recognition and machine learning. springer New York, 2006.\n[3] M. Bod\u00b4en and J. Wiles. Context-free and context-sensitive dynamics in recurrent neural networks. Con-\n\nnection Science, 2000.\n\nThe code is available at https://github.com/facebook/Stack-RNN\n\n8\n\n\f[4] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT. Springer, 2010.\n[5] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[6] J. S. Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs, with relationships\n\nto statistical pattern recognition. In Neurocomputing, pages 227\u2013236. Springer, 1990.\n\n[7] M. H. Christiansen and N. Chater. Toward a connectionist model of recursion in human linguistic perfor-\n\nmance. Cognitive Science, 23(2):157\u2013205, 1999.\n\n[8] J. Chung, C. Gulcehre, K Cho, and Y. Bengio. Gated feedback recurrent neural networks. arXiv, 2015.\n[9] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. High-performance neural\n\nnetworks for visual object classi\ufb01cation. arXiv preprint, 2011.\n\n[10] M. W. Crocker. Mechanisms for sentence processing. University of Edinburgh, 1996.\n[11] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for\n\nlarge-vocabulary speech recognition. Audio, Speech, and Language Processing, 20(1):30\u201342, 2012.\n\n[12] S. Das, C. Giles, and G. Sun. Learning context-free grammars: Capabilities and limitations of a recurrent\n\nneural network with an external stack memory. In ACCSS, 1992.\n\n[13] S. Das, C. Giles, and G. Sun. Using prior knowledge in a nnpda to learn context-free languages. NIPS,\n\n1993.\n\n[14] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179\u2013211, 1990.\n[15] M. Fanty. Context-free parsing in connectionist networks. Parallel natural language processing, 1994.\n[16] F. A. Gers and J. Schmidhuber. Lstm recurrent networks learn simple context-free and context-sensitive\n\nlanguages. Transactions on Neural Networks, 12(6):1333\u20131340, 2001.\n\n[17] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint, 2014.\n[18] P. Gr\u00a8unwald. A recurrent network that performs a context-sensitive prediction task. In ACCSS, 1996.\n[19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[20] S. Holldobler, Y. Kalinke, and H. Lehmann. Designing a counter: Another case study of dynamics and\n\nactivation landscapes in recurrent networks. In Advances in Arti\ufb01cial Intelligence, 1997.\n\n[21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\nworks. In NIPS, 2012.\n\n[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\n1998.\n\n[23] T. Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of\n\nTechnology, 2012.\n\n[24] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. A. Ranzato. Learning longer memory in recurrent\n\nneural networks. arXiv preprint, 2014.\n\n[25] M. Minsky and S. Papert. Perceptrons. MIT press, 1969.\n[26] M. C. Mozer and S. Das. A connectionist symbol manipulator that discovers the structure of context-free\n\nlanguages. NIPS, 1993.\n\n[27] J. B. Pollack. The induction of dynamical recognizers. Machine Learning, 7(2-3):227\u2013252, 1991.\n[28] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient\n\ndescent. In NIPS, 2011.\n\n[29] P. Rodriguez, J. Wiles, and J. L. Elman. A recurrent neural network that learns to count. Connection\n\nScience, 1999.\n\n[30] D. E Rumelhart, G. Hinton, and R. J. Williams. Learning internal representations by error propagation.\n\nTechnical report, DTIC Document, 1985.\n\n[31] W. Tabor. Fractal encoding of context-free grammars in connectionist networks. Expert Systems, 2000.\n[32] P. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural\n\nNetworks, 1(4):339\u2013356, 1988.\n\n[33] J. Weston, S. Chopra, and A. Bordes. Memory networks. In ICLR, 2015.\n[34] J. Wiles and J. Elman. Learning to count without a counter: A case study of dynamics and activation\n\nlandscapes in recurrent networks. In ACCSS, 1995.\n\n[35] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their com-\nputational complexity. Back-propagation: Theory, architectures and applications, pages 433\u2013486, 1995.\n\n[36] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint, 2014.\n[37] Z. Zeng, R. M. Goodman, and P. Smyth. Discrete recurrent neural networks for grammatical inference.\n\nTransactions on Neural Networks, 5(2):320\u2013330, 1994.\n\n9\n\n\f", "award": [], "sourceid": 96, "authors": [{"given_name": "Armand", "family_name": "Joulin", "institution": "Facebook AI research"}, {"given_name": "Tomas", "family_name": "Mikolov", "institution": "Facebook AI Research"}]}