{"title": "End-To-End Memory Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2440, "page_last": 2448, "abstract": "We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network (Weston et al., 2015) but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.", "full_text": "End-To-End Memory Networks\n\nSainbayar Sukhbaatar\nDept. of Computer Science\n\nCourant Institute, New York University\n\nsainbar@cs.nyu.edu\n\nArthur Szlam\n\nJason Weston\n\nRob Fergus\n\nFacebook AI Research\n\n{aszlam,jase,robfergus}@fb.com\n\nNew York\n\nAbstract\n\nWe introduce a neural network with a recurrent attention model over a possibly\nlarge external memory. The architecture is a form of Memory Network [23]\nbut unlike the model in that work, it is trained end-to-end, and hence requires\nsigni\ufb01cantly less supervision during training, making it more generally applicable\nin realistic settings. It can also be seen as an extension of RNNsearch [2] to the\ncase where multiple computational steps (hops) are performed per output symbol.\nThe \ufb02exibility of the model allows us to apply it to tasks as diverse as (synthetic)\nquestion answering [22] and to language modeling. For the former our approach\nis competitive with Memory Networks, but with less supervision. For the latter,\non the Penn TreeBank and Text8 datasets our approach demonstrates comparable\nperformance to RNNs and LSTMs. In both cases we show that the key concept\nof multiple computational hops yields improved results.\n\nIntroduction\n\n1\nTwo grand challenges in arti\ufb01cial intelligence research have been to build models that can make\nmultiple computational steps in the service of answering a question or completing a task, and\nmodels that can describe long term dependencies in sequential data.\nRecently there has been a resurgence in models of computation using explicit storage and a notion\nof attention [23, 8, 2]; manipulating such a storage offers an approach to both of these challenges.\nIn [23, 8, 2], the storage is endowed with a continuous representation; reads from and writes to the\nstorage, as well as other processing steps, are modeled by the actions of neural networks.\nIn this work, we present a novel recurrent neural network (RNN) architecture where the recurrence\nreads from a possibly large external memory multiple times before outputting a symbol. Our model\ncan be considered a continuous form of the Memory Network implemented in [23]. The model in\nthat work was not easy to train via backpropagation, and required supervision at each layer of the\nnetwork. The continuity of the model we present here means that it can be trained end-to-end from\ninput-output pairs, and so is applicable to more tasks, i.e. tasks where such supervision is not avail-\nable, such as in language modeling or realistically supervised question answering tasks. Our model\ncan also be seen as a version of RNNsearch [2] with multiple computational steps (which we term\n\u201chops\u201d) per output symbol. We will show experimentally that the multiple hops over the long-term\nmemory are crucial to good performance of our model on these tasks, and that training the memory\nrepresentation can be integrated in a scalable manner into our end-to-end neural network model.\n2 Approach\nOur model takes a discrete set of inputs x1, ..., xn that are to be stored in the memory, a query q, and\noutputs an answer a. Each of the xi, q, and a contains symbols coming from a dictionary with V\nwords. The model writes all x to the memory up to a \ufb01xed buffer size, and then \ufb01nds a continuous\nrepresentation for the x and q. The continuous representation is then processed via multiple hops to\noutput a. This allows backpropagation of the error signal through multiple memory accesses back\nto the input during training.\n\n1\n\n\f2.1 Single Layer\nWe start by describing our model in the single layer case, which implements a single memory hop\noperation. We then show it can be stacked to give multiple hops in memory.\nInput memory representation: Suppose we are given an input set x1, .., xi to be stored in memory.\nThe entire set of {xi} are converted into memory vectors {mi} of dimension d computed by\nembedding each xi in a continuous space, in the simplest case, using an embedding matrix A (of\nsize d\u00d7V ). The query q is also embedded (again, in the simplest case via another embedding matrix\nB with the same dimensions as A) to obtain an internal state u. In the embedding space, we compute\nthe match between u and each memory mi by taking the inner product followed by a softmax:\n\nwhere Softmax(zi) = ezi/(cid:80)\n\npi = Softmax(uT mi).\n\n(1)\n\nj ezj . De\ufb01ned in this way p is a probability vector over the inputs.\n\nOutput memory representation: Each xi has a corresponding output vector ci (given in the\nsimplest case by another embedding matrix C). The response vector from the memory o is then a\nsum over the transformed inputs ci, weighted by the probability vector from the input:\n\no =\n\npici.\n\n(2)\n\n(cid:88)\n\ni\n\nBecause the function from input to output is smooth, we can easily compute gradients and back-\npropagate through it. Other recently proposed forms of memory or attention take this approach,\nnotably Bahdanau et al. [2] and Graves et al. [8], see also [9].\nGenerating the \ufb01nal prediction: In the single layer case, the sum of the output vector o and the\ninput embedding u is then passed through a \ufb01nal weight matrix W (of size V \u00d7 d) and a softmax\nto produce the predicted label:\n\n\u02c6a = Softmax(W (o + u))\n\n(3)\n\nThe overall model is shown in Fig. 1(a). During training, all three embedding matrices A, B and C,\nas well as W are jointly learned by minimizing a standard cross-entropy loss between \u02c6a and the true\nlabel a. Training is performed using stochastic gradient descent (see Section 4.2 for more details).\n\nFigure 1: (a): A single layer version of our model. (b): A three layer version of our model. In\npractice, we can constrain several of the embedding matrices to be the same (see Section 2.2).\n2.2 Multiple Layers\nWe now extend our model to handle K hop operations. The memory layers are stacked in the\nfollowing way:\n\u2022 The input to layers above the \ufb01rst is the sum of the output ok and the input uk from layer k\n\n(different ways to combine ok and uk are proposed later):\n\nuk+1 = uk + ok.\n\n(4)\n\n2\n\nQuestion qOutput Input Embedding BEmbedding CWeights Softmax Weighted Sum picimiSentences {xi}Embedding Ao WSoftmax Predicted Answer a^uuInner ProductOut3 In3 BSentences W a^{xi} o1u1o2u2 o3u3A1C1A3C3A2C2Question qOut2 In2 Out1 In1 Predicted Answer (a)(b)\f\u2022 Each layer has its own embedding matrices Ak, C k, used to embed the inputs {xi}. However, as\n\u2022 At the top of the network, the input to W also combines the input and the output of the top\n\ndiscussed below, they are constrained to ease training and reduce the number of parameters.\n\nmemory layer: \u02c6a = Softmax(W uK+1) = Softmax(W (oK + uK)).\n\nWe explore two types of weight tying within the model:\n\n1. Adjacent: the output embedding for one layer is the input embedding for the one above,\ni.e. Ak+1 = C k. We also constrain (a) the answer prediction matrix to be the same as the\n\ufb01nal output embedding, i.e W T = C K, and (b) the question embedding to match the input\nembedding of the \ufb01rst layer, i.e. B = A1.\n\n2. Layer-wise (RNN-like): the input and output embeddings are the same across different\nlayers, i.e. A1 = A2 = ... = AK and C 1 = C 2 = ... = C K. We have found it useful to\nadd a linear mapping H to the update of u between hops; that is, uk+1 = Huk + ok. This\nmapping is learnt along with the rest of the parameters and used throughout our experiments\nfor layer-wise weight tying.\n\nA three-layer version of our memory model is shown in Fig. 1(b). Overall, it is similar to the\nMemory Network model in [23], except that the hard max operations within each layer have been\nreplaced with a continuous weighting from the softmax.\nNote that if we use the layer-wise weight tying scheme, our model can be cast as a traditional\nRNN where we divide the outputs of the RNN into internal and external outputs. Emitting an\ninternal output corresponds to considering a memory, and emitting an external output corresponds\nto predicting a label. From the RNN point of view, u in Fig. 1(b) and Eqn. 4 is a hidden state, and\nthe model generates an internal output p (attention weights in Fig. 1(a)) using A. The model then\ningests p using C, updates the hidden state, and so on1. Here, unlike a standard RNN, we explicitly\ncondition on the outputs stored in memory during the K hops, and we keep these outputs soft,\nrather than sampling them. Thus our model makes several computational steps before producing an\noutput meant to be seen by the \u201coutside world\u201d.\n3 Related Work\nA number of recent efforts have explored ways to capture long-term structure within sequences\nusing RNNs or LSTM-based models [4, 7, 12, 15, 10, 1]. The memory in these models is the state\nof the network, which is latent and inherently unstable over long timescales. The LSTM-based\nmodels address this through local memory cells which lock in the network state from the past. In\npractice, the performance gains over carefully trained RNNs are modest (see Mikolov et al. [15]).\nOur model differs from these in that it uses a global memory, with shared read and write functions.\nHowever, with layer-wise weight tying our model can be viewed as a form of RNN which only\nproduces an output after a \ufb01xed number of time steps (corresponding to the number of hops), with\nthe intermediary steps involving memory input/output operations that update the internal state.\nSome of the very early work on neural networks by Steinbuch and Piske[19] and Taylor [21] con-\nsidered a memory that performed nearest-neighbor operations on stored input vectors and then \ufb01t\nparametric models to the retrieved sets. This has similarities to a single layer version of our model.\nSubsequent work in the 1990\u2019s explored other types of memory [18, 5, 16]. For example, Das\net al. [5] and Mozer et al. [16] introduced an explicit stack with push and pop operations which has\nbeen revisited recently by [11] in the context of an RNN model.\nClosely related to our model is the Neural Turing Machine of Graves et al. [8], which also uses\na continuous memory representation. The NTM memory uses both content and address-based\naccess, unlike ours which only explicitly allows the former, although the temporal features that we\nwill introduce in Section 4.1 allow a kind of address-based access. However, in part because we\nalways write each memory sequentially, our model is somewhat simpler, not requiring operations\nlike sharpening. Furthermore, we apply our memory model to textual reasoning tasks, which\nqualitatively differ from the more abstract operations of sorting and recall tackled by the NTM.\n\n1Note that in this view, the terminology of input and output from Fig. 1 is \ufb02ipped - when viewed as a\ntraditional RNN with this special conditioning of outputs, A becomes part of the output embedding of the\nRNN and C becomes the input embedding.\n\n3\n\n\fOur model is also related to Bahdanau et al. [2]. In that work, a bidirectional RNN based encoder\nand gated RNN based decoder were used for machine translation. The decoder uses an attention\nmodel that \ufb01nds which hidden states from the encoding are most useful for outputting the next\ntranslated word; the attention model uses a small neural network that takes as input a concatenation\nof the current hidden state of the decoder and each of the encoders hidden states. A similar attention\nmodel is also used in Xu et al. [24] for generating image captions. Our \u201cmemory\u201d is analogous to\ntheir attention mechanism, although [2] is only over a single sentence rather than many, as in our\ncase. Furthermore, our model makes several hops on the memory before making an output; we will\nsee below that this is important for good performance. There are also differences in the architecture\nof the small network used to score the memories compared to our scoring approach; we use a simple\nlinear layer, whereas they use a more sophisticated gated architecture.\nWe will apply our model to language modeling, an extensively studied task. Goodman [6] showed\nsimple but effective approaches which combine n-grams with a cache. Bengio et al. [3] ignited\ninterest in using neural network based models for the task, with RNNs [14] and LSTMs [10, 20]\nshowing clear performance gains over traditional methods. Indeed, the current state-of-the-art is\nheld by variants of these models, for example very large LSTMs with Dropout [25] or RNNs with\ndiagonal constraints on the weight matrix [15]. With appropriate weight tying, our model can be\nregarded as a modi\ufb01ed form of RNN, where the recurrence is indexed by memory lookups to the\nword sequence rather than indexed by the sequence itself.\n4 Synthetic Question and Answering Experiments\nWe perform experiments on the synthetic QA tasks de\ufb01ned in [22] (using version 1.1 of the dataset).\nA given QA task consists of a set of statements, followed by a question whose answer is typically\na single word (in a few tasks, answers are a set of words). The answer is available to the model at\ntraining time, but must be predicted at test time. There are a total of 20 different types of tasks that\nprobe different forms of reasoning and deduction. Here are samples of three of the tasks:\nSam walks into the kitchen.\nMary journeyed to the den.\nMary went back to the kitchen.\nSam picks up an apple.\nJohn journeyed to the bedroom.\nSam walks into the bedroom.\nMary discarded the milk.\nSam drops the apple.\nQ: Where was the milk before the den?\nQ: Where is the apple?\nA. Bedroom\nA. Hallway\nNote that for each question, only some subset of the statements contain information needed for\nthe answer, and the others are essentially irrelevant distractors (e.g.\nthe \ufb01rst sentence in the \ufb01rst\nexample). In the Memory Networks of Weston et al. [22], this supporting subset was explicitly\nindicated to the model during training and the key difference between that work and this one is that\nthis information is no longer provided. Hence, the model must deduce for itself at training and test\ntime which sentences are relevant and which are not.\nFormally, for one of the 20 QA tasks, we are given example problems, each having a set of I\nsentences {xi} where I \u2264 320; a question sentence q and answer a. Let the jth word of sentence\ni be xij, represented by a one-hot vector of length V (where the vocabulary is of size V = 177,\nre\ufb02ecting the simplistic nature of the QA language). The same representation is used for the\nquestion q and answer a. Two versions of the data are used, one that has 1000 training problems\nper task and a second larger one with 10,000 per task.\n4.1 Model Details\nUnless otherwise stated, all experiments used a K = 3 hops model with the adjacent weight sharing\nscheme. For all tasks that output lists (i.e. the answers are multiple words), we take each possible\ncombination of possible outputs and record them as a separate answer vocabulary word.\nSentence Representation:\nthe sentences.\nThe \ufb01rst\n\nBrian is a lion.\nJulius is a lion.\nJulius is white.\nBernhard is green.\nQ: What color is Brian?\nA. White\n\nis the bag-of-words (BoW) representation that\n\nIn our experiments we explore two different representations for\ntakes the sentence\nj Axij and\nj Cxij. The input vector u representing the question is also embedded as a bag of words:\nj Bqj. This has the drawback that it cannot capture the order of the words in the sentence,\n\nxi = {xi1, xi2, ..., xin}, embeds each word and sums the resulting vectors: e.g mi =(cid:80)\nci =(cid:80)\nu =(cid:80)\nsentence. This takes the form: mi =(cid:80)\n\nwhich is important for some tasks.\nWe therefore propose a second representation that encodes the position of words within the\nj lj \u00b7 Axij, where \u00b7 is an element-wise multiplication. lj is a\n\n4\n\n\fthat mi = (cid:80)\n(e.g. ci =(cid:80)\n\ncolumn vector with the structure lkj = (1 \u2212 j/J) \u2212 (k/d)(1 \u2212 2j/J) (assuming 1-based indexing),\nwith J being the number of words in the sentence, and d is the dimension of the embedding. This\nsentence representation, which we call position encoding (PE), means that the order of the words\nnow affects mi. The same representation is used for questions, memory inputs and memory outputs.\nTemporal Encoding: Many of the QA tasks require some notion of temporal context, i.e.\nin\nthe \ufb01rst example of Section 2, the model needs to understand that Sam is in the bedroom after\nhe is in the kitchen. To enable our model to address them, we modify the memory vector so\nj Axij + TA(i), where TA(i) is the ith row of a special matrix TA that encodes\ntemporal information. The output embedding is augmented in the same way with a matrix Tc\nj Cxij + TC(i)). Both TA and TC are learned during training. They are also subject to\nthe same sharing constraints as A and C. Note that sentences are indexed in reverse order, re\ufb02ecting\ntheir relative distance from the question so that x1 is the last sentence of the story.\nLearning time invariance by injecting random noise: we have found it helpful to add \u201cdummy\u201d\nmemories to regularize TA. That is, at training time we can randomly add 10% of empty memories\nto the stories. We refer to this approach as random noise (RN).\n4.2 Training Details\n10% of the bAbI training set was held-out to form a validation set, which was used to select the\noptimal model architecture and hyperparameters. Our models were trained using a learning rate of\n\u03b7 = 0.01, with anneals every 25 epochs by \u03b7/2 until 100 epochs were reached. No momentum or\nweight decay was used. The weights were initialized randomly from a Gaussian distribution with\nzero mean and \u03c3 = 0.1. When trained on all tasks simultaneously with 1k training samples (10k\ntraining samples), 60 epochs (20 epochs) were used with learning rate anneals of \u03b7/2 every 15\nepochs (5 epochs). All training uses a batch size of 32 (but cost is not averaged over a batch), and\ngradients with an (cid:96)2 norm larger than 40 are divided by a scalar to have norm 40. In some of our\nexperiments, we explored commencing training with the softmax in each memory layer removed,\nmaking the model entirely linear except for the \ufb01nal softmax for answer prediction. When the\nvalidation loss stopped decreasing, the softmax layers were re-inserted and training recommenced.\nWe refer to this as linear start (LS) training.\nIn LS training, the initial learning rate is set to\n\u03b7 = 0.005. The capacity of memory is restricted to the most recent 50 sentences. Since the number\nof sentences and the number of words per sentence varied between problems, a null symbol was\nused to pad them all to a \ufb01xed size. The embedding of the null symbol was constrained to be zero.\nOn some tasks, we observed a large variance in the performance of our model (i.e. sometimes failing\nbadly, other times not, depending on the initialization). To remedy this, we repeated each training\n10 times with different random initializations, and picked the one with the lowest training error.\n4.3 Baselines\nWe compare our approach2 (abbreviated to MemN2N) to a range of alternate models:\n\u2022 MemNN: The strongly supervised AM+NG+NL Memory Networks approach, proposed in [22].\nThis is the best reported approach in that paper. It uses a max operation (rather than softmax) at\neach layer which is trained directly with supporting facts (strong supervision). It employs n-gram\nmodeling, nonlinear layers and an adaptive number of hops per query.\n\n\u2022 MemNN-WSH: A weakly supervised heuristic version of MemNN where the supporting sen-\ntence labels are not used in training. Since we are unable to backpropagate through the max\noperations in each layer, we enforce that the \ufb01rst memory hop should share at least one word with\nthe question, and that the second memory hop should share at least one word with the \ufb01rst hop and\nat least one word with the answer. All those memories that conform are called valid memories,\nand the goal during training is to rank them higher than invalid memories using the same ranking\ncriteria as during strongly supervised training.\n\n\u2022 LSTM: A standard LSTM model, trained using question / answer pairs only (i.e. also weakly\n\nsupervised). For more detail, see [22].\n\n2 MemN2N source code is available at https://github.com/facebook/MemNN.\n\n5\n\n\f4.4 Results\nWe report a variety of design choices: (i) BoW vs Position Encoding (PE) sentence representation;\n(ii) training on all 20 tasks independently vs jointly training (joint training used an embedding\ndimension of d = 50, while independent training used d = 20); (iii) two phase training: linear start\n(LS) where softmaxes are removed initially vs training with softmaxes from the start; (iv) varying\nmemory hops from 1 to 3.\nThe results across all 20 tasks are given in Table 1 for the 1k training set, along with the mean\nperformance for 10k training set3. They show a number of interesting points:\n\u2022 The best MemN2N models are reasonably close to the supervised models (e.g. 1k: 6.7% for\nMemNN vs 12.6% for MemN2N with position encoding + linear start + random noise, jointly\ntrained and 10k: 3.2% for MemNN vs 4.2% for MemN2N with position encoding + linear start +\nrandom noise + non-linearity4, although the supervised models are still superior.\n\u2022 All variants of our proposed model comfortably beat the weakly supervised baseline methods.\n\u2022 The position encoding (PE) representation improves over bag-of-words (BoW), as demonstrated\nby clear improvements on tasks 4, 5, 15 and 18, where word ordering is particularly important.\n\u2022 The linear start (LS) to training seems to help avoid local minima. See task 16 in Table 1, where\nPE alone gets 53.6% error, while using LS reduces it to 1.6%.\n\u2022 Jittering the time index with random empty memories (RN) as described in Section 4.1 gives a\nsmall but consistent boost in performance, especially for the smaller 1k training set.\n\u2022 Joint training on all tasks helps.\n\u2022 Importantly, more computational hops give improved performance. We give examples of the\nhops performed (via the values of eq. (1)) over some illustrative examples in Fig. 2 and in the\nsupplementary material.\n\nTask\n1: 1 supporting fact\n2: 2 supporting facts\n3: 3 supporting facts\n4: 2 argument relations\n5: 3 argument relations\n6: yes/no questions\n7: counting\n8: lists/sets\n9: simple negation\n10: inde\ufb01nite knowledge\n11: basic coreference\n12: conjunction\n13: compound coreference\n14: time reasoning\n15: basic deduction\n16: basic induction\n17: positional reasoning\n18: size reasoning\n19: path \ufb01nding\n20: agent\u2019s motivation\nMean error (%)\nFailed tasks (err. > 5%)\nOn 10k training data\nMean error (%)\nFailed tasks (err. > 5%)\n\nStrongly\nSupervised\nMemNN [22]\n\n0.0\n0.0\n0.0\n0.0\n2.0\n0.0\n15.0\n9.0\n0.0\n2.0\n0.0\n0.0\n0.0\n1.0\n0.0\n0.0\n35.0\n5.0\n64.0\n0.0\n6.7\n4\n\n3.2\n2\n\nBaseline\n\nLSTM MemNN\n[22]\n50.0\n80.0\n80.0\n39.0\n30.0\n52.0\n51.0\n55.0\n36.0\n56.0\n38.0\n26.0\n6.0\n73.0\n79.0\n77.0\n49.0\n48.0\n92.0\n9.0\n51.3\n20\n\nWSH\n0.1\n42.8\n76.4\n40.3\n16.3\n51.0\n36.1\n37.8\n35.9\n68.7\n30.0\n10.1\n19.7\n18.3\n64.8\n50.5\n50.9\n51.3\n100.0\n3.6\n40.2\n18\n\n36.4\n16\n\n39.2\n17\n\nBoW\n0.6\n17.6\n71.0\n32.0\n18.3\n8.7\n23.5\n11.4\n21.1\n22.8\n4.1\n0.3\n10.5\n1.3\n24.3\n52.0\n45.4\n48.1\n89.7\n0.1\n25.1\n15\n\n15.4\n\n9\n\nPE\n0.1\n21.6\n64.2\n3.8\n14.1\n7.9\n21.6\n12.6\n23.3\n17.4\n4.3\n0.3\n9.9\n1.8\n0.0\n52.1\n50.1\n13.6\n87.4\n0.0\n20.3\n13\n\n9.4\n6\n\nPE\nLS\nRN\n0.0\n8.3\n40.3\n2.8\n13.1\n7.6\n17.3\n10.0\n13.2\n15.1\n0.9\n0.2\n0.4\n1.7\n0.0\n1.3\n51.0\n11.1\n82.8\n0.0\n13.9\n11\n\n6.6\n4\n\nPE\nLS\n0.2\n12.8\n58.8\n11.6\n15.7\n8.7\n20.3\n12.7\n17.0\n18.6\n0.0\n0.1\n0.3\n2.0\n0.0\n1.6\n49.0\n10.1\n85.6\n0.0\n16.3\n12\n\n7.2\n4\n\nMemN2N\n1 hop\nPE LS\njoint\n0.8\n62.0\n76.9\n22.8\n11.0\n7.2\n15.9\n13.2\n5.1\n10.6\n8.4\n0.4\n6.3\n36.9\n46.4\n47.4\n44.4\n9.6\n90.7\n0.0\n25.8\n17\n\n2 hops\nPE LS\njoint\n0.0\n15.6\n31.6\n2.2\n13.4\n2.3\n25.4\n11.7\n2.0\n5.0\n1.2\n0.0\n0.2\n8.1\n0.5\n51.3\n41.2\n10.3\n89.9\n0.1\n15.6\n11\n\n3 hops\nPE LS\njoint\n0.1\n14.0\n33.1\n5.7\n14.8\n3.3\n17.9\n10.1\n3.1\n6.6\n0.9\n0.3\n1.4\n8.2\n0.0\n3.5\n44.5\n9.2\n90.2\n0.0\n13.3\n11\n\nPE\n\nLS RN\njoint\n0.0\n11.4\n21.9\n13.4\n14.4\n2.8\n18.3\n9.3\n1.9\n6.5\n0.3\n0.1\n0.2\n6.9\n0.0\n2.7\n40.4\n9.4\n88.0\n0.0\n12.4\n11\n\n24.5\n16\n\n10.9\n\n7\n\n7.9\n6\n\n7.5\n6\n\nPE LS\nLW\njoint\n0.1\n18.8\n31.7\n17.5\n12.9\n2.0\n10.1\n6.1\n1.5\n2.6\n3.3\n0.0\n0.5\n2.0\n1.8\n51.0\n42.6\n9.2\n90.6\n0.2\n15.2\n10\n\n11.0\n\n6\n\nTable 1: Test error rates (%) on the 20 QA tasks for models using 1k training examples (mean\ntest errors for 10k training examples are shown at the bottom). Key: BoW = bag-of-words\nrepresentation; PE = position encoding representation; LS = linear start training; RN = random\ninjection of time index noise; LW = RNN-style layer-wise weight tying (if not stated, adjacent\nweight tying is used); joint = joint training on all tasks (as opposed to per-task training).\n\n5 Language Modeling Experiments\nThe goal in language modeling is to predict the next word in a text sequence given the previous\nwords x. We now explain how our model can easily be applied to this task.\n\n3More detailed results for the 10k training set can be found in the supplementary material.\n4Following [17] we found adding more non-linearity solves tasks 17 and 19, see the supplementary material.\n\n6\n\n\fFigure 2: Example predictions on the QA tasks of [22]. We show the labeled supporting facts\n(support) from the dataset which MemN2N does not use during training, and the probabilities p of\neach hop used by the model during inference. MemN2N successfully learns to focus on the correct\nsupporting sentences.\n\nModel\nRNN [15]\nLSTM [15]\nSCRN [15]\nMemN2N\n\n# of\nhidden\n\n300\n100\n100\n150\n150\n150\n150\n150\n150\n150\n150\n150\n150\n150\n150\n150\n\nPenn Treebank\n\nsize\n\n# of memory Valid.\nhops\nperp.\n133\n120\n120\n128\n129\n127\n127\n122\n120\n125\n121\n122\n122\n120\n121\n118\n\n-\n-\n-\n100\n100\n100\n100\n100\n100\n25\n50\n75\n100\n125\n150\n200\n\n-\n-\n-\n2\n3\n4\n5\n6\n7\n6\n6\n6\n6\n6\n6\n7\n\nTest\nperp.\n129\n115\n115\n121\n122\n120\n118\n115\n114\n118\n114\n114\n115\n112\n114\n111\n\n# of\nhidden\n\n500\n500\n500\n500\n500\n500\n500\n500\n500\n500\n500\n500\n500\n500\n500\n-\n\nText8\n\n# of memory Valid.\nhops\nperp.\n\nsize\n\n-\n-\n-\n2\n3\n4\n5\n6\n7\n6\n6\n6\n6\n6\n6\n-\n\n-\n-\n-\n100\n100\n100\n100\n100\n100\n25\n50\n75\n100\n125\n150\n-\n\n-\n122\n-\n152\n142\n129\n123\n124\n118\n131\n132\n126\n124\n125\n123\n-\n\nTest\nperp.\n184\n154\n161\n187\n178\n162\n154\n155\n147\n163\n166\n158\n155\n157\n154\n-\n\nTable 2: The perplexity on the test sets of Penn Treebank and Text8 corpora. Note that increasing\nthe number of memory hops improves performance.\n\nFigure 3: Average activation weight of memory positions during 6 memory hops. White color\nindicates where the model is attending during the kth hop. For clarity, each row is normalized to\nhave maximum value of 1. A model is trained on (left) Penn Treebank and (right) Text8 dataset.\n\nWe now operate on word level, as opposed to the sentence level. Thus the previous N words in the\nsequence (including the current) are embedded into memory separately. Each memory cell holds\nonly a single word, so there is no need for the BoW or linear mapping representations used in the\nQA tasks. We employ the temporal embedding approach of Section 4.1.\nSince there is no longer any question, q in Fig. 1 is \ufb01xed to a constant vector 0.1 (without\nembedding). The output softmax predicts which word in the vocabulary (of size V ) is next in the\nsequence. A cross-entropy loss is used to train model by backpropagating the error through multiple\nmemory layers, in the same manner as the QA tasks. To aid training, we apply ReLU operations to\nhalf of the units in each layer. We use layer-wise (RNN-like) weight sharing, i.e. the query weights\nof each layer are the same; the output weights of each layer are the same. As noted in Section 2.2,\nthis makes our architecture closely related to an RNN which is traditionally used for language\n\n7\n\nStory (1: 1 supporting fact)SupportHop 1Hop 2Hop 3Story (2: 2 supporting facts)SupportHop 1Hop 2Hop 3Daniel went to the bathroom.0.000.000.03John dropped the milk.0.060.000.00Mary travelled to the hallway.0.000.000.00John took the milk there.yes0.881.000.00John went to the bedroom.0.370.020.00Sandra went back to the bathroom.0.000.000.00John travelled to the bathroom.yes0.600.980.96John moved to the hallway.yes0.000.001.00Mary went to the office.0.010.000.00Mary went back to the bedroom.0.000.000.00Story (16: basic induction)SupportHop 1Hop 2Hop 3Story (18: size reasoning)SupportHop 1Hop 2Hop 3Brian is a frog.yes0.000.980.00The suitcase is bigger than the chest.yes0.000.880.00Lily is gray.0.070.000.00The box is bigger than the chocolate.0.040.050.10Brian is yellow.yes0.070.001.00The chest is bigger than the chocolate.yes0.170.070.90Julius is green.0.060.000.00The chest fits inside the container.0.000.000.00Greg is a frog.yes0.760.020.00The chest fits inside the box.0.000.000.00Where is John? Answer: bathroom Prediction: bathroomWhere is the milk? Answer: hallway Prediction: hallwayWhat color is Greg? Answer: yellow Prediction: yellowDoes the suitcase fit in the chocolate? Answer: no Prediction: no\fmodeling tasks; however here the \u201csequence\u201d over which the network is recurrent is not in the text,\nbut in the memory hops. Furthermore, the weight tying restricts the number of parameters in the\nmodel, helping generalization for the deeper models which we \ufb01nd to be effective for this task. We\nuse two different datasets:\nPenn Tree Bank [13]: This consists of 929k/73k/82k train/validation/test words, distributed over a\nvocabulary of 10k words. The same preprocessing as [25] was used.\nText8 [15]: This is a a pre-processed version of the \ufb01rst 100M million characters, dumped from\nWikipedia. This is split into 93.3M/5.7M/1M character train/validation/test sets. All word occurring\nless than 5 times are replaced with the token, resulting in a vocabulary size of \u223c44k.\n5.1 Training Details\nThe training procedure we use is the same as the QA tasks, except for the following. For each\nmini-batch update, the (cid:96)2 norm of the whole gradient of all parameters is measured5 and if larger\nthan L = 50, then it is scaled down to have norm L. This was crucial for good performance. We\nuse the learning rate annealing schedule from [15], namely, if the validation cost has not decreased\nafter one epoch, then the learning rate is scaled down by a factor 1.5. Training terminates when the\nlearning rate drops below 10\u22125, i.e. after 50 epochs or so. Weights are initialized using N (0, 0.05)\nand batch size is set to 128. On the Penn tree dataset, we repeat each training 10 times with different\nrandom initializations and pick the one with smallest validation cost. However, we have done only\na single training run on Text8 dataset due to limited time constraints.\n5.2 Results\nTable 2 compares our model to RNN, LSTM and Structurally Constrained Recurrent Nets (SCRN)\n[15] baselines on the two benchmark datasets. Note that the baseline architectures were tuned in\n[15] to give optimal perplexity6. Our MemN2N approach achieves lower perplexity on both datasets\n(111 vs 115 for RNN/SCRN on Penn and 147 vs 154 for LSTM on Text8). Note that MemN2N\nhas \u223c1.5x more parameters than RNNs with the same number of hidden units, while LSTM has\n\u223c4x more parameters. We also vary the number of hops and memory size of our MemN2N,\nshowing the contribution of both to performance; note in particular that increasing the number of\nhops helps. In Fig. 3, we show how MemN2N operates on memory with multiple hops. It shows\nthe average weight of the activation of each memory position over the test set. We can see that\nsome hops concentrate only on recent words, while other hops have more broad attention over all\nmemory locations, which is consistent with the idea that succesful language models consist of a\nsmoothed n-gram model and a cache [15]. Interestingly, it seems that those two types of hops tend\nto alternate. Also note that unlike a traditional RNN, the cache does not decay exponentially: it\nhas roughly the same average activation across the entire memory. This may be the source of the\nobserved improvement in language modeling.\n6 Conclusions and Future Work\nIn this work we showed that a neural network with an explicit memory and a recurrent attention\nmechanism for reading the memory can be successfully trained via backpropagation on diverse tasks\nfrom question answering to language modeling. Compared to the Memory Network implementation\nof [23] there is no supervision of supporting facts and so our model can be used in a wider range\nof settings. Our model approaches the same performance of that model, and is signi\ufb01cantly better\nthan other baselines with the same level of supervision. On language modeling tasks, it slightly\noutperforms tuned RNNs and LSTMs of comparable complexity. On both tasks we can see that\nincreasing the number of memory hops improves performance.\nHowever, there is still much to do. Our model is still unable to exactly match the performance of\nthe memory networks trained with strong supervision, and both fail on several of the 1k QA tasks.\nFurthermore, smooth lookups may not scale well to the case where a larger memory is required. For\nthese settings, we plan to explore multiscale notions of attention or hashing, as proposed in [23].\nAcknowledgments\nThe authors would like to thank Armand Joulin, Tomas Mikolov, Antoine Bordes and Sumit Chopra\nfor useful comments and valuable discussions, and also the FAIR Infrastructure team for their help\nand support.\n\n5In the QA tasks, the gradient of each weight matrix is measured separately.\n6They tuned the hyper-parameters on Penn Treebank and used them on Text8 without additional tuning,\n\nexcept for the number of hidden units. See [15] for more detail.\n\n8\n\n\fReferences\n[1] C. G. Atkeson and S. Schaal. Memory-based neural networks for robot learning. Neurocom-\n\nputing, 9:243\u2013269, 1995.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In International Conference on Learning Representations (ICLR), 2015.\n\n[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J.\n\nMach. Learn. Res., 3:1137\u20131155, Mar. 2003.\n\n[4] J. Chung, C\u00b8 . G\u00a8ulc\u00b8ehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural\n\nnetworks on sequence modeling. arXiv preprint: 1412.3555, 2014.\n\n[5] S. Das, C. L. Giles, and G.-Z. Sun. Learning context-free grammars: Capabilities and\nlimitations of a recurrent neural network with an external stack memory. In In Proceedings of\nThe Fourteenth Annual Conference of Cognitive Science Society, 1992.\n\n[6] J. Goodman. A bit of progress in language modeling. CoRR, cs.CL/0108005, 2001.\n[7] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint: 1308.0850,\n\n[8] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint: 1410.5401,\n\n2013.\n\n2014.\n\n1780, 1997.\n\nNIPS, 2015.\n\n[9] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. DRAW: A recurrent neural network for\n\nimage generation. CoRR, abs/1502.04623, 2015.\n\n[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u2013\n\n[11] A. Joulin and T. Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets.\n\n[12] J. Koutn\u00b4\u0131k, K. Greff, F. J. Gomez, and J. Schmidhuber. A clockwork RNN. In ICML, 2014.\n[13] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of\n\nenglish: The Penn Treebank. Comput. Linguist., 19(2):313\u2013330, June 1993.\n\n[14] T. Mikolov. Statistical language models based on neural networks. Ph. D. thesis, Brno\n\nUniversity of Technology, 2012.\n\n[15] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato. Learning longer memory in\n\nrecurrent neural networks. arXiv preprint: 1412.7753, 2014.\n\n[16] M. C. Mozer and S. Das. A connectionist symbol manipulator that discovers the structure of\n\ncontext-free languages. NIPS, pages 863\u2013863, 1993.\n\n[17] B. Peng, Z. Lu, H. Li, and K. Wong. Towards Neural Network-based Reasoning. ArXiv\n\npreprint: 1508.05508, 2015.\n\n[18] J. Pollack. The induction of dynamical recognizers. Machine Learning, 7(2-3):227\u2013252, 1991.\n[19] K. Steinbuch and U. Piske. Learning matrices and their applications. IEEE Transactions on\n\n[20] M. Sundermeyer, R. Schl\u00a8uter, and H. Ney. LSTM neural networks for language modeling. In\n\nElectronic Computers, 12:846\u2013862, 1963.\n\nInterspeech, pages 194\u2013197, 2012.\n\n[21] W. K. Taylor. Pattern recognition by means of automatic analogue apparatus. Proceedings of\n\nThe Institution of Electrical Engineers, 106:198\u2013209, 1959.\n\n[22] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards AI-complete question answering:\n\nA set of prerequisite toy tasks. arXiv preprint: 1502.05698, 2015.\n\n[23] J. Weston, S. Chopra, and A. Bordes. Memory networks.\n\nIn International Conference on\n\nLearning Representations (ICLR), 2015.\n\n[24] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio.\nShow, Attend and Tell: Neural Image Caption Generation with Visual Attention. ArXiv\npreprint: 1502.03044, 2015.\n\n[25] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv\n\npreprint arXiv:1409.2329, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1446, "authors": [{"given_name": "Sainbayar", "family_name": "Sukhbaatar", "institution": "New York University"}, {"given_name": "arthur", "family_name": "szlam", "institution": "Facebook"}, {"given_name": "Jason", "family_name": "Weston", "institution": "Facebook AI Research"}, {"given_name": "Rob", "family_name": "Fergus", "institution": "Facebook AI Research"}]}