{"title": "Learning to Reason with Third Order Tensor Products", "book": "Advances in Neural Information Processing Systems", "page_first": 9981, "page_last": 9993, "abstract": "We combine Recurrent Neural Networks with Tensor Product Representations to\nlearn combinatorial representations of sequential data. This improves symbolic\ninterpretation and systematic generalisation. Our architecture is trained end-to-end\nthrough gradient descent on a variety of simple natural language reasoning tasks,\nsignificantly outperforming the latest state-of-the-art models in single-task and\nall-tasks settings. We also augment a subset of the data such that training and test\ndata exhibit large systematic differences and show that our approach generalises\nbetter than the previous state-of-the-art.", "full_text": "Learning to Reason\n\nwith Third-Order Tensor Products\n\nThe Swiss AI Lab IDSIA / USI / SUPSI\n\nThe Swiss AI Lab IDSIA / USI / SUPSI\n\nJ\u00fcrgen Schmidhuber\n\njuergen@idsia.ch\n\nImanol Schlag\n\nimanol@idsia.ch\n\nAbstract\n\nWe combine Recurrent Neural Networks with Tensor Product Representations to\nlearn combinatorial representations of sequential data. This improves symbolic\ninterpretation and systematic generalisation. Our architecture is trained end-to-end\nthrough gradient descent on a variety of simple natural language reasoning tasks,\nsigni\ufb01cantly outperforming the latest state-of-the-art models in single-task and\nall-tasks settings. We also augment a subset of the data such that training and test\ndata exhibit large systematic differences and show that our approach generalises\nbetter than the previous state-of-the-art.\n\n1\n\nIntroduction\n\nCertain connectionist architectures based on Recurrent Neural Networks (RNNs) [1\u20133] such as the\nLong Short-Term Memory (LSTM) [4, 5] are general computers, e.g., [6]. LSTM-based systems\nachieved breakthroughs in various speech and Natural Language Processing tasks [7\u20139]. Unlike\nhumans, however, current RNNs cannot easily extract symbolic rules from experience and apply\nthem to novel instances in a systematic way [10, 11]. They are catastrophically affected by systematic\n[10, 11] differences between training and test data [12\u201315].\nIn particular, standard RNNs have performed poorly at natural language reasoning (NLR) [16] where\nsystematic generalisation (such as rule-like extrapolation) is essential. Consider a network trained\non a variety of NLR tasks involving short stories about multiple entities. One task could be about\ntracking entity locations ([...] Mary went to the of\ufb01ce. [...] Where is Mary?), another about tracking\nobjects that people are holding ([...] Daniel picks up the milk. [...] What is Daniel holding?). If\nevery person is able to perform every task, this will open up a large number of possible person-task\npairs. Now suppose that during training we only have stories from a small subset of all possible pairs.\nMore speci\ufb01cally, let us assume Mary is never seen picking up or dropping any item. Unlike during\ntraining, we want to test on tasks such as [...] Mary picks up the milk. [...] What is Mary carrying?.\nIn this case, the training and test data exhibit systematic differences. Nevertheless, a systematic\nmodel should be able to infer milk because it has adopted a rule-like, entity-independent reasoning\npattern that generalises beyond the training distribution. RNNs, however, tend to fail to learn such\npatterns if the train and test data exhibit such differences.\nHere we aim at improving systematic generalisation by learning to deconstruct natural language\nstatements into combinatorial representations [17]. We propose a new architecture based on the\nTensor Product Representation (TPR) [18], a general method for embedding symbolic structures in a\nvector space.\nPrevious work already showed that TPRs allow for powerful symbolic processing with distributed\nrepresentations [18], given certain manual assignments of the vector space embedding. However,\nTPRs have commonly not been trained from data through gradient descent. Here we combine\ngradient-based RNNs with third-order TPRs to learn combinatorial representations from natural\nlanguage, training the entire system on NLR tasks via error backpropagation [19\u201321]. We point\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fout similarities to systems with Fast Weights [22\u201324], in particular, end-to-end-differentiable Fast\nWeight systems [25\u201327]. In experiments, we achieve state-of-the-art results on the bAbI dataset\n[16], obtaining better systematic generalisation than other methods. We also analyse the emerging\ncombinatorial and, to some extent, interpretable representations. The code we used to train and\nevaluate our models is available at github.com/ischlag/TPR-RNN.\n\n2 Review of the Tensor Product Representation and Notation\n\nn(cid:88)\n\nf \u202212 r =\n\nn(cid:88)\n\nn(cid:88)\n\nThe TPR method is a mechanism to create a vector-space embedding of symbolic structures. To\nillustrate, consider the relation implicit in the short sentences \"Kitty the cat\" and \"Mary the person\".\nIn order to store this structure into a TPR of order 2, each sentence has to be decomposed into\ntwo components by choosing a so-called \ufb01ller symbol f \u2208 F and a role symbol r \u2208 R. Now a\npossible set of \ufb01llers and roles for a unique role/\ufb01ller decomposition could be F = {Kitty, Mary}\nand R = {Cat, Person}. The two relations are then described by the set of \ufb01ller/role bindings:\n{Kitty/Cat, Mary/Person}. Let d, n, j, k denote positive integers. A distributed representation is then\nachieved by encoding each \ufb01ller symbol f by a \ufb01ller vector f in a vector space VF and each role\nsymbol r by a role vector r in a vector space VR. In this work, every vector space is over Rd, d > 1.\nThe TPR of the symbolic structures is de\ufb01ned as the tensor T in a vector space VF \u2297 VR where \u2297\nis the tensor product operator. In this example the tensor is of order 2, a matrix, which allows us to\nwrite the equation of our example using matrix multiplication:\nT = fKitty \u2297 rCat + fMary \u2297 rPerson = fKittyr(cid:62)\n\n(1)\nHere, the tensor product \u2014 or generalised outer product \u2014 acts as a variable binding operator. The\n\ufb01nal TPR representation is a superposition of all bindings via the element-wise addition.\nIn the TPR method the so-called unbinding operator consists of the tensor inner product which is\nused to exactly reconstruct previously stored variables from T using an unbinding vector. Recall that\nthe algebraic de\ufb01nition of the dot product of two vectors f = (f1; f2; ...; fn) and r = (r1; r2; ...; rn)\nis de\ufb01ned by the sum of the pairwise products of the elements of f and r. Equivalently, the tensor\ninner product \u2022jk can be expressed through the order increasing tensor product followed by the sum\nof the pairwise products of the elements of the j-th and k-th order.\n\nCat + fMaryr(cid:62)\n\nPerson\n\n(f \u2297 r)ii =\n\n(fr(cid:62))ii =\n\nfiri = f \u00b7 r\n\n(2)\n\ni=1\n\ni=1\n\ni=1\n\nGiven now the unbinding vector uCat, we can then retrieve the stored \ufb01ller fKitty. In the simplest case,\nif the role vectors are orthonormal, the unbinding vector uCat equals rCat. Again, for a TPR of order 2\nthe unbinding operation can also be expressed using matrix multiplication.\n\nT \u202223 uCat = TuCat = fKitty\n\n(3)\n\nNote how the dot product and matrix multiplication are special cases of the tensor inner product. We\nwill later use the tensor inner product \u202234 which can be used with a tensor of order 3 (a cube) and a\ntensor of order 1 (a vector) such that they result in a tensor of order 2 (a matrix). Other aspects of\nthe TPR method are not essential for this paper. For further details, we refer to Smolensky\u2019s work\n[18, 28, 29].\n\n3 The TPR as a Structural Bias for Combinatorial Representations\n\nA drawback of Smolensky\u2019s TPR method is that the decomposition of the symbolic structures into\nstructural elements \u2014 e.g. f and r in our previous example \u2014 are not learned but externally de\ufb01ned.\nSimilarly, the distributed representations f and r are assigned manually instead of being learned from\ndata, yielding arguments against the TPR as a connectionist theory of cognition [30].\nHere we aim at overcoming these limitations by recognising the TPR as a form of Fast Weight\nmemory which uses multi-layer perceptron (MLP) based neural networks trained end-to-end by\nstochastic gradient descent. Previous outer product-based Fast Weights [26], which share strong\nsimilarities to TPRs of order 2, have shown to be powerful associative memory mechanisms [31, 27].\nInspired by this capability, we use a graph interpretation of the memory where the representations of\n\n2\n\n\fa node and an edge allow for the associative retrieval of a neighbouring node. For the context of this\nwork, we refer to the nodes of such a graph as entities and to the edges as relations. This requires\nMLPs which deconstruct an input sentence into the source-entity f, the relation r, and the target-entity\nt such that f and t belong to the vector space VEntity and r to VRelation. These representations are then\nbound together with the binding operator and stored as a TPR of order 3 where we interpret multiple\nunbindings as a form of graph traversal.\nWe\u2019ll use a simple example to illustrate the idea. For instance, consider the following raw input:\n\"Mary went to the kitchen.\". A possible three-way task-speci\ufb01c decomposition could be fMary, ris-at,\nand tkitchen. At a later point in time, a question like \"Where is Mary?\" would have to be decomposed\ninto the vector representations nMary \u2208 VEntity and lwhere-is \u2208 VRelation. The vectors nMary and lwhere-is\nhave to be similar to the true unbinding vectors nMary \u2248 uMary and lwhere-is \u2248 uis-at in order to retrieve\nthe previously stored but possibly noisy tkitchen.\nWe chose a graph interpretation of the memory due to its generality as it can be found implicitly in\nthe data of many problems. Another important property of a graph inspired neural memory is the\ncombinatorial nature of entities and relations in the sense that any entity can be connected through\nany relation to any other entity. If the MLPs can disentangle entity-like information from relation-like\ninformation, the TPR will provide a simple mechanism to combine them in arbitrary ways. This\nmeans that if there is enough data for the network to learn speci\ufb01c entity representations such as\nfJohn \u2208 VEntity then it should not require any more data or training to combine fJohn with any of the\nlearned vectors embedded in VRelation even though such examples have never been covered by the\ntraining data. In Section 7 we analyse a trained model and present results which indicate that it\nindeed seems to learn representations in line with this perspective.\n\n4 Proposed Method\n\nRNNs can implement algorithms which map input sequences to output sequences. A traditional\nRNN uses one or several tensors of order 1 (i.e. a vector usually referred to as the hidden state) to\nencode the information of the past sequence elements necessary to infer the correct current and future\noutputs. Our architecture is a non-traditional RNN encoding relevant information from the preceding\nsequence elements in a TPR F of order 3.\nAt discrete time t, 0 < t \u2264 T , in the input sequence of varying length T , the previous state Ft\u22121 is\nupdated by the element-wise addition of an update representation \u2206Ft.\n\nFt \u2190 Ft\u22121 + \u2206Ft\n\n(4)\n\nThe proposed architecture is separated into three parts: an input, update, and inference module. The\nupdate module produces \u2206Ft while the inference module uses Ft as parameters (Fast Weights) to\ncompute the output \u02c6yt of the model given a question as input. F0 is the zero tensor.\n\nInput Module Similar to previous work, our model also iterates over a sequence of sentences and\nuses an input module to learn a sentence representation from a sequence of words [32]. Let the input\nto the architecture at time t be a sentence of k words with learned embeddings {d1, ..., dk}. The\nsequence is then compressed into a vector representation st by\n\nk(cid:88)\n\nst =\n\ndi (cid:12) pi,\n\n(5)\n\nwhere {p1, ..., pk} are learned position vectors that are equivalent for all input sequences and (cid:12) is\nthe Hadamard product. The vectors s, and p are in the vector space VSymbol.\n\ni=1\n\nUpdate Module The TPR update \u2206Ft is de\ufb01ned as the element-wise sum of the tensors produced\nby a write, move, and backlink function. We abbreviate the respective tensors as W, M, and B and\nrefer to them as memory operations.\n\n\u2206Ft = Wt + Mt + Bt\n\n(6)\n\n3\n\n\fTo this end, two entity and three relation representations are computed from the sentence representa-\ntion st using \ufb01ve separate networks such that\n\nt = fe(i) (st; \u03b8e(i)), 1 \u2264 i < 3\ne(i)\nt = fr(j) (st; \u03b8r(j) ), 1 \u2264 j < 4\nr(j)\n\n(7)\n(8)\n\nwhere f is an MLP network and \u03b8 its weights.\nThe write operation allows for the storage of a new node-edge-node association (e(1)\nusing the tensor product where e(1)\nand r(1)\nalready existing association (e(1)\nsubtracted from the TPR. If no such association exists, then \u02c6wt will ideally be the zero vector.\n\n, e(2)\n)\nrepresents the target entity,\nthe relation connecting them. To avoid superimposing the new association onto a possibly\n, \u02c6wt), the previous target entity \u02c6wt has to be retrieved and\n\nrepresents the source entity, e(2)\n\n, r(1)\n\n, r(1)\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n\u02c6wt = (Ft \u202234 e(1)\nt \u2297 r(1)\nWt = \u2212(e(1)\n\nt ) \u202223 r(1)\nt \u2297 \u02c6wt) + (e(1)\n\nt\n\nt \u2297 r(1)\n\nt \u2297 e(2)\nt )\n\n(9)\n(10)\n\nWhile the write operation removes the previous target entity representation \u02c6wt, the move operation\nallows to rewrite \u02c6wt back into the TPR with a different relation r(2)\n. Similar to the write operation,\nwe have to retrieve and remove the previous target entity \u02c6mt that would otherwise interfere.\n\nt\n\n\u02c6mt = (Ft \u202234 e(1)\nMt = \u2212(e(1)\nt \u2297 r(2)\n\nt ) \u202223 r(2)\nt \u2297 \u02c6mt) + (e(1)\n\nt\n\nt \u2297 r(2)\n\nt \u2297 \u02c6wt)\n\n(11)\n(12)\n\nThe \ufb01nal operation is the backlink. It switches source and target entities and connects them with yet\nanother relation r(3)\n. This allows for the associative retrieval of the neighbouring entity starting from\n\nt\n\nFigure 1: Illustration of our memory operations for a single time-step given some previous state. Each\narrow is represented by a tensor of order 3. The superposition of multiple tensors de\ufb01nes the current\ngraph. Red arrows are subtracted from the state while green arrows are added. In this illustration, \u02c6w\nexists but \u02c6m and \u02c6b do not yet \u2014 they are zero vectors. Hence, the two constructed third-order tensors\nthat are subtracted according to the move and backlink operation will both be zero tensors as well.\nNote that the associations are not necessarily as discrete as illustrated. Best viewed in color.\n\n4\n\n\feither one but with different relations (e.g. John is left of Mary and Mary is right of John).\n\n\u02c6bt = (Ft \u202234 e(2)\nBt = \u2212(e(2)\nt \u2297 r(3)\n\nt ) \u202223 r(3)\nt \u2297 \u02c6bt) + (e(2)\n\nt\n\nt \u2297 r(3)\n\nt \u2297 e(1)\nt )\n\n(13)\n(14)\n\nInference Module One of our experiments requires a single prediction after the last element of an\nobserved sequence (i.e. the last sentence). This \ufb01nal element is the question sentence representation\nsQ. Since the inference module does not edit the TPR memory, it is suf\ufb01cient to compute the\nprediction only when necessary. Hence we drop index t in the following equations. Similar to the\nupdate module, \ufb01rst an entity n and a set of relations lj are extracted from the current sentence using\nfour different networks.\n\nn = fn(sQ; \u03b8n)\nlj = flj (sQ; \u03b8lj ), 1 \u2264 j < 4\nThe extracted representations are used to retrieve one or several\npreviously stored associations by providing the necessary unbind-\ning vectors. The values of the TPR can be thought of as context-\nspeci\ufb01c weights which are not trained by gradient descent but\nconstructed incrementally during inference. They de\ufb01ne a func-\ntion that takes the entity n and relations li as an input. A simple\nillustration of this process is shown in Figure 2.\nThe most basic retrieval requires one source entity n and one rela-\ntion l1 to extract the \ufb01rst target entity. We refer to this retrieval as\na one-step inference \u02c6i(1) and use the additional extracted relations\nto compute multi-step inferences. Here LN refers to layer normal-\nization [33] which includes a learned scaling and shifting scalar.\nAs in other Fast Weight work, LN improves our training proce-\ndure which is possibly due to making the optimization landscape\nsmoother [34].\n\n\u02c6i(1) = LN((Ft \u202234 n) \u202223 l(1))\n\u02c6i(1)) \u202223 l(2))\n\u02c6i(2) = LN((Ft \u202234\n\u02c6i(3) = LN((Ft \u202234\n\u02c6i(2)) \u202223 l(3))\n\n(15)\n(16)\n\nFigure 2: Illustration of the in-\nference procedure. Given an en-\ntity and three relations (blue) we\ncan extract the inferred represen-\ntations \u02c6i(1:3) (yellow).\n\n(17)\n(18)\n(19)\n\nFinally, the output \u02c6y of our architecture consists of the sum of the three previous inference steps\nfollowed by a linear projection Z into the symbol space VSymbols where a softmax transforms the\nactivations into a probability distribution over all words from the vocabulary of the current task.\n\n3(cid:88)\n\n\u02c6y = softmax(Z\n\n\u02c6i(i))\n\n(20)\n\n5 Related Work\n\ni=1\n\nTo our knowledge, the proposed method is the \ufb01rst Fast Weight architecture with a TPR or tensor\nof order 3 trained on raw data by backpropagation [19\u201321]. It is inspired by an earlier, adaptive,\nbackpropagation-trained, end-to-end-differentiable, outer product-based, Fast Weight RNN architec-\nture with a tensor of order 2 (1993) [26]. The latter in turn was partially inspired by previous ideas,\nmost notably Hebbian learning [35]. Variations of such outer product-based Fast Weights were able\nto generalise in a variety of small but complex sequence problems where standard RNNs tend to\nperform poorly [31, 27, 36]. Compare also early work on differentiable control of Fast Weights [37].\nPrevious work also utilised TPRs of order 2 for simpler associations in the context of image-caption\ngeneration [38], question-answering [39], and general NLP [40] challenges with a gradient-based\noptimizer similar to ours.\nGiven the sequence of sentences of one sample, our method produces a \ufb01nal tensor of order 3 that\nrepresents the current task-relevant state of the story. Unfolded across time, the MLP representations\n\n5\n\n\fcan to some extent be related to components of a canonical polyadic decomposition (CPD, [41]).\nOver recent years, CPD and various other tensor decomposition methods have shown to be a powerful\ntool for a variety of Machine Learning problems [42]. Consider, e.g., recent work which applies the\ntensor-train decomposition to RNNs [43, 44].\nRNNs are popular choices for modelling natural language. Despite ongoing research in RNN\narchitectures, the good old LSTM [4] has been shown to outperform more recent variants [45] on\nstandard language modelling datasets. However, such networks do not perform well in NLR tasks\nsuch as question answering [16]. Recent progress came through the addition of memory and attention\ncomponents to RNNs. For the context of question answering, a popular line of research are memory\nnetworks [46\u201350]. But it remains unclear whether mistakes in trained models arise from imperfect\nlogical reasoning, knowledge representation, or insuf\ufb01cient data due to the dif\ufb01culty of interpreting\ntheir internal representations [51].\nSome early memory-augmented RNNs focused primarily on improving the ratio of the number of\ntrainable parameters to memory size [26]. The Neural Turing Machine [52] was among the \ufb01rst\nmodels with an attention mechanism over external memory that outperformed standard LSTM on\ntasks such as copying and sorting. The Differentiable Neural Computer (DNC) further re\ufb01ned this\napproach [53, 54], yielding strong performance also on question-answering problems.\n\n6 Experiments\n\nWe evaluate our architecture on bAbI tasks,\na set of 20 different synthetic question-\nanswering tasks designed to evaluate NLR\nsystems such as intelligent dialogue agents\n[16]. Every task addresses a different form\nof reasoning. It consists of the story - a se-\nquence of sentences - followed by a question\nsentence with a single word answer. We used\nthe train/validation/test split as it was intro-\nduced in v1.2 for the 10k samples version of\nthe dataset. We ignored the provided support-\ning facts that simplify the problem by point-\ning out sentences relevant to the question. We\nonly show story sentences once and before\nthe query sentence, with no additional super-\nvision signal apart from the prediction error.\nWe experiment with two models. The single-\ntask model is only trained and tested on the\ndata from one task but uses the same computa-\ntional graph and hyper-parameters for all. The\nall-tasks model is a scaled up version trained\nand tested on all tasks simultaneously, using only the default hyper-parameters. More details such as\nspeci\ufb01c hyper-parameters can be found in Appendix A.\nIn Table 1 and 2 we compare our model to various state-of-the-art models in the literature. We\nadded best results for a better comparison to earlier work which did not provide statistics generated\nfrom multiple runs. Our system outperforms the state-of-the-art in both settings. We also seem to\noutperform the DNC in convergence speed as shown in Figure 3.\n\nFigure 3: Training accuracy on all bAbI tasks over\nthe \ufb01rst 600k iterations. All our all-tasks models\nachieve <5% error in 48 hours (i.e. 250k steps).\nWe stopped training our own implementation of the\nDNC [53] after roughly 7 days (600k steps) and in-\nstead compare accuracy in Table 1 using previously\npublished results.\n\nTable 1: Mean and variance of the test error for the all-task setting. We perform early stopping\naccording to the validation set. Our statistics are generated from 10 runs.\n\nTask\n\nAvg Error\n\nFailure (>5%)\n\nREN [55] DNC [53]\n12.8 \u00b1 4.7\n9.7 \u00b1 2.6\n5 \u00b1 1.2\n8.2 \u00b1 2.5\n\nSDNC [54] TPR-RNN (ours)\n6.4 \u00b1 2.5\n4.1 \u00b1 1.6\n\n1.34 \u00b1 0.52\n0.86 \u00b1 1.11\n\n6\n\n\fTable 2: Mean and variance of the test error for the single-task setting. We perform early stopping\naccording to the validation set. Statistics are generated from 5 runs. We added best results for\ncomparison with previous work. Note that only our results for task 19 are unstable where different\nseeds either converge with perfect accuracy or fall into a local minimum. It is not clear how much\nprevious work is affected by such issues.\n\nLSTM [48] N2N [48] DMN+ [50] REN [55]\n\nTPR-RNN (ours)\n\nbest\n0.0\n0.3\n2.1\n0.0\n0.8\n0.1\n2.0\n0.9\n0.3\n0.0\n0.0\n0.0\n0.0\n0.2\n0.0\n51.8\n18.6\n5.3\n2.3\n0.0\n4.2\n3\n\nbest\n0.0\n0.3\n1.1\n0.0\n0.5\n0.0\n2.4\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.2\n0.0\n45.3\n4.2\n2.1\n0.0\n0.0\n2.8\n1\n\nTask\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n\nAvg Error\n\nFailure (>5%)\n\nbest\n0.0\n81.9\n83.1\n0.2\n1.2\n51.8\n24.9\n34.1\n20.2\n30.1\n10.3\n23.4\n6.1\n81.0\n78.7\n51.9\n50.1\n6.8\n31.9\n0.0\n36.4\n16\n\nbest\n0.0\n0.1\n4.1\n0.0\n0.3\n0.2\n0.0\n0.5\n0.1\n0.6\n0.3\n0.0\n1.3\n0.0\n0.0\n0.2\n0.5\n0.3\n2.3\n0.0\n0.5\n0\n\nbest\n0.0\n0.0\n1.2\n0.0\n0.5\n0.0\n0.5\n0.1\n0.0\n0.3\n0.0\n0.0\n0.3\n0.0\n0.0\n0.0\n0.4\n0.1\n0.0\n0.0\n0.17\n0\n\nmean\n\n0.02 \u00b1 0.05\n0.06 \u00b1 0.09\n1.78 \u00b1 0.58\n0.02 \u00b1 0.04\n0.61 \u00b1 0.17\n0.22 \u00b1 0.19\n2.78 \u00b1 1.81\n0.47 \u00b1 0.45\n0.14 \u00b1 0.13\n1.24 \u00b1 1.30\n0.14 \u00b1 0.11\n0.04 \u00b1 0.05\n0.42 \u00b1 0.11\n0.24 \u00b1 0.29\n0.0 \u00b1 0.0\n0.02 \u00b1 0.045\n0.9 \u00b1 0.69\n0.64 \u00b1 0.33\n12.64 \u00b1 17.39\n0.0 \u00b1 0.00\n1.12 \u00b1 1.19\n0.4 \u00b1 0.55\n\nTable 3: Summary results of the ablation experiments. We experimented with 3 variations of memory\noperations in order to analyse their necessity with regards to single-task performance. The results\nindicate that the move operation is in general less important than the backlink operation.\n\nOperations\n\nW\n\nW + M\nW + B\n\nFailed tasks (err > 5%)\n3, 6, 9, 10, 12, 13, 17, 19\n\n9, 10, 13, 17\n\n3\n\nAblation Study We ran ablation experiments on every task to assess the necessity of the three\nmemory operations. The experimental results in Table 3 indicate that a majority of the tasks can be\nsolved by the write operation alone. This is surprising at \ufb01rst because for some of those tasks the\nsymbolic operations that a person might think of as ideal typically require more complex steps than\nwhat the write operation allows for. However, the optimizer seems to be able to \ufb01nd representations\nthat overcome the limitations of the architecture. That said, more complex tasks do bene\ufb01t from the\nadditional operations without affecting the performance on simpler tasks.\n\n7 Analysis\n\nHere we analyse the representations produced by the MLPs of the update module. We collect\nthe set of unique sentences across all stories from the validation set of a task and compute their\nrespective entity and relation representations e(1), e(2), r(1), r(2), and r(3). For each representation\nwe then hierarchically cluster all sentences based on their cosine similarity. In Figure 4 we show\nsuch similarity matrices for a model trained on task 3. The image based on e(1) shows 4 distinct\nclusters which indicate that learned representations are almost perfectly orthogonal. By comparing the\n\n7\n\n\fFigure 4: The hierarchically clustered similarity matrices of all unique sentences of the validation set\nof task 3. We compute one similarity matrix for each representation produced by the update module\nusing the cosine similarity measure for clustering.\n\nsentences from different clusters it becomes apparent that they represent the four entities independent\nof other factors. Note that the dimensionality of this vector space is 15 which seems larger than\nnecessary for this task.\nIn the case of r(1) we observe that sentences seem to group into three, albeit less distinct, clusters.\nIn this task, the structure in the data implies three important events for any entity: moving to any\nlocation, bind with any object, and unbind from a previously bound object; all three represented by\na variety of possible words and phrases. By comparing sentences from different clusters, we can\nclearly associate them with the three general types of events.\nWe observed clusters of similar discreteness in all\ntasks; often with a semantic mean-\ning that becomes apparent when we compare sentences that belong to different clusters.\nWe also noticed that even though there are often\nclean clusters they are not always perfectly com-\nbinatorial, e.g., in e(2) as seen in Figure 4, we\nfound two very orthogonal clusters for the tar-\nget entity symbols {tKitchen, tBathroom} and\n{tGarden, tHallway}.\n\nSystematic Generalisation We conduct an\nadditional experiment to empirically analyse the\nmodel\u2019s capability to generalise in a systematic\nway [10, 11]. For this purpose, we join together\nall tasks which use the same four entity names\nwith at least one entity appearing in the question\n(i.e. tasks 1, 6, 7, 8, 9, 11, 12, 13). We then aug-\nment this data with \ufb01ve new entities such that\nthe train and test data exhibit systematic differ-\nences. The stories for a new entity are generated\nby randomly sampling 500 story/question pairs\nfrom a task such that in 20% of the generated\n\nFigure 5: Average accuracy over the generated test\nsets of each task. The novel entities that we add\nto the training data were not trained on all tasks.\nFor a model that generalises systematically, the\ntest accuracy should not drop for entities with only\npartial training data.\n\n8\n\n\fstories the new entity is also contained in the question. We then add generated stories from all\npossible 40 combinations of new entities and tasks to the test set. To the training set, however, we\nonly add stories from a subset of all tasks.\nMore speci\ufb01cally, the new entities are Alex, Glenn, Jordan, Mike, and Logan for which we generate\ntraining set stories from 8/8, 6/8, 4/8, 2/8, 1/8 of the tasks respectively. We summarize the results in\nFigure 5 by averaging over tasks. After the network has been trained, we \ufb01nd that our model achieves\nhigh accuracy on entity/task pairs on which it has not been trained. This indicates its systematic\ngeneralisation capability due to the disentanglement of entities and relations.\nOur analysis and the additional experiment indicate that the model seems to learn combinatorial\nrepresentations resulting in interpretable distributed representations and data ef\ufb01ciency due to rule-like\ngeneralisation.\n\n8 Limitations\n\nTo compute the correct gradients, an RNN with external memory trained by backpropagation through\ntime must store all values of all temporary variables at every time step of a sequence. Since outer\nproduct-based Fast Weights [26, 27] and our TPR system have many more time-varying variables per\nlearnable parameter than a classic RNN such as LSTM, this makes them less scalable in terms of\nmemory requirements. The problem can be overcome through RTRL [2, 3], but only at the expense\nof greater time complexity. Nevertheless, our results illustrate how the advantages of TPRs can\noutweigh such disadvantages for problems of combinatorial nature.\nOne dif\ufb01culty of our Fast Weight-like memory is the well-known vanishing gradient problem [56].\nDue to multiplicative interaction of Fast Weights with RNN activations, forward and backward\npropagation is unstable and can result in vanishing or exploding activations and error signals. A\nsimilar effect may affect the forward pass if the values of the activations are not bounded by some\nactivation function. Nevertheless, in our experiments, we abandoned bounded TPR values as they\nsigni\ufb01cantly slowed down learning with little bene\ufb01t. Although our current sub-optimal initialization\nmay occasionally lead to exploding activations and NaN values after the \ufb01rst few iterations of gradient\ndescent, we did not observe any extreme cases after a few dozen successful steps, and therefore\nsimply reinitialize the model in such cases.\nA direct comparison with the DNC is a bit inconclusive for the following reasons. Our architecture,\nuses a sentence encoding layer similar to how many memory networks encode their input. This\nslightly facilitates the problem since the network doesn\u2019t have to learn which words belong to the\nsame sentence. Most memory networks also iterate over sentence representations, which is less\ngeneral than iterating over the word level, which is what the DNC does, which is even less general\nthan iterating over the character level. In preliminary experiments, a word level variation of our\narchitecture solved many tasks, but it may require non-trivial changes to solve all of them.\n\n9 Conclusion\n\nOur novel RNN-TPR combination learns to decompose natural language sentences into combina-\ntorial components useful for reasoning. It outperforms previous models on the bAbI tasks through\nattentional control of memory. Our approach is related to Fast Weight architectures, another way\nof increasing the memory capacity of RNNs. An analysis of a trained model suggests straight-\nforward interpretability of the learned representations. Our model generalises better than a previous\nstate-of-the-art model when there are strong systematic differences between training and test data.\n\nAcknowledgments\n\nWe thank Paulo Rauber, Klaus Greff, and Filipe Mutz for helpful comments and helping hands. We\nare also grateful to NVIDIA Corporation for donating a DGX-1 as part of the Pioneers of AI Research\nAward and to IBM for donating a Minsky machine. This research was supported by an European\nResearch Council Advanced Grant (no: 742870).\n\n9\n\n\fReferences\n[1] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market\n\nmodel. Neural Networks, 1, 1988.\n\n[2] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and\ntheir computational complexity. In Back-propagation: Theory, Architectures and Applications.\nHillsdale, NJ: Erlbaum, 1994.\n\n[3] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical\n\nReport CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987.\n\n[4] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997. Based on TR FKI-207-95, TUM (1995).\n\n[5] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with\n\nLSTM. Neural Computation, 12(10):2451\u20132471, 2000.\n\n[6] H. T. Siegelmann and E. D. Sontag. Turing computability with neural nets. Applied Mathematics\n\nLetters, 4(6):77\u201380, 1991.\n\n[7] Hasim Sak, Andrew Senior, Kanishka Rao, Fran\u00e7oise Beaufays, and Johan Schalk-\nfaster and more accurate. Google Research Blog, 2015,\n\nwyk. Google voice search:\nhttp://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html.\n\n[8] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah,\nMelvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo,\nHideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason\nSmith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey\nDean. Google\u2019s neural machine translation system: Bridging the gap between human and\nmachine translation. Preprint arXiv:1609.08144, 2016.\n\n[9] J.M. Pino, A. Sidorov, and N.F. Ayan. Transitioning entirely to neural machine translation. Face-\nbook Research Blog, 2017, https://code.facebook.com/posts/289921871474277/transitioning-\nentirely-to-neural-machine-translation/.\n\n[10] Jerry A Fodor and Zenon W Pylyshyn. Connectionism and cognitive architecture: A critical\n\nanalysis. Cognition, 28(1-2):3\u201371, 1988.\n\n[11] Robert F Hadley. Systematicity in connectionist language learning. Mind & Language, 9(3):247\u2013\n\n272, 1994.\n\n[12] Brenden M. Lake and Marco Baroni. Still not systematic after all these years: On the composi-\n\ntional skills of sequence-to-sequence recurrent networks. CoRR, abs/1711.00350, 2017.\n\n[13] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.\n\n[14] Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, and Gal Chechik. Learning\nto generalize to new compositions in image understanding. arXiv preprint arXiv:1608.07639,\n2016.\n\n[15] Steven Andrew Phillips. Connectionism and the problem of systematicity. PhD thesis, University\n\nof Queensland, 1995.\n\n[16] Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. Towards ai-complete\n\nquestion answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015.\n\n[17] Olivier J. Brousse. Generativity and Systematicity in Neural Network Combinatorial Learning.\n\nPhD thesis, Boulder, CO, USA, 1992. UMI Order No. GAX92-20396.\n\n[18] Paul Smolensky. Tensor product variable binding and the representation of symbolic structures\n\nin connectionist systems. Arti\ufb01cial intelligence, 46(1-2):159\u2013216, 1990.\n\n10\n\n\f[19] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor\n\nexpansion of the local rounding errors. Master\u2019s thesis, Univ. Helsinki, 1970.\n\n[20] H. J. Kelley. Gradient theory of optimal \ufb02ight paths. ARS Journal, 30(10):947\u2013954, 1960.\n\n[21] Paul J. Werbos. Backpropagation through time: What it does and how to do it. Proceedings of\n\nthe IEEE, 78(10):1550\u20131560, 1990.\n\n[22] von der Malsburg. The Correlation Theory of Brain Function. Internal report. Department of\n\nNeurobiology, Max-Planck-Institute for Biophysical Chemistry. 1981.\n\n[23] Jerome A Feldman. Dynamic connections in neural networks. Biological cybernetics, 46(1):27\u2013\n\n39, 1982.\n\n[24] Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings\n\nof the ninth annual conference of the Cognitive Science Society, pages 177\u2013186, 1987.\n\n[25] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets.\n\nNeural Computation, 4(1):131\u2013139, 1992.\n\n[26] J. Schmidhuber. On decreasing the ratio between learning complexity and number of time-\nvarying variables in fully recurrent nets. In Proceedings of the International Conference on\nArti\ufb01cial Neural Networks, Amsterdam, pages 460\u2013463. Springer, 1993.\n\n[27] Imanol Schlag and J\u00fcrgen Schmidhuber. Gated fast weights for on-the-\ufb02y neural program\n\ngeneration. In NIPS Metalearning Workshop, 2017.\n\n[28] Paul Smolensky. Symbolic functions from neural computation. Phil. Trans. R. Soc. A,\n\n370(1971):3543\u20133569, 2012.\n\n[29] Paul Smolensky, Moontae Lee, Xiaodong He, Wen-tau Yih, Jianfeng Gao, and Li Deng. Basic\n\nreasoning with tensor product representations. CoRR, abs/1601.02745, 2016.\n\n[30] Jerry Fodor and Brian P McLaughlin. Connectionism and the problem of systematicity: Why\n\nsmolensky\u2019s solution doesn\u2019t work. Cognition, 35(2):183\u2013204, 1990.\n\n[31] Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast\nweights to attend to the recent past. In Advances In Neural Information Processing Systems,\npages 4331\u20134339, 2016.\n\n[32] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory\nnetworks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28, pages 2440\u20132448. Curran Associates,\nInc., 2015.\n\n[33] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton.\n\nabs/1607.06450, 2016.\n\nLayer normalization. CoRR,\n\n[34] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How Does Batch Normalization Help\n\nOptimization? (No, It Is Not About Internal Covariate Shift). ArXiv e-prints, May 2018.\n\n[35] D. O. Hebb. The Organization of Behavior. Wiley, New York, 1949.\n\n[36] Thomas Miconi, Jeff Clune, and Kenneth O Stanley. Differentiable plasticity: training plastic\n\nneural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018.\n\n[37] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets.\nTechnical Report FKI-147-91, Institut f\u00fcr Informatik, Technische Universit\u00e4t M\u00fcnchen, March\n1991.\n\n[38] Qiuyuan Huang, Paul Smolensky, Xiaodong He, Li Deng, and Dapeng Oliver Wu. Tensor\n\nproduct generation networks. CoRR, abs/1709.09118, 2017.\n\n[39] Hamid Palangi, Paul Smolensky, Xiaodong He, and Li Deng. Deep learning of grammatically-\n\ninterpretable representations through question-answering. CoRR, abs/1705.08432, 2017.\n\n11\n\n\f[40] Qiuyuan Huang, Li Deng, Dapeng Wu, Chang Liu, and Xiaodong He. Attentive tensor product\nlearning for language generation and grammar parsing. arXiv preprint arXiv:1802.07089, 2018.\n\n[41] Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of\n\nMathematics and Physics, 6(1-4):164\u2013189, 1927.\n\n[42] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor\ndecompositions for learning latent variable models (a survey for alt). In International Conference\non Algorithmic Learning Theory, pages 19\u201338. Springer, 2015.\n\n[43] Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks\nfor video classi\ufb01cation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning\nResearch, pages 3891\u20133900, International Convention Centre, Sydney, Australia, 06\u201311 Aug\n2017. PMLR.\n\n[44] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using\n\ntensor-train rnns. arXiv preprint arXiv:1711.00073, 2017.\n\n[45] G\u00e1bor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural\n\nlanguage models. CoRR, abs/1707.05589, 2017.\n\n[46] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, abs/1410.3916,\n\n2014.\n\n[47] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter\nOndruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memory networks\nfor natural language processing. CoRR, abs/1506.07285, 2015.\n\n[48] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Weakly supervised\n\nmemory networks. CoRR, abs/1503.08895, 2015.\n\n[49] Julien Perez and Fei Liu. Gated end-to-end memory networks. CoRR, abs/1610.04211, 2016.\n\n[50] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual\n\nand textual question answering. CoRR, abs/1603.01417, 2016.\n\n[51] Emmanuel Dupoux. Deconstructing ai-complete question-answering: going beyond toy tasks,\n\n2015.\n\n[52] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,\n\n2014.\n\n[53] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwinska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,\nAdri\u00c3 Puigdomenech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam\nCain, Helen King, Christopher Summer\ufb01eld, Phil Blunsom, Koray Kavukcuoglu, and Demis\nHassabis. Hybrid computing using a neural network with dynamic external memory. Nature,\n538(7626):471\u2013476, 2016.\n\n[54] Jack W. Rae, Jonathan J. Hunt, Tim Harley, Ivo Danihelka, Andrew W. Senior, Greg Wayne,\nAlex Graves, and Timothy P. Lillicrap. Scaling memory-augmented neural networks with sparse\nreads and writes. CoRR, abs/1610.09027, 2016.\n\n[55] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking\nIn International Conference on Learning\n\nthe world state with recurrent entity networks.\nRepresentations (ICLR2017). Preprint arXiv:1612.03969.\n\n[56] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut\nf\u00fcr Informatik, Lehrstuhl Prof. Brauer, Technische Universit\u00e4t M\u00fcnchen, 1991. Advisor: J.\nSchmidhuber.\n\n12\n\n\f[57] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nneural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 9 of Proceedings of\nMachine Learning Research, pages 249\u2013256, Chia Laguna Resort, Sardinia, Italy, 13\u201315 May\n2010. PMLR.\n\n[58] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek\nMurray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal\nTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete\nWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-\nscale machine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[59] Timothy Dozat.\n\nIn International Confer-\nence on Learning Representations (ICLR2016). CBLS, April 2016. OpenReview.net ID:\nOM0jvwB8jIp57ZJjtNEZ.\n\nIncorporating nestrov momentum into adam.\n\n13\n\n\f", "award": [], "sourceid": 6471, "authors": [{"given_name": "Imanol", "family_name": "Schlag", "institution": "IDSIA"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": "Swiss AI Lab, IDSIA (USI & SUPSI) - NNAISENSE"}]}