{"title": "Relational recurrent neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7299, "page_last": 7310, "abstract": "Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected -- i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module -- a Relational Memory Core (RMC) -- which employs multi-head dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (BoxWorld & Mini PacMan), program evaluation, and language modeling, achieving state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.", "full_text": "Relational recurrent neural networks\n\nAdam Santoro*\u03b1, Ryan Faulkner*\u03b1, David Raposo*\u03b1, Jack Rae\u03b1\u03b2, Mike Chrzanowski\u03b1,\n\nTh\u00e9ophane Weber\u03b1, Daan Wierstra\u03b1, Oriol Vinyals\u03b1, Razvan Pascanu\u03b1, Timothy Lillicrap\u03b1\u03b2\n\n*Equal Contribution\n\n\u03b1DeepMind\n\nLondon, United Kingdom\n\n\u03b2CoMPLEX, Computer Science, University College London\n\nLondon, United Kingdom\n\n{adamsantoro; rfaulk; draposo; jwrae; chrzanowskim;\n\ntheophane; weirstra; vinyals; razp; countzero}@google.com\n\nAbstract\n\nMemory-based neural networks model temporal data by leveraging an ability to\nremember information for long periods. It is unclear, however, whether they also\nhave an ability to perform complex relational reasoning with the information they\nremember. Here, we \ufb01rst con\ufb01rm our intuitions that standard memory architectures\nmay struggle at tasks that heavily involve an understanding of the ways in which\nentities are connected \u2013 i.e., tasks involving relational reasoning. We then improve\nupon these de\ufb01cits by using a new memory module \u2013 a Relational Memory Core\n(RMC) \u2013 which employs multi-head dot product attention to allow memories to\ninteract. Finally, we test the RMC on a suite of tasks that may pro\ufb01t from more\ncapable relational reasoning across sequential information, and show large gains\nin RL domains (e.g. Mini PacMan), program evaluation, and language modeling,\nachieving state-of-the-art results on the WikiText-103, Project Gutenberg, and\nGigaWord datasets.\n\n1\n\nIntroduction\n\nHumans use sophisticated memory systems to access and reason about important information regard-\nless of when it was initially perceived [1, 2]. In neural network research many successful approaches\nto modeling sequential data also use memory systems, such as LSTMs [3] and memory-augmented\nneural networks generally [4\u20137]. Bolstered by augmented memory capacities, bounded computational\ncosts over time, and an ability to deal with vanishing gradients, these networks learn to correlate\nevents across time to be pro\ufb01cient at storing and retrieving information.\nHere we propose that it is fruitful to consider memory interactions along with storage and retrieval.\nAlthough current models can learn to compartmentalize and relate distributed, vectorized memories,\nthey are not biased towards doing so explicitly. We hypothesize that such a bias may allow a model\nto better understand how memories are related, and hence may give it a better capacity for relational\nreasoning over time. We begin by demonstrating that current models do indeed struggle in this\ndomain by developing a toy task to stress relational reasoning of sequential information. Using a new\nRelational Memory Core (RMC), which uses multi-head dot product attention to allow memories to\ninteract with each other, we solve and analyze this toy problem. We then apply the RMC to a suite\nof tasks that may pro\ufb01t from more explicit memory-memory interactions, and hence, a potentially\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fincreased capacity for relational reasoning across time: partially observed reinforcement learning\ntasks, program evaluation, and language modeling on the Wikitext-103, Project Gutenberg, and\nGigaWord datasets.\n\n2 Relational reasoning\n\nWe take relational reasoning to be the process of understanding the ways in which entities are\nconnected and using this understanding to accomplish some higher order goal [8]. For example,\nconsider sorting the distances of various trees to a park bench: the relations (distances) between the\nentities (trees and bench) are compared and contrasted to produce the solution, which could not be\nreached if one reasoned about the properties (positions) of each individual entity in isolation.\nSince we can often quite \ufb02uidly de\ufb01ne what constitutes an \u201centity\u201d or a \u201crelation\u201d, one can imagine a\nspectrum of neural network inductive biases that can be cast in the language of relational reasoning\n1. For example, a convolutional kernel can be said to compute a relation (linear combination)\nof the entities (pixels) within a receptive \ufb01eld. Some previous approaches make the relational\ninductive bias more explicit: in message passing neural networks [e.g. 9\u201312], the nodes comprise\nthe entities and relations are computed using learnable functions applied to nodes connected with\nan edge, or sometimes reducing the relational function to a weighted sum of the source entities [e.g.\n13, 14]. In Relation Networks [15\u201317] entities are obtained by exploiting spatial locality in the input\nimage, and the model focuses on computing binary relations between each entity pair. Even further,\nsome approaches emphasize that more capable reasoning may be possible by employing simple\ncomputational principles; by recognizing that relations might not always be tied to proximity in space,\nnon-local computations may be better able to capture the relations between entities located far away\nfrom each other [18, 19].\nIn the temporal domain relational reasoning could comprise a capacity to compare and contrast\ninformation seen at different points in time [20]. Here, attention mechanisms [e.g. 21, 22] implicitly\nperform some form of relational reasoning; if previous hidden states are interpreted as entities, then\ncomputing a weighted sum of entities using attention helps to remove the locality bias present in\nvanilla RNNs, allowing embeddings to be better related using content rather than proximity.\nSince our current architectures solve complicated temporal tasks they must have some capacity for\ntemporal relational reasoning. However, it is unclear whether their inductive biases are limiting, and\nwhether these limitations can be exposed with tasks demanding particular types of temporal relational\nreasoning. For example, memory-augmented neural networks [4\u20137] solve a compartmentalization\nproblem with a slot-based memory matrix, but may have a harder time allowing memories to interact,\nor relate, with one another once they are encoded. LSTMs [3, 23], on the other hand, pack all\ninformation into a common hidden memory vector, potentially making compartmentalization and\nrelational reasoning more dif\ufb01cult.\n\n3 Model\n\nOur guiding design principle is to provide an architectural backbone upon which a model can learn\nto compartmentalize information, and learn to compute interactions between compartmentalized\ninformation. To accomplish this we assemble building blocks from LSTMs, memory-augmented\nneural networks, and non-local networks (in particular, the Transformer seq2seq model [22]). Similar\nto memory-augmented architectures we consider a \ufb01xed set of memory slots; however, we allow for\ninteractions between memory slots using an attention mechanism. As we will describe, in contrast to\nprevious work we apply attention between memories at a single time step, and not across all previous\nrepresentations computed from all previous observations.\n\n3.1 Allowing memories to interact using multi-head dot product attention\n\nWe will \ufb01rst assume that we do not need to consider memory encoding; that is, that we already\nhave some stored memories in matrix M, with row-wise compartmentalized memories mi. To allow\nmemories to interact we employ multi-head dot product attention (MHDPA) [22], also known as\n\n1Indeed, in the broadest sense any multivariable function must be considered \u201crelational.\u201d\n\n2\n\n\fFigure 1: Relational Memory Core. (a) The RMC receives a previous memory matrix and input\nvector as inputs, which are passed to the MHDPA module labeled with an \u201cA\u201d. (b). Linear projections\nare computed for each memory slot, and input vector, using row-wise shared weights W q for the\nqueries, W k for the keys, and W v for the values. (c) The queries, keys, and values are then compiled\ninto matrices and softmax(QK T )V is computed. The output of this computation is a new memory\nwhere information is blended across memories based on their attention weights. An MLP is applied\nrow-wise to the output of the MHDPA module (a), and the resultant memory matrix is gated, and\npassed on as the core output or next memory state.\n\nself-attention. Using MHDPA, each memory will attend over all of the other memories, and will\nupdate its content based on the attended information.\nFirst, a simple linear projection is used to construct queries (Q = M W q), keys (K = M W k), and\nvalues (V = M W v) for each memory (i.e. row mi) in matrix M. Next, we use the queries, Q, to\nperform a scaled dot-product attention over the keys, K. The returned scalars can be put through a\nsoftmax-function to produce a set of weights, which can then be used to return a weighted average\nV , where dk is the dimensionality of the key\nof values from V as A(Q, K, V ) = softmax\nvectors used as a scaling factor. Equivalently:\n\n(cid:17)\n\n(cid:16) QKT\u221a\n(cid:18) M W q(M W k)T\n(cid:19)\n\ndk\n\n\u221a\n\ndk\n\nA\u03b8(M ) = softmax\n\nM W v, where \u03b8 = (W q, W k, W v)\n\n(1)\n\nThe output of A\u03b8(M ), which we will denote as (cid:102)M, is a matrix with the same dimensionality as\nM. (cid:102)M can be interpreted as a proposed update to M, with each (cid:101)mi comprising information from\n\nmemories mj. Thus, in one step of attention each memory is updated with information originating\nfrom other memories, and it is up to the model to learn (via parameters W q, W k, and W v) how to\nshuttle information from memory to memory.\nAs implied by the name, MHDPA uses multiple heads. We implement this producing h sets of\nqueries, keys, and values, using unique parameters to compute a linear projection from the original\nmemory for each head h. We then independently apply an attention operation for each head. For\nexample, if M is an N \u00d7 F dimensional matrix and we employ two attention heads, then we compute\n\n(cid:103)M 1 = A\u03b8(M ) and(cid:103)M 2 = A\u03c6(M ), where(cid:103)M 1 and(cid:103)M 2 are N \u00d7 F/2 matrices, \u03b8 and \u03c6 denote unique\nparameters for the linear projections to produce the queries, keys, and values, and(cid:102)M = [(cid:103)M 1 :(cid:103)M 2],\n\nwhere [:] denotes column-wise concatenation. Intuitively, heads could be useful for letting a memory\nshare different information, to different targets, using each head.\n\n3\n\nNextMemoryCOREPrev.MemoryInputOutputAResidual+MLPApply gating*Residual+***computation of gates not depicted(a)MemoryMULTI-HEAD DOT PRODUCT ATTENTIONInputquerykeyvalueUpdatedMemory(b).KeysQueriesCompute attention weightsWeightsNormalized WeightsNormalize weights with row-wise softmaxValues.WeightsCompute weighted average of valuesReturn updated memoryUpdated Memory(c)\f3.2 Encoding new memories\n\n(cid:102)M = softmax\n\n(cid:18) M W q([M ; x]W k)T\n\n(cid:19)\n\n\u221a\n\ndk\n\nWe assumed that we already had a matrix of memories M. Of course, memories instead need to be\nencoded as new inputs are received. Suppose then that M is some randomly initialised memory. We\ncan ef\ufb01ciently incorporate new information x into M with a simple modi\ufb01cation to equation 1:\n\n[M ; x]W v,\n\n(2)\n\nwhere we use [M ; x] to denote the row-wise concatenation of M and x. Since we use [M ; x] when\n\ncomputing the keys and values, and only M when computing the queries,(cid:102)M is a matrix with same\n\ndimensionality as M. Thus, equation 2 is a memory-size preserving attention operation that includes\nattention over the memories and the new observations. Notably, we use the same attention operation\nto ef\ufb01ciently compute memory interactions and to incorporate new information.\nWe also note the possible utility of this operation when the memory consists of a single vector rather\nthan a matrix. In this case the model may learn to pick and choose which information from the input\nshould be written into the vector memory state by learning how to attend to the input, conditioned\non what is contained in the memory already. This is possible in LSTMs via the gates, though at a\ndifferent granularity. We return to this idea, and the possible compartmentalization that can occur via\nthe heads even in the single-memory-slot case, in the discussion.\n\n3.3\n\nIntroducing recurrence and embedding into an LSTM\n\nSuppose we have a temporal dimension with new observations at each timestep, xt. Since M and (cid:102)M\nand then updating it with(cid:102)M at each timestep. We chose to do this by embedding this update into an\n\nare the same dimensionality, we can naively introduce recurrence by \ufb01rst randomly initialising M,\n\nLSTM. Suppose memory matrix M can be interpreted as a matrix of cell states, usually denoted as C,\nfor a 2-dimensional LSTM. We can make the operations of individual memories mi nearly identical\nto those in a normal LSTM cell state as follows (subscripts are overloaded to denote the row from a\nmatrix, and timestep; e.g., mi,t is the ith row from M at time t).\n\nsi,t = (hi,t\u22121, mi,t\u22121)\nfi,t = W f xt + U f hi,t\u22121 + bf\nii,t = W ixt + U ihi,t\u22121 + bi\noi,t = W oxt + U ohi,t\u22121 + bo\n\nmi,t = \u03c3(fi,t + \u02dcbf ) \u25e6 mi,t\u22121 + \u03c3(ii,t) \u25e6 g\u03c8((cid:101)mi,t)\n(cid:123)(cid:122)\n(cid:125)\n\n(cid:124)\n\nhi,t = \u03c3(oi,t) \u25e6 tanh(mi,t)\nsi,t+1 = (mi,t, hi,t)\n\n(3)\n(4)\n(5)\n(6)\n(7)\n\n(8)\n(9)\n\nThe underbrace denotes the modi\ufb01cation to a standard LSTM. In practice we did not \ufb01nd output gates\nnecessary \u2013 please see the url in the footnote for our Tensor\ufb02ow implementation of this model in the\nSonnet library 2, and for the exact formulation we used, including our choice for the g\u03c8 function\n(brie\ufb02y, we found a row/memory-wise MLP with layer normalisation to work best). There is also an\ninteresting opportunity to introduce a different kind of gating, which we call \u2018memory\u2019 gating, which\nresembles previous gating ideas [24, 3]. Instead of producing scalar gates for each individual unit\n(\u2018unit\u2019 gating), we can produce scalar gates for each memory row by converting W f , W i, W o, U f ,\nU i, and U o from weight matrices into weight vectors, and by replacing the element-wise product in\nthe gating equations with scalar-vector multiplication.\nSince parameters W f , W i, W o, U f , U i, U o, and \u03c8 are shared for each mi, we can modify the\nnumber of memories without affecting the number of parameters. Thus, tuning the number of\nmemories and the size of each memory can be used to balance the overall storage capacity (equal\nto the total number of units, or elements, in M) and the number of parameters (proportional to the\ndimensionality of mi). We \ufb01nd in our experiments that some tasks require more, but not necessarily\nlarger, memories, and others such as language modeling require fewer, larger memories.\n\n2https://github.com/deepmind/sonnet/blob/master/sonnet/python/modules/\n\nrelational_memory.py\n\n4\n\n\fFigure 2: Tasks. We tested the RMC on a suite of supervised and reinforcement learning tasks.\nNotable are the N th Farthest toy task and language modeling. In the former, the solution requires\nexplicit relational reasoning since the model must sort distance relations between vectors, and not the\nvectors themselves. The latter tests the model on a large quantity of natural data and allows us to\ncompare performance to well-tuned models.\n\nThus, we have a number of tune-able parameters: the number of memories, the size of each memory,\nthe number of attention heads, the number of steps of attention, the gating method, and the post-\nattention processor g\u03c8. In the appendix we list the exact con\ufb01gurations for each task.\n\n4 Experiments\n\nHere we brie\ufb02y outline the tasks on which we applied the RMC, and direct the reader to the appendix\nfor full details on each task and details on hyperparameter settings for the model.\n\n4.1\n\nIllustrative supervised tasks\n\nN th Farthest The N th Farthest task is designed to stress a capacity for relational reasoning across\ntime. Inputs are a sequence of randomly sampled vectors, and targets are answers to a question of the\nform: \u201cWhat is the nth farthest vector (in Euclidean distance) from vector m?\u201d, where the vector\nvalues, their IDs, n, and m are randomly sampled per sequence. It is not enough to simply encode and\nretrieve information as in a copy task. Instead, a model must compute all pairwise distance relations\nto the reference vector m, which might also lie in memory, or might not have even been provided as\ninput yet. It must then implicitly sort these distances to produce the answer. We emphasize that the\nmodel must sort distance relations between vectors, and not the vectors themselves.\n\nProgram Evaluation The Learning to Execute (LTE) dataset [25] consists of algorithmic snippets\nfrom a Turing complete programming language of pseudo-code, and is broken down into three cate-\ngories: addition, control, and full program. Inputs are a sequence of characters over an alphanumeric\nvocabulary representing such snippets, and the target is a numeric sequence of characters that is\nthe execution output for the given programmatic input. Given that the snippets involve symbolic\nmanipulation of variables, we felt it could strain a model\u2019s capacity for relational reasoning; since\nsymbolic operators can be interpreted as de\ufb01ning a relation over the operands, successful learning\ncould re\ufb02ect an understanding of this relation. To also assess model performance on classical se-\nquence tasks we also evaluated on memorization tasks, in which the output is simply a permuted\nform of the input rather than an evaluation from a set of operational instructions. See the appendix\nfor further experimental details.\n\n4.2 Reinforcement learning\n\nMini Pacman with viewport We follow the formulation of Mini Pacman from [26]. Brie\ufb02y, the\nagent navigates a maze to collect food while being chased by ghosts. However, we implement this\ntask with a viewport: a 5 \u00d7 5 window surrounding the agent that comprises the perceptual input. The\ntask is therefore partially observable, since the agent must navigate the space and take in information\nthrough this viewport. Thus, the agent must predict the dynamics of the ghosts in memory, and plan\nits navigation accordingly, also based on remembered information about which food has already been\n\n5\n\nWhat is the Nth farthest from vector m?x = 339for [19]: x += 597 for[94]: x += 875x if 428 < 778 else 652print(x)BoxWorldMini-PacmanLockKeyLoose KeyAgentGemViewportReinforcement LearningProgram EvaluationNth farthest Language ModelingSupervised LearningIt had 24 step programming abilities, which meant it was highly _____A gold dollar had been proposed several times in the 1830s and 1840s , but was not initially _____Super Mario Land is a 1989 side scrolling platform video _____\fpicked up. We also point the reader to the appendix for a description and results of another RL task\ncalled BoxWorld, which demands relational reasoning in memory space.\n\n4.3 Language Modeling\n\nFinally, we investigate the task of word-based language modeling. We model the conditional prob-\nability p(wt|w