{"title": "Metalearned Neural Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 13331, "page_last": 13342, "abstract": "We augment recurrent neural networks with an external memory mechanism that builds upon recent progress in metalearning. We conceptualize this memory as a rapidly adaptable function that we parameterize as a deep neural network. Reading from the neural memory function amounts to pushing an input (the key vector) through the function to produce an output (the value vector). Writing to memory means changing the function; specifically, updating the parameters of the neural network to encode desired information. We leverage training and algorithmic techniques from metalearning to update the neural memory function in one shot. The proposed memory-augmented model achieves strong performance on a variety of learning problems, from supervised question answering to reinforcement learning.", "full_text": "Metalearned Neural Memory\n\nTsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, Adam Trischler\n\ntsendsuren.munkhdalai@microsoft.com\n\nMicrosoft Research\n\nMontr\u00e9al, Qu\u00e9bec, Canada\n\nAbstract\n\nWe augment recurrent neural networks with an external memory mechanism that\nbuilds upon recent progress in metalearning. We conceptualize this memory as a\nrapidly adaptable function that we parameterize as a deep neural network. Reading\nfrom the neural memory function amounts to pushing an input (the key vector)\nthrough the function to produce an output (the value vector). Writing to memory\nmeans changing the function; speci\ufb01cally, updating the parameters of the neural\nnetwork to encode desired information. We leverage training and algorithmic tech-\nniques from metalearning to update the neural memory function in one shot. The\nproposed memory-augmented model achieves strong performance on a variety of\nlearning problems, from supervised question answering to reinforcement learning.\n\n1\n\nIntroduction\n\nMany information processing tasks require memory, from sequential decision making to structured\nprediction. As such, a host of past and recent research has focused on augmenting statistical learning\nalgorithms with memory modules that rapidly record task-relevant information [38, 8, 41]. A core\ndesideratum for a memory module is the ability to store information such that it can be recalled\nfrom the same cue at later times; this reliability property has been called self-consistency [41].\nFurthermore, a memory should exhibit some degree of gen-\neralization, by recalling useful information for cues that\nhave not been encountered before, or by recalling informa-\ntion associated with what was originally stored (the degree\nof association may depend on downstream tasks). Memory\nstructures should also be ef\ufb01cient, by scaling gracefully\nwith the quantity of information stored and by enabling\nfast read-write operations.\nIn the context of neural networks, one widely successful\nmemory module is the soft look-up table [8, 46, 3]. This\nmodule stores high-dimensional key and value vectors in\ntabular format and is typically accessed by a controlled\nattention mechanism [3]. While broadly adopted, the soft\nlook-up table has several shortcomings. Look-up tables\nare ef\ufb01cient to write, but they may grow without bound if\ninformation is stored na\u00efvely in additional slots. Usually,\ntheir size is kept \ufb01xed and a more ef\ufb01cient writing mecha-\nnism is either learnt [8] or implemented heuristically [35].\nThe read operation common to most table-augmented models, which is based on soft attention, does\nnot scale well in terms of the number of slots used or in the dimensionality of stored information [32].\nFurthermore, soft look-up tables generalize only via convex combinations of stored values. This\nburdens the controller with estimating useful key and value representations.\n\nFigure 1: Schematic illustration of the\nMNM model. Green and blue arrows indicate\ndata \ufb02ows for writing and reading operations,\nrespectively. kr\nt and \u03b2t denote read-\nin key, write-in key, target value and update\nrate vectors.\n\nt , kw\n\nt , vw\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nvrt-1t-1Outputytt+1\ud835\udee5\ud835\udf19tvtwMemory f\ud835\udf19Update procedureController g\ud835\udf03t-1MSEht\ud835\udefdtvtrInput xtktwktrLtup-vtw\fIn this paper, we seek to unify few-shot metalearning and memory. We introduce an external memory\nmodule that we conceptualize as a rapidly adaptable function parameterized as a deep neural network.\nReading from this \u201cneural memory\u201d amounts to a forward pass through the network: we push an\ninput (the key vector) through the function to produce an output (the value vector). Writing to memory\nmeans updating the parameters of the neural network to encode desired information. We hypothesize\nthat modelling memory as a neural function will offer compression and generalization beyond that\nof soft look-up tables: deep neural networks are powerful function approximators capable of both\nstrong generalization and memorization [50, 12, 11], and their space overhead is constant.\nFor a neural network to operate as a useful memory module, it must be possible to record memories\nrapidly, i.e., to update the network in one shot based on a single datum. We address this challenge by\nborrowing techniques from few-shot learning through metalearning [1, 33, 6, 27]. Recent progress\nin this domain has shown how models can learn to implement data-ef\ufb01cient, gradient-descent-like\nupdate procedures that optimize their own parameters. To store information rapidly in memory, we\npropose a novel, layer-wise learned update rule. This update modi\ufb01es the memory parameters to\nminimize the difference between the neural function\u2019s predicted output (in response to a key) and a\ntarget value. We \ufb01nd our novel update rule to offer faster convergence than gradient-based update\nrules used commonly in the metalearning literature [7].\nWe combine our proposed neural memory module with an RNN controller and train the full model\nend-to-end (Figure 1). Our model learns to remember: we meta-train its reading and writing\nmechanisms to store information rapidly and reliably. Meta-training also promotes incremental\nstorage, as discussed in \u00a73.5. We demonstrate the effectiveness of our metalearned neural memory\n(MNM) on a diverse set of learning problems, which includes several algorithmic tasks, synthetic\nquestion answering on the bAbI dataset, and maze exploration via reinforcement learning. Our model\nachieves strong performance on all of these benchmarks.\n\n2 Related Work\n\nSeveral neural architectures have been proposed recently that combine a controller and a memory\nmodule. Neural Turing Machines (NTMs) extend recurrent neural networks (RNNs) with an external\nmemory matrix [8]. The RNN controller interacts with this matrix using read and write heads. Despite\nNTM\u2019s sophisticated architecture, it is unstable and dif\ufb01cult to train. A possible explanation is that its\n\ufb01xed-size matrix and the lack of a deallocation mechanism lead to information collisions in memory.\nNeural Semantic Encoders [28] address this drawback by way of a variable-size memory matrix\nand by writing new content to the most recently read memory entry. The Differentiable Neural\nComputer [9] maintains memory usage statistics to prevent information collision while still relying on\na \ufb01xed-size memory matrix. Memory Networks [46, 40] circumvent capacity issues with a read-only\nmemory matrix that scales with the number of inputs. The read-out functions of all these models\nand related variants (e.g., [20]) are based on a differentiable attention step [3] that takes a convex\ncombination of all stored memory vectors. Unfortunately, this attention step does not scale well to\nlarge memory arrays [32].\nAnother line of work incorporates dynamic, so-called \u201cfast\u201d weights [14, 38, 37] into the recurrent\nconnections of RNNs to serve as writable memory. For instance, the Hebbian learning rule [13] has\nbeen explored extensively in learning fast-weight matrices for memory [38, 2, 23, 26]. HyperNetworks\ngenerate context dependent weights as dynamic scaling terms for the weights of an RNN [10] and are\nclosely related to conditional normalization techniques [21, 5, 31].\nGradient-based fast weights have also been studied in the context of metalearning. Meta Networks\n[27] de\ufb01ne fast and slow weight branches in a single layer and train a meta-learner that generates fast\nweights, while conditionally shifted neurons [29] map loss gradients to fast biases, in both cases for\none-shot adaptation of a classi\ufb01cation network. Our proposed memory controller adapts its neural\nmemory model through a set of input and output pairs (called interaction vectors below) without\ndirectly interacting with the memory weights. Another related approach from the metalearning\nliterature [16, 43, 35, 33, 24] is MAML [6]. MAML discovers a parameter initialization from which\na few steps of gradient descent rapidly adapt a model to several related tasks.\nRecently, [47, 48] extended the Sparse Distributed Memory (SDM) of [18] as a generative memory\nmechanism [49], wherein the content matrix is parameterized as a linear Gaussian model. Memory\naccess then corresponds to an iterative inference procedure. Memory mechanisms based on iterative\n\n2\n\n\fand/or neural functions, as in [47, 48] and this work, are also related to frameworks that cast memory\nas dynamical systems of attractors (for some background, see [42]).\n\n3 Proposed Method\n\nAt a high level, our proposed memory-augmented model operates as follows. At each time step, the\ncontroller RNN receives an external input. Based on this input and its internal state, the controller\nproduces a set of memory interaction vectors.\nIn the process of reading, the controller passes a subset of these vectors, the read-in keys, to the neural\nmemory function. The memory function outputs a read-out value (i.e., a memory recall) in response.\nIn the process of writing, the controller updates the memory function based on the remaining subset\nof interaction vectors: the write-in keys and target values.\nWe investigate two ways to bind keys to values in the neural memory: (i) by applying one step of\nmodulated gradient descent to the memory function\u2019s parameters (\u00a73.3); or (ii) by applying a learned\nlocal update rule to the parameters (\u00a73.4). The parameter update reduces the error between the\nmemory function\u2019s predicted output in response to the write-in key and the target value. It may be\nused to create a new association or strengthen existing associations in memory.\nFinally, based on its internal state and the memory read-out, the controller produces an output vector\nfor use in some downstream task. We meta-train the controller and the neural memory end-to-end to\nlearn effective memory access procedures (\u00a73.5), and call the proposed model the Metalearned Neural\nMemory (MNM). In the sections below we describe its components in detail. Figure 1 illustrates the\nMNM model schematically.\n\n3.1 The Controller\nThe controller is a function g\u03b8 with parameters \u03b8 = {W, b}. It uses the LSTM architecture [15] as\nits core. At each time step it takes in the current external input, xt \u2208 IRdi, along with the previous\nt\u22121 \u2208 IRdv and hidden state ht\u22121 \u2208 IRdh. It outputs a new hidden state:\nmemory read-out value vr\nt\u22121, ht\u22121). The controller also produces an output vector to pass to external\nht = LSTM(xt, vr\nmodules (e.g., a classi\ufb01cation layer) for use in a downstream task. The output is computed as\nt . The read-out vector is\nyt = Wy[ht; vr\ncomputed by the memory function, as described in \u00a73.2.\nFrom the controller\u2019s hidden state ht we obtain a set of interaction vectors for reading from and\nt \u2208 IRdk,\nwriting to the memory function. These include read-in keys kr\ntarget values vw\n\nt ] + by \u2208 IRdo and depends on the memory read-out vector vr\n\nt \u2208 IRdk, write-in keys kw\n\nt \u2208 IRdv, and a rate vector \u03b2(cid:48)\n[kr\nt,1; . . . ; kr\nt,1; . . . ; kw\n\nt \u2208 IRdk:\nt,H ; vw\n\nt,H ; kw\n\nt,1 . . . ; vw\n\nt,H ; \u03b2(cid:48)\n\nt] = tanh(Wvht + bv).\n\n(1)\n\nThe controller outputs H vectors of each interaction type, where H is the number of parallel\ninteraction heads. The single rate vector is further projected down to a scalar and squashed into [0, 1]:\n\u03b2t = sigmoid(W\u03b2\u03b2(cid:48)\nt + b\u03b2). Rate \u03b2t controls the strength with which the corresponding (key, value)\npairs should be stored in memory. The write-in keys and target values determine the content to be\nstored, whereas the read-in keys are used to retrieve content from the memory. We use separate keys\nand values for reading and writing because the model interacts with its memory in two distinct modes\nat each time step: It reads information stored there at previous time steps that it deems useful to the\ntask at hand, and it writes information related to the current input that it deems will be useful in the\nfuture. The rate \u03b2t enables the controller to in\ufb02uence the dynamics of the gradient-based and local\nupdate procedures that encode information in memory (\u00a73.3 and \u00a73.4).\n\n3.2 The Memory Function\n\nWe model external memory as an adaptive function, f\u03c6t, parameterized as a feed-forward neural\nnetwork with weights \u03c6t = {M l}. Note that these weights, unlike the controller parameters \u03b8,\nchange rapidly from time step to time step and store associative bindings as the model encodes\ninformation. Reading from the memory corresponds to feeding the set of read-in keys through the\nmemory function to generate a set of read-out values, {vr\n\nt,i} = f\u03c6t({kr\n\nt,i}).\n\n3\n\n\fAt hidden layer l, the memory function\u2019s forward computation is de\ufb01ned as zl = \u03c3(M lzl\u22121), where\n\u03c3 is a nonlinear activation function, zl \u2208 IRDl is the layer\u2019s activation vector, and we have dropped\nthe time-step index. We execute this computation in parallel on the set of H read-in keys kr\nt,i at each\ntime step, yielding H read-out value vectors. We take their mean to construct the single read-out\nvalue vr\n\nt that feeds back into the controller and the output computation for yt.\n\n3.3 Writing to Memory with Gradient Descent\nt,i} to map to the target values\nA write to the memory consists in rapidly binding the write-in keys {kw\nt,i} in the parameters of the neural memory. One way to do this is by updating the memory\n{vw\nparameters to optimize an objective that encourages binding. We denote this memory objective by\nLup\nt and in this work implement it as a simple mean-squared error:\n\nH(cid:88)\n\n2\n\nLup\nt =\n\n||f\u03c6t\u22121 (kw\n\n1\nH\nt,i} obtained from the controller by optimizing Eq. 2. We obtain\nWe aim to encode the target values {vw\nt,i}), by feeding the controller\u2019s write-in keys\nt,i} = f\u03c6t\u22121 ({kw\na set of memory prediction values, {\u02c6vw\nthrough the memory function as parameterized at the previous time step (by \u03c6t\u22121). The model binds\nthe write-in keys to the target values at time t by taking a gradient step to minimize Lup\nt :\n\nt,i) \u2212 vw\n\nt,i||2\n\n(2)\n\ni=1\n\n\u03c6t \u2190 \u03c6t\u22121 \u2212 \u03b2t\u2207\u03c6t\u22121Lup\nt .\n\n(3)\nHere, \u03b2t is the update rate obtained from the controller, which modulates the size of the gradient step.\nIn principle, by diminishing the update rate, the controller can effectively avoid writing into memory,\nachieving an effect similar to a gated memory update [8].\nIn experiments we \ufb01nd that the mean-squared error is an effective memory objective: minimizing it\nencodes the target values in the memory function, in the sense that they can be read out (approximately)\nby passing in the corresponding keys.\n\n3.4 Writing to Memory with a Learned Local Update\n\nWriting to memory with gradient descent poses challenges. Writing a new item to a look-up\ntable can be as simple as adding an element to the corresponding array. In a neural memory, by\ncontrast, multiple costly gradient steps may be required to store a key-value vector pair reliably\nin the parameters. Memory parameter updates are expensive because, in end-to-end training, they\nrequire computation of higher-order gradients (see \u00a73.5; this issue is common in gradient-based\nmetalearning). Sequential back-propagation of the memory error through the layers of the memory\nmodel also adds a computational bottleneck.\nOther researchers have recognized this problem and proposed possible solutions. For example,\nthe direct feedback alignment algorithm [30], a variant of the feedback alignment method [22],\nenables weight updates via error alignment with random skip connections. Because these feedback\nconnections are not computed nor used sequentially, updates can be parallelized for speed. However,\n\ufb01xed random feedback connections may be inef\ufb01cient. The synthetic gradient methods [17] train a\nmodel to locally predict the error gradient and this requires the true gradient as a target.\nWe propose a memory writing mechanism that is fast and gradient-free. The key idea is to represent\neach neural memory layer with decoupled forward computation and backward feedback prediction\nfunctions (BFPF) and perform local updates to the memory layers. Unlike feedback alignment\nmethods, the BFPF and the local update rules are then meta-trained, jointly. Concretely, for neural\nmemory layer l, the BFPF is a function ql\n\u03c8 with trainable parameter \u03c8l that makes a prediction for an\nexpected activation as: z(cid:48)l = ql\nt ). We then adopt the perceptron learning rule [34] to update the\nlayer locally:\nt \u2190 M l\n\nt(zl \u2212 z(cid:48)l)zl\u22121T\n\n(4)\nt is the local update rate that can be learned for each layer with the controller or separately\n\u03c8. The perceptron update rule uses the predicted activation as a true target and\nt(zl\u2212z(cid:48)l)zl\u22121T . Therefore, the (approximate) gradient\n\nwhere \u03b2l\nwith the BFPF ql\napproximate the loss gradient w.r.t M l\n\nt\u22121 \u2212 \u03b2l\n\nt\u22121 via \u03b2l\n\n\u03c8(vw\n\nM l\n\n4\n\n\fis near zero when zl \u2248 z(cid:48)l and there are no changes to the weights. But if the predicted and the\nforward activations don\u2019t match, we update the weights such that the predicted activations can be\nreconstructed from the weights given the activations zl\u22121 from the previous layer l \u2212 1.\nTherefore, the BFPF module \ufb01rst proposes a regression problem locally for each layer and then the\nperceptron learning mechanism here solves the local regression problem. One can use a different\nlocal update mechanism, rather than the perceptron method. Note that it is helpful to choose the\nupdate mechanism that is differentiable w.r.t to its solution to the problem, since the BFPF module is\ntrained to propose problems whose solutions minimize the meta and task loss (\u00a73.5).\nWith the proposed local and gradient-free update rule, the neural memory writes to its weights in\nparallel and its computation graph need not be tracked during writing. This makes it straightforward\nto add complex structural biases, such as recurrence, into the neural memory itself. The proposed\napproach can readily be applied in the few-shot learning setup as well. For example, we can utilize\nthe learned local update method as an inner-loop adaptation mechanism in a model agnostic way. We\nleave this to future work.\n\n3.5 End-to-end Training via Meta and Task Objectives\n\nIt is important to note that the memory objective function, Eq. 2, and the memory updates, Eqs. 3, 4,\nmodify the neural memory function; these updates occur even at test time, as the model processes\nand records information (they require no external labels). We train the controller parameters \u03b8 and\nthe memory initialization \u03c60 and the BFPF parameters \u03c8 end-to-end through the combination of a\ntask objective and a meta objective. The former, denoted by Ltask, is speci\ufb01c to the task the model\nperforms, e.g., classi\ufb01cation or regression. The meta objective, Lmeta, encourages the controller\nto learn effective update strategies for the neural memory that work well across tasks. These two\nobjectives take account of the model\u2019s behavior over an episode, i.e., a sequence of time steps,\nt = 1, . . . , T , in which the model performs some temporally extended task (like question answering\nor maze navigation).\nFor the meta objective, we make use again of the mean squared error between the memory prediction\nvalues and the target values, as in Eq. 2. For the meta objective, however, we obtain the prediction\nvalues from the updated neural function at each time step, f\u03c6t, after the update step in Eq. 3 or Eq. 4\nhas been applied. We also introduce a recall delay parameter, \u03c4:\n\n\u03bb\u03c4||f\u03c6t(kw\n\nt\u2212\u03c4,i) \u2212 vw\n\nt\u2212\u03c4,i||2\n\n2\n\n(5)\n\nT(cid:88)\n\nT(cid:88)\n\nH(cid:88)\n\nLmeta =\n\n1\nT H\n\n\u03c4 =0\n\nt=1\n\ni=1\n\nThe sum over \u03c4 can be used to reinforce older memory values, and \u03bb\u03c4 is a decay weight for the\nimportance of older (key, value) pairs. The latter can be used, for instance, to implement a kind of\nexponential decay. We found that \u03bb\u03c4 is task speci\ufb01c; in this work, we always set the maximum recall\ndelay as T = 0 to focus on reliable storage of new information.\nAt the end of each episode, we take gradients of the meta objective and the task objective with respect\nto the controller parameters, \u03b8, and use these for training the controller:\n\n\u03b8 \u2190 \u03b8 \u2212 \u2207\u03b8Ltask \u2212 \u2207\u03b8Lmeta.\n\n(6)\n\nThese gradients propagate back through the memory-function updates (requiring higher-order gra-\ndients if using Eq. 3) so that the controller learns how to modify the memory parameters via the\ninteraction vectors (Eq. 1) and the gradient steps or local updates (Eq. 3 or Eq. 4, respectively).\nWe attempted to learn the memory initialization (similar to MAML) by updating the initial parameters\n\u03c60 w.r.t. the meta loss. We found that this led to severe over\ufb01tting. Therefore, we initialize the\nmemory function tabula rasa at each task episode from a \ufb01xed random parameter set.\nOptimizing the episodic task objective will often require the memory to recall information stored\nmany time steps back, after newer information has also been written. This requirement, along\nwith the fact that the optimization involves propagating gradients back through the memory update\nsteps, promotes incremental learning in the memory function, because overwriting previously stored\ninformation would harm task performance.\n\n5\n\n\fFigure 2: Training curves on the dictionary inference task.\n\n4 Experimental Evaluation and Analysis\n\n4.1 Algorithmic Tasks\n\nWe \ufb01rst introduce a synthetic dictionary inference task to test MNM\u2019s capacity to store and recall\nassociated information. This can be considered a toy translation problem. To construct it, we\nrandomly partition the 26 letters of the English alphabet evenly into a source and a target vocabulary,\nand de\ufb01ne a random, bijective mapping F between the two sets. Following the few-shot learning\nsetup, we then construct a support set of k source sequences with their corresponding target sequences.\nEach source sequence consists of l letters randomly sampled (with replacement) from the source\nvocabulary, which are then mapped to the corresponding target sequence using F . The objective\nof the task is to predict the target given a previously unseen source sequence whose composing\nletters have been observed in the support set. For example, after observing the support examples\nabc\u2192def;tla\u2192qzd, the model is expected to translate input sequence tca to the output qfd.\nThe dif\ufb01culty of the task varies depending on the sequence length l and the size of the support set\nk. Longer sequences introduce long-term dependencies, whereas a larger number of observations\nrequires ef\ufb01cient memory structure and compression. We constructed four different task instances\nwith support set size of 4, 8, 12, and 16 and sequence length of 1, 4, 8 and 12.\nWe trained the MNM models with both gradient-based (MNM-g) and local memory updates (MNM-\np). The models have a controller network with 100 hidden units and a three-layer feed-forward\nmemory with 100 hidden units and tanh activation. We compare against two baseline models: a\nvanilla LSTM model and a memory-augmented model with the soft-attention look-up table as memory\n(LSTM+SALU). In the LSTM+SALU model, we replace the feed-forward neural memory with the\nlook-up table, providing an unbounded memory to the LSTM controller. The training setup is given\nin Appendix A.\nFigure 2 shows the results for our dictionary inference task (averaged over 5 runs). All models solved\nthe \ufb01rst task instance and all memory-augmented models converged for the second case. As the task\ndif\ufb01culty increased for the last two cases, only the MNM models converged and solved the task with\nzero error.\nIn Figure 3, we compared the training wallclock time and the memory size of MNM(-g, -p) against\nLSTM+SALU models on these task runs. When the input length is small, the LSTM+SALU model\nis faster than MNM-g and similar in speed to MNM-p, and has a smaller memory footprint than\nboth. However, as soon as the input length exceeds the size of the MNM memory\u2019s hidden units,\nLSTM+SALU becomes less ef\ufb01cient. It exhibits quadratic growth whereas the MNM models grow\napproximately linearly in the wallclock time. Figure 3\u2019s left plot also demonstrates that that learned\n\nFigure 3: Model comparison over varying\ninput lengths (x-axes).\n\nFigure 4: Training curves on programming tasks.\n\n6\n\n01.534.543296192Wallclock timeLSTM+SALUMNM-gMNM-p01000020000300004000043296192Memory sizeLSTM+SALUMNM\fTable 1: Results on bAbI question answering.\n\nSentence-level\n\nWord-level\n\nMean Error\n\nFailed Tasks (> 5% error)\n\nEntNet\n9.7 \u00b1 2.6\n5 \u00b1 1.2\n\nTPR-RNN\n1.34 \u00b1 0.52\n0.86 \u00b1 1.11\n\nDNC\n12.8 \u00b1 4.7\n8.2 \u00b1 2.5\n\nSDNC\n6.4 \u00b1 2.5\n4.1 \u00b1 1.6\n\nMNM-g\n3.2 \u00b1 0.5\n1.3 \u00b1 0.8\n\nMNM-p\n0.55 \u00b10.74\n0.08 \u00b1 0.28\n\n(a)\n\n(b)\n\nFigure 5: Visualization of learned memory operations. (a) At each question word (y-axis), the model recalls\nmemory contents written for entities in the story (x-axis) that are most closely related to the question type (e.g.,\nlocations for where questions). (b) Towards the end of a story (y-axis), the model learns to access and update\nthe memory conditioned on structured information (e.g., location and character) memorized earlier in the story\n(x-axis).\n\nlocal updates (MNM-p) confer signi\ufb01cant speed bene\ufb01ts over gradient-based updates (MNM-g),\nsince the former can be applied in parallel.\nWe further evaluated these models on two standard memorization tasks: double copy and priority\nsort. As shown in Figure 4, the MNM models quickly solve the double copy task with input length\n50. On the priority sort problem, the LSTM+SALU model demonstrated the strongest result. This is\nsurprising, since the task was previously shown to be hard to solve [8]. It suggests that the unbounded\nmemory table and the look-up operation are especially useful for sorting. The MNM models\u2019 strong\nperformance across the suite of algorithmic tasks, which require precise recall of past information,\nindicates that MNM can store information reliably.\n\n4.2 bAbI Question Answering\n\nbAbI is a synthetic question-answering benchmark that has been widely adopted to evaluate long-term\nmemory and reasoning [45]. It presents 20 reasoning tasks in total. We aim to solve all of them with\none generic model. Previous memory-augmented models addressed the problem with sentence-level\n[40, 20] or word-level inputs [9, 32]. Solving the task based on word-level input is harder, but more\ngeneral [36]. We trained word-level MNM models following the setup for the DNC [9].\nThe results are summarized in Table 1. The MNM model with the learned local update solved all\nthe bAbI tasks with near zero error, outperforming the result of TPR-RNN [36] (previously the best\nmodel operating at sentence-level). It also outperformed the DNC by around 12% and the Sparse\nDNC [9] by around 6% in terms of mean error. We report the best results and all 12 runs of MNM-p\nin Appendix B. MNM-g with the gradient-based update solved 19 tasks, failing to solve only the\nbasic induction task.\nWe also attempted to train a word-level LSTM+SALU baseline as described in the previous section.\nHowever, multiple LSTM+SALU runs did not solve any task and converged to 77.5% mean error\nafter 62K training iterations. With the same number of iterations, the MNM runs converged to a 9.5%\nmean error and solved 16.5 tasks on average. This suggests the importance of a deep neural memory\nand a learned memory access mechanism for reasoning.\nAnalyzing Learned Memory Operations: To understand the MNM more deeply, we analyzed its\ntemporal dynamics through the similarity between the keys and values of its read/write operations\nas it processed bAbI. Here cosine distance is used as a similarity metric. We found that the neural\nmemory exhibits readily interpretable structures as well as ef\ufb01cient self-organization.\n\n7\n\nx-axis: target values and y-axis: read-out x-axis: target values and y-axis: read-out x-axis: target values and y-axis: read-out x and y-axis are target valuesx and y-axis are write-in keys\ft2 against vw\n\nIntuitively, keys and values correspond to the locations and contents of memory read/write operations.\nConsequently the temporal comparison of various combinations of these vectors can have meaningful\ninterpretations. For example, given two time steps t1 < t2, when comparing vr\nt1, higher\nsimilarity (brighter colors in Figure 5) indicates that the content stored at t1 is being retrieved at\nt2. We \ufb01rst applied this comparison to a bAbI story (x-axis in Figure 5a) and the corresponding\nquestion (y-axis), since they are read consecutively by the model. Upon reading the question word\n\u201cwhere\u201d, the model successfully retrieves the location-related memories. When the model reads in\nthe character names, the retrieval is then \u201c\ufb01ltered\u201d down to only the locations of the characters in\nquestion. Furthermore, the retrieval also appears to be closer to more recent locations, effectively\nmodeling this strong prior in the data distribution of bAbI.\nSimilarly, we analyzed the memory operations towards the end of a story (y-axis) and examined\nhow the model uses the memory developed earlier (x-axis). Again, comparing vr\nt1 (row\n1 in Figure 5b), the bright vertical stripe at \u201challway\u201d indicates that the memory retrieval focuses\nmore on Daniel\u2019s most recent location (while ignoring both his previous locations and locations of\nother characters). In addition, vw\nt1 are compared in row 2, Figure 5b, where the dark vertical\nstripes indicate that the memory is being updated aggressively with new contents whenever a new\nlocation is mentioned \u2014 potentially establishing new associations between locations and characters.\nIn the comparison between kw\nt1 (row 3 in Figure 5b), two bright diagonals are observed in the\nsentences related to the matching character Daniel, suggesting that (a) the model has likely acquired\nan entity-based structure and (b) it is capable of leveraging this structure for ef\ufb01cient memory reuse.\nMore examples can be found in the appendix. Overall, the patterns above are salient and consistent,\nindicating our model\u2019s ability to disentangle objects and their roles from a story, and to use that\ninformation to dynamically establish and update associations in a structured, ef\ufb01cient manner \u2014 all\nof which are key to neural-symbolic reasoning [39] and effective generalization.\n\nt2 and vw\n\nt2 and vw\n\nt2 and kw\n\n4.3 Maze Exploration by Reinforcement Learning\n\nExternal memory may be helpful or even necessary for\nagents operating in partially observable environments,\nwhere current decision making depends on a sequence\nof past observations. However, memory mechanisms also\nadd complexity to an agent\u2019s learning process [19], since\nthe agent must learn to access its memory during experi-\nence. In our \ufb01nal set of experiments, we train an RL agent\naugmented with our metalearned neural memory. We wish\nto discover whether an MNM agent can perform well in\na sequential decision making problem and use its memory\nto improve performance or sample ef\ufb01ciency.\nWe train MNM agents on a maze exploration task from the\nliterature on meta-reinforcement learning [4, 44]. Specif-\nically, we adopted the grid world setup from [23]. In this\ntask, for each episode, the agent explores a grid-based maze and attempts to reach a goal position.\nReaching the goal earns the agent a reward of 10 and relocates it to a random position. The agent\nmust then return to the goal to collect more reward before the episode terminates. To the agent, the\ngoal is invisible in the maze and its position is chosen randomly at the beginning of each episode.\nInputs to the agent are its surrounding 3 \u00d7 3 cells, the time step, and the reward from the previous\ntime step. The agent receives -0.1 reward for hitting a wall and 0 reward otherwise. A 9 \u00d7 9 grid\nmaze is shown in Figure 7 (Appendix A) for illustration.\nWe trained agents on a 9 \u00d7 9 maze following the setup of [23] to provide a direct comparison with\nthe differential plasticity agents of that work. We used the Advantage Actor-Critic (A2C) algorithm\nfor optimization, a non-asynchronous variant of the A3C method [25]. The MNM agent has a neural\nmemory and controller with 100 and 200 hidden units, respectively. The training curve for the 9 \u00d7 9\nmaze (averaged over 10 runs) is plotted in Figure 6, along with results from [23]. As can be seen, the\nagents with differential plasticity (denoted Plastic and Homogenous Plastic) converge to a reward\nof 175 after training on nearly 1M episodes. MNM, on the other hand, reaches the same reward in\nonly 250K episodes. It obtains signi\ufb01cantly higher \ufb01nal reward after 1M episodes. This result shows\n\nFigure 6: Training curves on the maze ex-\nploration task.\n\n8\n\n\fthat the MNM fosters improved performance and sample ef\ufb01ciency in a sequential decision making\nscenario and, promisingly, it can be trained in conjunction with an RL policy.\n\n5 Conclusion\n\nWe cast external memory for neural models as a rapidly adaptable function, itself parameterized as a\ndeep neural network. Our goal was for this memory mechanism to confer the bene\ufb01ts of deep neural\nnetworks\u2019 expressiveness, generalization, and constant space overhead. In order to write to a neural\nnetwork memory rapidly, in one shot, and incrementally, such that newly stored information does not\ndistort existing information, we adopted training and algorithmic techniques from metalearning. The\nproposed memory-augmented model, MNM, was shown to achieve strong performance on a wide\nvariety of learning problems, from supervised question answering to reinforcement learning. Our\nlearned local update algorithm can be applied in an other setup than the memory one. In future work,\nwe will investigate different neural architectures for metalearned memory and the effects of recall\ndelays in the meta objective.\n\nAcknowledgements\n\nWe thank Thomas Miconi for sharing data. We thank Geoff Gordon for helpful comments and\nsuggestions.\n\nReferences\n[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by\ngradient descent. In Advances in Neural Information Processing Systems, pages 3981\u20133989,\n2016.\n\n[2] Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast\nweights to attend to the recent past. In Advances in Neural Information Processing Systems,\npages 4331\u20134339, 2016.\n\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[4] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2:\nFast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[5] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for\n\nartistic style. CoRR, abs/1610.07629, 2(4):5, 2016.\n\n[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\ntion of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning\nResearch, pages 1126\u20131135, International Convention Centre, Sydney, Australia, 06\u201311 Aug\n2017. PMLR.\n\n[7] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual\n\nimitation learning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.\n\n[8] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint\n\narXiv:1410.5401, 2014.\n\n[9] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,\net al. Hybrid computing using a neural network with dynamic external memory. Nature,\n538(7626):471, 2016.\n\n[10] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR 2017, 2017.\n\n9\n\n\f[11] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in neural information processing systems, pages\n1135\u20131143, 2015.\n\n[12] Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain\n\nsurgeon. In Advances in neural information processing systems, pages 164\u2013171, 1993.\n\n[13] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology\n\nPress, 1949.\n\n[14] Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings\n\nof the ninth annual conference of the Cognitive Science Society, pages 177\u2013186, 1987.\n\n[15] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[16] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient\ndescent. In International Conference on Arti\ufb01cial Neural Networks, pages 87\u201394. Springer,\n2001.\n\n[17] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,\nDavid Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1627\u20131635. JMLR. org, 2017.\n\n[18] Pentti Kanerva. Sparse distributed memory. MIT press, 1988.\n\n[19] Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, and\nDaniel D Lee. Memory augmented control networks. arXiv preprint arXiv:1709.05706,\n2017.\n\n[20] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani,\nVictor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory\nnetworks for natural language processing. In International Conference on Machine Learning,\npages 1378\u20131387, 2016.\n\n[21] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional\nneural networks using textual descriptions. In Proceedings of the IEEE International Conference\non Computer Vision, pages 4247\u20134255, 2015.\n\n[22] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synap-\ntic feedback weights support error backpropagation for deep learning. Nature communications,\n7, 2016.\n\n[23] Thomas Miconi, Jeff Clune, and Kenneth O Stanley. Differentiable plasticity: training plastic\n\nneural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018.\n\n[24] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive\n\nmeta-learner. 2018.\n\n[25] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International conference on machine learning, pages 1928\u20131937,\n2016.\n\n[26] Tsendsuren Munkhdalai and Adam Trischler. Metalearning with hebbian fast weights. arXiv\n\npreprint arXiv:1807.05076, 2018.\n\n[27] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In Doina Precup and Yee Whye Teh,\neditors, Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 2554\u20132563, International Convention\nCentre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n10\n\n\f[28] Tsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. In Proceedings of the 15th\nConference of the European Chapter of the Association for Computational Linguistics: Volume\n1, Long Papers, pages 397\u2013407. Association for Computational Linguistics, 2017.\n\n[29] Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler. Rapid adaptation\nwith conditionally shifted neurons. In International Conference on Machine Learning, pages\n3661\u20133670, 2018.\n\n[30] Arild N\u00f8kland. Direct feedback alignment provides learning in deep neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 1037\u20131045, 2016.\n\n[31] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:\n\nVisual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.\n\n[32] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne,\nAlex Graves, and Timothy Lillicrap. Scaling memory-augmented neural networks with sparse\nreads and writes. In Advances in Neural Information Processing Systems, pages 3621\u20133629,\n2016.\n\n[33] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR\n\n2017, 2017.\n\n[34] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organiza-\n\ntion in the brain. Psychological review, 65(6):386, 1958.\n\n[35] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\nIn International conference on\n\nMeta-learning with memory-augmented neural networks.\nmachine learning, pages 1842\u20131850, 2016.\n\n[36] Imanol Schlag and J\u00fcrgen Schmidhuber. Learning to reason with third order tensor products. In\n\nAdvances in Neural Information Processing Systems, pages 10003\u201310014, 2018.\n\n[37] J Schmidhuber. Reducing the ratio between learning complexity and number of time varying\nvariables in fully recurrent nets. In International Conference on Arti\ufb01cial Neural Networks,\npages 460\u2013463. Springer, 1993.\n\n[38] J\u00fcrgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic\n\nrecurrent networks. Neural Computation, 4(1):131\u2013139, 1992.\n\n[39] Paul Smolensky. Tensor product variable binding and the representation of symbolic structures\n\nin connectionist systems. Arti\ufb01cial intelligence, 46(1-2):159\u2013216, 1990.\n\n[40] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In\n\nAdvances in neural information processing systems, pages 2440\u20132448, 2015.\n\n[41] Wen Sun, Alina Beygelzimer, Hal Daum\u00e9 III, John Langford, and Paul Mineiro. Contextual\n\nmemory trees. arXiv preprint arXiv:1807.06473, 2018.\n\n[42] Adam Trischler. A Computational Model for Episodic Memory Inspired by the Brain. PhD\n\nthesis, 2016.\n\n[43] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one\nshot learning. In Advances in Neural Information Processing Systems, pages 3630\u20133638, 2016.\n\n[44] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.\narXiv preprint arXiv:1611.05763, 2016.\n\n[45] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri\u00ebnboer,\nArmand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of\nprerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.\n\n[46] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, abs/1410.3916,\n\n2014.\n\n11\n\n\f[47] Yan Wu, Greg Wayne, Alex Graves, and Timothy Lillicrap. The kanerva machine: A generative\n\ndistributed memory. ICLR 2018, 2018.\n\n[48] Yan Wu, Gregory Wayne, Karol Gregor, and Timothy Lillicrap. Learning attractor dynamics for\ngenerative memory. In Advances in Neural Information Processing Systems, pages 9401\u20139410,\n2018.\n\n[49] Richard S Zemel and Michael C Mozer. A generative model for attractor dynamics. In Advances\n\nin neural information processing systems, pages 80\u201388, 2000.\n\n[50] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n12\n\n\f", "award": [], "sourceid": 7322, "authors": [{"given_name": "Tsendsuren", "family_name": "Munkhdalai", "institution": "Microsoft Research"}, {"given_name": "Alessandro", "family_name": "Sordoni", "institution": "Microsoft Research Montreal"}, {"given_name": "TONG", "family_name": "WANG", "institution": "Microsoft Research Montreal"}, {"given_name": "Adam", "family_name": "Trischler", "institution": "Microsoft"}]}