{"title": "Using Fast Weights to Attend to the Recent Past", "book": "Advances in Neural Information Processing Systems", "page_first": 4331, "page_last": 4339, "abstract": "Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These ``fast weights'' can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proven helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.", "full_text": "Using Fast Weights to Attend to the Recent Past\n\nJimmy Ba\n\nUniversity of Toronto\n\njimmy@psi.toronto.edu\n\nGeoffrey Hinton\n\nUniversity of Toronto and Google Brain\n\ngeoffhinton@google.com\n\nVolodymyr Mnih\nGoogle DeepMind\nvmnih@google.com\n\nJoel Z. Leibo\n\nGoogle DeepMind\njzl@google.com\n\nAbstract\n\nCatalin Ionescu\nGoogle DeepMind\ncdi@google.com\n\nUntil recently, research on arti\ufb01cial neural networks was largely restricted to sys-\ntems with only two types of variable: Neural activities that represent the current\nor recent input and weights that learn to capture regularities among inputs, outputs\nand payoffs. There is no good reason for this restriction. Synapses have dynam-\nics at many different time-scales and this suggests that arti\ufb01cial neural networks\nmight bene\ufb01t from variables that change slower than activities but much faster\nthan the standard weights. These \u201cfast weights\u201d can be used to store temporary\nmemories of the recent past and they provide a neurally plausible way of imple-\nmenting the type of attention to the past that has recently proved very helpful in\nsequence-to-sequence models. By using fast weights we can avoid the need to\nstore copies of neural activity patterns.\n\n1\n\nIntroduction\n\nOrdinary recurrent neural networks typically have two types of memory that have very different time\nscales, very different capacities and very different computational roles. The history of the sequence\ncurrently being processed is stored in the hidden activity vector, which acts as a short-term memory\nthat is updated at every time step. The capacity of this memory is O(H) where H is the number\nof hidden units. Long-term memory about how to convert the current input and hidden vectors into\nthe next hidden vector and a predicted output vector is stored in the weight matrices connecting the\nhidden units to themselves and to the inputs and outputs. These matrices are typically updated at the\nend of a sequence and their capacity is O(H 2) + O(IH) + O(HO) where I and O are the numbers\nof input and output units.\nLong short-term memory networks [Hochreiter and Schmidhuber, 1997] are a more complicated\ntype of RNN that work better for discovering long-range structure in sequences for two main reasons:\nFirst, they compute increments to the hidden activity vector at each time step rather than recomputing\nthe full vector1. This encourages information in the hidden states to persist for much longer. Second,\nthey allow the hidden activities to determine the states of gates that scale the effects of the weights.\nThese multiplicative interactions allow the effective weights to be dynamically adjusted by the input\nor hidden activities via the gates. However, LSTMs are still limited to a short-term memory capacity\nof O(H) for the history of the current sequence.\nUntil recently, there was surprisingly little practical investigation of other forms of memory in recur-\nrent nets despite strong psychological evidence that it exists and obvious computational reasons why\nit was needed. There were occasional suggestions that neural networks could bene\ufb01t from a third\nform of memory that has much higher storage capacity than the neural activities but much faster\ndynamics than the standard slow weights. This memory could store information speci\ufb01c to the his-\ntory of the current sequence so that this information is available to in\ufb02uence the ongoing processing\n\n1This assumes the \u201cremember gates \u201d of the LSTM memory cells are set to one.\n\n\fwithout using up the memory capacity of the hidden activities. Hinton and Plaut [1987] suggested\nthat fast weights could be used to allow true recursion in a neural network and Schmidhuber [1993]\npointed out that a system of this kind could be trained end-to-end using backpropagation, but neither\nof these papers actually implemented this method of achieving recursion.\n\n2 Evidence from physiology that temporary memory may not be stored as\n\nneural activities\n\nProcesses like working memory, attention, and priming operate on timescale of 100ms to minutes.\nThis is simultaneously too slow to be mediated by neural activations without dynamical attractor\nstates (10ms timescale) and too fast for long-term synaptic plasticity mechanisms to kick in (minutes\nto hours). While arti\ufb01cial neural network research has typically focused on methods to maintain\ntemporary state in activation dynamics, that focus may be inconsistent with evidence that the brain\nalso\u2014or perhaps primarily\u2014maintains temporary state information by short-term synaptic plasticity\nmechanisms [Tsodyks et al., 1998, Abbott and Regehr, 2004, Barak and Tsodyks, 2007].\nThe brain implements a variety of short-term plasticity mechanisms that operate on intermediate\ntimescale. For example, short term facilitation is implemented by leftover [Ca2+] in the axon termi-\nnal after depolarization while short term depression is implemented by presynaptic neurotransmitter\ndepletion Zucker and Regehr [2002]. Spike-time dependent plasticity can also be invoked on this\ntimescale [Markram et al., 1997, Bi and Poo, 1998]. These plasticity mechanisms are all synapse-\nspeci\ufb01c. Thus they are more accurately modeled by a memory with O(H 2) capacity than the O(H)\nof standard recurrent arti\ufb01cial recurrent neural nets and LSTMs.\n\n3 Fast Associative Memory\n\nOne of the main preoccupations of neural network research in the 1970s and early 1980s [Willshaw\net al., 1969, Kohonen, 1972, Anderson and Hinton, 1981, Hop\ufb01eld, 1982] was the idea that memories\nwere not stored by somehow keeping copies of patterns of neural activity. Instead, these patterns\nwere reconstructed when needed from information stored in the weights of an associative network\nand the very same weights could store many different memories An auto-associative memory that\nhas N 2 weights cannot be expected to store more that N real-valued vectors with N components\neach. How close we can come to this upper bound depends on which storage rule we use. Hop\ufb01eld\nnets use a simple, one-shot, outer-product storage rule and achieve a capacity of approximately\n0.15N binary vectors using weights that require log(N ) bits each. Much more ef\ufb01cient use can\nbe made of the weights by using an iterative, error correction storage rule to learn weights that can\nretrieve each bit of a pattern from all the other bits [Gardner, 1988], but for our purposes maximizing\nthe capacity is less important than having a simple, non-iterative storage rule, so we will use an outer\nproduct rule to store hidden activity vectors in fast weights that decay rapidly. The usual weights in\nan RNN will be called slow weights and they will learn by stochastic gradient descent in an objective\nfunction taking into account the fact that changes in the slow weights will lead to changes in what\ngets stored automatically in the fast associative memory.\nA fast associative memory has several advantages when compared with the type of memory assumed\nby a Neural Turing Machine (NTM) [Graves et al., 2014], Neural Stack [Grefenstette et al., 2015], or\nMemory Network [Weston et al., 2014]. First, it is not at all clear how a real brain would implement\nthe more exotic structures in these models e.g., the tape of the NTM, whereas it is clear that the brain\ncould implement a fast associative memory in synapses with the appropriate dynamics. Second, in\na fast associative memory there is no need to decide where or when to write to memory and where\nor when to read from memory. The fast memory is updated all the time and the writes are all\nsuperimposed on the same fast changing component of the strength of each synapse. Every time the\ninput changes there is a transition to a new hidden state which is determined by a combination of\nthree sources of information: The new input via the slow input-to-hidden weights, C, the previous\nhidden state via the slow transition weights, W , and the recent history of hidden state vectors via\nthe fast weights, A. The effect of the \ufb01rst two sources of information on the new hidden state can be\ncomputed once and then maintained as a sustained boundary condition for a brief iterative settling\nprocess which allows the fast weights to in\ufb02uence the new hidden state. Assuming that the fast\nweights decay exponentially, we now show that the effect of the fast weights on the hidden vector\n\n2\n\n\fFigure 1: The fast associative memory model.\n\nduring an iterative settling phase is to provide an additional input that is proportional to the sum over\nall recent hidden activity vectors of the scalar product of that recent hidden vector with the current\nhidden activity vector, with each term in this sum being weighted by the decay rate raised to the\npower of how long ago that hidden vector occurred. So fast weights act like a kind of attention to\nthe recent past but with the strength of the attention being determined by the scalar product between\nthe current hidden vector and the earlier hidden vector rather than being determined by a separate\nparameterized computation of the type used in neural machine translation models [Bahdanau et al.,\n2015].\nThe update rule for the fast memory weight matrix, A, is simply to multiply the current fast weights\nby a decay rate, \u03bb, and add the outer product of the hidden state vector, h(t), multiplied by a learning\nrate, \u03b7:\n\nA(t) = \u03bbA(t \u2212 1) + \u03b7h(t)h(t)T\n\n(1)\n\nThe next vector of hidden activities, h(t + 1), is computed in two steps. The \u201cpreliminary\u201d vector\nh0(t + 1) is determined by the combined effects of the input vector x(t) and the previous hidden\nvector: h0(t + 1) = f (W h(t) + Cx(t)), where W and C are slow weight matrices and f (.)\nis the nonlinearity used by the hidden units. The preliminary vector is then used to initiate an\n\u201cinner loop\u201d iterative process which runs for S steps and progressively changes the hidden state into\nh(t + 1) = hS(t + 1)\n\nhs+1(t + 1) = f ([W h(t) + Cx(t)] + A(t)hs(t + 1)),\n\n(2)\n\nwhere the terms in square brackets are the sustained boundary conditions.\nIn a real neural net,\nA could be implemented by rapidly changing synapses but in a computer simulation that uses se-\nquences which have fewer time steps than the dimensionality of h, A will be of less than full rank\nand it is more ef\ufb01cient to compute the term A(t)hs(t+1) without ever computing the full fast weight\nmatrix, A. Assuming A is 0 at the beginning of the sequence,\n\nA(t) = \u03b7\n\n\u03bbt\u2212\u03c4 h(\u03c4 )h(\u03c4 )T\n\n\u03c4 =t(cid:88)\n\u03c4 =t(cid:88)\n\n\u03c4 =1\n\n(3)\n\n(4)\n\nA(t)hs(t + 1) = \u03b7\n\n\u03bbt\u2212\u03c4 h(\u03c4 )[h(\u03c4 )T hs(t + 1)]\n\n\u03c4 =1\n\nThe term in square brackets is just the scalar product of an earlier hidden state vector, h(\u03c4 ), with the\ncurrent hidden state vector, hs(t + 1), during the iterative inner loop. So at each iteration of the inner\nloop, the fast weight matrix is exactly equivalent to attending to past hidden vectors in proportion\nto their scalar product with the current hidden vector, weighted by a decay factor. During the inner\nloop iterations, attention will become more focussed on past hidden states that manage to attract the\ncurrent hidden state.\nThe equivalence between using a fast weight matrix and comparing with a set of stored hidden state\nvectors is very helpful for computer simulations. It allows us to explore what can be done with fast\n\n3\n\n....Sustainedboundary conditionSlow transitionweightsFast transitionweights\fweights without incurring the huge penalty of having to abandon the use of mini-batches during\ntraining. At \ufb01rst sight, mini-batches cannot be used because the fast weight matrix is different for\nevery sequence, but comparing with a set of stored hidden vectors does allow mini-batches.\n\n3.1 Layer normalized fast weights\n\nA potential problem with fast associative memory is that the scalar product of two hidden vectors\ncould vanish or explode depending on the norm of the hidden vectors. Recently, layer normalization\n[Ba et al., 2016] has been shown to be very effective at stablizing the hidden state dynamics in RNNs\nand reducing training time. Layer normalization is applied to the vector of summed inputs to all the\nrecurrent units at a particular time step. It uses the mean and variance of the components of this\nvector to re-center and re-scale those summed inputs. Then, before applying the nonlinearity, it in-\ncludes a learned, neuron-speci\ufb01c bias and gain. We apply layer normalization to the fast associative\nmemory as follows:\n\nhs+1(t + 1) = f (LN [W h(t) + Cx(t) + A(t)hs(t + 1)])\n\n(5)\nwhere LN [.] denotes layer normalization. We found that applying layer normalization on each\niteration of the inner loop makes the fast associative memory more robust to the choice of learning\nrate and decay hyper-parameters. For the rest of the paper, fast weight models are trained using\nlayer normalization and the outer product learning rule with fast learning rate of 0.5 and decay rate\nof 0.95, unless otherwise noted.\n\n4 Experimental results\n\nTo demonstrate the effectiveness of the fast associative memory, we \ufb01rst investigated the problems\nof associative retrieval (section 4.1) and MNIST classi\ufb01cation (section 4.2). We compared fast\nweight models to regular RNNs and LSTM variants. We then applied the proposed fast weights\nto a facial expression recognition task using a fast associative memory model to store the results\nof processing at one level while examining a sequence of details at a \ufb01ner level (section 4.3). The\nhyper-parameters of the experiments were selected through grid search on the validation set. All\nthe models were trained using mini-batches of size 128 and the Adam optimizer [Kingma and Ba,\n2014]. A description of the training protocols and the hyper-parameter settings we used can be\nfound in the Appendix. Lastly, we show that fast weights can also be used effectively to implement\nreinforcement learning agents with memory (section 4.4).\n\n4.1 Associative retrieval\n\nWe start by demonstrating that the method we propose for storing and retrieving temporary memo-\nries works effectively for a toy task to which it is very well suited. Consider a task where multiple\nkey-value pairs are presented in a sequence. At the end of the sequence, one of the keys is presented\nand the model must predict the value that was temporarily associated with the key. We used strings\nthat contained characters from English alphabet, together with the digits 0 to 9. To construct a train-\ning sequence, we \ufb01rst randomly sample a character from the alphabet without replacement. This is\nthe \ufb01rst key. Then a single digit is sampled as the associated value for that key. After generating a\nsequence of K character-digit pairs, one of the K different characters is selected at random as the\nquery and the network must predict the associated digit. Some examples of such string sequences\nand their targets are shown below:\n\nInput string Target\nc9k8j3f1??c\nj0a5s5z2??a\n\n9\n5\n\nwhere \u2018?\u2019 is the token to separate the query from the key-value pairs. We generated 100,000 training\nexamples, 10,000 validation examples and 20,000 test examples. To solve this task, a standard RNN\nhas to end up with hidden activities that somehow store all of the key-value pairs after the keys and\nvalues are presented sequentially. This makes it a signi\ufb01cant challenge for models only using slow\nweights.\nWe used a neural network with a single recurrent layer for this experiment. The recurrent network\nprocesses the input sequence one character at a time. The input character is \ufb01rst converted into a\n\n4\n\n\fModel\nIRNN\nLSTM\nA-LSTM\nFast weights\n\nR=20\nR=100\n62.11% 60.23% 0.34%\n60.81% 1.85%\n60.13% 1.62%\n1.81%\n\n0%\n0%\n0%\n\nR=50\n\n0%\n\nTable 1: Classi\ufb01cation error rate comparison on the\nassociative retrieval task.\n\nFigure 2: Comparison of the test log likelihood on\nthe associative retrieval task with 50 recurrent hidden\nunits.\n\nlearned 100-dimensional embedding vector which then provides input to the recurrent layer2. The\noutput of the recurrent layer at the end of the sequence is then processed by another hidden layer\nof 100 ReLUs before the \ufb01nal softmax layer. We augment the ReLU RNN with a fast associative\nmemory and compare it to an LSTM model with the same architecture. Although the original\nLSTMs do not have explicit long-term storage capacity, recent work from Danihelka et al. [2016]\nextended LSTMs by adding complex associative memory. In our experiments, we compared fast\nassociative memory to both LSTM variants.\nFigure 2 and Table 1 show that when the number of recurrent units is small, the fast associative\nmemory signi\ufb01cantly outperforms the LSTMs with the same number of recurrent units. The result\n\ufb01ts with our hypothesis that the fast associative memory allows the RNN to use its recurrent units\nmore effectively. In addition to having higher retrieval accuracy, the model with fast weights also\nconverges faster than the LSTM models.\n\n4.2\n\nIntegrating glimpses in visual attention models\n\nDespite their many successes, convolutional neural networks are computationally expensive and the\nrepresentations they learn can be hard to interpret. Recently, visual attention models [Mnih et al.,\n2014, Ba et al., 2015, Xu et al., 2015] have been shown to overcome some of the limitations in\nConvNets. One can understand what signals the algorithm is using by seeing where the model is\nlooking. Also, the visual attention model is able to selectively focus on important parts of visual\nspace and thus avoid any detailed processing of much of the background clutter. In this section,\nwe show that visual attention models can use fast weights to store information about object parts,\nthough we use a very restricted set of glimpses that do not correspond to natural parts of the objects.\nGiven an input image, a visual attention model computes a sequence of glimpses over regions of the\nimage. The model not only has to determine where to look next, but also has to remember what it has\nseen so far in its working memory so that it can make the correct classi\ufb01cation later. Visual attention\nmodels can learn to \ufb01nd multiple objects in a large static input image and classify them correctly,\nbut the learnt glimpse policies are typically over-simplistic: They only use a single scale of glimpses\nand they tend to scan over the image in a rigid way. Human eye movements and \ufb01xations are far\nmore complex. The ability to focus on different parts of a whole object at different scales allows\nhumans to apply the very same knowledge in the weights of the network at many different scales,\nbut it requires some form of temporary memory to allow the network to integrate what it discovered\nin a set of glimpses. Improving the model\u2019s ability to remember recent glimpses should help the\nvisual attention model to discover non-trivial glimpse policies. Because the fast weights can store\nall the glimpse information in the sequence, the hidden activity vector is freed up to learn how to\nintelligently integrate visual information and retrieve the appropriate memory content for the \ufb01nal\nclassi\ufb01er.\nTo explicitly verify that larger memory capacity is bene\ufb01cial to visual attention-based models, we\nsimplify the learning process in the following way: First, we provide a pre-de\ufb01ned glimpse control\nsignal so the model knows where to attend rather than having to learn the control policy through\nreinforcement learning. Second, we introduce an additional control signal to the memory cells so\nthe attention model knows when to store the glimpse information. A typical visual attention model is\n\n2To make the architecture for this task more similar to the architecture for the next task we \ufb01rst compute a\n\n50 dimensional embedding vector and then expand this to a 100-dimensional embedding.\n\n5\n\n020406080100120140Updates x 50000.00.51.01.52.0Negative log likelihoodA-LSTM 50IRNN 50LSTM 50FW 50\fFigure 3: The multi-level fast associative memory model.\n\nModel\nIRNN\nLSTM\nConvNet\nFast weights\n\n50 features\n12.95%\n12%\n1.81%\n7.21%\n\n100 features\n1.95%\n1.55%\n1.00%\n1.30%\n\n200 features\n1.42%\n1.10%\n0.9%\n0.85%\n\nTable 2: Classi\ufb01cation error rates on MNIST.\n\ncomplex and has high variance in its performance due to the need to learn the policy network and the\nclassi\ufb01er at the same time. Our simpli\ufb01ed learning procedure enables us to discern the performance\nimprovement contributed by using fast weights to remember the recent past.\nWe consider a simple recurrent visual attention model that has a similar architecture to the RNN from\nthe previous experiment. It does not predict where to attend but rather is given a \ufb01xed sequence of\nlocations: the static input image is broken down into four non-overlapping quadrants recursively\nwith two scale levels. The four coarse regions, down-sampled to 7 \u00d7 7, along with their the four\n7\u00d7 7 quadrants are presented in a single sequence as shown in Figure 1. Notice that the two glimpse\nscales form a two-level hierarchy in the visual space. In order to solve this task successfully, the\nattention model needs to integrate the glimpse information from different levels of the hierarchy.\nOne solution is to use the model\u2019s hidden states to both store and integrate the glimpses of different\nscales. A much more ef\ufb01cient solution is to use a temporary \u201ccache\u201d to store any of the un\ufb01nished\nglimpse computation when processing the glimpses from a \ufb01ner scale in the hierarchy. Once the\ncomputation is \ufb01nished at that scale, the results can be integrated with the partial results at the\nhigher level by \u201cpopping\u201d the previous result from the \u201ccache\u201d. Fast weights, therefore, can act as\na neurally plausible \u201ccache\u201d for storing partial results. The slow weights of the same model can\nthen specialize in integrating glimpses at the same scale. Because the slow weights are shared for\nall glimpse scales, the model should be able to store the partial results at several levels in the same\nset of fast weights, though we have only demonstrated the use of fast weights for storage at a single\nlevel.\nWe evaluated the multi-level visual attention model on the MNIST handwritten digit dataset. MNIST\nis a well-studied problem on which many other techniques have been benchmarked. It contains the\nten classes of handwritten digits, ranging from 0 to 9. The task is to predict the class label of an\nisolated and roughly normalized 28x28 image of a digit. The glimpse sequence, in this case, consists\nof 24 patches of 7 \u00d7 7 pixels.\nTable 2 compares classi\ufb01cation results for a ReLU RNN with a multi-level fast associative mem-\nory against an LSTM that gets the same sequence of glimpses. Again the result shows that when\nthe number of hidden units is limited, fast weights give a signi\ufb01cant improvement over the other\n\n6\n\nIntegrationtransitionweightsSlow transitionweightsFast transitionweightsUpdate fast weights and wipe out hidden state\fFigure 4: Examples of the near frontal faces from the MultiPIE dataset.\n\nTest accuracy\n\nIRNN LSTM ConvNet\n81.11\n\n88.23\n\n81.32\n\nFast Weights\n86.34\n\nTable 3: Classi\ufb01cation accuracy comparison on the facial expression recognition task.\n\nmodels. As we increase the memory capacities, the multi-level fast associative memory consistently\noutperforms the LSTM in classi\ufb01cation accuracy.\nUnlike models that must integrate a sequence of glimpses, convolutional neural networks process all\nthe glimpses in parallel and use layers of hidden units to hold all their intermediate computational\nresults. We further demonstrate the effectiveness of the fast weights by comparing to a three-layer\nconvolutional neural network that uses the same patches as the glimpses presented to the visual\nattention model. From Table 2, we see that the multi-level model with fast weights reaches a very\nsimilar performance to the ConvNet model without requiring any biologically implausible weight\nsharing.\n\n4.3 Facial expression recognition\n\nTo further investigate the bene\ufb01ts of using fast weights in the multi-level visual attention model, we\nperformed facial expression recognition tasks on the CMU Multi-PIE face database [Gross et al.,\n2010]. The dataset was preprocessed to align each face by eyes and nose \ufb01ducial points. It was\ndownsampled to 48 \u00d7 48 greyscale. The full dataset contains 15 photos taken from cameras with\ndifferent viewpoints for each illumination \u00d7 expression \u00d7 identity \u00d7 session condition. We used\nonly the images taken from the three central cameras corresponding to \u221215\u25e6, 0\u25e6, 15\u25e6 views since\nfacial expressions were not discernible from the more extreme viewpoints. The resulting dataset\ncontained > 100, 000 images. 317 identities appeared in the training set with the remaining 20\nidentities in the test set.\nGiven the input face image, the goal is to classify the subject\u2019s facial expression into one of the six\ndifferent categories: neutral, smile, surprise, squint, disgust and scream. The task is more realistic\nand challenging than the previous MNIST experiments. Not only does the dataset have unbalanced\nnumbers of labels, some of the expressions, for example squint and disgust, are are very hard to dis-\ntinguish. In order to perform well on this task, the models need to generalize over different lighting\nconditions and viewpoints. We used the same multi-level attention model as in the MNIST exper-\niments with 200 recurrent hidden units. The model sequentially attends to non-overlapping 12x12\npixel patches at two different scales and there are, in total, 24 glimpses. Similarly, we designed a\ntwo layer ConvNet that has a 12x12 receptive \ufb01elds.\nFrom Table 3, we see that the multi-level fast weights model that knows when to store information\noutperforms the LSTM and the IRNN. The results are consistent with previous MNIST experiments.\nHowever, ConvNet is able to perform better than the multi-level attention model on this near frontal\nface dataset. We think the ef\ufb01cient weight-sharing and architectural engineering in the ConvNet\ncombined with the simultaneous availability of all the information at each level of processing allows\nthe ConvNet to generalize better in this task. Our use of a rigid and predetermined policy for where\nto glimpse eliminates one of the main potential advantages of the multi-level attention model: It can\nprocess informative details at high resolution whilst ignoring most of the irrelevant details. To realize\nthis advantage we will need to combine the use of fast weights with the learning of complicated\npolicies.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 5: a) Sample screen from the game \u201dCatch\u201d b) Performance curves for Catch with N =\n16, M = 3. c) Performance curves for Catch with N = 24, M = 5.\n\n4.4 Agents with memory\n\nWhile different kinds of memory and attention have been studied extensively in the supervised\nlearning setting [Graves, 2014, Mnih et al., 2014, Bahdanau et al., 2015], the use of such models for\nlearning long range dependencies in reinforcement learning has received less attention.\nWe compare different memory architectures on a partially observable variant of the game \u201dCatch\u201d\ndescribed in [Mnih et al., 2014]. The game is played on an N \u00d7 N screen of binary pixels and each\nepisode consists of N frames. Each trial begins with a single pixel, representing a ball, appearing\nsomewhere in the \ufb01rst row of the column and a two pixel \u201dpaddle\u201d controlled by the agent in the\nbottom row. After observing a frame, the agent gets to either keep the paddle stationary or move it\nright or left by one pixel. The ball descends by a single pixel after each frame. The episode ends\nwhen the ball pixel reaches the bottom row and the agent receives a reward of +1 if the paddle\ntouches the ball and a reward of \u22121 if it doesn\u2019t. Solving the fully observable task is straightforward\nand requires the agent to move the paddle to the column with the ball. We make the task partially-\nobservable by providing the agent blank observations after the Mth frame. Solving the partially-\nobservable version of the game requires remembering the position of the paddle and ball after M\nframes and moving the paddle to the correct position using the stored information.\nWe used the recently proposed asynchronous advantage actor-critic method [Mnih et al., 2016] to\ntrain agents with three types of memory on different sizes of the partially observable Catch task. The\nthree agents included a ReLU RNN, an LSTM, and a fast weights RNN. Figure 5 shows learning\nprogress of the different agents on two variants of the game N = 16, M = 3 and N = 24, M = 5.\nThe agent using the fast weights architecture as its policy representation (shown in green) is able to\nlearn faster than the agents using ReLU RNN or LSTM to represent the policy. The improvement\nobtained by fast weights is also more signi\ufb01cant on the larger version of the game which requires\nmore memory.\n\n5 Conclusion\n\nThis paper contributes to machine learning by showing that the performance of RNNs on a variety\nof different tasks can be improved by introducing a mechanism that allows each new state of the\nhidden units to be attracted towards recent hidden states in proportion to their scalar products with\nthe current state. Layer normalization makes this kind of attention work much better. This is a form\nof attention to the recent past that is somewhat similar to the attention mechanism that has recently\nbeen used to dramatically improve the sequence-to-sequence RNNs used in machine translation.\nThe paper has interesting implications for computational neuroscience and cognitive science. The\nability of people to recursively apply the very same knowledge and processing apparatus to a whole\nsentence and to an embedded clause within that sentence or to a complex object and to a major part\nof that object has long been used to argue that neural networks are not a good model of higher-level\ncognitive abilities. By using fast weights to implement an associative memory for the recent past,\nwe have shown how the states of neurons could be freed up so that the knowledge in the connections\nof a neural network can be applied recursively. This overcomes the objection that these models can\nonly do recursion by storing copies of neural activity vectors, which is biologically implausible.\n\n8\n\n02468101214steps0.80.60.40.20.00.20.40.60.81.0Avgerage RewardRNNRNN+FWLSTM051015202530steps1.00.50.00.51.0Avgerage RewardRNNRNN+FWLSTM\fReferences\nSepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\nGeoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings of the ninth\n\nannual conference of the Cognitive Science Society, pages 177\u2013186. Erlbaum, 1987.\n\nJ Schmidhuber. Reducing the ratio between learning complexity and number of time varying variables in fully\n\nrecurrent nets. In ICANN93, pages 460\u2013463. Springer, 1993.\n\nMisha Tsodyks, Klaus Pawelzik, and Henry Markram. Neural networks with dynamic synapses. Neural\n\ncomputation, 10(4):821\u2013835, 1998.\n\nLF Abbott and Wade G Regehr. Synaptic computation. Nature, 431(7010):796\u2013803, 2004.\nOmri Barak and Misha Tsodyks. Persistent activity in neural networks with dynamic synapses. PLoS Comput\n\nBiol, 3(2):e35, 2007.\n\nRobert S Zucker and Wade G Regehr. Short-term synaptic plasticity. Annual review of physiology, 64(1):\n\n355\u2013405, 2002.\n\nHenry Markram, Joachim L\u00a8ubke, Michael Frotscher, and Bert Sakmann. Regulation of synaptic ef\ufb01cacy by\n\ncoincidence of postsynaptic aps and epsps. Science, 275(5297):213\u2013215, 1997.\n\nGuo-qiang Bi and Mu-ming Poo. Synaptic modi\ufb01cations in cultured hippocampal neurons: dependence on\nspike timing, synaptic strength, and postsynaptic cell type. The Journal of neuroscience, 18(24):10464\u2013\n10472, 1998.\n\nDavid J Willshaw, O Peter Buneman, and Hugh Christopher Longuet-Higgins. Non-holographic associative\n\nmemory. Nature, 1969.\n\nTeuvo Kohonen. Correlation matrix memories. Computers, IEEE Transactions on, 100(4):353\u2013359, 1972.\nJames A Anderson and Geoffrey E Hinton. Models of information processing in the brain. Parallel models of\n\nassociative memory, pages 9\u201348, 1981.\n\nJohn J Hop\ufb01eld. Neural networks and physical systems with emergent collective computational abilities. Pro-\n\nceedings of the national academy of sciences, 79(8):2554\u20132558, 1982.\n\nElizabeth Gardner. The space of interactions in neural network models. Journal of physics A: Mathematical\n\nand general, 21(1):257, 1988.\n\nAlex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.\nEdward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with\n\nunbounded memory. In Advances in Neural Information Processing Systems, pages 1819\u20131827, 2015.\n\nJason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.\nD. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In\n\nInternational Conference on Learning Representations, 2015.\n\nJ. Ba, R. Kiros, and G. Hinton. Layer normalization. arXiv:1607.06450, 2016.\nD. Kingma and J. L. Ba. Adam: a method for stochastic optimization. arXiv:1412.6980, 2014.\nIvo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long short-term\n\nmemory. arXiv preprint arXiv:1602.03032, 2016.\n\nV. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Neural Informa-\n\ntion Processing Systems, 2014.\n\nJ. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention.\n\nConference on Learning Representations, 2015.\n\nIn International\n\nKelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Ben-\ngio. Show, attend and tell: Neural image caption generation with visual attention. In International Confer-\nence on Machine Learning, 2015.\n\nRalph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and Vision\n\nComputing, 28(5):807\u2013813, 2010.\n\nA. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2014.\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Interna-\ntional Conference on Machine Learning, 2016.\n\n9\n\n\f", "award": [], "sourceid": 2143, "authors": [{"given_name": "Jimmy", "family_name": "Ba", "institution": "University of Toronto"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google"}, {"given_name": "Volodymyr", "family_name": "Mnih", "institution": "Google DeepMind"}, {"given_name": "Joel", "family_name": "Leibo", "institution": "Google DeepMind"}, {"given_name": "Catalin", "family_name": "Ionescu", "institution": "Google"}]}