NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:2143
Title:Using Fast Weights to Attend to the Recent Past

Reviewer 1

Summary

This paper presents a new recurrent neural network architecture which contains ‘fast weights,’ that is, parameters which change on a timescale intermediate between the fast fluctuations of hidden activity patterns and the slow changes of weights due to gradient descent. These fast weights are automatically updated with a Hebbian learning rule, thereby embedding an associative memory automatically into the dynamics of the recurrent neural network. Experiments show that this scheme can rapidly learn a variety of (mostly visual) tasks more quickly than LSTMs or LSTMs augmented with external memories; and often obtains better error as well.

Qualitative Assessment

Major comments: This paper contains a nice idea, namely, a weight matrix which is architecturally constrained to use a certain learning rule and update itself at various points during processing. This general scheme seems likely to lead to many variants in the future. The performance on the tasks considered is solid, and makes the technique worthy of further consideration. This paper makes a solid contribution to machine learning, but the results in the paper do not support the claim in the conclusion that “the main contribution is to computational neuroscience and cognitive science.” The paper makes no contact with experimental data, whether neural or psychological. It engages with only a small subset of the relevant computational neuroscience or cognitive science literature (see, eg, Buonomano D.V. and Maass W. State-dependent Computations: Spatiotemporal Processing in Cortical Networks. Nat. Rev. Neurosci. 10:113-125, 2009; or the extensive literature on neural nets and recursive processing in human linguistic abilities, eg, Prince, A., & Smolensky, P. (1997). Optimality: From Neural Networks to Universal Grammar. Science, 275(1997), 1604–1610). It is plausible that this paper could contribute to understanding how (quasi)recursive computations might be implementable in the brain, and this is an extremely worthy goal, but much more work is required to substantiate this claim. If the claim is that fast weights enable recursive, compositional processing, then a task where this is a key component would be the most convincing grounds for a demonstration. At a minimum, an application to the domain of natural language processing would improve the paper greatly—but to address cognitive concerns, this must demonstrate more than merely good performance. It must reproduce the patterns of errors, for instance, as made by human subjects on complex ‘garden path’ sentences, etc. As it stands, the paper is squarely in the engineering tradition, with ‘success’ defined by improved performance on benchmark tasks. From a cognitive perspective, we don’t obtain multi scale views of objects and their parts—we cannot ‘zoom in’ somehow with our eyes, only with our attention. I found it difficult to follow exactly what each model for each experiment was. The figures depicting different architecture variants are subtly different (in Fig 1, ‘sustained’ transitions are represented by red arrows; in Fig 2, ‘integration’ transitions are represented by red arrows). The ‘integration transition’ is not clearly defined. Minor comments: Table 3: classification accuracy? Few details of the RL agent and the MNIST experiments are given. The paper mentions an ‘appendix’ but none was available in the submission material. This could greatly aid the reproducibility of the results (but I think more details can also be placed in the main text). Cite the 'asynchronous advantage actor-critic method'.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

This paper proposes to use an additional recurrent weight matrix with fast synaptic weight dynamics as a non-selective short-term memory. It specifies a simple update rule for the fast weights with exponential decay. Experiments on a key-value paradigm, sequential MNIST and MultiPIE as well as a RL task highlight the efficiency of the proposed memory in sequence tasks that do not require selective forms of memory.

Qualitative Assessment

UPDATE: Thanks to the authors for addressing my concerns in their rebuttal. With the suggested edits I am very happy to promote this paper for acceptance. --- The paper is very well written, the method is well described and experiments are fairly exhaustive. Still, the current version of the manuscript leaves me with the following questions: 1) This type of short-term memory is argued to be biological plausible. However, implementing the inner loop, eq. (2), would not be easy in a biological setting since Wh(t) would need to be cached during the update of the new hidden state, and because the new state might take many iterations to converge. Can the authors speculate on the biological mechanisms that might underly this mechanisms in real neural networks? Also, how sensitive is the mechanism to the number of iterations? 2) Line 136 refers to the Appendix, but I could not find any? 3) In the simple key-value task, how is the performance of the network evolving with the length of the sequence? 4) Since the memory is non-selective (in what it stores) I’d expect reduced performance in tasks where the network needs to learn to store only certain inputs and not others. This scenario is not really tested here (the RL task looks similar, but selectivity is trivially induced by the task structure). Is there a simple scenario in which LSTMs are clearly better then short-term memory? Such scenarios could be interesting in highlighting the costs and benefits of this new method. 5) Did the authors compare with the key-value memory by Facebook? It looks like the scenarios tested here should be much better covered by this type of memory then by LSTMs. 6) Line 114/115: there is a confusion (at least to the reader) between mini-batches and sequences 7) Is there code online for this network, or are the authors planning to make the code available? 8) Line 209-211: when you state “the results can be integrated with the partial results at the higher level by popping the previously result from the cache” then what do you refer to as “popping”? Is that manually implemented or do you expect the network to implement “popping” itself? If so, can you elaborate how the network can learn the latter (it is non-obvious since the memory is simply decaying, so to actively pop one would need to subtract the right hidden state, which in turn would not only eliminate the memory but also any additional information in the hidden unit activity).

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

The authors introduce an intermediate timescale to neural networks, and demonstrate that it improves computation. The new addition effectively adds an input corresponding to the correlation of the current network state with past states.

Qualitative Assessment

This is a nice and timely work. The ideas are well presented and motivated, and the simulations are appropriate. Novelty – intermediate timescales were introduced in the context of reservoir computing, but not in more standard machine learning tasks as done here. Furthermore, the specific form of addition here is novel. The paper claims that the main contribution is to computational neuroscience. From this perspective, there are a few puzzles. How plausible is the proposed normalization procedure? Why are the matrices W and A in equation 2 applied to different activity vectors (what is the Biophysical equivalent of the inner loop). Specific comments: Size of mini batches – this seems to limit memory, but the issue is not discussed. Table 3 – should be percent correct and not error Line 277: reference missing It seems that the advantage of fast weights is mostly for small networks – this could be an indication of some larger problem, but it is hard to judge from available results.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 4

Summary

This paper presents a method for using a rapidly changing secondary matrix of temporary weight values, referred to as "fast weights", in recurrent neural networks to maintain a trace of past inputs and network states. Presented experiments show that incorporating these fast weights in recurrent neural networks improve performance. RNNs with fast weights are shown to outperform other RNN variants with equivalent numbers of recurrent units on associative retrieval of characters from strings and image recognition tasks (MNIST and facial expression).

Qualitative Assessment

The presented method for incorporating fast weights into recurrent neural networks is elegant and produces noticeable improvements in performance in the experiments presented. The paper is very well written, for the most part. A small amount of related prior work exists on fast weights; however, it seems that research on this topic never caught on. This paper will hopefully change this; fast weights should be of interest to the machine learning community. Additionally, the developments and results presented in this paper have a biological inspiration and may inspire work in computational neuroscience and cognitive science. Some feedback, questions and suggestions arose on reading: The performance improvements shown in experimental results indicate that work in fast weights has significant promise. It would have been interesting to see some analysis or an example of the low-level outcomes of fast weight operation, such as a resulting fast weight matrix (in preference to the description of a second image recognition task, perhaps). An appendix describing implementation details was mentioned in the paper but was not submitted with the paper for review. The contents of the Conclusion section might be better described as, and converted into, a Discussion section (an additional Conclusion section might not be necessary). Suggestions and comments on the Conclusion(/Discussion) follow: 1) The authors may disagree and disregard this suggestion; however, rather than outright state that the paper contributes to a field of study, my preference is to refer to the specific developments as "contributions" and then describe how these contributions are significant or relevant to the fields of study. For example, in the case of the paper under review, the contributions of the paper are a method for adapting fast weights in recurrent neural networks and the experimental results demonstrating improved RNN performance. The demonstration of improved performance of RNNs is evidence of the significance of the contributions to machine learning. These contributions are also relevant to computational neuroscience and cognitive science as a model and evidence that fast mechanisms of synaptic plasticity may contribute to sequential processing of stimuli and working memory. 2) Line 288 has the statement "Layer normalization makes this kind of attention work better", which is mentioned earlier but not actually shown in the paper, possibly making the statement unsuitable as a conclusion in the paper. A statement reminding the reader of the use of layer normalization might be more suitable. 3) References are made to "sequence-to-sequence RNNs used in machine translation" (line 290) and "[t]he ability of people to recursively apply the very same knowledge and processing apparatus to a whole sentence and to an embedded clause within that sentence has long been used as an argument against neural networks as a model of higher-level cognitive abilities" (line 294-297). Ideally, both of these references would be accompanied by a citation. A few suggestions for presentation and typographical corrections follow: 1) The black-and-white printability of the paper could be improved by choosing thicker line weights for figures 1, 3, and 5, and by using different line styles or arrow heads to distinguish signals. Figure 5 could benefit from a larger font size for plots. 2) Line 203: remove "their". Page 8 has a number of instances of the opening quotation mark being a closing quotation mark. Line 287: "unis" was probably meant to be "units". References [8] and [25] have initialisms that should be capitalised.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 5

Summary

Ordinary recurrent neural networks have 2 types of variables: synapses that are updated at the end of the input sequence and capture regularities among inputs, and neural activities that represent short term memory. In this paper a second type of synapses is considered, called "fast weights". They change slower than neural activities but faster than usual synapses (called "slow weights"). This idea is motivated by physiological evidence that synapses in brains modulate at different time scales. The "fast weights" are used in an associative network to store memories. The update rule for the fast weights is a Hebbian-like learning rule to remember the past hidden activities. This associative network acts like an attention to the recent past, in a way similar to previously studied attention mechanisms that have been used recently to improve sequence-to-sequence RNNs. The difference here is that the strength of the attention to a given past hidden activity is not specified by a new set of parameters, but by the scalar product of that past hidden activity and the current hidden activity. The effectiveness of the proposed algorithm is shown on a variety of tasks.

Qualitative Assessment

Arguments from math (memory capacity of different models) and neuroscience (different time scales for synaptic plasticity) are given to support the idea of introducing fast weights. The proposed algorithm establishes links with previous work on memory mechanisms (e.g. NTM and Memory Networks) and attention mechanisms, while being clearly much more biologically plausible than all previous models. This looks like an important step towards bridging the gap with computational neuroscience and cognition. From computational point of view, there is a trick to avoid computing the full fast weight matrix A. As explained on lines 100-104, it is sufficient to store the hidden activities. This makes the algorithm computationally much more efficient. Moreover, thanks to this trick, the algorithm also applies to mini-batches (lines 111-115) One downside: As far as I could see, very few details are given regarding how the slow weights W and C are trained (it's only mentioned that the Adam optimizer is used). I think this would deserve a bit more explanations. I think it would be beneficial to add a detailed figure for the computational graph (more detailed than figure 1) to show how exactly automatic differentiation is done. The fact that there is an "inner loop" (for the computation of h_s(t) ) makes it less usual and harder for the reader to visualize, I feel. One question in particular: do we backpropagate the gradients through A(t) (which is a function of the h(tau), and thus a function of W and C), or are these variables considered to be constant when computing the gradients in the computational graph? Other minor remark: typo line 287 : "units"

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)