NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID:5394
Title:FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network

Reviewer 1


		
This paper presents an adaptation to the peep-hole connection from [22] to create fast, accurate, and, small RNNs. There are two variants introduced: (1) FastRNN which adds two extra learnable parameters to the vanilla RNN to regulate the computation flow between the hidden state and the nonlinear mapping of inputs and states (2) FastGRNN which replaces the two parameters with gating functions which share input/state matrices with the nonlinear projection but have separate biases. Furthermore, the FastGRNN utilises low-rank sparse representation for matrices with constraints that helps with compressing the size of the model. Theoretical analysis studies the convergence and stability of these models vs normal RNNs while the extensive experimental setup shows that these models are capable of achieving comparable results to state-of-the-art or comparable models (e.g. LSTMs) with much smaller network sizes. I really like the paper and the results reported and I'm sure this is going to have a great impact in the field. I can suggest a few improvements that could make the paper even better: - While the performance of these models have been studied in comparison to competing RNN formulations, the effects of regularisation have not been analysed. For example, the results reported on PTB are far worse than where the SOTA is and recent results on those tasks show that heavy regularisation of LSTMs for example can result in massive improvements in performance ("Regularizing and Optimizing LSTM Language Models", Merity et al 2017). So how would these models behave under heavy regularisation if one wants to attain high performances? - Also there are other work in this area such as "Quasi-Recurrent Neural Networks", Bradbury et al 2017 which also try to provide faster alternatives to LSTMs. How does FastGRNN perform against these methods? - While the balance between \alpha and \beta has been discussed briefly in the paper, it would have been better if we had further experimental results on them. Fig 4 in appendix is a good start, but what about \beta, how does that behave? Do we see often that \beta = 1 - \alpha as speculated in the paper? How does the ratio \alpha / \beta change with sequence length or different datasets? There are many interesting results here that would be good to have a report on - The paper has considerable redundancy in the introduction Sec 1 and related work Sec 2 which can be condensed. This will provide space to bring forward some of the interesting findings and results from appendix such as Fig 4 which.

Reviewer 2


		
This work presents a variant of recurrent neural networks, including its gated version. The key idea is a so-called "peephole" connection which mixes the previous hidden unit with the network hidden output. The proposed method is well supported by theoretical findings and good empirical evidences. The resulting networks are very compact and can be used in embedding products. Several things need to be clarified: 1) It is unclear how large is the learned beta. Figure 4 in appendex only gives the alpha results. How did you set beta? Was it close to 1-alpha? 2) In the non-learning setting, alpha>0 and 0