__ Summary and Contributions__: This paper proposes a gradient descent view of RNN dynamics, and shows how to incorporate momentum for accelerating gradient dynamics. The proposed method can address the vanishing gradient problem.
=============================
After Author Response:
=============================
After reading the author response and discussing with the other reviewers, I am keeping my score.
The main issue I have with this paper is that it makes speculative claims that are not adequately supported, but tries to pass them off as conclusive. Some examples include:
(*) Phrases such as
- "we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD)" (Abstract)
- "MomentumRNN is principled with theoretical guarantees provided by the momentum-accelerated dynamical system for optimization and sampling" (Contributions)
- and several more throughout the paper
are examples of misleading language in my opinion.
(*) I mentioned concerns with the related work and background in the original review, which were not addressed in the author response.
(*) The paper frequently mentions demonstrating that "based on our experiments, the acceleration from momentum is preserved". This was *not* demonstrated -- empirically faster convergence is not equatable to acceleration in an optimization sense. It is better to use precise language to separate what you have shown (an architecture with faster convergence) and what you are hypothesizing (it is related to momentum acceleration). As presented, I am not convinced of the latter connection.
I think it's fine to say that your method is *inspired* by momentum, but in my opinion the paper implies a much stronger connection that is not substantiated by the theoretical and empirical results.
I currently remain unconvinced that the proposed method's improvements are related to momentum at all. There are plenty of simpler explanations, as also offered by Reviewers 3 and 4, which should at least be discussed and ideally ablated.
Ultimately, I see this paper as posing an interesting possible connection, but one that is currently speculative and not ready for publication.
Aside from the overall writing, I have a few more detailed suggestions for improving the paper.
(*) Although the paper positions itself in the space of "addressing the vanishing gradient issue in RNNs", this area is notably absent from the related work. For example, there are numerous such papers which operate on similar principled foundations, are much simpler, and perform better than the complicated models such as AdamLSTM proposed here. See e.g. [1, 2, 3] on PMNIST. These models are not difficult to implement, and it would be more convincing to use a more SOTA model instead of, or in addition to, the dtriv model used here. Other reviewers have also provided other connections and references.
[1] Tallec, Olliver. Can RNNs Warp Time?
[2] Gu et al. Improving the Gating Mechanism of RNNs
[3] Voelker et al. Legendre Memory Units
(*) Because I am not convinced that the benefits of the MomentumRNN cell come from momentum per se, the other cells like RMSPropLSTM/AdamLSTM seem very convoluted to me. The experimental results for them are not very convincing: the cell is so complicated that there are many uncontrolled architecture changes, which would benefit from an ablation study.
(*) The authors claim that they "theoretically prove that MomentumRNNs alleviate the vanishing gradient issue" (Abstract), but there is no formal statement, much less a proof. As said in my initial review, the theory section merely concludes that "a \mu exists" without justification. Since \mu is the important momentum hyperparameter -- the main addition that this work proposes -- there is an opportunity to flesh out the theory and connect it to experiments. The fact that 2 out of 3 of the datasets are best when \mu is set to 0 is suspicious, and deserves further explanation.
Regardless of whether this paper is accepted, I sincerely hope this feedback is useful to the authors and helps them refine the paper.

__ Strengths__: - The explanation of the method is clear
- The experimental results, including table and figures, are easy to understand
- The method shows decent performance on the benchmarks, especially on the TIMIT task

__ Weaknesses__: I have substantial concerns with the stated motivation and claims of the paper.
Overall, the motivation and pitch of the paper imply a deeper connection to optimization that seems misguided at best, and the analysis, experiments, and baselines could be improved.
- The authors main source of justification of the method is the claim that it is principled based on analogies to optimization methods. However, the analogies do not seem appropriate in this context in many ways, from the setting to the implementation details. As just one example, the additional linear and non-linear transformations (multiplication by U and a $\sigma$ nonlinearity, as in line 104) applied after the "momentum step" drastically alter the dynamics compared to momentum in optimization. The motivation, and even the name of the proposed method, feel misleading, supported only by a tenuous connection without deeper analysis.
- The theory of the method is quite weak. The theoretical (Section 2.3) and empirical (Figure 2) analysis is with respect to a vanilla RNN, which is a poor baseline and as the authors note there are countless other works that address vanishing gradients. The analysis concludes that "there is an appropriate choice of \mu that can alleviate vanishing gradients", but this seems unsubstantiated; given the (implicit) constraint that $\mu$ is between 0 and 1, and the fact that $\Sigma_k$ can vary per timestep, it is not obvious that an appropriate choice of $\mu$ exists.
- Additionally, the datasets used are quite toy, and lack any of the numerous baselines in the RNN family of methods besides the LSTM. For example, there are many similar RNN extensions that outperform this method on the MNIST/PMNIST benchmarks.

__ Correctness__: The methods and experimental protocol seem correct. However, the claimed connection to optimization methods are not supported.

__ Clarity__: The paper is overall well-written and easy to understand

__ Relation to Prior Work__: The paper brings up many lines of related work. However, some of them seem less relevant (e.g. Langevin/Hamiltonian Monte Carlo and other optimization theory), while many of the more directly relevant works on improving long-term dependencies in RNNs are not discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: It seems clear to me that the improvements are not from a magical connection to momentum in optimization, but simply from the addition of linear combination of updates $v_t = \mu v_{t-1} + s W x_t$, which add additive ("residual") connections to the state $v_t$ that allow gradients to backpropagate more easily. This is exactly analogous to the motivation of the gated update $c_t = f c_{t-1} + i h_t$ of the cell state of LSTMs, where $\mu, s$ take the role of the forget and input gates $f, i$. In fact, the main momentum cell (equation (8) or (10)) looks very similar to a LSTM with slight rewiring between the cell/hidden state, where $v_t$ plays the role of the cell state; the improvements of MomentumLSTM can be explained by its usage of multiple layers of linear "cell" states. These sorts of gating techniques and variants (e.g. a plethora of linear dynamics incorporated into RNNs) and their benefits for vanishing gradients has been well-studied, but the paper lacks a discussion and basic comparisons against such closely related and well-known methods. I believe such connections are obfuscated by the analogies to optimization. Simply incorporating a standard linear recurrence $v_t = \mu v_{t-1} + s W x_t$ does not beget a connection to momentum or other optimization principles.

__ Summary and Contributions__: The paper present a novel framework termed MomentumRNN. It is the incorporation of momentum into the connection of hidden states dynamics and gradient descent. The new model is presented in various forms and benchmarked on 3 different datasets, where it shows faster training, more robustness, and higher performance.

__ Strengths__: - The paper does a thorough analysis on how to integrate various forms of momentum into different RNN types, and provides a simple yet powerful framework to do this.
- The method and the mathematical basis to motivate these adaptions are provided.
- The results show an increase in learning speed, more robustness against vanishing/exploding gradients, and improved final test accuracy.
- Code examples allow the fast implementation of the framework into existing models and benchmarks.
- Because of the widespread use of RNNs for all kinds of tasks make an optimization like MomentumRNN quite relevant.

__ Weaknesses__: The new method requires additional parameters and fine-tuning of hyperparameters. Why this framework provides the aforementioned improvements is not clear.
(Both of these points are mentioned in the conclusions by the authors as well)

__ Correctness__: The claims seem correct, the empirical methodology in how the experiments were done and described are sound. An ablation study was also done for all 3 benchmarks.

__ Clarity__: The clarity of the paper and its structure are good.

__ Relation to Prior Work__: Previous contributions including s.o.t.a. approaches are correctly mentioned, described, and the differences to this new framework highlighted.

__ Reproducibility__: Yes

__ Additional Feedback__: An interesting metric would be to compare the computation (e.g. a rough FLOPs approximation) needed with this new method (that requires additional steps & parameters) to reach the same level of accuracy compared to a model without momentum that has to train for longer.
Update: I changed the final score.

__ Summary and Contributions__: This paper makes an interesting connection between gradient updates in gradient descent and hidden state integration in RNNs. By making this connection, authors extend the idea of momentum in optimization to improve gradient flow in RNNs. This idea of momentum can be integrated into any RNN architecture and the authors show performance improvements in LSTMs and DTRIV by integrating momentum.

__ Strengths__: 1. The paper makes a very novel connection between two ideas which is very thought-provoking!
2. The experiments prove that the proposed momentumRNN improves performance.

__ Weaknesses__: 1. The authors could do a better job in motivating the model than simply saying this is a more principled way. I will give other motivations in the additional comments section.

__ Correctness__: I do not see any incorrect claims. The experiments could be more controlled. Please refer to additional comments for suggestions.

__ Clarity__: The paper is well written and I enjoyed reading this paper.

__ Relation to Prior Work__: I appreciate an extensive coverage of 58 references. However, it would be nice if the authors also cover the line of work which discusses the issues with the saturations due to the gating mechanism in LSTMs (For example, JANET (Westhuizen and Lasenby 2018), NRU (Chandar et al 2019)).

__ Reproducibility__: Yes

__ Additional Feedback__: Important comments:
1. One way to motivate this work is as follows: In an LSTM, the cell state is trying to do additive integration of input so that gradients do not vanish. However, the sigmoid and tanh gates used in LSTM makes gradients vanish. Momentum RNN is trying to do additive integration of the input without any additional gates and hence has a much better gradient flow. Using MomentumRNN, or RMSPromRNN, or ADAMRNN is just equivalent to trying different integration schemes. And their performance might vary depending on the nature of the data.
2. While momentum is similar to what LSTM is trying to do with cell state integration, I am not sure why RMSPropLSTM should work. Because RMSProp is not a momentum method, it is an adaptive gradient method which adapts the learning rate automatically. Can you explain your intuitions why RMSPropLSTM should work?
3. Is there a reason you used different optimizers for different tasks? I am not sure if comparing your method to LSTM+SGD is fair since your model has some momentum integrated into it while LSTM+SGD does not. Using Adam for all your models, all your tasks might be a more controlled setup. I am inclined to accept the paper. So can you please give me these results in the rebuttal so that I can make up my mind?
Other comments:
1. Please change the first 2 lines of the abstract. This work is not to overcome the expensive search. You can just say you find this novel connection.
2. Figure 2 is not meaningful without seeing the corresponding loss plots. The gradients could go to zero even when the model has solved the problem.
3. Line 123: I think MomentumLSTM should be explained in the main text since it gives an idea of how to integrate momentum to a non-vanilla RNN.
4. Equation 12 - shouldn’t the first + be - ?
5. Why did you use only the MomentumRNN in language modelling task? Is it because other models did not do well? I would like to see the performance of other models for reference.
6. Figure 6 is troubling. Based on the ablation, it looks like often momentum=0 is the best thing to do? Please elaborate more on Figure 6. You need to give me a strong justification for figure 6 to convince me to accept the paper.
7. Line 456 - why is forget gate initialized to -4? I have never seen this in any other work. Do you have a reference for others doing this initialization?
Minor comments:
1. Line 45 - fix grammar “We then proposed to ..”
2. Line 115 - mention that mu and s are the hyperparameters.
3. Line 135 - fix “dominates”
After Rebuttal: I am happy with the answers.

__ Summary and Contributions__: This paper introduces a new cell design for recurrent neural nets, which is inspired by the momentum in stochastic gradient decent-based optimization. The proposed method, called Momentum RNN, is able to alleviate the gradient vanishing problem and can be universally applied to a wide range of RNN structures. Momentum RNN is evaluated through several sequence modeling tasks.

__ Strengths__: - The topic is of interest to majority of machine learning community. Since gradient vanishing problem is still a major issue in application of RNNs, it is good to see there is another work dedicating to explore its answer.
- The perspective of the proposed method is interesting. Connecting temporal dynamics with momentum in SGD is a novel and reasonable assumption. And the paper gives a good comparison of similarities between hidden states and SGD steps.

__ Weaknesses__: - Although the paper puts an emphasis on the relations between SGD and temporal hidden states, I find integrated momentum is quite similar to forget/input gating mechanism in form, which helps update hidden states by controlling forgetting/memorizing information in conventional LSTM. The only difference is forget/input gating is totally data-driven, while the proposed momentum is hand-crafted. So in my opinion, Momentum RNN introduces additional hyper-parameters and could be very sensitive to the parameter selection, leading to a painful tuning process and limited scalability. This may partially verified by Figure 6, which I consider as a disadvantage to conventional LSTM.
- Important experiments are missing. For example, since the paper claim momentum RNN can alleviate gradient vanishing problem, it should be tested on related tasks such as copy/adding problems, which are commonly used for majority of RNN evaluations. Besides, only conventional baselines and several of its own variants are evaluated on reported tasks while other closely related baselines such as [39] [58] are missing. These previous works should also be included for comparison.

__ Correctness__: The proposed method technically sounds. And empirical settings look valid.

__ Clarity__: This paper is clearly written and easy to follow.

__ Relation to Prior Work__: Several previous works are missing and needed to be compared with the proposed method. Please refer to the weakness section.

__ Reproducibility__: Yes

__ Additional Feedback__: --------------- After rebuttal ----------------
After reading your reviews and authors' feedback, I have similar concerns with the reviewer 1 and believe they should not be overlooked. Since authors' feedback cannot convince me, I will keep my original score unchanged.