Review for NeurIPS paper: HiPPO: Recurrent Memory with Optimal Polynomial Projections

NeurIPS 2020

HiPPO: Recurrent Memory with Optimal Polynomial Projections

Review 1

Summary and Contributions: POST REBUTTAL The authors' rebuttal is acceptable and my score remains as it is. +++++++ The authors describe a novel framework, building on established methods, which tackles the problem of online function approximation in the presence of sequentially introduced data. This problem is framed as a memory problem to learn sequential dependencies in data to perform a given task. The main contribution of this work is to leverage tools from signal processing theory whereby function approximation using orthogonal polynomial bases are formulated with respect to time dependent measures on the input domain. The authors present an online algorithm based on ordinary differential equations that govern time-dependent projection coefficients which minimize a measure-specific L2 norm. A complete theoretical description of the algorithm and its approximation guarantees is presented, and a number of numerical experiments comparing the method to popular RNN architectures is conducted.

Strengths: The paper presents a strong and clear theory that unifies past work using orthogonal function bases to approximate sequential dependencies in machine learning. The advantages of this method are well explained and supported by theorems and experiments. While this work build on a large body of work from signal processing and approximation theory, as well as recent work from the ML community (namely Legendre Memory Units), it provides important generalizations and novel instances in the form of scaled Legendre approximations which show interesting results. I believe this paper presents significant novel results and is relevant to the NeurIPS community.

Weaknesses: An important point of comparison that is missing in the paper is a clear discussion about the role of cost functions in the comparison of RNN baselines with the HiPPO methods. If I understand correctly, HiPPO computes optimal polynomial coefficients with respect to a measure-specific L2 norm. However, it is not clear if the cost function used by the baseline comparison models was the same L2 norm, or something different. Moreover, it is not clear how this method could be adapted to cases where loss cannot be expressed with an L2 norm. A discussion about this important point is lacking. For example, in the sequential MNIST case, cross entropy loss is often used but it is not clear what is implemented here by HiPPO nor the other RNN baselines.

Correctness: All claims seem correct to me. See additional feedback comment for minor suggestions and questions about some derivations.

Clarity: The paper is very well written.

Relation to Prior Work: Prior work is clearly described, and relation between distinct lines of related work is well articulated.

Reproducibility: No

Additional Feedback: About reproducibility: have some concerns about experiments. Perhaps I missed it but cost functions used to train RNN baselines are not specified. This relates to the major weakness I described above. This is easily fixed and I am expecting to change my answer about reproducibility upon seeing this information. Here are other minor suggestions that would improve the paper: - A discussion about cost functions for RNNs in relation to HiPPO. - For section starting at line 89, a clarification that the function space should be of integrable functions would add formalism. - (Important) Line 132: it is stated that the coefficients c(t) follow a particular form of ODE. This statement appears in a definition, but it is highly non trivial and depends on involved derivations found in the appendix. To improve reading, this statement should not be part of a definition, but rather should be presented as a consequence of the formulation. A clear reference to the appendix where the derivation takes place should be added. - Details about network size and polynomial order N in experiments should be brought to the main text. These are important comparison points and are now buried in the appendix. - Appendix B.2: I believe there is a mistake in the statement of the general Leibniz Integral Rule. Multipliers d\alpha/dt and d\beta/dt should be added to the f(\alpha(t),t) and f(\beta(t),t) terms respectively. - In appendix F, line 1095. In "HiPPO variants tie the memory size N to the hidden state dimension d.", what does "tie" mean. Does it mean equal ? This is an important point of comparison and should be made clear. When the N for RNNs is specified in experiments, should we assume that the order of polynomials is the same ? - line 1137: Is the N used to designate the length of input sequences the same as the polynomial order ? If yes, this is problematic. If no, then another variable name should be used.

Review 2

Summary and Contributions: This paper develops a general framework called HIPPO for sequential data learning. Its core idea is that projecting continuous signals or discrete time series onto orthogonal polynomial bases in an online approximation way. One of its variants HiPPO-LegS present the advantages of the scaled measure and it can efficiently handle dependencies across millions of time steps.

Strengths: This framework provides an unified perspective on memory mechanisms and it can be incorporated into existing recurrent models such that all history can be remembered.

Weaknesses: Since recurrent neural networks have been studied as universal approximators for modeling dynamic systems, can you discuss more about the relationship between the HiPPO online approximation framework and recurrent networks? Is it possible directly using HiPPO to fit timeseries in the input space? It is not clear which complex dependencies in the hidden space of RNN need to be approximated by polynomial approximation. Can these fitted dependencies be visualized? It would be interesting to discuss more about the memory dynamics learned/approximated by HiPPO. Since the HiPPO can easily fit long-range function, how to adapt to short-term time series forecasting tasks? Please discuss what types of time series tasks the HiPPO can do well and which ones it can’t work well. How to find the number of polynomial bases?

Correctness: This work is studied based on the approximation theory. Details for each part are provided clearly in appendix.

Clarity: Yes, this work is well written.

Relation to Prior Work: Yes, related work is well covered.

Reproducibility: Yes

Additional Feedback: I thank the authors for their response, and am maintaining my score.

Review 3

Summary and Contributions: The paper presents a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. The authors introduced the proposed HiPPO memory in RNN based models and evaluate the effectiveness of their models on the MNIST classification task and on a character classification task.

Strengths: The paper is well written and the idea is novel. The appendices give enough information to let the reader to follow the paper in good conditions. The paper also give different examples for a non expert reader.

Weaknesses: The paper needs more experiments to better support the proposed model. Indeed, the authors list different AI related area potentially interested by such as novel memory management in RNN based models such as speech processing, NLP, etc.

Correctness: The paper seems correct. I have not checked all equations and indices.

Clarity: Yes

Relation to Prior Work: Yes, but some comparaison to other RNN based cells that manage memory in an efficient way will be interesting such as MemRNN, Minimal RNN, PMU, etc.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: Update after author feedback: The authors addressed my concerns and I'll raise my score to 8. _________________________________________________________________ This paper proposes a framework for compressing and representing the cumulative history of a continuous or discrete signal. The representation is obtained as coefficients by formulating the problem as optimal function approximation under certain base and measure. The dynamics of the coefficients are modeled as an ODE which can be efficiently computed with fixed matrices.

Strengths: The proposed representation can be efficiently computed and combined into conventional RNN framework. With proper measure, the representation is able to capture and recover long history due to its nature of function approximation. The modeled dynamic is able to handle certain distribution shift like unmatched sampling rate of training and testing data. It also provides a general view that unify some previous recurrent structures like vanilla LSTM and GRU as well as LMU.

Weaknesses: The performance may rely on the measure and task. The tasks in the experiments seem to favor a measure that faithfully and uniformly memorize the history like LegS, while it remains to be seen for tasks where certain local segments of history plays a more important role.

Correctness: For the speed comparison with LMU, as the dynamic is mostly similar for the representation, why is there a large gap between the proposed method and LMU?

Clarity: The paper is well written.

Relation to Prior Work: It might be better if the authors could discuss more about the speed difference between the proposed method and LMU.

Reproducibility: Yes

Additional Feedback: