NeurIPS 2020

Ode to an ODE

Review 1

Summary and Contributions: Post Rebuttal: Authors have addressed my concerns. And in light of this I am increasing the score. Here're my brief responses. (1) ANODEV2: The choice of calling the method of [56] (which is closely related to the present paper) HyperNets and not ANODEV2 was curious, to say nothing of being confusing. I had previously assumed that by HyperNets authors had meant the method of [26] even though they cite [56] at one place and [26] at another place. I have taken authors' claim on face value. (2) I retain my reservations about the theoretical result. The paper provides methods to constrain the parameter matrix of (coupled) neural ODEs to be orthogonal. It shows theoretically that doing so alleviates the exploding-vanishing gradient problem often associated with training of deep models. It also shows experimental results for reinforcement and supervised learning.

Strengths: The paper seems to fit the template of let's apply (pre-existing) idea A that's been used in setting B to a new variant setting C. In this case, A=orthogonality constraints on the parameters, B=deep/recurrent neural network training, C=Neural ODE (more precisely coupled neural ODE as in ANODEV2). However, the application of A to C is not trivial and requires transporting methods from other areas of mathematics (differential equations, optimization on manifolds, flows on manifolds) and some expertise seems to be necessary to carry this out. Lemma 4.1 which proves that gradients do not vanish or explode also seems interesting. The method seems to do well in experiments.

Weaknesses: While orthogonality constraint helps with vanishing-exploding gradients issue as shown in Lemma 4.1, it's not clear to me if it doesn't introduce some new issues. For example, by reducing the flexibility in how W can change, training might become slow or not converge to a point with near-minimum loss. Theorem 1 is proving convergence to a stationary point which of course need not have small loss. It's worth noting that Lemma 4.1 doesn't get used (as far as I can tell) in any other theoretical result. Thus, at least theoretically, the results in the paper can't be regarded as providing support for orthogonality constraints. I would like to know what the authors think about these issues. (The remark on line 34 seems to be related: "Fortunately, there exist several efficient parameterizations of the subgroups of the orthogonal group O(d) that, even though in principle reduce representational capacity, in practice produce high-quality models and bypass Riemannian optimization [36, 40, 34].") Experiments: The paper should justify the choices of baselines. The experiments could have included results with ANODEV2 (code is available online) which is closely related to the present paper, and possibly other recent neural ODE work (e.g. augmented NODE). Apart from BaseODE, another NODE benchmark chosen in the paper is NANODE. I had a quick look at that paper, and that also doesn't seem to compare with ANODEV2. In supervised learning, for neural net models used as baseline instead of just fully-connected feedforward networks one would also like to see ResNets (which are used as baseline for reinforcement learning experiments). I am not qualified to judge the importance of improvements to Evolution Strategies method for reinforcement learning as shown in this paper within the broader landscape of reinforcement learning methods.

Correctness: While I didn't verify the proofs in detail, I did not find any inaccuracies.

Clarity: The paper is reasonably well-written.

Relation to Prior Work: There's a good discussion of prior work. One thing I would like to know and is not discussed: is there a counterpart of Lemma 4.1 for neural network training in the previous literature.

Reproducibility: No

Additional Feedback: "Such Neural ODE constructions enable deeper models than would not otherwise be possible with a 27 fixed computation budget; however, it has been noted that training instabilities and the problem of 28 vanishing/exploding gradients can arise during the learning of very deep systems [43, 4, 23]." Unless I misunderstood, this whole sentence should be rewritten. In particular part of it should probably read "…than would otherwise be possible…" or ""…that would not otherwise be possible…" The second part of the sentence seems to support the first but "however" suggests otherwise.

Review 2

Summary and Contributions: This paper presents ODEtoODE, a new version of Neural ODE where not only the state vector x but also the parameter vector W evolve according to ODEs (Eq.2) respectively. A key idea is to restrict the parameters in the space of a compact matrix manifold \Sigma. As a result, in Lemma 4.1, the proposed method can avoid the vanishing/exploding gradient problem when \Sigma is the orthogonal group O(d). (It would be more convincing if this is empirically visualized in the Experiments section.) The trainable parameters are not W itself but the \psi in a map b_\psi that eats point W and spits out a tangent vector on \Sigma at W. The authors propose two implementations of b_\psi: ISO-ODEtoODE and Gated-ODEtoODE. In Theorem 1, the authors claim that the training procedure of the proposed method can ‘strongly converge’ in a certain reinforcement learning setup. (To be exact, they show the norm convergence of gradients.) The Experiments section lists several SOTA results on reinforcement learning.

Strengths: The idea to restrict parameters to O(d) is simple but tractable and effective both in theory (Lemma 4.1) and practice (Experiments).

Weaknesses: It would be more convincing if the gradients during training are visualized in the Experiments section.

Correctness: Correct. I found some flaws in mathematical notions. For examples, between ll.16-17, 'a solution' it is not a solution but an integral equation. At ll.10, 54 and 149 the authors assert 'strong convergence' but Theorem 1 only shows the norm convergence of the gradients.

Clarity: Average. Considering the simplicity of the main idea, Sections 1-4 could be more straight, simpler and clearer. For example, at the end of the Introduction, the authors summarize the contributions but I could not make sense what ISO/Gated-ODEtoODE are, what they solved, and how. At l.102 the authors introduce ‘a parameterized function b_\psi...’ without mentioning what the parameters are.

Relation to Prior Work: Not exhaustive, but covers wide viewpoints. Sometimes citations are rude, for example, [23] at l.28 and [35] at l.104 because these are books. I would like to read more detailed discussions in Section 5.2.3 as this is an empirical study.

Reproducibility: No

Additional Feedback: === UPDATE AFTER AUTHOR FEEDBACK === I have read the author’s responses and other reviewers’ comments, and I would keep my score. I do not agree with that the authors claim that HyperNets is ANODEV2. It is simply missed in the experiments.

Review 3

Summary and Contributions: This paper proposes a neural ODE framework that allows its parameters to be dynamically evolved through time. The paper also theoretically show that the proposed method, called ODEtoODE, can achieve a stable and effective training by constraining the parameter-flow on compact manifolds. The proposed method is evaluated for Reinforcement Learning tasks.

Strengths: - The topic is of interest to the research community. NeuralODE is a relatively new topic and is drawing more and more attentions recently. - The contribution is novel. Formulating parameter-flow and constraining it with orthogonal groups could be very helpful to understanding NeuralODE.

Weaknesses: - The paper is a little bit hard to follow. There are plenty of contents discussing the validity of ODEtoODE but have little paragraphs on how to apply the proposed method into usage. Since parameters become dynamical, it is expected to train ODEtoODE with additional procedures and may be different from the training of conventional ones. So including an algorithmic description is a good option. - Although the paper gives certain analysis on how ODEtoODE can help alleviate gradient vanishing/explosion problem, there is little experimental result to support such claim, especially for supervised learning in section 5.2. Without results such as convergence curves, etc, It is less convincing that the proposed method could really improve the stability and effectiveness.

Correctness: The proposed method technically sounds. And empirical settings look valid. However, certain experimental results are missing. Please refer to the weakness section.

Clarity: The technical details may need some tuning with more elaborations on how to apply the proposed method to solve problems.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: --------------- After rebuttal ---------------- I decide to keep my score unchanged after reading other reviews and rebuttal.