Review for NeurIPS paper: STEER : Simple Temporal Regularization For Neural ODE

NeurIPS 2020

STEER : Simple Temporal Regularization For Neural ODE

Review 1

Summary and Contributions: In this paper, the authors propose a simple regularization for neural ODE. The idea is to add a uniformly distributed random noise to the final time of the initial value problem during training.

Strengths: The proposed model has the merit of simplicity. It's easy to understand and implement. The experimental study is well designed and extensive.

Weaknesses: I fail to fully understand why the proposed regularization helps. In many applications of neural ODE, for example in generative models, the final time t_1 is somewhat arbitrary. The proposed regularization is effectively changing the final time from t_1 to (t_1 - b). This is also supported by Figures 1, 2, and 3. I understand that it is unfair to ask for a comparison with Finlay et al. (2020) since it is recently published. But I think their method is much more principled and theoretically founded.

Correctness: The claims and method are correct. The empirical methodology is consistent with how existing models are evaluated.

Clarity: The paper is well written and easy to understand.

Relation to Prior Work: As mentioned above, it would be great if the authors can further elaborate how their method is different from Finlay et al. (2020).

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: In this paper, the authors propose a novel regularization technique for Neural ODEs, where the end time of solving ODEs during training is sampled randomly. They show the validity of this approach based on the existence of unique solutions. And, they empirically validate the proposed technique to work expectedly on normalizing flows, time series models and image recognition tasks.

Strengths: The paper propose a (minor but) novel technique that is potentially useful, where the existence of unique solutions is assured theoretically. In the experiments, their proposed technique is shown to give some improvements over test errors in several experimental setups under comparison with the representative recent models.

Weaknesses: First of all, I think the efficacy of the proposed technique in practice is limited. As the authors claim, the end time of solving ODEs during training would have effects to computational costs and test accuracy somehow. I agree with this point intuitively (although I have some doubt on the point that the computational cost is reduced much from this perspective). However, it is difficult to judge whether their technique, where the end time of solving ODEs during training is sampled randomly, improve in principle these by using this effect from the current descriptions of this paper. I would appreciate if the authors address how this regularization affects positively on test accuracy (through the improvement over stiffness?).

Correctness: Although I do not go into the details, the derivations of the equations and the related logics in the paper look correctly described.

Clarity: The clarity of the paper is OK. It contains necessary information to understand the contents.

Relation to Prior Work: The paper describes clearly the relation with existing works and their contributions upon those.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper presents a regularization technique for training Neural ODE models. The paper presents a theorem, tests the methodology on standard problems, and discusses some connection to stiff problems.

Strengths: The topic is of very high relevance to the community as there is currently a spike in interest in Neural ODE models. The empirical results are promising.

Weaknesses: I have trouble following the rationale of the theory, both for the theorem as well as the discussion of stiffness.

Correctness: - I don't understand the necessity of the theorem: either the solution to the Neural ODE is well-defined over [t_0, t_1 + b] and there is nothing to proof as we fall back to the deterministic case, or it isn't , but in this case there shouldn't exist a meaningful gradient in the first place. - I also don't understand the proof of the theorem: why is delta_k drawn in each iteration of the Piccard? in the training, T is only drawn after a theta-parameter update has been conducted. So within each theta iteration, the integration interval is fixed, no? - I have reservations regarding the stiffness discussion. This might be slightly off, as I haven't been able to fully follow the synthetic study design (see below). That being said: the problem of stiffness has a rich history in the study of numerical ODE solvers. Sect. 3.4 is a bold simplification of the subject manner without any external reference to the numerical analysis literature. Furthermore, there is no rationale presented as to why the presented method should be able to better deal with stiffness. I'd even argue that the paper's intuition could be used to argue why STEER shouldn't be able to handle stiffness: STEER encourages to find solutions that achieve the final state earlier which necessitates faster transients, while occasionally asking for solutions t_1+b asks for slow transients. Thus, I'd expect stiff behavior of the found solutions in the interval [t_1 - b, t_1 + b]. - Furthermore: if I understood the stiffness experiment halfway correctly, the Neural ODE is used to fit a particular trajectory. However, in all Neural ODE applications that I am aware of until now, the model is always ill-posed in the sense that there is never an unique vector field solution to the problem, but rather many viable dynamics that could fit the data. Thus, I don't think that the experiment is a clear indication what is happening with respect to stiffness in Neural ODE training.

Clarity: - I haven't been able to understand the problem setup of Sect. 4.4, neither from the main text nor from the appendix. I believe that the exact training and evaluation method are presented in the paragraph starting in line 267, but I have not understood the description. As a reference, a 990 page text book is cited without a hint to any specific subsection.

Relation to Prior Work: The discussion of related work with respect to Neural ODEs is adequate. Regarding the discussion of stiffness, the authors should consider refering to the discussion in https://arxiv.org/abs/0910.3780 https://rd.springer.com/article/10.1007/s10543-014-0503-3 https://books.google.de/books/about/Solving_Ordinary_Differential_Equations.html?id=m7c8nNLPwaIC&redir_esc=y or similar work (but also see comment below).

Reproducibility: No

Additional Feedback: While I have raised severe objections, I still believe that the method itself may have strong merits. Please consider the following questions and suggestions: a) If the authors deem the theorem really necessary, it needs to be made clearer why the regular Picard-Lindelöf theorem does not apply. Maybe I misunderstood something on a very fundamental level. If not, simply consider removing the section. b) The discussion, rationale and experiments regarding stiffness does not present a coherent picture to me and I have severe objections for acceptance If the authors chose to keep this content, I would require significant changes and very convincing explanations during the rebuttal. However, I don't think the method necessarily requires a stiffness discussion. c) If either or both sections are removed, may I suggest to add more experimental details that would certainly be of interest to the readers: how are training times affected for time series and feedforward models? Is there a dependence between suitable parameter ranges b and different solvers/model architectures? Have you also tried to apply the stochastic term during evaluation? Maybe the authors could even try to include fixed step solvers (ResNets with stochastic depth)? From my background in numerical analysis, it currently seems as if the author felt the need to include more theory to warrant publication, but these parts raise more concerns than bring valuable insights. On the other hand: a more nuanced empirical view on the method would probably find a big audience and I would gladly support such a submission, even if theoretical questions are left for future work. Post-rebuttal update: I thank the authors for their feedback and I'm happy to hear that I could provide good hints. Big changes have been promised (the removal of Sect. 3.3 and at least some more nuanced description of Sects. 3.4+4.4 with the remaining space being filled by more empirical evaluations). I am willing to give the authors the benefit of the doubt and raise no more objections, but I will also not actively seek acceptance at this time. Thus, I raised my overall score, but lowered my confidence.

Review 4

Summary and Contributions: The paper proposes a regularization technique for Neural ODE models, which speeds up the notoriously slow training of these models, while simultaneously improving their performance on a range of tasks. The technique is simple and complementary to existing approaches.

Strengths: The work proposes a simple and effective regularization technique that addresses an important problem in practical applications of Neural ODE models - their computational complexity. While not providing an ultimate solution - even with the proposed technique these models are still considerably slower to train than fixed-depth neural networks - this work makes a step towards solving this issue, and would be interesting to a wide audience. Effectiveness of the technique is showcased on a number of tasks (density modeling, image classification and time series modeling) with convincing results. Analysis of toy Gaussian densities and the stiff ODE example further strengthen the results and give insight into the effect of this regularization technique. Finally, the paper also provides theoretical grounding for the existence of a solution to a neural ODE with their regularization technique applied.

Weaknesses: It is not entirely clear from the paper how difficult choosing the right value of the regularization term b is in practice. It would be helpful if the authors could share results of such experiments for the density modeling or classification experiments. Such results are important to gauge the practical value of the proposed regularization technique.

Correctness: Yes.

Clarity: The paper is well-written and well-motivated. One thing I missed was the comparison of the Neural ODE results to results of fixed-depth methods. Including these would make it easier for the readers to understand the value of using Neural ODEs for certain tasks compared to other types of models. Also a few minor comments/questions below: * In equation 5 are eigenvectors meant to be complex or real? * Please mention dataset details in Tables 1 and 5 (i.e. CIFAR10, ImageNet32). * What are the units of time in Table 1? * Please check that the following sentence is correct: “Indeed these datasets break the basic assumptions of a standard RNN, namely that the data are not necessarily collected at regular time intervals” * In Table 4 add space before “STEER” in rows 5 and 6

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: ======= POST-REBUTTAL UPDATE ======= I would like to thank the authors for their response, for providing additional details and committing to making some changes to the manuscript. I believe the manuscript will improve as a result.