__ Summary and Contributions__: This paper proposes a method to learn a generalized Hamiltonian decomposition (learning an energy function). This method applies to more general scenarios than previous work.

__ Strengths__: Since this method applies to more general problems than previous work, it might inspire more research on this more general direction. They test it on noisy & chaotic systems. They are able to incorporate physical priors via special parametrizations of neural networks. I appreciate the clear distinction between "weak form regression," "strong regression," and "state regression." I have done some work with the latter two, but no the former. As mentioned by the authors, this may be the first paper to approach the problem as "weak form regression" in a deep learning context.

__ Weaknesses__: As I'll detail below, I found the paper hard to follow. I think it would be difficult to reproduce without further clarifying details. I also found the experiments unconvincing.

__ Correctness__: There aren't any explicit references to held out (test) data. Is that what's meant in Appendix D (ODE Model Comparison Metrics)? Are those 50 initial conditions different than the ones used for training? If so, are all reported metrics & figures
about results from these 50 held-out initial conditions?
My understanding is that Section 4.1 compares ways to learn a pendulum model with a fully connected neural network that is not concerned with learning an energy function. This section focuses on comparing three approaches to learning
the dynamics model: weak form regression, derivative regression, and state regression. We see that weak form regression has significantly lower error than the other two options. However, I was surprised that all of the errors were so high. Similarly, the later sections, which add in the aspect of learning the generalized Hamiltonian, have significant error. Are there any papers we can compare to with similar examples at a similar level of noise? See, for example Rudy, et al. "Deep learning of dynamics and signal-noise decomposition with time-stepping constraints" JCP, 2019 for an example of seemingly much more accurate results to similar problems. Also, looking at Figure 2 in [17], their state error seems lower on their real pendulum than your state error on the noisy pendulum. However, perhaps your problems are much harder in a way that is not obvious to me.
I think it would help to report relative error & plot what the noisy data looks like, as is done in the Rudy, et al. paper. If the baselines in this paper are not very representative of how accurate other methods are, then that greatly weakens the
claim that your method is an improvement.
I added one more "correctness" comment under "additional feedback" since I ran out of space in this box.

__ Clarity__: In general, I found this paper hard to read. However, I realize that I may be missing some important background pieces.
Here are some suggestions:
I found it hard to follow where the neural network comes in for the weak form learning (Equation 16).
Since inference for an input convex network is itself a convex optimization problem, how does using input concave neural networks for parameterizations affect the speed of inference? (I know I asked about training time already, but is inference time also important to consider here?)
In Appendix D, explaining the metrics, it says "We then integrate these same initial conditions forward in time using the models and perform the following computations..." Are the training losses the log likelihoods in Section 3, but then to compare the models, you use this proceduce described in Appendix D? When you talk about integrating forward using the models, does that mean using an ODE integrator? i.e. for derivative regression, were the derivatives from the neural network then plugged into an ODE integrator? Was it the same integrator as described in Section 4 that you used for state regression? Similarly, how were the derivatives estimated for the sake of these metrics? (other than for the derivative regression problem.)
Could you clarify how to check if you have learned an energy function? I know for a Hamiltonian, I would check if it is conserved along trajectories, but for an unknown generalized Hamiltonian, I'm not sure how to check if I learned one well.
I'm confused about the "setting energy flux rate" prior. This method is used to learn a scalar energy function. If you don't already know the energy function, then how do you know the energy flux rate?

__ Relation to Prior Work__: I found it hard to tell in Section 2.1 what was required to extend the generalized Hamiltonian decomposition in [22] from R^3 to R^n. Is the main change replacing the Helmholtz decomposition with the Helmholtz Hodge decomposition?

__ Reproducibility__: No

__ Additional Feedback__: Last correctness comment:
In Section 4.2, it's stated that "In all cases, GHNNs perform at least as well as the other state-of-the-art continuous time models while simultaneously learning, generalized Hamiltonian energy function and the energy cycle for the system." State-of-the-art is too strong of a strong claim. There are decades of research on using neural networks to predict dynamical systems. You compare against two models: a fully-connected neural network that may or may not have been tuned, and a Hamiltonian Neural Network which was actually designed to learn Hamiltonians. (If you train a HNN on data that doesn't have a Hamiltonian to conserve, are you putting it at a disadvantage?) Worse, your model is only the most accurate of the 3 occasionally in Table 2, so "at least as well as" is not correct.
UPDATE: Thank you for addressing several of my concerns in the response. I updated my score from a 4 to a 5.

__ Summary and Contributions__: The paper proposes two significant contributions (generalised hamiltonians + weak-form optimisation), and either one would have been alone sufficient for publication. The paper presents novel concepts and pushes the boundary in neural ODEs.

__ Strengths__: The work is sound, significant, very novel and relevant.

__ Weaknesses__: The experiments are limited, the connection to realistic usecases / applications could have been stronger

__ Correctness__: Everything looks correct.

__ Clarity__: Clearly written. The material is technical but the paper does a good job presenting the concepts. Some additional cartoon visualisations would have still helped.

__ Relation to Prior Work__: Very well described and positioned

__ Reproducibility__: Yes

__ Additional Feedback__: The paper proposes hamiltonian ODE systems via helmholtz-type decompositions into curl and div-based fields. They propose neural network parameterisation to learn such systems, and propose a range of priors and model choices to more realistically model and represent ODE systems. Second, they propose new weak-form optimisation technique that seems very promising.
Both contributions are very significant and will certainly lead to high impact in deep learning / diffeqs. The paper is excellently written, I enjoyed reading it a lot.
The main drawback of the paper is limited experiments. The weak-form optimisation is an excellent contribution, but its superiority has not yet been convincingly demonstrated based on these very limited experiments. Likewise, table 2 shows that the new model is not dramatically better than earlier approaches. The paper would have improved considerably with realistic applications or benchmarks instead of the toy’ish systems.
Despite these drawbacks, this is a fantastic paper that I’m sure will interest the nips community, and I’d be happy to see it published.
Minor comments
o The theorem 1 seems incomplete. The theorem is true for any skew-symmetric J’s, but the connection to g() is vague. With arbitrary g the J is not guaranteed to be skew-symmetric. Parameterising only one triangle of J with g should work. The proof is correct.
o Its unclear if eq 7 means that all g_ij’s are same network, or if they are different networks.

__ Summary and Contributions__: The authors provide a new method for learning an ODE that describes time series data, offering both a new (potentially more interpretable) technique for parametrizing the fitted ODE and a new technique for training the parameters.

__ Strengths__: This is a very well-written paper. The problem that it wants to solve is quite clear, and the descrptions of the two methodological advances are easy to follow.

__ Weaknesses__: The numerical experiments are convincing, but it would have been nice (if the authors had more room) to include an interpretation of the learned ODE parameters and to include a more real-world, not as simulated example. This might have illustrated some limitations of the method, as when some data is hidden, learning a first order ODE may perform badly. Ultimately, the failure of the authors to provide a real-world example led to me docking a point in the final review.

__ Correctness__: As far as I can tell, the claims and the methods are correct.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes; this is particularly well-done.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors discuss the generalized Hamiltonian decomposition of ODEs and demonstrate its use in estimating the vector field in ODEs.

__ Strengths__: The generalized Hamiltonian decomposition offers an physically intuitive way to describe the dynamics.

__ Weaknesses__: As it pertains to parameter estimation some things are not explained well. Additionally, some aspects of the proposed weak from approach are not discussed in sufficient detail. I elaborate below.

__ Correctness__: Appears so.

__ Clarity__: So-So.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: I considered the author reply, however I remain convinced that the proposed weak from regression would require some closer examination.
1) I guess strictly speaking, it should be \circ Identity map(x) rather than \circ x in the definition of N(x).
4) It seems that one drawback of the weak form in Eq. (15) is that, while yes quadrature can be used, you are restricted to low order quadratures since you can not evaluate x(t) at any t? Also how sensitive is this approach to noise in comparison to other methods?
5) The paranthesis in Eq. (16) around the argument of p could be made bigger