Summary and Contributions: The paper investigates optimization algorithms through the lense of dynamical systems. This connection allows to transfer known results from dynamical systems to optimization. In particular, the paper investigates two known algorithms (heavy ball and Nesterov's accelerated descent) and proposes a new one coined "relativistic gradient descent". The paper investigates these algorithms in terms of order approximation, conformal simplecticity and dissipation. In addition, 4 benchmark examples show that the new proposed algorithm converges faster than the other two algorithms, even in the case when all parameters have been tuned "optimally".
Strengths: The paper is interesting, largely sound, a novel contribution to the literature and of relevance to the NeurIPS community. Based on the presented information, I believe the theoretical results are correct. The numerical results are interesting since they cover a range of relevant problems and fair since all parameters have been tuned "optimally" using Bayesian optimization. The connection of optimization with dynamical systems is very relevant to the NeurIPS community since understanding and enhancing optimization is fundamentally important. The proposed algorithm is novel.
Weaknesses: Although the paper has a number of strengths, it also has some weaknesses. While the theoretical results of the paper are sound and important from a dynamical systems perspective, it remains questionable how important these are in an optimization context. In any case the authors fail to convince me in this regard. For the numerical results, all parameters were tuned "optimally" using Bayesian optimization. This is a good and fair approach. However, the details here are vague and it is not clear for instance what the metric was under which the parameters have been tuned. That being said, I trust that the authors chose a relevant metric.
Correctness: The claims in the paper are largely correct. One exception is for instance the claim in page 2 that "RGD never diverges". This claim is based on the fact that the difference of certain iterates stay globally bounded. There is absolutely no reason why such conclusion should be true. In fact, the authors themselves observed on page 7 that "CM and NAG diverged more often than RGD" implying that RGD did diverge in a couple of their numerical examples.
Clarity: Yes, the paper is extremely well written. It is well structured and can be easily digested.
Relation to Prior Work: Yes, prior work is appropriately referenced.
Additional Feedback: After reading the other reviews, the authors's feedback and discussion with other reviewers, I have decided to change my score to "8" and have updated my review accordingly. My main concern has been clearly resolved.
Summary and Contributions: The authors propose to apply the conformal symplectic method to compute the Nesterov’s accelerated gradient and Polyaks’s heavy ball. In particular, they study the possibility of realistic damped Hamiltonian flow. They propose a new algorithm that normalizes the momentum and may result in more stable/faster optimization Here the realistic Hamiltonian helps in the simple update of momentum with a little additional computational cost.
Strengths: The accelerated gradient flow is important in optimization. The paper addresses the problem in using damped Hamiltonian flows with conformal symplectic method.
Weaknesses: 1. This work study several toy examples in purely optimization problems. It may be interesting to study machine learning optimization problems, or Bayesian sampling problem. 2. Is there a theoretical justification that shows the realistic Hamitlonian is better than Euclidean Hamiltonian? Does it provide a new convexity which helps the optimization convergence for some particular functions or non-convex functions?
Correctness: The claim is correct by using the damped Hamiltonian flow with conformal Symplectic discretization.
Clarity: The paper is well written.
Relation to Prior Work: There are also related work on applying damped Hamiltonian flows in probability density space, which works for Bayesian sampling. Yifei Wang, Wuchen Li. Accelerated Information Gradient flow, arXiv:1909.02102.
Additional Feedback: For numerical expreiments, more machine learning examples may be needed. In addition, a toy example which explains why the proposed method works may be useful. It could explain why the realistic Hamiltonian is better than the classical quadratic kinetic energy in these problems. I have read the authors' response. They carefully address my questions. This is a good attempt in the generalization of Nesterov algorithm, which could have border applications in sampling. I increase my score from 6 to 7.
Summary and Contributions: This paper brings ideas from the physics literature to study and development of gradient-based optimization methods. By considering specific structure-preserving discretization of continuous time dynamical systems, the authors shed a new light on the theoretical grounding of classical momentum (CM) and Nesterov accelerated gradient (NAG). Moreover, they derive a new interesting optimization algorithm called relativistic gradient descent (RGD) interpolating the behavior of CM and NAG. Once properly tuned, the authors show empirical evidence of the better performance of their method over CM and NAG.
Strengths: This work sheds a new light on the theoretical grounding of CM and NAG. A new interesting optimization algorithm RGD is proposed. It interpolates between the structural properties of both CM and NAG. Crucial structural and theoretical properties of RGD along with CM and NAG are derived and certainly provide a better understanding of the methods. The experimental section clearly supports the theoretical claims. I am a strong advocate of the claim l319 "cross-fertilization between dynamical systems theory, physics, machine learning and optimization". The pedagogical efforts made by the authors is much appreciated.
Weaknesses: There are 2 extra hyper-parameters to tune compared to CM and NAG. It would have been great to be able to play with some supplementary code, please consider releasing the corresponding implementation.
Correctness: I cannot judge the correctness of the claims and proofs. The empirical methodology seems clean, parameter tuning is performed via Bayesian optimization and is well interpreted. Could you give the computational infrastructure and programming language which allows you to monitor 10^-34 variations?
Clarity: The paper is very well written, it is a pedagogical work to bring up physical jargon to the ML audience.
Relation to Prior Work: Yes it is. Nevertheless, I don't understand why this work is listed in the Deep Learning track and not the Optimization one.
Additional Feedback: l 136 t_k notation is only implicitly defined Figure 3 in appendix might be used in the main text to illustrate the stability properties of each procedures and to make the body less dense. At least, l 69 add reference to Figure 3 Please make sure the hyperlinks associated to citations, equations, etc. are clickable in the main text. # EDIT: The authors' response and the discussion with the other reviewers have convinced me to support even more this contribution despite my low expertise on the topic.