Paper ID: | 5108 |
---|---|

Title: | Understanding the Role of Momentum in Stochastic Gradient Methods |

In this paper, the authors use a general formulation of QHM and derive a unified analysis for several popular methods. Originality: The topic of the paper, in my opinion, is interesting. The paper presents analysis and insights for several methods that are used in practice without theoretical guarantees. Quality: The overall quality of the paper is very good. The paper is a purely theoretical paper. Some parts of the paper are hard to follow and some of the theoretical results are hard to parse (e.g., Theorem 3). Clarity: The paper is well-written and motivated. Significance: In my opinion, even though the paper is purely theoretical, it does provide some insights that may aid in the use of stochastic momentum methods in practice. Issues/Questions/Comments/Suggestions: - Contributions: This section can be written more clearly. And the importance of the results can be better explained. - Contributions: “stochastic noise” what do the authors mean by this? Stochastic noise in the gradient approximations? - Theorem 2: The bounded noise assumption seems like a very strong assumption. The authors should comment about this. On a similar note, it appears that the result that \vu_k \beta_k -> 1 is derived under this strong assumption, whereas the \beta_k -> 0 result is not. The authors should comment about this in the paper. Moreover, why is the \vu_k \beta_k -> 1 result interesting? The authors should comment about this. - Theorem 3: 0<, \mu \leq … Is this correct? - Theorem 3: What is \epsilon_k? - Theorem 3: This result excludes the case of \beta = 1. The authors should comment about this in the paper. - Figures 1, 2 and 3 are very interesting. The authors, in my opinion, should expand the discussion about the results shown in these figures (e.g., the three regimes shown in 1c).

INDIVIDUAL COMMENTS / QUESTIONS 1) I really appreciate how the paper ties up loose ends by unifying the analysis of several momentum-based methods in the stochastic setting. 2) I am sorely missing a literature review. I am not very closely familiar with the literature analyzing momentum methods, but there's a lot of work out there (e.g., the line of research studying momentum methods in the continuous time limit). A brief review would be very helpful to position the paper within the existing work. 3) On Section 4: a) You should prominently cite Mandt et al. [16], who show similar results for SGD. b) In the beginning, it says "we use quadratic functions for ease of analysis". To me this implies that the analysis would go through for more general functions. I don't find it obvious that it would. I think this should be justified or the actual more general analysis should be presented. 4) I would recommend the general version of Theorem 3 (i.e., Theorem 6 in the appendix) in the main paper. Specializing to quadratic functions imho does not add any extra insight here, so I would state the more general result. TYPOS / STYLE - Line 70: Section title should be lower case - Line 138: Should be "driven by i.i.d. noise" - Line 193: should be "[...] then the QHM algorithm [...]" - Line 216: should be "visualization" for consistent American English - Line 264: "evidence" does not have a plural form - Consistently capitalize "Section" when referencing specific sections, e.g., line 210 - The references could use some cleaning: Capitalization [e.g. 1, 13, 15, 16]. The URL in [13] RATING Quality: To the best of my knowledge, the paper is mathematically sound, but I followed the proofs in the appendix only superficially due to time restrictions. - Clarity: The paper is well written and all ideas are explained very clearly. I am missing a brief literature review to set the scene and position the paper in the existing work. - Significance: Momentum methods are widely used but not very well-understood in the stochastic setting. - Originality: This is not a paper with grand new ideas, but it is tying up some loose ends by unifying the analysis of several momentum-based methods. - Reproducibility: No code has been provided and the description of the experimental setup is, in my opinion, insufficient for an outsider to reproduce the results. Overall, this is a very solid paper and I recommend acceptance.

The paper partially fills the blank of the theoretical analysis of the original QHM paper and is well written. Since the momentum based SGD is frequently used in deep learning, the proposed theories could help one tune parameters. I vote for acceptance.