NeurIPS 2020

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning


Review 1

Summary and Contributions: This paper analyzes escaping from local minima of SGD and Adam, using their corresponding Levy-driven stochastic differential equations (SDEs). By defining the “flatness” and “sharpness” or local minima based on a Radon measure, the paper theoretically shows that: (1) both SGD and Adam prefer to find flat local minima, i.e., flat minima usually have larger escaping time than sharp ones, (2) Compared to Adam, SGD tends to easily escape from local basins.

Strengths: The theory established in the paper is very solid, and provides a good explanation for the generalization performances of SGD and Adam, and the gap between them. Note that there is always a debate on the connection between generalization performance and flatness of local minima. This proposed definition of flatness based on Radon measure seems can make the debate to an end.

Weaknesses: The experiment mainly verifies the generalization performance gap between SGD and Adam, but does not verify the main theory established in the paper, such as: the relation between Radon measure and the escaping time, the validity of the flatness definition.

Correctness: As far as I checked, I think the claims are correct.

Clarity: The paper is written clearly and easy to read.

Relation to Prior Work: Related works are properly mentioned.

Reproducibility: Yes

Additional Feedback: [After author feedback] The feedback addresses some of my concerns. Hence, I keep my score.


Review 2

Summary and Contributions: This work analyzes the convergence behaviors of SGD and Adam through their Levy-driven stochastic differential equations. The theoretical results show that escaping time from a basin for SGD is smaller than that of Adam. Moreover, comparing to Adam, SGD is more like to finally converge to a flat or asymmetric basin that often have better generalization performance.

Strengths: This work can help us to theoretically better understand why SGD generalizes better than Adam in deep learning.

Weaknesses: 1). Commonly, the optimization algorithm “SGD” in deep learning is SGD with momentum rather than the standard SGD. This work does not discuss SGD with momentum. Compared to the SDE of the standard SGD, the SDE of SGD with momentum is more like that of Adam. Hence, there is a question whether the proposed theory can still interpret that SGD with momentum can generalize better than Adam. 2). One of the contributions is substituting the Gaussian distribution assumption for gradient noise in SGD and Adam with the alpha-stable distribution, but this work does not state how to get the key tail index \alpha. From Figure 1, we can conclude that alpha in practice is usually smaller than 1. This will bring a new problem. According to Theorem 1, the first escaping time \Gamma negatively depends on the learning rate \eta. It means a larger learning rate is beneficial to escaping from a basin. However, in practice it is common that a smaller learning rate is more helpful to escape from a basin. Hence, this theory is somewhat inconsistent with the phenomenon in practice.

Correctness: The theory seems to be inconsistent with the phenomenon in practice.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: The responses partly address my concerns, so I raise my score. But the theoretical analysis of SGD-M is still not sufficient, I would like to see more clear justifications for SGD-M in the updated revision.


Review 3

Summary and Contributions: This paper studies the escaping behavior of SGD and Adam through Levy driven SDEs. The theoretical results show that the time for escaping local basic depends on the Radon measure of basin and the heaviness of the gradient noise. This also suggests that Adam has a larger escaping time due to its adaptation. Experiments are also provided to justify their results.

Strengths: The noise assumption considered in this paper is more realistic than gaussian noise assumption made by several previous work. The results suggest a new measurement of “flat” minima, Radon measure of the local basin, which gives another interesting characterization of “flat/sharp” minima.

Weaknesses: The analysis for dynamics of Adam is simplified by Assumption 2. While the intuitive explanation is provided, the results can be strengthened if this can be proved rigorously.

Correctness: The claims and method seem to be reasonable.

Clarity: The paper is well-written and easy to follow.

Relation to Prior Work: The authors discussed related works, and explained several differences between prior works and current paper.

Reproducibility: Yes

Additional Feedback: We know that for over-parameterized neural network, its local minima usually only have an ill-conditioned or even degenerate Hessian. So, I wonder if results would still hold in these situations, such as the non-strongly convex case, i.e., \mu = 0 if Assumption 1. For the right figure of Figure 2(a) and 2(b), I wonder why the loss between two points is decreasing and then increasing. If there is a barrier between these two points, it seems should be first increasing and then decreasing? ============================================================= After rebuttal: Authors feedback addressed my questions.