Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The authors studied the implicit regularization effect of (accelerated) gradient algorithms for least squares. The paper is well-written and clear. The main results are the population risk analysis of (accelerated) gradient methods for least squares, with the conclusion that accelerated methods achieve the same accuracy of vanilla gradient descent but with much fewer iterations. The assumptions are clean and standard. I like that the convergence bounds are proved in a very "unified" way based on spectral filtering technique. The authors also showed that their theoretical findings are empirically observed (and matched almost perfectly!) in the experiments, which is very nice. One drawback of the paper is that there are only upper bounds without any corresponding lower bounds. In other words, it is possible that their analysis are not tight, although that seems unlikely to me according to the experiments. Also, the least square setting is a bit too simplified (is it possible to extend to general convex setting?), but I am OK with just least squares at the current state of research. Overall, I think this is a good theory paper which connects stats and optimization. I suggest acceptance.
**********After author response*********** I thank the authors for answering my questions. In particular, my concern about novelty was addressed. I decide to raise my score. ************************************************ Originality: There are many previous results about the spectral filtering functions of AGD and the authors are not very clear about what's novel. See entry 1 in the "improvements" part. Quality: I haven't checked the proofs in the appendices, but the theoretical insight is convincingly verified by experiments. Clarity: The paper is clearly written in general, although there are minor issues that I mention in entries 2, 3 and 4 in the "improvements" part. Significance: The generalization bounds for AGD in this work will enhance the statistical understanding of optimization methods. While there are previous works showing that AGD is less stable than GD, the finding that AGD achieves the same optimal excess risk as GD is interesting.
The paper reviews Gradient Decent and accelerated methods and provide novel analysis for the accelerated methods. Originality - the analysis techniques using in the paper are novel. The paper combines known techniques to conclude improved analysis for the known methods. The related work is very well cited and explained. Clarity. The paper is very well written and organized. Very good review of the subject. The technical parts were left to the appendix. The theorems in the paper are a direct conclusion from "Theorem 4" which is not stated in the main part of the paper. The appendix is harder to understand, The mathematical analysis seems to be correct. Quality - The authors made a complete work, providing claims analysis and empirical demonstrations. Most of the work is provided in the appendix. The paper also provided strength and weaknesses for their work. A further analysis of the unstable behavior would be interesting. Significance. The paper set interesting starting point for further study. It is hard to evaluate the significance of this paper. The analysis techniques are novel and led to improved analysis for the problem. But the paper does not focus on the techniques or how to use them in other scenarios. The results are also novel, but also shown to be "unstable", as the test error tend to explode past the minimum point. It is unclear from practical point of view when should accelerated method should be used.