NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4604
Title:A Universally Optimal Multistage Accelerated Stochastic Gradient Method

Reviewer 1

**********After author response*************** I appreciated that the authors took the time to try to relax their noise assumption. The idea seems promising but I cannot change my score without seeing a real and complete proof. Yet the fact remains that, even if I will not fight for this paper, I think that it deserves a submission. *************************************************** Originality: the aim of the paper is very general but very important: how to accelerate SGD ? The algorithm is based on the accelerated gradient descent from Nesterov together with a multistage approach that does not seem new neither. Yet they reach optimal rates for the algorithm which is the principal aim of the paper. Quality and clarity: The aim of the paper is clear and as robustness of the acceleration of gradient descent with respect to noise is important, the result stated is quite strong. Moreover, even if the topic is quite technical and several attempts in the past have been made, the topic is well reviewed and documented. However, the paper lacks from some clarity, especially in Sections 2 and 3, where this is rather a succession of lemmas and theorems than a clear discussion about the difference with the previous works or the method employed. Significance: as I said in the previous Section (contribution), the result was already proven in more specific settings, and the significance of this paper may be more methodological than theoretical or experimental.

Reviewer 2

Originality: This paper provides a clear and deep analysis of a multi-stage accelerated SGD algorithm. The results show that the expected function value gap is bounded by an exponential decay term plus a sublinear decay term related to noise. They recover the deterministic case in the single stage and zero noise special case, while reaching the lower bound O(\sigma^2/n) in the noise term. The paper contains sufficient novel results and is competitive comparing with related work. In particular, the main results reveal how to choose the right time to switch from constant stepsize to decaying stepsize, a crucial choice for the overall performance of stochastic algorithms. Quality/Clarity: The paper is written with care. The proofs are correct, clear and concise. The results are well presented and highlighted. A detailed comparison is made with related papers. For example, the authors showed that AC-SA can be seen as M-SAG with specific varying stepsize rule. The technique of exponential decaying stepsize allows to avoid the requirement on the knowledge of \sigma. Significance: The paper brings new algorithm and new results in accelerated stochastic gradient methods. The results match the lower bound without requiring the knowledge of noise level and provide insight into parameter choice of the algorithm. The new algorithm also perform asymptotically the best in the reported experiments. Comments: 1, Should the function value gap in 132 be changed to the Lyapunov function? Also, where is Corollary 3.2 needed in the whole paper? 2, Typo in Equation (32). I read the response.

Reviewer 3

Compared with existing work [8], the contribution of this work is incremental. The key idea of proposed M-ASG is to use a exponentially decreased step size and increased number of iterations for each stage. In fact, this idea is similar to the results presented in [8]. More importantly, the proposed results only work for strongly convex objective. As discussed in [8], compared with generally convex objective, the accelerated method is much less sensitive to noisy gradient estimate if the objective is strongly convex. Therefore, more challenging and important problem is how to design an accelerated algorithm so that it is robust to noisy gradient and able to achieve the optimal rate. Compared with u-AGD [8], as showed in Figure 2 and 3, he proposed algorithm does not show clear difference and improvement in terms of empirical performance. After rebuttal: the authors' feedback partially address my questions and I have changed my score.