NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1973
Title:Tight Dimension Independent Lower Bound on the Expected Convergence Rate for Diminishing Step Sizes in SGD

Reviewer 1

Pros. + paper is clearly written and easy to follow + contributions are clearly stated and sufficient literature review is provided + result seem to be novel and significant as they give important insight about the impossibility of improving SGD types method for the strongly convex case + the analysis appears to be novel as well, which might have some usage in the construction of other lower bound under different assumptions and frameworks. Cons. - the analysis seems to be too narrow and cover a small subset of algorithms --- I have read the authors feedback.

Reviewer 2

Originality: The lower bound seems to be new for the considered class of problems and stochastic algorithms. It is not clear what is the difference in proving the lower bounds from previous similar works. Quality: The authors should mention more related work on stochastic strongly convex optimization with optimal convergence rates. Clarity: There are a couple of papers on stochastic strongly convex optimization that provide 1/t convergence rate for last iterate. It is not clear why the authors paticularly refer to [8,18]. It is also not clear why the lower bounds on Y_t is more interesting than the objective gap. Significance: It is not clear how the results in this paper can advance our understanding on stochastic strongly convex optimization given so many studies on the upper bounds and lower bounds. I have read the rebuttal.

Reviewer 3

Update: Thank you for the feedback, I have read it as well as other reviews. Compared with the vast literature on obtaining upper bounds on convergence rates of stochastic convex optimization problems, less work has been done towards deriving corresponding lower bounds that depict the fundamental hardness of these problems. This paper aims to fill this gap and proposes a general framework for comparing upper and lower bounds. The framework also suggests potential future research directions for obtaining better convergence rates. In addition, this paper proved, for all round t, a lower bound on the expected convergence rate of SGD over any diminishing step-size sequences when applied to strongly convex problems. This bound shows that the step-size schemes proposed in recent work (Gower et al. 2019, and Nguyen et al. 2018) are optimal within a dimension independent constant factor. By restricting its scope to SGD only, it improves upon the general dimension dependent lower bound in Agarwal et al. (2010) which is for all stochastic first-order algorithms including SGD. The framework proposed in this paper gives a clear comparison of existing results and its own result by incorporating both the different problem settings being studied in each work, and the various families each optimization algorithm belongs to. It formalizes the idea that the hardest objective function to optimize in a larger class of functions is no easier than the hardest objective function in a smaller class and the best algorithm from a larger class of algorithms is no worse than the best algorithm from a smaller class. Thus, to get a better bound, we need to either consider a more restricted problem setting which gives us more information, or use a larger set of algorithms to choose the best one. This framework shall be very useful for later research that seeks the possibility to further improve upper bounds or proves lower bounds for specific problems. The proved lower bound is novel, which not only improves upon the existing work, but also confirms the recent proposed step-size schemes are optimal. The proof of the theorem is very clever. The construction of the function class to be studied when only strong-convexity and smoothness parameters are given is creative, and the way of deriving the optimal step-size sequence is cute. However, I have a minor concern: in Line 344, you first drew \xi, and then chose w_0 as the minimizer of f(w;\xi) to make Y_0 equal to Tr(\Sigma). To me, this either means we are studying a smaller algorithm class in which the initialization parameter is restricted, or we need a few steps to initialize which should be included in the total number of steps t. Thus, in the former case, the algorithm class doesn't include classic SGD in which we can choose any w_0; while in the latter case, the lower bound should be modified to include those initial steps. Could you explain why this won't matter? Typos: 1. into what extend --> into what extent 2. Line 39: "smaller" --> "larger" 3. strongly objective functions --> strongly convex objective functions 4. Line 105: in [8] are equivalent... 5. Line 376: proof generalizes