NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
POST-AUTHOR FEEDBACK: I thank the authors for their explanations, which addresses my concerns. I look forward to seeing the camera-ready version with improved results and the clarifications. ======= The paper is well-written, the technical results are precise and correct as far as I can tell, and the results are novel and interesting. However, I am slightly puzzled over a few points, some of which pertain to the interpretation of the results. I believe the paper would significantly benefit from clarifying the following points: * The authors refer to mirror descent with (fixed) weighted Euclidean distance as an adaptive gradient algorithm, but this connection is not clear to me. Adaptive gradient algorithms would imply a time variant conditioning matrix A_t, instead of a fixed A. Unless I am missing something, this distinction should be clarified. * The authors should bring out the intuition behind the importance of the quadratically convex geometry. The notion seems central to the results, but from the main text of the paper it is difficult to understand why it is so. The one place I can see it being used critically is in the proof of Proposition 4, in switching the order of inf and sup. If this is the critical step, this should be emphasized more clearly. * Am I correct to interpret that the switching of inf and sup in Proposition 4 would mean that an adaptive gradient algorithm (time-variant conditioning matrix A) is equivalent to a fixed one from minimax regret point of view, since the right-hand side allows for the optimization of the weights (\lambda) based on the parameter (\theta)? * For the results on arbitrary gradient norms, it is not clear whether Corollary 3 follows from Theorem 2 or Corollary 3. The conditions in the Corollary are looser than those in Corollary 2, but the language surrounding the corollary makes it sound like it follows from Corollary 2. * It is hard to interpret the bounds in Corollary 3. It would be illustrative to compare the upper and lower bounds for specific cases, and demonstrate the highest and lowest possible gaps between them, in order to get a sense of what the result means. * In the definition of minimax stochastic risk, the authors define F_P(\theta) as the expectation with respect to P, i.e., the data distribution. Right after the definition, they also state that the additional expectation in the definition is with respect to the data, which is distributed by P. This seems confusing to me, F_P(\theta) should already be a deterministic quantity. If this is a typo, it should be fixed, and if it is not, the differences of the two expectations should be clarified. * On p.3, in the definitions of M_n^S(\Theta, \gamma) and M_n^R(\Theta, \gamma), the supremum over the input set \mathcal{X} is taken *after* the infimum w.r.t. \theta_i. Since this is minimax regret, it is not clear why the supremum is taken before the infimum.
Reviewer 2
My concerns were addressed in the rebuttal. Look forward to seeing the accepted version of this paper. ------------------------------------------- This paper rigorously studies the interaction between the geometry of the constraint set and the geometry of the gradients of online and stochastic convex optimization methods. A complete characterization of necessary conditions for adaptive gradient methods with a diagonal re-scaling to be minimax rate-optimal was presented. The results offer concrete recommendations for when one should (not) use adaptive gradient methods to train her/his machine learning models. This paper is very well-written and well-motivated. It is of both theoretical importance as well as practical merits. Specifically, the established minimax rate-optimal results for stochastic/adaptive gradient methods are novel and interesting, and they certainly shed light on understanding the role of adaptive gradient methods in training machine (deep) learning models. As training deep neural models involves solving (highly) nonconvex optimization, it would be great to comment on the possibility of establishing such convergence guarantees for adaptive gradient methods in optimizing non-convex functions. Overall, this submission constitutes a very strong work connecting theory with machine/deep learning practice. It would be inspiring if numerical tests validating the theoretical claims can be included. Below are some minor comments. 1) Line 46. 'conclusions'--->conclusion. 2) Line 60. Typo in D_h((,x),y). 3) Please also define the notation like X_1^n and x_1^n. 4) Line 246. `]1,2]' should be [1,2].
Reviewer 3
============= After Author Response ==================== I have read the authors' rebuttal, my evaluation remains unchanged. =================================================== This paper shows the constraint set and gradient geometry matters a lot in the convergence (regret) of stochastic gradient methods, and provide concrete cases of when to use adaptive gradient methods in terms of constraint set geometry. The main results is 1) for quadratically convex sets, the adaptive gradient methods are minimax optimal; 2) for non-quadratically convex sets (e.g. Lp norm balls with p in [1,2]), any linear mirror descent methods are far from optimal. Such a sharp characteristics of minimax rates of stochastic gradient methods in terms of constraints geometry provide useful insights in understanding the practical behavior of adaptive gradient methods.