Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
In the theory of Gaussian estimation, a result of Donoho shows that if the parameter set is "quadratically convex" (QC) then linear estimators are minimax optimal. Conversely non-QC sets for which linear estimators can fail badly to be minimax optimal. This paper exhibits a similar dichotomy (to use the authors' own language) for Adaptive Gradient (AG) methods that "recale" the gradient with a fixed linear transformation. If the constraint set is QC (this includes \ell-2 balls and hyper-rectangles) then AG methods are minimax optimal. On the other hand for scaled \ell_p balls with p < 2 in very high dimensions, AG methods can be very far from minimax optimal. This paper is about a class of widely used methods in ML, is nicely written, and connects theory and practice. It is a very strong contribution. The authors should take reviewer comments into account and make the changes promised in their rebuttal.