NeurIPS 2020

Why are Adaptive Methods Good for Attention Models?

Meta Review

There was a reasonable amount of discussion about this paper. The author feedback clarified a variety of issues which caused some reviewers to increase their scores, while some of the discussion caused other reviewers to decrease their scores. Although there was one holdout, the majority of the reviewers leaned towards rejection of the paper. However, I believe this is one of the rare cases where the AC should recommend for the PC to accept the paper against the recommendations of the reviewers. (And to be clear, I'm not doing this lightly: among the reviewers recommending rejection are optimization experts/stars who I greatly respect.) The main reason I'm recommending acceptance is due to the broader context and potential impact of the paper. The paper is giving new insights into one of the most important/mis-understood issues in the *practice* of optimization for machine learning. It is also providing theory under non-standard assumptions to explain empirical observations that were not explained by previous theory, and in my opinion gives a creative solution to the observed issues (even the new algorithm isn't dramatically different than existing methods). The number of people that could be helped by a better understanding of why Adam-related and clipping-related methods can/should be use to train language models is simply *much* bigger than the optimization community (more classes around the world teach LSTMs than teach strong-convexity). I do not think that the paper should be rejected because of minor issues related to things we view as important in the optimization community, given the large potential impact of the work. That being said, if the PCs agree with me then there are two things I want to see from the authors in preparation of the camera-ready version (in addition to addressing the other review comments): 1. Fix the issues associated with making the "strong-convexity and gradient bounded assumption", which the reviewers correctly point out cannot be true for an entire real space. The standard/preferred correction to this issue is to add projection step onto a compact/convex set to the algorithm. Another fix is to assume or prove that the iterates all stay inside a set. Proving this would be strongly preferred since I'm not sure we should make this assumption under the heavy-tailed noise assumption. (The reasons I don't view this as a crucial issue: this assumption combination is unfortunately pretty standard, and I think the paper would have been reviewed more favourably if the results relying on this combination simply hadn't been included.) 2. Two of the reviewers pointed out potential mistakes in the analysis, which were addressed in the author response. However, the authors should be careful to re-write the analysis as clearly as possible. Further, I think the authors should share their work with several optimization-expert colleagues and get them to check the analysis carefully. If critical issues are found, I hope that the authors do the right thing and withdraw the submission.