NeurIPS 2020

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Meta Review

Thank you for submitting your work to NeurIPS. All four reviewers were enthusiastic about the paper, and I am happy to accept it. In the final revision, please address reviewers' feedback. Especially, please make sure to address the reviewers' 2 remark "authors argue that their results indicate that large learning rates do not generalize well, but a better presentation would be to say that they show that large effective learning rates generalize well.". Indeed, it is somewhat a strawman argument to say that other researchers claim that small LR never generalize. It is rather that when comparing the effect of a large v a small LR (*keep other hyperparameters fixed*), the network trained with a larger learning rate tends to generalize better. Finally, please also discuss a bit more clearly relation to Please note that the authors also claim (without a clear theoretical argument) that LR*weight decay steers learning dynamics. It would be also useful for the reader to discuss (as concurrent work)