Summary: This paper provides new theoretical insights for non-convex problems for the optimizer Adam and Yogi (the proposed optimizer). In particular, the non-convergence of Adam is tackled with increasing batch size. Yogi provides more control than Adam in tuning the effective learning rate. The authors also present experiments comparing Yogi with Adam. The paper is very clear in terms of writing. Yogi idea is original but not sufficiently motivated. The work presented is not very significant due to certain weaknesses (see below). Strengths: 1) Theoretical results for non-convex problems for Adam (actually RMSProp) and Yogi are interesting. 2) Many experiments were presented though not sufficient (can be biased due to hyperparameter selection). Weaknesses: 1) Considering g_t only rather than m_t in update step (in Appendix) makes the update step more similar to RMSProp, actually Adam will be exactly the same as RMSProp when beta_1 = 0 (if one excludes debiasing step). So there is no sync between theory and experiments. I would prefer the algorithm being called RMSProp rather than Adam in theoretical results. In discussion, the authors mention that they set beta_1 = 0.9 inspired from theoretical analysis, but in theory they set beta_1=0. This needs to be clarified and should be consistent. 2) No proper motivation for the update step, the explanation for the update step is very minor. I think it would be better if the authors dwell a bit more into the details what makes the algorithm special for instance explain different cases especially for different cases of v_{t-1} and g_{t}^2 where Adam fails and Yogi does it better. Since, there is only a minor modification in the update step, as compared to RMSprop, this is a critical issue. The formula itself doesn't look like a natural choice, thus it needs to be justified. 3) In theoretical results (both in Adam and Yogi), there is O(G^2/\epsilon^2) term as coefficient for variance, which can be very huge if epsilon is small. And usually epsilon is very small. It would be nice to see if one can decrease the dependence on epsilon. Regarding this aspect there is no improvement compared to the literature. 4) Weak experiments section, there is no proper rule in the choice of learning rate and \epsilon. I think one must choose the tuned hyperparameters from a set (need not be very big). One cannot be sure if the performance is due to hyperparameter selection or actually due to performance gains. Improvements: 1) What about debiasing results? At least brief comments on the changes in the theory. 2) Some theoretical insights when beta_1 \neq 0 for Adam and Yogi. 3) Relation to Adagrad, if v_{t-1}
Reviewer 3
This paper proposed convergence analysis for ADAM under certain parameter settings and proved the convergence to a stationary point for nonconvex problems. Moreover, the authors proposed an adaptive SGD (i.e., YOGI) and evaluated it on the different learning models. Generally, the studied problem is interesting and the proposed algorithm demonstrated good performance on its testing problems. However, some important issues are not clearly stated in the current manuscript: 1. This paper has stated that YOGI controls the increase in effective learning rate, but it is not clear how it impacts the learning rate? 2. The proof of the convergence in ADAM only considered the case that $\beta_1=0$. Thus it is also not very clear how about the convergence proof with the condition $\beta_1\in(0,1)$? 3. There are some types and improper presentations in Appendix. For example, in the first line of page 12, the update rule $v_t,i$ of YOGI does not match with the update rule in Algorithm 2.
Reviewer 4
This paper studies the adaptive gradient methods (e.g, ADAM) for nonconvex stochastic problems. Specifically, the authors prove the convergence to the stationary point using the increasing batch size. They also propose a new algorithm of YOGI and achieve the same convergence result. I have some concers on the "increaing minibatch size". 1. The authors claim that increasing batch size leads to convergence. It may be ambiguous. From Corollaries 2,3,4,5, we can see that the authors use a large and fixed batch size in the order of O(T), where T is the number of iterations. In my opinions, increasing batch size means that the batch size is increaing during the iterations and it may be more practical than the fixed and large batch size. Of course, it may be more chanllenging to prove the convergence with the real increaing batch size. Comment after author rebuttal: the authors said that " the same theoretical guarantee can be obtained for increasing minibatch b_t = Θ(t)". I am not convinced of it. Hope the authors to include a detailed proof for the case of b_t = Θ(t) in the final version. If the analysis only applies to the case of b_t = Θ(T), please point it clearly and do not mislead the readers. 2. More seriously, the theories in this paper require the batch size to be $b=O(T)$. In practice, we may run the algorithm for several epoches, e.g., 100 epoches. Let N be the sample size, then T=O(100N). However, the batchsize we ofteh use is much smaller than N, e.g., b=100, in this case, b=O(T) is an unpractical requirement. I think this issue is not due to the weakness of ADAM, but the weakness of the proof in this paper. Comment after author rebuttal: I have made a mistake in the previous comment, that is, T should be O(100N/b) and b=O(T) leads to b=O(10\sqrt{N}). This setting is acceptable in practice. From Table 1 in "Neon2: Finding Local Minima via First-Order Oracles", we know SGD needs O(1/\delta^4) iterations to find x such that $ ||\nabla f(x)|| \leq detla$. The complexity proved in Corollary 3 is O(1/\delta^2) such that $ ||\nabla f(x)|| \leq detla$. I do not think ADAM has a faster theoretical convergence rate than SGD. Thus it may verify that this paper has made some unreasonable assumptions. Comment after author rebuttal: The authors persuaded me. Since the minibatch is O(T) and SFO complexity measures the total number of stochastic oracle calls. So the final compleixty of ADAM is O(T^2) = O(1/\delta^4). Hope the authors to make it clear in the final version and do not mislead the readers that ADAM is theoretically faster than SGD (they have the same theoretical complexity in the convex case). Corollary 1 in this paper is also ambiguous. I have read the introduction of [8]. [8] proved that their method needs O(1/epsilon^2) iterations to find x such that $ ||\nabla f(x)||^2 \leq epsilon$. However, from Corollary 2, we know ADAM needs O(1/epsilon) iterations to find x such that $ ||\nabla f(x)||^2 \leq epsilon$. Thus, the comparison in Section: Discussion about Theoretical Results is not correct. Comment after author rebuttal: The authors persuaded me. Please see the above comment. In conclusion, the study in this paper may be significative in practice. But the theoretical analysis is not satisfied. Final comment after author rebuttal: I am pleased with the theoretical analysis now. I do not evalute the experiments due to lack of experience.