Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper studied the local updates with periodic averaging to solve distributed non-convex optimization problems, and provided an improved bound over the number of required local updates. The theories are validated through experimental results. Overall, I feel the paper contains new results and insights, and is well-organized. I have comments in the following three aspects. Originality: 1. The upper bound of the required local updates is new, which improves the existing one by a constant order. And it is good that the paper also explained why this improvement can be achieved. 2. The adaptive communication rule also seems to be new, but it needs further explanation. Specifically, only the convergence is proved under the adaptive communication rule, which is not that surprising. Is there any theoretical improvement? 3. How does the adaptive communication rule relate to/differ from the adaptive communications rules in the following work? Chen, Tianyi, Georgios Giannakis, Tao Sun, and Wotao Yin. "LAG: Lazily aggregated gradient for communication-efficient distributed learning." In Advances in Neural Information Processing Systems, pp. 5050-5060. 2018. Wang, Jianyu, and Gauri Joshi. ``Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD." arXiv preprint:1810.08313, Oct. 2018. 4. It is mentioned that under the current analysis, the logarithmic number of communications may be not possible, it is encouraged to comment on whether the bound is tight or match its lower bound. ==== Rebuttal ==== I have carefully read the rebuttal. Thanks.
The paper tightens the analysis of local SGD that periodically averages the models at different nodes. It improves the bound on the communication rounds suffice to achieve linear speedups and relaxes some of the assumptions used in previous works. The authors also develop an adaptive scheme to choose the communication rounds based on the intuition from their theoretical results. They also provided empirical results on logistic regression problem with epsilon dataset to support the theoretical results. Comments: The paper is well written and east to follow. Although, I haven't checked all the proofs in detail, I believe one can definitely relax the assumptions such as bounded gradients and variances in the analysis which is standard in analyzing centralized sgd and distributed gradient descent methods (smooth problems case). While the authors consider non-convex problems, they restrict the analysis to the ones that satisfy P-L condition which is very similar to that of strong convexity (considered in previous works) and the analysis for both these cases would be exactly same if same assumptions are used. The main theoretical result on communication rounds is better than the previous result in terms of the dependence on total iterations (T) and the number of workers (p). While the motivation for the adaptive scheme is convincing, the practicality of such a scheme after approximations is questionable. For example, given that this is a stochastic problem, do we have access to evaluate functions F in eq 10. If one follows a different scheme, then what are the criteria on which one can select $tau$'s? Like on line 245-246 on page 6? If so, is there an optimal strategy? The empirical results are promising; however, I believe a more thorough numerical investigation would help the theoretical claims. For example, while the communication plots in figure 2 are understandable, it is surprising that syncSGD and LUPA have similar performance in terms of iterations. The theoretical rates might be similar for both of them, but the constants are different. This raises the question of whether or not the learning rate is tuned separately for each of them. If not, then it would not be a fair comparison. Also, more experiments on the same problem, and other problems would strengthen the paper. On another note, there is a lot of work on efficient communication strategies in distributed network settings. One way to look at this problem is as a particular case of fully connected networks with delayed communications. After Authors' Response: I have carefully read the response. Thanks
This work presents a much stronger result than the state-of-the-art analysis , and should be interesting to stochastic optimization community. The idea is simple and easy to follow. I have only one minor concern: It is desirable to discuss or empirically verify whether the established bound is optimal. Typos: line 423 & 425 `left-hand side’à `right-hand side’