Paper ID: | 1862 |
---|---|

Title: | Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients |

============= After Author Response ==================== I have read the authors' rebuttal, my evaluation remains unchanged. =================================================== This paper proposed lazily aggregated quantized gradient methods for communication-efficient distributed learning. The main idea is to combine lazily aggregated gradient and quantized gradient into a single framework to further save communication cost in distributed learning problems. The proposed method looks reasonable, and convergence analysis is provided showing that linear convergence is achieved under certain conditions for strongly convex and smooth functions. Empirical results were provided showing that the proposed approach improves both quantized gradient methods or lazily aggregated methods. My main concern on the theoretical analysis, which is done only on strongly convex and smooth functions, also the dependencies on the condition numbers (smoothness and strongly convex) are hidden in the main paper. By looking at the supplemental materials, it appears that the condition number dependency are quite bad and being much worse than standard gradient methods.

The paper extends the lazily aggregated gradient (LAG) approach by applying quantization to further reduce communication. In the original LAG approach, workers only communicate their gradient with the central coordinator if it is significantly different from its previous one. In this paper, the gradients are compressed using quantization and workers skip communication if their quantized gradient does not differ substantially from previous ones. For strongly convex objectives, the paper proves linear convergence. The paper is very well written and the approach is clearly motivated, easy to understand, and discussed in the context of related work. Although the proposed approach is a straight-forward extension of the LAG approach, the idea is sound and theoretically evaluated. The theoretical analysis is performed using Lyapunov functions that measure not only the loss difference between an iterate and the optimal model, but also include past gradient differences and the quantization error. The relation of this measure to the simple (expected) risk should be discussed in more detail. Also, this analysis does not apply to neural networks (for the proofs, the authors assume strong convexity). The empirical analysis shows a substantial communication reduction over gradient descent, quantized gradient descent and LAG, while retaining the same test accuracy. My only concern is that the experiments have been only conducted on MNIST. There, comparable accuracies can be reached easily (e.g., by averaging aggregates only once at the end, as in [1]). Thus, it is unclear how the approach performs in practice. Overall the paper is a nice extension to the LAG approach with a solid theoretical analysis. Despite my concerns about the empirical evaluation, I think the paper is a nice contribution to the conference. Detailed comments: - the approach is motivated by the need to reduce the number of worker-to-server uplink communication, because of the latency through sequential uploads. However, in [4] it was shown that asynchronous updates work in practice. It would be interesting to discuss this approach with respect to quantization, since this might lead to more collisions and thus render [4] infeasible. - it would be very interesting to see how LAQ compares to randomly selecting workers that upload (as in [2]). In this light, it would be also interesting to discuss the advantages of LAQ over approaches that dynamically silence all workers (as in [3]). - the proof of theorem 1 was quite difficult to understand and the Lyapunov-functions are not very intuitive to me. Thus it would be great if the authors could add more intuitive explanations to the proof in the supplementary material. - the authors claim in the reproducibility checklist that their code is available but I couldn't find a link to it in the paper. - typo in line 66: "see also" [1] Zinkevich, et al., "Parallel Stochastic Gradient Descent", NIPS 2010 [2] McMahan et al. "Communication-Efficient Learning of Deep Networks from Decentralized Data", AISTATS 2017 [3] Kamp, et al. "Efficient Decentralized Deep Learning by Dynamic Model Averaging", ECMLPKDD 2018 [4] Recht, et al., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent", NIPS 2011 The authors responded to all my comments and I remain convinced that this paper would be a good contribution to the conference.

The authors consider the problem of reducing the amount of communication for learning in a parameter-server setting. The approach is based on a combination of quantization of new gradient information and skipping updates when innovations are not considered informative enough. In particular, it combines the idea of lazily aggregating gradients (NeurIPS 2018) with quantization of the gradient components. The combination of quantization of elements and skipping updates is interesting and potentially useful. However, both the paper and the scheme itself has some limitations which would be nice to have addressed. 1. The scheme requires additional memory, both at the server (to compute update direction in absence of new information) and at the workers (e.g. to evaluate the transmission triggering conditions). Please quantify and comment on these clearly in the paper. 2. Many critical tuning parameters, such as the step-size alpha and the Lyapunov parameters xi are quite complex to tune. Although the authors provide simple parameter selections that satisfy the complex formulas, these parameters are not used in the numerical experiments. Please use the theoretically justified parameters in your experiments (at least as a complement to the present one) and comment on whether the parameters used in the numerical experiments satisfy the theoretical bounds. It would also be nice with some intuition of how you select D. In addition, it would be useful to have some discussion about how sensitive your numerical results are to the different non-trivial parameter choices. 3. It would be nice to discuss the increase in total iteration counts (as a surrogate of wall-clock time) vs the decrease in total communication load. Although this is difficult in general, with your simplified step-size expression, you should also get a simple value for the contraction modulus. You could then compare this one with the one we get for (compressed or uncompressed gradient descent). 4. In Theorem 1, you prove that V decays, but what about f or gradient norm? Can you say something at all about these? 5. Since your quantization scheme is memory-based, I think that you should also compare it against error-compensated gradient quantization schemes (e.g. Alistarh et al., NeurIPS 2018). 6. Only the strongly convex case is analyzed, so the results do not in general apply to Neural Network training. Could you also analyze the non-convex case? If not, I am not sure that it makes much sense to have the simple one-layer ReLU network in the numerical experiments. 7. Some equation references point to supplementary (this is probably a simple LaTeX compilation error, but please revise!) ** Update after rebuttals **. Thank you for your reply. You have addressed most of my concerns and I believe that the numerical experiments consistent with the theory and your refined theoretical analysis will make the final version of the manuscript even better.