Reviews: Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations

UPDATE: I thank the authors for their detailed response to reviewer feedback. Overall I believe this is a quality paper, and would be happy to see it published at NeurIPS. This is reflected in my current assessment, which remains unchanged. Previous studies have contributed three classes of methods for avoiding gradient communication bottlenecks for distributed SGD: (1) communicating updates less frequently, (2) communicating less updates (random k or top k) and (3) communicating quantized updates. There is also a known method for keeping track of induced error in order to compensate later. The authors build upon these previous works by combining all three methods into a unified algorithm (QLSGD), showing that one can combine their benefits. They additionally show that the error compensation mechanism is more broadly applicable than (2), which is the main context it has been previously used. In order to demonstrate convergence properties of QLSGD, the authors were further required to generalize some tools and statements from prior literature, in some cases with improvements (e.g. getting rid of an unnecessary technical assumption from a NeurIPS paper last year). These generalizations are not stand-out contributions in isolation but do add to the overall strength and quality of the work. Pros: The authors are to be commended for combining both a very strong, theoretical analysis with experimental results on a practical distributed SGD problem (ResNet-50 for ImageNet). As far as I can tell their results represent a new state-of-the-art for bits transmitted in a master-based distributed SGD setup. Cons (all minor points): The authors focus entirely on worker -> master communication complexity, but never discuss the reverse direction (broadcasting updated network weights). Tackling this issue directly is not necessarily within the scope of this study, but it certainly warrants discussion. Even a 100x improvement for half the communications is "only" a 50% speed-up. Additionally, the authors should take note of this related work from ICML this year, "Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication" (https://arxiv.org/pdf/1902.00340.pdf). This appears to already address 2/3 of what the authors propose in this study, which does detract slightly from the overall novelty. I further refer the authors to Ofer Dekel's SysML 2019 keynote, where he describes at length why claims of the form "we achieve X% cost reduction for only Y% loss in accuracy" are flawed / very difficult to interpret. I largely agree with his opinion on the matter.

Reviewer 2

*author feedback* I thank the authors for addressing my comments. I am willing to increase my score to 5 and I hope that the authors will do a thorough pass over the paper for the final version, if accepted. A better description and discussion of the experimental results must be added and I would also recommend to simplify presentation and add more intuition about the challenges faced in the analysis to the main part of the paper in order to put more focus its contributions. *summary* The authors propose and analyze a new algorithm, called Qsparse-local SGD. It combines three known concepts for achieving communication efficiency into a unified algorithm. That is sparsification, quantization and local computation. The proposed algorithm is analyzed theoretically and evaluated for training Resnet-50 on the ImageNet dataset. *comments* [Presentation] The paper is generally well written and easy to follow. I like the introduction with the related work where the differences to (closely) related work is explained in detail and effort is put into defining the contributions and limitations of the proposed method. The presentation of the paper is a bit dense and a lot of equations and enumerations are put inline. This could be improved by adding a bit more structure. An additional 9th content page could help. [Novelty] The novelty of the paper is the combination of existing techniques into one algorithm. This comes with some theoretical challenges as the authors say. Since this is the main contribution of the paper it would be nice to have more technical details on what these challenges are in particular and how they are solved. While the combination of existing methods is a valid contribution I would like to see more thorough experimental results showing the merit of the individual components that you put together w.r.t the overall performance. For example, it would be interesting to see what the gain is of using quantization on top of sparsification? How do the quantization level s and the sparsification level k play together? How would one tune these values? [Experiments] The experimental section is difficult to follow and important details are missing. This needs to be improved, in particular: - you use sign quantization, topk sparsification but what is the synchronization schedule you used to distributed the work? How many local steps are performed on each worker before synchronization? - The figures and the reference schemes need to be explained better. What do the individual curves correspond to? What do the numbers in the legend mean? Which curve corresponds to local SGD without quantification or sparsification? This has to be stated in the main part. - Did you tune the parameters for the individual methods to have a fair comparison? - Please refer to the appendix if the reader can find additional experiments or supporting material there. Minor comments: - The definition of b is hidden in the algorithm, should be somewhere in the text - In line 176 you refer to assumptions (i), (ii) whereas you use (i) for enumeration in multiple places. Use (A1) (A2) or something else for it to stand out. *Summary* The paper is generally well written apart from the experimental section which needs to be improved. The contribution consists of the combination of existing techniques and their analyses. This is fine, but for the contribution to be strong enough I think the technical challenges should be emphasized and explained better and the merit of the individual components, as well as their tradeoffs should be evaluated more carefully.

Reviewer 3

I would like to thank the authors' feedback. I hope you can add more details about your experiments in the revised version. The score will be unchanged. --------------------------------------------------------------------------------------------------------- This is a good work and the authors are quite familiar with this research area, but the experiment evaluation is not adequate. I have several questions: 1. In algorithm 1, why x_0 and \hat{x}_0^{(r)} are all initialized as 0 ? x_0 and \hat{x}_0^{(r)} are all model parameters, setting them to 0 is unreasonable. 2. In the experiment part, what doest SignTopK_4L mean? You should describe your experiments in more detail. 3. More experimental results should be given, such as sync & async comparison , the effect of different sparsification strategies or quantization mehtods.

Reviewer 4

a). My first concern is on the novelty. While the paper does a good combination of existing techniques, none of any individual part is interested enough. - The motivation to combine these techniques are straightforward, authors do not show many insights and interesting things on this point. - Thus, I expect more interesting things can be discovered from the mathematic analysis. For example, whether authors can propose some usage of new analysis methods/tools, or strong results comparing with existing ones. - While authors have done a really solid analysis for the proposed work, I fail to find the desired interesting points, which I have mentioned above. b). The presentation can be improved. - I do not see many new things in Section 3.2 and 4.1, these sections are not necessary. - Authors give a summary in Appendix.E, I hope they can move this table to the main text. Besides, it is better to add a comparison between related methods used in the experiments. This can help readers understand how tight the given bound is compared with other distributed SGD methods. - Appendix D should be moved in the main text in Section 5, they are necessary for ablation studies. Deep models are indeed popular, but it is not a good testbed to verify the obtained theoretical results as assumptions can be hardly satisfied.

Paper ID:	8295
Title:	Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4