NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The paper is clearly written, and the method seems to be working very nicely. The authors compare their method only to signSGD and signum, which as it seems lose a lot of accuracy compared to plain SGD and SGDM. However, looking at the QSGD paper, it looks like there isn't such a loss in accuracy against pure SGD. Can the authors compare their method to QSGD? The error feedback mechanism also appeared in [14] for a distributed setting. There is also the recent work below that applies very similar error-correction mechanism for QSGD: J.Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized sgd and its applications to large-scale distributed optimization. In ICML, 2018. These works seem to apply the same error correction mechanisms. Can the authors distinguish their work from this one? Compare to it? Post rebuttal: the authors have addressed most of my concerns. In light of the rebuttal and the other reviews, I upgrade my score to 7.
Reviewer 2
Originality: The application of the compression and momentum to the parameter server is fairly original. The theorems established could be further adapted to other communication settings, including non-server ones such as binary tree allreduce or ring allreduce. Quality: Ideas and theorems are thoroughly well-explained. Experiments are done at a reasonable scale, though the number of repeat trials is too small to establish reproducibility, and it would have been nice to see more deep networks and datasets. Overall above average quality, but experiments could be more thorough. Clarity: The paper was well-written, very clear, and easy to follow. Significance: As mentioned in my earlier comment about originality, I think this paper lays stepping stones towards analyzing how compression and momentum can benefit communication settings beyond parameter server. High significance.
Reviewer 3
The novelty of this paper is limited and some important experiments are missing to support your claim. 1. Some citations of this paper are not proper while some important citations are missing. For example, when you review DNNs' success in different field, why [22] is cited? And you should also cite the following relevant papers: [1] Povey D, Zhang X, Khudanpur S. PARALLEL TRAINING OF DNNS WITH NATURAL GRA-DIENT AND PARAMETER AVERAGING[J]. arXiv preprint arXiv:1410.7455, 2014. [2] Chen K, Huo Q. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering[C]//2016 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2016: 5880-5884. [3] Chen J, Pan X, Monga R, et al. Revisiting distributed synchronous SGD[J]. arXiv preprint arXiv:1604.00981, 2016. [4] Chen C Y, Choi J, Brand D, et al. Adacomp: Adaptive residual gradient compression for data-parallel distributed training[C]//Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 2. Is it necessary to compress gradients passed from sever to worker? Since dist-EF-SGD is a synchronous algorithm, all_reduce can be used instead of reduce&broadcast. Leveraging all_reduce operation, every worker can play the role of server, and there is no need to send gradients from server to worker anymore. Have you compared the performance difference of these 2 implementations? 3. Your blockwise compression is based on the claim that "...gradients in a deep network typically have similar magnitudes in each layer". Can you provide experimental evidence of this claim? 4. Dist-EF-BlockSGD should be compared with Dist-EF-SGD to show the effect of blockwise compressor. 5. The introduction of nesterov momentum is one of your claims, but the experimental results show that dist-EF-SGD can benefit little from momentum, can you give an explanation?