Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is well written, and the method seems to be working nicely. The ideas and methodology are good. The main issue is that the experiments are somewhat weak. There are no tests done on common networks, and use a small somewhat "engineered" fully connected network only. Indeed, QSVRG showed similar test problems, but that paper is from 2013. Also, in QSVRG the authors considered two data sets – CIFAR10 and MNIST, and not only MNIST. After rebuttal: the authors have somewhat addressed my concern about the weak experiments. Trying Resnet18 on CIFAR10 is nice, but is considered a toy example. Still, this much improves the quality of the results. In light of this, and the other reviews I upgrade my score to 7.
Originality: Algorithms proposed are not completely new, though they are novel combinations of well-known techniques. The theoretical analysis is an important contribution, but it was unclear to me that there were technical breakthroughs instead of mainly building on approaches in prior work. Update after reading author feedback: - Thanks for running / reporting the experiments that I requested. I was a little disappointed that the scalability was underwhelming -- doubling the number of workers from 4 to 8 only improved running time by ~20%. - After also taking into account the other reviewers' comments, I would still support accepting the paper, but at a slightly lower score. Quality: Presentation of multiple algorithms with proper theoretical analyses is pretty much complete. Empirical evaluations could be more extensive. Clarity: Overall the writing was clear, with an easy-to-follow presentation of algorithms that build on top of the previous algorithm. I did get a little lost with some of the notations, but that is understandable for a theory-focused paper. Significance: I rate the overall significance of this paper as high, for demonstrating both theoretically and empirically that asynchrony, quantization, sparsification, and acceleration can be combined together into a single distributed optimization algorithm. My only complaints are with the empirical evaluations. 1. The greatest issue I have is that there is only 1 evaluation / plot (Figure 3b) on total training time. My understanding is that the motivation for all this work is to reduce the training time to epsilon loss, by reducing communication while not impacting convergence rate. With only 1 plot on total training time for 1 run using 1 dataset on 1 model / problem, I am unable to conclude that the proposed algorithms can robustly accomplish the goal of reducing total training time. 2. I also thought that the experiments did not address the question of scalability. Only one setting for number-of-workers was used. I do expect that the proposed algorithms will scale well; nevertheless, I would have expected this evaluation to be done for a paper on *distributed* optimization. 3. Finally, more extension experiments could have been done -- more runs to get error bars, more datasets, more problems, etc. -- to more convincingly establish the robustness of proposed approaches over competitor algorithms.
The paper addresses the communication bottleneck in distributed optimization and proposes double quantization for mitigating this. The approach is not particularly novel as many papers have proposed quantization or compression methods. Here are some detailed points: - The technique is not new and indeed there are important references missing. For example, there should be comparison with the paper "signSGD: Compressed Optimisation for Non-Convex Problems" by Bernstein et al. which considers exactly the same problem. - The main theoretical issue is that the results are based on an unbiased quantizer and this is indeed crucial for the proofs. The low-precision quantizer is only unbiased in the domain dom(delta,b). - The motivation behind quantizing from master to workers is less clear. Open-MPI has a broadcast function and the cost is not the same as N unicasts. - The numerical results on 6 servers are quite limited and the achieved gain is not impressive. - To achieve convergence, model parameter quantization should have a vanishing error which means less and less quantization as iterations increase. What happens to total number of bits then for reaching a desired accuracy? The relationship of parameter mu and number of bits is unclear. To conclude, after the rebuttal I am more convinced about the contributions of this paper. I still find the theoretical novelty of the work incremental; however, the algorithmic novelty and numerical results are quite good, so I increase my score to 6.