Summary and Contributions: The paper proposes an adaptive quantization of gradients on the fly to reduce communication in distributed neural network training. Unlike quantization under fixed discrete values in previous work, the paper optimizes those discrete values on the fly based on the dynamic statistics of the gradient distribution. The problem formulation is sound and the method is well theoretically studied. However, the experiments and comparison should be improved to solidify this paper.
Strengths: 1. the method is well motivated (optimized quantization values and dynamic update of those values when the gradient distribution shifts); 2. theoretical study is sound.
Weaknesses: The most significant limitation of this work is its experiment part. The following aspects should be fixed to support the claim: 1. why there is a large gap between SGD and SuperSGD? Generally, the accuracy should be close whenever the model is trained in a single GPU or multiple GPUs. Was it because different total mini-batch sizes were used? 2. clarify if/how weights are communicated through the network, and how gradients are exchanged. 3. In Figure 7(a), the validation accuracy drops when bucket size is very small. This is controversial. When bucket size is smaller, the variance is smaller, and thus the accuracy should be closer to full-precision SGD. In the extreme case when bucket size is 1, quantized SGD is floating-point SGD. In Figure 7(b), why the TRN line is a flat when increasing the bit-width. 4. putting TRN in the content of "3 quantization bits" is misleading. The TRN only used "ternary levels" (three levels) as stated in Line 209. Please clarify this in Table 1, Figure 3, Figure 4 and Figure 7(b). Such as "All methods use 3 quantization bits". Also please clarify if the "gradient clipping" in TRN was used or not in the implementation. "gradient clipping" is also an adaptive processing to reduce the variance of gradients by limiting the maximum norm.
Correctness: See Weaknesses.
Clarity: 1. explain P_L in equation (6) 2. explain x- and y-axis in Figure 6. 3. some markers are missing in Figure 7(b), such as, markers at the 3 bits.
Relation to Prior Work: It's generally clearly discussed. "To the best of our knowledge, matching the validation loss of SuperSGD has not been achieved in any previous work using only 3 bits": please clarify the condition. At least, TRN "using only 3-level gradients achieves 0.92% top-1 accuracy improvement for AlexNet" .
Additional Feedback: 1. Evaluating the method in a larger scale distributed systems with much more nodes (>>4 nodes) may boost the gain of this method, as the gradient variance is large at each node when the batch size per node is small. 2. consider to remove markers in Figure 7 for a better view. ---------------------------- The rebuttal resolved part of my concerns regarding their experiments, but the SGD vs SuperSGD issue. Each baseline should have been done with well-tuned hyper-parameters for comparison. This issue plus the reported bug in the code rise/reassure my concerns on the experiment part. Considering the sound theoretical part, my final score will be 6.5 if I can, given all details in the rebuttal are correct and will be incorporated into the final version.
Summary and Contributions: The paper proposed a new adaptive quantization scheme to reduce the communication cost during parallel training.
Strengths: The paper is well written and also provides step-by-step math computations.
Weaknesses: -- Why is the performance of SGD/SuperSGD so different? -- Figure 7 uses ResNet-8, which is weird to me since it is never used in pervious examples in the paper. -- Line 243 -- 250, the computation overhead is based on bucketsize=64, however all main results are based on bucketsize=8k/16k -- Though the method is proposed to speed up parallel training, the number nodes (or simulated nodes) is only 4. -- For AMQ, what p are you using? Except p=1/2, other should be hard to implement efficiently in practice. -- The theoretical results are useful but trivial to get. The authors should consider shorten that part. -- Could you compare with your results with other quantization methods, which aim to use better quantization scheme, like LQNet?
Relation to Prior Work: Yes.
Summary and Contributions: The paper studies quantizing gradients in a distributed learning scenario in order to reduce communication burden. Specifically, they learn the quantization levels for both cases of uniform and exponential quantization by minimizing the variance between the quantized and the original values in expectation. They further prove the theoretical bounds for this variance. The authors further experimentally show they can quantize the gradients down to 3 bits without loss of accuracy.
Strengths: The proposed method is established using theoretical guarantees of convergence and derives the necessary communication bounds which can be used for the purpose of comparison with other gradient compression methods.
Weaknesses: Some aspects of the evaluation methodology have not been fully explained. It might be better to provide overall communication load requirements instead of quantization bitwidth for comparison with previous works as they may not compress all gradient values with a uniform bit-width (e.g. pruning).
Correctness: To the best of my understanding the proposed bounds are correct. More details for evaluation of the experimental results would help judge their correctness. It is mentioned that evaluation is done by simulating running on 4 GPUs using a single GPU. Can the authors provide more details about how this simulation was carried out?
Clarity: Yes. The paper is clear and easy to follow.
Relation to Prior Work: The authors have performed comparisons with several previous works using different DNN models and image classification datasets.
Summary and Contributions: The paper proposes two adaptive quantization techniques for efficient communication in distributed SGD setups. Quantization levels are chosen to minimize the expected (normalized) variance of the gradient and can either be: 1. Exponentially spaced levels controlled by a single multiplier 2. Adaptable individual levels chosen to minimize the variance. The paper provides theoretical worst-case guarantees for the variance and code-length.
Strengths: - Establishes an upper bound on variance and code length. - Empirical results support the claims made in the paper
Weaknesses: - The claim "robust to all values of hyperparameters" looks like it is hard to defend since the experiments are performed only on varying choices of bucket size and number of bits. What about momentum, batch size etc.
Correctness: I did not read the proofs for the theorems so cannot comment on these. However, the empirical methodology supports the main claims.
Clarity: Yes. Grammar: performs nearly as good as adaptive methods -> performs nearly as well as adaptive methods
Relation to Prior Work: There is discussion of prior work, but discussion of how this papers differs from previous work has to be gleaned.