NeurIPS 2020
### Adaptive Gradient Quantization for Data-Parallel SGD

### Meta Review

All of the reviews are relatively short, and largely focus on questions about the experiment results, for which the author rebuttal give reasonable explanations. I think the main strength of the paper is its analytical approach for adaptive quantization based on changing statistics of the stochastic gradients, which some reviewers pointed out but didn't give substantive comments or discussion.
These is a large concurrent literature on using gradient quantization in distributed training of large machine learning models. The adaptive quantization methods presented in this paper is well motivated, the technical treatment is novel and has sound theoretical analysis on the variance bound and code length (Theorems 1-3). The consequences on the optimization algorithm, namely AQSGD, are direct consequences of the variance bounds and the convergence results in Theorem 4 looks standard. The empirical study is well designed, although there are concerns on particular revelations and the way they are conducted (simulated using a single GPU instead of true distributed computation). Overall I believe this work has sufficient novelty, with a fresh theoretical contribution to quantization-based distributed optimization, and may have practical impact as well. Therefore I recommend acceptance.
The writing of the paper is clear in general. However, there are a few places where the notations are not well introduces and reference of critical formulas in the supplementary materials. Here are a few examples:
* Theorem 2 and Theorem 3 mention "under L^q normalization" which is not found anywhere in the paper, and even vague in the appendix where it is proved.
* Theorem 3 refer to "L is a random variable with probability mass function given by Proposition 6," but that proposition is in the appendix. I understand that there are space limitations, but such omissions leave the statement meaningless without referring to the appendix.
The authors are expected to make careful revisions to address these issues and the questions raised by the reviewers on the empirical results.