Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper combines model and gradient compression, which is an interesting and relevant topic. It combines these aspects with asynchronous SGD updates and momentum. While reviewers uniformly liked the main contributions, they also agreed that the current literature overview is insufficient, and that scaling experiments are not impressive enough in terms of time savings from 4->8 nodes and were only presented for small networks so far. This was partially addressed in the rebuttal. We strongly encourage the authors to improve related work and the other issues mentioned in reviews and in the rebuttal phase. Additional relevant work for example includes https://arxiv.org/abs/1905.10936 (appearing simultaneously), and https://arxiv.org/abs/1901.09847 , and the line of work around https://epubs.siam.org/doi/pdf/10.1137/18M1166134 and the references therein.