Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Overall, the paper develops a new and practically useful algorithm for the distributed SGD problem with theoretical convergence guarantees and empirical evidences on its performance. The paper can be improved in terms of clarity by expanding on the main theoretical advantages of the algorithm over the Stich et al. paper and addressing some concerns on the Count Sketch detailed below.
The authors propose sketching gradients to solve the communication overhead of large-scale distributed SGD. In particular, they show a reduction from O(d) to O(log d) where d is number of model parameters. Count-sketch is used to estimate the top-k elements of the gradient by first identifying the heavy-hitters and then getting their values from the worker nodes. Extensive experimental results are shown on a wide-variety of tasks showing up to 40x improvement in communication cost. Comments: Overall, the paper is well-written with clear remarks to show how it differs from related work such as Stich et al which it heavily relies on. The results are important because SGD is a workhorse for a wide-range of deep learning models and speeding up the communication can help train better models/architectures in a shorter time.
|------------------------------> The authors have addressed my questions in the rebuttal, and so I am increasing my score from a 6 to a 7. |------------------------------| This work, to my knowledge, appears to be original. The quality is adequate; the authors show familiarity with, and build on ideas from, the relevant literature. The experimental setup (image classification and NMT) is also relevant. The work is very clear and well written. The proposed method could provide a significant reduction in training time for practitioners and researchers, but, in my opinion, needs some additional empirical validation. Criticisms: 1. The bounded second moment and variance assumptions, together, are quite strong. There has been a lot of work recently advocating against the use of these combined assumptions. 2. Communication complexity per iteration is "k log(dT/delta)W"; but k log(dT/delta) > d for sufficiently large T. So does this mean that one can only use gradient sketches for the first few iterations of training? 3. In related work, it is stated that other gradient compression methods that "achieve high compression rates when the workers transmit gradients to the parameter server," still incur O(w) communication overhead when the model parameters are communicated back, but this seems like a technicality of the exposition and not of the method. Couldn't one just send back the compressed gradient in this case and have the workers apply updates locally, or, alternatively, only return the updated coordinates? 4. In line 8, in Algorithm 2, you only send back the k elements that were modified right? Worth stating in the algorithm description itself. 5. You left a "TODO" in table 1. 6. To competitively train a transformer on WMT En-De, one typically uses adaptive step-size methods like Adam, not vanilla SGD. How does the gradient Sketching approach in parameter-server setting extend to cases like Adam, where the momentum buffers are non-linear in the stochastic gradients? It seemed like the linearity property of the Sketches was essential, and so it's not clear to me that this method would work with methods like Adam. 7. Does "vanilla SGD" in the experiments also contain some form of momentum? 8. Can you explain why momentum factor masking "interferes with momentum" such that increasing k (reducing compression error) increases the training error? 9. Given that your method can achieve significant communication reduction at the expense of a reduction in model accuracy, can you train for more iterations and obtain better models while still maintaining less communication overall? 10. How does this method compare, empirically, with other compression schemes.