NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:59
Title:GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Reviewer 1


		
Originality * Their proposed algorithm has little in the way of surprising conceptual insights, but in this case that is a good thing - the parallelism algorithm is simple, intuitive, and achieves nearly linear throughput increase in the number of accelerators used (hard to expect much more). It's surprising that this hasn't been done before, given that it's such a simple trick that gives such a large speed up for training distributed models. * The authors propose a couple smaller tricks to make the parallelism algorithm work or work better (re-computing forward pass activations to reduce memory requirements, doing so while waiting for backward pass gradients). To train their very deep models, the authors also use another few smaller tricks (e.g., clipping logits to mitigate bad gradients) Significance & Quality * General-purpose model parallelism algorithm: The proposed algorithm is applicable to almost any neural network architecture without modification; the authors demonstrate this feature by scaling up state-of-the-art architectures in both computer vision and NLP settings. * Empirical results with scaled up models: The large-scale models enabled by TensorPipe show empirical gains on highly competitive benchmarks in computer vision (ImageNet) and NLP (machine translation). In machine translation for low-resource languages, these gains seem quite substantial. Clarity: Clearly written, aided by the simplicity of the algorithm. Figure 2 is a clear, comprehensive overview of the approach. The writing could be made more concise/dense in some places, especially to make more space for the helpful analysis of wall-clock time breakdown referenced in the Appendix (2.2). The authors clearly describe the relation of their work to other parallelism algorithms, plus the broader literature regarding deeper/wider models and generalization. The authors may also wish to relate to the literature in developing model architectures that are explicitly easy to distribute (e.g. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"). It also may be helpful to briefly overview (or show in a figure) the architectures for AmoebaNet and Transformer.

Reviewer 2


		
There exists a strong correlation between model size an accuracy on large benchmark datasets. Accordingly, it is valuable to push the limits of scale and see if/when the gains from increasing model size saturates. One of the leading approaches for scaling model sizes is model parallelism, but model parallel training is complex and often much slower than standard training. The authors propose a framework for combining *pipeline* model parallelism with gradient checkpointing ("re-materialization"). Compared to other model parallel approaches that split individual layers across accelerators, the proposed TensorPipe approach partitions whole layers across devices and uses pipelining to improve overall efficiency and gradient checkpointing to reduce inter-device communication. The proposed approach works, and the authors demonstrate strong results scaling model sizes by an order of magnitude for image classification and machine translation models. The paper is well written, the code is open source, and the results are compelling. This work is likely to have a large influence on future model scaling efforts and serves as a good reference for future adoption/implementation in other frameworks.

Reviewer 3


		
The article is well-written and easy to read. The article addresses an important challenge: achieving model parallelism without being architecture and task specific. The solution is simple and straightforward. The key contribution pertains to the implementation and empirical evaluation of the system. The scheduling algorithm should be elaborated. It is not clear how the load imbalance is addressed and how possible node failures are taken into account. The minibatching algorithm details are not presented in the paper. The evaluation does not include a benchmark system. The major limitations of the article pertain to the missing algorithm detail pertaining to the scheduling/microbatching and the lack of empirical benchmark system. The system configuration parameters are explored with the empirical setup. The system has been contributed to Open Source. Author response: the authors have addressed the above points by providing more explanation of the scheduler that is also open sourced and more details on the micro batching algorithm.