Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper studies the problem of parallelising large transformer-based language models. It goes beyond data parallelism in that it focuses on splitting the model when it does not fit in the memory of a single GPU. The idea is to segment the model into groups such that GPUs do not sit around waiting on others to pass gradients ( this is the case for layer-wise parallel solutions where each layer is on its own GPU). The model then allows backpropagation to use stale gradients between groups. An L-layer network is split into K modules so that the weights of the network are divided into K groups and each group is placed on a GPU. In the experiments, Transformer-based language model is split into K modules and allocated sequentially onto K GPUs. Theoretical guarantees for convergence are presented and experiments show (modest) improvements in training speed without task hurting performance. The paper would be a good addition to the conference, there is support for its inclusion in the conference.