{"title": "Ouroboros: On Accelerating Training of Transformer-Based Language Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5519, "page_last": 5529, "abstract": "Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy. Code to reproduce experiments is to be found at \\url{https://github.com/LaraQianYang/Ouroboros}.", "full_text": "Ouroboros: On Accelerating Training of\nTransformer-Based Language Models\n\nQian Yang1\u2217, Zhouyuan Huo2, Wenlin Wang1, Heng Huang2, Lawrence Carin1\n\nDepartment of Electrical and Computer Engineering\n\n1 Duke University\n\n2 University of Pittsburgh\n\nqian.yang@duke.edu\n\nAbstract\n\nLanguage models are essential for natural language processing (NLP) tasks, such\nas machine translation and text summarization. Remarkable performance has\nbeen demonstrated recently across many NLP domains via a Transformer-based\nlanguage model with over a billion parameters, verifying the bene\ufb01ts of model size.\nModel parallelism is required if a model is too large to \ufb01t in a single computing\ndevice. Current methods for model parallelism either suffer from backward locking\nin backpropagation or are not applicable to language models. We propose the \ufb01rst\nmodel-parallel algorithm that speeds the training of Transformer-based language\nmodels. We also prove that our proposed algorithm is guaranteed to converge to\ncritical points for non-convex problems. Extensive experiments on Transformer\nand Transformer-XL language models demonstrate that the proposed algorithm\nobtains a much faster speedup beyond data parallelism, with comparable or better\naccuracy. Code to reproduce experiments is to be found at https://github.\ncom/LaraQianYang/Ouroboros.\n\n1\n\nIntroduction\n\nNatural language processing (NLP) tasks, such as machine translation [1, 2, 3, 4, 5], text summa-\nrization [6, 7, 8, 9, 10], or paraphrase generation [11, 12, 13] have achieved great success with\nthe development of neural networks. It has been demonstrated recently that Transformer networks\nobtain superior performance [14, 15, 16] relative to recurrent neural networks or convolutional neural\nnetworks. BERT [17] trains a deep bidirectional Transformer with 340M parameters and obtains\nstate-of-the-art results on 11 NLP tasks. Recently, OpenAI GPT-2 [18], which is a Transformer-based\nlanguage model with 1.5B parameters, achieves state-of-the-art results on 7 out of 8 tested language\nmodeling datasets, presenting impressive performance across many domains and datasets. Empirical\nresults demonstrate the superiority of Transformer networks and show that a larger model tends to\nyield better performance. However, when a model is so large that it has to be allocated on multiple\nGPUs, data parallelism over these GPUs is not applicable because it requires each GPU to have one\ncopy of the whole model. Meanwhile, model parallelization is still an open question when the model\nis too large to \ufb01t in a single device when training.\nWhen a model becomes too large to \ufb01t on a single computing device, the simplest solution is to\ndistribute model layers across multiple devices. In [19], the authors parallelize the model by splitting\n\ufb01lters or parameters of a layer across multiple GPUs. However, both of these methods suffer from\nbackward locking of the backpropagation algorithm, and cannot parallelize the computations between\nlayers. Backward locking denotes that the backpropagation algorithm requires gradients to be\n\n\u2217 Corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcomputed from top layers to bottom layers sequentially. When networks are very deep, all other\ndevices are idle when the backpropagation computation is performed on one device. Jaderberg et al.\n[20] proposes Decoupled Neural Interface to remove backward locking, by employing additional\nneural networks to approximate error gradients. However, this approach works poorly on deep neural\nnetworks [21]. In [21], the authors use stale gradients in previous computations and successfully\naccelerate the training of deep networks like ResNet110. Subsequently, Huo et al. [22] revises\nthe memory issue in [21] and obtains better generalization error. Both of these methods can only\nwork on feed-forward networks that are separable between layers. However, neither approach can\nparallelize Transformer-based language models, because the shared embeddings make the networks\nnon-separable.\nTo address the above challenges, we make the following contributions. (i) We present the \ufb01rst\nmodel-parallel algorithm to parallelize the training of Transformer-based language models, going\nbeyond data parallelism. (ii) The convergence rate of the proposed algorithm is analyzed, and it is\nproven that it is guaranteed to converge to critical points for non-convex problems. (iii) We evaluate\nthe proposed algorithm in training two Transformer-based language models, and experimental results\nverify our theoretical analysis, demonstrating convergence much faster than previous methods with\ncomparable or better accuracy. The source code will be made publicly accessible to encourage further\nresearch.\n\n2 Preliminary and Related Works\n\nSelf-attention architectures like the Transformer [14] have recently become popular for language\nmodeling [15, 16, 17, 18]. Consider training a Transformer-based language model with L layers. We\nmay represent the computations in the network as follows:\n\n\u2200l \u2208 {2, ..., L \u2212 1},\n\nh1 = F1(h0; w1, Vi),\nhl = Fl(hl\u22121; wl),\nhL = FL(hL\u22121; wL, Vo),\n\n(1)\n(2)\n(3)\nwhere hl\u22121 denotes the input of layer l, Fl(\u00b7; wl) denotes the computation of layer l with weight\nwl, Vi is the input embedding, and Vo is the output projection. In particular, h0 denotes the input\ndata x, and hL = F (x; \u02dcw) represents the output of the network. For the sake of performance, Vi\nand Vo are typically tied in language modeling or machine translation tasks, so that V = Vi = Vo\n[23, 24]. De\ufb01ning network weights w = [w1, w2, ..., wL], embedding layer V and \u02dcw = [w, V ], the\nloss function for language modeling can be represented as:\n\nwhere y denotes the target. In the following context, we use f ( \u02dcw) for simplicity.\n\nmin\n\n\u02dcw\n\nf (F (x; \u02dcw), y),\n\n(4)\n\n2.1 Gradient-Based Method\n\nGradient-based methods are widely employed for training deep neural networks, with important\nstochastic gradient descent (SGD) [25] examples including AdaGrad [26], RMSProp [27], Adam\n[28] and AdamW [29]. With SGD, the weights of the network are updated as:\n\nwt+1\n\nl = wt\n\nl \u2212 \u03b3t\u2207fl,xi(t) ( \u02dcwt) and V t+1 = V t \u2212 \u03b3t\u2207fV,xi(t) ( \u02dcwt),\n\n(5)\nfor any l \u2208 {1, ..., L}, where \u03b3t is the stepsize, i(t) represents data index at iteration t, and\n\u2207fl,xi(t) ( \u02dcwt) is the gradient of the loss function (4) with respect to the weights at layer l and\ndata sample xi(t).\n\n2.2 Backpropagation\n\nIf the loss functions are differentiable, the gradients of network parameters can be computed using\nthe backpropagation algorithm [30]. The backpropagation algorithm consists of two passes of the\nnetwork, forward computation and backward computation. In the forward computation, activations\nof all layers are calculated from l = 1 to L following equations (1), (2) and (3). In the backward\n\n2\n\n\fAlgorithm 1 Ouroboros + SGD\nRequire:\n\nInitial weights w0 = [w0G(1), ..., w0G(K)];\nInitial word embedding V 0\nStepsize sequence {\u03b3t};\n\no ;\ni = V 0\n\n1: for t = 0, 1, 2, . . . , T \u2212 1 do\n2:\n3:\n\nk for module k\n\nfor k = 1, . . . , K in parallel do\nCompute delayed gradient gt\nfollowing (8);\nCompute mixed gradient gt\nlayer following (9);\nUpdate weights and embedding layer fol-\nlowing SGD:\n\nV for embedding\n\n4:\n\n5:\n\nwt+1G(k) = wtG(k) \u2212 \u03b3t \u00b7 gt\nk;\ni \u2212 \u03b3t \u00b7 gt\n= V t+1\nV ;\n\n= V t\n\no\n\nV t+1\ni\nend for\n\n6:\n7: end for\n8: Output ws, V s\ni }T\u22121\n\n{V t\n\no randomly from {wt}T\u22121\nt=0 ,\ni and V s\no }T\u22121\nt=0 .\n\nt=0 and {V t\n\nFigure 1: Communication between GPUs of the\nproposed Ouroboros algorithm. The \ufb01rst and last\nmodule of a transformer-based language model\nis located on the same device.\n\ncomputation, we apply the chain rule and propagate error gradients repeatedly through the network,\nfrom the output layer l = L to the input layer l = 1:\n\n\u2202f ( \u02dcwt)\n\n\u2202wt\nl\n\n=\n\n\u2202f ( \u02dcwt)\n\n\u2202ht\nl\n\n\u00d7 \u2202ht\nl\n\u2202wt\nl\n\nand \u2202f ( \u02dcwt)\nl\u22121\n\n\u2202ht\n\n=\n\n\u2202f ( \u02dcwt)\n\n\u2202ht\nl\n\n\u00d7 \u2202ht\nl\n\u2202ht\nl\u22121\n\n,\n\n(6)\n\nwhere \u02dcw = [w, V ], and \u2207fl,xi(t) ( \u02dcwt) = \u2202f ( \u02dcwt)\ngradient with respect to the input embedding and output projection layer are computed as:\n\n. For Transformer-based language models, the\n\n\u2202wt\nl\n\n\u2202f ( \u02dcwt)\n\n\u2202Vi\n\n=\n\n\u2202f ( \u02dcwt)\n\n\u2202ht\n1\n\n\u00d7 \u2202ht\n1\n\u2202Vi\n\nand \u2202f ( \u02dcwt)\n\u2202Vo\n\n=\n\n\u2202f ( \u02dcwt)\n\n\u2202ht\nL\n\n\u00d7 \u2202ht\nL\n\u2202Vo\n\n.\n\n(7)\n\nFrom (6), it is evident that the computation in layer l is dependent on the error gradient \u2202f ( \u02dcwt)\nfrom\n\u2202ht\nl\nlayer l + 1. Therefore, the sequential chain rule constrains all layers from updating before receiving\nerror gradients from the dependent layers. When Transformer-based networks are very deep and\ncomputations in each layer are signi\ufb01cant, breaking such a sequential chain rule to accelerate the\ntraining presents a challenge.\n\n3 Accelerating Training of Transformer-Based Language Models\n\nWe propose the \ufb01rst model-parallel algorithm that can speed up the training of Transformer-based\nlanguage models. We then take stochastic gradient descent as an example to verify that our algorithm\nis easy to work with any gradient-based method.\n\n3.1 Ouroboros Algorithm\n\nWe split an L-layer network into K modules so that the weights of the network are divided into K\ngroups and each group is placed on a GPU. Therefore, we have w = [wG(1), wG(2), ..., wG(K)] where\nG(k) denotes layer indices in group k. We again denote Vi and Vo as the input embedding and output\nprojection, at the \ufb01rst and last module of the network. In [23], it is shown that shared embedding\nalways has better performance than not sharing for a language model and machine translation, where\nVi and Vo are tied and Vi = Vo. In the following context, we let V = [Vi, Vo]. Because of this, the\n\ufb01rst module and the last module must be placed on the same device, visualized in Figure 1. Our\nmodel is connected end-to-end and shrinks like a snake when grouping, so we name it \u201cOuroboros.\u201d\n\n3\n\nTlayer 1Tlayer 4GPU 1Tlayer 2GPU 2Tlayer 3GPU 3\fFigure 2: Forward and backward computation of the proposed algorithm. We split a Transformer-\nbased language model into four modules and allocate them into three GPUs, where the \ufb01rst and the\nlast module are placed on the same GPU. In the \ufb01gure, h denotes activations, w denotes weights, and\nV represents embedding layers. T Layler represents Transformer layer. The input embedding and\noutput projection are tied together.\n\nIn the backward computation of the backpropagation algorithm, the computations of Module 1 are\ndependent on the computations of the later modules. In our Ouroboros algorithm, at each iteration all\nmodules are independent of each other, by using delayed gradients. Let \u02dcw = [w, V ], the gradient of\nweights in G(k) is\n\n\u2202fxi(t\u2212K+k) ( \u02dcwt\u2212K+k)\n\n\u2202wt\u2212K+k\n\nl\n\n, if t \u2212 K + k \u2265 0,\n\n(8)\n\n\u2207fG(k),xi(t\u2212K+k)\n\n(cid:0) \u02dcwt\u2212K+k(cid:1) =\n\n(cid:88)\n\nl\u2208G(k)\n\n1\n2\n\n\u2207fVo,xi(t)\n\n1\n2\n\n\u2207fVi,xi(t\u2212K+1)\n\n(cid:0) \u02dcwt(cid:1) +\n\n(cid:18) \u2202f ( \u02dcwt)\n\n(cid:16)\n\u02dcwt\u2212K+1(cid:17)\n\nor 0 otherwise for any k \u2208 {1, 2, ..., K}. The gradient of V is the average of the gradients of output\n(cid:19)\nprojection and input embedding:\n\u2207fV,xi(t) ( \u02dcwt) =\n(9)\notherwise 0 if t \u2212 K + 1 < 0. In the proposed algorithm, the backward computation in module\nk is always one time step behind module k + 1. Therefore, the computations in all modules can\nbe parallelized. In Figure 2, we visualize the procedure of the Ouroboros algorithm, optimizing a\nTransformer-based language model with four modules.\nMemory Consumption. In the Ouroboros algorithm, we need to store stale gradients of all layers,\nwhich may be memory demanding. We follow [31] and only store the input of each GPU. Required\nactivations and gradients are recomputed in the backward pass. Therefore, the extra memory\nconsumption is negligible, which is only dependent on the number of GPUs.\n\n\u2202f ( \u02dcwt\u2212K+1)\n\u2202V t\u2212K+1\n\n=\n\n1\n2\n\n+\n\n\u2202V t\no\n\ni\n\n,\n\n3.2 Gradient-Based Method with Ouroboros\n\nAfter obtaining gradients of the loss function with respect to the weights of the model, we can apply\nthese gradients to gradient-based methods. We consider the procedures of SGD as an example.\nLetting gt\nV represent the gradients of module k and embedding V at iteration t, we can update\nmodel weights and embeddings following SGD:\n\nk and gt\n\nwt+1G(k) = wtG(k) \u2212 \u03b3t \u00b7 gt\nk;\ni \u2212 \u03b3t \u00b7 gt\n= V t+1\nV ,\n\n= V t\n\no\n\nV t+1\ni\n\n(10)\n(11)\n\nwhere \u03b3t denotes the stepsize. We summarize Ouroboros with SGD in Algorithm 1. In the next\nsection, we analyze the convergence rate of Algorithm 1, which is the basis of analysis for other\nvariants of SGD.\n\n4\n\nTransformerlayer 1Transformer layer 2Transformerlayer 4loss\u00a0Forward pass Backward pass EmbeddingEmbeddingTiedTransformer layer 3 GPU 1GPU 2GPU 3GPU 1\f4 Convergence Analysis\n\nWe prove Algorithm 1 is guaranteed to converge to critical points for non-convex problems. Results\nshow that it admits a similar convergence rate to vanilla SGD. Detailed proofs are in the supplementary\nmaterial. At \ufb01rst, we make two commonly used assumptions following [32]:\n\nAssumption 1 (Lipschitz-continuous gradient) The gradient of f (w) is Lipschitz continuous with\nLipschitz constant L > 0, such that for any w, v, it is satis\ufb01ed that:\n(cid:107)\u2207f (w) \u2212 \u2207f (v)(cid:107)2 \u2264 L(cid:107)w \u2212 v(cid:107)2.\n\n(12)\n\nAssumption 2 (Bounded variance) We assume the second moment of the stochastic gradient is\nupper bounded, such that there exists constant M \u2265 0, for any sample xi and for any w:\n\n(cid:107)\u2207fxi (w)(cid:107)2\n\n2 \u2264 M.\n\nBecause of the variance equation E(cid:107)\u2207fxi (w) \u2212 \u2207f (w)(cid:107)2\nfollowing inequality is also satis\ufb01ed:\n(cid:107)\u2207fxi(w) \u2212 E [\u2207fxi(w)](cid:107)2\n\n2 = E(cid:107)\u2207fxi(w)(cid:107)2\n2 \u2264 M.\n\n2 \u2212 (cid:107)\u2207f (w)(cid:107)2\n\n(13)\n2, the\n\n(14)\n\nUnder Assumptions 1 and 2, we obtain Lemma 1 about iterations of the objective functions.\n\nLemma 1 With Assumptions 1 and 2, let \u03c3 := maxt\nK 3)(K + 4)M. For all t \u2208 N, the iterations in Algorithm 1 satisfy the inequality\n\nand MK = (K + 3\n\n\u03b3max{0,t\u2212K+1}\n\nE(cid:2)f (wt+1)(cid:3) \u2212 f (wt) \u2264 \u2212 \u03b3t\n\n\u03b3t\n\n(cid:13)(cid:13)\u2207f (wt)(cid:13)(cid:13)2\n\n2\n\n2\n\n+ \u03b32\n\nt LMK .\n\n(15)\n\n4 )M + \u03c3( K2\n\n2 +\n\nFrom Lemma 1, we observe that the expected decrease of the objective function is controlled by the\nstepsize \u03b3t and MK. Therefore, we can guarantee that the values of objective functions are decreasing\nas long as the stepsizes \u03b3t are small enough, such that the right-hand side of (15) is less than zero.\nBased on Lemma 1, we analyze the convergence guarantee of Algorithm 1.\n\n4.1 Fixed Stepsize \u03b3t\n\nWe \ufb01rst analyze the convergence for Algorithm 1 when \u03b3t is \ufb01xed, and prove that the learned model\nwill converge sub-linearly to the neighborhood of the critical points.\nTheorem 1 With Assumptions 1 and 2, and the \ufb01xed stepsize sequence {\u03b3t} satisfying \u03b3t = \u03b3 and\n\u03b3L \u2264 1,\u2200t \u2208 {0, 1, ..., T \u2212 1}, let w\u2217 be the optimal solution to f (w). The output of Algorithm 1\nsatis\ufb01es:\n\nT\u22121(cid:88)\n\nt=0\n\n1\nT\n\nE(cid:13)(cid:13)\u2207f (wt)(cid:13)(cid:13)2\n\n2\n\n\u2264 2(cid:0)f (w0) \u2212 f (w\u2217)(cid:1)\n\n\u03b3T\n\n+ 2\u03b3LMK ,\n\n(16)\n\nand MK = (K + 3\n\n4 )M + ( K2\n\n2 + K 3)(K + 4)M.\n\nAccording to Theorem 1, the average norm of gradients can converge to the neighborhood of critical\npoints. As T \u2192 \u221e, it is also upper bounded by 2\u03b3LMK.\n\nRemark 1 With Assumptions 1 and 2, and following notation in Theorem 1, let \u03b3 =\n\nT\u22121(cid:80)\n\nt=0\n\nThen 1\nT\n\nE(cid:107)\u2207f (wt)(cid:107)2\n\n2 \u2264 4\n\n(cid:113) (f (w0)\u2212f (w\u2217))LMK\n\n.\n\nT\n\n(cid:113) f (w0)\u2212f (w\u2217)\n\nT LMK\n\n.\n\n\u221a\n\nT ) for\n\nAccording to above analysis, we know that Algorithm 1 admits a convergence rate of O(1/\nnon-convex problems, which is similar to the result of SGD [32].\n\n5\n\n\f4.2 Diminishing Stepsize \u03b3t\n\nWe prove that Algorithm 1 with diminishing stepsizes can guarantee the convergence to critical points\nfor non-convex problems.\nTheorem 2 With Assumptions 1 and 2, and the diminishing stepsize sequence {\u03b3t} satisfying \u03b3t =\n1+t , \u03b3tL \u2264 1,\u2200t \u2208 {0, 1, ..., T \u2212 1}, assume w\u2217 to be the optimal solution to f (w), and let \u03c3 = K\nsuch that MK = (K + 3\n\u03b3t, then the output of\nAlgorithm 1 satis\ufb01es\n\n2 + K 4)(K + 4)M. Setting \u0393T =\n\n4 )M + ( K3\n\n\u03b30\n\nT\u22121(cid:88)\n\nt=0\n\n1\n\u0393T\n\n\u03b3tE(cid:13)(cid:13)\u2207f (wt)(cid:13)(cid:13)2\n\n2 \u2264 2(cid:0)f (w0) \u2212 f (w\u2217)(cid:1)\nT\u22121(cid:88)\n\n\u0393T\n\n\u03b3t = \u221e and\n\nt < \u221e.\n\u03b32\n\nlim\nT\u2192\u221e\n\nt=0\n\nT\u22121(cid:88)\n\nt=0\n\nlim\nT\u2192\u221e\n\nSince \u03b3t = \u03b30\n\nt+1, the following inequalities are satis\ufb01ed:\n\nT\u22121(cid:80)\nT\u22121(cid:80)\n\nt=0\n\n2\n\n\u03b32\nt LMK\n\nt=0\n\n+\n\n\u0393T\n\n(17)\n\nTherefore, according to Theorem 2, when T \u2192 \u221e, the right-hand side of (17) converges to 0.\nRemark 2 Suppose ws is chosen randomly from {wt}T\u22121\n{\u03b3t}T\u22121\nE(cid:107)\u2207f (ws)(cid:107)2\ncritical points for the non-convex problem: lim\ns\u2192\u221e\n\nt=0 with probabilities proportional to\nt=0 . According to Theorem 2, we can prove that Algorithm 1 guarantees convergence to\n\n2 = 0.\n\n5 Experimental Setup\n\nWe evaluate the proposed method by training two Transformer-based language models. When the\nmodel is too large to be \ufb01t in a single GPU, its layers have to be distributed across multiple GPUs. In\nthis case, data parallelism over multiple GPUs does not work because it requires that each GPU has\none copy of the whole model. Mini-batch computation in one GPU is regarded as the data parallelism\nin this paper. By simulating this case, we distribute layers of a model across K GPUs. Experimental\nresults demonstrate that the proposed method obtains further speedup beyond data parallelism.\n\n5.1 Datasets\n\nFollowing [16], three publicly available datasets are used for training and evaluation: (i) enwiki8,\ncontaining 100M bytes of unprocessed Wikipedia text [33]; (ii) text8, containing 100M processed\nlower-case Wikipedia characters and removing any character other than the 26 letters a through z, and\nspace [33]; and (iii) WikiText-103, the largest available word-level language modeling benchmark\nwith long-term dependency [34]. All training datasets are preprocessed following [16].\n\n5.2 Training Details\n\nOur implementation is based on Transformer-XL2 using PyTorch. All experiments are performed\non a machine with 4\u00d7TESLA V100 GPUs. Parallelization between modules is handled via the\nsubprocess library in Python3. We use two language models in the paper: a 12-layer Transformer\n(44M parameters) [15] and Transformer-XL (41M parameters) [16]. In all experiments, we split a\nTransformer-based language model into K modules and allocate them sequentially onto K GPUs\n(backpropagation algorithm) or K \u2212 1 GPUs (Ouroboros). Due to the limited resources, we validate\nour proposed algorithm by varying K from 3 to 5. According to [16], we use the Adam optimizer,\nwhere \u03b21 = 0.9, \u03b22 = 0.999 and \u03b5 = 1e\u22128 [28]. For comparison, we use Ouroboros+Adam (see\nAppendix) in the experiments. The learning rate is set to be 0.00025 and it decreases following a\ncosine learning rate schedule [35].\n\n2https://github.com/kimiyoung/transformer-xl/tree/master/pytorch\n\n6\n\n\fFigure 3: Convergence of the methods, regarding steps and computational time. We evaluate our\nalgorithm on both Transformer and Transformer-XL language models.\n\nDataset\nenwiki8\n\ntext8\n\nWikiText-103\n\nAdam\n1.11\n1.18\n28.32\n\nTransformer\nOuroboros + Adam\n\n1.12\n1.18\n28.29\n\nAdam\n1.06\n1.15\n24.00\n\nTransformer-XL\n\nOuroboros + Adam\n\n1.05\n1.13\n24.10\n\nTable 1: Comparison of Test bpc (Bit per Character) or Test PPL. We use the metric bpc on the\nenwiki8 and text8 datasets, and PPL on the WikiText-103 dataset. Our algorithm can achieve speedup\nwith comparable or better performance.\n\nWarm-up Training.\nIn the early stages of training, stale gradient information may affect the\nconvergence of Ouroboros. Following [36], we use a gradual warm-up approach in all experiments.\nThis avoids a sudden increase of the learning rate, decreasing the error caused by stale gradients. In\nall experiments, we set the warm-up step to be 5000. After the warm-up, we use the cosine learning\nrate schedule.\nRepeatable Dropout. According to [37], dropout ignores weights in a fully-connected layer inde-\npendently and randomly, with a given probability. Ouroboros allows modules to compute gradients in\nparallel, in different time-stamps. To compute gradients with the input from time-stamp t \u2212 K + k,\nwe need to recompute activations ht\u2212K+k\nin module k. However, randomness in the dropout layer\nprevents recovering previous activations accurately. Consequently, we propose to store the input of\neach module as well as a random seed. Therefore, before computing activations, we initialize the\nrandom number generator in GPU with the stored seed.\n\nG(k)\n\n5.3 Evaluation Metric\n\nTo evaluate the convergence rate of the proposed algorithm, we compare training loss regarding steps\nand computational time. We evaluate the \ufb01nal performance of the trained model by computing the\nbpc score on test data of enwiki8 and test8 datasets, and PPL score on the test data of WikiText-103.\n\n6 Experimental Results\n\nWe show that our Ouroboros algorithm parallelizes the previous sequential backpropagation, and\nobtains a much faster speedup beyond data parallelism, without loss of accuracy. We also perform\n\n7\n\n010002000300040005000600070008000Steps 11.522.533.54Training Lossenwiki8Transformer-XL (Adam)Transformer (Adam)Transformer-XL (Our method)Transformer (Our method)010002000300040005000600070008000Steps 11.21.41.61.822.22.42.62.83Training Losstext8Transformer-XL (Adam)Transformer (Adam)Transformer-XL (Our method)Transformer (Our method)010002000300040005000600070008000Steps 45678910Training LossWikiText-103Transformer-XL (Adam)Transformer (Adam)Transformer-XL (Our method)Transformer (Our method)00.511.522.53Time (s) 10411.522.533.54Training Lossenwiki8Transformer-XL (Adam)Transformer (Adam)Transformer-XL (Our method)Transformer (Our method)00.511.522.53Time (s) 10411.21.41.61.822.22.42.62.83Training Losstext8Transformer-XL (Adam)Transformer (Adam)Transformer-XL (Our method)Transformer (Our method)020004000600080001000012000Time (s) 45678910Training LossWikiText-103Transformer-XL (Adam)Transformer (Adam)Transformer-XL (Our method)Transformer (Our method)\fFigure 4: Convergence of training loss regarding steps and computational time, when we vary\nmodules K. Speedup of computational time per batch in the right \ufb01gure. Experiments are performed\nto train Transformer-XL on enwiki dataset\n\n(a) Warm-up\n\n(b) Repeatable dropout\n\nFigure 5: Ablation study on the effect of warm-up (Figure 5a) and repeatable dropout (Figure 5b).\nExperiments are performed to train Transformer-XL on enwiki dataset.\n\nablation studies to analyze the necessity of the proposed training techniques. All \ufb01gures are plotted\nusing the average of 5 runs.\n\n6.1 Convergence Comparisons\n\nThe proposed method is evaluated by optimizing two Transformer-based language models, Trans-\nformer [15] and Transformer-XL [16]. For the enwiki and text8 datasets, we use 12-layer models,\nand for WikiText-103 dataset, we use 16-layer models. We visualize the convergence of training\nloss regarding steps and computational time in Figure 3. The convergence rate of our algorithm\nand the alternative methods are very close. This veri\ufb01es our theoretical analysis that the proposed\nalgorithm converges to critical points with a rate of O(1/T ). Secondly, our algorithm is much faster\nthan alternative methods. In Table 1, we compare PPL or bpc of the methods. Experimental results\nshow that our algorithm obtains comparable or sometimes better performance.\n\n6.2 Distributed Speedup\n\nWe further evaluate our algorithm by varying K from 3 to 5 and visualize experimental results in\nFigure 4. We allocate K modules on K \u2212 1 GPUs. Note that (i) increasing the number of modules\nmay affect the convergence regarding steps, consistent with our theoretical analysis; and (ii) more\nspeedup will be obtained when the networks are deeper. It is an ideal case to obtain linear speedup,\nusing K\u00d7 machines to achieve K\u00d7 speedup regarding time. However, it is impossible to achieve\neven for data parallelism. The goal of our method is to guarantee that there is no idle machines during\nthe training and fully utilize all computing resources. Besides, it is also easy to combine our method\nwith data parallelism to obtain further speedup.\n\n8\n\n0500100015002000250030003500Steps 11.522.533.54Training LossEffect of KAdamK=3K=4K=5020004000600080001000012000Time (s) 11.522.533.54Training LossEffect of KAdamK=3K=4K=51345K00.20.40.60.811.21.41.61.82SpeedupSpeedup of Time Per Batch05001000150020002500300035004000Steps 11.522.533.54Training LossEffect of Warm-up stepsWarm-up step = 0Warm-up step = 50Warm-up step = 500Warm-up step = 500005001000150020002500300035004000Steps 11.522.533.54Training LossEffect of repeatable dropoutNonrepeatable dropoutRepeatable dropout\f6.3 Ablation Studies\n\nThe Effect of Warm-Up. As mentioned in Section 5, the proposed algorithm is vulnerable to noise\nat early steps, and stale gradients may affect convergence. We compare the convergence of training\nloss when the warm-up step is selected from {0, 50, 500, 5000}. As illustrated in the left of Figure 5,\nwe observe that the algorithm may diverge if there is no warm-up at the early stages of training.\nThe Effect of Repeatable Dropout. We also \ufb01nd that the randomness in the dropout layer affects\nthe convergence of the proposed algorithm. In the right of Figure 5, we evaluate the effectiveness of\ndropout. It is clear that the convergence is affected if there is no repeatable dropout.\n\n7 Conclusions\n\nWe have considered accelerating the training of Transformer-based language models, and have\nintroduced a novel \u201cOuroboros\u201d algorithm. We prove Ouroboros is guaranteed to converge to critical\npoints for non-convex problems, and has a similar convergence rate as normal SGD. We conduct\nexperiments on training Transformer-based language models, and experimental results verify that the\nproposed algorithm can yield a signi\ufb01cant speedup without loss of accuracy.\n\nAcknowledgments\n\nThis research was supported in part by DARPA, DOE, NIH, ONR and NSF.\n\nReferences\n[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[2] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[3] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-\n\nbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.\n\n[4] Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Wei Xu. Joint training for pivot-based\nneural machine translation. In Joint Training for Neural Machine Translation, pages 41\u201354.\nSpringer, 2019.\n\n[5] Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and Wei Xu. Neural machine translation\n\nwith pivot languages. arXiv preprint arXiv:1611.04928, 2016.\n\n[6] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text sum-\nmarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023,\n2016.\n\n[7] Sumit Chopra, Michael Auli, and Alexander M Rush. Abstractive sentence summarization with\nattentive recurrent neural networks. In Proceedings of the 2016 Conference of the North Ameri-\ncan Chapter of the Association for Computational Linguistics: Human Language Technologies,\npages 93\u201398, 2016.\n\n[8] Qian Yang, Rebecca J Passonneau, and Gerard De Melo. Peak: Pyramid evaluation via\nautomated knowledge extraction. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n[9] Qian Yang, Gerard de Melo, Yong Cheng, and Sen Wang. Hitext: text reading with dynamic\nsalience marking. In Proceedings of the 26th International Conference on World Wide Web\nCompanion, pages 311\u2013319. International World Wide Web Conferences Steering Committee,\n2017.\n\n[10] Rebecca J Passonneau, Ananya Poddar, Gaurav Gite, Alisa Krivokapic, Qian Yang, and Dolores\nInternational Journal of\n\nPerin. Wise crowd content assessment and educational rubrics.\nArti\ufb01cial Intelligence in Education, 28(1):29\u201355, 2018.\n\n[11] Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Paraphrase generation with deep reinforce-\n\nment learning. arXiv preprint arXiv:1711.00279, 2017.\n\n[12] Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. A deep generative framework\nfor paraphrase generation. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n9\n\n\f[13] Qian Yang, Zhouyuan Huo, Dinghan Shen, Yong Cheng, Wenlin Wang, Guoyin Wang, and\nLawrence Carin. An end-to-end generative architecture for paraphrase generation. In EMNLP,\n2019.\n\n[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 5998\u20136008, 2017.\n\n[15] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level\n\nlanguage modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018.\n\n[16] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le,\nand Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a \ufb01xed-length\ncontext. arXiv preprint arXiv:1901.02860, 2019.\n\n[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[18] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.\n\nLanguage models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019.\n\n[19] Omry Yadan, Keith Adams, Yaniv Taigman, and Marc\u2019Aurelio Ranzato. Multi-gpu training of\n\nconvnets. arXiv preprint arXiv:1312.5853, 2013.\n\n[20] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,\nDavid Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1627\u20131635. JMLR. org, 2017.\n\n[21] Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. Decoupled parallel backpropagation\n\nwith convergence guarantee. arXiv preprint arXiv:1804.10574, 2018.\n\n[22] Zhouyuan Huo, Bin Gu, and Heng Huang. Training neural networks using features replay. In\n\nAdvances in Neural Information Processing Systems, pages 6659\u20136668, 2018.\n\n[23] O\ufb01r Press and Lior Wolf. Using the output embedding to improve language models. arXiv\n\npreprint arXiv:1608.05859, 2016.\n\n[24] Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classi\ufb01ers:\n\nA loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.\n\n[25] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[26] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n[27] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini\u2013\nbatch gradient descent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012-\n001/lecture,[Online, 2012.\n\n[28] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[29] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. arXiv preprint\n\narXiv:1711.05101, 2017.\n\n[30] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by\n\nback-propagating errors. Cognitive modeling, 5(3):1, 1988.\n\n[31] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear\n\nmemory cost. arXiv preprint arXiv:1604.06174, 2016.\n\n[32] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. Siam Review, 60(2):223\u2013311, 2018.\n\n[33] Matt Mahoney. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text.\n\nhtml, 2011.\n\n[34] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture\n\nmodels. arXiv preprint arXiv:1609.07843, 2016.\n\n10\n\n\f[35] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv\n\npreprint arXiv:1608.03983, 2016.\n\n[36] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[37] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2952, "authors": [{"given_name": "Qian", "family_name": "Yang", "institution": "Duke University"}, {"given_name": "Zhouyuan", "family_name": "Huo", "institution": "University of Pittsburgh"}, {"given_name": "Wenlin", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}