{"title": "Mesh-TensorFlow: Deep Learning for Supercomputers", "book": "Advances in Neural Information Processing Systems", "page_first": 10414, "page_last": 10423, "abstract": "Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the \"batch\" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing SOTA results on WMT'14 English-to-French translation task and the one-billion-word Language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh", "full_text": "Mesh-TensorFlow:\n\nDeep Learning for Supercomputers\n\nNoam Shazeer, Youlong Cheng, Niki Parmar,\n\nDustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee\n\nMingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman\n\n{noam, ylc, nikip, trandustin, avaswani, penporn, phawkins,\nhyouklee, hongm, cliffy, rsepassi, blakehechtman}@google.com\n\nGoogle Brain\n\nAbstract\n\nBatch-splitting (data-parallelism) is the dominant distributed Deep Neural Network\n(DNN) training strategy, due to its universal applicability and its amenability to\nSingle-Program-Multiple-Data (SPMD) programming. However, batch-splitting\nsuffers from problems including the inability to train very large models (due to\nmemory constraints), high latency, and inef\ufb01ciency at small batch sizes. All of\nthese can be solved by more general distribution strategies (model-parallelism).\nUnfortunately, ef\ufb01cient model-parallel algorithms tend to be complicated to dis-\ncover, describe, and to implement, particularly on large clusters. We introduce\nMesh-TensorFlow, a language for specifying a general class of distributed tensor\ncomputations. Where data-parallelism can be viewed as splitting tensors and op-\nerations along the \"batch\" dimension, in Mesh-TensorFlow, the user can specify\nany tensor-dimensions to be split across any dimensions of a multi-dimensional\nmesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program\nconsisting of parallel operations coupled with collective communication prim-\nitives such as Allreduce. We use Mesh-TensorFlow to implement an ef\ufb01cient\ndata-parallel, model-parallel version of the Transformer [16] sequence-to-sequence\nmodel. Using TPU meshes of up to 512 cores, we train Transformer models with\nup to 5 billion parameters, surpassing state of the art results on WMT\u201914 English-\nto-French translation task and the one-billion-word language modeling benchmark.\nMesh-Tensor\ufb02ow is available at https://github.com/tensor\ufb02ow/mesh .\n\n1\n\nIntroduction\n\nBatch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) train-\ning strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data\n(SPMD) programming. However, batch-splitting suffers from several major problems when training\nvery large models. The memory required to store parameters and/or activations and the time neces-\nsary to synchronize parameters can make purely-data-parallel algorithms impossible or inef\ufb01cient.\nDifferent distribution strategies (model-parallelism [7]) can solve these issues, but specifying these\nstrategies can be complicated, and the current MIMD implementations generate very large programs\nwhich can be dif\ufb01cult to compile and to optimize.\nWe solve this problem by introducing Mesh-TensorFlow, a language for specifying a general class\nof distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and\noperations along the \"batch\" dimension, in Mesh-TensorFlow, the user can specify any tensor-\ndimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-\nTensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcollective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement\nan ef\ufb01cient data-parallel, model-parallel version of the Transformer [16] sequence-to-sequence\nmodel. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion\nparameters, surpassing state-of-the-art results on WMT\u201914 English-to-French translation task and the\none-billion-word Language modeling benchmark.\n\n2 Hardware Assumptions\n\nWhile much work deals with heterogeneous and/or unreliable hardware, we focus on clusters of\nidentical, reliable processors, each with a local memory. We de\ufb01ne a mesh as an n-dimensional array\nof such processors. The mesh is only a naming abstraction and does not imply a physical network\ntopology. As such, different meshes can be de\ufb01ned over the same set of physical processors. For\nexample, a 512-core TPU cluster with a 16x16x2 toroidal network interconnect could be represented\nby a 3-dimensional mesh with shape [16, 16, 2], a two-dimensional mesh with shape [32, 16], a\none-dimensional mesh with shape [512], etc. The physical network topology does affect performance;\nparticularly important is the performance of MPI Allreduce, grouped by splitting the mesh by a subset\nof the dimensions, which can be very ef\ufb01cient [4] [5] if each such group is physically connected.\n\n3\n\nInspiration: Single-Program-Multiple-Data (SPMD) Batch-Splitting\n\nWe \ufb01rst review a commonly-used variant of synchronous data-parallelism where each processor keeps\nan identical copy of all parameters (Algorithm 1). For each step, the batch of training examples is\nsplit into sub-batches, one for each processor. Each processor computes the forward and backward\npasses on its sub-batch, resulting in gradients on the model parameters. These gradients are then\nsummed across all processors and the results broadcast to all processors (MPI-allreduce). Finally,\neach processor updates its own copy of the parameters.\n\nAlgorithm 1 Synchronous data-parallelism with replicated parameters. Each processor maintains a\ncomplete copy of all weights W (t). The batch b(t) of training examples for timestep t is partitioned\np . Below is the computation performed on one\nprocessor p \u2208 P .\n1: Compute partial parameter gradients \u2207Q(W (t), b(t)\np )\n\namong the set P of processors: b(t) = \u02d9(cid:83)\n2: \u2207Q(W (t), b(t)) =(cid:80)\n\np\u2208P b(t)\n\np(cid:48)\u2208P \u2207Q(W (t), b(t)\np(cid:48) )\n3: W (t+1) = U pdate(W (t),\u2207Q(W (t), b(t))\n\n(cid:46) Local computation\n(cid:46) Allreduce\n(cid:46) Local computation\n\nThis algorithm is typically implemented using Single-Program-Multiple-Data (SPMD) programming,\nwith every processor running the same program of local operations and MPI-allreduce primitives.\nOne way to see this algorithm is that every tensor and every operation in the computation is either\nsplit across all processors (if it has a \"batch\" dimension), or fully replicated across all processors\n(if it does not have a \"batch\" dimension). Operations which reduce out the \"batch\" dimension\nrequire an additional MPI-allreduce to produce the correct result. We can describe this as splitting\nthe computation across the \"batch\" dimension. Mesh-TensorFlow generalizes this idea to splitting\ncomputations across arbitrary dimensions.\n\n4 Mesh-TensorFlow: Beyond Batch Splitting\n\nMesh-Tensor\ufb02ow generalizes from the batch-splitting algorithm described in section 3 to allow for\nsplitting across different Tensor dimensions. The similarities are as follows:\n\ntensor on each processor.\n\n\u2022 Each tensor in the computation is represented by one (not-necessarily-distinct) slice of the\n\u2022 Each operation in the computation is implemented as one operation on each processor. Most\noperations require no communication, with each processor producing its slice of the output\nfrom its slices of the inputs. Some operations additionally require collective communication\nprimitives such as MPI-allreduce.\n\n2\n\n\f\u2022 Based on the above, the computation can be implemented as a SPMD program.\n\nThe new elements in Mesh-TensorFlow are as follows:\n\n\u2022 Tensors have named dimensions. This allows for the idea of a logical dimension (like\n\"batch\") which will be split in the same way for different tensors and operations. It is illegal\nfor a tensor to have two identically-named dimensions.\n\n\u2022 Rather than an unstructured set of processors, Mesh-Tensor\ufb02ow allows for an n-dimensional\n\nmesh of processors (section 2). The mesh also has named dimensions.\n\n\u2022 A global \"computation layout\" is a partial map from tensor-dimension to mesh-dimension\nspecifying which tensor-dimensions are split across which dimensions of the processor-\nmesh. For example, batch-splitting (data-parallelism) would be expressed by using a\none-dimensional mesh with dimension \"all_processors\" and using the computation\nlayout [(\"batch\", \"all_processors\")]. This means that all tensors with a \"batch\"\ndimension are split along that dimension across all processors, while all other tensors are\nfully replicated.\n\n5 Tensor Representations\n\nA tensor is represented as one slice of the tensor per processor. The layout of a tensor is an injective\npartial map from the tensor\u2019s dimensions to dimensions of the mesh, and is computed as the restriction\nof the global computation layout to that tensor\u2019s dimensions. It is illegal for two dimensions of the\nsame tensor to map to the same mesh dimension. If a tensor\u2019s layout is empty, it is fully replicated on\neach processor. For every (tensor-dimension, mesh-dimension) pair in the tensor\u2019s layout, the slice\non a processor is restricted along that tensor-dimension to a stripe corresponding to that processor\u2019s\ncoordinate along that mesh-dimension. The current implementation of Mesh-TensorFlow requires the\nsize of the tensor-dimension to be evenly divisible by the size of the mesh-dimension.\n\n6 Operation Implementation\n\nEach operation is implemented by parallel computation on every processor, and sometimes collective\ncommunication. We describe the implementations of some important operations here:\n\nComponent-wise Operations Mesh-TensorFlow supports component-wise operations where the\nshapes (and hence the layouts) of the input and output tensors are identical. These are trivially\nimplemented by parallel operations on each processor to compute that processor\u2019s slice of the output\nfrom that processor\u2019s slice(s) of the input(s).\n\nReduction (reduce_sum(), reduce_max(), etc.) Mesh-TensorFlow supports reductions where the\noutput dimensions are a subset of the input dimensions. These can be implemented by local reductions\nof each slice, followed by MPI-allreduce across any mesh dimensions corresponding to reduced-out\nTensor dimensions. The allreduce operation is necessary because the local reduction only sums across\na subset of the split tensor-dimension. Bandwidth-ef\ufb01cient implementations of allreduce exist when\nthe processors for each group are connected in any type of tree. [4] [5]\n\nEinstein Summation (matrix multiplication, etc.) Einstein-summation (einsum) notation (as\nde\ufb01ned in numpy, TensorFlow, etc.) is a way of expressing a class of operations including (batch)\nmatrix multiplication, reductions and broadcasts, where the operation is de\ufb01ned by the names of the\ndimensions of the input and output tensors. Mesh-TensorFlow\u2019s use of named dimensions makes\nusing einsum particularly convenient. Einsum can be de\ufb01ned as broadcasting all inputs to a shape\nconsisting the union of all their dimensions, multiplying them component-wise, then reducing out all\ndimensions not in the speci\ufb01ed output shape. Einsum is implemented by parallel einsum operations\non each processor of that processor\u2019s input slices, followed by MPI-allreduce across any mesh\ndimensions corresponding to reduced-out Tensor dimensions.\n\n3\n\n\f6.1 Reshape\n\nWhile reshape is simple in the non-distributed case, Mesh-TensorFlow reshape can require network\ncommunication, since the layout of the output tensor may differ from that of the input tensor. Even\nkeeping the same dimension sizes, changing the dimension names (and hence the layout) can result in\nseveral different communication patterns: If a dimension is split in the input but not in the output, the\nimplementation involves MPI-allgather communication across the corresponding mesh-dimension. If\na dimension is split in the output but not in the input, the implementation involves no communication,\njust slicing on each processor. MPI-alltoall is used in the case where different dimensions in the\ninput and the output are split across the same mesh dimension, as might be the case when switching\nbetween data-parallelism and model-parallelism for different layers of the same model, as in [15].\n\n7 Mesh-TensorFlow syntax\n\nThe Mesh-TensorFlow language is nearly identical to TensorFlow [12], with the familiar notions of\ngraphs, tensors, operations, variables, devices (called meshes), and automatic gradient computation.\nThe principal difference is that in Mesh-TensorFlow, tensor-dimensions have a name as well as a\nsize. The shape of each tensor is a statically-known tuple of such dimensions. Shapes are inferred\nautomatically when possible, as they are in TensorFlow. Binary component-wise operations like\naddition employ implicit broadcasting in the case where the shape of one operand is a subset of the\nshape of the other.\nThe initial implementation of Mesh-TensorFlow is a Python library. The user builds a Mesh-\nTensorFlow graph in python, which the library \"lowers\" to generate part of a TensorFlow graph. As\nof the writing of this paper, implementations exist for generating SPMD TensorFlow code for TPUs,\nor MIMD code (using device placement) for multi-CPU/GPU con\ufb01gurations.\n\n8 Example: Two Fully-Connected Layers\n\nWe consider a simple example of two fully-connected layers in the middle of a neural network. The\ninput layer x and the output layer y each have dio units, and the hidden layer h has dh units. The\nhidden layer also has a bias and Relu activation.\n\nThis Mesh-TensorFlow code fragment runs these layers on a batch x of batch_size = b inputs.\n\ny = Relu(xw + bias)v\n\n(1)\n\n...\nbatch = mtf.Dimension(\"batch\", b)\nio = mtf.Dimension(\"io\", d_io)\nhidden = mtf.Dimension(\"hidden\", d_h)\n# x.shape == [batch, io]\nw = mtf.get_variable(\"w\", shape=[io, hidden])\nbias = mtf.get_variable(\"bias\", shape=[hidden])\nv = mtf.get_variable(\"v\", shape=[hidden, io])\nh = mtf.relu(mtf.einsum(x, w, output_shape=[batch, hidden]) + bias)\ny = mtf.einsum(h, v, output_shape=[batch, io])\n...\n\nThe code above de\ufb01nes only the mathematical model. We now discuss several different computation\nlayouts. Each will produce identical results, but will have different performance characteristics. We\nalso provide illustrations of the layouts in the supplementary materials (Section S.1).\n\n8.1 Data-Parallel Layout\n\nTo train the above model in data-parallel mode on a mesh of n processors, we would de\ufb01ne:\n\nmesh_shape = [(\"all\", n)]\ncomputation_layout = [(\"batch\", \"all\")]\n\n4\n\n\fWhen the Mesh-TensorFlow graph is compiled with this layout, the parameter tensors w, v, and bias\nare replicated on all processors, but the activation matrices x, h, y, etc. are split across the batch\ndimension. For example, each processor keeps a slice of x with shape [ b\nThere is no inter-processor communication in the forward pass. However, the gradient computations\nfor the parameters are mtf.einsum operations which reduce out the batch dimension, and hence\nproduce Allreduce operations when they are compiled. The number of values allreduced per\nprocessor is equal to the number of parameters, approximately 2diodh.\n\nn , dio].\n\n8.2 Model-Parallel Layout\n\nRather than splitting the batch, we can split the units in the hidden layer:\n\nmesh_shape = [(\"all\", n)]\ncomputation_layout = [(\"hidden\", \"all\")]\n\nWhen the Mesh-TensorFlow graph is compiled with this layout, the input and output layers x, and y\nare replicated on all processors, but the hidden activations h and the parameter tensors w, v and bias\nare all split across the hidden dimension. For example, each processor keeps a slice of w with shape\n[dio, dh\nWhen computing y, the split hidden dimension is reduced out. Consequently, the results of that\ncomputation get allreduced across all processors. A similar allreduce happens in computing the\ngradients on x. In all, the number of values allreduced per processor is 2bdio.\n\nn ] and a slice of v with shape [ dh\n\nn , dio].\n\n8.3 Data-Parallel, Model-Parallel Layouts\nOn a two-dimensional mesh of r \u00d7 c processors, we can employ both data-parallelism and model-\nparallelism:\n\nmesh_shape = [(\"rows\", r), (\"cols\", c)]\ncomputation_layout = [(\"batch\", \"rows\"), (\"hidden\", \"cols\")]\n\nIn this layout, each row of processors handles a fraction of the batch, while each column of processors\nhandles a fraction of the hidden units. Each processor keeps a slice of x with shape [ b\nr , dio], with\nprocessors in the same row having identical slices. The hidden activation tensor h is tiled in two\ndimensions, with each processor keeping a slice with shape [ b\nThis layout causes partitioned-allreduce operations in several places. For example, in computing y,\nwe reduce out the hidden dimension, which is split over the cols dimension of the mesh, so the\nresults of the operation need to be summed up by processor-column, as opposed to over the entire\nmesh. In all, the number of values allreduced per processor is 2bdio\nIf we have a three-dimensional mesh of processors, we can even split the computation in three\ndimensions:\n\nr + 2diodh\n\nc ].\nr , dh\n\nc\n\nmesh_shape = [(\"rows\", r), (\"cols\", c), (\"planes\", p)]\ncomputation_layout = [\n\n(\"batch\", \"rows\"), (\"hidden\", \"cols\"), (\"io\", \"planes\"])\n\nIn this case, every matrix in the computation is tiled across two mesh dimensions and replicated in\nthe third, and every einsum requires an allreduce across one mesh dimension.\n\n8.4 Inef\ufb01cient Layouts\n\nFor a computation layout to be ef\ufb01cient, all expensive operations need to be split (as opposed to\nreplicated) across all mesh dimensions. For example, the empty layout below produces correct results,\nbut since it replicates all computation on every processor, it saves no time or memory. A general rule\nis that any expensive einsum operation should have one input dimension that is split across each\nbatch dimension.\n\nmesh_shape = [(\"all\", n)]\ncomputation_layout = []\n\n5\n\n\f8.5\n\nIllegal Layouts\n\nThe computation layout below is illegal, because it causes the tensor h to have two dimensions which\nare split across the same dimension of the mesh.\n\nmesh_shape = [(\"all\", n)]\ncomputation_layout = [(\"batch\", \"all\"), (\"hidden\", \"all\")]\n\n8.6 Performance Comparison\n\nLayout\n\n[]\n\n[(\"batch\", \"all\")]\n[(\"hidden\", \"all\")]\n[(\"batch\", \"rows\"),\n(\"hidden\", \"cols\")]\n[(\"batch\", \"rows\"),\n(\"hidden\", \"cols\"),\n(\"io\", \"planes\")]\n\nComp. Time\n\nComm. Time\n\ncommunication\n\ncomputation\n\nMemory/Processor\n\nbdiodh\nbdiodh\n\nn\n\nbdiodh\n\nn\n\nbdiodh\n\nrc\n\nbdiodh\n\nrcp\n\n0\n\ndiodh\n\ndiob\nr + dh\nc )\n\ndio( b\n\n0\n\nn\nb\nn\ndh\n+ r\nb\n\nc\ndh\n\nbdio + bdh + diodh\nb\nn dio + b\nn dh + diodh\nbdio + b dh\ndh\nn + dio\nn\ndh\ndh\nb\nr dio + b\nc + dio\nc\n\nr\n\nb\nr\n\ndio\np + b\n\nr\n\ndh\nc + dio\n\np\n\ndh\nc\n\nc\ndh\n\n+ p\ndio\n\n+ r\nb\n\nb\nr\n\ndio\np + b\n\nr\n\ndh\nc + dio\n\np\n\ndh\nc\n\nTable 1: Computation, communication and memory costs for different layouts of the computation in\nAlgorithm 1. Constant factors and lower-order terms are dropped.\n\nTable 1 shows the computational costs associated with our example computation layouts. The\ncomputation time is dominated by that of einsum operations. The communication time comes from\nthe Allreduce operations, which are necessary whenever the inner dimension of einsum is split.\nAssuming that the mesh has physical links between all pairs of logically adjacent processors, each\nAllreduce operations can be done in time proportional to the size of one slice divided by the per-link\nnetwork bandwidth [5].\nThe network-boundedness of the computation is proportional to the value shown in the table column\nmarked communication\ncomputation , with the constant of proportionality depending on the ratio of communication\nand computation speeds on the given hardware. In the data-parallel layout, the value is n\nb , the inverse\nof the per-processor batch size. Performance suffers if the per-processor batch is too small. In\nthe model-parallel layout, the value is n\n, the inverse of the number of hidden units per processor.\ndh\nPerformance suffers if the hidden layer is sliced too \ufb01nely. For good performance, batch size is\nirrelevant, but we need the hidden layer to get larger as we increase the number of processors. In the\n\ufb01rst data-parallel, model-parallel layout, the value is c\nb . In this layout, we can quadratically\ndh\nincrease the number of processors while only linearly increasing the batch size and hidden layer\nsizes necessary to maintain good ef\ufb01ciency. The \ufb01nal layout lets us cubically increase the number of\nprocessors in a 3-dimensional mesh, while only linearly increasing the batch size and the layer sizes.\n\n+ r\n\n9 Model-Parallel \"Transformer\"\n\nWe implemented a model-parallel layout of the Transformer attention-based sequence-to-sequence\nmodel described in [16]. The complete implementation is available in the tensor2tensor library on\ngithub. The layout is given by:\n\nmesh_shape = [(\"all\", n)]\ncomputation_layout = [\n\n(\"vocab\", \"all\"), (\"d_ff\", \"all\"), (\"heads\", \"all\")]\n\nThat is, the dimensions representing the vocabulary size, the size of the feed-forward hidden layer,\nand the number of attention heads are each split across all processors. This layout works because\nevery expensive operation in the model has exactly one of these dimensions, and no tensor in the\nmodel has more than one. Similarly to the model-parallel layout for our example network (Section\n8.2), network-boundedness and memory usage per processor remain constant if we scale all of these\ndimensions proportionally to the number of processors. We did just this, training transformer models\n\n6\n\n\fwith ever larger hidden layers and numbers of attention heads on ever larger TPU clusters (we did not\nincrease the vocabulary size). As expected, we saw very similar performance characteristics between\nthe models. This scaling turns out to be highly bene\ufb01cial to model quality (Section 9.1).\nTo use even more processors, we combined this model-parallelism with data parallelism, splitting\nthe batch across one dimension of a 2-dimensional TPU mesh and the dimensions described above\nacross the other dimension of the mesh:\n\nmesh_shape = [(\"rows\", r), (\"cols\", c\")]\ncomputation_layout = [(\"batch\", \"rows\"), (\"vocab\", \"cols\"),\n(\"d_ff\", \"cols\"), (\"heads\", \"cols\")]\n\nThis layout maintains constant performance if the batch size is scaled proportionally to r and the\nmentioned model dimensions are scaled proportionally to c. Using this layout, we trained Transformer\nmodels with feed-forward hidden dimensions up to 262144 and up to 256 attention heads on 2-\ndimensional TPUv2 meshes of up to 16x32=512 cores, maintaining computational ef\ufb01ciency of over\n50% (6 PFLOP/s out of a maximum 11.5 PFLOP/s) on the largest models.\n\n9.1 Experiments and Results\n\nTo examine the bene\ufb01t of scaling the Transformer model in the manner suggested by the previous\nsection, we trained such models on machine translation and language modeling tasks. Results are\ngiven in Tables 2 and 3.\nFor the billion-word language modeling benchmark, we trained the models for 10 epochs. The\nlargest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster. Batch size\nfor all models was 256 sequences of 256 tokens each (each sequence was the concatenation of\nmultiple training sentences). The batch was split along the mesh dimension of size 16 and the model\ndimensions were split along the mesh dimension of size 32. Per-word dev-perplexity for the largest\nmodel was 24.0, but dropped to 23.5 when the model was evaluated with the logits multiplied by\n0.9 (likely due to over\ufb01tting). This represents the best published result on this dataset. As expected,\nperplexity was lower for larger models. We have included random samples from these models in the\nsupplementary materials (Section S.3). On the languagemodel_wiki_noref_v128k_l1k dataset\nfrom the Tensor2Tensor library1, consisting of over 5 billion tokens of text from Wikipedia, perplexity\ncontinued to improve signi\ufb01cantly with a model size of 5 billion parameters.\nOn the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. The largest model\n(2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster. Quality improved with\nmodel size, with the largest model achieved BLEU score 43.9 (evaluated using sacrebleu), the\nbest published result to date. For the WMT14 En-De dataset, gains from model size were smaller,\npresumably due to the small size of the training data.\nAdditional details about the con\ufb01gurations for these experiments are available as part of the\ntensor2tensor library on github.\n\nTable 2: Transformer-Decoder Language Models: dmodel = 1024, dk = dv = 256\n\nParameters Billion-Word Benchmark\n(Billions)\n\nWord-Perplexity\n\ndf f\n\nheads\n\n4\n8\n16\n32\n64\n128\n256\n\n4096\n8192\n16384\n32768\n65516\n131072\n262144\nPrev Best DNN [15]\nBest DNN Ensemble [13]\nBest Ensemble (different methods)[13]\n\n1No published results exist for this dataset.\n\nWikipedia\n\nSubword-Perplexity\n\n8.74\n8.03\n7.44\n6.99\n6.55\n6.24\n6.01\n\n35.0\n31.7\n28.9\n26.8\n25.1\n24.1\n\n28.0\n26.1\n23.7\n\n24.0(23.5)\n\n0.14\n0.22\n0.37\n0.67\n1.28\n2.48\n4.90\n6.5\n\n> 100\n\n7\n\n\fTable 3: Transformer Machine-Translation Results. dmodel = 1024, dk = dv = 128\n\nParameters WMT14 EN-DE WMT14 EN-FR\n(Billions)\n\ndf f\n\nheads\n\ndk, dv\n\n2048\n4096\n8192\n16384\n32768\n65536\n4096\n\n4\n8\n16\n32\n64\n128\n16\n\n128\n128\n128\n128\n128\n128\n64\n\n10 Related Work\n\n0.15\n0.24\n0.42\n0.77\n1.48\n2.89\n0.21\n\nBLEU\n25.5\n26.5\n27.1\n27.5\n27.5\n26.7\n28.4\n\nBLEU\n41.8\n42.5\n43.3\n43.5\n43.8\n43.9\n41.8\n\n[16]\n\nA large part of deep learning computations is a series of matrix multiplications and tensor contractions\n(Einsums). Distributed matrix multiplication is a well-studied problem in high performance computing.\nEf\ufb01cient algorithms partition the computational space, instead of partitioning work by the output\nmatrix/tensor (owners compute), to minimize communication. This technique is sometimes called\niteration space tiling [2], replication [6], or task parallelism [11]. Mesh-TensorFlow can express\na wide range of uniform partitionings of the iteration space and therefore can adopt many best\nknown mappings, e.g., 3D [3, 1] and 2.5D [6] algorithms for square matrices, CARMA [9] for\nrectangular matrices, 1.5D [14] algorithm for matrices with different sparsities, best tile sizes for\ndirect convolutions [17], etc., although sometimes with higher memory requirements. Furthermore,\nin most existing work, when multiple multiplications are composed together, the user has to specify\nthe data layout for each matrix separately [22]. Mesh-TensorFlow lets the user name the dimension\nto split, simplifying the process and allowing for much easier mapping explorations. Feature-wise,\nMesh-TensorFlow shares many similarities with the Cyclops Tensor Framework [10], a distributed\ntensor contraction library originally developed for quantum chemistry applications, which also\nsupports replication and arbitrary mappings.\nIn the context of deep learning, partitioning the iteration space, e.g., interpolating between data and\nmodel parallelism, is relatively new. Gholami et al. [18] analytically showed that using both data and\nmodel parallelism at the same time can be more bene\ufb01cial than using just one of them. Building on\ntop of 1.5D matrix multiplication algorithms, their algorithm can support replication and arbitrary\nprocessor grid shapes. However, they only explored the parallelization of AlexNet [8] and they\nhave not implemented the algorithm. Jia et al. [20, 19] implemented a framework that uses cost\nmodeling to pick the best parallelization strategy, including how to partition work for each operation.\nTheir parallelizable dimensions are de\ufb01ned as the set of all divisible dimensions in the output tensor\n(owners compute), and therefore their mapping can be suboptimal in terms of communication. We\nexpand on this in the supplementary materials (Section S.2).\n\n11 Future Work\n\nThe Mesh-TensorFlow library is available at https://github.com/tensorflow/mesh and is\nunder active development. Some potential areas for development are:\n\n\u2022 Automated search for optimal computation layout.\n\u2022 Implementations of different models and operations. For example, convolutions on spatially-\npartitioned tensors will require the communication of \"halo\" regions, as described in [21].\n\u2022 Implementation of SPMD programming on CPU/GPU clusters.\n\n12 Conclusion\n\nIn this paper, we introduce the Mesh-TensorFlow language, facilitating a broad class of SPMD\ndistributed tensor computations. Applying Mesh-TensorFlow to the Transformer model, we are able\nto train models with 5 billion parameters on up to 512-core clusters, establishing new state-of-the-art\nresults for WMT14 En-Fr translation task and the One Billion Word language modeling benchmark.\n\n8\n\n\fReferences\n\n[1]\n\nJarle Berntsen. \u201cCommunication ef\ufb01cient matrix multiplication on hypercubes\u201d. In: Parallel\ncomputing 12.3 (1989), pp. 335\u2013342.\n\n[2] M. Wolfe. \u201cMore Iteration Space Tiling\u201d. In: Proceedings of the 1989 ACM/IEEE Conference\non Supercomputing. Supercomputing \u201989. Reno, Nevada, USA: ACM, 1989, pp. 655\u2013664.\nISBN: 0-89791-341-8. DOI: 10.1145/76263.76337. URL: http://doi.acm.org/10.\n1145/76263.76337.\n\n[3] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. \u201cCommunication Complexity of PRAMs\u201d.\nIn: Theor. Comput. Sci. 71.1 (Mar. 1990), pp. 3\u201328. ISSN: 0304-3975. DOI: 10.1016/0304-\n3975(90)90188-N. URL: http://dx.doi.org/10.1016/0304-3975(90)90188-N.\n\n[4] Pitch Patarasuk and Xin Yuan. \u201cBandwidth optimal all-reduce algorithms for clusters of\n\nworkstations\u201d. In: 69 (Feb. 2009), pp. 117\u2013124.\n\n[5] Nikhil Jain and Yogish Sabharwal. \u201cY.: Optimal bucket algorithms for large mpi collectives\non torus interconnects\u201d. In: In: Proceedings of the 24th ACM International Conference on\nSupercomputing. 2010, pp. 27\u201336.\n\n[6] Edgar Solomonik and James Demmel. \u201cCommunication-Optimal Parallel 2.5D Matrix Multi-\nplication and LU Factorization Algorithms\u201d. In: Euro-Par. Vol. 6853. 2011, pp. 90\u2013109. ISBN:\n978-3-642-23396-8.\nJeffrey Dean et al. \u201cLarge Scale Distributed Deep Networks\u201d. In: Proceedings of the 25th\nInternational Conference on Neural Information Processing Systems - Volume 1. NIPS\u201912.\nLake Tahoe, Nevada: Curran Associates Inc., 2012, pp. 1223\u20131231. URL: http://dl.acm.\norg/citation.cfm?id=2999134.2999271.\n\n[7]\n\n[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. \u201cImagenet classi\ufb01cation with deep\nconvolutional neural networks\u201d. In: Advances in neural information processing systems. 2012,\npp. 1097\u20131105.\nJames Demmel et al. \u201cCommunication-optimal parallel recursive rectangular matrix multi-\nplication\u201d. In: Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International\nSymposium on. IEEE. 2013, pp. 261\u2013272.\n\n[9]\n\n[11]\n\n[10] Edgar Solomonik et al. \u201cA massively parallel tensor contraction framework for coupled-cluster\ncomputations\u201d. In: Journal of Parallel and Distributed Computing 74.12 (2014), pp. 3176\u2013\n3190.\nJustus A. Calvin, Cannada A. Lewis, and Edward F. Valeev. \u201cScalable Task-based Algorithm\nfor Multiplication of Block-rank-sparse Matrices\u201d. In: Proceedings of the 5th Workshop\non Irregular Applications: Architectures and Algorithms. IA3 \u201915. Austin, Texas: ACM,\n2015, 4:1\u20134:8. ISBN: 978-1-4503-4001-4. DOI: 10.1145/2833179.2833186. URL: http:\n//doi.acm.org/10.1145/2833179.2833186.\n\n[12] Martin Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.\n\nSoftware available from tensor\ufb02ow.org. 2015. URL: https://www.tensorflow.org/.\n\n[13] Rafal Jozefowicz et al. \u201cExploring the Limits of Language Modeling\u201d. In: CoRR\nabs/1602.02410 (2016). arXiv: 1602.02410. URL: http://arxiv.org/abs/1602.02410.\n[14] Penporn Koanantakool et al. \u201cCommunication-Avoiding Parallel Sparse-Dense Matrix-Matrix\nMultiplication\u201d. In: 2016 IEEE International Parallel and Distributed Processing Symposium\n(IPDPS). May 2016, pp. 842\u2013853. DOI: 10.1109/IPDPS.2016.117.\n\n[15] Noam Shazeer et al. \u201cOutrageously Large Neural Networks: The Sparsely-Gated Mixture-of-\nExperts Layer\u201d. In: CoRR abs/1701.06538 (2017). arXiv: 1701.06538. URL: http://arxiv.\norg/abs/1701.06538.\n\n[16] Ashish Vaswani et al. \u201cAttention Is All You Need\u201d. In: CoRR abs/1706.03762 (2017). arXiv:\n\n[17]\n\n1706.03762. URL: http://arxiv.org/abs/1706.03762.\nJames Demmel and Grace Dinh. \u201cCommunication-Optimal Convolutional Neural Nets\u201d. In:\narXiv preprint arXiv:1802.06905 (2018).\n\n[18] Amir Gholami et al. \u201cIntegrated Model, Batch, and Domain Parallelism in Training Neural\nNetworks\u201d. In: SPAA\u201918: 30th ACM Symposium on Parallelism in Algorithms and Architectures.\n2018. URL: http://eecs.berkeley.edu/~aydin/integrateddnn_spaa2018.pdf.\n\n[19] Zhihao Jia, Matei Zaharia, and Alex Aiken. \u201cBeyond Data and Model Parallelism for Deep\n\nNeural Networks\u201d. In: arXiv preprint arXiv:1807.05358 (2018).\n\n9\n\n\f[20] Zhihao Jia et al. \u201cExploring Hidden Dimensions in Parallelizing Convolutional Neural Net-\n\nworks\u201d. In: arXiv preprint arXiv:1802.04924 (2018).\n\n[21] Peter Jin, Boris Ginsburg, and Kurt Keutzer. Spatially Parallel Convolutions. 2018. URL:\n\nhttps://openreview.net/forum?id=S1Yt0d1vG.\n\n[22] Penporn Koanantakool et al. \u201cCommunication-Avoiding Optimization Methods for Distributed\nMassive-Scale Sparse Inverse Covariance Estimation\u201d. In: Proceedings of the Twenty-First\nInternational Conference on Arti\ufb01cial Intelligence and Statistics. Vol. 84. Playa Blanca,\nLanzarote, Canary Islands: PMLR, Apr. 2018, pp. 1376\u20131386. URL: http://proceedings.\nmlr.press/v84/koanantakool18a.html.\n\n10\n\n\f", "award": [], "sourceid": 6669, "authors": [{"given_name": "Noam", "family_name": "Shazeer", "institution": "Google"}, {"given_name": "Youlong", "family_name": "Cheng", "institution": "Google"}, {"given_name": "Niki", "family_name": "Parmar", "institution": "Google"}, {"given_name": "Dustin", "family_name": "Tran", "institution": "Google Brain"}, {"given_name": "Ashish", "family_name": "Vaswani", "institution": "Google Brain"}, {"given_name": "Penporn", "family_name": "Koanantakool", "institution": "Google"}, {"given_name": "Peter", "family_name": "Hawkins", "institution": "google.com"}, {"given_name": "HyoukJoong", "family_name": "Lee", "institution": "Google"}, {"given_name": "Mingsheng", "family_name": "Hong", "institution": "google.com"}, {"given_name": "Cliff", "family_name": "Young", "institution": "google.com"}, {"given_name": "Ryan", "family_name": "Sepassi", "institution": "Google"}, {"given_name": "Blake", "family_name": "Hechtman", "institution": "Google"}]}