{"title": "Blockwise Parallel Decoding for Deep Autoregressive Models", "book": "Advances in Neural Information Processing Systems", "page_first": 10086, "page_last": 10095, "abstract": "Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, generation still remains an inherently sequential process. To overcome this limitation, we propose a novel blockwise parallel decoding scheme in which we make predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model. This allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel. We verify our approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance. In terms of wall-clock time, our fastest models exhibit real-time speedups of up to 4x over standard greedy decoding.", "full_text": "Blockwise Parallel Decoding for\n\nDeep Autoregressive Models\n\nMitchell Stern\u2217\n\nUniversity of California, Berkeley\n\nmitchell@berkeley.edu\n\nJakob Uszkoreit\n\nGoogle Brain\n\nusz@google.com\n\nNoam Shazeer\nGoogle Brain\n\nnoam@google.com\n\nAbstract\n\nDeep autoregressive sequence-to-sequence models have demonstrated impressive\nperformance across a wide variety of tasks in recent years. While common archi-\ntecture classes such as recurrent, convolutional, and self-attention networks make\ndifferent trade-offs between the amount of computation needed per layer and the\nlength of the critical path at training time, generation still remains an inherently\nsequential process. To overcome this limitation, we propose a novel blockwise\nparallel decoding scheme in which we make predictions for multiple time steps\nin parallel then back off to the longest pre\ufb01x validated by a scoring model. This\nallows for substantial theoretical improvements in generation speed when applied to\narchitectures that can process output sequences in parallel. We verify our approach\nempirically through a series of experiments using state-of-the-art self-attention\nmodels for machine translation and image super-resolution, achieving iteration\nreductions of up to 2x over a baseline greedy decoder with no loss in quality, or up\nto 7x in exchange for a slight decrease in performance. In terms of wall-clock time,\nour fastest models exhibit real-time speedups of up to 4x over standard greedy\ndecoding.\n\n1\n\nIntroduction\n\nNeural autoregressive sequence-to-sequence models have become the de facto standard for a wide\nvariety of tasks, including machine translation, summarization, and speech synthesis (Vaswani et al.,\n2017; Rush et al., 2015; van den Oord et al., 2016). One common feature among recent architectures\nsuch as the Transformer and convolutional sequence-to-sequence models is an increased capacity for\nparallel computation, making them a better \ufb01t for today\u2019s massively parallel hardware accelerators\n(Vaswani et al., 2017; Gehring et al., 2017). While advances in this direction have allowed for\nsigni\ufb01cantly faster training, outputs are still generated one token at a time during inference, posing a\nsubstantial challenge for many practical applications (Oord et al., 2017).\nIn light of this limitation, a growing body of work is concerned with different approaches to accel-\nerating generation for autoregressive models. Some general-purpose methods include probability\ndensity distillation (Oord et al., 2017), subscaling (Kalchbrenner et al., 2018), and decomposing the\nproblem into the autoregressive generation of a short sequence of discrete latent variables followed by\na parallel generation step conditioned on the discrete latents (Kaiser et al., 2018). Other techniques\nare more application-speci\ufb01c, such as the non-autoregressive Transformer for machine translation\n(Gu et al., 2018). While speedups of multiple orders of magnitude have been achieved on tasks with\nhigh output locality like speech synthesis, to the best of our knowledge, published improvements in\nmachine translation either show much more modest speedups or come at a signi\ufb01cant cost in quality.\n\n\u2217Work performed while the author was an intern at Google Brain.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this work, we propose a simple algorithmic technique that exploits the ability of some architectures,\nsuch as the Transformer (Vaswani et al., 2017), to score all output positions in parallel. We train\nvariants of the autoregressive model to make predictions for multiple future positions beyond the next\nposition modeled by the base model. At test time, we employ these proposal models to independently\nand in parallel make predictions for the next several positions. We then determine the longest pre\ufb01x\nof these predictions that would have generated under greedy decoding by scoring each position in\nparallel using the base model. If the length of this pre\ufb01x is greater than one, we are able to skip one\nor more iterations of the greedy decoding loop.\nIn our experiments, our technique approximately doubles generation speed at no loss in quality\nrelative to greedy decoding from an autoregressive model. Together with knowledge distillation\nand approximate decoding strategies, we can increase the speedup in terms of decoding iterations\nto up to \ufb01ve-fold at a modest sacri\ufb01ce in quality for machine translation and seven-fold for image\nsuper-resolution. These correspond to wall-clock speedups of three-fold and four-fold, respectively.\nIn contrast to the other previously mentioned techniques for improving generation speed, our approach\ncan furthermore be implemented on top of existing models with minimal modi\ufb01cations. Our code is\npublicly available in the open-source Tensor2Tensor library (Vaswani et al., 2018).\n\n2 Greedy Decoding\n\nIn a sequence-to-sequence problem, we are given an input sequence x = (x1, . . . , xn), and we would\nlike to predict the corresponding output sequence y = (y1, . . . , ym). These sequences might be\nsource and target sentences in the case of machine translation, or low-resolution and high-resolution\nimages in the case of image super-resolution. One common approach to this problem is to learn an\nautoregressive scoring model p(y | x) that decomposes according to the left-to-right factorization\n\nm\u22121(cid:88)\n\nlog p(y | x) =\n\nlog p(yj+1 | y\u2264j, x).\n\nj=0\n\nThe inference problem is then to \ufb01nd y\u2217 = argmaxy p(y | x).\nSince the output space is exponentially large, exact search is intractable. As an approximation,\nwe can perform greedy decoding to obtain a prediction \u02c6y as follows. Starting with an empty\nsequence \u02c6y and j = 0, we repeatedly extend our prediction with the highest-scoring token \u02c6yj+1 =\nargmaxyj+1 p(yj+1 | \u02c6y\u2264j, x) and set j \u2190 j + 1 until a termination condition is met. For language\ngeneration problems, we typically stop once a special end-of-sequence token has been generated. For\nimage generation problems, we simply decode for a \ufb01xed number of steps.\n\n3 Blockwise Parallel Decoding\n\nStandard greedy decoding takes m steps to produce an output of length m, even for models that\ncan ef\ufb01ciently score sequences using a constant number of sequential operations. While brute-force\nenumeration of output extensions longer than one token is intractable when the size of the vocabulary\nis large, we can still attempt to exploit parallelism within the model by training a set of auxiliary\nmodels to propose candidate extensions.\nLet the original model be p1 = p, and suppose that we have also learned a collection of auxiliary\nmodels p2, . . . , pk for which pi(yj+i | y\u2264j, x) is the probability of the (j + i)th token being yj+i\ngiven the \ufb01rst j tokens. We propose the following blockwise parallel decoding algorithm (illustrated\nin Figure 1), which is guaranteed to produce the same prediction \u02c6y that would be found under greedy\ndecoding but uses as few as m/k steps. As before, we start with an empty prediction \u02c6y and set j = 0.\nThen we repeat the following three substeps until the termination condition is met:\n\n\u2022 Predict: Get the block predictions \u02c6yj+i = argmaxyj+i pi(yj+i | \u02c6y\u2264j, x) for i = 1, . . . , k.\n\u2022 Verify: Find the largest \u02c6k such that \u02c6yj+i = argmaxyj+i p1(yj+i | \u02c6y\u2264j+i\u22121, x) for all 1 \u2264 i \u2264 \u02c6k.\n\nNote that \u02c6k \u2265 1 by the de\ufb01nition of \u02c6yj+1.\n\u2022 Accept: Extend \u02c6y with \u02c6yj+1, . . . , \u02c6yj+\u02c6k and set j \u2190 j + \u02c6k.\n\n2\n\n\fPredict\n\nVerify\n\nAccept\n\nI\n\nI\n\nI\n\nI\n\nI\n\nsaw\n\nsaw\n\nsaw\n\nsaw\n\nsaw\n\na\n\na\n\na\n\na\n\na\n\ndog\n\nride\n\ndog\n\nride\n\ndog\n\nride\n\ndog\n\nride\n\ndog\n\nride\n\nin\n\nin\n\nin\n\nin\n\nin\n\nthe\n\nbus\n\nthe\n\nthe\n\ncar\n\nthe\n\n(cid:88)\n\n(cid:88)\n\n\u00d7\n\nexecuted\nin parallel\n\nFigure 1: The three substeps of blockwise parallel decoding. In the predict substep, the greedy\nmodel and two proposal models independently and in parallel predict \u201cin\u201d, \u201cthe\u201d, and \u201cbus\u201d. In\nthe verify substep, the greedy model scores each of the three independent predictions, conditioning\non the previous independent predictions where applicable. When using a Transformer or con-\nvolutional sequence-to-sequence model, these three computations can be done in parallel. The\nhighest-probability prediction for the third position is \u201ccar\u201d, which differs from the independently\npredicted \u201cbus\u201d. In the accept substep, \u02c6y is hence extended to include only \u201cin\u201d and \u201cthe\u201d before\nmaking the next k independent predictions.\n\nIn the predict substep, we \ufb01nd the local greedy predictions of our base scoring model p1 and\nthe auxiliary proposal models p2, . . . , pk. Since these are disjoint models, each prediction can be\ncomputed in parallel, so there should be little time lost compared to a single greedy prediction.\nNext, in the verify substep, we \ufb01nd the longest pre\ufb01x of the proposed length-k extension that would\nhave otherwise been produced by p1. If the scoring model can process this sequence of k tokens in\nfewer than k steps, this substep will help save time overall provided more than one token is correct.\nLastly, in the accept substep, we extend our hypothesis with the veri\ufb01ed pre\ufb01x. By stopping early if\nthe base model and the proposal models start to diverge in their predictions, we ensure that we will\nrecover the same output that would have been produced by running greedy decoding with p1.\nThe potential of this scheme to improve decoding performance hinges crucially on the ability of the\nbase model p1 to execute all predictions made in the verify substep in parallel. In our experiments we\nuse the Transformer model (Vaswani et al., 2017). While the total number of operations performed\nduring decoding is quadratic in the number of predictions, the number of necessarily sequential\noperations is constant regardless of output length. This allows us to execute the verify substep for a\nnumber of positions in parallel without spending additional wall-clock time.\n\n4 Combined Scoring and Proposal Model\n\nWhen using a Transformer for scoring, the version of our algorithm presented in Section 3 requires\ntwo model invocations per step: one parallel invocation of p1, . . . , pk in the prediction substep, and\nan invocation of p1 in the veri\ufb01cation substep. This means that even with perfect auxiliary models,\nwe will only reduce the number of model invocations from m to 2m/k instead of the desired m/k.\nAs it turns out, we can further reduce the number of model invocations from 2m/k to m/k + 1 if we\nassume a combined scoring and proposal model, in which case the nth veri\ufb01cation substep can be\nmerged with the (n + 1)st prediction substep.\nMore speci\ufb01cally, suppose we have a single Transformer model which during the veri\ufb01cation substep\ncomputes pi(yj+i(cid:48)+i | \u02c6y\u2264j+i(cid:48), x) for all i = 1, . . . , k and i(cid:48) = 1, . . . , k in a constant number of\noperations. This can be implemented for instance by increasing the dimensionality of the \ufb01nal\nprojection layer by a factor of k and computing k separate softmaxes per position. Invoking the model\nafter plugging in the k future predictions from the prediction substep yields the desired outputs.\n\n3\n\n\fPredict\n\nVerify\n\n(+ next Predict)\n\nPredict\n\nI\n\nI\n\nI\n\nI\n\nI\n\nsaw\n\nsaw\n\nsaw\n\nsaw\n\nsaw\n\na\n\na\n\na\n\na\n\na\n\nthe\n\nbus\n\ndog ride\n\ndog ride\n\nin\n(cid:88)\n\nin\n\ndog ride\n\nin\n\nthe\n(cid:88)\nthe\n\ncar\n\nlast\n\ncar\n\u00d7\nbus\n\nthis week\n\nexecuted\nin parallel\n\nlast week when\n\ndog ride\n\nin\n\nthe\n\ndog ride\n\nin\n\nthe\n\ncar\n\nthis week\n\nreused\n\nFigure 2: Combining the scoring and proposal models allows us to merge the previous veri\ufb01cation\nsubstep with the next prediction substep. This makes it feasible to call the model just once per\niteration rather than twice, halving the number of model invocations required for decoding.\n\nUnder this setup, after \u02c6k has been computed during veri\ufb01cation, we will have already computed\npi(yj+\u02c6k+i | y\u2264j+\u02c6k, x) for all i = 1, . . . , k, which is exactly what is required for the prediction\nsubstep in the next iteration of decoding. Hence these substeps can be merged together, reducing the\nnumber of model invocations by a factor of two for all but the very \ufb01rst iteration.\nFigure 2 illustrates the process. Note that while proposals have to be computed for every position\nduring the veri\ufb01cation substep, all predictions can still be made in parallel.\n\n5 Approximate Inference\n\nThe approach to block parallel decoding we have described so far produces the same output as a\nstandard greedy decode. By relaxing the criterion used during veri\ufb01cation, we can allow for additional\nspeedups at the cost of potentially deviating from the greedy output.\n\n5.1 Top-k Selection\n\nRather than requiring that a prediction exactly matches the scoring model\u2019s prediction, we can instead\nask that it lie within the top k items. To accomplish this, we replace the veri\ufb01cation criterion with\n\n\u02c6yj+i \u2208 top-kyj+i\n\np1(yj+i | \u02c6y\u2264j+i\u22121, x).\n\n5.2 Distance-Based Selection\n\nIn problems where the output space admits a natural distance metric d, we can replace the exact\nmatch against the highest-scoring element with an approximate match:\n\n(cid:32)\n\nd\n\n\u02c6yj+i, argmax\n\nyj+i\n\np1(yj+i | \u02c6y\u2264j+i\u22121, x)\n\n\u2264 \u0001.\n\n(cid:33)\n\nIn the case of image generation, we let d(u, v) = |u\u2212v| be the absolute difference between intensities\nu and v within a given color channel.\n\n5.3 Minimum Block Size\n\nIt is possible that the \ufb01rst non-greedy prediction within a given step is incorrect, in which case only a\nsingle token would be added to the hypothesis. To ensure a minimum speedup, we could require that\n\n4\n\n\fp1\n\n+\n\np2\n\n+\n\np3\n\n+\n\nApply the original\nvocabulary projection\n\nAdd k output layers\n\nAdd a hidden layer\n\nOriginal decoder output\n\nFigure 3: The modi\ufb01cation we make to a Transformer to obtain a combined scoring and prediction\nmodel. To make predictions for the next k positions instead of one position, we insert a multi-output\nfeedforward layer with residual connections after the original decoder output layer, then apply the\noriginal vocabulary projection to all outputs.\n\nat least 1 < (cid:96) \u2264 k tokens be added during each decoding step. Setting (cid:96) = k would correspond to\nparallel decoding with blocks of \ufb01xed size k.\n\n6\n\nImplementation and Training\n\nWe implement the combined scoring and proposal model described in Section 4 for our experiments.\nGiven a baseline Transformer model pre-trained for a given task, we insert a single feedforward\nlayer with hidden size k \u00d7 dhidden and output size k \u00d7 dmodel between the decoder output and the\n\ufb01nal projection layer, where dhidden and dmodel are the same layer dimensions used in the rest of the\nnetwork. A residual connection between the input and each of the k outputs is included. The original\nprojection layer is identically applied to each of the k outputs to obtain the logits for p1, . . . , pk. See\nFigure 3 for an illustration.\nDue to memory constraints at training time, we are unable to use the mean of the k cross entropy\nlosses corresponding to p1, . . . , pk as the overall loss. Instead, we select one of these sub-losses\nuniformly at random for each minibatch to obtain an unbiased estimate of the full loss. At inference\ntime, all logits can be computed in parallel with marginal cost relative to the base model.\n\n6.1 Fine Tuning\n\nAn important question is whether or not the original parameters of the pre-trained model should be\n\ufb01ne tuned for the modi\ufb01ed joint prediction task. If they are kept frozen, we ensure that the quality\nof the original model is retained, perhaps at the cost of less accurate future prediction. If they are\n\ufb01ne tuned, we might improve the model\u2019s internal consistency but suffer a loss in terms of \ufb01nal\nperformance. We investigate both options in our experiments.\n\n6.2 Knowledge Distillation\n\nThe practice of knowledge distillation (Hinton et al., 2015; Kim and Rush, 2016), in which one\ntrains a model on the outputs of another model, has been shown to improve performance on a variety\nof tasks, potentially even when teacher and student models have the same architecture and model\nsize (Furlanello et al., 2017). We posit that sequence-level distillation could be especially useful for\nblockwise parallel decoding, as it tends to result in a training set with greater predictability due to\nconsistent mode breaking from the teacher model. For our language task, we perform experiments\nusing both the original training data and distilled training data to determine the extent of the effect. The\ndistilled data is produced via beam decoding using a pre-trained model with the same hyperparameters\nas the baseline but a different random seed. The beam search hyperparameters are those from Vaswani\net al. (2017).\n\n7 Experiments\n\nWe implement all our experiments using the open-source Tensor2Tensor framework (Vaswani et al.,\n2018). Our code is publicly available within this same library.\n\n5\n\n\f7.1 Machine Translation\n\nFor our machine translation experiments, we use the WMT 2014 English-German translation\ndataset. Our baseline model is a Transformer trained for 1,000,000 steps on 8 P100 GPUs us-\ning the transformer_base hyperparameter set in Tensor2Tensor. Using greedy decoding, it attains\na BLEU score of 25.56 on the newstest2013 development set.\nOn top of this, we train a collection of combined scoring and proposal Transformer models for various\nblock sizes k; see Section 6 for implementation details. Each model is trained for an additional\n1,000,000 steps on the same hardware, either on the original training data or on distilled data obtained\nfrom beam search predictions from a separate baseline run. Optimizer accumulators for running\naverages of \ufb01rst and second moments of the gradient are reset for the new training runs, as is the\nlearning rate schedule.\nWe measure the BLEU score and the mean accepted block size \u02c6k on the development set under a\nvariety of settings. Results are reported in Table 1.1\n\nk\n1\n2\n4\n6\n8\n10\n\nRegular\n26.00 / 1.00\n25.81 / 1.51\n25.84 / 1.73\n26.08 / 1.76\n25.82 / 1.76\n25.69 / 1.74\n\nDistillation\n26.41 / 1.00\n26.52 / 1.55\n26.31 / 1.85\n26.26 / 1.90\n26.25 / 1.91\n26.34 / 1.90\n\nFine Tuning\n\nBoth\n\n25.74 / 1.78\n25.05 / 2.69\n24.69 / 2.98\n24.27 / 3.01\n23.51 / 2.87\n\n26.58 / 1.88\n26.36 / 3.27\n26.18 / 4.18\n26.11 / 4.69\n25.60 / 4.95\n\nd\ne\nt\np\ne\nc\nc\nA\nn\na\ne\n\ne\nz\ni\nS\nk\nc\no\nl\nB\n\nM\n\n5\n\n4\n\n3\n\n2\n\n1\n\n25\n\n24\n26\nBLEU Score\n\nTable 1: Results on the newstest2013 development set for English-German translation. Each cell lists\nBLEU score and mean accepted block size. Larger BLEU scores indicate higher translation quality,\nand larger mean accepted block sizes indicate fewer decoding iterations. The data from the table is\nalso visually depicted in a scatter plot on the right.\n\nFrom these results, we make several observations. For the regular setup with gold training data\nand frozen baseline model parameters, the mean block size reaches a peak of 1.76, showing that\nspeed can be improved without sacri\ufb01cing model quality. When we instead use distilled data, the\nBLEU score at the same block size increases by 0.43 and the mean block size reaches 1.91, showing\nslight improvements on both metrics. Next, comparing the results in the \ufb01rst two columns to their\ncounterparts with parameter \ufb01ne tuning in the last two columns, we see large increases in mean block\nsize, albeit at the expense of some performance for larger k. The use of distilled data lessens the\nseverity of the performance drop and allows for more accurate forward prediction, lending credence\nto our earlier intuition. The model with the highest mean block size of 4.95 is only 0.81 BLEU points\nworse than the initial model trained on distilled data.\nWe visualize the trade-off between BLEU score and mean block size in the plot next to Table 1. For\nboth the original ( , ) and the distilled ( , ) training data, one can select a setting that optimizes for\nhighest quality, fastest speed, or something in between. Quality degradation for larger k is much less\npronounced when distilled data is used. The smooth frontier in both cases gives practitioners the\noption to choose a setting that best suits their needs.\nWe also repeat the experiments from the last column of Table 1 using the top-k approximate selection\ncriterion of Section 5.1. For top-2 approximate decoding, we obtain the results k = 2: 26.49 / 1.92,\nk = 4: 26.22 / 3.47, k = 6: 25.90 / 4.59, k = 8: 25.71 / 5.34, k = 10: 25.04 / 5.67, demonstrating\nadditional gains in accepted block size at the cost of further decrease in BLEU. Results for top-3\napproximate decoding follow a similar trend: k = 2: 26.41 / 1.93, k = 4: 26.14 / 3.52, k = 6: 25.56\n/ 4.69, k = 8: 25.41 / 5.52, k = 10: 24.68 / 5.91. On the other hand, experiments using a minimum\nblock size of k = 2 or k = 3 as described in Section 5.3 exhibit much larger drops in BLEU score\nwith only minor improvements in mean accepted block size, suggesting that the ability to accept just\none token on occasion is important and that a hard lower bound is somewhat less effective.\n\n1The BLEU scores in the \ufb01rst two columns vary slightly with k. This is because the \ufb01nal decoder layer is\nprocessed by a learned transformation for all predictions p1, p2, . . . , pk in our implementation rather than just\np2, . . . , pk. Using an identity transformation for p1 instead would result in identical BLEU scores.\n\n6\n\n\f7.2\n\nImage Super-Resolution\n\nFor our super-resolution experiments, we use the training and development data from the CelebA\ndataset (Liu et al., 2015). Our task is to generate a 32 \u00d7 32 pixel output image from an 8 \u00d7 8 pixel\ninput. Our baseline model is an Image Transformer (Parmar et al., 2018) with 1D local attention\ntrained for 1,000,000 steps on 8 P100 GPUs using the img2img_transformer_b3 hyperparameter\nset. As with our machine translation experiments, we train a collection of additional models with\nwarm-started parameters for various block sizes k, both with and without \ufb01ne tuning of the base\nmodel\u2019s parameters. Here we train for an additional 250,000 steps.\nWe measure the mean accepted block size on the development set for each model. For the Image\nTransformer, an image is decomposed into a sequence of red, green, and blue intensities for each\npixel in raster scan order, so each output token is an integer between 0 and 255. During inference, we\neither require an exact match with the greedy model or allow for an approximate match using the\ndistance-based selection criterion from Section 5.2 with \u0001 = 2. Our results are shown in Table 2.\n\nk\n1\n2\n4\n6\n8\n10\n\nRegular Approximate\n\nFine Tuning Both\n\n1.00\n1.07\n1.08\n1.09\n1.09\n1.10\n\n1.24\n1.36\n1.38\n1.49\n1.40\n\n1.59\n2.11\n2.23\n2.17\n2.04\n\n1.96\n3.75\n5.25\n6.36\n6.79\n\nTable 2: Results on the CelebA development set. Each cell lists the mean accepted block size during\ndecoding; larger values indicate fewer decoding iterations.\n\nWe \ufb01nd that exact-match decoding for the models trained with frozen base parameters is perhaps\noverly stringent, barely allowing for any speedup for even the largest block size. Relaxing the\nacceptance criterion helps a small amount, though the mean accepted block size remains below\n1.5 in all cases. The models with \ufb01ne-tuned parameters fare somewhat better when exact-match\ndecoding is used, achieving a mean block size of slightly over 2.2 in the best case. Finally, combining\napproximate decoding with \ufb01ne tuning yields results that are substantially better than when either\nmodi\ufb01cation is applied on its own. For the smaller block sizes, we see mean accepted block sizes\nvery close to the maximum achievable bound of k. For the largest block size of 10, the mean accepted\nblock size reaches an impressive 6.79, indicating a nearly 7x reduction in decoding iterations.\nTo evaluate the quality of our results, we also ran a human evaluation in which workers on Mechanical\nTurk were shown pairs of decoder outputs for examples from the development set and were asked to\npick which one they thought was more likely to have been taken by a camera. Within each pair, one\nimage was produced from the model trained with k = 1 and frozen base parameters, and one image\nwas produced from a model trained with k > 1 and \ufb01ne-tuned base parameters. The images within\neach pair were generated from the same underlying input, and were randomly permuted to avoid bias.\nResults are given in Table 3.\n\nMethod 1\n\nMethod 2\n\nFine tuning, exact, k = 2\nFine tuning, exact, k = 4\nFine tuning, exact, k = 6\nFine tuning, exact, k = 8\nFine tuning, exact, k = 10\n\nRegular, exact, k = 1\nRegular, exact, k = 1\nRegular, exact, k = 1\nRegular, exact, k = 1\nRegular, exact, k = 1\nRegular, exact, k = 1\nFine tuning, approximate, k = 2\nRegular, exact, k = 1\nFine tuning, approximate, k = 4\nRegular, exact, k = 1\nFine tuning, approximate, k = 6\nFine tuning, approximate, k = 8\nRegular, exact, k = 1\nFine tuning, approximate, k = 10 Regular, exact, k = 1\n\n1 > 2\n52.8%\n54.4%\n53.2%\n55.1%\n54.5%\n50.0%\n53.3%\n56.8%\n55.2%\n50.3%\n\nCon\ufb01dence Interval\n\n(50.8%, 54.9%)\n(52.5%, 56.3%)\n(51.3%, 55.0%)\n(53.3%, 56.8%)\n(53.1%, 56.0%)\n(48.4%, 51.5%)\n(51.7%, 55.0%)\n(55.4%, 58.2%)\n(53.5%, 56.7%)\n(48.9%, 51.8%)\n\nTable 3: Human evaluation results on the CelebA development set. In each row, we report the\npercentage of votes cast in favor of the output from Method 1 over that of Method 2, along with a\n90% bootstrap con\ufb01dence interval.\n\n7\n\n\fIn all cases we obtain preference percentages close to 50%, indicating little difference in perceived\nquality. In fact, subjects generally showed a weak preference toward images generated using the\n\ufb01ne-tuned models, with images coming from a \ufb01ne-tuned model with approximate decoding and a\nmedium block size of k = 6 obtaining the highest scores overall. We believe that the more dif\ufb01cult\ntraining task and approximate acceptance criterion both helped lead to outputs with slightly more\nnoise and variation, giving them a more natural appearance when compared to the smoothed outputs\nthat result from the baseline. See Section 7.4 for examples.\n\n7.3 Wall-Clock Speedup\n\nSo far we have framed our results in terms of the mean accepted block size, which is re\ufb02ective of the\nspeedup achieved relative to greedy decoding in terms of number of decoding iterations. Another\nmetric of interest is actual wall-clock speedup relative to greedy decoding, which takes into account\nthe additional overhead required for blockwise parallel prediction. We plot these two quantities\nagainst each other for the best translation and super-resolution settings in Figure 4.\n\np\nu\nd\ne\ne\np\nS\nk\nc\no\nl\nC\n\n-\nl\nl\na\n\nW\n\n4\n\n3\n\n2\n\n1\n\nTranslation\n\nSuper-Resolution\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\nMean Accepted Block Size\n\nFigure 4: A plot of the relative wall-clock speedup achieved for various mean accepted block sizes,\nwhere the latter measure a reduction in iterations required for decoding. This data comes from the\n\ufb01nal column of Table 1 (translation results using \ufb01ne tuning and distillation) and the \ufb01nal column of\nTable 2 (super-resolution results using \ufb01ne tuning and approximate decoding).\n\nFor translation, the wall-clock speedup peaks at 3.3x, corresponding to the setting with k = 8\nand mean accepted block size 4.7. For super-resolution, the wall-clock speedup reaches 4.0x,\ncorresponding to the setting with k = 6 and mean accepted block size 5.3. In both cases, larger block\nsizes k continue to improve in terms of iteration count, but start to decline in terms of wall-clock\nimprovement due to their higher computational cost.\nUsing our best settings for machine translation (distilled data and \ufb01ne-tuned models), we also ran\na test set evaluation on the newstest2014 dataset. These results along with others from related\napproaches are summarized in Table 4. Our technique exhibits much less quality degradation relative\nto our baseline when compared with other approaches, demonstrating its ef\ufb01cacy for faster decoding\nwith minimal impact on end performance.\n\n7.4 Examples\n\nMachine translation. Here we show the generation process for a typical machine translation\nexample. Generation occurs at the level of subwords, with underscores indicating word boundaries.\nInput: The James Webb Space Telescope (JWST) will be launched into space on board an Ariane5\nrocket by 2018 at the earliest.\nOutput: Das James Webb Space Teleskop (JWST) wird bis sp\u00e4testens 2018 an Bord einer Ariane5-\nRakete in den Weltraum gestartet.\n\nStep 1\nStep 2\nStep 3\nStep 4\nStep 5\nStep 6\n\n10 tokens\n5 tokens\n4 tokens\n10 tokens\n2 tokens\n3 tokens\n\n[Das_, James_, Web, b_, Space_, Tele, sko, p_, (_, J]\n[W, ST_, ) _, wird_, bis_]\n[sp\u00e4te, stens_, 2018_, an_]\n[Bord_, einer_, Ari, ane, 5_, -_, Rak, ete_, in_, den_]\n[Weltraum, _]\n[gestartet_, ._, ]\n\n8\n\n\fModel\nTransformer (beam size 4)\nTransformer (beam size 1)\nTransformer (beam size 4)\nNon-autoregressive Transformer\nNon-autoregressive Transformer (+FT)\nNon-autoregressive Transformer (+FT + NPD s = 10)\nNon-autoregressive Transformer (+FT + NPD s = 100)\nTransformer (beam size 1)\nTransformer (beam size 4)\nIterative re\ufb01nement Transformer (idec = 1)\nIterative re\ufb01nement Transformer (idec = 2)\nIterative re\ufb01nement Transformer (idec = 5)\nIterative re\ufb01nement Transformer (idec = 10)\nIterative re\ufb01nement Transformer (Adaptive)\nLatent Transformer without rescoring\nLatent Transformer rescoring top-10\nLatent Transformer rescoring top-100\nTransformer with distillation (greedy, k = 1)\nBlockwise parallel decoding for Transformer (k = 2)\nBlockwise parallel decoding for Transformer (k = 4)\nBlockwise parallel decoding for Transformer (k = 6)\nBlockwise parallel decoding for Transformer (k = 8)\nBlockwise parallel decoding for Transformer (k = 10)\n\nSource\n\nVaswani et al. (2017)\n\nGu et al. (2018)\nGu et al. (2018)\nGu et al. (2018)\nGu et al. (2018)\nGu et al. (2018)\nGu et al. (2018)\nLee et al. (2018)\nLee et al. (2018)\nLee et al. (2018)\nLee et al. (2018)\nLee et al. (2018)\nLee et al. (2018)\nLee et al. (2018)\nKaiser et al. (2018)\nKaiser et al. (2018)\nKaiser et al. (2018)\n\nThis work\nThis work\nThis work\nThis work\nThis work\nThis work\n\nBLEU\n28.4\n22.71\n23.45\n17.35\n17.69\n18.66\n19.17\n23.77\n24.57\n13.91\n16.95\n20.26\n21.61\n21.54\n19.8\n21.0\n22.5\n29.11\n28.95\n28.54\n28.11\n27.88\n27.40\n\nWall-Clock\nSpeedup\n\n1.20x\n1.00x\n11.39x\n8.77x\n3.11x\n2.01x\n2.39x\n\n1.00x\n1.72x\n2.69x\n3.10x\n3.31x\n3.04x\n\nTable 4: A comparison of results on the newstest2014 test set for English-German translation. The\nreported speedups are for wall-clock time for single-sentence decoding averaged over the test set. Our\napproach exhibits relatively little loss in quality compared to prior work. We achieve a BLEU score\nwithin 0.29 of the original Transformer with a real-time speedup over our baseline exceeding 3x.\n\nSuper-resolution. Here we provide a selection of typical examples from the development set. As\nsuggested by the human evaluations in Section 7.2, the blockwise parallel decodes are largely compa-\nrable in quality to the standard greedy decodes. For each triple, the left image is the low-resolution\ninput, the middle image is the standard greedy decode, and the right image is the approximate greedy\ndecode using the \ufb01ne-tuned model with block size k = 10.\n\n8 Conclusion\n\nIn this work, we proposed blockwise parallel decoding as a simple and generic technique for\nimproving decoding performance in deep autoregressive models whose architectures allow for\nparallelization of scoring across output positions. It is comparatively straightforward to add to\nexisting models, and we demonstrate signi\ufb01cant improvements in decoding speed on machine\ntranslation and a conditional image generation task at no loss or only small losses in quality.\nIn future work we plan to investigate combinations of this technique with potentially orthogonal\napproaches such as those based on sequences of discrete latent variables (Kaiser et al., 2018).\n\n9\n\n\fAcknowledgments\n\nWe thank Arvind Neelakantan, Niki Parmar, and Ashish Vaswani for their generous assistance in\nsetting up the Image Transformer and distillation experiments.\n\nReferences\nTommaso Furlanello, Zachary C Lipton, AI Amazon, Laurent Itti, and Anima Anandkumar. Born\n\nagain neural networks. In NIPS Workshop on Meta Learning, 2017.\n\nJonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional\nsequence to sequence learning. CoRR, abs/1705.03122, 2017. URL http://arxiv.org/abs/\n1705.03122.\n\nJiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressive\nneural machine translation. In International Conference on Learning Representations, 2018. URL\nhttps://openreview.net/forum?id=B1l8BtlCb.\n\nGeoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv\n\npreprint arXiv:1503.02531, 2015.\n\n\u0141ukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Pamar, Samy Bengio, Jakob Uszkoreit, and\nNoam Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv preprint\narXiv:1803.03382, 2018.\n\nNal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart,\nFlorian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Ef\ufb01cient\nneural audio synthesis. arXiv preprint arXiv:1802.08435, 2018.\n\nYoon Kim and Alexander M Rush. Sequence-level knowledge distillation.\n\narXiv:1606.07947, 2016.\n\narXiv preprint\n\nJason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence\nmodeling by iterative re\ufb01nement. CoRR, abs/1802.06901, 2018. URL http://arxiv.org/abs/\n1802.06901.\n\nZiwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), December 2015.\n\nAaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu,\nGeorge van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel\nwavenet: Fast high-\ufb01delity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.\n\nNiki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander Ku.\nImage transformer. CoRR, abs/1802.05751, 2018. URL http://arxiv.org/abs/1802.05751.\n\nAlexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive\n\nsentence summarization. arXiv preprint arXiv:1509.00685, 2015.\n\nA\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves,\nNal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for\nraw audio. CoRR, abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, 2017. URL http:\n//arxiv.org/abs/1706.03762.\n\nAshish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws,\nLlion Jones, \u0141ukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and\nJakob Uszkoreit. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018.\nURL http://arxiv.org/abs/1803.07416.\n\n10\n\n\f", "award": [], "sourceid": 6497, "authors": [{"given_name": "Mitchell", "family_name": "Stern", "institution": "UC Berkeley"}, {"given_name": "Noam", "family_name": "Shazeer", "institution": "Google"}, {"given_name": "Jakob", "family_name": "Uszkoreit", "institution": "Google"}]}