{"title": "Fast Structured Decoding for Sequence Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3016, "page_last": 3026, "abstract": "Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to speed up the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts. To improve then decoding consistency and reduce the inference cost at the same time, we propose to incorporate a structured inference module into the non-autoregressive models. Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. Experiments in machine translation show that while increasing little latency (8~14ms, our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.", "full_text": "Fast Structured Decoding for Sequence Models\n\nZhiqing Sun1,\u2217 Zhuohan Li2,\u2217 Haoqing Wang3 Di He3 Zi Lin3 Zhi-Hong Deng3\n1Carnegie Mellon University 2University of California, Berkeley 3Peking University\n\nzhiqings@cs.cmu.edu\nzhuohan@cs.berkeley.edu\n{wanghaoqing,di_he,zi.lin,zhdeng}@pku.edu.cn\n\nAbstract\n\nAutoregressive sequence models achieve state-of-the-art performance in domains\nlike machine translation. However, due to the autoregressive factorization na-\nture, these models suffer from heavy latency during inference. Recently, non-\nautoregressive sequence models were proposed to reduce the inference time. How-\never, these models assume that the decoding process of each token is conditionally\nindependent of others. Such a generation process sometimes makes the output\nsentence inconsistent, and thus the learned non-autoregressive models could only\nachieve inferior accuracy compared to their autoregressive counterparts. To im-\nprove the decoding consistency and reduce the inference cost at the same time, we\npropose to incorporate a structured inference module into the non-autoregressive\nmodels. Speci\ufb01cally, we design an ef\ufb01cient approximation for Conditional Ran-\ndom Fields (CRF) for non-autoregressive sequence models, and further propose a\ndynamic transition technique to model positional contexts in the CRF. Experiments\nin machine translation show that while increasing little latency (8\u223c14ms), our\nmodel could achieve signi\ufb01cantly better translation performance than previous\nnon-autoregressive models on different translation datasets. In particular, for the\nWMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely\noutperforms the previous non-autoregressive baselines and is only 0.61 lower in\nBLEU than purely autoregressive models.1\n\n1\n\nIntroduction\n\nAutoregressive sequence models achieve great success in domains like machine translation and have\nbeen deployed in real applications [1, 2, 3, 4, 5]. However, these models suffer from high inference\nlatency [1, 2], which is sometimes unaffordable for real-time industrial applications. This is mainly\nattributed to the autoregressive factorization nature of the models: Considering a general conditional\nsequence generation framework, given a context sequence x = (x1, ..., xT ) and a target sequence\ny = (y1, ..., yT (cid:48)), autoregressive sequence models are based on a chain of conditional probabilities\nwith a left-to-right causal structure:\n\nT (cid:48)(cid:89)\n\np(y|x) =\n\np(yi|y*y1y2y3y4Masked Multi-headSelf AttentionMulti-Head Encoder-to-Decoder AttentionFFNFFNFFNFFNSoftMaxSoftMaxSoftMaxSoftMaxEmbEmbEmbEmby1y2y3y4y1y2y3M\u00d7N\u00d7\u00d7N(a) Autoregressive Transformer(b) Non-Autoregressive Transformerwith Conditional Random FieldsMulti-HeadSelf AttentionContextEmbEmbEmbx1x2x3FFNFFNFFN\u00d7MEncoderDecoderDecoder\u00d7NDifferent layersN stacked layersMulti-HeadSelf AttentionContextEmbEmbEmbx1x2x3FFNFFNFFNEncoderEmbEmbEmbEmb\fFigure 2: Illustration of the decoding inconsistency problem in non-autoregressive decoding and how\na CRF-based structured inference module solves it.\n\n3.2 Structured inference module\n\nIn this paper, we propose to incorporate a structured inference module in the decoder part to directly\nmodel multimodality in NART models. Figure 2 shows how a CRF-based structured inference\nmodule works. In principle, this module can be any structured prediction model such as Conditional\nRandom Fields (CRF) [10] or Maximum Entropy Markov Model (MEMM) [24]. Here we focus on\nlinear-chain CRF, which is the most widely applied model in the sequence labeling literature. In\nthe context of machine translation, we use \u201clabel\u201d and \u201ctoken\u201d (vocabulary) interchangeably for the\ndecoder output.\n\nConditional random \ufb01elds CRF is a framework for building probabilistic models to segment and\nlabel sequence data. Given the sequence data x = (x1,\u00b7\u00b7\u00b7 , xn) and the corresponding label sequence\ny = (y1,\u00b7\u00b7\u00b7 , yn), the likelihood of y given x is de\ufb01ned as:\n\n(cid:32) n(cid:88)\n\nn(cid:88)\n\n(cid:33)\n\nP (y|x) =\n\n1\n\nZ(x)\n\nexp\n\ni=1\n\ni=2\n\ns(yi, x, i) +\n\nt(yi\u22121, yi, x, i)\n\n(6)\n\nwhere Z(x) is the normalizing factor, s(yi, x, i) is the label score of yi at the position i, and\nt(yi\u22121, yi, x, i) is the transition score from yi\u22121 to yi. The CRF module can be end-to-end jointly\ntrained with neural networks using negative log-likelihood loss LCRF = \u2212 log P (y|x). Note\nthat when omitting the transition score t(yi\u22121, yi, x, i), Equation 6 is the same as vanilla non-\nautoregressive models (Equation 2).\n\nIncorporating CRF into NART model For the label score, a linear transformation of the NART\ndecoder\u2019s output hi: s(yi, x, i) = (hiW + b)yi works well, where W \u2208 Rdmodel\u00d7|V| and b \u2208 R|V| are\nthe weights and bias of the linear transformation. However, for the transition score, naive methods\nrequire a |V| \u00d7 |V| matrix to model t(yi\u22121, yi, x, i) = Myi\u22121,yi. Also, according to the widely-\nused forward-backward algorithm [10], the likelihood computation and decoding process requires\nO(n|V|2) complexity through dynamic programming [10, 25, 26], which is infeasible for practical\nusage (e.g. a 32k vocabulary).\n\nLow-rank approximation for transition matrix A solution for the above issue is to use a low-\nrank matrix to approximate the full-rank transition matrix. In particular, we introduce two transition\nembedding E1, E2 \u2208 R|V|\u00d7dt to approximate the transition matrix:\n\nM = E1ET\n2\nwhere dt is the dimension of the transition embedding.\n\n(7)\n\nBeam approximation for CRF Low-rank approximation allows us to calculate the unnormalized\nterm in Equation 6 ef\ufb01ciently. However, due to numerical accuracy issues2, both the normalizing\n2The transition is calculated in the log space. See https://github.com/tensorflow/tensorflow/\n\ntree/master/tensorflow/contrib/crf for detailed implementation.\n\n5\n\nDankeschonVielenDank.DankeschonVielenDank.DankeschonVielenDank.Conditional Random FieldsDanke.Danke schon.Vielen Dank.DankeschonVielenDank.DankeschonVielenDank.DankeschonVielenDank.DankeDank.\ffactor Z(x) and the decoding process require the full transition matrix, which is still unaffordable.\nTherefore, we further propose beam approximation to make CRF tractable for NART models.\nIn particular, for each position i, we heuristically truncate all |V| candidates to a pre-de\ufb01ned beam\nsize k. We keep k candidates with highest label scores s(\u00b7, x, i) for each position i, and accordingly\ncrop the transition matrix between each pair of i \u2212 1 and i. The forward-backward algorithm is then\napplied on the truncated beam to get either normalizing factor or decoding result. In this way, the time\ncomplexities of them are reduced from O(n|V|2) to O(nk2) (e.g., for the normalizing factor, instead\nof a sum over |V|n possible paths, we sum over kn paths in the beam). Besides, when calculating the\nnormalizing factor for a training pair (x, y), we explicitly include each yi in the beam to ensure that\nthe approximated normalizing factor Z(x) is larger than the unnormalized path score of y.\nThe intuition of the beam approximation is that for the normalizing factor, the sum of path scores in\nsuch a beam (the approximated Z(x)) is able to predominate the actually value of Z(x), while it is\nalso reasonable to assume that the beam includes each label of the best path y\u2217.\n\nDynamic CRF transition In the traditional de\ufb01nition, the transition matrix M is \ufb01xed for each\nposition i. A dynamic transition matrix that depends on the positional context could improve the\nrepresentation power of CRF. Here we use a simple but effective way to get a dynamic transition\nmatrix by inserting a dynamic matrix between the product of transition embedding E1 and E2:\n\ndynamic = f ([hi\u22121, hi]),\n\n(8)\n(9)\n(10)\nwhere [hi\u22121, hi] is the concatenation of two adjacent decoder outputs, and f : R2dmodel \u2192 Rdt\u00d7dt is\na two-layer Feed-Forward Network (FFN).\n\nM i\nM i = E1M i\ndynamicET\n2 ,\nt(yi\u22121, yi, x, i) = M i\n\nyi\u22121,yi\n\n,\n\nLatency of CRF decoding Unlike vanilla non-autoregressive decoding, the CRF decoding can no\nlonger be parallelized. However, due to our beam approximation, the computation of linear-chain\nCRF O(nk2) is in theory still much faster than autoregressive decoding. As shown in Table 2, in\npractice, the overhead is only 8\u223c14ms.\n\nExact Decoding for Machine Translation Despite fast decoding, another promise of this approach\nis that it provides an exact decoding framework for machine translation, while the de facto standard\nbeam search algorithm for ART models cannot provide such guarantee. CRF-based structured\ninference module can solve the label bias problem [10], while locally normalized models (e.g. beam\nsearch) often have a very weak ability to revise earlier decisions [22].\n\nJoint training with vanilla non-autoregressive loss\nIn practice, we \ufb01nd that it is bene\ufb01cial to\ninclude the original NART loss to help the training of the NART-CRF model. Therefore, our \ufb01nal\ntraining loss L is a weighted sum of the CRF negative log-likelihood loss (Equation 3) and the\nNon-AutoRegressive (NAR) negative log-likelihood loss (Equation 2):\n\nL = LCRF + \u03bbLNAR,\n\n(11)\n\nwhere \u03bb is the hyperparameter controlling the weight of different loss terms.\n\n4 Experiments\n\n4.1 Experimental settings\n\nWe use several widely adopted benchmark tasks to evaluate the effectiveness of our proposed\nmodels: IWSLT143 German-to-English translation (IWSLT14 De-En) and WMT144 English-to-\nGerman/German-to-English translation (WMT14 En-De/De-En). For the WMT14 dataset, we use\nNewstest2014 as test data and Newstest2013 as validation data. For each dataset, we split word\ntokens into subword units following [2], forming a 32k word-piece vocabulary shared by source and\ntarget languages.\n\n3https://wit3.fbk.eu/\n4http://statmt.org/wmt14/translation-task.html\n\n6\n\n\fTable 2: Performance of BLEU score on WMT14 En-De/De-En and IWSLT14 De-En tasks. The\nnumber in the parentheses denotes the performance gap between NART models and their ART\nteachers. \u201d/\u201d denotes that the results are not reported. LSTM-based results are from [2, 27];\nCNN-based results are from [5, 28]; Transformer [1] results are based on our own reproduction.6\n\nModels\nAutoregressive models\nLSTM-based [2]\nCNN-based [5]\nTransformer [1] (beam size = 4)\nNon-autoregressive models\nFT [6]\nFT [6] (rescoring 10)\nFT [6] (rescoring 100)\nIR [9] (adaptive re\ufb01nement)\nLT [15]\nLT [15] (rescoring 10)\nLT [15] (rescoring 100)\nCTC [13]\nENAT-P [29]\nENAT-P [29] (rescoring 9)\nENAT-E [29]\nENAT-E [29] (rescoring 9)\nNAT-REG [8]\nNAT-REG [8] (rescoring 9)\nVQ-VAE [16] (compress 8\u00d7)\nVQ-VAE [16] (compress 16\u00d7)\nNon-autoregressive models (Ours)\nNART\nNART (rescoring 9)\nNART (rescoring 19)\nNART-CRF\nNART-CRF (rescoring 9)\nNART-CRF (rescoring 19)\nNART-DCRF\nNART-DCRF (rescoring 9)\nNART-DCRF (rescoring 19)\n\nWMT14\n\nDe-En\n\n/\n/\n\n31.29\n\nEn-De\n\n24.60\n26.43\n27.41\n\nIWSLT14\n\nDe-En\n\n28.53\n32.84\n33.26\n\n17.69 (5.76)\n18.66 (4.79)\n19.17 (4.28)\n21.54 (3.03)\n19.80 (7.50)\n21.00 (6.30)\n22.50 (4.80)\n17.68 (5.77)\n20.26 (7.15)\n23.22 (4.19)\n20.65 (6.76)\n24.28 (3.13)\n20.65 (6.65)\n24.61 (2.69)\n26.70 (1.40)\n25.40 (2.70)\n\n20.27 (7.14)\n24.22 (3.19)\n24.99 (2.42)\n23.32 (4.09)\n26.04 (1.37)\n26.68 (0.73)\n23.44 (3.97)\n26.07 (1.34)\n26.80 (0.61)\n\n21.47 (5.55)\n22.41 (4.61)\n23.20 (3.82)\n25.43 (3.04)\n\n19.80 (7.22)\n23.23 (8.06)\n26.67 (4.62)\n23.02 (8.27)\n26.10 (5.19)\n24.77 (6.52)\n28.90 (2.39)\n\n/\n/\n/\n\n/\n/\n\n/\n/\n/\n/\n/\n/\n/\n/\n\n/\n/\n\n25.09 (7.46)\n28.60 (3.95)\n24.13 (8.42)\n27.30 (5.25)\n23.89 (9.63)\n28.04 (5.48)\n\n22.02 (9.27)\n26.21 (5.08)\n26.60 (4.69)\n25.75 (5.54)\n28.88 (2.41)\n29.26 (2.03)\n27.22 (4.07)\n29.68 (1.61)\n30.04 (1.25)\n\n23.04 (10.22)\n26.79 (6.47)\n27.36 (5.90)\n26.39 (6.87)\n29.21 (4.05)\n29.55 (3.71)\n27.44 (5.82)\n29.99 (3.27)\n30.36 (2.90)\n\nLatency\n\nSpeedup\n\n387ms\u2021\n39ms\u2020\n79ms\u2020\n257ms\u2020\n105ms\u2020\n\n/\n\n/\n/\n\n/\n/\n/\n\n25ms\u2020\n50ms\u2020\n24ms\u2020\n49ms\u2020\n22ms\u2020\n40ms\u2020\n81ms\u2020\n58ms\u2020\n26ms\u2021\n50ms\u2021\n74ms\u2021\n35ms\u2021\n60ms\u2021\n87ms\u2021\n37ms\u2021\n63ms\u2021\n88ms\u2021\n\n/\n/\n\n/\n/\n/\n\n1.00\u00d7\n15.6\u00d7\u2020\n7.68\u00d7\u2020\n2.36\u00d7\u2020\n2.39\u00d7\u2020\n\n3.42\u00d7\u2020\n24.3\u00d7\u2020\n12.1\u00d7\u2020\n25.3\u00d7\u2020\n12.4\u00d7\u2020\n27.6\u00d7\u2020\n15.1\u00d7\u2020\n4.08\u00d7\u2020\n5.71\u00d7\u2020\n14.9\u00d7\u2021\n7.74\u00d7\u2021\n5.22\u00d7\u2021\n11.1\u00d7\u2021\n6.45\u00d7\u2021\n4.45\u00d7\u2021\n10.4\u00d7\u2021\n6.14\u00d7\u2021\n4.39\u00d7\u2021\n\nFor the WMT14 dataset, we use the default network architecture of the original base Transformer\n[1], which consists of a 6-layer encoder and 6-layer decoder. The size of hidden states dmodel is\nset to 512. Considering that IWSLT14 is a relatively smaller dataset comparing to WMT14, we\nuse a smaller architecture for IWSLT14, which consists of a 5-layer encoder, and a 5-layer decoder.\nThe size of hidden states dmodel is set to 256, and the number of heads is set to 4. For all datasets,\nwe set the size of transition embedding dt to 32 and the beam size k of beam approximation to 64.\nHyperparameter \u03bb is set to 0.5 to balance the scale of two loss components.\nFollowing previous works [6], we use sequence-level knowledge distillation [12] during training.\nSpeci\ufb01cally, we train our models on translations produced by a Transformer teacher model. It has\nbeen shown to be an effective way to alleviate the multimodality problem in training [6].\nSince the CRF-based structured inference module is not parallelizable in training, we initialize our\nNART-CRF models by warming up from their vanilla NART counterparts to speed up training. We use\nAdam [30] optimizer and employ label smoothing of value \u0001ls = 0.1 [31] in all experiments. Models\nfor WMT14/IWSLT14 tasks are trained on 4/1 NVIDIA P40 GPUs, respectively. We implement our\nmodels based on the open-sourced tensor2tensor library [23].\n\n6In Table 2, \u2021 and \u2020 indicate that the latency and speedup rate is measured on our own platform or by\nprevious works, respectively. Please note that both of them may be evaluated under different hardware settings\nand it may not be fair to directly compare them.\n\n7\n\n\fTable 3: BLEU scores of beam approxiamtion ablation study on WMT En-De.\n\nCRF beam size k\n\nNART-CRF\n\nNART-CRF (resocring 9)\nNART-CRF (resocring 19)\n\n1\n\n15.10\n19.61\n20.02\n\n2\n\n20.67\n23.93\n25.00\n\n4\n\n22.54\n25.48\n26.28\n\n8\n\n23.04\n25.86\n26.56\n\n16\n\n23.22\n25.93\n26.57\n\n32\n\n23.26\n26.01\n26.65\n\n64\n\n23.32\n26.04\n26.68\n\n128\n23.33\n26.09\n26.71\n\n256\n23.38\n26.08\n26.66\n\nInference\n\n4.2\nDuring training, the target sentence is given, so we do not need to predict the target length T (cid:48).\nHowever, during inference, we have to predict the length of the target sentence for each source\nsentence. Speci\ufb01cally, in this paper, we use the simplest form of target length T (cid:48), which is a linear\nfunction of source length T de\ufb01ned as T (cid:48) = T + C, where C is a constant bias term that can be\nset according to the overall length statistics of the training data. We also try different target lengths\nranging from (T + C) \u2212 B to (T + C) + B and obtain multiple translation results with different\nlengths, where B is the half-width, and then use the ART Transformer as the teacher model to select\nthe best translation from multiple candidate translations during inference.\nWe set the constant bias term C to 2, -2, 2 for WMT14 En-De, De-En and IWSLT14 De-En datasets\nrespectively, according to the average lengths of different languages in the training sets. We set B to\n4/9 and get 9/19 candidate translations for each sentence. For each dataset, we evaluate our model\nperformance with the BLEU score [32]. Following previous works [6, 9, 29, 8], we evaluate the\naverage per-sentence decoding latency on WMT14 En-De test sets with batch size 1 with a single\nNVIDIA Tesla P100 GPU for the Transformer model and the NART models to measure the speedup\nof our models. The latencies are obtained by taking average of \ufb01ve runs.\n\n4.3 Results and analysis\n\nWe evaluate7 three models described in Section 3: Non-AutoRegressive Transformer baseline\n(NART), NART with static-transition Conditional Random Fields (NART-CRF), and NART with\nDynamic-transition Conditional Random Fields (NART-DCRF). We also compare the proposed\nmodels with other ART or NART models, where LSTM-based model [2, 27], CNN-based model\n[5, 28], and Transformer [1] are autoregressive models; FerTility based (FT) NART model [6],\ndeterministic Iterative Re\ufb01nement (IR) model [9], Latent Transformer (LT) [15], NART model\nwith Connectionist Temporal Classi\ufb01cation (CTC) [13], Enhanced Non-Autoregressive Transformer\n(ENAT) [29], Regularized Non-Autoregressive Transformer (NAT-REG) [8], and Vector Quantized\nVariational AutoEncoders (VQ-VAE) [16] are non-autoregressive models.\nTable 2 shows the BLEU scores on different datasets and the inference latency of our models and the\nbaselines. The proposed NART-CRF/NART-DCRF models achieve state-of-the-art performance with\nsigni\ufb01cant improvements over previous proposed non-autoregressive models across various datasets\nand even outperform two strong autoregressive models (LSTM-based and CNN-based) on WMT\nEn-De dataset.\nSpeci\ufb01cally, the NART-DCRF model outperforms the fertility-based NART model with 5.75/7.41\nand 5.75/7.27 BLEU score improvements on WMT En-De and De-En tasks in similar settings,\nand outperforms our own NART baseline with 3.17/1.85/1.81 and 5.20/3.47/3.44 BLEU score\nimprovements on WMT En-De and De-En tasks in the same settings. It is even comparable to its ART\nTransformer teacher model. To the best of our knowledge, it is the \ufb01rst time that the performance\ngap of ART and NART is narrowed to 0.61 BLEU on WMT En-De task. Apart from the translation\naccuracy, our NART-CRF/NART-DCRF model achieves a speedup of 11.1/10.4 (greedy decoding) or\n4.45/4.39 (teacher rescoring) over the ART counterpart.\nThe proposed dynamic transition technique boosts the performance of the NART-CRF model by\n0.12/0.03/0.12, 1.47/0.80/0.78, and 1.05/0.78/0.81 BLEU score on WMT En-De, De-En and IWSLT\nDe-En tasks respectively. We can see that the gain is smaller on the En-De translation task. This may\nbe due to language-speci\ufb01c properties of German and English.\n\n7We follow common practice in previous works to make a fair comparison. Speci\ufb01cally, we use tokenized\n\ncase-sensitive BLEU for WMT datasets and case-insensitive BLEU for IWSLT datasets.\n\n8\n\n\fAn interesting question in our model design is how well the beam approximation \ufb01ts the full CRF\ntransition matrix. We conduct an ablation study of our NART-CRF model on WMT En-De task and\nthe results are shown in Table 3. The model is trained with CRF beam size k = 64 and evaluated\nwith different CRF beam size and rescoring candidates. We can see that k = 16 has already provided\na quite good approximation, as further increasing k does not bring much gain. This validates the\neffectiveness of our proposed beam approximation technique.\n\n5 Conclusion and Future Work\n\nNon-autoregressive sequence models have achieved impressive inference speedup but suffer from\ndecoding inconsistency problem, and thus performs poorly compared to autoregressive sequence\nmodels. In this paper, we propose a novel framework to bridge the performance gap between non-\nautoregressive and autoregressive sequence models. Speci\ufb01cally, we use linear-chain Conditional\nRandom Fields (CRF) to model the co-occurrence relationship between adjacent words during the\ndecoding. We design two effective approximation methods to tackle the issue of the large vocabulary\nsize, and further propose a dynamic transition technique to model positional contexts in the CRF. The\nresults signi\ufb01cantly outperform previous non-autoregressive baselines on WMT14 En-De and De-En\ndatasets and achieve comparable performance to the autoregressive counterparts.\nIn the future, we plan to utilize other existing techniques for our NART-CRF models to further bridge\nthe gap between non-autoregressive and autoregressive sequence models. Besides, although the\nrescoring process is also parallelized, it severely increases the inference latency, as can be seen in\nTable 2. An additional module that can accurately predict the target length might be useful. As our\nmajor contribution in this paper is to model richer structural dependency in the non-autoregressive\ndecoder, we leave this for future work.\n\nReferences\n[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems, pages 5998\u20136008, 2017.\n\n[2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[3] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[5] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional\nsequence to sequence learning. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1243\u20131252. JMLR. org, 2017.\n\n[6] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-\n\nautoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.\n\n[7] Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. Hint-based\n\ntraining for non-autoregressive translation. arXiv preprint arXiv:1909.06708, 2019.\n\n[8] Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Non-autoregressive\n\nmachine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245, 2019.\n\n[9] Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural\n\nsequence modeling by iterative re\ufb01nement. arXiv preprint arXiv:1802.06901, 2018.\n\n9\n\n\f[10] John D Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random \ufb01elds:\nProbabilistic models for segmenting and labeling sequence data. In Proceedings of the Eigh-\nteenth International Conference on Machine Learning, pages 282\u2013289. Morgan Kaufmann\nPublishers Inc., 2001.\n\n[11] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[12] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint\n\narXiv:1606.07947, 2016.\n\n[13] Jind\u02c7rich Libovick`y and Jind\u02c7rich Helcl. End-to-end non-autoregressive neural machine translation\n\nwith connectionist temporal classi\ufb01cation. arXiv preprint arXiv:1811.04719, 2018.\n\n[14] Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The\nmathematics of statistical machine translation: Parameter estimation. Computational linguistics,\n19(2):263\u2013311, 1993.\n\n[15] \u0141ukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and\nNoam Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv preprint\narXiv:1803.03382, 2018.\n\n[16] Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments\n\non vector quantized autoencoders. arXiv preprint arXiv:1805.11063, 2018.\n\n[17] Thomas Lavergne, Josep Maria Crego, Alexandre Allauzen, and Fran\u00e7ois Yvon. From n-gram-\nbased to crf-based translation models. In Proceedings of the sixth workshop on statistical\nmachine translation, pages 542\u2013553. Association for Computational Linguistics, 2011.\n\n[18] Yoon Kim, Sam Wiseman, and Alexander M Rush. A tutorial on deep latent variable models of\n\nnatural language. arXiv preprint arXiv:1812.06834, 2018.\n\n[19] Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. Unsupervised neural\n\nhidden markov models. arXiv preprint arXiv:1609.09007, 2016.\n\n[20] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. Structured attention networks.\n\narXiv preprint arXiv:1702.00887, 2017.\n\n[21] Ronan Collobert, Jason Weston, L\u00e9on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel\nKuksa. Natural language processing (almost) from scratch. Journal of machine learning\nresearch, 12(Aug):2493\u20132537, 2011.\n\n[22] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman\nGanchev, Slav Petrov, and Michael Collins. Globally normalized transition-based neural\nnetworks. arXiv preprint arXiv:1603.06042, 2016.\n\n[23] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N Gomez, Stephan\nGouws, Llion Jones, \u0141ukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. Tensor2tensor for\nneural machine translation. arXiv preprint arXiv:1803.07416, 2018.\n\n[24] Andrew McCallum, Dayne Freitag, and Fernando CN Pereira. Maximum entropy markov\n\nmodels for information extraction and segmentation. In Icml, pages 591\u2013598, 2000.\n\n[25] Charles Sutton, Andrew McCallum, et al. An introduction to conditional random \ufb01elds.\n\nFoundations and Trends R(cid:13) in Machine Learning, 4(4):267\u2013373, 2012.\n\n[26] Michael Collins. The forward-backward algorithm. Columbia Columbia Univ, 2013.\n\n[27] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,\nAaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv\npreprint arXiv:1607.07086, 2016.\n\n[28] Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc\u2019Aurelio Ranzato. Classical\nstructured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956,\n2017.\n\n10\n\n\f[29] Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. Non-autoregressive neural\n\nmachine translation with enhanced decoder input. arXiv preprint arXiv:1812.09664, 2018.\n\n[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[31] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non computer vision and pattern recognition, pages 2818\u20132826, 2016.\n\n[32] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n11\n\n\f", "award": [], "sourceid": 1726, "authors": [{"given_name": "Zhiqing", "family_name": "Sun", "institution": "Carnegie Mellon University"}, {"given_name": "Zhuohan", "family_name": "Li", "institution": "UC Berkeley"}, {"given_name": "Haoqing", "family_name": "Wang", "institution": "Peking University"}, {"given_name": "Di", "family_name": "He", "institution": "Peking University"}, {"given_name": "Zi", "family_name": "Lin", "institution": "Peking University"}, {"given_name": "Zhihong", "family_name": "Deng", "institution": "Peking University"}]}*