{"title": "Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 7944, "page_last": 7954, "abstract": "Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer, gradually from low level to high level. Specifically, we design a layer-wise attention and mixed attention mechanism, and further share the parameters of each layer between the encoder and decoder to regularize and coordinate the learning. Experiments show that combined with the state-of-the-art Transformer model, layer-wise coordination achieves  improvements on three IWSLT and two WMT translation tasks. More specifically, our method achieves 34.43 and 29.01 BLEU score on WMT16 English-Romanian and WMT14 English-German tasks, outperforming the Transformer baseline.", "full_text": "Layer-Wise Coordination between Encoder and\n\nDecoder for Neural Machine Translation\n\nTianyu He1 \u2020 \u2217\n\nhetianyu@mail.ustc.edu.cn\n\nXu Tan2 \u2020\n\nxuta@microsoft.com\n\nYingce Xia2\n\nyingce.xia@microsoft.com\n\nDi He3\n\ndi_he@pku.edu.cn\n\nTao Qin2\n\ntaoqin@microsoft.com\n\nZhibo Chen1\n\nchenzhibo@ustc.edu.cn\n\nTie-Yan Liu2\n\ntie-yan.liu@microsoft.com\n\n1CAS Key Laboratory of Technology in\n\nGeo-spatial Information Processing and Application System,\n\nUniversity of Science and Technology of China\n\n3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University\n\n2Microsoft Research\n\nAbstract\n\nNeural Machine Translation (NMT) has achieved remarkable progress with the\nquick evolvement of model structures. In this paper, we propose the concept of\nlayer-wise coordination for NMT, which explicitly coordinates the learning of\nhidden representations of the encoder and decoder together layer by layer, grad-\nually from low level to high level. Speci\ufb01cally, we design a layer-wise attention\nand mixed attention mechanism, and further share the parameters of each layer\nbetween the encoder and decoder to regularize and coordinate the learning. Experi-\nments show that combined with the state-of-the-art Transformer model, layer-wise\ncoordination achieves improvements on three IWSLT and two WMT translation\ntasks. More speci\ufb01cally, our method achieves 34.43 and 29.01 BLEU score on\nWMT16 English-Romanian and WMT14 English-German tasks, outperforming\nthe Transformer baseline.\n\n1\n\nIntroduction\n\nNeural Machine Translation (NMT) is a challenging task that attracts a lot of attention in recent\nyears [5, 2, 18, 29, 7, 33, 28, 27, 35, 37, 9], and the structure of NMT models has evolved quickly.\nThe \ufb01rst design of NMT model is based on Recurrent Neural Networks (RNNs) [5]. Then the\nattention mechanism [2] is introduced to better model the alignment between source and target tokens.\nDeeper architectures are adopted later to increase the expressiveness of NMT models [36, 7, 33].\nRecently, Convolutional Neural Network [7] and self-attention [33] based models are invented, which\nachieve the state-of-the-art performance in many broadly adopted translation tasks.\nWhile those models employ different basic building blocks (e.g., RNN, CNN, or self-attention),\nthey are all under the typical encoder-decoder framework: The encoder takes the source tokens as\ninputs and generates a set of hidden representations for those tokens layer by layer, gradually from\n\n\u2217The work was done when the \ufb01rst author was an intern at Microsoft Research Asia.\n\u2020The \ufb01rst and second author contribute equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flow level to high level. Then the decoder takes the last-layer (the highest level) representations\nfrom the encoder as inputs, generates hidden representations for each target position from low-level\nlayers to high-level ones, and \ufb01nally generates a token based on the last-layer representations. We\ncan see that the generation of hidden representations for target tokens, no matter high level or low\nlevel, are all based on the highest-level representations of the source sentence. Our case study and\nattention visualization (Section 5.3) show that attending the low layer of the decoder to the high-level\nrepresentations of the encoder causes unfocused attention and harms translation quality.\nThen questions come out naturally: Why should the low-level representation of a target token base\non the highest-level ones of source tokens? Why not attending each layer of the decoder to each\ncorresponding layer of the encoder? These questions exactly motivate our work.\nIn this paper, we propose to coordinate the learning of the encoder and decoder of an NMT model\nlayer by layer. The encoder and decoder of our model have the same number of layers, and the\ni-th layer in the decoder is aligned and coordinated with the i-th layer of the encoder. The hidden\nrepresentation of a source token in the i-th layer is generated from the hidden representations of all\nsource tokens in the (i-1)-th layer using the self-attention mechanism, and that of a target token in\nthe i-th layer is generated from the hidden representations of all source tokens and preceding target\ntokens in the (i-1)-th layer using a mixed attention mechanism. To further coordinate the learning\nbetween the encoder and decoder, we share parameters of the encoder and decoder. This new model\nhas several advantages compared with existing models.\nFirst, through layer-wise coordination, the information from the source and target sentence will\nmeet earlier, starting from the low-level representations. Consequently, the decoder can leverage\nmore \ufb01ne-grained source information when generating target tokens, instead of only using high-level\nrepresentations outputted by the encoder in the previous model structure. Such an approach has been\nshown to be effective for other NLP tasks such as text matching [16, 10, 19] or non-autoregressive\nmachine translation [8].\nSecond, through layer-wise coordination and parameter sharing, we ensure that the hidden represen-\ntations in two corresponding layers of the encoder and decoder are in the same (or closely related)\nsemantic level. Note that even if existing models can have the same number of layers in their encoder\nand decoder, there is no correspondence between an encoder layer and a decoder layer since their\nparameters are freely learned. Furthermore, parameter sharing allows us to stack more layers under\nthe constraint of model size, without loss of model capacity but regularizing training process.\nThe idea of layer-wise coordination can be applied to most existing model architectures, including\nRNN [5, 2, 18, 29], CNN [7] and Transformer [33]. In this work we apply layer-wise coordination to\nTransformer, considering its super accuracy on several benchmark tasks. Experiments show that our\nmethod outperforms strong baselines on three IWSLT tasks and two WMT tasks. In particular, we\nachieve 34.43 and 29.01 BLEU score on WMT16 English-Romanian and WMT14 English-German\ntasks.\n\n2 Background\n\n2.1 Encoder-Decoder Framework\n\nof each target word: P (y|x; \u03b8) =(cid:81)m\n\nGiven a bilingual sentence pair (x, y), an NMT model learns its parameter \u03b8 by maximizing the\nlog-likelihood P (y|x; \u03b8), which is usually decomposed into the product of the conditional probability\nt=1 P (yt|y<t, x; \u03b8), where m is the length of sentence y, y<t are\n\nthe target tokens before position t.\nAn encoder-decoder framework [5, 2, 18, 29, 7, 33] is usually adopted to model the conditional\nprobability P (y|x; \u03b8). The encoder maps the input sentence x into a set of hidden representations\nh, and the decoder generates the target token yt at position t using the previously generated target\ntokens y<t and the source representations h. Both the encoder and decoder can be implemented\nby different structure of neural models, such as RNN (LSTM/GRU) [5, 2, 18, 29], CNN [7] and\nself-attention [33]. Besides the basic component of the encoder and decoder, a source-target attention\nmechanism [2] is usually adopted to selectively focus on the source representations when generating\na target token.\n\n2\n\n\fDifferent from the typical encoder-decoder framework, our layer-wise coordination allows each layer\nof the decoder to directly leverage the hidden representations from the corresponding layer (instead\nof the last layer) of the encoder.\n\n2.2 Self-attention based Network\n\nSelf-attention has been used in many previous works [4, 21, 22, 33, 15]. [33] \ufb01rst introduces self-\nattention into Neural Machine Translation. For a single self-attention layer, it utilizes a cross-position\nself-attention to extract information from the tokens in the whole sentence, and then a position-wise\nfeed-forward network to increase the non-linearity. The self-attention is formulated as:\n\nAttention(Q, K, V ) = softmax(\n\nQK T\u221a\ndmodel\n\n)V,\n\n(1)\n\nwhere dmodel is the dimension of hidden representations. The embedding size, the input and output\nsize of self-attention are all set as dmodel. For the self-attention inside the encoder layer, Q, K, V \u2208\nRn\u2217dmodel, while for the self-attention inside the decoder layer, Q, K, V \u2208 Rm\u2217dmodel, where n and m\nis the length of source and target sentence. For the attention cross the encoder and decoder from\nthe source to the target (i.e., source-target attention), Q \u2208 Rm\u2217d, K, V \u2208 Rn\u2217dmodel. All the Q, K, V\ncome from the hidden representations of the corresponding encoder/decoder layer, but projected by\ndifferent parameter matrices: WQ, WK and WV . The position-wise feed-forward network consists\nof a two-layer linear transformation with ReLU activation in between:\n\nFFN(x) = max(0, xW1 + b1)W2 + b2.\n\n(2)\n\nThe feed-forward network is applied on every layer of both the source and target sentence.\n\n3 Layer-wise Coordination\n\nIn this section, we present the idea of layer-wise coordination. In principle, layer-wise coordination\ncan be applied to any encoder-decoder based models, including RNN, CNN and Transformer. In this\nwork, we directly focus on Transformer considering that it achieves very good accuracy on multiple\ntranslation tasks.\nLayer-wise coordination modi\ufb01es the structure of Transformer from two aspects: First, each layer\nin the decoder attends to the corresponding layer in the encoder. That is, the encoder and decoder\nhave the same number of layers and layer i in the decoder can extract information directly from\nlayer i in the encoder, instead of the last layer of the encoder like Transformer. While the decoder of\nTransformer uses a separate encoder-decoder attention module to extract information from the source\nsentence and a self-attention module to extract information from previous target tokens, we merge\nthe two attentions into one, which is called as mixed attention, to coordinate the learning between\nsource and target. Second, we share the parameters of attention and feed-forward layer between the\nencoder and decoder, in order to ensure the outputs of the corresponding layers of the encoder and\ndecoder are in the same (or closely related) semantic level, and thus enhance layer-wise coordination.\nThe overall structure of our model with layer-wise coordination is shown in Figure 1. The source\nand target sentences are concatenated and processed by the model layer by layer coordinately. We\nstack N layers and N can be varied according to different sizes of datasets. We introduce the key\ncomponents of the model as follows.\n\nMixed Attention In order to coordinate the learning of source and target tokens in each layer, the\ndecoder of our model uses a mixed attention for a target token to extract cross-position information\nfrom both the source and preceding target tokens. The attention mechanism is shown as the \u201cMixed\nAttention\u201d in Figure 1.\nTo enable the above attention mechanism, we add an extra mask on the dot product of Q and K based\non Equation 1 to prevent attending to the illegal positions (i.e., future target tokens):\n\nMixed_Attention(Q, K, V ) = softmax(\n\n+ M )V,\n\n(cid:26) 0, j < n \u2228 j \u2264 i + n\n\nQK T\u221a\ndmodel\n\n\u2212\u221e, otherwise\n\n(3)\n\n,\n\nM (i, j) =\n\n3\n\n\fFigure 1: Our proposed layer-wise coordination model for neural machine translation.\n\nwhere Q \u2208 Rm\u2217dmodel, K, V \u2208 R(n+m)\u2217dmodel, and M \u2208 Rm\u2217(n+m) is a mask matrix, with n and m\nbeing the length of source and target sentence. When M (i, j) equals to \u2212\u221e, the corresponding\nposition in softmax output will approach zero, which prevents position i from attending to position j.\n\nPosition Embedding Since self-attention has no recurrent operation like RNN or convolution like\nCNN, we need explicitly inject some information to indicate the absolute or relative position of a word\nto the model. In order to keep the original order of the concatenated sentence, we use the resettable\nposition embedding. The position of source tokens starts with zero, and for the target tokens, instead\nof increasing the position upon the end of source sentence, we reset the position from zero again. As\nthe position embedding function, we follow [8, 33] and use the sine and cosine functions to format\nthe embedding vectors: p(pos, k) = sin(pos/10000k/dmodel ) (if k is even) or cos(pos/10000k/dmodel)\n(if k is odd), where pos is the position and k is the index of the hidden dimension.\n\nSource/Target Embedding Since the source and target tokens share the same model parameters,\nwe need to give the model a sense of which language a token comes from it is receiving. The\nresettable position embedding alone cannot identify the language of a word token. We introduce\ntwo embeddings which represent the source and target language respectively. Every position of the\nsource and target tokens is added with the corresponding source/target embedding, which are learned\nend-to-end during the model training process. The source/target embeddings are demonstrated to be\nextremely important to train the model in our experiments.\n\n4 Experimental Setup\n\n4.1 Datasets\n\nWe evaluate our model on several widely used translation tasks, including IWSLT14 German-English\n(brie\ufb02y, De-En), IWSLT14 Romanian-English (brie\ufb02y, Ro-En), IWSLT14 Spanish-English (brie\ufb02y,\nEs-En), WMT16 English-Romanian (brie\ufb02y, En-Ro) and WMT14 English-German (brie\ufb02y, En-De).\n\nIWSLT14 German/Romanian/Spanish-English (De-En/Ro-En/Es-En) We use the datasets ex-\ntracted from IWSLT 2014 evaluation campaign [3] 3, which consist of 153K/182K/181K training\n\n3https://wit3.fbk.eu/mt.php?release=2014-01\n\n4\n\nPositionalEmbeddingx NSource/Target EmbeddingAdd & LayerNormFeed-ForwardAdd & LayerNormLinearSelf-AttentionX1 X2 X3 X4 X5 EOSSOS Y1 Y2 Y3 Y4 Y5 Input EmbeddingSource TokenTarget TokenY1 Y2 Y3 Y4 Y5 EOSSoftmaxMixed Attention\fsentence pairs for De-En/Ro-En/Es-En. For Ro-En/Es-En, we concatenate dev2010, tst2010, tst2011\nand tst2012 as the validation set and use tst2014 as the test set. For De-En, we use 7K data split\nfrom the training set as the validation set and use the concatenation of dev2010, tst2010, tst2011\nand tst2012 as the test set, which is widely used in prior works [23, 1, 11]. We also lowercase 4 the\nsentences of De-En following the common practice. Sentences are encoded using sub-word types\nbased on byte-pair-encoding (BPE) [25] 5, which has a shared vocabulary of about 31K/39K/34K\nsub-word tokens for De-En/Ro-En/Es-En.\n\nWMT16 English-Romanian (En-Ro) We use the same dataset and pre-processing techniques\nas [24], which result in 2.8M sentence pairs for training. We use the concatenation of newstest2013\nand newstest2014 as the validation set and newstest2016 as the test set [24]. Sentences are also\nencoded using BPE with a shared vocabulary of 36K sub-word tokens.\n\nWMT14 English-German (En-De) We use the same dataset as [17], which comprises 4.5M\nsentence pairs for training. We use the concatenation of newstest2012 and newstest2013 as the\nvalidation set and newstest2014 as the test set 6. Sentences are also encoded using BPE with a shared\nvocabulary of 40K sub-word tokens.\n\n4.2 Model Con\ufb01gurations\n\nFor small datasets De-En/Ro-En/Es-En, we choose the small con\ufb01guration with the model hidden\nsize dmodel = 256 and feed-forward hidden size dff = 1024. For relative larger datasets En-Ro and\nEn-De, we choose the big con\ufb01guration with dmodel = 1024 and dff = 2048. We used the same\nnumber of heads as Transformer (4/8/16 for small/base/big con\ufb01guration). For fair comparison with\nTransformer, we perform our experiments under the constraint that the number of parameters is\nsimilar with Transformer. Since our model structure shares the parameters between the encoder and\ndecoder, we can stack more layers under the same parameter constraint. We stack 14 layers for both\nsmall and big con\ufb01gurations, which have roughly the same number of parameters with the 6-layer\ntransformer_small and transformer_big counterparts [32].\n\n4.3 Training and Inference\n\nDuring training, we concatenate the source and target sentence together and batch the concatenated\nsentences with approximate sentence lengths with zero padded at the end of each sentence to ensure\nexactly the same length in one mini-batch. Each mini-batch on one GPU contains roughly 4096\ntokens. We train our models for En-Ro and En-De with 8 NVIDIA Tesla M40 GPUs on one machine.\nWe only use one M40 GPU to train the model for De-En/Ro-En/Es-En tasks as it is of both small\nmodel size and data size. The validation sets in all the datasets are used for hyper-parameter tuning\nand early stopping. We choose the Adam optimizer [13] with \u03b21 = 0.9, \u03b22 = 0.98, \u03b5 = 10\u22129 and\nuse the learning rate schedule in [33].\nDuring inference, we generate the target token autoregressively, regarding the source tokens as the\npreviously generated tokens. We append the source tokens with the end-of-sentence (EOS) token,\nand then with the start-of-sentence (SOS) token, and feed them into the model to generate the \ufb01rst\ntarget token. We decode with beam search and set beam width beam = 6 and length penalty \u03b1 = 1.1\non all datasets except for En-De, where we use beam = 4 and \u03b1 = 0.6 to be consistent with [33].\nWe evaluate the translation quality by tokenized case-sensitive BLEU [20] with multi-bleu.pl7, except\nfor De-En where we use case-insensitive BLEU to follow the common practice and En-Ro where\nwe use detokenized BLEU to be consistent with [24, 7] for comparison. Larger BLEU means better\ntranslation quality.\n\n4https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/lowercase.perl\n5https://github.com/rsennrich/subword-nmt\n6http://nlp.stanford.edu/projects/nmt\n7https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl\n\n5\n\n\f5 Results\n\n5.1 Compared with Previous NMT Models\n\nWe evaluate our proposed model on the \ufb01ve translation tasks and compare with the previous models\nthat are under the typical encoder-decoder framework. First, we compare with the Transformer\nbaseline, which is trained with the tensor2tensor codes [32]. We adopt the small con\ufb01guration for\nour model on IWSLT14 De-En/Ro-En/Es-En tasks, and big con\ufb01guration on WMT16 En-Ro and\nWMT14 En-De tasks. On WMT14 En-De task, we also run an extra base con\ufb01guration for our model\nto be comparable with the transformer_base model available in the Transformer paper, where the\ndmodel and dff are set to 512 and 2048. On this task, we reproduce the BLEU score in the Transformer\npaper [33] on transformer_big and transformer_base con\ufb01gurations8 and therefore we just list the\noriginal number in the paper. Second, we also compare with the results of other RNN/CNN-based\nmodels reported in the previous works. The BLEU scores are listed in Table 1 and 2.\nOn IWSLT14 De-En task, our method achieves 35.07 in terms of BLEU score, with 2.21 points\nimprovement over the Transformer baseline. We also compare with some RNN-based models and\nour model achieves great improvements. On IWSLT14 Ro-En/Es-En task, we also surpass the\nTransformer baseline for 1.08/1.93 BLEU score.\nOn WMT16 En-Ro task, our model achieves 1.73 BLEU score improvements over the Transformer\nbaseline. Compared with the RNN-based [24] and CNN-based [7] encoder-decoder models, our\nmethod also outperforms both of these models.\nOn WMT14 En-De task, we compare with Transformer in both the base and big model con\ufb01gurations.\nWe achieve 1.03 BLEU score improvement over the transformer_base model and advance the\ntransformer_big model with a new state-of-the-art BLEU score of 29.01. Again, we outperform all\nthe RNN/CNN-based encoder-decoder framework in terms of BLEU score.\n\nTask\n\nDe-En\n\nRo-En\n\nEs-En\n\nMethod\nMIXER [23]\nAC+LL [1]\nNPMT [11]\nDual Transfer Learning [34]\nTransformer (small)\nOur method (small)\nTransformer(small)\nOur method (small)\nUEDIN[3]\nTransformer(small)\nOur method (small)\n\nBLEU\n21.83\n28.53\n28.96\n32.35\n32.86\n35.07\n29.64\n30.72\n37.29\n38.57\n40.50\n\nTask\n\nEn-Ro\n\nEn-De\n\nMethod\nGRU[24]\nConvS2S[7]\nTransformer (big)\nOur method (big)\nByteNet [12]\nGNMT+RL [36]\nConvS2S [7]\nMoE [26]\nTransformer (base) [33]\nTransformer (big) [33]\nOur method (base)\nOur method (big)\n\nBLEU\n28.10\n30.02\n32.70\n34.43\n23.75\n24.60\n25.16\n26.03\n27.30\n28.40\n28.33\n29.01\n\nTable 1: BLEU scores on IWSLT 2014 trans-\nlation tasks compared with transformer base-\nline and other RNN/CNN-based models.\n\nTable 2: BLEU scores on WMT translation\ntasks compared with transformer baseline and\nother RNN/CNN-based models.\n\n5.2 Model Variations\n\nAblation Study To evaluate the importance of different components of layer-wise coordination\nmodel, we mask each component of our model and test the performance changes on De-En task. We\nfollow the same decoding strategy as described in Section 4.3 and compare the BLEU scores changes\non the test set, as listed in Table 3.\nThe \ufb01rst row in Table 3 is the basic parameter setting for our model. First, we do not use weight\nsharing in the second row, but for fair comparison we reduce half of the model layers to ensure\nthe same amount of model parameters. We can see sharing weight indeed outperforms the non-\n\n8For transformer_base/transformer_big con\ufb01guration, our reproduce score is 27.35/28.24 while the score in\n\nthe paper is 27.30/28.40.\n\n6\n\n\fOur model\nOur model w/o weight sharing\nOur model w/o mixed attention\nOur model w/o source/target embedding\nOur model w/o position embedding\n\n#parameter BLEU\n35.07\n33.96\n33.77\n32.80\n18.46\n\n19.07M\n19.07M\n19.07M\n19.07M\n19.07M\n\n\u2206\n1.11 \u2193\n1.30 \u2193\n2.27 \u2193\n16.61 \u2193\n\nTable 3: Ablation study on our proposed model on De-En task.\n\n#layer\n\nOur method\n\n10\n\n34.32\n\n14\n\n35.07\n\n18\n\n35.31\n\n22\n\n35.05\n\n#layer\nBaseline\n\n4\n\n32.78\n\n6\n\n8\n\n32.86\n\n32.72\n\n10\n\n32.67\n\nTable 4: The BLEU scores under different number of layers for our method and the baseline on\nDe-En task.\n\nSource (De)\nReference (En)\nTransformer\nOur model\nSource (De)\nReference (En)\nTransformer\nOur model\nSource (De)\nReference (En)\nTransformer\nOur model\n\nzwei minuten sp\u00e4ter passierten drei dinge gleichzeitig.\ntwo minutes later, three things happened at the same time.\ntwo minutes later, three things happened.\ntwo minutes later, three things happened at the same time.\nmit 17 wurde sie die zweite frau eines mandarin, dessen mutter sie schlug.\nat 17 she became the second wife of a mandarin whose mother beat her.\nat the age of 17, she turned into a mandarin second woman whose mother beat her.\nat 17, she became the second woman of a mandarin whose mother beat her.\nund ich erwiderte: \"wie kommuniziert ihr denn nun?\"\nand i said, \"well, how do you actually communicate?\"\nand i said, \"how does you communicates?\"\nand i said, \"how do you communicate?\"\n\nTable 5: Translation examples on De-En dataset from our model and the Transformer baseline.\n\nsharing counterpart. Second, if we separate the mixed attention into self-attention and source-target\nattention as usual, the BLEU score also drops. Third, removing position embedding and source/target\nembedding both hurt the model performance, especially for removing position embedding. The\naforementioned results demonstrate the importance of each component of our model.\n\nVarying the Number of Layers We vary different number of layers to investigate how our model\nperforms. Table 4 shows the BLEU score on De-En task with 10/14/18/22 layers. For a fair\ncomparison, we also vary the number of layers of baseline model with 4/6/8/10 to ensure the similar\nnumber of parameters between two models (i.e., our model with 10 layers has similar number of\nthe baseline model with 4 layers). We can observe that the BLEU scores do not nearly change or\neven drop when increasing the number of layer for baseline model, may due to over\ufb01tting. However,\nour layer-wise coordination can build a deep model up to 18 layers on this task, achieving a record-\nbreaking 35.31 BLEU score. When increasing to 22 layers, which is an extremely deep con\ufb01guration\nfor NMT, our model also drops to 35.05. We will investigate deeper model training through our\nlayer-wise coordination for future work.\n\n5.3 Case Study\n\nCase Analysis Table 5 shows several translation examples produced by our model compared with\nTransformer baseline. We can see that our layer-wise coordination model generates more adequate,\n\ufb02uent and accurate sentences. For the \ufb01rst case, the Transformer suffers from the adequacy problem\nthat misses the information \u201cat the same time\u201d while our model catches this information accurately.\nIn the second case, although Transformer nearly translates the meaning of the source sentence\nadequately, the result suffers from \ufb02uency compared to our model. In the third case, Transformer\ngenerates the sentence mistakenly with third-person singular while our model handles this case well.\n\nAttention Visualization In order to give a deep understanding why our model works better, we\nvisualize the attention weights of our model and Transformer for the \ufb01rst case from Table 5. In\nthis case, we analyze why Transformer misses the information \u201cat the same time\u201d while our model\n\n7\n\n\ftranslates it successfully. We investigate what information the model attends on when generating\nthe next token after \u201chappened\u201d. The attention weights are from the \ufb01rst layer of the decoder for\nboth models. Transformer, as the typical encoder-decoder framework, uses two attentions to extract\ninformation from source and target separately and it extracts the source information from the last\nlayer of hidden representations of the encoder, and thus the low layer of the decoder may not extract\nthis high-level representation precisely. As shown in Figure 2, when generating the next token of\n\u201chappened\u201d, it attends to diverse tokens, such as \u201cpassierten\u201d, \u201cgleichzeitig\u201d,\u201c.\u201d and \u201cEOS\u201d, which\ncause the generation of \u201c.\u201d and \ufb01nish the translation earlier. The target self-attention in Figure 3 alone\ncannot provide much information for the correct prediction. However, in our mixed attention as shown\nin Figure 4, we can observe that the attention weights mostly focus on the source token \u201cgleichzeitig\u201d\nthat means \u201csimultaneously\u201d in English, previous generated token \u201chappened\u201d as well as the current\nposition, which can precisely help the model generate the next token \u201cat\u201d for a beginning of the phrase\n\u201cat the same time\u201d. More cases can be found in the supplementary materials (part A). This kind of\ncases show the advantages of our layer-wise coordination learning over the typical encoder-decoder\nbased models.\n\nFigure 2: Source to target attention in Trans-\nformer.\n\nFigure 3: Target self-attention in Transformer.\n\nFigure 4: Mixed attention in our model.\n\n5.4 Discussions on Mixed Attention\n\nIn NMT, the generation of the target word depends on both source and target contexts, where source\ncontexts affect the adequacy of the generated sentence while target contexts have impact on the\n\ufb02uency [6, 14, 30, 31]. Our mixed attention is designed to better coordinate the learning of encoder\nand decoder by extracting cross-position information from both the source and preceding target tokens\nin one and the same attention function. In this way, the model automatically learns the preference on\nthe source or target contexts when generating the target token, which will be bene\ufb01cial when tackling\nwith the adequacy and \ufb02uency problem, as a by-product of our model design.\n[30] also developed a context gate on RNN-based model to trade off the context information from\nsource and target. Here we compare our proposed model with [30] on De-En task by implementing it\nwith Transformer, since it is originally implemented for RNN-based NMT model. The implementation\ndetails can be found in the supplementary materials (part B). For fair comparison, we just use layer-\nwise coordination without weight sharing between the encoder and decoder (the corresponding BLEU\nscore is 33.96 and parameter size is 19.07M as shown in Table 3). Our implemented Transformer\nversion of [30] has a 6-layer encoder and decoder with parameter size of 19.08M. The BLEU score is\n33.02, with 0.94 point lower than our method.\n\n8\n\n\fWe also show more visualization cases on our mixed attention in the supplementary materials (part\nC).\n\n6 Conclusion\n\nIn this work, we improved existing NMT models through layer-wise coordination of the encoder\nand decoder. Our method aligns the i-th layer of the encoder to the i-th layer of the decoder and\ncoordinates the learning of the hidden representations of source and target sentences layer by layer, by\nsharing the parameters of the aligned layers. Experiments on several translation tasks demonstrated\nour proposed model outperforms the Transformer baseline as well as other RNN/CNN-based encoder-\ndecoder models.\nFor future works, we will apply the idea of layer-wise coordination to other sequence to sequence\ntasks, such as question answering and image captioning. We will also investigate better ways to\ncoordinate the learning and interaction between the source and target. Furthermore, we will study\nhow to leverage layer-wise coordination to train deeper NMT models.\n\n7 Acknowledgement\n\nThis work was partially supported by the National Key Research and Development Program of China\nunder Grant No. 2016YFC0801001, the National Program on Key Basic Research Projects (973\nProgram) under Grant 2015CB351803, NSFC under Grant 61571413, 61632001, 61390514. We\nthank all the anonymous reviewers for their valuable comments on our paper.\n\nReferences\n[1] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio.\nAn actor-critic algorithm for sequence prediction. 5th International Conference on Learning\nRepresentations, ICLR, 2017, 2017.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. ICLR 2015, 2015.\n\n[3] M. Cettolo, J. Niehues, S. St\u00fcker, L. Bentivogli, and M. Federico. Report on the 11th iwslt\n\nevaluation campaign. In Proceedings of the 11th IWSLT, 2014.\n\n[4] J. Cheng, L. Dong, and M. Lapata. Long short-term memory-networks for machine reading. In\nProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,\nEMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 551\u2013561, 2016.\n\n[5] K. Cho, B. van Merrienboer, \u00c7. G\u00fcl\u00e7ehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine\ntranslation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special\nInterest Group of the ACL, pages 1724\u20131734, 2014.\n\n[6] Y. Ding, Y. Liu, H. Luan, and M. Sun. Visualizing and understanding neural machine translation.\nIn Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics\n(Volume 1: Long Papers), volume 1, pages 1150\u20131159, 2017.\n\n[7] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to\nsequence learning. In Proceedings of the 34th International Conference on Machine Learning,\nICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1243\u20131252, 2017.\n\n[8] J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher. Non-autoregressive neural machine\n\ntranslation. CoRR, abs/1711.02281, 2017.\n\n[9] H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-\nDowmunt, W. Lewis, M. Li, S. Liu, T. Liu, R. Luo, A. Menezes, T. Qin, F. Seide, X. Tan, F. Tian,\nL. Wu, S. Wu, Y. Xia, D. Zhang, Z. Zhang, and M. Zhou. Achieving human parity on automatic\nchinese to english news translation. CoRR, abs/1803.05567, 2018.\n\n9\n\n\f[10] B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural network architectures for matching\nnatural language sentences. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal,\nQuebec, Canada, pages 2042\u20132050, 2014.\n\n[11] P.-S. Huang, C. Wang, D. Zhou, and L. Deng. Neural phrase-based machine translation. arXiv\n\npreprint arXiv:1706.05565, 2017.\n\n[12] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu.\n\nNeural machine translation in linear time. CoRR, abs/1610.10099, 2016.\n\n[13] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[14] P. Koehn. Statistical machine translation. Draft of chapter, 13, 2017.\n\n[15] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured\n\nself-attentive sentence embedding. 2017.\n\n[16] Z. Lu and H. Li. A deep architecture for matching short texts. In Advances in Neural Information\nProcessing Systems 26: 27th Annual Conference on Neural Information Processing Systems\n2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.,\npages 1367\u20131375, 2013.\n\n[17] M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural\n\nmachine translation. In EMNLP, 2015.\n\n[18] T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine\ntranslation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1412\u20131421, 2015.\n\n[19] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. Text matching as image recognition. In\nProceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, February 12-17, 2016,\nPhoenix, Arizona, USA., pages 2793\u20132799, 2016.\n\n[20] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation\nof machine translation. In Proceedings of the 40th Annual Meeting of the Association for\nComputational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311\u2013318, 2002.\n\n[21] A. P. Parikh, O. T\u00e4ckstr\u00f6m, D. Das, and J. Uszkoreit. A decomposable attention model for\nnatural language inference. In Proceedings of the 2016 Conference on Empirical Methods in\nNatural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages\n2249\u20132255, 2016.\n\n[22] R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization.\n\nCoRR, abs/1705.04304, 2017.\n\n[23] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural\n\nnetworks. CoRR, abs/1511.06732, 2015.\n\n[24] R. Sennrich, B. Haddow, and A. Birch. Edinburgh neural machine translation systems for WMT\n16. In Proceedings of the First Conference on Machine Translation, WMT 2016, colocated with\nACL 2016, August 11-12, Berlin, Germany, pages 371\u2013376, 2016.\n\n[25] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword\nIn Proceedings of the 54th Annual Meeting of the Association for Computational\n\nunits.\nLinguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016.\n\n[26] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean.\nOutrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR,\nabs/1701.06538, 2017.\n\n10\n\n\f[27] Y. Shen, X. Tan, D. He, T. Qin, and T. Liu. Dense information \ufb02ow for neural machine\ntranslation. In Proceedings of the 2018 Conference of the North American Chapter of the\nAssociation for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018,\nNew Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1294\u20131303,\n2018.\n\n[28] K. Song, X. Tan, D. He, J. Lu, T. Qin, and T. Liu. Double path networks for sequence to\nsequence learning. In Proceedings of the 27th International Conference on Computational\nLinguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 3064\u20133074,\n2018.\n\n[29] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.\nIn Advances in Neural Information Processing Systems 27: Annual Conference on Neural\nInformation Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages\n3104\u20133112, 2014.\n\n[30] Z. Tu, Y. Liu, Z. Lu, X. Liu, and H. Li. Context gates for neural machine translation. TACL,\n\n5:87\u201399, 2017.\n\n[31] Z. Tu, Y. Liu, L. Shang, X. Liu, and H. Li. Neural machine translation with reconstruction. In\nProceedings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, February 4-9, 2017,\nSan Francisco, California, USA., pages 3097\u20133103, 2017.\n\n[32] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser,\nN. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit. Tensor2tensor for neural\nmachine translation. CoRR, abs/1803.07416, 2018.\n\n[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and\nI. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems\n30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017,\nLong Beach, CA, USA, pages 6000\u20136010, 2017.\n\n[34] Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and Liu. Dual transfer learning for neural\n\nmachine translation with marginal distribution regularization. In AAAI, 2018.\n\n[35] L. Wu, X. Tan, D. He, F. Tian, T. Qin, J. Lai, and T. Liu. Beyond error propagation in neural\n\nmachine translation: Characteristics of language also matter. In EMNLP, 2018.\n\n[36] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo,\nH. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick,\nO. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google\u2019s neural machine translation system:\nBridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.\n\n[37] Y. Xia, X. Tan, F. Tian, T. Qin, N. Yu, and T. Liu. Model-level dual learning. In Proceedings\nof the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan,\nStockholm, Sweden, July 10-15, 2018, pages 5379\u20135388, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4923, "authors": [{"given_name": "Tianyu", "family_name": "He", "institution": "University of Science and Technology of China"}, {"given_name": "Xu", "family_name": "Tan", "institution": "Microsoft Research"}, {"given_name": "Yingce", "family_name": "Xia", "institution": "Microsoft Research"}, {"given_name": "Di", "family_name": "He", "institution": "Peking University"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Zhibo", "family_name": "Chen", "institution": "University of Science and Technology of China"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research Asia"}]}