{"title": "Sequence Modeling with Unconstrained Generation Order", "book": "Advances in Neural Information Processing Systems", "page_first": 7700, "page_last": 7711, "abstract": "The dominant approach to sequence generation is to produce a sequence in some predefined order, e.g. left to right. In contrast, we propose a more general model that can generate the output sequence by inserting tokens in any arbitrary order. Our model learns decoding order as a result of its training procedure. Our experiments show that this model is superior to fixed order models on a number of sequence generation tasks, such as Machine Translation, Image-to-LaTeX and Image Captioning.", "full_text": "Sequence Modeling with\n\nUnconstrained Generation Order\n\nDmitrii Emelianenko1,2 Elena Voita1,3\n\nPavel Serdyukov1\n\n1Yandex, Russia\n\n2National Research University Higher School of Economics, Russia\n\n3University of Amsterdam, Netherlands\n\n{dimdi-y, lena-voita, pavser}@yandex-team.ru\n\nAbstract\n\nThe dominant approach to sequence generation is to produce a sequence in some\nprede\ufb01ned order, e.g. left to right. In contrast, we propose a more general model\nthat can generate the output sequence by inserting tokens in any arbitrary order.\nOur model learns decoding order as a result of its training procedure. Our ex-\nperiments show that this model is superior to \ufb01xed order models on a number\nof sequence generation tasks, such as Machine Translation, Image-to-LaTeX and\nImage Captioning.1\n\n1\n\nIntroduction\n\nNeural approaches to sequence generation have seen a variety of applications such as language\nmodeling [1], machine translation [2, 3], music generation [4] and image captioning [5]. All these\ntasks involve modeling a probability distribution over sequences of some kind of tokens.\nUsually, sequences are generated in the left-to-right manner, by iteratively adding tokens to the\nend of an un\ufb01nished sequence. Although this approach is widely used due to its simplicity, such\ndecoding restricts the generation process. Generating sequences in the left-to-right manner reduces\noutput diversity [6] and could be unsuited for the target sequence structure [7]. To alleviate this\nissue, previous studies suggested exploiting prior knowledge about the task (e.g. the semantic roles of\nwords in a natural language sentence or the concept of language branching) to select the preferable\ngeneration order [6, 7, 8]. However, these approaches are still limited by prede\ufb01ned generation order,\nwhich is the same for all input instances.\n\nFigure 1: Examples of different decoding orders: left-to-right, alternative and right-to-left orders\nrespectively. Each line represents one decoding step.\n\nIn this work, we propose INTRUS: INsertion TRansformer for Unconstrained order Sequence\nmodeling. Our model has no prede\ufb01ned order constraint and generates sequences by iteratively\nadding tokens to a subsequence in any order, not necessarily in the order they appear in the \ufb01nal\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n1The source code is available at https://github.com/TIXFeniks/neurips2019_intrus.\n\n\fsequence. It learns to \ufb01nd convenient generation order as a by-product of its training procedure\nwithout any reliance on prior knowledge about the task it is solving.\nOur key contributions are as follows:\n\n\u2022 We propose a neural sequence model that can generate the output sequence by inserting\n\ntokens in any arbitrary order;\n\n\u2022 The proposed model outperforms \ufb01xed-order baselines on several tasks, including Machine\n\nTranslation, Image-to-LaTeX and Image Captioning;\n\n\u2022 We analyze learned generation orders and \ufb01nd that the model has a preference towards\nproducing \u201ceasy\u201d words at the beginning and leaving more complicated choices for later.\n\n2 Method\n\nWe consider the task of generating a sequence Y consisting of tokens yt given some input X. In\norder to remove the prede\ufb01ned generation order constraint, we need to reformulate the probability\nof target sequence in terms of token insertions. Unlike traditional models, there are multiple valid\ninsertions at each step. This formulation is closely related to the existing framework of generating\nunordered sets, which we brie\ufb02y describe in Section 2.1. In Section 2.2, we introduce our approach.\n\n2.1 Generating unordered sets\n\nIn the context of unordered set generation, Vinyals et al. [9] proposed a method to learn sequence\norder from data jointly with the model. The resulting model samples a permutation \u03c0(t) of the target\nsequence and then scores the permuted sequence with a neural probabilistic model:\n\nP (Y\u03c0|x, \u03b8) =\n\np(y\u03c0(t)|X, y\u03c0(0), .., y\u03c0(t\u22121), \u03b8).\n\n(1)\n\nThe training is performed by maximizing the data log-likelihood over both model parameters \u03b8 and\ntarget permutation \u03c0(t):\n\n\u03b8\u2217 = arg max\n\n\u03b8\n\nX,Y\n\nlog P (Y\u03c0|x, \u03b8).\n\nmax\n\n\u03c0\n\n(2)\n\nExact maximization over \u03c0(t) requires O(|Y |!) operations, therefore it is infeasible in practice.\nInstead, the authors propose using greedy or beam search. The resulting procedure resembles the\nExpectation Maximization algorithm:\n\n1. E step: \ufb01nd optimal \u03c0(t) for Y under current \u03b8 with inexact search,\n2. M step: update parameters \u03b8 with gradient descent under \u03c0(t) found on the E step.\n\nEM algorithms are known to easily get stuck in local optima. To mitigate this issue, the authors\nsample permutations proportionally to p(y\u03c0(t)|x, y\u03c0(0), .., y\u03c0(t\u22121), \u03b8) instead of maximizing over \u03c0.\n\n2.2 Our approach\n\nThe task now is to build a probabilistic model over sequences \u03c4 = (\u03c40, \u03c41, ..., \u03c4T ) of insertion opera-\ntions. This can be viewed as an extension of the approach described in the previous section, which\noperates on ordered sequences instead of unordered sets. At step t, the model generates either a pair\n\u03c4t = (post, tokent) consisting of a position post in the produced so far sub-sequence (post \u2208 [0, t])\nand a token tokent to be inserted at this position, or a special EOS element indicating that the gener-\nation process is terminated. It estimates the conditional probability of a new insertion \u03c4t given X and\na partial output \u02dcY (\u03c40:t\u22121) constructed from the previous inserts:\n\np(\u03c4|X, \u03b8) =\n\np(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8).\n\n(3)\n\n(cid:89)\n\nt\n\n(cid:88)\n\n(cid:89)\n\nt\n\n2\n\n\fTraining objective We train the model by maximizing the log-likelihood of the reference se-\nquence Y given the source X, summed over the data set D:\n\n(cid:88)\n\nL =\n\n{X,Y }\u2208D\n\n(cid:88)\n\n(cid:88)\n(cid:89)\n\n(cid:88)\n\n=\n\nlog\n\n{X,Y }\u2208D\n\n\u03c4\u2208T \u2217(Y )\n\nt\n\n(cid:88)\n\nlog p(Y |X, \u03b8) =\n\nlog\n\np(\u03c4|X, \u03b8) =\n\n{X,Y }\u2208D\n\n\u03c4\u2208T \u2217(Y )\n\np(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8),\n\n(4)\n\nwhere T \u2217(Y ) denotes the set of all trajectories leading to Y (see Figure 2).\n\nFigure 2: Graph of trajectories for T \u2217(Y = \u201ca cat sat\u201d).\n\nIntuitively, we maximize the total probability \u201c\ufb02owing\u201d through the acyclic graph de\ufb01ned by T \u2217(Y ).\nThis graph has approximately O(|Y |!) paths from an empty sequence to the target sequence Y .\nTherefore, directly maximizing (4) is impractical. Our solution, inspired by [9], is to assume that for\nany input X there is a trajectory \u03c4\u2217 that is the most convenient for the model. We want the model to\nconcentrate the probability mass on this single trajectory. This can be formulated as a lower bound of\nthe objective (4):\n\n(cid:88)\n\nL =\n\n\u2265 (cid:88)\n\n{X,Y }\u2208D\n\n{X,Y }\u2208D\n\nlog max\n\n\u03c4\n\n(cid:89)\n\nt\n\nlog p(Y |X, \u03b8) =\n\n{X,Y }\u2208D\np(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8) =\n\n(cid:88)\n\n(cid:89)\n\n(cid:88)\n(cid:88)\n\nlog\n\n\u03c4\u2208T \u2217(Y )\n\nt\n\nmax\n\n\u03c4\n\n{X,Y }\u2208D\n\n(cid:88)\n\nt\n\np(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8) \u2265\n\nlog p(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8).\n\n(5)\n\n(cid:81)\n\nThe lower bound is tight iff the entire probability mass in T \u2217 is concentrated along a single trajectory.\nThis leads to a convenient property: maximizing (5) forces the model to choose a certain \u201coptimal\u201d\np(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8) and concentrate most of the probability\nsequence of insertions \u03c4\u2217 = arg max\nmass there.\nThe bound (5) depends only on the most probable trajectory \u03c4\u2217, thus is dif\ufb01cult to optimize directly.\nThis may result in convergence to a local maximum. Similar to [9], we replace max with an\nexpectation w.r.t. trajectories sampled from T \u2217. We sample from the probability distribution over the\ntrajectories obtained from the model. The new lower bound is:\n\n\u03c4\n\nt\n\nE\u03c4\u223cp(\u03c4|X,\u03c4\u2208T \u2217(Y ),\u03b8)\n\nlog p(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8).\n\n(6)\n\n(cid:88)\n\n{X,Y }\u2208D\n\n(cid:88)\n\nt\n\nThe sampled lower bound in (6) is less or equal to (5). However, if the entire probability mass is\nconcentrated on a single trajectory, both lower bounds are tight. Thus, when maximizing (6), we also\nexpect most of the probability mass to be concentrated on one or a few \u201cbest\u201d trajectories.\n\nTraining procedure We train our model using stochastic gradient ascent of (6).\nFor\neach pair {X, Y } from the current mini-batch, we sample the trajectory \u03c4 from the model:\n\u03c4 \u223c p(\u03c4| X, \u03c4 \u2208 T \u2217(Y ), \u03b8). We constrain sampling only to correct trajectories by allowing\nonly the correct insertion operations (i.e. the ones that lead to producing Y ). At each step along\nthe sampled trajectory \u03c4, we maximize log p(ref (Y, \u03c40:t\u22121)|X, \u02dcY (\u03c40:t\u22121), \u03b8), where ref (Y, \u03c40:t\u22121)\nde\ufb01nes a set of all insertions \u03c4t immediately after \u03c40:t\u22121, such that the trajectory \u03c40:t\u22121 extended\nwith \u03c4t is correct: \u03c40:t\u22121 \u2295 \u03c4t \u2208 T \u2217(Y ). From a formal standpoint, this is a probability of picking\nany insertion that is on the path to Y . The simpli\ufb01ed training procedure is given in Algorithm 1.\n\n3\n\nasatcata catcat sata sata cat sat(0, a)(0, sat)(0, cat)(1, cat)(1, sat)(0, cat)(0, a)(1, sat)(0, a)(1, cat)(2, sat)(0, a)\fAlgorithm 1: Training procedure (simpli\ufb01ed)\nInputs: batch {X, Y }, parameters \u03b8, learning rate \u03b1, (cid:126)g := (cid:126)0\nfor Xi, Yi \u2208 {X, Y } do\n\n\u03c4 \u223c p(\u03c4|Xi, \u03c4 \u2208 T \u2217(Yi), \u03b8)\nfor t \u2208 0, 1, . . . ,|\u03c4| \u2212 1 do\nref := ref (Y, \u03c40:t\u22121)\nLi,t = log p(ref|Xi, \u02dcYi(\u03c4\u2217\n(cid:126)g := (cid:126)g + \u2202Li,t\n\u2202\u03b8\n\n0:t\u22121), \u03b8)\n\nend\n\nend\nreturn \u03b8 + \u03b1 \u00b7 (cid:126)g\n\n// (cid:126)g is the gradient accumulator\n\n// correct inserts\n\nThe training procedure is split into two steps: (i) pretraining with uniform samples from the set of\nfeasible trajectories T \u2217(Y ), and (ii) training on samples from our model\u2019s probability distribution\nover T \u2217(Y ) till convergence. We discuss the importance of the pretraining step in Section 5.2.\n\n(cid:88)\n\nInference To \ufb01nd the most likely output sequence according to our model, we have to compute the\nprobability distribution over target sequences as follows:\n\np(Y |X, \u03b8) =\n\np(\u03c4|x, \u03b8).\n\n(7)\nComputing such probability exactly requires summation over up to O(|Y |!) trajectories, which is\ninfeasible in practice. However, due to the nature of our optimization algorithm (explicitly maximizing\nthe lower bound E\u03c4\u223cp(\u03c4|X,\u03c4\u2208T \u2217(Y ),\u03b8)p(\u03c4|x, \u03b8) \u2264 max\u03c4\u2208T \u2217(Y ) p(\u03c4|x, \u03b8) \u2264 P (Y |X)), we expect\nmost of the probability mass to be concentrated on one or a few \u201cbest\u201d trajectories:\n\n\u03c4\u2208T \u2217(Y )\n\nP (Y |X) \u2248 max\n\u03c4\u2208T \u2217(Y )\n\np(\u03c4|x, \u03b8).\n\n(8)\n\nHence, we perform approximate inference by \ufb01nding the most likely trajectory of insertions, disre-\ngarding the fact that several trajectories may lead to the same Y .2 The resulting inference problem is\nde\ufb01ned as:\n\nY \u2217 = arg max\n\nlog p(\u03c4|X, \u03b8).\n\n(9)\n\nY (\u03c4 )\n\nThis problem is combinatoric in nature, but it can be solved approximately using beam search. In the\ncase of our model, beam search compares partial output sequences and extends them by selecting\nthe k best token insertions. Our model also inherits a common problem of the left-to-right machine\ntranslation: it tends to stop too early and produce output sequences that are shorter than the reference.\nTo alleviate this effect, we divide hypotheses\u2019 log-probabilities by their length. This has already been\nused in previous works [10, 11, 12].\n\n3 Model architecture\n\nINTRUS follows the encoder-decoder framework. Speci\ufb01cally, the model is based on the Trans-\nformer [10] architecture (Figure 3) due to its state-of-the-art performance on a wide range of\ntasks [13, 14, 15]. There are two key differences of INTRUS from the left-to-right sequence models.\nFirstly, our model\u2019s decoder does not require the attention mask preventing attention to subsequent\npositions. Decoder self-attention is re-applied at each decoding step because the positional encodings\nof most tokens change when inserting a new one into an incomplete subsequence of Y .3 Secondly,\nthe decoder predicts the joint probability of a token and a position corresponding to a single insertion\n(rather than the probability of a token, as usually done in the standard setting). Consequently, the\n\n2To justify this transition, we translated 104 sentences with a fully trained model using beam size 128 and\n\nfound only 4 occasions where multiple insertion trajectories in the beam led to the same output sequence.\n\n3Though this makes training more computationally expensive than the standard Transformer, this does not\nhurt decoding speed much: on average decoding is only 50% times slower than the baseline. We will discuss\nthis in detail in Section 5.2.\n\n4\n\n\fpredicted probabilities should add up to 1 over all positions and tokens at each step. We achieve this\nby decomposing the insertion probability into the probabilities of a token and a position:\n\np(\u03c4t) = p(token|pos) \u00b7 p(pos),\np(pos) = sof tmax(H \u00d7 wloc),\n\np(token|pos) = sof tmax(hpos \u00d7 Wtok).\n\n(10)\n\nHere hpos \u2208 Rd denotes a single decoder hidden state (of size d) corresponding to an insertion at\nposition pos; H \u2208 Rt\u00d7d represents a matrix of all such states. Wtok \u2208 Rd\u00d7v is a learned weight\nmatrix that predicts token probabilities and wloc \u2208 Rd is a learned vector of weights used to predict\npositions. In other words, the hidden state of each token in the current sub-sequence de\ufb01nes (i) the\nprobability that the next token will be generated at the position immediately preceding current and\n(ii) the probability for each particular token to be generated next.\n\nFigure 3: Model architecture: p(\u03c4t|X, \u02dcY (\u03c40:t\u22121), \u03b8) for a single token insertion. Output of the\ncolumn i for the token j de\ufb01nes the probability of inserting the token j into \u02dcY (\u03c40:t\u22121) before the\ntoken at the i-th position.\n\nThe encoder component of the model can have any task-speci\ufb01c network architecture. For Machine\nTranslation task, it can be an arbitrary sequence encoder: any combination of RNN [3, 2], CNN [12,\n16] or self-attention [10, 17]. For image-to-sequence problems (e.g. Image-To-LaTeX [18]) any 2d\nconvolutional encoder architecture from the domain of computer vision can be used [19, 20, 21].\n\n3.1 Relation to prior work\n\nThe closest to our is the work by Gu et al. [22]4, who propose a decoding algorithm which supports\n\ufb02exible sequence generation in arbitrary orders through insertion operations.\nIn terms of modeling, they describe a similar transformer-based model but use a relative-position-\nbased representation to capture generation orders. This effectively addresses the problem that\nabsolute positional encodings are unknown before generating the whole sequence. While in our\nmodel positional encodings of most of the tokens change after each insertion operation and, therefore,\ndecoder self-attention is re-applied at each generation step, the model by Gu et al. [22] does not\nneed this and has better theoretical time complexity of O(len(Y )2) in contrast to our O(len(Y )3).\nHowever, in practice our decoding is on average only 50% times slower than the baseline; for the\ndetails, see Section 5.2.\nIn terms of training objective, they use lower bound (5) with beam search over T \u2217(Y ), which is\ndifferent from our lower bound (6). However, we found our lower bound to be bene\ufb01cial in terms of\nquality and less prone to getting stuck in local optima. We will discuss this in detail in Section 5.1.\n\n4 Experimental setup\n\nWe consider three sequence generation tasks: Machine Translation, Image-To-Latex and Image\nCaptioning. For each, we now de\ufb01ne input X and output Y , the datasets and the task-speci\ufb01c encoder\n\n4At the time of submission, this was a concurrent work.\n\n5\n\nP(pos)P(token | pos)\u00d7P(\u03c4t)Linear1 unitSoftmaxglobalLineartokensSoftmaxcol-wiseTransformerDecoder withfull attentionHY(\u03c40:t-1)Partialoutput Encoder~ XInput\fwe use. Decoders for all tasks are Transformers in base con\ufb01guration [10] (either original or INTRUS)\nwith identical hyperparameters.\n\nMachine Translation For MT, input and output are sentences in different languages. The encoder\nis the Transformer-base encoder [10].\nWu et al. [7] suggest that left-to-right NMT models \ufb01t better for right-branching languages (e.g.,\nEnglish) and right-to-left NMT models \ufb01t better for left-branching languages (e.g., Japanese). This\nde\ufb01nes the choice of language pairs for our experiments. Our experiments include: En-Ru and Ru-En\nWMT14; En-Ja ASPEC [23]; En-Ar, En-De and De-En IWSLT14 Machine Translation data sets.\nWe evaluate our models on WMT2017 test set for En-Ru and Ru-En, ASPEC test set for En-Ja,\nconcatenated IWSLT tst2010, tst2011 and tst2012 for En-De and De-En, and concatenated IWSLT\ntst2014 and tst2013 for En-Ar.\nSentences of all translation directions except the Japanese part of En-Ja data set are preprocessed\nwith the Moses tokenizer [24] and segmented into subword units using BPE [25] with 32,000 merge\noperations. Before BPE segmentation Japanese sentences were \ufb01rstly segmented into words5.\n\nImage-To-Latex In this task, X is a rendered image of LaTeX markup, Y is the markup itself. We\nuse the ImageToLatex-140K [18, 26] data set. We used the encoder CNN architecture, preprocessing\npipeline and evaluation scripts by Singh [26]6.\n\nImage captioning Here X is an image, Y is its description in natural language. We use\nMSCOCO [27], the standard Image Captioning dataset. Encoder is VGG16 [19] pretrained7 on the\nImageNet task without the last layer.\n\nEvaluation We use BLEU8 [29] for evaluation of Machine Translation and Image-to-Latex models.\nFor En-Ja, we measure character-level BLEU to avoid infuence on word segmentation software. The\nscores on MSCOCO dataset are obtained via the of\ufb01cial evaluation script9.\n\nTraining details The models are trained until convergence with base learning rate 1.4e-3, 16,000\nwarm-up steps and batch size of 4,000 tokens. We vary the learning rate over the course of training\naccording to [10] and follow their optimization technique. We use beam search with the beam\nbetween 4 and 64 selected using the validation data for both baseline and INTRUS, although our\nmodel bene\ufb01ts more when using even bigger beam sizes. The pretraining phase of INTRUS is 105\nbatches.\n\n5 Results\n\nTable 1: The results of our experiments. En-Ru, Ru-En, En-Ja, En-Ar, En-De and De-En are machine\ntranslation experiements. \u2217 indicates statistical signi\ufb01cance with p-value of 0.05, computed via\nbootstrapping [30].\n\nEn-Ru Ru-En\n\nEn-Ja\n\nEn-Ar\n\nEn-De De-En\n\nIm2Latex\n\nMSCOCO\n\nModel\n\nBLEU\n\nLeft-to-right\nRight-to-left\n\nINTRUS\n\n31.6\n\n-\n\n33.2\u2217\n\n35.3\n\n-\n\n36.4\u2217\n\n47.9\n48.6\n50.3\u2217\n\n12.0\n11.5\n12.2\n\n28.04\n\n-\n\n28.36\u2217\n\n33.17\n\n-\n\n33.08\n\n89.5\n\n-\n\n90.3\u2217\n\nBLEU CIDEr\n\n18.0\n\n-\n\n25.6\u2217\n\n56.1\n\n-\n\n81.0\u2217\n\n5Open-source word segmentation software is available at https://github.com/atilika/\n\nkuromoji\n\n6We used https://github.com/untrix/im2latex\n7We use pretrained weights from keras applications https://keras.io/applications/, the same\n\n8BLEU is computed via SacreBLEU [28] script with the following parameters: BLEU+c.lc+l.[src-lang]-\n\nfor both baseline and our model.\n\n[dst-lang]+.1+s.exp+tok.13a+v.1.2.18\n\n9Script is available at https://github.com/tylin/coco-caption\n\n6\n\n\fAmong all tasks, the largest improvements are for Image Captioning: 7.6 BLEU and 25.1 CIDER.\nFor Machine Translation, INTRUS substantially outperforms the baselines for most considered\nlanguage pairs and matches the baseline for the rest. As expected, the right-to-left generation order is\nbetter than the left-to-right for translation into Japanese. However, our model signi\ufb01cantly outperforms\nboth baselines. For the tasks where left-to-right decoding order provides a strong inductive bias\n(e.g. in De-En translation task, where source and target sentences can usually be aligned without any\npermutations), generation in arbitrary order does not give signi\ufb01cant improvements.\nImage-To-Latex improves by 0.8 BLEU, which is reasonable difference considering the high perfor-\nmance of the baseline.\n\n5.1 Ablation analysis\n\nIn this section, we show the superior performance of the proposed lower bound of the data log-\nlikelihood (6) over the natural choice of (5). We also emphasize the importance of the pretraining\nphase for INTRUS. Speci\ufb01cally, we compare performance of the following models:\n\u2022 Default \u2014 using the training procedure described in Section 2.2;\n\u2022 Argmax \u2014 trained with the lower bound (5) (maximum is approximated with using beam\nsearch with the beam of 4; this technique matches the one used in Gu et al. [22]);\n\u2022 Left-to-right pretraining \u2014 pretrained with the \ufb01xed left-to-right decoding order (in contrast\nto the uniform samples in the default setting);\n\u2022 No pretraining \u2014 with no pretraining phase;\n\u2022 Only pretraining \u2014 training is performed with a model-independent order, either uniform or\nleft-to-right.\n\nTable 2: Training strategies of INTRUS. MT task, scores on the WMT En-Ru 2012-2013 test sets.\n\nTraining\nstrategy\nBLEU\n\nINTRUS Argmax\n\nPretraining No pre-\nleft-to-right\ntraining\n\nOnly pretraining\n\nBaseline\n\nuniform left-to-right\n\nleft-to-right\n\n27.5\n\n26.6\n\n26.3\n\n27.1\n\n24.6\n\n25.5\n\n25.8\n\nTable 2 con\ufb01rms the importance of the chosen pretraining strategy for the performance of the model.\nIn preliminary experiments, we also observed that introducing any of the two pretraining strategies\nincreases the overall robustness of our training procedure and helps to avoid convergence to poor\nlocal optima. We attribute this to the fact that a pretrained model provides the main algorithm with\na good initial exploration of the trajectory space T \u2217, while the Argmax training strategy tends to\nquickly converge to the current best trajectory which may not be globally optimal. This leads to poor\nperformance and unstable results. This is the only strategy that required several consecutive runs to\nobtain reasonable quality, despite the fact that it starts from a good pretrained model.\n\n5.2 Computational complexity\n\nDespite its superior performance, INTRUS is more computationally expensive compared to the\nbaseline. The main computational bottleneck in the model training is the generation of insertions\nrequired to evaluate the training objective (6). This generation procedure is inherently sequential.\nThus, it is challenging to effectively parallelize it on GPU accelerators. In our experiments, training\ntime of INTRUS is 3-4 times longer than that of the baseline. The theoretical computational\ncomplexity of the model\u2019s inference is O(|Y |3k) compared to O(|Y |2k) of conventional left-to-\nright models. However, in practice this is likely not to cause drastic decrease of the decoding\nspeed. Figure 4 shows the decoding speed of both INTRUS and the baseline measured for machine\ntranslation task. On average, INTRUS is only 50% slower because for sentences of a reasonable\nlength it performs comparably to the baseline.\n\n7\n\n\fFigure 4: Inference time of INTRUS and the baseline models vs sentence length.\n\n6 Analyzing learned generation orders\n\nIn this section, we analyze generation orders learned by INTRUS on the Ru-En translation task.\n\nVisual inspection We noticed that the model often follows a general decoding direction that varies\nfrom sentence to sentence: left-to-right, right-to-left, middle-out, etc. (Figure 5 shows several\nexamples10). When following the chosen direction, the model deviates from it for translation of\ncertain phrases. For instance, the model tends to decode pairs of quotes and brackets together. Also we\nnoticed that tokens which are generated \ufb01rst are often uninformative (e.g., punctuation, determiners,\netc.). This suggests that the model has preference towards generating \u201ceasy\u201d words \ufb01rst.\n\nFigure 5: Decoding examples: left-to-right (left), right-to-left (center) and middle-out (right). Each\nline represents one decoding step.\n\nPart of speech generation order We want to \ufb01nd out if the model has any preference towards\ngenerating different parts of speech in the beginning or at the end of the decoding process. For each\npart of speech,11 we compute the relative index on the generation trajectory (for the baseline, it\ncorresponds to its relative position in a sentence). Figure 6 shows that INTRUS tends to generate\npunctuation tokens and conjunctions early in decoding. Other parts of speech like nouns, adjectives,\nprepositions and adverbs are the next easiest to predict. Most often they are produced in the middle\nof the generation process, when some context is already established. Finally, the most dif\ufb01cult for the\nmodel is to insert verbs and particles.\nThese observations are consistent with the easy-\ufb01rst generation hypothesis: the early decoding steps\nmostly produce words which are the easiest to predict based on the input data. This is especially\ninteresting in the context of previous work. Ford et al. [8] study the in\ufb02uence of token generation\norder on a language model quality. They developed a family of two-pass language models that\ndepend on a partitioning of the vocabulary into a set of \ufb01rst-pass and second-pass tokens to generate\nsentences. The authors \ufb01nd that the most effective strategy is to generate function words in the \ufb01rst\npass and content words in the second. While Ford et al. [8] consider three manually de\ufb01ned strategies,\nour model learned to give preference to such behavior despite not having any inductive bias to do so.\n\n10More examples and the analysis for Image Captioning are provided in the supplementary material.\n11To derive part of speech tags, we used CoreNLP tagger [31].\n\n8\n\n\fFigure 6: The distributions of the relative generation order of different parts of speech.\n\n7 Related work\n\nIn Machine Translation, decoding in the right-to-left order improves performance for English-to-\nJapanese [32, 7]. The difference in translation quality is attributed to two main factors: Error\nPropagation [33] and the concept of language branching [7, 34]. In some languages (e.g. English),\nsentences normally start with subject/verb on the left and add more information in the rightward\ndirection. Other languages (e.g. Japanese) have the opposite pattern.\nSeveral works suggest to \ufb01rst generate the most \u201cimportant\u201d token, and then the rest of the sequence\nusing forward and backward decoders. The two decoders start generation process from this \ufb01rst\n\u201cimportant\u201d token, which is predicted using classi\ufb01ers. This approach was shown bene\ufb01cial for video\ncaptioning [6] and conversational systems [35]. Other approaches to non-standard decoding include\nmulti-pass generation models [36, 8, 37, 38] and non-autoregressive decoding [39, 38].\nSeveral recent works proposed sequence models with arbitrary generation order. Gu et al. [22]\npropose a similar approach using another lower bound of the log-likelihood which, as we showed\nin Section 5.1, underperforms ours. They, however, achieve O(|Y |2) time complexity by utilizing\na different probability parameterization along with relative position encoding. Welleck et al. [40]\ninvestigates the possibility of decoding output sequences by descending a binary insertion tree. Stern\net al. [41] focuses on parallel decoding using one of several pre-speci\ufb01ed generation orders.\n\n8 Conclusion\n\nIn this work, we introduce INTRUS, a model which is able to generate sequences in any arbitrary\norder via iterative insertion operations. We demonstrate that our model learns convenient generation\norder as a by-product of its training procedure. The model outperforms left-to-right and right-to-left\nbaselines on several tasks. We analyze learned generation orders and show that the model has a\npreference towards producing \u201ceasy\u201d words at the beginning and leaving more complicated choices\nfor later.\n\nAcknowledgements\n\nThe authors thank David Talbot and Yandex Machine Translation team for helpful discussions and\ninspiration.\n\n9\n\n\fReferences\n[1] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1s Burget, Jan Cernock\u00fd, and Sanjeev Khudanpur. Recur-\n\nrent neural network based language model. In INTERSPEECH, 2010.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\nlearning to align and translate. Presented at ICLR 2015, abs/1409.0473, September 2014. URL\nhttps://arxiv.org/abs/1409.0473.\n\n[3] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural\n\nnetworks. CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/1409.3215.\n\n[4] Jean-Pierre Briot, Ga\u00ebtan Hadjeres, and Fran\u00e7ois Pachet. Deep learning techniques for music\ngeneration - A survey. CoRR, abs/1709.01620, 2017. URL http://arxiv.org/abs/\n1709.01620.\n\n[5] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov,\nRichard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation\nwith visual attention. CoRR, abs/1502.03044, 2015. URL http://arxiv.org/abs/\n1502.03044.\n\n[6] Shikib Mehri and Leonid Sigal. Middle-out decoding. In S. Bengio, H. Wallach, H. Larochelle,\nK. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 31, pages 5518\u20135529. Curran Associates, Inc., 2018. URL http:\n//papers.nips.cc/paper/7796-middle-out-decoding.pdf.\n\n[7] Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. Beyond\nerror propagation in neural machine translation: Characteristics of language also matter. In\nProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,\nBrussels, Belgium, October 31 - November 4, 2018, pages 3602\u20133611, 2018. URL https:\n//aclanthology.info/papers/D18-1396/d18-1396.\n\n[8] Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George Dahl. The importance of\ngeneration order in language modeling. In Proceedings of the 2018 Conference on Empirical\nMethods in Natural Language Processing, pages 2942\u20132946, Brussels, Belgium, October-\nNovember 2018. Association for Computational Linguistics. URL https://www.aclweb.\norg/anthology/D18-1324.\n\n[9] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence\nfor sets. In International Conference on Learning Representations (ICLR), 2016. URL http:\n//arxiv.org/abs/1511.06391.\n\n[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.\nURL http://arxiv.org/abs/1706.03762.\n\n[11] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah,\nMelvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo,\nHideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason\nSmith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey\nDean. Google\u2019s neural machine translation system: Bridging the gap between human and\nmachine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.\n08144.\n\n[12] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Con-\nvolutional sequence to sequence learning. CoRR, abs/1705.03122, 2017. URL http:\n//arxiv.org/abs/1705.03122.\n\n[13] Ond\u02c7rej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias\nHuck, Philipp Koehn, and Christof Monz. Findings of the 2018 conference on machine\ntranslation (wmt18). In Proceedings of the Third Conference on Machine Translation, Volume\n2: Shared Task Papers, pages 272\u2013307, Belgium, Brussels, October 2018. Association for\nComputational Linguistics. URL http://www.aclweb.org/anthology/W18-6401.\n[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of\ndeep bidirectional transformers for language understanding. In Proceedings of the 2019 Confer-\nence of the North American Chapter of the Association for Computational Linguistics: Human\n\n10\n\n\fLanguage Technologies, Volume 1 (Long and Short Papers), pages 4171\u20134186, Minneapolis,\nMinnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.\nURL https://www.aclweb.org/anthology/N19-1423.\n\n[15] Lo\u00efc Barrault, Ond\u02c7rej Bojar, Marta R. Costa-juss\u00e0, Christian Federmann, Mark Fishel, Yvette\nGraham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz,\nMathias M\u00fcller, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2019 conference\nIn Proceedings of the Fourth Conference on Machine\non machine translation (WMT19).\nTranslation (Volume 2: Shared Task Papers, Day 1), pages 1\u201361, Florence, Italy, August\n2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5301. URL https:\n//www.aclweb.org/anthology/W19-5301.\n\n[16] Maha Elbayad, Laurent Besacier, and Jakob Verbeek. Pervasive attention: {2D} convolutional\nneural networks for sequence-to-sequence prediction. In Proceedings of the 22nd Conference\non Computational Natural Language Learning, pages 97\u2013107. Association for Computational\nLinguistics, 2018. URL http://aclweb.org/anthology/K18-1010.\n\n[17] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni-\n\nversal transformers. CoRR, abs/1807.03819, 2018.\n\n[18] Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M. Rush. Image-to-markup\ngeneration with coarse-to-\ufb01ne attention. In Proceedings of the 34th International Conference\non Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 980\u2013989,\n2017. URL http://proceedings.mlr.press/v70/deng17a.html.\n\n[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\nimage recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.\n1556.\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, pages 770\u2013778. IEEE Computer Society, 2016.\n\n[21] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected\nconvolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 2261\u20132269, 2017.\n\n[22] Jiatao Gu, Qi Liu, and Kyunghyun Cho. Insertion-based decoding with automatically inferred\n\ngeneration order. 2019.\n\n[23] Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita,\nSadao Kurohashi, and Hitoshi Isahara. Aspec: Asian scienti\ufb01c paper excerpt corpus. In LREC,\n2016.\n\n[24] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola\nBertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ond\u02c7rej Bojar,\nAlexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine\ntranslation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and\nDemonstration Sessions, ACL \u201907, pages 177\u2013180, Stroudsburg, PA, USA, 2007. Associa-\ntion for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=\n1557769.1557821.\n\n[25] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words\nwith subword units. In ACL (1). The Association for Computer Linguistics, 2016. ISBN 978-\n1-945626-00-5. URL http://dblp.uni-trier.de/db/conf/acl/acl2016-1.\nhtml#SennrichHB16a.\n\n[26] S. S. Singh. Teaching machines to code : Neural markup generation with interpretable attention.\n\n2018.\n\n[27] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv\npreprint arXiv:1504.00325, 2015.\n\n[28] Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference\non Machine Translation: Research Papers, pages 186\u2013191. Association for Computational\nLinguistics, 2018. URL http://aclweb.org/anthology/W18-6319.\n\n[29] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic\nevaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for\n\n11\n\n\fComputational Linguistics, ACL \u201902, pages 311\u2013318, Stroudsburg, PA, USA, 2002. Association\nfor Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/\n10.3115/1073083.1073135.\n\n[30] Philipp Koehn. Statistical signi\ufb01cance tests for machine translation evaluation. In Proceedings\nof the 2004 Conference on Empirical Methods in Natural Language Processing, 2004. URL\nhttps://www.aclweb.org/anthology/W04-3250.\n\n[31] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and\nDavid McClosky. The Stanford CoreNLP natural language processing toolkit. In Association\nfor Computational Linguistics (ACL) System Demonstrations, pages 55\u201360, 2014. URL http:\n//www.aclweb.org/anthology/P/P14/P14-5010.\n\n[32] Taro Watanabe and Eiichiro Sumita. Bidirectional decoding for statistical machine translation.\nIn Proceedings of the 19th International Conference on Computational Linguistics - Volume\n1, COLING \u201902, pages 1\u20137, Stroudsburg, PA, USA, 2002. Association for Computational Lin-\nguistics. doi: 10.3115/1072228.1072278. URL https://doi.org/10.3115/1072228.\n1072278.\n\n[33] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for\nsequence prediction with recurrent neural networks. In Proceedings of the 28th International\nConference on Neural Information Processing Systems - Volume 1, NIPS\u201915, pages 1171\u20131179,\nCambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm?\nid=2969239.2969370.\n\n[34] Thomas Berg. Structure in language: A dynamic perspective. 2011.\n[35] Qiao Qian, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. Assigning person-\nality/pro\ufb01le to a chatting machine for coherent conversation generation. In Proceedings of\nthe Twenty-Seventh International Joint Conference on Arti\ufb01cial Intelligence, IJCAI-18, pages\n4279\u20134285. International Joint Conferences on Arti\ufb01cial Intelligence Organization, 7 2018. doi:\n10.24963/ijcai.2018/595. URL https://doi.org/10.24963/ijcai.2018/595.\n\nDeliberation networks: Sequence generation beyond one-pass decoding.\n\n[36] Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan\nLiu.\nIn\nI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages\n1784\u20131794. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/\n6775-deliberation-networks-sequence-generation-beyond-one-pass\\\n-decoding.pdf.\n\n[37] Xinwei Geng, Xiaocheng Feng, Bing Qin, and Ting Liu. Adaptive multi-pass decoder for\nneural machine translation. In Proceedings of the 2018 Conference on Empirical Methods\nin Natural Language Processing, pages 523\u2013532, Brussels, Belgium, October-November\n2018. Association for Computational Linguistics. URL https://www.aclweb.org/\nanthology/D18-1048.\n\n[38] Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural\nsequence modeling by iterative re\ufb01nement. In Proceedings of the 2018 Conference on Empirical\nMethods in Natural Language Processing, pages 1173\u20131182, Brussels, Belgium, October-\nNovember 2018. Association for Computational Linguistics. URL https://www.aclweb.\norg/anthology/D18-1149.\n\n[39] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-\nIn International Conference on Learning Rep-\n\nautoregressive neural machine translation.\nresentations, 2018. URL https://openreview.net/forum?id=B1l8BtlCb.\n\n[40] Sean Welleck, Kiant\u2019e Brantley, Hal Daum\u2019e, and Kyunghyun Cho. Non-monotonic sequential\n\ntext generation. 2019.\n\n[41] Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. Insertion transformer: Flexible\n\nsequence generation via insertion operations. 2019.\n\n12\n\n\f", "award": [], "sourceid": 4178, "authors": [{"given_name": "Dmitrii", "family_name": "Emelianenko", "institution": "Yandex; National Research University Higher School of Economics"}, {"given_name": "Elena", "family_name": "Voita", "institution": "Yandex; University of Amsterdam"}, {"given_name": "Pavel", "family_name": "Serdyukov", "institution": "Yandex"}]}