{"title": "Plan, Attend, Generate: Planning for Sequence-to-Sequence Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5474, "page_last": 5483, "abstract": "We investigate the integration of a planning mechanism into sequence-to-sequence models using attention. We develop a model which can plan ahead in the future when it computes its alignments between input and output sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the recently proposed strategic attentive reader and writer (STRAW) model for Reinforcement Learning. Our proposed model is end-to-end trainable using primarily differentiable operations. We show that it outperforms a strong baseline on character-level translation tasks from WMT'15, the algorithmic task of finding Eulerian circuits of graphs, and question generation from the text. Our analysis demonstrates that the model computes qualitatively intuitive alignments, converges faster than the baselines, and achieves superior performance with fewer parameters.", "full_text": "Plan, Attend, Generate:\n\nPlanning for Sequence-to-Sequence Models\n\nFrancis Dutil\u2217\n\nUniversity of Montreal (MILA)\n\nfrdutil@gmail.com\n\nCaglar Gulcehre\u2217\n\nUniversity of Montreal (MILA)\n\nca9lar@gmail.com\n\nAdam Trischler\n\nMicrosoft Research Maluuba\n\nadam.trischler@microsoft.com\n\nYoshua Bengio\n\nUniversity of Montreal (MILA)\n\nyoshua.umontreal@gmail.com\n\nAbstract\n\nWe investigate the integration of a planning mechanism into sequence-to-sequence\nmodels using attention. We develop a model which can plan ahead in the future when\nit computes its alignments between input and output sequences, constructing a matrix\nof proposed future alignments and a commitment vector that governs whether to follow\nor recompute the plan. This mechanism is inspired by the recently proposed strategic\nattentive reader and writer (STRAW) model for Reinforcement Learning. Our proposed\nmodel is end-to-end trainable using primarily differentiable operations. We show that\nit outperforms a strong baseline on character-level translation tasks from WMT\u201915,\nthe algorithmic task of finding Eulerian circuits of graphs, and question generation\nfrom the text. Our analysis demonstrates that the model computes qualitatively intuitive\nalignments, converges faster than the baselines, and achieves superior performance\nwith fewer parameters.\n\n1 Introduction\n\nSeveral important tasks in the machine learning literature can be cast as sequence-to-sequence\nproblems (Cho et al., 2014b; Sutskever et al., 2014). Machine translation is a prime example of this: a\nsystem takes as input a sequence of words or characters in some source language, then generates an output\nsequence of words or characters in the target language \u2013 the translation.\nNeural encoder-decoder models (Cho et al., 2014b; Sutskever et al., 2014) have become a standard\napproach for sequence-to-sequence tasks such as machine translation and speech recognition. Such models\ngenerally encode the input sequence as a set of vector representations using a recurrent neural network\n(RNN). A second RNN then decodes the output sequence step-by-step, conditioned on the encodings.\nAn important augmentation to this architecture, first described by Bahdanau et al. (2015), is for models\nto compute a soft alignment between the encoder representations and the decoder state at each time-step,\nthrough an attention mechanism. The computed alignment conditions the decoder more directly on a\nrelevant subset of the input sequence. Computationally, the attention mechanism is typically a simple\nlearned function of the decoder\u2019s internal state, e.g., an MLP.\nIn this work, we propose to augment the encoder-decoder model with attention by integrating a planning\nmechanism. Specifically, we develop a model that uses planning to improve the alignment between input\nand output sequences. It creates an explicit plan of input-output alignments to use at future time-steps, based\n\n\u2217 denotes that both authors (CG and FD) contributed equally and the order is determined randomly.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fon its current observation and a summary of its past actions, which it may follow or modify. This enables the\nmodel to plan ahead rather than attending to what is relevant primarily at the current generation step. Con-\ncretely, we augment the decoder\u2019s internal state with (i) an alignment plan matrix and (ii) a commitment plan\nvector. The alignment plan matrix is a template of alignments that the model intends to follow at future time-\nsteps, i.e., a sequence of probability distributions over input tokens. The commitment plan vector governs\nwhether to follow the alignment plan at the current step or to recompute it, and thus models discrete decisions.\nThis is reminiscent of macro-actions and options from the hierarchical reinforcement learning literature (Di-\netterich, 2000). Our planning mechanism is inspired by the strategic attentive reader and writer (STRAW)\nof Vezhnevets et al. (2016), which was originally proposed as a hierarchical reinforcement learning algo-\nrithm. In reinforcement-learning parlance, existing sequence-to-sequence models with attention can be said\nto learn reactive policies; however, a model with a planning mechanism could learn more proactive policies.\nOur work is motivated by the intuition that, although many natural sequences are output step-by-step because\nof constraints on the output process, they are not necessarily conceived and ordered according to only local,\nstep-by-step interactions. Natural language in the form of speech and writing is again a prime example \u2013\nsentences are not conceived one word at a time. Planning, that is, choosing some goal along with candidate\nmacro-actions to arrive at it, is one way to induce coherence in sequential outputs like language. Learning\nto generate long coherent sequences, or how to form alignments over long input contexts, is difficult for\nexisting models. In the case of neural machine translation (NMT), the performance of encoder-decoder\nmodels with attention deteriorates as sequence length increases (Cho et al., 2014a; Sutskever et al., 2014).\nA planning mechanism could make the decoder\u2019s search for alignments more tractable and more scalable.\nIn this work, we perform planning over the input sequence by searching for alignments; our model does not\nform an explicit plan of the output tokens to generate. Nevertheless, we find this alignment-based planning\nto improve performance significantly in several tasks, including character-level NMT. Planning can also\nbe applied explicitly to generation in sequence-to-sequence tasks. For example, recent work by Bahdanau\net al. (2016) on actor-critic methods for sequence prediction can be seen as this kind of generative planning.\nWe evaluate our model and report results on character-level translation tasks from WMT\u201915 for English\nto German, English to Finnish, and English to Czech language pairs. On almost all pairs we observe\nimprovements over a baseline that represents the state-of-the-art in neural character-level translation. In\nour NMT experiments, our model outperforms the baseline despite using significantly fewer parameters\nand converges faster in training. We also show that our model performs better than strong baselines on\nthe algorithmic task of finding Eulerian circuits in random graphs and the task of natural-language question\ngeneration from a document and target answer.\n\n2 Related Works\n\nExisting sequence-to-sequence models with attention have focused on generating the target sequence by\naligning each generated output token to another token in the input sequence. This approach has proven\nsuccessful in neural machine translation (Bahdanau et al., 2016) and has recently been adapted to several\nother applications, including speech recognition (Chan et al., 2015) and image caption generation (Xu\net al., 2015). In general these models construct alignments using a simple MLP that conditions on the\ndecoder\u2019s internal state. In our work we integrate a planning mechanism into the alignment function.\nThere have been several earlier proposals for different alignment mechanisms: for instance, Yang et al.\n(2016) developed a hierarchical attention mechanism to perform document-level classification, while Luo\net al. (2016) proposed an algorithm for learning discrete alignments between two sequences using policy\ngradients (Williams, 1992).\nSilver et al. (2016) used a planning mechanism based on Monte Carlo tree search with neural networks\nto train reinforcement learning (RL) agents on the game of Go. Most similar to our work, Vezhnevets\net al. (2016) developed a neural planning mechanism, called the strategic attentive reader and writer\n(STRAW), that can learn high-level temporally abstracted macro-actions. STRAW uses an action plan\nmatrix, which represents the sequences of actions the model plans to take, and a commitment plan\nvector, which determines whether to commit an action or recompute the plan. STRAW\u2019s action plan and\ncommitment plan are stochastic and the model is trained with RL. Our model computes an alignment plan\nrather than an action plan, and both its alignment matrix and commitment vector are deterministic and\nend-to-end trainable with backpropagation.\n\n2\n\n\fOur experiments focus on character-level neural machine translation because learning alignments for long\nsequences is difficult for existing models. This effect can be more pronounced in character-level NMT,\nsince sequences of characters are longer than corresponding sequences of words. Furthermore, to learn\na proper alignment between sequences a model often must learn to segment them correctly, a process\nsuited to planning. Previously, Chung et al. (2016) and Lee et al. (2016) addressed the character-level\nmachine translation problem with architectural modifications to the encoder and the decoder. Our model\nis the first we are aware of to tackle the problem through planning.\n\n3 Planning for Sequence-to-Sequence Learning\n\nWe now describe how to integrate a planning mechanism into a sequence-to-sequence architecture with\nattention (Bahdanau et al., 2015). Our model first creates a plan, then computes a soft alignment based on the\nplan, and generates at each time-step in the decoder. We refer to our model as PAG (Plan-Attend-Generate).\n\n3.1 Notation and Encoder\nAs input our model receives a sequence of tokens, X =(x0,\u00b7\u00b7\u00b7,x|X|), where |X| denotes the length of X. It\nprocesses these with the encoder, a bidirectional RNN. At each input position i we obtain annotation vector\nhi by concatenating the forward and backward encoder states, hi =[h\u2192i ;h\u2190i ], where h\u2192i denotes the hidden\nstate of the encoder\u2019s forward RNN and h\u2190i denotes the hidden state of the encoder\u2019s backward RNN.\nThrough the decoder the model predicts a sequence of output tokens, Y = (y1,\u00b7\u00b7\u00b7,y|Y |). We denote by\nst the hidden state of the decoder RNN generating the target output token at time-step t.\n\n3.2 Alignment and Decoder\n\nOur goal is a mechanism that plans which parts of the input sequence to focus on for the next k\ntime-steps of decoding. For this purpose, our model computes an alignment plan matrix At \u2208 Rk\u00d7|X|\nand commitment plan vector ct\u2208Rk at each time-step. Matrix At stores the alignments for the current\nand the next k\u22121 timesteps; it is conditioned on the current input, i.e. the token predicted at the previous\ntime-step, yt, and the current context \u03c8t, which is computed from the input annotations hi. Each row\nof At gives the logits for a probability vector over the input annotation vectors. The first row gives the\nlogits for the current time-step, t, the second row for the next time-step, t+1, and so on. The recurrent\ndecoder function, fdec-rnn(\u00b7), receives st\u22121, yt, \u03c8t as inputs and computes the hidden state vector\n\nContext \u03c8t is obtained by a weighted sum of the encoder annotations,\n\nst =fdec-rnn(st\u22121,yt,\u03c8t).\n\n|X|(cid:88)\n\n(1)\n\n(2)\n\n\u03c8t =\n\n\u03b1tihi,\n\ni\n\n\u00afAt[i]=falign(st\u22121, hj, \u03b2i\n\nt, yt),\n\nwhere the soft-alignment vector \u03b1t =softmax(At[0])\u2208R|X| is a function of the first row of the alignment\nmatrix. At each time-step, we compute a candidate alignment-plan matrix \u00afAt whose entry at the ith row is\n(3)\nt denotes a summary of the alignment matrix\u2019s ith row at time t\u22121. The\n\nwhere falign(\u00b7) is an MLP and \u03b2i\nsummary is computed using an MLP, fr(\u00b7), operating row-wise on At\u22121: \u03b2i\nThe commitment plan vector ct governs whether to follow the existing alignment plan, by shifting it forward\nfrom t\u22121, or to recompute it. Thus, ct represents a discrete decision. For the model to operate discretely,\nwe use the recently proposed Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016) in conjunction\nwith the straight-through estimator (Bengio et al., 2013) to backpropagate through ct.1 The model further\nlearns the temperature for the Gumbel-Softmax as proposed in (Gulcehre et al., 2017). Both the commitment\nvector and the action plan matrix are initialized with ones; this initialization is not modified through training.\n\nt =fr(At\u22121[i]).\n\n1We also experimented with training ct using REINFORCE (Williams, 1992) but found that Gumbel-Softmax\n\nled to better performance.\n\n3\n\n\fFigure 1: Our planning mechanism in a sequence-to-sequence model that learns to plan and execute\nalignments. Distinct from a standard sequence-to-sequence model with attention, rather than using a\nsimple MLP to predict alignments our model makes a plan of future alignments using its alignment-plan\nmatrix and decides when to follow the plan by learning a separate commitment vector. We illustrate the\nmodel for a decoder with two layers s(cid:48)t for the first layer and the st for the second layer of the decoder.\nThe planning mechanism is conditioned on the first layer of the decoder (s(cid:48)t).\n\nAlignment-plan update Our decoder updates its alignment plan as governed by the commitment plan.\nWe denote by gt the first element of the discretized commitment plan \u00afct. In more detail, gt =\u00afct[0], where\nthe discretized commitment plan is obtained by setting ct\u2019s largest element to 1 and all other elements\nto 0. Thus, gt is a binary indicator variable; we refer to it as the commitment switch. When gt = 0, the\ndecoder simply advances the time index by shifting the action plan matrix At\u22121 forward via the shift\nfunction \u03c1(\u00b7). When gt = 1, the controller reads the action-plan matrix to produce the summary of the\nplan, \u03b2i\nt. We then compute the updated alignment plan by interpolating the previous alignment plan matrix\nAt\u22121 with the candidate alignment plan matrix \u00afAt. The mixing ratio is determined by a learned update\ngate ut\u2208Rk\u00d7|X|, whose elements uti correspond to tokens in the input sequence and are computed by\nan MLP with sigmoid activation, fup(\u00b7):\n\nuti =fup(hi, st\u22121),\n\nAt[:,i]=(1\u2212uti)(cid:12)At\u22121[:,i]+uti(cid:12) \u00afAt[:,i].\n\nTo reiterate, the model only updates its alignment plan when the current commitment switch gt is active.\nOtherwise it uses the alignments planned and committed at previous time-steps.\n\nCommitment-plan update The commitment plan also updates when gt becomes 1. If gt is 0, the\nshift function \u03c1(\u00b7) shifts the commitment vector forward and appends a 0-element. If gt is 1, the model\nrecomputes ct using a single layer MLP, fc(\u00b7), followed by a Gumbel-Softmax, and \u00afct is recomputed\nby discretizing ct as a one-hot vector:\n\nct =gumbel_softmax(fc(st\u22121)),\n\u00afct =one_hot(ct).\n\n(4)\n(5)\n\nWe provide pseudocode for the algorithm to compute the commitment plan vector and the action plan\nmatrix in Algorithm 1. An overview of the model is depicted in Figure 1.\n\n3.2.1 Alignment Repeat\n\nIn order to reduce the model\u2019s computational cost, we also propose an alternative to computing the\ncandidate alignment-plan matrix at every step. Specifically, we propose a model variant that reuses the\n\n4\n\nAlignment Plan Matrix # tokens in thesource # steps to plan ahead (k)AtCommitment plan cthtTxSoftmax( )+ tAt[0]ytst1s0t\fAlgorithm 1: Pseudocode for updating the alignment plan and commitment vector.\n\nfor j\u2208{1,\u00b7\u00b7\u00b7|X|} do\n\nfor t\u2208{1,\u00b7\u00b7\u00b7|Y |} do\n\nif gt =1 then\n\nct =softmax(fc(st\u22121))\n\u03b2j\nt =fr(At\u22121[j]) {Read alignment plan}\n\u00afAt[i]=falign(st\u22121, hj, \u03b2j\nutj =fup(hj, st\u22121, \u03c8t\u22121) {Compute update gate}\nAt = (1 \u2212 utj)(cid:12)At\u22121+utj(cid:12) \u00afAt {Update alignment plan}\n\nt , yt) {Compute candidate alignment plan}\n\nAt =\u03c1(At\u22121) {Shift alignment plan}\nct =\u03c1(ct\u22121) {Shift commitment plan}\n\nend if\nCompute the alignment as \u03b1t =softmax(At[0])\n\nelse\n\nend for\n\nend for\n\nalignment vector from the previous time-step until the commitment switch activates, at which time the\nmodel computes a new alignment vector. We call this variant repeat, plan, attend, and generate (rPAG).\nrPAG can be viewed as learning an explicit segmentation with an implicit planning mechanism in an\nunsupervised fashion. Repetition can reduce the computational complexity of the alignment mechanism\ndrastically; it also eliminates the need for an explicit alignment-plan matrix, which reduces the model\u2019s\nmemory consumption also. We provide pseudocode for rPAG in Algorithm 2.\n\nAlgorithm 2: Pseudocode for updating the repeat alignment and commitment vector.\n\nct =softmax(fc(st\u22121,\u03c8t\u22121))\n\u03b1t =softmax(falign(st\u22121, hj, yt))\n\nct =\u03c1(ct\u22121) {Shift the commitment vector ct\u22121}\n\u03b1t =\u03b1t\u22121 {Reuse the old the alignment}\n\nfor j\u2208{1,\u00b7\u00b7\u00b7|X|} do\n\nfor t\u2208{1,\u00b7\u00b7\u00b7|Y |} do\n\nif gt =1 then\n\nelse\n\nend if\nend for\n\nend for\n\n3.3 Training\n\nWe use a deep output layer (Pascanu et al., 2013a) to compute the conditional distribution over output tokens,\n(6)\nwhere Wo is a matrix of learned parameters and we have omitted the bias for brevity. Function fo is\nan MLP with tanh activation.\nThe full model, including both the encoder and decoder, is jointly trained to minimize the (conditional)\nnegative log-likelihood\n\np(yt|y<t,x)\u221dy(cid:62)t exp(Wofo(st,yt\u22121,\u03c8t)),\n\nN(cid:88)\n\nn=1\n\nL=\u2212 1\nN\n\nlogp\u03b8(y(n)|x(n)),\n\nwhere the training corpus is a set of (x(n),y(n)) pairs and \u03b8 denotes the set of all tunable parameters.\nAs noted by Vezhnevets et al. (2016), the proposed model can learn to recompute very often, which\ndecreases the utility of planning. To prevent this behavior, we introduce a loss that penalizes the model\nfor committing too often,\n\nLcom =\u03bbcom\n\n\u2212cti||2\n2,\n\n(7)\n\nwhere \u03bbcom is the commitment hyperparameter and k is the timescale over which plans operate.\n\n|X|(cid:88)\nk(cid:88)\n\nt=1\n\ni=0\n\n|| 1\nk\n\n5\n\n\f(a)\n\n(b)\n\n(c)\nFigure 2: We visualize the alignments learned by PAG in (a), rPAG in (b), and our baseline model with\na 2-layer GRU decoder using h2 for the attention in (c). As depicted, the alignments learned by PAG\nand rPAG are smoother than those of the baseline. The baseline tends to put too much attention on the\nlast token of the sequence, defaulting to this empty location in alternation with more relevant locations.\nOur model, however, places higher weight on the last token usually when no other good alignments exist.\nWe observe that rPAG tends to generate less monotonic alignments in general.\n\n4 Experiments\n\nOur baseline is the encoder-decoder architecture with attention described in Chung et al. (2016),\nwherein the MLP that constructs alignments conditions on the second-layer hidden states, h2, in the\ntwo-layer decoder. The integration of our planning mechanism is analogous across the family of attentive\nencoder-decoder models, thus our approach can be applied more generally. In all experiments below,\nwe use the same architecture for our baseline and the (r)PAG models. The only factor of variation is\nthe planning mechanism. For training all models we use the Adam optimizer with initial learning rate\nset to 0.0002. We clip gradients with a threshold of 5 (Pascanu et al., 2013b) and set the number of\nplanning steps (k) to 10 throughout. In order to backpropagate through the alignment-plan matrices and\nthe commitment vectors, the model must maintain these in memory, increasing the computational overhead\nof the PAG model. However, rPAG does not suffer from these computational issues.\n\n4.1 Algorithmic Task\n\nWe first compared our models on the algorithmic task from Li et al. (2015) of finding the \u201cEulerian\nCircuits\u201d in a random graph. The original work used random graphs with 4 nodes only, but we found\nthat both our baseline and the PAG model solve this task very easily. We therefore increased the number\nof nodes to 7. We tested the baseline described above with hidden-state dimension of 360, and the same\nmodel augmented with our planning mechanism. The PAG model solves the Eulerian Circuits problem\nwith 100% absolute accuracy on the test set, indicating that for all test-set graphs, all nodes of the circuit\nwere predicted correctly. The baseline encoder-decoder architecture with attention performs well but\nsignificantly worse, achieving 90.4% accuracy on the test set.\n\n4.2 Question Generation\n\nSQUAD (Rajpurkar et al., 2016) is a question answering (QA) corpus wherein each sample is a (document,\nquestion, answer) triple. The document and the question are given in words and the answer is a\nspan of word positions in the document. We evaluate our planning models on the recently proposed\nquestion-generation task (Yuan et al., 2017), where the goal is to generate a question conditioned on a\ndocument and an answer. We add the planning mechanism to the encoder-decoder architecture proposed\nby Yuan et al. (2017). Both the document and the answer are encoded via recurrent neural networks, and\n\n6\n\nTats\u00e4chlichidentifiziertenrepublikanischeRechtsanw\u00e4lteineinemJahrzehntnur300F\u00e4llevonWahlbetrugindenUSA.Indeed,Republicanlawyersidentifiedonly300casesofelectoralfraudintheUnitedStatesinadecade.\fthe model learns to align the question output with the document during decoding. The pointer-softmax\nmechanism (Gulcehre et al., 2016) is used to generate question words from either a shortlist vocabulary\nor by copying from the document. Pointer-softmax uses the alignments to predict the location of the word\nto copy; thus, the planning mechanism has a direct influence on the decoder\u2019s predictions.\nWe used 2000 examples from SQUAD\u2019s training set for validation and used the official development set\nas a test set to evaluate our models. We trained a model with 800 units for all GRU hidden states 600\nunits for word embedding. On the test set the baseline achieved 66.25 NLL while PAG got 65.45 NLL.\nWe show the validation-set learning curves of both models in Figure 3.\n\nFigure 3: Learning curves for question-generation models on our development set. Both models have\nthe same capacity and are trained with the same hyperparameters. PAG converges faster than the baseline\nwith better stability.\n\n4.3 Character-level Neural Machine Translation\n\nCharacter-level neural machine translation (NMT) is an attractive research problem (Lee et al., 2016;\nChung et al., 2016; Luong and Manning, 2016) because it addresses important issues encountered in\nword-level NMT. Word-level NMT systems can suffer from problems with rare words (Gulcehre et al.,\n2016) or data sparsity, and the existence of compound words without explicit segmentation in some\nlanguage pairs can make learning alignments between different languages and translations more difficult.\nCharacter-level neural machine translation mitigates these issues.\nIn our NMT experiments we use byte pair encoding (BPE) (Sennrich et al., 2015) for the source sequence\nand characters at the target, the same setup described in Chung et al. (2016). We also use the same\npreprocessing as in that work.2 We present our experimental results in Table 1. Models were tested on\nthe WMT\u201915 tasks for English to German (En\u2192De), English to Czech (En\u2192Cs), and English to Finnish\n(En\u2192Fi) language pairs. The table shows that our planning mechanism improves translation performance\nover our baseline (which reproduces the results reported in (Chung et al., 2016) to within a small margin).\nIt does this with fewer updates and fewer parameters. We trained (r)PAG for 350K updates on the training\nset, while the baseline was trained for 680K updates. We used 600 units in (r)PAG\u2019s encoder and decoder,\nwhile the baseline used 512 in the encoder and 1024 units in the decoder. In total our model has about\n4M fewer parameters than the baseline. We tested all models with a beam size of 15.\nAs can be seen from Table 1, layer normalization (Ba et al., 2016) improves the performance of PAG\nsignificantly. However, according to our results on En\u2192De, layer norm affects the performance of rPAG\nonly marginally. Thus, we decided not to train rPAG with layer norm on other language pairs.\nIn Figure 2, we show qualitatively that our model constructs smoother alignments. At each word that the\nbaseline decoder generates, it aligns the first few characters to a word in the source sequence, but for the re-\nmaining characters places the largest alignment weight on the last, empty token of the source sequence. This\nis because the baseline becomes confident of which word to generate after the first few characters, and it gen-\nerates the remainder of the word mainly by relying on language-model predictions. We observe that (r)PAG\nconverges faster with the help of the improved alignments, as illustrated by the learning curves in Figure 4.\n\n2Our implementation is based on the code available at https://github.com/nyu-dl/dl4mt-cdec\n\n7\n\n0510151200x Updates5456586062NLLBaselinePAG\fModel\nBaseline\nBaseline\u2020\nBaseline\u2020\nPAG\n\nrPAG\n\nBaseline\nBaseline\u2020\nPAG\nrPAG\nBaseline\nBaseline\u2020\nPAG\nrPAG\n\nEn\u2192De\n\nEn\u2192Cs\n\nEn\u2192Fi\n\nLayer Norm\n\n\u0017\n\u0017\n\u0013\n\u0017\n\u0013\n\u0017\n\u0013\n\u0017\n\u0013\n\u0017\n\u0013\n\u0017\n\u0017\n\u0013\n\u0017\n\u0013\n\u0017\n\nDev\n21.57\n21.4\n21.65\n21.92\n22.44\n21.98\n22.33\n17.68\n19.1\n18.9\n19.44\n18.66\n11.19\n11.26\n12.09\n12.85\n11.76\n\nTest 2014 Test 2015\n\n23.45\n22.1\n22.55\n22.42\n23.18\n22.85\n22.83\n16.98\n18.79\n18.88\n19.48\n19.14\n10.93\n10.71\n11.08\n12.15\n11.02\n\n21.33\n21.16\n21.69\n21.93\n22.59\n22.17\n22.35\n19.27\n21.35\n20.6\n21.64\n21.18\n\n-\n-\n-\n-\n-\n\nTable 1: The results of different models on the WMT\u201915 tasks for English to German, English to\nCzech, and English to Finnish language pairs. We report BLEU scores of each model computed via the\nmulti-blue.perl script. The best-score of each model for each language pair appears in bold-face. We use\nnewstest2013 as our development set, newstest2014 as our \"Test 2014\" and newstest2015 as our \"Test\n\n2015\" set.(cid:0)\u2020(cid:1) denotes the results of the baseline that we trained using the hyperparameters reported in\n\nChung et al. (2016) and the code provided with that paper. For our baseline, we only report the median\nresult, and do not have multiple runs of our models. On WMT\u201914 and WMT\u201915 for EnrightarrowDe\ncharacter-level NMT, Kalchbrenner et al. (2016) have reported better results with deeper auto-regressive\nconvolutional models (Bytenets), 23.75 and 26.26 respectively.\n\nFigure 4: Learning curves for different models on WMT\u201915 for En\u2192De. Models with the planning\nmechanism converge faster than our baseline (which has larger capacity).\n\n5 Conclusion\n\nIn this work we addressed a fundamental issue in neural generation of long sequences by integrating\nplanning into the alignment mechanism of sequence-to-sequence architectures. We proposed two different\nplanning mechanisms: PAG, which constructs explicit plans in the form of stored matrices, and rPAG,\nwhich plans implicitly and is computationally cheaper. The (r)PAG approach empirically improves\nalignments over long input sequences. We demonstrated our models\u2019 capabilities through results on\n\n8\n\n50100150200250300350400100x Updates1026\u00d71012\u00d71023\u00d7102NLLPAGPAG + LayerNormrPAGrPAG + LayerNormBaseline\fcharacter-level machine translation, an algorithmic task, and question generation. In machine translation,\nmodels with planning outperform a state-of-the-art baseline on almost all language pairs using fewer\nparameters. We also showed that our model outperforms baselines with the same architecture (minus\nplanning) on question-generation and algorithmic tasks. The introduction of planning improves training\nconvergence and potentially the speed by using the alignment repeats.\n\n.\n\nReferences\nJimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450\n\nDzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and\n\nYoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 .\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to\n\nalign and translate. International Conference on Learning Representations (ICLR) .\n\nYoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic\n\nneurons for conditional computation. arXiv preprint arXiv:1308.3432 .\n\nWilliam Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015. Listen, attend and spell. arXiv preprint\n\narXiv:1508.01211 .\n\nKyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On the properties of neural\n\nmachine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 .\n\nKyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. 2014b. Learning phrase representations using rnn encoder-decoder for statistical machine\ntranslation. arXiv preprint arXiv:1406.1078 .\n\nJunyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation\n\nfor neural machine translation. arXiv preprint arXiv:1603.06147 .\n\nThomas G Dietterich. 2000. Hierarchical reinforcement learning.\nCaglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown\n\nCaglar Gulcehre, Sarath Chandar, and Yoshua Bengio. 2017. Memory augmented neural networks with wormhole\n\nEric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint\n\nwords. arXiv preprint arXiv:1603.08148 .\n\nconnections. arXiv preprint arXiv:1701.08718 .\n\narXiv:1611.01144 .\n\nNal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu.\n\n2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 .\n\nJason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without\n\nexplicit segmentation. arXiv preprint arXiv:1610.03017 .\n\nYujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv\n\npreprint arXiv:1511.05493 .\n\nYuping Luo, Chung-Cheng Chiu, Navdeep Jaitly, and Ilya Sutskever. 2016. Learning online alignments with\n\ncontinuous rewards policy gradient. arXiv preprint arXiv:1608.01281 .\n\nMinh-Thang Luong and Christopher D Manning. 2016. Achieving open vocabulary neural machine translation with\n\nhybrid word-character models. arXiv preprint arXiv:1604.00788 .\n\nChris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. arXiv preprint arXiv:1611.00712 .\n\nRazvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2013a. How to construct deep recurrent\n\nneural networks. arXiv preprint arXiv:1312.6026 .\n\nRazvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013b. On the difficulty of training recurrent neural networks.\n\nICML (3) 28:1310\u20131318.\n\nPranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine\n\ncomprehension of text. arXiv preprint arXiv:1606.05250 .\n\nRico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword\n\nunits. arXiv preprint arXiv:1508.07909 .\n\nDavid Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of go\nwith deep neural networks and tree search. Nature 529(7587):484\u2013489.\n\nIlya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances\n\nin neural information processing systems. pages 3104\u20133112.\n\nAlexander Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, and Koray\nKavukcuoglu. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information\nProcessing Systems. pages 3486\u20133494.\n\nRonald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine learning 8(3-4):229\u2013256.\n\nKelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua\nBengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International\n\n9\n\n\fZichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention\n\nnetworks for document classification. In Proceedings of NAACL-HLT. pages 1480\u20131489.\n\nConference on Machine Learning. pages 2048\u20132057.\n\nXingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Sandeep Subramanian, Saizheng\nZhang, and Adam Trischler. 2017. Machine comprehension by text-to-text neural question generation. arXiv\npreprint arXiv:1705.02012 .\n\n10\n\n\f", "award": [], "sourceid": 2824, "authors": [{"given_name": "Caglar", "family_name": "Gulcehre", "institution": "Deepmind"}, {"given_name": "Francis", "family_name": "Dutil", "institution": "MILA"}, {"given_name": "Adam", "family_name": "Trischler", "institution": "Microsoft"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}