{"title": "Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1171, "page_last": 1179, "abstract": "Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists of maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence. We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Experiments on several sequence prediction tasks show that this approach yields significant improvements. Moreover, it was used successfully in our winning bid to the MSCOCO image captioning challenge, 2015.", "full_text": "Scheduled Sampling for Sequence Prediction with\n\nRecurrent Neural Networks\n\nSamy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer\n\nGoogle Research\n\n{bengio,vinyals,ndjaitly,noam}@google.com\n\nMountain View, CA, USA\n\nAbstract\n\nRecurrent Neural Networks can be trained to produce sequences of tokens given\nsome input, as exempli\ufb01ed by recent results in machine translation and image\ncaptioning. The current approach to training them consists of maximizing the\nlikelihood of each token in the sequence given the current (recurrent) state and the\nprevious token. At inference, the unknown previous token is then replaced by a\ntoken generated by the model itself. This discrepancy between training and infer-\nence can yield errors that can accumulate quickly along the generated sequence.\nWe propose a curriculum learning strategy to gently change the training process\nfrom a fully guided scheme using the true previous token, towards a less guided\nscheme which mostly uses the generated token instead. Experiments on several se-\nquence prediction tasks show that this approach yields signi\ufb01cant improvements.\nMoreover, it was used succesfully in our winning entry to the MSCOCO image\ncaptioning challenge, 2015.\n\n1\n\nIntroduction\n\nRecurrent neural networks can be used to process sequences, either as input, output or both. While\nthey are known to be hard to train when there are long term dependencies in the data [1], some\nversions like the Long Short-Term Memory (LSTM) [2] are better suited for this. In fact, they have\nrecently shown impressive performance in several sequence prediction problems including machine\ntranslation [3], contextual parsing [4], image captioning [5] and even video description [6].\nIn this paper, we consider the set of problems that attempt to generate a sequence of tokens of\nvariable size, such as the problem of machine translation, where the goal is to translate a given\nsentence from a source language to a target language. We also consider problems in which the input\nis not necessarily a sequence, like the image captioning problem, where the goal is to generate a\ntextual description of a given image.\nIn both cases, recurrent neural networks (or their variants like LSTMs) are generally trained to\nmaximize the likelihood of generating the target sequence of tokens given the input. In practice, this\nis done by maximizing the likelihood of each target token given the current state of the model (which\nsummarizes the input and the past output tokens) and the previous target token, which helps the\nmodel learn a kind of language model over target tokens. However, during inference, true previous\ntarget tokens are unavailable, and are thus replaced by tokens generated by the model itself, yielding\na discrepancy between how the model is used at training and inference. This discrepancy can be\nmitigated by the use of a beam search heuristic maintaining several generated target sequences, but\nfor continuous state space models like recurrent neural networks, there is no dynamic programming\napproach, so the effective number of sequences considered remains small, even with beam search.\n\n1\n\n\fThe main problem is that mistakes made early in the sequence generation process are fed as input\nto the model and can be quickly ampli\ufb01ed because the model might be in a part of the state space it\nhas never seen at training time.\nHere, we propose a curriculum learning approach [7] to gently bridge the gap between training and\ninference for sequence prediction tasks using recurrent neural networks. We propose to change the\ntraining process in order to gradually force the model to deal with its own mistakes, as it would\nhave to during inference. Doing so, the model explores more during training and is thus more robust\nto correct its own mistakes at inference as it has learned to do so during training. We will show\nexperimentally that this approach yields better performance on several sequence prediction tasks.\nThe paper is organized as follows: in Section 2, we present our proposed approach to better train\nsequence prediction tasks with recurrent neural networks; this is followed by Section 3 which draws\nlinks to some related approaches. We then present some experimental results in Section 4 and\nconclude in Section 5.\n\n2 Proposed Approach\n\nWe are considering supervised tasks where the training set is given in terms of N input/output pairs\n{X i, Y i}N\ni=1, where X i is the input and can be either static (like an image) or dynamic (like a\nsequence) while the target output Y i is a sequence yi\nof a variable number of tokens\nthat belong to a \ufb01xed known dictionary.\n\n2, . . . , yi\nTi\n\n1, yi\n\n2.1 Model\nGiven a single input/output pair (X, Y ), the log probability P (Y |X) can be computed as:\n\nlog P (Y |X) = log P (yT\n\n1 |X)\n\nT(cid:88)\n\n=\n\nlog P (yt|yt\u22121\n\n1\n\n, X)\n\n(1)\nwhere Y is a sequence of length T represented by tokens y1, y2, . . . , yT . The latter term in the above\nequation is estimated by a recurrent neural network with parameters \u03b8 by introducing a state vector,\nht, that is a function of the previous state, ht\u22121, and the previous output token, yt\u22121, i.e.\n\nt=1\n\nlog P (yt|yt\u22121\n\n, X; \u03b8) = log P (yt|ht; \u03b8)\n\n(2)\n\n1\n\n(cid:26) f (X; \u03b8)\n\nwhere ht is computed by a recurrent neural network as follows:\n\nif t = 1,\n\nht =\n\nf (ht\u22121, yt\u22121; \u03b8) otherwise.\n\n(3)\nP (yt|ht; \u03b8) is often implemented as a linear projection1 of the state vector ht into a vector of scores,\none for each token of the output dictionary, followed by a softmax transformation to ensure the\nscores are properly normalized (positive and sum to 1). f (h, y) is usually a non-linear function that\ncombines the previous state and the previous output in order to produce the current state.\nThis means that the model focuses on learning to output the next token given the current state\nof the model AND the previous token. Thus, the model represents the probability distribution of\nsequences in the most general form - unlike Conditional Random Fields [8] and other models that\nassume independence between between outputs at different time steps, given latent variable states.\nThe capacity of the model is only limited by the representational capacity of the recurrent and\nfeedforward layers. LSTMs, with their ability to learn long range structure are especially well suited\nto this task and make it possible to learn rich distributions over sequences.\nIn order to learn variable length sequences, a special token, , that signi\ufb01es the end of a\nsequence is added to the dictionary and the model. During training, is concatenated to the\nend of each sequence. During inference, the model generates tokens until it generates .\n\n1 Although one could also use a multi-layered non-linear projection.\n\n2\n\n\f2.2 Training\n\nTraining recurrent neural networks to solve such tasks is usually accomplished by using mini-batch\nstochastic gradient descent to look for a set of parameters \u03b8(cid:63) that maximizes the log likelihood of\nproducing the correct target sequence Y i given the input data X i for all training pairs (X i, Y i):\n\nlog P (Y i|X i; \u03b8) .\n\n(4)\n\n\u03b8(cid:63) = arg max\n\n\u03b8\n\n2.3\n\nInference\n\n(cid:88)\n\n(X i,Y i)\n\nDuring inference the model can generate the full sequence yT\n1 given X by generating one token at a\ntime, and advancing time by one step. When an token is generated, it signi\ufb01es the end of\nthe sequence. For this process, at time t, the model needs as input the output token yt\u22121 from the\nlast time step in order to produce yt. Since we do not have access to the true previous token, we can\ninstead either select the most likely one given our model, or sample according to it.\nSearching for the sequence Y with the highest probability given X is too expensive because of the\ncombinatorial growth in the number of sequences. Instead we use a beam searching procedure to\ngenerate k \u201cbest\u201d sequences. We do this by maintaining a heap of m best candidate sequences. At\neach time step new candidates are generated by extending each candidate by one token and adding\nthem to the heap. At the end of the step, the heap is re-pruned to only keep m candidates. The beam\nsearching is truncated when no new sequences are added, and k best sequences are returned.\nWhile beam search is often used for discrete state based models like Hidden Markov Models where\ndynamic programming can be used, it is harder to use ef\ufb01ciently for continuous state based models\nlike recurrent neural networks, since there is no way to factor the followed state paths in a continuous\nspace, and hence the actual number of candidates that can be kept during beam search decoding is\nvery small.\nIn all these cases, if a wrong decision is taken at time t \u2212 1, the model can be in a part of the\nstate space that is very different from those visited from the training distribution and for which it\ndoesn\u2019t know what to do. Worse, it can easily lead to cumulative bad decisions - a classic problem in\nsequential Gibbs sampling type approaches to sampling, where future samples can have no in\ufb02uence\non the past.\n\n2.4 Bridging the Gap with Scheduled Sampling\n\nThe main difference between training and inference for sequence prediction tasks when predicting\ntoken yt is whether we use the true previous token yt\u22121 or an estimate \u02c6yt\u22121 coming from the model\nitself.\nWe propose here a sampling mechanism that will randomly decide, during training, whether we use\nyt\u22121 or \u02c6yt\u22121. Assuming we use a mini-batch based stochastic gradient descent approach, for every\ntoken to predict yt \u2208 Y of the ith mini-batch of the training algorithm, we propose to \ufb02ip a coin\nand use the true previous token with probability \u0001i, or an estimate coming from the model itself with\nprobability (1 \u2212 \u0001i)2 The estimate of the model can be obtained by sampling a token according to\nthe probability distribution modeled by P (yt\u22121|ht\u22121), or can be taken as the arg maxs P (yt\u22121 =\ns|ht\u22121). This process is illustrated in Figure 1.\nWhen \u0001i = 1, the model is trained exactly as before, while when \u0001i = 0 the model is trained in\nthe same setting as inference. We propose here a curriculum learning strategy to go from one to\nthe other: intuitively, at the beginning of training, sampling from the model would yield a random\ntoken since the model is not well trained, which could lead to very slow convergence, so selecting\nmore often the true previous token should help; on the other hand, at the end of training, \u0001i should\nfavor sampling from the model more often, as this corresponds to the true inference situation, and\none expects the model to already be good enough to handle it and sample reasonable tokens.\n\n2Note that in the experiments, we \ufb02ipped the coin for every token. We also tried to \ufb02ip the coin once per\nsequence, but the results were much worse, most probably because consecutive errors are ampli\ufb01ed during the\n\ufb01rst rounds of training.\n\n3\n\n\fFigure 1: Illustration of the Scheduled Sampling approach,\nwhere one \ufb02ips a coin at every time step to decide to use the\ntrue previous token or one sampled from the model itself.\n\nFigure 2:\nschedules.\n\nExamples of decay\n\nWe thus propose to use a schedule to decrease \u0001i as a function of i itself, in a similar manner used\nto decrease the learning rate in most modern stochastic gradient descent approaches. Examples of\nsuch schedules can be seen in Figure 2 as follows:\n\n\u2022 Linear decay: \u0001i = max(\u0001, k \u2212 ci) where 0 \u2264 \u0001 < 1 is the minimum amount of truth to be\ngiven to the model and k and c provide the offset and slope of the decay, which depend on\nthe expected speed of convergence.\n\u2022 Exponential decay: \u0001i = ki where k < 1 is a constant that depends on the expected speed\n\u2022 Inverse sigmoid decay: \u0001i = k/(k +exp(i/k)) where k \u2265 1 depends on the expected speed\n\nof convergence.\n\nof convergence.\n\nWe call our approach Scheduled Sampling. Note that when we sample the previous token \u02c6yt\u22121 from\nthe model itself while training, we could back-propagate the gradient of the losses at times t \u2192 T\nthrough that decision. This was not done in the experiments described in this paper and is left for\nfuture work.\n\n3 Related Work\n\nThe discrepancy between the training and inference distributions has already been noticed in the\nliterature, in particular for control and reinforcement learning tasks.\nSEARN [9] was proposed to tackle problems where supervised training examples might be different\nfrom actual test examples when each example is made of a sequence of decisions, like acting in a\ncomplex environment where a few mistakes of the model early in the sequential decision process\nmight compound and yield a very poor global performance. Their proposed approach involves a\nmeta-algorithm where at each meta-iteration one trains a new model according to the current policy\n(essentially the expected decisions for each situation), applies it on a test set and modi\ufb01es the next\niteration policy in order to account for the previous decisions and errors. The new policy is thus a\ncombination of the previous one and the actual behavior of the model.\nIn comparison to SEARN and related ideas [10, 11], our proposed approach is completely online: a\nsingle model is trained and the policy slowly evolves during training, instead of a batch approach,\nwhich makes it much faster to train3 Furthermore, SEARN has been proposed in the context of\nreinforcement learning, while we consider the supervised learning setting trained using stochastic\ngradient descent on the overall objective.\nOther approaches have considered the problem from a ranking perspective, in particular for parsing\ntasks [12] where the target output is a tree. In this case, the authors proposed to use a beam search\nboth during training and inference, so that both phases are aligned. The training beam is used to \ufb01nd\n\n3In fact, in the experiments we report in this paper, our proposed approach was not meaningfully slower\n\n(nor faster) to train than the baseline.\n\n4\n\n 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000Exponential decayInverse sigmoid decayLinear decay\fthe best current estimate of the model, which is compared to the guided solution (the truth) using a\nranking loss. Unfortunately, this is not feasible when using a model like a recurrent neural network\n(which is now the state-of-the-art technique in many sequential tasks), as the state sequence cannot\nbe factored easily (because it is a multi-dimensional continuous state) and thus beam search is hard\nto use ef\ufb01ciently at training time (as well as inference time, in fact).\nFinally, [13] proposed an online algorithm for parsing problems that adapts the targets through the\nuse of a dynamic oracle that takes into account the decisions of the model. The trained model\nis a perceptron and is thus not state-based like a recurrent neural network, and the probability of\nchoosing the truth is \ufb01xed during training.\n\n4 Experiments\n\nWe describe in this section experiments on three different tasks, in order to show that scheduled\nsampling can be helpful in different settings. We report results on image captioning, constituency\nparsing and speech recognition.\n\n4.1\n\nImage Captioning\n\nImage captioning has attracted a lot of attention in the past year. The task can be formulated as a\nmapping of an image onto a sequence of words describing its content in some natural language, and\nmost proposed approaches employ some form of recurrent network structure with simple decoding\nschemes [5, 6, 14, 15, 16]. A notable exception is the system proposed in [17], which does not\ndirectly optimize the log likelihood of the caption given the image, and instead proposes a pipelined\napproach.\nSince an image can have many valid captions, the evaluation of this task is still an open prob-\nlem. Some attempts have been made to design metrics that positively correlate with human evalua-\ntion [18], and a common set of tools have been published by the MSCOCO team [19].\nWe used the MSCOCO dataset from [19] to train our model. We trained on 75k images and report\nresults on a separate development set of 5k additional images. Each image in the corpus has 5 dif-\nferent captions, so the training procedure picks one at random, creates a mini-batch of examples,\nand optimizes the objective function de\ufb01ned in (4). The image is preprocessed by a pretrained con-\nvolutional neural network (without the last classi\ufb01cation layer) similar to the one described in [20],\nand the resulting image embedding is treated as if it was the \ufb01rst word from which the model starts\ngenerating language. The recurrent neural network generating words is an LSTM with one layer\nof 512 hidden units, and the input words are represented by embedding vectors of size 512. The\nnumber of words in the dictionary is 8857. We used an inverse sigmoid decay schedule for \u0001i for the\nscheduled sampling approach.\nTable 1 shows the results on various metrics on the development set. Each of these metrics is\na variant of estimating the overlap between the obtained sequence of words and the target one.\nSince there were 5 target captions per image, the best result is always chosen. To the best of our\nknowledge, the baseline results are consistent (slightly better) with the current state-of-the-art on\nthat task. While dropout helped in terms of log likelihood (as expected but not shown), it had a\nnegative impact on the real metrics. On the other hand, scheduled sampling successfully trained a\nmodel more resilient to failures due to training and inference mismatch, which likely yielded higher\nquality captions according to all the metrics. Ensembling models also yielded better performance,\nboth for the baseline and the schedule sampling approach. It is also interesting to note that a model\ntrained while always sampling from itself (hence in a regime similar to inference), dubbed Always\nSampling in the table, yielded very poor performance, as expected because the model has a hard\ntime learning the task in that case. We also trained a model with scheduled sampling, but instead\nof sampling from the model, we sampled from a uniform distribution, in order to verify that it was\nimportant to build on the current model and that the performance boost was not just a simple form\nof regularization. We called this Uniform Scheduled Sampling and the results are better than the\nbaseline, but not as good as our proposed approach. We also experimented with \ufb02ipping the coin\nonce per sequence instead of once per token, but the results were as poor as the Always Sampling\napproach.\n\n5\n\n\fTable 1: Various metrics (the higher the better) on the MSCOCO development set for the image\ncaptioning task.\n\nApproach vs Metric\n\nBLEU-4 METEOR CIDER\n\nBaseline\n\nBaseline with Dropout\n\nAlways Sampling\nScheduled Sampling\n\nUniform Scheduled Sampling\n\nBaseline ensemble of 10\n\nScheduled Sampling ensemble of 5\n\n28.8\n28.1\n11.2\n30.6\n29.2\n30.7\n32.3\n\n24.2\n23.9\n15.7\n24.3\n24.2\n25.1\n25.4\n\n89.5\n87.0\n49.7\n92.1\n90.9\n95.7\n98.7\n\nIt\u2019s worth noting that we used our scheduled sampling approach to participate in the 2015 MSCOCO\nimage captioning challenge [21] and ranked \ufb01rst in the \ufb01nal leaderboard.\n\n4.2 Constituency Parsing\n\nAnother less obvious connection with the any-to-sequence paradigm is constituency parsing. Recent\nwork [4] has proposed an interpretation of a parse tree as a sequence of linear \u201coperations\u201d that build\nup the tree. This linearization procedure allowed them to train a model that can map a sentence onto\nits parse tree without any modi\ufb01cation to the any-to-sequence formulation.\nThe trained model has one layer of 512 LSTM cells and words are represented by embedding vectors\nof size 512. We used an attention mechanism similar to the one described in [22] which helps,\nwhen considering the next output token to produce yt, to focus on part of the input sequence only\nby applying a softmax over the LSTM state vectors corresponding to the input sequence. The input\nword dictionary contained around 90k words, while the target dictionary contained 128 symbols used\nto describe the tree. We used an inverse sigmoid decay schedule for \u0001i in the scheduled sampling\napproach.\nParsing is quite different from image captioning as the function that one has to learn is almost\ndeterministic. In contrast to an image having a large number of valid captions, most sentences have\na unique parse tree (although some very dif\ufb01cult cases exist). Thus, the model operates almost\ndeterministically, which can be seen by observing that the train and test perplexities are extremely\nlow compared to image captioning (1.1 vs. 7).\nThis different operating regime makes for an interesting comparison, as one would not expect the\nbaseline algorithm to make many mistakes. However, and as can be seen in Table 2, scheduled\nsampling has a positive effect which is additive to dropout. In this table we report the F1 score on the\nWSJ 22 development set [23]. We should also emphasize that there are only 40k training instances,\nso over\ufb01tting contributes largely to the performance of our system. Whether the effect of sampling\nduring training helps with regard to over\ufb01tting or the training/inference mismatch is unclear, but the\nresult is positive and additive with dropout. Once again, a model trained by always sampling from\nitself instead of using the groundtruth previous token as input yielded very bad results, in fact so bad\nthat the resulting trees were often not valid trees (hence the \u201c-\u201d in the corresponding F1 metric).\n\nTable 2: F1 score (the higher the better) on the validation set of the parsing task.\n\nF1\n86.54\n87.0\n\n-\n\n88.08\n88.68\n\nBaseline LSTM with Dropout\n\nAlways Sampling\nScheduled Sampling\n\nScheduled Sampling with Dropout\n\nApproach\n\nBaseline LSTM\n\n6\n\n\f4.3 Speech Recognition\n\nFor the speech recognition experiments, we used a slightly different setting from the rest of the\npaper. Each training example is an input/output pair (X, Y ), where X is a sequence of T input\nvectors x1, x2,\u00b7\u00b7\u00b7 xT and Y is a sequence of T tokens y1, y2,\u00b7\u00b7\u00b7 yT so each yt is aligned with the\ncorresponding xt. Here, xt are the acoustic features represented by log Mel \ufb01lter bank spectra at\nframe t, and yt is the corresponding target. The targets used were HMM-state labels generated from\na GMM-HMM recipe, using the Kaldi toolkit [24] but could very well have been phoneme labels.\nThis setting is different from the other experiments in that the model we used is the following:\n\nlog P (Y |X; \u03b8) = log P (yT\n\n1 |xT\n\n1 ; \u03b8)\nlog P (yt|yt\u22121\n\n1\n\n, xt\n\n1; \u03b8)\n\n=\n\nlog P (yt|ht; \u03b8)\n\n=\n\nt=1\n\nT(cid:88)\nT(cid:88)\n(cid:26) f (oh, S, x1; \u03b8)\n\nt=1\n\n(5)\n\n(6)\n\nwhere ht is computed by a recurrent neural network as follows:\n\nht =\n\nf (ht\u22121, yt\u22121, xt; \u03b8) otherwise.\n\nif t = 1,\n\nwhere oh is a vector of 0\u2019s with same dimensionality as ht\u2019s and S is an extra token added to the\ndictionary to represent the start of each sequence.\nWe generated data for these experiments using the TIMIT4 corpus and the KALDI toolkit as de-\nscribed in [25]. Standard con\ufb01gurations were used for the experiments - 40 dimensional log Mel\n\ufb01lter banks and their \ufb01rst and second order temporal derivatives were used as inputs to each frame.\n180 dimensional targets were generated for each time frame using forced alignment to transcripts\nusing a trained GMM-HMM system. The training, validation and test sets have 3696, 400 and 192\nsequences respectively, and their average length was 304 frames. The validation set was used to\nchoose the best epoch in training, and the model parameters from that epoch were used to evaluate\nthe test set.\nThe trained models had two layers of 250 LSTM cells and a softmax layer, for each of \ufb01ve con\ufb01gura-\ntions - a baseline con\ufb01guration where the ground truth was always fed to the model, a con\ufb01guration\n(Always Sampling) where the model was only fed in its own predictions from the last time step,\nand three scheduled sampling con\ufb01gurations (Scheduled Sampling 1-3), where \u0001i was ramped lin-\nearly from a maximum value to a minimum value over ten epochs and then kept constant at the\n\ufb01nal value. For each con\ufb01guration, we trained 3 models and report average performance over them.\nTraining of each model was done over frame targets from the GMM. The baseline con\ufb01gurations\ntypically reached the best validation accuracy after approximately 14 epochs whereas the sampling\nmodels reached the best accuracy after approximately 9 epochs, after which the validation accuracy\ndecreased. This is probably because the way we trained our models is not exact - it does not account\nfor the gradient of the sampling probabilities from which we sampled our targets. Future effort at\ntackling this problem may further improve results.\nTesting was done by \ufb01nding the best sequence from beam search decoding (using a beam size of\n10 beams) and computing the error rate over the sequences. We also report the next step error rate\n(where the model was fed in the ground truth to predict the class of the next frame) for each of the\nmodels on the validation set to summarize the performance of the models on the training objective.\nTable 3 shows a summary of the results\nIt can be seen that the baseline performs better next step prediction than the models that sample the\ntokens for input. This is to be expected, since the former has access to the groundtruth. However, it\ncan be seen that the models that were trained with sampling perform better than the baseline during\ndecoding. It can also be seen that for this problem, the \u201cAlways Sampling\u201d model performs quite\n\n4https://catalog.ldc.upenn.edu/LDC93S1.\n\n7\n\n\f1\n\nwell. We hypothesize that this has to do with the nature of the dataset. The HMM-aligned states\nhave a lot of correlation - the same state appears as the target for several frames, and most of the\nstates are constrained only to go to a subset of other states. Next step prediction with groundtruth\nlabels on this task ends up paying disproportionate attention to the structure of the labels (yt\u22121\n)\nand not enough to the acoustics input (xt\n1). Thus it achieves very good next step prediction error\nwhen the groundtruth sequence is fed in with the acoustic information, but is not able to exploit\nthe acoustic information suf\ufb01ciently when the groundtruth sequence is not fed in. For this model\nthe testing conditions are too far from the training condition for it to make good predictions. The\nmodel that is only fed its own prediction (Always Sampling) ends up exploiting all the information\nit can \ufb01nd in the acoustic signal, and effectively ignores its own predictions to in\ufb02uence the next\nstep prediction. Thus at test time, it performs just as well as it does during training. A model such as\nthe attention model of [26] which predicts phone sequences directly, instead of the highly redundant\nHMM state sequences, would not suffer from this problem because it would need to exploit both the\nacoustic signal and the language model suf\ufb01ciently to make predictions. Nevertheless, even in this\nsetting, adding scheduled sampling still helped to improve the decoding frame error rate.\nNote that typically speech recognition experiments use HMMs to decode predictions from neural\nnetworks in a hybrid model. Here we avoid using an HMM altogether and hence we do not have the\nadvantage of the smoothing that results from the HMM architecture and the language models. Thus\nthe results are not directly comparable to the typical hybrid model results.\n\nTable 3: Frame Error Rate (FER) on the speech recognition experiments. In next step prediction\n(reported on validation set) the ground truth is fed in to predict the next target like it is done during\ntraining. In decoding experiments (reported on test set), beam searching is done to \ufb01nd the best\nsequence. We report results on four different linear schedulings of sampling, where \u0001i was ramped\ndown linearly from \u0001s to \u0001e. For the baseline, the model was only fed in the ground truth. See\nSection 4.3 for an analysis of the results.\n\nApproach\n\nAlways Sampling\n\nScheduled Sampling 1\nScheduled Sampling 2\nScheduled Sampling 3\n\nBaseline LSTM\n\n\u0001s\n0\n\n0.25\n0.5\n0.9\n1\n\n\u0001e\n0\n0\n0\n0.5\n1\n\nNext Step FER Decoding FER\n\n34.6\n34.3\n34.1\n19.8\n15.0\n\n35.8\n34.5\n35.0\n42.0\n46.0\n\n5 Conclusion\n\nUsing recurrent neural networks to predict sequences of tokens has many useful applications like\nmachine translation and image description. However, the current approach to training them, predict-\ning one token at a time, conditioned on the state and the previous correct token, is different from\nhow we actually use them and thus is prone to the accumulation of errors along the decision paths.\nIn this paper, we proposed a curriculum learning approach to slowly change the training objective\nfrom an easy task, where the previous token is known, to a realistic one, where it is provided by the\nmodel itself. Experiments on several sequence prediction tasks yield performance improvements,\nwhile not incurring longer training times. Future work includes back-propagating the errors through\nthe sampling decisions, as well as exploring better sampling strategies including conditioning on\nsome con\ufb01dence measure from the model itself.\n\nReferences\n[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long term dependencies is hard. IEEE Transactions on\n\nNeural Networks, 5(2):157\u2013166, 1994.\n\n[2] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.\n[3] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In Advances in\n\nNeural Information Processing Systems, NIPS, 2014.\n\n[4] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language. In\n\narXiv:1412.7449, 2014.\n\n8\n\n\f[5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\nIEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.\n\n[6] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\nLong-term recurrent convolutional networks for visual recognition and description. In IEEE Conference\non Computer Vision and Pattern Recognition, CVPR, 2015.\n\n[7] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the Inter-\n\nnational Conference on Machine Learning, ICML, 2009.\n\n[8] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic models for\nsegmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on\nMachine Learning, ICML, pages 282\u2013289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers\nInc.\n\n[9] H. Daum\u00b4e III, J. Langford, and D. Marcu. Search-based structured prediction as classi\ufb01cation. Machine\n\nLearning Journal, 2009.\n\n[10] S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction\nIn Proceedings of the Workshop on Arti\ufb01cial Intelligence and Statistics,\n\nto no-regret online learning.\nAISTATS, 2011.\n\n[11] A. Venkatraman, M. Herbert, and J. A. Bagnell. Improving multi-step prediction of learned time series\n\nIn Proceedings of the\n\nmodels. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, AAAI, 2015.\nIncremental parsing with the perceptron algorithm.\n\n[12] M. Collins and B. Roark.\n\nAssociation for Computational Linguistics, ACL, 2004.\n\n[13] Y. Goldberg and J. Nivre. A dynamic oracle for arc-eager dependency parsing. In Proceedings of COL-\n\nING, 2012.\n\n[14] J. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. Deep captioning with multimodal recurrent\n\nneural networks (m-rnn). In International Conference on Learning Representations, ICLR, 2015.\n\n[15] R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural\n\nlanguage models. In TACL, 2015.\n\n[16] A. Karpathy and F.-F. Li. Deep visual-semantic alignments for generating image descriptions. In IEEE\n\nConference on Computer Vision and Pattern Recognition, CVPR, 2015.\n\n[17] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C.\nPlatt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In IEEE Conference on\nComputer Vision and Pattern Recognition, CVPR, 2015.\n\n[18] R. Vedantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. In\n\nIEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.\n\n[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00b4ar, and C. L. Zitnick. Microsoft\n\ncoco: Common objects in context. arXiv:1405.0312, 2014.\n\n[20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In Proceedings of the International Conference on Machine Learning, ICML, 2015.\n\n[21] Y. Cui, M. R. Ronchi, T.-Y. Lin, P. Dollr, and L. Zitnick. Microsoft coco captioning challenge.\n\nhttp://mscoco.org/dataset/#captions-challenge2015, 2015.\n\n[22] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\nIn International Conference on Learning Representations, ICLR, 2015.\n\n[23] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. Ontonotes: The 90% solution. In\nProceedings of the Human Language Technology Conference of the NAACL, Short Papers, pages 57\u201360,\nNew York City, USA, June 2006. Association for Computational Linguistics.\n\n[24] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek,\nY. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The kaldi speech recognition toolkit. In\nIEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing\nSociety, December 2011. IEEE Catalog No.: CFP11SRW-USB.\n\n[25] N. Jaitly. Exploring Deep Learning Methods for discovering features in speech signals. PhD thesis,\n\nUniversity of Toronto, 2014.\n\n[26] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end continuous speech\n\nrecognition using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602, 2014.\n\n9\n\n\f", "award": [], "sourceid": 734, "authors": [{"given_name": "Samy", "family_name": "Bengio", "institution": "Google Research"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google"}, {"given_name": "Navdeep", "family_name": "Jaitly", "institution": "Google"}, {"given_name": "Noam", "family_name": "Shazeer", "institution": "Google"}]}