{"title": "Professor Forcing: A New Algorithm for Training Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4601, "page_last": 4609, "abstract": "The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network\u2019s own one-step-ahead predictions to do multi-step sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps. We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation. Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST. We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps. This is supported by human evaluation of sample quality. Trade-offs between Professor Forcing and Scheduled Sampling are discussed. We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar.", "full_text": "Professor Forcing: A New Algorithm for Training\n\nRecurrent Networks\n\nAnirudh Goyal\u2217, Alex Lamb\u2217, Ying Zhang, Saizheng Zhang,\n\nAaron Courville and Yoshua Bengio1\nMILA, Universit\u00e9 de Montr\u00e9al, 1CIFAR\n\n{anirudhgoyal9119, alex6200, ying.zhlisa, saizhenglisa,\n\naaron.courville, yoshua.umontreal}@gmail.com\n\nAbstract\n\nThe Teacher Forcing algorithm trains recurrent networks by supplying observed\nsequence values as inputs during training and using the network\u2019s own one-step-\nahead predictions to do multi-step sampling. We introduce the Professor Forcing\nalgorithm, which uses adversarial domain adaptation to encourage the dynamics of\nthe recurrent network to be the same when training the network and when sampling\nfrom the network over multiple time steps. We apply Professor Forcing to language\nmodeling, vocal synthesis on raw waveforms, handwriting generation, and image\ngeneration. Empirically we \ufb01nd that Professor Forcing acts as a regularizer, im-\nproving test likelihood on character level Penn Treebank and sequential MNIST.\nWe also \ufb01nd that the model qualitatively improves samples, especially when sam-\npling for a large number of time steps. This is supported by human evaluation of\nsample quality. Trade-offs between Professor Forcing and Scheduled Sampling are\ndiscussed. We produce T-SNEs showing that Professor Forcing successfully makes\nthe dynamics of the network during training and sampling more similar.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) have become to be the generative models of choice for sequential\ndata (Graves, 2012) with impressive results in language modeling (Mikolov, 2010; Mikolov and\nZweig, 2012), speech recognition (Bahdanau et al., 2015; Chorowski et al., 2015), Machine Transla-\ntion (Cho et al., 2014a; Sutskever et al., 2014; Bahdanau et al., 2014), handwriting generation (Graves,\n2013), image caption generation (Xu et al., 2015; Chen and Lawrence Zitnick, 2015), etc.\nThe RNN models the data via a fully-observed directed graphical model: it decomposes the distribu-\ntion over the discrete time sequence y1, y2, . . . yT into an ordered product of conditional distributions\nover tokens\n\nT(cid:89)\n\nP (y1, y2, . . . yT ) = P (y1)\n\nP (yt | y1, . . . yt\u22121).\n\nt=1\n\nBy far the most popular training strategy is via the maximum likelihood principle. In the RNN\nliterature, this form of training is also known as teacher forcing (Williams and Zipser, 1989), due to\nthe use of the ground-truth samples yt being fed back into the model to be conditioned on for the\nprediction of later outputs. These fed back samples force the RNN to stay close to the ground-truth\nsequence.\nWhen using the RNN for prediction, the ground-truth sequence is not available conditioning and\nwe sample from the joint distribution over the sequence by sampling each yt from its conditional\n\n\u2217Indicates \ufb01rst authors. Ordering determined by coin \ufb02ip.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fdistribution given the previously generated samples. Unfortunately, this procedure can result in\nproblems in generation as small prediction error compound in the conditioning context. This can\nlead to poor prediction performance as the RNN\u2019s conditioning context (the sequence of previously\ngenerated samples) diverge from sequences seen during training.\nRecently, (Bengio et al., 2015) proposed to remedy that issue by mixing two kinds of inputs during\ntraining: those from the ground-truth training sequence and those generated from the model. However,\nwhen the model generates several consecutive yt\u2019s, it is not clear anymore that the correct target\n(in terms of its distribution) remains the one in the ground truth sequence. This is mitigated in\nvarious ways, by making the self-generated subsequences short and annealing the probability of\nusing self-generated vs ground truth samples. However, as remarked by Husz\u00e1r (2015), scheduled\nsampling yields a biased estimator, in that even as the number of examples and the capacity go\nto in\ufb01nity, this procedure may not converge to the correct model. It is however good to note that\nexperiments with scheduled sampling clearly showed some improvements in terms of the robustness\nof the generated sequences, suggesting that something indeed needs to be \ufb01xed (or replaced) with\nmaximum-likelihood (or teacher forcing) training of generative RNNs.\nIn this paper, we propose an alternative way of training RNNs which explicitly seeks to make the\ngenerative behavior and the teacher-forced behavior match as closely as possible. This is particularly\nimportant to allow the RNN to continue generating robustly well beyond the length of the sequences\nit saw during training. More generally, we argue that this approach helps to better model long-term\ndependencies by using a training objective that is not solely focused on predicting the next observation,\none step at a time.\nOur work provides the following contributions regarding this new training framework:\n\n\u2022 We introduce a novel method for training generative RNNs called Professor Forcing, meant\nto improve long-term sequence sampling from recurrent networks. We demonstrate this\nwith human evaluation of sample quality by performing a study with human evaluators.\n\n\u2022 We \ufb01nd that Professor Forcing can act as a regularizer for recurrent networks. This is\ndemonstrated by achieving improvements in test likelihood on character-level Penn Treebank,\nSequential MNIST Generation, and speech synthesis.\nInterestingly, we also \ufb01nd that\ntraining performance can also be improved, and we conjecture that it is because longer-term\ndependencies can be more easily captured.\n\n\u2022 When running an RNN in sampling mode, the region occupied by the hidden states of the\nnetwork diverges from the region occupied when doing teacher forcing. We empirically\nstudy this phenomenon using T-SNEs and show that it can be mitigated by using Professor\nForcing.\n\n\u2022 In some domains the sequences available at training time are shorter than the sequences\nthat we want to generate at test time. This is usually the case in long-term forecasting tasks\n(climate modeling, econometrics). We show how using Professor Forcing can be used to\nimprove performance in this setting. Note that scheduled sampling cannot be used for this\ntask, because it still uses the observed sequence as targets for the network.\n\n2 Proposed Approach: Professor Forcing\n\nThe basic idea of Professor Forcing is simple: while we do want the generative RNN to match the\ntraining data, we also want the behavior of the network (both in its outputs and in the dynamics of\nits hidden states) to be indistinguishable whether the network is trained with its inputs clamped to\na training sequence (teacher forcing mode) or whether its inputs are self-generated (free-running\ngenerative mode). Because we can only compare the distribution of these sequences, it makes sense\nto take advantage of the generative adversarial networks (GANs) framework (Goodfellow et al., 2014)\nto achieve that second objective of matching the two distributions over sequences (the one observed\nin teacher forcing mode vs the one observed in free-running mode).\nHence, in addition to the generative RNN, we will train a second model, which we call the discrim-\ninator, and that can also process variable length inputs. In the experiments we use a bidirectional\nRNN architecture for the discriminator, so that it can combine evidence at each time step t from the\npast of the behavior sequence as well as from the future of that sequence.\n\n2\n\n\f2.1 De\ufb01nitions and Notation\n\nLet the training distribution provide (x, y) pairs of input and output sequences (possibly there are\nno inputs at all). An output sequence y can also be generated by the generator RNN when given an\ninput sequence x, according to the sequence to sequence model distribution P\u03b8g (y|x). Let \u03b8g be the\nparameters of the generative RNN and \u03b8d be the parameters of the discriminator. The discriminator\nis trained as a probabilistic classi\ufb01er that takes as input a behavior sequence b derived from the\ngenerative RNN\u2019s activity (hiddens and outputs) when it either generates or is constrained by a\nsequence y, possibly in the context of an input sequence x (often but not necessarily of the same\nlength). The behavior sequence b is either the result of running the generative RNN in teacher forcing\nmode (with y from a training sequence with input x), or in free-running mode (with y self-generated\naccording to P\u03b8g (y|x), with x from the training sequence). The function B(x, y, \u03b8g) outputs the\nbehavior sequence (chosen hidden states and output values) given the appropriate data (where x\nalways comes from the training data but y either comes from the data or is self-generated). Let D(b)\nbe the output of the discriminator, estimating the probability that b was produced in teacher-forcing\nmode, given that half of the examples seen by the discriminator are generated in teacher forcing mode\nand half are generated in the free-running mode.\nNote that in the case where the generator RNN does not have any conditioning input, the sequence\nx is empty. Note also that the generated output sequences could have a different length then the\nconditioning sequence, depending of the task at hand.\n\n2.2 Training Objective\n\nThe discriminator parameters \u03b8d are trained as one would expect, i.e., to maximize the likelihood of\ncorrectly classifying a behavior sequence:\nCd(\u03b8d|\u03b8g) = E(x,y)\u223cdata[\u2212 log D(B(x, y, \u03b8g), \u03b8d)+Ey\u223cP\u03b8g (y|x)[\u2212 log(1\u2212D(B(x, y, \u03b8g), \u03b8d)]].\n(1)\nPractically, this is achieved with a variant of stochastic gradient descent with minibatches formed by\ncombining N sequences obtained in teacher-forcing mode and N sequences obtained in free-running\nmode, with y sampled from P\u03b8g (y|x). Note also that as \u03b8g changes, the task optimized by the\ndiscriminator changes too, and it has to track the generator, as in other GAN setups, hence the notation\nCd(\u03b8d|\u03b8g).\nThe generator RNN parameters \u03b8g are trained to (a) maximize the likelihood of the data and (b) fool\nthe discriminator. We considered two variants of the latter. The negative log-likelihood objective (a)\nis the usual teacher-forced training criterion for RNNs:\n\n(2)\nRegarding (b) we consider a training objective that only tries to change the free-running behavior so\nthat it better matches the teacher-forced behavior, considering the latter \ufb01xed:\n\nN LL(\u03b8g) = E(x,y)\u223cdata[\u2212 log P\u03b8g (y|x)].\n\nCf (\u03b8g|\u03b8d) = Ex\u223cdata,y\u223cP\u03b8g (y|x)[\u2212 log D(B(x, y, \u03b8g), \u03b8d)].\n\n(3)\n\nIn addition (and optionally), we can ask the teacher-forced behavior to be indistinguishable from the\nfree-running behavior:\n\nCt(\u03b8g|\u03b8d) = E(x,y)\u223cdata[\u2212 log(1 \u2212 D(B(x, y, \u03b8g), \u03b8d))].\n\n(4)\nIn our experiments we either perform stochastic gradient steps on N LL + Cf or on N LL + Cf + Ct\nto update the generative RNN parameters, while we always do gradient steps on Cd to update the\ndiscriminator parameters.\n\n3 Related Work\n\nProfessor Forcing is an adversarial method for learning generative models that is closely related to\nGenerative Adversarial Networks (Goodfellow et al., 2014) and Adversarial Domain Adaptation\nAjakan et al. (2014); Ganin et al. (2015). Our approach is similar to generative adversarial networks\n(GANs) because both use a discriminative classi\ufb01er to provide gradients for training a generative\nmodel. However, Professor Forcing is different because the classi\ufb01er discriminates between hidden\n\n3\n\n\fFigure 1: Architecture of the Professor Forcing - Learn correct one-step predictions such as to to\nobtain the same kind of recurrent neural network dynamics whether in open loop (teacher forcing)\nmode or in closed loop (generative) mode. An open loop generator that does one-step-ahead prediction\ncorrectly. Recursively composing these outputs does multi-step prediction (closed-loop) and can\ngenerate new sequences. This is achieved by train a classi\ufb01er to distinguish open loop (teacher\nforcing) vs. closed loop (free running) dynamics, as a function of the sequence of hidden states and\noutputs. Optimize the closed loop generator to fool the classi\ufb01er. Optimize the open loop generator\nwith teacher forcing. The closed loop and open loop generators share all parameters\n\nstates from sampling mode and teacher forcing mode, whereas the GAN\u2019s classi\ufb01er discriminates\nbetween real samples and generated samples. One practical advantage of Professor Forcing over\nGANs is that Professor Forcing can be used to learn a generative model over discrete random variables\nwithout requiring to approximate backpropagation through discrete spaces Bengio et al. (2013).\nThe Adversarial Domain Adaptation uses a classi\ufb01er to discriminate between the hidden states of the\nnetwork with inputs from the source domain and the hidden states of the network with inputs from\nthe target domain. However this method was not applied in the context of generative models, more\nspeci\ufb01cally, was not applied to the task of improving long-term generation from recurrent networks.\nAlternative non-adversarial methods have been explored for improving long-term generation from\nrecurrent networks. The scheduled sampling method Bengio et al. (2015), which is closely related\nto SEARN (Daum\u00e9 et al., 2009) and DAGGER Ross et al. (2010), involves randomly using the\nnetwork\u2019s predictions as its inputs (as in sampling mode) with some probability that increases over the\ncourse of training. This forces the network to be able to stay in a reasonable regime when receiving\nthe network\u2019s predictions as inputs instead of observed inputs. While Scheduled Sampling shows\nimprovement on some tasks, it is not a consistent estimation strategy. This limitation arises because\nthe outputs sampled from the network could correspond to a distribution that is not consistent with\nthe sequence that the network is trained to generate. This issue is discussed in detail in Husz\u00e1r (2015).\nA practical advantage of Scheduled Sampling over Professor Forcing is that Scheduled Sampling\ndoes not require the additional overhead of having to train a discriminator network.\nFinally, the idea of matching the behavior of the model when it is generating in a free-running way\nwith its behavior when it is constrained by the observed data (being clamped on the \"visible units\")\nis precisely that which one obtains when zeroing the maximum likelihood gradient on undirected\ngraphical models with latent variables such as the Boltzmann machine. Training Boltzmann machines\namounts to matching the suf\ufb01cient statistics (which summarize the behavior of the model) in both\n\"teacher forced\" (positive phase) and \"free-running\" (negative phase) modes.\n\n4 Experiments\n\n4.1 Networks Architecture and Professor Forcing Setup\n\nThe neural networks and Professor Forcing setup used in the experiments is the following. The\ngenerative RNN has single hidden layer of gated recurrent units (GRU), previously introduced\nby (Cho et al., 2014b) as a computationally cheaper alternative to LSTM units (Hochreiter and\nSchmidhuber, 1997). At each time step, the generative RNN reads an element xt of the input\n\n4\n\n......DiscriminatorTeacherForcingFreeRunningDistributions of hidden states are forced to be closeto each other by DiscriminatorShare parameters\fsequence (if any) and an element of the output sequence yt (which either comes from the training\ndata or was generated at the previous step by the RNN). It then updates its state ht as a function of\nits previous state ht\u22121 and of the current input (xt, yt). It then computes a probability distribution\nP\u03b8g (yt+1|ht) = P\u03b8g (yt+1|x1, . . . , xt, y1, . . . , yt) over the next element of the output. For discrete\noutputs this is achieved by a softmax / af\ufb01ne layer on top of ht, with as many outputs as the size of\nthe set of values that yt can take. In free-running mode, yt+1 is then sampled from this distribution\nand will be used as part of the input for the next time step. Otherwise, the ground truth yt is used.\nThe behavior function B used in the experiments outputs the pre-tanh activation of the GRU states\nfor the whole sequence considered, and optionally the softmax outputs for the next-step prediction,\nagain for the whole sequence.\nThe discriminator architecture we used for these experiments is based on a bidirectional recurrent\nneural network, which comprises two RNNs (again, two GRU networks), one running forward in\ntime on top of the input sequence b, and one running backwards in time, with the same input. The\nhidden states of these two RNNs are concatenated at each time step and fed to a multi-layer neural\nnetwork shared across time (the same network is used for all time steps). That MLP has three layers,\neach composing an af\ufb01ne transformation and a recti\ufb01er (ReLU). Finally, the output layer composes\nan af\ufb01ne transformation and a sigmoid that outputs D(b).\nWhen the discriminator is too poor, the gradient it propagates into the generator RNN could be\ndetrimental. For this reason, we back-propagate from the discriminator into the generator RNN only\nwhen the discriminator classi\ufb01cation accuracy is greater than 75%. On the other hand, when the\ndiscriminator is too successful at identifying fake inputs, we found that it would also hurt to continue\ntraining it. So when its accuracy is greater than 99%, we do not update the discriminator.\nBoth networks are trained by minibatch stochastic gradient descent with adaptive learning rates and\nmomentum determined by the Adam algorithm (Kingma and Ba, 2014). All of our experiments were\nimplemented using the Theano framework (Al-Rfou et al., 2016).\n\n4.2 Character-Level Language Modeling\n\nWe evaluate Professor Forcing on character-level language modeling on Penn-Treebank corpus,\nwhich has an alphabet size of 50 and consists of 5059k characters for training, 396k characters for\nvalidation and 446k characters for test. We divide the training set into non-overlapping sequences\nwith each length of 500. During training, we monitor the negative log-likelihood (NLL) of the output\nsequences. The \ufb01nal model are evaluated by bits-per-character (BPC) metric. The generative RNN\n\nFigure 2: Penn Treebank Likelihood Curves in terms of the number of iterations. Training Negative\nLog-Likelihood (left). Validation BPC (Right)\n\nimplements an 1 hidden layer GRU with 1024 hidden units. We use Adam algorithm for optimization\nwith a learning rate of 0.0001. We feed both the hidden states and char level embeddings into the\ndiscriminator. All the layers in the discriminator consists of 2048 hidden units. Output activation of\nthe last layer is clipped between -10 and 10. We see that training cost of Professor Forcing network\ndecreases faster compared to teacher forcing network. The training time of our model is 3 times\nmore as compared to teacher forcing, since our model includes sampling phase, as well as passing\nthe hidden distributions corresponding to free running and teacher forcing phase to the discriminator.\nThe \ufb01nal BPC on validation set using our baseline was 1.50 while using professor forcing it is 1.48.\nOn word level Penn Treebank we did not observe any difference between Teacher Forcing and\nProfessor Forcing. One possible explanation for this difference is the increased importance of\nlong-term dependencies in character-level language modeling.\n\n5\n\n\fFigure 3: T-SNE visualization of hidden states, left: with teacher forcing, right: with professor\nforcing. Red dots correspond to teacher forcing hidden states, while the gold dots correspond to\nfree running mode. At t = 500, the closed-loop and open-loop hidden states clearly occupy distinct\nregions with teacher forcing, meaning that the network enters a region during sampling distinct from\nthe region seen during teacher forcing training. With professor forcing, these regions now largely\noverlap. We computed 30 T-SNEs for Teacher Forcing and 30 T-SNEs for Professor Forcing and\nfound that the mean centroid distance was reduced from 3000 to 1800 (40% relative reduction). The\nmean distance from a hidden state in the training network to a hidden state in the sampling network\nwas reduced from 22.8 with Teacher Forcing to 16.4 with Professor Forcing (vocal synthesis).\n\nMethod\n\nMNIST NLL\n\nDBN 2hl (Germain et al., 2015)\nNADE (Larochelle and Murray, 2011)\nEoNADE-5 2hl (Raiko et al., 2014)\nDLGM 8 leapfrog steps (Salimans et al., 2014)\nDARN 1hl (Gregor et al., 2015)\nDRAW (Gregor et al., 2015)\nPixel RNN (van den Oord et al., 2016)\n\nProfessor Forcing (ours)\n\n\u2248 84.55\n88.33\n84.68\n\u2248 85.51\n\u2248 84.13\n\u2264 80.97\n79.2\n\n79.58\n\nTable 1: Test set negative log-likelihood evaluations on Sequential MNIST.\n\n4.3 Sequential MNIST\n\nWe evaluated Professor Forcing on the task of sequentially generating the pixels in MNIST digits.\nWe use the standard binarized MNIST dataset Murray and Salakhutdinov (2009). We selected\nhyperparameters for our model on the validation set and elected to use 512 hidden states and a\nlearning rate of 0.0001. For all experiments we used a 3-layer GRU as our generator. Unlike our other\nexperiments, we used a convolutional network for the discriminator instead of a bi-directional RNN,\nas the pixels have a 2D spatial structure. In Table 1, We note that our model achieves the second best\nreported likelihood on this task, after the PixelRNN, which used a signi\ufb01cantly more complicated\narchitecture for its generator van den Oord et al. (2016). Combining Professor Forcing with the\nPixelRNN would be an interesting area for future research. However, the PixelRNN parallelizes\ncomputation in the teacher forcing network in a way that doesn\u2019t work in the sampling network.\nBecause Professor Forcing requires running the sampling network during training, naively combining\nProfessor Forcing with the PixelRNN would be very slow.\n\nFigure 4: Samples with Teacher Forcing (left) and Professor Forcing (right) on Sequential MNIST.\n\n6\n\n\fResponse\n\nPercent Count\n\nProfessor Forcing Much Better\nProfessor Forcing Slightly Better\nTeacher Forcing Slightly Better\nTeacher Forcing Much Better\n\nTotal\n\n19.7\n57.2\n18.9\n4.3\n\n100.0\n\n151\n439\n145\n33\n\n768\n\nTable 2: Human Evaluation Study Results for Handwriting Generation.\n\n4.4 Handwriting Generation\n\nWith this task we wanted to investigate if Professor Forcing could be used to perform domain\nadaptation from a training set with short sequences to sampling much longer sequences. We train\nthe Teacher Forcing model on only 50 steps of text-conditioned handwriting (corresponding to a few\nletters) and then sample for 1000 time steps . We let the model learn a sequence of (x, y) coordinates\ntogether with binary indicators of pen-up vs. pen-down, using the standard handwriting IAM-OnDB\ndataset, which consists of 13,040 handwritten lines written by 500 writers Liwicki and Bunke (2005).\nFor our teacher forcing model, we use the open source implementation Brebisson (2016) and use their\nhyperparameters which is based on the model in Graves (2013). For the professor forcing model, we\nsample for 1000 time steps and run a separate discriminator on non-overlapping segments of length\n50 (the number of steps used in the teacher forcing model).\nWe performed a human evaluation study on handwriting samples. We gave 48 volunteers 16 randomly\nselected Prof. Forcing samples randomly paired with 16 Teacher Forcing samples and asked them to\nindicate which sample was higher quality and whether it was \u201cmuch better\u201d or \u201cslightly better\u201d. Both\nmodels had equal training time and samples were drawn using the same procedure. Volunteers were\nnot aware of which samples came from which model, see Table 2 for results.\n\n4.5 Music Synthesis on Raw Waveforms\n\nWe considered the task of vocal synthesis on raw waveforms. For this task we used three hours of monk\nchanting audio scraped from YouTube (https://www.youtube.com/watch?v=9-pD28iSiTU).\nWe sampled the audio at a rate of 1 kHz and took four seconds for each training and validation\nexample. On each time step of the raw audio waveform we binned the signal\u2019s value into 8000 bins\nwith boundaries drawn uniformly between the smallest and largest signal values in the dataset. We\nthen model the raw audio waveform as a 4000-length sequence with 8000 potential values on each\ntime step.\n\nFigure 6: Music Synthesis. Left: training likelihood curves. Right: validation likelihood curves.\n\nWe evaluated the quality of our vocal synthesis model using two criteria. First, we demonstrated a\nregularizing effect and improvement in negative log-likelihood. Second, we observed improvement in\nthe quality of samples. We included a few randomly selected samples in the supplementary material\nand also performed human evaluation of the samples.\nVisual inspection of samples is known to be a \ufb02awed method for evaluating generative models,\nbecause a generative model could simply memorize a small number of examples from the training set\n(or slightly modi\ufb01ed examples from the training set) and achieve high sample quality. This issue was\ndiscussed in Theis et al. (2015). However, this is unlikely to be an issue with our evaluation because\nour method also improved validation set likelihood, whereas a model that achieves quality samples\nby dropping coverage would have poorer validation set likelihood.\n\n7\n\n\fWe performed human evaluation by asking 29 volunteers to listen to \ufb01ve randomly selected teacher\nforcing samples and \ufb01ve randomly selected professor forcing samples (included in supplementary\nmaterials and then rate each sample from 1-3 on the basis of quality. The annotators were given\nthe samples in random order and were not told which samples came from which algorithm. The\nhuman annotators gave the Professor Forcing samples an average score of 2.20, whereas they gave\nthe Teacher Forcing samples an average score of 1.30.\n\nFigure 7: Human evaluator ratings for vocal synthesis samples (higher is better). The height of the\nbar is the mean of the ratings and the error bar shows the spread of one standard deviation.\n\n5 Conclusion\nThe idea of matching behavior of a model when it is running on its own, making predictions,\ngenerating samples, etc. vs when it is forced to be consistent with observed data is an old and\npowerful one. In this paper we introduce Professor Forcing, a novel instance of this idea when the\nmodel of interest is a recurrent generative one, and which relies on training an auxiliary model, the\ndiscriminator to spot the differences in behavior between these two modes of behavior. A major\nmotivation for this approach is that the discriminator can look at the statistics of the behavior and not\njust at the single-step predictions, forcing the generator to behave the same when it is constrained by\nthe data and when it is left generating outputs by itself for sequences that can be much longer than the\ntraining sequences. This naturally produces better generalization over sequences that are much longer\nthan the training sequences, as we have found. We have also found that it helped to generalize better\nin terms of one-step prediction (log-likelihood), even though we are adding a possibly con\ufb02icting\nterm to the log-likelihood training objective. This suggests that it acts like a regularizer but a very\ninteresting one because it can also greatly speed up convergence in terms of number of training\nupdates. We validated the advantage of Professor Forcing over traditional teacher forcing on a variety\nof sequential learning and generative tasks, with particularly impressive results in acoustic generation,\nwhere the training sequences are much shorter (because of memory constraints) than the length of the\nsequences we actually want to generate.\n\nAcknowledgments\nWe thank Martin Arjovsky, Dzmitry Bahdanau, Nan Rosemary Ke, Jos\u00e9 Manuel Rodr\u00edguez Sotelo,\nAlexandre de Br\u00e9bisson, Olexa Bilaniuk, Hal Daum\u00e9 III, Kari Torkkola, and David Krueger.\n\nReferences\nAjakan, H., Germain, P., Larochelle, H., Laviolette, F., and Marchand, M. (2014). Domain-Adversarial Neural\n\nNetworks. ArXiv e-prints.\n\nAl-Rfou, R., Alain, G., Almahairi, A., and et al. (2016). Theano: A python framework for fast computation of\n\nmathematical expressions. CoRR, abs/1605.02688.\n\nBahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and\n\ntranslate. arXiv preprint arXiv:1409.0473.\n\nBahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., and Bengio, Y. (2015). End-to-end attention-based large\n\nvocabulary speech recognition. arXiv preprint arXiv:1508.04395.\n\nBahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2016). An\n\nActor-Critic Algorithm for Sequence Prediction. ArXiv e-prints.\n\nBengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence prediction with\n\nrecurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171\u20131179.\n\nBengio, Y., L\u00e9onard, N., and Courville, A. (2013). Estimating or Propagating Gradients Through Stochastic\n\nNeurons for Conditional Computation. ArXiv e-prints.\n\n8\n\n\fBrebisson, A. (2016). Conditional handwriting generation in theano. https://github.com/adbrebs/\n\nhandwriting.\n\nChen, X. and Lawrence Zitnick, C. (2015). Mind\u2019s eye: A recurrent visual representation for image caption\ngeneration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n2422\u20132431.\n\nCho, K., Van Merri\u00ebnboer, B., G\u00fcl\u00e7ehre, \u00c7., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y.\n(2014a). Learning phrase representations using RNN encoder\u2013decoder for statistical machine translation. In\nProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages\n1724\u20131734, Doha, Qatar. Association for Computational Linguistics.\n\nCho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014b).\nLearning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint\narXiv:1406.1078.\n\nChorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based models for\n\nspeech recognition. In Advances in Neural Information Processing Systems, pages 577\u2013585.\n\nDaum\u00e9, III, H., Langford, J., and Marcu, D. (2009). Search-based Structured Prediction. ArXiv e-prints.\nGanin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V.\n\n(2015). Domain-Adversarial Training of Neural Networks. ArXiv e-prints.\n\nGermain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder for distribution\n\nestimation. arXiv preprint arXiv:1502.03509.\n\nGoodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio,\n\nY. (2014). Generative adversarial networks. In NIPS\u20192014.\n\nGraves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational\n\nIntelligence. Springer.\n\nGraves, A. (2013). Generating sequences with recurrent neural networks. Technical report, arXiv:1308.0850.\nGraves, A. (2013). Generating Sequences With Recurrent Neural Networks. ArXiv e-prints.\nGregor, K., Danihelka, I., Graves, A., and Wierstra, D. (2015). Draw: A recurrent neural network for image\n\ngeneration. arXiv preprint arXiv:1502.04623.\n\nHochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Comput., 9(8), 1735\u20131780.\nHusz\u00e1r, F. (2015). How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?\n\nArXiv e-prints.\n\nKingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.\nLarochelle, H. and Murray, I. (2011). The neural autoregressive distribution estimator.\nLiwicki, M. and Bunke, H. (2005). Iam-ondb - an on-line english sentence database acquired from handwritten\ntext on a whiteboard. In Eighth International Conference on Document Analysis and Recognition (ICDAR\u201905),\npages 956\u2013961 Vol. 2.\n\nMikolov, T. (2010). Recurrent neural network based language model.\nMikolov, T. and Zweig, G. (2012). Context dependent recurrent neural network language model.\nMurray, I. and Salakhutdinov, R. R. (2009). Evaluating probabilities under high-dimensional latent variable\nmodels. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems 21, pages 1137\u20131144. Curran Associates, Inc.\n\nRaiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive distribution estimator NADE-k.\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 27 (NIPS 2014), pages 325\u2013333. Curran Associates, Inc.\n\nRoss, S., Gordon, G. J., and Bagnell, J. A. (2010). A Reduction of Imitation Learning and Structured Prediction\n\nSalimans, T., Kingma, D. P., and Welling, M. (2014). Markov chain monte carlo and variational inference:\n\nto No-Regret Online Learning. ArXiv e-prints.\n\nBridging the gap. arXiv preprint arXiv:1410.6460.\n\nSutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112.\n\nTheis, L., van den Oord, A., and Bethge, M. (2015). A note on the evaluation of generative models. ArXiv\n\nvan den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel Recurrent Neural Networks. ArXiv\n\ne-prints.\n\ne-prints.\n\nWilliams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural\n\nnetworks. Neural computation, 1(2), 270\u2013280.\n\nXu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and\n\ntell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.\n\n9\n\n\f", "award": [], "sourceid": 2295, "authors": [{"given_name": "Alex", "family_name": "Lamb", "institution": "Montreal"}, {"given_name": "Anirudh Goyal", "family_name": "ALIAS PARTH GOYAL", "institution": "University of Montreal"}, {"given_name": "Ying", "family_name": "Zhang", "institution": "University of Montreal"}, {"given_name": "Saizheng", "family_name": "Zhang", "institution": "University of Montreal"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "University of Montreal"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}