{"title": "Training Language GANs from Scratch", "book": "Advances in Neural Information Processing Systems", "page_first": 4300, "page_last": 4311, "abstract": "Generative Adversarial Networks (GANs) enjoy great success at image generation, but have proven difficult to train in the domain of natural language. Challenges with gradient estimation, optimization instability, and mode collapse have lead practitioners to resort to maximum likelihood pre-training, followed by small amounts of adversarial fine-tuning. The benefits of GAN fine-tuning for language generation are unclear, as the resulting models produce comparable or worse samples than traditional language models. We show it is in fact possible to train a language GAN from scratch --- without maximum likelihood pre-training. We combine existing techniques such as large batch sizes, dense rewards and discriminator regularization to stabilize and improve language GANs. The resulting model, ScratchGAN, performs comparably to maximum likelihood training on EMNLP2017 News and WikiText-103 corpora\naccording to quality and diversity metrics.", "full_text": "Training Language GANs from Scratch\n\nCyprien de Masson d\u2019Autume\u2217 Mihaela Rosca\u2217 Jack Rae\n\nDeepMind\n\nShakir Mohamed\n\n{cyprien,mihaelacr,jwrae,shakir}@google.com\n\nAbstract\n\nGenerative Adversarial Networks (GANs) enjoy great success at image genera-\ntion, but have proven dif\ufb01cult to train in the domain of natural language. Chal-\nlenges with gradient estimation, optimization instability, and mode collapse have\nlead practitioners to resort to maximum likelihood pre-training, followed by small\namounts of adversarial \ufb01ne-tuning. The bene\ufb01ts of GAN \ufb01ne-tuning for language\ngeneration are unclear, as the resulting models produce comparable or worse sam-\nples than traditional language models. We show it is in fact possible to train a\nlanguage GAN from scratch \u2014 without maximum likelihood pre-training. We\ncombine existing techniques such as large batch sizes, dense rewards and dis-\ncriminator regularization to stabilize and improve language GANs. The resulting\nmodel, ScratchGAN, performs comparably to maximum likelihood training on\nEMNLP2017 News and WikiText-103 corpora according to quality and diversity\nmetrics.\n\n1\n\nIntroduction\n\nUnsupervised word level text generation is a stepping stone for a plethora of applications, from\ndialogue generation to machine translation and summarization [1, 2, 3, 4]. While recent innovations\nsuch as architectural changes and leveraging big datasets are promising [5, 6, 7], the problem of\nunsupervised text generation is far from being solved.\nToday, language models trained using maximum likelihood are the most successful and widespread\napproach to text modeling, but they are not without limitations. Since they explicitly model sequence\nprobabilities, language models trained by maximum likelihood are often con\ufb01ned to an autoregres-\nsive structure, limiting applications such as one-shot language generation. Non-autoregressive max-\nimum likelihood models have been proposed, but due to reduced model capacity they rely on distill-\ning autoregressive models to achieve comparable performance on machine translation tasks [8].\nWhen combined with maximum likelihood training, autoregressive modelling can result in poor\nsamples due exposure bias [9]\u2013 a distributional shift between training sequences used for learning\nand model data required for generation. Recently, [10] showed that sampling from state of the\nart language models can lead to repetitive, degenerate output. Scheduled sampling [9] has been\nproposed as a solution, but is thought to encourage sample quality by reducing sample diversity,\ninducing mode collapse [11].\nGenerative Adversarial Networks (GANs) [12] are an alternative to models trained via maximum\nlikelihood. GANs do not suffer from exposure bias since the model learns to sample during training:\nthe learning objective is to generate samples which are indistinguishable from real data according to\na discriminator. Since GANs don\u2019t require an explicit probability model, they remove the restriction\nto autoregressive architectures, allowing one shot feed-forward generation [13].\nThe sequential and discrete nature of text has made the application of GANs to language challenging,\nwith fundamental issues such as dif\ufb01cult gradient estimation and mode collapse yet to be addressed.\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fExisting language GANs avoid these issues by pre-training models with maximum likelihood [14,\n15, 16, 17, 18] and limiting the amount of adversarial \ufb01ne tuning by restricting the number of \ufb01ne-\ntuning epochs and often using a small learning rate [19, 20]. This suggests \u201cthat the best-performing\nGANs tend to stay close to the solution given by maximum-likelihood training\u201d [20]. Even with\nadversarial \ufb01ne-tuning playing a limited role, extensive evaluation has shown that existing language\nGANs do not improve over maximum likelihood-trained models [19, 20, 21].\nWe show that pure adversarial training is a viable approach for unsupervised word-level text gen-\neration by training a language GAN from scratch. We achieve this by tackling the fundamental\nlimitations of training discrete GANs through a combination of existing techniques as well as care-\nfully choosing the model and training regime. To the best of our knowledge we are the \ufb01rst to do\nso successfully; we thus call our model ScratchGAN. Compared to prior work on discrete language\nGANs which \u201cbarely achieve non-random results without supervised pre-training\u201d [19], Scratch-\nGAN achieves results comparable with maximum likelihood models.\nOur aim is to learn models that captures both both semantic coherence and grammatical correctness\nof language, and to demonstrate that these properties have been captured with the use of different\nevaluation metrics. BLEU and Self-BLEU [22] capture basic local consistency. The Fr\u00b4echet Dis-\ntance metric [19] captures global consistency and semantic information, while being less sensitive\nto local syntax. We use Language and Reverse Language model scores [20] across various softmax\ntemperatures to capture the diversity-quality trade-off. We measure validation data perplexity, using\nthe fact that ScratchGAN learns an explicit distribution over sentences. Nearest neighbor analysis\nin embedding and data space provide evidence that our model is not trivially over\ufb01tting, e.g. by\ncopying sections of training text.\nWe make the following contributions:\n\u2022 We show that GANs without any pre-training are comparable with maximum likelihood methods\n\u2022 We show that large batch sizes, dense rewards and discriminator regularization are key ingredients\n\u2022 We perform an extensive evaluation of the quality and diversity of our model. In doing so, we\n\nshow that no current evaluation metric is able to capture all the desired properties of language.\n\nof training language GANs from scratch.\n\nat unconditional text generation.\n\nThe ScratchGAN code can be found at https://github.com/deepmind/deepmind-research/\nscratchgan.\n\n2 Generative Models of Text\n\nThe generative model practitioner has two choices to make: how to model the unknown data dis-\ntribution p\u2217(x) and how to learn the parameters \u03b8 of the model. The choice of model is where\noften prior information about the data is encoded, either through the factorization of the distribution,\nor through its parametrization. The language sequence x = [x1, ..., xT ] naturally lends itself to\nautoregressive modeling:\n\nT(cid:89)\n\nt=1\n\np\u03b8(x) =\n\np\u03b8(xt|x1, ..., xt\u22121)\n\n(1)\n\nSampling \u02c6x1, ..., \u02c6xT from an autoregressive model is an iterative process: each token \u02c6xt is sampled\nfrom the conditional distribution imposed by previous samples: \u02c6xt \u223c p\u03b8(xt|\u02c6x1, ..., \u02c6xt\u22121). Dis-\ntributions p\u03b8(xt|x1, ..., xt\u22121) are Categorical distributions over the vocabulary size, and are often\nparametrized as recurrent neural networks [23, 24].\nThe speci\ufb01c tokenization x1, ..., xT for a given data sequence is left to the practitioner, with character\nlevel or word level splits being the most common. Throughout this work, we use word level language\nmodeling.\n\n2.1 Maximum Likelihood\n\nOnce a choice of model is made, the question of how to train the model arises. The most common\napproach to learn model of language is using maximum likelihood estimation (MLE):\n\nE\np\u2217(x) log p\u03b8(x)\n\narg max\n\n\u03b8\n\n2\n\n(2)\n\n\fThe combination of autoregressive models and maximum likelihood learning has been very fruitful\nin language modeling [5, 25, 26], but it is unclear whether maximum likelihood is the optimal\nperceptual objective for text data [11]. In this work we will retain the use of autoregressive models\nand focus on the impact of the training criterion on the quality and sample diversity of generated\ndata, by using adversarial training instead.\n\n2.2 Generative Adversarial Networks\nGenerative adversarial networks [12] learn the data distribution p\u2217(x) through a two player adversar-\nial game between a discriminator and a generator. A discriminator D\u03c6(x) is trained to distinguish\nbetween real data and samples from the generator distribution p\u03b8(x), while the generator is trained\nto fool the discriminator in identifying its samples as real. The original formulation proposes a\nmin-max optimization procedure using the objective:\n\n(cid:2)log D\u03c6(x)(cid:3) + E\n\np\u03b8 (x)\n\n(cid:2)log(1 \u2212 D\u03c6(x))(cid:3).\n\nmin\n\n\u03b8\n\nmax\n\n\u03c6\n\nE\np\u2217(x)\n\n(3)\n\nciple with the choice of an autoregressive model. Learning p\u03b8(x) =(cid:81)T\n\nGoodfellow et al. [12] suggested using the alternative generator loss E\np\u03b8 (x)[\u2212 log D\u03c6(x)] as it\nprovides better gradients for the generator. Since then, multiple other losses have been proposed\n[27, 28, 29, 30].\nChallenges of learning language GANs arise from the combination of the adversarial learning prin-\nt=1 p\u03b8(xt|x1, ..., xt\u22121) using\nequation 3 requires backpropagating through a sampling operation, forcing the language GAN prac-\ntitioner to choose between high variance, unbiased estimators such as REINFORCE [31], or lower\nvariance, but biased estimators, such as the Gumbel-Softmax trick [32, 33] and other continuous\nrelaxations [13]. Gradient estimation issues compounded with other GAN problems such as mode\ncollapse or training instability [27, 34] led prior work on language GANs to use maximum likelihood\npre-training [14, 15, 16, 18, 35, 36]. This is the current preferred approach to train text GANs.\n\n2.3 Learning Signals\n\nTo train the generator we use the REINFORCE gradient estimator [31]:\n\n(4)\nwhere R(x) is provided by the discriminator. By analogy with reinforcement learning, we call R(x)\n\u2217\na reward. Setting R(x) = p\np\u03b8(x) , recovers the MLE estimator in Eq (2) as shown by Che et al. [17]:\n\np\u03b8(x)\n\n(x)\n\n\u2207\u03b8E\n\np\u03b8(x)[R(x)] = E\n\n(cid:2)R(x)\u2207\u03b8 log p\u03b8(x)(cid:3),\n(cid:2)\n\u2207\u03b8 log p\u03b8(x)(cid:3) = \u2207\u03b8E\n\n(cid:20) p\u2217(x)\n\n(cid:21)\n\nE\np\u03b8(x)\n\np\u03b8(x)\u2207\u03b8 log p\u03b8(x)\n\n= E\n\np\u2217(x)\n\np\u2217(x) log p\u03b8(x).\n\n(5)\n\nThe gradient updates provided by the MLE estimator can be seen as a special case of the REIN-\nFORCE updates used in language GAN training. The important difference lies in the fact that for\nlanguage GANs rewards are learned. Learned discriminators have been shown to be a useful mea-\nsure of model quality and correlate with human evaluation [37]. We postulate that learned rewards\nprovide a smoother signal to the generator than the classical MLE loss: the discriminator can learn to\ngeneralize and provide a meaningful signal over parts of the distribution not covered by the training\ndata. As the training progresses and the signal from the discriminator improves, the generator also\nexplores other parts of data space, providing a natural curriculum, whereas MLE models are only\nexposed to the dataset.\nAdversarial training also enables the use of domain knowledge. Discriminator ensembles where\neach discriminator is biased to focus on speci\ufb01c aspects of the samples such as syntax, grammar,\nsemantics, or local versus global structure are a promising approach [38]. The research avenues\nopened by learned rewards and the issues with MLE pre-training motivate our search for a language\nGAN which does not make use of maximum likelihood pre-training.\n\n3 Training Language GANs from Scratch\n\nTo achieve the goal of training a language GAN from scratch, we tried different loss functions and\narchitectures, various reward structures and regularization methods, ensembles, and other modi\ufb01ca-\ntions. Most of these approaches did not succeed or did not result in any signi\ufb01cant gains. Via this\n\n3\n\n\fTable 1: BLEU-5 and Self-BLEU-5\nmetrics for a 5-gram model.\n\nMODEL\nKNESER-NEY\nTRAINING DATA\n\nBLEU-5\n\nSBLEU-5\n\n20.67\n20.73\n\n19.73\n20.73\n\nFigure 1: ScratchGAN architecture and reward structure.\n\nextensive experimentation we found that the key ingredients to train language GANs from scratch\nare: a recurrent discriminator used to provide dense rewards at each time step, large batches for vari-\nance reduction, and discriminator regularization. We describe the generator architecture and reward\nstructure we found effective in Figure 1 and provide a list of other techniques we tried but which\nproved unsuccessful or unnecessary in Appendix C.\n\n3.1 Dense Rewards\n\nOur ultimate goal is to generate entire sequences, so we could train a discriminator to distinguish\nbetween complete data sequences and complete sampled sequences, with the generator receiving a\nreward only after generating a full sequence. However, in this setting the generator would get no\nlearning signal early in training, when generated sentences can easily be determined to be fake by\nthe discriminator. We avoid this issue by instead training a recurrent discriminator which provides\nrewards for each generated token [35]. The discriminator D\u03c6 learns to distinguish between sentence\npre\ufb01xes coming from real data and sampled sentence pre\ufb01xes:\n\nT(cid:88)\n\nt=1\n\nmax\n\n\u03c6\n\nE\np\u2217(xt|x1,...,xt\u22121)\n\nT(cid:88)\n(cid:2)log D\u03c6(xt|x1, ...xt\u22121)(cid:3)+\n\nt=1\n\nE\np\u03b8 (xt|x1,...,xt\u22121)\n\n(cid:2)log(1\u2212D\u03c6(xt|x1, ...xt\u22121))(cid:3)\n\nWhile a sequential discriminator is potentially harder to learn than sentence based feed-forward\ndiscriminators, it is computationally cheaper than approaches that use Monte Carlo Tree Search to\nscore partial sentences [14, 15, 18] and has been shown to perform better empirically [19].\nFor a generated token \u02c6xt \u223c p\u03b8(xt|xt\u22121...x1), the reward provided to the ScratchGAN generator at\ntime step t is:\n(6)\nRewards scale linearly with the probability the discriminator assigns to the current pre\ufb01x pertaining\nto a real sentence. Bounded rewards help stabilize training.\nThe goal of the generator at timestep t is to maximize the sum of discounted future rewards using a\ndiscount factor \u03b3:\n\nrt = 2D\u03c6(\u02c6xt|xt\u22121...x1) \u2212 1\n\nT(cid:88)\n\nRt =\n\n\u03b3s\u2212trs\n\n(7)\n\nLike ScratchGAN, SeqGAN-step [19] uses a recurrent discriminator to provide rewards per time\nstep to a generator trained using policy gradient for unsupervised word level text generation. Unlike\nSeqGAN-step, our model is trained from scratch using only the adversarial objective, without any\nmaximum likelihood pretraining.\n\ns=t\n\n3.2 Large Batch Sizes for Variance Reduction\n\nThe ScratchGAN generator parameters \u03b8 are updated using Monte Carlo estimates of policy gradi-\nents (Equation 4), where N is the batch size:\n\nN(cid:88)\n\nT(cid:88)\n\nn=1\n\nt=1\n\n\u2207\u03b8 =\n\n(Rn\n\nt \u2212 bt)\u2207\u03b8 log p\u03b8(\u02c6xn\n\nt |\u02c6xn\n\nt\u22121...\u02c6xn\n\n1 ),\n\n\u02c6xn\nt \u223c p\u03b8(xn\n\nt |\u02c6xn\n\nt\u22121...\u02c6xn\n1 )\n\nA key component of ScratchGAN is the use of large batch sizes to reduce the variance of the gradient\nestimation, exploiting the ability to cheaply generate experience by sampling from the generator. To\n\n4\n\nD\u2713(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)rt(null)(null)(null)(null)(null)LSTM\u02c6xt(null)(null)(null)(null)(null)Embedding MatrixD\u2713(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)rt+1(null)(null)(null)(null)(null)LSTMEmbedding MatrixEmbedding Matrix\u02c6xt+1(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)\u02c6xt(null)(null)(null)(null)(null)D\u2713(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)rt+2(null)(null)(null)(null)(null)LSTMEmbedding MatrixEmbedding Matrix\u02c6xt+2(null)(null)(null)(null)(null)\u02c6xt+1(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)\u2026\u2026\u2026\f(a) Negative BLEU-5 versus Self-BLEU-5.\n\n(b) Language- and reverse language-model scores.\n\nFigure 2: BLEU scores on EMNLP2017 News (left) and language model scores on Wikitext-103\n(right). For BLEU scores, left is better and down is better. LeakGAN, MaliGAN, RankGAN and\nSeqGAN results from Caccia et al. [20].\n\nfurther reduce the gradient variance ScratchGAN uses a global moving-average of rewards as a\nbaseline bt [39], as we empirically found it improves performance for certain datasets.\nProviding rewards only for the sampled token as in Equation (3.2) results in a substantial training\n1 ) to provide rewards for each token in the\nspeed boost compared to methods that use p\u03b8(xn\nvocabulary, in order to reduce variance and provide a richer learning signal. These methods score\nall pre\ufb01xes at time t and thus scale linearly with vocabulary size [35].\n\nt\u22121...\u02c6xn\n\nt |\u02c6xn\n\n3.3 Architectures and Discriminator Regularization\n\nThe ScratchGAN discriminator and generator use an embedding layer followed by one or more\nLSTM layers [23]. For the embedding layer, we have experimented with training the embeddings\nfrom scratch, as well as using pre-trained GloVe embeddings [40] concatenated with learned embed-\ndings. When GloVe embeddings are used, they are shared by the discriminator and the generator,\nand kept \ufb01xed during training.\nDiscriminator regularization in the form of layer normalization [41], dropout [42] and L2 weight\ndecay provide a substantial performance boost to ScratchGAN. Our \ufb01ndings align with prior work\nwhich showed the importance of discriminator regularization on image GANs [34, 43, 44].\nDespite using a recurrent discriminator, we also provide the discriminator with positional informa-\ntion by concatenating a \ufb01x sinusoidal signal to the word embeddings used in the discriminator [5].\nWe found this necessary to ensure the sentence length distribution obtained from generator samples\nmatches that of the training data. Ablation experiments are provided in Appendix G.\n\n4 Evaluation Metrics\n\nEvaluating text generation remains challenging, since no single metric is able to capture all desired\nproperties: local and global consistency, diversity and quality, as well as generalization beyond the\ntraining set. We follow Semeniuta et al. [19] and Caccia et al. [20] in the choice of metrics. We\nuse n-gram based metrics to capture local consistency, Fr\u00b4echet Distance to measure distances to real\ndata in embedding space, and language model scores to measure the quality-diversity trade-off. To\nshow our model is not trivially over\ufb01tting we look at nearest neighbors in data and embedding space.\n\n4.1 n-gram based Metrics\n\nBLEU [45] and Self-BLEU have been proposed [22] as measures of quality and diversity, respec-\ntively. BLEU based metrics capture local consistency and detect relatively simple problems with\nsyntax but do not capture semantic variation [19, 46].\nWe highlight the limitations of BLEU metrics by training a 5-gram model with Kneser-Ney smooth-\ning [47] on EMNLP2017-News and measuring its BLEU score. The results are reported in Table 1.\nThe 5-gram model scores close to perfect according to BLEU-5 metric although its samples are\nqualitatively very poor (see Table 10 in the Appendix). In the rest of the paper we report BLEU-5\n\n5\n\n-0.30-0.25-0.20-0.15-0.10-0.05Negative BLEU-5 (quality)0.10.20.30.40.5SBLEU-5 (diversity)SeqGANMaliGANRankGANLeakGANtraining dataKneser-Ney 5-gramScratchGAN at various TLanguage model at various T024681012LM score0510152025Reverse LM scoreScratchGANMLEValidation data\f(a) Wikitext-103.\n\n(b) FED vs softmax temperature.\n\n(c) ScratchGAN ablation study.\n\nFigure 3: FED scores. Lower is better. EMNLP2017 News results unless otherwise speci\ufb01ed.\n\nand Self-BLEU-5 metrics to compare with prior work, and complement it with metrics that capture\nglobal consistency, like Fr\u00b4echet Distance.\n\n4.2 Fr\u00b4echet Embedding Distance\n\nSemeniuta et al. [19] proposed the Fr\u00b4echet InferSent Distance (FID), inspired by the Fr\u00b4echet Incep-\ntion Distance used for images [48]. The metric computes the Fr\u00b4echet distance between two Gaussian\ndistributions \ufb01tted to data embeddings, and model sample embeddings, respectively. Semeniuta et al.\n[19] showed that the metric is not sensitive to the choice of embedding model and use InferSent for\nmodel evaluation, while we use a Universal Sentence Encoder [49]2. We call the metric Fr\u00b4echet\nEmbedding Distance to clarify that we use a different embedding model from Semeniuta et al. [19].\nThe Fr\u00b4echet Embedding Distance (FED) offers several advantages over BLEU-based metrics, as\nhighlighted in Semeniuta et al. [19]: it captures both quality and diversity; it captures global consis-\ntency; it is faster and simpler to compute than BLEU metrics; it correlates with human evaluation;\nit is less sensitive to word order than BLEU metrics; it is empirically proven useful for images.\nWe \ufb01nd that the Fr\u00b4echet Embedding Distance provides a useful metric to optimize for during model\ndevelopment, and we use it to choose the best models. However, we notice that FED also has draw-\nbacks: it can be sensitive to sentence length, and we avoid this bias by ensuring that all compared\nmodels match the sentence length distribution of the data (see details in Appendix E).\n\n4.3 Language Model Scores\n\nCaccia et al. [20] proposed evaluating the quality of generated model samples using a language\nmodel (Language Model score, LM), as well as training a language model on the generated samples\nand scoring the original data with it (Reverse Language Model score, RLM). LM measures sample\nquality: bad samples score poorly under a language model trained on real data. RLM measures\nsample diversity: real data scores poorly under a language model trained on samples which lack\ndiversity. While insightful, this evaluation criteria relies on training new models, and hence the\nresults can depend on the evaluator architecture. The metric could also have an inherent bias favoring\nlanguage models, since they were trained using the same criteria.\n\n5 Experimental Results\n\nWe use two datasets, EMNLP2017 News3 and Wikitext-103 [50]. We use EMNLP2017 News to\ncompare with prior work[15, 20] but note that this dataset has limitations: a small vocabulary (5.7k\nwords), no out-of-vocabulary tokens, a sentence length limited to 50 tokens, and a size of only 300k\nsentences. Wikitext-103 is a large scale dataset of almost 4 million sentences that captures more of\nthe statistical properties of natural language and is a standard benchmark in language modeling [51,\n52]. For Wikitext-103 we use a vocabulary of 20k words. In Wikitext-103 we remove sentences\nwith less than 7 tokens or more than 100 tokens. All our models are trained on individual sentences,\nusing an NVIDIA P100 GPU.\n\n2The model can be found at https://tfhub.dev/google/universal-sentence-encoder/2\n3http://www.statmt.org/wmt17/\n\n6\n\nMLEScratchGANTraining data0.000.020.040.060.080.100.12FED0.1180.0450.0360.60.81.01.21.4softmax temperature0.000.010.020.030.040.050.06FEDScratchGANMLE0.000.010.020.030.040.050.060.070.08baseline SeqGAN-steplarge batchdiscriminator regularizationpretrained embeddingsREINFORCE value baselineScratchGANFEDchanges in FED\fModel\nRandom\n\nScratchGAN\n\nMLE\n\nWorld level perplexity\n\n5725\n154\n42\n\nTable 2: EMNLP2017 News perplexity. Figure 4: Matching n-grams in EMNLP2017.\n\nIn all our experiments, the baseline maximum likelihood trained language model is a dropout reg-\nularized LSTM. Model architectures, hyperparameters, regularization and experimental procedures\nfor the results below are detailed in Appendix D. Samples from ScratchGAN can be seen in Ap-\npendix H, alongside data and MLE samples.\n\n5.1 Quality and Diversity\n\nAs suggested in Caccia et al. [20], we measure the diversity-quality trade-off of different models\nby changing the softmax temperature at sampling time. Reducing the softmax temperature below 1\nresults in higher quality but less diverse samples, while increasing it results in samples closer and\ncloser to random. Reducing the temperature for a language GANs is similar to the \u201ctruncation trick\u201d\nused in image GANs [43]. We compute all metrics at different temperatures.\nScratchGAN shows improved local consistency compared to existing language GANs and signif-\nicantly reduces the gap between language GANs and the maximum likelihood language models.\nFigure 2a reports negative BLEU5 versus Self-BLEU5 metrics on EMNLP2017 News for Scratch-\nGAN and other language GANs, as reported in Caccia et al. [20].\nScratchGAN improves over an MLE trained model on WikiText-103 according to FED, as shown in\nFigure 3a. This suggests that ScratchGAN is more globally consistent and better captures semantic\ninformation. Figure 3b shows the quality diversity trade-off as measured by FED as the softmax\ntemperature changes. ScratchGAN performs slightly better than the MLE model on this metric.\nThis contrasts with the Language Model Score-Reverse Language Model scores shown in Figure 2b,\nwhich suggests that MLE samples are more diverse. Similar results on EMNLP2017 News are\nshown in Appendix A.\nUnlike image GANs, ScratchGAN learns an explicit model of data, namely an autoregressive ex-\nplicit model of language. This allows us to compute model perplexities on validation data by feeding\nthe model ground truth at each step. We report ScratchGAN and MLE perplexities on EMNLP2017\nNews in Table 2. Evaluating perplexity favors the MLE model, which is trained to minimize perplex-\nity and thus has an incentive to spread mass around the data distribution to avoid being penalized for\nnot explaining training instances [53], unlike ScratchGAN which is penalized by the discriminator\nwhen deviating from the data manifold and thus favors quality over diversity. Improving sample\ndiversity, together with avoiding under\ufb01tting by improving grammatical and local consistency are\nrequired in order to further decrease the perplexity of ScratchGAN to match that of MLE models.\nOur diversity and quality evaluation across multiple metrics shows that compared to the MLE\nmodel, ScratchGAN trades off local consistency to achieve slightly better global consistency.\n\n5.2 Nearest Neighbors\n\nA common criticism of GAN models is that they produce realistic samples by over\ufb01tting to the\ntraining set, e.g. by copying text snippets. For a selection of ScratchGAN samples we \ufb01nd and\npresent the nearest neighbors present in the training set. We consider two similarity measures, a\n3-gram cosine similarity \u2014 to capture copied word sequences, and a cosine similarity from embed-\ndings produced by the Universal Sentence Encoder \u2014to capture semantically similar sentences. In\nTable 5 in Appendix B we display a selection of four random samples and the corresponding top\nthree closest training set sentences with respect to each similarity measure, and see the training text\nsnippets have a mild thematic correspondence but have distinct phrasing and meaning. Additionally\nwe perform a quantitive analysis over the full set of samples; we also compare the longest matching\nn-grams between text from the training set and (a) ScratchGAN samples, (b) MLE samples, and\n(c) text from the validation set. In Figure 4 we see fewer ScratchGAN samples with long matching\n\n7\n\n24681012141618Longest matching n-grams in training set0.000.050.100.150.200.250.300.35Proportion of samplesScratchGANMLEValidation\fModel\n\nTable 3: FED on EMNLP2017 News.\nFED\n0.084\n0.015\n\nSeqGAN-step (no pretraining)\n\nScratchGAN\n\nTable 4: FED sensitivity on EMNLP2017 News.\n\nVariation\n\nFED\n\nHyperparameters\nSeeds (best hypers)\n\n0.021 \u00b1 0.0056\n0.018 \u00b1 0.0008\n\nn-grams (n \u2265 5) in comparison with MLE samples and text from the validation set. We conclude\nthe generator is producing genuinely novel sentences, although they are not always grammatically\nor thematically consistent.\n\n5.3 Ablation Study and SeqGAN-step comparison\n\nWe show the relative importance of individual features of ScratchGAN with an ablation study in Fig-\nure 3c. We successively add all elements that appear important to ScratchGAN performance, namely\nlarge batch size, discriminator regularization (L2 weight decay, dropout, and layer normalization),\npre-trained embeddings, and a value baseline for REINFORCE. The increase in batch size results\nin the most signi\ufb01cant performance boost, due to the reduction in gradient variance and stabilizing\neffect on adversarial dynamics. Discriminator regularization also leads to substantial performance\ngains, as it ensures the discriminator is not memorizing the training data and thus is providing a\nsmoother learning signal for the generator.\nThe baseline model in Figure 3c is a SeqGAN-step like model [14] without pretraining. To highlight\nthe improvement of ScratchGAN compared to prior work, we show in Table 3 the FED difference\nbetween the two models.\n\n5.4 Training Stability\n\nDespite the high variance of REINFORCE gradients and the often unstable GAN training dynamics,\nour training procedure is very stable, due to the use of large batch sizes and chosen reward struc-\nture. Table 4 reports the FED scores for ScratchGAN models trained with hyperparameters from a\nlarge volume in hyper-parameter space as well as across 50 random seeds. The low variance across\nhyperparameters shows that ScratchGAN is not sensitive to changes in learning rate, REINFORCE\ndiscount factor, regularization or LSTM feature sizes, as long as these are kept in a reasonable\nrange. The full hyperparameter sweep performed to obtain the variance estimates is described in\nAppendix F. When we \ufb01xed hyperparameters and repeated an experiment across 50 seeds, we ob-\ntained very similar FED score; no divergence or mode collapse occurred in any of the 50 runs. For\nWikiText-103, the results are similar (0.055 \u00b1 0.003).\n6 Related Work\n\nOur work expands on the prior work of discrete language GANs, which opened up the avenues\nto this line of research. Methods which use discrete data have proven to be more successful than\nmethods using continuous relaxations [19], but face their own challenges, such as \ufb01nding the right\nreward structure and reducing gradient variance. Previously proposed solutions include: receiving\ndense rewards via Monte Carlo Search [14, 15, 18] or a recurrent discriminator [19, 35], leaking in-\nformation from the discriminator to the generator [15], using actor critic methods to reduce variance\n[35], using ranking or moment matching to provide a richer learning signal [16, 18] and curriculum\nlearning [35]. Despite alleviating problems somewhat, all of the above methods require pre-training,\nsometimes together with teacher forcing [17] or interleaved supervised and adversarial training [15].\nNie et al. [36] recently showed that language GANs can bene\ufb01t from complex architectures such\nas Relation Networks [54]. Their RelGAN model can achieve better than random results without\nsupervised pre-training, but still requires pre-training to achieve results comparable to MLE models.\nPress et al. [55] is perhaps the closest to our work: they train a character level GAN without pre-\ntraining. Unlike Press et al. [55], ScratchGAN is a word level model and does not require teacher\nhelping, curriculum learning or continuous relaxations during training. Importantly, we have per-\nformed an extensive evaluation to quantify the performance of ScratchGAN, as well as measured\nover\ufb01tting using multiple metrics, beyond 4-gram matching.\nBy learning reward signals through the use of discriminators, our work is in line with recent imitation\nlearning work [56], as well as training non-differentiable generators [57].\n\n8\n\n\f7 Discussion\n\nExisting language GANs use maximum likelihood pretraining to minimize adversarial training chal-\nlenges, such as unstable training dynamics and high variance gradient estimation. However, they\nhave shown little to no performance improvements over traditional language models, likely due to\nconstraining the set of possible solutions to be close to those found by maximum likelihood. We\nhave shown that large batch sizes, dense rewards and discriminator regularization remove the need\nfor maximum likelihood pre-training in language GANs. To the best of our knowledge, we are the\n\ufb01rst to use Generative Adversarial Networks to train word-level language models successfully from\nscratch. Removing the need for maximum likelihood pretraining in language GANs opens up a\nnew avenue of language modeling research, with future work exploring GANs with one-shot feed-\nforward generators and specialized discriminators which distinguish different features of language,\nsuch as semantics and syntax, local and global structure. Borrowing from the success of GANs\nfor image generation [43], another promising avenue is to use powerful neural network architec-\ntures [5, 54] to improve ScratchGAN.\nWe have measured the quality and diversity of ScratchGAN samples using BLEU metrics, Fr`echet\ndistance, and language model scores. None of these metrics is suf\ufb01cient to evaluate language gen-\neration: we have shown that BLEU metrics only capture local consistency; language model scores\ndo not capture semantic similarity; and that while embedding based Fr`echet distance is a promis-\ning global consistency metric it is sensitive to sentence length. Until new ways to assess language\ngeneration are developed, current metrics need to be used together to compare models.\n\n8 Acknowledgments\n\nWe would like to thank Chris Dyer, Oriol Vinyals, Karen Simonyan, Ali Eslami, David Warde-\nFarley, Siddhant Jayakumar and William Fedus for thoughtful discussions.\n\n9\n\n\fReferences\n[1] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural ma-\nchine translation system: Bridging the gap between human and machine translation. arXiv\npreprint arXiv:1609.08144, 2016.\n\n[2] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato. Unsuper-\nvised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043,\n2017.\n\n[3] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep\n\nreinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.\n\n[4] Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Asse\ufb01, Saeid Safaei, Elizabeth D Trippe, Juan B\nGutierrez, and Krys Kochut. Text summarization techniques: a brief survey. arXiv preprint\narXiv:1707.02268, 2017.\n\n[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor-\nmation Processing Systems, pages 5998\u20136008, 2017.\n\n[6] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\n\nthe limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[7] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.\n\nLanguage models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019.\n\n[8] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-\n\nautoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.\n\n[9] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for\nsequence prediction with recurrent neural networks. In Advances in Neural Information Pro-\ncessing Systems, pages 1171\u20131179, 2015.\n\n[10] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text\n\ndegeneration. arXiv preprint arXiv:1904.09751, 2019.\n\n[11] Ferenc Husz\u00b4ar. How (not) to train your generative model: Scheduled sampling, likelihood,\n\nadversary? arXiv preprint arXiv:1511.05101, 2015.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[13] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.\n\nImproved training of Wasserstein GANs. In NIPS, 2017.\n\n[14] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. 2017.\n\n[15] Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation\n\nvia adversarial training with leaked information. arXiv preprint arXiv:1709.08624, 2017.\n\n[16] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence\nCarin. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850,\n2017.\n\n[17] Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua\nBengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv\npreprint arXiv:1702.07983, 2017.\n\n[18] Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial rank-\ning for language generation. In Advances in Neural Information Processing Systems, pages\n3155\u20133165, 2017.\n\n[19] Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. On accurate evaluation of gans for\n\nlanguage generation. arXiv preprint arXiv:1806.04936, 2018.\n\n[20] Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent\nCharlin. Language gans falling short. CoRR, abs/1811.02549, 2018. URL http://arxiv.\norg/abs/1811.02549.\n\n[21] Guy Tevet, Gavriel Habib, Vered Shwartz, and Jonathan Berant. Evaluating text gans as lan-\n\nguage models. arXiv preprint arXiv:1810.12686, 2018.\n\n[22] Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texy-\ngen: A benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886,\n2018.\n\n[23] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9\n\n(8):1735\u20131780, 1997.\n\n[24] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio.\n\nevaluation of gated recurrent neural networks on sequence modeling.\n\nEmpirical\narXiv preprint\n\n10\n\n\farXiv:1412.3555, 2014.\n\n30(1):50\u201364, 1951.\n\n[25] Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal,\n\n[26] Tom\u00b4a\u02c7s Mikolov, Martin Kara\ufb01\u00b4at, Luk\u00b4a\u02c7s Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. Recur-\nrent neural network based language model. In Eleventh Annual Conference of the International\nSpeech Communication Association, 2010.\n\n[27] Martin Arjovsky, Soumith Chintala, and L\u00b4eon Bottou. Wasserstein GAN. In ICML, 2017.\n[28] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.\n\nLeast squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.\n\n[29] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from stan-\n\ndard gan. arXiv preprint arXiv:1807.00734, 2018.\n\n[30] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\n[31] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[32] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[33] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[34] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed,\nand Ian Goodfellow. Many paths to equilibrium: Gans do not need to decrease adivergence at\nevery step. arXiv preprint arXiv:1710.08446, 2017.\n\n[35] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via\n\n\ufb01lling in the . arXiv preprint arXiv:1801.07736, 2018.\n\n[36] Weili Nie, Nina Narodytska, and Ankit Patel. RelGAN: Relational generative adversarial net-\nworks for text generation. In International Conference on Learning Representations, 2019.\nURL https://openreview.net/forum?id=rJedV3R5tm.\n\n[37] Anjuli Kannan and Oriol Vinyals. Adversarial evaluation of dialogue models. arXiv preprint\n\narXiv:1701.08198, 2017.\n\n[38] Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi.\n\nLearning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087, 2018.\n\n[39] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018.\n[40] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for\nword representation. In Proceedings of the 2014 conference on empirical methods in natural\nlanguage processing (EMNLP), pages 1532\u20131543, 2014.\n\n[41] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[42] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-\ndinov. Dropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of\nMachine Learning Research, 15(1):1929\u20131958, 2014.\n\n[43] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity\n\nnatural image synthesis. arXiv preprint arXiv:1809.11096, 2018.\n\n[44] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normaliza-\n\ntion for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[45] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[46] Ehud Reiter. A structured review of the validity of bleu. Computational Linguistics, pages\n\n1\u201312, 2018.\n\n[47] Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In\n\nicassp, volume 1, page 181e4, 1995.\n\n[48] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv\npreprint arXiv:1706.08500, 2017.\n\n[49] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah\nConstant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder.\narXiv preprint arXiv:1803.11175, 2018.\n\n[50] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture\n\nmodels. arXiv preprint arXiv:1609.07843, 2016.\n\n11\n\n\f[51] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with\n\ngated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.\n\n[52] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolu-\ntional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.\n[53] Lucas Theis, A\u00a8aron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\npreprint arXiv:1412.6980, 2014.\n\n[54] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In\nAdvances in neural information processing systems, pages 4967\u20134976, 2017.\n\n[55] O\ufb01r Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. Language genera-\narXiv preprint\n\ntion with recurrent generative adversarial networks without pre-training.\narXiv:1706.01399, 2017.\n\n[56] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, pages 4565\u20134573, 2016.\n\n[57] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, SM Eslami, and Oriol Vinyals. Synthesizing\nprograms for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118,\n2018.\n\n[58] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv\n\n12\n\n\f", "award": [], "sourceid": 2410, "authors": [{"given_name": "Cyprien", "family_name": "de Masson d'Autume", "institution": "Google DeepMind"}, {"given_name": "Shakir", "family_name": "Mohamed", "institution": "DeepMind"}, {"given_name": "Mihaela", "family_name": "Rosca", "institution": "Google DeepMind"}, {"given_name": "Jack", "family_name": "Rae", "institution": "DeepMind, UCL"}]}