{"title": "Towards Text Generation with Adversarially Learned Neural Outlines", "book": "Advances in Neural Information Processing Systems", "page_first": 7551, "page_last": 7563, "abstract": "Recent progress in deep generative models has been fueled by two paradigms -- autoregressive and adversarial models. We propose a combination of both approaches with the goal of learning generative models of text. Our method first produces a high-level sentence outline and then generates words sequentially, conditioning on both the outline and the previous outputs.\nWe generate outlines with an adversarial model trained to approximate the distribution of sentences in a latent space induced by general-purpose sentence encoders. This provides strong, informative conditioning for the autoregressive stage. Our quantitative evaluations suggests that conditioning information from generated outlines is able to guide the autoregressive model to produce realistic samples, comparable to maximum-likelihood trained language models, even at high temperatures with multinomial sampling. Qualitative results also demonstrate that this generative procedure yields natural-looking sentences and interpolations.", "full_text": "Towards Text Generation with Adversarially Learned\n\nNeural Outlines\n\nSandeep Subramanian1,2,4\u2217, Sai Rajeswar1,2,5, Alessandro Sordoni4,\n\nAdam Trischler4, Aaron Courville1,2,6, Christopher Pal1,3,5\n\n1Montr\u00b4eal Institute for Learning Algorithms, 2Universit\u00b4e de Montr\u00b4eal,\n3 \u00b4Ecole Polytechnique de Montr\u00b4eal, 4Microsoft Research Montr\u00b4eal,\n\n2{sandeep.subramanian.1,sai.rajeswar.mudumba,aaron.courville}@umontreal.ca\n\n5Element AI, Montr\u00b4eal, 6CIFAR Fellow\n\n3christopher.pal@polymtl.ca\n\n4{alsordon,adam.trischler}@microsoft.com\n\nAbstract\n\nRecent progress in deep generative models has been fueled by two paradigms \u2013 au-\ntoregressive and adversarial models. We propose a combination of both approaches\nwith the goal of learning generative models of text. Our method \ufb01rst produces a\nhigh-level sentence outline and then generates words sequentially, conditioning on\nboth the outline and the previous outputs. We generate outlines with an adversarial\nmodel trained to approximate the distribution of sentences in a latent space induced\nby general-purpose sentence encoders. This provides strong, informative condi-\ntioning for the autoregressive stage. Our quantitative evaluations suggests that\nconditioning information from generated outlines is able to guide the autoregressive\nmodel to produce realistic samples, comparable to maximum-likelihood trained\nlanguage models, even at high temperatures with multinomial sampling. Qualita-\ntive results also demonstrate that this generative procedure yields natural-looking\nsentences and interpolations.\n\n1\n\nIntroduction\n\nDeep neural networks are powerful tools for modeling sequential data [36, 54, 24]. Tractable\nmaximum-likelihood (MLE) training of these models typically involves factorizing the joint distribu-\ntion over random variables into a product of conditional distributions that models the one-step-ahead\nprobability in the sequence via the chain rule. Each conditional is then modeled by an expressive\nfamily of functions, such as neural networks. These models have been successful in a variety of tasks.\nHowever, the only source of variation is modeled in the conditional output probability at every step:\nthere is limited capacity for capturing the higher-level structure likely present in natural text and other\nsequential data (e.g., through a hierarchical generation process [46]).\nVariational Autoencoders (VAE) [28] provide a tractable method to train hierarchical latent-variable\ngenerative models. In the context of text data, latent variables may assume the role of sentence\nrepresentations that govern a lower-level generation process, thus facilitating controlled generation\nof text. However, VAEs for text are notoriously hard to train when combined with powerful auto-\nregressive decoders [5, 18, 47]. This is due to the \u201cposterior collapse\u201d problem: the model ends up\nrelying solely on the auto-regressive properties of the decoder while ignoring the latent variables,\nwhich become uninformative. This phenomenon is partly a consequence of the restrictive assumptions\non the parametric form of the posterior and prior approximations, usually modeled as simple diagonal\nGaussian distributions.\n\n\u2217Work done while author was an intern at Microsoft Research Montreal\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fFigure 1: Overall Model Architecture. (Left) A GAN setup that is trained to model the distribution\nof \ufb01xed-length sentence vectors. A minimal amount of noise, indicated by a small Gaussian ball, is\ninjected to every point on the data mainfold. (Right) Samples produced from our generative process\nby interpolating linearly between two points sampled at random from the input noise space of the\ngenerator.\n\nGeneral-purpose sentence encoders have been shown to produce representations that are useful across\na wide range of natural language processing (NLP) tasks [29, 12, 51]. Such models seem to capture\nsyntactic and semantic properties of text in self-contained vector representations [51, 10]. Although\nthese representations have been used successfully in downstream tasks, their usefulness for text\ngeneration per se has not yet been thoroughly investigated, to the best of our knowledge.\nIn this paper, we study how pre-trained sentence representations, obtained by general purpose sentence\nencoders, can be leveraged to train better models for text generation. In particular, we interpret\nthe set of sentence embeddings obtained from training on large text corpora as samples from an\nexpressive and unknown prior distribution over a continuous space. We model this distribution using\na Generative Adversarial Network (GAN), which enables us to build a fully generative model of\nsentences as follows: we \ufb01rst sample a sentence embedding from the GAN generator, then decode it\nto the observed space (words) using a conditional GRU-based language model (which we refer to as\nthe decoder in the rest of this work). The sentence embeddings produced by the learned generator\ncan be seen as \u201cneural outlines\u201d that provide high-level information to the decoder, which in turn can\ngenerate many possible sentences given a single latent representation. The conditioning information\neffectively guides the decoder to a smaller space of possible generations.\nThe takeaways from our work are:\n\n\u2022 We propose the use of \ufb01xed-length representations induced by general-purpose sentence\nencoders for training generative models of text, and demonstrate their potential both qualita-\ntively and quantitatively.\n\n\u2022 We extend our model to conditional text generation. Speci\ufb01cally, we train a conditional\nGAN that learns to transform a given hypothesis representation in the sentence embedding\nspace into a premise embedding that satis\ufb01es a speci\ufb01ed entailment relationship.\n\n\u2022 We propose a gradient-based technique to generate meaningful interpolations between\ntwo given sentence embeddings. This technique may navigate around \u201choles\u201d in the data\nmanifold by moving in areas of high data density, using gradient signals from the decoder.\nQualitatively, the obtained interpolations appear more natural than those obtained by linear\ninterpolation when the distribution (in our case, the induced sentence embedding distribution)\ndoesn\u2019t have a simple parametric form such as a Gaussian.\n\n2\n\n\f2 Related Work\n\nOur work is similar in spirit to the line of work that employs adversarial training on a latent representa-\ntion of data, such as Adversarial Autoendoers (AAE) [33], Wasserstein Autoencoders (WAE) [53] and\nAdversarially Regularized Autoencoders (ARAE) [26]. AAEs and WAEs are similar to Variational\nAutoencoders (VAE) [28] in that they learn an encoder that produces an approximate latent posterior\ndistribution q\u03c6(z|x), and a decoder p\u03b8(x|z) that is trained to minimize the reconstruction error of x.\nThey differ in the way they are regularized. VAEs penalize the discrepancy between the approximate\nposterior and a prior distribution DKL(q\u03c6(z|x)(cid:107)p(z)), which controls the tightness of the variational\nbound on the data log-likelihood. AAEs and WAEs, on the other hand, adversarially match the\naggregated or marginal posterior, q\u03c6(z), with a \ufb01xed prior distribution that shapes the posterior into\nwhat is typically a simple distribution, such as a Gaussian.\nARAEs train, by means of a critic, a \ufb02exible prior distribution regularizing the sentence representations\nobtained by an auto-encoder.\nIn contrast, we provide evidence that assuming a uniform prior\ndistribution during the training of sentence representations and \ufb01tting a \ufb02exible prior a posteriori\nover the obtained representations yields better performance than learning both jointly, and helps us\nscale to longer sequences, larger vocabularies, and richer datasets.\nRecent successes of generative adversarial networks in the image domain [25] have motivated their\nuse in modeling discrete sequences like text. However, discreteness introduces problems in end-\nto-end training of the generator. Policy gradient techniques [57] are one way to circumvent this\nproblem, but typically require maximum-likelihood pre-training [58, 6], as do actor-critic methods\n[17, 15]. Gumbel-softmax based approaches have also proven useful in the restricted setting of short\nsequence lengths and small output vocabulary sizes [31]. Approaches without maximum-likelihood\npre-training that operate on continuous softmax outputs from the generator such as [19, 43, 42] have\nalso shown promise, but apply mostly in arti\ufb01cial settings. Adversarial training on the continuous\nhidden states of an RNN was used by [48] for unaligned style transfer and unsupervised machine\ntranslation [32]. While our approach can potentially be applied to unsupervised machine translation,\nwe believe that high quality sentence representations and resources to learn them in languages other\nthan English, have only recently been explored [13].\nIn this work, we argue that generative adversarial training on latent representations of a sentence\nnot only alleviates the non-differentiability issue of regular GAN training on discrete sequences, but\nalso eases the learning process by instead modeling an already smoothed manifold of the underlying\ndiscrete data distribution. We also simultaneously reap the bene\ufb01ts of a \u201csequence-level\u201d training\nobjective while somewhat side-stepping the temporal credit assignment problem.\n\n3 Approach\n\nIn this section we discuss the building blocks of our generative model. Our model consists of two\ndistinct and independently trainable components: (1) a generative adversarial network trained to\nmatch the distribution of the \ufb01xed-length vector representations induced by general purpose sentence\nencoders, and (2) a conditional RNN language model trained with maximum-likelihood to reconstruct\nthe words in a sentence given its vector representation. The overall architecture is presented in Fig. 1.\n\n3.1 General Purpose Sentence Encoders\n\nWith the success of models that learn distributed representations of words in an unsupervised manner,\nthere has been a recent focus on building models that learn \ufb01xed-length vector representations of\nwhole sentences that are \u201cgeneral purpose\u201d enough to be useful as black-box features across a wide\nrange of downstream Natural Language Processing tasks. Extensions of the skip-gram model to\nsentences [29], sequential de-noising autoencoders [22], features learned from Natural Language\nInference models [12], and large-scale multi-task models that are trained with multiple objectives\n[51] have been shown to learn useful [11, 56] \ufb01xed-length representations of text. They also encode\nseveral characteristics of a sentence faithfully, including word order, content, and length [10, 1, 51].\nThis is important, since we want to reliably reconstruct the contents of a sentence from its vector\nrepresentation. In this work, we will use the pre-trained sentence encoder from [51], which consists\nof an embedding look-up table learned from scratch and a single layer GRU [9] with 2048 hidden\nunits. We use the last hidden state of the GRU to create a compressed representation. Thus, each\n\n3\n\n\fsentence x is represented by a vector E(x) = hx \u2208 R2048, where E denotes our general-purpose\nsentence encoder.\n\n3.2 Generative Adversarial Training\n\nGenerative Adversarial networks are a family of implicit generative models that formulate the learning\nprocess as a two player minimax game between a generator and discriminator/critic [16]: the critic is\ntrained to distinguish samples of the true data distribution from those of the generator distribution.\nThe generator is trained to \u201cfool\u201d the critic by approximating the data distribution, given samples from\na much simpler distribution, e.g., a Gaussian. The discriminator D or critic fw and the generator G\nare typically parameterized by neural networks. In our setting, the data distribution is the distribution\nP (hx) of sentence embeddings, hx = E(x), obtained by applying sentence encoder E to samples\nx \u223c PD from a dataset D. Speci\ufb01cally, our critic and generator are trained with objective\n\nmin\n\nG\n\nmax\n\nD\n\nV (D, G) = E\nx\u223cPD\n\n[log D(E(x))] + E\n\n[log(1 \u2212 D(G(z)))]\n\nz\u223cP (z)\n\nIn presenting the Wasserstein GAN, [3] argue that while KL divergence is a good divergence for\nlow-dimensional data, it is often too strong in the high-dimensional settings we are often interested\nin modeling. We found training to be more stable using the 1-Wasserstein distance between two\ndistributions. We follow the setup of [19], which we found leads to more effective and robust training.\n\n3.3 Decoding Generated Sentence Vectors to Words\n\nGiven generated or real sentences in their \ufb01xed-length latent space, we would like to train a model\nthat maps these vectors back to their respective sentences. To this end, we train a conditional GRU\nlanguage model that, for each sentence x, conditions on the sentence vector representation, hx, as\nwell as previously generated words at each step to reconstruct x. We parameterize this decoder with\nan embedding lookup-table, a single layer GRU, and an af\ufb01ne transformation matrix that maps the\nGRU hidden states to the space of possible output tokens. The GRU is fed the sentence vector at\nevery time step as described in [51]. The decoder is trained with maximum-likelihood.\nGiven that E is a deterministic function of x \u223c PD, where PD is a discrete distribution in the case\nof text data, the distribution P (hx) is a discrete distribution embedded in a continuous space. Our\nsentence embedding generator approximates the high-dimensional discrete distribution P (hx) with a\ncontinuous distribution G(z). Therefore, samples from G(z) will likely produce data points that are\noff-manifold; this may lead to unexpected behavior in the decoder, since it has been trained only on\nsamples from the true distribution P (hx). To encourage better generalization for samples from G(z),\nduring the training of the decoder we smooth the distribution P (hx) by adding a small amount of\nadditive isotropic Gaussian noise to each vector hx.\nIn Table 1, we demonstrate the impact of noise on the reconstruction BLEU scores and sample quality.\nWe notice that the amount of noise injected introduces a trade-off between reconstruction and sample\nquality. Speci\ufb01cally, as we increase the amount of noise the reconstruction BLEU score decreases\nbut samples from a GAN start looking qualitatively better. This aligns with observations made in\n[14], where VAEs trained with a variance of \u223c 0.1 had crisp reconstructions but poor sample quality,\nwhile those with a variance of 1.0 had blurry reconstructions but better sample quality.\n\nNoise\n0.00\n0.07\n0.12\n0.17\n0.20\n\nBLEU-4\n64.54\n53.12\n48.60\n41.16\n37.51\n\nSamples\nthe young men are playing volleyball in the ball .\n-\nthe young child is playing soccer .\n-\na young child is playing with a ball .\n\nThe injection of noise into neural mod-\nels has been explored in many contexts\n[2, 55, 40, 50, 41] and has been shown to\nhave important implications in unsupervised\nrepresentation learning. Theoretical and em-\npirical arguments from Denoising Autoen-\nTable 1: The impact of noise on reconstruction quality,\ncoders (DAEs) [55] show that the addition\nmeasured by tokenized BLEU-4 scores on SNLI [4].\nof noise leads to robust features that are in-\nWe also show samples from decoders trained with\nsensitive to small perturbations to examples\ndifferent amounts of noise but always conditioned on\non the data manifold. Contractive Autoen-\nthe same sentence vector generated from a GAN.\ncoders (CAEs) [44] impose an explicit in-\nvariance of the hidden representations to small perturbations in the input by penalizing the norm of\nthe Jacobian of hidden representations with respect to the inputs. This was shown to be equivalent to\n\n4\n\n\fadding additive Gaussian noise to the hidden representations [41], where the variance of the noise\nis proportional to the contraction penalty in CAEs. Although this was proved to be true only for\nfeedforward autoencoders, we believe this has a similar effect on sequential models like the ones we\nuse in this work.\n\n3.4 Model Architecture\n\nIn all experiments, we use the sentence encoder from [51]. In our generator and discriminator, we\nuse 5-layer MLPs with 1024 hidden dimensions and leaky ReLU activation functions. We use the\nWGAN-GP formulation [19] in all experiments, with 5 discriminator updates for every generator\nupdate and a gradient penalty coef\ufb01cient of 10. Our decoder architecture is identical to the multi-task\ndecoders used in [51]. We trained all models with the Adam [27] stochastic optimization algorithm\nwith a learning rate of 2e-4 and \u03b21 = 0.5, \u03b22 = 0.9. We used a noise radius of 0.12 for experiments\ninvolving SNLI and 0.2 for others.\n\n4 Walking the Latent Space via Gradient-Based Optimization\n\nWe explore three different techniques to produce interpolations between sentences. The \ufb01rst, which is\npresented in Table 5, interpolates linearly in the input noise space of the GAN generator. The second\nand third techniques, which are presented in this section, interpolate between two given sentences\nin the sentence encoder latent space linearly or via gradient-based optimization. We show that the\ngradient-based method works on high-dimensional continuous representations of text that are more\nexpressive than the Gaussian typically used in VAEs. We exploit the fact that we have continuous\nrepresentations on which we can take gradient steps to iteratively transform one sentence into the\nother. We use our decoder that maps sentence representations into words to provide the gradient\nsignal to move the sentence representation of the \ufb01rst sentence towards the second.\nSpeci\ufb01cally, given two sentences x1 and x2, we formulate the interpolation problem as an optimization\nthat iteratively transforms x1 into x2 by taking gradient steps from hx1 in order to reconstruct x2.\nWe start the optimization process at h0 = hx1 and take gradient steps as follows\n\nht = ht\u22121 + \u03b1\u2207ht\u22121 log P (x2|ht\u22121)\n\nlog P (x2|ht\u22121) is given by our decoder described in Section 3.3. At every step of the optimization\nprocess, we can run the sentence representation ht through our decoder to produce an output sentence.\nUnlike linear interpolations, this procedure is not guaranteed to be symmetric, i.e., the interpolation\npath between x1 and x2 might not be the same as the path between x2 and x1. Sample interpolations\nusing this technique are presented in Table 2.\n\nof course, i had already made coffee and she headed right for the pot. of course, i had already made coffee and she headed right for the pot.\nof course, she had already made coffee.\ncolin had already made a pot of coffee.\ncolin pulled out the coffee pot .\ncolin pulled out the \ufb01le .\ncolin pulled out the myers \ufb01le .\ncolin pulled out the myers \ufb01le .\n\u201c my mother struggled to make ends meet when i was a child .\n\u201c my mother struggled to make ends meet .\n\u201c my mother would make ends meet .\n\u201c my mother would \u2019ve loved you too .\nyou would \u2019ve loved her mother \u2019s child .\nyou would \u2019ve loved her . \u201d\nyou would \u2019ve loved her . \u201d\nTable 2: Interpolations using gradient-based optimization (Left). Corresponding linear interpolations\ndirectly in the sentence representation space (Right). Two randomly selected sentences from the\nBookCorpus are in bold.\n\nof course, i had already made coffee and she headed right for the pot.\nof course, i had already made coffee.\ni had a lot of things to do .\ncolin pulled the \ufb01le out of his pocket .\ncolin colin pulled colin out of the colin .\ncolin pulled out the myers \ufb01le .\n\u201c my mother struggled to make ends meet when i was a child .\n\u201c my mother struggled to make ends meet when i was a child .\n\u201c my mother struggled to make ends meet .\ni \u2019m so sorry , \u201d i said .\ni love you , i love you . \u201d\nyou loved him , would n\u2019t you ? \u201d\nyou would \u2019ve loved her . \u201d\n\n5 Experiments & Results\n\nTo evaluate our generative model, we set up unconditional and conditional English sentence generation\nexperiments. We consider the SNLI [4], BookCorpus [59] and WMT15 (English fraction of the En-De\nparallel corpus) datasets in unconditional text generation experiments. For conditional generation\n\n5\n\n\fexperiments, we consider two different tasks: (1) to generate a premise sentence that satis\ufb01es a\nparticular relationship (entailment/contradiction/neutral) with a given hypothesis sentence from the\nSNLI dataset, similar to [49, 30]; and (2) a synthetic/toy task of generating captions (without the\nimage) given a particular binary sentiment from the SentiCap dataset [34] and product reviews from\nAmazon [48].\n\n5.1 Unconditional Sentence Generation\n\nIn all settings, we partition the dataset into equally sized halves, one on which we train our generative\nmodel (GAN & Decoder) and the other for evaluation. The large holdout set is a result of the\nnature of our evaluation setup, described in the next section. In Tables 4 and 5, we present samples\nand interpolations produced from our model on three datasets. Additionally, we also trained an\nInfoGAN model [8] with a 10 dimensional latent categorical variable on the BookCorpus. As shown\nin Appendix Table 5, the latent variable is able to pick up on some of the frequent factors of variation\nin the dataset such as the subject of the sentence, punctuation, quotes etc.\nIn Table 6, we compare unconditional samples from\nARAEs2, a state-of-the-art LSTM language model3 [35]\nand our model. We experimented with different hyperpa-\nrameter con\ufb01gurations for the ARAE to increase model\ncapacity but found that the default parameters gave us the\nbest results. For [35], we use the default hyper-parameters\nwithout Averaged SGD since we noticed that it didn\u2019t have an impact on results.\nOur model with beam search is competitive with the WD-LSTM at low temperatures (0.5) in terms\nof sample quality and diversity, but is able to maintain quality even at high temperatures (1.0). We\nalso outperform the ARAE on the benchmarks. All experiments were performed with a vocabulary\nsize of 80,000 words and a maximum sequence length of 50, except for the ARAE model trained on\nSNLI where we used the pre-trained models provided in the of\ufb01cial code repository.\n\nTable 3: Dataset Statistics\n\nTokens\n12.2M\n159.5M\n117.2M\n\n1.1M\n12M\n4.5M\n\nSentences\n\nDataset\nSNLI\n\nBookCorpus\n\nWMT15\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\nthe room was nicely decorated and the two of them were very comfortable and the bathroom was fantastic .\nall of the information was gathered from the police or the court of justice in the united states .\nwe are working with the elders to tell the story of the ancient egyptian stories of the past .\nall of our doctors , nurses , and other health care providers have been waiting for me .\nand this is why it is so important that the health care system be fully understood .\nthis is going to be a good way to get a glimpse of the new york city council .\n\u201c what \u2019s going on with you ? \u201d\ni shook my head , not trusting myself .\ni was too tired to think about it .\n\u201c yes , \u201d he said , nodding .\nin the mid-1980s , he was appointed as a member of the court of human rights in afghanistan .\ndo you have any other ideas about cooperation between the european union and other countries in the world ?\nsecondly , i am not happy to see that the countries of the european union are in agreement .\nthe main objective of this study is to promote the development of a more comprehensive and accessible information society .\nwe\u2019ve been looking forward to welcoming you to the beach , with a view of the sea .\nbut it is clear that the west and east of the country are not yet fully committed .\ni would like to point out to the house that there are some amendments to the \ufb01sheries act .\nhealth and education , research and development are a major factor in the development of health education programs .\ni therefore ask the commission to cooperate fully with the commission and to parliament to approve this report .\ni hope that the next step will be to ensure that this agreement is maintained in the eu .\n\nTable 4: Generated samples from our model trained on the BookCorpus (top) and WMT15 (bottom)\n\n5.2 Towards Quantitatively Evaluating Unconditional Sentence Samples\n\nEvaluating implicit generative models such as GANs is still an active area of research, with several\npieces of work focusing on evaluating image samples [45, 21]. In this section, we revisit the evaluation\nmethod that was originally proposed in [16], of \ufb01tting a non-parametric kernel density estimator\n(KDE) on the samples produced by a GAN and then evaluating the likelihood of real examples under\nthis KDE. As pointed out in [52], KDEs seldom do a good job of capturing the underlying density,\nsince they do not scale well with data. However, when the underlying data distribution is discrete,\ncount-based non-parametric models such as smoothed n-gram language models [20] are extremely\n\n2Of\ufb01cial code from https://github.com/jakezhaojb/ARAE\n3https://github.com/salesforce/awd-lstm-lm\n\n6\n\n\f1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\na girl on a stage holding a guitar .\na girl on a stage holding a scythe .\nthe girl in the sweat shirt plays a guitar on a stage .\nthe girl in the sweat vest is reading a newspaper .\na woman in a white shirt is taking a shortcut .\na woman in a white shirt is holding a scythe .\na woman is playing the sax .\na man is in the \ufb01rehouse .\na man is in the \ufb01rehouse .\ntwo men are in an enclosed room .\n\n\u201c you \u2019re human , \u201d she said softly .\n\u201c you \u2019re a vampire , \u201d she said .\n\u201c you \u2019re a vampire , \u201d she said .\n\u201c you \u2019re a b**ch , \u201d she said .\n\u201c you \u2019re angry , \u201d he said \ufb02atly .\n\u201c you \u2019re a jerk , \u201d he said dryly .\n\u201c you \u2019re not going to let me out ? \u201d\n\u201c i do n\u2019t know why you \u2019re here ? \u201d\n\u201c i do n\u2019t know why you \u2019re here ? \u201d\n\u201c you do n\u2019t know what to do ? \u201d\nit is therefore quite impossible to separate the various provisions of the single market with regard to quotas .\ni do not therefore think that it is necessary to continue with a different view of the commission .\ni am therefore very sensitive to the issue of women in different member states and the social repercussions .\ntherefore , i am very pleased with the report on women and their social rights in the member states .\ni am also very pleased to have the opportunity to discuss with the european parliament on this matter .\ni would also like to take the opportunity to comment on the issue of the european economic partnership .\ni would not like to mention the commission \u2019s statement on the issue of the council \u2019s statement .\ni do not want to answer any of the commissioner \u2019s questions to the commission .\nmr president , i am not going to reply to mr santer \u2019s statement on the lisbon strategy .\nmr president .\n\nTable 5: Linear interpolations along two randomly sampled points in the input space of the GAN\ngenerator on the BookCorpus (top left), SNLI (top right), and WMT15 (bottom). Points along the\nline between the two points are transformed into sentence vectors via the generator and then decoded\nwith beam search.\n\nDataset\n\nWDLSTM\n\nOurs\n\n0.5\n\nFPPL\n1.0 B=1\n\nRPPL\n1.0 B=5\nBookCorpus 389.6 555.6 364.2 209.2 206.2 213.3 9.4 185.2 280.7 137.2 25.5 66.6 10.5 220.4 152.8 250.9\n448.7 965.1 385.8 476.2 378.7 626.3 21.4 369.0 528.9 250.5 105.5 212.9 19.9 350.5 254.1 373.2\n67.5 109.1 62.0 54.8 54.0 59.9 5.9 57.0 86.8 34.5 18.6 35.6 15.3 90.8 49.5 59.8\n\nRPPL\n1.0 B=1 0.5\n\nWMT15\n\nSNLI\n\n1.0 B=5 0.5\n\n0.5\n\n1.0\n\n0.5\n\nFPPL\n1.0\n\nRPPL\n\nFPPL\n\nARAE\n\n0.5\n\nTable 6: Quantitative evaluation of sample quality from ARAE, WD-LSTM and our model. We\nreport the FPPL and RPPL from a KN smoothed 5-gram language modeled trained on a distinct but\nlarge subset of the data. We also report FPPLs and RPPLs on samples generated with multinomial\nsampling with temperatures of 0.5 and 1.0, as well as deterministic decoding with beam search (B).\nNote that deterministic decoding is not suitable for the WDLSTM since there needs to be some\nstochasticity to produce diverse samples.\n\nWMT (Temperature 1.0)\nOurs (Temperature 1.0)\nWD-LSTM (Temperature 1.0)\nNo preference\nWMT (Beam Search vs Temperature 0.5)\nOurs (Beam Search)\nWD-LSTM (Temperature 0.5)\nNo preference\nBookCorpus (Temperature 1.0)\nOurs (Temperature 1.0)\nWD-LSTM (Temperature 1.0)\nNo preference\nBookCorpus (Beam Search vs Temperature 0.5)\nOurs (Beam Search)\nWD-LSTM (Temperature 0.5)\nNo preference\n\nGrammaticality\n\n46.19%\n28.90%\n24.91%\n\nTopicality\n48.73%\n25.88%\n25.39%\n\n20.00%\n19.30%\n60.7%\n\n50.75%\n19.59%\n29.66%\n\n15.20%\n24.80%\n60.00%\n\n54.27%\n23.61%\n22.12%\n\nOverall\n63.95%\n36.05%\n\n-\n\n53.20%\n46.80%\n\n-\n\n70.35%\n29.65%\n\n-\n\n22.68%\n25.21%\n52.11%\n\n35.29%\n17.64%\n47.07%\n\n57.28%\n42.72%\n\n-\n\nTable 7: Human evaluation of grammaticality, topicality and overall quality for sentences generated\nby our model and the WD-LSTM.\n\nfast to train and produce reasonable density estimates which can be used as weak proxies for the real\nand model data densities.\nFurther, we identify a few simple statistics about generated samples to act as proxies for sample\nquality and diversity. Speci\ufb01cally, we report the number of unique n-grams as a proxy for the latter\nand the fraction of \u201cvalid\u201d n-grams as a proxy for former (see Supplementary Table 1). We also\nattempt to analyze the extent to which these models over\ufb01t to examples in the training set (see\nSupplementary Table 2 and details in Supplementary section 1.1).\n\n7\n\n\fLabel\n\nE\nC\nN\nE\n\nC\n\nN\n\nGiven Hypothesis\nthe woman is very happy .\nno one is dancing .\nthe man is reading the sportspage .\na man is in a black shirt\n\nan old woman has no jacket .\n\na person is waiting for a train .\n\nGenerated Premise\na lady wearing a blue white shirt is laughing .\na group of people playing guitar hero on a stage .\na man in a white shirt is sitting in a recliner .\na man in a black shirt stands in front of a store\nwhile a man in a blue hat and white shirt stands beside him .\na woman with a white hat and jacket is playing\nwith a girl in a red jacket .\nperson in white and black hat standing in front of a train track .\n\nTable 8: Samples from our GAN trained to conditionally transform a given hypothesis and a label as\neither (E)ntailment, (C)ontradiction or (N)eutral into a premise that satis\ufb01es the speci\ufb01ed relationship\n\nLet P be the data distribution and Q be our model\u2019s distribution. Since P is unknown and Q is\nan implicit generative model, we do not have access to either of the underlying data generating\ndistributions, only samples from them. Nevertheless, we\u2019d like to \ufb01nd an evaluation criterion that\nmeasures an (approximate) measure of similarity between these distributions.\nTo do so, we \ufb01t a non-parametric Kneiser-Ney smoothed 5-gram language model [20] to samples\nfrom P and Q and denote the resulting density models as \u02c6P and \u02c6Q. Using these, we formulate\ntwo complementary evaluation criteria, the Forward and Reverse perplexities (FPPL, RPPL). These\ncorrespond to approximations of H(Q, P ) and H(P, Q), respectively.\nIt is straightforward to\nsee that this is the case by substituting Q with \u02c6Q H(P, Q) (cid:39) \u2212 E\nlog \u02c6Q(x) and P with \u02c6P\nx\u223cP\nH(Q, P ) (cid:39) \u2212 E\nlog \u02c6P (x). We use the forward and reverse terminologies in the opposite way\nx\u223cQ\nresearchers refer to the forward and reverse KL divergences in an optimization setting when P is the\ntrue distribution we\u2019d like to approximate with Q. This is to be consistent with [26].\nComputationally, the RPPL is equivalent to training a KN 5-gram LM on the samples from our model\nand reporting perplexities on the real data, while FPPL involves the opposite. Note that H(P, Q) and\nH(Q, P ) differ in the order of arguments in the KL-divergence, with the latter being more sensitive\nto sample quality and the former to a balance between diversity and quality.\nFinally, we also carried out human evaluations to compare samples from our model and the WD-\nLSTM in an A/B test. Annotators are presented with two samples - one from our model and one from\nthe WD-LSTM (with the presentation order random each time) and asked to pick which they prefer\nalong three dimensions: grammaticality, topicality, and overall quality. This protocol is identical to\n[15] who also evaluate unconditional text generation samples. Annotators are allowed to rate two\nsamples as having equal grammaticality and topicality but not overall quality. We collected a total\nof 1,094 annotations from 16 annotators. In Table 5.1, we report results for comparisons across the\nWMT and BookCorpus datasets with different sampling parameters: Temp=1.0 and Temp=0.5 vs.\nbeam search. We compare our beam search variant with the WD-LSTM at Temp=0.5 since we found\nit to be the most similar to beam search in trading off RPPL and FPPL. Every entry in the table\ncorresponds to the percentage of annotations where annotators preferred a sample from a particular\nmodel. Our model does consistently better at high temperatures, while being comparable to the\nWD-LSTM at low temperatures.\n\n5.3 Conditional Text Generation\n\nIn most real world settings, we are interested in condi-\ntional rather than unconditional text generation. Condi-\ntional GANs [37] have proven extremely powerful in this\ncontext for generating images conditioned on certain dis-\ncrete attributes. In this work, we explore the relatively\nsimple and arti\ufb01cial task of generating sentences condi-\ntioned on binary sentiment labels from the SentiCap and\nAmazon Review datasets. While true sentiment is certainly\nmore nuanced, the binary setting can serve as a simple test-\nbed for initial experiments with such techniques. Using\nthe SNLI dataset, we also explore the task of generating a\npremise sentence conditioned on a given hypothesis plus a\n\n8\n\nBaseline-Seq2Seq (Mean)\nBaseline-Seq2Seq (MOSM)\n\nShen et al. Mean (N=1)\nShen et al. MOSM (N=1)\n\nMethod\nRandom\n\nOurs\n\nAccuracy\n\n41.1%\n59.6%\n62.6%\n62.4%\n75.9%\n70.8%\n\nTable 9: NLI classi\ufb01cation accuracies of\ngenerated samples on the test set evaluated\nby the ESIM model [7]. All results except\nours were obtained from Shen et. al [49].\nOur model also had an FPPL of 15.01, which\nis comparable to results in Table 6\n\n\flabel that speci\ufb01es a relationship between them [49]. We believe that transforming hypotheses into\npremises in SNLI is a harder problem than the inverse since it requires \ufb01lling in extra details rather\nthan removing them.\nIn both these sets of experiments, we use a conditional GAN with our generators and discriminators as\nMLPs that use conditioning information in their input layers. Conditioning information for sentiment\nand NLI labels is learned via a 128-dimensional embedding layer updated only by the discriminator.\nFor premise generation, the MLPs are also presented with the sentence representation corresponding\nto the given hypothesis. We present quantitative evaluations of both models in Tables 9 and 10,\nevaluated using a combination of pre-trained classi\ufb01ers and sample \ufb02uency evaluations using FPPL.\nOn SNLI, we are able to outperform a baseline sequence-to-sequence model as well as Shen et al\u2019s\n[49] simpler model variant of averaging word embeddings to produce a sentence representation.\nQualitative examples shown in in Table 8 and Supplementary Table 4 indicate that the model is able\nto capture some simple aspects of the mapping from sentiment or entailment labels to generated\nsentences/premises. There are however cases that may be attributed to a form of mode collapse where\nthe model adds in trivial details to the caption to satisfy entailment, such as the color of one\u2019s shirt.\n\nDataset\n\nSentiment Classi\ufb01cation Accuracy\nPositive Negative\n67.65% 84.65%\nAmazon Review 78.29% 54.68%\n\nOverall\n76.15%\n66.48%\n\nSenticap\n\nFPPL\n\n45.3\n14.7\n\n33.1\n14.2\n\nPositive Negative Overall\n\n38.8\n14.4\n\nTable 10: Sentiment Classi\ufb01cation Accuracies using a trained FastText classi\ufb01er [23] and FPPLs on\n3200 generated samples from the SentiCap and Amazon Review datasets\n\n6 Conclusion & Future Work\n\nWe investigate and demonstrate the potential of leveraging general purpose sentence encoders that\nproduce \ufb01xed-length sentence representations to train generative models of sentences. We show\nthat it is possible to train conditional generative models of text that operate by manipulating and\ntransforming sentences entirely in the latent space. We also show that smooth transitions arise in the\nobserved space when moving linearly along the input space of the GAN generator. In futre work, we\nwe would like to evaluate our conditional text generation approach on a more challenging benchmark\nsuch as MultiNLI. Further, in our preliminary exploration of conditional text generation with SNLI,\nwe experimented with a combination of MSE and adversarial training objectives similar to [39], but\nthis showed no improvements over just adversarial training. However, ablating the coef\ufb01cient of the\nadversarial term in this context, can shed some light on the impact of the autoregressive component\nand we hope to look into this in future work.\n\nAcknowledgements\n\nThe authors thank Alex Lamb, Rithesh Kumar, Jonathan Pilault, Isabela Albuquerque, Anirudh Goyal,\nKyunghyun Cho, Kelly Zhang and Varsha Embar for feedback and valuable discussions during the\ncourse of this work. We are also grateful to the PyTorch development team [38]. We thank NVIDIA\nfor donating a DGX-1 computer used in this work and Fonds de recherche du Qu\u00b4ebec - Nature et\ntechnologies for funding.\n\nReferences\n[1] Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. Fine-grained anal-\nysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207,\n2016.\n\n[2] Guozhong An. The effects of adding noise during backpropagation training on a generalization\n\nperformance. Neural computation, 8(3):643\u2013674, 1996.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00b4eon Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n9\n\n\f[4] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large\nannotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326,\n2015.\n\n[5] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy\nBengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349,\n2015.\n\n[6] Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua\nBengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv\npreprint arXiv:1702.07983, 2017.\n\n[7] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm\nfor natural language inference. In Proceedings of the 55th Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1657\u20131668, 2017.\n\n[8] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[9] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[10] Alexi Conneau, German Kruszewski, Guillaume Lample, Lo\u00a8\u0131c Barrault, and Marco Baroni.\nWhat you can cram into a single vector: Probing sentence embeddings for linguistic properties.\narXiv preprint arXiv:1805.01070, 2018.\n\n[11] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence\n\nrepresentations. arXiv preprint arXiv:1803.05449, 2018.\n\n[12] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised\nlearning of universal sentence representations from natural language inference data. arXiv\npreprint arXiv:1705.02364, 2017.\n\n[13] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger\nSchwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. arXiv\npreprint arXiv:1809.05053, 2018.\n\n[14] Jesse Engel, Matthew Hoffman, and Adam Roberts. Latent constraints: Learning to generate\nconditionally from unconditional generative models. arXiv preprint arXiv:1711.05772, 2017.\n\n[15] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via \ufb01lling\n\nin the . arXiv preprint arXiv:1801.07736, 2018.\n\n[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[17] Anirudh Goyal, Nan Rosemary Ke, Alex Lamb, R Devon Hjelm, Chris Pal, Joelle Pineau,\narXiv preprint\n\nand Yoshua Bengio. Actual: Actor-critic under adversarial learning.\narXiv:1711.04755, 2017.\n\n[18] Anirudh Goyal ALIAS PARTH GOYAL, Alessandro Sordoni, Marc-Alexandre C\u02c6ot\u00b4e, Nan Ke,\nand Yoshua Bengio. Z-forcing: Training stochastic recurrent networks. In Advances in Neural\nInformation Processing Systems, pages 6716\u20136726, 2017.\n\n[19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[20] Kenneth Hea\ufb01eld. Kenlm: Faster and smaller language model queries. In Proceedings of the\nSixth Workshop on Statistical Machine Translation, pages 187\u2013197. Association for Computa-\ntional Linguistics, 2011.\n\n10\n\n\f[21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G\u00a8unter Klambauer,\nand Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium.\narXiv preprint arXiv:1706.08500, 2017.\n\n[22] Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of\n\nsentences from unlabelled data. arXiv preprint arXiv:1602.03483, 2016.\n\n[23] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H\u00b4erve J\u00b4egou, and\nTomas Mikolov. Fasttext. zip: Compressing text classi\ufb01cation models. arXiv preprint\narXiv:1612.03651, 2016.\n\n[24] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\n\nthe limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[25] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[26] Yoon Kim, Kelly Zhang, Alexander M Rush, Yann LeCun, et al. Adversarially regularized\n\nautoencoders for generating discrete structures. arXiv preprint arXiv:1706.04223, 2017.\n\n[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[28] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[29] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio\nTorralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing\nsystems, pages 3294\u20133302, 2015.\n\n[30] Vladyslav Kolesnyk, Tim Rockt\u00a8aschel, and Sebastian Riedel. Generating natural language\n\ninference chains. arXiv preprint arXiv:1606.01404, 2016.\n\n[31] Matt J Kusner and Jos\u00b4e Miguel Hern\u00b4andez-Lobato. Gans for sequences of discrete elements\n\nwith the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.\n\n[32] Guillaume Lample, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato. Unsupervised machine\n\ntranslation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.\n\n[33] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adver-\n\nsarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[34] Alexander Patrick Mathews, Lexing Xie, and Xuming He. Senticap: Generating image descrip-\n\ntions with sentiments. In AAAI, pages 3574\u20133580, 2016.\n\n[35] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm\n\nlanguage models. arXiv preprint arXiv:1708.02182, 2017.\n\n[36] Tom\u00b4a\u02c7s Mikolov, Martin Kara\ufb01\u00b4at, Luk\u00b4a\u02c7s Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. Recur-\n\nrent neural network based language model.\n\n[37] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[39] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context\nencoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2536\u20132544, 2016.\n\n[40] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen,\nTamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration.\narXiv preprint arXiv:1706.01905, 2017.\n\n11\n\n\f[41] Ben Poole, Jascha Sohl-Dickstein, and Surya Ganguli. Analyzing noise in autoencoders and\n\ndeep networks. CoRR, abs/1406.1831, 2014.\n\n[42] O\ufb01r Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. Language generation with\nrecurrent generative adversarial networks without pre-training. arXiv preprint arXiv:1706.01399,\n2017.\n\n[43] Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville.\n\nAdversarial generation of natural language. arXiv preprint arXiv:1705.10929, 2017.\n\n[44] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-\nencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International\nConference on International Conference on Machine Learning, pages 833\u2013840. Omnipress,\n2011.\n\n[45] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[46] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau.\nBuilding end-to-end dialogue systems using generative hierarchical neural network models.\n2016.\n\n[47] Samira Shabanian, Devansh Arpit, Adam Trischler, and Yoshua Bengio. Variational bi-lstms.\n\narXiv preprint arXiv:1711.05717, 2017.\n\n[48] Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-\nparallel text by cross-alignment. In Advances in Neural Information Processing Systems, pages\n6833\u20136844, 2017.\n\n[49] Yikang Shen, Shawn Tan, Chin-Wei Huang, and Aaron Courville. Generating contradictory,\n\nneutral, and entailing sentences. arXiv preprint arXiv:1803.02710, 2018.\n\n[50] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[51] Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learning general\npurpose distributed sentence representations via large scale multi-task learning. arXiv preprint\narXiv:1804.00079, 2018.\n\n[52] Lucas Theis, A\u00a8aron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\n[53] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-\n\nencoders. arXiv preprint arXiv:1711.01558, 2017.\n\n[54] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. arXiv preprint arXiv:1609.03499, 2016.\n\n[55] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and\ncomposing robust features with denoising autoencoders. In Proceedings of the 25th international\nconference on Machine learning, pages 1096\u20131103. ACM, 2008.\n\n[56] Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman.\nGlue: A multi-task benchmark and analysis platform for natural language understanding. arXiv\npreprint arXiv:1804.07461, 2018.\n\n[57] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. In Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n[58] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In AAAI, pages 2852\u20132858, 2017.\n\n12\n\n\f[59] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,\nand Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by\nwatching movies and reading books. In Proceedings of the IEEE international conference on\ncomputer vision, pages 19\u201327, 2015.\n\n13\n\n\f", "award": [], "sourceid": 3748, "authors": [{"given_name": "Sandeep", "family_name": "Subramanian", "institution": "University of Montreal"}, {"given_name": "Sai Rajeswar", "family_name": "Mudumba", "institution": "University of Montreal"}, {"given_name": "Alessandro", "family_name": "Sordoni", "institution": "Microsoft Research Montreal"}, {"given_name": "Adam", "family_name": "Trischler", "institution": "Microsoft"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "U. Montreal"}, {"given_name": "Chris", "family_name": "Pal", "institution": "MILA, Polytechnique Montr\u00e9al, Element AI"}]}