{"title": "Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 1810, "page_last": 1820, "abstract": "Responses generated by neural conversational models tend to lack informativeness and diversity. We present Adversarial Information Maximization (AIM), an adversarial learning framework that addresses these two related but distinct problems. To foster response diversity, we leverage adversarial training that allows distributional matching of synthetic and real responses. To improve informativeness, our framework explicitly optimizes a variational lower bound on pairwise mutual information between query and response. Empirical results from automatic and human evaluations demonstrate that our methods significantly boost informativeness and diversity.", "full_text": "Generating Informative and Diverse Conversational\nResponses via Adversarial Information Maximization\n\nYizhe Zhang\n\nMichel Galley\n\nJianfeng Gao\n\nZhe Gan\n\nXiujun Li\n\nChris Brockett\n\nBill Dolan\n\nMicrosoft Research, Redmond, WA, USA\n\n{yizzhang,mgalley,jfgao,zhgan,xiul,chrisbkt,billdol}@microsoft.com\n\nAbstract\n\nResponses generated by neural conversational models tend to lack informativeness\nand diversity. We present a novel adversarial learning method, called Adversarial\nInformation Maximization (AIM) model, to address these two related but dis-\ntinct problems. To foster response diversity, we leverage adversarial training that\nallows distributional matching of synthetic and real responses. To improve infor-\nmativeness, we explicitly optimize a variational lower bound on pairwise mutual\ninformation between query and response. Empirical results from automatic and hu-\nman evaluations demonstrate that our methods signi\ufb01cantly boost informativeness\nand diversity.\n\n1\n\nIntroduction\n\nNeural conversational models are effective in generating coherent and relevant responses [1, 2, 3, 4,\netc.]. However, the maximum-likelihood objective commonly used in these neural models fosters\ngeneration of responses that average out the responses in the training data, resulting in the production\nof safe but bland responses [5].\nWe argue that this problem is in fact twofold. The responses of a system may be diverse but\nuninformative (e.g.,\u201cI don\u2019t know\u201d, \u201cI haven\u2019t a clue\u201d, \u201cI haven\u2019t the foggiest\u201d, \u201cI couldn\u2019t tell you\u201d),\nand conversely informative but not diverse (e.g., always giving the same generic responses such as \u201cI\nlike music\u201d, but never \u201cI like jazz\u201d). A major challenge, then, is to strike the right balance between\ninformativeness and diversity. On the one hand, we seek informative responses that are relevant and\nfully address the input query. Mathematically, this can be measured via Mutual Information (MI) [5],\nby computing the reduction in uncertainty about the query given the response. On the other hand,\ndiversity can help produce responses that are more varied and unpredictable, which contributes to\nmaking conversations seem more natural and human-like.\nThe MI approach of [5] con\ufb02ated the problems of producing responses that are informative and\ndiverse, and subsequent work has not attempted to address the distinction explicitly. Researchers\nhave applied Generative Adversarial Networks (GANs) [6] to neural response generation [7, 8].\nThe equilibrium for the GAN objective is achieved when the synthetic data distribution matches\nthe real data distribution. Consequently, the adversarial objective discourages generating responses\nthat demonstrate less variation than human responses. However, while GANs help reduce the\nlevel of blandness, the technique was not developed for the purpose of explicitly improving either\ninformativeness or diversity.\nWe propose a new adversarial learning method, Adversarial Information Maximization (AIM),\nfor training end-to-end neural response generation models that produce informative and diverse\nconversational responses. Our approach exploits adversarial training to encourage diversity, and\nexplicitly maximizes a Variational Information Maximization Objective (VIMO) [9, 10] to produce\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\finformative responses. To leverage VIMO, we train a backward model that generates source from\ntarget. The backward model guides the forward model (from source to target) to generate relevant\nresponses during training, thus providing a principled approach to mutual information maximization.\nThis work is the \ufb01rst application of a variational mutual information objective in text generation.\nTo alleviate the instability in training GAN models, we propose an embedding-based discriminator,\nrather than the binary classi\ufb01er used in traditional GANs. To reduce the variance of gradient\nestimation, we leverage a deterministic policy gradient algorithm [11] and employ the discrete\napproximation strategy in [12]. We also employ a dual adversarial objective inspired by [13, 14,\n15], which composes both source-to-target (forward) and target-to-source (backward) objectives.\nWe demonstrate that this forward-backward model can work synergistically with the variational\ninformation maximization loss. The effectiveness of our approach is validated empirically on two\nsocial media datasets.\n\n2 Method\n\n2.1 Model overview\nLet D \u201c tpSi, TiquN\ni\u201c1 denote a set of N single-turn con-\nversations, where Si represents a query (i.e., source), Ti\nis the response to Si (i.e., target). We aim to learn a gen-\nerative model p\u03b8pT|Sq that produces both informative and\ndiverse responses for arbitrary input queries.\nTo achieve this, we propose the Adversarial Information\nMaximization (AIM), illustrated in Figure 1, where (i)\nadversarial training is employed to learn the conditional\ndistribution p\u03b8pT|Sq, so as to improve the diversity of\ngenerated responses over standard maximum likelihood\ntraining, and (ii) variational information maximization\nis adopted to regularize the adversarial learning process\nand explicitly maximize mutual information to boost the\ninformativeness of generated responses.\nIn order to perform adversarial training, a discriminator\nD\u03c8p\u00a8,\u00a8q is used to distinguish real query-response pairs\npS, Tq from generated synthetic pairs pS, \u02dcTq, where \u02dcT is synthesized from p\u03b8pT|Sq given the query\nS. In order to evaluate the mutual information between S and \u02dcT , a backward proposal network\nq\u03c6pS|Tq calculates a variational lower bound over the mutual information. In summary, the objective\nof AIM is de\ufb01ned as following\n\nFigure 1: Overview of the Adversar-\nial Information Maximization (AIM)\nmodel for neural response generation.\nOrange for real data, and blue for gen-\nerated fake response. pepS, Tq rep-\nresent encoder joint distribution, ex-\nplained later.\n\nLAIMp\u03b8, \u03c6, \u03c8q \u201c LGANp\u03b8, \u03c8q ` \u03bb \u00a8 LMIp\u03b8, \u03c6q ,\n\n(1)\nwhere LGANp\u03b8, \u03c8q represents the objective that accounts for adversarial learning, while LMIp\u03b8, \u03c6q\ndenotes the regularization term corresponding to the mutual information, and \u03bb is a hyperparameter\nthat balances these two parts.\n\nmin\n\n\u03c8\n\nmax\n\u03b8,\u03c6\n\n2.2 Diversity-encouraging objective\n\nGenerator The conditional generator\np\u03b8pT|Sq that produces neural response\nT \u201c py1, . . . , ynq given the source sen-\ntence S \u201c px1, . . . , xmq and an isotropic\nGaussian noise vector Z is shown in Fig-\nure 2. The noise vector Z is used to inject\nnoise into the generator to prompt diver-\nsity of generated text.\nSpeci\ufb01cally, a 3-layer convolutional neu-\nral network (CNN) is employed to encode the source sentence S into a \ufb01xed-length hidden vector\nH0. A random noise vector Z with the same dimension of H0 is then added to H0 element-wisely.\n\nFigure 2: Illustration of the CNN-LSTM conditional\ngenerator.\n\n2\n\n SS \u02dcT\u02dcT SS TTp\u2713pe(S,T)qp(S,T)D x1x1x2x2\u2026H0H0Word embeddingConvolvingConvolvingZZy1y1y2y2LSTMLSTMH3H3y3y3LSTMsoft-argmaxH1H1H2H2\fThis is followed by a series of long short-term memory (LSTM) units as decoder. In our model, the\nt-th LSTM unit takes the previously generated word yt\u00b41, hidden state Ht\u00b41, H0 and Z as input,\nand generates the next word yt that maximizes the probability over the vocabulary set. However, the\nargmax operation is used, instead of sampling from a multinomial distribution as in the standard\nLSTM. Thus, all the randomness during the generation is clamped into the noise vector Z, and the\nreparameterization trick [16] can be used (see Eqn. (4)). However, the argmax operation is not\ndifferentiable, thus no gradient can be backpropagated through yt. Instead, we adopt the soft-argmax\napproximation [12] below:\n\nonehotpytq \u00ab softmax\n\n,\n\n(2)\n\n\u00b4\npV \u00a8 Htq \u00a8 1{\u03c4\n\n\u00af\n\nwhere V is a weight matrix used for computing a distribution over words. When the temperature\n\u03c4 \u00d1 0, the argmax operation is exactly recovered [12], however the gradient will vanish. In practice,\n\u03c4 should be selected to balance the approximation bias and the magnitude of gradient variance,\nwhich scales up nearly quadratically with 1{\u03c4. Note that when \u03c4 \u201c 1 this recovers the setting in [8].\nHowever, we empirically found that using a small \u03c4 would result in accumulated ambiguity when\ngenerating words in our experiment.\n\nDiscriminator For the discriminator, we adopt a novel\napproach inspired by the Deep Structured Similarity\nModel (DSSM) [17]. As shown in Figure 3, the source\nsentence S, the synthetic response \u02dcT and the human re-\nsponse T are all projected to an embedding space with\n\ufb01xed dimensionality via different CNNs1. The embedding\nnetwork for S is denoted as Ws, while \u02dcT and T share a\nnetwork Wt. Given these embeddings, the cosine similar-\nities of WspSq versus Wtp \u02dcTq and WtpTq are computed,\ndenoted as D\u03c8pT, Sq and D\u03c8p \u02dcT , Sq, respectively. \u03c8 rep-\nresents all the parameters in the discriminator.\nWe empirically found that separate embedding for each\nsentence yields better performance than concatenating\npS, Tq pairs. Presumably, mapping pS, Tq pairs to the embedding space requires the embedding\nnetwork to capture the cross-sentence interaction features of how relevant the response is to the source.\nMapping them separately to the embedding space would divide the tasks into a sentence feature\nextraction sub-task and a sentence feature matching sub-task, rather than entangle them together.\nThus the former might be slightly easier to train.\nObjective The objective of our generator is to minimize the difference between D\u03c8pT, Sq and\nD\u03c8p \u02dcT , Sq. Conversely, the discriminator tries to maximize such difference. The LGAN part in Eqn. (1)\nis speci\ufb01ed as\n\nFigure 3: Embedding-based sentence\ndiscrimination.\n\n\u201d\n\n\u00b4\n\u00af\u0131\nD\u03c8pT, Sq \u00b4 D\u03c8p \u02dcT , Sq\n\nLGANp\u03b8, \u03c8q \u201c \u00b4E\n\nT, \u02dcT ,S\n\nf\n\n,\n\n(3)\n\nwhere fpxq \ufb01 2tanh\u00b41pxq scales the difference to deliver more smooth gradients.\nNote that Eqn. (3) is conceptually related to [7] in which the discriminator loss is introduced to\nprovide sequence-level training signals. Speci\ufb01cally, the discriminator is responsible for assessing\nboth the genuineness of a response and the relevance to its corresponding source. The discriminator\nemployed in [7] evaluates a source-target pair by operations like concatenation. However, our\napproach explicitly structures the discriminator to compare the embeddings using cosine similarity\nmetrics, thus avoiding learning a neural network to match correspondence, which could be dif\ufb01cult.\nPresumably our discriminator delivers more direct updating signal by explicitly de\ufb01ning how the\nresponse is related to the source.\nThe objective in Eqn. (3) also resembles Wasserstein GAN (WGAN) [19] in that without the\nmonotonous scaling function f, the discriminator D\u03c8 can be perceived as the critic in WGAN\nwith embedding-structured regularization. See details in the Supplementary Material.\n\n1Note that encoders based on RNN or pure word embedding [18] are also possible, nevertheless we limit our\n\nchoice to CNN in this paper.\n\n3\n\n SS \u02dcT\u02dcT TTWtAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=WtAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74ST27nfeeLaiEQ94jTlQUxHSkSCUbTSQ2eAg2rNrbsLkHXiFaQGBVqD6ld/mLAs5gqZpMb0PDfFIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BC81ZfXSfuq7rl1775RazaKOMpwBudwCR5cQxPuoAU+MBjBM7zCmyOdF+fd+Vi2lpxi5hT+wPn8ATrAjbQ=WsAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fikrZNMMfRZIhLVDalGwSX6hhuB3VQhjUOBnXByO/c7T6g0T+SjmaYYxHQkecQZNVZ66Az0oFpz6+4CZJ14BalBgdag+tUfJiyLURomqNY9z01NkFNlOBM4q/QzjSllEzrCnqWSxqiDfHHqjFxYZUiiRNmShizU3xM5jbWexqHtjKkZ61VvLv7n9TIT3QQ5l2lmULLloigTxCRk/jcZcoXMiKkllClubyVsTBVlxqZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c4L86787FsLTnFzCn8gfP5Azk8jbM=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fikrZNMMfRZIhLVDalGwSX6hhuB3VQhjUOBnXByO/c7T6g0T+SjmaYYxHQkecQZNVZ66Az0oFpz6+4CZJ14BalBgdag+tUfJiyLURomqNY9z01NkFNlOBM4q/QzjSllEzrCnqWSxqiDfHHqjFxYZUiiRNmShizU3xM5jbWexqHtjKkZ61VvLv7n9TIT3QQ5l2lmULLloigTxCRk/jcZcoXMiKkllClubyVsTBVlxqZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c4L86787FsLTnFzCn8gfP5Azk8jbM=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fikrZNMMfRZIhLVDalGwSX6hhuB3VQhjUOBnXByO/c7T6g0T+SjmaYYxHQkecQZNVZ66Az0oFpz6+4CZJ14BalBgdag+tUfJiyLURomqNY9z01NkFNlOBM4q/QzjSllEzrCnqWSxqiDfHHqjFxYZUiiRNmShizU3xM5jbWexqHtjKkZ61VvLv7n9TIT3QQ5l2lmULLloigTxCRk/jcZcoXMiKkllClubyVsTBVlxqZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c4L86787FsLTnFzCn8gfP5Azk8jbM=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeCF48VTVtoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fikrZNMMfRZIhLVDalGwSX6hhuB3VQhjUOBnXByO/c7T6g0T+SjmaYYxHQkecQZNVZ66Az0oFpz6+4CZJ14BalBgdag+tUfJiyLURomqNY9z01NkFNlOBM4q/QzjSllEzrCnqWSxqiDfHHqjFxYZUiiRNmShizU3xM5jbWexqHtjKkZ61VvLv7n9TIT3QQ5l2lmULLloigTxCRk/jcZcoXMiKkllClubyVsTBVlxqZTsSF4qy+vk/ZV3XPr3n2j1mwUcZThDM7hEjy4hibcQQt8YDCCZ3iFN0c4L86787FsLTnFzCn8gfP5Azk8jbM=\fTo backpropagate the learning signal from the discriminator D\u03c8 to the generator p\u03b8pT|Sq, instead of\nusing the standard policy gradient as in [7], we consider a novel approach related to deterministic\npolicy gradient (DPG) [11], which estimates the gradient as below:\n\n\u2207\u03b8E\n\npp \u02dcT|S,ZqD\u03c8p \u02dcT , Sq \u201c E\n\nppZq\u2207 \u02dcT D\u03c8p \u02dcT , Sq\u2207\u03b8 \u02dcTpS, Zq ,\n\n(4)\nwhere the expectation in Eqn. (4) approximated by Monte Carlo approximation. \u02dcTpS, Zq is the\ngenerated response, as a function of source S and randomness Z. Note that \u2207\u03b8 \u02dcTpS, Zq can be\ncalculated because we use the soft-argmax approximation as in (2). The randomness in [7] comes from\nthe softmax-multinomial sampling at each local time step; while in our approach, \u02dcT is a deterministic\nfunction of S and Z, therefore, the randomness is global and separated out from the deterministic\npropagation, which resembles the reparameterization trick used in variational autoencoder [16].\nThis separation of randomness allows gradients to be deterministically backpropagated through\ndeterministic nodes rather than stochastic nodes. Consequently, the variance of gradient estimation is\nlargely reduced.\n\n2.3\n\nInformation-promoting objective\n\nWe further seek to explicitly boost the MI between S and \u02dcT , with the aim of improving the informa-\ntiveness of generated responses. Intuitively, maximizing MI allows the model to generate responses\nthat are more speci\ufb01c to the source, while generic responses are largely down-weighted.\nDenoting the unknown oracle joint distribution as ppS, Tq, we aim to \ufb01nd an encoder joint distribution\npepS, Tq \u201c p\u03b8pT|SqppSq by learning a forward model p\u03b8pT|Sq, such that pepS, Tq approximates\nppS, Tq, while the mutual information under pepS, Tq remains high. See Figure 1 for illustration.\nEmpirical success has been achieved in [5] for mutual information maximization. However their\napproach is limited by the fact that the MI-prompting objective is used only during testing time, while\nthe training procedure remains the same as the standard maximum likelihood training. Consequently,\nduring training the model is not explicitly speci\ufb01ed for maximizing pertinent information. The MI\nobjective merely provides a criterion for reweighing response candidates, rather than asking the\ngenerator to produce more informative responses in the \ufb01rst place. Further, the hyperparameter that\nbalances the likelihood and anti-likelihood/reverse-likelihood terms is manually selected from p0, 1q,\nwhich deviates from the actual MI objective, thus making the setup ad hoc.\npepS,Tq log pepS,Tq\nHere, we consider explicitly maximizing mutual information IpepS, Tq \ufb01 E\nppSqpepTq\nover pepS, Tq during training. However, direct optimization of IpepS, Tq is intractable. To provide a\nprincipled approach to maximizing MI, we adopt variational information maximization [9, 10]. The\nmutual information IpepS, Tq under the encoder joint distribution pepS, Tq is\n\npepS,Tq log\n\npepS, Tq\nppSqpepTq\n\nIpepS, Tq \ufb01 E\n\u201cHpSq ` E\n\u011bE\n\nppSqE\n\npepTqDKLppepS|Tq, q\u03c6pS|Tqq ` E\n\np\u03b8pT|Sq log q\u03c6pS|Tq \ufb01 LMIp\u03b8, \u03c6q ,\n\npepS,Tq log q\u03c6pS|Tq\n\nwhere Hp\u00a8q denotes the entropy of a random variable, and DKLp\u00a8,\u00a8q\ndenotes the KL divergence between two distributions. q\u03c6pS|Tq is a\nbackward proposal network that approximates the unknown pepS|Tq.\nFor this backward model q\u03c6pS|Tq, we use the same CNN-LSTM\narchitecture as the forward model [20]. We denote the MI objective\nE\nppSqE\nThe gradient of LMIp\u03b8, \u03c6q w.r.t. \u03b8 can be approximated by Monte\nCarlo samples using the REINFORCE policy gradient method [21]\n\np\u03b8pT|Sq log q\u03c6pS|Tq as LMIp\u03b8, \u03c6q, as used in Eqn. (1).\n\n\u2207\u03b8LMIp\u03b8, \u03c6q \u201c E\n\u2207\u03c6LMIp\u03b8, \u03c6q \u201c E\n\np\u03b8pT|Sqrlog q\u03c6pS|Tq \u00b4 bs \u00a8 \u2207\u03b8 log p\u03b8pT|Sq ,\np\u03b8pT|Sq\u2207\u03b8 log q\u03c6pS|Tq ,\n\n(6)\nwhere b is denoted as a baseline. Here we choose a simple empirical\naverage for b [21]. Note that more sophisticated baselines based on\n\n4\n\n(5)\n\nJoint distribu-\nFigure 4:\ntion matching of the query-\nresponse pairs. Details ex-\nplained in Section 2.4.\n\npe(S,T)pd(S,T)p(S,T)ST\fneural adaptation [22] or self-critic [23] can be also employed. We\ncomplement the policy gradient objective with small proportion of likelihood-maximization loss,\nwhich was shown to stabilize the training as in [24].\nAs an alternative to the REINFORCE approach used in (6), we also considered using the same\nDPG-like approach as in (4) for approximated gradient calculation. Compared to the REINFORCE\napproach, the DPG-like method yields lower variance, but is less memory ef\ufb01cient in this case. This\nis because the LMIp\u03b8, \u03c6q objective requires the gradient \ufb01rst back-propagated to synthetic text through\nall backward LSTM nodes, then from synthetic text back-propagated to all forward LSTM nodes,\nwhere both steps are densely connected. Hence, the REINFORCE approach is used in this part.\n\n2.4 Dual Adversarial Learning\nOne issue of the above approach is that learning an appropriate q\u03c6pS|Tq is dif\ufb01cult. Similar to the\nforward model, this backward model q\u03c6pS|Tq may also tend to be \u201cbland\u201d in generating source\nfrom the target. As illustrated in Figure 4, supposing that we de\ufb01ne a decoder joint distribution\npdpS, Tq \u201c q\u03c6pS|TqppTq, this distribution tends to be \ufb02at along T axis (i.e., tending to generate the\nsame source giving different target inputs). Similarly, pepS, Tq tends to be \ufb02at along the S axis as\nwell.\nTo address this issue, inspired by recent work on leveraging \u201ccycle consistency\u201d for image gen-\neration [13, 25], we implement a dual objective that treats source and target equally, by comple-\nmenting the objective in Eqn. (1) with decoder joint distribution matching, which can be written as\n\n\u03c8\n\nmin\n\nmax\n\u03b8,\u03c6\n\nLDAIM\n\u201c \u00b4EpT, \u02dcT ,Sq\u201epe\n\u00b4 EpT, \u02dcS,Sq\u201epd\n` \u03bb \u00a8 E\nppSqE\n` \u03bb \u00a8 E\nppTqE\n\n\u03b8\n\nfpD\u03c8pS, Tq \u00b4 D\u03c8pS, \u02dcTqq\nfpD\u03c8pS, Tq \u00b4 D\u03c8p \u02dcS, Tqq\np\u03b8pT|Sq log q\u03c6pS|Tq\nq\u03c6pS|Tq log p\u03b8pT|Sq ,\n\n\u03c6\n\n(7)\n\nwhere \u03bb is a hyperparameter to balance the GAN loss and the MI\nloss. An illustration is shown in Figure 5.\nWith this dual objective, the forward and backward model are sym-\nmetric and collaborative. This is because a better estimation of the\nbackward model q\u03c6pS|Tq will render a more accurate evaluation of\nthe mutual information IpepS, Tq, which the optimization for the\nforward model is based on. Correspondingly, the improvement over\nthe forward model will also provide positive impact on the learning of the backward model. As a\nconsequence, the forward and backward models work in a synergistic manner to simultaneously\nmake the encoder joint distribution pepS, Tq and decoder joint distribution pdpS, Tq match the oracle\njoint distribution ppS, Tq. Further, as seen in Eqn. (7), the discriminators for pepS, Tq and pdpS, Tq\nare shared. Such sharing allows the model to borrow discriminative features from both sides, and\naugments the synthetic data pairs (both pS, \u02dcTq and p \u02dcS, Tq) for the discriminator. Presumably, this\ncan facilitate discriminator training especially when source-target correspondence is dif\ufb01cult to learn.\nWe believe that this approach would also improve the generation diversity. To understand this, notice\nthat we are maximizing a surrogate objective of IpdpS, Tq, which can be written as\n\nFigure 5: Dual objective for\nAdversarial Information Max-\nimization (AIM).\n\nIpdpS, Tq \u201c HpTq \u00b4 HpT|Sq.\n\n(8)\n\u03c6pS|Tq is \ufb01xed and HpT|Sq remains constant. Thereby\nWhen optimizing \u03b8, the backward model qd\noptimizing IpdpS, Tq with respect to \u03b8 can be understood as equivalently maximizing HpTq, which\npromotes the diversity of generated text.\n\n3 Related Work\n\nOur work is closely related to [5], where an information-promoting objective was proposed to directly\noptimize an MI-based objective between source and target pairs. Despite the great success of this\n\n5\n\n SS \u02dcT\u02dcT SS TT \u02dcS\u02dcS TTp\u2713p\u2713qqDDpe(S,T)pd(S,T)p(S,T)\fapproach, the use of the additional hyperparameter for the anti-likelihood term renders the objective\nonly an approximation to the actual MI. Additionally, the MI objective is employed only during\ntesting (decoding) time, while the training procedure does not involve such an MI objective and\nis identical to standard maximum-likelihood training. Compared with [5], our approach considers\noptimizing a principled MI variational lower bound during training.\nAdversarial learning [6, 26] has been shown to be successful in dialog generation, translation, image\ncaptioning and a series of natural language generation tasks [7, 12, 27, 28, 29, 30, 31, 32, 33, 34].\n[7] leverages adversarial training and reinforcement learning to generate high quality responses.\nOur adversarial training differs from [7] in both the discriminator and generator design: we adopt\nan embedding-based structured discriminator that is inspired by the ideas from Deep Structured\nSimilarity Models (DSSM) [17]. For the generator, instead of performing multinomial sampling at\neach generating step and leveraging REINFORCE-like method as in [7], we clamp all the randomness\nin the generation process to an initial input noise vector, and employ a discrete approximation strategy\nas used in [12]. As a result, the variance of gradient estimation is largely reduced.\nUnilke previous work, we seek to make a conceptual distinction between informativeness and\ndiversity, and combine the MI and GAN approaches, proposed previously, in a principled manner to\nexplicitly render responses to be both informative (via MI) and diverse (via GAN).\nOur AIM objective is further extended to a dual-learning framework. This is conceptually related to\nseveral previous GAN models in the image domain that designed for joint distribution matching [13,\n25, 35, 36, 37]. Among these, our work is mostly related to the Triangle GAN [13]. However, we\nemploy an additional VIMO as objective, which has a similar effect to that of \u201ccycle-consistent\u201d\nregularization which enables better communication between the forward and backward models.\n[14] also leverages a dual objective for supervised translation training and demonstrates superior\nperformance. Our work differs from [14] in that we formulate the problem in an adversarial learning\nsetup. It can thus be perceived as conditional distribution matching rather than seeking a regularized\nmaximum likelihood solution.\n\n4 Experiments\n\n4.1 Setups\n\nWe evaluated our methods on two datasets: Reddit and Twitter. The Reddit dataset contains 2\nmillion source-target pairs of single turn conversations extracted from Reddit discussion threads. The\nmaximum length of sentence is 53. We randomly partition the data as (80%, 10%, 10%) to construct\nthe training, validation and test sets. The Twitter dataset contains 7 million single turn conversations\nfrom Twitter threads. We mainly compare our results with MMI [5]2.\nWe evaluated our method based on relevance and diversity metrics. For relevance evaluation, we\nadopt BLEU [38], ROUGE [39] and three embedding-based metrics following [8, 40]. The Greedy\nmetric yields the maximum cosine similarity over embeddings of two utterances [41]. Similarly, the\nAverage metric [42] considers the average embedding cosine similarity. The Extreme metric [43]\nobtains sentence representation by taking the largest extreme values among the embedding vectors of\nall the words it contains, then calculates the cosine similarity of sentence representations.\nTo evaluate diversity, we follow [5] to use Dist-1 and Dist-2, which is characterized by the proportion\nbetween the number of unique n-grams and total number of n-grams of tested sentence. However,\nthis metric neglects the frequency difference of n-grams. For example, token A and token B that both\noccur 50 times have the same Dist-1 score (0.02) as token A occurs 1 time and token B occurs 99\ntimes, whereas commonly the former is considered more diverse that the latter. To accommodate\nthis, we propose to use the Entropy (Ent-n) metric, which re\ufb02ects how evenly the empirical n-gram\ndistribution is for a given sentence:\nEnt \u201c \u00b4\n\nFpwq log\n\n\u0159\nFpwq\nw Fpwq ,\n\nwhere V is the set of all n-grams, Fpwq denotes the frequency of n-gram w.\n\n1\u0159\nw Fpwq\n\n\u00ff\n\nwPV\n\n2We did not compare with [8] since the code is not available, and the original training data used in [8]\n\ncontains a large portion of test data, owing to data leakage.\n\n6\n\n\fTable 1: Quantitative evaluation on the Reddit dataset. (\u02da is implemented based on [5].)\n\nModels\nseq2seq\ncGAN\nAIM\nDAIM\nMMI\u02da\nHuman\n\nRelevance\n\nBLEU ROUGE Greedy Average\n0.591\n1.85\n1.83\n0.604\n0.645\n2.04\n0.632\n1.93\n1.87\n0.596\n\n1.845\n1.872\n1.989\n1.945\n1.864\n\n0.9\n0.9\n1.2\n1.1\n1.1\n-\n\n-\n\n-\n\n-\n\nDiversity\nExtreme Dist-1 Dist-2\n0.153\n0.342\n0.357\n0.199\n0.205\n0.362\n0.220\n0.366\n0.127\n0.353\n0.616\n\n0.040\n0.052\n0.050\n0.054\n0.046\n0.129\n\n-\n\nEnt-4\n6.807\n7.864\n8.014\n8.128\n7.142\n9.566\n\nWe evaluated conditional GAN (cGAN), adversarial information maximization (AIM), dual adversar-\nial information maximization (DAIM), together with maximum likelihood CNN-LSTM sequence-to-\nsequence baseline on multiple datasets. For comparison with previous state of the art methods, we\nalso include MMI [5]. To eliminate the impact of network architecture differences, we implemented\nMMI-bidi [5] using our CNN-LSTM framework. The settings, other than model architectures, are\nidentical to [5]. We performed a beam search with width of 200 and choose the hyperparameter based\non performance on the validation set.\nThe forward and backward models were pretrained via seq2seq training. During cGAN training,\nwe added a small portion of supervised signals to stabilize the training [24]. For embedding-based\nevaluation, we used a word2vec embedding trained on GoogleNews Corpus3, recommended by [44].\nFor all the experiments, we employed a 3-layer convolutional encoder and an LSTM decoder as in\n[45]. The \ufb01lter size, stride and the word embedding dimension were set to 5, 2 and 300, respectively,\nfollowing [46]. The hidden unit size of H0 was set to 100. We set \u03bb to be 0.1 and the supervised-loss\nbalancing parameter to be 0.001. All other hyperparameters were shared among different experiments.\nAll experiments are conducted using NVIDIA K80 GPUs.\n\n4.2 Evaluation on Reddit data\n\nSource:\n\nHuman:\nMMI:\n\nseq2seq:\ncGAN:\n\nAIM:\nDAIM:\n\nSource:\n\nI don\u2019t suppose you have my missing\nsocks as well?\nYou can\u2019t sleep either, I see.\nI don\u2019t have socks, but I have no\nidea what you\u2019re talking about.\nI have one.\nI have one, but I have a pair of\nthem.\nI have one left handed.\nCheck your pants.\n\nQuantitative evaluation We \ufb01rst eval-\nuated our methods on the Reddit dataset\nusing the relevance and diversity met-\nrics. We truncated the vocabulary to con-\ntain only the most frequent 20,000 words.\nFor testing we used 2,000 randomly se-\nlected samples from the test set4. The\nresults are summarized in Table 1. We\nobserve that by incorporating the adver-\nsarial loss the diversity of generated re-\nsponses is improved (cGAN vs. seq2seq).\nThe relevance under most metrics (ex-\ncept for BLEU), increases by a small\namount.\nCompared MMI with cGAN, AIM and\nDAIM, we observe substantial improve-\nments on diversity and relevance due to\nthe use of the additional mutual infor-\nmation promoting objective in cGAN,\nAIM and DAIM. Table 2 presents sev-\neral examples. It can be seen that AIM\nand DAIM produce more informative re-\nsponses, due to the fact that the MI objec-\ntive explicitly rewards the responses that\nare predictive to the source, and down-\nweights those that are generic and dull. Under the same hyperparameter setup, we also observe that\n\nWhy does *** make such poor cell\nphones?\nIsn\u2019t that against the\nJapanese code?\nThey\u2019re a Korean company\nBecause they use ads.\nI don\u2019t know how to use it.\nBecause they are more expensive.\nBecause they aren\u2019t in the store.\nBecause they aren\u2019t available in\nJapan.\n\nWhy would he throw a lighter at you?\nHe was passing me it.\nWhy wouldn\u2019t he?\nI don\u2019t know.\nYou don\u2019t?\nThough he didn\u2019t use a potato.\nHe didn\u2019t even notice that.\n\nTable 2: Sample outputs from different methods.\n\nSource:\nHuman:\nMMI:\nseq2seq:\ncGAN:\nAIM:\nDAIM:\n\nHuman:\nMMI:\nseq2seq:\ncGAN:\nAIM:\nDAIM:\n\n3https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM\n4We did not use the full test set because MMI decoding is relatively slow.\n\n7\n\n\fTable 3: Human evaluation results. Results of statistical signi\ufb01cance are shown in bold.\n\nInformativeness\n\nRelevance\n\nMethod A\n\nMethod B\n\nMethod A\n\nMethods\nMMI\nMMI-AIM\nMMI-cGAN\nMMI\nMMI\nMMI-DAIM\nMMI-seq2seq\nMMI\nseq2seq-cGAN seq2seq\nseq2seq-AIM seq2seq\nseq2seq-DAIM seq2seq\nHuman-DAIM Human\n\n0.496\n0.505\n0.484\n0.510\n0.487\n0.478\n0.468\n0.615\n\n0.504\nAIM\nMMI\n0.495\ncGAN\nMMI\nDAIM 0.516\nMMI\n0.490\nseq2seq\nMMI\n0.513\ncGAN\nseq2seq\nAIM\n0.522\nseq2seq\nDAIM 0.532\nseq2seq\nDAIM 0.385 Human\n\n0.501\n0.514\n0.503\n0.518\n0.492\n0.492\n0.475\n0.600\n\nMethod B\n\n0.499\nAIM\n0.486\ncGAN\nDAIM 0.497\n0.482\nseq2seq\n0.508\ncGAN\nAIM\n0.508\nDAIM 0.525\nDAIM 0.400\n\nDAIM bene\ufb01ts from the additional backward model and outperforms AIM in diversity, which better\napproximates human responses. We show the histogram of the length of generated responses in\nthe Supplementary Material. Our models are trained until convergence. cGAN, AIM and DAIM\nrespectively consume around 1.7, 2.5 and 3.5 times the computation time compared with our seq2seq\nbaseline.\nThe distributional discrepancy between generated responses and ground-truth responses is arguably a\nmore reasonable metric than the single response judgment. We leave it to future work.\n\nHuman evaluation Informativeness is not easily measurable using automatic metrics, so we\nperformed a human evaluation on 600 random sampled sources using crowd-sourcing. Systems\nwere paired and each pair of system outputs was randomly presented to 7 judges, who ranked them\nfor informativeness and relevance5. The human preferences are shown in Table 3. A statistically\nsigni\ufb01cant (p < 0.00001) preference for DAIM over MMI is observed with respect to informativeness,\nwhile relevance judgments are on par with MMI. MMI has proved a strong baseline: the other\ntwo GAN systems are (with one exception) statistically indistinguishable from MMI, which in turn\nperform signi\ufb01cantly better than seq2seq. Box charts illustrating these results can be found in the\nSupplementary Material.\n\nTable 4: Quantitative evaluation on the Twitter dataset.\n\nModels\nseq2seq\ncGAN\nAIM\nDAIM\nMMI\n\nRelevance\n\nBLEU ROUGE Greedy Average\n0.64\n0.62\n0.85\n0.81\n0.80\n\n1.669\n1.68\n1.960\n1.845\n1.876\n\n0.54\n0.536\n0.645\n0.588\n0.591\n\n0.62\n0.61\n0.82\n0.77\n0.75\n\nDiversity\nExtreme Dist-1 Dist-2\n0.084\n0.102\n0.092\n0.137\n0.105\n\n0.020\n0.028\n0.030\n0.032\n0.028\n\n0.34\n0.329\n0.370\n0.344\n0.348\n\nEnt-4\n6.427\n6.631\n7.245\n7.907\n7.156\n\n4.3 Evaluation on Twitter data\n\nWe further compared our methods on the Twitter dataset. The results are shown in Table 4. We treated\nall dialog history before the last response in a multi-turn conversation session as a source sentence,\nand use the last response as the target to form our dataset. We employed CNN as our encoder because\na CNN-based encoder is presumably advantageous in tracking long dialog history comparing to an\nLSTM encoder. We truncated the vocabulary to contain only 20k most frequent words due to limited\n\ufb02ash memory capacity. We evaluated each methods on 2k test data.\nAdversarial training encourages generating more diverse sentences, at the cost of slightly decreasing\nthe relevance score. We hypothesize that such a decrease is partially attributable to the evaluation\nmetrics we used. All the relevance metrics are based on utterance-pair discrepancy, i.e., the score\nassesses how close the system output is to the ground-truth response. Thus, the MLE system output\ntends to obtain a high score despite being bland, because a MLE response by design is most \u201crelevant\u201d\n\n5Relevance relates to the degree to which judges perceived the output to be semantically tied to the previous\nturn, and can be regarded as a constraint on informativeness. An af\ufb01rmative response like \u201cSure\u201d and \u201cYes\u201d is\nrelevant but not very informative.\n\n8\n\n\fto any random response. On the other hand, adding diversity without improving semantic relevance\nmay occasionally hurt these relevance scores.\nHowever the additional MI term seems to compensate for the relevance decrease and improves the\nresponse diversity, especially in Dist-n and Ent-n with a larger value of n. Sampled responses are\nprovided in the Supplementary Material.\n\n5 Conclusion\n\nIn this paper we propose a novel adversarial learning method, Adversarial Information Maximization\n(AIM), for training response generation models to promote informative and diverse conversations\nbetween human and dialogue agents. AIM can be viewed as a more principled version of the classical\nMMI method in that AIM is able to directly optimize the (lower bounder of) the MMI objective in\nmodel training while the MMI method only uses it to rerank response candidates during decoding.\nWe then extend AIM to DAIM by incorporating a dual objective so as to simultaneously learn forward\nand backward models. We evaluated our methods on two real-world datasets. The results demonstrate\nthe our methods do lead to more informative and diverse responses in comparison to existing methods.\n\nAcknowledgements\n\nWe thank Adji Bousso Dieng, Asli Celikyilmaz, Sungjin Lee, Chris Quirk, Chengtao Li for helpful\ndiscussions. We thank anonymous reviewers for their constructive feedbacks.\n\nReferences\n[1] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational ai. In The 41st\nInternational ACM SIGIR Conference on Research & Development in Information Retrieval,\npages 1371\u20131374. ACM, 2018.\n\n[2] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural responding machine for short-text conversa-\n\ntion. In ACL, 2015.\n\n[3] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret\nMitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-\nsensitive generation of conversational responses. In NAACL, 2016.\n\n[4] Oriol Vinyals and Quoc Le. A neural conversational model. In ICML Deep Learning Workshop,\n\n2015.\n\n[5] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting\n\nobjective function for neural conversation models. In NAACL, 2016.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[7] Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for\n\nneural dialogue generation. In EMNLP, 2017.\n\n[8] Zhen Xu, Bingquan Liu, Baoxun Wang, Sun Chengjie, Xiaolong Wang, Zhuoran Wang, and\nChao Qi. Neural response generation via gan with an approximate embedding layer. In EMNLP,\n2017.\n\n[9] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nNIPS, 2016.\n\n[10] David Barber and Felix Agakov. The im algorithm: a variational approach to information\n\nmaximization. In NIPS, 2003.\n\n[11] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.\n\nDeterministic policy gradient algorithms. In ICML, 2014.\n\n9\n\n\f[12] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin.\n\nAdversarial feature matching for text generation. In ICML, 2017.\n\n[13] Zhe Gan, Liqun Chen, Weiyao Wang, Yuchen Pu, Yizhe Zhang, Hao Liu, Chunyuan Li, and\n\nLawrence Carin. Triangle generative adversarial networks. In NIPS, 2017.\n\n[14] Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai Yu, and Tie-Yan Liu. Dual supervised\n\nlearning. In ICML, 2017.\n\n[15] Yunchen Pu, Shuyang Dai, Zhe Gan, Weiyao Wang, Guoyin Wang, Yizhe Zhang, Ricardo\nHenao, and Lawrence Carin. Jointgan: Multi-domain joint distribution learning with generative\nadversarial nets. In ICML, 2018.\n\n[16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[17] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning\n\ndeep structured semantic models for web search using clickthrough data. In CIKM, 2013.\n\n[18] Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang,\nChunyuan Li, Ricardo Henao, and Lawrence Carin. Baseline needs more love: On simple\nword-embedding-based models and associated pooling mechanisms. In ACL, 2018.\n\n[19] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. In ICML, 2017.\n\n[20] Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin.\nLearning generic sentence representations using convolutional neural networks. In EMNLP,\n2017.\n\n[21] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 1992.\n\n[22] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In\n\nICML, 2014.\n\n[23] Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep reinforcement learning-\n\nbased image captioning with embedding reward. In CVPR, 2017.\n\n[24] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[25] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\n\ntranslation using cycle-consistent adversarial networks. In ICCV, 2017.\n\n[26] Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio.\n\nProfessor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.\n\n[27] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: sequence generative adversarial\n\nnets with policy gradient. In AAAI, 2017.\n\n[28] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,\nAaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In\nICLR, 2017.\n\n[29] Wenlin Wang, Yunchen Pu, Vinay Kumar Verma, Kai Fan, Yizhe Zhang, Changyou Chen,\nPiyush Rai, and Lawrence Carin. Zero-shot learning via class-conditioned deep generative\nmodels. In AAAI, 2018.\n\n[30] Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Improving neural machine translation with\n\nconditional sequence generative adversarial nets. In NAACL, 2018.\n\n[31] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep\n\nreinforcement learning for dialogue generation. In EMNLP, 2016.\n\n10\n\n\f[32] Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. Towards diverse and natural image\n\ndescriptions via a conditional gan. In ICCV, 2017.\n\n[33] Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation\n\nvia adversarial training with leaked information. In AAAI, 2018.\n\n[34] Jingjing Xu, Xu Sun, Xuancheng Ren, Junyang Lin, Binzhen Wei, and Wei Li. DP-GAN:\nDiversity-promoting generative adversarial network for generating informative and diversi\ufb01ed\ntext. In EMNLP, 2018.\n\n[35] Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, Liqun Chen, Ricardo Henao, and Lawrence\nCarin. Alice: Towards understanding adversarial learning for joint distribution matching. In\nNIPS, 2017.\n\n[36] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to\n\ndiscover cross-domain relations with generative adversarial networks. In ICML, 2017.\n\n[37] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mas-\n\ntropietro, and Aaron Courville. Adversarially learned inference. In ICLR, 2017.\n\n[38] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\n\nevaluation of machine translation. In ACL, 2002.\n\n[39] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In ACL workshop,\n\n2004.\n\n[40] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron\nCourville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for\ngenerating dialogues. In AAAI, 2017.\n\n[41] Vasile Rus and Mihai Lintean. A comparison of greedy and optimal assessment of natural\nlanguage student input using word-to-word similarity metrics. In Proceedings of the Seventh\nWorkshop on Building Educational Applications Using NLP, 2012.\n\n[42] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition. In ACL, 2008.\n\n[43] Gabriel Forgues, Joelle Pineau, Jean-Marie Larchev\u00eaque, and R\u00e9al Tremblay. Bootstrapping\ndialog systems with word embeddings. In NIPS, modern machine learning and natural language\nprocessing workshop, 2014.\n\n[44] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau.\n\nHierarchical neural network generative models for movie dialogues. In AAAI, 2016.\n\n[45] Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang Su, and Lawrence Carin. Deconvolutional\n\nlatent-variable model for text sequence matching. In AAAI, 2018.\n\n[46] Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan, Ricardo Henao, and Lawrence Carin.\n\nDeconvolutional paragraph representation learning. In NIPS, 2017.\n\n11\n\n\f", "award": [], "sourceid": 916, "authors": [{"given_name": "Yizhe", "family_name": "Zhang", "institution": "Microsoft Research"}, {"given_name": "Michel", "family_name": "Galley", "institution": "Microsoft Research"}, {"given_name": "Jianfeng", "family_name": "Gao", "institution": "Microsoft Research, Redmond, WA"}, {"given_name": "Zhe", "family_name": "Gan", "institution": "Microsoft"}, {"given_name": "Xiujun", "family_name": "Li", "institution": "Microsoft Research Redmond"}, {"given_name": "Chris", "family_name": "Brockett", "institution": "Microsoft Research AI"}, {"given_name": "Bill", "family_name": "Dolan", "institution": "Microsoft"}]}