{"title": "Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space", "book": "Advances in Neural Information Processing Systems", "page_first": 5756, "page_last": 5766, "abstract": "This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a \u201cvanilla\u201d CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.", "full_text": "Diverse and Accurate Image Description Using a\n\nVariational Auto-Encoder with an Additive Gaussian\n\nEncoding Space\n\nLiwei Wang\n\nAlexander G. Schwing\n\nSvetlana Lazebnik\n\n{lwang97, aschwing, slazebni}@illinois.edu\n\nUniversity of Illinois at Urbana-Champaign\n\nAbstract\n\nThis paper explores image caption generation using conditional variational auto-\nencoders (CVAEs). Standard CVAEs with a \ufb01xed Gaussian prior yield descriptions\nwith too little variability. Instead, we propose two models that explicitly structure\nthe latent space around K components corresponding to different types of image\ncontent, and combine components to create priors for images that contain multiple\ntypes of content simultaneously (e.g., several kinds of objects). Our \ufb01rst model\nuses a Gaussian Mixture model (GMM) prior, while the second one de\ufb01nes a novel\nAdditive Gaussian (AG) prior that linearly combines component means. We show\nthat both models produce captions that are more diverse and more accurate than\na strong LSTM baseline or a \u201cvanilla\u201d CVAE with a \ufb01xed Gaussian prior, with\nAG-CVAE showing particular promise.\n\nIntroduction\n\n1\nAutomatic image captioning [9, 11, 18\u201320, 24] is a challenging open-ended conditional generation\ntask. State-of-the-art captioning techniques [23, 32, 36, 1] are based on recurrent neural nets with\nlong-short term memory (LSTM) units [13], which take as input a feature representation of a provided\nimage, and are trained to maximize the likelihood of reference human descriptions. Such methods are\ngood at producing relatively short, generic captions that roughly \ufb01t the image content, but they are\nunsuited for sampling multiple diverse candidate captions given the image. The ability to generate\nsuch candidates is valuable because captioning is profoundly ambiguous: not only can the same image\nbe described in many different ways, but also, images can be hard to interpret even for humans, let\nalone machines relying on imperfect visual features. In short, we would like the posterior distribution\nof captions given the image, as estimated by our model, to accurately capture both the open-ended\nnature of language and any uncertainty about what is depicted in the image.\nAchieving more diverse image description is a major theme in several recent works [6, 14, 27, 31, 35].\nDeep generative models are a natural \ufb01t for this goal, and to date, Generative Adversarial Models\n(GANs) have attracted the most attention. Dai et al. [6] proposed jointly learning a generator to\nproduce descriptions and an evaluator to assess how well a description \ufb01ts the image. Shetty et\nal. [27] changed the training objective of the generator from reproducing ground-truth captions to\ngenerating captions that are indistinguishable from those produced by humans.\nIn this paper, we also explore a generative model for image description, but unlike the GAN-style\ntraining of [6, 27], we adopt the conditional variational auto-encoder (CVAE) formalism [17, 29].\nOur starting point is the work of Jain et al. [14], who trained a \u201cvanilla\u201d CVAE to generate questions\ngiven images. At training time, given an image and a sentence, the CVAE encoder samples a latent z\nvector from a Gaussian distribution in the encoding space whose parameters (mean and variance)\ncome from a Gaussian prior with zero mean and unit variance. This z vector is then fed into a decoder\nthat uses it, together with the features of the input image, to generate a question. The encoder and the\ndecoder are jointly trained to maximize (an upper bound on) the likelihood of the reference questions\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Example output of our proposed AG-CVAE approach compared to an LSTM baseline\n(see Section 4 for details). For each method, we show top \ufb01ve sentences following consensus\nre-ranking [10]. The captions produced by our method are both more diverse and more accurate.\n\nFigure 2: Illustration of how our additive latent space structure controls the image description process.\nModifying the object labels changes the weight vectors associated with semantic components in\nthe latent space. In turn, this shifts the mean from which the z vectors are drawn and modi\ufb01es the\nresulting descriptions in an intuitive way.\n\ngiven the images. At test time, the decoder is seeded with an image feature and different z samples,\nso that multiple z\u2019s result in multiple questions.\nWhile Jain et al. [14] obtained promising question generation performance with the above CVAE\nmodel equipped with a \ufb01xed Gaussian prior, for the task of image captioning, we observed a tendency\nfor the learned conditional posteriors to collapse to a single mode, yielding little diversity in candidate\ncaptions sampled given an image. To improve the behavior of the CVAE, we propose using a set of K\nGaussian priors in the latent z space with different means and standard deviations, corresponding to\ndifferent \u201cmodes\u201d or types of image content. For concreteness, we identify these modes with speci\ufb01c\nobject categories, such as \u2018dog\u2019 or \u2018cat.\u2019 If \u2018dog\u2019 and \u2018cat\u2019 are detected in an image, we would like to\nencourage the generated captions to capture both of them.\nStarting with the idea of multiple Gaussian priors, we propose two different ways of structuring\nthe latent z space. The \ufb01rst is to represent the distribution of z vectors using a Gaussian Mixture\nmodel (GMM). Due to the intractability of Gaussian mixtures in the VAE framework, we also\nintroduce a novel Additive Gaussian (AG) prior that directly adds multiple semantic aspects in the\nz space. If an image contains several objects or aspects, each corresponding to means \u00b5k in the\nlatent space, then we require the mean of the encoder distribution to be close to a weighted linear\ncombination of the respective means. Our CVAE formulation with this additive Gaussian prior\n(AG-CVAE) is able to model a richer, more \ufb02exible encoding space, resulting in more diverse and\naccurate captions, as illustrated in Figure 1. As an additional advantage, the additive prior gives us an\ninterpretable mechanism for controlling the captions based on the image content, as shown in Figure\n2. Experiments of Section 4 will show that both GMM-CVAE and AG-CVAE outperform LSTMs\nand \u201cvanilla\u201d CVAE baselines on the challenging MSCOCO dataset [5], with AG-CVAE showing\nmarginally higher accuracy and by far the best diversity and controllability.\n\n2 Background\nOur proposed framework for image captioning extends the standard variational auto-encoder [17]\nand its conditional variant [29]. We brie\ufb02y set up the necessary background here.\nVariational auto-encoder (VAE): Given samples x from a dataset, VAEs aim at modeling the data\nlikelihood p(x). To this end, VAEs assume that the data points x cluster around a low-dimensional\nmanifold parameterized by embeddings or encodings z. To obtain the sample x corresponding to an\nembedding z, we employ the decoder p(x|z) which is often based on deep nets. Since the decoder\u2019s\nposterior p(z|x) is not tractably computable we approximate it with a distribution q(z|x) which is\n\n2\n\nPredicted Object Labels:\u2018person\u2019 \u2018cup\u2019 \u2018donut\u2019 \u2018dining table\u2019AG-CVAE:Predicted Object Labels:\u2018cup\u2019 \u2018fork\u2019 \u2018knife\u2019 \u2018sandwich\u2019 \u2018dining table\u2019 \u2018mouse\u2019LSTM Baseline:AG-CVAE:LSTM Baseline:a close up of a plate of food on a tablea table with a plate of food on ita plate of food with a sandwich on ita white plate topped with a plate of fooda plate of food on a table next to a cup of co\ufb00eea close up of a plate of food on a tablea close up of a plate of food with a sandwicha close up of a plate of fooda close up of a plate of food on a white platea close up of a plate of food with a sandwich on ita woman sitting at a table with a cup of co\ufb00eea person sitting at a table with a cup of co\ufb00eea table with two plates of donuts and a cup of co\ufb00eea woman sitting at a table with a plate of co\ufb00eea man sitting at a table with a plate of fooda close up of a table with two plates of co\ufb00eea close up of a table with a plate of fooda close up of a plate of food on a tablea close up of a table with two plates of fooda close up of a table with plates of foodObject Labels: \u2018person\u2019AG-CVAE sentences:a man and a woman standing in a rooma man and a woman are playing a gamea man standing next to a woman in a rooma man standing next to a woman in a \ufb01elda man standing next to a woman in a suitAG-CVAE sentences:a man and a woman playing a video gamea man and a woman are playing a video gamea man and woman are playing a video gamea man and a woman playing a game with a remotea woman holding a nintendo wii game controllerAG-CVAE sentences:a man and a woman sitting on a busa man and a woman sitting on a traina man and woman sitting on a busa man and a woman sitting on a bencha man and a woman are sitting on a busAG-CVAE sentences:a man and a woman sitting on a traina woman and a woman sitting on a traina woman sitting on a train next to a traina woman sitting on a bench in a traina man and a woman sitting on a benchObject Labels: \u2018person\u2019, \u2018remote\u2019Object Labels: \u2018person\u2019,\u2018bus\u2019Object Labels: \u2018person\u2019, \u2018train\u2019\freferred to as the encoder. Taking together all those ingredients, VAEs are based on the identity\n\nlog p(x) \u2212 DKL[q(z|x), p(z|x)] = E\n\nq(z|x)[log p(x|z)] \u2212 DKL[q(z|x), p(z)],\n\n(1)\nwhich relates the likelihood p(x) and the conditional p(z|x). It is hard to compute the KL-divergence\nDKL[q(z|x), p(z|x)] because the posterior p(z|x) is not readily available from the decoder distribu-\ntion p(x|z) if we use deep nets. However, by choosing an encoder distribution q(z|x) with suf\ufb01cient\ncapacity, we can assume that the non-negative KL-divergence DKL[q(z|x), p(z|x)] is small. Thus,\nwe know that the right-hand-side is a lower bound on the log-likelihood log p(x), which can be\nmaximized w.r.t. both encoder and decoder parameters.\nConditional variational auto-encoders (CVAE): In tasks like image captioning, we are interested\nin modeling the conditional distribution p(x|c), where x are the desired descriptions and c is some\nrepresentation of content of the input image. The VAE identity can be straightforwardly extended by\nconditioning both the encoder and decoder distributions on c. Training of the encoder and decoder\nproceeds by maximizing the lower bound on the conditional data-log-likelihood p(x|c), i.e.,\n\n(2)\nwhere \u03b8 and \u03c6, the parameters for the decoder distribution p\u03b8(x|z, c) and the encoder distribution\nq\u03c6(z|x, c) respectively. In practice, the following stochastic objective is typically used:\n\nq\u03c6(z|x,c)[log p\u03b8(x|z, c)] \u2212 DKL[q\u03c6(z|x, c), p(z|c)] ,\n\nlog p\u03b8(xi|zi, ci) \u2212 DKL[q\u03c6(z|x, c), p(z|c)],\n\ns.t. \u2200i zi \u223c q\u03c6(z|x, c).\n\nlog p\u03b8(x|c) \u2265 E\nN(cid:88)\n\nmax\n\u03b8,\u03c6\n\n1\nN\n\ni=1\n\nq\u03c6(z|x,c)[log p\u03b8(x|z, c)] using N samples zi drawn from the ap-\nIt approximates the expectation E\nproximate posterior q\u03c6(z|x, c) (typically, just a single sample is used). Backpropagation through\nthe encoder that produces samples zi is achieved via the reparameterization trick [17], which is\napplicable if we restrict the encoder distribution q\u03c6(z|x, c) to be, e.g., a Gaussian with mean and\nstandard deviation output by a deep net.\n3 Gaussian Mixture Prior and Additive Gaussian Prior\nOur key observation is that the behavior of the trained CVAE crucially depends on the choice of\nthe prior p(z|c). The prior determines how the learned latent space is structured, because the KL-\ndivergence term in Eq. (2) encourages q\u03c6(z|x, c), the encoder distribution over z given a particular\ndescription x and image content c, to be close to this prior distribution.\nIn the vanilla CVAE\nformulation, such as the one adopted in [14], the prior is not dependent on c and is \ufb01xed to a\nzero-mean unit-variance Gaussian. While this choice is the most computationally convenient, our\nexperiments in Sec. 4 will demonstrate that for the task of image captioning, the resulting model has\npoor diversity and worse accuracy than the standard maximum-likelihood-trained LSTM. Clearly, the\nprior has to change based on the content of the image. However, because of the need to ef\ufb01ciently\ncompute the KL-divergence in closed form, it still needs to have a simple structure, ideally a Gaussian\nor a mixture of Gaussians.\nMotivated by the above considerations, we encourage the latent z space to have a multi-modal\nstructure composed of K modes or clusters, each corresponding to different types of image content.\nGiven an image I, we assume that we can obtain a distribution c(I) = (c1(I), . . . , cK(I)), where the\nentries ck are nonnegative and sum to one. In our current work, for concreteness, we identify these\nwith a set of object categories that can be reliably detected automatically, such as \u2018car,\u2019 \u2018person,\u2019 or\n\u2018cat.\u2019 The MSCOCO dataset, on which we conduct our experiments, has direct supervision for 80\nsuch categories. Note, however, our formulation is general and can be applied to other de\ufb01nitions of\nmodes or clusters, including latent topics automatically obtained in an unsupervised fashion.\nGMM-CVAE: We can model p(z|c) as a Gaussian mixture with weights ck and components with\nmeans \u00b5k and standard deviations \u03c3k:\n\nK(cid:88)\n\nckN(cid:0)z |\u00b5k, \u03c32\nkI(cid:1) ,\n\np(z|c) =\n\n(3)\n\nwhere ck is de\ufb01ned as the weights above and \u00b5k represents the mean vector of the k-th component.\nIn practice, for all components, we use the same standard deviation \u03c3.\n\nk=1\n\n3\n\n\f(a) GMM-CVAE\n\n(b) AG-CVAE\n\nFigure 3: Overview of GMM-CVAE and AG-CVAE models. To sample z vectors given an image,\nGMM-CVAE (a) switches from one cluster center to another, while AG-CVAE (b) encourages the\nembedding z for an image to be close to the average of its objects\u2019 means.\n\nIt is not directly tractable to optimize Eq. (2) with the above GMM prior. We therefore approximate\nthe KL divergence stochastically [12]. In each step during training, we \ufb01rst draw a discrete component\nk according to the cluster probability c(I), and then sample z from the resulting Gaussian component.\nThen we have\n\n(cid:19)\n(cid:19)\n\n(cid:18) \u03c3k\n(cid:18) \u03c3k\n\n\u03c3\u03c6\n\n\u03c3\u03c6\n\n+\n\n+\n\nE\nq\u03c6(z|x,ck)\n\n1\n2\u03c32\n\u03c6 + (cid:107)\u00b5\u03c6 \u2212 \u00b5k(cid:107)2\n\u03c32\n\n2\n\n2\u03c32\nk\n\n(cid:2)(cid:107)z \u2212 \u00b5k(cid:107)2\n\n2\n\n(cid:3) \u2212 1\n\n2\n\n\u2212 1\n2\n\n,\n\n\u2200k ck \u223c c(I).\n\n(4)\n\nDKL[q\u03c6(z|x, ck), p(z|ck)] = log\n\n= log\n\nWe plug the above KL term into Eq. (2) to obtain an objective function, which we optimize w.r.t. the\nencoder and decoder parameters \u03c6 and \u03b8 using stochastic gradient descent (SGD). In principle, the\nprior parameters \u00b5k and \u03c3k can also be trained, but we obtained good results by keeping them \ufb01xed\n(the means are drawn randomly and all standard deviations are set to the same constant, as will be\nfurther explained in Section 4).\nAt test time, in order to generate a description given an image I, we \ufb01rst sample a component index k\nfrom c(I), and then sample z from the corresponding component distribution. One limitation of this\nprocedure is that, if an image contains multiple objects, each individual description is still conditioned\non just a single object.\nAG-CVAE: We would like to structure the z space in a way that can directly re\ufb02ect object co-\noccurrence. To this end, we propose a simple novel conditioning mechanism with an additive\nGaussian prior. If an image contains several objects with weights ck, each corresponding to means \u00b5k\nin the latent space, we want the mean of the encoder distribution to be close to the linear combination\nof the respective means with the same weights:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K(cid:88)\nwhere \u03c32I is a spherical covariance matrix with \u03c32 =(cid:80)K\n\np(z|c) = N\n\n(cid:32)\n\nk=1\n\nz\n\n(cid:33)\n\nbetween this AG-CVAE model and the GMM-CVAE model introduced above.\nIn order to train the AG-CVAE model using the objective of Eq. (2), we need to compute the\nKL-divergence DKL[q\u03c6(z|x, c), p(z|c)] where q\u03c6(z|x, c) = N (z | \u00b5\u03c6(x, c), \u03c32\n\u03c6(x, c)I) and the prior\np(z|c) is given by Eq. (5). Its analytic expression can be derived to be\n\nk=1 c2\n\nk\u03c32\n\nk. Figure 3 illustrates the difference\n\nck\u00b5k, \u03c32I\n\n,\n\n(5)\n\nDKL[q\u03c6(z|x, c), p(z|c)] = log\n\n= log\n\n(cid:19)\n(cid:19)\n\n(cid:18) \u03c3\n(cid:18) \u03c3\n\n\u03c3\u03c6\n\n\u03c3\u03c6\n\n+\n\n+\n\n\uf8ee\uf8f0(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)z \u2212 K(cid:88)\n\u03c6 + (cid:107)\u00b5\u03c6 \u2212(cid:80)K\n\nEq\u03c6\n\nk=1\n\n1\n2\u03c32\n\n\u03c32\n\n2\u03c32\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\uf8f9\uf8fb \u2212 1\n\n2\n\n\u2212 1\n2\n\n.\n\nck\u00b5k\n\nk=1 ck\u00b5k(cid:107)2\n\nWe plug the above KL-divergence term into Eq. (2) to obtain the stochastic objective function for\ntraining the encoder and decoder parameters. We initialize the mean and variance parameters \u00b5k and\n\u03c3k in the same way as for GMM-CVAE and keep them \ufb01xed throughout training.\n\n4\n\nC2C1C3switch\u2026Cluster VectorDecoderEncoderZC1C2C3CzCluster VectorDecoderEncoderZ\u2026C1C2C3\fFigure 4: Illustration of our encoder (left) and decoder (right). See text for details.\n\nImplementation Details\n\n\u03c6k are then summed with weights ck and c2\n\nNext, we need to specify our architectures for the encoder and decoder, which are shown in Fig. 4.\nThe encoder uses an LSTM to map an image I, its vector c(I), and a caption into a point in the\nlatent space. More speci\ufb01cally, the LSTM receives the image feature in the \ufb01rst step, the cluster\nvector in the second step, and then the caption word by word. The hidden state hT after the last step\n\u03c6k, using a linear layer for each.\nis transformed into K mean vectors, \u00b5\u03c6k, and K log variances, log \u03c32\nFor AG-CVAE, the \u00b5\u03c6k and \u03c32\nk respectively to generate\nthe desired \u00b5\u03c6 and \u03c32\n\u03c6 encoder outputs. Note that the encoder is used at training time only, and the\ninput cluster vectors are produced from ground truth object annotations.\nThe decoder uses a different LSTM that receives as input \ufb01rst the image feature, then the cluster\nvector, then a z vector sampled from the conditional distribution of Eq. (5). Next, it receives a \u2018start\u2019\nsymbol and proceeds to output a sentence word by word until it produces an \u2018end\u2019 symbol. During\ntraining, its c(I) inputs are derived from the ground truth, same as for the encoder, and the log-loss is\nused to encourage reconstruction of the provided ground-truth caption. At test time, ground truth\nobject vectors are not available, so we rely on automatic object detection, as explained in Section 4.\n4 Experiments\n4.1\nWe test our methods on the MSCOCO dataset [5], which is the largest \u201cclean\u201d image captioning\ndataset available to date. The current (2014) release contains 82,783 training and 40,504 validation\nimages with \ufb01ve reference captions each, but many captioning works re-partition this data to enlarge\nthe training set. We follow the train/val/test split released by [23]. It allocates 118, 287 images for\ntraining, 4, 000 for validation, and 1, 000 for testing.\nFeatures. As image features, we use 4,096-dimensional activations from the VGG-16 network [28].\nThe cluster or object vectors c(I) are 80-dimensional, corresponding to the 80 MSCOCO object\ncategories. At training time, c(I) consist of binary indicators corresponding to ground truth object\nlabels, rescaled to sum to one. For example, an image with labels \u2018person,\u2019 \u2018car,\u2019 and \u2018dog\u2019 results in a\ncluster vector with weights of 1/3 for the corresponding objects and zeros elsewhere. For test images\nI, c(I) are obtained automatically through object detection. We train a Faster R-CNN detector [26]\nfor the MSCOCO categories using our train/val split by \ufb01ne-tuning the VGG-16 net [28]. At test\ntime, we use a threshold of 0.5 on the per-class con\ufb01dence scores output by this detector to determine\nwhether the image contains a given object (i.e., all the weights are once again equal).\nBaselines. Our LSTM baseline is obtained by deleting the z vector input from the decoder architec-\nture shown in Fig. 4. This gives a strong baseline comparable to NeuralTalk2 [1] or Google Show\nand Tell [33]. To generate different candidate sentences using the LSTM, we use beam search with\na width of 10. Our second baseline is given by the \u201cvanilla\u201d CVAE with a \ufb01xed Gaussian prior\nfollowing [14]. For completeness, we report the performance of our method as well as all baselines\nboth with and without the cluster vector input c(I).\nParameter settings and training. For all the LSTMs, we use a one-hot encoding with vocabulary\nsize of 11,488, which is the number of words in the training set. This input gets projected into a word\nembedding layer of dimension 256, and the LSTM hidden space dimension is 512. We found that\nthe same LSTM settings worked well for all models. For our three models (CVAE, GMM-CVAE,\nand AG-CVAE), we use a dimension of 150 for the z space. We wanted it to be at least equal to the\nnumber of categories to make sure that each z vector corresponds to a unique set of cluster weights.\nThe means \u00b5k of clusters for GMM-CVAE and AG-CVAE are randomly initialized on the unit ball\n\n5\n\nWILSTMWeWeWeWcLSTMLSTMLSTMLSTMImage FeatureCluster Vector\u00b51,log21\u00b5K,log(2K)Wc1Wc2WcK\u00b52,log(22)ReconstructionLossWILSTMWeWeWeWcLSTMLSTMLSTMLSTMImage FeatureCluster VectorLSTM\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026w1w2wThTzwTw1hTh0h1p1p0P(Real/Fake)wsWz\u00b5,log(2)\fS\n\nR\n\nC\n\nM\n\nCVAE\n\nGMM-\nCVAE\n\nstd\n-\n-\n0.1\n2\n0.1\n2\n2\n2\n2\n0.1\n2\n2\n2\n2\n\nbeam\n10\n10\n-\n-\n-\n-\n2\n-\n2\n-\n-\n2\n-\n2\n\nobj\n#z\nLSTM -\n-\n(cid:88)\n-\n-\n20\n(cid:88) 20\n-\n20\n(cid:88) 20\n(cid:88) 20\n(cid:88) 100\n(cid:88) 100\n-\n20\n(cid:88) 20\n(cid:88) 20\n(cid:88) 100\n(cid:88) 100\n\n0.285 0.218\n0.290 0.223\n0.246 0.184\n0.244 0.176\n0.274 0.209\n0.298 0.233\n0.299 0.232\n0.325 0.261\n0.329 0.263\n0.300 0.235\n0.305 0.243\n0.309 0.244\n0.342 0.278\n0.345 0.277\nTable 1: Oracle (upper bound) performance according to each metric. Obj indicates whether the\nobject (cluster) vector is used; #z is the number of z samples; std is the test-time standard deviation;\nbeam is the beam width if beam search is used. For the caption quality metrics, C is short for Cider,\nR for ROUGE, M for METEOR, S for SPICE.\n\n1.157 0.597\n1.202 0.607\n0.860 0.531\n0.910 0.541\n1.080 0.582\n1.216 0.617\n1.251 0.624\n1.378 0.659\n1.430 0.670\n1.230 0.622\n1.259 0.630\n1.308 0.638\n1.478 0.682\n1.517 0.690\n\nB2\nB3\n0.515 0.643\n0.529 0.654\n0.381 0.538\n0.421 0.565\n0.481 0.619\n0.533 0.666\n0.553 0.680\n0.597 0.719\n0.625 0.740\n0.537 0.668\n0.557 0.686\n0.573 0.698\n0.631 0.749\n0.654 0.767\n\nB4\n0.413\n0.428\n0.261\n0.312\n0.371\n0.423\n0.449\n0.494\n0.527\n0.431\n0.451\n0.471\n0.532\n0.557\n\nB1\n0.790\n0.797\n0.742\n0.733\n0.778\n0.813\n0.821\n0.856\n0.865\n0.814\n0.829\n0.834\n0.876\n0.883\n\nAG-\nCVAE\n\nS\n\nR\n\nC\n\nM\n\nCVAE\n\nGMM-\nCVAE\n\nobj\n#z\nLSTM -\n-\n(cid:88)\n-\n-\n20\n(cid:88) 20\n-\n20\n(cid:88) 20\n(cid:88) 20\n(cid:88) 100\n(cid:88) 100\n-\n20\n(cid:88) 20\n(cid:88) 20\n(cid:88) 100\n(cid:88) 100\nTable 2: Consensus re-ranking using CIDEr. See caption of Table 1 for legend.\n\nB3\nB2\n0.388 0.529\n0.395 0.536\n0.347 0.495\n0.372 0.521\n0.376 0.522\n0.388 0.538\n0.394 0.538\n0.402 0.552\n0.413 0.557\n0.394 0.540\n0.391 0.537\n0.402 0.544\n0.410 0.557\n0.417 0.559\n\n0.915 0.510\n0.947 0.516\n0.775 0.491\n0.834 0.506\n0.890 0.507\n0.932 0.516\n0.941 0.513\n0.972 0.520\n0.986 0.525\n0.942 0.518\n0.953 0.517\n0.963 0.518\n0.991 0.527\n1.001 0.528\n\nB4\n0.286\n0.292\n0.245\n0.265\n0.271\n0.278\n0.289\n0.292\n0.307\n0.287\n0.286\n0.299\n0.301\n0.311\n\nB1\n0.702\n0.711\n0.674\n0.698\n0.702\n0.718\n0.715\n0.728\n0.729\n0.715\n0.716\n0.716\n0.732\n0.732\n\nbeam\n10\n10\n-\n-\n-\n-\n2\n-\n2\n-\n-\n2\n-\n2\n\nstd\n-\n-\n0.1\n2\n0.1\n2\n2\n2\n2\n0.1\n2\n2\n2\n2\n\n0.235 0.165\n0.238 0.170\n0.217 0.147\n0.225 0.158\n0.231 0.166\n0.238 0.170\n0.235 0.169\n0.241 0.174\n0.242 0.177\n0.238 0.168\n0.239 0.172\n0.237 0.173\n0.243 0.177\n0.245 0.179\n\nAG-\nCVAE\n\nand are not changed throughout training. The standard deviations \u03c3k are set to 0.1 at training time and\ntuned on the validation set at test time (the values used for our results are reported in the tables). All\nnetworks are trained with SGD with a learning rate that is 0.01 for the \ufb01rst 5 epochs, and is reduced\nby half every 5 epochs. On average all models converge within 50 epochs.\n4.2 Results\nA big part of the motivation for generating diverse candidate captions is the prospect of being able to\nre-rank them using some discriminative method. Because the performance of any re-ranking method\nis upper-bounded by the quality of the best candidate caption in the set, we will \ufb01rst evaluate different\nmethods assuming an oracle that can choose the best sentence among all the candidates. Next, for a\nmore realistic evaluation, we will use a consensus re-ranking approach [10] to automatically select a\nsingle top candidate per image. Finally, we will assess the diversity of the generated captions using\nuniqueness and novelty metrics.\nOracle evaluation. Table 1 reports caption evaluation metrics in the oracle setting, i.e., taking the\nmaximum of each relevant metric over all the candidates. We compare caption quality using \ufb01ve\nmetrics: BLEU [25], METEOR [7], CIDEr [30], SPICE [2], and ROUGE [21]. These are calculated\nusing the MSCOCO caption evaluation tool [5] augmented by the author of SPICE [2]. For the\nLSTM baseline, we report the scores attained among 10 candidates generated using beam search (as\nsuggested in [23]). For CVAE, GMM-CVAE and AG-CVAE, we sample a \ufb01xed number of z vectors\nfrom the corresponding prior distributions (the numbers of samples are given in the table).\nThe high-level trend is that \u201cvanilla\u201d CVAE falls short even of the LSTM baseline, while the upper-\nbound performance for GMM-CVAE and AG-CVAE considerably exceeds that of the LSTM given\n\n6\n\n\fobj\n\nGMM-\nCVAE\n\n#z\nLSTM (cid:88) -\nCVAE (cid:88) 20\n(cid:88) 20\n(cid:88) 20\n(cid:88) 100\n(cid:88) 100\n(cid:88) 20\n(cid:88) 20\n(cid:88) 100\n(cid:88) 100\n\nAG-\nCVAE\n\nstd beam\nsize\n10\n-\n-\n2\n-\n2\n2\n2\n-\n2\n2\n2\n-\n2\n2\n2\n-\n2\n2\n2\n\n% unique\nper image\n\n-\n\n0.118\n0.594\n0.539\n0.376\n0.326\n0.764\n0.698\n0.550\n0.474\n\n% novel\nsentences\n\n0.656\n0.820\n0.809\n0.716\n0.767\n0.688\n0.795\n0.707\n0.745\n0.667\n\nTable 3: Diversity evaluation. For each method, we report the percentage of unique candidates\ngenerated per image by sampling different numbers of z vectors. We also report the percentage of\nnovel sentences (i.e., sentences not seen in the training set) out of (at most) top 10 sentences following\nconsensus re-ranking. It should be noted that for CVAE, there are 2,466 novel sentences out of 3,006.\nFor GMM-CVAE and AG-CVAE, we get roughly 6,200-7,800 novel sentences.\n\nFigure 5: Comparison of captions produced by our AG-CVAE method and the LSTM baseline. For\neach method, top \ufb01ve captions following consensus re-ranking are shown.\n\nthe right choice of standard deviation and a large enough number of z samples. AG-CVAE obtains the\nhighest upper bound. A big advantage of the CVAE variants over the LSTM is that they can be easily\nused to generate more candidate sentences simply by increasing the number of z samples, while the\nonly way to do so for the LSTM is to increase the beam width, which is computationally prohibitive.\nIn more detail, the top two lines of Table 1 compare performance of the LSTM with and without the\nadditional object (cluster) vector input, and show that it does not make a dramatic difference. That is,\nimproving over the LSTM baseline is not just a matter of adding stronger conditioning information\nas input. Similarly, for CVAE, GMM-CVAE, and AG-CVAE, using the object vector as additional\nconditioning information in the encoder and decoder can increase accuracy somewhat, but does not\naccount for all the improvements that we see. One thing we noticed about the models without the\nobject vector is that they are more sensitive to the standard deviation parameter and require more\ncareful tuning (to demonstrate this, the table includes results for several values of \u03c3 for the CVAE\nmodels).\nConsensus re-ranking evaluation. For a more realistic evaluation we next compare the same models\nafter consensus re-ranking [10, 23]. Speci\ufb01cally, for a given test image, we \ufb01rst \ufb01nd its nearest\nneighbors in the training set in the cross-modal embedding space learned by a two-branch network\nproposed in [34]. Then we take all the ground-truth reference captions of those neighbors and\ncalculate the consensus re-ranking scores between them and the candidate captions. For this, we\nuse the CIDEr metric, based on the observation of [22, 30] that it can give more human-consistent\nevaluations than BLEU.\n\n7\n\nPredicted Object Labels:'bottle' 'refrigerator'Predicted Object Labels:'person' 'backpack' 'umbrella'AG-CVAE:a person holding an umbrella in front of a buildinga woman holding a red umbrella in front of a buildinga person holding an umbrella in the raina man and woman holding an umbrella in the raina man holding a red umbrella in front of a buildingLSTM Baseline:a man holding an umbrella on a city streeta man holding an umbrella in the raina man is holding an umbrella in the raina person holding an umbrella in the raina man holding an umbrella in the rain with an umbrellaPredicted Object Labels:'person' 'horse' 'bear'AG-CVAE:a man standing next to a brown horsea man is standing next to a horsea person standing next to a brown and white horsea man standing next to a horse and a mana man holding a brown and white horseLSTM Baseline:a close up of a person with a horsea close up of a horse with a horsea black and white photo of a man wearing a hata black and white photo of a person wearing a hata black and white photo of a man in a hatAG-CVAE:an open refrigerator \ufb01lled with lots of fooda refrigerator \ufb01lled with lots of food and drinksa refrigerator \ufb01lled with lots of fooda large open refrigerator \ufb01lled with lots of fooda refrigerator \ufb01lled with lots of food and other itemsLSTM Baseline:a refrigerator \ufb01lled with lots of fooda refrigerator \ufb01lled with lots of food on topa refrigerator \ufb01lled with lots of food insidea refrigerator \ufb01lled with lots of food inside of ita refrigerator \ufb01lled with lots of food and other itemsPredicted Object Labels:'person' \u2018bed\u2019AG-CVAE:a baby laying on a bed with a blanketa woman laying on a bed with a babya man laying on a bed with a babya baby laying in a bed with a blanketa baby is laying in bed with a catLSTM Baseline:a baby is laying on a bed with a blanketa baby is laying on a bed with a stu\ufb00ed animala little girl laying in a bed with a blanketa little girl laying on a bed with a blanketa man laying in a bed with a blanket(a)(b)(d)(c)\fFigure 6: Comparison of captions produced by GMM-CVAE and AG-CVAE for two different versions\nof input object vectors for the same images. For both models, we draw 20 z samples and show the\nresulting unique captions.\n\nTable 2 shows the evaluation based on the single top-ranked sentence for each test image. While the\nre-ranked performance cannot get near the upper bounds of Table 1, the numbers follow a similar\ntrend, with GMM-CVAE and AG-CVAE achieving better performance than the baselines in almost\nall metrics. It should also be noted that, while it is not our goal to outperform the state of the art in\nabsolute terms, our performance is actually better than some of the best methods to date [23, 37],\nalthough [37] was trained on a different split. AG-CVAE tends to get slightly higher numbers than\nGMM-CVAE, although the advantage is smaller than for the upper-bound results in Table 1. One\nof the most important take-aways for us is that there is still a big gap between upper-bound and\nre-ranking performance and that improving re-ranking of candidate sentences is an important future\ndirection.\nDiversity evaluation. To compare the generative capabilities of our different methods we report\ntwo indicative numbers in Table 3. One is the average percentage of unique captions in the set of\ncandidates generated for each image. This number is only meaningful for the CVAE models, where\nwe sample candidates by drawing different z samples, and multiple z\u2019s can result in the same caption.\nFor LSTM, the candidates are obtained using beam search and are by de\ufb01nition distinct. From Table\n3, we observe that CVAE has very little diversity, GMM-CVAE is much better, but AG-CVAE has the\ndecisive advantage.\nSimilarly to [27], we also report the percentage of all generated sentences for the test set that have\nnot been seen in the training set. It only really makes sense to assess novelty for sentences that\nare plausible, so we compute this percentage based on (at most) top 10 sentences per image after\nconsensus re-ranking. Based on the novelty ratio, CVAE does well. However, since it generates\nfewer distinct candidates per image, the absolute numbers of novel sentences are much lower than for\nGMM-CVAE and AG-CVAE (see table caption for details).\n\n8\n\nObject Labels: \u2018cat\u2019 \u2018suitcase\u2019GMM-CVAE:AG-CVAE:a small white and black cat sitting on top of a suitcasea cat sitting on a piece of luggagea small gray and white cat sitting in a suitcasea white cat sitting on top of a suitcasea black and white cat sitting in a suitcasea black and white cat sitting on top of a suitcasea cat that is sitting on a tablea black and white cat sitting next to a suitcasea cat sitting in front of a suitcasea cat sitting on a wooden bench in the suna close up of a cat sitting on a suitcasea cat sitting on top of a blue suitcasea large brown and white cat sitting on top of a suitcasea cat sitting on top of a suitcasea white cat with a suitcasea black and white cat is sitting in a suitcasea cat that is sitting in a suitcasea cat that is sitting on a suitcasea cat sitting on top of a suitcasea black and white cat sitting on a suitcasea cat sitting in a suitcase on a tablea cat that is sitting in a suitcasea cat sitting on top of a suitcasea cat sitting in a suitcase on the \ufb02oora black and white cat is sitting in a suitcasea close up of a cat on a suitcaseObject Labels: \u2018cat\u2019 \u2018suitcase\u2019 \u2018chair\u2019GMM-CVAE:AG-CVAE:a white and black cat sitting in a suitcasea cat that is sitting on a chaira white and black cat sitting on top of a suitcasea black and white cat sitting on a chaira cat sitting on a chair in a rooma large brown and white cat sitting on top of a deska cat sitting on a wooden bench in the suna close up of a cat sitting on a suitcasea black and white cat sitting next to a piece of luggagea small white and black cat sitting in a chaira black and white cat sitting on top of a suitcasea cat sitting on top of a blue chaira cat sitting on top of a suitcaseObject Labels: \u2018cup\u2019 \u2018dining table\u2019 \u2018teddy bear\u2019GMM-CVAE:AG-CVAE:Object Labels: \u2018cup\u2019 \u2018dining table\u2019 \u2018teddy bear\u2019 \u2018sandwich\u2019 \u2018cake\u2019GMM-CVAE:AG-CVAE:a teddy bear sitting next to a teddy beara teddy bear sitting on a table next to a tablea teddy bear sitting on top of a tablea teddy bear sitting on a table next to a cup of co\ufb00eea stu\ufb00ed teddy bear sitting next to a tablea stu\ufb00ed teddy bear sitting on a tablea teddy bear sitting next to a table \ufb01lled with stu\ufb00ed animalsa teddy bear is sitting on a tableateddy bear sitting on a table next to a teddy beara white teddy bear sitting next to a tablea couple of stu\ufb00ed animals sitting on a tablea teddy bear sitting next to a bunch of \ufb02owersa couple of teddy bears sitting on a tablea large teddy bear sitting on a tablea bunch of stu\ufb00ed animals sitting on a tablea group of teddy bears sitting on a tablea large teddy bear sitting on a table next to a tablea teddy bear sitting next to a pile of booksa group of teddy bears sitting next to each othera white teddy bear sitting on a wooden tabletwo teddy bears sitting next to each othera couple of teddy bears sitting next to each othera white teddy bear sitting next to a tablea teddy bear sitting next to a wooden tablea large stu\ufb00ed animal sitting on top of a tablea teddy bear sitting next to a teddy beara teddy bear sitting on a table next to a cup of co\ufb00eea teddy bear sitting on a table with a teddy beara teddy bear with a teddy bear sitting on top of ita teddy bear sitting on top of a tablea teddy bear sitting next to a cup of co\ufb00eea table with a teddy bear and a teddy beara teddy bear sitting on a table next to a glass of co\ufb00eetwo teddy bears sitting on a table next to each othera table topped with a cakea couple of cake sitting on top of a tablea table with a cake and a bunch of stu\ufb00ed animalsa cake with a bunch of co\ufb00ee on ita white teddy bear sitting next to a glass of co\ufb00eea table with a cake and a bear on ita table with a bunch of teddy bearsa table with two plates of food on ita table topped with a variety of fooda table with two teddy bearsa table with a cake and a plate of fooda couple of sandwiches sitting on top of a tablea table topped with a cake and two plates of fooda table with a bunch of cakes on ita table with a cake and a cup of co\ufb00eea white plate of food next to a tablea white table topped with lots of food\fQualitative results. Figure 5 compares captions generated by AG-CVAE and the LSTM baseline on\nfour example images. The AG-CVAE captions tend to exhibit a more diverse sentence structure with\na wider variety of nouns and verbs used to describe the same image. Often this yields captions that\nare more accurate (\u2018open refrigerator\u2019 vs. \u2018refrigerator\u2019 in (a)) and better re\ufb02ective of the cardinality\nand types of entities in the image (in (b), our captions mention both the person and the horse while the\nLSTM tends to mention only one). Even when AG-CVAE does not manage to generate any correct\ncandidates, as in (d), it still gets the right number of people in some candidates. A shortcoming of\nAG-CVAE is that detected objects frequently end up omitted from the candidate sentences if the\nLSTM language model cannot accommodate them (\u2018bear\u2019 in (b) and \u2018backpack\u2019 in (c)). On the\none hand, this shows that the capacity of the LSTM decoder to generate combinatorially complex\nsentences is still limited, but on the other hand, it provides robustness against false positive detections.\nControllable sentence generation. Figure 6 illustrates how the output of our GMM-CVAE and\nAG-CVAE models changes when we change the input object vectors in an attempt to control the\ngeneration process. Consistent with Table 3, we observe that for the same number of z samples,\nAG-CVAE produces more unique candidates than GMM-CVAE. Further, AG-CVAE is more \ufb02exible\nthan GMM-CVAE and more responsive to the content of the object vectors. For the \ufb01rst image\nshowing a cat, when we add the additional object label \u2018chair,\u2019 AG-CVAE is able to generate some\ncaptions mentioning a chair, but GMM-CVAE is not. Similarly, in the second example, when we add\nthe concepts of \u2018sandwich\u2019 and \u2018cake,\u2019 only AG-CVAE can generate some sentences that capture\nthem. Still, the controllability of AG-CVAE leaves something to be desired, since, as observed above,\nit has trouble mentioning more than two or three objects in the same sentence, especially in unusual\ncombinations.\n5 Discussion\nOur experiments have shown that both our proposed GMM-CVAE and AG-CVAE approaches\ngenerate image captions that are more diverse and more accurate than standard LSTM baselines.\nWhile GMM-CVAE and AG-CVAE have very similar bottom-line accuracies according to Table 2,\nAG-CVAE has a clear edge in terms of diversity (unique captions per image) and controllability, both\nquantitatively (Table 3) and qualitatively (Figure 6).\nRelated work. To date, CVAEs have been used for image question generation [14], but as far as we\nknow, our work is the \ufb01rst to apply them to captioning. In [8], a mixture of Gaussian prior is used in\nCVAEs for colorization. Their approach is essentially similar to our GMM-CVAE, though it is based\non mixture density networks [4] and uses a different approximation scheme during training.\nOur CVAE formulation has some advantages over the CGAN approach adopted by other recent\nworks aimed at the same general goals [6, 27]. GANs do not expose control over the structure of\nthe latent space, while our additive prior results in an interpretable way to control the sampling\nprocess. GANs are also notoriously tricky to train, in particular for discrete sampling problems like\nsentence generation (Dai et al. [6] have to resort to reinforcement learning and Shetty et al. [27] to an\napproximate Gumbel sampler [15]). Our CVAE training is much more straightforward.\nWhile we represent the z space as a simple vector space with multiple modes, it is possible to impose\non it a more general graphical model structure [16], though this incurs a much greater level of\ncomplexity. Finally, from the viewpoint of inference, our work is also related to general approaches\nto diverse structured prediction, which focus on extracting multiple modes from a single energy\nfunction [3]. This is a hard problem necessitating sophisticated approximations, and we prefer to\ncircumvent it by cheaply generating a large number of diverse and plausible candidates, so that \u201cgood\nenough\u201d ones can be identi\ufb01ed using simple re-ranking mechanisms.\nFuture work. We would like to investigate more general formulations for the conditioning informa-\ntion c(I), not necessarily relying on object labels whose supervisory information must be provided\nseparately from the sentences. These can be obtained, for example, by automatically clustering nouns\nor noun phrases extracted from reference sentences, or even clustering vector representations of entire\nsentences. We are also interested in other tasks, such as question generation, where the cluster vectors\ncan represent the question type (\u2018what is,\u2019 \u2018where is,\u2019 \u2018how many,\u2019 etc.) as well as the image content.\nControl of the output by modifying the c vector would in this case be particularly natural.\nAcknowledgments: This material is based upon work supported in part by the National Science\nFoundation under Grants No. 1563727 and 1718221, and by the Sloan Foundation. We would like to\nthank Jian Peng and Yang Liu for helpful discussions.\n\n9\n\n\fReferences\n\n[1] Neuraltalk2. https://github.com/karpathy/neuraltalk2.\n[2] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption\n\n[3] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse M-Best Solutions in Markov\n\nevaluation. In ECCV, 2016.\n\nRandom Fields. In ECCV, 2012.\n\n[4] C. M. Bishop. Mixture density networks. 1994.\n[5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Doll\u00e1r, and C. L. Zitnick. Microsoft coco captions:\n\nData collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.\n\n[6] B. Dai, D. Lin, R. Urtasun, and S. Fidler. Towards diverse and natural image descriptions via a conditional\n\ngan. ICCV, 2017.\n\n[7] M. Denkowski and A. Lavie. Meteor universal: Language speci\ufb01c translation evaluation for any target\n\nlanguage. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014.\n\n[8] A. Deshpande, J. Lu, M.-C. Yeh, and D. Forsyth. Learning diverse image colorization. CVPR, 2017.\n[9] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell. Language models for\n\nimage captioning: The quirks and what works. arXiv preprint arXiv:1505.01809, 2015.\n\n[10] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches\n\nfor image captioning. arXiv preprint arXiv:1505.04467, 2015.\n\n[11] A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every\n\npicture tells a story: Generating sentences from images. In ECCV, 2010.\n\n[12] J. R. Hershey and P. A. Olsen. Approximating the kullback leibler divergence between gaussian mixture\n\nmodels. In ICASSP, 2007.\n\nCVPR, 2017.\n\n[13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[14] U. Jain, Z. Zhang, and A. Schwing. Creativity: Generating diverse questions using variational autoencoders.\n\n[15] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. ICLR, 2017.\n[16] M. J. Johnson, D. Duvenaud, A. Wiltschko, S. Datta, and R. Adams. Structured vaes: Composing\n\nprobabilistic graphical models and variational autoencoders. NIPS, 2016.\n\n[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.\n[18] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, 2014.\n[19] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk:\nUnderstanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 35(12):2891\u20132903, 2013.\n\n[20] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Generalizing image captions for image-text\n\nparallel corpus. In ACL, 2013.\n\n[21] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out:\n\nProceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.\n\n[22] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Improved image captioning via policy gradient\n\n[23] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent\n\noptimization of spider. ICCV, 2017.\n\nneural networks (m-rnn). ICLR, 2015.\n\n[24] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and\nH. Daum\u00e9 III. Midge: Generating image descriptions from computer vision detections. In Proceedings\nof the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages\n747\u2013756. Association for Computational Linguistics, 2012.\n\n[25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine\n\ntranslation. In ACL. Association for Computational Linguistics, 2002.\n\n[26] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region\n\nproposal networks. In NIPS, 2015.\n\n[27] R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele. Speaking the same language: Matching\n\nmachine to human captions by adversarial training. ICCV, 2017.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[29] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative\n\n[30] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation.\n\nmodels. In NIPS, 2015.\n\nIn CVPR, 2015.\n\n[31] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra. Diverse beam\nsearch: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424, 2016.\n[32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\n[33] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: Lessons learned from the 2015 mscoco\n\nimage captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 2016.\n\n[34] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In CVPR,\n\n[35] Z. Wang, F. Wu, W. Lu, J. Xiao, X. Li, Z. Zhang, and Y. Zhuang. Diverse image captioning via grouptalk.\n\nCVPR, 2015.\n\n2016.\n\nIn IJCAI, 2016.\n\n10\n\n\f[36] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and\n\ntell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[37] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2941, "authors": [{"given_name": "Liwei", "family_name": "Wang", "institution": "University of Illinois at Urbana\u2013Champaign"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Svetlana", "family_name": "Lazebnik", "institution": null}]}