{"title": "VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3308, "page_last": 3318, "abstract": "Deep generative models provide powerful tools for distributions over complicated manifolds, such as those of natural images. But many of these methods, including generative adversarial networks (GANs), can be difficult to train, in part because they are prone to mode collapse, which means that they characterize only a few modes of the true distribution. To address this, we introduce VEEGAN, which features a reconstructor network, reversing the action of the generator by mapping from data to noise. Our training objective retains the original asymptotic consistency guarantee of GANs, and can be interpreted as a novel autoencoder loss over the noise. In sharp contrast to a traditional autoencoder over data points, VEEGAN does not require specifying a loss function over the data, but rather only over the representations, which are standard normal by assumption. On an extensive set of synthetic and real world image datasets, VEEGAN indeed resists mode collapsing to a far greater extent than other recent GAN variants, and produces more realistic samples.", "full_text": "VEEGAN: Reducing Mode Collapse in GANs using\n\nImplicit Variational Learning\n\nAkash Srivastava\nSchool of Informatics\nUniversity of Edinburgh\n\nakash.srivastava@ed.ac.uk\n\nChris Russell\n\nThe Alan Turing Institute\n\nLondon\n\nLazar Valkov\n\nSchool of Informatics\nUniversity of Edinburgh\n\nL.Valkov@sms.ed.ac.uk\n\nMichael U. Gutmann\nSchool of Informatics\nUniversity of Edinburgh\n\ncrussell@turing.ac.uk\n\nMichael.Gutmann@ed.ac.uk\n\nCharles Sutton\n\nSchool of Informatics & The Alan Turing Institute\n\nUniversity of Edinburgh\ncsutton@inf.ed.ac.uk\n\nAbstract\n\nDeep generative models provide powerful tools for distributions over complicated\nmanifolds, such as those of natural images. But many of these methods, including\ngenerative adversarial networks (GANs), can be dif\ufb01cult to train, in part because\nthey are prone to mode collapse, which means that they characterize only a few\nmodes of the true distribution. To address this, we introduce VEEGAN, which\nfeatures a reconstructor network, reversing the action of the generator by mapping\nfrom data to noise. Our training objective retains the original asymptotic consis-\ntency guarantee of GANs, and can be interpreted as a novel autoencoder loss over\nthe noise. In sharp contrast to a traditional autoencoder over data points, VEEGAN\ndoes not require specifying a loss function over the data, but rather only over the\nrepresentations, which are standard normal by assumption. On an extensive set of\nsynthetic and real world image datasets, VEEGAN indeed resists mode collapsing\nto a far greater extent than other recent GAN variants, and produces more realistic\nsamples.\n\n1\n\nIntroduction\n\nDeep generative models are a topic of enormous recent interest, providing a powerful class of tools\nfor the unsupervised learning of probability distributions over dif\ufb01cult manifolds such as natural\nimages [7, 11, 18]. Deep generative models are usually implicit statistical models [3], also called\nimplicit probability distributions, meaning that they do not induce a density function that can be\ntractably computed, but rather provide a simulation procedure to generate new data points. Generative\nadversarial networks (GANs) [7] are an attractive such method, which have seen promising recent\nsuccesses [17, 20, 23]. GANs train two deep networks in concert: a generator network that maps\nrandom noise, usually drawn from a multi-variate Gaussian, to data items; and a discriminator network\nthat estimates the likelihood ratio of the generator network to the data distribution, and is trained\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fusing an adversarial principle. Despite an enormous amount of recent work, GANs are notoriously\n\ufb01ckle to train, and it has been observed [1, 19] that they often suffer from mode collapse, in which\nthe generator network learns how to generate samples from a few modes of the data distribution\nbut misses many other modes, even though samples from the missing modes occur throughout the\ntraining data.\nTo address this problem, we introduce VEEGAN,1 a variational principle for estimating implicit\nprobability distributions that avoids mode collapse. While the generator network maps Gaussian\nrandom noise to data items, VEEGAN introduces an additional reconstructor network that maps the\ntrue data distribution to Gaussian random noise. We train the generator and reconstructor networks\njointly by introducing an implicit variational principle, which encourages the reconstructor network\nnot only to map the data distribution to a Gaussian, but also to approximately reverse the action of\nthe generator. Intuitively, if the reconstructor learns both to map all of the true data to the noise\ndistribution and is an approximate inverse of the generator network, this will encourage the generator\nnetwork to map from the noise distribution to the entirety of the true data distribution, thus resolving\nmode collapse.\nUnlike other adversarial methods that train reconstructor networks [4, 5, 22], the noise autoencoder\ndramatically reduces mode collapse. Unlike recent adversarial methods that also make use of a\ndata autoencoder [1, 13, 14], VEEGAN autoencodes noise vectors rather than data items. This is\na signi\ufb01cant difference, because choosing an autoencoder loss for images is problematic, but for\nGaussian noise vectors, an (cid:96)2 loss is entirely natural. Experimentally, on both synthetic and real-world\nimage data sets, we \ufb01nd that VEEGAN is dramatically less susceptible to mode collapse, and produces\nhigher-quality samples, than other state-of-the-art methods.\n\n2 Background\n\nImplicit probability distributions are speci\ufb01ed by a sampling procedure, but do not have a tractable\ndensity [3]. Although a natural choice in many settings, implicit distributions have historically been\nseen as dif\ufb01cult to estimate. However, recent progress in formulating density estimation as a problem\nof supervised learning has allowed methods from the classi\ufb01cation literature to enable implicit model\nestimation, both in the general case [6, 10] and for deep generative adversarial networks (GANs) in\nparticular [7]. Let {xi}N\ni=1 denote the training data, where each xi \u2208 RD is drawn from an unknown\ndistribution p(x). A GAN is a neural network G\u03b3 that maps representation vectors z \u2208 RK, typically\ndrawn from a standard normal distribution, to data items x \u2208 RD. Because this mapping de\ufb01nes an\nimplicit probability distribution, training is accomplished by introducing a second neural network\nD\u03c9, called a discriminator, whose goal is to distinguish generator samples from true data samples.\nThe parameters of these networks are estimated by solving the minimax problem\n\nmax\n\n\u03c9\n\nmin\n\n\u03b3\n\nOGAN(\u03c9, \u03b3) := Ez [log \u03c3 (D\u03c9(G\u03b3(z)))] + Ex [log (1 \u2212 \u03c3 (D\u03c9(x)))] ,\n\nwhere Ez indicates an expectation over the standard normal z, Ex indicates an expectation over the\ndata distribution p(x), and \u03c3 denotes the sigmoid function. At the optimum, in the limit of in\ufb01nite\ndata and arbitrarily powerful networks, we will have D\u03c9 = log q\u03b3(x)/p(x), where q\u03b3 is the density\nthat is induced by running the network G\u03b3 on normally distributed input, and hence that q\u03b3 = p [7].\nUnfortunately, GANs can be dif\ufb01cult and unstable to train [19]. One common pathology that arises in\nGAN training is mode collapse, which is when samples from q\u03b3(x) capture only a few of the modes\nof p(x). An intuition behind why mode collapse occurs is that the only information that the objective\nfunction provides about \u03b3 is mediated by the discriminator network D\u03c9. For example, if D\u03c9 is a\nconstant, then OGAN is constant with respect to \u03b3, and so learning the generator is impossible. When\nthis situation occurs in a localized region of input space, for example, when there is a speci\ufb01c type of\nimage that the generator cannot replicate, this can cause mode collapse.\n\n1VEEGAN is a Variational Encoder Enhancement to Generative Adversarial Networks. https://akashgit.\n\ngithub.io/VEEGAN/\n\n2\n\n\fp(x)\n\np0(z)\n\nF\u03b8\n\nG\u03b3\n\nz\n\nx\n\nz\n\np(x)\n\np0(z)\n\nF\u03b8\n\nG\u03b3\n\nz\n\nx\n\nz\n\n(a) Suppose F\u03b8 is trained to approximately invert\nG\u03b3. Then applying F\u03b8 to true data is likely to\nproduce a non-Gaussian distribution, allowing us\nto detect mode collapse.\nFigure 1: Illustration of how a reconstructor network F\u03b8 can help to detect mode collapse in a deep\ngenerative network G\u03b3. The data distribution is p(x) and the Gaussian is p0(z). See text for details.\n\n(b) When F\u03b8 is trained to map the data to a Gaus-\nsian distribution, then treating F\u03b8 \u25e6 G\u03b3 as an au-\ntoencoder provides learning signal to correct G\u03b3.\n\n3 Method\n\nThe main idea of VEEGAN is to introduce a second network F\u03b8 that we call the reconstructor network\nwhich is learned both to map the true data distribution p(x) to a Gaussian and to approximately invert\nthe generator network.\nTo understand why this might prevent mode collapse, consider the example in Figure 1. In both\ncolumns of the \ufb01gure, the middle vertical panel represents the data space, where in this example\nthe true distribution p(x) is a mixture of two Gaussians. The bottom panel depicts the input to the\ngenerator, which is drawn from a standard normal distribution p0 = N (0, I), and the top panel\ndepicts the result of applying the reconstructor network to the generated and the true data. The arrows\nlabeled G\u03b3 show the action of the generator. The purple arrows labelled F\u03b8 show the action of the\nreconstructor on the true data, whereas the green arrows show the action of the reconstructor on data\nfrom the generator. In this example, the generator has captured only one of the two modes of p(x).\nThe difference between Figure 1a and 1b is that the reconstructor networks are different.\nFirst, let us suppose (Figure 1a) that we have successfully trained F\u03b8 so that it is approximately the\ninverse of G\u03b3. As we have assumed mode collapse however, the training data for the reconstructor\nnetwork F\u03b8 does not include data items from the \u201cforgotten\" mode of p(x), therefore the action of F\u03b8\non data from that mode is ill-speci\ufb01ed. This means that F\u03b8(X), X \u223c p(x) is unlikely to be Gaussian\nand we can use this mismatch as an indicator of mode collapse.\nConversely, let us suppose (Figure 1b) that F\u03b8 is successful at mapping the true data distribution to a\nGaussian. In that case, if G\u03b3 mode collapses, then F\u03b8 will not map all G\u03b3(z) back to the original z\nand the resulting penalty provides us with a strong learning signal for both \u03b3 and \u03b8.\nTherefore, the learning principle for VEEGAN will be to train F\u03b8 to achieve both of these objectives\nsimultaneously. Another way of stating this intuition is that if the same reconstructor network maps\nboth the true data and the generated data to a Gaussian distribution, then the generated data is likely\nto coincide with true data. To measure whether F\u03b8 approximately inverts G\u03b3, we use an autoencoder\nloss. More precisely, we minimize a loss function, like (cid:96)2 loss between z \u223c p0 and F\u03b8(G\u03b3(z))).\nTo quantify whether F\u03b8 maps the true data distribution to a Gaussian, we use the cross entropy\nH(Z, F\u03b8(X)) between Z and F\u03b8(x). This boils down to learning \u03b3 and \u03b8 by minimising the sum of\nthese two objectives, namely\n\n(1)\nWhile this objective captures the main idea of our paper, it cannot be easily computed and minimised.\nWe next transform it into a computable version and derive theoretical guarantees.\n\n2\n\nOentropy(\u03b3, \u03b8) = E(cid:2)(cid:107)z \u2212 F\u03b8(G\u03b3(z))(cid:107)2\n\n(cid:3) + H(Z, F\u03b8(X)).\n\n3.1 Objective Function\n\nitem x by p\u03b8(z|x) and when applied to all X \u223c p(x) by p\u03b8(z) =(cid:82) p\u03b8(z|x)p(x) dx. The conditional\n\nLet us denote the distribution of the outputs of the reconstructor network when applied to a \ufb01xed data\n\n3\n\n\f(cid:90)\n\n(cid:90)\n\ndistribution p\u03b8(z|x) is Gaussian with unit variance and, with a slight abuse of notation, (deterministic)\nmean function F\u03b8(x). The entropy term H(Z, F\u03b8(X)) can thus be written as\n\nH(Z, F\u03b8(X)) = \u2212\n\np0(z) log p\u03b8(z)dz = \u2212\n\np0(z) log\n\np(x)p\u03b8(z|x) dx dz.\n\n(2)\n\n(cid:90)\n\n(cid:90)\n\nThis cross entropy is minimized with respect to \u03b8 when p\u03b8(z) = p0(z) [2]. Unfortunately, the\nintegral on the right-hand side of (2) cannot usually be computed in closed form. We thus introduce a\nvariational distribution q\u03b3(x|z) and by Jensen\u2019s inequality, we have\n\n\u2212 log p\u03b8(z) = \u2212 log\n\np\u03b8(z|x)p(x)\n\nq\u03b3(x|z)\nq\u03b3(x|z)\n\ndx \u2264 \u2212\n\nq\u03b3(x|z) log\n\np\u03b8(z|x)p(x)\n\nq\u03b3(x|z)\n\ndx,\n\n(3)\n\n(cid:90)\n\nwhich we use to bound the cross-entropy in (2). In variational inference, strong parametric assump-\ntions are typically made on q\u03b3. Importantly, we here relax that assumption, instead representing q\u03b3\nimplicitly as a deep generative model, enabling us to learn very complex distributions. The variational\ndistribution q\u03b3(x|z) plays exactly the same role as the generator in a GAN, and for that reason, we\nwill parameterize q\u03b3(x|z) as the output of a stochastic neural network G\u03b3(z).\nIn practice minimizing this bound is dif\ufb01cult if q\u03b3 is speci\ufb01ed implicitly. For instance, it is chal-\nlenging to train a discriminator network that accurately estimates the unknown likelihood ratio\nlog p(x)/q\u03b3(x|z), because q\u03b3(x|z), as a conditional distribution, is much more peaked than the\njoint distribution p(x), making it too easy for a discriminator to tell the two distributions apart.\nIntuitively, the discriminator in a GAN works well when it is presented a dif\ufb01cult pair of distributions\nto distinguish. To circumvent this problem, we write (see supplementary material)\n\n(cid:90)\n\n(cid:90)\n\n\u2212\n\np0(z) log p\u03b8(z) \u2264 KL [q\u03b3(x|z)p0(z)(cid:107) p\u03b8(z|x)p(x)] \u2212 E [log p0(z)] .\n\n(4)\n\nHere all expectations are taken with respect to the joint distribution p0(z)q\u03b3(x|z).\nNow, moving to the second term in (1), we de\ufb01ne the reconstruction penalty as an expectation\nof the cost of autoencoding noise vectors, that is, E [d(z, F\u03b8(G\u03b3(z)))] . The function d denotes a\nloss function in representation space RK, such as (cid:96)2 loss and therefore the term is an autoencoder\nin representation space. To make this link explicit, we expand the expectation, assuming that we\n\nchoose d to be (cid:96)2 loss. This yields E [d(z, F\u03b8(x))] =(cid:82) p0(z)(cid:82) q\u03b3(x|z)(cid:107)z \u2212 F\u03b8(x)(cid:107)2 dxdz. Unlike\n\na standard autoencoder, however, rather than taking a data item as input and attempting to reconstruct\nit, we autoencode a representation vector. This makes a substantial difference in the interpretation\nand performance of the method, as we discuss in Section 4. For example, notice that we do not\ninclude a regularization weight on the autoencoder term in (5), because Proposition 1 below says that\nthis is not needed to recover the data distribution.\nCombining these two ideas, we obtain the \ufb01nal objective function\n\nO(\u03b3, \u03b8) = KL [q\u03b3(x|z)p0(z)(cid:107) p\u03b8(z|x)p(x)] \u2212 E [log p0(z)] + E [d(z, F\u03b8(x))] .\n\n(5)\nRather than minimizing the intractable Oentropy(\u03b3, \u03b8), our goal in VEEGAN is to minimize the upper\nbound O with respect to \u03b3 and \u03b8. Indeed, if the networks F\u03b8 and G\u03b3 are suf\ufb01ciently powerful, then if\nwe succeed in globally minimizing O, we can guarantee that q\u03b3 recovers the true data distribution.\nThis statement is formalized in the following proposition.\nProposition 1. Suppose that there exist parameters \u03b8\u2217, \u03b3\u2217 such that O(\u03b3\u2217, \u03b8\u2217) = H[p0], where H\ndenotes Shannon entropy. Then (\u03b3\u2217, \u03b8\u2217) minimizes O, and further\n\np\u03b8\u2217 (z) :=\n\np\u03b8\u2217 (z|x)p(x) dx = p0(z),\n\nand\n\nq\u03b3\u2217 (x) :=\n\nq\u03b3\u2217 (x|z)p0(z) dz = p(x).\n\n(cid:90)\n\nBecause neural networks are universal approximators, the conditions in the proposition can be\nachieved when the networks G and F are suf\ufb01ciently powerful.\n\n3.2 Learning with Implicit Probability Distributions\nThis subsection describes how to approximate O when we have implicit representations for q\u03b3 and p\u03b8\nrather than explicit densities. In this case, we cannot optimize O directly, because the KL divergence\n\n4\n\n\fAlgorithm 1 VEEGAN training\n1: while not converged do\nfor i \u2208 {1 . . . N} do\n2:\nSample zi \u223c p0(z)\n3:\nSample xi\n4:\nSample xi \u223c p(x)\n5:\nSample zi\n6:\ng\u03c9 \u2190 \u2212\u2207\u03c9\n1\n7:\nN\n8:\ng\u03b8 \u2190 \u2207\u03b8\n9:\n10:\ng\u03b3 \u2190 \u2207\u03b3\ni d(zi, xi\n11:\ng)\n12:\n\u03c9 \u2190 \u03c9 \u2212 \u03b7g\u03c9; \u03b8 \u2190 \u03b8 \u2212 \u03b7g\u03b8; \u03b3 \u2190 \u03b3 \u2212 \u03b7g\u03b3\n13:\n\ng \u223c q\u03b3(x|zi)\ni log \u03c3(cid:0)D\u03c9(zi, xi\ng)(cid:1) + log(cid:0)1 \u2212 \u03c3(cid:0)D\u03c9(zi\n(cid:80)\ng \u223c p\u03b8(zg|xi)\n(cid:80)\n(cid:80)\n(cid:80)\n\ni D\u03c9(zi, xi\n\ni d(zi, xi\ng)\n\ng) + 1\nN\n\n1\nN\n\n1\nN\n\ng, xi)(cid:1)(cid:1)\n\n(cid:46) Compute \u2207\u03c9 \u02c6OLR\n(cid:46) Compute \u2207\u03b8 \u02c6O\n(cid:46) Compute \u2207\u03b3 \u02c6O\n\n(cid:46) Perform SGD updates for \u03c9, \u03b8 and \u03b3\n\nin (5) depends on a density ratio which is unknown, both because q\u03b3 is implicit and also because p(x)\nis unknown. Following [4, 5], we estimate this ratio using a discriminator network D\u03c9(x, z) which\nwe will train to encourage\n\nThis will allow us to estimate O as\n\nD\u03c9(z, x) = log\n\nq\u03b3(x|z)p0(z)\np\u03b8(z|x)p(x)\n\n.\n\nN(cid:88)\n\n(6)\n\n(7)\n\nd(zi, xi\n\ng),\n\nN(cid:88)\n\ni=1\n\n\u02c6O(\u03c9, \u03b3, \u03b8) =\n\n1\nN\ng) \u223c p0(z)q\u03b3(x|z). In this equation, note that xi\n\nD\u03c9(zi, xi\n\ng) +\n\n1\nN\n\ni=1\n\ng is a function of \u03b3; although we suppress\nwhere (zi, xi\nthis in the notation, we do take this dependency into account in the algorithm. We use an auxiliary\nobjective function to estimate \u03c9. As mentioned earlier, we omit the entropy term \u2212E [log p0(z)] from\n\u02c6O as it is constant with respect to all parameters. In principle, any method for density ratio estimation\ncould be used here, for example, see [9, 21]. In this work, we will use the logistic regression loss,\nmuch as in other methods for deep adversarial training, such as GANs [7], or for noise contrastive\nestimation [8]. We will train D\u03c9 to distinguish samples from the joint distribution q\u03b3(x|z)p0(z) from\np\u03b8(z|x)p(x). The objective function for this is\n\nOLR(\u03c9, \u03b3, \u03b8) = \u2212E\u03b3 [log (\u03c3 (D\u03c9(z, x)))] \u2212 E\u03b8 [log (1 \u2212 \u03c3 (D\u03c9(z, x)))] ,\n\n(8)\nwhere E\u03b3 denotes expectation with respect to the joint distribution q\u03b3(x|z)p0(x) and E\u03b8 with respect\nto p\u03b8(z|x)p(x). We write \u02c6OLR to indicate the Monte Carlo estimate of OLR. Our learning algorithm\noptimizes this pair of equations with respect to \u03b3, \u03c9, \u03b8 using stochastic gradient descent. In particular,\nthe algorithms aim to \ufb01nd a simultaneous solution to min\u03c9 \u02c6OLR(\u03c9, \u03b3, \u03b8) and min\u03b8,\u03b3 \u02c6O(\u03c9, \u03b3, \u03b8). This\ntraining procedure is described in Algorithm 1. When this procedure converges, we will have that\n\u03c9\u2217 = arg min\u03c9 OLR(\u03c9, \u03b3\u2217, \u03b8\u2217), which means that D\u03c9\u2217 has converged to the likelihood ratio (6).\nTherefore (\u03b3\u2217, \u03b8\u2217) have also converged to a minimum of O.\nWe also found that pre-training the reconstructor network on samples from p(x) helps in some cases.\n\n4 Relationships to Other Methods\n\nAn enormous amount of attention has been devoted recently to improved methods for GAN training,\nand we compare ourselves to the most closely related work in detail.\n\nBiGAN/Adversarially Learned Inference BiGAN [4] and Adversarially Learning Inference\n(ALI) [5] are two essentially identical recent adversarial methods for learning both a deep gen-\nerative network G\u03b3 and a reconstructor network F\u03b8. Likelihood-free variational inference (LFVI)\n[22] extends this idea to a hierarchical Bayesian setting. Like VEEGAN, all of these methods also use\na discriminator D\u03c9(z, x) on the joint (z, x) space. However, the VEEGAN objective function O(\u03b8, \u03b3)\n\n5\n\n\fprovides signi\ufb01cant bene\ufb01ts over the logistic regression loss over \u03b8 and \u03b3 that is used in ALI/BiGAN,\nor the KL-divergence used in LFVI.\nIn all of these methods, just as in vanilla GANs, the objective function depends on \u03b8 and \u03b3 only\nvia the output D\u03c9(z, x) of the discriminator; therefore, if there is a mode of data space in which\nD\u03c9 is insensitive to changes in \u03b8 and \u03b3, there will be mode collapse. In VEEGAN, by contrast, the\nreconstruction term does not depend on the discriminator, and so can provide learning signal to \u03b3\nor \u03b8 even when the discriminator is constant. We will show in Section 5 that indeed VEEGAN is\ndramatically less prone to mode collapse than ALI.\n\nInfoGAN While differently motivated to obtain disentangled representation of the data, InfoGAN\nalso uses a latent-code reconstruction based penalty in its cost function. But unlike VEEGAN, only a\npart of the latent code is reconstructed in InfoGAN. Thus, InfoGAN is similar to VEEGAN in that it\nalso includes an autoencoder over the latent codes, but the key difference is that InfoGAN does not\nalso train the reconstructor network on the true data distribution. We suggest that this may be the\nreason that InfoGAN was observed to require some of the same stabilization tricks as vanilla GANs,\nwhich are not required for VEEGAN.\n\nAdversarial Methods for Autoencoders A number of other recent methods have been proposed\nthat combine adversarial methods and autoencoders, whether by explicitly regularizing the GAN\nloss with an autoencoder loss [1, 13], or by alternating optimization between the two losses [14].\nIn all of these methods, the autoencoder is over images, i.e., they incorporate a loss function of the\nform \u03bbd(x, G\u03b3(F\u03b8(x))), where d is a loss function over images, such as pixel-wise (cid:96)2 loss, and \u03bb is\na regularization constant. Similarly, variational autoencoders [12, 18] also autoencode images rather\nthan noise vectors. Finally, the adversarial variational Bayes (AVB) [15] is an adaptation of VAEs to\nthe case where the posterior distribution p\u03b8(z|x) is implicit, but the data distribution q\u03b3(x|z), must\nbe explicit, unlike in our work.\nBecause these methods autoencode data points, they share a crucial disadvantage. Choosing a good\nloss function d over natural images can be problematic. For example, it has been commonly observed\nthat minimizing an (cid:96)2 reconstruction loss on images can lead to blurry images. Indeed, if choosing\na loss function over images were easy, we could simply train an autoencoder and dispense with\nadversarial learning entirely. By contrast, in VEEGAN we autoencode the noise vectors z, and\nchoosing a good loss function for a noise autoencoder is easy. The noise vectors z are drawn from\na standard normal distribution, using an (cid:96)2 loss on z is entirely natural \u2014 and does not, as we will\nshow in Section 5, result in blurry images compared to purely adversarial methods.\n\n5 Experiments\n\nQuantitative evaluation of GANs is problematic because implicit distributions do not have a tractable\nlikelihood term to quantify generative accuracy. Quantifying mode collapsing is also not straightfor-\nward, except in the case of synthetic data with known modes. For this reason, several indirect metrics\nhave recently been proposed to evaluate GANs speci\ufb01cally for their mode collapsing behavior [1, 16].\nHowever, none of these metrics are reliable on their own and therefore we need to compare across a\nnumber of different methods. Therefore in this section we evaluate VEEGAN on several synthetic and\nreal datasets and compare its performance against vanilla GANs [7], Unrolled GAN [16] and ALI\n[5] on \ufb01ve different metrics. Our results strongly suggest that VEEGAN does indeed resolve mode\ncollapse in GANs to a large extent. Generally, we found that VEEGAN performed well with default\nhyperparameter values, so we did not tune these. Full details are provided in the supplementary\nmaterial.\n\n5.1 Synthetic Dataset\n\nMode collapse can be accurately measured on synthetic datasets, since the true distribution and its\nmodes are known. In this section we compare all four competing methods on three synthetic datasets\nof increasing dif\ufb01culty: a mixture of eight 2D Gaussian distributions arranged in a ring, a mixture\nof twenty-\ufb01ve 2D Gaussian distributions arranged in a grid 2 and a mixture of ten 700 dimensional\n\n2Experiment follows [5]. Please note that for certain settings of parameters, vanilla GAN can also recover all\n\n25 modes, as was pointed out to us by Paulina Grnarova.\n\n6\n\n\fTable 1: Sample quality and degree of mode collapse on mixtures of Gaussians. VEEGAN consistently\ncaptures the highest number of modes and produces better samples.\n\n2D Ring\n% High Quality\n\nSamples\n\nModes\n(Max 8)\n\n2D Grid\n\n1200D Synthetic\n\nModes\n(Max 25)\n\n% High Quality\n\nSamples\n\nModes\n(Max 10)\n\n% High Quality\n\nSamples\n\nGAN\nALI\nUnrolled GAN\nVEEGAN\n\n1\n2.8\n7.6\n8\n\n99.3\n0.13\n35.6\n52.9\n\n3.3\n15.8\n23.6\n24.6\n\n0.5\n1.6\n16\n40\n\n1.6\n3\n0\n5.5\n\n2.0\n5.4\n0.0\n28.29\n\nGaussian distributions embedded in a 1200 dimensional space. This mixture arrangement was chosen\nto mimic the higher dimensional manifolds of natural images. All of the mixture components were\nisotropic Gaussians. For a fair comparison of the different learning methods for GANs, we use\nthe same network architectures for the reconstructors and the generators for all methods, namely,\nfully-connected MLPs with two hidden layers. For the discriminator we use a two layer MLP without\ndropout or normalization layers. VEEGAN method works for both deterministic and stochastic\ngenerator networks. To allow for the generator to be a stochastic map we add an extra dimension of\nnoise to the generator input that is not reconstructed.\nTo quantify the mode collapsing behavior we report two metrics: We sample points from the generator\nnetwork, and count a sample as high quality, if it is within three standard deviations of the nearest\nmode, for the 2D dataset, or within 10 standard deviations of the nearest mode, for the 1200D dataset.\nThen, we report the number of modes captured as the number of mixture components whose mean is\nnearest to at least one high quality sample. We also report the percentage of high quality samples\nas a measure of sample quality. We generate 2500 samples from each trained model and average\nthe numbers over \ufb01ve runs. For the unrolled GAN, we set the number of unrolling steps to \ufb01ve as\nsuggested in the authors\u2019 reference implementation.\nAs shown in Table 1, VEEGAN captures the greatest number of modes on all the synthetic datasets,\nwhile consistently generating higher quality samples. This is visually apparent in Figure 2, which\nplot the generator distributions for each method; the generators learned by VEEGAN are sharper and\ncloser to the true distribution. This \ufb01gure also shows why it is important to measure sample quality\nand mode collapse simultaneously, as either alone can be misleading. For instance, the GAN on the\n2D ring has 99.3% sample quality, but this is simply because the GAN collapses all of its samples\nonto one mode (Figure 2b). On the other extreme, the unrolled GAN on the 2D grid captures almost\nall the modes in the true distribution, but this is simply because that it is generating highly dispersed\nsamples (Figure 2i) that do not accurately represent the true distribution, hence the low sample quality.\nAll methods had approximately the same running time, except for unrolled GAN, which is a few\norders of magnitude slower due to the unrolling overhead.\n\n5.2 Stacked MNIST\n\nFollowing [16], we evaluate our methods on the stacked MNIST dataset, a variant of the MNIST data\nspeci\ufb01cally designed to increase the number of discrete modes. The data is synthesized by stacking\nthree randomly sampled MNIST digits along the color channel resulting in a 28x28x3 image. We\nnow expect 1000 modes in this data set, corresponding to the number of possible triples of digits.\nAgain, to focus the evaluation on the difference in the learning algorithms, we use the same generator\narchitecture for all methods. In particular, the generator architecture is an off-the-shelf standard\nimplementation3 of DCGAN [17].\nFor Unrolled GAN, we used a standard implementation of the DCGAN discriminator network. For\nALI and VEEGAN, the discriminator architecture is described in the supplementary material. For the\nreconstructor in ALI and VEEGAN, we use a simple two-layer MLP for the reconstructor without any\nregularization layers.\n\n3https://github.com/carpedm20/DCGAN-tensorflow\n\n7\n\n\fStacked-MNIST\n\nCIFAR-10\n\nModes (Max 1000) KL\n\nIvOM\n\nDCGAN\n\nALI\n\nUnrolled GAN\n\nVEEGAN\n\n99\n16\n48.7\n150\n\n3.4\n5.4\n4.32\n2.95\n\n0.00844 \u00b1 0.002\n0.0067 \u00b1 0.004\n0.013 \u00b1 0.0009\n0.0068 \u00b1 0.0001\n\nTable 2: Degree of mode collapse, measured by modes captured and the inference via optimization\nmeasure (IvOM), and sample quality (as measured by KL) on Stacked-MNIST and CIFAR. VEEGAN\ncaptures the most modes and also achieves the highest quality.\n\nFinally, for VEEGAN we pretrain the reconstructor by taking a few stochastic gradient steps with\nrespect to \u03b8 before running Algorithm 1. For all methods other than VEEGAN, we use the enhanced\ngenerator loss function suggested in [7], since we were not able to get suf\ufb01cient learning signals for\nthe generator without it. VEEGAN did not require this adjustment for successful training.\nAs the true locations of the modes in this data are unknown, the number of modes are estimated using\na trained classi\ufb01er as described originally in [1]. We used a total of 26000 samples for all the models\nand the results are averaged over \ufb01ve runs. As a measure of quality, following [16] again, we also\nreport the KL divergence between the generator distribution and the data distribution. As reported\nin Table 2, VEEGAN not only captures the most modes, it consistently matches the data distribution\nmore closely than any other method. Generated samples from each of the models are shown in the\nsupplementary material.\n\n5.3 CIFAR\n\nFinally, we evaluate the learning methods on the CIFAR-10 dataset, a well-studied and diverse dataset\nof natural images. We use the same discriminator, generator, and reconstructor architectures as in\nthe previous section. However, the previous mode collapsing metric is inappropriate here, owing to\nCIFAR\u2019s greater diversity. Even within one of the 10 classes of CIFAR, the intra-group diversity is\nvery high compared to any of the 10 classes of MNIST. Therefore, for CIFAR it is inappropriate to\nassume, as the metrics of the previous subsection do, that each labelled class corresponds to a single\nmode of the data distribution.\nInstead, we use a metric introduced by [16] which we will call the inference via optimization metric\n(IvOM). The idea behind this metric is to compare real images from the test set to the nearest\ngenerated image; if the generator suffers from mode collapse, then there will be some images for\nwhich this distance is large. To quantify this, we sample a real image x from the test set, and \ufb01nd\nthe closest image that the GAN is capable of generating, i.e. optimizing the (cid:96)2 loss between x\nand generated image G\u03b3(z) with respect to z. If a method consistently attains low MSE, then it\ncan be assumed to be capturing more modes than the ones which attain a higher MSE. As before,\nthis metric can still be fooled by highly dispersed generator distributions, and also the (cid:96)2 metric\nmay favour generators that produce blurry images. Therefore we will also evaluate sample quality\nvisually. All numerical results have been averaged over \ufb01ve runs. Finally, to evaluate whether the\nnoise autoencoder in VEEGAN is indeed superior to a more traditional data autoencoder, we compare\nto a variant, which we call VEEGAN +DAE, that uses a data autoencoder instead, by simply replacing\nd(z, F\u03b8(x)) in O with a data loss (cid:107)x \u2212 G\u03b3(F\u03b8(x)))(cid:107)2\n2.\nAs shown in Table 2, ALI and VEEGAN achieve the best IvOM. Qualitatively, however, generated\nsamples from VEEGAN seem better than other methods. In particular, the samples from VEEGAN\n+DAE are meaningless. Generated samples from VEEGAN are shown in Figure 3b; samples from\nother methods are shown in the supplementary material. As another illustration of this, Figure 3\nillustrates the IvOM metric, by showing the nearest neighbors to real images that each of the GANs\nwere able to generate; in general, the nearest neighbors will be more semantically meaningful than\nrandomly generated images. We omit VEEGAN +DAE from this table because it did not produce\nplausible images. Across the methods, we see in Figure 3 that VEEGAN captures small details, such\nas the face of the poodle, that other methods miss.\n\n8\n\n\fFigure 2: Density plots of the true data and generator distributions from different GAN methods\ntrained on mixtures of Gaussians arranged in a ring (top) or a grid (bottom).\n\n(a) True Data\n\n(b) GAN\n\n(c) ALI\n\n(d) Unrolled\n\n(e) VEEGAN\n\n(f) True Data\n\n(g) GAN\n\n(h) ALI\n\n(i) Unrolled\n\n(j) VEEGAN\n\nFigure 3: Sample images from GANs trained on CIFAR-10. Best viewed magni\ufb01ed on screen.\n\n(a) Generated samples nearest to real images from CIFAR-10. In\neach of the two panels, the \ufb01rst column are real images, followed\nby the nearest images from DCGAN, ALI, Unrolled GAN, and\nVEEGAN respectively.\n\n(b) Random samples from generator of\nVEEGAN trained on CIFAR-10.\n\n6 Conclusion\n\nWe have presented VEEGAN, a new training principle for GANs that combines a KL divergence in\nthe joint space of representation and data points with an autoencoder over the representation space,\nmotivated by a variational argument. Experimental results on synthetic data and real images show\nthat our approach is much more effective than several state-of-the art GAN methods at avoiding mode\ncollapse while still generating good quality samples.\n\nAcknowledgement\n\nWe thank Martin Arjovsky, Nicolas Collignon, Luke Metz, Casper Kaae S\u00f8nderby, Lucas Theis,\nSoumith Chintala, Stanis\u0142aw Jastrz\u02dbebski, Harrison Edwards, Amos Storkey and Paulina Grnarova for\ntheir helpful comments. We would like to specially thank Ferenc Husz\u00e1r for insightful discussions\nand feedback.\n\n9\n\n\fReferences\n[1] Che, Tong, Li, Yanran, Jacob, Athul Paul, Bengio, Yoshua, and Li, Wenjie. Mode regularized\ngenerative adversarial networks. In International Conference on Learning Representations\n(ICLR), volume abs/1612.02136, 2017.\n\n[2] Cover, Thomas M and Thomas, Joy A. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[3] Diggle, Peter J. and Gratton, Richard J. Monte carlo methods of inference for implicit statistical\nmodels. Journal of the Royal Statistical Society. Series B (Methodological), 46(2):193\u2013227,\n1984. ISSN 00359246. URL http://www.jstor.org/stable/2345504.\n\n[4] Donahue, Jeff, Kr\u00e4henb\u00fchl, Philipp, and Darrell, Trevor. Adversarial feature learning. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[5] Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Mastropietro, Olivier, Lamb, Alex, Arjovsky,\nMartin, and Courville, Aaron. Adversarially learned inference. In International Conference on\nLearning Representations (ICLR), 2017.\n\n[6] Dutta, Ritabrata, Corander, Jukka, Kaski, Samuel, and Gutmann, Michael U. Likelihood-free\n\ninference by ratio estimation. 2016.\n\n[7] Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair,\nSherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. In Advances in\nNeural Information Processing Systems, pp. 2672\u20132680, 2014.\n\n[8] Gutmann, Michael U. and Hyvarinen, Aapo. Noise-contrastive estimation of unnormalized\nstatistical models, with applications to natural image statistics. Journal of Machine Learning\nResearch, 13:307\u2013361, 2012.\n\n[9] Gutmann, M.U. and Hirayama, J. Bregman divergence as general framework to estimate\nunnormalized statistical models. In Proc. Conf. on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npp. 283\u2013290, Corvallis, Oregon, 2011. AUAI Press.\n\n[10] Gutmann, M.U., Dutta, R., Kaski, S., and Corander, J. Likelihood-free inference via classi\ufb01ca-\n\ntion. arXiv:1407.4981, 2014.\n\n[11] Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[12] Kingma, D.P. and Welling, M. Auto-encoding variational bayes. In International Conference\n\non Learning Representations (ICLR), 2014.\n\n[13] Larsen, Anders Boesen Lindbo, S\u00f8nderby, S\u00f8ren Kaae, Larochelle, Hugo, and Winther, Ole.\nAutoencoding beyond pixels using a learned similarity metric. In International Conference on\nMachine Learning (ICML), 2016.\n\n[14] Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, and Goodfellow, Ian J. Adversarial au-\ntoencoders. Arxiv preprint 1511.05644, 2015. URL http://arxiv.org/abs/1511.05644.\n\n[15] Mescheder, Lars M., Nowozin, Sebastian, and Geiger, Andreas. Adversarial variational bayes:\nUnifying variational autoencoders and generative adversarial networks. ArXiv, abs/1701.04722,\n2017. URL http://arxiv.org/abs/1701.04722.\n\n[16] Metz, Luke, Poole, Ben, Pfau, David, and Sohl-Dickstein, Jascha. Unrolled generative adversar-\n\nial networks. arXiv preprint arXiv:1611.02163, 2016.\n\n[17] Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[18] Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation\nand approximate inference in deep generative models. In Proceedings of The 31st International\nConference on Machine Learning, pp. 1278\u20131286, 2014.\n\n10\n\n\f[19] Salimans, Tim, Goodfellow, Ian J., Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and\nImproved techniques for training gans. CoRR, abs/1606.03498, 2016. URL\n\nChen, Xi.\nhttp://arxiv.org/abs/1606.03498.\n\n[20] S\u00f8nderby, Casper Kaae, Caballero, Jose, Theis, Lucas, Shi, Wenzhe, and Husz\u00e1r, Ferenc.\nAmortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.\n\n[21] Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation in machine learning.\n\nCambridge University Press, 2012.\n\n[22] Tran, D., Ranganath, R., and Blei, D. M. Deep and Hierarchical Implicit Models. ArXiv e-prints,\n\n2017.\n\n[23] Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, and Efros, Alexei A. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1879, "authors": [{"given_name": "Akash", "family_name": "Srivastava", "institution": "University of Edinburgh"}, {"given_name": "Lazar", "family_name": "Valkov", "institution": "University of Edinburgh"}, {"given_name": "Chris", "family_name": "Russell", "institution": "The Alan Turing Institute/ The University of Surrey"}, {"given_name": "Michael", "family_name": "Gutmann", "institution": "University of Edinburgh"}, {"given_name": "Charles", "family_name": "Sutton", "institution": "University of Edinburgh"}]}