{"title": "Adaptive Density Estimation for Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 12016, "page_last": 12026, "abstract": "Unsupervised learning of generative models has seen tremendous progress over recent years, in particular due to generative adversarial networks (GANs), variational autoencoders, and flow-based models. GANs have dramatically improved sample quality, but suffer from two drawbacks: (i) they mode-drop, \\ie, do not cover the full support of the train data, and (ii) they do not allow for likelihood evaluations on held-out data. In contrast likelihood-based training encourages models to cover the full support of the train data, but yields poorer samples. These mutual shortcomings can in principle be addressed by training generative latent variable models in a hybrid adversarial-likelihood manner. However, we show that commonly made parametric assumptions create a conflict between them, making successful hybrid models non trivial. As a solution, we propose the use of deep invertible transformations in the latent variable decoder. This approach allows for likelihood computations in image space, is more efficient than fully invertible models, and can take full advantage of adversarial training. We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models and improved likelihood scores.", "full_text": "Adaptive Density Estimation for Generative Models\n\nThomas Lucas\u2217\n\nInria\u2020\n\nKonstantin Shmelkov\u2217,\u2021\nNoah\u2019s Ark Lab, Huawei\n\nCordelia Schmid\n\nInria\u2020\n\nKarteek Alahari\n\nInria\u2020\n\nJakob Verbeek\n\nInria\u2020\n\nAbstract\n\nUnsupervised learning of generative models has seen tremendous progress over re-\ncent years, in particular due to generative adversarial networks (GANs), variational\nautoencoders, and \ufb02ow-based models. GANs have dramatically improved sample\nquality, but suffer from two drawbacks: (i) they mode-drop, i.e., do not cover the\nfull support of the train data, and (ii) they do not allow for likelihood evaluations on\nheld-out data. In contrast, likelihood-based training encourages models to cover the\nfull support of the train data, but yields poorer samples. These mutual shortcomings\ncan in principle be addressed by training generative latent variable models in a\nhybrid adversarial-likelihood manner. However, we show that commonly made\nparametric assumptions create a con\ufb02ict between them, making successful hybrid\nmodels non trivial. As a solution, we propose to use deep invertible transformations\nin the latent variable decoder. This approach allows for likelihood computations\nin image space, is more ef\ufb01cient than fully invertible models, and can take full\nadvantage of adversarial training. We show that our model signi\ufb01cantly improves\nover existing hybrid models: offering GAN-like samples, IS and FID scores that\nare competitive with fully adversarial models, and improved likelihood scores.\n\n1\n\nIntroduction\n\nSuccessful recent generative models of natural images can be divided into two broad families, which\nare trained in fundamentally different ways. The \ufb01rst is trained using likelihood-based criteria,\nwhich ensure that all training data points are well covered by the model. This category includes\nvariational autoencoders (VAEs) [25, 26, 39, 40], autoregressive models such as PixelCNNs [46, 53],\nand \ufb02ow-based models such as Real-NVP [9, 20, 24]. The second category is trained based on a\nsignal that measures to what extent (statistics of) samples from the model can be distinguished from\n(statistics of) the training data, i.e., based on the quality of samples drawn from the model. This is the\ncase for generative adversarial networks (GANs) [2, 15, 22], and moment matching methods [28].\nDespite tremendous recent progress, existing methods exhibit a number of drawbacks. Adversarially\ntrained models such as GANs do not provide a density function, which poses a fundamental problem\nas it prevents assessment of how well the model \ufb01ts held out and training data. Moreover, adversarial\nmodels typically do not allow to infer the latent variables that underlie observed images. Finally,\nadversarial models suffer from mode collapse [2], i.e., they do not cover the full support of the\ntraining data. Likelihood-based model on the other hand are trained to put probability mass on all\nelements of the training set, but over-generalise and produce samples of substantially inferior quality\nas compared to adversarial models. The models with the best likelihood scores on held-out data\nare autoregressive models [35], which suffer from the additional problem that they are extremely\n\u2217 The authors contributed equally.\n\u2020 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000\nGrenoble, France.\n\n\u2021 Work done while at Inria.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 2: Variational inference is used to train a latent\nvariable generative model in feature space. The invertible\nmapping f\u03c8 maps back to image space, where adversarial\ntraining can be performed together with MLE.\n\nFigure 1: An invertible non-linear mapping f\u03c8 maps\nan image x to a vector f\u03c8(x) in feature space. f\u03c8 is\ntrained to adapt to modelling assumptions made by a\ntrained density p\u03b8 in feature space. This induces full\ncovariance structure and a non-Gaussian density.\n\nFigure 3: Our model yields compelling samples while the\noptimization of likelihood ensures coverage of all modes\nin the training support and thus sample diversity, here on\nLSUN churches (64 \u00d7 64).\n\ninef\ufb01cient to sample from [38], since images are generated pixel-by-pixel. The sampling inef\ufb01ciency\nmakes adversarial training of such models prohibitively expensive.\nIn order to overcome these shortcomings, we seek to design a model that (i) generates high-quality\nsamples typical of adversarial models, (ii) provides a likelihood measure on the entire image space,\nand (iii) has a latent variable structure to enable ef\ufb01cient sampling and to permit adversarial training.\nAdditionally we show that, (iv) a successful hybrid adversarial-likelihood paradigm requires going\nbeyond simplifying assumptions commonly made with likelihood based latent variable models. These\nsimplifying assumptions on the conditional distribution on data x given latents z, p(x|z), include\nfull independence across the dimensions of x and/or simple parametric forms such as Gaussian [25],\nor use fully invertible networks [9, 24]. These assumptions create a con\ufb02ict between achieving\nhigh sample quality and high likelihood scores on held-out data. Autoregressive models, such as\npixelCNNs [46, 53], do not make factorization assumptions, but are extremely inef\ufb01cient to sample\nfrom. As a solution, we propose learning a non-linear invertible function f\u03c8 between the image space\nand an abstract feature space, as illustrated in Figure 1. Training a model with full support in this\nfeature space induces a model in the image space that does not make Gaussianity, nor independence\nassumptions in the conditional density p(x|z). Trained by MLE, f\u03c8 adapts to modelling assumptions\nmade by p\u03b8 so we refer to this approach as \u201dadaptive density estimation\u201d.\nWe experimentally validate our approach on the CIFAR-10 dataset with an ablation study. Our model\nsigni\ufb01cantly improves over existing hybrid models, producing GAN-like samples, and IS and FID\nscores that are competitive with fully adversarial models, see Figure 3. At the same time, we obtain\nlikelihoods on held-out data comparable to state-of-the-art likelihood-based methods which requires\ncovering the full support of the dataset. We further con\ufb01rm these observations with quantitative and\nqualitative experimental results on the STL-10, ImageNet and LSUN datasets.\n\n2 Related work\n\nMode-collapse in GANs has received considerable attention, and stabilizing the training process\nas well as improved and bigger architectures have been shown to alleviate this issue [2, 17, 37].\nAnother line of work focuses on allowing the discriminator to access batch statistics of generated\nimages, as pioneered by [22, 45], and further generalized by [29, 32]. This enables comparison of\ndistributional statistics by the discriminator rather than only individual samples. Other approaches\nto encourage diversity among GAN samples include the use of maximum mean discrepancy [1],\noptimal transport [47], determinental point processes [14] and Bayesian formulations of adversarial\ntraining [42] that allow model parameters to be sampled. In contrast to our work, these models lack\nan inference network, and do not de\ufb01ne an explicit density over the full data support.\nAn other line of research has explored inference mechanisms for GANs. The discriminator of\nBiGAN [10] and ALI [12], given pairs (x, z) of images and latents, predict if z was encoded from\na real image, or if x was decoded from a sampled z. In [52] the encoder and the discriminator\nare collapsed into one network that encodes both real images and generated samples, and tries to\n\n2\n\n\fFigure 4: (Left) Maximum likelihood training pulls\nprobability mass towards high-density regions of\nthe data distribution, while adversarial training\npushes mass out of low-density regions. (Right)\nIndependence assumptions become a source of con-\n\ufb02ict in a joint training setting, making hybrid train-\ning non-trivial.\n\nspread their posteriors apart. In [6] a symmetrized KL divergence is approximated in an adversarial\nsetup, and uses reconstruction losses to improve the correspondence between reconstructed and target\nvariables for x and z. Similarly, in [41] a discriminator is used to replace the KL divergence term in\nthe variational lower bound used to train VAEs with the density ratio trick. In [33] the KL divergence\nterm in a VAE is replaced with a discriminator that compares latent variables from the prior and\nthe posterior in a more \ufb02exible manner. This regularization is more \ufb02exible than the standard KL\ndivergence. The VAE-GAN model [27] and the model in [21] use the intermediate feature maps of\na GAN discriminator and of a classi\ufb01er respectively, as target space for a VAE. Unlike ours, these\nmethods do not de\ufb01ne a likelihood over the image space.\nLikelihood-based models typically make modelling assumptions that con\ufb02ict with adversarial training,\nthese include strong factorization and/or Gaussianity. In our work we avoid these limitations by\nlearning the shape of the conditional density on observed data given latents, p(x|z), beyond fully\nfactorized Gaussian models. As in our work, Flow-GAN [16] also builds on invertible transformations\nto construct a model that can be trained in a hybrid adversarial-MLE manner, see Figure 2.However,\nFlow-GAN does not use ef\ufb01cient non-invertible layers we introduce, and instead relies entirely on\ninvertible layers. Other approaches combine autoregressive decoders with latent variable models to\ngo beyond typical parametric assumptions in pixel space [7, 18, 31]. They, however, are not amenable\nto adversarial training due to the prohibitively slow sequential pixel sampling.\n\n3 Preliminaries on MLE and adversarial training\n\nloss of the form LC(p\u03b8) =(cid:82)\n\nMaximum-likelihood and over-generalization. The de-facto standard approach to train generative\nmodels is maximum-likelihood estimation. It maximizes the probability of data sampled from an\nunknown data distribution p\u2217 under the model p\u03b8 w.r.t. the model parameters \u03b8. This is equivalent\nto minimizing the Kullback-Leibler (KL) divergence, DKL(p\u2217||p\u03b8), between p\u2217 and p\u03b8. This yields\nmodels that tend to cover all the modes of the data, but put mass in spurious regions of the target\nspace; a phenomenon known as \u201cover-generalization\u201d or \u201czero-avoiding\u201d [4], and manifested by\nunrealistic samples in the context of generative image models, see Figure 4.Over-generalization is\ninherent to the optimization of the KL divergence oriented in this manner. Real images are sampled\nfrom p\u2217, and p\u03b8 is explicitly optimized to cover all of them. The training procedure, however, does\nnot sample from p\u03b8 to evaluate the quality of such samples, ideally using the inaccessible p\u2217(x) as a\nscore. Therefore p\u03b8 may put mass in spurious regions of the space without being heavily penalized.\nWe refer to this kind of training procedure as \u201ccoverage-driven training\u201d (CDT). This optimizes a\nx p\u2217(x)sc(x, p\u03b8) dx, where sc(x, p\u03b8) = ln p\u03b8(x) evaluates how well\na sample x is covered by the model. Any score sc that veri\ufb01es: LC(p\u03b8) = 0 \u21d0\u21d2 p\u03b8 = p\u2217, is\nequivalent to the log-score, which forms a justi\ufb01cation for MLE on which we focus.\nExplicitly evaluating sample quality is redundant in the regime of unlimited model capacity and\ntraining data. Indeed, putting mass on spurious regions takes it away from the support of p\u2217, and thus\nreduces the likelihood of the training data. In practice, however, datasets and model capacity are\n\ufb01nite, and models must put mass outside the \ufb01nite training set in order to generalize. The maximum\nlikelihood criterion, by construction, only measures how much mass goes off the training data, not\nwhere it goes. In classic MLE, generalization is controlled in two ways: (i) inductive bias, in the form\nof model architecture, controls where the off-dataset mass goes, and (ii) regularization controls to\nwhich extent this happens. An adversarial loss, by considering samples of the model p\u03b8, can provide\na second handle to evaluate and control where the off-dataset mass goes. In this sense, and in contrast\nto model architecture design, an adversarial loss provides a \u201ctrainable\u201d form of inductive bias.\n\n3\n\nStrongly penalysed by GANStronglypenalysed by MLEMaximum-LikelihoodDataMode-droppingAdversarial trainingOver-generalizationHybridData\fAdversarial models and mode collapse. Adversarially trained models produce samples of excellent\nquality. As mentioned, their main drawbacks are their tendency to \u201cmode-drop\u201d, and the lack of\nmeasure to assess mode-dropping, or their performance in general. The reasons for this are two-fold.\nFirst, de\ufb01ning a valid likelihood requires adding volume to the low-dimensional manifold learned\nby GANs to de\ufb01ne a density under which training and test data have non-zero density. Second,\ncomputing the density of a data point under the de\ufb01ned probability distribution requires marginalizing\nout the latent variables, which is not trivial in the absence of an ef\ufb01cient inference mechanism.\nWhen a human expert subjectively evaluates the quality of generated images, samples from the\nmodel are compared to the expert\u2019s implicit approximation of p\u2217. This type of objective may be\nx p\u03b8(x)sq(x, p\u2217) dx, and we refer to it as \u201cquality-driven training\u201d (QDT).\nTo see that GANs [15] use this type of training, recall that the discriminator is trained with the loss\nx p\u2217(x) ln D(x) + p\u03b8(x) ln(1\u2212 D(x)) dx. It is easy to show that the optimal discriminator\nequals D\u2217(x) = p\u2217(x)/(p\u2217(x) + p\u03b8(x)). Substituting the optimal discriminator, LGAN equals the\nJensen-Shannon divergence,\n\nformalized as LQ(p\u03b8) =(cid:82)\nLGAN =(cid:82)\n\n1\n2\n\n(p\u03b8 + p\u2217)),\n\n(p\u03b8 + p\u2217)) +\n\n1\n2\n\nDKL(p\u2217|| 1\n2\n\nDJS(p\u2217||p\u03b8) =\n\nDKL(p\u03b8|| 1\n2\nup to additive and multiplicative constants [15]. This loss, approximated by the discriminator, is\nsymmetric and contains two KL divergence terms. Note that DKL(p\u2217|| 1\n2 (p\u03b8 + p\u2217)) is an integral on\nx p\u2217(x) ln D(x), is however\nindependent from the generative model, and disappears when differentiating. Therefore, it cannot be\nused to perform coverage-driven training, and the generator is trained to minimize ln(1 \u2212 D(G(z)))\n(or to maximize ln D(G(z))), where G(z) is the deterministic generator that maps latent variables z\nto the data space. Assuming D = D\u2217, this yields\n\np\u2217, so coverage driven. The term that approximates it in LGAN, i.e.,(cid:82)\n(cid:90)\n\n(cid:90)\n\n(1)\n\np(z) ln(1 \u2212 D\u2217(G(z))) dz =\n\nz\n\nx\n\np\u03b8(x) ln\n\np\u03b8(x)\n\np\u03b8(x) + p\u2217(x)\n\ndx = DKL(p\u03b8||(p\u03b8 + p\u2217)/2),\n\n(2)\n\nwhich is a quality-driven criterion, favoring sample quality over support coverage.\n\n4 Adaptive Density Estimation and hybrid adversarial-likelihood training\n\nLatent variable generative models, de\ufb01ned as p\u03b8(x) =(cid:82)\n\nWe present a hybrid training approach with MLE to cover the full support of the training data, and\nadversarial training as a trainable inductive bias mechanism to improve sample quality. Using both\nthese criteria provides a richer training signal, but satisfying both criteria is more challenging than\neach in isolation for a given model complexity. In practice, model \ufb02exibility is limited by (i) the\nnumber of parameters, layers, and features in the model, and (ii) simplifying modeling assumptions,\nusually made for tractability. We show that these simplifying assumptions create a con\ufb02ict between the\ntwo criteria, making successfull joint training non trivial. We introduce Adaptive Density Estimation\nas a solution to reconcile them.\nz p\u03b8(x|z)p(z) dz, typically make simplifying\nassumptions on p\u03b8(x|z), such as full factorization and/or Gaussianity, see e.g. [11, 25, 30]. In\nparticular, assuming full factorization of p\u03b8(x|z) implies that any correlations not captured by z are\ntreated as independent per-pixel noise. This is a poor model for natural images, unless z captures\neach and every aspect of the image structure. Crucially, this hypothesis is problematic in the context\nof hybrid MLE-adversarial training. If p\u2217 is too complex for p\u03b8(x|z) to \ufb01t it accurately enough, MLE\nwill lead to a high variance in a factored (Gaussian) p\u03b8(x|z) as illustrated in Figure 4 (right). This\nleads to unrealistic blurry samples, easily detected by an adversarial discriminator, which then does\nnot provide a useful training signal. Conversely, adversarial training will try to avoid these poor\nsamples by dropping modes of the training data, and driving the \u201cnoise\u201d level to zero. This in turn is\nheavily penalized by maximum likelihood training, and leads to poor likelihoods on held-out data.\nAdaptive density estimation. The point of view of regression hints at a possible solution. For\ninstance, with isotropic Gaussian model densities with \ufb01xed variance, solving the optimization\nproblem \u03b8\u2217 \u2208 max\u03b8 ln(p\u03b8(x|z)) is similar to solving min\u03b8 ||\u00b5\u03b8(z) \u2212 x||2, i.e., (cid:96)2 regression, where\n\u00b5\u03b8(z) is the mean of the decoder p\u03b8(x|z). The Euclidean distance in RGB space is known to\nbe a poor measure of similarity between images, non-robust to small translations or other basic\ntransformations [34]. One can instead compute the Euclidean distance in a feature space, ||f\u03c8(x1) \u2212\nf\u03c8(x2)||2, where f\u03c8 is chosen so that the distance is a better measure of similarity. A popular way to\n\n4\n\n\fobtain f\u03c8 is to use a CNN that learns a non-linear image representation, that allows linear assessment\nof image similarity. This is the idea underlying GAN discriminators, the FID evaluation measure [19],\nthe reconstruction losses of VAE-GAN [27] and classi\ufb01er based perceputal losses as in [21].\nDespite their \ufb02exibility, such similarity metrics are in general degenerate in the sense that they may\ndiscard information about the data point x. For instance, two different images x and y can collapse to\nthe same points in feature space, i.e., f\u03c8(x) = f\u03c8(y). This limits the use of similarity metrics in the\ncontext of generative modeling for two reasons: (i) it does not yield a valid measure of likelihood\nover inputs, and (ii) points generated in the feature space f\u03c8 cannot easily be mapped to images.\nTo resolve this issue, we chose f\u03c8 to be a bijection. Given a model p\u03b8 trained to model f\u03c8(x) in\nfeature space, a density in image space is computed using the change of variable formula, which\np\u03b8 in feature space, and mapping to the image space through f\u22121\n\u03c8 . We refer to this construction as\nAdaptive Denstiy Estimation. If p\u03b8 provides ef\ufb01cient log-likelihood computations, the change of\nvariable formula can be used to train f\u03c8 and p\u03b8 together by maximum-likelihood, and if p\u03b8 provides\nfast sampling adversarial training can be performed ef\ufb01ciently.\nMLE with adaptive density estimation. To train a generative latent variable model p\u03b8(x) which\npermits ef\ufb01cient sampling, we rely on amortized variational inference. We use an inference network\nq\u03c6(z|x) to construct a variational evidence lower-bound (ELBO),\n\nyields p\u03b8,\u03c8(x) = p\u03b8(f\u03c8(x))(cid:12)(cid:12)det(cid:0)\u2202f\u03c8(x)/\u2202x(cid:62)(cid:1)(cid:12)(cid:12) . Image samples are obtained by sampling from\n\nL\u03c8\nELBO(x, \u03b8, \u03c6) = E\n\nq\u03c6(z|x)\n\n[ln(p\u03b8(f\u03c8(x)|z))] \u2212 DKL(q\u03c6(z|x)||p\u03b8(z)) \u2264 ln p\u03b8(f\u03c8(x)).\n\n(3)\n\nUsing this lower bound together with the change of variable formula, the mapping to the similarity\nspace f\u03c8 and the generative model p\u03b8 can be trained jointly with the loss\n\n(cid:20)\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:21)\n\n\u2202f\u03c8(x)\n\u2202x(cid:62)\n\n\u2265 \u2212 E\n\nLC(\u03b8, \u03c6, \u03c8) = E\nx\u223cp\u2217\n\n\u2212L\u03c8\n\nELBO(x, \u03b8, \u03c6) \u2212 ln\n\nx\u223cp\u2217 [ln p\u03b8,\u03c8(x)] .\n\n(4)\nWe use gradient descent to train f\u03c8 by optimizing LC(\u03b8, \u03c6, \u03c8) w.r.t. \u03c8. The LELBO term encourges\nthe mapping f\u03c8 to maximize the density of points in feature space under the model p\u03b8, so that f\u03c8\nis trained to match modeling assumptions made in p\u03b8. Simultaneously, the log-determinant term\nencourages f\u03c8 to maximize the volume of data points in feature space. This guarantees that data\npoints cannot be collapsed to a single point in the feature space. We use a factored Gaussian form\nof the conditional p\u03b8(.|z) for tractability, but since f\u03c8 can arbitrarily reshape the corresponding\nconditional image space, it still avoids simplifying assumptions in the image space. Therefore,\nthe (invertible) transformation f\u03c8 avoids the con\ufb02ict between the MLE and adversarial training\nmechanisms, and can leverage both.\nAdversarial training with adaptive density estimation. To sample the generative model, we\nsample latents from the prior, z \u223c p\u03b8(z), which are then mapped to feature space through \u00b5\u03b8(z), and\nto image space through f\u22121\n\u03c8 . We train our generator using the modi\ufb01ed objective proposed by [50],\ncombining both generator losses considered in [15], i.e. ln[(1 \u2212 D(G(z)))/D(G(z))], which yields:\n(5)\n\n\u03c8 (\u00b5\u03b8(z))) \u2212 ln(1 \u2212 D(f\u22121\n\nLQ(p\u03b8,\u03c8) = \u2212 E\n\n\u03c8 (\u00b5\u03b8(z))))\n\n.\n\n(cid:105)\n\n(cid:104)\n\nln D(f\u22121\n\np\u03b8(z)\n\nAssuming the discriminator D is trained to optimality at every step, it is easy to demonstrate that\nthe generator is trained to optimize DKL(p\u03b8,\u03c8||p\u2217). The training procedure, written as an algorithm\nin Appendix H, alternates between (i) bringing LQ(p\u03b8,\u03c8) closer to it\u2019s optimal value L\u2217\nQ(p\u03b8,\u03c8) =\nDKL(p\u03b8,\u03c8||p\u2217), and (ii) minimizing LC(p\u03b8,\u03c8) +LQ(p\u03b8,\u03c8). Assuming that the discriminator is trained\nto optimality at every step, the generative model is trained to minimize a bound on the symmetric\nsum of two KL divergences: LC(p\u03b8,\u03c8) + L\u2217\nQ(p\u03b8,\u03c8) \u2265 DKL(p\u2217||p\u03b8,\u03c8) + DKL(p\u03b8,\u03c8||p\u2217) + H(p\u2217),\nwhere the entropy of the data generating distribution, H(p\u2217), is an additive constant independent of\nthe generative model p\u03b8,\u03c8. In contrast, MLE and GANs optimize one of these divergences each.\n\n5 Experimental evaluation\nWe present our evaluation protocol, followed by an ablation study to assess the importance of the\ncomponents of our model (Section 5.1). We then show the quantitative and qualitative performance of\nour model, and compare it to the state of the art on the CIFAR-10 dataset in Section 5.2. We present\nadditional results and comparisons on higher resolution datasets in Section 5.3.\n\n5\n\n\ff\u03c8 Adv. MLE BPD \u2193 IS \u2191 FID \u2193\n\u00d7 (cid:88) \u00d7 [7.0] 6.8 31.4\nGAN\n\u00d7 \u00d7 (cid:88) 4.4\n2.0 171.0\nVAE\nV-ADE\u2020 (cid:88) \u00d7 (cid:88) 3.5\n3.0 112.0\nAV-GDE \u00d7 (cid:88) (cid:88) 4.4\n5.1 58.6\nAV-ADE\u2020 (cid:88) (cid:88) (cid:88) 3.9\n7.1 28.0\nTable 1: Quantitative results. \u2020 : Parameter\ncount decreased by 1.4% to compensate for\nf\u03c8. [Square brackets] denote that the value\nis approximated, see Section 5.\n\nFigure 5: Samples from GAN and VAE\nbaselines, our V-ADE, AV-GDE and AV-\nADE models, all trained on CIFAR-10.\n\nVAE\n\nV-ADE\n\nAV-GDE\n\nGAN\n\nAV-ADE (Ours)\n\nEvalutation metrics. We evaluate our models with complementary metrics. To assess sample\nquality, we report the Fr\u00e9chet inception distance (FID) [19] and the inception score (IS) [45], which\nare the de facto standard metrics to evaluate GANs [5, 54]. Although these metrics focus on sample\nquality, they are also sensitive to coverage, see Appendix D for details. To speci\ufb01cally evaluate\nthe coverage of held-out data, we use the standard bits per dimension (BPD) metric, de\ufb01ned as the\nnegative log-likelihood on held-out data, averaged across pixels and color channels [9].\nDue to their degenerate low-dimensional support, GANs do not de\ufb01ne a density in the image space,\nwhich prevents measuring BPD on them. To endow a GAN with a full support and a likelihood, we\ntrain an inference network \u201caround it\u201d, while keeping the weights of the GAN generator \ufb01xed. We\nalso train an isotropic noise parameter \u03c3. For both GANs and VAEs, we use this inference network to\ncompute a lower bound to approximate the likelihood, i.e., an upper bound on BPD. We evaluate all\nmetrics using held-out data not used during training, which improves over common practice in the\nGAN literature, where training data is often used for evaluation.\n\n5.1 Ablation study and comparison to VAE and GAN baselines\n\nWe conduct an ablation study on the CIFAR-10 dataset.1 Our GAN baseline uses the non-residual\narchitecture of SNGAN [37], which is stable and quick to train, without spectral normalization. The\nsame convolutional architecture is kept to build a VAE baseline.2 It produces the mean of a factorizing\nGaussian distribution. To ensure a valid density model we add a trainable isotropic variance \u03c3. We\ntrain the generator for coverage by optimizing LQ(p\u03b8), for quality by optimizing LC(p\u03b8), and\nfor both by optimizing the sum LQ(p\u03b8) + LC(p\u03b8). The model using Variational inference with\nAdaptive Density Estimation (ADE) is refered to as V-ADE. The addition of adversarial training is\ndenoted AV-ADE, and hybrid training with a Gaussian decoder as AV-GDE. The bijective function\nf\u03c8, implemented as a small Real-NVP with 1 scale, 3 residual blocks, 2 layers per block, increases\nthe number of weights by approximately 1.4%. We compensate for these additional parameters with\na slight decrease in the width of the generator for fair comparison.3 See Appendix B for details.\nExperimental results in Table 1 con\ufb01rm that the GAN baseline yields better sample quality (IS and\nFID) than the VAE baseline: obtaining inception scores of 6.8 and 2.0, respectively. Conversely,\nVAE achieves better coverage, with a BPD of 4.4, compared to 7.0 for GAN. An identical generator\ntrained for both quality and coverage, AV-GDE, obtains a sample quality that is in between that of the\nGAN and the VAE baselines, in line with the analysis in Section 4. Samples from the different models\nin Figure 5 con\ufb01rm these quantitative results. Using f\u03c8 and training with LC(p\u03b8) only, denoted by\nV-ADE in the table, leads to improved sample quality with IS up from 2.0 to 3.0 and FID down from\n171 to 112. Note that this quality is still below the GAN baseline and our AV-GDE model.\nWhen f\u03c8 is used with coverage and quality driven training, AV-ADE, we obtain improved IS and\nFID scores over the GAN baseline, with IS up from 6.8 to 7.1, and FID down from 31.4 to 28.0. The\n1 We use the standard split of 50k/10k train/test images of 32\u00d732 pixels.\n2 In the VAE model, some intermediate\nfeature maps are treated as conditional latent variables, allowing for hierarchical top-down sampling (see\nAppendix B). Experimentally, we \ufb01nd that similar top-down sampling is not effective for the GAN model.\n3 This is, however, too small to have a signi\ufb01cant impact on the experimental results.\n\n6\n\n\fexamples shown in the \ufb01gure con\ufb01rm the high quality of the samples generated by our AV-ADE model.\nOur model also achieves a better BPD than the VAE baseline. These experiments demonstrate that\nour proposed bijective feature space substantially improves the compatibility of coverage and quality\ndriven training. We obtain improvements over both VAE and GAN in terms of held-out likelihood,\nand improve VAE sample quality to, or beyond, that of GAN. We further evaluate our model using the\nrecent precision and recall approach of [43] an the classi\ufb01cation framework of [48] in Appendix E.\nAdditional results showing the impact of the number of layers and scales in the bijective similarity\nmapping f\u03c8 (Appendix F), reconstructions qualitatively demonstrating the inference abilities of our\nAV-ADE model (Appendix G) are presented in the supplementary material.\n\n5.2 Re\ufb01nements and comparison to the state of the art\n\nWe now consider further re\ufb01nements to our model, inspired by recent\ngenerative modeling literature. Four re\ufb01nements are used: (i) adding\nresidual connections to the discriminator [17] (rd), (ii) leveraging\nmore accurate posterior approximations using inverse auto-regressive\n\ufb02ow [26] (iaf); see Appendix B, (iii) training wider generators with\ntwice as many channels (wg), and (iv) using a hierarchy of two scales\nto build f\u03c8 (s2); see Appendix F. Table 2 shows consistent improve-\nments with these additions, in terms of BPD, IS, FID.\n\nRe\ufb01nements\n\nGAN\nGAN (rd)\nAV-ADE\nAV-ADE (rd)\nAV-ADE (wg, rd)\nAV-ADE (iaf, rd)\nAV-ADE (s2)\n\nBPD \u2193 IS \u2191 FID \u2193\n[7.0]\n31.4\n[6.9]\n24.0\n28.0\n3.9\n26.0\n3.8\n17.2\n3.8\n18.6\n3.7\n3.5\n28.9\n\n6.8\n7.4\n7.1\n7.5\n8.2\n8.1\n6.9\n\nTable 2: Model re\ufb01nements.\n\nTable 3 compares our model to existing hybrid approaches and\nstate-of-the-art generative models on CIFAR-10.\nIn the category\nof hybrid models that de\ufb01ne a valid likelihood over the data space,\ndenoted by Hybrid (L) in the table, FlowGAN(H) optimizes MLE and\nan adversarial loss, and FlowGAN(A) is trained adversarially. The\nAV-ADE model signi\ufb01cantly outperforms these two variants both in\nterms of BPD, from 4.2 to between 3.5 and 3.8, and quality, e.g., IS\nimproves from 5.8 to 8.2. Compared to models that train an inference\nnetwork adversarially, denoted by Hybrid (A), our model shows a\nsubstantial improvement in IS from 7.0 to 8.2. Note that these models\ndo not allow likelihood evaluation, thus BPD values are not de\ufb01ned.\n\nCompared to adversarial models, which are not optimized for support\ncoverage, AV-ADE obtains better FID (17.2 down from 21.7) and sim-\nilar IS (8.2 for both) compared to SNGAN with residual connections\nand hinge-loss, despite training on 17% less data than GANs (test\nsplit removed). The improvement in FID is likely due to this measure\nbeing more sensitive to support coverage than IS. Compared to mod-\nels optimized with MLE only, we obtain a BPD between 3.5 and 3.7,\ncomparable to 3.5 for Real-NVP demonstrating a good coverage of\nthe support of held-out data. We computed IS and FID scores for MLE\nbased models using publicly released code, with provided parameters\n(denoted by \u2020 in the table) or trained ourselves (denoted by \u2021). Despite\nbeing smaller (for reference Glow has 384 layers vs. at most 10 for our\ndeeper generator), our AV-ADE model generates better samples, e.g.,\nIS up from 5.5 to 8.2 (samples displayed in Figure 6), owing to quality\ndriven training controling where the off-dataset mass goes. Additional\nsamples from our AV-ADE model and comparison to others models\nare given in Appendix A.\n\nHybrid (L)\n\nAV-ADE (wg, rd)\nAV-ADE (iaf, rd)\nAV-ADE (S2)\nFlowGan(A) [16]\nFlowGan(H) [16]\n\nBPD \u2193 IS \u2191 FID \u2193\n17.2\n3.8\n3.7\n18.6\n3.5\n28.9\n8.5\n4.2\n\n8.2\n8.1\n6.9\n5.8\n3.9\n\nBPD \u2193 IS \u2191 FID \u2193\n\n5.9\n5.3\n6.8\n6.8\n7.0\n\nBPD \u2193 IS \u2191 FID \u2193\n25.0\n29.3\n23.7\n\n7.3\n7.4\n7.5\n7.9\n8.2\n\n21.7\n\nHybrid (A)\n\nAGE [52]\nALI [12]\nSVAE [6]\n\u03b1-GAN [41]\nSVAE-r [6]\n\nAdversarial\n\nmmd-GAN[1]\nSNGan [37]\nBatchGAN [32]\nWGAN-GP [17]\nSNGAN(R,H)\n\nMLE\n\n3.5\n3.1\n2.9\n3.1\n3.4\n\nBPD \u2193 IS \u2191 FID \u2193\n4.5\u2020 56.8\u2020\n3.8\u2020 73.5\u2020\n5.5\n5.5\u2021\n\nReal-NVP [9]\nVAE-IAF [26]\nPixcnn++ [46]\nFlow++ [20]\n46.8\u2021\nGlow [24]\nPerformance on CIFAR10,\nTable 3:\nMLE and Hybrid\nwithout\n\u2020:\n(L) models discard the test split.\ncomputed by us using provided weights.\n\u2021: computed by us using provided code\nto (re)train models.\n\nlabels.\n\n5.3 Results on additional datasets\nTo further validate our model we evaluate it on STL10 (48\u00d7 48), ImageNet and LSUN (both 64\u00d7 64).\nWe use a wide generator to account for the higher resolution, without IAF, a single scale in f\u03c8, and\nno residual blocks (see Section 5.2). The architecture and training hyper-parameters are not tuned,\nbesides adding one layer at resolution 64 \u00d7 64, which demonstrates the stability of our approach.\nOn STL10, Table 4 shows that our AV-ADE improves inception score over SNGAN, from 9.1 up to\n\n7\n\n\fGlow @ 3.35 BPD\n\nFlowGan (H) @ 4.21 BPD\n\nAV-ADE (iaf, rd) @ 3.7 BPD\n\nFigure 6: Samples from models trained on CIFAR-10. Our AV-ADE spills less mass on unrealistic samples,\nowing to adversarial training which controls where off-dataset mass goes.\n9.4, and is second best in FID. Our likelihood performance, between 3.8 and 4.4, and close to that\nof Real-NVP at 3.7, demonstrates good coverage of the support of held-out data. On the ImageNet\ndataset, maintaining high sample quality, while covering the full support is challenging, due to its\nvery diverse support. Our AV-ADE model obtains a sample quality behind that of MMD-GAN with\nIS/FID scores at 8.5/45.5 vs. 10.9/36.6. However, MMD-GAN is trained purely adversarially and\ndoes not provide a valid density across the data space, unlike our approach.\nFigure 7 shows samples from our generator trained on a single GPU with 11 Gb of memory on LSUN\nclasses. Our model yields more compelling samples compared to those of Glow, despite having less\nlayers (7 vs. over 500). Additional samples on other LSUN categories are presented in Appendix A.\nSTL-10 (48 \u00d7 48) BPD \u2193 IS \u2191 FID \u2193\n9.4\nAV-ADE (wg, wd)\n4.4\n44.3\nAV-ADE (iaf, wd)\n4.0\n8.6 52.7\nAV-ADE (s2)\n3.8\n8.6 52.1\n3.7\u2021 4.8\u2021 103.2\u2021\nReal-NVP\nBatchGAN\nSNGAN (Res-Hinge)\n\nLSUN\nBedroom (2.72/\u00d7) (2.38/208.8\u2020) (3.91, 21.1)\n(2.81/\u00d7) (2.46/214.5\u2020) (3.95, 15.5)\nTower\n(3.08/\u00d7) (2.67/222.3\u2020) (4.3, 13.1)\nChurch\n(4.6, 20.0)\nClassroom\nRestaurant\n(4.7, 20.5)\n\nAV-ADE (wg, wd)\nReal-NVP\nGlow\nFlow++\nMMD-GAN\n\n4.90 8.5 45.5\n3.98\n3.81\n3.69\n\nImageNet (64 \u00d7 64) BPD \u2193 IS \u2191 FID \u2193\n\nReal-NVP\n\nGlow\n\nOurs\n\n8.7\n9.1\n\n51\n40.1\n\n10.9 36.6\n\n\u00d7\n\u00d7\n\n\u00d7\n\u00d7\n\nTable 4: Results on the STL-10, ImageNet, and LSUN datasets. AV-ADE (wg, rd) is used for LSUN.\n\nGlow [24]\n\nOurs: AV-ADE (wg, rd)\n\n(C)\n\n(B)\n\nFigure 7: Samples from models trained on LSUN Churches (C) and bedrooms (B). Our AV-ADE model\nover-generalises less and produces more compelling samples. See Appendix A for more classes and samples.\n6 Conclusion\nWe presented a generative model that leverages invertible network layers to relax the conditional\nindependence assumptions commonly made in VAE decoders. It allows for ef\ufb01cient feed-forward\nsampling, and can be trained using a maximum likelihood criterion that ensures coverage of the data\ngenerating distribution, as well as an adversarial criterion that ensures high sample quality.\n\n8\n\n\fAcknowledgments. The authors would like to thank Corentin Tallec, Mathilde Caron, Adria Ruiz\nand Nikita Dvornik for useful feedback and discussions. Acknowledgments also go to our anonymous\nreviewers, who contributed valuable comments and remarks.\nThis work was supported in part by the grants ANR16-CE23-0006 \u201cDeep in France\u201d, LabEx\nPERSYVAL-Lab (ANR-11-LABX0025-01) as well as the Indo-French project EVEREST (no.\n5302-1) funded by CEFIPRA and a grant from ANR (AVENUE project ANR-18-CE23-0011),\n\nReferences\n[1] M. Arbel, D. J. Sutherland, M. Binkowski, and A. Gretton. On gradient regularizers for MMD\n\nGANs. In NeurIPS, 2018.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML,\n\n2017.\n\n[3] P. Bachman. An architecture for deep, hierarchical generative models. In NeurIPS, 2016.\n\n[4] C. Bishop. Pattern recognition and machine learning. Springer-Verlag, 2006.\n\n[5] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high \ufb01delity natural\n\nimage synthesis. In ICLR, 2019.\n\n[6] L. Chen, S. Dai, Y. Pu, E. Zhou, C. Li, Q. Su, C. Chen, and L. Carin. Symmetric variational\n\nautoencoder and connections to adversarial learning. In AISTATS, 2018.\n\n[7] X. Chen, D. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and\n\nP. Abbeel. Variational lossy autoencoder. In ICLR, 2017.\n\n[8] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early\n\nvisual processing by language. In NeurIPS, 2017.\n\n[9] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In ICLR, 2017.\n\n[10] J. Donahue, P. Kr\u00e4henb\u00fchl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.\n\n[11] G. Dorta, S. Vicente, L. Agapito, N. Campbell, and I. Simpson. Structured uncertainty prediction\n\nnetworks. In CVPR, 2018.\n\n[12] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville.\n\nAdversarially learned inference. In ICLR, 2017.\n\n[13] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR,\n\n2017.\n\n[14] M. Elfeki, C. Couprie, M. Riviere, and M. Elhoseiny. GDPP: Learning diverse generations\n\nusing determinantal point processes. In arxiv.org/pdf/1812.00068, 2018.\n\n[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NeurIPS, 2014.\n\n[16] A. Grover, M. Dhar, and S. Ermon. Flow-GAN: Combining maximum likelihood and adversarial\n\nlearning in generative models. In AAAI Press, 2018.\n\n[17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of\n\nWasserstein GANs. In NeurIPS, 2017.\n\n[18] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. PixelVAE:\n\nA latent variable model for natural images. In ICLR, 2017.\n\n[19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two\n\ntime-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.\n\n[20] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving \ufb02ow-based generative\n\nmodels with variational dequantization and architecture design. In ICML, 2019.\n\n9\n\n\f[21] X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In\n\nWACV, 2017.\n\n[22] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality,\n\nstability, and variation. In ICLR, 2018.\n\n[23] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[24] D. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\n\nNeurIPS, 2018.\n\n[25] D. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[26] D. Kingma, T. Salimans, R. J\u00f3zefowicz, X. Chen, I. Sutskever, and M. Welling. Improving\n\nvariational autoencoders with inverse autoregressive \ufb02ow. In NeurIPS, 2016.\n\n[27] A. Larsen, S. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. In ICML, 2016.\n\n[28] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015.\n\n[29] Z. Lin, A. Khetan, G. Fanti, and S. Oh. PacGAN: The power of two samples in generative\n\nadversarial networks. In NeurIPS, 2018.\n\n[30] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with\n\ngraph convolutional autoencoders. In CVPR, 2018.\n\n[31] T. Lucas and J. Verbeek. Auxiliary guided autoregressive variational autoencoders. In ECML,\n\n2018.\n\n[32] T. Lucas, C. Tallec, Y. Ollivier, and J. Verbeek. Mixed batches and symmetric discriminators\n\nfor GAN training. In ICML, 2018.\n\n[33] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In ICLR, 2016.\n\n[34] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square\n\nerror. In ICLR, 2016.\n\n[35] J. Menick and N. Kalchbrenner. Generating high \ufb01delity images with subscale pixel networks\n\nand multidimensional upscaling. In ICLR, 2019.\n\n[36] T. Miyato and M. Koyama. cGANs with projection discriminator. In ICLR, 2018.\n\n[37] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative\n\nadversarial networks. In ICLR, 2018.\n\n[38] P. Ramachandran, T. Paine, P. Khorrami, M. Babaeizadeh, S. Chang, Y. Zhang, M. Hasegawa-\nJohnson, R. Campbell, and T. Huang. Fast generation for convolutional autoregressive models.\nIn ICLR workshop, 2017.\n\n[39] D. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In ICML, 2015.\n\n[40] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\n\ninference in deep generative models. In ICML, 2014.\n\n[41] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. Variational approaches\n\nfor auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.\n\n[42] Y. Saatchi and A. Wilson. Bayesian GAN. In NeurIPS, 2017.\n\n[43] M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via\n\nprecision and recall. In NeurIPS, 2018.\n\n[44] T. Salimans and D. Kingma. Weight normalization: A simple reparameterization to accelerate\n\ntraining of deep neural networks. In NeurIPS, 2016.\n\n10\n\n\f[45] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training GANs. In NeurIPS, 2016.\n\n[46] T. Salimans, A. Karpathy, X. Chen, and D. Kingma. PixelCNN++: Improving the PixelCNN\n\nwith discretized logistic mixture likelihood and other modi\ufb01cations. In ICLR, 2017.\n\n[47] T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving gans using optimal transport.\n\nIn ICLR, 2018.\n\n[48] K. Shmelkov, C. Schmid, and K. Alahari. How good is my GAN? In ECCV, 2018.\n\n[49] C. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. S\u00f8nderby, and O. Winther. Ladder variational autoencoders.\n\nIn NeurIPS, 2016.\n\n[50] C. S\u00f8nderby, J. Caballero, L. Theis, W. Shi, and F. Husz\u00e1r. Amortised MAP inference for image\n\nsuper-resolution. In ICLR, 2017.\n\n[51] H. Thanh-Tung, T. Tran, and S. Venkatesh. Improving generalization and stability of generative\n\nadversarial networks. In ICLR, 2019.\n\n[52] D. Ulyanov, A. Vedaldi, and V. Lempitsky. It takes (only) two: Adversarial generator-encoder\n\nnetworks. In AAAI, 2018.\n\n[53] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In\n\nICML, 2016.\n\n[54] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial\n\nnetworks. In ICML, 2019.\n\n11\n\n\f", "award": [], "sourceid": 6481, "authors": [{"given_name": "Thomas", "family_name": "Lucas", "institution": "Inria"}, {"given_name": "Konstantin", "family_name": "Shmelkov", "institution": "Huawei"}, {"given_name": "Karteek", "family_name": "Alahari", "institution": "Inria"}, {"given_name": "Cordelia", "family_name": "Schmid", "institution": "Inria / Google"}, {"given_name": "Jakob", "family_name": "Verbeek", "institution": "INRIA"}]}