{"title": "Generating Diverse High-Fidelity Images with VQ-VAE-2", "book": "Advances in Neural Information Processing Systems", "page_first": 14866, "page_last": 14876, "abstract": "We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) models for large scale image generation.\nTo this end, we scale and enhance the autoregressive priors used in VQ-VAE to generate synthetic samples of much higher coherence and fidelity than possible before. \nWe use simple feed-forward encoder and decoder networks, making our model an attractive candidate for applications where the encoding and/or decoding speed is critical. Additionally, VQ-VAE  requires sampling an autoregressive model only in the compressed latent space, which is an order of magnitude faster than sampling in the pixel space, especially for large images.\nWe demonstrate that a multi-scale hierarchical organization of  VQ-VAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ImageNet, while not suffering from GAN's known shortcomings such as mode collapse and lack of diversity.", "full_text": "Generating Diverse High-Fidelity Images\n\nwith VQ-VAE-2\n\nAli Razavi\u2217\nDeepMind\n\nalirazavi@google.com\n\nA\u00e4ron van den Oord\u2217\n\nDeepMind\n\navdnoord@google.com\n\nOriol Vinyals\n\nDeepMind\n\nvinyals@google.com\n\nAbstract\n\nWe explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE)\nmodels for large scale image generation. To this end, we scale and enhance the\nautoregressive priors used in VQ-VAE to generate synthetic samples of much higher\ncoherence and \ufb01delity than possible before. We use simple feed-forward encoder\nand decoder networks, making our model an attractive candidate for applications\nwhere the encoding and/or decoding speed is critical. Additionally, VQ-VAE\nrequires sampling an autoregressive model only in the compressed latent space,\nwhich is an order of magnitude faster than sampling in the pixel space, especially\nfor large images. We demonstrate that a multi-scale hierarchical organization of\nVQ-VAE, augmented with powerful priors over the latent codes, is able to generate\nsamples with quality that rivals that of state of the art Generative Adversarial\nNetworks on multifaceted datasets such as ImageNet, while not suffering from\nGAN\u2019s known shortcomings such as mode collapse and lack of diversity.\n\n1\n\nIntroduction\n\nDeep generative models have signi\ufb01cantly improved in the past few years [5, 27, 25]. This is, in part,\nthanks to architectural innovations as well as computation advances that allows training them at larger\nscale in both amount of data and model size. The samples generated from these models are hard to\ndistinguish from real data without close inspection, and their applications range from super resolution\n[21] to domain editing [44], artistic manipulation [36], or text-to-speech and music generation [25].\n\nFigure 1: Class-conditional 256x256 image samples from a two-level model trained on ImageNet.\n\nWe distinguish two main types of generative models: likelihood based models, which include VAEs\n[16, 31], \ufb02ow based [9, 30, 10, 17] and autoregressive models [20, 39]; and implicit generative\n\n\u2217Equal contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmodels such as Generative Adversarial Networks (GANs) [12]. Each of these models offer several\ntrade-offs such as sample quality, diversity, speed, etc.\nGANs optimize a minimax objective with a generator neural network producing images by mapping\nrandom noise onto an image, and a discriminator de\ufb01ning the generators\u2019 loss function by classifying\nits samples as real or fake. Larger scale GAN models can now generate high-quality and high-\nresolution images [5, 14]. However, it is well known that samples from these models do not fully\ncapture the diversity of the true distribution. Furthermore, GANs are challenging to evaluate, and a\nsatisfactory generalization measure on a test set to assess over\ufb01tting does not yet exist. For model\ncomparison and selection, researchers have used image samples or proxy measures of image quality\nsuch as Inception Score (IS) [33] and Fr\u00e9chet Inception Distance (FID) [13].\nIn contrast, likelihood based methods optimize negative log-likelihood (NLL) of the training data.\nThis objective allows model-comparison and measuring generalization to unseen data. Additionally,\nsince the probability that the model assigns to all examples in the training set is maximized, likelihood\nbased models, in principle, cover all modes of the data, and do not suffer from the problems of\nmode collapse and lack of diversity seen in GANs. In spite of these advantages, directly maximizing\nlikelihood in the pixel space can be challenging. First, NLL in pixel space is not always a good\nmeasure of sample quality [37], and cannot be reliably used to make comparisons between different\nmodel classes. There is no intrinsic incentive for these models to focus on, for example, global\nstructure. Some of these issues are alleviated by introducing inductive biases such as multi-scale\n[38, 39, 29, 22] or by modeling the dominant bit planes in an image [18, 17].\nIn this paper we use ideas from lossy compression to relieve the generative model from modeling\nnegligible information. Indeed, techniques such as JPEG [43] have shown that it is often possible\nto remove more than 80% of the data without noticeably changing the perceived image quality. As\nproposed by [41], we compress images into a discrete latent space by vector-quantizing intermediate\nrepresentations of an autoencoder. These representations are over 30x smaller than the original image,\nbut still allow the decoder to reconstruct the images with little distortion. The prior over these discrete\nrepresentations can be modeled with a state of the art PixelCNN [39, 40] with self-attention [42],\ncalled PixelSnail [7]. When sampling from this prior, the decoded images also exhibit the same high\nquality and coherence of the reconstructions (see Fig. 1). Furthermore, the training and sampling of\nthis generative model over the discrete latent space is also 30x faster than when directly applied to the\npixels, allowing us to train on much higher resolution images. Finally, the encoder and decoder used\nin this work retains the simplicity and speed of the original VQ-VAE, which means that the proposed\nmethod is an attractive solution for situations in which fast, low-overhead encoding and decoding of\nlarge images are required.\n\n2 Background\n\n2.1 Vector Quantized Variational AutoEncoder\n\nThe VQ-VAE model [41] can be better understood as a communication system. It comprises of\nan encoder that maps observations onto a sequence of discrete latent variables, and a decoder that\nreconstructs the observations from these discrete variables. Both encoder and decoder use a shared\ncodebook. More formally, the encoder is a non-linear mapping from the input space, x, to a vector\nE(x). This vector is then quantized based on its distance to the prototype vectors in the codebook\nek, k \u2208 1 . . . K such that each vector E(x) is replaced by the index of the nearest prototype vector in\nthe codebook, and is transmitted to the decoder (note that this process can be lossy).\n\nQuantize(E(x)) = ek where k = arg min\n\nj\n\n||E(x) \u2212 ej||\n\n(1)\n\nThe decoder maps back the received indices to their corresponding vectors in the codebook, from\nwhich it reconstructs the data via another non-linear function. To learn these mappings, the gradient\nof the reconstruction error is then back-propagated through the decoder, and to the encoder using\nthe straight-through gradient estimator. The VQ-VAE model incorporates two additional terms in its\nobjective to align the vector space of the codebook with the output of the encoder. The codebook\nloss, which only applies to the codebook variables, brings the selected codebook e close to the output\nof the encoder, E(x). The commitment loss, which only applies to the encoder weights, encourages\nthe output of the encoder to stay close to the chosen codebook vector to prevent it from \ufb02uctuating\ntoo frequently from one code vector to another. The overall objective is described in equation 2,\n\n2\n\n\f(a) Overview of the architecture of our hierarchical\nVQ-VAE. The encoders and decoders consist of\ndeep neural networks. The input to the model is a\n256 \u00d7 256 image that is compressed to quantized\nlatent maps of size 64 \u00d7 64 and 32 \u00d7 32 for the\nbottom and top levels, respectively. The decoder\nreconstructs the image from the two latent maps.\n\n(b) Multi-stage image generation. The top-level\nPixelCNN prior is conditioned on the class label,\nthe bottom level PixelCNN is conditioned on the\nclass label as well as the \ufb01rst level code. Thanks\nto the feed-forward decoder, the mapping between\nlatents to pixels is fast. (The example image with\na parrot is generated with this model).\n\nFigure 2: VQ-VAE architecture.\n\nwhere e is the quantized code for the training example x, E is the encoder function and D is the\ndecoder function. The operator sg refers to a stop-gradient operation that blocks gradients from\n\ufb02owing into its argument, and \u03b2 is a hyperparameter which controls the reluctance to change the code\ncorresponding to the encoder output.\n\nL(x, D(e)) = ||x \u2212 D(e)||2\n\n(2)\nAs proposed in [41], we use the exponential moving average updates for the codebook, as a replace-\nment for the codebook loss (the second loss term in Equation equation 2):\n\n2 + ||sg[E(x)] \u2212 e||2\n2 + \u03b2||sg[e] \u2212 E(x)||2\ni(cid:88)\n\nn(t)\n\n2\n\nN (t)\n\ni\n\n:= N (t\u22121)\n\ni\n\n\u2217 \u03b3 + n(t)\n\ni (1 \u2212 \u03b3), m(t)\n\ni\n\ni\n\nm(t)\nN (t)\n\ni\n\n:= m(t\u22121)\n\ni\n\n\u2217 \u03b3 +\n\nE(x)(t)\n\ni,j (1 \u2212 \u03b3),\n\ne(t)\ni\n\n:=\n\nj\n\nwhere n(t)\nis the number of vectors in E(x) in the mini-batch that will be quantized to codebook\ni\nitem ei, and \u03b3 is a decay parameter with a value between 0 and 1 (default \u03b3 = 0.99 is used in all\nexperiments). We use the released VQ-VAE implementation in the Sonnet library 2 3.\n\n3 Method\n\nThe proposed method follows a two-stage approach: \ufb01rst, we train a hierarchical VQ-VAE (see\nFig. 2a) to encode images onto a discrete latent space, and then we \ufb01t a powerful PixelCNN prior\nover the discrete latent space induced by all the data.\n\n3.1 Stage 1: Learning Hierarchical Latent Codes\n\nAs opposed to vanilla VQ-VAE, in this work we use a hierarchy of vector quantized codes to\nmodel large images. The main motivation behind this is to model local information, such as texture,\nseparately from global information such as shape and geometry of objects. The prior model over each\nlevel can thus be tailored to capture the speci\ufb01c correlations that exist in that level. The structure of our\nmulti-scale hierarchical encoder is illustrated in Fig. 2a, with a top latent code which models global\ninformation, and a bottom latent code, conditioned on the top latent, responsible for representing\nlocal details (see Fig. 3). We note if we did not condition the bottom latent on the top latent, then the\ntop latent would need to encode every detail from the pixels. We therefore allow each level in the\n\n2https://github.com/deepmind/sonnet/blob/master/sonnet/python/modules/nets/vqvae.py\n3https://github.com/deepmind/sonnet/blob/master/sonnet/examples/vqvae_example.ipynb\n\n3\n\nVQVQEncoderDecoderEncoderDecoderOriginalReconstructionBottomLevelTopLevel0255ConditionDecoderGeneration0255VQ-VAE Encoder and Decoder TrainingImage Generation\fhierarchy to separately depend on pixels, which encourages encoding complementary information in\neach latent map that can contribute to reducing the reconstruction error in the decoder.\nFor 256 \u00d7 256 images, we use a two level latent hierarchy. As depicted in Fig. 2a, the encoder\nnetwork \ufb01rst transforms and downsamples the image by a factor of 4 to a 64 \u00d7 64 representation\nwhich is quantized to our bottom level latent map. Another stack of residual blocks then further\nscales down the representations by a factor of two, yielding a top-level 32 \u00d7 32 latent map after\nquantization. The decoder is similarly a feed-forward network that takes as input all levels of the\nquantized latent hierarchy. It consists of a few residual blocks followed by a number of strided\ntransposed convolutions to upsample the representations back to the original image size.\n\n3.2 Stage 2: Learning Priors over Latent Codes\n\nIn order to further compress the image, and to be able to sample from the model learned during\nstage 1, we learn a prior over the latent codes. Fitting prior distributions using neural networks from\ntraining data has become common practice, as it can signi\ufb01cantly improve the performance of latent\nvariable models [6]. This procedure also reduces the gap between the marginal posterior and the\nprior. Thus, latent variables sampled from the learned prior at test time are close to what the decoder\nnetwork has observed during training which results in more coherent outputs. From an information\ntheoretic point of view, the process of \ufb01tting a prior to the learned posterior can be considered as\nlossless compression of the latent space by re-encoding the latent variables with a distribution that\nis a better approximation of their true distribution, and thus results in bit rates closer to Shannon\u2019s\nentropy. Therefore the lower the gap between the true entropy and the negative log-likelihood of the\nlearned prior, the more realistic image samples one can expect from decoding the latent samples.\nIn the VQ-VAE framework, this auxiliary prior is modeled with a powerful, autoregressive neural\nnetwork such as PixelCNN in a post-hoc, second stage. The prior over the top latent map is responsible\nfor structural global information. Thus, we equip it with multi-headed self-attention layers as in\n[7, 26] so it can bene\ufb01t from a larger receptive \ufb01eld to capture correlations in spatial locations that\nare far apart in the image. In contrast, the conditional prior model for the bottom level over latents\nthat encode local information will operate at a larger resolution. Using self-attention layers as in the\ntop-level prior would not be practical due to memory constraints. For this prior over local information,\nwe thus \ufb01nd that using large conditioning stacks (coming from the top prior) yields good performance\n(see Fig. 2b). The hierarchical factorization also allows us to train larger models: we train each prior\nseparately, thereby leveraging all the available compute and memory on hardware accelerators. Please\nrefer to Appendix A for the details of the architecture and hyperparameters.\n\n3.3 Trading off Diversity with Classi\ufb01er Based Rejection Sampling\n\nUnlike GANs, probabilistic models trained with the maximum likelihood objective are forced to\nmodel all of the training data distribution. This is because the MLE objective can be expressed as the\nforward KL-divergence between the data and model distributions, which would be driven to in\ufb01nity\nif an example in the training data is assigned zero mass. While the coverage of all modes in the data\ndistribution is an appealing property of these models, the task is considerably more dif\ufb01cult than\n\nhtop\n\nhtop, hmiddle\n\nhtop, hmiddle, hbottom\n\nOriginal\n\nFigure 3: Reconstructions from a hierarchical VQ-VAE with three latent maps (top, middle, bottom).\nThe rightmost image is the original. Each latent map adds extra detail to the reconstruction. These\nlatent maps are approximately 3072x, 768x, 192x times smaller than the original image (respectively).\n\n4\n\n\fadversarial modeling, since likelihood based models need to \ufb01t all the modes present in the data.\nFurthermore, ancestral sampling from autoregressive models can in practice induce errors that can\naccumulate over long sequences and result in samples with reduced quality. Recent GAN frameworks\n[5, 2] have proposed automated procedures for sample selection to trade-off diversity and quality.\nIn this work, we also propose an automated method for trading off diversity and quality of samples\nbased on the intuition that the closer our samples are to the true data manifold, the more likely they\nare classi\ufb01ed to the correct class labels by a pre-trained classi\ufb01er. Speci\ufb01cally, we use a classi\ufb01er\nnetwork that is trained on ImageNet to score samples from our model according to the probability the\nclassi\ufb01er assigns to the correct class. Note that we only use this classi\ufb01er for the quantitive metrics in\nthis paper (such as FID, IS, Precision, Recall) to trade off diversity with quality. None of the samples\nin this manuscript are sampled using this classi\ufb01er (please follow the link in the Appendix Section to\nsee these).\n\n4 Related Works\n\nThe foundation of our work is the VQ-VAE framework of [41]. Our prior network is based on Gated\nPixelCNN [40] augmented with self-attention [42], as proposed in [7].\nBigGAN [5] is currently state-of-the-art in FID and Inception scores, and produces high quality\nhigh-resolution images. The improvements in BigGAN come mostly from incorporating architectural\nadvances such as self-attention, better stabilization methods, scaling up the model on TPUs and a\nmechanism to trade-off sample diversity with sample quality. In our work we also investigate how the\naddition of some of these elements, in particular self-attention and compute scale, indeed also improve\nthe quality of samples of VQ-VAE models. Recent work has also been proposed to generate high\nresolution images with likelihood based models include Subscale Pixel Networks of [22]. Similar\nto the parallel multi-scale model of [29], SPN imposes a partitioning on the spatial dimensions, but\nunlike [29], SPN does not make the corresponding independence assumptions, whereby it trades\nsampling speed with density estimation performance and sample quality.\nHierarchical latent variables have been proposed in e.g. [31]. Speci\ufb01cally for VQ-VAE, [8] uses a\nhierarchy of latent codes for modeling and generating music using a WaveNet decoder. The speci\ufb01cs\nof the encoding is however different from ours: in our work, the bottom levels of hierarchy do\nnot exclusively re\ufb01ne the information encoded by the top level, but they extract complementary\ninformation at each level, as discussed in Sect. 3.1. Additionally, as we are using simple, feed-forward\ndecoders and optimizing mean squared error in the pixels, our model does not suffer from, and thus\nneeds no mitigation for, the hierarchy collapse problems detailed in [8]. Concurrent to our work,\n[11] extends [8] for generating high-resolution images. The primary difference to our work is the\nuse of autoregressive decoders in the pixel space. In contrast, for reasons detailed in Sect. 3, we\nuse autoregressive models exclusively as priors in the compressed latent space, which simpli\ufb01es the\nmodel and greatly improves sampling speed. Additionally, the same differences with [8] outlined\nabove also exist between our method and [11].\nImproving sample quality by rejection sampling has been previously explored for GANs [2] as well as\nfor VAEs [4] which combines a learned rejecting sampling proposal with the prior in order to reduce\nits gap with the aggregate posterior. Neural networks have recently been used towards learned image\ncompression. For lossy image compression, [24] trains hierarchical and autoregressive priors jointly\nto improve the entropy coding part of the compression system. L3C [23] is a parallel architecture\nproposed for lossless image compression that uses jointly learned with auxiliary latent spaces to\nachieve speedups in sampling compared to autoregressive models. Using GANs for extremely low\nrate compression is explored in [35] and [1].\n\n5 Experiments\n\nObjective evaluation and comparison of generative models, specially across model families, remains a\nchallenge [37]. Current image generation models trade-off sample quality and diversity (or precision\nvs recall [32]). In this section, we present quantitative and qualitative results of our model trained on\nImageNet 256 \u00d7 256. Sample quality is indeed high and sharp, across several representative classes\nas can be seen in the class conditional samples provided in Fig. 5. In terms of diversity, we provide\nsamples from our model juxtaposed with those of BigGAN-deep [5], the state of the art GAN model\n\n5\n\n\fTrain NLL Validation NLL Train MSE Validation MSE\n\n-\n-\n\n-\n-\n\nTop prior\nBottom prior\nVQ Decoder\n\n3.40\n3.45\n\n-\n\n3.41\n3.45\n\n-\n\n0.0047\n\n0.0050\n\nTable 1: Train and validation negative log-likelihood (NLL) for top and bottom prior measured by\nencoding train and validation set resp., as well as Mean Squared Error for train and validation set.\nThe small difference in both NLL and MSE suggests that neither the prior network nor the VQ-VAE\nover\ufb01t.\n\n4 in Fig. 5. As can be seen in these side-by-side comparisons, VQ-VAE is able to provide samples of\ncomparable \ufb01delity and higher diversity.\n\n5.1 Modeling High-Resolution Face Images\n\nTo further assess the effectiveness of our multi-scale approach for capturing extremely long range\ndependencies in the data, we train a three level hierarchical model over the FFHQ dataset [15]\nat 1024 \u00d7 1024 resolution. This dataset consists of 70000 high-quality human portraits with a\nconsiderable diversity in gender, skin colour, age, poses and attires. Although modeling faces is\ngenerally considered less dif\ufb01cult compared to ImageNet, at such a high resolution there are also\nunique modeling challenges that can probe generative models in interesting ways. For example, the\nsymmetries that exist in faces require models capable of capturing long range dependencies: a model\nwith restricted receptive \ufb01eld may choose plausible colours for each eye separately, but can miss\nthe strong correlation between the two eyes that lie several hundred pixels apart from one another,\nyielding samples with mismatching eye colours.\n\n5.2 Quantitative Evaluation\n\nIn this section, we report the results of our quantitative evaluations based on several metrics aiming\nto measure the quality as well as diversity of our samples.\n\n5.2.1 Negative Log-Likelihood and Reconstruction Error\n\nOne of the chief motivations to use likelihood based generative models is that negative log likelihood\n(NLL) on the test and training sets give an objective measure for generalization and allow us to\nmonitor for over-\ufb01tting. We emphasize that other commonly used performance metrics such as FID\nand Inception Score completely ignore the issue of generalization; a model that simply memorizes the\ntraining data can obtain a perfect score on these metrics. The same issue also applies to some recently\nproposed metrics such as Precision-Recall [32, 19] and Classi\ufb01cation Accuracy Scores [28]. These\nsample-based metrics only provide a proxy for the quality and diversity of samples, but are oblivious\nto generalization to held-out images. Note that the NLL values for our top and bottom priors, reported\nin Fig. 1, are close for training and validation, indicating that neither of these networks over\ufb01t. We\nnote that these NLL values are only comparable between prior models that use the same pretrained\nVQ-VAE encoder and decoder.\n\n5.2.2 Precision - Recall Metric\n\nPrecision and Recall metrics are proposed as an alternative to FID and Inception score for evaluating\nthe performance of GANs [32, 19]. These metrics aim to explicitly quantify the trade off between\ncoverage (recall) and quality (precision). We compare samples from our model to those obtained\nfrom BigGAN- deep using the improved version of precision-recall with the same procedure outlined\nin [19] for all 1000 classes in ImageNet. Fig. 7b shows the Precision-Recall results for VQ-VAE and\nBigGan with the classi\ufb01er based rejection sampling (\u2019critic\u2019, see section 3.3) for various rejection\nrates and the BigGan-deep results for different levels of truncation. VQ-VAE results in slightly lower\nlevels of precision, but higher values for recall.\n\n4Samples are taken from BigGAN\u2019s colab notebook in TensorFlow hub:\n\nhttps://tfhub.dev/deepmind/biggan-deep-256/1\n\n6\n\n\fFigure 4: Class conditional random samples. Classes from the top row are: 108 sea anemone, 109\nbrain coral, 114 slug, 11 gold\ufb01nch, 130 \ufb02amingo, 141 redshank, 154 Pekinese, 157 papillon, 97\ndrake, and 28 spotted salamander.\n\n7\n\n\fVQ-VAE\n\nBigGAN deep\n\nVQ-VAE\n\nBigGAN deep\n\nFigure 5: Sample diversity comparison between VQ-VAE-2 and BigGan Deep for Tinca-Tinca (1st\nImageNet class) and Ostrich (10th ImageNet class). BigGAN was sampled with 1.0 truncation to\nyield maximum diversity. Several kinds of samples such as top view of the \ufb01sh or different kinds\nof poses (eg, close up ostrich) are absent from BigGAN\u2019s samples. Please zoom into the pdf for\ninspecting the details and refer to the Supplementary material for comparison on more classes.\n\nFigure 6: Representative samples from the 3-level hierarchical model trained on FFHQ-1024 \u00d7 1024.\nSamples capture long-range dependencies such as matching eye colour or symmetric facial features,\nwhile covering lower density data distribution modes such as green hair. See the supplementary\nmaterial for more samples, including full resolution samples.\n\n5.3 Classi\ufb01cation Accuracy Score\n\nWe also evaluate our method using the recently proposed Classi\ufb01cation Accuracy Score (CAS) [28],\nwhich requires training an ImageNet classi\ufb01er only on samples from the candidate model, but then\nevaluates its classi\ufb01cation accuracy on real images from the test set, thus measuring sample quality\nand diversity. The result of our evaluation with this metric are reported in Table 2. In the case of\nVQ-VAE, the ImageNet classi\ufb01er is only trained on samples, which lack high frequency signal, noise,\netc. (due to compression). Evaluating the classi\ufb01er on VQ-VAE reconstructions of the test images\ncloses the \u201cdomain gap\u201d and improves the CAS score without need for retraining the classi\ufb01er.\n\nBigGAN deep\nVQ-VAE\nVQ-VAE after reconstructing\nReal data\n\n42.65\n54.83\n58.74\n73.09\n\nTop-1 Accuracy Top-5 Accuracy\n\n65.92\n77.59\n80.98\n91.47\n\nTable 2: Classi\ufb01cation Accuracy Score (CAS) [28] for the real dataset, BigGAN-deep and our model.\n\n8\n\n\f(a) Inception Scores [34] (IS) and Fr\u00e9chet Inception\nDistance scores (FID) [13].\nFigure 7: Quantitative Evaluation of Diversity-Quality trade-off with FID/IS and Precision/Recall.\n\n(b) Precision - Recall metrics [32, 19].\n\n5.3.1 FID and Inception Score\n\nThe two most common metrics for comparing GANs are Inception Score [34] and Fr\u00e9chet Inception\nDistance (FID) [13]. Noting that there are several known drawbacks to these metrics [3, 32, 19],\nwe report our results in Fig. 7a. We use the classi\ufb01er-based rejection sampling for trading off\ndiversity with quality (Section 3.3). For VQ-VAE this improves both IS and FID scores, with the\nFID going from roughly \u223c 30 to \u223c 10. For BigGan-deep the rejection sampling (referred to as critic)\nworks better than the truncation method proposed in the BigGAN paper [5]. We observe that the\ninception classi\ufb01er is quite sensitive to event slightest blurriness or other perturbations introduced in\nthe VQ-VAE reconstructions, as shown by an FID \u223c 10 instead of \u223c 2 when simply compressing\nthe originals. We therefore also compute the FID between VQ-VAE samples and the reconstructions\n(which we denote as FID*) showing that the inception network statistics are much closer to real\nimages data than what the FID would otherwise suggest.\n\n6 Conclusion\n\nWe propose a simple method for generating diverse high resolution images using VQ-VAE with\na powerful autoregressive model as prior. Our encoder and decoder architectures are kept simple\nand light-weight as in the original VQ-VAE, with the only difference that we use a hierarchical\nmulti-scale latent maps for increased resolution. The \ufb01delity of our best class conditional samples\nare competitive with the state of the art Generative Adversarial Networks, with broader diversity\nin several classes, contrasting our method against the known limitations of GANs. Still, concrete\nmeasures of sample quality and diversity are in their infancy, and visual inspection is still necessary.\nLastly, we believe our experiments vindicate autoregressive modeling in the latent space as a simple\nand effective objective for learning large scale generative models.\n\nAcknowledgments\n\nWe would like to thank Suman Ravuri, Jeff Donahue, Sander Dieleman, Jeffrey Defauw, Danilo J.\nRezende, Karen Simonyan and Andy Brock for their help and feedback.\n\n9\n\n\fReferences\n[1] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative\n\nadversarial networks for extreme learned image compression. CoRR, abs/1804.02958, 2018.\n\n[2] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator\n\nrejection sampling. In International Conference on Learning Representations, 2019.\n\n[3] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.\n[4] M. Bauer and A. Mnih. Resampled priors for variational autoencoders. In 22nd International Conference\n\non Arti\ufb01cial Intelligence and Statistics, April 2019.\n\n[5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high \ufb01delity natural\n\nimage synthesis. In International Conference on Learning Representations, 2019.\n\n[6] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever,\n\nand Pieter Abbeel. Variational Lossy Autoencoder. In Iclr, pages 1\u201314, nov 2016.\n\n[7] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An Improved Autoregres-\n\nsive Generative Model. pages 12\u201317, 2017.\n\n[8] Sander Dieleman, Aaron van den Oord, and Karen Simonyan. The challenge of realistic music generation:\nmodelling raw audio at scale. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7989\u20137999. Curran\nAssociates, Inc., 2018.\n\n[9] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.\n\narXiv preprint arXiv:1410.8516, 2014.\n\n[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint\n\narXiv:1605.08803, 2016.\n\n[11] Jeffrey De Fauw, Sander Dieleman, and Karen Simonyan. Hierarchical autoregressive image models with\n\nauxiliary decoders. CoRR, abs/1903.04933, 2019.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans\ntrained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 30, pages 6626\u20136637. Curran Associates, Inc., 2017.\n\n[14] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial\n\nnetworks. arXiv preprint arXiv:1812.04948, 2018.\n\n[15] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial\n\nnetworks. arXiv preprint arXiv:1812.04948, 2018.\n\n[16] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.\n[17] Durk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\n\nAdvances in Neural Information Processing Systems, pages 10236\u201310245, 2018.\n\n[18] Alexander Kolesnikov and Christoph H Lampert. Pixelcnn models with auxiliary variables for natural\nimage modeling. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 1905\u20131914. JMLR. org, 2017.\n\n[19] Tuomas Kynk\u00e4\u00e4nniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision\n\nand recall metric for assessing generative models. CoRR, abs/1904.06991, 2019.\n\n[20] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceedings of the\n\nFourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 29\u201337, 2011.\n\n[21] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes\nTotz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative\nadversarial network. CoRR, abs/1609.04802, 2016.\n\n[22] Jacob Menick and Nal Kalchbrenner. Generating high \ufb01delity images with subscale pixel networks and\n\nmultidimensional upscaling. In International Conference on Learning Representations, 2019.\n\n[23] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Practical full\n\nresolution learned lossless image compression. CoRR, abs/1811.12817, 2018.\n\n[24] David Minnen, Johannes Ball\u00e9, and George D Toderici. Joint autoregressive and hierarchical priors for\nlearned image compression. In Advances in Neural Information Processing Systems 31, pages 10771\u201310780.\n2018.\n\n10\n\n\f[25] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.\narXiv preprint arXiv:1609.03499, 2016.\n\n[26] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, \u0141ukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin\n\nTran. Image Transformer. 2018.\n\n[27] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models\n\nare unsupervised multitask learners. 2019.\n\n[28] Suman Ravuri and Oriol Vinyals. Classi\ufb01cation accuracy score for conditional generative models. arXiv\n\npreprint arXiv:1905.10887, 2019.\n\n[29] Scott Reed, A\u00e4ron van den Oord, Nal Kalchbrenner, Sergio G\u00f3mez Colmenarejo, Ziyu Wang, Yutian Chen,\nDan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. In Proceedings of\nthe 34th International Conference on Machine Learning-Volume 70, pages 2912\u20132921. JMLR. org, 2017.\n[30] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\n[31] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approxi-\n\nmate Inference in Deep Generative Models. 32, 2014.\n\n[32] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing\ngenerative models via precision and recall. In Advances in Neural Information Processing Systems, pages\n5234\u20135243, 2018.\n\n[33] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\ntechniques for training gans. In Advances in neural information processing systems, pages 2234\u20132242,\n2016.\n\n[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen.\nImproved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 29, pages 2234\u20132242. Curran Associates,\nInc., 2016.\n\n[35] Shibani Santurkar, David M. Budden, and Nir Shavit. Generative compression. CoRR, abs/1703.01467,\n\n2017.\n\n[36] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. CoRR,\n\nabs/1611.02200, 2016.\n\n[37] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In International\n\nConference on Learning Representations, Apr 2016.\n\n[38] Lucas Theis and Matthias Bethge. Generative image modeling using spatial lstms. In Advances in Neural\n\nInformation Processing Systems, pages 1927\u20131935, 2015.\n\n[39] A\u00e4ron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR,\n\nabs/1601.06759, 2016.\n\n[40] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel Recurrent Neural Networks. In\n\nInternational Conference on Machine Learning, volume 48, pages 1747\u20131756, 2016.\n\n[41] A\u00e4ron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning.\n\nCoRR, abs/1711.00937, 2017.\n\n[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz\n\nIEEE transactions on consumer\n\nKaiser, and Illia Polosukhin. Attention Is All You Need. (Nips), 2017.\n[43] Gregory K Wallace. The jpeg still picture compression standard.\n\nelectronics, 38(1):xviii\u2013xxxiv, 1992.\n\n[44] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\ncycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference\non, 2017.\n\n11\n\n\f", "award": [], "sourceid": 8437, "authors": [{"given_name": "Ali", "family_name": "Razavi", "institution": "DeepMind"}, {"given_name": "Aaron", "family_name": "van den Oord", "institution": "Google Deepmind"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google DeepMind"}]}