{"title": "Deep Generative Image Models using a \ufffcLaplacian Pyramid of Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1486, "page_last": 1494, "abstract": "In this paper we introduce a generative model capable of producing high quality samples of natural images. Our approach uses a cascade of convolutional networks (convnets) within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach. Samples drawn from our model are of significantly higher quality than existing models. In a quantitive assessment by human evaluators our CIFAR10 samples were mistaken for real images around 40% of the time, compared to 10% for GAN samples. We also show samples from more diverse datasets such as STL10 and LSUN.", "full_text": "Deep Generative Image Models using a\n\nLaplacian Pyramid of Adversarial Networks\n\nEmily Denton\u2217\n\nDept. of Computer Science\n\nCourant Institute\n\nNew York University\n\nSoumith Chintala\u2217\n\nArthur Szlam\nFacebook AI Research\n\nNew York\n\nRob Fergus\n\nAbstract\n\nIn this paper we introduce a generative parametric model capable of producing\nhigh quality samples of natural images. Our approach uses a cascade of convo-\nlutional networks within a Laplacian pyramid framework to generate images in\na coarse-to-\ufb01ne fashion. At each level of the pyramid, a separate generative con-\nvnet model is trained using the Generative Adversarial Nets (GAN) approach [11].\nSamples drawn from our model are of signi\ufb01cantly higher quality than alternate\napproaches. In a quantitative assessment by human evaluators, our CIFAR10 sam-\nples were mistaken for real images around 40% of the time, compared to 10% for\nsamples drawn from a GAN baseline model. We also show samples from models\ntrained on the higher resolution images of the LSUN scene dataset.\n\nIntroduction\n\n1\nBuilding a good generative model of natural images has been a fundamental problem within com-\nputer vision. However, images are complex and high dimensional, making them hard to model well,\ndespite extensive efforts. Given the dif\ufb01culties of modeling entire scene at high-resolution, most\nexisting approaches instead generate image patches. In contrast, we propose an approach that is\nable to generate plausible looking scenes at 32 \u00d7 32 and 64 \u00d7 64. To do this, we exploit the multi-\nscale structure of natural images, building a series of generative models, each of which captures\nimage structure at a particular scale of a Laplacian pyramid [1]. This strategy breaks the original\nproblem into a sequence of more manageable stages. At each scale we train a convolutional network-\nbased generative model using the Generative Adversarial Networks (GAN) approach of Goodfellow\net al. [11]. Samples are drawn in a coarse-to-\ufb01ne fashion, commencing with a low-frequency resid-\nual image. The second stage samples the band-pass structure at the next level, conditioned on the\nsampled residual. Subsequent levels continue this process, always conditioning on the output from\nthe previous scale, until the \ufb01nal level is reached. Thus drawing samples is an ef\ufb01cient and straight-\nforward procedure: taking random vectors as input and running forward through a cascade of deep\nconvolutional networks (convnets) to produce an image.\nDeep learning approaches have proven highly effective at discriminative tasks in vision, such as\nobject classi\ufb01cation [4]. However, the same level of success has not been obtained for generative\ntasks, despite numerous efforts [14, 26, 30]. Against this background, our proposed approach makes\na signi\ufb01cant advance in that it is straightforward to train and sample from, with the resulting samples\nshowing a surprising level of visual \ufb01delity.\n1.1 Related Work\nGenerative image models are well studied, falling into two main approaches: non-parametric and\nparametric. The former copy patches from training images to perform, for example, texture synthesis\n[7] or super-resolution [9]. More ambitiously, entire portions of an image can be in-painted, given a\nsuf\ufb01ciently large training dataset [13]. Early parametric models addressed the easier problem of tex-\n\n\u2217denotes equal contribution.\n\n1\n\n\fture synthesis [3, 33, 22], with Portilla & Simoncelli [22] making use of a steerable pyramid wavelet\nrepresentation [27], similar to our use of a Laplacian pyramid. For image processing tasks, models\nbased on marginal distributions of image gradients are effective [20, 25], but are only designed for\nimage restoration rather than being true density models (so cannot sample an actual image). Very\nlarge Gaussian mixture models [34] and sparse coding models of image patches [31] can also be\nused but suffer the same problem.\nA wide variety of deep learning approaches involve generative parametric models. Restricted Boltz-\nmann machines [14, 18, 21, 23], Deep Boltzmann machines [26, 8], Denoising auto-encoders [30]\nall have a generative decoder that reconstructs the image from the latent representation. Variational\nauto-encoders [16, 24] provide probabilistic interpretation which facilitates sampling. However, for\nall these methods convincing samples have only been shown on simple datasets such as MNIST\nand NORB, possibly due to training complexities which limit their applicability to larger and more\nrealistic images.\nSeveral recent papers have proposed novel generative models. Dosovitskiy et al. [6] showed how a\nconvnet can draw chairs with different shapes and viewpoints. While our model also makes use of\nconvnets, it is able to sample general scenes and objects. The DRAW model of Gregor et al. [12]\nused an attentional mechanism with an RNN to generate images via a trajectory of patches, showing\nsamples of MNIST and CIFAR10 images. Sohl-Dickstein et al. [28] use a diffusion-based process\nfor deep unsupervised learning and the resulting model is able to produce reasonable CIFAR10 sam-\nples. Theis and Bethge [29] employ LSTMs to capture spatial dependencies and show convincing\ninpainting results of natural textures.\nOur work builds on the GAN approach of Goodfellow et al. [11] which works well for smaller\nimages (e.g. MNIST) but cannot directly handle large ones, unlike our method. Most relevant to our\napproach is the preliminary work of Mirza and Osindero [19] and Gauthier [10] who both propose\nconditional versions of the GAN model. The former shows MNIST samples, while the latter focuses\nsolely on frontal face images. Our approach also uses several forms of conditional GAN model but\nis much more ambitious in its scope.\n2 Approach\nThe basic building block of our approach is the generative adversarial network (GAN) of Goodfellow\net al. [11]. After reviewing this, we introduce our LAPGAN model which integrates a conditional\nform of GAN model into the framework of a Laplacian pyramid.\n2.1 Generative Adversarial Networks\nThe GAN approach [11] is a framework for training generative models, which we brie\ufb02y explain in\nthe context of image data. The method pits two networks against one another: a generative model G\nthat captures the data distribution and a discriminative model D that distinguishes between samples\ndrawn from G and images drawn from the training data. In our approach, both G and D are convo-\nlutional networks. The former takes as input a noise vector z drawn from a distribution pNoise(z) and\noutputs an image \u02dch. The discriminative network D takes an image as input stochastically chosen\n(with equal probability) to be either \u02dch \u2013 as generated from G, or h \u2013 a real image drawn from the\ntraining data pData(h). D outputs a scalar probability, which is trained to be high if the input was\nreal and low if generated from G. A minimax objective is used to train both models together:\n\nEh\u223cpData(h)[log D(h)] + Ez\u223cpNoise(z)[log(1 \u2212 D(G(z)))]\n\n(1)\n\nmin\n\nG\n\nmax\n\nD\n\nThis encourages G to \ufb01t pData(h) so as to fool D with its generated samples \u02dch. Both G and D are\ntrained by backpropagating the loss in Eqn. 1 through both models to update the parameters.\nThe conditional generative adversarial net (CGAN) is an extension of the GAN where both networks\nG and D receive an additional vector of information l as input. This might contain, say, information\nabout the class of the training example h. The loss function thus becomes\n\nEh,l\u223cpData(h,l)[log D(h, l)] + Ez\u223cpNoise(z),l\u223cpl(l)[log(1 \u2212 D(G(z, l), l))]\n\n(2)\n\nmin\n\nG\n\nmax\n\nD\n\nwhere pl(l) is, for example, the prior distribution over classes. This model allows the output of\nthe generative model to be controlled by the conditioning variable l. Mirza and Osindero [19] and\nGauthier [10] both explore this model with experiments on MNIST and faces, using l as a class\nindicator. In our approach, l will be another image, generated from another CGAN model.\n\n2\n\n\f2.2 Laplacian Pyramid\nThe Laplacian pyramid [1] is a linear invertible image representation consisting of a set of band-pass\nimages, spaced an octave apart, plus a low-frequency residual. Formally, let d(.) be a downsampling\noperation which blurs and decimates a j \u00d7 j image I, so that d(I) is a new image of size j/2\u00d7 j/2.\nAlso, let u(.) be an upsampling operator which smooths and expands I to be twice the size, so u(I)\nis a new image of size 2j \u00d7 2j. We \ufb01rst build a Gaussian pyramid G(I) = [I0, I1, . . . , IK], where\nI0 = I and Ik is k repeated applications of d(.) to I, i.e. I2 = d(d(I)). K is the number of levels in\nthe pyramid, selected so that the \ufb01nal level has very small spatial extent (\u2264 8 \u00d7 8 pixels).\nThe coef\ufb01cients hk at each level k of the Laplacian pyramid L(I) are constructed by taking the\ndifference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with u(.)\nso that the sizes are compatible:\n\nhk = Lk(I) = Gk(I) \u2212 u(Gk+1(I)) = Ik \u2212 u(Ik+1)\n\n(3)\nIntuitively, each level captures image structure present at a particular scale. The \ufb01nal level of the\nLaplacian pyramid hK is not a difference image, but a low-frequency residual equal to the \ufb01nal\nGaussian pyramid level, i.e. hK = IK. Reconstruction from a Laplacian pyramid coef\ufb01cients\n[h1, . . . , hK] is performed using the backward recurrence:\nIk = u(Ik+1) + hk\n\n(4)\nwhich is started with IK = hK and the reconstructed image being I = Io. In other words, starting\nat the coarsest level, we repeatedly upsample and add the difference image h at the next \ufb01ner level\nuntil we get back to the full resolution image.\n2.3 Laplacian Generative Adversarial Networks (LAPGAN)\nOur proposed approach combines the conditional GAN model with a Laplacian pyramid represen-\ntation. The model is best explained by \ufb01rst considering the sampling procedure. Following training\n(explained below), we have a set of generative convnet models {G0, . . . , GK}, each of which cap-\ntures the distribution of coef\ufb01cients hk for natural images at a different level of the Laplacian pyra-\nmid. Sampling an image is akin to the reconstruction procedure in Eqn. 4, except that the generative\nmodels are used to produce the hk\u2019s:\n\n\u02dcIk = u( \u02dcIk+1) + \u02dchk = u( \u02dcIk+1) + Gk(zk, u( \u02dcIk+1))\n\n(5)\n\nThe recurrence starts by setting \u02dcIK+1 = 0 and using the model at the \ufb01nal level GK to generate a\nresidual image \u02dcIK using noise vector zK: \u02dcIK = GK(zK). Note that models at all levels except the\n\ufb01nal are conditional generative models that take an upsampled version of the current image \u02dcIk+1 as\na conditioning variable, in addition to the noise vector zk. Fig. 1 shows this procedure in action for\na pyramid with K = 3 using 4 generative models to sample a 64 \u00d7 64 image.\nThe generative models {G0, . . . , GK} are trained using the CGAN approach at each level of the\npyramid. Speci\ufb01cally, we construct a Laplacian pyramid from each training image I. At each level\nwe make a stochastic choice (with equal probability) to either (i) construct the coef\ufb01cients hk either\nusing the standard procedure from Eqn. 3, or (ii) generate them using Gk:\n\n\u02dchk = Gk(zk, u(Ik+1))\n\n(6)\n\nFigure 1: The sampling procedure for our LAPGAN model. We start with a noise sample z3 (right side) and\nuse a generative model G3 to generate \u02dcI3. This is upsampled (green arrow) and then used as the conditioning\nvariable (orange arrow) l2 for the generative model at the next level, G2. Together with another noise sample\nz2, G2 generates a difference image \u02dch2 which is added to l2 to create \u02dcI2. This process repeats across two\nsubsequent levels to yield a \ufb01nal full resolution sample I0.\n\n3\n\nG2 ~ I3 G3 z2 ~ h2 z3 G1 z1 G0 z0 ~ I2 l2 ~ I0 h0 ~ I1 ~ ~ h1 l1 l0 \fFigure 2: The training procedure for our LAPGAN model. Starting with a 64x64 input image I from our\ntraining set (top left): (i) we take I0 = I and blur and downsample it by a factor of two (red arrow) to produce\nI1; (ii) we upsample I1 by a factor of two (green arrow), giving a low-pass version l0 of I0; (iii) with equal\nprobability we use l0 to create either a real or a generated example for the discriminative model D0. In the real\ncase (blue arrows), we compute high-pass h0 = I0 \u2212 l0 which is input to D0 that computes the probability of\nit being real vs generated. In the generated case (magenta arrows), the generative network G0 receives as input\na random noise vector z0 and l0. It outputs a generated high-pass image \u02dch0 = G0(z0, l0), which is input to\nD0. In both the real/generated cases, D0 also receives l0 (orange arrow). Optimizing Eqn. 2, G0 thus learns\nto generate realistic high-frequency structure \u02dch0 consistent with the low-pass image l0. The same procedure is\nrepeated at scales 1 and 2, using I1 and I2. Note that the models at each level are trained independently. At\nlevel 3, I3 is an 8\u00d78 image, simple enough to be modeled directly with a standard GANs G3 & D3.\n\nNote that Gk is a convnet which uses a coarse scale version of the image lk = u(Ik+1) as an input,\nas well as noise vector zk. Dk takes as input hk or \u02dchk, along with the low-pass image lk (which is\nexplicitly added to hk or \u02dchk before the \ufb01rst convolution layer), and predicts if the image was real or\ngenerated. At the \ufb01nal scale of the pyramid, the low frequency residual is suf\ufb01ciently small that it\ncan be directly modeled with a standard GAN: \u02dchK = GK(zK) and DK only has hK or \u02dchK as input.\nThe framework is illustrated in Fig. 2.\nBreaking the generation into successive re\ufb01nements is the key idea in this work. Note that we give\nup any \u201cglobal\u201d notion of \ufb01delity; we never make any attempt to train a network to discriminate\nbetween the output of a cascade and a real image and instead focus on making each step plausible.\nFurthermore, the independent training of each pyramid level has the advantage that it is far more\ndif\ufb01cult for the model to memorize training examples \u2013 a hazard when high capacity deep networks\nare used.\nAs described, our model is trained in an unsupervised manner. However, we also explore variants\nthat utilize class labels. This is done by add a 1-hot vector c, indicating class identity, as another\nconditioning variable for Gk and Dk.\n3 Model Architecture & Training\nWe apply our approach to three datasets: (i) CIFAR10 [17] \u2013 32\u00d732 pixel color images of 10\ndifferent classes, 100k training samples with tight crops of objects; (ii) STL10 [2] \u2013 96\u00d796 pixel\ncolor images of 10 different classes, 100k training samples (we use the unlabeled portion of data);\nand (iii) LSUN [32] \u2013 \u223c10M images of 10 different natural scene types, downsampled to 64\u00d764\npixels.\nFor each dataset, we explored a variety of architectures for {Gk, Dk}. Model selection was\nperformed using a combination of visual inspection and a heuristic based on (cid:96)2 error in pixel\nspace. The heuristic computes the error for a given validation image at level k in the pyramid\nas Lk(Ik) = min{zj}||Gk(zj, u(Ik+1)) \u2212 hk||2 where {zj} is a large set of noise vectors, drawn\nfrom pnoise(z). In other words, the heuristic is asking, are any of the generated residual images\nclose to the ground truth? Torch training and evaluation code, along with model speci\ufb01cation \ufb01les\ncan be found at http://soumith.ch/eyescream/. For all models, the noise vector zk is\ndrawn from a uniform [-1,1] distribution.\n\n4\n\nG0 l2 ~ I3 G3 D0 z0 D1 D2 h2 ~ h2 z3 D3 I3 I2 I2 I3 Real/Generated? Real/ Generated? G1 z1 G2 z2 Real/Generated? Real/ Generated? l0 I = I0 h0 I1 I1 l1 ~ h1 h1 h0 ~ \f3.1 CIFAR10 and STL10\nInitial scale: This operates at 8 \u00d7 8 resolution, using densely connected nets for both GK & DK\nwith 2 hidden layers and ReLU non-linearities. DK uses Dropout and has 600 units/layer vs 1200\nfor GK. zK is a 100-d vector.\nSubsequent scales: For CIFAR10, we boost the training set size by taking four 28 \u00d7 28 crops from\nthe original images. Thus the two subsequent levels of the pyramid are 8 \u2192 14 and 14 \u2192 28. For\nSTL, we have 4 levels going from 8 \u2192 16 \u2192 32 \u2192 64 \u2192 96. For both datasets, Gk & Dk are\nconvnets with 3 and 2 layers, respectively (see [5]). The noise input zk to Gk is presented as a 4th\n\u201ccolor plane\u201d to low-pass lk, hence its dimensionality varies with the pyramid level. For CIFAR10,\nwe also explore a class conditional version of the model, where a vector c encodes the label. This is\nintegrated into Gk & Dk by passing it through a linear layer whose output is reshaped into a single\nplane feature map which is then concatenated with the 1st layer maps. The loss in Eqn. 2 is trained\nusing SGD with an initial learning rate of 0.02, decreased by a factor of (1 + 4 \u00d7 10\u22124) at each\nepoch. Momentum starts at 0.5, increasing by 0.0008 at epoch up to a maximum of 0.8. Training\ntime depends on the models size and pyramid level, with smaller models taking hours to train and\nlarger models taking up to a day.\n3.2 LSUN\nThe larger size of this dataset allows us to train a separate LAPGAN model for each of the scene\nclasses. The four subsequent scales 4 \u2192 8 \u2192 16 \u2192 32 \u2192 64 use a common architecture for Gk &\nDk at each level. Gk is a 5-layer convnet with {64, 368, 128, 224} feature maps and a linear output\nlayer. 7 \u00d7 7 \ufb01lters, ReLUs, batch normalization [15] and Dropout are used at each hidden layer. Dk\nhas 3 hidden layers with {48, 448, 416} maps plus a sigmoid output. See [5] for full details. Note\nthat Gk and Dk are substantially larger than those used for CIFAR10 and STL, as afforded by the\nlarger training set.\n4 Experiments\nWe evaluate our approach using 3 different methods: (i) computation of log-likelihood on a held\nout image set; (ii) drawing sample images from the model and (iii) a human subject experiment that\ncompares (a) our samples, (b) those of baseline methods and (c) real images.\n4.1 Evaluation of Log-Likelihood\nLike Goodfellow et al. [11], we are compelled to use a Gaussian Parzen window estimator to com-\npute log-likelihood, since there no direct way of computing it using our model. Table 1 compares the\nlog-likelihood on a validation set for our LAPGAN model and a standard GAN using 50k samples\nfor each model (the Gaussian width \u03c3 was also tuned on the validation set). Our approach shows\na marginal gain over a GAN. However, we can improve the underlying estimation technique by\nleveraging the multi-scale structure of the LAPGAN model. This new approach computes a proba-\nbility at each scale of the Laplacian pyramid and combines them to give an overall image probability\n(see Appendix A in supplementary material for details). Our multi-scale Parzen estimate, shown in\nTable 1, produces a big gain over the traditional estimator.\nThe shortcomings of both estimators are readily apparent when compared to a simple Gaussian, \ufb01t\nto the CIFAR-10 training set. Even with added noise, the resulting model can obtain a far higher log-\nlikelihood than either the GAN or LAPGAN models, or other published models. More generally,\nlog-likelihood is problematic as a performance measure due to its sensitivity to the exact represen-\ntation used. Small variations in the scaling, noise and resolution of the image (much less changing\nfrom RGB to YUV, or more substantive changes in input representation) results in wildly different\nscores, making fair comparisons to other methods dif\ufb01cult.\n\nCIFAR10 (@32\u00d732)\n\nSTL10 (@32\u00d732)\n\nModel\n\nGAN [11] (Parzen window estimate)\nLAPGAN (Parzen window estimate)\n\nLAPGAN (multi-scale Parzen window estimate)\n\n-3617 \u00b1 353\n-3572 \u00b1 345\n-1799 \u00b1 826\n\n-3661 \u00b1 347\n-3563 \u00b1 311\n-2906 \u00b1 728\n\nTable 1: Log-likelihood estimates for a standard GAN and our proposed LAPGAN model on CI-\nFAR10 and STL10 datasets. The mean and std. dev. are given in units of nats/image. Rows 1 and 2\nuse a Parzen-window approach at full-resolution, while row 3 uses our multi-scale Parzen-window\nestimator.\n\n5\n\n\f4.2 Model Samples\nWe show samples from models trained on CIFAR10, STL10 and LSUN datasets. Additional sam-\nples can be found in the supplementary material [5]. Fig. 3 shows samples from our models trained\non CIFAR10. Samples from the class conditional LAPGAN are organized by class. Our reimple-\nmentation of the standard GAN model [11] produces slightly sharper images than those shown in the\noriginal paper. We attribute this improvement to the introduction of data augmentation. The LAP-\nGAN samples improve upon the standard GAN samples. They appear more object-like and have\nmore clearly de\ufb01ned edges. Conditioning on a class label improves the generations as evidenced\nby the clear object structure in the conditional LAPGAN samples. The quality of these samples\ncompares favorably with those from the DRAW model of Gregor et al. [12] and also Sohl-Dickstein\net al. [28]. The rightmost column of each image shows the nearest training example to the neighbor-\ning sample (in L2 pixel-space). This demonstrates that our model is not simply copying the input\nexamples.\nFig. 4(a) shows samples from our LAPGAN model trained on STL10. Here, we lose clear ob-\nject shape but the samples remain sharp. Fig. 4(b) shows the generation chain for random STL10\nsamples.\nFig. 5 shows samples from LAPGAN models trained on three LSUN categories (tower, bedroom,\nchurch front). To the best of our knowledge, no other generative model is been able to produce\nsamples of this complexity. The substantial gain in quality over the CIFAR10 and STL10 samples is\nlikely due to the much larger training LSUN training set which allows us to train bigger and deeper\nmodels. In supplemental material we show additional experiments probing the models, e.g. drawing\nmultiple samples using the same \ufb01xed 4 \u00d7 4 image, which illustrates the variation captured by the\nLAPGAN models.\n4.3 Human Evaluation of Samples\nTo obtain a quantitative measure of quality of our samples, we asked 15 volunteers to participate\nin an experiment to see if they could distinguish our samples from real images. The subjects were\npresented with the user interface shown in Fig. 6(right) and shown at random four different types\nof image: samples drawn from three different GAN models trained on CIFAR10 ((i) LAPGAN, (ii)\nclass conditional LAPGAN and (iii) standard GAN [11]) and also real CIFAR10 images. After being\npresented with the image, the subject clicked the appropriate button to indicate if they believed the\nimage was real or generated. Since accuracy is a function of viewing time, we also randomly pick\nthe presentation time from one of 11 durations ranging from 50ms to 2000ms, after which a gray\nmask image is displayed. Before the experiment commenced, they were shown examples of real\nimages from CIFAR10. After collecting \u223c10k samples from the volunteers, we plot in Fig. 6 the\nfraction of images believed to be real for the four different data sources, as a function of presentation\ntime. The curves show our models produce samples that are more realistic than those from standard\nGAN [11].\n5 Discussion\nBy modifying the approach in [11] to better respect the structure of images, we have proposed a\nconceptually simple generative model that is able to produce high-quality sample images that are\nqualitatively better than other deep generative modeling approaches. While they exhibit reasonable\ndiversity, we cannot be sure that they cover the full data distribution. Hence our models could\npotentially be assigning low probability to parts of the manifold on natural images. Quantifying this\nis dif\ufb01cult, but could potentially be done via another human subject experiment. A key point in our\nwork is giving up any \u201cglobal\u201d notion of \ufb01delity, and instead breaking the generation into plausible\nsuccessive re\ufb01nements. We note that many other signal modalities have a multiscale structure that\nmay bene\ufb01t from a similar approach.\nAcknowledgements\nWe would like to thank the anonymous reviewers for their insightful and constructive comments.\nWe also thank Andrew Tulloch, Wojciech Zaremba and the FAIR Infrastructure team for useful\ndiscussions and support. Emily Denton was supported by an NSERC Fellowship.\n\n6\n\n\fFigure 3: CIFAR10 samples: our class conditional CC-LAPGAN model, our LAPGAN model and\nthe standard GAN model of Goodfellow [11]. The yellow column shows the training set nearest\nneighbors of the samples in the adjacent column.\n\nFigure 4: STL10 samples: (a) Random 96x96 samples from our LAPGAN model. (b) Coarse-to-\n\ufb01ne generation chain.\n\n(a)\n\n(b)\n\n7\n\nCC-LAPGAN: Airplane CC-LAPGAN: Automobile CC-LAPGAN: Bird CC-LAPGAN: Cat CC-LAPGAN: Deer CC-LAPGAN: Dog CC-LAPGAN: Frog CC-LAPGAN: Horse CC-LAPGAN: Ship CC-LAPGAN: Truck GAN [14] LAPGAN \fFigure 5: 64 \u00d7 64 samples from three different LSUN LAPGAN models (top: tower, middle: bed-\nroom, bottom: church front)\n\nFigure 6: Left: Human evaluation of real CIFAR10 images (red) and samples from Goodfellow\net al. [11] (magenta), our LAPGAN (blue) and a class conditional LAPGAN (green). The error\nbars show \u00b11\u03c3 of the inter-subject variability. Around 40% of the samples generated by our class\nconditional LAPGAN model are realistic enough to fool a human into thinking they are real images.\nThis compares with \u2264 10% of images from the standard GAN model [11], but is still a lot lower\nthan the > 90% rate for real images. Right: The user-interface presented to the subjects.\n\n8\n\n 50 75 100 150 200 300 400 650100020000102030405060708090100Presentation time (ms)% classified real RealCC\u2212LAPGANLAPGANGAN\fReferences\n[1] P. J. Burt, Edward, and E. H. Adelson. The laplacian pyramid as a compact image code.\n\n31:532\u2013540, 1983.\n\nIEEE Transactions on Communications,\n\n[2] A. Coates, H. Lee, and A. Y. Ng. An analysis of single layer networks in unsupervised feature learning. In AISTATS, 2011.\n[3] J. S. De Bonet. Multiresolution sampling procedure for analysis and synthesis of texture images. In Proceedings of the 24th annual\n\nconference on Computer graphics and interactive techniques, pages 361\u2013368. ACM Press/Addison-Wesley Publishing Co., 1997.\n\n[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages\n\n248\u2013255. IEEE, 2009.\n\n[5] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks:\n\nSupplementary material. http://soumith.ch/eyescream.\n\n[6] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. arXiv preprint\n\narXiv:1411.5928, 2014.\n\n[7] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In ICCV, volume 2, pages 1033\u20131038. IEEE, 1999.\n[8] S. A. Eslami, N. Heess, C. K. Williams, and J. Winn. The shape boltzmann machine: a strong model of object shape. International\n\nJournal of Computer Vision, 107(2):155\u2013176, 2014.\n\n[9] W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based super-resolution. Computer Graphics and Applications, IEEE, 22(2):56\u2013\n\n65, 2002.\n\n[10] J. Gauthier. Conditional generative adversarial nets for convolutional face generation. Class Project for Stanford CS231N: Convolutional\n\n[11]\n\nNeural Networks for Visual Recognition, Winter semester 2014 2014.\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.\nIn NIPS, pages 2672\u20132680. 2014.\n\n[12] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. DRAW: A recurrent neural network for image generation. CoRR, abs/1502.04623,\n\n2015.\n\n[13] J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG), 26(3):4, 2007.\n[14] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504\u2013507, 2006.\n[15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint\n\narXiv:1502.03167v3, 2015.\n\n[16] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.\n[17] A. Krizhevsky. Learning multiple layers of features from tiny images. Masters Thesis, Deptartment of Computer Science, University of\n\nToronto, 2009.\n\n[18] A. Krizhevsky, G. E. Hinton, et al. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, pages\n\n621\u2013628, 2010.\n\n[19] M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.\n[20] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1?\n\n37(23):3311\u20133325, 1997.\n\nVision research,\n\n[21] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random \ufb01elds. In J. Platt, D. Koller, Y. Singer,\n\nand S. Roweis, editors, NIPS, pages 1121\u20131128. 2008.\n\n[22] J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coef\ufb01cients.\n\nJournal of Computer Vision, 40(1):49\u201370, 2000.\n\nInternational\n\n[23] M. Ranzato, V. Mnih, J. M. Susskind, and G. E. Hinton. Modeling natural images using gated MRFs. IEEE Transactions on Pattern\n\nAnalysis & Machine Intelligence, (9):2206\u20132222, 2013.\n\n[24] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in deep latent gaussian models. arXiv\n\npreprint arXiv:1401.4082, 2014.\n\n[25] S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In In CVPR, pages 860\u2013867, 2005.\n[26] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In AISTATS, pages 448\u2013455, 2009.\n[27] E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger. Shiftable multiscale transforms. Information Theory, IEEE Transac-\n\ntions on, 38(2):587\u2013607, 1992.\n\n[28] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.\n\nCoRR, abs/1503.03585, 2015.\n\n[29] L. Theis and M. Bethge. Generative image modeling using spatial LSTMs. Dec 2015.\n[30] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In\n\nICML, pages 1096\u20131103, 2008.\n\n[31] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan. Sparse representation for computer vision and pattern recognition.\n\nProceedings of the IEEE, 98(6):1031\u20131044, 2010.\n\n[32] Y. Zhang, F. Yu, S. Song, P. Xu, A. Seff, and J. Xiao. Large-scale scene understanding challenge. In CVPR Workshop, 2015.\n[33] S. C. Zhu, Y. Wu, and D. Mumford. Filters, random \ufb01elds and maximum entropy (frame): Towards a uni\ufb01ed theory for texture modeling.\n\nInternational Journal of Computer Vision, 27(2):107\u2013126, 1998.\n\n[34] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In ICCV, 2011.\n\n9\n\n\f", "award": [], "sourceid": 903, "authors": [{"given_name": "Emily", "family_name": "Denton", "institution": "New York University"}, {"given_name": "Soumith", "family_name": "Chintala", "institution": "Facebook AI Research"}, {"given_name": "arthur", "family_name": "szlam", "institution": "Facebook"}, {"given_name": "Rob", "family_name": "Fergus", "institution": "Facebook AI Research"}]}