{"title": "IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 52, "page_last": 63, "abstract": "We present a novel introspective variational autoencoder (IntroVAE) model for synthesizing high-resolution photographic images. IntroVAE is capable of self-evaluating the quality of its generated samples and improving itself accordingly. Its inference and generator models are jointly trained in an introspective way. On one hand, the generator is required to reconstruct the input images from the noisy outputs of the inference model as normal VAEs. On the other hand, the inference model is encouraged to classify between the generated and real samples while the generator tries to fool it as GANs. These two famous generative frameworks are integrated in a simple yet efficient single-stream architecture that can be trained in a single stage. IntroVAE preserves the advantages of VAEs, such as stable training and nice latent manifold. Unlike most other hybrid models of VAEs and GANs, IntroVAE requires no extra discriminators, because the inference model itself serves as a discriminator to distinguish between the generated and real samples. Experiments demonstrate that our method produces high-resolution photo-realistic images (e.g., CELEBA images at \\(1024^{2}\\)), which are comparable to or better than the state-of-the-art GANs.", "full_text": "IntroVAE: Introspective Variational Autoencoders for\n\nPhotographic Image Synthesis\n\nHuaibo Huang, Zhihang Li, Ran He\u2217, Zhenan Sun, Tieniu Tan\n\n1School of Arti\ufb01cial Intelligence, University of Chinese Academy of Sciences, Beijing, China\n\n2Center for Research on Intelligent Perception and Computing, CASIA, Beijing, China\n\n3National Laboratory of Pattern Recognition, CASIA, Beijing, China\n\n4Center for Excellence in Brain Science and Intelligence Technology, CAS, Beijing, China\n\nhuaibo.huang@cripac.ia.ac.cn\n\n{zhihang.li, rhe, znsun, tnt}@nlpr.ia.ac.cn\n\nAbstract\n\nWe present a novel introspective variational autoencoder (IntroVAE) model for\nsynthesizing high-resolution photographic images. IntroVAE is capable of self-\nevaluating the quality of its generated samples and improving itself accordingly.\nIts inference and generator models are jointly trained in an introspective way. On\none hand, the generator is required to reconstruct the input images from the noisy\noutputs of the inference model as normal VAEs. On the other hand, the inference\nmodel is encouraged to classify between the generated and real samples while the\ngenerator tries to fool it as GANs. These two famous generative frameworks are\nintegrated in a simple yet ef\ufb01cient single-stream architecture that can be trained in\na single stage. IntroVAE preserves the advantages of VAEs, such as stable training\nand nice latent manifold. Unlike most other hybrid models of VAEs and GANs,\nIntroVAE requires no extra discriminators, because the inference model itself\nserves as a discriminator to distinguish between the generated and real samples.\nExperiments demonstrate that our method produces high-resolution photo-realistic\nimages (e.g., CELEBA images at 10242), which are comparable to or better than\nthe state-of-the-art GANs.\n\n1\n\nIntroduction\n\nIn the recent years, many types of generative models such as autoregressive models [38, 37], vari-\national autoencoders (VAEs) [20, 32], generative adversarial networks (GANs) [13], real-valued\nnon-volume preserving (real NVP) transformations [7] and generative moment matching networks\n(GMMNs) [24] have been proposed and widely studied. They have achieved remarkable success\nin various tasks, such as unconditional or conditional image synthesis [22, 27], image-to-image\ntranslation [25, 46], image restoration [5, 17] and speech synthesis [12]. While each model has its\nown signi\ufb01cant strengths and limitations, the two most prominent models are VAEs and GANs. VAEs\nare theoretically elegant and easy to train. They have nice manifold representations but produce very\nblurry images that lack details. GANs usually generate much sharper images but face challenges in\ntraining stability and sampling diversity, especially when synthesizing high-resolution images.\nMany techniques have been developed to address these challenges. LAPGAN [6] and StackGAN [41]\ntrain a stack of GANs within a Laplacian pyramid to generate high-resolution images in a coarse-\nto-\ufb01ne manner. StackGAN-v2 [42] and HDGAN [43] adopt multi-scale discriminators in a tree-\nlike structure. Some studies [11, 39] have trained a single generator with multiple discriminators\n\n\u2217Ran He is the corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fto improve the image quality. PGGAN [18] achieves the state-of-the-art by training symmetric\ngenerators and discriminators progressively. As illustrated in Fig. 1(a) (A, B, C, and D show the\nabove GANs respectively), most existing GANs require multi-scale discriminators to decompose\nhigh-resolution tasks to from-low-to-high resolution tasks, which increases the training complexity.\nIn addition, much effort has been devoted to combining the strengths of VAEs and GANs via hybrid\nmodels. VAE/GAN [23] imposes a discriminator on the data space to improve the quality of the\nresults generated by VAEs. AAE [28] discriminates in the latent space to match the posterior to the\nprior distribution. ALI [10] and BiGAN [8] discriminate jointly in the data and latent space, while\nVEEGAN [35] uses additional constraints in the latent space. However, hybrid models usually have\nmore complex network architectures (as illustrated in Fig. 1(b), A, B, C, and D show the above hybrid\nmodels respectively) and still lag behind GANs in image quality [18].\nTo alleviate this problem, we introduce an introspective variational autoencoder (IntroVAE), a simple\nyet ef\ufb01cient approach to training VAEs for photographic image synthesis. One of the reasons why\nsamples from VAEs tend to be blurry could be that the training principle makes VAEs assign a\nhigh probability to training points, which cannot ensure that blurry points are assigned to a low\nprobability [14]. Motivated by this issue, we train VAEs in an introspective manner such that the\nmodel can self-estimate the differences between generated and real images. In the training phase,\nthe inference model attempts to minimize the divergence of the approximate posterior with the prior\nfor real data while maximize it for the generated samples; the generator model attempts to mislead\nthe inference model by minimizing the divergence of the generated samples. The model acts like\na standard VAE for real data and acts like a GAN when handling generated samples. Compared to\nmost VAE and GAN hybrid models, our version requires no extra discriminators, which reduces\nthe complexity of the model. Another advantage of the proposed method is that it can generate\nhigh-resolution realistic images through a single-stream network in a single stage. The divergence\nobject is adversarially optimized along with the reconstruction error, which increases the dif\ufb01culty of\ndistinguishing between the generated and real images for the inference model, even for those with\nhigh-resolution. This arrangement greatly improves the stability of the adversarial training. The\nreason could be that the instability of GANs is often due to the fact that the discriminator distinguishes\nthe generated images from the training images too easily [18, 30].\nOur contribution is three-fold. i) We propose a new training technique for VAEs, that trains VAEs in\nan introspective manner such that the model itself estimates the differences between the generated\nand real images without extra discriminators. ii) We propose a single-stream single-stage adversarial\nmodel for high-resolution photographic image synthesis, which is, to our knowledge, the \ufb01rst feasible\nmethod for GANs to generate high-resolution images in such a simple yet ef\ufb01cient manner, e.g.,\nCELEBA images at 10242. iii) Experiments demonstrate that our method combines the strengths of\nGANs and VAEs, producing high-resolution photographic images comparable to those produced by\nthe state-of-the-art GANs while preserving the advantages of VAEs, such as stable training and nice\nlatent manifold.\n\n2 Background\n\nAs our work is a speci\ufb01c hybrid model of VAEs and GANs, we start with a brief review of VAEs,\nGANs and their hybrid models.\nVariational Autoencoders (VAEs) consist of two networks: a generative network (Generator)\np\u03b8(x|z) that samples the visible variables x given the latent variables z and an approximate in-\nference network (Encoder) q\u03c6(z|x) that maps the visible variables x to the latent variables z which\napproximate a prior p(z). The object of VAEs is to maximize the variational lower bound (or evidence\nlower bound, ELBO) of p\u03b8(x):\n\nlogp\u03b8(x) \u2265 Eq\u03c6(z|x) log p\u03b8(x|z) \u2212 DKL(q\u03c6(z|x)||p(z)).\n\n(1)\n\nThe main limitation of VAEs is that the generated samples tend to be blurry, which is often attributed\nto the limited expressiveness of the inference models, the injected noise and imperfect element-wise\ncriteria such as the squared error [23, 45]. Although recent studies [4, 9, 21, 34, 45] have greatly\nimproved the predicted log-likelihood, they still face challenges in generating high-resolution images.\nGenerative Adversarial Networks (GANs) employ a two-player min-max game with two models:\nthe generative model (Generator) G produces samples G(z) from the prior p(z) to confuse the\n\n2\n\n\f(a) Several GANs\n\n(b) Hybrid models\n\nFigure 1: Overviews of several typical GANs for high-resolution image generation and hybrid models\nof VAEs and GANs.\n\ndiscriminator D(x), while D(x) is trained to distinguish between the generated samples and the\ngiven training data. The training object is\n\nmin\n\nG\n\nmax\n\nD\n\nEx\u223cpdata(x)[log D(x)] + Ez\u223cpz(z)[log(1 \u2212 D(G(z)))].\n\n(2)\n\nGANs are promising tools for generating sharp images, but they are dif\ufb01cult to train. The training\nprocess is usually unstable and is prone to mode collapse, especially when generating high-resolution\nimages. Many methods [44, 1, 2, 15, 33] have been attempted to improve GANs in terms of training\nstability and sample variation. To synthesize high-resolution images, several studies have trained\nGANs in a Laplacian pyramid [6, 41] or a tree-like structure [42, 43] with multi-scale discrimina-\ntors [11, 29, 39], mostly in a coarse-to-\ufb01ne manner, including the state-of-the-art PGGAN [18].\nHybrid Models of VAEs and GANs usually consist of three components: an encoder and a decoder,\nas in autoencoders (AEs) or VAEs, to map between the latent space and the data space, and an extra\ndiscriminator to add an adversarial constraint into the latent space [28], data space [23], or their\njoint space [8, 10, 35]. Recently, Ulyanov et al. [36] propose adversarial generator-encoder networks\n(AGE) that shares some similarity with ours in the architecture of two components, while the two\nmodels differ in many ways, such as the design of the inference models, the training objects, and the\ndivergence computations. Brock et al. [3] also propose an introspective adversarial network (IAN)\nthat the encoder and discriminator share most of the layers except the last layer, and their adversarial\nloss is a variation of the standard GAN loss. In addition, existing hybrid models, including AGE and\nIAN, still lag far behind GANs in generating high-resolution images, which is one of the focuses of\nour method.\n\n3 Approach\n\nIn this section, we train VAEs in an introspective manner such that the model can self-estimate the\ndifferences between the generated samples and the training data and then updates itself to produce\nmore realistic samples. To achieve this goal, one part of the model needs to discriminate the generated\nsamples from the training data, and another part should mislead the former part, analogous to the\ngenerator and discriminator in GANs. Speci\ufb01cally, we select the approximate inference model (or\nencoder) of VAEs as the discriminator of GANs and the generator model of VAEs as the generator of\nGANs. In addition to performing adversarial learning like GANs, the inference and generator models\nare also expected to train jointly for the given training data to preserve the advantages of VAEs.\nThere are two components in the ELBO objective of VAEs, a log-likelihood (autoencoding) term\nLAE and a prior regularization term LREG, which are listed below in the negative version:\n\nLAE = \u2212Eq\u03c6(z|x) log p\u03b8(x|z),\n\n(3)\n\nLREG = DKL(q\u03c6(z|x)||p(z)).\n\n(4)\nThe \ufb01rst term LAE is the reconstruction error in a probabilistic autoencoder, and the second term\nLREG regularizes the encoder by encouraging the approximate posterior q\u03c6(z|x) to match the prior\np(z). In the following, we describe the proposed introspective VAE (IntroVAE) with the modi\ufb01ed\ncombination objective of these two terms.\n\n3\n\nD1D2zBx1x2x1D1G1x2D2G2zAD1D2D3xGzCxzDz\u02c6 EDGRecxrealxgenAdvKLAz\u02c6EDGRecxrealxgenAdvzBz\u02c6EDxrealAdvzGxgenCz\u02c6EDxrealAdvzGxgenRecz\u02c6\u02c6D\f3.1 Adversarial distribution matching\n\nTo match the distribution of the generated samples with the true distribution of the given training\ndata, we use the regularization term LREG as the adversarial training cost function. The inference\nmodel is trained to minimize LREG to encourage the posterior q\u03c6(z|x) of the real data x to match\nthe prior p(z), and simultaneously to maximize LREG to encourage the posterior q\u03c6(z|G(z(cid:48))) of the\ngenerated samples G(z(cid:48)) to deviate from the prior p(z), where z(cid:48) is sampled from p(z). Conversely,\nthe generator G is trained to produce samples G(z(cid:48)) that have a small LREG, such that the samples\u2019\nposterior distribution approximately matches the prior distribution.\nGiven a data sample x and a generated sample G(z), we design two different losses, one to train the\ninference model E, and another to train the generator G:\n\nLE(x, z) = E(x) + [m \u2212 E(G(z))]+,\n\n(5)\n\nLG(z) = E(G(z)),\n\nty V (E, G) = (cid:82)\nU (E, G) =(cid:82)\n\n(6)\nwhere E(x) = DKL(q\u03c6(z|x)||p(z)), [\u00b7]+ = max(0,\u00b7), and m is a positive margin. The above\ntwo equations form a min-max game between the inference model E and the generator G when\nE(G(z)) \u2264 m, i.e., minimizing LG for the generator G is equal to maximizing the second term of\nLE for the inference model E.\u2217\nFollowing the original GANs [14], we train the inference model E to minimize the quanti-\nx,z LE(x, z)pdata(x)pz(z)dxdz, and the generator G to minimize the quantity\nz LG(z)pz(z)dz. In a non-parametric setting, i.e., E and G are assumed to have in\ufb01nite\ncapacity, the following theorem shows that when the system reaches a Nash equilibrium (a saddle\npoint) (E\u2217, G\u2217), the generator G\u2217 produces samples that are distinguishable from the given training\ndistribution, i.e., pG\u2217 = pdata.\nTheorem 1. Assuming that no region exists where pdata(x) = 0, (E\u2217, G\u2217) forms a saddle point of\nthe above system if and only if (a) pG\u2217 = pdata and (b) E\u2217(x) = \u03b3, where \u03b3 \u2208 [0, m] is a constant.\nProof. See Appendix A.\nRelationships with other GANs To some degree, the proposed adversarial method appears to be\nsimilar to Energy-based GANs (EBGAN) [44], which views the discriminator as an energy function\nthat assigns low energies to the regions of high data density and higher energies to the other regions.\nThe proposed KL-divergence function can be considered as a speci\ufb01c type of energy function\nthat is computed by the inference model instead of an extra auto-encoder discriminator [44]. The\narchitecture of our system is simpler and the KL-divergence shows more promising properties than\nthe reconstruction error [44], such as stable training for high-resolution images.\n\n3.2\n\nIntrospective variational inference\n\nAs demonstrated in the previous subsection, playing a min-max game between the inference model E\nand the generator G is a promising method for the model to align the generated and true distributions\nand thus produce visual-realistic samples. However, training the model in this adversarial manner\ncould still cause problems such as mode collapse and training instability, like in other GANs. As\ndiscussed above, we introduce IntroVAE to alleviate these problems by combining GANs with VAEs\nin an introspective manner.\nThe solution is surprisingly simple, and we only need to combine the adversarial object in Eq. (5)\nand Eq. (6) with the ELBO object of VAEs. The training objects for the inference model E and the\ngenerator G can be reformulated as below:\n\nLE(x, z) = E(x) + [m \u2212 E(G(z))]+ + LAE(x),\n\nLG(z) = E(G(z)) + LAE(x).\n\nThe addition of the reconstruction error LAE builds a bridge between the inference model E and the\ngenerator G and results in a speci\ufb01c hybrid models of VAEs and GANs. For a data sample x from the\n\u2217It should be noted that we use E to denote the inference model and E(x) to denote the kl-divergence\n\nfunction for representation convenience.\n\n4\n\n(7)\n\n(8)\n\n\ftraining set, the object of the proposed method collapses to the standard ELBO object of VAEs, thus\npreserving the properties of VAEs; for a generated sample G(z), this object generates a min-max\ngame of GANs between E and G and makes G(z) more realistic.\nRelationships with other hybrid models Compared to other hybrid models [28, 23, 8, 10, 35] of\nVAEs and GANs, which always use a discriminator to regularize the latent code and generated data\nindividually or jointly, the proposed method adds prior regularization into both the latent space and\ndata space in an introspective manner. The \ufb01rst term in Eq. (7) (i.e., LREG in Eq. (4)) encourages\nthe latent code of the training data to approximately follow the prior distribution. The adversarial part\nof Eq. (7) and Eq. (8) encourages the generated samples to have the same distribution as the training\ndata. The inference model E and the generator G are trained both jointly and adversarially without\nextra discriminators.\nCompared to AGE [36], the major differences are addressed in three-fold. 1) AGE is designed in\nan autoencoder-type where the encoder has one output variable and no noise term is injected when\nreconstructing the input data. The proposed method follows the original VAEs that the inference model\nhas two output variables, i.e., \u00b5 and \u03c3, to utilize the reparameterization trick, i.e., z = \u00b5 + \u03c3(cid:12) \u0001 where\n\u0001 \u223c N (0, I). 2) AGE uses different reconstruction errors to regularize the encoder and generator\nrespectively, while the proposed method uses the reconstruction error LAE to regularize both the\nencoder and generator. 3) AGE computes the KL-divergence using batch-level statistics, i.e., mj and\nsj in Eq. (7) in [36], while we compute it using the two batch-independent outputs of the inference\nmodel, i.e., \u00b5 and \u03c3 in Eq. (9). For high-resolution image synthesis, the training batch-size is usually\nlimited to be very small, which may harm the performance of AGE but has little in\ufb02uence on ours.\nAs AGE is trained on 64 \u00d7 64 images, we re-train AGE and \ufb01nd it hard to converge on 256 \u00d7 256\nimages; there is no improvement even when replacing AGE\u2019s network with ours.\n\n3.3 Training IntroVAE networks\n\nFollowing the original VAEs [20], we select the centered isotropic multivariate Gaussian N (0, I) as\nthe prior p(z) over the latent variables. As illustrated in Fig. 2, the inference model E is designed to\noutput two individual variables, \u00b5 and \u03c3, and thus the posterior q\u03c6(z|x) = N (z; \u00b5, \u03c32). The input\nz of the generator G is sampled from N (z; \u00b5, \u03c32) using a reparameterization trick: z = \u00b5 + \u03c3 (cid:12) \u0001\nwhere \u0001 \u223c N (0, I). In this setting, the KL-divergence LREG (i.e., E(x) in Eq. (7) and Eq. (8)),\ngiven N data samples, can be computed as below:\n\nLREG(z; \u00b5, \u03c3) =\n\n1\n2\n\n(1 + log(\u03c32\n\nij) \u2212 \u00b52\n\nij \u2212 \u03c32\nij),\n\n(9)\n\nwhere Mz is the dimension of the latent code z.\nFor the reconstruction error LAE in Eq. (7) and Eq. (8), we choose the commonly-used pixel-wise\nmean squared error (MSE) function. Let xr be the reconstruction sample, LAE is de\ufb01ned as:\n\nN(cid:88)\n\nMz(cid:88)\n\ni=1\n\nj=1\n\nN(cid:88)\n\nMx(cid:88)\n\ni=1\n\nj=1\n\nLAE(x, xr) =\n\n1\n2\n\n(cid:107)xr,ij \u2212 xij(cid:107)2\nF ,\n\n(10)\n\nwhere Mx is the dimension of the data x.\nSimilar to VAE/GAN [23], we train IntroVAE to discriminate real samples from both the model\nsamples and reconstructions. As shown in Fig. 2, these two types of samples are the reconstruction\nsamples xr and the new samples xp. When the KL-divergence object of VAEs is adequately optimized,\nthe posterior q\u03c6(z|x) matches the prior p(z) approximately and the samples are similar to each other.\nThe combined use of samples from p(z) and q\u03c6(z|x) is expected to provide a more useful signal for\nthe model to learn more expressive latent code and synthesize more realistic samples. The total loss\nfunctions for E and G are respectively rede\ufb01ned as:\n\nLE = LREG(z) + \u03b1\n\n[m \u2212 LREG(zs)]+ + \u03b2LAE(x, xr)\n\ns=r,p\n\n= LREG(Enc(x)) + \u03b1\n\n[m \u2212 LREG(Enc(ng(xs)))]+ + \u03b2LAE(x, xr),\n\n(11)\n\n(cid:88)\n\n(cid:88)\n\ns=r,p\n\n5\n\n\fFigure 2: The architecture and training \ufb02ow of IntroVAE. The left part shows that the model consists\nof two components, the inference model E and the generator G, in a circulation loop. The right part\nis the unrolled training \ufb02ow of the proposed method.\n\nAlgorithm 1 Training IntroVAE model\n1: \u03b8G, \u03c6E \u2190 Initialize network parameters\n2: while not converged do\nX \u2190 Random mini-batch from dataset\n3:\nZ \u2190 Enc(X)\n4:\nZp \u2190 Samples from prior N (0, I)\n5:\nXr \u2190 Dec(Z), Xp \u2190 Dec(Zp)\n6:\nLAE \u2190 LAE(Xr, X)\n7:\nZr \u2190 Enc(ng(Xr)), Zpp \u2190 Enc(ng(Xp))\n8:\nadv \u2190 LREG(Z) + \u03b1{[m \u2212 LREG(Zr)]+ + [m \u2212 LREG(Zpp)]+}\nLE\n9:\n\u03c6E \u2190 \u03c6E \u2212 \u03b7\u2207\u03c6E (LE\n10:\nadv + \u03b2LAE)\nZr \u2190 Enc(Xr), Zpp \u2190 Enc(Xp)\n11:\nadv \u2190 \u03b1{LREG(Zr) + LREG(Zpp)}\nLG\n12:\n\u03b8G \u2190 \u03b8G \u2212 \u03b7\u2207\u03b8G(LG\n13:\n14: end while\n\nadv + \u03b2LAE)\n\n(cid:46) Perform Adam updates for \u03c6E\n\n(cid:46) Perform Adam updates for \u03b8G\n\n(cid:88)\n\nLG = \u03b1\n\nLREG(Enc(xs)) + \u03b2LAE(x, xr),\n\n(12)\n\ns=r,p\n\nwhere ng(\u00b7) indicates that the back propagation of the gradients is stopped at this point, Enc(\u00b7)\nrepresents the mapping function of E, and \u03b1 and \u03b2 are weighting parameters used to balance the\nimportance of each item.\nThe networks of E and G are designed in a similar manner to other GANs [31, 18], except that E\nhas two output variables with respect to \u00b5 and \u03c3. As shown in Algorithm 1, E and G are trained\niteratively by updating E using LE to distinguish the real data X and generated samples, Xr and Xp,\nand then updating G using LG to generate samples that are increasingly similar to the real data; these\nsteps are repeated until convergence.\n\n4 Experiments\n\nIn this section, we conduct a set of experiments to evaluate the performance of the proposed method.\nWe \ufb01rst give an introduction of the experimental implementations, and then discuss in detail the\nimage quality, training stability and sample diversity of our method. Besides, we also investigate the\nlearned manifold via interpolation in the latent space.\n\n4.1\n\nImplementations\n\nDataset We condider three data sets, namely CelebA [26] , CelebA-HQ [18] and LSUN BED-\nROOM [40]. The CelebA dataset consists of 202,599 celebrity images with large variations in facial\nattributes. Following the standard protocol of CelebA, we use 162,770 images for training, 19,867\nfor validation and 19,962 for testing. The CelebA-HQ dataset is a high-quality version of CelebA\nthat consists of 30,000 images at 1024 \u00d7 1024 resolution. The dataset is split into two sets: the \ufb01rst\n\n6\n\nEGXZxE\u03bc\u03c3GzxrE\u03bcr\u03c3rGZp~N(0,I)xpE\u03bcp\u03c3p\u03b5~N(0,I)LREG(z)LREG(zr)LREG(zp)LAE\f29,000 images as the training set and the rest 1,000 images as the testing set. We take the testing set\nto evaluate the reconstruction quality. The LSUN BEDROOM is a subset of the Large-scale Scene\nUnderstanding (LSUN) dataset [40]. We adopt its whole training set of 3,033,042 images in our\nexperiments.\nNetwork architecture We design the inference and generator models of IntroVAE in a similar way\nto the discriminator and generator in PGGAN except of the use of residual blocks to accelerate the\ntraining convergence (see Appendix B for more details). Like other VAEs, the inference model has\ntwo output vectors, respectively representing the mean \u00b5 and the covariance \u03c32 in Eq. (9). For the\nimages at 1024 \u00d7 1024, the dimension of the the latent code is set to be 512 and the hyperparameters\nin Eq. (11) and Eq. (12) are set empirically to hold the training balance of the inference and generator\nmodels: m = 90 , \u03b1 = 0.25 and \u03b2 = 0.0025. For the images at 256 \u00d7 256, the latent dimension\nis 512, m = 120 , \u03b1 = 0.25 and \u03b2 = 0.05. For the images at 128 \u00d7 128, the latent dimension is\n256, m = 110 , \u03b1 = 0.25 and \u03b2 = 0.5. The key is to hold the regularization term LREG in Eq. (11)\nand Eq. (12) below the margin value m for most of the time. It is suggested to pre-train the model\nwith 1 \u223c 2 epochs in the original VAEs form (i.e., \u03b1 = 0) to \ufb01nd the appropriate con\ufb01guration of the\nhyper-parameters for different image sizes. More analyses and results for different hyper-parameters\nare provided in Appendix D.\nAs illustrated in Algorithm 1, the inference and generator models are trained iteratively using Adam\nalgorithm [19] (\u03b21 = 0.9, \u03b22 = 0.999) with a batch size of 8 and a \ufb01xed learning rate of 0.0002. An\nadditional illustration of the training \ufb02ow is provided in Appendix C.\n\n4.2 High quality image synthesis\nAs shown in Fig. 3, our method produces visually appealing high-resolution images of 1024 \u00d7 1024\nresolution both in reconstruction and sampling. The images in Fig. 3(c) are the reconstruction results\nof the original images in Fig. 3(a) from the CelebA-HQ testing set. Due to the training principle\nof VAEs that injects random noise in the training phase, the reconstruction images cannot keep\naccurate pixel-wise similarity with the original images. In spite of this, our results preserve the most\nglobal topology information of the input images while achieve photographic high-quality in visual\nperception.\nWe also compare our sampling results against PGGAN [18], the state-of-the-art in synthesizing\nhigh-resolution images. As illustrated in Fig. 3(d), our method is able to synthesize high-resolution\nhigh-quality samples comparable with PGGAN, which are both distinguishable with the real images.\nWhile PGGAN is trained with symmetric generators and discriminators in a progressive multi-stage\nmanner, our model is trained in a much simpler manner that iteratively trains a single inference\nmodel and a single generator in a single stage like the original GANs [13]. The results of our method\ndemonstrate that it is possible to synthesize very high-resolution images by training directly with\nhigh-resolution images without decomposing the single task to multiple from-low-to-high resolution\ntasks. Additionally, we provide the visual quality results in LSUN BEDROOM in Fig. 4, which\nfurther demonstrate that our method is capable to synthesize high quality images that are comparable\nwith PGGAN\u2019s. (More visual results on extra datasets are provided in Appendix F & G.)\n\n4.3 Training stability and speed\n\nFigure 5 illustrates the quality of the samples with regard to the loss functions of the reconstruction\nerror LAE and the KL-divergences. It can be seen that the losses converge very fast to a stable\nstage in which their values \ufb02uctuate slightly around a balance line. As described in Theorem 1, the\nprediction E(x) of the inference model reaches a constant \u03b3 in [0, m]. This is consistent with the\ncurves in Fig. 4, that when approximately converged, the KL-divergence of real images is around a\nconstant value lower than m while those of the reconstruction and sample images \ufb02uctuate around m.\nBesides, the image quality of the samples improves stably along with the training process.\nWe evaluate the training speed on CelebA images of various resolutions, i.e., 128 \u00d7 128, 256 \u00d7 256,\n512 \u00d7 512 and 1024 \u00d7 1024. As illustrated in Tab. 1, The convergence time increases along with the\nresolution since the hardware limits the minibatch size for high-resolutions.\n\n7\n\n\f(a) Original\n\n(b) PGGAN [18]\n\n(c) Ours-Reconstructions\n\n(d) Ours-Samples\n\nFigure 3: Qualitative results of 1024 \u00d7 1024 images. (a) and (c) are the original and reconstruction\nimages from the testing split, respectively. (b) and (d) are sample images of PGGAN (copied from\nthe cited paper [18]) and our method, respectively. Best viewed by zooming in the electronic version.\n\n(a) WGAN-GP [15](128 \u00d7 128)\n\n(b) PGGAN [18](256 \u00d7 256)\n\n(c) Ours(256 \u00d7 256)\n\nFigure 4: Qualitative comparison in LSUN BEDROOM. The images in (a) and (b) are copied from\nthe cited papers [15, 18]\n\n4.4 Diversity analysis\n\nWe take two metrics to evaluate the sample diversity of our method, namely multi-scale structural\nsimilarity (MS-SSIM) [30] and Fr\u00e9chet Inception Distance (FID) [16]. The MS-SSIM measures\nthe similarity of two images and FID measures the Fr\u00e9chet distance of two distributions in feature\nspace. For fair comparison with PGGAN, the MS-SSIM scores are computed among an average of\n\nTable 1: Training speed w.r.t. the image resolutions.\n\n128 \u00d7 128\n\n256 \u00d7 256\n\n512 \u00d7 512\n\n1024 \u00d7 1024\n\nResolution\nMinibatch\nTime (days)\n\n64\n0.5\n\n12\n7\n\n8\n21\n\n32\n1\n\n8\n\n\fFigure 5: Illustration of the training process.\n\nTable 2: Quantitative comparison with two metrics: MS-SSIM and FID.\n\nMethod\n\nWGAN-GP [15]\n\nPGGAN [18]\n\nOurs\n\nMS-SSIM\n\nFID\n\nCELEBA LSUN BEDROOM CELEBA-HQ LSUN BEDROOM\n0.2854\n0.2828\n0.2719\n\n0.0587\n0.0636\n0.0532\n\n7.30\n5.19\n\n-\n\n-\n\n8.34\n8.84\n\n10K pairs of synthesize images at 128 \u00d7 128 for CelebA and LSUN BEDROOM, respectively. FID is\ncomputed from 50K images at 1024 \u00d7 1024 for CelebA-HQ and from 50K images at 256 \u00d7 256 for\nLSUN BEDROOM. As illustrated in Tab. 2, our method achieves comparable or better quantitative\nperformance than PGGAN, which re\ufb02ects the sample diversity to some degree. More visual results\nare provided in Appendix H to further demonstrate the diversity.\n\n4.5 Latent manifold analysis\n\nWe conduct interpolations of real images in the latent space to estimate the manifold continuity.\nFor a pair of real images, we \ufb01rst map them to latent codes z using the inference model and then\nmake linear interpolations between the codes. As illustrated in Fig. 6, our model demonstrates\ncontinuity in the latent space in interpolating from a male to a female or rotating a pro\ufb01le face. This\nmanifold continuity veri\ufb01es that the proposed model generalizes the image contents instead of simply\nmemorizing them.\n\nFigure 6:\nInterpolations of real images in the latent space. The leftmost and rightmost are real\nimages in CelebA-HQ testing set and the images immediately next to them are their reconstructions\nvia our model. The rest are the interpolations. The images are compressed to save space.\n\n5 Conclusion\n\nWe have introduced introspective VAEs, a novel and simple approach to training VAEs for synthesiz-\ning high-resolution photographic images. The learning objective is to play a min-max game between\nthe inference and generator models of VAEs. The inference model not only learns a nice latent\nmanifold structure, but also acts as a discriminator to maximize the divergence of the approximate\nposterior with the prior for the generated data. Thus, the proposed IntroVAE has an introspection\ncapability to self-estimate the quality of the generated images and improve itself accordingly. Com-\npared to other state-of-the-art methods, the proposed model is simpler and more ef\ufb01cient with a\nsingle-stream network in a single stage, and it can synthesize high-resolution photographic images\nvia a stable training process. Since our model has a standard VAE architecture, it may be easily\nextended to various VAEs-related tasks, such as conditional image synthesis.\n\n9\n\n01234567x 105050100150200250300350Global StepLoss \uf062LAEEREGrealEREGrecEREGsamGREGrecGREGsam\fAcknowledgments\n\nThis work is partially funded by the State Key Development Program (Grant No. 2016YFB1001001)\nand National Natural Science Foundation of China (Grant No. 61622310, 61427811).\n\nReferences\n[1] Arjovsky, Martin, Chintala, Soumith, and Bottou, L\u00e9on. Wasserstein GAN. arXiv preprint arX-\n\niv:1701.07875, 2017.\n\n[2] Berthelot, David, Schumm, Tom, and Metz, Luke. BEGAN: Boundary equilibrium generative adversarial\n\nnetworks. arXiv preprint arXiv:1703.10717, 2017.\n\n[3] Brock, Andrew, Lim, Theodore, Ritchie, James M, and Weston, Nick. Neural photo editing with introspec-\n\ntive adversarial networks. In ICLR, 2017.\n\n[4] Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan, Dhariwal, Prafulla, Schulman, John, Sutskever,\n\nIlya, and Abbeel, Pieter. Variational lossy autoencoder. In ICLR, 2017.\n\n[5] Dahl, Ryan, Norouzi, Mohammad, and Shlens, Jonathon. Pixel recursive super resolution. In ICCV, 2017.\n\n[6] Denton, Emily L, Chintala, Soumith, Fergus, Rob, et al. Deep generative image models using a laplacian\n\npyramid of adversarial networks. In NeurIPS, pp. 1486\u20131494, 2015.\n\n[7] Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy. Density estimation using real NVP. In ICLR,\n\n2017.\n\n[8] Donahue, Jeff, Kr\u00e4henb\u00fchl, Philipp, and Darrell, Trevor. Adversarial feature learning. In ICLR, 2017.\n\n[9] Dosovitskiy, Alexey and Brox, Thomas. Generating images with perceptual similarity metrics based on\n\ndeep networks. In NeurIPS, pp. 658\u2013666, 2016.\n\n[10] Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Mastropietro, Olivier, Lamb, Alex, Arjovsky, Martin,\n\nand Courville, Aaron. Adversarially learned inference. In ICLR, 2017.\n\n[11] Durugkar, Ishan, Gemp, Ian, and Mahadevan, Sridhar. Generative multi-adversarial networks. In ICLR,\n\n2017.\n\n[12] Gibiansky, Andrew, Arik, Sercan, Diamos, Gregory, Miller, John, Peng, Kainan, Ping, Wei, Raiman,\nJonathan, and Zhou, Yanqi. Deep voice 2: Multi-speaker neural text-to-speech. In NeurIPS, pp. 2966\u20132974,\n2017.\n\n[13] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil,\n\nCourville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NeurIPS, pp. 2672\u20132680, 2014.\n\n[14] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, and Bengio, Yoshua. Deep learning, volume 1. MIT\n\npress Cambridge, 2016.\n\n[15] Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron C. Improved\n\ntraining of wasserstein GANs. In NeurIPS, pp. 5769\u20135779, 2017.\n\n[16] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, and Hochreiter, Sepp. Gans\ntrained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, pp. 6626\u20136637,\n2017.\n\n[17] Huang, Huaibo, He, Ran, Sun, Zhenan, and Tan, Tieniu. Wavelet-srnet: A wavelet-based cnn for multi-scale\n\nface super resolution. In ICCV, pp. 1689\u20131697, 2017.\n\n[18] Karras, Tero, Aila, Timo, Laine, Samuli, and Lehtinen, Jaakko. Progressive growing of GANs for improved\n\nquality, stability, and variation. In ICLR, 2018.\n\n[19] Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. In ICLR, 2014.\n\n[20] Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. In ICLR, 2014.\n\n[21] Kingma, Diederik P, Salimans, Tim, Jozefowicz, Rafal, Chen, Xi, Sutskever, Ilya, and Welling, Max.\n\nImproved variational inference with inverse autoregressive \ufb02ow. In NeurIPS, pp. 4743\u20134751, 2016.\n\n10\n\n\f[22] Lample, Guillaume, Zeghidour, Neil, Usunier, Nicolas, Bordes, Antoine, Denoyer, Ludovic, et al. Fader\n\nnetworks: Manipulating images by sliding attributes. In NeurIPS, pp. 5969\u20135978, 2017.\n\n[23] Larsen, Anders Boesen Lindbo, S\u00f8nderby, S\u00f8ren Kaae, Larochelle, Hugo, and Winther, Ole. Autoencoding\n\nbeyond pixels using a learned similarity metric. In ICML, pp. 1558\u20131566, 2016.\n\n[24] Li, Yujia, Swersky, Kevin, and Zemel, Rich. Generative moment matching networks. In ICML, pp.\n\n1718\u20131727, 2015.\n\n[25] Liu, Ming-Yu, Breuel, Thomas, and Kautz, Jan. Unsupervised image-to-image translation networks. In\n\nNeurIPS, pp. 700\u2013708, 2017.\n\n[26] Liu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou. Deep learning face attributes in the wild. In\n\nICCV, pp. 3730\u20133738, 2015.\n\n[27] Ma, Liqian, Jia, Xu, Sun, Qianru, Schiele, Bernt, Tuytelaars, Tinne, and Van Gool, Luc. Pose guided\n\nperson image generation. In NeurIPS, pp. 405\u2013415, 2017.\n\n[28] Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, Goodfellow, Ian, and Frey, Brendan. Adversarial\n\nautoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[29] Nguyen, Tu, Le, Trung, Vu, Hung, and Phung, Dinh. Dual discriminator generative adversarial nets. In\n\nNeurIPS, pp. 2667\u20132677, 2017.\n\n[30] Odena, Augustus, Olah, Christopher, and Shlens, Jonathon. Conditional image synthesis with auxiliary\n\nclassi\ufb01er GANs. In ICML, pp. 2642\u20132651, 2017.\n\n[31] Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. In ICLR, 2016.\n\n[32] Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and\n\napproximate inference in deep generative models. In ICML, pp. 1278\u20131286, 2014.\n\n[33] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi.\n\nImproved techniques for training GANs. In NeurIPS, pp. 2234\u20132242, 2016.\n\n[34] S\u00f8nderby, Casper Kaae, Raiko, Tapani, Maal\u00f8e, Lars, S\u00f8nderby, S\u00f8ren Kaae, and Winther, Ole. Ladder\n\nvariational autoencoders. In NeurIPS, pp. 3738\u20133746, 2016.\n\n[35] Srivastava, Akash, Valkoz, Lazar, Russell, Chris, Gutmann, Michael U, and Sutton, Charles. VEEGAN:\n\nReducing mode collapse in gans using implicit variational learning. In NeurIPS, pp. 3310\u20133320, 2017.\n\n[36] Ulyanov, Dmitry, Vedaldi, Andrea, and Lempitsky, Victor. It takes (only) two: Adversarial generator-\n\nencoder networks. In AAAI, 2018.\n\n[37] van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional\n\nimage generation with pixelcnn decoders. In NeurIPS, pp. 4790\u20134798, 2016.\n\n[38] Van Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In ICML,\n\npp. 1747\u20131756, 2016.\n\n[39] Wang, Ting-Chun, Liu, Ming-Yu, Zhu, Jun-Yan, Tao, Andrew, Kautz, Jan, and Catanzaro, Bryan. High-\n\nresolution image synthesis and semantic manipulation with conditional GANs. In CVPR, 2018.\n\n[40] Yu, Fisher, Seff, Ari, Zhang, Yinda, Song, Shuran, Funkhouser, Thomas, and Xiao, Jianxiong. Lsun:\nConstruction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint\narXiv:1506.03365, 2015.\n\n[41] Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Huang, Xiaolei, Wang, Xiaogang, and Metaxas,\nDimitris. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks.\nIn ICCV, pp. 5907\u20135915, 2017.\n\n[42] Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Wang, Xiaogang, Huang, Xiaolei, and Metaxas,\nDimitris. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv\npreprint arXiv:1710.10916v2, 2017.\n\n[43] Zhang, Zizhao, Xie, Yuanpu, and Yang, Lin. Photographic text-to-image synthesis with a hierarchically-\n\nnested adversarial network. arXiv preprint arXiv:1802.09178, 2018.\n\n11\n\n\f[44] Zhao, Junbo, Mathieu, Michael, and LeCun, Yann. Energy-based generative adversarial network. In ICLR,\n\n2017.\n\n[45] Zhao, Shengjia, Song, Jiaming, and Ermon, Stefano.\nautoencoders. arXiv preprint arXiv:1706.02262, 2017.\n\nInfoVAE: Information maximizing variational\n\n[46] Zhu, Jun-Yan, Zhang, Richard, Pathak, Deepak, Darrell, Trevor, Efros, Alexei A, Wang, Oliver, and\n\nShechtman, Eli. Toward multimodal image-to-image translation. In NeurIPS, pp. 465\u2013476, 2017.\n\n12\n\n\f", "award": [], "sourceid": 59, "authors": [{"given_name": "Huaibo", "family_name": "Huang", "institution": "Institute of Automation, Chinese Academy of Science"}, {"given_name": "zhihang", "family_name": "li", "institution": "Institute of Automation, Chinese Academy of Science"}, {"given_name": "Ran", "family_name": "He", "institution": "NLPR, CASIA"}, {"given_name": "Zhenan", "family_name": "Sun", "institution": "Institute of Automation, Chinese Academy of Sciences (CASIA)"}, {"given_name": "Tieniu", "family_name": "Tan", "institution": "Chinese Academy of Sciences"}]}