{"title": "PixelGAN Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 1975, "page_last": 1985, "abstract": "In this paper, we describe the \"PixelGAN autoencoder\", a generative autoencoder in which the generative path is a convolutional autoregressive neural network on pixels (PixelCNN) that is conditioned on a latent code, and the recognition path uses a generative adversarial network (GAN) to impose a prior distribution on the latent code. We show that different priors result in different decompositions of information between the latent code and the autoregressive decoder. For example, by imposing a Gaussian distribution as the prior, we can achieve a global vs. local decomposition, or by imposing a categorical distribution as the prior, we can disentangle the style and content information of images in an unsupervised fashion. We further show how the PixelGAN autoencoder with a categorical prior can be directly used in semi-supervised settings and achieve competitive semi-supervised classification results on the MNIST, SVHN and NORB datasets.", "full_text": "PixelGAN Autoencoders\n\nAlireza Makhzani, Brendan Frey\n\nUniversity of Toronto\n\n{makhzani,frey}@psi.toronto.edu\n\nAbstract\n\nIn this paper, we describe the \u201cPixelGAN autoencoder\u201d, a generative autoencoder\nin which the generative path is a convolutional autoregressive neural network on\npixels (PixelCNN) that is conditioned on a latent code, and the recognition path\nuses a generative adversarial network (GAN) to impose a prior distribution on the\nlatent code. We show that different priors result in different decompositions of\ninformation between the latent code and the autoregressive decoder. For example,\nby imposing a Gaussian distribution as the prior, we can achieve a global vs. local\ndecomposition, or by imposing a categorical distribution as the prior, we can\ndisentangle the style and content information of images in an unsupervised fashion.\nWe further show how the PixelGAN autoencoder with a categorical prior can be\ndirectly used in semi-supervised settings and achieve competitive semi-supervised\nclassi\ufb01cation results on the MNIST, SVHN and NORB datasets.\n\n1\n\nIntroduction\n\nIn recent years, generative models that can be trained via direct back-propagation have enabled\nremarkable progress in modeling natural images. One of the most successful models is the generative\nadversarial network (GAN) [1], which employs a two player min-max game. The generative model,\nG, samples the prior p(z) and generates the sample G(z). The discriminator, D(x), is trained to\nidentify whether a point x is a sample from the data distribution or a sample from the generative\nmodel. The generator is trained to maximally confuse the discriminator into believing that generated\nsamples come from the data distribution. The cost function of GAN is\n\nmin\n\nG\n\nmax\n\nD\n\nEx\u223cpdata[log D(x)] + Ez\u223cp(z)[log(1 \u2212 D(G(z))].\n\nGANs can be considered within the wider framework of implicit generative models [2, 3, 4]. Implicit\ndistributions can be sampled through their generative path, but their likelihood function is not\ntractable. Recently, several papers have proposed another application of GAN-style algorithms for\napproximate inference [2, 3, 4, 5, 6, 7, 8, 9]. These algorithms use implicit distributions to learn\nposterior approximations that are more expressive than the distributions with tractable densities that\nare often used in variational inference. For example, adversarial autoencoders [6] use a universal\napproximator posterior as the implicit posterior distribution and use adversarial training to match the\naggregated posterior of the latent code to the prior distribution. Adversarial variational Bayes [3, 7]\nuses a more general amortized GAN inference framework within a maximum-likelihood learning\nsetting. Another type of GAN inference technique is used in the ALI [8] and BiGAN [9] models,\nwhich have been shown to approximate maximum likelihood learning [3]. In these models, both\nthe recognition and generative models are implicit and are jointly learnt by an adversarial training\nprocess.\nVariational autoencoders (VAE) [10, 11] are another state-of-the-art image modeling technique that\nuse neural networks to parametrize the posterior distribution and pair it with a top-down generative\nnetwork. Both networks are jointly trained to maximize a variational lower bound on the data log-\nlikelihood. A different framework for learning density models is autoregressive neural networks such\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Architecture of the PixelGAN autoencoder.\n\nas NADE [12], MADE [12], PixelRNN [12] and PixelCNN [13]. Unlike variational autoencoders,\nwhich capture the statistics of the data in hierarchical latent codes, the autoregressive models learn\nthe image densities directly at the pixel level without learning a hierarchical latent representation.\nIn this paper, we present the PixelGAN autoencoder as a generative autoencoder that combines the\nbene\ufb01ts of latent variable models with autoregressive architectures. The PixelGAN autoencoder is a\ngenerative autoencoder in which the generative path is a PixelCNN that is conditioned on a latent\nvariable. The latent variable is inferred by matching the aggregated posterior distribution to the prior\ndistribution by an adversarial training technique similar to that of the adversarial autoencoder [6].\nHowever, whereas in adversarial autoencoders the statistics of the data distribution are captured by\nthe latent code, in the PixelGAN autoencoder they are captured jointly by the latent code and the\nautoregressive decoder. We show that imposing different distributions as the prior results in different\nfactorizations of information between the latent code and the autoregressive decoder. For example, in\nSection 2.1, we show that by imposing a Gaussian distribution on the latent code, we can achieve\na global vs. local decomposition of information. In this case, the global latent code no longer has\nto model all the irrelevant and \ufb01ne details of the image, and can use its capacity to capture more\nrelevant and global statistics of the image. Another type of decomposition of information that can\nbe learnt by PixelGAN autoencoders is a discrete vs. continuous decomposition. In Section 2.2, we\nshow that we can achieve this decomposition by imposing a categorical prior on the latent code using\nadversarial training. In this case, the categorical latent code captures the discrete underlying factors\nof variation in the data, such as class label information, and the autoregressive decoder captures\nthe remaining continuous structure, such as style information, in an unsupervised fashion. We then\nshow how PixelGAN autoencoders with categorical priors can be directly used in clustering and\nsemi-supervised scenarios and achieve very competitive classi\ufb01cation results on several datasets in\nSection 3. Finally, we present one of the main potential applications of PixelGAN autoencoders in\nlearning cross-domain relations between two different domains in Section 4.\n\n2 PixelGAN Autoencoders\n\nLet x be a datapoint that comes from the distribution pdata(x) and z be the hidden code. The\nrecognition path of the PixelGAN autoencoder (Figure 1) de\ufb01nes an implicit posterior distribution\nq(z|x) by using a deterministic neural function z = f (x, n) that takes the input x along with random\nnoise n with a \ufb01xed distribution p(n) and outputs z. The aggregated posterior q(z) of this model is\nde\ufb01ned as follows:\n\n(cid:90)\n\nq(z) =\n\nq(z|x)pdata(x)dx.\n\nx\n\nThis parametrization of the implicit posterior distribution was originally proposed in the adversarial\nautoencoder work [6] as the universal approximator posterior. We can sample from this implicit\ndistribution q(z|x), by evaluating f (x, n) at different samples of n, but the density function of this\nposterior distribution is intractable. Appendix A.1 discusses the importance of the input noise in\ntraining PixelGAN autoencoders. The generative path p(x|z) is a conditional PixelCNN [13] that\nconditions on the latent vector z using an adaptive bias in PixelCNN layers. The inference is done by\nan amortized GAN inference technique that was originally proposed in the adversarial autoencoder\nwork [6]. In this method, an adversarial network is attached on top of the hidden code vector of\n\n2\n\n\fthe autoencoder and matches the aggregated posterior distribution, q(z), to an arbitrary prior, p(z).\nSamples from q(z) and p(z) are provided to the adversarial network as the negative and positive\nexamples respectively, and the generator of the adversarial network, which is also the encoder of\nthe autoencoder, tries to match q(z) to p(z) by the gradient that comes through the discriminative\nadversarial network.\nThe adversarial network, the PixelCNN decoder and the encoder are trained jointly in two phases \u2013 the\nreconstruction phase and the adversarial phase \u2013 executed on each mini-batch. In the reconstruction\nphase, the ground truth input x along with the hidden code z inferred by the encoder are provided to\nthe PixelCNN decoder. The PixelCNN decoder weights are updated to maximize the log-likelihood of\nthe input x. The encoder weights are also updated at this stage by the gradient that comes through the\nconditioning vector of the PixelCNN. In the adversarial phase, the adversarial network updates both\nits discriminative network and its generative network (the encoder) to match q(z) to p(z). Once the\ntraining is done, we can sample from the model by \ufb01rst sampling z from the prior distribution p(z),\nand then sampling from the conditional likelihood p(x|z) parametrized by the PixelCNN decoder.\nWe now establish a connection between the PixelGAN autoencoder cost and maximum likelihood\nlearning using a decomposition of the aggregated evidence lower bound (ELBO) proposed in [14]:\nEx\u223cpdata(x)[log p(x)] > \u2212Ex\u223cpdata(x)\n= \u2212 Ex\u223cpdata(x)\n\n(cid:104)Eq(z|x)[\u2212 log p(x|z)]\n(cid:105) \u2212 Ex\u223cpdata(x)\n(cid:104)Eq(z|x)[\u2212 log p(x|z)]\n(cid:105)\n(cid:123)(cid:122)\n(cid:124)\n(cid:125)\n(cid:123)(cid:122)\n\n\u2212 KL(q(z)(cid:107)p(z))\n\nKL(q(z|x)(cid:107)p(z))\n\n(cid:104)\n\n(cid:125)\n\n\u2212 I(z; x)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nmutual info.\n\n(1)\n\n(2)\n\nmarginal KL\n\n(cid:105)\n\n(cid:124)\n\nreconstruction term\n\nThe \ufb01rst term in Equation 2 is the reconstruction term and the second term is the marginal KL\ndivergence between the aggregated posterior and the prior distribution. The third term is the mutual\ninformation between the latent code z and the input x. This is a regularization term that encourages z\nand x to be decoupled by removing the information of the data distribution from the hidden code. If\nthe training set has N examples, I(z; x) is bounded as follows (see [14]).\n\n0 < I(z; x) < log N\n\n(3)\nIn order to maximize the ELBO, we need to minimize all the three terms of Equation 2. We consider\ntwo cases for the decoder p(x|z):\nDeterministic Decoder. If the decoder p(x|z) is deterministic or has very limited stochasticity such\nas the simple factorized decoder of the VAE, the mutual information term acts in the complete opposite\ndirection of the reconstruction term. This is because the only way to minimize the reconstruction\nerror of x is to learn a hidden code z that is relevant to x, which results in maximizing I(z; x).\nIndeed, it can be shown that minimizing the reconstruction term maximizes a variational lower\nbound on I(z; x) [15, 16]. For example, in the case of the VAE trained on MNIST, since the\nreconstruction is precise, the mutual information term is dominated and is close to its maximum value\nI(z; x) \u2248 log N \u2248 11.00 nats [14].\nStochastic Decoder. If we use a powerful decoder such as the PixelCNN, the reconstruction term\nand the mutual information term will not compete with each other anymore and the network can\nminimize both independently. In this case, the optimal solution for maximizing the ELBO would be\nto model pdata(x) solely by p(x|z) and thereby minimizing the reconstruction term, and at the same\ntime, minimizing the mutual information term by ignoring the latent code. As a result, even though\nthe model achieves a high likelihood, the latent code does not learn any useful representation, which\nis undesirable. This problem has been observed in several previous works [17, 18] and different\ntechniques such as annealing the weight of the KL term [17] or weakening the decoder [18] have\nbeen proposed to make z and x more dependent.\nAs suggested in [19, 18], we think that the maximum likelihood objective by itself is not a useful\nobjective for representation learning especially when a powerful decoder is used. In PixelGAN\nautoencoders, in order to encourage learning more useful representations, we modify the ELBO\n(Equation 2) by removing the mutual information term from it, since this term is explicitly encouraging\nz to become independent of x. So our cost function only includes the reconstruction term and the\nmarginal KL term. The reconstruction term is optimized by the reconstruction phase of training and\nthe marginal KL term is approximately optimized by the adversarial phase1. Note that since the\n1The original GAN formulation optimizes the Jensen-Shannon divergence [1], but there are other formulations\n\nthat optimize the KL divergence, e.g. [3].\n\n3\n\n\f(a) PixelGAN Samples\n(2D code, limited receptive \ufb01eld)\n\n(b) PixelCNN Samples\n(limited receptive \ufb01eld)\n\n(c) AAE Samples\n\n(2D code)\n\nFigure 2: (a) Samples of the PixelGAN autoencoder with 2D Gaussian code and limited receptive\n\ufb01eld of size 9. (b) Samples of the PixelCNN (c) Samples of the adversarial autoencoder.\n\nmutual information term is upper bounded by a constant (log N), we are still maximizing a lower\nbound on the log-likelihood of data. However, this bound is weaker than the ELBO, which is the\nprice that is paid for learning more useful latent representations by balancing the decomposition of\ninformation between the latent code and the autoregressive decoder.\nFor implementing the conditioning adaptive bias in the PixelCNN decoder, we explore two different\narchitectures [13]. In the location-invariant bias, for each PixelCNN layer, we use the latent code\nto construct a vector that is broadcasted within each feature map of the layer and then added as an\nadaptive bias to that layer. In the location-dependent bias, we use the latent code to construct a spatial\nfeature map that is broadcasted across different feature maps and then added only to the \ufb01rst layer\nof the decoder as an adaptive bias. We will discuss the effect of these architectures on the learnt\nrepresentation in Figure 3 of Section 2.1 and their implementation details in Appendix A.2.\n\n2.1 PixelGAN Autoencoders with Gaussian Priors\n\nHere, we show that PixelGAN autoencoders with Gaussian priors can decompose the global and local\nstatistics of the images between the latent code and the autoregressive decoder. Figure 2a shows the\nsamples of a PixelGAN autoencoder model with the location-dependent bias trained on the MNIST\ndataset. For the purpose of better illustrating the decomposition of information, we have chosen a\n2-D Gaussian latent code and a limited the receptive \ufb01eld of size 9 for the PixelGAN autoencoder.\nFigure 2b shows the samples of a PixelCNN model with the same limited receptive \ufb01eld size of 9 and\nFigure 2c shows the samples of an adversarial autoencoder with the 2-D Gaussian latent code. The\nPixelCNN can successfully capture the local statistics, but fails to capture the global statistics due\nto the limited receptive \ufb01eld size. In contrast, the adversarial autoencoder, whose sample quality is\nvery similar to that of the VAE, can successfully capture the global statistics, but fails to generate the\ndetails of the images. However, the PixelGAN autoencoder, with the same receptive \ufb01eld and code\nsize, can combine the best of both and generates sharp images with global statistics.\nIn PixelGAN autoencoders, both the PixelCNN depth and the conditioning architecture affect the\ndecomposition of information between the latent code and the autoregressive decoder. We investigate\nthese effects in Figure 3 by training a PixelGAN autoencoder on MNIST where the code size is\nchosen to be 2 for the visualization purpose. As shown in Figure 3a,b, when a shallow decoder is\nused, most of the information will be encoded in the hidden code and there is a clean separation\nbetween the digit clusters. As we make the PixelCNN more powerful (Figure 3c,d), we can see that\nthe hidden code is still used to capture some relevant information of the input, but the separation of\ndigit clusters is not as sharp when the limited code size of 2 is used. In the next section, we will show\nthat by using a larger code size (e.g., 30), we can get a much better separation of digit clusters even\nwhen a powerful PixelCNN is used.\nThe conditioning architecture also affects the decomposition of information. In the case of the\nlocation-invariant bias, the hidden code is encouraged to learn the global information that is location-\ninvariant (the what information and not the where information) such as the class label information.\nFor example, we can see in Figure 3a,c that the network has learnt to use one of the axes of the 2D\nGaussian code to explicitly encode the digit label even though a continuous prior is imposed. In this\n\n4\n\n\f(a) Shallow PixelCNN\nLocation-invariant bias\n\n(c) Deep PixelCNN\nLocation-invariant bias\n\n(d) Deep PixelCNN\nLocation-dependent bias\nFigure 3: The effect of the PixelCNN decoder depth and the conditioning architecture on the learnt\nrepresentation of the PixelGAN autoencoder. (Shallow=3 ResBlocks, Deep=12 ResBlocks)\n\n(b) Shallow PixelCNN\nLocation-dependent bias\n\ncase, we can potentially get a much better separation if we impose a discrete prior. This makes this\narchitecture suitable for the discrete vs. continuous decomposition and we use it for our clustering\nand semi-supervised learning experiments. In the case of the location-dependent bias (Figure 3b,d),\nthe hidden code is encouraged to learn the global information that has location dependent information\nsuch as low-frequency content of the image, similar to what the hidden code of an adversarial or\nvariational autoencoder would learn (Figure 2c). This makes this architecture suitable for the global\nvs. local decomposition experiments such as Figure 2a.\nFrom Figure 3, we can see that the class label information is mostly captured by p(z) while the style\ninformation of the images is captured by both p(z) and p(x|z). This decomposition of information\nhas also been studied in other works that combine the latent variable models with autoregressive\ndecoders such as PixelVAE [20] and variational lossy autoencoders (VLAE) [18]. For example, the\nVLAE model [18] proposes to use the depth of the PixelCNN decoder to control the decomposition of\ninformation. In their model, the PixelCNN decoder is designed to have a shallow depth (small local\nreceptive \ufb01eld) so that the latent code z is forced to capture more global information. This approach is\nvery similar to our example of the PixelGAN autoencoder in Figure 2. However, the question that has\nremained unanswered is whether it is possible to achieve a complete decomposition of content and\nstyle in an unsupervised fashion, where the class label or discrete structure information is encoded in\nthe latent code z, and the remaining continuous structure such as style is captured by a powerful and\ndeep PixelCNN decoder. This kind of decomposition is particularly interesting as it can be directly\nused for clustering and semi-supervised classi\ufb01cation. In the next section, we show that we can\nlearn this decomposition of content and style by imposing a categorical distribution on the latent\nrepresentation z using adversarial training. Note that this discrete vs. continuous decomposition is\nvery different from the global vs. local decomposition, because a continuous factor of variation such\nas style can have both global and local effect on the image. Indeed, in order to achieve the discrete\nvs. continuous decomposition, we have to use very deep and powerful PixelCNN decoders (up to 20\nresidual blocks) to capture both the global and local statistics of the style by the PixelCNN while the\ndiscrete content of the image is captured by the categorical latent variable.\n\n2.2 PixelGAN Autoencoders with Categorical Priors\n\nIn this section, we present an architecture of the PixelGAN autoencoder that can separate the discrete\ninformation (e.g., class label) from the continuous information (e.g., style information) in the images.\nWe then show how our architecture can be naturally adopted for the semi-supervised settings.\nThe architecture that we use is similar to Figure 1, with the difference that we impose a categorical dis-\ntribution as the prior rather the Gaussian distribution (Figure 4) and also use the location-independent\nbias architecture. Another difference is that we use a convolutional network as the inference network\nq(z|x) to encourage the encoder to preserve the content and lose the style information of the image.\nThe inference network has a softmax output and predicts a one-hot vector whose dimension is the\nnumber of discrete labels or categories that we wish the data to be clustered into. The adversarial\nnetwork is trained directly on the continuous probability outputs of the softmax layer of the encoder.\nImposing a categorical distribution at the output of the encoder imposes two constraints. The \ufb01rst\nconstraint is that the encoder has to make con\ufb01dent decisions about the class labels of the inputs. The\n\n5\n\n\fFigure 4: Architecture of the PixelGAN autoencoder with the categorical prior. p(z) captures the\nclass label and p(x|z) is a multi-modal distribution that captures the style distribution of a digit\nconditioned on the class label of that digit.\n\nadversarial training pushes the output of the encoder to the corners of the softmax simplex, by which\nit ensures that the autoencoder cannot use the latent vector z to carry any continuous style information.\nThe second constraint imposed by adversarial training is that the aggregated posterior distribution of\nz should match the categorical prior distribution with uniform outcome probabilities. This constraint\nenforces the encoder to evenly distribute the class labels across the corners of the softmax simplex.\nBecause of these constraints, the latent variable will only capture the discrete content of the image\nand all the continuous style information will be captured by the autoregressive decoder.\nIn order to better understand and visualize the effect of the adversarial training on shaping the hidden\ncode distribution, we train a PixelGAN autoencoder on the \ufb01rst three digits of MNIST (18000 training\nand 3000 test points) and choose the number of clusters to be 3. Suppose z = [z1, z2, z3] is the hidden\ncode which in this case is the output probabilities of the softmax layer of the inference network.\nIn Figure 5a, we project the 3D softmax simplex of z1 + z2 + z3 = 1 onto a 2D triangle and plot\nthe hidden codes of the training examples when no distribution is imposed on the hidden code. We\ncan see from this \ufb01gure that the network has learnt to use the surface of the softmax simplex to\nencode style information of the digits and thus the three corners of the simplex do not have any\nmeaningful interpretation. Figure 5b corresponds to the code space of the same network when a\ncategorical distribution is imposed using the adversarial training. In this case, we can see the network\nhas successfully learnt to encode the label information of the three digits in the three corners of the\nsimplex, and all the style information has been separately captured by the autoregressive decoder.\nThis network achieves an almost perfect test error-rate of 0.3% on the \ufb01rst three digits of MNIST,\neven though it is trained in a purely unsupervised fashion.\nOnce the PixelGAN autoencoder is trained, its encoder can be used for clustering new points and its\ndecoder can be used to generate samples from each cluster. Figure 6 illustrates the samples of the\nPixelGAN autoencoder trained on the full MNIST dataset. The number of clusters is set to be 30\nand each row corresponds to the conditional samples of one of the clusters (only 16 are shown). We\ncan see that the discrete latent code of the network has learnt discrete factors of variation such as\n\n(a) Without GAN Regularization\n\n(b) With GAN Regularization\n\nFigure 5: Effect of GAN regularization (categorical prior) on the code space of PixelGAN autoencoders.\n\n6\n\n\fFigure 6: Disentangling the content and style in an unsupervised fashion with PixelGAN autoencoders.\nEach row shows samples of the model from one of the learnt clusters.\n\nclass label information and some discrete style information. For example digit 1s are put in different\nclusters based on how much tilted they are. The network is also assigning different clusters to digit 2s\n(based on whether they have a loop) and digit 7s (based on whether they have a dash in the middle).\nIn Section 3, we will show that by using the encoder of this network, we can obtain about 5% error\nrate in classifying digits in an unsupervised fashion, just by matching each cluster to a digit type.\nSemi-Supervised PixelGAN Autoencoders. The PixelGAN autoencoder can be used in a semi-\nsupervised setting. In order to incorporate the label information, we add a semi-supervised training\nphase. Speci\ufb01cally, we set the number of clusters to be the same as the number of class labels\nand after executing the reconstruction and the adversarial phases on an unlabeled mini-batch, the\nsemi-supervised phase is executed on a labeled mini-batch, by updating the weights of the encoder\nq(z|x) to minimize the cross-entropy cost. The semi-supervised cost also reduces the mode-missing\nbehavior of the GAN training by enforcing the encoder to learn all the modes of the categorical\ndistribution. In Section 3, we will evaluate the performance of the PixelGAN autoencoders on the\nsemi-supervised classi\ufb01cation tasks.\n3 Experiments\nIn this paper, we presented the PixelGAN autoencoder as a generative model, but the currently\navailable metrics for evaluating the likelihood of GAN-based generative models such as Parzen\nwindow estimate are fundamentally \ufb02awed [21]. So in this section, we only present the performance of\nthe PixelGAN autoencoder on downstream tasks such as unsupervised clustering and semi-supervised\nclassi\ufb01cation. The details of all the experiments can be found in Appendix B.\nUnsupervised Clustering. We trained a PixelGAN autoencoder in an unsupervised fashion on\nthe MNIST dataset (Figure 6). We chose the number of clusters to be 30 and used the following\nevaluation protocol: once the training is done, for each cluster i, we found the validation example\n\n(a) SVHN\n(1000 labels)\n\n(b) MNIST\n(100 labels)\n\n(c) NORB\n(1000 labels)\n\nFigure 7: Conditional samples of the semi-supervised PixelGAN autoencoder.\n\n7\n\n\fFigure 8: Semi-supervised error-rate of PixelGAN autoencoders on the MNIST and SVHN datasets.\n\nMNIST\n\n(Unsupervised)\n\nMNIST\n(20 labels)\n\nMNIST\n(50 labels)\n\nSVHN\n\n(500 labels)\n\n4.10 (\u00b11.13)\n\nVAE [24]\nVAT [25]\nADGM [26]\nSDGM [26]\nAdversarial Autoencoder [6]\nLadder Networks [27]\nConvolutional CatGAN [22]\nInfoGAN [16]\nFeature Matching GAN [28]\nTemporal Ensembling [23]\nPixelGAN Autoencoders\nTable 1: Semi-supervised learning and clustering error-rate on MNIST, SVHN and NORB datasets.\n\n18.44 (\u00b14.80)\n7.05 (\u00b10.30)\n10.47 (\u00b11.80)\n\n8.11 (\u00b11.30)\n5.43 (\u00b10.25)\n6.96 (\u00b10.55)\n\n16.77 (\u00b14.52)\n\n12.08 (\u00b15.50)\n\n5.27 (\u00b11.81)\n\n2.21 (\u00b11.36)\n\n1.16 (\u00b10.17)\n\n1.08 (\u00b10.15)\n\n8.90 (\u00b11.0)\n\n4.27\n5.00\n\n-\n-\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n-\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n-\n-\n-\n\n-\n\n-\n-\n\n-\n-\n-\n-\n-\n-\n-\n-\n\nMNIST\n(100 labels)\n3.33 (\u00b10.14)\n2.33\n0.96 (\u00b10.02)\n1.32 (\u00b10.07)\n1.90 (\u00b10.10)\n0.89 (\u00b10.50)\n1.39 (\u00b10.28)\n0.93 (\u00b10.06)\n\n-\n\n-\n\nSVHN\n\n(1000 labels)\n36.02 (\u00b10.10)\n24.63\n22.86\n16.61 (\u00b10.24)\n17.70 (\u00b10.30)\n\nNORB\n\n(1000 labels)\n18.79 (\u00b10.05)\n9.88\n10.06 (\u00b10.05)\n9.40 (\u00b10.04)\n\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n\nxn that maximizes q(zi|xn), and assigned the label of xn to all the points in the cluster i. We then\ncomputed the test error based on the assigned class labels to each cluster. As shown in the \ufb01rst\ncolumn of Table 1, the performance of PixelGAN autoencoders is on par with other GAN-based\nclustering algorithms such as CatGAN [22], InfoGAN [16] and adversarial autoencoders [6].\nSemi-supervised Classi\ufb01cation. Table 1 and Figure 8 report the results of semi-supervised classi-\n\ufb01cation experiments on the MNIST, SVHN and NORB datasets. On the MNIST dataset with 20,\n50 and 100 labels, our classi\ufb01cation results are highly competitive. Note that the classi\ufb01cation rate\nof unsupervised clustering of MNIST is better than semi-supervised MNIST with 20 labels. This\nis because in the unsupervised case, the number of clusters is 30, but in the semi-supervised case,\nthere are only 10 class labels which makes it more likely to confuse two digits. On the SVHN dataset\nwith 500 and 1000 labels, the PixelGAN autoencoder outperforms all the other methods except the\nrecently proposed temporal ensembling work [23] which is not a generative model. On the NORB\ndataset with 1000 labels, the PixelGAN autoencoder outperforms all the other reported results.\nFigure 7 shows the conditional samples of the semi-supervised PixelGAN autoencoder on the MNIST,\nSVHN and NORB datasets. Each column of this \ufb01gure presents sampled images conditioned on a\n\ufb01xed one-hot latent code. We can see from this \ufb01gure that the PixelGAN autoencoder can achieve a\nrather clean separation of style and content on these datasets with very few labeled data.\n4 Learning Cross-Domain Relations with PixelGAN Autoencoders\n\nIn this section, we discuss how the PixelGAN autoencoder can be viewed in the context of learning\ncross-domain relations between two different domains. We also describe how the problem of\nclustering or semi-supervised learning can be cast as the problem of \ufb01nding a smooth cross-domain\nmapping from the data distribution to the categorical distribution.\nRecently several GAN-based methods have been developed to learn a cross-domain mapping between\ntwo different domains [29, 30, 31, 6, 32]. In [31], an unsupervised cost function called the output\ndistribution matching (ODM) is proposed to \ufb01nd a cross-domain mapping F between two domains\nD1 and D2 by imposing the following unsupervised constraint on the uncorrelated samples from\nx \u223c D1 and y \u223c D2:\n\nDistr[F (x)] = Distr[y]\n\n(4)\n\n8\n\n0255075100125150175Epochs0.000.020.040.060.080.100.120.140.160.180.200.220.240.260.280.300.320.340.360.38ErrorRateSemi-supervisedMNIST100Labels50Labels20LabelsUnsupervised(30clusters)0100200300400500600700800900Epochs0.000.020.040.060.080.100.120.140.160.180.200.220.240.260.280.300.320.340.360.38ErrorRateSemi-supervisedSVHN1000Labels500Labels\fwhere Distr[z] denotes the distribution of the random variable z. The adversarial training is proposed\nas one of the methods for matching these distributions. If we have access to a few labeled pairs (x, y),\nthen F can be further trained on them in a supervised fashion to satisfy F (x) = y. For example,\nin speech recognition, we want to \ufb01nd a cross-domain mapping from a sequence of phonemes to a\nsequence of characters. By optimizing the ODM cost function in Equation 4, we can \ufb01nd a smooth\nfunction F that takes phonemes at its input and outputs a sequence of characters that respects the\nlanguage model. However, the main problem with this method is that the network can learn to ignore\npart of the input distribution and still satisfy the ODM cost function by its output distribution. This\nproblem has also been observed in other works such as [29]. One way to avoid this problem is to add\na reconstruction term to the ODM cost function by introducing a reverse mapping from the output of\nthe encoder to the input domain. The is essentially the idea of the adversarial autoencoder (AAE) [6]\nwhich learns a generative model by \ufb01nding a cross-domain mapping between a Gaussian distribution\nand the data distribution. Using the ODM cost function along with a reconstruction term to learn\ncross-domain relations have been explored in several previous works. For example, InfoGAN [16]\nadds a mutual information term to the ODM cost function and optimizes a variational lower bound\non this term. It can be shown that maximizing this variational bound is indeed minimizing the\nreconstruction cost of an autoencoder [15]. Similarly, in [32, 33], an AAE is used to learn the\ncross-domain relations of the vector representations of words from two different languages. The\narchitecture of the recent works of DiscoGAN [29] and CycleGAN [30] are also similar to an AAE\nin which the latent representation is enforced to have the distribution of the other domain. Here we\ndescribe how our proposed PixelGAN autoencoder can be potentially used in all these application\nareas to learn better cross-domain relations. Suppose we want to learn a mapping from domain D1\nto D2. In the architecture of Figure 1, we can use independent samples of x \u223c D1 at the input and\ninstead of imposing a Gaussian distribution on the latent code, we can impose the distribution of\nthe second domain using its independent samples y \u223c D2. Unlike AAEs, the encoder of PixelGAN\nautoencoders does not have to retain all the input information in order to have a lossless reconstruction.\nSo the encoder can use all its capacity to learn the most relevant mapping from D1 to D2 and at the\nsame time, the PixelCNN can capture the remaining information that has been lost by the encoder.\nWe can adopt the ODM idea for semi-supervised learning by assuming D1 is the image domain and\nD2 is the label domain. Independent samples of D1 and D2 correspond to samples from the data\ndistribution pdata(x) and the categorical distribution. The function F = q(y|x) can be parametrized\nby a neural network that is trained to satisfy the ODM cost function by matching the aggregated\n\ndistribution q(y) =(cid:82) q(y|x)pdata(x)dx to the categorical distribution using adversarial training. The\n\nfew labeled examples are used to further train F to satisfy F (x) = y. However, as explained above,\nthe problem with this method is that the network can learn to generate the categorical distribution\nby ignoring some part of the input distribution. The AAE solves this problem by adding an inverse\nmapping from the categorical distribution to the data distribution. However, the main drawback of\nthe AAE architecture is that due to the reconstruction term, the latent representation now has to\nmodel all the underlying factors of variation in the image. For example, in the semi-supervised AAE\narchitecture [6], while we are only interested in the one-hot label representation to do semi-supervised\nlearning, we also need to infer the style of the image so that we can have a lossless reconstruction of\nthe image. The PixelGAN autoencoder solves this problem by enabling the encoder to only infer the\nfactor of variation that we are interested in (i.e., label information), while the remaining structure of\nthe input (i.e., style information) is automatically captured by the autoregressive decoder.\n\n5 Conclusion\n\nIn this paper, we proposed the PixelGAN autoencoder, which is a generative autoencoder that\ncombines a generative PixelCNN with a GAN inference network that can impose arbitrary priors\non the latent code. We showed that imposing different distributions as the prior enables us to learn\na latent representation that captures the type of statistics that we care about, while the remaining\nstructure of the image is captured by the PixelCNN decoder. Speci\ufb01cally, by imposing a Gaussian\nprior, we were able to disentangle the low-frequency and high-frequency statistics of the images,\nand by imposing a categorical prior we were able to disentangle the style and content of images and\nlearn representations that are speci\ufb01cally useful for clustering and semi-supervised learning tasks.\nWhile the main focus of this paper was to demonstrate the application of PixelGAN autoencoders in\ndownstream tasks such as semi-supervised learning, we discussed how these architectures have many\nother potentials such as learning cross-domain relations between two different domains.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Nathan Killoran for helpful discussions. We also thank NVIDIA for GPU\ndonations.\n\nReferences\n[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing\nSystems, pages 2672\u20132680, 2014.\n\n[2] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint\n\narXiv:1610.03483, 2016.\n\n[3] Ferenc Husz\u00e1r. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235, 2017.\n\n[4] Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. arXiv preprint\n\narXiv:1702.08896, 2017.\n\n[5] Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David Blei. Operator variational inference. In Advances\n\nin Neural Information Processing Systems, pages 496\u2013504, 2016.\n\n[6] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial\n\nautoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[7] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying\n\nvariational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.\n\n[8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and\n\nAaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.\n\n[9] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv preprint\n\narXiv:1605.09782, 2016.\n\n[10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on\n\nLearning Representations (ICLR), 2014.\n\n[11] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. International Conference on Machine Learning, 2014.\n\n[12] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\n[13] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\nimage generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages\n4790\u20134798, 2016.\n\n[14] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational\nevidence lower bound. In NIPS 2016 Workshop on Advances in Approximate Bayesian Inference, 2016.\n\n[15] David Barber and Felix V Agakov. The im algorithm: A variational approach to information maximization.\n\nIn NIPS, pages 201\u2013208, 2003.\n\n[16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[17] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[18] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever,\n\nand Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.\n\n[19] Ferenc Husz\u00e1r. Is Maximum Likelihood Useful for Representation Learning? http://www.inference.\n\nvc/maximum-likelihood-for-representation-learning-2.\n\n[20] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and\nAaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013,\n2016.\n\n10\n\n\f[21] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.\n\narXiv preprint arXiv:1511.01844, 2015.\n\n[22] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adver-\n\nsarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[23] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint\n\narXiv:1610.02242, 2016.\n\n[24] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[25] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing\n\nwith virtual adversarial training. stat, 1050:25, 2015.\n\n[26] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep generative\n\nmodels. arXiv preprint arXiv:1602.05473, 2016.\n\n[27] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised\nlearning with ladder networks. In Advances in Neural Information Processing Systems, pages 3532\u20133540,\n2015.\n\n[28] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages 2226\u20132234,\n2016.\n\n[29] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover cross-\n\ndomain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.\n\n[30] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n[31] Ilya Sutskever, Rafal Jozefowicz, Karol Gregor, Danilo Rezende, Tim Lillicrap, and Oriol Vinyals. Towards\n\nprincipled unsupervised learning. arXiv preprint arXiv:1511.06440, 2015.\n\n[32] Antonio Valerio Miceli Barone. Towards cross-lingual distributed representations without parallel text\n\ntrained with adversarial autoencoders. arXiv preprint arXiv:1608.02996, 2016.\n\n[33] Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Adversarial training for unsupervised bilingual\n\nlexicon induction.\n\n[34] Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, and Yoshua Bengio. Denoising criterion for\n\nvariational auto-encoding framework. arXiv preprint arXiv:1511.06406, 2015.\n\n[35] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised map\n\ninference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.\n\n[36] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and\n\nKoray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.\n\n[37] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,\nAndy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey\nIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,\nDan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,\nBenoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda\nVi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from\ntensor\ufb02ow.org.\n\n[38] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn\nwith discretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint arXiv:1701.05517,\n2017.\n\n[39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[40] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1209, "authors": [{"given_name": "Alireza", "family_name": "Makhzani", "institution": "University of Toronto"}, {"given_name": "Brendan", "family_name": "Frey", "institution": "Deep Genomics, Vector Institute, Univ. Toronto"}]}