{"title": "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 2172, "page_last": 2180, "abstract": "This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.", "full_text": "InfoGAN: Interpretable Representation Learning by\nInformation Maximizing Generative Adversarial Nets\n\nXi Chen\u2020\u2021, Yan Duan\u2020\u2021, Rein Houthooft\u2020\u2021, John Schulman\u2020\u2021, Ilya Sutskever\u2021, Pieter Abbeel\u2020\u2021\n\n\u2020 UC Berkeley, Department of Electrical Engineering and Computer Sciences\n\n\u2021 OpenAI\n\nAbstract\n\nThis paper describes InfoGAN, an information-theoretic extension to the Gener-\native Adversarial Network that is able to learn disentangled representations in a\ncompletely unsupervised manner. InfoGAN is a generative adversarial network\nthat also maximizes the mutual information between a small subset of the latent\nvariables and the observation. We derive a lower bound of the mutual information\nobjective that can be optimized ef\ufb01ciently. Speci\ufb01cally, InfoGAN successfully\ndisentangles writing styles from digit shapes on the MNIST dataset, pose from\nlighting of 3D rendered images, and background digits from the central digit on\nthe SVHN dataset. It also discovers visual concepts that include hair styles, pres-\nence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments\nshow that InfoGAN learns interpretable representations that are competitive with\nrepresentations learned by existing supervised methods. For an up-to-date version\nof this paper, please see https://arxiv.org/abs/1606.03657.\n\n1\n\nIntroduction\n\nUnsupervised learning can be described as the general problem of extracting value from unlabelled\ndata which exists in vast quantities. A popular framework for unsupervised learning is that of\nrepresentation learning [1, 2], whose goal is to use unlabelled data to learn a representation that\nexposes important semantic features as easily decodable factors. A method that can learn such\nrepresentations is likely to exist [2], and to be useful for many downstream tasks which include\nclassi\ufb01cation, regression, visualization, and policy learning in reinforcement learning.\nWhile unsupervised learning is ill-posed because the relevant downstream tasks are unknown at\ntraining time, a disentangled representation, one which explicitly represents the salient attributes of a\ndata instance, should be helpful for the relevant but unknown tasks. For example, for a dataset of\nfaces, a useful disentangled representation may allocate a separate set of dimensions for each of the\nfollowing attributes: facial expression, eye color, hairstyle, presence or absence of eyeglasses, and the\nidentity of the corresponding person. A disentangled representation can be useful for natural tasks\nthat require knowledge of the salient attributes of the data, which include tasks like face recognition\nand object recognition. It is not the case for unnatural supervised tasks, where the goal could be,\nfor example, to determine whether the number of red pixels in an image is even or odd. Thus, to be\nuseful, an unsupervised learning algorithm must in effect correctly guess the likely set of downstream\nclassi\ufb01cation tasks without being directly exposed to them.\nA signi\ufb01cant fraction of unsupervised learning research is driven by generative modelling. It is\nmotivated by the belief that the ability to synthesize, or \u201ccreate\u201d the observed data entails some form\nof understanding, and it is hoped that a good generative model will automatically learn a disentangled\nrepresentation, even though it is easy to construct perfect generative models with arbitrarily bad\nrepresentations. The most prominent generative models are the variational autoencoder (VAE) [3]\nand the generative adversarial network (GAN) [4].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn this paper, we present a simple modi\ufb01cation to the generative adversarial network objective that\nencourages it to learn interpretable and meaningful representations. We do so by maximizing the\nmutual information between a \ufb01xed small subset of the GAN\u2019s noise variables and the observations,\nwhich turns out to be relatively straightforward. Despite its simplicity, we found our method to be\nsurprisingly effective: it was able to discover highly semantic and meaningful hidden representations\non a number of image datasets: digits (MNIST), faces (CelebA), and house numbers (SVHN). The\nquality of our unsupervised disentangled representation matches previous works that made use of\nsupervised label information [5\u20139]. These results suggest that generative modelling augmented with\na mutual information cost could be a fruitful approach for learning disentangled representations.\nIn the remainder of the paper, we begin with a review of the related work, noting the supervision that is\nrequired by previous methods that learn disentangled representations. Then we review GANs, which\nis the basis of InfoGAN. We describe how maximizing mutual information results in interpretable\nrepresentations and derive a simple and ef\ufb01cient algorithm for doing so. Finally, in the experiments\nsection, we \ufb01rst compare InfoGAN with prior approaches on relatively clean datasets and then\nshow that InfoGAN can learn interpretable representations on complex datasets where no previous\nunsupervised approach is known to learn representations of comparable quality.\n2 Related Work\n\nThere exists a large body of work on unsupervised representation learning. Early methods were based\non stacked (often denoising) autoencoders or restricted Boltzmann machines [10\u201313].\nAnother intriguing line of work consists of the ladder network [14], which has achieved spectacular\nresults on a semi-supervised variant of the MNIST dataset. More recently, a model based on the\nVAE has achieved even better semi-supervised results on MNIST [15]. GANs [4] have been used by\nRadford et al. [16] to learn an image representation that supports basic linear algebra on code space.\nLake et al. [17] have been able to learn representations using probabilistic inference over Bayesian\nprograms, which achieved convincing one-shot learning results on the OMNI dataset.\nIn addition, prior research attempted to learn disentangled representations using supervised data.\nOne class of such methods trains a subset of the representation to match the supplied label using\nsupervised learning: bilinear models [18] separate style and content; multi-view perceptron [19]\nseparate face identity and view point; and Yang et al. [20] developed a recurrent variant that generates\na sequence of latent factor transformations. Similarly, VAEs [5] and Adversarial Autoencoders [9]\nwere shown to learn representations in which class label is separated from other variations.\nRecently several weakly supervised methods were developed to remove the need of explicitly\nlabeling variations. disBM [21] is a higher-order Boltzmann machine which learns a disentangled\nrepresentation by \u201cclamping\u201d a part of the hidden units for a pair of data points that are known to\nmatch in all but one factors of variation. DC-IGN [7] extends this \u201cclamping\u201d idea to VAE and\nsuccessfully learns graphics codes that can represent pose and light in 3D rendered images. This line\nof work yields impressive results, but they rely on a supervised grouping of the data that is generally\nnot available. Whitney et al. [8] proposed to alleviate the grouping requirement by learning from\nconsecutive frames of images and use temporal continuity as supervisory signal.\nUnlike the cited prior works that strive to recover disentangled representations, InfoGAN requires\nno supervision of any kind. To the best of our knowledge, the only other unsupervised method that\nlearns disentangled representations is hossRBM [13], a higher-order extension of the spike-and-slab\nrestricted Boltzmann machine that can disentangle emotion from identity on the Toronto Face Dataset\n[22]. However, hossRBM can only disentangle discrete latent factors, and its computation cost grows\nexponentially in the number of factors. InfoGAN can disentangle both discrete and continuous latent\nfactors, scale to complicated datasets, and typically requires no more training time than regular\nGANs.\n\n3 Background: Generative Adversarial Networks\n\nGoodfellow et al. [4] introduced the Generative Adversarial Networks (GAN), a framework for\ntraining deep generative models using a minimax game. The goal is to learn a generator distribution\nPG(x) that matches the real data distribution Pdata(x). Instead of trying to explicitly assign probability\nto every x in the data distribution, GAN learns a generator network G that generates samples from\n\n2\n\n\fthe generator distribution PG by transforming a noise variable z \u223c Pnoise(z) into a sample G(z).\nThis generator is trained by playing against an adversarial discriminator network D that aims to\ndistinguish between samples from the true data distribution Pdata and the generator\u2019s distribution PG.\nSo for a given generator, the optimal discriminator is D(x) = Pdata(x)/(Pdata(x) + PG(x)). More\nformally, the minimax game is given by the following expression:\n\nmin\n\nG\n\nmax\n\nD\n\nV (D, G) = Ex\u223cPdata[log D(x)] + Ez\u223cnoise[log (1 \u2212 D(G(z)))]\n\n(1)\n\n4 Mutual Information for Inducing Latent Codes\n\nThe GAN formulation uses a simple factored continuous input noise vector z, while imposing no\nrestrictions on the manner in which the generator may use this noise. As a result, it is possible that\nthe noise will be used by the generator in a highly entangled way, causing the individual dimensions\nof z to not correspond to semantic features of the data.\nHowever, many domains naturally decompose into a set of semantically meaningful factors of\nvariation. For instance, when generating images from the MNIST dataset, it would be ideal if the\nmodel automatically chose to allocate a discrete random variable to represent the numerical identity\nof the digit (0-9), and chose to have two additional continuous variables that represent the digit\u2019s\nangle and thickness of the digit\u2019s stroke. It would be useful if we could recover these concepts without\nany supervision, by simply specifying that an MNIST digit is generated by an 1-of-10 variable and\ntwo continuous variables.\nIn this paper, rather than using a single unstructured noise vector, we propose to decompose the input\nnoise vector into two parts: (i) z, which is treated as source of incompressible noise; (ii) c, which we\nwill call the latent code and will target the salient structured semantic features of the data distribution.\nMathematically, we denote the set of structured latent variables by c1, c2, . . . , cL. In its simplest\ni=1 P (ci). For ease of\n\nform, we may assume a factored distribution, given by P (c1, c2, . . . , cL) =(cid:81)L\n\nnotation, we will use latent codes c to denote the concatenation of all latent variables ci.\nWe now propose a method for discovering these latent factors in an unsupervised way: we provide\nthe generator network with both the incompressible noise z and the latent code c, so the form of the\ngenerator becomes G(z, c). However, in standard GAN, the generator is free to ignore the additional\nlatent code c by \ufb01nding a solution satisfying PG(x|c) = PG(x). To cope with the problem of trivial\ncodes, we propose an information-theoretic regularization: there should be high mutual information\nbetween latent codes c and generator distribution G(z, c). Thus I(c; G(z, c)) should be high.\nIn information theory, mutual information between X and Y , I(X; Y ), measures the \u201camount of\ninformation\u201d learned from knowledge of random variable Y about the other random variable X. The\nmutual information can be expressed as the difference of two entropy terms:\n\nI(X; Y ) = H(X) \u2212 H(X|Y ) = H(Y ) \u2212 H(Y |X)\n\n(2)\n\nThis de\ufb01nition has an intuitive interpretation: I(X; Y ) is the reduction of uncertainty in X when Y\nis observed. If X and Y are independent, then I(X; Y ) = 0, because knowing one variable reveals\nnothing about the other; by contrast, if X and Y are related by a deterministic, invertible function,\nthen maximal mutual information is attained. This interpretation makes it easy to formulate a cost:\ngiven any x \u223c PG(x), we want PG(c|x) to have a small entropy. In other words, the information in\nthe latent code c should not be lost in the generation process. Similar mutual information inspired\nobjectives have been considered before in the context of clustering [23\u201325]. Therefore, we propose\nto solve the following information-regularized minimax game:\n\nmin\n\nG\n\nmax\n\nD\n\nVI (D, G) = V (D, G) \u2212 \u03bbI(c; G(z, c))\n\n(3)\n\n5 Variational Mutual Information Maximization\n\nIn practice, the mutual information term I(c; G(z, c)) is hard to maximize directly as it requires\naccess to the posterior P (c|x). Fortunately we can obtain a lower bound of it by de\ufb01ning an auxiliary\n\n3\n\n\f(cid:124)\n\n(cid:48)\nc(cid:48)\u223cP (c|x)[log P (c\n\nx\u223cG(z,c)[E\nx\u223cG(z,c)[DKL(P (\u00b7|x) (cid:107) Q(\u00b7|x))\nx\u223cG(z,c)[E\n\nc(cid:48)\u223cP (c|x)[log Q(c\n\n(cid:123)(cid:122)\n\n\u22650\n\n(cid:125)\n\n= E\n= E\n\n\u2265 E\n\n(cid:48)\n\n|x)]] + H(c)\n\n|x)]] + H(c)\n\n+ E\n\n(cid:48)\nc(cid:48)\u223cP (c|x)[log Q(c\n\n|x)]] + H(c)\n\n(4)\n\nThis technique of lower bounding mutual information is known as Variational Information Maxi-\nmization [26]. We note that the entropy of latent codes H(c) can be optimized as well since it has a\nsimple analytical form for common distributions. However, in this paper we opt for simplicity by\n\ufb01xing the latent code distribution and we will treat H(c) as a constant. So far we have bypassed the\nproblem of having to compute the posterior P (c|x) explicitly via this lower bound but we still need\nto be able to sample from the posterior in the inner expectation. Next we state a simple lemma, with\nits proof deferred to Appendix 1, that removes the need to sample from the posterior.\n\nLemma 5.1 For random variables X, Y and function f (x, y) under suitable regularity conditions:\nE\nx\u223cX,y\u223cY |x[f (x, y)] = E\nBy using Lemma 5.1, we can de\ufb01ne a variational lower bound, LI (G, Q), of the mutual information,\nI(c; G(z, c)):\n\nx\u223cX,y\u223cY |x,x(cid:48)\u223cX|y[f (x(cid:48), y)].\n\ndistribution Q(c|x) to approximate P (c|x):\nI(c; G(z, c)) = H(c) \u2212 H(c|G(z, c))\n\nLI (G, Q) = Ec\u223cP (c),x\u223cG(z,c)[log Q(c|x)] + H(c)\n\n(cid:48)\nc(cid:48)\u223cP (c|x)[log Q(c\n\n|x)]] + H(c)\n\n= Ex\u223cG(z,c)[E\n\u2264 I(c; G(z, c))\n\n(5)\n\nWe note that LI (G, Q) is easy to approximate with Monte Carlo simulation. In particular, LI can\nbe maximized w.r.t. Q directly and w.r.t. G via the reparametrization trick. Hence LI (G, Q) can be\nadded to GAN\u2019s objectives with no change to GAN\u2019s training procedure and we call the resulting\nalgorithm Information Maximizing Generative Adversarial Networks (InfoGAN).\nEq (4) shows that the lower bound becomes tight as the auxiliary distribution Q approaches the\ntrue posterior distribution: Ex[DKL(P (\u00b7|x) (cid:107) Q(\u00b7|x))] \u2192 0. In addition, we know that when the\nvariational lower bound attains its maximum LI (G, Q) = H(c) for discrete latent codes, the bound\nbecomes tight and the maximal mutual information is achieved. In Appendix, we note how InfoGAN\ncan be connected to the Wake-Sleep algorithm [27] to provide an alternative interpretation.\nHence, InfoGAN is de\ufb01ned as the following minimax game with a variational regularization of\nmutual information and a hyperparameter \u03bb:\n\nmin\nG,Q\n\nmax\n\nD\n\nVInfoGAN(D, G, Q) = V (D, G) \u2212 \u03bbLI (G, Q)\n\n(6)\n\n6\n\nImplementation\n\nIn practice, we parametrize the auxiliary distribution Q as a neural network. In most experiments, Q\nand D share all convolutional layers and there is one \ufb01nal fully connected layer to output parameters\nfor the conditional distribution Q(c|x), which means InfoGAN only adds a negligible computation\ncost to GAN. We have also observed that LI (G, Q) always converges faster than normal GAN\nobjectives and hence InfoGAN essentially comes for free with GAN.\nFor categorical latent code ci, we use the natural choice of softmax nonlinearity to represent Q(ci|x).\nFor continuous latent code cj, there are more options depending on what is the true posterior P (cj|x).\nIn our experiments, we have found that simply treating Q(cj|x) as a factored Gaussian is suf\ufb01cient.\nSince GAN is known to be dif\ufb01cult to train, we design our experiments based on existing techniques\nintroduced by DC-GAN [16], which are enough to stabilize InfoGAN training and we did not have to\nintroduce new trick. Detailed experimental setup is described in Appendix. Even though InfoGAN\nintroduces an extra hyperparameter \u03bb, it\u2019s easy to tune and simply setting to 1 is suf\ufb01cient for discrete\nlatent codes. When the latent code contains continuous variables, a smaller \u03bb is typically used\nto ensure that \u03bbLI (G, Q), which now involves differential entropy, is on the same scale as GAN\nobjectives.\n\n4\n\n\f7 Experiments\n\nThe \ufb01rst goal of our experiments is to investigate if mutual information can be maximized ef\ufb01ciently.\nThe second goal is to evaluate if InfoGAN can learn disentangled and interpretable representations\nby making use of the generator to vary only one latent factor at a time in order to assess if varying\nsuch factor results in only one type of semantic variation in generated images. DC-IGN [7] also uses\nthis method to evaluate their learned representations on 3D image datasets, on which we also apply\nInfoGAN to establish direct comparison.\n\n7.1 Mutual Information Maximization\n\nTo evaluate whether the mutual information between latent codes c and\ngenerated images G(z, c) can be maximized ef\ufb01ciently with proposed\nmethod, we train InfoGAN on MNIST dataset with a uniform categor-\nical distribution on latent codes c \u223c Cat(K = 10, p = 0.1). In Fig 1,\nthe lower bound LI (G, Q) is quickly maximized to H(c) \u2248 2.30,\nwhich means the bound (4) is tight and maximal mutual information\nis achieved.\nAs a baseline, we also train a regular GAN with an auxiliary distribu-\ntion Q when the generator is not explicitly encouraged to maximize\nthe mutual information with the latent codes. Since we use expressive\nneural network to parametrize Q, we can assume that Q reasonably\napproximates the true posterior P (c|x) and hence there is little mutual\ninformation between latent codes and generated images in regular\nGAN. We note that with a different neural network architecture, there\nmight be a higher mutual information between latent codes and generated images even though we\nhave not observed such case in our experiments. This comparison is meant to demonstrate that in a\nregular GAN, there is no guarantee that the generator will make use of the latent codes.\n\nFigure 1: Lower bound LI\nover training iterations\n\n7.2 Disentangled Representation\n\nTo disentangle digit shape from styles on MNIST, we choose to model the latent codes with one\ncategorical code, c1 \u223c Cat(K = 10, p = 0.1), which can model discontinuous variation in data, and\ntwo continuous codes that can capture variations that are continuous in nature: c2, c3 \u223c Unif(\u22121, 1).\nIn Figure 2, we show that the discrete code c1 captures drastic change in shape. Changing categorical\ncode c1 switches between digits most of the time. In fact even if we just train InfoGAN without\nany label, c1 can be used as a classi\ufb01er that achieves 5% error rate in classifying MNIST digits by\nmatching each category in c1 to a digit type. In the second row of Figure 2a, we can observe a digit 7\nis classi\ufb01ed as a 9.\nContinuous codes c2, c3 capture continuous variations in style: c2 models rotation of digits and c3\ncontrols the width. What is remarkable is that in both cases, the generator does not simply stretch\nor rotate the digits but instead adjust other details like thickness or stroke style to make sure the\nresulting images are natural looking. As a test to check whether the latent representation learned\nby InfoGAN is generalizable, we manipulated the latent codes in an exaggerated way: instead of\nplotting latent codes from \u22121 to 1, we plot it from \u22122 to 2 covering a wide region that the network\nwas never trained on and we still get meaningful generalization.\nNext we evaluate InfoGAN on two datasets of 3D images: faces [28] and chairs [29], on which\nDC-IGN was shown to learn highly interpretable graphics codes.\nOn the faces dataset, DC-IGN learns to represent latent factors as azimuth (pose), elevation, and\nlighting as continuous latent variables by using supervision. Using the same dataset, we demonstrate\nthat InfoGAN learns a disentangled representation that recover azimuth (pose), elevation, and lighting\non the same dataset. In this experiment, we choose to model the latent codes with \ufb01ve continuous\ncodes, ci \u223c Unif(\u22121, 1) with 1 \u2264 i \u2264 5.\nSince DC-IGN requires supervision, it was previously not possible to learn a latent code for a variation\nthat\u2019s unlabeled and hence salient latent factors of variation cannot be discovered automatically from\ndata. By contrast, InfoGAN is able to discover such variation on its own: for instance, in Figure 3d a\n\n5\n\n02004006008001000Iteration\u22120.50.00.51.01.52.02.5LIInfoGANGAN\f(a) Varying c1 on InfoGAN (Digit type)\n\n(b) Varying c1 on regular GAN (No clear meaning)\n\n(c) Varying c2 from \u22122 to 2 on InfoGAN (Rotation)\n(d) Varying c3 from \u22122 to 2 on InfoGAN (Width)\nFigure 2: Manipulating latent codes on MNIST: In all \ufb01gures of latent code manipulation, we will\nuse the convention that in each one latent code varies from left to right while the other latent codes\nand noise are \ufb01xed. The different rows correspond to different random samples of \ufb01xed latent codes\nand noise. For instance, in (a), one column contains \ufb01ve samples from the same category in c1, and a\nrow shows the generated images for 10 possible categories in c1 with other noise \ufb01xed. In (a), each\ncategory in c1 largely corresponds to one digit type; in (b), varying c1 on a GAN trained without\ninformation regularization results in non-interpretable variations; in (c), a small value of c2 denotes\nleft leaning digit whereas a high value corresponds to right leaning digit; in (d), c3 smoothly controls\nthe width. We reorder (a) for visualization purpose, as the categorical code is inherently unordered.\n\nlatent code that smoothly changes a face from wide to narrow is learned even though this variation\nwas neither explicitly generated or labeled in prior work.\nOn the chairs dataset, DC-IGN can learn a continuous code that represents rotation. InfoGAN again is\nable to learn the same concept as a continuous code (Figure 4a) and we show in addition that InfoGAN\nis also able to continuously interpolate between similar chair types of different widths using a single\ncontinuous code (Figure 4b). In this experiment, we choose to model the latent factors with four\ncategorical codes, c1,2,3,4 \u223c Cat(K = 20, p = 0.05) and one continuous code c5 \u223c Unif(\u22121, 1).\nNext we evaluate InfoGAN on the Street View House Number (SVHN) dataset, which is signi\ufb01cantly\nmore challenging to learn an interpretable representation because it is noisy, containing images of\nvariable-resolution and distracting digits, and it does not have multiple variations of the same object.\nIn this experiment, we make use of four 10\u2212dimensional categorical variables and two uniform\ncontinuous variables as latent codes. We show two of the learned latent factors in Figure 5.\nFinally we show in Figure 6 that InfoGAN is able to learn many visual concepts on another challenging\ndataset: CelebA [30], which includes 200, 000 celebrity images with large pose variations and\nbackground clutter. In this dataset, we model the latent variation as 10 uniform categorical variables,\neach of dimension 10. Surprisingly, even in this complicated dataset, InfoGAN can recover azimuth as\nin 3D images even though in this dataset no single face appears in multiple pose positions. Moreover\nInfoGAN can disentangle other highly semantic variations like presence or absence of glasses,\nhairstyles and emotion, demonstrating a level of visual understanding is acquired.\n\n6\n\n\f(a) Azimuth (pose)\n\n(b) Elevation\n\n(c) Lighting\n\n(d) Wide or Narrow\n\nFigure 3: Manipulating latent codes on 3D Faces: We show the effect of the learned continuous\nlatent factors on the outputs as their values vary from \u22121 to 1. In (a), we show that a continuous latent\ncode consistently captures the azimuth of the face across different shapes; in (b), the continuous code\ncaptures elevation; in (c), the continuous code captures the orientation of lighting; and in (d), the\ncontinuous code learns to interpolate between wide and narrow faces while preserving other visual\nfeatures. For each factor, we present the representation that most resembles prior results [7] out of 5\nrandom runs to provide direct comparison.\n\n(a) Rotation\n\n(b) Width\n\nFigure 4: Manipulating latent codes on 3D Chairs: In (a), the continuous code captures the pose\nof the chair while preserving its shape, although the learned pose mapping varies across different\ntypes; in (b), the continuous code can alternatively learn to capture the widths of different chair types,\nand smoothly interpolate between them. For each factor, we present the representation that most\nresembles prior results [7] out of 5 random runs to provide direct comparison.\n\n8 Conclusion\n\nThis paper introduces a representation learning algorithm called Information Maximizing Generative\nAdversarial Networks (InfoGAN). In contrast to previous approaches, which require supervision,\nInfoGAN is completely unsupervised and learns interpretable and disentangled representations on\nchallenging datasets. In addition, InfoGAN adds only negligible computation cost on top of GAN and\nis easy to train. The core idea of using mutual information to induce representation can be applied to\nother methods like VAE [3], which is a promising area of future work. Other possible extensions to\nthis work include: learning hierarchical latent representations, improving semi-supervised learning\nwith better codes [31], and using InfoGAN as a high-dimensional data discovery tool.\n\n7\n\n\f(a) Continuous variation: Lighting\n\n(b) Discrete variation: Plate Context\n\nFigure 5: Manipulating latent codes on SVHN: In (a), we show that one of the continuous codes\ncaptures variation in lighting even though in the dataset each digit is only present with one lighting\ncondition; In (b), one of the categorical codes is shown to control the context of central digit: for\nexample in the 2nd column, a digit 9 is (partially) present on the right whereas in 3rd column, a digit\n0 is present, which indicates that InfoGAN has learned to separate central digit from its context.\n\n(a) Azimuth (pose)\n\n(b) Presence or absence of glasses\n\n(c) Hair style\n\n(d) Emotion\n\nFigure 6: Manipulating latent codes on CelebA: (a) shows that a categorical code can capture the\nazimuth of face by discretizing this variation of continuous nature; in (b) a subset of the categorical\ncode is devoted to signal the presence of glasses; (c) shows variation in hair style, roughly ordered\nfrom less hair to more hair; (d) shows change in emotion, roughly ordered from stern to happy.\n\nAcknowledgements\n\nWe thank the anonymous reviewers. This research was funded in part by ONR through a PECASE\naward. Xi Chen was also supported by a Berkeley AI Research lab Fellowship. Yan Duan was also\nsupported by a Berkeley AI Research lab Fellowship and a Huawei Fellowship. Rein Houthooft was\nsupported by a Ph.D. Fellowship of the Research Foundation - Flanders (FWO).\n\nReferences\n\n[1] Y. Bengio, \u201cLearning deep architectures for ai,\u201d Foundations and trends in Machine Learning, 2009.\n\n8\n\n\f[2] Y. Bengio, A. Courville, and P. Vincent, \u201cRepresentation learning: A review and new perspectives,\u201d\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, 2013.\n\n[3] D. P. Kingma and M. Welling, \u201cAuto-encoding variational bayes,\u201d ArXiv preprint arXiv:1312.6114, 2013.\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y.\n[4]\nBengio, \u201cGenerative adversarial nets,\u201d in NIPS, 2014, pp. 2672\u20132680.\n\n[5] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, \u201cSemi-supervised learning with deep\n\ngenerative models,\u201d in NIPS, 2014, pp. 3581\u20133589.\n\n[6] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen, \u201cDiscovering hidden factors of variation in\n\ndeep networks,\u201d ArXiv preprint arXiv:1412.6583, 2014.\n\n[7] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, \u201cDeep convolutional inverse graphics\n\nnetwork,\u201d in NIPS, 2015, pp. 2530\u20132538.\n\n[8] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum, \u201cUnderstanding visual concepts with\n\ncontinuation learning,\u201d ArXiv preprint arXiv:1602.06822, 2016.\n\n[9] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, \u201cAdversarial autoencoders,\u201d ArXiv preprint\n\narXiv:1511.05644, 2015.\n\n[10] G. E. Hinton, S. Osindero, and Y.-W. Teh, \u201cA fast learning algorithm for deep belief nets,\u201d Neural\n\nComput., vol. 18, no. 7, pp. 1527\u20131554, 2006.\n\n[11] G. E. Hinton and R. R. Salakhutdinov, \u201cReducing the dimensionality of data with neural networks,\u201d\n\nScience, vol. 313, no. 5786, pp. 504\u2013507, 2006.\n\n[12] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, \u201cExtracting and composing robust features\n\nwith denoising autoencoders,\u201d in ICLR, 2008, pp. 1096\u20131103.\n\n[13] G. Desjardins, A. Courville, and Y. Bengio, \u201cDisentangling factors of variation via generative entangling,\u201d\n\nArXiv preprint arXiv:1210.5474, 2012.\n\n[14] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, \u201cSemi-supervised learning with ladder\n\nnetworks,\u201d in NIPS, 2015, pp. 3532\u20133540.\n\n[15] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther, \u201cImproving semi-supervised learning with\n\nauxiliary deep generative models,\u201d in ICML, 2016.\n\n[16] A. Radford, L. Metz, and S. Chintala, \u201cUnsupervised representation learning with deep convolutional\n\ngenerative adversarial networks,\u201d ArXiv preprint arXiv:1511.06434, 2015.\n\n[17] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, \u201cHuman-level concept learning through probabilistic\n\nprogram induction,\u201d Science, vol. 350, no. 6266, pp. 1332\u20131338, 2015.\nJ. B. Tenenbaum and W. T. Freeman, \u201cSeparating style and content with bilinear models,\u201d Neural\ncomputation, vol. 12, no. 6, pp. 1247\u20131283, 2000.\n\n[18]\n\n[19] Z. Zhu, P. Luo, X. Wang, and X. Tang, \u201cMulti-view perceptron: A deep model for learning face identity\n\nand view representations,\u201d in NIPS, 2014, pp. 217\u2013225.\nJ. Yang, S. E. Reed, M.-H. Yang, and H. Lee, \u201cWeakly-supervised disentangling with recurrent transfor-\nmations for 3d view synthesis,\u201d in NIPS, 2015, pp. 1099\u20131107.\n\n[20]\n\n[21] S. Reed, K. Sohn, Y. Zhang, and H. Lee, \u201cLearning to disentangle factors of variation with manifold\n\ninteraction,\u201d in ICML, 2014, pp. 1431\u20131439.\nJ. Susskind, A. Anderson, and G. E. Hinton, \u201cThe Toronto face dataset,\u201d Tech. Rep., 2010.\nJ. S. Bridle, A. J. Heading, and D. J. MacKay, \u201cUnsupervised classi\ufb01ers, mutual information and\n\u2019phantom targets\u2019,\u201d in NIPS, 1992.\n\n[22]\n[23]\n\n[24] D. Barber and F. V. Agakov, \u201cKernelized infomax clustering,\u201d in NIPS, 2005, pp. 17\u201324.\n[25] A. Krause, P. Perona, and R. G. Gomes, \u201cDiscriminative clustering by regularized information maximiza-\n\ntion,\u201d in NIPS, 2010, pp. 775\u2013783.\n\n[26] D. Barber and F. V. Agakov, \u201cThe IM algorithm: A variational approach to information maximization,\u201d\n\nin NIPS, 2003.\n\n[27] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, \u201cThe\" wake-sleep\" algorithm for unsupervised neural\n\nnetworks,\u201d Science, vol. 268, no. 5214, pp. 1158\u20131161, 1995.\nP. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, \u201cA 3d face model for pose and illumination\ninvariant face recognition,\u201d in AVSS, 2009, pp. 296\u2013301.\n\n[28]\n\n[29] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, \u201cSeeing 3D chairs: Exemplar part-based\n\n2D-3D alignment using a large dataset of CAD models,\u201d in CVPR, 2014, pp. 3762\u20133769.\n\n[30] Z. Liu, P. Luo, X. Wang, and X. Tang, \u201cDeep learning face attributes in the wild,\u201d in ICCV, 2015.\n[31]\n\nJ. T. Springenberg, \u201cUnsupervised and semi-supervised learning with categorical generative adversarial\nnetworks,\u201d ArXiv preprint arXiv:1511.06390, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1134, "authors": [{"given_name": "Xi", "family_name": "Chen", "institution": "UC Berkeley and OpenAI"}, {"given_name": "Yan", "family_name": "Duan", "institution": "UC Berkeley"}, {"given_name": "Rein", "family_name": "Houthooft", "institution": "Ghent University - iMinds and UC Berkeley and OpenAI"}, {"given_name": "John", "family_name": "Schulman", "institution": "OpenAI"}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": "Google"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}]}