{"title": "Disentangling factors of variation in deep representation using adversarial training", "book": "Advances in Neural Information Processing Systems", "page_first": 5040, "page_last": 5048, "abstract": "We propose a deep generative model for learning to distill the hidden factors of variation within a set of labeled observations into two complementary codes. One code describes the factors of variation relevant to solving a specified task. The other code describes the remaining factors of variation that are irrelevant to solving this task. The only available source of supervision during the training process comes from our ability to distinguish among different observations belonging to the same category. Concrete examples include multiple images of the same object from different viewpoints, or multiple speech samples from the same speaker. In both of these instances, the factors of variation irrelevant to classification are implicitly expressed by intra-class variabilities, such as the relative position of an object in an image, or the linguistic content of an utterance. Most existing approaches for solving this problem rely heavily on having access to pairs of observations only sharing a single factor of variation, e.g. different objects observed in the exact same conditions. This assumption is often not encountered in realistic settings where data acquisition is not controlled and labels for the uninformative components are not available. In this work, we propose to overcome this limitation by augmenting deep convolutional autoencoders with a form of adversarial training. Both factors of variation are implicitly captured in the organization of the learned embedding space, and can be used for solving single-image analogies. Experimental results on synthetic and real datasets show that the proposed method is capable of disentangling the influences of style and content factors using a flexible representation, as well as generalizing to unseen styles or content classes.", "full_text": "Disentangling factors of variation in deep\nrepresentations using adversarial training\n\nMichael Mathieu, Junbo Zhao, Pablo Sprechmann, Aditya Ramesh, Yann LeCun\n\n719 Broadway, 12th Floor, New York, NY 10003\n\n{mathieu, junbo.zhao, pablo, ar2922, yann}@cs.nyu.edu\n\nAbstract\n\nWe introduce a conditional generative model for learning to disentangle the hidden\nfactors of variation within a set of labeled observations, and separate them into\ncomplementary codes. One code summarizes the speci\ufb01ed factors of variation\nassociated with the labels. The other summarizes the remaining unspeci\ufb01ed vari-\nability. During training, the only available source of supervision comes from our\nability to distinguish among different observations belonging to the same class.\nExamples of such observations include images of a set of labeled objects captured\nat different viewpoints, or recordings of set of speakers dictating multiple phrases.\nIn both instances, the intra-class diversity is the source of the unspeci\ufb01ed factors of\nvariation: each object is observed at multiple viewpoints, and each speaker dictates\nmultiple phrases. Learning to disentangle the speci\ufb01ed factors from the unspeci\ufb01ed\nones becomes easier when strong supervision is possible. Suppose that during\ntraining, we have access to pairs of images, where each pair shows two different\nobjects captured from the same viewpoint. This source of alignment allows us to\nsolve our task using existing methods. However, labels for the unspeci\ufb01ed factors\nare usually unavailable in realistic scenarios where data acquisition is not strictly\ncontrolled. We address the problem of disentaglement in this more general setting\nby combining deep convolutional autoencoders with a form of adversarial training.\nBoth factors of variation are implicitly captured in the organization of the learned\nembedding space, and can be used for solving single-image analogies. Experimen-\ntal results on synthetic and real datasets show that the proposed method is capable\nof generalizing to unseen classes and intra-class variabilities.\n\n1\n\nIntroduction\n\nA fundamental challenge in understanding sensory data is learning to disentangle the underlying\nfactors of variation that give rise to the observations [1]. For instance, the factors of variation involved\nin generating a speech recording include the speaker\u2019s attributes, such as gender, age, or accent, as\nwell as the intonation and words being spoken. Similarly, the factors of variation underlying the image\nof an object include the object\u2019s physical representation and the viewing conditions. The dif\ufb01culty\nof disentangling these hidden factors is that, in most real-world situations, each can in\ufb02uence the\nobservation in a different and unpredictable way. It is seldom the case that one has access to rich\nforms of labeled data in which the nature of these in\ufb02uences is given explicitly.\nOften times, the purpose for which a dataset is collected is to further progress in solving a certain\nsupervised learning task. This type of learning is driven completely by the labels. The goal is for\nthe learned representation to be invariant to factors of variation that are uninformative to the task\nat hand. While recent approaches for supervised learning have enjoyed tremendous success, their\nperformance comes at the cost of discarding sources of variation that may be important for solving\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fother, closely-related tasks. Ideally, we would like to be able to learn representations in which the\nuninformative factors of variation are separated from the informative ones, instead of being discarded.\nMany other exciting applications require the use of generative models that are capable of synthesizing\nnovel instances where certain key factors of variation are held \ufb01xed. Unlike classi\ufb01cation, generative\nmodeling requires preserving all factors of variation. But merely preserving these factors is not\nsuf\ufb01cient for many tasks of interest, making the disentanglement process necessary. For example,\nin speech synthesis, one may wish to transfer one person\u2019s dialog to another person\u2019s voice. Inverse\nproblems in image processing, such as denoising and super-resolution, require generating images that\nare perceptually consistent with corrupted or incomplete observations.\nIn this work, we introduce a deep conditional generative model that learns to separate the factors of\nvariation associated with the labels from the other sources of variability. We only make the weak\nassumption that we are able to distinguish between observations assigned to the same label during\ntraining. To make disentanglement possible in this more general setting, we leverage both Variational\nAuto-Encoders (VAEs) [12, 25] and Generative Adversarial Networks (GANs) [9].\n\n2 Related work\n\nThere is a vast literature on learning disentangled representations. Bilinear models [26] were an early\napproach to separate content and style for images of faces and text in various fonts. What-where\nautoencoders [22, 28] combine discrimination and reconstruction criteria to attempt to recover the\nfactors of variation not associated with the labels. In [10], an autoencoder is trained to separate a\ntranslation invariant representation from a code that is used to recover the translation information.\nIn [2], the authors show that standard deep architectures can discover and explicitly represent factors of\nvariation aside those relevant for classi\ufb01cation, by combining autoencoders with simple regularization\nterms during the training. In the context of generative models, the work in [23] extends the Restricted\nBoltzmann Machine by partitioning its hidden state into distinct factors of variation. The work\npresented in [11] uses a VAE in a semi-supervised learning setting. Their approach is able to\ndisentangle the label information from the hidden code by providing an additional one-hot vector as\ninput to the generative model. Similarly, [18] shows that autoencoders trained in a semi-supervised\nmanner can transfer handwritten digit styles using a decoder conditioned on a categorical variable\nindicating the desired digit class. The main difference between these approaches and ours is that the\nformer cannot generalize to unseen identities.\nThe work in [5, 13] further explores the application of content and style disentanglement to computer\ngraphics. Whereas computer graphics involves going from an abstract description of a scene to a\nrendering, these methods learn to go backward from the rendering to recover the abstract description.\nThis description can include attributes such as orientation and lighting information. While these\nmethods are capable of producing impressive results, they bene\ufb01t from being able to use synthetic\ndata, making strong supervision possible.\nClosely related to the problem of disentangling factors of variations in representation learning is that\nof learning fair representations [17, 7]. In particular, the Fair Variational Auto-Encoder [17] aims\nto learn representations that are invariant to certain nuisance factors of variation, while retaining\nas much of the remaining information as possible. The authors propose a variant of the VAE that\nencourages independence between the different latent factors of variation.\nThe problem of disentangling factors of variation also plays an important role in completing image\nanalogies, the goal of the end-to-end model proposed in [24]. Their method relies on having access to\nmatching examples during training. Our approach requires neither matching observations nor labels\naside from the class identities. These properties allow the model to be trained on data with a large\nnumber of labels, enabling generalizing over the classes present in the training data.\n\n3 Background\n\n3.1 Variational autoencoder\n\nThe VAE framework is an approach for modeling a data distribution using a collection of independent\nlatent variables. Let x be a random variable (real or binary) representing the observed data and z\na collection of real-valued latent variables. The generative model over the pair (x, z) is given by\n\n2\n\n\fp(x, z) = p(x | z)p(z), where p(z) is the prior distribution over the latent variables and p(x | z) is\nthe conditional likelihood function. Generally, we assume that the components of z are independent\nBernoulli or Gaussian random variables. The likelihood function is parameterized by a deep neural\nnetwork referred to as the decoder.\nA key aspect of VAEs is the use of a learned approximate inference procedure that is trained purely\nusing gradient-based methods [12, 25]. This is achieved by using a learned approximate posterior\nq(z | x) = N (\u00b5, \u03c3I) whose parameters are given by another deep neural network referred to as the\nencoder. Thus, we have z \u223c Enc(x) = q(z|x) and \u02dcx Dec(z) = p(x|z). The parameters of these\nnetworks are optimized by minimizing the upper-bound on the expected negative log-likelihood of x,\nwhich is given by\n\n(1)\nThe \ufb01rst term in (1) corresponds to the reconstruction error, and the second term is a regularizer that\nensures that the approximate posterior stays close to the prior.\n\nEq(z | x)[\u2212 log p\u03b8(x | z)] + KL(q(z|x) || p(z)).\n\n3.2 Generative adversarial networks\n\nGenerative Adversarial Networks (GAN) [9] have enjoyed great success at producing realistic natural\nimages [21]. The main idea is to use an auxiliary network Disc, called the discriminator, in conjunction\nwith the generative model, Gen. The training procedure establishes a min-max game between the\ntwo networks as follows. On one hand, the discriminator is trained to differentiate between natural\nsamples sampled from the true data distribution, and synthetic images produced by the generative\nmodel. On the other hand, the generator is trained to produce samples that confuse the discriminator\ninto mistaking them for genuine images. The goal is for the generator to produce increasingly more\nrealistic images as the discriminator learns to pick up on increasingly more subtle inaccuracies that\nallow it to tell apart real and fake images.\nBoth Disc and Gen can be conditioned on the label of the input that we wish to classify or generate,\nrespectively [20]. This approach has been successfully used to produce samples that belong to a\nspeci\ufb01c class or possess some desirable property [4, 19, 21]. The training objective can be expressed\nas a min-max problem given by\n\nLgan, where Lgan = log Disc(x, id) + log(1 \u2212 Disc(Gen(z, id), id)).\n\n(2)\n\nmin\nGen\n\nmax\nDisc\n\nwhere pd(x, id) is the data distribution conditioned on a given class label id, and p(z) is a generic\nprior over the latent space (e.g. N (0, I)).\n\n4 Model\n\n4.1 Conditional generative model\n\nWe introduce a conditional probabilistic model admitting two independent sources of variation:\nan observed variable s that characterizes the speci\ufb01ed factors of variation, and a continuous latent\nvariable z that characterizes the remaining variability. The variable s is given by a vector of real\nnumbers, rather than a class ordinal or a one-hot vector, as we intend for the model to generalize to\nunseen identities.\nGiven an observed speci\ufb01ed component s, we can sample\n\nz \u223c p(z) = N (0, I)\n\nand x \u223c p\u03b8(x | z, s),\n\nin order to generate a new instance x compatible with s.\nThe variables s and z are marginally independent, which promotes disentanglement between the\nspeci\ufb01ed and unspeci\ufb01ed factors of variation. Again here, p\u03b8(x|z, s) is a likelihood function described\nby and decoder network, Dec, and the approximate posterior is modeled using an independent\nGaussian distribution, q\u03c6(z|x, s) = N (\u00b5, \u03c3I), whose parameters are speci\ufb01ed via an encoder network,\nEnc. In this new setting, the variational upper-bound is be given by\n\nEq(z | x,s)[\u2212 log p\u03b8(x | z, s)] + KL(q(z | x, s) | p(z)).\n\nThe speci\ufb01ed component s can be obtained from one or more images belonging to the same class.\nIn this work, we consider the simplest case in which s is obtained from a single image. To this end,\n\n3\n\n(3)\n\n(4)\n\n\fwe de\ufb01ne a deterministic encoder fs that maps images to their corresponding speci\ufb01ed components.\nAll sources of stochasticity in s come from the data distribution. The conditional likelihood given\nby (3) can now be written as x \u223c p\u03b8(x | z, fs(x(cid:48))) where x(cid:48) is any image sharing the same label as x,\nincluding x itself. In addition to fs, the model has an additional encoder fz that parameterizes the\napproximate posterior q(z | x, s). It is natural to consider an architecture in which parameters of both\nencoders are shared.\nWe now de\ufb01ne a single encoder Enc by Enc(x) = (fs(x), fz(x)) = (s, (\u00b5, \u03c3) = (s, z), where s is\nthe speci\ufb01ed component, and z = (\u00b5, \u03c3) the parameters of the approximate posterior that constitute\nthe unspeci\ufb01ed component. To generate a new instance, we synthesize s and z using Dec to obtain\n\u02dcx = Dec(s, z).\nThe model described above cannot be trained by minimizing the log-likelihood alone. In particular,\nthere is nothing that prevents all of the information about the observation from \ufb02owing through the\nunspeci\ufb01ed component. The decoder could learn to ignore s, and the approximate posterior could\nmap images belonging to the same class to different regions of the latent space. This degenerate\nsolution can be easily prevented when we have access to labels for the unspeci\ufb01ed factors of variation,\nas in [24]. In this case, we could enforce that s be informative by requiring that Dec be able to\nreconstruct two observations having the same unspeci\ufb01ed label after their unspeci\ufb01ed components\nare swapped. But for many real-world scenarios, it is either impractical or impossible to obtain labels\nfor the unspeci\ufb01ed factors of variation. In the following section, we explain a way of eliminating the\nneed for such labels.\n\n4.2 Discriminative regularization\n\nAn alternative approach to preventing the degenerate solution described in the previous section,\nwithout the need for labels for the unspeci\ufb01ed components, makes use of GANs (3.2). As before,\nwe employ a procedure in which the unspeci\ufb01ed components of a pair of observations are swapped.\nBut since the observations need not be aligned along the unspeci\ufb01ed factors of variation, it no longer\nmakes sense to enforce reconstruction. After swapping, the class identities of both observations\nwill remain the same, but the sources of variability within their corresponding classes will change.\nHence, rather than enforcing reconstruction, we ensure that both observations are assigned high\nprobabilities of belonging to their original classes by an external discriminator. Formally, we introduce\nthe discriminative term given by (2) into the loss given by (5), yielding\n\nEq(z | x,s)[\u2212 log p\u03b8(x | z, s)] + KL(q(z | x, s) || p(z)) + \u03bbLgan,\n\n(5)\n\nwhere \u03bb is a non-negative weight.\nRecent works have explored combining VAE with GAN [14, 6]. These approaches aim at including a\nrecognition network (allowing solving inference problems) to the GAN framework. In the setting\nused in this work, GAN is used to compensate the lack of aligned training data. The work in [14]\ninvestigates the use of GANs for obtaining perceptually better loss functions (beyond pixels). While\nthis is not the goal of our work, our framework is able to generate sharper images, which comes as\na side effect. We evaluated including a GAN loss also for samples, however, the system became\nunstable without leading to perceptually better generations. An interesting variant could be to use\nseparate discriminator for images generated with and without supervision.\n\n4.3 Training procedure\nLet x1 and x(cid:48)\n1 be samples sharing the same label, namely id1, and x2 a sample belonging to a different\nclass, id2. On one hand we want to minimize the upper bound of negative log likelihood of x1 when\nfeeding to the decoder inputs of the form (z1, fs(x1)) and (z1, fs(x(cid:48)\n1)), where z1 are samples form\nthe approximate posterior q(z|x1). On the other hand, we want to minimize the adversarial loss of\nsamples generated by feeding to the decoder inputs given by (z, fs(x2)), where z is sampled from\nthe approximate posterior q(z|x1). This corresponds to swapping speci\ufb01ed and unspeci\ufb01ed factors of\nx1 and x2. We could only use upper bound if we had access to aligned data. As in the GAN setting\ndescribed in Section 3.2, we alternate this procedure with updates of the adversary network. The\ndiagram of the network is shown in \ufb01gure 1, and the described training procedure is summarized in\non Algorithm 1, in the supplementary material.\n\n4\n\n\fFigure 1: Training architecture. The inputs x1 and x(cid:48)\nwhereas x2 can have any label.\n5 Experiments\n\n1 are two different samples with the same label,\n\nDatasets. We evaluate our model on both synthetic and real datasets: Sprites dataset [24], MNIST [15],\nNORB [16] and the Extended-YaleB dataset [8]. We used Torch7 [3] to conduct all experiments. The\nnetwork architectures follow that of DCGAN [21] and are described in detail in the supplementary\nmaterial.\nEvaluation. To the best of our knowledge, there is no standard benchmark dataset (or task) for\nevaluating disentangling performance [2]. We propose two forms of evaluation to illustrate the\nbehavior of the proposed framework, one qualitative and one quantitative.\nQualitative evaluation is obtained by visually examining the perceptual quality of single-image\nanalogies and conditional images generation. For all datasets, we evaluated the models in four\ndifferent settings: swapping: given a pair of images, we generate samples conditioning on the speci\ufb01ed\ncomponent extracted from one of the images and sampling from the approximate posterior obtained\nfrom the other one. This procedure is analogous to the sampling technique employed during training,\ndescribed in Section 4.3, and corresponds to solving single-image analogies; retrieval: in order to asses\nthe correlation between the speci\ufb01ed and unspeci\ufb01ed components, we performed nearest neighbor\nretrieval in the learned embedding spaces. We computed the corresponding representations for all\nsamples (for the unspeci\ufb01ed component we used the mean of the approximate posterior distribution)\nand then retrieved the nearest neighbors for a given query image; interpolation: to evaluate the\ncoverage of the data manifold, we generated a sequence of images by linearly interpolating the codes\nof two given test images (for both speci\ufb01ed and unspeci\ufb01ed representations); conditional generation:\ngiven a test image, we generate samples conditioning on its speci\ufb01ed component, sampling directly\nfrom the prior distribution, p(z). In all the experiments images were randomly chosen from the test\nset, please see speci\ufb01c details for each dataset.\nThe objective evaluation of generative models is a dif\ufb01cult task and itself subject of current research\n[27]. Frequent evaluation metrics, such as measuring the log-likelihood of a set of validation samples,\nare often not very meaningful as they do not correlate to the perceptual quality of the images [27].\nFurthermore, the loss function used by our model does not correspond a bound on the likelihood of a\ngenerative model, which would render this evaluation less meaningful. As a quantitative measure,\nwe evaluate the degree of disentanglement via a classi\ufb01cation task. Namely, we measure how much\ninformation about the identity is contained in the speci\ufb01ed and unspeci\ufb01ed components.\nMNIST. In this setup, the speci\ufb01ed part is simply the class of the digit. The goal is to show that the\nmodel is able to learn to disentangle the style from the identity of the digit and to produce satisfactory\nanalogies. We cannot test the ability of the model to generalize to unseen identities. In this case, one\ncould directly condition on a class label [11, 18]. It is still interesting that the proposed model is\nable to transfer handwriting style without having access to matched examples while still be able to\nlearn a smooth representation of the digits as show in the interpolation results. Results are shown in\nFigure 2. We observe that the generated images are convincing and particularly sharp, the latter is an\n\u201cside-effect\u201d produced by the GAN term in our training loss.\nSprites. The dataset is composed of 672 unique characters (we refer to them as sprites), each of\nwhich is associated with 20 animations [24]. Any image of a sprite can present 7 sources of variation:\nbody type, gender, hair type, armor type, arm type, greaves type, and weapon type. Unlike the work\nin [24], we do not use any supervision regarding the positions of the sprites. The results obtained for\n\n5\n\nEncX1EncX1'EncX2DecX11~DecX11'~DecX12~DecX 2~.AdvAdvZ1Z1'Z2S1'S1S2N(0,1)X1LX1Lid(X2)id(X2)\fFigure 2: left(a): A visualization grid of 2D MNIST image swapping generation. The top row and\nleftmost column digits come from the test set. The other digits are generated using z from leftmost\ndigit, and s from the digit at the top of the column. The diagonal digits show reconstructions. Right(b):\nInterpolation visualization. Digits located at top-left corner and bottom-right corner come from the\ndataset. The rest digits are generated by interpolating s and z. Like (a), each row has constant a z\neach column a constant s.\n\nFigure 3: left(a): A visualization grid of 2D sprites swapping generation. Same visualization arrange-\nment as in 2(a); right(b): Interpolation visualization. Same arrangement as in 2(b).\n\nthe swapping and interpolation settings are displayed in Figure 3 while retrieval result are showed\nin 4. Samples from the conditional model are shown in 5(a). We observe that the model is able to\ngeneralize to unseen sprites quite well. The generated images are sharp and single image analogies\nare resolved successfully. The interpolation results show that one can smoothly transition between\nidentities or positions. It is worth noting that this dataset has a \ufb01xed number of discrete positions.\nThus, 3(b) shows a reasonable coverage of the manifold with some abrupt changes. For instance, the\nhands are not moving up from the pixel space, but appearing gradually from the faint background.\nNORB. For the NORB dataset we used instance identity (rather than object category) for de\ufb01ning the\nlabels. This results in 25 different object identities in the training set and another 25 distinct objects\nidentities in the testing set. As in the sprite dataset, the identities used at testing have never been\npresented to the network at training time. In this case, however, the small number of identities seen at\ntraining time makes the generalization more dif\ufb01cult. In Figure 6 we present results for interpolation\nand swapping. We observe that the model is able to resolve analogies well. However, the quality\nof the results are degraded. In particular, classes having high variability (such as planes) are not\nreconstructed well. Also some of the models are highly symmetric, thus creating a lot of uncertainty.\nWe conjecture that these problems could be eliminated in the presence of more training data. Queries\nin the case of NORB are not as expressive as with the sprites, but we can still observe good behavior.\nWe refer to these images to the supplementary material.\nExtended-YaleB. The datasets consists of facial images of 28 individuals taken under different\npositions and illuminations. The training and testing sets contains roughly 600 and 180 images\nper individual respectively. Figure 7 shows interpolation and swapping results for a set of testing\nimages. Due to the small number of identities, we cannot test in this case the generalization to unseen\nidentities. We observe that the model is able to resolve the analogies in a satisfactory, position and\nillumination are transferred correctly although these positions have not been seen at train time for\n\n6\n\n\fFigure 4: left(a): sprite retrieval querying on speci\ufb01ed component; right(b): sprite retrieval querying\non (cid:127)unspeci\ufb01ed component. Sprites placed at the left of the white lane are used as the query.\n\nFigure 5: left(a): sprite generation by sampling; right(b): NORB generation by sampling.\n\nFigure 6: left(a): A visualization grid of 2D NORB image swapping generation. Same visualization\narrangement as in 2(a); right(b): Interpolation visualization. Same arrangement as in 2(b).\n\nthese individuals. In the supplementary material we show samples drawn from the conditional model\nas well as other examples of interpolation and swapping.\nQuantitative evaluation. We analyze the disentanglement of the speci\ufb01ed and unspeci\ufb01ed represen-\ntations, by using them as input features for a prediction task. We trained a two-layer neural network\nwith 256 hidden units to predict structured labels for the sprite dataset, toy category for the NORB\ndataset (four-legged animals, human \ufb01gures, airplanes, trucks, and cars) and the subject identity for\nExtended-YaleB dataset. We used early-stopping on a validation set to prevent over\ufb01tting. We report\nboth training and testing errors in Table 1. In all cases the unspeci\ufb01ed component is agnostic to the\nidentity information, almost matching the performance of random selection. On the other hand, the\nspeci\ufb01ed components are highly informative, producing almost the same results as a classi\ufb01er directly\ntrained on a discriminative manner. In particular, we observe some over\ufb01tting in the NORB dataset.\nThis might also be due to the dif\ufb01culty of generalizing to unseen identities using a small dataset.\nIn\ufb02uence of components of the framework. It is worth evaluating the contribution of the different\ncomponents of the framework. Without the adversarial regularization, the model is unable to learn\ndisentangled representations. It can be veri\ufb01ed empirically that the unspeci\ufb01ed component is com-\npletely ignored, as discussed in Section 4.1. A valid question to ask is if the training of s has be\ndone jointly in an end-to-end manner or could be pre-computed. In Section 4 of the supplementary\nmaterial we run our setting by using an embedding trained before hand to classify the identities. The\nmodel is still able to learned a disentangled representations. The quality of the generated images\nas well as the analogies are compromised. Better pre-trained embeddings could be considered, for\nexample, enforcing the representation of different images to be close to each other and far from those\ncorresponding to different identities. However, joint end-to-end training has still the advantage of\nrequiring fewer parameters, due to the parameter sharing of the encoders.\n\n7\n\n\fFigure 7: left(a): A visualization grid of 2D Extended-YaleB face image swapping generation. right(b):\nInterpolation visualization. See 2 for description.\n\nTable 1: Comparison of classi\ufb01cation upon z and s. Shown numbers are all error rate.\n\nset\n\nSprites\n\nNORB\n\nExtended-YaleB\n\nz\n\ns\n\nz\n\ns\n\nz\n\ns\n\ntrain\ntest\n\nrandom-chance\n\n58.6% 5.5% 79.8% 2.6% 96.4% 0.05%\n59.8% 5.2% 79.9% 13.5% 96.4% 0.08%\n\n60.7%\n\n80.0%\n\n96.4%\n\n6 Conclusions and discussion\n\nThis paper presents a conditional generative model that learns to disentangle the factors of variations\nof the data speci\ufb01ed and unspeci\ufb01ed through a given categorization. The proposed model does not\nrely on strong supervision regarding the sources of variations. This is achieved by combining two\nvery successful generative models: VAE and GAN. The model is able to resolve the analogies in a\nconsistent way on several datasets with minimal parameter/architecture tuning. Although this initial\nresults are promising there is a lot to be tested and understood. The model is motivated on a general\nsettings that is expected to encounter in more realistic scenarios. However, in this initial study we\nonly tested the model on rather constrained examples. As was observed in the results shown using\nthe NORB dataset, given the weaker supervision assumed in our setting, the proposed approach\nseems to have a high sample complexity relying on training samples covering the full range of\nvariations for both speci\ufb01ed and unspeci\ufb01ed variations. The proposed model does not attempt to\ndisentangle variations within the speci\ufb01ed and unspeci\ufb01ed components. There are many possible\nways of mapping a unit Gaussian to corresponding images, in the current setting, there is nothing\npreventing the obtained mapping to present highly entangled factors of variations.\n\nReferences\n[1] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends R(cid:13) in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[2] Brian Cheung, Jesse A. Livezey, Arjun K. Bansal, and Bruno A. Olshausen. Discovering hidden factors of\n\nvariation in deep networks. CoRR, abs/1412.6583, 2014.\n\n[3] Ronan Collobert, Koray Kavukcuoglu, and Cl\u00e9ment Farabet. Torch7: A matlab-like environment for\n\nmachine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.\n\n[4] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a\n\nlaplacian pyramid of adversarial networks. In NIPS, 2015.\n\n[5] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with\n\nconvolutional neural networks. CoRR, abs/1411.5928, 2014.\n\n[6] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and\n\nAaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.\n\n[7] Harrison Edwards and Amos Storkey. Censoring representations with an adversary. arXiv preprint\n\narXiv:1511.05897, 2015.\n\n8\n\n\f[8] Athinodoros S Georghiades, Peter N Belhumeur, and David J Kriegman. From few to many: Illumina-\ntion cone models for face recognition under variable lighting and pose. Pattern Analysis and Machine\nIntelligence, IEEE Transactions on, 23(6):643\u2013660, 2001.\n\n[9] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C.\n\nCourville, and Yoshua Bengio. Generative adversarial networks. NIPS, 2014.\n\n[10] Geoffrey E. Hinton, Alex Krizhevsky, and Sida D. Wang. Transforming auto-encoders. In Proceedings of\nthe 21th International Conference on Arti\ufb01cial Neural Networks - Volume Part I, ICANN\u201911, pages 44\u201351,\nBerlin, Heidelberg, 2011. Springer-Verlag.\n\n[11] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[12] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[13] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In Advances in Neural Information Processing Systems, pages 2530\u20132538, 2015.\n\n[14] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Autoencoding beyond pixels using\n\na learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.\n\n[15] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[16] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In CVPR, 2004.\n\n[17] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoen-\n\ncoder. ICLR, 2016.\n\n[18] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. Adversarial autoencoders.\n\nCoRR, abs/1511.05644, 2015.\n\n[19] Micha\u00ebl Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean\n\nsquare error. ICLR, abs/1511.05440, 2015.\n\n[20] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784,\n\n2014.\n\n[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. CoRR, abs/1511.06434, 2015.\n\n[22] Marc\u2019Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of\ninvariant feature hierarchies with applications to object recognition. In Proc. Computer Vision and Pattern\nRecognition Conference (CVPR\u201907). IEEE Press, 2007.\n\n[23] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation\nwith manifold interaction. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 1431\u20131439, 2014.\n\n[24] Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In C. Cortes, N. D.\nLawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 28, pages 1252\u20131260. Curran Associates, Inc., 2015.\n\n[25] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[26] Joshua B. Tenenbaum and William T. Freeman. Separating style and content with bilinear models. Neural\n\nComput., 12(6):1247\u20131283, June 2000.\n\n[27] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.\n\narXiv preprint arXiv:1511.01844, 2015.\n\n[28] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann LeCun. Stacked what-where auto-encoders. In\n\nICLR workshop submission, 2016.\n\n9\n\n\f", "award": [], "sourceid": 2587, "authors": [{"given_name": "Michael", "family_name": "Mathieu", "institution": "NYU"}, {"given_name": "Junbo Jake", "family_name": "Zhao", "institution": "NYU"}, {"given_name": "Junbo", "family_name": "Zhao", "institution": "NYU"}, {"given_name": "Aditya", "family_name": "Ramesh", "institution": "NYU"}, {"given_name": "Pablo", "family_name": "Sprechmann", "institution": "New York University"}, {"given_name": "Yann", "family_name": "LeCun", "institution": "NYU"}]}