{"title": "Learning Latent Subspaces in Variational Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 6444, "page_last": 6454, "abstract": "Variational autoencoders (VAEs) are widely used deep generative models capable of learning unsupervised latent representations of data. Such representations are often difficult to interpret or control. We consider the problem of unsupervised learning of features correlated to specific labels in a dataset. We propose a VAE-based generative model which we show is capable of extracting features correlated to binary labels in the data and structuring it in a latent subspace which is easy to interpret. Our model, the Conditional Subspace VAE (CSVAE), uses mutual information minimization to learn a low-dimensional latent subspace associated with each label that can easily be inspected and independently manipulated. We demonstrate the utility of the learned representations for attribute manipulation tasks on both the Toronto Face and CelebA datasets.", "full_text": "Learning Latent Subspaces in\n\nVariational Autoencoders\n\nJack Klys, Jake Snell, Richard Zemel\n\n{jackklys,jsnell,zemel}@cs.toronto.edu\n\nUniversity of Toronto\n\nVector Institute\n\nAbstract\n\nVariational autoencoders (VAEs) [10, 20] are widely used deep generative models\ncapable of learning unsupervised latent representations of data. Such represen-\ntations are often dif\ufb01cult to interpret or control. We consider the problem of\nunsupervised learning of features correlated to speci\ufb01c labels in a dataset. We\npropose a VAE-based generative model which we show is capable of extracting\nfeatures correlated to binary labels in the data and structuring it in a latent subspace\nwhich is easy to interpret. Our model, the Conditional Subspace VAE (CSVAE),\nuses mutual information minimization to learn a low-dimensional latent subspace\nassociated with each label that can easily be inspected and independently ma-\nnipulated. We demonstrate the utility of the learned representations for attribute\nmanipulation tasks on both the Toronto Face [23] and CelebA [15] datasets.\n\n1\n\nIntroduction\n\nDeep generative models have recently made large strides in their ability to successfully model\ncomplex, high-dimensional data such as images [8], natural language [1], and chemical molecules\n[6]. Though useful for data generation and feature extraction, these unstructured representations still\nlack the ease of understanding and exploration that we desire from generative models. For example,\nthe correspondence between any particular dimension of the latent representation and the aspects of\nthe data it is related to is unclear. When a latent feature of interest is labelled in the data, learning a\nrepresentation which isolates it is possible [11, 21], but doing so in a fully unsupervised way remains\na dif\ufb01cult and unsolved task.\nConsider instead the following slightly easier problem. Suppose we are given a dataset of N labelled\nexamples D = {(x1, y1), . . . , (xN , yN )} with each label yi \u2208 {1, . . . , K}, and data belonging to\neach class yi has some latent structure (for example, it can be naturally clustered into sub-classes or\norganized based on class-speci\ufb01c properties). Our goal is to learn a generative model in which this\nstructure can easily be recovered from the learned latent representations. Moreover, we would like\nour model to allow manipulation of these class-speci\ufb01c properties in any given new data point (given\nonly a single example), or generation of data with any class-speci\ufb01c property in a straightforward\nway.\nWe investigate this problem within the framework of variational autoencoders (VAE) [10, 20]. A\nvariable z \u2208 Z and an associated prior p(z). We propose the Conditional Subspace VAE (CSVAE),\nwhich learns a latent space Z \u00d7 W that separates information correlated with the label y into a\nprede\ufb01ned subspace W. To accomplish this we require that the mutual information between z and y\nshould be 0, and we give a mathematical derivation of our loss function as a consequence of imposing\nthis condition on a directed graphical model. By setting W to be low dimensional we can easily\nanalyze the learned representations and the effect of w on data generation.\n\nVAE forms a generative distribution over the data p\u03b8(x) =(cid:82) p(z)p\u03b8(x|z) dz by introducing a latent\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fz\n\ny\n\nz\n\ny\n\nz\n\ny\n\nz\n\ny\n\nz\n\ny\n\nw\n\nz\n\nw\n\nx\n\nx\n\nx\n\nx\n\nx\n\nx\n\n(a) CondVAE\n\n(b) CondVAE-info\n\n(c) CSVAE (ours)\n\nFigure 1: The encoder (left) and decoder (right) for each of the baselines and our model. Shaded\nnodes represent conditioning variables. Dotted arrows represent adversarially trained prediction\nnetworks used to minimize mutual information between variables.\n\nThe aim of our CSVAE model is twofold:\n\n1. Learn higher-dimensional latent features correlated with binary labels in the data.\n2. Represent these features using a subspace that is easy to interpret and manipulate when\n\ngenerating or modifying data.\n\nWe demonstrate these capabilities on the Toronto Faces Dataset (TFD) [23] and the CelebA face\ndataset [15] by comparing it to baseline models including a conditional VAE [11, 21] and a VAE\nwith adversarial information minimization but no latent space factorization [3]. We \ufb01nd through\nquantitative and qualitative evaluation that the CSVAE is better able to capture intra-class variation\nand learns a richer yet easily manipulable latent subspace in which attribute style transfer can easily\nbe performed.\n\n2 Related Work\n\nThere are two main lines of work relevant to our approach as underscored by the dual aims of\nour model listed in the introduction. The \ufb01rst of these seeks to introduce useful structure into the\nlatent representations of generative models such as Variational Autoencoders (VAEs) [10, 20] and\nGenerative Adversarial Networks (GANs) [7]. The second utilizes trained machine learning models\nto manipulate and generate data in a controllable way, often in the form of images.\n\nIncorporating structure into representations. A common approach is to make use of labels by\ndirectly de\ufb01ning them as latent variables in the model [11, 22]. Beyond providing an explicit variable\nfor the labelled feature this yields no other easily interpretable structure, such as discovering features\ncorrelated to the labels, as our model does. This is the case also with other methods of structuring\nlatent space which have been explored, such as batching data according to labels [12] or use of a\ndiscriminator network in a non-generative model [13]. Though not as relevant to our setting, we note\nthere is also recent work on discovering latent structure in an unsupervised fashion [2, 9].\nAn important aspect of our model used in structuring the latent space is mutual information minimiza-\ntion between certain latent variables. There are other works which use this idea in various ways. In\n[3] an adversarial network similar to the one in this paper is used, but minimizes information between\nthe latent space of a VAE and the feature labels (see Section 3.3). In [16] independence between\nlatent variables is enforced by minimizing maximum mean discrepancy, and it is an interesting\nquestion what effect their method would have in our model, which we have not pursued here. Other\nworks which utilize adversarial methods in learning latent representations which are not as directly\ncomparable to ours include [4, 5, 17].\n\nData manipulation and generation. There are also several works that speci\ufb01cally consider trans-\nferring attributes in images as we do here. The works [26], [24], and [25] all consider this task, in\nwhich attributes from a source image are transferred onto a target image. These models can perform\nattribute transfer between images (e.g. \u201csplice the beard style of image A onto image B\u201d), but only\nthrough interpolation between existing images. Once trained our model can modify an attribute of a\nsingle given image to any style encoded in the subspace.\n\n2\n\n\f3 Background\n\n3.1 Variational Autoencoder (VAE)\n\nThe variational autoencoder (VAE) [10, 20] is a widely-used generative model on top of which our\nmodel is built. VAEs are trained to maximize a lower bound on the marginal log-likelihood log p\u03b8(x)\nover the data by utilizing a learned approximate posterior q\u03c6(z|x):\n\nlog p\u03b8(x) \u2265 Eq\u03c6(z|x) [log p\u03b8(x|z)] \u2212 DKL (q\u03c6(z|x) (cid:107) p(z))\n\n(1)\nOnce training is complete, the approximate posterior q\u03c6(z|x) functions as an encoder which maps\nthe data x to a lower dimensional latent representation.\n\n3.2 Conditional VAE (CondVAE)\n\nA conditional VAE [11, 21] (CondVAE) is a supervised variant of a VAE which models a labelled\ndataset. It conditions the latent representation z on another variable y representing the labels. The\nmodi\ufb01ed objective becomes:\n\nlog p\u03b8 (x|y) \u2265 Eq\u03c6(z|x,y) [log p\u03b8 (x|z, y)] \u2212 DKL (q\u03c6 (z|x, y) (cid:107) p (z))\n\n(2)\nThis model provides a method of structuring the latent space. By encoding the data and modifying\nthe variable y before decoding it is possible to manipulate the data in a controlled way. A diagram\nshowing the encoder and decoder is in Figure 1a.\n\n3.3 Conditional VAE with Information Factorization (CondVAE-info)\n\nThe objective function of the conditional VAE can be augmented by an additional network r\u03c8(z) as\nin [3] which is trained to predict y from z while q\u03c6 (z|x) is trained to minimize the accuracy of r\u03c8. In\naddition to the objective function (2) (with q\u03c6 (z|x, y) replaced with q\u03c6 (z|x)), the model optimizes\n(3)\n\nL(r\u03c8(q\u03c6 (z|x)), y)\n\nmax\n\nmin\n\n\u03c6\n\n\u03c8\n\nwhere L denotes the cross entropy loss. This removes information correlated with y from z but the\nencoder does not use y and the generative network p (x|z, y) must use the one-dimensional variable\ny to reconstruct the data, which is suboptimal as we demonstrate in our experiments. We denote this\nmodel by CondVAE-info (diagram in Figure 1b). In the next section we will give a mathematical\nderivation of the loss (3) as a consequence of a mutual information condition on a probabilistic\ngraphical model.\n\n4 Model\n\nk features of x. Let H = Z \u00d7 W = Z \u00d7(cid:81)k\n\n4.1 Conditional Subspace VAE (CSVAE)\nSuppose we are given a dataset D of elements (x, y) with x \u2208 Rn and y \u2208 Y = {0, 1}k representing\ni=1 Wi denote a probability space which will be the\nlatent space of our model. Our goal is to learn a latent representation of our data which encodes all\nthe information related to feature i labelled by yi exactly in the subspace Wi.\nWe will do this by maximizing a form of variational lower bound on the marginal log likelihood of\nour model, along with minimizing the mutual information between Z and Y . We parameterize the\njoint log-likelihood and decompose it as:\n\nlog p\u03b8,\u03b3 (x, y, w, z) = log p\u03b8 (x|w, z) + log p (z) + log p\u03b3 (w|y) + log p (y)\n\n(4)\nwhere we are assuming that Z is independent from W and Y , and X | W is independent from Y .\nGiven an approximate posterior q\u03c6 (z, w|x, y) we use Jensen\u2019s inequality to obtain the variational\nlower bound\n\nlog p\u03b8,\u03b3 (x, y) = log Eq\u03c6(z,w|x,y) [p\u03b8,\u03b3 (x, y, w, z) /q\u03c6 (z, w|x, y)]\n\u2265 Eq\u03c6(z,w|x,y) [log p\u03b8,\u03b3 (x, y, w, z) /q\u03c6 (z, w|x, y)] .\n\n3\n\n\fFigure 2: Left: The swiss roll and its reconstruction by CSVAE. Right: Projections onto the axis\nplanes of the latent space of CSVAE trained on the swiss roll, color coded by labels. The data overlaps\nin Z making it dif\ufb01cult for the model to determine the label of a data point from this projection alone.\nConversely the data is separated in W by its label.\n\nUsing (4) and taking the negative gives an upper bound on \u2212 log p\u03b8,\u03b3 (x, y) of the form\nm1 (x, y) = \u2212Eq\u03c6(z,w|x,y) [log p\u03b8 (x|w, z)] + DKL (q\u03c6 (w|x, y) (cid:107) p\u03b3 (w|y))\n\n+ DKL (q\u03c6 (z|x, y) (cid:107) p (z)) \u2212 log p (y) .\n\nThus we obtain the \ufb01rst part of our objective function:\n\nM1 = ED(x,y) [m1 (x, y)]\n\n(5)\n\nWe derived (5) using the assumption that Z is independent from Y but in practice minimizing this\nobjective will not imply that our model will satisfy this condition. Thus we also minimize the mutual\ninformation\n\nI (Y ; Z) = H (Y ) \u2212 H (Y |Z)\n\nwhere H (Y |Z) is the conditional entropy. Since the prior on Y is \ufb01xed this is equivalent to\nmaximizing the conditional entropy\n\np (z) p (y|z) log p (y|z) dydz\n\np (z|x) p (x) p (y|z) log p (y|z) dxdydz.\n\nZ,Y,X\n\nSince the integral over Z is intractable, to approximate this quantity we use approximate posteriors\nq\u03b4 (y|z) and q\u03c6 (z|x) and instead average over the empirical data distribution\n\nED(x)\n\nq\u03c6 (z|x) q\u03b4 (y|z) log q\u03b4 (y|z) dydz\n\n.\n\n(cid:90)(cid:90)\n(cid:90)(cid:90)(cid:90)\n\nZ,Y\n\nH (Y |Z) =\n\n=\n\n(cid:20)(cid:90)(cid:90)\n\nZ,Y\n\n(cid:21)\n(cid:21)\n\nThus we let the second part of our objective function be\n\nM2 = Eq\u03c6(z|x)D(x)\n\nq\u03b4 (y|z) log q\u03b4 (y|z) dy\n\n.\n\n(cid:20)(cid:90)\n\nY\n\nFinally, computing M2 requires learning the approximate posterior q\u03b4 (y|z). Hence we let\n\nN = Eq(z|x)D(x,y) [q\u03b4 (y|z)] .\n\nThus the complete objective function consists of two parts\n\u03b21M1 + \u03b22M2\n\nmin\n\u03b8,\u03c6,\u03b3\n\n\u03b23N\n\nmax\n\n\u03b4\n\nwhere the \u03b2i are weights which we treat as hyperparameters. We train these parts jointly.\nThe terms M2 and N can be viewed as constituting an adversarial component in our model, where\nq\u03b4 (y|z) attempts to predict the label y given z, and q\u03c6 (z|x) attempts to generate z which prevent\nthis. A diagram of our CSVAE model is shown in Figure 1c.\n\n4\n\n15105051015505101520251510505101520151050510155051015202515105051015201.51.00.50.00.51.01.52.01.51.00.50.00.51.01.52.0(z1,z2)2.01.51.00.50.00.51.01.52.021012(z2,w1)0.60.40.20.00.20.40.60.80.50.00.51.0(w1,w2)0.50.00.51.01.00.50.00.51.0(w2,z1)\f(a) CSVAE\n\n(b) CondVAE\n\n(c) CondVAE-info\n\n(d) Sampling Grid for\nCSVAE\n\nFigure 3: Images generated by each of the models when manipulating the glasses and facial hair\nattribute on CelebA-Glasses and CelebA-FacialHair. For CSVAE the points in the subspace Wi\ncorresponding to each image are visualized in (d) along with the posterior distribution over the test\nset. For CondVAE and CondVAE-info the points are chosen uniformly in the range [0, 3]. CSVAE\ngenerates a larger variety of glasses and facial hair.\n\n4.2\n\nImplementation\n\nIn practice we use Gaussian MLPs to represent distributions over relevant random variables:\nq\u03c61(z|x) = N (z|\u00b5\u03c61(x), \u03c3\u03c61(x)), q\u03c62(w|x, y) = N (w|\u00b5\u03c62(x, y), \u03c3\u03c62 (x, y)), and p\u03b8(x|w, z) =\nN (\u00b5\u03b8,(w, z), \u03c3\u03b8(w, z)). Furthermore q\u03b4 (y|z) = Cat (y|\u03c0\u03b4 (z)). Finally for \ufb01xed choices\n\u00b51, \u03c31, \u00b52, \u03c32, for each i = 1, . . . , k we let\n\np (wi|yi = 1) = N (\u00b51, \u03c31)\np (wi|yi = 0) = N (\u00b52, \u03c32) .\n\nThese choices are arbitrary and in all of our experiments we choose Wi = R2 for all i. Hence we let\n\u00b51 = (0, 0), \u03c31 = (0.1, 0.1) and \u00b52 = (3, 3), \u03c32 = (1, 1). This implies points x with yi = 1 will be\nencoded away from the origin in Wi and at 0 in Wj for all j (cid:54)= i. These choices are motivated by\nthe goal that our model should provide a way of switching an attribute on or off. Other choices are\npossible but we did not explore alternate priors in the current work.\nIt will be helpful to make the following notation.\nonto Wi then we will denote the corresponding factor of q\u03c62 (w|x, y) as qi\nN (wi|\u00b5i\n\nIf we let wi be the projection of w \u2208 W\n(wi|x, y) =\n\n(x, y)).\n\n(x, y) , \u03c3i\n\u03c62\n\n\u03c62\n\n\u03c62\n\n4.3 Attribute Manipulation\n\nWe expect that the subspaces Wi will encode higher dimensional information underlying the binary\nlabel yi. In this sense the model gives a form of semi-supervised feature extraction.\nThe most immediate utility of this model is for the task of attribute manipulation in images. By\nsetting the subspaces Wi to be low-dimensional, we gain the ability to visualize the posterior for the\ncorresponding attribute explicitly, as well as ef\ufb01ciently explore it and its effect on the generative\ndistribution p (x|z, w).\nWe now describe the method used by each of our models to change the label of x \u2208 X from i to j, by\nde\ufb01ning an attribute switching function Gij. We refer to Section 3 for the de\ufb01nitions of the baseline\nmodels.\n\n5\n\n21012345620246\f(a) CSVAE\n\n(b) CondVAE\n\n(c) CondVAE-info\n\n(d) Sampling Grid for\nCSVAE\n\nFigure 4: The analog of the results of Figure 3 on TFD for manipulating the happy and disgust\nexpressions (a single model was used for all expressions). CSVAE again learns a larger variety of\nthese expression than the baseline models. The remaining expressions can be seen in Figure 8.\n\nVAE: For each i = 1, . . . , k let Si be the set of (x, y) \u2208 D with yi = 1. Let mi be the mean of\nthe elements of Si encoded in the latent space, that is ESi [\u00b5\u03c6 (x)]. Then we de\ufb01ne the attribute\nswitching function\n\nGij (x) = \u00b5\u03b8 (\u00b5\u03c6 (x) \u2212 mi + mj) .\n\nThat is, we encode the data, and perform vector arithmetic in the latent space, and then decode it.\nj = 1. For (x, y) \u2208 D and p \u2208 R\nCondVAE and CondVAE-info: Let y1 be a one-hot vector with y1\nwe de\ufb01ne\n\n(cid:0)\u00b5\u03c6 (x, y) , py1(cid:1) .\n\nGij (x, y, p) = \u00b5\u03b8\n\nThat is, we encode the data using its original label, and then switch the label and decode it. We can\nscale the changed label to obtain varying intensities of the desired attribute.\n\ni=1 Wi be any vector with pl = (cid:126)0 for l (cid:54)= j. For (x, y) \u2208 D we\n\nCSVAE: Let p = (p1, . . . , pk) \u2208(cid:81)k\n\nde\ufb01ne\n\nGij (x, p) = \u00b5\u03b8 (\u00b5\u03c61 (x) , p) .\n\nThat is, we encode the data into the subspace Z, and select any point p in W , then decode the\nconcatenated vector. Since Wi can be high dimensional this affords us additional freedom in attribute\nmanipulation through the choice of pi \u2208 Wi.\nIn our experiments we will want to compare the values of Gij (x, p) for many choices of p. We\nuse the following two methods of searching W . If each Wi is 2-dimensional we can generate a\ngrid of points centered at \u00b52 (de\ufb01ned in Section 4.2). In the case when Wi is higher dimensional\nthis becomes inef\ufb01cient. We can alternately compute the principal components in Wi of the set\n{\u00b5\u03c62 (x, y)|yi = 1} and generate a list of linear combinations to be used instead.\n\n5 Experiments\n\n5.1 Toy Data: Swiss Roll\n\nIn order to gain intuition about the CSVAE, we \ufb01rst train this model on the Swiss Roll, a dataset\ncommonly used to test dimensionality reduction algorithms. This experiment will demonstrate\nexplicitly how our model structures the latent space in a low dimensional example which can be\nvisualized.\n\n6\n\n20246820246\fVAE\nCondVAE\nCondVAE-info\nCSVAE (ours)\n\nTFD\n19.08%\n62.97%\n62.27%\n76.23%\n\nAccuracy\n\nCelebA-Glasses CelebA-FacialHair\n\n25.03%\n96.04%\n95.16%\n99.59%\n\n49.81%\n88.93%\n88.03%\n97.75%\n\nTable 1: Accuracy of expression and attribute classi\ufb01ers on images changed by each model. CSVAE\nshows best performance.\n\nWe generate this data using the Scikit-learn [19] function make_swiss_roll with n_samples =\n10000. We furthermore assign each data point (x, y, z) the label 0 if the x < 10, and 1 if x > 10,\nsplitting the roll in half. We train our CSVAE with Z = R2 and W = R2.\nThe projections of the latent space are visualized in Figure 2. The projection onto (z2, w1) shows\nthe whole swiss roll in familiar form embedded in latent space, while the projections onto Z and W\nshow how our model encodes the data to satisfy its constraints. The data overlaps in Z making it\ndif\ufb01cult for the model to determine the label of a data point from this projection alone. Conversely\nthe data is separated in W by its label, with the points labelled 1 mapping near the origin.\n\n5.2 Datasets\n\n5.2.1 Toronto Faces Dataset (TFD)\n\nThe Toronto Faces Dataset [23] consists of approximately 120,000 grayscale face images partially\nlabelled with expressions (expression labels include anger, disgust, fear, happy, sad, surprise, and\nneutral) and identity. Since our model requires labelled data, we assigned expression labels to the\nunlabelled subset as follows. A classi\ufb01er was trained on the labelled subset (around 4000 examples)\nand applied to each unlabelled point. If the classi\ufb01er assigned some label at least a 0.9 probability the\ndata point was included with that label, otherwise it was discarded. This resulted in a fully labelled\ndataset of approximately 60000 images (note the identity labels were not extended in this way). This\ndata was randomly split into a train, validation, and test set in 80%/10%/10% proportions (preserving\nthe proportions of originally labelled data in each split).\n\n5.2.2 CelebA\n\nCelebA [15] is a dataset of approximately 200,000 images of celebrity faces with 40 labelled attributes.\nWe \ufb01lter this data into two seperate datasets which focus on a particular attribute of interest. This is\ndone for improved image quality for all the models and for faster training time. All the images are\ncropped as in [14] and resized to 64 \u00d7 64 pixels.\nWe prepare two main subsets of the dataset: CelebA-Glasses and CelebA-FacialHair. CelebA-Glasses\ncontains all images labelled with the attribute glasses and twice as many images without. CelebA-\nFacialHair contains all images labelled with at least one of the attributes beard, mustache, goatee\nand an equal number of images without. Each version of the dataset therefore contains a single binary\nlabel denoting the presence or absence of the corresponding attribute. This dataset construction\nprocedure is applied independently to each of the training, validation and test split.\nWe additionally create a third subset called CelebA-GlassesFacialHair which contains the images\nfrom the previous two subsets along with the binary labels for both attributes. Thus it is a dataset\nwith multiple binary labels, but unlike in the TFD dataset these labels are not mutually exclusive.\n\n5.3 Qualitative Evaluation\n\nOn each dataset we compare four models. A standard VAE, a conditional VAE (denoted here by\nCondVAE), a conditional VAE with information factorization (denoted here by CondVAE-info) and\nour model (denoted CSVAE). We refer to Section 3 for the precise de\ufb01nitions of the baseline models.\n\n7\n\n\fWe examine generated images under several style-transfer settings. We consider both\nattribute transfer, in which the goal is to transfer a speci\ufb01c style of an attribute to the generated\nimage, and identity transfer, where the goal is to transfer the style of a speci\ufb01c image onto an image\nwith a different identity.\nFigure 3 shows the result of manipulating the\nglasses and facial hair attribute for a \ufb01xed sub-\nject using each model, following the procedure\ndescribed in Section 4.3. CSVAE can generate a\nlarger variety of both attributes than the baseline\nmodels. On CelebA-Glasses we see a variety\nof rims and different styles of sunglasses. On\nCelebA-FacialHair we see both mustaches and\nbeards of varying thickness. Figure 4 shows the\nanalogous experiment on the TFD data. CSVAE\ncan generate a larger variety of smiles, in par-\nticular teeth showing or not showing, and open\nmouth or closed mouth, and similarly for the\ndisgust expression.\nWe also train a CSVAE on the joint CelebA-\nGlassesFacialHair dataset to show that it can\nindependently manipulate attributes as above in\nthe case where binary attribute labels are not mutually exclusive. The results are shown in Figure 5.\nThus it can learn a variety of styles as before, and manipulate them simultaneously in a single image.\nFigure 6 shows the CSVAE model is capable of preserving the style of the given attribute over many\nidentities, demonstrating that information about the given attribute is in fact disentangled from the Z\nsubspace.\n\nFigure 5: Attribute transfer with a CSVAE on\nCelebA-GlassesFacialHair. From left to right: in-\nput image, reconstruction, Cartesian product of\nthree representative glasses styles and facial hair\nstyles. Additional attribute transfer results are pro-\nvided in the supplementary material.\n\nFigure 6: Style transfer of facial hair and glasses across many identities using CSVAE.\n\n5.4 Quantitative Evaluation\nMethod 1: We train a classi\ufb01er C : X \u2212\u2192 {1, . . . , K} which predicts the label y from x for\n(x, y) \u2208 D and evaluate its accuracy on data points with attributes changed using the model as\ndescribed in Section 4.3.\nA shortcoming of this evaluation method is that it does not penalize images Gij (x, y, pj) which have\nlarge negative loglikelihood under the model, or are qualitatively poor, as long as the classi\ufb01er can\ndetect the desired attribute. For example setting pj to be very large will increase the accuracy of C\nlong after the generated images have decreased drastically in quality. Hence we follow the standard\npractice used in the literature, of setting pj = 1 for the models CondVAE and CondVAE-info and set\npj to the empirical mean ESj\nover the validation set for CSVAE in analogy with the other\nmodels. Even when we do not utilize the full expressive power of our model, CSVAE show better\nperformance.\nTable 1 shows the results of this evaluation on each dataset. CSVAE obtains a higher classi\ufb01cation\naccuracy than the other models. Interestingly there is not much performance difference between\nCondVAE and CondVAE-info, showing that the information factorization loss on its own does not\nimprove model performance much.\nMethod 2 We apply this method to the TFD dataset, which comes with a subset labelled with identities.\nFor a \ufb01xed identity t let Si,t \u2282 Si be the subset of the data with attribute label i and identity t. Then\n\n\u00b5j\n\u03c62\n\n(cid:104)\n\n(cid:105)\n\n(x)\n\n8\n\n\ftarget - changed\n\noriginal - changed\n\ntarget - original\n\nVAE\nCondVAE\nCondVAE-info\nCSVAE (ours)\n\n75.8922\n74.3354\n74.3340\n71.0858\n\n13.4122\n18.3365\n18.7964\n28.1997\n\n91.2093\n91.2093\n91.2093\n91.2093\n\nTable 2: MSE between ground truth image and image changed by model for each subject and\nexpression. CSVAE exhibits the largest change from the original while getting closest to the ground\ntruth.\n\nover all attribute label pairs i, j with i (cid:54)= j and identities t we compute the mean-squared error\n\n(cid:88)\n\n(cid:88)\n\nL1 (p) =\n\ni,j,t,i(cid:54)=j\n\nx1\u2208Si,t,x2\u2208Sj,t\n\n(x2 \u2212 Gij (x1, y1, pj))2 .\n\n(6)\n\nIn this case for each model we choose the points pj which minimize this loss over the validation set.\nThe value of L1 is shown in Table 2. CSVAE shows a large improvement relative to that of CondVAE\nand CondVAE-info over VAE. At the same time it makes the largest change to the original image.\n\n6 Conclusion\n\nWe have proposed the CSVAE model as a deep generative model to capture intra-class variation using\na latent subspace associated with each class. We demonstrated through qualitative experiments on\nTFD and CelebA that our model successfully captures a range of variations associated with each\nclass. We also showed through quantitative evaluation that our model is able to more faithfully\nperform attribute transfer than baseline models. In future work, we plan to extend this model to the\nsemi-supervised setting, in which some of the attribute labels are missing.\n\nAcknowledgements\n\nWe would like to thank Sageev Oore for helpful discussions. This research was supported by Samsung\nand the Natural Sciences and Engineering Research Council of Canada.\n\n9\n\n\fReferences\n[1] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[2] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[3] Antonia Creswell, Anil A Bharath, and Biswa Sengupta. Conditional autoencoders with adversarial\n\ninformation factorization. arXiv preprint arXiv:1711.05175, 2017.\n\n[4] Harrison Edwards and Amos Storkey. Censoring representations with an adversary. arxiv preprint\n\narXiv:1511.05897, 2015.\n\n[5] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette,\nMario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of\nMachine Learning Research 2016, vol. 17, p. 1-35, 2015.\n\n[6] Rafael G\u00f3mez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Benjam\u00edn\nS\u00e1nchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams,\nand Al\u00e1n Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of\nmolecules. ACS Central Science, 2016.\n\n[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[8] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and\nAaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013,\n2016.\n\n[9] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational\nframework. In International Conference on Learning Representations, 2017.\n\n[10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[11] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[12] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In Advances in Neural Information Processing Systems, pages 2539\u20132547, 2015.\n\n[13] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, et al. Fader\nnetworks: Manipulating images by sliding attributes. In Advances in Neural Information Processing\nSystems, pages 5969\u20135978, 2017.\n\n[14] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther. Autoencoding\n\nbeyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.\n\n[15] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[16] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair\n\nautoencoder. arxiv preprint arXiv:1511.00830, 2015.\n\n[17] Michael Mathieu, Junbo Zhao, Pablo Sprechmann, Aditya Ramesh, and Yann LeCun. Disentangling\n\nfactors of variation in deep representations using adversarial training, 2016.\n\n[18] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W,\n2017.\n\n[19] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,\nMathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in\npython. Journal of machine learning research, 12(Oct):2825\u20132830, 2011.\n\n10\n\n\f[20] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[21] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep\nconditional generative models. In Advances in Neural Information Processing Systems, pages 3483\u20133491,\n2015.\n\n[22] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adver-\n\nsarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[23] Josh M Susskind, Adam K Anderson, and Geoffrey E Hinton. The toronto face database. Department of\n\nComputer Science, University of Toronto, Toronto, ON, Canada, Tech. Rep, 3, 2010.\n\n[24] Taihong Xiao, Jiapeng Hong, and Jinwen Ma. Dna-gan: Learning disentangled representations from\n\nmulti-attribute images. International Conference on Learning Representations, Workshop, 2018.\n\n[25] Taihong Xiao, Jiapeng Hong, and Jinwen Ma. Elegant: Exchanging latent encodings with gan for\n\ntransferring multiple face attributes. arXiv preprint arXiv:1803.10562, 2018.\n\n[26] Shuchang Zhou, Taihong Xiao, Yi Yang, Dieqiao Feng, Qinyao He, and Weiran He. Genegan: Learning\nobject trans\ufb01guration and attribute subspace from unpaired data. In Proceedings of the British Machine\nVision Conference (BMVC), 2017. URL http://arxiv.org/abs/1705.04932.\n\n11\n\n\f", "award": [], "sourceid": 3170, "authors": [{"given_name": "Jack", "family_name": "Klys", "institution": "University of Toronto"}, {"given_name": "Jake", "family_name": "Snell", "institution": "University of Toronto, Vector Institute"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "Vector Institute/University of Toronto"}]}