{"title": "Learning Disentangled Joint Continuous and Discrete Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 710, "page_last": 720, "abstract": "We present a framework for learning disentangled and interpretable jointly continuous and discrete representations in an unsupervised manner. By augmenting the continuous latent distribution of variational autoencoders with a relaxed discrete distribution and controlling the amount of information encoded in each latent unit, we show how continuous and categorical factors of variation can be discovered automatically from data. Experiments show that the framework disentangles continuous and discrete generative factors on various datasets and outperforms current disentangling methods when a discrete generative factor is prominent.", "full_text": "Learning Disentangled Joint Continuous and Discrete\n\nRepresentations\n\nSchlumberger Software Technology Innovation Center\n\nEmilien Dupont\n\nMenlo Park, CA, USA\n\ndupont@slb.com\n\nAbstract\n\nWe present a framework for learning disentangled and interpretable jointly continu-\nous and discrete representations in an unsupervised manner. By augmenting the\ncontinuous latent distribution of variational autoencoders with a relaxed discrete\ndistribution and controlling the amount of information encoded in each latent unit,\nwe show how continuous and categorical factors of variation can be discovered\nautomatically from data. Experiments show that the framework disentangles con-\ntinuous and discrete generative factors on various datasets and outperforms current\ndisentangling methods when a discrete generative factor is prominent.\n\n1\n\nIntroduction\n\nDisentangled representations are de\ufb01ned as ones where a change in a single unit of the representation\ncorresponds to a change in single factor of variation of the data while being invariant to others (Bengio\net al. (2013)). For example, a disentangled representation of 3D objects could contain a set of units\neach corresponding to a distinct generative factor such as position, color or scale. Most recent work\non learning disentangled representations has focused on modeling continuous factors of variation\n(Higgins et al. (2016); Kim & Mnih (2018); Chen et al. (2018)). However, a large number of datasets\ncontain inherently discrete generative factors which can be dif\ufb01cult to capture with these methods. In\nimage data for example, distinct objects or entities would most naturally be represented by discrete\nvariables, while their position or scale might be represented by continuous variables.\nSeveral machine learning tasks, including transfer learning and zero-shot learning, can bene\ufb01t from\ndisentangled representations (Lake et al. (2017)). Disentangled representations have also been applied\nto reinforcement learning (Higgins et al. (2017a)) and for learning visual concepts (Higgins et al.\n(2017b)). Further, in contrast to most representation learning algorithms, disentangled representations\nare often interpretable since they align with factors of variation of the data. Different approaches\nhave been explored for semi-supervised or supervised learning of factored representations (Kulkarni\net al. (2015); Whitney et al. (2016); Yang et al. (2015); Reed et al. (2014)). These approaches achieve\nimpressive results but either require knowledge of the underlying generative factors or other forms of\nweak supervision. Several methods also exist for unsupervised disentanglement with the two most\nprominent being InfoGAN and \u03b2-VAE (Chen et al. (2016); Higgins et al. (2016)). These frameworks\nhave shown promise in disentangling factors of variation in an unsupervised manner on a number of\ndatasets.\nInfoGAN (Chen et al. (2016)) is a framework based on Generative Adversarial Networks (Goodfellow\net al. (2014)) which disentangles generative factors by maximizing the mutual information between\na subset of latent variables and the generated samples. While this approach is able to model both\ndiscrete and continuous factors, it suffers from some of the shortcomings of Generative Adversarial\nNetworks (GAN), such as unstable training and reduced sample diversity. Recent improvements in\nthe training of GANs (Arjovsky et al. (2017); Gulrajani et al. (2017)) have mitigated some of these\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fissues, but stable GAN training still remains a challenge (and this is particularly challenging for\nInfoGAN as shown in Kim & Mnih (2018)). \u03b2-VAE (Higgins et al. (2016)), in contrast, is based on\nVariational Autoencoders (Kingma & Welling (2013); Rezende et al. (2014)) and is stable to train.\n\u03b2-VAE, however, can only model continuous latent variables.\nIn this paper we propose a framework, based on Variational Autoencoders (VAE), that learns disentan-\ngled continuous and discrete representations in an unsupervised manner. It comes with the advantages\nof VAEs, such as stable training, large sample diversity and a principled inference network, while\nhaving the \ufb02exibility to model a combination of continuous and discrete generative factors. We\nshow how our framework, which we term JointVAE, discovers independent factors of variation on\nMNIST, FashionMNIST (Xiao et al. (2017)), CelebA (Liu et al. (2015)) and Chairs (Aubry et al.\n(2014)). For example, on MNIST, JointVAE disentangles digit type (discrete) from slant, width and\nstroke thickness (continuous). In addition, the model\u2019s learned inference network can infer various\nproperties of data, such as the azimuth of a chair, in an unsupervised manner. The model can also be\nused for simple image editing, such as rotating a face in an image.\n\n2 Analysis of \u03b2-VAE\n\nWe derive our approach by modifying the \u03b2-VAE framework and augmenting it with a joint latent\ndistribution. \u03b2-VAEs model a joint distribution of the data x and a set of latent variables z and learn\ncontinuous disentangled representations by maximizing the objective\n\nL(\u03b8, \u03c6) = Eq\u03c6(z|x)[log p\u03b8(x|z)] \u2212 \u03b2DKL(q\u03c6(z|x) (cid:107) p(z))\n\n(1)\nwhere the posterior or encoder q\u03c6(z|x) is a neural network with parameters \u03c6 mapping x into z,\nthe likelihood or decoder p\u03b8(x|z) is a neural network with parameters \u03b8 mapping z into x and \u03b2\nis a positive constant. The loss is a weighted sum of a likelihood term Eq\u03c6(z|x)[log p\u03b8(x|z)] which\nencourages the model to encode the data x into a set of latent variables z which can ef\ufb01ciently\nreconstruct the data and a second term that encourages the distribution of the inferred latents z\nto be close to some prior p(z). When \u03b2 = 1, this corresponds to the original VAE framework.\nHowever, when \u03b2 > 1, it is theorized that the increased pressure of the posterior q\u03c6(z|x) to match the\nprior p(z), combined with maximizing the likelihood term, gives rises to ef\ufb01cient and disentangled\nrepresentations of the data (Higgins et al. (2016); Burgess et al. (2017)).\nWe can derive further insights by analyzing the role of the KL divergence term in the objective (1).\nDuring training, the objective will be optimized in expectation over the data x. The KL term then\nbecomes (Makhzani & Frey (2017); Kim & Mnih (2018))\n\nEp(x)[DKL(q\u03c6(z|x) (cid:107) p(z))] = I(x; z) + DKL(q(z) (cid:107) p(z))\n\n\u2265 I(x; z)\n\n(2)\n\ni.e., when taken in expectation over the data, the KL divergence term is an upper bound on the mutual\ninformation between the latents and the data (see appendix for proof and details). Thus, a mini batch\nestimate of the mean KL divergence is an estimate of the upper bound on the information z can\ntransmit about x.\nPenalizing the mutual information term improves disentanglement but comes at the cost of increased\nreconstruction error. Recently, several methods have been explored to improve the reconstruction\nquality without decreasing disentanglement (Burgess et al. (2017); Kim & Mnih (2018); Chen et al.\n(2018); Gao et al. (2018)). Burgess et al. (2017) in particular propose an objective where the upper\nbound on the mutual information is controlled and gradually increased during training. Denoting the\ncontrolled information capacity by C, the objective is de\ufb01ned as\n\nL(\u03b8, \u03c6) = Eq\u03c6(z|x)[log p\u03b8(x|z)] \u2212 \u03b3|DKL(q\u03c6(z|x) (cid:107) p(z)) \u2212 C|\n\n(3)\n\nwhere \u03b3 is a constant which forces the KL divergence term to match the capacity C. Gradually\nincreasing C during training allows for control of the amount of information the model can encode.\nThis objective has been shown to improve reconstruction quality as compared to (1) without reducing\ndisentanglement (Burgess et al. (2017)).\n\n2\n\n\f3 JointVAE Model\n\nWe propose a modi\ufb01cation to the \u03b2-VAE framework which allows us to model a joint distribution of\ncontinuous and discrete latent variables. Letting z denote a set of continuous latent variables and c\ndenote a set of categorical or discrete latent variables, we de\ufb01ne a joint posterior q\u03c6(z, c|x), prior\np(z, c) and likelihood p\u03b8(x|z, c). The \u03b2-VAE objective then becomes\n\nL(\u03b8, \u03c6) = Eq\u03c6(z,c|x)[log p\u03b8(x|z, c)] \u2212 \u03b2DKL(q\u03c6(z, c|x) (cid:107) p(z, c))\n\n(4)\n\nwhere the latent distribution is now jointly continuous and discrete. Assuming the continuous\nand discrete latent variables are conditionally independent1, i.e. q\u03c6(z, c|x) = q\u03c6(z|x)q\u03c6(c|x) and\nsimilarly for the prior p(z, c) = p(z)p(c) we can rewrite the KL divergence as\n\nDKL(q\u03c6(z, c|x) (cid:107) p(z, c)) = DKL(q\u03c6(z|x) (cid:107) p(z)) + DKL(q\u03c6(c|x) (cid:107) p(c))\n\n(5)\n\ni.e. we can separate the discrete and continuous KL divergence terms (see appendix for proof). Under\nthis assumption, the loss becomes\n\nL(\u03b8, \u03c6) = Eq\u03c6(z,c|x)[log p\u03b8(x|z, c)] \u2212 \u03b2DKL(q\u03c6(z|x) (cid:107) p(z)) \u2212 \u03b2DKL(q\u03c6(c|x) (cid:107) p(c))\n\n(6)\n\nIn our initial experiments, we found that directly optimizing this loss led to the model ignoring\nthe discrete latent variables. Similarly, gradually increasing the channel capacity as in equation (3)\nleads to the model assigning all capacity to the continuous channels. To overcome this, we split\nthe capacity increase: the capacities of the discrete and continuous latent channels are controlled\nseparately forcing the model to encode information both in the discrete and continuous channels. The\n\ufb01nal loss is then given by\n\nL(\u03b8, \u03c6) = Eq\u03c6(z,c|x)[log p\u03b8(x|z, c)]\u2212\u03b3|DKL(q\u03c6(z|x) (cid:107) p(z))\u2212Cz|\u2212\u03b3|DKL(q\u03c6(c|x) (cid:107) p(c))\u2212Cc|\n(7)\n\nwhere Cz and Cc are gradually increased during training.\n\n3.1 Parametrization of continuous latent variables\n(cid:81)\nAs in the original VAE framework, we parametrize q\u03c6(z|x) by a factorised Gaussian, i.e. q\u03c6(z|x) =\ni ) and let the prior be a unit Gaussian p(z) = N (0, I). \u00b5\ni q\u03c6(zi|x) where q\u03c6(zi|x) = N (\u00b5i, \u03c32\nand \u03c32 are both parametrized by neural networks.\n\n3.2 Parametrization of discrete latent variables\nParametrizing q\u03c6(c|x) is more dif\ufb01cult. Since q\u03c6(c|x) needs to be differentiable with respect to its\nparameters, we cannot parametrize q\u03c6(c|x) by a set of categorical distributions. Recently, Maddison\net al. (2016) and Jang et al. (2016) proposed a differentiable relaxation of discrete random variables\nbased on the Gumbel Max trick (Gumbel (1954)). If c is a categorical variable with class probabilities\n\u03b11, \u03b12, ..., \u03b1n, then we can sample from a continuous approximation of the categorical distribution,\nby sampling a set of gk \u223c Gumbel(0, 1) i.i.d. and applying the following transformation\n\n(cid:80)\n\nyk =\n\nexp((log \u03b1k + gk)/\u03c4 )\ni exp((log \u03b1i + gi)/\u03c4 )\n\n(8)\n\nwhere \u03c4 is a temperature parameter which controls the relaxation. The sample y is a continuous\napproximation of the one hot representation of c. The relaxed discrete distribution is called a Concrete\nor Gumbel Softmax distribution and is denoted by g(\u03b1) where \u03b1 is a vector of class probabilities.\n\n1\u03b2-VAE assumes the data is generated by a \ufb01xed number of independent factors of variation, so all latent\nvariables are in fact conditionally independent. However, for the sake of deriving the JointVAE objective we\nonly require conditional independence between the continuous and discrete latents.\n\n3\n\n\f(cid:81)\nWe can parametrize q\u03c6(c|x) by a product of independent Gumbel Softmax distributions, q\u03c6(c|x) =\ni q\u03c6(ci|x) where q\u03c6(ci|x) = g(\u03b1(i)) is a Gumbel Softmax distribution with class probabilities\n\u03b1(i). We let the prior p(c) be equal to a product of uniform Gumbel Softmax distributions. This\napproach allows us to use the reparametrization trick (Kingma & Welling (2013); Rezende et al.\n(2014)) and ef\ufb01ciently train the discrete model.\n\n3.3 Architecture\n\nThe \ufb01nal architecture of the JointVAE model is shown in Fig. 1. We build the encoder to output the\nparameters of the continuous distribution \u00b5 and \u03c32 and of each of the discrete distributions \u03b1(i). We\ni ) and ci \u223c g(\u03b1(i)) using the reparametrization trick and concatenate z\nthen sample zi \u223c N (\u00b5i, \u03c32\nand c into one latent vector which is passed as input to the decoder.\n\nFigure 1: JointVAE architecture. The input x is encoded by q\u03c6 into the parameters of the latent\ndistributions. Samples are drawn from each of the latent distributions using the reparametrization\ntrick (indicated by dashed arrows on the diagram). The samples are then concatenated and decoded\nthrough p\u03b8.\n\n3.4 Choice and sensitivity hyperparameters\n\nThe JointVAE loss in equation 7 depends on the hyperparameters \u03b3, Cc and Cz. While the choice of\nthese is ultimately empirical, there are various heuristics we can use to narrow the search. The value\nof \u03b3, for example, is chosen so that it is large enough to maintain the capacity at the desired level\n(e.g. large improvements in reconstruction error should not come at the cost of breaking the capacity\nconstraint). We found the model to be quite robust to changes in \u03b3. As the capacity of a discrete\nchannel is bounded, Cc is chosen to be the maximum capacity of the channel, encouraging the model\nto use all categories of the discrete distribution. Cz is more dif\ufb01cult to choose and is often chosen by\nexperiment to be the largest value where the representation is still disentangled (in a similar way that\n\u03b2 is chosen as the lowest value where the representation is still disentangled in \u03b2-VAE).\n\n4 Experiments\n\nWe perform experiments on several datasets including MNIST, FashionMNIST, CelebA and Chairs.\nWe parametrize the encoder by a convolutional neural network and the decoder by the same network,\ntransposed (for the full architecture and training details see appendix). The code, along with all\nexperiments and trained models presented in this paper, is available at https://github.com/\nSchlumberger/joint-vae.\n\n4\n\n...\fMNIST\n\nDisentanglement results and latent traversals for MNIST are shown in Fig. 2. The model was trained\nwith 10 continuous latent variables and one discrete 10-dimensional latent variable. The model\ndiscovers several factors of variation in the data, such as digit type (discrete), stroke thickness, angle\nand width (continuous) in an unsupervised manner. As can be seen from the latent traversals in Fig.\n2, the trained model is able to generate realistic samples for a large variety of latent settings. Fig. 4a\nshows digits generated by \ufb01xing the discrete latent and sampling the continuous latents from the prior\np(z) = N (0, 1), which can be interpreted as sampling from a distribution conditioned on digit type.\nAs can be seen, the samples are diverse, realistic and honor the conditioning.\nFor a large range of hyperparameters we were not able to achieve disentanglement using the purely\ncontinuous \u03b2-VAE framework (see Fig. 3). This is likely because MNIST has an inherently discrete\ngenerative factor (digit type), which \u03b2-VAE is unable to map onto a continuous latent variable. In\ncontrast, the JointVAE approach allows us to disentangle the discrete factors while maintaining\ndisentanglement of continuous factors. To the best of our knowledge, JointVAE is, apart from\nInfoGAN, the only framework which disentangles MNIST in a completely unsupervised manner and\nit does so in a more stable way than InfoGAN.\n\n(a) Angle (continuous)\n\n(b) Thickness (continuous)\n\n(c) Digit type (discrete)\n\n(d) Width (continuous)\n\nFigure 2: Latent traversals of the model trained on MNIST with 10 continuous latent variables and 1\ndiscrete latent variable. Each row corresponds to a \ufb01xed random setting of the latent variables and\nthe columns correspond to varying a single latent unit. Each sub\ufb01gure varies a different latent unit.\nAs can be seen each of the varied latent units corresponds to an interpretable generative factor, such\nas stroke thickness or digit type.\n\nFigure 3: Traversals of all latent dimensions on MNIST for JointVAE, \u03b2-VAE and \u03b2-VAE with\ncontrolled capacity increase (CC\u03b2-VAE). JointVAE is able to disentangle digit type from continuous\nfactors of variation like stroke thickness and angle, while digit type is entangled with continuous\nfactors for both \u03b2-VAE and CC\u03b2-VAE.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: (a) Samples conditioned on digit type. Each row shows samples from p\u03b8 where the discrete\nlatent variable is \ufb01xed and all other latent values are sampled from the prior. As can be seen each\nrow produces diverse samples of each digit. Note that digits which are similar, such as 5 and 8 are\nsometimes confused and not perfectly disentangled. (b) Samples conditioned on fashion item type.\nThe samples are diverse and largely disentangled. (c) Latent traversals of FashionMNIST model. The\nrows correspond to different settings of the discrete latent variable, while the columns correspond to\na traversal of the most informative continuous latent variable. Various factors are discovered, such as\nsleeve length, bag handle size, ankle height and shoe opening.\n\nFashionMNIST\n\nLatent traversals for FashionMNIST are shown in Fig. 4c. We also used 10 continuous and 1 discrete\nlatent variable for this dataset. FashionMNIST is harder to disentangle as the generative factors for\ncreating clothes are not as clear as the ones for drawing digits. However, JointVAE performs well\nand discovers interesting dimensions, such as sleeve length, heel size and shirt color. As some of\nthe classes of FashionMNIST are very similar (e.g. shirt and t-shirt are two different classes), not\nall classes are discovered. However, a signi\ufb01cant amount of them are disentangled including dress,\nt-shirt, trousers, sneakers, bag, ankle boot and so on (see Fig. 4b).\n\nCelebA\n\nFor CelebA we used a model with 32 continuous latent variables and one 10 dimensional discrete\nlatent variable. As shown in Fig. 5, the JointVAE model discovers various factors of variation\nincluding azimuth, age and background color, while being able to generate realistic samples. Different\nsettings of the discrete variable correspond to different facial identities. While the samples are not as\nsharp as those produced by entangled models, we can still see details in the images such as distinct\nfacial features and skin tones (the trade-off between disentanglement and reconstruction quality is a\nwell known problem which is discussed in Higgins et al. (2016); Burgess et al. (2017); Kim & Mnih\n(2018); Chen et al. (2018)).\n\nChairs\n\nFor the chairs dataset we used a model with 32 continuous latent variables and 3 binary discrete latent\nvariables. JointVAE discovers several factors of variation such as chair rotation, width and leg style.\nFurthermore, different settings of the discrete variables correspond to different chair types and colors.\n\n(a) Azimuth\n\n(b) Background color\n\n(c) Age\n\nFigure 5: Latent traversals of the model trained on CelebA. Each row corresponds to a \ufb01xed setting\nof the discrete latent variable and the columns correspond to varying a single continuous latent unit.\n\n6\n\n\fModel\n\u03b2-VAE\n\nFactorVAE\nJointVAE\n\nScore\n0.73\n0.82\n0.69\n\nFigure 6: Left: Disentanglement scores for various frameworks on the dSprites dataset. The scores\nare obtained by averaging scores over 10 different random seeds from the model with the best\nhyperparameters (removing outliers where the model collapsed to the mean). Right: Comparison of\nlatent traversals on the dSprites dataset. There are 4 continuous factors and 1 discrete factor in the\noriginal dataset and only JointVAE is able to encode all information into 4 continuous and 1 discrete\nlatent variables. Note that the \ufb01nal row of the JointVAE latent traversal corresponds to the discrete\nfactor of dimension 3, which is why the patterns repeat with a period of 3.\n\nWhile there is a well de\ufb01ned discrete generative factor for datasets like MNIST and FashionMNIST,\nit is less clear what exactly would constitute a discrete factor of variation in datasets like CelebA and\nChairs. For example, for CelebA, JointVAE maps various facial identities onto the discrete latent\nvariable. However, facial identity is not necessarily discrete and it is possible that such a factor of\nvariation could also be mapped to a continuous latent variable. JointVAE has a clear advantage in\ndisentangling datasets where discrete factors are prominent (as shown in Fig. 3) but when this is not\nthe case using frameworks that only disentangle continuous factors may be suf\ufb01cient.\n\n4.1 Quantitative evaluation\n\nWe quantitatively evaluate our model on the dSprites dataset using the metric recently proposed by\nKim & Mnih (2018). Since the dataset is generated from 1 discrete factor (with 3 categories) and\n4 continuous factors, we used a model with 6 continuous latent variables and one 3 dimensional\ndiscrete latent variable. The results are shown in table 6. Even though the discrete factor in this\ndataset is not prominent (in the sense that the different categories have very small differences in pixel\nspace) our model is able to achieve scores close to the current best models. Further, as shown in\nFig. 6, our model learns meaningful latent representations. In particular, for the discrete factor of\nvariation, JointVAE is able to better separate the classes than other models.\n\n4.2 Detecting disentanglement in latent distributions\n\nAs noted in Section 2, taken in expectation over data, the KL divergence between the inferred latents\nq\u03c6(z, c|x) and the priors, upper bounds the mutual information between the latent units and the data.\nMotivated by this, we can plot the KL divergence values for each latent unit averaged over a mini\nbatch of data during training. As various factors of variation are discovered in the data, we would\nexpect the KL divergence of the corresponding latent units to increase. This is shown in Fig. 7a.\nAs the capacities Cz and Cc are increased, the model is able to encode more and more factors of\nvariation. For MNIST, the \ufb01rst factor to be discovered is digit type, followed by angle and width. This\nis likely because encoding digit type results in the largest reconstruction error reduction, followed by\nencoding angle and width and so on.\nAfter training, we can also measure the KL divergence of each latent unit on test data and rank the\nlatent units by their average KL values. This corresponds to ranking the latent units by how much\ninformation they are transmitting about x. Fig. 7b shows the ranked latent units for MNIST and\nChairs along with a latent traversal of each unit. As can be seen, the latent units with large information\ncontent encode various aspects of the data while latent units with approximately zero KL divergence\ndo not affect the output.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 7: (a) Increase of KL divergence during training. As the latent channel capacity is increased,\ndifferent factors of variation are discovered. Most of the latent units have a KL divergence of\napproximately zero throughout training, meaning they do not encode any information about the data.\nAs training progresses, however, some latent units start encoding more information about the data.\nEach latent unit can then be matched to a factor of variation of the data by visual inspection. (b, c)\nEach row corresponds to a latent traversal of a single latent unit. The column on the left shows the\nmean KL divergence value over a large number of examples (which corresponds to the amount of\ninformation encoded in that latent unit in nats). The rows are ordered from the latent unit with largest\nKL divergence to the lowest. As can be seen, large KL divergence values correspond to active latents\nwhich encode information about the data, whereas low KL divergence value channels to do not affect\nthe data.\n\n4.3 The inference network\nOne of the advantages of JointVAE is that it comes with an inference network q\u03c6(z, c|x). For\nexample, on MNIST we can infer the digit type on test data with 88.7% accuracy by simply looking\nat the value of the discrete latent variable q\u03c6(c|x). Of course, this is completely unsupervised and the\naccuracy could likely be increased dramatically by using some label information.\nSince we are learning several generative factors, the inference network can also be used to infer\nproperties which we do not have labels for. For example, the latent unit corresponding to azimuth on\nthe chairs dataset correlates well with the actual azimuth of unseen chairs. After training a model on\nthe chairs dataset and identifying the latent unit corresponding to azimuth, we can test the inference\nnetwork on images that were not used during training. This is shown in Fig. 8a. As can be seen, the\nlatent unit corresponding to rotation infers the angle of the chair even though no labeled data was\ngiven (or available) for this task.\nThe framework can also be used to perform image editing or manipulation. If we wish to rotate the\nimage of a face, we can encode the face with q\u03c6, modify the latent corresponding to azimuth and\ndecode the resulting vector with p\u03b8. Examples of this are shown in Fig. 8b.\n\n4.4 Robustness and sensitivity to hyperparameters\n\nWhile our framework is robust with respect to different architectures and optimizers, it is, like most\nframeworks for unsupervised disentanglement, fairly sensitive to the choice of hyperparameters (all\nhyperparameters needed to reproduce the results in this paper are given in the appendix). Even with a\ngood choice of hyperparameters, the quality of disentanglement may vary based on the random seed.\nIn general, it is easy to achieve some degree of disentanglement for a large set of hyperparameters,\nbut achieving complete clean disentanglement (e.g. perfectly separate digit type and other generative\nfactors) can be dif\ufb01cult. It would be interesting to explore more principled approaches for choosing\nthe latent capacities and how to increase them, but we leave this for future work. Further, as mentioned\nin Section 4, when a discrete generative factor is not present or important, the framework may fail to\nlearn meaningful discrete representations. We have included some failure examples in Fig. 9.\n\n8\n\n\f(a)\n\n(b)\n\nFigure 8: (a) Inference of azimuth on test data of chairs. The \ufb01rst row shows images of chairs from a\ntest set. The second row shows the inferred z for each of the images. As can be seen, the latent unit\nsuccessfully identi\ufb01es rotation. (b) Image editing with JointVAE. An image of a celebrity is encoded\nwith q\u03c6. In the encoded space, we can then rotate the face, change the background color or change\nthe hair style by manipulating the latent unit corresponding to each factor. The bottom rows show\nthe decoded images when each latent factor is changed. The samples are not as sharp as the original\nimage, but these initial results show promise for using disentangled representations to edit images.\n\n(a)\n\n(b)\n\nFigure 9: Failure examples. (a) Background color is entangled with azimuth and hair length. (b)\nVarious clothing items are entangled with each other.\n\n5 Conclusion\n\nWe have proposed JointVAE, a framework for learning disentangled continuous and discrete repre-\nsentations in an unsupervised manner. The framework comes with the advantages of VAEs such as\nstable training and large sample diversity while being able to model complex jointly continuous and\ndiscrete generative factors. We have shown that JointVAE disentangles factors of variation on several\ndatasets while producing realistic samples. In addition, the inference network can be used to infer\nunlabeled quantities on test data and to edit and manipulate images.\nIn future work, it would be interesting to combine our approach with recent improvements of the\n\u03b2-VAE framework, such as FactorVAE (Kim & Mnih (2018)) or \u03b2-TCVAE (Chen et al. (2018)).\nGaining a deeper understanding of how disentanglement depends on the latent channel capacities and\nhow they are increased will likely provide insights to build more stable models. Finally, it would also\nbe interesting to explore the use of other latent distributions since the framework allows the use of\nany joint distribution of reparametrizable random variables.\n\nAcknowledgments\n\nThe author would like to thank Erik Burton, Jos\u00e9 Celaya, Suhas Suresha, Vishakh Hegde and the\nanonymous reviewers for helpful suggestions and comments that helped improve the paper.\n\n9\n\n\fReferences\nMartin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan.\n\narXiv:1701.07875, 2017.\n\narXiv preprint\n\nMathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan C Russell, and Josef Sivic. Seeing 3d chairs:\nexemplar part-based 2d-3d alignment using a large dataset of cad models. In Proceedings of the\nIEEE conference on computer vision and pattern recognition, pp. 3762\u20133769, 2014.\n\nYoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new\nperspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u20131828,\n2013.\n\nChristopher Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins,\nand Alexander Lerchner. Understanding disentangling in beta-vae. NIPS 2017 Disentanglement\nWorkshop, 2017.\n\nTian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement\n\nin variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.\n\nXi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pp. 2172\u20132180, 2016.\n\nShuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total correlation\n\nexplanation. arXiv preprint arXiv:1802.05822, 2018.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-\ntion processing systems, pp. 2672\u20132680, 2014.\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp.\n5769\u20135779, 2017.\n\nEmil Julius Gumbel. Statistical theory of extreme valuse and some practical applications. Nat. Bur.\n\nStandards Appl. Math. Ser. 33, 1954.\n\nIrina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. ICLR 2017, 2016.\n\nIrina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel,\nMatthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot\ntransfer in reinforcement learning. arXiv preprint arXiv:1707.08475, 2017a.\n\nIrina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matthew Botvinick,\nDemis Hassabis, and Alexander Lerchner. Scan: learning abstract hierarchical compositional\nvisual concepts. arXiv preprint arXiv:1707.03389, 2017b.\n\nEric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\nHyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983,\n\n2018.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nTejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional\ninverse graphics network. In Advances in Neural Information Processing Systems, pp. 2539\u20132547,\n2015.\n\nBrenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.\n\n10\n\n\fZiwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), 2015.\n\nChris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\nAlireza Makhzani and Brendan J Frey. Pixelgan autoencoders. In Advances in Neural Information\n\nProcessing Systems, pp. 1972\u20131982, 2017.\n\nScott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of\nvariation with manifold interaction. In International Conference on Machine Learning, pp. 1431\u2013\n1439, 2014.\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\n\napproximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\nWilliam F Whitney, Michael Chang, Tejas Kulkarni, and Joshua B Tenenbaum. Understanding visual\n\nconcepts with continuation learning. arXiv preprint arXiv:1602.06822, 2016.\n\nHan Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\nJimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling\nwith recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing\nSystems, pp. 1099\u20131107, 2015.\n\n11\n\n\f", "award": [], "sourceid": 412, "authors": [{"given_name": "Emilien", "family_name": "Dupont", "institution": "Oxford University"}]}