{"title": "Dual Swap Disentangling", "book": "Advances in Neural Information Processing Systems", "page_first": 5894, "page_last": 5904, "abstract": "Learning interpretable disentangled representations is a crucial yet challenging task. In this paper, we propose a weakly semi-supervised method, termed as Dual Swap Disentangling (DSD), for disentangling using both labeled and unlabeled data. Unlike conventional weakly supervised methods that rely on full annotations on the group of samples, we require only limited annotations on paired samples that indicate their shared attribute like the color. Our model takes the form of a dual autoencoder structure. To achieve disentangling using the labeled pairs, we follow a ``encoding-swap-decoding'' process, where we first swap the parts of their encodings corresponding to the shared attribute, and then decode the obtained hybrid codes to reconstruct the original input pairs. For unlabeled pairs, we follow the ``encoding-swap-decoding'' process twice on designated encoding parts and enforce the final outputs to approximate the input pairs. By isolating parts of the encoding and swapping them back and forth, we impose the dimension-wise modularity and portability of the encodings of the unlabeled samples, which implicitly encourages disentangling under the guidance of labeled pairs. This dual swap mechanism, tailored for semi-supervised setting, turns out to be very effective. Experiments on image datasets from a wide domain show that our model yields state-of-the-art disentangling performances.", "full_text": "Dual Swap Disentangling\n\nZunlei Feng\n\nZhejiang University\n\nzunleifeng@zju.edu.cn\n\nXinchao Wang\n\nStevens Institute of Technology\nxinchao.wang@stevens.edu\n\nChenglong Ke\n\nZhejiang University\n\nchenglongke@zju.edu.cn\n\nAnxiang Zeng\nAlibaba Group\n\nrenzhong@taobao.com\n\nDacheng Tao\n\nUniversity of Sydney\n\ndctao@sydney.edu.au\n\nMingli Song\u2217\n\nZhejiang University\n\nbrooksong@zju.edu.cn\n\nAbstract\n\nLearning interpretable disentangled representations is a crucial yet challenging\ntask. In this paper, we propose a weakly semi-supervised method, termed as Du-\nal Swap Disentangling (DSD), for disentangling using both labeled and unlabeled\ndata. Unlike conventional weakly supervised methods that rely on full annotations\non the group of samples, we require only limited annotations on paired samples\nthat indicate their shared attribute like the color. Our model takes the form of a\ndual autoencoder structure. To achieve disentangling using the labeled pairs, we\nfollow a \u201cencoding-swap-decoding\u201d process, where we \ufb01rst swap the parts of their\nencodings corresponding to the shared attribute, and then decode the obtained hy-\nbrid codes to reconstruct the original input pairs. For unlabeled pairs, we follow\nthe \u201cencoding-swap-decoding\u201d process twice on designated encoding parts and\nenforce the \ufb01nal outputs to approximate the input pairs. By isolating parts of the\nencoding and swapping them back and forth, we impose the dimension-wise mod-\nularity and portability of the encodings of the unlabeled samples, which implicitly\nencourages disentangling under the guidance of labeled pairs. This dual swap\nmechanism, tailored for semi-supervised setting, turns out to be very effective.\nExperiments on image datasets from a wide domain show that our model yields\nstate-of-the-art disentangling performances.\n\n1 Introduction\n\nDisentangling aims at learning dimension-wise interpretable representations from data. For example,\ngiven an image dataset of human faces, disentangling should produce representations or encodings\nfor which part corresponds to interpretable attributes like facial expression, hairstyle, and color of\nthe eye.\nIt is therefore a vital step for many machine learning tasks including transfer learning\n(Lake et al. [2017]), reinforcement learning (Higgins et al. [2017a]) and visual concepts learning\n(Higgins et al. [2017b]).\nExisting disentangling methods can be broadly classi\ufb01ed into two categories, supervised approach-\nes and unsupervised ones. Methods in the former category focus on utilizing annotated data to\nexplicitly supervise the input-to-attribute mapping. Such supervision may take the form of par-\ntitioning the data into subsets which vary only along some particular dimension (Kulkarni et al.\n[2015], Bouchacourt et al. [2017]), or labeling explicitly speci\ufb01c sources of variation of the data\n(Kingma et al. [2014], Siddharth et al. [2017], Perarnau et al. [2016], Wang et al. [2017]). Despite\ntheir promising results, supervised methods, especially for deep-learning ones, usually require a\nlarge number of training samples which are often expensive to obtain.\n\n(cid:3)\n\nCorresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fUnsupervised methods, on the other hand, do not require annotations but yield disentangled repre-\nsentations that are usually uninterpretable and dimension-wise uncontrollable. In other words, the\nuser has no control over the semantic encoded in each dimension of the obtained codes. Taking\na mugshot for example, the unsupervised approach fails to make sure that one of the disentangled\ncodes will contain the feature of eye color. In addition, existing methods produce for each attribute\nwith a single-dimension code, which sometimes has dif\ufb01culty in expressing intricate semantics.\nIn this paper, we propose a weakly semi-supervised learning approach, dubbed as Dual Swap Dis-\nentangling (DSD), for disentangling that combines the best of the two worlds. The proposed DSD\ntakes advantage of limited annotated sample pairs together with many unannotated ones to derive\ndimension-wise and semantic-controllable disentangling. We implement the DSD model using an\nautoencoder, training on both labeled and unlabeled input data pairs and by swapping designated\nparts of the encodings. Speci\ufb01cally, DSD differs from the prior disentangling models in the follow-\ning aspects.\n\n\u2022 Limited Weakly-labeled Input Pairs. Unlike existing supervised and semi-supervised mod-\nels that either require strong labels on each attribute of each training sample (Kingma et al.\n[2014], Perarnau et al. [2016], Siddharth et al. [2017], Wang et al. [2017], Banijamali et al.\n[2017]), or require fully weak labels on a group of samples sharing the same attribute\n(Bouchacourt et al. [2017]), our model only requires limited pairs of samples, which are\nmuch cheaper to obtain.\n\n\u2022 Dual-stage Architecture. To our best knowledge, we propose the \ufb01rst dual-stage network\narchitecture to utilize unlabeled sample pairs for semi-supervised disentangling, to facilitate\nand improve over the supervised learning using a small number of labeled pairs.\n\n\u2022 Multi-dimension Attribute Encoding. We allow multi-dimensional encoding for each\nattribute to improve the expressiveness capability. Moreover, unlike prior methods\n(Kulkarni et al. [2015], Chen et al. [2016], Higgins et al. [2016], Burgess et al. [2017],\nBouchacourt et al. [2017], Chen et al. [2018], Gao et al. [2018], Kim and Mnih [2018]), we\ndo not impose any over-constrained assumption, such as each dimension being independent,\ninto our encodings.\n\nWe show the architecture of DSD in Fig. 1. It comprises two stages, primary-stage and dual-stage,\nboth are utilizing the same autoencoder. During training, the annotated pairs go through the primary-\nstage only, while the unannotated ones go through both. For annotated pairs, again, we only require\nweak labels to indicate which attribute the two input samples are sharing. We feed such annotated\npairs to the encoder and obtained a pair of codes. We then designate which dimensions correspond\nto the speci\ufb01c shared attribute, and swap these parts of the two codes to obtain a pair of hybrid codes.\nNext we feed the hybrid codes to the decoder to reconstruct the \ufb01nal output of the labeled pairs. We\nenforce the reconstruction to approximate the input since we swap only the shared attribute, in which\nway we encourage the disentangling of the speci\ufb01c attribute in the designated dimensions and thus\nmake our encodings dimension-wise controllable.\nThe unlabeled pairs during training go through both the primary-stage and the dual-stage. In the\nprimary-stage, unlabeled pairs undergo the exact same procedure as the labeled ones, i.e., the\nencoding-swap-decoding steps. In the dual-stage, the decoded unlabeled pairs are again fed into\nthe same autoencoder and parsed through the encoding-swap-decoding process for the second time.\nIn other words, the code parts that are swapped during the primary-stage are swapped back in the\nsecond stage. With the guidance and constraint of labeled pairs, the dual swap strategy can gener-\nate informative feedback signals to train the DSD for the dimension-wise and semantic-controllable\ndisentangling. The dual swap strategy, tailored for unlabeled pairs, turns out to be very effective in\nfacilitating supervised learning with a limited number of samples.\nOur contribution is therefore the \ufb01rst dual-stage strategy for semi-supervised disentangling. Also, re-\nquire limited weaker annotations as compared to previous methods, and extend the single-dimension\nattribute encoding to multi-dimension ones. We evaluate the proposed DSD on a wide domain of\nimage datasets, in term of both qualitative visualization and quantitative measures. Our method\nachieves results superior to the current state-of-the-art.\n\n2\n\n\fFigure 1: Architecture of the proposed DSD. It comprises two stages: primary-stage and dual-stage.\nThe former one is employed for both labeled and unlabeled pairs while the latter is for unlabeled\nonly.\n\n2 Related Work\n\nRecent works in learning disentangled representations have broadly followed two approaches,\n(semi-)supervised and unsupervised. Most of existing unsupervised methods (Burgess et al. [2017],\nChen et al. [2018], Gao et al. [2018], Kim and Mnih [2018], Dupont [2018]) are based on the most\ntwo prominent methods InfoGAN (Chen et al. [2016]) and \u03b2-VAE (Higgins et al. [2016]). They\nhowever impose the independent assumption of the different dimensions of the latent code to\nachieve disentangling. Some semi-supervised methods (Bouchacourt et al. [2017], Siddharth et al.\n[2017]) import annotation information into \u03b2-VAE to achieve controllable disentangling. Supervised\nor semi-supervised methods like (Kingma et al. [2014], Perarnau et al. [2016], Wang et al. [2017],\nBanijamali et al. [2017], Feng et al. [2018]), they focus on utilizing annotated data to explicitly su-\npervise the input-to-attribute mapping. Different with above methods, our method does not impose\nany over-constrained assumption and only require limited weak annotations.\nWe also give a brief review here about swapping scheme, group labels, and dual mechanism, which\nrelate to our dual-stage model and weakly-labeled input. For swapping, Xiao et al. [2017] propose a\nsupervised algorithm called DNA-GAN which can learn disentangled representations from multiple\nsemantic images with swapping policy. The signi\ufb01cant difference between our DSD and DNA-GAN\nis that the swapped codes correspond to different semantics in DNA-GAN. DNA-GAN requires\nlots of annotated multi-labeled images and the annihilating operation adopted by DNA-GAN is\ndestructive. Besides, DNA-GAN is based on GAN, which also suffers from the unstable training of\nGAN. For group information, Bouchacourt et al. [2017] propose the Multi-Level VAE (ML-VAE)\nmodel for learning a meaningful disentanglement from a set of grouped observations. The group\nused in the ML-VAE requires that observations in the same group have the same semantics. However,\nit also has the limitation on increased reconstruction error. For dual mechanism, Zhu et al. [2017]\nuse cycle-consistent adversarial networks to realize unpaired image-to-image translation. Xia et al.\n[2016] adopt the dual-learning framework for machine translation. However, they all require two\ndomain entities, such as image domains (sketch and photo) and language domains (English and\nFrench). Different with above two works, our dual framework only needs one domain entity.\n\n3 Method\n\nIn this section, we give more details of our proposed DSD model. We start by introducing the archi-\ntecture and basic elements of our model, then show our training strategy for labeled and unlabeled\npairs, and \ufb01nally summarize the complete algorithm.\n\n3\n\nprimary-stagedual-stage(cid:1853)(cid:2869)(cid:1854)(cid:2869)(cid:1854)(cid:3038)(cid:1854)(cid:2870)(cid:1853)(cid:3038)(cid:1853)(cid:3041)(cid:1854)(cid:3041)(cid:1853)(cid:2870)\u2026\u2026\u2026\u2026(cid:1853)(cid:2869)(cid:1854)(cid:2869)(cid:1853)(cid:3038)(cid:1854)(cid:2870)(cid:1854)(cid:3038)(cid:1853)(cid:3041)(cid:1854)(cid:3041)(cid:1853)(cid:2870)\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026(cid:1858)(cid:3109)(cid:1858)(cid:3109)(cid:1858)(cid:3101)(cid:1858)(cid:3101)(cid:1853)(cid:2869)(cid:4593)(cid:1854)(cid:2869)(cid:4593)(cid:1854)(cid:2870)(cid:4593)(cid:1854)(cid:3041)(cid:4593)(cid:1853)(cid:2870)(cid:4593)(cid:1853)(cid:3041)(cid:4593)(cid:1853)(cid:3038)(cid:4593)(cid:1854)(cid:3038)(cid:4593)\u2026\u2026\u2026\u2026(cid:1853)(cid:2869)(cid:4593)(cid:1854)(cid:2869)(cid:4593)(cid:1854)(cid:2870)(cid:4593)(cid:1854)(cid:3041)(cid:4593)(cid:1853)(cid:2870)(cid:4593)(cid:1853)(cid:3041)(cid:4593)(cid:1854)(cid:3038)(cid:4593)(cid:1853)(cid:3038)(cid:4593):original pairs (labeled orunlabeled): hybrid outputs: dual-stage outputs: data flow for labeled pairs: data flow for unlabeled pairs: primary-stage outputs( , ): swapping(cid:1858)(cid:3109)(cid:1858)(cid:3101): encoder: decoder( , )( , )( , )\f3.1 Dual-stage Autoencoder\n\nThe goal of our proposed DSD model is to take both weakly labeled and unlabeled sample pairs as\ninput, and train an autoencoder that accomplishes dimension-wise controllable disentangling. We\nshow a visual illustration of our model in Fig. 1, where the dual-stage architecture is tailored for\nthe self-supervision on the unlabeled samples. In what follows, we describe DSD\u2019s basic elements:\ninput, autoencoder, swap strategy and the dual-stage design in detail.\nInput DSD takes a pair of samples as input denoted as (IA,IB), where the pair can be either\nweakly labeled or unlabeled. Unlike conventional weakly supervised methods like Bouchacourt et al.\n[2017] that rely on full annotations on the group of samples, our model only requires limited and\nweak annotations as we only require the labels to indicate which attribute, if any, is shared by a pair\nof samples.\n\nAutoencoder DSD conducts disentangling using an autoencoder trained in both stages. Given\na pair of input (IA,IB), weakly labeled or not, the encoder f\u03d5 \ufb01rst encodes them to two vector\nrepresentations RA = f\u03d5(IA) = [a1, a2, ..., an] and RB = f\u03d5(IB) = [b1, b2, ..., bn], and then\nthe decoder f\u03c6 decodes the obtained codes or encodings to reconstruct the original input pairs, i.e.,\nIA = f\u03c6(RA) and IB = f\u03c6(RB). We would expect the obtained codes RA and RB to possess\nthe following two properties: i) they include as much as possible information of the original input\nIA and IB, and ii) they are disentangled and element-wise interpretable. The \ufb01rst property, as any\nautoencoder, is achieved through minimizing the following original autoencoder loss:\n\nLo(IA,IB; \u03d5, \u03c6) = ||IA \u2212 IA||2\n\n2 + ||IB \u2212 IB||2\n2.\n\n(1)\nThe second property is further achieved via the swapping strategy and dual-stage design, described\nin what follows.\nSwap Strategy If given the knowledge that the pair of input IA and IB are sharing an attribute,\nsuch as the color, we can designate a speci\ufb01c part of their encodings, like ak of RA and bk of RB, to\nassociate the attribute semantic with the designated part. Assume that RA and RB are disentangled,\nswapping their code parts corresponding to the shared attribute, ak and bk, should not change their\nencoding or their hybrid reconstruction (cid:127)IA and (cid:127)IB. Conversely, enforcing the reconstruction after\nswapping to approximate the original input should facilitate and encourage disentangling for the\nspeci\ufb01c shared attribute. Notably, here we allow each part of the encodings to be multi-dimensions,\ni.e., ak, bk \u2208 Rm, m \u2265 1, so as to improve the expressiveness of the encodings.\nDual-stage For labeled pairs, we know what their shared attribute is and can thus swap the cor-\nresponding parts of the code. For unlabeled ones, however, we do not have such knowledge. To\ntake advantage of the large volume of unlabeled pairs, we implement a dual-stage architecture that\nallows the unlabeled pairs to swap random designated parts of their codes to produce the reconstruc-\ntion during the primary-stage and then swap back during the second stage. Through this process,\nwe explicitly impose the element-wise modularity and portability of the encodings of the unlabeled\nsamples, and implicitly encourage disentangling under the guidance of labeled pairs.\n\n3.2 Labeled Pairs\nFor a pair of labeled input (IA,IB) in group Gk, meaning that they share the attribute corresponding\nto the k-th part of their encodings RA and RB, we swap their k-th part and get a pair of hybrid\ncodes (cid:127)RA = [a1, a2, ..., bk, ..., an] and (cid:127)RB = [b1, b2, ..., ak, ..., bn]. We then feed the hybrid code\npair (cid:127)RA and (cid:127)RB to the decoder f\u03c6 to obtain the \ufb01nal representation (cid:127)IA and (cid:127)IB. We enforce\nthe reconstructions (cid:127)IA and (cid:127)IB to approximate (IA,IB), and encourage disentangling of the k-th\nattribute. This is achieved by minimizing the swap loss\n\nLs(IA,IB; \u03d5, \u03c6) = ||IA \u2212 (cid:127)IA||2\n\n2 + ||IB \u2212 (cid:127)IB||2\n2,\nso that the k-th part of RA and RB will only contain the shared semantic.\nWe take the total loss Lp for the labeled pairs to be the sum of the original autoencoder loss Lo and\nswap loss Ls(IA,IB; \u03d5, \u03c6):\n\n(2)\n\nLp(IA,IB; \u03d5, \u03c6) = Lo(IA,IB; \u03d5, \u03c6) + \u03b1Ls(IA,IB; \u03d5, \u03c6),\n\n(3)\n\n4\n\n\fwhere \u03b1 is a balance parameter, which decides the degree of disentanglement.\n\nAlgorithm 1 The Dual Swap Disentangling (DSD) algorithm\nInput: Paired observation groups {Gk, k = 1, 2, .., n}, unannotated observation set G.\n1: Initialize \u03d51 and \u03c61.\n2: for t =1, 3, ..., T epochs do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n\nRandom sample k \u2208 {1, 2, ..., n}.\nSample paired observation (IA,IB) from group Gk.\nEncode IA and IB into RA and RB with encoder f\u03d5t.\nSwap the k-th part of RA and RB and get two hybrid representations (cid:127)RA and (cid:127)RB.\nConstruct RA and RB into (cid:22)IA = f\u03c6t (RA) and (cid:22)IB = f\u03c6t(RB).\nConstruct (cid:127)RA and (cid:127)RB into (cid:127)IA = f\u03c6t ( (cid:127)RA) and (cid:127)IB = f\u03c6t( (cid:127)RB).\nUpdate \u03d5t+1, \u03c6t+1 \u2190 \u03d5t, \u03c6t by ascending the gradient estimate of Lp(IA,IB; \u03d5t, \u03c6t).\nSample unpaired observation (IA,IB) from unannotated observation set G.\nEncode IA and IB into RA and RB with encoder f\u03d5t+1.\nswap the k-th part of RA and RB and get two hybrid representations (cid:127)RA and (cid:127)RB.\nConstruct RA and RB into (cid:22)IA = f\u03c6t+1(RA) and (cid:22)IB = f\u03c6t+1 (RB).\nConstruct (cid:127)RA and (cid:127)RB into (cid:127)IA = f\u03c6t+1( (cid:127)RA) and (cid:127)IB = f\u03c6t+1 ( (cid:127)RB).\nEncode ( (cid:127)IA, (cid:127)IB) into (cid:127)R\u2032\nSwap the k-th parts of (cid:127)R\u2032\nConstruct R\u2032\nB into (cid:22)(cid:22)IA = f\u03c6t+1(R\u2032\nUpdate \u03d5t+2, \u03c6t+2 \u2190 \u03d5t+1, \u03c6t+1\nLu(IA,IB; \u03d5t+1, \u03c6t+1).\n\nA and R\u2032\nB.\nA) and (cid:22)(cid:22)IB = f\u03c6t+1 (R\u2032\nB).\nthe\nby\n\nB with encoder f\u03d5t+1.\nB backward and get R\u2032\n\nA and (cid:127)R\u2032\nA and (cid:127)R\u2032\n\nA and R\u2032\n\nascending\n\ngradient\n\nestimate\n\nof\n\n19: end for\nOutput: \u03d5T , \u03c6T\n\n3.3 Unlabeled Pairs\n\nUnlike the labeled pairs that go through only the primary-stage, unlabeled pairs go through both the\nprimary-stage and the dual-stage, in other words, the \u201cencoding-swap-decoding\u201d process is conduct-\ned twice for disentangling. Like the labeled pairs, in the primary-stage the unlabeled pairs (IA,IB)\nalso produce a pair of hybrid outputs (cid:127)IA and (cid:127)IB through swapping a random k-th part of RA and\nRB. In the dual-stage, the two hybrids (cid:127)IA and (cid:127)IB are again fed to the same encoder f\u03d5 and encod-\ned as new representations (cid:127)R\u2032\n\u2032\n\u2032\n\u2032\n\u2032\n\u2032\nn]. We then\n1, b\n2, ..., b\nk, ..., a\n1, a\nA = [a\nswap back the k-th part of (cid:127)R\u2032\nB and denote the new codes as R\u2032\nA and (cid:127)R\u2032\n\u2032\nn]\nand R\u2032\n\u2032\n\u2032\n\u2032\n\u2032\nn]. These codes are fed to the decoder f\u03c6 to produce the \ufb01nal output\n2, ..., b\nk, ..., b\n1, b\nB = [b\n(cid:22)(cid:22)IA = f\u03c6(R\u2032\nA) and (cid:22)(cid:22)IB = f\u03c6(R\u2032\nB).\nWe minimize the reconstruction error of dual swap output with respect to the original input, and\nwrite the dual swap loss Ld as follows:\n\n\u2032\n\u2032\n2, ..., a\nk, ..., b\n\u2032\n1, a\nA = [a\n\n\u2032\n\u2032\n2, ..., a\nk, ..., a\n\nn] and (cid:127)R\u2032\n\u2032\n\nB = [b\n\nLd(IA,IB; \u03d5, \u03c6) = ||IA \u2212 (cid:22)(cid:22)IA||2\n\n2 + ||IB \u2212 (cid:22)(cid:22)IB||2\n2.\n\n(4)\n\nThe dual swap reconstruction minimization here provides a unique form of self-supervision. That is,\nby swapping random parts back and forth, we encourage the element-wise separability and modular-\nity of the obtained encodings, which further helps the encoder to learn disentangled representations\nunder the guidance of limited weak labels.\nThe total loss for the unlabeled pairs consists of the original autoencoder loss Lo(IA,IB; \u03d5, \u03c6) and\ndual autoencoder loss Ld(IA,IB; \u03d5, \u03c6):\n\nLu(IA,IB; \u03d5, \u03c6) = Lo(IA,IB; \u03d5, \u03c6) + \u03b2Ld(IA,IB; \u03d5, \u03c6),\n\n(5)\n\nwhere \u03b2 is the balance parameter. As we will show in our experiment, adopting the dual swap on\nunlabeled samples and solving the objective function of Eq. 5, yield a signi\ufb01cantly better result\nas compared to only using unlabeled samples during the primary-stage without swapping, which\ncorresponds to optimizing over the autoencoder loss alone.\n\n5\n\n\f3.4 Complete Algorithm\n\nWithin each epoch during training, we alternatively optimize the autoencoder using randomly-\nsampled labeled and unlabeled pairs. The complete algorithm is summarized in Algorithm 1. Once\ntrained, the encoder is able to infer disentangled encodings that can be applied in many applications.\n\n4 Experiments\n\nTo validate the effectiveness of our methods, we conduct experiments on six image datasets of dif-\nferent domains: a synthesized Square dataset, Teapot (Moreno et al. [2016], Eastwood and Williams\n[2018]), MNIST (Haykin and Kosko [2009]), dSprites (Higgins et al. [2016]), Mugshot (Shen et al.\n[2016]), and CAS-PEAL-R1 (Gao et al. [2008]). We \ufb01rstly qualitatively assess the visualization of\nDSD\u2019s generative capacity by performing swapping operation on the parts of latent codes, which\nveri\ufb01es the disentanglement and completeness of our method. To evaluate the informativeness of\nthe disentangled codes, we compute the classi\ufb01cation accuracies based on DSD encodings. We are\nnot able to use the framework of Eastwood and Williams [2018] as it is only applicable to methods\nthat encode each semantic into a single dimension code. In the DSD framework, the latent code\u2019s\nlength and semantic number for the six datasets are set as follows: Square (15, 3), Teapot (50, 5),\nMNIST (15, 3), dSprites (25, 5), CAS-PEAL-R1 (40, 4) and Mugshot (100, 2). The latent code\u2019s\nlength is empirically set, but usually set larger for sophisticated attributes.\nIn our experiment, the visual results are generated with the 64 \u00d7 64 network architecture and oth-\ner quantitative results are generated with the 32 \u00d7 32 network architecture. For the 32 \u00d7 32 net-\nwork architecture, the encoder / discriminatior (D) / auxilary network (Q) and the decoder / gen-\nerator (G) are shown in Table 1. The 64 \u00d7 64 network architecture is same as architecture of\nEastwood and Williams [2018]. Adam optimizer (Kingma and Ba [2014]) is adopted with learning\n\u22124 (32 \u00d7 32 network). The batch size is 64. For the\nrates of 1e\nstable training of InfoGAN, we \ufb01x the latent codes\u2019 standard deviations to 1 and use the objective\nof the improved Wasserstein GAN (Gulrajani et al. [2017]), simply appending InfoGAN\u2019s approxi-\nmate mutual information penalty. We use layer normalization instead of batch normalization. For\nthe above two network architecture, \u03b1 and \u03b2 are all set as 5 and 0.2, respectively.\n\n\u22124 (64 \u00d7 64 network) and 0.5e\n\nEncoder / D/Q\n\nDecoder /G\n\nFC 4 (cid:1) 4 (cid:1) 8 (cid:1) 32\n3 (cid:2) 3 32 conv.\nBN, ReLU, 3 (cid:2) 3 256 conv, \"\nBN, ReLU, 3 (cid:2) 3 32 conv\nBN, ReLU, 3 (cid:2) 3 128 conv\nBN, ReLU, 3 (cid:2) 3 64 conv, #\nBN, ReLU, 3 (cid:2) 3 128 conv, \"\nBN, ReLU, 3 (cid:2) 3 64 conv\nBN, ReLU, 3 (cid:2) 3 128 conv, # BN, ReLU, 3 (cid:2) 3 64 conv\nBN, ReLU, 3 (cid:2) 3 64 conv, \"\nBN, ReLU, 3 (cid:2) 3 128 conv\nBN, ReLU, 3 (cid:2) 3 256 conv, # BN, ReLU, 3 (cid:2) 3 32 conv\nBN, ReLU, 3 (cid:2) 3 3 conv, tanh\nFC Output\n\nTable 1: Network architecture for image size 32 \u00d7 32. Each network has 3 residual blocks (all but\nthe \ufb01rst and last rows). The input to each residual block is added to its output (with appropriate\ndownsampling/upsampling to ensure that the dimensions match). Downsampling \u2193 is performed\nwith mean pooling and \u2191 indicates nearest-neighbour upsampling.\n\n4.1 Qualitative Evaluation\n\nWe show in Fig. 2 some visualization results on the six datasets. For each dataset, we show input\npairs, the swapped attribute, and results after swapping.\nSquare We create a synthetic image dataset of 60, 000 image samples ( 30, 000 pair images), where\neach image features a randomly-colored square at a random position with a randomly-colored back-\nground. The training, validation and testing dataset are set as {(20, 000), (9, 000) and (1, 000)},\nrespectively. Visual results of DSD on Square dataset are shown in Fig. 2(a), where DSD leads to\nvisually plausible results.\n\n6\n\n\fFigure 2: Visual results on six datasets: (a) Square, (b) Teapot, (c) MNIST, (d) dSprites, (e) Mugshot,\nand (f) CAS-PEAL-R1. \u201cd-pair\u201d indicates disturbed pair.\n\nTeapot The Teapot dataset used in Eastwood and Williams [2018] contains 200, 000 64 \u00d7 64 color\nimages of a teapot with varying poses and colors. Each generative factor is independently sampled\nfrom its respective uniform distribution: azimuth (z0) s U\u23080, 2\u03c0\u2309, elevation (z1) s U\u23080, 2\u03c0\u2309,\nred (z2) s U\u23080; 1\u2309, green (z4) s U\u23080; 1\u2309. In the experiment, we used 50, 000 training, 10, 000\nvalidation and 10, 000 testing samples. Fig. 2(b) shows the visual results on Teapot, where we can\nsee that the \ufb01ve factors are evidently disentangled.\nMNIST In the visual experiment, we adopt InfoGAN to generate 5, 000 paired samples, for which\nwe vary the following factors: digital identity (0\u22129), angle and stroke thickness. The whole training\ndataset contains 50, 000 samples: 5, 000 generated paired samples and 45, 000 real unpaired samples\ncollected from the original dataset. Semantics swapping for MNIST are shown in Fig. 2(c), where\nthe digits swap one attribute but preserve the other two. For example, when swapping the angle, the\ndigital identity and thickness are kept unchanged. The generated images again look very realistic.\ndSprites The dSprites is a dataset of 2D shapes procedurally generated from 6 ground truth indepen-\ndent latent factors. These factors are color (white), shape (heart, oval and square), scale (6 values),\nrotation (40 values), position X (32 values) and position Y (32 values) of a sprite. All possible com-\nbinations of these factors are present exactly once, generating N = 737280 total images. We sample\n100, 000 pairs from original dSprites, which are divided into {(80, 000), (10, 000), (10, 000)} for\ntraining, validation and testing. Fig. 2(d) shows the visual results with swapped above latent factors,\nwhere we can see that the \ufb01ve factors are again obviously disentangled.\nMugshot We also use the Mugshot dataset which contains sel\ufb01e images of different subjects with\ndifferent backgrounds. This dataset is generated by arti\ufb01cially combining human face images in\nShen et al. [2016] with 1, 000 scene photos collected from internet. For Mugshot dataset, we divided\nit into {(20, 000), (9, 000), (1, 000)} for training, validation and testing. Fig. 2(e) shows the results\nof the same mugshot through swapping different backgrounds, which are visually impressive. Note\nthat, in this case we only consider two semantics, the foreground being the human sel\ufb01e and the\nbackground being the collected scene. The good visual results can be partially explained by the fact\nthat the background with different subjects has been observed by DSD during training.\nCAS-PEAL-R1 CAS-PEAL-R1 contains 30, 900 images of 1, 040 subjects, of which 438 subjects\nwear 6 different types of accessories (3 types of glasses, and 3 types of hat). There are images of 233\nsubjects that involve at least 10 lighting changes and at most 31 lighting changes. We sample 50, 000\npair samples from original CAS-PEAL-R1. They are divided into {(40, 000), (9, 000), (1, 000)} for\ntraining, validation and testing. Fig. 2(f) shows the visual results with swapped light, hat and glasses.\n\n7\n\ninputs #1s #2s #3positionbackgroundsquare colorinputs #1s #2s #3digital identityanglethickness(a)(c)azimuthelevationredbluegreeninputs #1s #2s #3(b)s #4s #5(d)inputbackgroundorientationposition xscaleposition yshapeinputs #1inputs #1(e)(f)hatlighthat & glassesinputs #1s #2s #2,3inputinputd-pairs #1s #2s #3s #4s #5background\fNotably, the covered hair by the hats can also be reconstructed when the hats are swapped, despite\nthe qualities of hybrid images are exceptional. This can be in part explained by the existence of\ndisturbed paired samples, as depicted in the last column. This pair of images is in fact labeled as\nsharing the same hat, although the appearances of the hats such as the wearing angles are signi\ufb01cant-\nly different, making the supervision very noisy.\n\n4.2 Quantitative Evaluation\n\nTo quantitatively evaluate the informativeness of disentangled codes, we compare our method-\ns with 4 methods:\nInfoGAN (Chen et al. [2016]), \u03b2-VAE (Higgins et al. [2016]), Semi-VAE\n(Siddharth et al. [2017]) and basic Autoencoder. We \ufb01rst use InfoGAN to generate 5, 0000 pair\ndigital samples, and then train all methods on this generated dataset. For InfoGAN and \u03b2-VAE , the\nlengths of their codes are set as 5. To fairly compare with the above two methods, the codes\u2019 length\nof Semi-VAE, Autoencoder and our DSD are taken to be 5 \u00d7 3, which means the code contains 3\nparts and each part\u2019s length is 5. In this condition, we can compare part of codes (length = 5) that\ncorrespond to digit identity with whole codes (length = 5) of InfoGAN and \u03b2-VAE and variable\n(length = 1) that correspond to digit identity. For the basic Autoencoder, the highest accuracy part\nis treated as the identity part. After training all the models, real MNIST data are encoded as codes.\nThen, 55, 000 training samples are used to train a simple knn classi\ufb01er and remaining 10, 000 are\nused as test samples. Table 2 gives the classi\ufb01cation accuracy of different methods, where the Info-\nGAN achieves the worst accuracy score. The DSD(0.5) achieves best accuracy score, which further\nvalidates the informativeness of our DSD.\n\nTable 2: The accuracy score comparison among different models. DSD(n) denotes the DSD with\nn supervision rate paired samples. Accuracy (ACC) values are shown as \u201cq/p\u201d, where q is the\naccuracy obtained using the digital identity part of the codes for classi\ufb01cation, and p is the accuracy\nobtained using the whole codes.\n\nModel\nACC\n\n(cid:12)-VAE((cid:12)=1)\n\n(cid:12)-VAE((cid:12)=6)\n\n0.22/0.72\n\n0.25/0.71\n\nInfoGAN Semi-VAE Autoencoder DSD(0.5)\n0.76/0.91\n0.19/0.51\n\n0.66/0.93\n\n0.22/0.57\n\nDSD(1)\n0.742/0.90\n\nIn addition, to compare the annotated dataset\u2019s requirement of different (semi-)supervised meth-\n3. Name abbreviation with corresponding methods is giv-\nods, we summarize it in Table\nen as following: DC-IGN (Kulkarni et al. [2015]), DNA-GAN (Xiao et al. [2017]), TD-GAN\n(Wang et al. [2017]), Semi-DGM (Kingma et al. [2014]), Semi-VAE (Siddharth et al. [2017]), ML-\nVAE (Bouchacourt et al. [2017]), JADE (Banijamali et al. [2017]). DSD is the only one that requires\nlimited and weak labels, meaning that it requires the least amount of human annotations.\n\nTable 3: Comparison of the required annotated data. Label indicates whether the method require\nstrong label or weak label. Rate indicates the proportion of annotated data required for training.\n\nDC-IGN DNA-GAN TD-GAN Semi-DGM Semi-VAE\nstrong\n100 %\n\nstrong\nlimited\n\nstrong\nlimited\n\nstrong\n100 %\n\nstrong\n100 %\n\nJADE ML-VAE\nstrong\nlimited\n\nweak\n100 %\n\nDSD\nweak\nlimited\n\nLabel\nRate\n\n4.3 Supervision Rate\n\nWe also conduct experiments to demonstrate the impact of the supervision rate for DSD\u2019s disen-\ntangling capabilities, where we set the rates to be 0.0, 0.1, 0.2, ..., 1.0. From Fig. 3(a), we can see\nthat different supervision rates do not affect the convergence of DSD. Lower supervision rate will\nhowever lead to the over\ufb01tting if the epoch number greater than the optimal one. Fig. 3(d) shows\nthe classi\ufb01cation accuracy of DSD with different supervision rates. With only 20% paired samples,\nDSD achieves comparable accuracy as the one obtained using 100% paired data, which shows that\nthe dual-learning mechanism is able to take good advantage of unpaired samples. Fig. 3(c) shows\nsome hybrid images that are swapped the digital identity code parts. Note that, images obtained by\nDSD with supervision rates equal to 0.2, 0.3, 0.4, 0.5 and 0.7 keep the angles of the digits correct\nwhile others not. These image pairs are highlighted in yellow.\n\n8\n\n\fFigure 3: Results of different supervision rate.\n(a) The training loss curves and validation loss\ncurves of different supervision rates, where \u201ct-rate\u201d indicates training loss of supervision rate and\n\u201cv-rate\u201d indicates validation loss of supervision rate. (b) The training and validation loss curves\nof the DSD (dual framework) and primary-framework with different supervision rates. (c) Visual\nresults of different supervision rates through swapping parts of codes that correspond to the digital\nidentities. (d) Classi\ufb01cation accuracy of codes that are encoded by DSD with different supervision\nrate.\n\n4.4 Primary vs Dual\n\nTo verify the effectiveness of dual-learning mechanism, we compare our DSD (dual framework) with\na basic primary-framework that only contains primary-stage. The primary-framework also requires\npaired and unpaired samples. The major difference between the primary-framework and DSD is that\nthere is no swapping operation for unpaired samples in the primary-framework. Fig. 3(b) gives the\ntraining and validation loss curves of the DSD and primary-framework with different supervision\nrates, where we can \ufb01nd that different supervision rates have no visible impacts on the convergence\nof DSD and primary-framework. From Fig. 3(d), we can see that accuracy scores of the DSD are\nalways higher than accuracies of the primary-framework in different supervision rate, which proves\nthat codes disentangled by the DSD are more informative than those disentangled by the primary-\nframework. Fig. 3(c) gives the visual comparison between the hybrid images in different supervision\nrate. It is obvious that hybrid images of the primary-framework are almost the same with original\nimages, which indicates that the swapped codes contain redundant angle information. In other words,\nthe disentanglement of the primary-framework is defective. On the contrary, most of the hybrid\nimages of DSD keep the angle effectively, indicating that swapped coded only contains the digital\nidentity information. These results show that the DSD is indeed superior to the primary-framework.\n\n5 Discussion and Conclusion\n\nIn this paper, we propose the Dual Swap Disentangling (DSD) model that learns disentangled rep-\nresentations using limited and weakly-labeled training samples. Our model requires the shared\nattribute as the only annotation of a pair of input samples, and is able to take advantage of the vast\namount of unlabeled samples to facilitate the model training. This is achieved by the dual-stage ar-\nchitecture, where the labeled samples go through the \u201cencoding-swap-decoding\u201d process once while\nthe unlabeled ones go through the process twice. Such self-supervision mechanism for unlabeled\nsamples turns out to be very effective: DSD yields results superior to the state-of-the-art on several\ndatasets of different domains. In the future work, we will take semantic hierarchy into consideration\nand potentially learn disentangled representations with even fewer labeled pairs.\n\n9\n\nInputsdigitalidentity LossSupervision RateEpoch0.00.10.20.30.40.50.60.70.80.91.0dualprimaryEpochLoss(a)(b)(c)(d)Supervision RateX1.0e4X1.0e4overfittingvalidation loss curves of primary-frameworktraining loss curves of DSDtraining loss curves of primary-frameworkvalidation loss curves of DSD\fAcknowledgments\n\nThis work is supported by Natonal Basic Research Program of China under Grant No.\n2015CB352400, National Natural Science Foundation of China (61572428,U1509206), Fundamen-\ntal Research Funds for the Central Universities (2017FZA5014), Key Research and Developmen-\nt Program of Zhejiang Province (2018C01004), and Australian Research Council Projects (FL-\n170100117, DP-140102164).\n\nReferences\nErshad Banijamali, Amir Hossein Karimi, Alexander Wong, and Ali Ghodsi. Jade: Joint autoencoders for\n\ndis-entanglement. 2017.\n\nDiane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multi-level variational autoencoder: Learning\n\ndisentangled representations from grouped observations. 2017.\n\nChristopher Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexan-\n\nder Lerchner. Understanding disentangling in beta-vae. In NIPS 2017 Disentanglement Workshop, 2017.\n\nTian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud.\nvariational autoencoders. arXiv preprint arXiv:1802.04942, 2018.\n\nIsolating sources of disentanglement in\n\nXi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable\n\nrepresentation learning by information maximizing generative adversarial nets. 2016.\n\nEmilien Dupont. Joint-vae: Learning disentangled joint continuous and discrete representations. arXiv preprint\n\narXiv:1804.00104, 2018.\n\nCian Eastwood and Christopher K. I. Williams. A framework for the quantitative evaluation of disentangled\n\nrepresentations. In International Conference on Learning Representations, 2018.\n\nZunlei Feng, Zhenyun Yu, Yezhou Yang, Yongcheng Jing, Junxiao Jiang, and Mingli Song.\n\npartitioned embedding for customized fashion out\ufb01t composition. 2018.\n\nInterpretable\n\nShuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total correlation explana-\n\ntion. arXiv preprint arXiv:1802.05822, 2018.\n\nWen Gao, Bo Cao, Shiguang Shan, Xilin Chen, Delong Zhou, Xiaohua Zhang, and Debin Zhao. The cas-\npeal large-scale chinese face database and baseline evaluations. IEEE Transactions on Systems, Man, and\nCybernetics - Part A: Systems and Humans, 38(1):149\u2013161, 2008.\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training\n\nof wasserstein gans. 2017.\n\nS. Haykin and B. Kosko. Gradientbased learning applied to document recognition. In IEEE, pages 306\u2013351,\n\n2009.\n\nIrina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mo-\nhamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational\nframework. 2016.\n\nIrina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew\nBotvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement\nlearning. arXiv preprint arXiv:1707.08475, 2017a.\n\nIrina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matthew Botvinick, Demis\nHassabis, and Alexander Lerchner. Scan: Learning abstract hierarchical compositional visual concepts.\n2017b.\n\nHyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014.\n\nDiederik P Kingma, Danilo J Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with\n\ndeep generative models. Advances in Neural Information Processing Systems, 4:3581\u20133589, 2014.\n\nTejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, and Joshua B. Tenenbaum. Deep convolutional inverse\n\ngraphics network. 71(2):2539\u20132547, 2015.\n\n10\n\n\fBrenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that\n\nlearn and think like people. Behavioral and Brain Sciences, 40, 2017.\n\nPol Moreno, Christopher K. I. Williams, Charlie Nash, and Pushmeet Kohli. Overcoming occlusion with\n\ninverse graphics. In European Conference on Computer Vision, pages 170\u2013185, 2016.\n\nGuim Perarnau, Van De Weijer Joost, Bogdan Raducanu, and Jose M Alvarez. Invertible conditional gans for\n\nimage editing. 2016.\n\nXiaoyong Shen, Aaron Hertzmann, Jiaya Jia, Sylvain Paris, Brian Price, Eli Shechtman, and Ian Sachs. Auto-\n\nmatic portrait segmentation for image stylization. Computer Graphics Forum, 35(2):93\u2013102, 2016.\n\nN Siddharth, Brooks Paige, Alban Desmaison, Jan-Willem van de Meent, Frank Wood, Noah Goodman, D,\nPushmeet Kohli, and H S Torr, Philip. Learning disentangled representations in deep generative models.\n2017.\n\nChaoyue Wang, Chaohui Wang, Chang Xu, and Dacheng Tao. Tag disentangled generative adversarial network\nIn Twenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence,\n\nfor object image re-rendering.\npages 2901\u20132907, 2017.\n\nYingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie Yan Liu, and Wei Ying Ma. Dual learning for\n\nmachine translation. 2016.\n\nTaihong Xiao, Jiapeng Hong, and Jinwen Ma. Dna-gan: Learning disentangled representations from multi-\n\nattribute images. 2017.\n\nJun Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. pages 2242\u20132251, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2847, "authors": [{"given_name": "Zunlei", "family_name": "Feng", "institution": "Zhejiang University"}, {"given_name": "Xinchao", "family_name": "Wang", "institution": "Stevens Institute of Technology"}, {"given_name": "Chenglong", "family_name": "Ke", "institution": "Zhejiang University"}, {"given_name": "An-Xiang", "family_name": "Zeng", "institution": "Alibaba"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Sydney"}, {"given_name": "Mingli", "family_name": "Song", "institution": "Zhejiang University"}]}