{"title": "A Unified Feature Disentangler for Multi-Domain Image Translation and Manipulation", "book": "Advances in Neural Information Processing Systems", "page_first": 2590, "page_last": 2599, "abstract": "We present a novel and unified deep learning framework which is capable of learning domain-invariant representation from data across multiple domains. Realized by adversarial training with additional ability to exploit domain-specific information, the proposed network is able to perform continuous cross-domain image translation and manipulation, and produces desirable output images accordingly. In addition, the resulting feature representation exhibits superior performance of unsupervised domain adaptation, which also verifies the effectiveness of the proposed model in learning disentangled features for describing cross-domain data.", "full_text": "A Uni\ufb01ed Feature Disentangler for Multi-Domain\n\nImage Translation and Manipulation\n\nAlexander H. Liu1\n\nYen-Cheng Liu2\n\nYu-Ying Yeh3\n\nYu-Chiang Frank Wang1,4\n\n2Georgia Institute of Technology, USA\n\n3University of California, San Diego, USA\n\n1National Taiwan University, Taiwan\n\n4MOST Joint Research Center for AI Technology and All Vista Healthcare, Taiwan\n\nb03902034@ntu.edu.tw, ycliu@gatech.edu\nyuyeh@eng.ucsd.edu, ycwang@ntu.edu.tw\n\nAbstract\n\nWe present a novel and uni\ufb01ed deep learning framework which is capable of learn-\ning domain-invariant representation from data across multiple domains. Realized\nby adversarial training with additional ability to exploit domain-speci\ufb01c informa-\ntion, the proposed network is able to perform continuous cross-domain image\ntranslation and manipulation, and produces desirable output images accordingly. In\naddition, the resulting feature representation exhibits superior performance of un-\nsupervised domain adaptation, which also veri\ufb01es the effectiveness of the proposed\nmodel in learning disentangled features for describing cross-domain data.\n\n1\n\nIntroduction\n\nLearning interpretable feature representation has been an active research topic in the \ufb01elds of computer\nvision and machine learning. In particular, learning deep representation with the ability to exploit\nrelationship between data across different data domains has attracted the attention from the researchers.\nRecent developments of deep learning technologies have shown progress in the tasks of cross-domain\nvisual classi\ufb01cation [6, 26, 27] and cross-domain image translation [10, 24, 30, 11, 28, 17, 16, 4].\nWhile such tasks typically learn feature mapping from one domain to another or derive a joint\nrepresentation across domains, the developed models have limited capacities in manipulating speci\ufb01c\nfeature attributes for recovering cross-domain data.\nWith the goal of understanding and describing underlying explanatory factors across distinct data\ndomains, cross-domain representation disentanglement aims to derive a joint latent feature space,\nwhere selected feature dimensions would represent particular semantic information [1]. Once such a\ndisentangled representation across domains is learned, one can describe and manipulate the attribute\nof interest for data in either domain accordingly. While recent work [18] have demonstrated promising\nability in the above task, designs of exisitng models typically require high computational costs when\nmore than two data domains or multiple feature attributes are of interest.\nTo perform joint feature disentanglement and translation across multiple data domains, we propose\na compact yet effective model of Uni\ufb01ed Feature Disentanglement Network (UFDN), which is\ncomposed of a pair of uni\ufb01ed encoder and generator as shown in Figure 1. From this \ufb01gure, it can be\nseen that our encoder takes data instances from multiple domains as inputs, and a domain-invariant\nlatent feature space is derived via adversarial training, followed by a generator/decoder which recovers\nor translates data across domains. Our model is able to disentangle the underlying factors which\nrepresent domain-speci\ufb01c information (e.g., domain code, attribute of interest, etc.). This is achieved\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Illustration of multi-domain image translation and manipulation. With data from different\ndomains (e.g., D1: sketch, D2: photo, D3: painting), the goal is to learn domain-invariant feature\nrepresentation. With domain information disentangled from such representation, one can synthesize\nand manipulate image outputs in different domains of interests (including the intermediate ones\nacross domains).\n\nby joint learning of our generator. Once the disentangled domain factors are observed, one can simply\nsynthesize and manipulate the images of interest as outputs.\nLater in the experiments, we show that the use of our derived latent representation achieves signi\ufb01cant\nimprovements over state-of-the-art methods in the task of unsupervised domain adaptation. In\naddition to very promising results in multi-domain image-to-image translation, we further con\ufb01rm\nthat our UFDN is able to perform continuous image translation using the interpolated domain code\nin the resulting latent space. Implementation of our proposed method and the datasets are now\navailable1.\nThe contributions of this paper are highlighted as follows:\n\n\u2022 We propose a Uni\ufb01ed Feature Disentanglement Network (UFDN), which learns deep\ndisentangled feature representation for multi-domain image translation and manipulation.\n\u2022 Our UFDN views both data domains and image attributes of interest as latent factors to be\ndisentangled, which realizes multi-domain image translation in a single uni\ufb01ed framework.\n\u2022 Continuous multi-domain image translation and manipulation can be performed using our\nUFDN, while the disentangled feature representation shows promising ability in cross-\ndomain classi\ufb01cation tasks.\n\n2 Related Work\n\nRepresentation Disentanglement Based on the development of generative models like generative\nadversarial networks (GANs) [8, 21] and variational autoencoders (VAEs) [13, 22], recent works\non representation disentangling [20, 9, 3, 14, 12, 18] aim at learning an interpretable representation\nusing deep neural networks with different degrees of supervision. In a fully supervised setting,\nKulkarni et al. [14] learned invertible graphic codes for 3D image rendering. Odena et al. [20]\nachieved representation disentanglement with the proposed auxiliary classi\ufb01er GAN (AC-GAN).\nKingma et al. [12] also extended VAE into semi-supervised setting for representation disentanglement.\nWithout utilizing any supervised data, Chen et al. [3] decomposed representation by maximizing\nthe mutual information between the latent factors and the synthesized images. Despite promising\nperformances, the above works focused on learning disentangled representation of images in a single\ndomain, and they cannot be easily extened to describe cross-domain data. While a recent work\nby Liu et al. [18] addressed cross-domain disentangled representation with only supervision from\nsingle-domain data, empirical studies were performed to determine their network architecture (i.e.,\n\n1https://github.com/Alexander-H-Liu/UFDN\n\n2\n\n\fnumber of sharing layers across domains), which would limit its practical uses. Thus, a uni\ufb01ed\ndisentangled representation model (like ours) for describing and manipulating multi-domains data\nwould be desirable.\n\nImage-to-Image Translation Image-to-image translation is another line of research to deal with\ncross-domain visual data. With the goal of translating images across different domains, Isola et\nal. [10] applied conditional GAN which is trained on pairwise data across source and target domains.\nTaigman et al. [24] removed the restriction of pairwise training images and presented a Domain\nTransfer Network (DTN) which observes cross-domain feature consistency. Likewise, Zhu et al. [30]\nemployed a cycle consistency loss in the pixel space to achieve unpaired image translation. Similar\nideas were applied by Kim et al. [11] and Yi et al. [28]. Liu et al. [17] presented coupled GANs\n(CoGAN) with sharing weight on high-level layers to learn the joint distribution across domains. To\nachieve image-to-image translation, they further integrated CoGAN with two parallel encoders [16].\nNevertheless, the above dual-domains models cannot be easily extended to multi-domain image\ntranslation without increasing the computation costs. Although Choi et al. [4] recently proposed\nan uni\ufb01ed model to achieve multi-domain image-to-image translation, their model does not exhibit\nability in learning and disentangling desirable latent representations (as ours does).\n\nUnsupervised Domain Adaptation (UDA) Unsupervised domain adaptation (UDA) aims at clas-\nsifying samples in the target domain, using labeled and unlabeled training data in source and target\ndomains, respectively. Inspired by the idea of adversarial learning [8], Ganin et al. [6] proposed a\nmethod applying adversarial training between domain discriminator and normal convolution neural\nnetwork based classi\ufb01er, making the model invariant to the domain shift. Tzeng et al. [26] also\nattempted to build domain-invariant classi\ufb01er via introducing domain confusion loss. By advancing\nadversarial learning strategies, Bousmalis et al. [2] chose to learn orthogonal representations, derived\nby shared and domain-speci\ufb01c encoders, respectively. Tzeng et al. [27] addressed UDA by adapting\nCNN feature extractors/classi\ufb01er across source and target domains via adversarial training. However,\nthe above methods generally address domain adaptation by eliminating domain biases. There is\nno guarantee that the derived representation would preserve semantic information (e.g., domain or\nattribute of interest). Moreover, since the goal of UDA is visual classi\ufb01cation, image translation (dual\nor multi-domains) cannot be easily achieved. As we highlighted in Sect. 1, our UFDN learns the\nmulti-domain disentangled representation, which enables multi-domain image-to-image translation\nand manipulation and unsupervised domain adaption. Thus, our proposed model is very unique.\n\n3 Uni\ufb01ed Feature Disentanglement Network\n\nWe present a unique and uni\ufb01ed network architecture, Uni\ufb01ed Feature Disentanglement Network\n(UFDN), which disentangles the domain information from latent space and derives domain-invariant\nrepresentation from data across multiple domains (not just from a pair of domains). This not only\nenables the task of multi-domain image translation/manipulation, the derived feature representation\ncan also be applied for unsupervised domain adaptation.\nGiven image sets {Xc}N\nc=1 across N domains, our UFDN learns a domain-invariant representation z\nfor the input image xc \u2208 Xc (in domain c). This is realized by disentangling the domain information\nin the latent space as domain vector v \u2208 RN via self-supervised feature disentanglement (Sect. 3.1),\nfollowed by preserving the data recovery ability via adversarial learning in the pixel space (Sect. 3.2).\nWe now detail our proposed model.\n\n3.1 Self-supervised feature disentanglement\n\nTo learn disentangled representation across data domains, one can simply apply a VAE architecture\n(e.g., components E and G in Figure 2). To be more speci\ufb01c, we have encoder E take the image xc\nas input and derive its representation z, which is combined with its domain vector vc to reconstruct\nthe image \u02c6xc via Generator G. Thus, the objective function of VAE is de\ufb01ned as:\n\nLvae = (cid:107)\u02c6xc \u2212 xc(cid:107)2\n\nF + KL(q(z|xc)||p(z)),\n\n(1)\n\nwhere the \ufb01rst term aims at recovering the synthesized output in the same domain c, and the second\nterm calculates Kullback-Leibler divergence which penalizes deviation of latent feature from the prior\n\n3\n\n\fFigure 2: Overview of our Uni\ufb01ed Feature Disentanglement Network (UFDN), consisting of an\nencoder E, a generator G, a discriminator in pixel space Dx and a discriminator in feature space\nDv. Note that xc and \u02c6xc denote input and reconstruct images with domain vector vc, respectively. \u02c6x\u00afc\nindicates the synthesized image with domain vector v\u00afc.\n\ndistribution p(zc) (as z \u223c N (0, I)). However, the above technical is not guaranteed to disentangle\ndomain information from the latent space, since generator recovers the images simply based on the\nrepresentation z without considering the domain information.\nTo address the above problem, we extend the aforementioned model to eliminate the domain-speci\ufb01c\ninformation from the representation z. This is achieved by exploiting adversarial domain classi\ufb01cation\nin the resulting latent feature space. More precisely, the introduced domain discriminator Dv in\nFigure 2 only takes the latent representation z as input and produce domain code prediction lv. The\nobjective function of this domain discriminator Ladv\n\nis derived as follows:\n\nDv\n\nLadv\n\nDv\n\n= E[log P (lv = vc|E(xc))],\n\n(2)\n\nwhere P is the probability distribution over domains lv, which is produced by the domain discrimina-\ntor Dv. The domain vector vc can be implemented by an one-hot vector, concatenation of multiple\none-hot vectors, or simply a real-value vector describing the domain of interest. In contrast, the\nencoder E aims to confuse Dv from correctly predicting the domain code. As a result, the objective\nof the encoder Ladv\n\nE is to maximize the entropy of the domain discriminator:\n\nLadv\nE = \u2212Ladv\n\nDv\n\n= \u2212E[log P (lv = vc|E(xc))].\n\n(3)\n\n3.2 Adversarial learning in pixel space\n\nOnce the above domain-invariant representation z is learned, we further utilize the reconstruction\nmodule in our UFDN to preserve the recovery ability of the disentangled representation. That is, the\nreconstructed image \u02c6xc can be supervised by its original image xc.\nHowever, when manipulating the domain vector as v\u00afc in the above process, there is no guarantee\nthat the synthesized image \u02c6x\u00afc could be practically satisfactory based on v\u00afc. This is due to the fact\nthat there is no pairwise training data (i.e., xc and x\u00afc)) to supervise the synthesized image \u02c6x\u00afc in the\ntraining stage. Moreover, as noted in [29], the VAE architecture tends to generate blurry samples,\nwhich would not be desirable for practical uses.\nTo overcome the above limitation, we additionally introduce an image discriminator Dx in the pixel\nspace for our UFDN. This discriminator not only improves the image quality of the synthesized\nimage \u02c6x\u00afc, it also enhances the ability of disentangling domain information from the latent space.\nWe note that the objectives of this image discriminator Dx are twofold: to distinguish whether the\ninput image is real or fake, and to predict the observed images (i.e., \u02c6x\u00afc and xc) into proper domain\ncode/categories.\n\n4\n\n\fTable 1: Comparisons with recent works on image-to-image translation.\n\nUnpaired Bidirectional\ntranslation\n\nUni\ufb01ed Multiple\nstructure\ndomains\n\nJoint\n\nFeature\n\nrepresentation\n\ndisentanglement\n\nPix2Pix [10]\nCycleGAN [30]\nStarGAN [4]\nDTN [24]\nUNIT [16]\nE-CDRD [18]\nUFDN (Ours)\n\ndata\n-\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n-\n(cid:88)\n(cid:88)\n-\n(cid:88)\n(cid:88)\n(cid:88)\n\n-\n-\n(cid:88)\n-\n-\n-\n(cid:88)\n\n-\n-\n(cid:88)\n-\n-\n(cid:88)\n(cid:88)\n\n-\n-\n-\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n-\n-\n-\n-\n-\n(cid:88)\n(cid:88)\n\nWith the above discussions, we de\ufb01ne the objective functions Ladv\nbetween image discriminator Dx and generator G as:\n\nDx\n\nand Ladv\n\nG for adversarial learning\n\nDx\n\n= E[log(Dx(\u02c6x\u00afc))] + E[log(1 \u2212 Dx(xc)],\n\nLadv\nLadv\nG = \u2212E[log(Dx(\u02c6x\u00afc))].\n\nOn the other hand, the objective function for domain classi\ufb01cation is derived as:\n\nLcls = E[log P (lx = v\u00afc|\u02c6x\u00afc)] + E[log P (lx = vc|xc)],\n\n(4)\n\n(5)\n\nwhere lx denotes the domain prediction of image discriminator Dx. This term implicitly maximizes\nthe mutual information between the domain vector and the synthesized image [3].\nTo train our UFDN, we alternately update encoder E, generator G, domain discriminator Dv, and\nimage discriminator Dx with the following gradients:\n\n+\u2190\u2212 \u2212\u2206\u03b8E (Lvae + Ladv\nE ),\n+\u2190\u2212 \u2212\u2206\u03b8Dv\n\n(Ladv\n\nDv\n\n),\n\n\u03b8E\n\n\u03b8Dv\n\n+\u2190\u2212 \u2212\u2206\u03b8G (Lvae + Ladv\n+\u2190\u2212 \u2212\u2206\u03b8Dx\n\n(Ladv\n\n+ Lcls).\n\nDx\n\nG + Lcls),\n\n\u03b8G\n\n\u03b8Dx\n\n(6)\n\n3.3 Comparison with state-of-the-art cross-domain visual tasks\n\nTo demonstrate the uniqueness of our proposed UFDN, we compare our model with several state-of-\nthe-art image\u2013to\u2013image translation works in Table 1.\nWithout the need of pairwise training data, CycleGAN [10] learns bidirectional mapping between two\npixel spaces, while they needed to learn the multiple individual networks for the task of multi-domain\nimage translation. StarGAN [4] alleviates the above problem by learning a uni\ufb01ed structure. However,\nit does not exhibit the ability to disentangle particular semantics across different domains. Another\nline of works on image translation is to learn a joint representation across image domains [24, 16, 18].\nWhile DTN [24] learns a joint representation to translate the image from one domain to another,\ntheir model only allows the task of unidirectional image translation. UNIT [16] addresses the\nabove problem by jointly synthesizing the images in both domains. However, it is not able to learn\ndisentangled representation as ours does. A recent work of E-CDRD [18] derives cross-domain\nrepresentation disentanglement. Their model requires high computational costs when more than\ntwo data domains are of interest, while ours is a uni\ufb01ed architecture for multiple data domains (i.e.,\ndomain code as a vector).\nIt is worth repeating that our UFDN does not require pairwise training data for learning multi-domain\ndisentangled feature representation. As veri\ufb01ed later in the experiments, our model not only enables\nmulti-domain image-to-image translation and manipulation, the derived domain-invariant feature\nfurther allows unsupervised domain adaptation.\n\n5\n\n\f(a) Examples results of\n\nFigure 3:\nimage-to-image translation across data domains of\nsketch/photo/paint and (b) example image translation results with randomly generated identity\n(i.e., random sample z).\n\n4 Experiment Results\n\n4.1 Datasets\n\nDigits MNIST, USPS, Street View House Number (SVHN) datasets are considered to be three\ndifferent domains and used as benchmark datasets in unsupervised domain adaption (UDA) tasks.\nMNIST contains 60000/10000 training/testing images, and USPS contains 7291/2007 training/testing\nimages. While the above two datasets are handwritten digits, SVHN consists of digit images with\nthe complex background and various illuminations. We used the 60000 images from SVHN extra\ntraining set to train our model and few samples from the testing set to perform image translation. All\nimages are converted to RGB images with the size 32x32 in our experiments.\n\nHuman faces We use the Large-scale CelebFaces Attributes (CelebA) Dataset [19] in our experi-\nment on human face images. CelebA includes more than 200k celebrity photos annotated with 40\nfacial attributes. Considering photo, sketch and paint as three different domains, we follow the setting\nof previous works [10, 18] to transfer half of the photos to sketch. We further transferred half of the\nremaining photos to paint through off-the-shelf style transfer software2.\n\n4.2 Multi-domain image translation with disentangled representation\n\nMost of the previous works focus on image translation between two domains as mentioned in\nSection 2. In our experiment, we use human face images from different domains to perform image-\nto-image translation. Although Choi et al. [4] claim to have achieved multi-domain image-to-image\ntranslation on human face dataset, they de\ufb01ne attribute, e.g., gender or hair color, as domain. In\nour work, we denote domain by the dataset properties rather than attributes. Images from different\ndomains may share same attributes, but an image cannot belong to two domain at the same time.\nWith uni\ufb01ed framework and no restriction on the dimension of domain vector, UFDN can perform\nimage-to-image translation over multiple domains. As shown in Figure 3(a), we demonstrate the\nresults of image-to-image translations between domains sketch/photo/paint. Previous works [25, 15]\nhad discovered that even the disentangled feature to the generator/decoder is binary during training,\nit can be considered as continuous variable during testing. Our model also inherits this property of\ncontinuous cross-domain image translation by manipulating the value of domain vector.\nOur model is also capable of generating unseen images by randomly sampled representation in the\nlatent space. Since the representation is sampled from domain-invariant latent space, UFDN can\nfurther present them with any domain vector supplied. Figure 3(b) shows the result of translation for\n\n2https://fotosketcher.com\n\n6\n\n\fFigure 4: Example results of multi-domain image translation. Note that all images are produced by\nthe same z with varying domain information.\n\nTable 2: Quantitative evaluation in terms of image-to-image translation on human face dataset.\n\nSketch\u2192Photo\n\nSSIM MSE\n0.0207\n0.6229\n0.0142\n0.8026\n0.8222\n0.0106\n\nPSNR\n16.86\n19.04\n20.24\n\nPaint\u2192Photo\n\nSSIM MSE\n0.0174\n0.5892\n0.0060\n0.8496\n0.8798\n0.0033\n\nPSNR\n17.61\n22.53\n25.06\n\nE-CDRD [18]\nStarGAN [4]\nUFDN (Ours)\n\nsix identities randomly sampled. It is worth noting that this cannot be done by those translation models\nwithout representation learning or using skipped connection between encoder and decoder/generator.\nTable 2 provides quantitative evaluation on the recovered images using our proposed UFDN with E-\nCDRD [18] and StarGAN3 [4]. In our experiments, we convert photo images into sketches/paintings\nfor the purpose of collecting training cross-domain image data (but did not utilize such pairwise\ninformation during training). This is the reason why we are able to observe the ground truth photo\nimages and calculate SSIM/MSE/PSNR values for the translated outputs. While both learning\ndisentangled representation, our UFDN outperformed E-CDRD in terms translation quality. It is also\nworth noting that our UFDN matched the performance of StarGAN, which was designed for image\ntranslation without learning any representation.\nTo further demonstrate the ability to disentangle representation, our model performs feature dis-\nentanglement of common attributes across domains simultaneously. This can be easily done by\nexpanding domain vector with the annotated attribute from the dataset. In our experiment, Gender\nand Smiling are picked as the attribute of interest. The results are shown in the Figure 4. We used a\n\ufb01xed domain-invariant representation to show that features are highly disentangled by our UFDN.\nAll information of interest (domain/gender/smiling) can be independently manipulated through our\nmodel. As a reminder, each and every result provided above was presented by the same single model.\n\n4.3 Unsupervised domain adaption with domain-invariant representation\n\nUnsupervised domain adaption (UDA) aims to classify samples in target domain while labels are\nonly available in the source domain. Previous works [6, 27] dedicated to building a domain-invariant\nclassi\ufb01er for UDA task. Recent works [16, 18] addressed the problem by using classi\ufb01er with high-\nlevel layers tied across domain and synthesized training data provided by image-to-image translation.\nWe followed the previous works to challenge UDA task on digit classi\ufb01cation over three datasets\nMNIST/USPS/SVHN. The notation \"\u2192\" denotes the relation between source and target domain. For\nexample, SVHN\u2192MNIST indicates that SVHN is the source domain with categorical labels.\nTo verify the robustness of our domain-invariant representation, we adapt our model to UDA task by\nadding a single fully-connected layer as the digit classi\ufb01er. This classi\ufb01er simply takes as input the\n\n3We used the source code provided by the author at https://github.com/yunjey/StarGAN\n\n7\n\n\fFigure 5: t-SNE visualization of SVHN\u2192MNIST. Note that different colors indicate data of (a)\ndifferent domains and (b) digit classes.\n\nTable 3: Performance comparisons of unsupervised domain adaptation (i.e., classi\ufb01cation accuracy\nfor target-domain data). For example, MNIST\u2192USPS denotes MNIST and USPS as source and\ntarget-domain data, respectively.\n\nMNIST\u2192USPS USPS\u2192MNIST SVHN\u2192MNIST\n\n59.32\n73.85\n84.88\n82.00\n\n76.00\n90.53\n92.40\n\n-\n\n-\n\n95.01\n\nSA [5]\n\nDANN [6]\nDTN [24]\nDRCN [7]\nCoGAN [17]\nADDA [27]\nUNIT [16]\n\nADGAN [23]\nCDRD [18]\nUFDN (Ours)\n\n67.78\n\n-\n-\n\n91.8\n95.65\n89.40\n95.97\n92.80\n95.05\n97.13\n\n48.80\n\n-\n-\n\n73.7\n93.15\n90.10\n93.58\n90.80\n94.35\n93.77\n\ndomain-invariant representation and predicts the digit label. The auxiliary classi\ufb01er is jointly trained\nwith our UFDN.\nTable 3 lists and compares the performance of our model to others. For the setting MNIST\u2192USPS,\nour model surpasses UNIT [16] which was the state-of-the-art. For SVHN\u2192MNIST, our model also\nsurpasses the state-of-the-art with signi\ufb01cant improvement. While SVHN\u2192MNIST is considered to\nbe much more dif\ufb01cult than the other two settings, our model is able to decrease the classi\ufb01cation\nerror rate from 7.6% to 5%. It is also worth mentioning that our model used 60K images from SVHN,\nwhich is considerably less than 531K used by UNIT.\nWe visualize domain-invariant representations with t-SNE and show the results in Figure 5. From\nFigure 5(a) and 5(b) we can see that the representation is properly clustered with respect to class of\ndigits instead of domain. We also provide the result of synthesizing images with the domain-invariant\nrepresentation. As shown in Figure 6, by manipulating domain vector, the representation of SVHN\nimage can be transformed to MNIST. It further strengthens our point of view that disentangled\nrepresentation is worth learning.\n\n4.4 Ablation study\n\nAs mentioned in Section 3, we applied self-supervised feature disentanglement and adversarial\nlearning in pixel space to build our framework. To verify the effect of these methods, we did ablation\nstudy on the proposed framework and show the results in Figure 7. We claimed that without self-\nsupervised feature disentanglement, i.e., without Dv, the generator will be able to reconstruct images\nwith the entangled representation and ignore the domain vector. This can be veri\ufb01ed by Figure 7(a)\nwhere the self-supervised feature disentanglement is disabled, meaning that the representation is not\ntrained to be domain-invariant. In such case, the decoder simply decodes the input representation back\n\n8\n\n\fFigure 6: Example image\n\ntranslation results of\nSVHN\u2192MNIST.\n\nFigure 7: Comparison between example image translation results:\n(a) our UFDN without self-supervised feature disentanglement, (b)\nour UFDN without adversarial training in the pixel space, and (c) the\nfull version of UFDN.\n\nto its source domain ignoring the domain vector. Next, we disabled pixel space adversarial learning in\nour framework to verify that the representation is indeed forced to be domain-invariant. As shown in\nFigure 7(b), the generator is now forced to synthesize image conditioning on the manipulated domain\nvector. However, without pixel space adversarial learning, the difference between domain photo and\npaint is not apparent comparing to the complete version of our UFDN.\n\n5 Conclusion\n\nWe proposed a novel network architecture of uni\ufb01ed feature disentanglement network (UFDN), which\nlearns disentangled feature representation for data across multiple domains by a unique encoder-\ngenerator architecture with adversarial learning. With superior properties over recent image translation\nworks, our model not only produced promising qualitative results but also allows unsupervised domain\nadaptation, which con\ufb01rmed the effectiveness of the derived deep features in the above tasks.\n\nReferences\n[1] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(8):1798\u20131828, 2013.\n\n[2] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 343\u2013351, 2016.\n\n[3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.\n\nInfogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in Neural\nInformation Processing Systems (NIPS), 2016.\n\n[4] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Uni\ufb01ed generative adversarial networks\nfor multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2018.\n\n[5] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using\nsubspace alignment. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\n2013.\n\n[6] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), 2015.\n\n[7] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classi\ufb01cation networks\nfor unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision\n(ECCV), 2016.\n\n[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[9] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner.\nbeta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2017.\n\n[11] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative\nadversarial networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017.\n\n9\n\n\f[12] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative\n\nmodels. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[13] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n[14] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[15] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, et al. Fader networks: Manipulating images\n\nby sliding attributes. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[16] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in\n\nNeural Information Processing Systems (NIPS), 2017.\n\n[17] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2016.\n\n[18] Y.-C. Liu, Y.-Y. Yeh, T.-C. Fu, W.-C. Chiu, S.-D. Wang, and Y.-C. F. Wang. Detach and adapt: Learning\ncross-domain disentangled deep representation. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2018.\n\n[19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the\n\nIEEE International Conference on Computer Vision (ICCV), 2015.\n\n[20] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er gans. In Proceedings\n\nof the International Conference on Machine Learning (ICML), 2017.\n\n[21] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\ndeep generative models. In Proceedings of the International Conference on Machine Learning (ICML),\n2014.\n\n[23] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains\nusing generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018.\n\n[24] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In Proceedings of the\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[25] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[26] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In\n\nProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.\n\n[27] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[28] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation.\n\nIn Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[29] S. Zhao, J. Song, and S. Ermon. Towards deeper understanding of variational autoencoding models. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2017.\n\n[30] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\nadversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\n2017.\n\n10\n\n\f", "award": [], "sourceid": 1294, "authors": [{"given_name": "Alexander", "family_name": "Liu", "institution": "National Taiwan University"}, {"given_name": "Yen-Cheng", "family_name": "Liu", "institution": "Georgia Tech"}, {"given_name": "Yu-Ying", "family_name": "Yeh", "institution": "University of California, San Diego"}, {"given_name": "Yu-Chiang Frank", "family_name": "Wang", "institution": "National Taiwan University"}]}