{"title": "One-Sided Unsupervised Domain Mapping", "book": "Advances in Neural Information Processing Systems", "page_first": 752, "page_last": 762, "abstract": "In unsupervised domain mapping, the learner is given two unmatched datasets $A$ and $B$. The goal is to learn a mapping $G_{AB}$ that translates a sample in $A$ to the analog sample in $B$. Recent approaches have shown that when learning simultaneously both $G_{AB}$ and the inverse mapping $G_{BA}$, convincing mappings are obtained. In this work, we present a method of learning $G_{AB}$ without learning $G_{BA}$. This is done by learning a mapping that maintains the distance between a pair of samples. Moreover, good mappings are obtained, even by maintaining the distance between different parts of the same sample before and after mapping. We present experimental results that the new method not only allows for one sided mapping learning, but also leads to preferable numerical results over the existing circularity-based constraint. Our entire code is made publicly available at~\\url{https://github.com/sagiebenaim/DistanceGAN}.", "full_text": "One-Sided Unsupervised Domain Mapping\n\nSagie Benaim1 and Lior Wolf1,2\n\n1The Blavatnik School of Computer Science , Tel Aviv University, Israel\n\n2Facebook AI Research\n\nAbstract\n\nIn unsupervised domain mapping, the learner is given two unmatched datasets\nA and B. The goal is to learn a mapping GAB that translates a sample in A\nto the analog sample in B. Recent approaches have shown that when learning\nsimultaneously both GAB and the inverse mapping GBA, convincing mappings\nare obtained. In this work, we present a method of learning GAB without learning\nGBA. This is done by learning a mapping that maintains the distance between\na pair of samples. Moreover, good mappings are obtained, even by maintaining\nthe distance between different parts of the same sample before and after mapping.\nWe present experimental results that the new method not only allows for one\nsided mapping learning, but also leads to preferable numerical results over the\nexisting circularity-based constraint. Our entire code is made publicly available\nat https://github.com/sagiebenaim/DistanceGAN.\n\n1\n\nIntroduction\n\nThe advent of the Generative Adversarial Network (GAN) [6] technology has allowed for the\ngeneration of realistic images that mimic a given training set by accurately capturing what is inside\nthe given class and what is \u201cfake\u201d. Out of the many tasks made possible by GANs, the task of\nmapping an image in a source domain to the analog image in a target domain is of a particular interest.\nThe solutions proposed for this problem can be generally separated by the amount of required\nsupervision. On the one extreme, fully supervised methods employ pairs of matched samples, one\nin each domain, in order to learn the mapping [9]. Less direct supervision was demonstrated by\nemploying a mapping into a semantic space and requiring that the original sample and the analog\nsample in the target domain share the same semantic representation [22].\nIf the two domains are highly related, it was demonstrated that just by sharing weights between the\nnetworks working on the two domains, and without any further supervision, one can map samples\nbetween the two domains [21, 13]. For more distant domains, it was demonstrated recently that by\nsymmetrically leaning mappings in both directions, meaningful analogs are obtained [28, 11, 27].\nThis is done by requiring circularity, i.e., that mapping a sample from one domain to the other and\nthen back, produces the original sample.\nIn this work, we go a step further and show that it is possible to learn the mapping between the\nsource domain and the target domain in a one-sided unsupervised way, by enforcing high cross-\ndomain correlation between the matching pairwise distances computed in each domain. The new\nconstraint allows one-sided mapping and also provides, in our experiments, better numerical results\nthan circularity. Combining both of these constraints together often leads to further improvements.\nLearning the new constraint requires comparing pairs of samples. While there is no real practical\nreason not to do so, since training batches contain multiple samples, we demonstrate that similar\nconstraints can even be applied per image by computing the distance between, e.g., the top part of the\nimage and the bottom part.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Related work\n\nStyle transfer These methods [5, 23, 10] typically receive as input a style image and a content image\nand create a new image that has the style of the \ufb01rst and the content of the second. The problem of\nimage translation between domains differs since when mapping between domains, part of the content\nis replaced with new content that matches the target domain and not just the style. However, the\ndistinction is not sharp, and many of the cross-domain mapping examples in the literature can almost\nbe viewed as style transfers. For example, while a zebra is not a horse in another style, the horse to\nzebra mapping, performed in [28] seems to change horse skin to zebra skin. This is evident from the\nstripped Putin example obtained when mapping the image of shirtless Putin riding a horse.\nGenerative Adversarial Networks GAN [6] methods train a generator network G that synthesizes\nsamples from a target distribution, given noise vectors, by jointly training a second network D. The\nspeci\ufb01c generative architecture we and others employ is based on the architecture of [18]. In image\nmapping, the created image is based on an input image and not on random noise [11, 28, 27, 13, 22, 9].\nUnsupervised Mapping The work that is most related to ours, employs no supervision except for\nsample images from the two domains. This was done very recently [11, 28, 27] in image to image\ntranslation and slightly earlier for translating between natural languages [24]. Note that [11] proposes\nthe \u201cGAN with reconstruction loss\u201d method, which applies the cycle constraint in one side and trains\nonly one GAN. However, unlike our method, this method requires the recovery of both mappings and\nis outperformed by the full two-way method.\nThe CoGAN method [13], learns a mapping from a random input vector to matching samples from\nthe two domains. It was shown in [13, 28] that the method can be modi\ufb01ed in order to perform\ndomain translation. In CoGAN, the two domains are assumed to be similar and their generators (and\nGAN discriminators) share many of the layers weights, similar to [21]. As was demonstrated in [28],\nthe method is not competitive in the \ufb01eld of image to image translation.\nWeakly Supervised Mapping In [22], the matching between the source domain and the target\ndomain is performed by incorporating a \ufb01xed pre-trained feature map f and requiring f-constancy,\ni.e, that the activations of f are the same for the input samples and for mapped samples.\nSupervised Mapping When provided with matching pairs of (input image, output image) the\nsupervision can be performed directly. An example of such method that also uses GANs is [9], where\nthe discriminator D receives a pair of images where one image is the source image and the other is\neither the matching target image (\u201creal\u201d pair) or a generated image (\u201cfake\u201d pair); The linking between\nthe source and the target image is further strengthened by employing the U-net architecture [19].\nDomain Adaptation In this setting, we typically are given two domains, one having supervision in\nthe form of matching labels, while the second has little or no supervision. The goal is to learn to\nlabel samples from the second domain. In [3], what is common to both domains and what is distinct\nis separated thus improving on existing models. In [2], a transformation is learned, on the pixel\nlevel, from one domain to another, using GANs. In [7], an unsupervised adversarial approach to\nsemantic segmentation, which uses both global and category speci\ufb01c domain adaptation techniques,\nis proposed.\n\n2 Preliminaries\n\nIn the problem of unsupervised mapping, the learning algorithm is provided with unlabeled datasets\nfrom two domains, A and B. The \ufb01rst dataset includes i.i.d samples from the distribution pA and the\nsecond dataset includes i.i.d samples from the distribution pB. Formally, given\n\n{xi}m\n\ni=1 such that xi\n\ni.i.d\u223c pA\n\nand\n\n{xj}n\n\nj=1 such that xj\n\ni.i.d\u223c pB,\n\nour goal is to learn a function GAB, which maps samples in domain A to analog samples in domain\nB, see examples below. In previous work [11, 28, 27], it is necessary to simultaneously recover a\nsecond function GBA, which similarly maps samples in domain B to analog samples in domain A.\n\nJusti\ufb01cation In order to allow unsupervised learning of one directional mapping, we introduce\nthe constraint that pairs of inputs x, x(cid:48), which are at a certain distance from each other, are mapped\nto pairs of outputs GAB(x), GAB(x(cid:48)) with a similar distance, i.e., that the distances (cid:107)x \u2212 x(cid:48)(cid:107) and\n\n2\n\n\f+\n\nFigure 1: Each triplet shows the source handbag image, the target shoe as produced by Cycle-\nGAN\u2019s [28] mapper GAB and the results of approximating GAB by a \ufb01xed nonnegative linear\ntransformation T , which obtains each output pixel as a linear combination of input pixels. The linear\ntransformation captures the essence of GAB showing that much of the mapping is achieved by a \ufb01xed\nspatial transformation.\n(cid:107)GAB(x) \u2212 GAB(x(cid:48))(cid:107) are highly correlated. As we show below, it is reasonable to assume that this\nconstraint approximately holds in many of the scenarios demonstrated by previous work on domain\ntranslation. Although approximate, it is suf\ufb01cient, since as was shown in [21], mapping between\ndomains requires only little supervision on top of requiring that the output distribution of the mapper\nmatches that of the target distribution.\nConsider, for example, the case of mapping shoes to edges, as presented in Fig. 4. In this case, the\nedge points are simply a subset of the image coordinates, selected by local image criterion. If image\nx is visually similar to image x(cid:48), it is likely that their edge maps are similar. In fact, this similarity\nunderlies the usage of gradient information in the classical computer vision literature. Therefore,\nwhile the distances are expected to differ in the two domains, one can expect a high correlation.\nNext, consider the case of handbag to shoe mapping (Fig. 4). Analogs tend to have the same\ndistribution of image colors in different image formations. Assuming that the spatial pixel locations\nof handbags follow a tight distribution (i.e., the set of handbag images share the same shapes) and the\nsame holds for shoes, then there exists a set of canonical displacement \ufb01elds that transform a handbag\nto a shoe. If there was one displacement, which would happen to be a \ufb01xed permutation of pixel\nlocations, distances would be preserved. In practice, the image transformations are more complex.\nTo study whether the image displacement model is a valid approximation, we learned a nonnegative\nlinear transformation T \u2208 R642\u00d7642\nthat maps, one channel at a time, handbag images of size\n64\u00d7 64\u00d7 3 to the output shoe images of the same size given by the CycleGAN method. T \u2019s columns\ncan be interpreted as weights that determine the spread of mass in the output image for each pixel\nlocation in the input image. It was estimated by minimizing the squared error of mapping every\nchannel (R, G, or B) of a handbag image to the same channel in the matching shoe. Optimization\nwas done by gradient descent with a projection to the space of nonnegative matrices, i.e., zeroing the\nnegative elements of T at each iteration.\nSample mappings by the matrix T are shown in Fig. 1. As can be seen, the nonnegative linear\ntransformation approximates CycleGAN\u2019s multilayer CNN GAB to some degree. Examining the\nelements of T , they share some properties with permutations: the mean sum of the rows is 1.06 (SD\n0.08) and 99.5% of the elements are below 0.01.\nIn the case of adding glasses or changing gender or hair color (Fig 3), a relatively minor image\nmodi\ufb01cation, which does not signi\ufb01cantly change the majority of the image information, suf\ufb01ces\nin order to create the desired visual effect. Such a change is likely to largely maintain the pairwise\nimage distance before and after the transformation.\nIn the case of computer generated heads at different angles vs. rotated cars, presented in [11],\ndistances are highly correlated partly because the area that is captured by the foreground object is\na good indicator of the object\u2019s yaw. When mapping between horses to zebras [28], the texture\nof a horse\u2019s skin is transformed to that of the zebra. In this case, most of the image information\nis untouched and the part that is changed is modi\ufb01ed by a uniform texture, again approximately\nmaintaining pairwise distances. In Fig 2(a), we compare the L1 distance in RGB space of pairs\nof horse images to the distance of the samples after mapping by the CycleGAN Network [28] is\nperformed, using the public implementation. It is evident that the cross-domain correlation between\npairwise distances is high. We also looked at Cityscapes image and ground truth label pairs in\nFig 2(c), and found that there is high correlation between the distances. This is the also the case in\nmany other literature-based mappings between datasets we have tested and ground truth pairs.\nWhile there is little downside to working with pairs of training images in comparison to working with\nsingle images, in order to further study the amount of information needed for successful alignment,\nwe also consider distances between the two halves of the same image. We compare the L1 distance\n\n3\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 2: Justifying the high correlation between distances in different domains. (a) Using the\nCycleGAN model [28], we map horses to zebras and vice versa. Green circles are used for the\ndistance between two random horse images and the two corresponding translated zebra images.\nBlue crosses are for the reverse direction translating zebra to horse images. The Pearson correlation\nfor horse to zebra translation is 0.77 (p-value 1.7e\u2212113) and for zebra to horse it is 0.73 (p-value\n8.0e\u221296). (b) As in (a) but using the distance between two halves of the same image that is either a\nhorse image translated to a zebra or vice-versa. The Pearson correlation for horse to zebra translation\nis 0.91 (p-value 9.5e\u221223) and for zebra to horse it is 0.87 (p-value 9.7e\u221219). (c) Cityscapes images\nand associated labels. Green circles are used for distance between two cityscapes images and the\ntwo corresponding ground truth images The Pearson correlation is 0.65 (p-value 6.0e\u221216). (d) As in\n(c) but using the distance between two halves of the same image. The Pearson correlation is 0.65\n(p-value 1.4e\u221212).\n\nbetween the left and right halves as computed on the input image to that which is obtained on the\ngenerated image or the corresponding ground truth image. Fig. 2(b) and Fig. 2(d) presents the results\nfor horses to zebras translation and for Cityscapes image and label pairs, respectively. As can be seen,\nthe correlation is also very signi\ufb01cant in this case.\n\nFrom Correlations to Sum of Absolute Differences We have provided justi\ufb01cation and empirical\nevidence that for many semantic mappings, there is a high degree of correlations between the\npairwise distances in the two domains. In other words, let dk be a vector of centered and unit-variance\nnormalized pairwise distances in one domain and let d(cid:48)\nk be the vector of normalized distances obtained\n\nk\nshould be high. When training the mapper GAB, the mean and variance used for normalization in\neach domain are precomputed based on the training samples in each domain, which assumes that the\npost mapping distribution of samples is similar to the training distribution.\n\nin the other domain by translating each image out of each pair between the domains, then(cid:80) dkd(cid:48)\nThe pairwise distances in the source domain dk are \ufb01xed and maximizing(cid:80) dkd(cid:48)\nthe sum of absolute differences(cid:80)\ntwo losses \u2212(cid:80) dkd(cid:48)\n\nk causes pairwise\ndistances dk with large absolute value to dominate the optimization. Instead, we propose to minimize\nk|, which spreads the error in distances uniformly. The\nk| are highly related and the negative correlation between them\n\nwas explicitly computed for simple distributions and shown to be very strong [1].\n\nk and(cid:80)\n\nk |dk \u2212 d(cid:48)\n\nk |dk \u2212 d(cid:48)\n\n4\n\n\f3 Unsupervised Constraints on the Learned Mapping\n\nThere are a few types of constraints suggested in the literature, which do not require paired samples.\nFirst, one can enforce the distribution of GAB(x) : x \u223c pA, which we denote as GAB(pA), to\nbe indistinguishable from that of pB. In addition, one can require that mapping from A to B and\nback would lead to an identity mapping. Another constraint suggested, is that for every x \u2208 B\nGAB(x) = x. We review these constraints and then present the new constraints we propose.\n\nAdversarial constraints Our training sets are viewed as two discrete distributions \u02c6pA and \u02c6pB that\nare sampled from the source and target domain distributions pA and pB, respectively. For the learned\nnetwork GAB, the similarity between the distributions GAB(pA) and pB is modeled by a GAN. This\ninvolves the training of a discriminator network DB : B \u2192 {0, 1}. The loss is given by:\n\nLGAN(GAB, DB, \u02c6pA, \u02c6pB) =ExB\u223c \u02c6pB [log DB(xB)] + ExA\u223c \u02c6pA[log(1 \u2212 DB(GAB(xA))]\n\nThis loss is minimized over GAB and maximized over DB. When both GAB and GBA are learned\nsimultaneously, there is an analog expression LGAN(GBA, DA, \u02c6pB, \u02c6pA), in which the domains A and\nB switch roles and the two losses (and four networks) are optimized jointly.\n\nCircularity constraints\nIn three recent reports [11, 28, 27], circularity loss was introduced for\nimage translation. The rationale is that given a sample from domain A, translating it to domain B and\nthen back to domain A should result in the identical sample. Formally, the following loss is added:\n\nLcycle(GAB, GBA, \u02c6pA) = Ex\u223c \u02c6pA(cid:107)GBA(GAB(x)) \u2212 x(cid:107)1\n\nThe L1 norm employed above was found to be mostly preferable, although L2 gives similar results.\nSince the circularity loss requires the recovery of the mappings in both directions, it is usually\nemployed symmetrically, by considering Lcycle(GAB, GBA, \u02c6pA) + Lcycle(GBA, GAB, \u02c6pB).\nThe circularity constraint is often viewed as a de\ufb01nite requirement for admissible functions GAB and\nGBA. However, just like distance-based constraints, it is an approximate one. To see this, consider\nthe zebra to horse mapping example. Mapping a zebra to a horse means losing the stripes. The\ninverse mapping, therefore, cannot be expected to recover the exact input stripes.\n\nTarget Domain Identity A constraint that has been used in [22] and in some of the experiments\nin [28] states that GAB applied to samples from the domain B performs the identity mapping. We\ndid not experiment with this constraint and it is given here for completeness:\n\nLT-ID(GAB, \u02c6pB) = Ex\u223c \u02c6pB(cid:107)x \u2212 GAB(x)(cid:107)2\n\nDistance Constraints The adversarial loss ensures that samples from the distribution of A are\ntranslated to samples in the distribution of B. However, there are many such possible mappings.\nGiven a mapping for n samples of A to n samples of B, one can consider any permutation of the\nsamples in B as a valid mapping and, therefore, the space of functions mapping from A to B is very\nlarge. Adding the circularity constraint, enforces the mapping from B to A to be the inverse of the\npermutation that occurs from A to B, which reduces the amount of admissible permutations.\nTo further reduce this space, we propose a distance preserving map, that is, the distance between two\nsamples in A should be preserved in the mapping to B. We therefore consider the following loss,\nwhich is the expectation of the absolute differences between the distances in each domain up to scale:\nLdistance(GAB, \u02c6pA) = Exi,xj\u223c \u02c6pA| 1\n((cid:107)GAB(xi) \u2212 GAB(xj)(cid:107)1 \u2212 \u00b5B)|\n\u03c3A\nwhere \u00b5A, \u00b5B (\u03c3A, \u03c3B) are the means (standard deviations) of pairwise distances in the training sets\nfrom A and B, respectively, and are precomputed.\nIn practice, we compute the loss over pairs of samples that belong to the same minibatch during\ntraining. Even for minibatches with 64 samples, as in DiscoGAN [11], considering all pairs is feasible.\nIf needed, for even larger mini-batches, one can subsample the pairs.\nWhen the two mappings are simultaneously learned, Ldistance(GBA, \u02c6pB) is similarly de\ufb01ned. In both\ncases, the absolute difference of the L1 distances between the pairs in the two domains is considered.\n\n((cid:107)xi \u2212 xj(cid:107)1 \u2212 \u00b5A) \u2212 1\n\u03c3B\n\n5\n\n\fIn comparison to circularity, the distance-based constraint does not suffer from the model collapse\nproblem that is described in [11]. In this phenomenon, two different samples from domain A are\nmapped to the same sample in domain B. The mapping in the reverse direction then generates an\naverage of the two original samples, since the sample in domain B should be mapped back to both the\n\ufb01rst and second original samples in A. Pairwise distance constraints prevents this from happening.\n\nSelf-distance Constraints Whether or not the distance constraint is more effective than the circu-\nlarity constraint in recovering the alignment, the distance based constraint has the advantage of being\none sided. However, it requires that pairs of samples are transfered at once, which, while having little\nimplications on the training process as it is currently done, might effect the ability to perform on-line\nlearning. Furthermore, the of\ufb01cial CycleGAN [28] implementation employs minibatches of size one.\nWe, therefore, suggest an additional constraint, which employs one sample at a time and compares\nthe distances between two parts of the same sample.\nLet L, R : Rh\u00d7w \u2192 Rh\u00d7w/2 be the operators that given an input image return the left or right part of\nit. We de\ufb01ne the following loss:\n\nL self-\n\ndistance\n\n(GAB, \u02c6pA) = Ex\u223c \u02c6pA| 1\n\u03c3A\n\u2212 1\n\u03c3B\n\n((cid:107)L(x) \u2212 R(x)(cid:107)1 \u2212 \u00b5A)\n((cid:107)L(GAB(x)) \u2212 R(GAB(x))(cid:107)1 \u2212 \u00b5B)|\n\n(1)\n\nwhere \u00b5A and \u03c3A are the mean and standard deviation of the pairwise distances between the two\nhalves of the image in the training set from domain A, and similarly for \u00b5B and \u03c3B, e.g., given the\ntraining set {xj}n\n\nj=1 \u2282 B, \u00b5B is precomputed as 1\n\n(cid:80)\nj (cid:107)L(xj) \u2212 R(xj)(cid:107)1.\n\nn\n\n3.1 Network Architecture and Training\n\nWhen training the networks GAB, GBA, DB and DA, we employ the following loss, which is\nminimized over GAB and GBA and maximized over DB and DA:\n\n\u03b11ALGAN(GAB, DB, \u02c6pA, \u02c6pB) + \u03b11BLGAN(GBA, DA, \u02c6pB, \u02c6pA) + \u03b12ALcycle(GAB, GBA, \u02c6pA)+\n\u03b12BLcycle(GBA, GAB, \u02c6pB) + \u03b13ALdistance(GAB, \u02c6pA) + \u03b13BLdistance(GBA, \u02c6pB)+\n\u03b14ALself-distance(GAB, \u02c6pA) + \u03b14BLself-distance(GBA, \u02c6pB)\n\nwhere \u03b1iA, \u03b1iB are trade-off parameters. We did not test the distance constraint and the self-distance\nconstraint jointly, so in every experiment, either \u03b13A = \u03b13B = 0 or \u03b14A = \u03b14A = 0. When\nperforming one sided mapping from A to B, only \u03b11A and either \u03b13A or \u03b14A are non-zero.\nWe consider A and B to be a subset of R3\u00d7s\u00d7s of images where s is either 64, 128 or 256, depending\non the image resolution. In order to directly compare our results with previous work and to employ\nthe strongest baseline in each dataset, we employ the generator and discriminator architectures of\nboth DiscoGAN [11] and CycleGAN [28].\nIn DiscoGAN, the generator is build of an encoder-decoder unit. The encoder consists of convolu-\ntional layers with 4 \u00d7 4 \ufb01lters followed by Leaky ReLU activation units. The decoder consists of\ndeconvolutional layers with 4 \u00d7 4 \ufb01lters followed by a ReLU activation units. Sigmoid is used for\nthe output layer and batch normalization [8] is used before the ReLU or Leaky ReLU activations.\nBetween 4 to 5 convolutional/deconvolutional layers are used, depending on the domains used in\nA and B (we match the published code architecture per dataset). The discriminator is similar to the\nencoder, but has an additional convolutional layer as the \ufb01rst layer and a sigmoid output unit.\nThe CycleGAN architecture for the generator is based on [10]. The generators consist of two 2-\nstride convolutional layers, between 6 to 9 residual blocks depending on the image resolution and\ntwo fractionally strided convolutions with stride 1/2. Instance normalization is used as in [10].\nThe discriminator uses 70 \u00d7 70 PatchGANs [9]. For training, CycleGAN employs two additional\ntechniques. The \ufb01rst is to replace the negative log-likelihood by a least square loss [25] and the second\nis to use a history of images for the discriminators, rather then only the last image generated [20].\n\n6\n\n\fTable 1: Tradeoff weights for each experiment.\nExperiment \u03b11A \u03b11B \u03b12A \u03b12B \u03b13A \u03b13B \u03b14A \u03b14B\n0\n0\nDiscoGAN 0.5 0.5 0.5 0.5\nDistance \u2192 0.5\n0\n0\n0\nDistance \u2190 0\n0\n0.5\n0\nDist+Cycle 0.5 0.5 0.5 0.5 0.5 0.5\n0\nSelf Dist \u2192 0.5\n0\n0\nSelf Dist \u2190 0\n0\n0.5\n\n0\n0\n0\n0\n0.5\n0\n\n0\n0.5\n0\n\n0\n0.5\n\n0\n0.5\n\n0\n0\n\n0\n0\n\n0\n0\n\n0\n0\n\n2:\n\nTable\nNormalized\nRMSE between the angles\nof source and translated\nimages.\nMethod\nDiscoGAN 0.306\nDistance\n0.135\nDist.+Cycle 0.098\nSelf Dist.\n0.117\n\n0.137\n0.097\n0.273\n0.197\n\ncar2car car2head\n\nTable 3: MNIST clas-\nsi\ufb01cation on mapped\nSHVN images.\nMethod\nCycleGAN 26.1%\n26.8%\nDistance\nDist.+Cycle 18.0%\nSelf Dist.\n25.2%\n\nAccuracy\n\nTable 4: CelebA mapping results using the VGG face descriptor.\n\nMale \u2192 Female\n\nBlond \u2192 Black\n\nGlasses \u2192 Without\n\nMethod\n\nDiscoGAN\nDistance\nDistance+Cycle\nSelf Distance\n\nCosine\nSimilarity Accuracy\n0.23\n0.32\n0.35\n0.24\n\n0.87\n0.88\n0.87\n0.86\n\nDiscoGAN\nDistance\nDistance+Cycle\nSelf Distance\n\n0.22\n0.26\n0.31\n0.24\n\n0.86\n0.87\n0.89\n0.91\n\n4 Experiments\n\nSeparation Cosine\n\n0.89\n0.92\n0.91\n0.91\n\nSeparation Cosine\n\nSimilarity Accuracy\n0.15\n0.24\n0.24\n0.24\n\nSeparation\nSimilarity Accuracy\n0.13\n0.42\n0.41\n0.34\n\u2014\u2014\u2014\u2014 Other direction \u2014\u2014\u2014\u2014\n0.10\n0.30\n0.30\n0.30\n\n0.84\n0.79\n0.82\n0.80\n\n0.90\n0.89\n0.85\n0.81\n\n0.91\n0.96\n0.95\n0.94\n\n0.14\n0.22\n0.22\n0.19\n\nWe compare multiple methods: the DiscoGAN or the CycleGAN baselines; the one sided mapping\nusing Ldistance (A \u2192 B or B \u2192 A); the combination of the baseline method with Ldistance; the\nself distance method. For DiscoGAN, we use a \ufb01xed weight con\ufb01guration for all experiments, as\nshown in Tab. 1. For CycleGAN, there is more sensitivity to parameters and while the general pattern\nis preserved, we used different weight for the distance constraint depending on the experiment, digits\nor horses to zebra.\n\nModels based on DiscoGAN Datasets that were tested by DiscoGAN are evaluated here using this\narchitecture. In initial tests, CycleGAN is not competitive on these out of the box. The \ufb01rst set of\nexperiments maps rotated images of cars to either cars or heads. The 3D car dataset [4] consists of\nrendered images of 3D cars whose degree varies at 15\u25e6 intervals. Similarly, the head dataset, [17],\nconsists of 3D images of rotated heads which vary from \u221270\u25e6 to 70\u25e6. For the car2car experiment,\nthe car dataset is split into two parts, one of which is used for A and one for B (It is further split\ninto train and test set). Since the rotation angle presents the largest source of variability, and since\nthe rotation operation is shared between the datasets, we expect it to be the major invariant that the\nnetwork learns, i.e., a semantic mapping would preserve angles.\nA regressor was trained to calculate the angle of a given car image based on the training data. Tab. 2\nshows the Root Mean Square Error (RMSE) between the angle of source image and translated image.\nAs can be seen, the pairwise distance based mapping results in lower error than the DiscoGAN\none, combining both further improves results, and the self distance outperforms both DiscoGAN\nand pairwise distance. The original DiscoGAN implementation was used, but due to differences\nin evaluation (different regressors) these numbers are not compatible with the graph shown in\nDiscoGAN.\nFor car2head, DiscoGAN\u2019s solution produces mirror images and combination of DiscoGAN\u2019s\ncircularity constraint with the distance constraint produces a solution that is rotated by 90\u25e6. We\nconsider these biases as ambiguities in the mapping and not as mistakes and, therefore, remove the\nmean error prior to computing the RMSE. In this experiment, distance outperforms all other methods.\nThe combination of both methods is less competitive than both, perhaps since each method pulls\ntoward a different solution. Self distance, is worse than circularity in this dataset.\n\n7\n\n\fAnother set of experiments arises from considering face images with and without a certain property.\nCelebA [26, 14] was annotated for multiple attributes including the person\u2019s gender, hair color, and\nthe existence of glasses in the image. Following [11] we perform mapping between two values of each\nof these three properties. The results are shown in the supplementary material with some examples\nin Fig. 3. It is evident that the DiscoGAN method (using the unmodi\ufb01ed authors\u2019 implementation)\npresents many more failure cases than our pair based method. The self-distance method was\nimplemented with the top and bottom image halves, instead of left to right distances, since faces are\nsymmetric. This method also seems to outperform DiscoGAN.\nIn order to evaluate how well the face translation was performed, we use the representation layer of\nVGG faces [16] on the image in A and its output in B. One can assume that two images that match\nwill have many similar features and so the VGG representation will be similar. The cosine similarities,\nas evaluated between input images and their mapped versions, are shown in Tab. 4. In all cases, the\npair-distance produces more similar input-output faces. Self-distance performs slightly worse than\npairs, but generally better than DiscoGAN. Applying circularity together with pair-distance, provides\nthe best results but requires, unlike the distance, learning both sides simultaneously.\nWhile we create images that better match in the face descriptor metric, our ability to create images\nthat are faithful to the second distribution is not impaired. This is demonstrated by learning a linear\nclassi\ufb01er between the two domains based on the training samples and then applying it to a set of\ntest image before and after mapping. The separation accuracy between the input test image and the\nmapped version is also shown in Tab. 4. As can be seen, the separation ability of our method is\nsimilar to that of DiscoGAN (it arises from the shared GAN terms).\nWe additionally perform a user study to asses the quality of our results. The user is \ufb01rst presented\nwith a set of real images from the dataset. Then, 50 random pairs of images are presented to a user for\na second, one trained using DiscoGAN and one using our method. The user is asked to decide which\nimage looks more realistic. The test was performed on 22 users. On shoes to handbags translation,\nour translation performed better on 65% of the cases. For handbags to shoes, the score was 87%.\nFor male to female, both methods showed a similar realness score (51% to 49% of DiscoGAN\u2019s).\nWe, therefore, asked a second question: given the face of a male, which of the two generated female\nvariants is a better \ufb01t to the original face. Our method wins 88% of the time.\nIn addition, in the supplementary material we compare the losses of the GAN discriminator for the\nvarious methods and show that these values are almost identical. We also measure the losses of the\nvarious methods during test, even if these were not directly optimized. For example, despite this\nconstraints not being enforced, the distance based methods seem to present a low circularity loss,\nwhile DiscoGAN presents a relatively higher distance losses.\nSample results of mapping shoes to handbags and edges to shoes and vice versa using the DiscoGAN\nbaseline architecture are shown in Fig. 3. More results are shown in the supplementary. Visually, the\nresults of the distance-based approach seem better then DiscoGAN while the results of self-distance\nare somewhat worse. The combination of DiscoGAN and distance usually works best.\n\nModels based on CycleGAN Using the CycleGAN architecture we map horses to zebras, see\nFig. 4 and supplementary material for examples. Note that on the zebra to horse mapping, all methods\nfail albeit in different ways. Subjectively, it seems that the distance + cycle method shows the most\npromise in this translation.\nIn order to obtain numerical results, we use the baseline CycleGAN method as well as our methods\nin order to translate from Street View House Numbers (SVHN) [15] to MNIST [12]. Accuracy is\nthen measured in the MNIST space by using a neural net trained for this task. Results are shown in\nTab. 3 and visually in the Supplementary. While the pairwise distance based method improves upon\nthe baseline method, there is still a large gap between the unsupervised and semi-supervised setting\npresented in [22], which achieves much higher results. This can be explained by the large amount of\nirrelevant information in the SVHN images (examples are shown in the supplementary). Combining\nthe distance based constraint with the circularity one does not work well on this dataset.\nWe additionally performed a qualitative evaluation using FCN score as in [28]. The FCN metric\nevaluates the interoperability images by taking a generated cityscape image and generating a label\nusing semantic segmentation algorithm. The generated label can then be compared to the ground\ntruth label. FCN results are given as three measures: per-pixel accuracy, per-class accuracy and Class\n\n8\n\n\fInput\n\nDisco -\nGAN\n\nDistance\n\nDistance\n+cycle\n\nSelf dis-\ntance\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 3: Translations using various methods on the celebA dataset: (a,b) Male to and from Female.\n(c,d) Blond to and from black hair. (e,f) With eyeglasses to from without eyeglasses.\n\nInput\n\nDisco/\nCycle-\nGAN\n\nDistance\n\nDistance\n+cycle\n\nSelf dis-\ntance\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 4: (a,b) Handbags to and from shoes. (c,d) Edges to/from shoes. (e,f) Horse to/from zebra.\n\nIOU. Our distance GAN method is preferable on all three scores (0.53 vs. 0.52, 0.19 vs. 0.17, and\n0.11 vs 0.11, respectively). The paired t-test p-values are 0.29, 0.002 and 0.42 respectively. In a user\nstudy similar to the one for DiscoGAN above, our cityscapes translation scores 71% for realness\nwhen comparing to CycleGAN\u2019s. When looking at similarity to the ground truth image we score\n68%.\n\n5 Conclusion\n\nWe have proposed an unsupervised distance-based loss for learning a single mapping (without its\ninverse), which empirically outperforms the circularity loss. It is interesting to note that the new\nloss is applied to raw RGB image values. This is in contrast to all of the work we are aware of that\ncomputes image similarity. Clearly, image descriptors or low-layer network activations can be used.\nHowever, by considering only RGB values, we not only show the general utility of our method, but\nalso further demonstrate that a minimal amount of information is needed in order to form analogies\nbetween two related domains.\n\nAcknowledgements\n\nThis project has received funding from the European Research Council (ERC) under the European\nUnion\u2019s Horizon 2020 research and innovation programme (grant ERC CoG 725974). The authors\nwould like to thank Laurens van der Maaten and Ross Girshick for insightful discussions.\n\n9\n\n\fReferences\n[1] Werner Van Belle. Correlation between the inproduct and the sum of absolute differences is\n-0.8485 for uniform sampled signals on [-1:1]. Available at http: // werner. yellowcouch.\norg/ Papers/ sadvssip/ index. html , 2006.\n\n[2] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan.\nUnsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR,\n2017.\n\n[3] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru\nErhan. Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 343\u2013351.\nCurran Associates, Inc., 2016.\n\n[4] Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3d object detection and viewpoint estimation\n\nwith a deformable 3d cuboid model. In NIPS, 2012.\n\n[5] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolu-\n\ntional neural networks. In CVPR, 2016.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS. 2014.\n\n[7] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level\n\nadversarial and constraint-based adaptation. 12 2016.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. In ICML, 2015.\n\n[9] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. In CVPR, 2017.\n\n[10] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer\n\nand super-resolution. In ECCV, 2016.\n\n[11] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover\ncross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192,\n2017.\n\n[12] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.\n\n[13] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NIPS, pages\n\n469\u2013477. 2016.\n\n[14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[15] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning, 2011.\n\n[16] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision\n\nConference, 2015.\n\n[17] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d\n\nface model for pose and illumination invariant face recognition. In AVSS, 2009.\n\n[18] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[19] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for\n\nbiomedical image segmentation. In MICCAI, 2015.\n\n10\n\n\f[20] Ashish Shrivastava, Tomas P\ufb01ster, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb.\nLearning from simulated and unsupervised images through adversarial training. arXiv preprint\narXiv:1612.07828, 2016.\n\n[21] Ilya Sutskever, Rafal Jozefowicz, Karol Gregor, Danilo Rezende, Tim Lillicrap, and Oriol\n\nVinyals. Towards principled unsupervised learning. In ICLR workshop, 2016.\n\n[22] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation.\n\nIn International Conference on Learning Representations (ICLR), 2017.\n\n[23] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward\n\nsynthesis of textures and stylized images. In ICML, 2016.\n\n[24] Yingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual\n\nlearning for machine translation. arXiv preprint arXiv:1611.00179, 2016.\n\n[25] X.Mao, Q.Li, H.Xie, R.Y. Lau, and Z.Wang. Multi-class generative adversarial networks with\n\nthe l2 loss function. arXiv preprint arXiv:1611.04076, 2016.\n\n[26] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. From facial parts responses to face\n\ndetection: A deep learning approach. In ICCV, pages 3676\u20133684, 2015.\n\n[27] Zili Yi, Hao Zhang, Ping Tan Gong, et al. Dualgan: Unsupervised dual learning for image-to-\n\nimage translation. arXiv preprint arXiv:1704.02510, 2017.\n\n[28] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networkss. arXiv preprint arXiv:1703.10593,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 514, "authors": [{"given_name": "Sagie", "family_name": "Benaim", "institution": "Tel Aviv University"}, {"given_name": "Lior", "family_name": "Wolf", "institution": "Facebook AI Research"}]}