{"title": "Flow-based Image-to-Image Translation with Feature Disentanglement", "book": "Advances in Neural Information Processing Systems", "page_first": 4168, "page_last": 4178, "abstract": "Learning non-deterministic dynamics and intrinsic factors from images obtained through physical experiments is at the intersection of machine learning and material science. Disentangling the origins of uncertainties involved in microstructure growth, for example, is of great interest because future states vary due to thermal fluctuation and other environmental factors. To this end we propose a flow-based image-to-image model, called Flow U-Net with Squeeze modules (FUNS), that allows us to disentangle the features while retaining the ability to generate highquality diverse images from condition images. Our model successfully captures probabilistic phenomena by incorporating a U-Net-like architecture into the flowbased model. In addition, our model automatically separates the diversity of target images into condition-dependent/independent parts. We demonstrate that the quality and diversity of the images generated for microstructure growth and CelebA datasets outperform existing variational generative models.", "full_text": "Flow-based Image-to-Image Translation\n\nwith Feature Disentanglement\n\nRuho Kondo\n\nToyota Central R&D Labs.\n\nKeisuke Kawano\n\nToyota Central R&D Labs.\n\nr-kondo@mosk.tytlabs.co.jp\n\nkawano@mosk.tytlabs.co.jp\n\nSatoshi Koide\n\nToyota Central R&D Labs.\n\nTakuro Kutsuna\n\nToyota Central R&D Labs.\n\nkoide@mosk.tytlabs.co.jp\n\nkutsuna@mosk.tytlabs.co.jp\n\nAbstract\n\nLearning non-deterministic dynamics and intrinsic factors from images obtained\nthrough physical experiments is at the intersection of machine learning and material\nscience. Disentangling the origins of uncertainties involved in microstructure\ngrowth, for example, is of great interest because future states vary due to thermal\n\ufb02uctuation and other environmental factors. To this end we propose a \ufb02ow-based\nimage-to-image model, called Flow U-Net with Squeeze modules (FUNS), that\nallows us to disentangle the features while retaining the ability to generate high-\nquality diverse images from condition images. Our model successfully captures\nprobabilistic phenomena by incorporating a U-Net-like architecture into the \ufb02ow-\nbased model.\nIn addition, our model automatically separates the diversity of\ntarget images into condition-dependent/independent parts. We demonstrate that\nthe quality and diversity of the images generated for microstructure growth and\nCelebA datasets outperform existing variational generative models.\n\n1\n\nIntroduction\n\nRecently, machine learning models for gener-\nating various images conditioned on another\nimage (diverse image-to-image models) have\nbeen developed [1, 2, 3, 4, 5] based on vari-\national autoencoders (VAEs) [6] or gener-\native adversarial networks (GANs) [7]. In\nthe \ufb01elds of material science, these models\ncan be used for learning the relationship be-\ntween the initial microstructure images and\nthose after material processing. Figure 1\nshows an example of microstructure growth,\nin which various microstructures (x) are ob-\ntained from an initial condition (c) via phase\nseparation [8]. Such diversity is due not only\nto the elapsed processing time but to other\nenvironmental factors such as thermal \ufb02uc-\ntuation [8, 9, 10]. Our \ufb01rst goal is to model\nsuch a non-deterministic image translation,\ni.e., to generate diverse images (x) from the corresponding initial conditions (c).\n\nFigure 1: Illustration of our tasks: learning microstruc-\nture growth and disentangling its features. Condition\nand target are indicated as c and x, respectively.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nFeaturedisentanglementNondeterministicdynamicsCondition-independent\tdiversityc=c1xxxxc=c2Cross-conditional\tdiversityCondition-speciic\tdiversity\fOur second goal is feature disentanglement. There are several origins of diversity within the above-\nmentioned image translations; some are from the given conditions (e.g., the mixture ratio of the two\ncompounds) and the others are from environmental factors such as thermal \ufb02uctuation. For material\nproperty optimization [11], disentangling such diversity within those images is also important. Fig-\nure 1 illustrates our concept, which aims to disentangle the latent feature into (1) condition-dependent\n(referred to as \u201ccross-conditional\u201d and \u201ccondition-speci\ufb01c\u201d) and (2) condition-independent parts.\nFlow-based generative models [12, 13, 14] are suitable when high-quality and diverse image genera-\ntion with stable learning is required. Additionally, from the view point of application in downstream\ntasks such as property optimization [15] and inverse analysis [16], an invertible mapping between\nimages and latent codes is a useful feature of \ufb02ow-based models. However, to the best of the authors\u2019\nknowledge, neither image-to-image translation nor feature disentanglement has yet been incorporated\ninto \ufb02ow-based models. With that point in mind, we propose a new \ufb02ow-based image-to-image\ngenerative model and a novel technique that enables feature disentanglement in \ufb02ow-based models.\nTo condition a \ufb02ow with an image, the prior distribution of the \ufb02ow is associated with the condition\nimage through a multi-scale encoder. To disentangle latent features into condition-speci\ufb01c and\ncondition-invariant parts, we force a part of the outputs of the encoder to be independent of the\ncondition image.\nOur contributions can be summarized as follows:\n\n\u2022 We propose a diverse image-to-image translation model based on \ufb02ow that outperforms cur-\nrent state-of-the-art variational models in terms of the Fr\u00b4echet Inception Distance (FID) [17]\nand Learned Perceptual Image Patch Similarity (LPIPS) [18] for the CelebFaces Attributes\n(CelebA) [19] dataset and our original microstructure dataset created by solving the Cahn\u2013\nHilliard\u2013Cook (CHC) equation [8].\n\u2022 A method to incorporate feature disentanglement mechanism into \ufb02ow-based generative\nmodels is proposed. The proposed method successfully separates embedded features into\nthe condition-dependent and condition-independent parts.\n\nThe remainder of this paper is organized as follows. Section 2 states the preliminaries and Section 3\noutlines our proposed model. Section 4 describes related work. Section 5 shows the results of image\ngeneration using the CelebA [19] and CHC datasets. Section 6 summarizes and concludes our study.\n\n2 Preliminaries\n\nFlow with multi-scale architecture\nIn \ufb02ow-based generative models [12, 13, 14], the data distri-\nbution and the latent distribution are connected by an invertible mapping. The map f\u03b8 : x (cid:55)\u2192 z is\ncommonly expressed by a neural network and is trained to maximize the likelihood of data points\npushed-forward from the data distribution measured on the latent distribution. Here, \u03b8, x and z\nare the parameter, target image, and latent vector, respectively. Flow-based models simultaneously\nachieve high representation power and numerically stable optimization because they do not require\napproximation, as do VAEs [6], nor a discriminator, as in GANs [7]. For high-dimensional data\ngeneration, the use of a multi-scale architecture [13] is preferred. The spatial resolution of the feature\nmap is recursively reduced to half as the level of \ufb02ow increases. In the generation phase, x is obtained\nas\n\n\u03be(L\u22121) = g(L)(z(L)), \u03be(l) = g(l+1)(z(l+1) \u2295 \u03be(l+1)), x = \u03be(0),\n\n(1)\nwhere g(l) = (f (l))\u22121, f (l) is the invertible map at level l, \u2295 is concatenation along the channel\ndimension, and L is the total level of \ufb02ow. For unconditional generation, z(l) is obtained by\nz(l) \u223c N (0, 1).\n\nl = 0,\u00b7\u00b7\u00b7 , L \u2212 2,\n\nU-Net U-Net [20] is one of the most popular architectures for image-to-image translation models\nthat comprise encoder-decoder networks [21] and skip paths. When input image c is being encoded\ninto the latent vector, skip paths are created when the image resolution is halved. The spatial resolution\nof c is reduced L times during encoding as follows:\n\nwhere NN is a neural network. The target image x is generated from u(l)\n\u03be(l)\nc = NN(u(l+1)\n\n, l = 0,\u00b7\u00b7\u00b7 , L \u2212 2, which is very similar to Eq. (1).\n\n\u2295 \u03be(l+1)\n\n), x = \u03be(0)\n\nc\n\nc\n\nc\n\nc = c, u(l)\nu(0)\n\nc = NN(u(l\u22121)\n\nc\n\n),\n\nl = 1,\u00b7\u00b7\u00b7 , L,\n\nc as \u03be(L\u22121)\n\nc\n\n(2)\n= NN(u(L)\n),\n\nc\n\n2\n\n\f(a)\n\n(b)\n\nFigure 2: Overview of our proposed Flow U-Net with Squeeze modules (FUNS), which consists\nof the following two parts that are trained simultaneously in an end-to-end manner. (a) U-Net-like\nvariational autoencoder that encodes condition c to latent feature z. Purple hatched boxes represent\nthe squeeze modules, which disentangle the latent feature. (b) Flow-based generative model with\nmulti-scale architecture that learns an invertible mapping between target image x and latent feature z.\n\n3 Proposed Model\n\nAn overview of our proposed Flow U-Net with Squeeze modules (FUNS) is shown in Fig. 2. The\nmodel consists of a U-Net-like variational autoencoder with multiple scales of latent variables between\nits encoder and its decoder. These latent spaces encode an image c \u2208 Rdc to condition on, which\nis trained by reconstructing c. Simultaneously a \ufb02ow-based model is trained to associate possible\noutputs x \u2208 Rdx with the given image c by maximizing the likelihood of the latents z \u2208 Rdx that the\n\ufb02ow produces given x under the U-Net\u2019s latent space densities for c (see Section 3.1).\nFeature disentanglement is achieved as follows. Let us consider a decomposition of the following\nform: p(z|c) = p(ziv)p(zsp|c), which orthogonally decomposes the latent vector into condition\nindependent and dependent parts (see Section 3.2). Here, zsp can be considered as a random variable\nthat should be almost uniquely determined under a given condition c. For this purpose, we introduce\nthe entropy regularization to narrow p(zsp|c) (see Section 3.3).\n\n3.1 Flow U\u2013Net\n\nThe outputs of the encoder part of U-Net and the inputs of the \ufb02ow architecture both have multi-scale\narchitectures, as can be seen in Eq. (1) and (2). This allows us to consistently plug the encoder\noutputs of U-Net (u(l)\n\nc ) into the \ufb02ow inputs, z(l), as follows (see Fig. 2):\n\nz(l) \u223c N (\u00b5(l)(c), diag \u03c3(l)2(c)),\n\nl = 1,\u00b7\u00b7\u00b7 , L\n\n[\u00b5(l)(c), log \u03c3(l)2(c)] = F (l) (u(l)\nc ),\n\n(3)\nwhere F (l) is an arbitrary function. In this study, we use the squeeze module for the F (l) proposed\nin the next section. As can be seen in the above equation, z(l) is now sampled from the distribution\nthat depends on c. This achieves the multi-scale conditional sampling of the latent vector for the\n\ufb02ow. In what follows, we abbreviate the superscript (\u2022)(l) and write e\u03c6(z|c) \u2261 N (\u00b5(1) \u2295 \u00b7\u00b7\u00b7 \u2295\n\u00b5(L), diag (\u03c3(1)2 \u2295 \u00b7\u00b7\u00b7 \u2295 \u03c3(L)2)) for simplicity, where \u03c6 is a parameter corresponding to F and uc.\nWe now consider the use of a decoder whose output distribution is d\u03c8(c|z), with a parameter \u03c8\n(as described in Fig. 2(b)). The decoder is used to reconstruct c from z, where z is drawn from\ne\u03c6(z|c). The addition of the decoder results from the need to maximize the mutual information [22].\nMore speci\ufb01cally, the mutual information between z and c, I(z; c), is increased by minimizing the\nreconstruction error (see the Supplementary Material). A normal distribution with unit variance is\nused for d\u03c8(c|z). We call the \ufb02ow with the multi-scale encoder/decoder a Flow U-Net. Very recently,\na similar conditional \ufb02ow model that also contains an encoder and decoder was proposed [23].\nThe main differences between their model and ours are that our model (i) considers multi-scale\nconditioning and (ii) contains squeeze modules.\n\n3\n\n(1)(1)[m,s]c(2)(2)[m,s](L)(L)[m,s]...(1)z(2)z(L)z~~~cencoderdecoder(1)uuc(2)uuc(1)xxc(2)xxc(1)uuc(2)uuc(1)xxc(2)xxcSqueezemodule(L)uuc(L)xxc(L)uuc(L)xxc(1)(1)[m,s](2)(2)[m,s](L)(L)[m,s]...(1)z(2)z(L)z~~~xlowlow(1)xxc(2)xxc(1)xx(2)xx(L)xx(L)x\f(a)\n\n(b)\n\nFigure 3: Detail of squeeze module. Here, (cid:12) and \u2295 are element-wise multiplication and element-wise\naddition, respectively.\n\n3.2 Disentangling Latent Features Using Squeeze Modules\n\nDisentangling the latent vector z into condition-speci\ufb01c and condition-invariant parts cannot be\nachieved by Flow U-Net because the latent vector is fully conditioned on c. To separate z into\nthese parts, we introduce a squeeze module (Fig. 3(a)), with which the actual dimensionality of z\nconditioned on c is squeezed without changing the total dimensionality of z. This is required for \ufb02ow,\nin which both x and z must have the same dimensions. Because the squeezed dimension of z no\nlonger depends on c, it captures the features that are common among the conditions.\nIt is well known that one can easily obtain a sparse latent vector by applying a rectifying nonlinear\nfunction and minimizing the L1 norm of its outputs [24]. By using this technique, we make a part of\nlatent vector z follow N (f\u00b5(b\u00b5), diag(exp(f\u03c3(b\u03c3))) that is independent of c. To accomplish this, the\nfollowing module is used for the F that appears in Eq. (3).\n\nM = ReLU(m),\n\u00b5(c) = f\u00b5(M (cid:12) h\u00b5,c + b\u00b5),\n\n[h\u00b5,c, h\u03c3,c] = NN(uc),\nlog \u03c32(c) = f\u03c3(M (cid:12) h\u03c3,c + b\u03c3)\n\nHere, m, b\u00b5, and b\u03c3 are, respectively, the learnable parameters whose dimensions are equal to\nthat of \u00b5(c), the neural networks (NN), and the recti\ufb01ed linear unit (ReLU). In addition, (cid:12) is the\nelement-wise product. The initial value of m is set to m \u223c N (0, 1). Functions f\u00b5(\u2022) = a tanh(\u2022)\nand f\u03c3(\u2022) = b(exp(\u2212(\u2022)2) \u2212 1) are the activation functions that respectively restrict \u00b5(c) \u2208 (\u2212a, a)\nand log \u03c32(c) \u2208 (\u2212b, 0], where a and b are hyperparameters. Limiting the value of \u00b5(c) and log \u03c32(c)\nis only done to facilitate numerical stability. In our experiments, we use a = 2 and b = log(2\u03c0e).\nWhen Mi = 0 where (\u2022)i indicates the i-th element of (\u2022), both \u00b5(c)i and log \u03c32(c)i are independent\nof c. In what follows, zsp and ziv are de\ufb01ned as the parts of z where Mi > 0 and Mi = 0, respectively.\nFigure 3(b) illustrates the relationship between zsp and ziv. The distributions of zsp depends on c\nwhereas those of ziv do not.\nNote that modeling the relation between zsp and c as a probabilistic relation is important because, as\nwe will see in the experiments, the generated images may be diverse under a speci\ufb01c condition c but\nmay not be under another condition c(cid:48). In addition, marginalizing p(zsp|c) with respect to c using\np(c), we obtain p(zsp), which corresponds to the \u201ccross-conditional diversity\u201d in Fig. 1.\n\n3.3 Loss Function\nThe loss functions of the conditional \ufb02ow L\ufb02ow and those of the condition reconstruction Lrecons are\nrespectively written as follows:\n\nlog Jp\n\n,\n\n(4)\n\nL\ufb02ow(\u03b8, \u03c6, M ) = Ex,c\u223cp(x,c)\nLrecons(\u03c6, \u03c8, M ) = Ec\u223cp(c), z\u223ce\u03c6(z|c)\n\n(cid:104) \u2212 log e\u03c6(f\u03b8(x)|c) \u2212(cid:88)\n(cid:2) \u2212 log d\u03c8(c|z)(cid:3),\n\np\n\n(cid:105)\n\n(5)\nwhere Jp is the Jacobian of the p-th invertible mapping. Moreover, L\ufb02ow + Lrecons is an upper bound\nof Ex,c\u223cp(x,c)\n\n(cid:2) \u2212 log p\u03b8,\u03c6(x|c)(cid:3) \u2212 I(c; z), which is an objective for conditional generation (see the\n\n4\n\nmReLUMm(c)2logs(c)bsbmhm,chs,cfsfmcucucz|c2z|c1z|c3spzivz\fSupplementary Material). The squeezing loss function Lsqueeze is an L1 loss of M because the partial\ndimensions of mask M are required to be zero. That is,\n\n(6)\nMoreover, to reduce the uncertainty of the zsp given c, the entropy of zsp conditioned on c (a part of z\nwhere Mi > 0) should be decreased. Hence, the following entropy regularization is introduced.\n\nLsqueeze(M ) = (cid:107)M(cid:107)1.\n\nwhere H[\u2022] is the entropy and e\u03c6(z|c) = (cid:81)\n\nLentropy(\u03c6, M ) =\n\ndx(cid:88)\ni e\u03c6(zi|c) is assumed. When e\u03c6(zi|c) is the normal\n\nMiH[e\u03c6(zi|c)],\n\n(7)\n\ni=1\n\n(cid:0) log \u03c32\n\ni +log(2\u03c0e)(cid:1).\n\n(cid:80)dx\n\ndistribution with standard deviation \u03c3i, it can be written as Lentropy = 1\nIt is noteworthy to mention that, without the entropy regularization, we fail to disentangle the\nfeatures (see the experiments and Fig. 7).\nIn summary, the total loss for FUNS is as follows:\nLFUNS = L\ufb02ow + Lrecons + \u03b1Lsqueeze + \u03b2Lentropy,where \u03b1 and \u03b2 are hyperparameters that control the\namount of regularization.\n\ni=1 Mi\n\n2\n\n3.4 Sampling Procedure\n\n\u03b8\n\nFor conditional generation, \u00b5(c) and log \u03c32(c) are \ufb01rst obtained by passing condition c through the\nencoder. The latent vectors z conditioned on c are sampled from N (\u00b5(c), diag \u03c32(c)) and then,\nsamples x are obtained as x = f\u22121\n(z). In addition to ordinary conditional generation, our model\ncan also generate condition-speci\ufb01c diverse samples xsp and condition-invariant diverse samples xiv\nfrom a given sample xgiven. Such condition-speci\ufb01c and condition-invariant diverse samples are those\nwhose diversities originate from conditions and other sources, respectively. Samples xsp and xiv\ncan easily be obtained by resampling zsp and ziv, respectively, even if the ground truth condition is\nunknown. Moreover, from two given images, xgiven\n1,2 and\nxtrans\n2,1 can be obtained by exchanging their condition-invariant features.The sampling procedures used\nto obtain the above mentioned samples are summarized in Algorithm 1.\n\n, style transferred samples xtrans\n\nand xgiven\n\n1\n\n2\n\n4 Related Work\n\nConditional Flows Conditional \ufb02ow models [14, 16, 23] are categorized into two classes. The \ufb01rst\nseparates latent vector z into condition-speci\ufb01c and condition-invariant parts in advance and uses\nc as the former [16]. This type of model, however, can only be used when the dimensionality of c\nis less than or equal to that of x because of the \ufb02ow restriction requiring that the dimensionalities\nof x and z be the same. In the other class, a part of z is encoded from c [14] or all of z is encoded\nfrom both the c and the noise [23]. This type of model can potentially be used to treat a c whose\ndimensionality is much larger than that of x because the encoder can reduce the dimensionality of c.\nThe Flow U-Net described in Section 3.1 is classi\ufb01ed as this type. However, to separate z into zsp and\nziv, it is necessary to choose those dimensions carefully. Generally speaking, before the start of model\n\nAlgorithm 1 Procedures used to obtain condition-speci\ufb01c diverse images xsp, condition-invariant\ndiverse images xiv, and style transformed images xtrans. (\u2022)mean is the mean of (\u2022).\nRequire: xgiven\nRequire: xgiven\nz1 = f\u03b8(xgiven\nz = f\u03b8(xgiven)\nc(cid:48) = (d\u03c8(c|z = z))mean\nz2 = f\u03b8(xgiven\nz(cid:48) \u223c e\u03c6(z|c = c(cid:48))\nfor i = 1 to dx do\nfor i = 1 to dx do\nif Mi = 0 then\nz1,i \u2190 z2,i\nif Mi > 0 then\n(z1)\n\nRequire: xgiven\nz = f\u03b8(xgiven)\n\u00b5 = f\u00b5(b\u00b5),\nz(cid:48) \u223c N (\u00b5, diag \u03c32)\nfor i = 1 to dx do\nif Mi = 0 then\n\n, xgiven\n)\n)\n\nlog \u03c32 = f\u03c3(b\u03c3)\n\n2\n\n1\n\n1\n\n2\n\ni\n\nzi \u2190 z(cid:48)\n(z)\n\nxsp \u2190 f\u22121\n\u03b8\nreturn xsp\n\nzi \u2190 z(cid:48)\n(z)\n\ni\n\nxiv \u2190 f\u22121\n\u03b8\nreturn xiv\n\n1,2 \u2190 f\u22121\nxtrans\n\u03b8\nreturn xtrans\n1,2\n\n5\n\n\ftraining, there is no oracle to indicate how large the dimensionality of z conditioned on c should be\nfor conditional generation. This becomes a crucial problem when c consists of high-dimensional data,\nsuch as in the case of images. In FUNS, the dimension of zsp is learnt from the data.\n\nConditional Variational Models Variational approaches for image generation with feature dis-\nentanglement and diversity have progressed recently.\nIn conditional variational autoencoders\n(CVAEs) [25], the latent vector z is separated into two parts, condition and noise parts, in ad-\nvance. Variational U-Net (VUNet) [1] and Probabilistic U-Net (PUNet) [2] employ similar strategies\nto generate diverse images. In fact, all of the variational models use a loss function that is similar to\nthat used in our model (see the Supplementary Material), except for the following differences: (i) our\nmodel does not use approximated distributions; (ii) it has no deterministic path from condition c to\ntarget x; and (iii) our model uses invertible neural networks for generating x from z.\n\nConditional Generative Adversarial Networks BicycleGAN [3] is a diverse image-to-image\ngenerative model based on conditional GANs [26]. Similar to CVAE, BicycleGAN generates x\nfrom both c and noise. However, unlike CVAE, the BicycleGAN generator is trained to fool the\ndiscriminator, whereas the discriminator is trained to distinguish the generated and real images. In\naddition, cd-GAN [27] and cross-domain disentanglement networks [28] are both extended models\nthat enable feature disentanglement. MUNIT [4] and DRIT [5] are both extended models that learn\nfrom unpaired data.\n\n5 Experiments\n\nBaselines We compared our model to the of\ufb01cial implementations of VUNet [1] and PUNet [2].\nVUNet was found to work well when the coef\ufb01cient of the Kullback-Leibler divergence loss was\n\ufb01xed to one throughout the training, whereas in PUNet, the prediction loss function was changed\nfrom the softmax cross-entropy to mean squared error to predict real-valued images.\n\nDatasets We employed the celebrity face attributes dataset (CelebA [19], which consists of images\nwith 40 attribute annotations) as well as our orginal dataset, Cahn\u2013Hilliard\u2013Cook (CHC) dataset. The\nCHC dataset includes a vast number of microstructure images describing phase separation and was\n\u2202t = \u22072(u3 \u2212 u \u2212 \u03b3\u22072u) + \u03c3 \u03b6,\ncreated by solving the following partial differential equation: \u2202u\nwhere u is the local density, t is the time, \u03b3 is the material property, \u03c3 is the intensity of noise, and\n\u03b6 \u223c U ([\u22121, 1]). Here, \u03b3 = 1.5 and \u03c3 = 2 are used. We generated 999 distinct initial conditions\n(c in Fig. 2(a)). For each initial condition, we solved the CHC equation 32 times to obtain u(t) for\n0 \u2264 t \u2264 9, 000\u2206t where \u2206t = 0.01 is the time increment, which varies depending on the noise \u03b6. Of\nthese, 800 trajectories were used for training, 100 were used for validation and 99 were used for testing.\nFrom these trajectories, we created the CHC dataset as follows: (c, x) \u2261 (u(0), u(1, 000k\u2206t)) where\nk = 1, 2,\u00b7\u00b7\u00b7 , 9. In total, the CHC dataset includes 287,712 image pairs. The number of data for\ntraining, validation and testing in CelebA were 162,770, 1,9867 and 19,962, respectively, which\nfollows an of\ufb01cial train/val/test partitions. The CelebA image sizes were reduced to 64 \u00d7 64, and\nthe CHC image sizes were also 64 \u00d7 64. In the training phase, CelebA color depth was reduced to\n\ufb01ve bits following the Glow settings [14], whereas CHC was maintained at eight bits. Random noise\nsmaller than the color tone step was added to CelebA, in a process known as dequantization [29]. The\ndata were then rescaled to be in the range [\u22121, 1]dx. Because there are no condition images in CelebA,\none image in each tag was chosen as the condition image and all images in that tag were assumed\nto be conditioned on those selected images. The \u201cSmiling\u201d attribute was chosen for the CelebA tag.\nOne Smiling image csmile and one non-Smiling image cnot smile were chosen from the training data\nfor condition images and were \ufb01xed during training and testing. All of the Smiling images were\nconditioned on csmile, whereas all of the non-Smiling images were conditioned on cnot smile.\n\nImplementation Details We implemented our model with Tensor Flow version 1.10.0 [30]. We\nused Glow [14] for the \ufb02ow architecture, where the number of \ufb02ows per level K and the total levels\nL were K = 48 and L = 4, respectively. In all of our experiments, \u03b1 = 0.01 and \u03b2 = 0.1 were\nused. The encoder and decoder had mirror symmetric architectures that were consistent with 3L\nresidual blocks, where each block contains two batch normalizations [31], two leaky ReLUs and\ntwo convolutions with a \ufb01lter size of three. The image size was halved for every three residual\nblocks in the encoder and was doubled for every three residual blocks in the decoder. The number of\n\n6\n\n\fCondition\n\nGround truth\n\nVUNet [1]\n\nPUNet [2]\n\nFUNS (ours)\n\nFigure 4: Comparison of diverse image generation with various models.\n\nz(1), z(2) and z(3) are sampled\n\nz(1), z(2) and z(3) are sampled\n\nz(4) is sampled\n\nz(4) is sampled\n\nz(5) is sampled\n\nz(5) is sampled\n\nFigure 5: Multi-scale samples for CelebA and CHC.\n\nconvolution channels was 16 in the \ufb01rst layer for the encoder and doubled when the image size was\nhalved. For training, we used the default settings of the Adam [32] optimizer with a learning rate of\n10\u22124 and batch size of 32. All experiments were carried out on a single NVIDIA Tesla P100 GPU.\n\nEvaluation Metrics\nImage quality was measured by\nFID [17] using the Inception-v3 model [33]. Note that\na smaller FID implies better image quality. To evaluate\nthe FID of CelebA and CHC, 50,000 images per class and\n3,168 images, respectively, were used. Image diversity\nwas measured by LPIPS [18]. This is the sample-wise dis-\ntance between activations obtained by putting the samples\ninto the pretrained model. In this study, we used a modi-\n\ufb01ed version of this measure because LPIPS is too sensitive\nfor inter-conditional diversity. Namely, the LPIPS between\nimages obtained under different conditions is greater than\nthat between images obtained under the same conditions.\nWe also measure the intra-conditional diversity to prevent\nthe overestimation of LPIPS values caused by the diver-\nsity of conditions. Accordingly, let LPIPS (c, c(cid:48)) be an\nLPIPS between images generated from condition c and\nc(cid:48). The intra-conditional LPIPS (c-LPIPS) is de\ufb01ned as\nEc\u223cp(c)\nevaluating both LPIPS and c-LPIPS were 4,000 for CelebA\nand 3,168 for CHC. To ensure that the generated images\nwere conditioned on c, the prediction performance of c\nfrom a generated x was measured by in-house ResNet that\nwas trained on real data. For CelebA, the prediction model\nwas trained to classify whether the input image belongs to\nSmile or not and its accuracy was measured. For CHC, the\nprediction model was trained to predict condition image\nc and the L2 distance between the ground truth c and the\npredicted one was measured.\n\n(cid:2)LPIPS (c, c)(cid:3). The numbers of images used for\n\n7\n\nxgiven\n\nxsp\n\nxiv\n\nFigure 6:\nCondition-speci\ufb01c and\ncondition-invariant diverse samples xsp\nand xiv, respectively, obtained from the\ngiven image xgiven (test image).\n\n\fTable 1: Comparison of FID, LPIPS and c\u2013LPIPS. The means and standard deviations of \ufb01ve trials\nare shown for FID. All of the standard deviations of \ufb01ve trials for LPIPS and c-LPIPS are less than\n0.021 (abbreviated). Full information can be found in the Supplementary Material.\n\nFID\n\n66.0 \u00b1 4.3\n81.7 \u00b1 3.7\n114.8 \u00b1 9.2\n117.2 \u00b1 5.5\n39.6 \u00b1 3.9\n29.5 \u00b1 3.5\n\n\u2013\n\nCelebA\nLPIPS\n0.148\n0.105\n0.182\n0.149\n\n0.264\n0.259\n0.286\n\nc\u2013LPIPS\n\n0.146\n0.103\n0.180\n0.146\n\n0.262\n0.256\n0.284\n\nVUNet\nVUNet\nPUNet\nPUNet\nFUNS\nFUNS\nReal data\n\nT = 1.0\nT = 0.8\nT = 1.0\nT = 0.8\n\nT = 1.0\nT = 0.8\n\nFID\n\n96.5 \u00b1 2.1\n164.0 \u00b1 5.7\n225.7 \u00b1 6.1\n227.9 \u00b1 6.0\n10.5 \u00b1 2.0\n11.1 \u00b1 2.5\n\n\u2013\n\nCHC\nLPIPS\n0.217\n0.249\n0.226\n0.214\n\n0.207\n0.210\n0.225\n\nc\u2013LPIPS\n\n0.118\n0.113\n0.108\n0.088\n\n0.157\n0.155\n0.169\n\nTable 2: Prediction performance of generated images with respect to their ground truth condition.\nFor CelebA, classi\ufb01cation accuracy (Acc.), i.e., whether the generated images belong to the Smile\nclass or not is measured. For CHC, the L2 distance (Err.) between the predicted c and ground truth is\nmeasured.\n\nVUNet\nVUNet\nPUNet\nPUNet\nFUNS\nFUNS\nReal data\n\nT = 1.0\nT = 0.8\nT = 1.0\nT = 0.8\n\nT = 1.0\nT = 0.8\n\nAcc.\n0.977\n0.981\n1.000\n1.000\n\n0.973\n0.967\n\n0.924\n\nErr.\n9.46\n9.44\n9.48\n9.48\n\n9.67\n9.81\n\n9.54\n\nResults Figure 4 shows images generated by various methods. To generate high-quality im-\nages, all of the images were sampled from the reduced-temperature model pmodel,T (x|c) \u221d\n(pmodel(x|c))T 2 [34] where pmodel is a model distribution and T is the sampling temperature. Here,\nT = 1.0 is used for the all cases except for CelebA with FUNS. In that case, T = 0.8 is used, because\nit yields the best performance in terms of FID (see Table 1). PUNet tends to generate blurred images,\nwhereas VUNet and FUNS successfully generate sharp images. It appears that all of these models\ncan generate diverse images. Additional samples are shown in the Supplementary Material.\nAs noted in Eq. (3), there are several levels of latent features, z(1),\u00b7\u00b7\u00b7 , z(L), in our model. Figure 5\nshows generated samples for which only a part of z(l) are sampled whereas other latent features are\n\ufb01xed to \u00b5(l)(c). We can see that a larger diversity is captured by latent features at lower resolution\nlevels (z(4) and z(5)) whereas very subtle variations are captured by higher resolution levels (z(1),\nz(2) and z(3)).\nThe quantitative comparisons of the models are summarized in Table 1, where it can be seen that\nour model outperforms the competitors in terms of both FID and c-LPIPS for all datasets. Note\nthat for the CHC dataset, LPIPS has a much larger value than c-LPIPS. This means that the original\nLPIPS overestimates the diversity because of the inter-conditional diversity. Table 2 summarizes\nthe prediction performance of the conditions from the generated data. Note that the prediction error\nfor CHC is 22.84 when the ground truth is randomly shuf\ufb02ed. It can be seen that every model can\ngenerate images that depend on the conditions in terms of prediction performance. PUNet and VUNet\nachieved higher prediction performance than the Real data and FUNS in both CelebA and CHC. This\nmay be because PUNet and VUNet generated less diverse images than the Real data and FUNS,\nwhich can be seen in c-LPIPS measure in Table 1.\nCondition-speci\ufb01c and condition-invariant diverse samples, xsp and xiv, respectively, generated by\nFUNS following Algorithm 1 are shown in Fig. 6. In CelebA, it is interesting to note that diverse\nSmiling images of the same individual were sampled as xsp when a Smiling image was given (in the\n\n8\n\n\fFigure 7: Feature disentanglement and interpolation. The top-left (red framed) and bottom-right (blue\nframed) images are given images, whereas the bottom-left and top-right images are style transferred\nimages, xtrans. All of the other images are obtained by linear interpolation in latent space.\n\n\ufb01rst column), whereas almost the same images were sampled as xsp when a non-Smiling image was\ngiven (the second column). This indicates that even if an individual is \ufb01xed, Smiling images have a\ncertain amount of diversity, e.g., slightly grinning or, widely smiling, whereas non-Smiling images\nhave less diversity. In contrast, diverse individual images of similar facial expressions were sampled\nas xiv. This is because the diversity between individuals is almost independent of the condition of\nSmiling/non-Smiling. Therefore, the features corresponding to such diversity were embedded into ziv.\nIn the CHC dataset, the diversity of x originates from both the thermal \ufb02uctuation and the elapsed\ntime. As time proceeds, the black and white regions gradually separate and the characteristic length\nof their pattern grows. In Fig. 6, it can be seen that the elapsed time is captured by xiv, whereas other\n\ufb02uctuation effects are captured in xsp because the characteristic lengths in xsp are similar to each\nother. The bene\ufb01t of our model is that it can capture condition-speci\ufb01c diversity (if it exists) and can\nextract a feature that is truly independent of conditions.\nAn interpolation of two given images is shown in Fig. 7. Note that these results are obtained by using\nonly two (top-left and bottom-right) images for each dataset. The remaining corners, top-right and\nbottom-left images, are obtained by exchanging the ziv of two given images. The vertical axis of\nFig. 7 exhibits cross-conditional diversity because we can choose any two images for the top-left\nand bottom-right images whether they belong to the same conditions or not (the reader should not\nconfuse this with condition-speci\ufb01c diversity). Figure 7 shows that the obtained image manifold is\nnot only smooth but is aligned along meaningful axes, such as the individual and facial expression\naxes for CelebA (Fig. 7(a)) and the initial condition and elapsed time axes for CHC (Fig. 7(b)). We\nargue that the axis decomposition fails if Lentropy is absent (Fig. 7(c)). The effects of each loss term\n(Eq. (4)-(7)) are summarized in Table S.3 in the Supplementary Material.\n\n6 Conclusion\n\nHerein, we presented a framework for diverse image-to-image translation with feature disentan-\nglement that is based on \ufb02ow. Quantitative and qualitative comparisons showed that our model\noutperforms the state-of-the-art variational generative model in terms of image quality and image\ndiversity. Furthermore, our model not only successfully generates diverse images but also can separate\nlatent features into condition-speci\ufb01c and condition-invariant parts. By utilizing this property, we\nshow that the meaningful orthogonal axes that lie in the latent space can be extracted by the given\nimages.\n\n9\n\ngivenx=f\t\t([,])qspivzz111-1givenx=f\t\t([,])qspivzz222-1transx=f\t\t([,])qspzivz1,212-1transx=f\t\t([,])qspzivz2,121\t-1(a) CelebA(b) CHCentropy(c) CHC w.o. Livzspz\fReferences\n[1] Patrick Esser, Ekaterina Sutter, and Bj\u00a8orn Ommer. A variational u-net for conditional appearance and\nshape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 8857\u20138866, 2018.\n\n[2] Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus\nMaier-Hein, SM Ali Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. A probabilistic u-net for\nsegmentation of ambiguous images. In Proceedings of the Advances in Neural Information Processing\nSystems (NeurIPS), pages 6965\u20136975, 2018.\n\n[3] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli\nShechtman. Toward multimodal image-to-image translation. In Proceedings of the Advances in Neural\nInformation Processing Systems (NeurIPS), pages 465\u2013476, 2017.\n\n[4] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image\ntranslation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172\u2013189,\n2018.\n\n[5] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-\nto-image translation via disentangled representations. In Proceedings of the European Conference on\nComputer Vision (ECCV), pages 35\u201351, 2018.\n\n[6] Diederik P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proceedings of the International\n\nConference on Learning Representations (ICLR), 2014.\n\n[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the Advances in Neural\nInformation Processing Systems (NeurIPS), pages 2672\u20132680, 2014.\n\n[8] HE Cook. Brownian motion in spinodal decomposition. Acta Metallurgica, 18(3):297\u2013306, 1970.\n\n[9] Alain Karma and Wouter-Jan Rappel. Phase-\ufb01eld model of dendritic sidebranching with thermal noise.\n\nPhysical Review E, 60(4):3614, 1999.\n\n[10] KR Elder, Francois Drolet, JM Kosterlitz, and Martin Grant. Stochastic eutectic growth. Physical Review\n\nLetters, 72(5):677, 1994.\n\n[11] David L McDowell and Gregory B Olson. Concurrent design of hierarchical materials and structures. In\n\nScienti\ufb01c Modeling and Simulations, pages 207\u2013240. Springer, 2008.\n\n[12] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.\n\nIn Proceedings of the International Conference on Learning Representations (ICLR), 2015.\n\n[13] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In Proceedings\n\nof the International Conference on Learning Representations (ICLR), 2017.\n\n[14] Durk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\nProceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 10215\u201310224,\n2018.\n\n[15] Rafael G\u00b4omez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos\u00b4e Miguel Hern\u00b4andez-Lobato, Benjam\u00b4\u0131n\nS\u00b4anchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams,\nand Al\u00b4an Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of\nmolecules. ACS central science, 4(2):268\u2013276, 2018.\n\n[16] Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W Pellegrini, Ralf S Klessen,\nLena Maier-Hein, Carsten Rother, and Ullrich K\u00a8othe. Analyzing inverse problems with invertible neural\nnetworks. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.\n\n[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans\ntrained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the\nAdvances in Neural Information Processing Systems (NeurIPS), pages 6629\u20136640, 2017.\n\n[18] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable\neffectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 586\u2013595, 2018.\n\n[19] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of the IEEE international conference on computer vision (ICCV), pages 3730\u20133738, 2015.\n\n10\n\n\f[20] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\nimage segmentation. In Proceedings of the International Conference on Medical image computing and\ncomputer-assisted intervention (MICCAI), pages 234\u2013241. Springer, 2015.\n\n[21] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nscience, 313(5786):504\u2013507, 2006.\n\n[22] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Proceedings\nof the Advances in Neural Information Processing Systems (NeurIPS), pages 2172\u20132180, 2016.\n\n[23] Rui Liu, Yu Liu, Xinyu Gong, Xiaogang Wang, and Hongsheng Li. Conditional adversarial generative\n\ufb02ow for controllable image synthesis. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2019.\n\n[24] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In Proceedings\nof the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 315\u2013323, 2011.\n\n[25] Diederik P. Kingma, Danlio J. Rezende, Shakir Mohamed, and Max Welling. Semi-Supervised Learning\nwith Deep Generative Models. In Proceedings of the Advances in Neural Information Processing Systems\n(NeurIPS), pages 3581\u20133589, 2014.\n\n[26] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\nadversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition\n(CVPR), pages 1125\u20131134, 2017.\n\n[27] Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Conditional image-to-image translation.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n5524\u20135532, 2018.\n\n[28] Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. Image-to-image translation for cross-\ndomain disentanglement. In Proceedings of the Advances in Neural Information Processing Systems\n(NeurIPS), pages 1287\u20131298, 2018.\n\n[29] Benigno Uria, Iain Murray, and Hugo Larochelle. RNADE: The real-valued neural autoregressive density-\nestimator. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages\n2175\u20132183, 2013.\n\n[30] Mart\u00b4\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,\nAndy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey\nIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,\nDan Man\u00b4e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,\nBenoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda\nVi\u00b4egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from\ntensor\ufb02ow.org.\n\n[31] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML),\npages 448\u2013456, 2015.\n\n[32] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International\n\nConference on Learning Representations (ICLR), page 13. Ithaca, NY, 2015.\n\n[33] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the\ninception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and\npattern recognition (CVPR), pages 2818\u20132826, 2016.\n\n[34] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, \u0141ukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin\nTran. Image transformer. In Proceedings of the International Conference on Learning Representations\n(ICLR), 2018.\n\n11\n\n\f", "award": [], "sourceid": 2302, "authors": [{"given_name": "Ruho", "family_name": "Kondo", "institution": "Toyota Central R&D Labs., Inc."}, {"given_name": "Keisuke", "family_name": "Kawano", "institution": "Toyota Central R&D Labs., Inc"}, {"given_name": "Satoshi", "family_name": "Koide", "institution": "Toyota Central R&D Labs."}, {"given_name": "Takuro", "family_name": "Kutsuna", "institution": "Toyota Central R&D Labs. Inc."}]}