{"title": "Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 474, "page_last": 484, "abstract": "Despite remarkable advances in image synthesis research, existing works often fail in manipulating images under the context of large geometric transformations. Synthesizing person images conditioned on arbitrary poses is one of the most representative examples where the generation quality largely relies on the capability of identifying and modeling arbitrary transformations on different body parts. Current generative models are often built on local convolutions and overlook the key challenges (e.g. heavy occlusions, different views or dramatic appearance changes) when distinct geometric changes happen for each part, caused by arbitrary pose manipulations. This paper aims to resolve these challenges induced by geometric variability and spatial displacements via a new Soft-Gated Warping Generative Adversarial Network (Warping-GAN), which is composed of two stages: 1) it first synthesizes a target part segmentation map given a target pose, which depicts the region-level spatial layouts for guiding image synthesis with higher-level structure constraints; 2) the Warping-GAN equipped with a soft-gated warping-block learns feature-level mapping to render textures from the original image into the generated segmentation map. Warping-GAN is capable of controlling different transformation degrees given distinct target poses. Moreover, the proposed warping-block is light-weight and flexible enough to be injected into any networks. Human perceptual studies and quantitative evaluations demonstrate the superiority of our Warping-GAN that significantly outperforms all existing methods on two large datasets.", "full_text": "Soft-Gated Warping-GAN for Pose-Guided Person\n\nImage Synthesis\n\nHaoye Dong1,2 , Xiaodan Liang3,\u2217, Ke Gong1 , Hanjiang Lai1,2 , Jia Zhu4 , Jian Yin1,2\n\n1School of Data and Computer Science, Sun Yat-sen University\n\n2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, P.R.China\n\n3School of Intelligent Systems Engineering, Sun Yat-sen University\n\n4School of Computer Science, South China Normal University\n\n{donghy7@mail2, laihanj3@mail, issjyin@mail}.sysu.edu.cn\n{xdliang328, kegong936}@gmail.com, jzhu@m.scun.edu.cn\n\nAbstract\n\nDespite remarkable advances in image synthesis research, existing works often\nfail in manipulating images under the context of large geometric transformations.\nSynthesizing person images conditioned on arbitrary poses is one of the most\nrepresentative examples where the generation quality largely relies on the capability\nof identifying and modeling arbitrary transformations on different body parts.\nCurrent generative models are often built on local convolutions and overlook the key\nchallenges (e.g. heavy occlusions, different views or dramatic appearance changes)\nwhen distinct geometric changes happen for each part, caused by arbitrary pose\nmanipulations. This paper aims to resolve these challenges induced by geometric\nvariability and spatial displacements via a new Soft-Gated Warping Generative\nAdversarial Network (Warping-GAN), which is composed of two stages: 1) it \ufb01rst\nsynthesizes a target part segmentation map given a target pose, which depicts the\nregion-level spatial layouts for guiding image synthesis with higher-level structure\nconstraints; 2) the Warping-GAN equipped with a soft-gated warping-block learns\nfeature-level mapping to render textures from the original image into the generated\nsegmentation map. Warping-GAN is capable of controlling different transformation\ndegrees given distinct target poses. Moreover, the proposed warping-block is light-\nweight and \ufb02exible enough to be injected into any networks. Human perceptual\nstudies and quantitative evaluations demonstrate the superiority of our Warping-\nGAN that signi\ufb01cantly outperforms all existing methods on two large datasets.\n\n1\n\nIntroduction\n\nPerson image synthesis, posed as one of most challenging tasks in image analysis, has huge potential\napplications for movie making, human-computer interaction, motion prediction, etc. Despite recent\nadvances in image synthesis for low-level texture transformations [13, 35, 14] (e.g. style or colors),\nthe person image synthesis is particularly under-explored and encounters with more challenges that\ncannot be resolved due to the technical limitations of existing models. The main dif\ufb01culties that\naffect the generation quality lie in substantial appearance diversity and spatial layout transformations\non clothes and body parts, induced by large geometric changes for arbitrary pose manipulations.\nExisting models [20, 21, 28, 8, 19] built on the encoder-decoder structure lack in considering the\ncrucial shape and appearance misalignments, often leading to unsatisfying generated person images.\nAmong recent attempts of person image synthesis, the best-performing methods (PG2 [20], Body-\nROI7 [21], and DSCF [28]) all directly used the conventional convolution-based generative models\n\n\u2217Corresponding author is Xiaodan Liang\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fReal Image\n\nCVPR2017\npix2pix [13]\n\nNIPS2017\nPG2 [20]\n\nCVPR2018\n\nBodyROI7 [21]\n\nCVPR2018\nDSCF [28]\n\nOurs\n\nFigure 1: Comparison against the state-of-the-art methods on DeepFashion [36], based on the same\ncondition images and the target poses. Our results are shown in the last column. Zoom in for details.\n\nby taking either the image and target pose pairs or more body parts as inputs. DSCF [28] employed\ndeformable skip connections to construct the generator and can only transform the images in a coarse\nrectangle scale using simple af\ufb01nity property. However, they ignore the most critical issue (i.e. large\nspatial misalignment) in person image synthesis, which limits their capabilities in dealing with large\npose changes. Besides, they fail to capture structure coherence between condition images with target\nposes due to the lack of modeling higher-level part-level structure layouts. Hence, their results\nsuffer from various artifacts, blurry boundaries, missing clothing appearance when large geometric\ntransformations are requested by the desirable poses, which are far from satisfaction. As the Figure 1\nshows, the performance of existing state-of-the-art person image synthesis methods is disappointing\ndue to the severe misalignment problem for reaching target poses.\nIn this paper, we propose a novel Soft-Gated Warping-GAN to address the large spatial misalignment\nissues induced by geometric transformations of desired poses, which includes two stages: 1) a pose-\nguided parser is employed to synthesize a part segmentation map given a target pose, which depicts\npart-level spatial layouts to better guide the image generation with high-level structure constraints;\n2) a Warping-GAN renders detailed appearances into each segmentation part by learning geometric\nmappings from the original image to the target pose, conditioned on the predicted segmentation map.\nThe Warping-GAN \ufb01rst trains a light-weight geometric matcher and then estimates its transformation\nparameters between the condition and synthesized segmentation maps. Based on the learned transfor-\nmation parameters, the Warping-GAN incorporates a soft-gated warping-block which warps deep\nfeature maps of the condition image to render the target segmentation map.\nOur Warping-GAN has several technical merits. First, the warping-block can control the transfor-\nmation degree via a soft gating function according to different pose manipulation requests. For\nexample, a large transformation will be activated for signi\ufb01cant pose changes while a small degree\nof transformation will be performed for the case that the original pose and target pose are similar.\nSecond, warping informative feature maps rather than raw pixel values could help synthesize more\nrealistic images, bene\ufb01ting from the powerful feature extraction. Third, the warping-block can\nadaptively select effective feature maps by attention layers to perform warping.\nExtensive experiments demonstrate that the proposed Soft-Gated Warping-GAN signi\ufb01cantly outper-\nforms the existing state-of-the-art methods on pose-based person image synthesis both qualitatively\n\n2\n\n\fand quantitatively, especially for large pose variation. Additionally, human perceptual study further\nindicates the superiority of our model that achieves remarkably higher scores compared to other\nmethods with more realistic generated results.\n\n2 Relation Works\n\nImage Synthesis. Driven by remarkable results of GANs [10], lots of researchers leveraged GANs to\ngenerate images [12, 6, 18]. DCGANs [24] introduced an unsupervised learning method to effectively\ngenerate realistic images, which combined convolutional neural networks (CNNs) with GANs.\nPix2pix [13] exploited a conditional adversarial networks (CGANs) [22] to tackle the image-to-image\ntranslation tasks, which learned the mapping from condition images to target images. CycleGAN [35],\nDiscoGAN [15], and DualGAN [33] each proposed an unsupervised method to generate the image\nfrom two domains with unlabeled images. Furthermore, StarGAN [5] proposed a uni\ufb01ed model for\nimage-to-image transformations task towards multiple domains, which is effective on young-to-old,\nangry-to-happy, and female-to-male. Pix2pixHD [30] used two different scales residual networks\nto generate the high-resolution images by two steps. These approaches are capable of learning to\ngenerate realistic images, but have limited scalability in handling posed-based person synthesis,\nbecause of the unseen target poses and the complex conditional appearances. Unlike those methods,\nwe proposed a novel Soft-Gated Warping-GAN that pays attention to pose alignment in deep feature\nspace and deals with textures rendering on the region-level for synthesizing person images.\nPerson Image Synthesis. Recently, lots of studies have been proposed to leverage adversarial\nlearning for person image synthesis. PG2 [20] proposed a two-stage GANs architecture to synthesize\nthe person images based on pose keypoints. BodyROI7 [21] applied disentangle and restructure\nmethods to generate person images from different sampling features. DSCF [28] introduced a special\nU-Net [26] structure with deformable skip connections as a generator to synthesize person images\nfrom decomposed and deformable images. AUNET [8] presented a variational U-Net for generating\nimages conditioned on a stickman (more arti\ufb01cial pose information), manipulating the appearance\nand shape by a variational Autoencoder. Skeleton-Aided [32] proposed a skeleton-aided method\nfor video generation with a standard pix2pix [13] architecture, generating human images base on\nposes. [1] proposed a modular GANs, separating the image into different parts and reconstructing\nthem by target pose. [23] essentially used CycleGAN [35] to generate person images, which applied\nconditioned bidirectional generators to reconstruct the original image by the pose. VITON [11] used\na coarse-to-\ufb01ne strategy to transfer a clothing image into a \ufb01xed pose person image. CP-VTON [29]\nlearns a thin-plate spline transformation for transforming the in-shop clothes into \ufb01tting the body\nshape of the target person via a Geometric Matching Module (GMM). However, all methods above\nshare a common problem, ignoring the deep feature maps misalignment between the condition and\ntarget images. In this paper, we exploit a Soft-Gated Warping-GAN, including a pose-guided parser to\ngenerate the target parsing, which guides to render textures on the speci\ufb01c part segmentation regions,\nand a novel warping-block to align the image features, which produces more realistic-look textures\nfor synthesizing high-quality person images conditioned on different poses.\n\n3 Soft-Gated Warping-GAN\n\nOur goal is to change the pose of a given person image to another while keeping the texture\ndetails, leveraging the transformation mapping between the condition and target segmentation maps.\nWe decompose this task into two stages: pose-guided parsing and Warping-GAN rendering. We\n\ufb01rst describe the overview of our Soft-Gated Warping-GAN architecture. Then, we discuss the\npose-guided parsing and Warping-GAN rendering in details, respectively. Next, we present the\nwarping-block design and the pipeline for estimating transformation parameters and warping images,\nwhich bene\ufb01ts to generate realistic-looking person images. Finally, we give a detailed description of\nthe synthesis loss functions applied in our network.\n\n3.1 Network Architectures\n\nOur pipeline is a two-stage architecture for pose-guided parsing and Warping-GAN rendering\nrespectively, which includes a human parsing parser, a pose estimator, and the af\ufb01ne [7]/TPS [2, 25]\n(Thin-Plate Spline) transformation estimator. Notably, we make the \ufb01rst attempt to estimate the\n\n3\n\n\fFigure 2: The overview of our Soft-Gated Warping-GAN. Given a condition image and a target\npose, our model \ufb01rst generates the target parsing using a pose-guided parser. We then estimate\nthe transformations between the condition and target parsing by a geometric matcher following a\nsoft-gated warping-block to warp the image features. Subsequently, we concatenate the warped\nfeature maps, embedded pose, and synthesized parsing to generate the realistic-looking image.\n\ntransformation of person part segmentation maps for generating person images. In stage I, we \ufb01rst\npredict the human parsing based on the target pose and the parsing result from the condition image.\nThe synthesized parsing result is severed as the spatial constraint to enhance the person coherence. In\nstage II, we use the synthesized parsing result from the stage I, the condition image, and the target\npose jointly to trained a deep warping-block based generator and a discriminator, which is able to\nrender the texture details on the speci\ufb01c regions. In both stages, we only take the condition image\nand the target pose as input. In contrast to AUNET [8](using \u2019stickman\u2019 to represent pose, involving\nmore arti\ufb01cial and constraint for training), we are following PG2 [20] to encode the pose with 18\nheatmaps. Each heatmap has one point that is \ufb01lled with 1 in 4-pixel radius circle and 0 elsewhere.\n\n3.1.1 Stage I: Pose-Guided Parsing\n\nTo learn the mapping from condition image to the target pose on a part-level, a pose-guide parser is\nintroduced to generate the human parsing of target image conditioned on the pose. The synthesized\nhuman parsing contains pixel-wise class labels that can guide the image generation on the class-level,\nas it can help to re\ufb01ne the detailed appearance of parts, such as face, clothes, and hands. Since the\nDeepFashion and Market-1501 dataset do not have human parsing labels, we use the LIP [9] dataset\nto train a human parsing network. The LIP [9] dataset consists of 50,462 images and each person\nhas 20 semantic labels. To capture the re\ufb01ned appearance of the person, we transfer the synthesized\nparsing labels into an one-hot tensor with 20 channels. Each channel is a binary mask vector and\ndenotes the one class of person parts. These vectors are trained jointly with condition image and pose\nto capture the information from both the image features and the structure of the person, which bene\ufb01ts\nto synthesize more realistic-looking person images. Adapted from Pix2pix [13], the generator of the\npose-guided parser contains 9 residual blocks. In addition, we utilize a pixel-wise softmax loss from\nLIP [9] to enhance the quality of results. As shown in Figure 2, the pose-guided parser consists of\none ResNet-like generator, which takes condition image and target pose as input, and outputs the\ntarget parsing which obeys the target pose.\n\n3.1.2 Stage II: Warping-GAN Rendering\n\nIn this stage, we exploit a novel region-wise learning to render the texture details based on speci\ufb01c\nregions, guided by the synthesized parsing from the stage I. Formally, Let Ii = P (li) denote the\nfunction for the region-wise learning, where Ii and li denote the i-th pixel value and the class label of\nthis pixel respectively. And i (0 \u2264 i < n) denotes the index of pixel in image. n is the total number\n\n4\n\nParser\u2026GeometricMatcher\u2026Stage I: Pose-Guided ParsingStage II: Warping-GAN RenderingWarping-BlockCondition ImageCondition ImageTarget poseTarget PoseCondition ParsingSynthesized ParsingSynthesized ParsingTarget ParsingSynthesized ImageTarget ImageCondition ParsingResidual BlocksResidual BlocksEncoderEncoderDecoderDecoderEncoder\fFigure 3: The architecture of Geometric Matcher. We \ufb01rst produce a condition parsing from the\ncondition image by a parser. Then, we synthesize the target parsing with the target pose and the\ncondition parsing. The condition and synthesized parsing are passed through two feature extractors,\nrespectively, following a feature matching layer and a regression network. The condition image is\nwarped using the transformation grid in the end.\n\nof pixels in one image. Note that in this work, the segmentation map also named parsing or human\nparsing, since our method towards the person images.\nHowever, the misalignments between the condition and target image lead to generate blurry values.\nTo alleviate this problem, we further learn the mapping between condition image and target pose by\nintroducing two novel approaches: the geometric matcher and the soft-gated warping-block transfer.\nInspired by the geometric matching method, GEO [25], we propose a parsing-based geometric\nmatching method to estimate the transformation between the condition and synthesized parsing.\nBesides, we design a novel block named warping-block for warping the condition image on a part-\nlevel, using the synthesized parsing from stage I. Note that, those transformation mappings are\nestimated from the parsing, which we can use to warp the deep features of condition image.\nGeometric Matcher. We train a geometric matcher to estimate the transformation mapping between\nthe condition and synthesized parsing, as illustrate in Figure 3. Different from GEO [25], we handle\nthis issue as parsing context matching, which can also estimate the transformation effectively. Due\nto the lack of the target image in the test phrase, we use the condition and synthesized parsing to\ncompute the transformation parameters. In our method, we combine af\ufb01ne and TPS to obtain the\ntransformation mapping through a siamesed convolutional neural network following GEO [25]. To\nbe speci\ufb01c, we \ufb01rst estimate the af\ufb01ne transformation between the condition and synthesized parsing.\nBased on the results from af\ufb01ne estimation, we then estimate TPS transformation parameters between\nwarping results from the af\ufb01ne transformation and target parsing. The transformation mappings\nare adopted to transform the extracted features of the condition image, which helps to alleviate the\nmisalignment problem.\nSoft-gated Warping-Block. Inspired by [3], having\nobtained the transformation mapping from the geo-\nmetric matcher, we subsequently use this mapping\nto warp deep feature maps, which is able to cap-\nture the signi\ufb01cant high-level information and thus\nhelp to synthesize image in an approximated shape\nwith more realistic-looking details. We combine the\naf\ufb01ne [7] and TPS [2] (Thin-Plate Spline transforma-\ntion) as the transformation operation of the warping-\nblock. As shown in Figure 4, we denote those trans-\nformations as the transformation grid. Formally, let\n\u03a6(I) denotes the deep feature map, R(\u03a6(I)) denotes\nthe residual feature map from \u03a6(I), W (I) represents the operation of the transformation grid. Thus,\nwe regard T (I) as the transformation operation, we then formulate the transformation mapping of\nthe warping-block as:\n\nFigure 4: The architecture of soft-gated\nwarping-block. Zoom in for details.\n\nT (\u03a6(I)) = \u03a6(I) + W (I) \u00b7 R(\u03a6(I)),\n\n(1)\nwhere \u00b7 denotes the matrix multiplication, we denote e as the element of W (I), e \u2208 [0, 1], which acts\nas the soft gate of residual feature maps to address the misalignment problem caused by different\nposes. Hence, the warping-block can control the transformation degree via a soft-gated function\n\n5\n\nMatchingLayerPose-Guided ParserParserWarpingCondition ImageTarget PoseWarped ImageTransformation GridCondition parsingSynthesized ParsingFeature ExtractorFeature ExtractorRegression NetworkFeature MapWarped Feature MapTransformation Grid\faccording to different pose manipulation requests, which is light-weight and \ufb02exible to be injected\ninto any generative networks.\n\n3.1.3 Generator and Discriminator\nGenerator. Adapted from pix2pix [13], we build\ntwo residual-like generators. One generator in stage\nI contains standard residual blocks in the middle, and\nanother generator adds the warping-block following\nencoder in stage II. As shown in Figure 2, Both gener-\nators consist of an encoder, a decoder and 9 residual\nblocks.\nDiscriminator. To achieve a stabilized training, in-\nspired by [30], we adopt the pyramidal hierarchy\nlayers to build the discriminator in both stages, as\nillustrated in Figure 5. We combine condition image,\ntarget keypoints, condition parsing, real/synthesized\nparsing, and real/synthesized image as input for the\ndiscriminator. We observe that feature maps from the\npyramidal hierarchy discriminator bene\ufb01ts to enhance the quality of synthesized images. More details\nare shown in the following section.\n\nFigure 5: The overview of discriminator in\nstage II. Zoom in for details.\n\n3.2 Objective Functions\n\nWe aim to build a generator that synthesizes person image in arbitrary poses. Due to the complex\ndetails of the image and the variety of poses, it\u2019s challenging for training generator well. To address\nthese issues, we apply four losses to alleviate them in different aspects, which are adversarial loss\nLadv [10], pixel-wise loss Lpixel [32], perceptual loss Lperceptual [14, 11, 17] and pyramidal hierarchy\nloss LPH. As the Fig 5 shown, the pyramidal hierarchy contains useful and effective feature maps\nat different scales in different layers of the discriminator. To fully leverage these feature maps, we\nde\ufb01ne a pyramidal hierarchy loss, as illustrated in Eq. 2.\n\nn(cid:88)\n\ni=0\n\nLPH =\n\n\u03b1i(cid:107)Fi( \u02c6I) \u2212 Fi(I)(cid:107)1,\n\n(2)\n\nwhere Fi(I) denotes the i-th (i = 0, 1, 2) layer feature map from the trained discriminator. We also\nuse L1 norm to calculate the losses of feature maps of each layer and sum them with the weight \u03b1i.\nThe generator objective is a weighted sum of different losses, written as follows.\n\nLtotal = \u03bb1Ladv + \u03bb2Lpixel + \u03bb3Lperceptual + \u03bb4LPH,\n\n(3)\n\nwhere \u03bbi denotes the weight of i-loss, respectively.\n\n4 Experiments\n\nWe perform extensive experiments to evaluate the capability of our approach against recent methods\non two famous datasets. Moreover, we further perform a human perceptual study on the Amazon\nMechanical Turk (AMT) to evaluate the visualized results of our method. Finally, we demonstrate an\nablation study to verify the in\ufb02uence of each important component in our framework.\n\n4.1 Datasets and Implementation Details\nDeepFashion [36] consists of 52,712 person images in fashion clothes with image size 256\u00d7256.\nFollowing [20, 21, 28], we remove the failure case images with pose estimator [4] and human\nparser [9], then extract the image pairs that contain the same person in same clothes with two different\nposes. We select 81,414 pairs as our training set and randomly select 12,800 pairs for testing.\nMarket-1501 [34] contains 322,668 images collected from 1,501 persons with image size 128\u00d764.\nAccording to [34, 20, 21, 28], we extract the image pairs that reach about 600,000. Then we also\nrandomly select 12,800 pairs as the test set and 296,938 pairs for training.\n\n6\n\nFeature map 0Feature map 1Feature map 2Real or Fake?Fake PairReal Pair\fTable 1: Comparison on DeepFashion and Market-1501 datasets.\n\nModel\npix2pix [13] (CVPR2017)\nPG2 [20] (NIPS2017)\nDSCF [28] (CVPR2018)\nUPIS [23] (CVPR2018)\nAUNET [8] (CVPR2018)\nBodyROI7 [21] (CVPR2018)\nw/o parsing\nw/o soft-gated warping-block\nw/o Ladv\nw/o Lperceptual\nw/o Lpixel\nw/o LPH\nOurs (full)\n\nIS\n\nIS\n\nSSIM\n0.183\n0.253\n0.290\n\n\u2013\n\n2.678\n3.460\n3.185\n\n\u2013\n\nDeepFashion [36] Market-1501 [34]\nSSIM\n0.692\n0.762\n0.761\n0.747\n0.786\n0.614\n0.692\n0.777\n0.780\n0.772\n0.780\n0.776\n0.793\n\n3.249\n3.090\n3.351\n2.97\n3.087\n3.228\n3.146\n3.262\n3.430\n3.446\n3.270\n3.323\n3.314\n\n3.214\n3.483\n2.489\n3.394\n3.332\n3.407\n3.292\n3.448\n3.409\n\n0.353\n0.099\n0.236\n0.337\n0.346\n0.319\n0.337\n0.337\n0.356\n\nTable 2: Pairwise comparison with other approaches. Chance is at 50%. Each cell lists the percentage\nwhere our result is preferred over the other method.\n\npix2pix [13]\n\nDeepFashion [36]\nMarket-1501 [34]\n\n98.7%\n83.4%\n\nPG2 [20] BodyROI7 [21] DSCF [28]\n87.3%\n67.7%\n\n96.3%\n68.4%\n\n79.6%\n62.1%\n\nEvaluation Metrics: We use the Amazon Mechanical Turk (AMT) to evaluate the visual quality of\nsynthesized results. We also apply Structural SIMilarity (SSIM) [31] and Inception Score (IS) [27]\nfor quantitative analysis.\nImplementation Details. Our network architecture is adapted from pix2pixHD [30], which has\npresented remarkable results for synthesizing images. The architecture consists of an generator with\nencoder-decoder structure and a multi-layer discriminator. The generator contains three downsam-\npling layers, three upsampling layers, night residual blocks, and one Warping-Block block. We apply\nthree different scale convolutional layers for the discriminator, and extract features from those layers\nto compute the Pyramidal Hierarchy loss (PH loss), as the Figure 5 shows. We use the Adam [16]\noptimizer and set \u03b21 = 0.5, \u03b22 = 0.999. We use learning rate of 0.0002. We use batch size of 12 for\nDeepfashion, and 20 for Market-1501.\n\n4.2 Quantitative Results\n\nTo verify the effectiveness of our method, we conduct experiments on two benchmarks and compare\nagainst six recent related works. To obtain fair comparisons, we directly use the results provided by\nthe authors. The comparison results and the advantages of our approach are also clearly shown in\nthe numerical scores in Table 1. Our proposed method consistently outperforms all baselines for the\nSSIM metric on both datasets, thanks to the Soft-Gated Warping-GAN which can render high-level\nstructure textures and control different transformation for the geometric variability. Our network also\nachieves comparable results for the IS metric on two datasets, which con\ufb01rms the generalization\nability of our Soft-Gated Warping-GAN.\n\n4.3 Human Perceptual Study\n\nWe further evaluate our algorithm via a human subjective study. We perform pairwise A/B tests\ndeployed on the Amazon Mechanical Turk (MTurk) platform on the DeepFashion [36] and the\nMarket-1501 [34]. The workers are given two generated person images at once. One is synthesized\nby our method and the other is produced by the compared approach. They are then given unlimited\ntime to select which image looks more natural. Our method achieves signi\ufb01cantly better human\n\n7\n\n\fFigure 6: Comparison against the recent state-of-the-\nart methods, based on the same condition image and\nthe target pose. Our results on DeepFashion [36] are\nshown in the last column. Zoom in for details.\n\nFigure 7: Comparison against the recent state-\nof-the-art methods, based on the same condi-\ntion image and the target pose. Our results on\nMarket-1501 [34] shown in the last column.\n\nevaluation scores, as summarized in Table 2. For example, compared to BodyROI7 [21], 96.3%\nworkers determine that our method generated more realistic person images on DeepFashion [36]\ndataset. This superior performance con\ufb01rms the effectiveness of our network comprised of a human\nparser and the soft-gated warping-block, which synthesizes more realistic and natural person images.\n\n4.4 Qualitative Results\n\nWe next present and discuss a series of qualitative results that will highlight the main characteristics\nof the proposed approach, including its ability to render textures with high-level semantic part\nsegmentation details and to control the transformation degrees conditioned on various poses. The\nqualitative results on the DeepFashion [36] and the Market-1501 [34] are visualized in Figure 6 and\nFigure 7. Existed methods create blurry and coarse results without considering to render the details\nof target clothing items by human parsing. Some methods produce sharper edges, but also cause\nundesirable artifacts missing some parts. In contrast to these methods, our approach accurately and\nseamlessly generates more detailed and precise virtual person images conditioned on different target\nposes. However, there are some blurry results on Market-1501 [34], which might result from that the\nperformance of the human parser in our model may be in\ufb02uenced by the low-resolution images. We\nwill present and analyze more visualized results and failure cases in the supplementary materials.\n\n4.5 Ablation study\n\nTo verify the impact of each component of the proposed method, we conduct an ablation study on\nDeepFashion [36] and Market-1501 [34]. As shown in Table 1 and Figure 8, we report evaluation\nresults of the different versions of the proposed method. We \ufb01rst compare the results using pose-\nguided parsing to the results without using it. From the comparison, we can learn that incorporating\nthe human parser into our generator signi\ufb01cantly improves the performance of generation, which\ncan depict the region-level spatial layouts for guiding image synthesis with higher-level structure\nconstraints by part segmentation maps. We then examine the effectiveness of the proposed soft-gated\nwarping-block. From Table 1 and Figure 8, we observe that the performance drops dramatically\nwithout the soft-gated warping-block. The results suggest the improved performance attained\nby the warping-block insertion is not merely due to the additional parameters, but the effective\nmechanism inherently brought by the warping operations which act as a soft gate to control different\ntransformation degrees according to the pose manipulations. We also study the importance of each\nterm in our objective function. As can be seen, adding each of the four losses can substantially\nenhance the results.\n\n5 Conclusion\n\nIn this work, we presented a novel Soft-Gated Warping-GAN for addressing pose-guided person\nimage synthesis, which aims to resolve the challenges induced by geometric variability and spatial\n\n8\n\nCondition ImageTarget PoseTarget ImageCVPR2017pix2pixNIPS2017PG2CVPR2018BodyROI7CVPR2018DSCFOursCondition ImageTarget PoseTarget ImageOursNIPS2017PG2CVPR2018BodyROI7CVPR2018DSCFCVPR2017 pix2pix\fFigure 8: Ablation studies on DeepFashion [36]. Zoom in for details.\n\ndisplacements. Our approach incorporates a human parser to produce a target part segmentation map\nto instruct image synthesis with higher-level structure information, and a soft-gated warping-block\nto warp the feature maps for rendering the textures. Effectively controlling different transformation\ndegrees conditioned on various target poses, our proposed Soft-Gated Warping-GAN can generate\nremarkably realistic and natural results with the best human perceptual score. Qualitative and\nquantitative experimental results demonstrate the superiority of our proposed method, which achieves\nstate-of-the-art performance on two large datasets.\n\nAcknowledgements\n\nThis work is supported by the National Natural Science Foundation of China (61472453, U1401256,\nU1501252, U1611264, U1711261, U1711262, 61602530), and National Natural Science Foundation\nof China (NSFC) under Grant No. 61836012.\n\nReferences\n[1] Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing\n\nimages of humans in unseen poses. In CVPR, 2018.\n\n[2] Fred L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations.\n\nIEEE Transactions on pattern analysis and machine intelligence, 11(6):567\u2013585, 1989.\n\n[3] Kaidi Cao, Yu Rong, Cheng Li, Xiaoou Tang, and Chen Change Loy. Pose-robust face\n\nrecognition via deep residual equivariant mapping. In CVPR, pages 5187\u20135196, 2018.\n\n[4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose\n\nestimation using part af\ufb01nity \ufb01elds. In CVPR, 2017.\n\n[5] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo.\nStargan: Uni\ufb01ed generative adversarial networks for multi-domain image-to-image translation.\nIn CVPR, 2018.\n\n[6] Zhijie Deng, Hao Zhang, Xiaodan Liang, Luona Yang, Shizhen Xu, Jun Zhu, and Eric P Xing.\nStructured generative adversarial networks. In Advances in Neural Information Processing\nSystems, pages 3899\u20133909, 2017.\n\n9\n\nw/o parsingw/o soft-gated warping-blockw/o pixel-wise lossw/o perceptual lossw/o PH lossw/o adv lossOurs(Full)Condition ImageTarget PoseTarget Image\f[7] Ping Dong and Nikolas P Galatsanos. Af\ufb01ne transformation resistant watermarking based on\n\nimage normalization. In ICIP (3), pages 489\u2013492, 2002.\n\n[8] Patrick Esser, Ekaterina Sutter, and Bj\u00f6rn Ommer. A variational u-net for conditional appearance\n\nand shape generation. In CVPR, 2018.\n\n[9] Ke Gong, Xiaodan Liang, Xiaohui Shen, and Liang Lin. Look into person: Self-supervised\nstructure-sensitive learning and a new benchmark for human parsing. In CVPR, pages 6757\u2013\n6765, 2017.\n\n[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672\u20132680,\n2014.\n\n[11] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based\n\nvirtual try-on network. arXiv preprint arXiv:1711.08447, 2017.\n\n[12] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, Xiaodan Liang, Lianhui Qin, Haoye Dong,\n\nand Eric Xing. Deep generative models with learnable knowledge constraints. NIPS, 2018.\n\n[13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. In CVPR, 2017.\n\n[14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer\n\nand super-resolution. In ECCV, pages 694\u2013711, 2016.\n\n[15] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover\ncross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192,\n2017.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[17] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro\nAcosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic\nsingle image super-resolution using a generative adversarial network. In CVPR, pages 105\u2013114,\n2017.\n\n[18] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion gan for future-\ufb02ow embedded\nvideo prediction. In IEEE International Conference on Computer Vision (ICCV), volume 1,\n2017.\n\n[19] Xiaodan Liang, Hao Zhang, Liang Lin, and Eric Xing. Generative semantic manipulation with\nmask-contrasting gan. In Proceedings of the European Conference on Computer Vision (ECCV),\npages 558\u2013573, 2018.\n\n[20] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose\n\nguided person image generation. In NIPS, 2017.\n\n[21] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz.\n\nDisentangled person image generation. In CVPR, 2018.\n\n[22] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[23] Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and Francesc Moreno-Noguer. Unsuper-\n\nvised person image synthesis in arbitrary poses. In CVPR, 2018.\n\n[24] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. In ICLR, 2016.\n\n[25] I. Rocco, R. Arandjelovi\u00b4c, and J. Sivic. Convolutional neural network architecture for geometric\n\nmatching. In CVPR, 2017.\n\n10\n\n\f[26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for\n\nbiomedical image segmentation. In MICCAI, pages 234\u2013241, 2015.\n\n[27] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and\n\nXi Chen. Improved techniques for training gans. In NIPS, 2016.\n\n[28] Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere, and Nicu Sebe. Deformable gans\n\nfor pose-based human image generation. In CVPR, 2018.\n\n[29] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward\n\ncharacteristic-preserving image-based virtual try-on network. ECCV, 2018.\n\n[30] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.\nHigh-resolution image synthesis and semantic manipulation with conditional gans. arXiv\npreprint arXiv:1711.11585, 2017.\n\n[31] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment:\n\nfrom error visibility to structural similarity. TIP, 13(4):600\u2013612, 2004.\n\n[32] Yichao Yan, Jingwei Xu, Bingbing Ni, Wendong Zhang, and Xiaokang Yang. Skeleton-aided\n\narticulated motion generation. In ACM MM, 2017.\n\n[33] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for\n\nimage-to-image translation. arXiv preprint, 2017.\n\n[34] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable\n\nperson re-identi\ufb01cation: A benchmark. In ICCV, pages 1116\u20131124, 2015.\n\n[35] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\n\ntranslation using cycle-consistent adversarial networks. In ICCV, 2017.\n\n[36] Shi Qiu Xiaogang Wang Ziwei Liu, Ping Luo and Xiaoou Tang. Deepfashion: Powering robust\n\nclothes recognition and retrieval with rich annotations. In CVPR, pages 1096\u20131104, 2016.\n\n11\n\n\f", "award": [], "sourceid": 292, "authors": [{"given_name": "Haoye", "family_name": "Dong", "institution": "Sun Yat-sen University"}, {"given_name": "Xiaodan", "family_name": "Liang", "institution": "Sun Yat-sen University"}, {"given_name": "Ke", "family_name": "Gong", "institution": "Sun Yat-sen University"}, {"given_name": "Hanjiang", "family_name": "Lai", "institution": "Sun Yat-Sen university"}, {"given_name": "Jia", "family_name": "Zhu", "institution": "South China Normal University"}, {"given_name": "Jian", "family_name": "Yin", "institution": "Sun Yat-Sen University"}]}