{"title": "ETNet: Error Transition Network for Arbitrary Style Transfer", "book": "Advances in Neural Information Processing Systems", "page_first": 670, "page_last": 679, "abstract": "Numerous valuable efforts have been devoted to achieving arbitrary style transfer since the seminal work of Gatys et al. However, existing state-of-the-art approaches often generate insufficiently stylized results under challenging cases. We believe a fundamental reason is that these approaches try to generate the stylized result in a single shot and hence fail to fully satisfy the constraints on semantic structures in the content images and style patterns in the style images. Inspired by the works on error-correction, instead, we propose a self-correcting model to predict what is wrong with the current stylization and refine it accordingly in an iterative manner. For each refinement, we transit the error features across both the spatial and scale domain and invert the processed features into a residual image, with a network we call Error Transition Network (ETNet). The proposed model improves over the state-of-the-art methods with better semantic structures and more adaptive style pattern details. Various qualitative and quantitative experiments show that the key concept of both progressive strategy and error-correction leads to better results. Code and models are available at https://github.com/zhijieW94/ETNet.", "full_text": "ETNet: Error Transition Network for Arbitrary\n\nStyle Transfer\n\nChunjin Song\u2217\nShenzhen University\n\nsongchunjin1990@gmail.com\n\nZhijie Wu\u2217\n\nShenzhen University\n\nwzj.micker@gmail.com\n\nYang Zhou\u2020\n\nShenzhen University\n\nzhouyangvcc@gmail.com\n\nMinglun Gong\n\nUniversity of Guelph\n\nminglun@uoguelph.ca\n\nHui Huang\u2020\n\nShenzhen University\nhhzhiyan@gmail.com\n\nAbstract\n\nNumerous valuable efforts have been devoted to achieving arbitrary style transfer\nsince the seminal work of Gatys et al. However, existing state-of-the-art approaches\noften generate insuf\ufb01ciently stylized results under challenging cases. We believe a\nfundamental reason is that these approaches try to generate the stylized result in a\nsingle shot and hence fail to fully satisfy the constraints on semantic structures in\nthe content images and style patterns in the style images. Inspired by the works\non error-correction, instead, we propose a self-correcting model to predict what is\nwrong with the current stylization and re\ufb01ne it accordingly in an iterative manner.\nFor each re\ufb01nement, we transit the error features across both the spatial and scale\ndomain and invert the processed features into a residual image, with a network we\ncall Error Transition Network (ETNet). The proposed model improves over the\nstate-of-the-art methods with better semantic structures and more adaptive style\npattern details. Various qualitative and quantitative experiments show that the key\nconcept of both progressive strategy and error-correction leads to better results.\nCode and models are available at https://github.com/zhijieW94/ETNet.\n\n1\n\nIntroduction\n\nStyle transfer is a challenging image manipulation task, whose goal is to apply the extracted style\npatterns to content images. Starting from the works on neural network based generative models [17,\n10, 2, 3], the seminal work of Gatys et al. [9] made the \ufb01rst effort to achieve stylization using\nneural networks with surprising results. The basic assumption they made is that the feature map\noutput by a deep encoder represents the content information, while the correlations between the\nfeature channels of multiple deep layers encode the style info. Following Gatys et al. [9], subsequent\nworks [8, 16, 26, 5, 31] try to address problems in quality, generalization and ef\ufb01ciency through\nreplacing the time-consuming optimization process with feed-forward neural networks.\nIn this paper, we focus on how to transfer arbitrary styles with one single model. The existing\nmethods achieve this goal by simple statistic matching [13, 21], local patch exchange [6] and their\ncombinations [24, 28]. Even with their current success, these methods still suffer from artifacts, such\nas the distortions to both semantic structures and detailed style patterns. This is because, when there\nis large variation between content and style images, it is dif\ufb01cult to use the network to synthesize the\nstylized outputs in a single step; see e.g., Figure 1(a).\n\n\u2217Equal contribution, in alphabetic order.\n\u2020Corresponding authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: State of the art methods are all able to achieve good stylization with a simple target style\n(top row in (a)). But for a complex style (bottom row in (a)), both WCT and Avatar-Net distort the\nspatial structures and fail to preserve texture consistency, our method, however, still performs well.\nDifferent from the existing methods, our proposed model achieves style transfer by a coarse-to-\ufb01ne\nre\ufb01nement. One can see that, from left to right in (b), \ufb01ner details arise more and more along with\nthe re\ufb01nements. See the close-up views in our supplementary material for a better visualization.\n\nTo address these challenges, we propose an iterative error-correction mechanism [11, 22, 4, 18]\nto break the stylization process into multiple re\ufb01nements; see examples in Figure 1(b). Given an\ninsuf\ufb01ciently stylized image, we compute what is wrong with the current estimate and transit the\nerror information to the whole image. The simplicity and motivation lie in the following aspect:\nthe detected errors evaluate the residuals to the ground truth, which thus can guide the re\ufb01nement\neffectively. Both the stylized images and error features encode information from the same content-\nstyle image pairs. Though they do correlate with each other, we cannot simply add the residuals\nto images or to deep features, especially both on content and style features simultaneously. Thus,\na novel error transition module is introduced to predict a residual image that is compatible to the\nintermediate stylized image. We \ufb01rst compute the correlation between the error features and the deep\nfeatures extracted from the stylized image. Then errors are diffused to the whole image according to\ntheir similarities [27, 14].\nFollowing [9], we employ a VGG-based encoder to extract deep features and measure the distance of\ncurrent stylization to the ground truth de\ufb01ned by a content-style image pair. Meanwhile, a decoder\nis designed for error propagation across both the spatial and scale domain and \ufb01nally invert the\nprocessed features back to RGB space. Considering the multi-scale property of style features, our\nerror propagation is also processed in a coarse-to-\ufb01ne manner. For the coarsest level, we utilize the\nnon-local block to capture long-range dependency, thus the error can be transited to the whole image.\nThen the coarse error is propagated to \ufb01ner scale to keep consistency across scales. As a cascaded\nre\ufb01nement framework, our successive error-correction procedures can be naturally embedded into a\nLaplacian pyramid [1], which facilitates both the ef\ufb01ciency and effectiveness for the training.\nIn experiments, our model consistently outperforms the existing state-of-the-art models in qualitative\nand quantitative evaluations by a large margin. In summary, our contributions are:\n\nerrors in stylization results and correcting them iteratively.\n\n\u2022 We \ufb01rst introduce the concept of error-correction mechanism to style transfer by evaluating\n\u2022 By explicitly computing the features for perceptual loss in a feed-forward network, each\n\u2022 The overall pyramid-based style transfer framework can perform arbitrary style transfer and\n\nre\ufb01nement is formulated as an error diffusion process.\n\nsynthesize highly detailed results with favored styles.\n\n2 Related Work\n\nUnder the context of deep learning, many works and valuable efforts [9, 16, 20, 26, 29, 33, 31, 32]\nhave approached the problems of style transfer. In this section, we only focus on the related works on\narbitrary style transfer and refer the readers to [15] for a comprehensive survey.\nThe seminal work of Gatys et al. [9] \ufb01rst proposes to transfer arbitrary styles via a back-propagation\noptimization. Our approach shares the same iterative error-correction spirit with theirs. The key\n\n2\n\n\fFigure 2: Framework of our proposed stylization procedure. We start with a zero vector to represent\nthe initial stylized image, i.e. \u02c6I 3\ncs = 0. Together with downsampled input content-style image pair (I 3\nc\nand I 3\nh. The sum\nof I 3\ncs. This process\nis repeated across two subsequent levels to yield a \ufb01nal stylized image I 1\n\ns ), it is fed into the residual image generator ET N et3 to generate a residual image I 3\nh and \u02c6I 3\n\ncs gives us the updated stylized image I 3\n\ncs, which is then upsampled into \u02c6I 2\n\ncs with full resolution.\n\ndifference is that our method achieves error-correction in a feed-forward manner, which allows us\nto explicitly perform joint analysis between errors and the synthesized image. With the computed\n(especially the long-range) dependencies guiding the error diffusion, a better residual image can\nhence be generated.\nSince then, several recent papers have proposed novel models to reduce the computation cost\nwith a feed-forward network. The methods based on, local patch exchange [6], global statistic\nmatching [13, 21, 19] and their combinations [24, 28], all develop a generative decoder to reconstruct\nthe stylized images from the latent representation. Meanwhile, Shen et al. [23] employ the meta\nnetworks to improve the \ufb02exibility for different styles. However, for these methods, they all stylize\nimages in one step, which limits their capability in transferring arbitrary styles under challenging\ncases. Hence when the input content and style images differ a lot, they are hard to preserve spatial\nstructures and receive style patterns at the same time, leading to poor stylization results.\nDifferent from the above methods, our approach shares the same spirit with networks coupled\nwith error correction mechanism [11, 22, 4, 18]. Rather than directly learning a mapping from\ninput to target in one step, these networks [30] resolve the prediction process into multiple steps\nto make the model have an error-correction procedure. The error-correction procedures have many\nsuccessful applications, including super-resolution [11], video structure modeling [22], human pose\nestimation [4] and image segmentation [18] and so on. To our knowledge, we make the \ufb01rst efforts to\nintroduce error-correction mechanism to arbitrary style transfer. Coupled with Laplacian pyramid, the\nproposed schema enables the networks to gradually bring the results closer to the desired stylization\nand synthesize outputs with better structures and \ufb01ne-grained details.\n\n3 Developed Framework\n\nWe formulate style transfer as a successive re\ufb01nements based on error-correction procedures. As\nshown in Figure 2, the overall framework is implemented using a Laplacian pyramid [1, 12, 7], which\nhas 3 levels. Each re\ufb01nement is performed by an error transition process with an af\ufb01nity matrix\nbetween pixels [25, 14] followed by an error propagation procedure in a coarse-to-\ufb01ne manner to\ncompute a residual image. Each element of the af\ufb01nity matrix is computed on the similarity between\nthe error feature of one pixel and the feature of stylized result at another pixel, which can be used to\nmeasure the possibility of the error transition between them. See Section 3.1 for more details.\n\n3.1 Error Transition Networks for Residual Images\n\nWe develop Error Transition Network to generate a residual image for each level of a pyramid. As\nillustrated in Figure 3, it contains two VGG-based encoders and one decoder. Taken a content-style\nimage pair (Ic and Is) and an intermediate stylized image ( \u02c6Ics) as inputs, one encoder (\u0398in(\u00b7)) is used\nin( \u02c6Ics) and the other encoder (\u0398err(\u00b7)) is for the\nto extract deep features {f i\nerror computation. The superscript i represents the feature extracted at the relu_i_1 layer. The deepest\nfeatures of both encoders we used are the output of relu_4_1 layer, which means i \u2208 {1, 2,\u00b7\u00b7\u00b7 , L}\n\nin} from \u02c6Ics, i.e. f i\n\nin = \u0398i\n\n3\n\nIs3Ic3Ics3Ics3Ics2Ics1Is2Ic2Is1Ic1ETNet3ETNet2ETNet1+++^Ics2^Ics1^Ih3Ih2Ih1\fFigure 3: Error Transition network (a) and the detailed architecture of error propagation block (b).\nFor ETNet, the input images include a content-style image pair (Ic, Is) and an intermediate stylization\n( \u02c6Ics). The two encoders extract deep features {f i\ns} respectively.\nAfter the fusion of \u2206E4\nin into a\nnon-local block to compute a global residual feature \u2206D4. Then both \u2206D4 and \u2206E4 are fed to a\nseries of error propagation blocks to further receive lower-level information until we get the residual\nimage Ih. Finally, we add Ih to \u02c6Ics and output the re\ufb01ned image Ics.\n\ns , we input the fused error feature \u2206E4, together with f 4\n\nin}, and error features \u2206E4\n\nc and {\u2206Ei\n\nc and \u2206E4\n\n(1)\n\nand L = 4. Then a decoder is used to diffuse errors and invert processed features into the \ufb01nal\nresidual image. Following [9], the error of the stylized image \u02c6Ics contains two parts, content error\n\u2206Ei\n\ns, which are de\ufb01ned as\n\nc and style error \u2206Ei\n\n\u2206Ei\n\nc( \u02c6Ics, Ic) = \u0398i\n\nerr(Ic) \u2212 \u0398i\n\nerr( \u02c6Ics),\n\n\u2206Ei\n\ns( \u02c6Ics, Is) = Gi(Is) \u2212 Gi( \u02c6Ics),\n\nc , \u2206EL\n\ns ) = \u2206EL\n\n\u2206EL = \u03a8L(\u2206EL\n\nwhere Gi(\u00b7) represents a Gram matrix for features extracted at layer i in the encoder network.\nTo leverage the correlation between the error features and the deep features of stylized image so as to\ndetermine a compatible residual feature \u2206D, we apply the trick of non-local blocks [27] here. To do\nthat, we \ufb01rst fuse the content and style error features as a full error feature:\nc \u00b7 W \u00b7 \u2206EL\ns ,\n\n(2)\nwhere \u03a8(\u00b7) and W indicates the fusion operation and a learnable matrix respectively following\n[33, 31]. Let the f lat(\u00b7) and sof tmax(\u00b7) denote the \ufb02atten operation along the channel dimension\nand a softmax operation respectively. Then the residual feature at L-th scale is determined as:\n\n\u2206DL = f lat(\u2206EL \u2297 \u03c8h) \u00b7 sof tmax(f lat(\u2206EL \u2297 \u03c8u) \u00b7 f lat(f L\n\n(3)\nwhere the \u2297 represents one convolution operation and {\u03c8h, \u03c8u, \u03c8g} are implemented as learnable\n1 \u00d7 1 convolutions [27]. The af\ufb01nity matrix outputted by the softmax operation determines the error\ntransition within the whole image. And since long-range dependencies are well captured by the\naf\ufb01nity matrix, we are able to better respect the semantic structures, producing superior style pattern\nconsistency. Note that similar to the previous work [27], we only apply the non-local blocks at the top\nscale (i.e., L = 4) to better consider the locality of lower-level features and reduce the computation\nburden.\nNext \u2206EL and \u2206DL are fed into cascaded residual propagation blocks to receive lower-level\ninformation step by step. As shown in Figure 3(b), a residual propagation block at the i-th layer\ntakes four features as input and has two branches which are used to re\ufb01ne \u2206Di and \u2206Ei respectively.\nConditioned on \u2206Ei and \u2206Di of last scale, to make the residual feature \u2206Di\u22121 more compatible to\nthe stylized image \u02c6Ics and let the \ufb01ne scale consistent with coarser scales, we take all of them into\naccount to compute residual feature. Thus we achieve \u2206Ei\u22121 and \u2206Di\u22121 as:\n\nin \u2297 \u03c8g)T ),\n\n\u2206Ei\u22121 = \u2206 \u00afEi\u22121 \u2297 \u03c6u,\n\n\u2206 \u00afEi\u22121 = \u03a8i\u22121(\u2206Ei \u2297 \u03c6t, \u2206Ei\u22121\n\ns\n\n),\n\n(4)\n\n\u2206Di\u22121 = [\u2206Di \u2297 \u03c6v,\n\nin , \u2206 \u00afEi\u22121] \u2297 \u03c6w.\nf i\u22121\n\n(5)\nHere [\u00b7,\u00b7,\u00b7] denotes a concatenation operation to jointly consider all inputs for simplicity and {\u03c6s,\n\u03c6u, \u03c6v, \u03c6w} represents learnable convolutions for feature adjustment. We further denote \u2206 \u02c6Ei \u2208\nRN\u00d7Ci\u22121 as the processed feature, \u2206 \u02c6Ei = \u2206Ei \u2297 \u03c6t. N and C i\u22121 indicate the number of pixels\nand the channel number of \u2206 \u02c6Ei respectively. The learnable matrix \u03a8i\u22121 aims to associate \u2206 \u02c6Ei and\ns \u2208 C i\u22121 \u00d7 C i\u22121 for additional feature fusion. Both \u2206Ei\u22121 and \u2206Di\u22121 will be successively\n\u2206Ei\u22121\nupsampled until i = 1. Finally \u2206D1 will be directly outputted as a residual image Ih which will be\nadded to \u02c6Ics to compute Ics. Then Ics will be inputted to the \ufb01ner level of a Laplacian pyramid for\nfurther re\ufb01nement or outputted as the \ufb01nal stylization result.\n\n4\n\nfin2fin3fin4Ec4Es4fini-1NBNBnonlocal blockupsampling operationconcatenationconvolution layerVGGVGGFusionFusionPBPBPB+Es3Es2fin1Es1Esi-1CONVConcatCONVConcatpropagation blockPBCONVDiEiDi-1Ei-1CONVCONVIcs^IcIsIcs(cid:269)(cid:66)(cid:270)(cid:9)(cid:67)(cid:10)D 4E4IhEi-1(cid:100)\f3.2 Style transfer via a Laplacian Pyramid\nFigure 2. illustrates a progressive procedure for a pyramid with K = 3 to stylize a 768 \u00d7 768 image.\ns ) at k-th level of the pyramid\nSimilar to [7], except the \ufb01nal level, the input images (I k\nare downsampled versions of original full resolution images. And correspondingly \u02c6I k\ncs denotes\nthe intermediate stylized image. With error transition networks, the stylization is performed in a\ncoarse-to-\ufb01ne manner. At the begining, we set the initial stylized image \u02c6I 3\ncs to be an all-zero vector\nwith the same size as I 3\nh as:\n\ns , we compute a residual image I k\n\nc . Setting k = 3, together with I k\n\nc and I k\n\nc and I k\n\nc , I k\n\ncs, I k\n\ncs + I k\n\nh = \u02c6I k\n\ncs = \u02c6I k\nI k\n\ncs + ET N etk( \u02c6I k\n\n(6)\nwhere ET N etk denotes an error transition network at k-th level. Then we upsample I k\ncs to expand it\n. We repeat this process until\nto be twice the size and we denote the upsampled version of I k\nwe go back to the full resolution image. Note that we train each error transition network ET N etk(\u00b7)\nseparately due to the limitation in GPU memory. The independent training of each level also offers\nbene\ufb01ts on preventing from over\ufb01tting [7].\nThe loss functions are de\ufb01ned based on the image structures present in each level of a pyramid as\nwell. Thus the content and style loss function for I k\n\ncs are respectively de\ufb01ned as:\n\ncs as \u02c6I k\u22121\n\ns ),\n\ncs\n\nK(cid:88)\nK(cid:88)\n\nj=k+1\n\n(cid:107)\u03a6vgg(I j\n\nc ) \u2212 \u03a6vgg( \u02dcI j\n\ncs)(cid:107)2,\n\n(cid:107)GL(I j\n\ns ) \u2212 GL( \u02dcI j\n\ncs)(cid:107)2,\n\n(7)\n\n(8)\n\npc = (cid:107)\u03a6vgg(I k\nLk\nL(cid:88)\n\nLk\n\nps =\n\nc ) \u2212 \u03a6vgg(I k\n\ncs)(cid:107)2 +\n\n(cid:107)Gi(I k\n\ns ) \u2212 Gi(I k\n\ncs)(cid:107)2 +\n\ni=1\n\nj=k+1\n\nwhere the \u03a6vgg denotes a VGG-based encoder and as mentioned, we set L = 4. And \u02dcI j\ncs is computed\nas the (j \u2212 k) repeated applications of downsampling operation d(\u00b7) on I k\ncs =\ncs)) to capture large patterns that can not be evaluated with \u03a6vgg directly.\nd(d(I 1\nMoreover, total variation loss LT V (\u00b7) is also added to achieve the piece-wise smoothness. Thus the\ntotal loss at k-th level of a pyramid is computed as:\n\ncs, e.g., k = 1, \u02dcI 3\n\nLk\n\ntotal = \u03bbk\n\npcLk\n\npc + \u03bbk\n\npsLk\n\nps + \u03bbk\n\ntvLT V ,\n\n(9)\n\nwhere the \u03bbk\nps is assigned to 1, 5, 8\nrespectively to preserve semantic structures and gradually add style details to outputs. We tried to use\n4 or more levels in the pyramid, but found only subtle improvement achieved on the visual quality.\n\ntv are always set to 1 and 10\u22126 while for k = 1, 2, 3, \u03bbk\n\npc and \u03bbk\n\n4 Experimental Results\n\nWe \ufb01rst evaluate the key ingredients of our method. Then qualitative and quantitative comparisons to\nseveral state of the art methods are presented. Finally, we show some applications using our approach,\ndemonstrating the \ufb02exibility of our framework. Implementation details and more results can be found\nin our supplementary document, and code will be made publicly available online.\n\nAblation Study Our method has three key ingredients: iterative re\ufb01nements, error measurement\nand the joint analysis between the error features and features of the intermediate stylization. Table 1\nlists the quantitative metrics of the ablation study on above ingredients. For our full model, we can\nsee that though content loss increases a little bit, the style loss shrink signi\ufb01cantly by using more\nre\ufb01nements, proving the ef\ufb01cacy of our iterative strategy on stylization.\nFor evaluating the role of error computation, we replace the error features with the plain deep features\nfrom the content and style images, i.e. \u2206Ei\ns = Gi(Is), before the fusion\nand information propagation. From the perceptual loss shown in Table 1, the model without error\ninformation can still somehow improve the stylization a little bit, since the plain deep features also\ncontain error features, but comparing to our full model of feeding error explicitly, it brings more\ndif\ufb01culty for the network to exact the correct residual info. Figure 4 shows a visual comparison. We\ncan see that without error features, noise and artifacts are introduced, like the unseen stripes in the\n\nerr(Ic) and \u2206Ei\n\nc = \u0398i\n\n5\n\n\fFigure 4: Ablation study on explicit error computation. Our full model is successful in reducing\nartifacts and synthesize texture patterns more faithful to the style image. Better zoom for more details.\n\nFigure 5: Ablation study on the effect of current stylized results in computing residual images.\n\nTable 1: Ablation study on multiple re\ufb01nements, the effect of error computation and the joint analysis\nof intermediate stylized results in computing the residual images. All the results are averaged over\n100 synthesized images with perceptual metrics. Note that both K and K\nrepresent the number of\nre\ufb01nements, where K denotes a full model and K\nrepresents a simple model that removes the upper\nencoder, which means it does not consider the intermediate stylization in computing residual images.\n\n(cid:48)\n\n(cid:48)\n\nLoss K = 1\n10.9117\n28.3117\n\nLc\nLs\n\nK = 2\n16.8395\n12.1404\n\nK = 3\n18.3771\n7.1984\n\nK = 1\nw/o err\n10.9332\n28.3513\n\nK = 2\nw/o err\n19.6845\n18.3374\n\nK = 3\nw/o err K\n19.4235\n15.2981\n\n10.9424\n28.3283\n\n(cid:48)\n\n(cid:48)\n\n= 1 K\n\n= 2 K\n\n(cid:48)\n\n= 3\n22.1014\n21.3698\n\n19.5554\n23.6618\n\nmountain (2nd row, please zoom in to see more details). Our full model yields a much more favorable\nresult. The synthesized patterns faithfully mimic the provided style image.\nThen we evaluate the importance of features from the intermediate stylized result in computing\nthe residual image Ih. We train a model that removes the upper encoder and directly employ the\ncomputed error features to create Ih. Speci\ufb01cally, we disable the non-local blocks at the bottleneck\nlayer of our model and for the error propagation block at i-th scale, it only takes \u2206Di and \u2206Ei\u22121\nas inputs. From Table 1, we can see that when disabling the fusion with the features from the\nintermediate stylization, the \ufb01nal stylizations after iterative re\ufb01nements are worse than our full model\nby a large margin. Figure 5 shows an example, where the incomplete model introduces unseen white\nbumps and blurs the contents more comparing to our full model, demonstrating the effect of the\nfusion with the intermediate stylization in the error transition.\n\ns\n\nQualitative Comparison Figure 6 presents the qualitative comparison results to state of the art\nmethods. For baseline methods, codes released by the authors are used with default con\ufb01gurations for\na fair comparison. Gatys et a. [9] achieves arbitrary style transfer via time-consuming optimization\nprocess but often gets stuck into local minimum with distorted images and fails to capture the\nsalient style patterns. AdaIN [13] frequently produces instylized issues due to its limitation in style\n\n6\n\n\fFigure 6: Comparison with results from different methods.\n\nFigure 7: Detail cut-outs. The top row shows close-ups for highlighted areas for a better visualization.\nOnly our result successfully captures the paint brush patterns in the style image.\n\nrepresentation while WCT [21] improves the generalization ability to unseen styles, but introduces\nunappealing distorted patterns and warps content structures. Avatar-Net [24] addresses the style\ndistortion issue by introducing a feature decorator module. But it blurs the spatial structures and\nfails to capture the style patterns with long-range dependency. Meanwhile, AAMS [28] generates\nresults with worse texture consistency and introduces unseen repetitive style patterns. In contrast,\nbetter transfer results are achieved with our approach. The iterative re\ufb01nement coupled with error\ntransition shows a rather stable performance in transferring arbitrary styles. Moreover, the leverage of\nLaplacian pyramid further helps the preservation of stylization consistency across scales. The output\nstyle patterns are more faithful to the target style image, without distortion and exhibiting superior\nvisual detail. Meanwhile, our model better respects the semantic structures in the content images,\nmaking the style pattern be adapted to these structures.\nIn Figure 7, we show close-up views of transferred results to indicate the superiority in generating\nstyle details. For the compared methods, they either fail to stylize local regions or capture the salient\nstyle patterns. As expected, our approach performs a better style transfer with clearer structures and\ngood-quality details. Please see the brush strokes and the color distribution present in results.\n\n7\n\n\fFigure 8: At deployment stage, we can adjust the degree of stylization with paramter \u03b1.\n\nFigure 9: A re\ufb01nement for the outputs of AdaIN and WCT.\n\nTable 2: Quantitative comparison over different models on perceptual (content & style) loss, prefer-\nence score of user study and stylization speed. Note that all the results are averaged over 100 test\nimages except the preference score.\n\nAdaIN [13] WCT [21] Avatar-Net [24] AAMS [28]\n\n14.2689\n40.2651\n\n11.2\n1.3102\n\nOurs\n\n18.3771\n7.1984\n33.3\n0.5680\n\nLoss\n\nContent(Lc)\nStyle(Ls)\n\nPreference/%\n\nTime/sec\n\n11.4325\n71.5744\n\n16.1\n0.1484\n\n19.6778\n26.1967\n\n26.4\n0.7590\n\n15.5150\n42.8833\n\n13.0\n1.1422\n\nQuantitative Comparison To quantitatively evaluate each method, we conduct a comparison\nregarding perceptual loss and report the results in the \ufb01rst two rows of Table 2. It shows that, the\nproposed method achieves a signi\ufb01cant lower style loss than the baseline models, whereas the content\nloss is also lower than WCT but higher than the other three approaches. This indicates that our model\nis better at capturing the style patterns presented in style images with good-quality content structures.\nIt is highly subjective to assess stylization results. Hence, a user study comparison is further designed\nfor the \ufb01ve approaches. We randomly pick 30 content images and 30 style images from the test set\nand generate 900 stylization results for each method. We randomly select 20 stylized images for each\nparticipant who is asked to vote for the method achieving best results. For each round of evaluation,\nthe generated images are displayed in a random order. Finally we collect 600 votes from 30 subjects\nand detail the preference percentage of each method in the 3rd row of Table 2.\nIn the 4th row of Table 2, we also compare with the same set of baseline methods [13, 21, 24] in terms\nof running time. AdaIN [13] is the most ef\ufb01cient model as it uses a simple feed-forward scheme. Due\nto the requirements for SVD decomposition and patch swapping operations, WCT and Avatar-Net\nare much slower. Even though the feedback mechanism makes our method slower than the fastest\nAdaIN algorithm, it is noticeably faster than the other three approaches.\n\nRuntime applications Two applications are employed to reveal the \ufb02exibility of designed model\nat run-time. The same trained model is used for all these tasks without any modi\ufb01cation.\nOur model is able to control the degree of stylization in running time. For each level k in a\npyramid, this can be realized by the interpolation between two kinds of style error features: one\nis computed between the intermediate output \u02c6I k\ns denoting as \u2206Ec\u2192s, the\nother is attained for \u02c6I k\nc as \u2206Ec\u2192c. Thus the trade-off can be achieved as\n\u2206Emix = \u03b1\u2206Ec\u2192s + (1 \u2212 \u03b1)\u2206Ec\u2192c, which will be fed into the decoder for mixed effect by fusion.\nFigure 8 shows a smooth transition when \u03b1 is changed from 0.25 to 1.\nWith error-correction mechanism, the proposed model is enabled to further re\ufb01ne the stylized results\nfrom other existing methods. It can be seen in Figure 9 that, both AdaIN and WCT fail to preserve\nthe global color distribution and introduce unseen patterns. Feeding the output result of these two\n\ncs and content image I k\n\ncs and style image I k\n\n8\n\nContentAdaINStyleAdaIN_RefinedWCTWCT_Refined\fmethods into our model, ETNet is successful in improving the stylization level by making the color\ndistribution more adaptive to style image and generating more noticeable brushstroke patterns.\n\n5 Conclusions\n\nWe present ETNet for arbitrary style transfer by introducing the concept of error-correction mecha-\nnism. Our model decomposes the stylization task into a sequence of re\ufb01nement operations. During\neach re\ufb01nement, error features are computed and then transitted across both the spatial and scale\ndomain to compute a residual image. Meanwhile long-range dependencies are captured to better\nresepect the semantic relationships and facilitate the texture consistency. Experiments show that our\nmethod signi\ufb01cantly improves the stylization performance over existing methods.\n\nAcknowledgement\n\nWe thank the anonymous reviewers for their constructive comments. This work was supported in\nparts by NSFC (61861130365, 61761146002, 61602461), GD Higher Education Innovation Key\nProgram (2018KZDXM058), GD Science and Technology Program (2015A030312015), Shenzhen\nInnovation Program (KQJSCX20170727101233642), LHTD (20170003), and Guangdong Laboratory\nof Arti\ufb01cial Intelligence and Digital Economy (SZ).\n\nReferences\n[1] P. Burt and E. Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on\n\ncommunications, 31(4):532\u2013540, 1983.\n\n[2] J. Cao, Y. Guo, Q. Wu, C. Shen, J. Huang, and M. Tan. Adversarial learning with local\ncoordinate coding. In J. Dy and A. Krause, editors, Proceedings of the 35th International\nConference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,\npages 707\u2013715, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[3] J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan. Multi-marginal wasserstein gan. In\n\nAdvances in Neural Information Processing Systems, pages 1774\u20131784, 2019.\n\n[4] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error\nfeedback. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 4733\u20134742, 2016.\n\n[5] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural\nimage style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 1897\u20131906, 2017.\n\n[6] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. CoRR,\n\n[7] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a\n\nlaplacian pyramid of adversarial networks. 2015.\n\n[8] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. CoRR,\n\nabs/1612.04337, 2016.\n\nabs/1610.07629, 2016.\n\n[9] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 2414\u20132423, 2016.\n\n[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[11] M. Haris, G. Shakhnarovich, and N. Ukita. Deep back-projection networks for super-resolution.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n1664\u20131673, 2018.\n\n[12] D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In SIGGRAPH, 1995.\n[13] X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance\nnormalization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510\u2013\n1519, 2017.\n\n[14] P. Jiang, F. Gu, Y. Wang, C. Tu, and B. Chen. Difnet: Semantic segmentation by diffusion\n\nnetworks. In Advances in Neural Information Processing Systems, pages 1630\u20131639, 2018.\n\n[15] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song. Neural style transfer: A review. arXiv\n\npreprint arXiv:1705.04058, 2017.\n\n9\n\n\f[16] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-\n\nresolution. In European Conference on Computer Vision, pages 694\u2013711. Springer, 2016.\n\n[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n2017.\n\n[18] K. Li, B. Hariharan, and J. Malik. Iterative instance segmentation. In Proceedings of the IEEE\n\nconference on computer vision and pattern recognition, pages 3659\u20133667, 2016.\n\n[19] X. Li, S. Liu, J. Kautz, and M.-H. Yang. Learning linear transformations for fast arbitrary style\n\ntransfer. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.\n\n[20] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversi\ufb01ed texture synthesis with\nfeed-forward networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition,\npages 266\u2013274, 2017.\n\n[21] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature\n\ntransforms. In NIPS, 2017.\n\n[22] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and\n\nunsupervised learning. arXiv preprint arXiv:1605.08104, 2016.\n\n[23] F. Shen, S. Yan, and G. Zeng. Meta networks for neural style transfer. CoRR, abs/1709.04111,\n\n[24] L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by\n\nfeature decoration. pages 8242\u20138250, 2018.\n\n[25] Y. Tao, Q. Sun, Q. Du, and W. Liu. Nonlocal neural networks, nonlocal diffusion and nonlocal\n\nmodeling. In Advances in Neural Information Processing Systems, pages 496\u2013506, 2018.\n\n[26] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward\n\nsynthesis of textures and stylized images. In ICML, 2016.\n\n[27] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 7794\u20137803, 2018.\n\n[28] Y. Yao, J. Ren, X. Xie, W. Liu, Y.-J. Liu, and J. Wang. Attention-aware multi-stroke style\n\ntransfer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[29] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image\n\ntranslation. In IEEE International Conference on Computer Vision, 2017.\n\n[30] A. R. Zamir, T.-L. Wu, L. Sun, W. B. Shen, B. E. Shi, J. Malik, and S. Savarese. Feedback\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1308\u20131317, 2017.\n\n[31] H. Zhang and K. J. Dana. Multi-style generative network for real-time transfer. CoRR,\n\nabs/1703.06953, 2017.\n\n[32] R. Zhang, S. Tang, Y. Li, J. Guo, Y. Zhang, J. Li, and S. Yan. Style separation and synthesis\nvia generative adversarial networks. In 2018 ACM Multimedia Conference on Multimedia\nConference, pages 183\u2013191. ACM, 2018.\n\n[33] Y. Zhang, Y. Zhang, and W. Cai. Separating style and content for generalized style transfer.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n8447\u20138455, 2018.\n\n10\n\n\f", "award": [], "sourceid": 332, "authors": [{"given_name": "Chunjin", "family_name": "Song", "institution": "Shenzhen University"}, {"given_name": "Zhijie", "family_name": "Wu", "institution": "Shenzhen University"}, {"given_name": "Yang", "family_name": "Zhou", "institution": "Shenzhen University"}, {"given_name": "Minglun", "family_name": "Gong", "institution": "Memorial Univ"}, {"given_name": "Hui", "family_name": "Huang", "institution": "Shenzhen University"}]}