{"title": "The Point Where Reality Meets Fantasy: Mixed Adversarial Generators for Image Splice Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 215, "page_last": 226, "abstract": "Modern photo editing tools allow creating realistic manipulated images easily. While fake images can be quickly generated, learning models for their detection is challenging due to the high variety of tampering artifacts and the lack of large labeled datasets of manipulated images. In this paper, we propose a new framework for training of discriminative segmentation model via an adversarial process. We simultaneously train four models: a generative retouching model G_R that translates manipulated image to the real image domain, a generative annotation model G_A that estimates the pixel-wise probability of image patch being either real or fake, and two discriminators D_R and D_A that qualify the output of G_R and G_A. The aim of model G_R is to maximize the probability of model G_A making a mistake. Our method extends the generative adversarial networks framework with two main contributions: (1) training of a generative model G_R against a deep semantic segmentation network G_A that learns rich scene semantics for manipulated region detection, (2) proposing per class semantic loss that facilitates semantically consistent image retouching by the G_R. We collected large-scale manipulated image dataset to train our model. The dataset includes 16k real and fake images with pixel-level annotations of manipulated areas. The dataset also provides ground truth pixel-level object annotations. We validate our approach on several modern manipulated image datasets, where quantitative results and ablations demonstrate that our method achieves and surpasses the state-of-the-art in manipulated image detection. We made our code and dataset publicly available.", "full_text": "The Point Where Reality Meets Fantasy: Mixed\nAdversarial Generators for Image Splice Detection\n\nVladimir V. Kniaz1,2, Vladimir A. Knyaz1,2\n\n1State Res. Institute of Aviation Systems (GosNIIAS)\n\n125319, 7, Victorenko str., Moscow, Russia\n\n{vl.kniaz, knyaz}@gosniias.ru\n\n2Moscow Institute of Physics and Technology (MIPT)\n\n141701, 9 Institutskiy per., Dolgoprudny, Russia\n\nFabio Remondino\n\nFondazione Bruno Kessler (FBK)\nVia Sommarive 18, Trento, Italy\n\nremondino@fbk.eu\n\nAbstract\n\nModern photo editing tools allow creating realistic manipulated images easily.\nWhile fake images can be quickly generated, learning models for their detection\nis challenging due to the high variety of tampering artifacts and the lack of large\nlabeled datasets of manipulated images. In this paper, we propose a new framework\nfor training of discriminative segmentation model via an adversarial process. We\nsimultaneously train four models: a generative retouching model GR that translates\nmanipulated image to the real image domain, a generative annotation model GA that\nestimates the pixel-wise probability of image patch being either real or fake, and two\ndiscriminators DR and DA that qualify the output of GR and GA. The aim of model\nGR is to maximize the probability of model GA making a mistake. Our method\nextends the generative adversarial networks framework with two main contributions:\n(1) training of a generative model GR against a deep semantic segmentation\nnetwork GA that learns rich scene semantics for manipulated region detection, (2)\nproposing per class semantic loss that facilitates semantically consistent image\nretouching by the GR. We collected large-scale manipulated image dataset to\ntrain our model. The dataset includes 16k real and fake images with pixel-level\nannotations of manipulated areas. The dataset also provides ground truth pixel-\nlevel object annotations. We validate our approach on several modern manipulated\nimage datasets, where quantitative results and ablations demonstrate that our\nmethod achieves and surpasses the state-of-the-art in manipulated image detection.\nWe made our code and dataset publicly available 1.\n\n1\n\nIntroduction\n\nWhile every image captured by the human eye is real, digital photos can be easily manipulated to\npresent scenes that never existed in reality. Such manipulated image can be easily generated by\ncopying the part of one image into another. This image manipulation is called an image splice and\ncan be used maliciously to create fake news or change historical photos [1]. Recent research [1, 2]\nsuggests that training a model for splice localization is more challenging than other types of object\ndetection problems as the domain of manipulated images is extensive and diverse. Therefore, the\ncollection of the representative training dataset is dif\ufb01cult. Moreover, the forger can adapt to the\ndetection algorithm by changing the manipulation technique. This principle is used in Generative\nAdversarial Networks (GANs) to train a generator network to synthesize images from noise [3], text\ndescriptions [4], scene graphs [5] or by image-to-image translation [6, 7, 8, 9]. Fake images produced\nby the generator are evaluated against real images by an adversarial discriminator network that learns\n\n1http://zefirus.org/MAG\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fManipulated Image Ground Truth\n\nCFA [10]\n\nLSC [1]\n\nOurs\n\nFigure 1: Comparison against two state-of-the-art methods on our FantasticReality dataset (Sec-\ntion 3.3). Our results are shown in the last column. Zoom in for details.\n\nto classify them as \u2018real\u2019 or \u2018fake.\u2019 Even though no \u2018fake\u2019 images exist in the training dataset, the\ndiscriminator successfully learns to detect them during the training process. We hypothesize that\nadversarial training of an image-to-image translation generator against a splice localization generator\ncan improve the splice localization accuracy.\nIn this paper, we propose a Mixed Adversarial Generators (MAG) framework, in which we simultane-\nously train four models: a generative retoucher GR, an adversarial generative annotator GA, and two\ndiscriminators DR and DA that qualify the output of GR and GA. The aim of our retoucher GR is\nsuppressing image tampering artifacts in the input image splices from the training dataset. We train\nour adversarial annotator GA to predict splice localization masks in the \u2018retouched\u2019 images generated\nby the retoucher GR. The adversarial loss provided by the annotator GA forces the retoucher GR\nto mask those particular tampering artifacts that allow GA to detect the image splice. Unlike other\nsplice detection models, our annotator GA learns to adapt to changing tampering techniques of the\nretoucher GR. Therefore, our annotator GA receives a new sample from the manipulated image\ndomain every iteration. Moreover, with the increasing epoch samples are becoming more complex.\nTo further increase the splice localization rate, we train our annotator GA to predict object classes for\nthe input image. Resulting semantic labeling is used to provide a semantic consistency loss for the\nretoucher GR. The semantic loss forces the output of the retoucher GR to present objects of the same\nsemantic classes as the input image.\nOur adversarial generators extend the GAN framework with two key contributions: (1) training of a\ngenerative model GR against a deep semantic segmentation network GA that learns rich scene seman-\ntics for manipulated region detection, (2) proposing per class semantic loss that facilitates semantically\nconsistent image retouching by the GR. Unlike the recently proposed Sem-GAN model [11], we do\nnot use the pertained segmentation model but train it adversarially. We perform a comprehensive\nevaluation of our MAG framework, where quantitative results and ablations demonstrate that our\nannotator GA achieves and surpasses the state-of-the-art in splice localization on several challenging\nimage splice datasets (see Figure 1 and 3).\nWe evaluate our retoucher GR on image-to-image translation tasks to demonstrate that our MAG\nframework is not limited to the splice localization task. Semantic loss function allows us to train\nchallenging image-to-image translation tasks that are unfeasible for baselines. We also introduce a\nnew FantasticReality dataset that includes 16k image splices with pixel-level ground truth annotations\n\n2\n\n\fof manipulated areas, and instance and class labels for ten object categories. We made our code and\nthe dataset publicly available.\n\n2 Related Work\n\nSplice detection. Modern splice detection methods fall into three categories: tampering artifacts-\nbased approaches leverage local discrepancies in image noise [12, 13, 14, 15, 16, 17, 18], compression\nartifacts [19, 20, 21, 22], or camera\u2019s color \ufb01lter array inconsistencies [10, 23, 24, 25, 26, 27, 28, 29,\n30, 31] to detect tampered image regions; consistency-based methods [32, 1] compare pairs of local\nimage patches to localize image areas, where predicted camera model [33, 32] or image metadata [1]\nare inconsistent with the rest of the image; deep learning-based methods [34, 35, 1, 2, 36] detect\nimage splice regions either by comparison of local patches in a siamese network [1] or using fully\nconvolutional networks [2] to predict labeling of the tampered regions. While many digital image\nforensic datasets were introduced recently [37, 38, 39, 40, 41], they usually include only several\nhundreds of photos and do not provide enough of training data for modern methods. Related to our\nmulti-task annotation prediction, Salloum et al. [2] have proposed multi-task training to localize\ntampered regions and their edges.\nImage-to-image translation. Modern methods for image generation conditioned by an input image\nare trained in either supervised [6, 42, 43, 11, 44], unsupervised [7, 8, 9, 45, 46, 47] or mixed [48]\nsetting. Unsupervised approaches are trained on an unpaired dataset leveraging the latent space\nassumption [8], the cycle consistency loss [7] or other criteria to learn a mapping from source\nto target domain. Recent research demonstrates exciting progress in multimodal image-to-image\ntranslation [49, 42]. Related to our semantic consistency loss function are the loss functions proposed\nin Sem-GAN [11] and InstaGAN [50] models. Unlike our MAG framework Sem-GAN model leverages a\npretrained segmentation model to provide semantic loss. Unlike InstaGAN [50] model, our retoucher\ngenerator GR does not require instance masks as an input. Closely related to our retoucher generator\nGR, Mejjati et al. [9] propose to use attention guided training to perform translation only for the\ntarget object.\nMost of the modern approaches in the image-to-image translation are based on Generative Adversarial\nNetworks [3], which can capture the sample distribution in the target domain using an adversarial game\nof two players. Recent research demonstrates that GANs can solve more challenging tasks than image-\nto-image translation. They can learn complex transforms between physically different domains such as\nimage-to-thermal translation [51, 52, 53, 54, 55], image-to-voxel model transformation [56, 57], and\nimage synthesis from audio data [58]. In our MAG framework, we replace the discriminator network\nwith an adversarial annotator generator GA. While the discriminator predicts a scalar probability of\nan input image being either real or fake, our annotator generator GA predicts a pixel-level probability\nmap of an image patch being either authentic image or splice.\n\n3 Mixed Adversarial Generators\n\nOur goal is training two generator networks adversarially: a splice retoucher GR and a splice\nlocalization annotator GA. We consider three domains: the input domain A \u2208 RW\u00d7H\u00d73 of potentially\nmanipulated images, the authentic domain B \u2208 RW\u00d7H\u00d73 of untampered images, and the output\ndomain C \u2208 [0, 1]W\u00d7H\u00d7(2+K) of splice localization and class segmentation masks, where K is the\nnumber of predicted object classes. While an image A \u2208 A may be either authentic or tampered,\nall images B \u2208 B are authentic, B \u2282 A. We use assumptions made by Salloum et al. [2] as the\nstarting point for our generator GA. Speci\ufb01cally, we train our generator GA for multi-task prediction\nof splice segmentation mask Cm \u2208 [0, 1]W\u00d7H, splice edge mask Ce \u2208 [0, 1]W\u00d7H, and object class\nsegmentation Cs \u2208 [0, 1]W\u00d7H\u00d7K. Therefore, we learn a mapping GA : (A) \u2192 C, where A \u2208 A\nis an input potentially manipulated image, C \u2208 C is an output tensor obtained by concatenation\nof Cm, Ce, Cs. The goal of our retoucher generator GR is learning a mapping from manipulated\nimage domain A to the authentic domain B. To this end, the aim of adversarial training of the\nGR is maximizing the probability of an annotator GA making a mistake in splice detection of the\nretouched image \u02c6B. We believe that the retoucher GR in the training loop facilitates our annotator\nGA to learn complicated splice retouching approaches. We use attention-guided learning assumption\nmade by Mejjati et al. [9] as the starting point for our retoucher GR. We observe the similarity\nbetween the attention map proposed by Mejjati et al. [9] and the alpha channel used for the splice\n\n3\n\n\fFigure 2: Our proposed pipeline: We want our annotator GA to predict annotations correctly for\nthree kinds of images: retouched spliced images \u02c6B = GR(A) (1), original spliced images A \u2208 A\nfrom the training dataset (2), and authentic images B \u2208 B (3). Our retoucher GR learns to hide a\nwide range of tampering artifacts such as modern-to-retro photo translation, blurring of tampering\nedges, and compensating light source inconsistencies. During the training, we feed manipulated\nimages A, retouched images \u02c6B, and authentic images B to our splice localization annotator GA.\n\ngeneration. We hypothesize that attention-guided learning of our retoucher GR allows us to model\nsplice generation with layers in photo-editing applications, e.g., GIMP or Photoshop. We learn a\nmapping GR : (A) \u2192 ( \u02c6Brgb, \u02c6B\u03b1), where \u02c6Brgb \u2208 RW\u00d7H\u00d73 is an image with the retouched splice\narea, and \u02c6B\u03b1 \u2208 [0, 1]W\u00d7H is the attention map. We obtain the target retouched splice image \u02c6B\nsimilarly to [9] by\n\n\u02c6B = \u02c6B\u03b1 (cid:12) \u02c6Brgb + (1 \u2212 \u02c6B\u03b1) (cid:12) A,\n\n(1)\n\nwhere (cid:12) is an element-wise product. Our proposed pipeline is presented in Figure 2. We train two\ndiscriminator networks DR, DA to provide adversarial losses for the output of our generators GR\nand GA. The architecture and the loss function of the retoucher GR are presented in Section 3.1,\nwhereas the structured loss function of the annotator GA is described in Section 3.2.\n\n3.1 Retoucher Generator GR\n\nArchitecture. We use the U-Net generator architecture [59] as the starting point for our retoucher\nGR. While skip connections of the U-Net generator facilitate robust learning of tampering techniques\nby our retoucher GR, deconvolutional layers often introduce checkerboard artifacts in output images.\nOur annotator GA quickly learns checkerboard features to detect images produced by our retoucher\nGR. To avoid such a scenario, we replaced deconvolutional layers with an upsample layer followed by\na convolutional layer, inspired by the architecture proposed in [60]. We term the resulting architecture\nthat is free from the checkerboard artifacts as U-Net-UC (see supplementary material Table 1).\nLoss function. Three loss functions govern the training process for our retoucher GR: LGA\nand LDR\n\nsem,LGA\nadv,\nadv, where a superscript indicates the network providing the loss. The aim of our semantic\n\n4\n\n\u00a0\u0001A\u0302B\u0302Cc GR\u2211 GA\u0302Cm\u0302Ce\u0302Brgb\u0302B\u03b11\u2212\u0302B\u03b1CcLbal+LDAadvCmCeLsem+LGAadv+LDRadvB1.2. GA GA3.B\u00a0\u0001C\u0302BCACBAInput potentially manipulated imageOutput class segmentationRetouched image color channelsA\u0302Brgb\u0302B\u03b1\u0302Cm\u0302Ce\u0302CcAuthentic image for the loss function and for step 3BRetouched image attention map\u0302BRetouched manipulated imageOutput edges of detected manipulated areasOutput detected manipulated areasGround truth outputLossDeep network\fwhere 0W,H is the a splice localization \ufb01lled with zeros. Finally, we use a discriminator\u2019s DR\nadversarial loss function to make our image realistic globally\n\nLGA\nadv(Cm, \u02c6Cm) = EB\u223cp(B)\n\nLDR\nadv( \u02c6B) = EB\u223cp(B)\n\n(cid:21)\n\n,\n\n(cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)0W,H \u2212 \u02c6Cm\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1\n(cid:105)\n\nlog(1 \u2212 DR( \u02c6B))\n\n.\n\n(cid:104)\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nconsistency loss function LGA\nby our annotator GA\n\nsem is to make the classes of objects in the output image \u02c6B recognizable\n\nLGA\nsem(Cs, \u02c6Cs) = EB\u223cp(B)\n\n(cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Cs \u2212 \u02c6Cs\n\n(cid:21)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1\n\n,\n\nwhere \u02c6Cs = GA( \u02c6B)s is the class segmentation produced by our annotator GA, Cs is the ground\ntruth class segmentation. Our adversarial annotator loss LGA\nadv stimulates our retoucher GR to mask\ntampering artifacts in the input sliced images. In other words, we want to maximize the probability of\nour annotator GA making a mistake in splice localization \u02c6Cm = GA( \u02c6B)m\n\nWe obtain the \ufb01nal energy to be optimized by combining all losses\n\nLR(Cs, \u02c6Cs, Cm, \u02c6Cm, \u02c6B) = \u03bbGA\n\nsem + \u03bbGA\n\nsem \u00b7 LGA\nsem = 10, \u03bbGA\n\nadv \u00b7 LGA\nadv = 10, \u03bbDR\n\nadv + \u03bbDR\n\nadv \u00b7 LDR\nadv,\n\nadv = 0.25 in our experiments.\n\nwhere we use the loss hyper-parameters \u03bbGA\n\n3.2 Annotator Generator GA\nLoss function. We train our annotator GA utilizing a combination of our balanced L1 loss function\nLbal and an adversarial loss LDA\nadv. We observe that training our annotator GA using the L1 distance\n||C \u2212 \u02c6C|| between the ground-truth and predicted annotations results in a large number of false\nnegatives in splice localizations. We hypothesize that making the penalty for false negatives and false\npositives equal for each image can improve the overall splice localization score. We implement this\nhypothesis in our balanced loss function based on the Dice loss [61]\n\nLbal(C, \u02c6C) =\n\n2+K(cid:88)\n(cid:124)\n\ni=1\n\n|Ci \u2229 (1 \u2212 \u02c6Ci)|\n\n|Ci|\n\n(cid:123)(cid:122)\n\n2+K(cid:88)\n(cid:124)\n\ni=1\n\n+\n\n(cid:125)\n\n|(1 \u2212 Ci) \u2229 \u02c6Ci|\n\n|1 \u2212 Ci|\n\n(cid:123)(cid:122)\n\n,\n\n(cid:125)\n\nFalse negatives\n\nFalse positives\n\nwhere i is the index of an annotation channel. Channel C1 provides a splice mask annotation Cm,\nchannel C2 provides a splice edges annotation Ce. The predicted class labels are given by Ci for\ni \u2208 {3, 4, . . . , 2 + K}, where K is the number of classes (K = 10 in our experiments). The area of\npredicted annotations (white area) in the channel Ci is given by |Ci|, the background area (black\narea) in the channel Ci is given by |1 \u2212 Ci|.\nWe want our annotator GA to predict annotations correctly for three kinds of images: original spliced\nimages A \u2208 A from the training dataset, retouched spliced images \u02c6B = GR(A), and authentic\nimages B \u2208 B. Therefore, for each iteration, we evaluate the loss Lbal on three pairs of ground\ntruth and predicted annotations: (CA(cid:48)\n, \u02c6C \u02c6B), (CB, \u02c6CB). We use the superscript to\ndenote the corresponding color image for the annotation. Please, note that both original spliced\nimage A and the retouched spliced image \u02c6B have the same annotation CA(cid:48)\nwith an adversarial class\nsegmentation mask C A\nm). We want to train our annotator GA to predict class\ns\nsegmentation adversarially: it must generate the correct class annotations only for authentic image\nareas and predict empty class annotations for manipulated regions. Speci\ufb01cally, we multiply our\nground truth semantic segmentation C A\nm). The\n\ns by an inverted splice localization mask (1 \u2212 C A\n\ns (cid:12) (1 \u2212 C A\n\n, \u02c6CA), (CA(cid:48)\n\n= C A\n\n(cid:48)\n\n5\n\n\fmultiplication by the inverted mask leaves authentic areas untouched and removes annotations for\nmanipulated regions.\nThe aim of our adversarial loss LDA\nprovided by a conditional discriminator DA with PatchGAN architecture [6]\n\nadv is to avoid blurry output splice localization masks [6]. It is\n\nLDA\nadv( \u02c6B, \u02c6C\n\n\u02c6B) = EB\u223cp(B)\n\nlog(1 \u2212 DA( \u02c6B, \u02c6C\n\n\u02c6B))\n\n.\n\nWe obtain the resulting energy to optimize by combining four loss functions\n\n(cid:16)Lbal(CA(cid:48)\n\nLA = \u03bbbal\n\n(cid:104)\n\n(cid:105)\n\n(cid:17)\n\n(7)\n\n\u02c6B), (8)\n\n, \u02c6CA) + Lbal(CA(cid:48)\n\n, \u02c6C\n\n\u02c6B) + Lbal(CB, \u02c6CB)\n\n+ \u03bbDA\n\nadv \u00b7 LDA\n\nadv( \u02c6B, \u02c6C\n\nwhere we use the loss hyper-parameters \u03bbbal = 1, \u03bbDA\n\nadv = 1 in our experiments.\n\n3.3 FantasticReality Dataset\n\nWe collected large-scale image tampering dataset with 16k authentic and 16k tampered images to\nperform extensive training and evaluation of our MAG model. Compared to previous datasets [37, 38,\n39, 40], our FantasticReality dataset is more extensive in terms of scene variety and image count.\nTo the best of our knowledge, it is the \ufb01rst tampering dataset that provides both tampering masks\nand instance and class labels for each image. For each authentic and tampered image, we manually\ngenerated instance and class segmentation for ten object classes: person, car, truck, van, bus, building,\ncat, dog, tram, boat. Examples from the dataset are presented in Figure 1 in the supplementary\nmaterial.\n\n4 Experiments\n\nWe perform extensive experiments to evaluate our MAG model on splice localization. We compare\nour model to three modern state-of-the-art deep learning splice detection frameworks: ManTra [62],\nLSC [1], MFCN [2]. We provide a comparison to non-deep learning methods to be consistent with LSC:\nNOI [18], CFA [10], DCT [19]. ManTra-Net (ManTra) [62] is a self-supervised model that learns\nto classify 385 image manipulation types. Learned Self-Consistency (LSC) [1] is a self-supervised\nmodel. Multi-Task Fully Convolutional Network (MFCN) [2] leverages a deep two-stream architecture\nto predict splice mask and splice edge mask. Noise Variance (NOI) [18] leverages wavelet analysis\nto detect inconsistency in noise patterns. Color Filter Array (CFA) [10] searches for inconsistencies\nin artifacts of demosaicking algorithm to detect tampered regions. JPEG DCT [19] leverages incon-\nsistencies of JPEG blocking artifacts to detect tampered image regions. For the LSC algorithm, we\nuse a pertained model provided by authors. We implemented the MFCN model and train it on the\ntraining split of our FantasticReality dataset. We train our MAG model on the \u2018Rough\u2019 split of our\nFantasticReality dataset. We use a batch size of one and an Adam solver with initial learning rate of\n2 \u00b7 10\u22124. We trained our MAG model for 400 epochs.\nWe perform evaluation on \ufb01ve manipulated image datasets CASIA v2.0 [37], Carvalho [38],\nColumbia [39], Realistic Tampering [40] and our FantasticReality dataset. For the fair evalua-\ntion, we downscale all images to match the input size 512 \u00d7 512 of our annotator generator GR. We\nuse the downscaled images to evaluate all baselines and our framework. If two images are used for\nsplice generation, the choice of \u2018authentic\u2019 and \u2018tampered\u2019 regions is ambiguous. To avoid ambiguity,\nwe follow the method proposed in [1]. Namely, we compare the areas of the \u2018background\u2019 image\nand the \u2018pasted\u2019 images. We de\ufb01ne the smaller region as the tampered. If the regions are equal, we\ncalculate the mAP score for the original tampering mask and an inverted mask. We use the higher\nscore and term it permuted mAP (p-mAP) similar to [1]. For additional details on the evaluation\nprotocol, please, refer to the supplementary material. Furthermore, we perform ablation studies to\ndemonstrate the in\ufb02uence of each component of our framework on the resulting performance.\n\n4.1 Annotator Generator GA Evaluation\nSplice Localization. We evaluate our model and baselines on the task of splice localization using\nground-truth masks of spliced regions. Speci\ufb01cally, we want our model to predict a per-pixel\n\n6\n\n\fCASIA v2.0 [37]\n\nCarv. [38]\n\nRealistic Tampering [40]\n\nt\nu\np\nn\nI\n\nT\nG\n\n]\n1\n[\nC\nS\nL\n\n]\n8\n1\n[\nI\nO\nN\n\n]\n9\n1\n[\nT\nC\nD\n\ns\nr\nu\nO\n\nFigure 3: Comparison against the state-of-the-art methods on image splices from CASIAv2, Carvalho\nand Realistic Tampering datasets. Our results are shown in the last row. Zoom in for details.\n\nRT\n\n[40] Carvalho\n\nCASIA v2.0 [37] Columbia [39]\n[38] FantasticReality\nDataset\nmAP p-mAP cIOU mAP p-mAP cIOU mAP p-mAP cIOU mAP p-mAP cIOU mAP p-mAP cIOU\nMetric\n0.36\n0.32 0.41\nLSC\n0.48\n0.37 0.40\nCFA\n0.29 0.45\n0.29\nNOI\n0.41\n0.35 0.39\nLSC\n0.46\n0.36 0.41\nMFCN\n0.73\n0.40 0.40\nManTra\n0.41 0.35\n0.36\nNo GR\n0.21\nSingle-task 0.12 0.15\n0.74 0.74\n0.76\nOurs\n\n0.41 0.33 0.47\n0.44 0.45 0.48\n0.40 0.48 0.45\n0.43 0.37 0.49\n0.42 0.41 0.51\n0.58 0.50 0.50\n0.29 0.28 0.35\n0.21 0.17 0.16\n0.77 0.50 0.51\n\n0.52 0.15 0.33\n0.49 0.32 0.32\n0.50 0.21 0.31\n0.48 0.16 0.37\n0.36 0.42 0.36\n0.54 0.33 0.33\n0.18 0.20 0.37\n0.14 0.19 0.15\n0.55 0.48 0.48\n\n0.24 0.17 0.45\n0.33 0.45 0.50\n0.21 0.18 0.49\n0.25 0.26 0.51\n0.37 0.40 0.51\n0.38 0.57 0.57\n0.29 0.27 0.45\n0.18 0.12 0.17\n0.56 0.61 0.61\n\n0.47 0.25 0.44\n0.42 0.39 0.44\n0.46 0.26 0.43\n0.35 0.22 0.42\n0.48 0.27 0.45\n0.45 0.48 0.48\n0.12 0.34 0.32\n0.24 0.11 0.19\n0.76 0.69 0.69\n\nTable 1: Splice Localization: We evaluate our model on 5 datasets using mean average precision\n(mAP, permuted-mAP) over pixels and per class IOU (cIOU).\n\nprobability of an image patch being tampered. We present results in terms of mAP, permuted\nmAP [1], and per class Intersection over Union (cIOU) in Table 1 and in Figure 3. Our MAG model\nachieves state-of-the-art in splice localization on all datasets. The LSC model fails to detect splices\nwhen authentic and spliced regions originate from the same camera model and share similar camera\nmetadata.\nAblation Study. We evaluate the necessity of all components of our MAG framework by comparing the\nsplice localization accuracy of several ablated versions of our model presented in Table 1 and Figure 4.\nFirstly, we evaluate the performance of annotator GA trained without retoucher GR (No GR). Both\nqualitative and quantitative results demonstrate that the competition of two generators is the critical\n\n7\n\n\fManipulated Image Ground Truth\n\nNo GR\n\nSingle-task\n\nOriginal\n\nFigure 4: Qualitative results for ablated versions of our MAG framework evaluated on Realistic\nTampering dataset.\n\ncomponent of our MAG framework. Secondly, we evaluate our framework trained for the single task\nof predicting splice area annotations. The results prove that multi-task training outperforms the\nsingle-task version of our model (Single-task).\n\n4.2 Semantic-guided Retoucher Generator GR Evaluation\n\nExamples in Figures 5 and 6 demonstrate how our retoucher GR gradually removes the tampering\nartifacts in the input splice A with an increasing epoch. While other deep learning splice detection\nmethods receive both realistic and rough splices from the \ufb01rst training epoch, our annotator GA\nsees only rough splices at the \ufb01rst epoch. With an increasing epoch, retoucher GR produces more\ncomplicated splices, which allows GA to focus attention on the sophisticated tampering techniques\nthat could appear in real splices. We believe that this is the main reason why our MAG framework\nachieves state-of-the-art results and outperforms other deep learning methods.\n\nA\n\n\u02c6B = GR(A)\n\nGT\n\n\u02c6C = GA(\u02c6B)\n\n\u02c6B\n\n\u02c6C\n\nEpoch 100\n\nEpoch 300\n\n\u02c6B\n\n\u02c6C\n\nh\ng\nu\no\nR\n\nc\ni\nt\ns\ni\nl\na\ne\nR\n\nFigure 5: Performance for retoucher GR on rough\nand realistic splices.\n\nFigure 6: Adaptation of annotator GA over time.\n\n5 Conclusion\n\nWe showed how adversarial training based on a learning retoucher generator in the loop could help\na splice localization model to learn a wide range of image manipulations. Our mixed adversarial\ngenerators extend the generative adversarial networks framework by replacing a scalar value fake\nprediction discriminator with a pixel-level fake region annotator. The proposed retoucher generator\nis trained simultaneously with an annotator generator trying to maximize the probability of the\nannotator to make a mistake. Such adversarial training improves the annotator splice localization rate\nas it observes changing image manipulation techniques through the training process. Furthermore,\nthe competition of two generators allows the retoucher generator to achieve the state-of-the-art\nperformance in image-to-image translation tasks. Our main observation is that semantic-guided\ntraining allows our splice localization annotator to reason explicitly about splices and their semantic\nconsistency, and achieve and surpass the state-of-the-art methods in splice localization on several\nchallenging datasets.\n\n8\n\n\fAcknowledgments\n\nThe reported study was funded by the Russian Science Foundation (RSF) according to the research\nproject No 19-11-11008 and the Russian Foundation for Basic Research (RFBR) according to the\nresearch project No 17-29-04509 . We want to thank Belgian Surrealist artist Ren\u00e9 Magritte for\nteaching us through his art how to \ufb01nd the point where fantasy meets reality.\n\nReferences\n\n[1] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. Fighting Fake News: Image Splice\nDetection via Learned Self-Consistency. In The European Conference on Computer Vision (ECCV),\nSeptember 2018.\n\n[2] Ronald Salloum, Yuzhuo Ren, and C C Jay Kuo. Image Splicing Localization using a Multi-task Fully\nConvolutional Network (MFCN). Journal of Visual Communication and Image Representation, 51:201\u2013\n209, February 2018.\n\n[3] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C\n\nCourville, and Yoshua Bengio. Generative Adversarial Networks. CoRR, 2014.\n\n[4] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N.\nMetaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.\nIn The IEEE International Conference on Computer Vision (ICCV), Oct 2017.\n\n[5] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image Generation From Scene Graphs. In The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), June 2018.\n\n[6] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-Image Translation with Conditional\nAdversarial Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 5967\u20135976. IEEE, 2017.\n\n[7] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\ncycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference\non, 2017.\n\n[8] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In\nI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 700\u2013708. Curran Associates, Inc., 2017.\n\n[9] Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. Unsuper-\nvised attention-guided image-to-image translation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages\n3693\u20133703. Curran Associates, Inc., 2018.\n\n[10] P. Ferrara, T. Bianchi, A. De Rosa, and A. Piva. Image forgery localization via \ufb01ne-grained analysis of cfa\n\nartifacts. IEEE Transactions on Information Forensics and Security, 7(5):1566\u20131577, Oct 2012.\n\n[11] Anoop Cherian and Alan Sullivan. Sem-gan: Semantically-consistent image-to-image translation. In IEEE\nWinter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January\n7-11, 2019, pages 1797\u20131806, 2019.\n\n[12] Bo Liu and Chi-Man Pun. Splicing Forgery Exposure in Digital Image by Detecting Noise Discrepancies.\n\nInternational Journal of Computer and Communication Engineering, 4(1):33\u201338, 2015.\n\n[13] Bo Liu, Chi-Man Pun, and Xiao-Chen Yuan. Digital Image Forgery Detection Using JPEG Features and\n\nLocal Noise Discrepancies. The Scienti\ufb01c World Journal, 2014(6):1\u201312, 2014.\n\n[14] Chi-Man Pun, Bo Liu, and Xiao-Chen Yuan. Multi-scale Noise Estimation for Image Splicing Forgery\n\nDetection. Journal of Visual Communication and Image Representation, 38(C):195\u2013206, July 2016.\n\n[15] Thibaut Julliand, Vincent Nozick, and Hugues Talbot. Image Noise and Digital Image Forensics. In\nYun-Qing Shi, Hyoung Joong Kim, Fernando P\u00e9rez-Gonz\u00e1lez, and Isao Echizen, editors, Digital-Forensics\nand Watermarking, pages 3\u201317, Cham, 2016. Springer International Publishing.\n\n[16] Wu-Chih Hu, Wei-Hao Chen, Deng-Yuan Huang, and Ching-Yu Yang. Novel Detection of Image Forgery\nfor Exchanged Foreground and Background Using Image Watermarking Based on Alpha Matte. In 2012\nSixth International Conference on Genetic and Evolutionary Computing, ICGEC 2012, Kitakyushu, Japan,\nAugust 25-28, 2012, pages 245\u2013248, 2012.\n\n[17] Miroslav Goljan, Jessica J. Fridrich, and R\u00e9mi Cogranne. Rich model for steganalysis of color images. In\n2014 IEEE International Workshop on Information Forensics and Security, WIFS 2014, Atlanta, GA, USA,\nDecember 3-5, 2014, pages 185\u2013190, 2014.\n\n9\n\n\f[18] Babak Mahdian and Stanislav Saic. Using noise inconsistencies for blind image forensics. Image and\nVision Computing, 27(10):1497 \u2013 1503, 2009. Special Section: Computer Vision Methods for Ambient\nIntelligence.\n\n[19] S. Ye, Q. Sun, and E. Chang. Detecting digital image forgeries by measuring inconsistencies of blocking\n\nartifact. In 2007 IEEE International Conference on Multimedia and Expo, pages 12\u201315, July 2007.\n\n[20] Y Su, J Zhang, and J Liu. Exposing Digital Video Forgery by Detecting Motion-Compensated Edge\nArtifact. In 2009 International Conference on Computational Intelligence and Software Engineering, pages\n1\u20134, December 2009.\n\n[21] Wu-Chih Hu and Wei-Hao Chen. Effective forgery detection using DCT+SVD-based watermarking for\n\nregion of interest in key frames of vision-based surveillance. IJCSE, 8(4):297\u2013305, 2013.\n\n[22] Ashima Gupta, Nisheeth Saxena, and Sunil Kumar. Detecting Copy move Forgery using DCT. International\n\nJournal of Scienti\ufb01c and Research Publications, 3:2250\u20133153, May 2013.\n\n[23] Harpreet Kaur, Jyoti Saxena, and Sukhjinder Singh. Simulative comparison of copy-move forgery detection\nmethods for digital images. International Journal of Electronics, Electrical and Computational System,\n4:62\u201366, 2015.\n\n[24] Bo Liu and Chi-Man Pun. HSV Based Image Forgery Detection for Copy-Move Attack. In Mechatronics\nEngineering, Computing and Information Technology, pages 2825\u20132828. Trans Tech Publications Ltd, July\n2014.\n\n[25] Feng Zeng, Wei Wang, Min Tang, and Zhanghua Cao. Exposing Blurred Image Forgeries through Blind\nImage Restoration. In 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing,\n3PGCIC 2015, Krakow, Poland, November 4-6, 2015, pages 466\u2013469, 2015.\n\n[26] Wu-Chih Hu, Wei-Hao Chen, Deng-Yuan Huang, and Ching-Yu Yang. Effective image forgery detection\nof tampered foreground or background image based on image watermarking and alpha mattes. Multimedia\nTools and Applications, 75(6):3495\u20133516, March 2016.\n\n[27] Xin Wang, Bo Xuan, and Si-long Peng. Digital Image Forgery Detection Based on the Consistency of\nDefocus Blur. In 2008 Fourth International Conference on Intelligent Information Hiding and Multimedia\nSignal Processing (IIH-MSP), pages 192\u2013195. IEEE, July 2008.\n\n[28] Aniket Roy, Rahul Dixit, Ruchira Naskar, and Rajat Subhra Chakraborty. Copy-Move Forgery Detection\nwith Similar But Genuine Objects. In Digital Image Forensics: Theory and Implementation, pages 65\u201377.\nSpringer Singapore, Singapore, 2020.\n\n[29] Irene Amerini, Lamberto Ballan, Roberto Caldelli, Alberto Del Bimbo, Luca Del Tongo, and Giuseppe\nSerra. Copy-move forgery detection and localization by means of robust clustering with J-Linkage. Signal\nProcessing : Image Communication, 28(6):659\u2013669, July 2013.\n\n[30] Longyin Wen, Honggang Qi, and Siwei Lyu. Contrast enhancement estimation for digital image forensics.\nACM Transactions on Multimedia Computing, Communications and Applications, 14(2):1\u201321, May 2018.\n[31] I-Cheng Chang, J Cloud Yu, and Chih-Chuan Chang. A forgery detection algorithm for exemplar-based\ninpainting images using multi-region relation. Image and Vision Computing, 31(1):57\u201371, January 2013.\n[32] L. Bondi, S. Lameri, D. G\u00fcera, P. Bestagini, E. J. Delp, and S. Tubaro. Tampering detection and localization\nthrough clustering of camera-based cnn features. In 2017 IEEE Conference on Computer Vision and\nPattern Recognition Workshops (CVPRW), pages 1855\u20131864, July 2017.\n\n[33] L. Bondi, L. Barof\ufb01o, D. G\u00fcera, P. Bestagini, E. J. Delp, and S. Tubaro. First steps toward camera model\nidenti\ufb01cation with convolutional neural networks. IEEE Signal Processing Letters, 24(3):259\u2013263, March\n2017.\n\n[34] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and B. S. Manjunath. Exploiting spatial structure\nfor localizing manipulated image regions. In 2017 IEEE International Conference on Computer Vision\n(ICCV), pages 4980\u20134989, Oct 2017.\n\n[35] Belhassen Bayar and Matthew C. Stamm. A deep learning approach to universal image manipulation\ndetection using a new convolutional layer. In Proceedings of the 4th ACM Workshop on Information Hiding\nand Multimedia Security, IH&MMSec 2016, Vigo, Galicia, Spain, June 20-22, 2016, pages 5\u201310, 2016.\n\n[36] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. Learning Rich Features for Image\nManipulation Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2018.\n\n[37] Jing Dong, Wei Wang, and Tieniu Tan. CASIA image tampering detection evaluation database. In 2013\nIEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP 2013,\nBeijing, China, July 6-10, 2013, pages 422\u2013426, 2013.\n\n[38] T. J. d. Carvalho, C. Riess, E. Angelopoulou, H. Pedrini, and A. d. R. Rocha. Exposing digital image\nforgeries by illumination color classi\ufb01cation. IEEE Transactions on Information Forensics and Security,\n8(7):1182\u20131194, July 2013.\n\n10\n\n\f[39] Tian-Tsong Ng and Shih-Fu Chang. A data set of authentic and spliced image blocks. Technical report,\n\nColumbia University, June 2004.\n\n[40] P. Korus and J. Huang. Multi-scale analysis strategies in prnu-based tampering localization. IEEE Trans.\n\non Information Forensics & Security, 2017.\n\n[41] B Wen, Y Zhu, R Subramanian, T Ng, X Shen, and S Winkler. COVERAGE \u2014 A novel database for\ncopy-move forgery detection. In 2016 IEEE International Conference on Image Processing (ICIP), pages\n161\u2013165. IEEE, September 2016.\n\n[42] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shecht-\nman. Toward multimodal image-to-image translation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems\n30, pages 465\u2013476. Curran Associates, Inc., 2017.\n\n[43] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan:\nIn The IEEE\n\nUni\ufb01ed generative adversarial networks for multi-domain image-to-image translation.\nConference on Computer Vision and Pattern Recognition (CVPR), June 2018.\n\n[44] Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. Scribbler: Controlling deep image\nsynthesis with sketch and color. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), July 2017.\n\n[45] Amelie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Mosseri, Forrester Cole, and\n\nKevin Murphy. XGAN: Unsupervised image-to-image translation for many-to-many mappings, 2018.\n\n[46] Matthew Amodio and Smita Krishnaswamy. Travelgan: Image-to-image translation by transformation\n\nvector learning. CoRR, abs/1902.09631, 2019.\n\n[47] Wayne Wu, Kaidi Cao, Cheng Li, Chen Qian, and Chen Change Loy. Transgaga: Geometry-aware\n\nunsupervised image-to-image translation. CoRR, abs/1904.09571, 2019.\n\n[48] Soumya Tripathy, Juho Kannala, and Esa Rahtu. Learning image-to-image translation using paired and\n\nunpaired training samples. arXiv preprint arXiv:1805.03189, 2018.\n\n[49] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image\n\ntranslation. In ECCV, 2018.\n\n[50] Sangwoo Mo, Minsu Cho, and Jinwoo Shin. Instagan: Instance-aware image-to-image translation. In\n\nInternational Conference on Learning Representations, 2019.\n\n[51] He Zhang, Vishal M Patel, Benjamin S Riggan, and Shuowen Hu. Generative adversarial network-based\nsynthesis of visible faces from polarimetrie thermal faces. In 2017 IEEE International Joint Conference on\nBiometrics (IJCB), pages 100\u2013107. IEEE, 2017.\n\n[52] Teng Zhang, Arnold Wiliem, Siqi Yang, and Brian C Lovell. TV-GAN: Generative Adversarial Network\n\nBased Thermal to Visible Face Recognition. December 2017.\n\n[53] Vladimir V. Kniaz, Vladimir A. Knyaz, Ji\u02c7r\u00ed Hlad\u02dauvka, Walter G. Kropatsch, and Vladimir Mizginov.\nThermalgan: Multimodal color-to-thermal image translation for person re-identi\ufb01cation in multispectral\ndataset. In Laura Leal-Taix\u00e9 and Stefan Roth, editors, Computer Vision \u2013 ECCV 2018 Workshops, pages\n606\u2013624, Cham, 2019. Springer International Publishing.\n\n[54] V. V. Kniaz and A. N. Bordodymov. Long wave infrared image colorization for person re-identi\ufb01cation.\nISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences,\nXLII-2/W12:111\u2013116, 2019.\n\n[55] Vladimir V. Kniaz and Vladimir A. Knyaz. Chapter 6 - multispectral person re-identi\ufb01cation using gan for\ncolor-to-thermal image translation. In Michael Ying Yang, Bodo Rosenhahn, and Vittorio Murino, editors,\nMultimodal Scene Understanding, pages 135 \u2013 158. Academic Press, 2019.\n\n[56] Vladimir A. Knyaz, Vladimir V. Kniaz, and Fabio Remondino. Image-to-voxel model translation with\nconditional adversarial networks. In Laura Leal-Taix\u00e9 and Stefan Roth, editors, Computer Vision \u2013 ECCV\n2018 Workshops, pages 601\u2013618, Cham, 2019. Springer International Publishing.\n\n[57] V. V. Kniaz, F. Remondino, and V. A. Knyaz. Generative adversarial networks for single photo 3d\nISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial\n\nreconstruction.\nInformation Sciences, XLII-2/W9:403\u2013408, 2019.\n\n[58] Chia-Hung Wan, Shun-Po Chuang, and Hung-yi Lee. Towards audio to scene image synthesis using\nIn IEEE International Conference on Acoustics, Speech and Signal\n\ngenerative adversarial network.\nProcessing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019, pages 496\u2013500, 2019.\n\n[59] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical\n\nImage Segmentation. Springer International Publishing, Cham, 2015.\n\n11\n\n\f[60] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved\nquality, stability, and variation. In 6th International Conference on Learning Representations, ICLR 2018,\nVancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[61] W. R. Crum, O. Camara, and D. L. G. Hill. Generalized overlap measures for evaluation and validation in\n\nmedical image analysis. IEEE Transactions on Medical Imaging, 25(11):1451\u20131461, Nov 2006.\n\n[62] Wael AbdAlmageed Yue Wu and Premkumar Natarajan. Mantra-net: Manipulation tracing network\nfor detection and localization of image forgerieswith anomalous features. In The IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), 2019.\n\n12\n\n\f", "award": [], "sourceid": 93, "authors": [{"given_name": "Vladimir", "family_name": "Kniaz", "institution": "IEEE"}, {"given_name": "Vladimir", "family_name": "Knyaz", "institution": "State Research Institute of Aviation Systems"}, {"given_name": "Fabio", "family_name": "Remondino", "institution": "\"Fondazione Bruno Kessler, Italy\""}]}