{"title": "Constructing Unrestricted Adversarial Examples with Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 8312, "page_last": 8323, "abstract": "Adversarial examples are typically constructed by perturbing an existing data point within a small matrix norm, and current defense methods are focused on guarding against this type of attack. In this paper, we propose a new class of adversarial examples that are synthesized entirely from scratch using a conditional generative model, without being restricted to norm-bounded perturbations. We first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the class-conditional distribution over data samples. Then, conditioned on a desired class, we search over the AC-GAN latent space to find images that are likely under the generative model and are misclassified by a target classifier. We demonstrate through human evaluation that these new kind of adversarial images, which we call Generative Adversarial Examples, are legitimate and belong to the desired class. Our empirical results on the MNIST, SVHN, and CelebA datasets show that generative adversarial examples can bypass strong adversarial training and certified defense methods designed for traditional adversarial attacks.", "full_text": "Constructing Unrestricted Adversarial Examples\n\nwith Generative Models\n\nYang Song\n\nStanford University\n\nyangsong@cs.stanford.edu\n\nRui Shu\n\nStanford University\n\nruishu@cs.stanford.edu\n\nNate Kushman\n\nMicrosoft Research\n\nnkushman@microsoft.com\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nAbstract\n\nAdversarial examples are typically constructed by perturbing an existing data point\nwithin a small matrix norm, and current defense methods are focused on guarding\nagainst this type of attack. In this paper, we propose unrestricted adversarial\nexamples, a new threat model where the attackers are not restricted to small norm-\nbounded perturbations. Different from perturbation-based attacks, we propose to\nsynthesize unrestricted adversarial examples entirely from scratch using conditional\ngenerative models. Speci\ufb01cally, we \ufb01rst train an Auxiliary Classi\ufb01er Generative\nAdversarial Network (AC-GAN) to model the class-conditional distribution over\ndata samples. Then, conditioned on a desired class, we search over the AC-GAN\nlatent space to \ufb01nd images that are likely under the generative model and are\nmisclassi\ufb01ed by a target classi\ufb01er. We demonstrate through human evaluation that\nunrestricted adversarial examples generated this way are legitimate and belong to\nthe desired class. Our empirical results on the MNIST, SVHN, and CelebA datasets\nshow that unrestricted adversarial examples can bypass strong adversarial training\nand certi\ufb01ed defense methods designed for traditional adversarial attacks.\n\n1\n\nIntroduction\n\nMachine learning algorithms are known to be susceptible to adversarial examples: imperceptible\nperturbations to samples from the dataset can mislead cutting edge classi\ufb01ers [1, 2]. This has raised\nconcerns for safety-critical AI applications because, for example, attackers could use them to mislead\nautonomous driving vehicles [3, 4, 5] or hijack voice controlled intelligent agents [6, 7, 8].\nTo mitigate the threat of adversarial examples, a large number of methods have been developed.\nThese include augmenting training data with adversarial examples [2, 9, 10, 11], removing adversarial\nperturbations [12, 13, 14], and encouraging smoothness for the classi\ufb01er [15]. Recently, [16, 17]\nproposed theoretically-certi\ufb01ed defenses based on minimizing upper bounds of the training loss under\nworst-case perturbations. Although inspired by different perspectives, a shared design principle of\ncurrent defense methods is to make classi\ufb01ers more robust to small perturbations of their inputs.\nIn this paper, we introduce a more general attack mechanism where adversarial examples are\nconstructed entirely from scratch instead of perturbing an existing data point by a small amount. In\npractice, an attacker might want to change an input signi\ufb01cantly while not changing the semantics.\nTaking traf\ufb01c signs as an example, an adversary performing perturbation-based attacks can draw\ngraf\ufb01ti [4] or place stickers [18] on an existing stop sign in order to exploit a classi\ufb01er. However,\nthe attacker might want to go beyond this and replace the original stop sign with a new one that\nwas speci\ufb01cally manufactured to be adversarial. In the latter case, the new stop sign does not have\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fto be a close replica of the original one\u2014the font could be different, the size could be smaller\u2014as\nlong as it is still identi\ufb01ed as a stop sign by humans. We argue that all inputs that fool the classi\ufb01er\nwithout confusing humans can pose potential security threats. In particular, we show that previous\ndefense methods, including the certi\ufb01ed ones [16, 17], are not effective against this more general\nattack. Ultimately, we hope that identifying and building defenses against such new vulnerabilities\ncan shed light on the weaknesses of existing classi\ufb01ers and enable progress towards more robust\nmethods.\nGenerating this new kind of adversarial example, however, is challenging. It is clear that adding small\nnoise is a valid mechanism for generating new images from a desired class\u2014the label should not\nchange if the perturbation is small enough. How can we generate completely new images from a given\nclass? In this paper, we leverage recent advances in generative modeling [19, 20, 21]. Speci\ufb01cally,\nwe train an Auxiliary Classi\ufb01er Generative Adversarial Network (AC-GAN [20]) to model the set\nof legitimate images for each class. Conditioned on a desired class, we can search over the latent\ncode of the generative model to \ufb01nd examples that are mis-classi\ufb01ed by the model under attack, even\nwhen protected by the most robust defense methods available. The images that successfully fool the\nclassi\ufb01er without confusing humans (veri\ufb01ed via Amazon Mechanical Turk [22]) are referred to as\nUnrestricted Adversarial Examples1. The ef\ufb01cacy of our attacking method is demonstrated on the\nMNIST [25], SVHN [26], and CelebA [27] datasets, where our attacks uniformly achieve over 84%\nsuccess rates. In addition, our unrestricted adversarial examples show moderate transferability to\nother architectures, reducing by 35.2% the accuracy of a black-box certi\ufb01ed classi\ufb01er (i.e. a certi\ufb01ed\nclassi\ufb01er with an architecture unknown to our method) [17].\n\n2 Background\n\nIn this section, we review recent work on adversarial examples, defense methods, and conditional\ngenerative models. Although adversarial examples can be crafted for many domains, we focus on\nimage classi\ufb01cation tasks in the rest of this paper, and will use the words \u201cexamples\u201d and \u201cimages\u201d\ninterchangeably.\nAdversarial examples\nLet x 2 Rm denote an input image to a classi\ufb01er f : Rm !{ 1, 2,\u00b7\u00b7\u00b7 , k},\nand assume the attacker has full knowledge of f (a.k.a., white-box setting). [1] discovered that it\nis possible to \ufb01nd a slightly different image x0 2 Rm such that kx0 xk \uf8ff \u270f but f (x0) 6= f (x), by\nsolving a surrogate optimization problem with L-BFGS [28]. Different matrix norms k\u00b7k have been\nused, such as l1 ([9]), l2 ([29]) or l0 ([30]). Similar optimization-based methods are also proposed\nin [31, 32]. In [2], the authors observe that f (x) is approximately linear and propose the Fast Gradient\nSign Method (FGSM), which applies a \ufb01rst-order approximation of f (x) to speed up the generation\nof adversarial examples. This procedure can be repeated several times to give a stronger attack named\nProjected Gradient Descent (PGD [10]).\nDefense methods\nExisting defense methods typically try to make classi\ufb01ers more robust to small\nperturbations of the image. There has been an \u201carms race\u201d between increasingly sophisticated attack\nand defense methods. As indicated in [33], the strongest defenses to date are adversarial training [10]\nand certi\ufb01ed defenses [16, 17]. In this paper, we focus our investigation of unrestricted adversarial\nattacks on these defenses.\nGenerative adversarial networks (GANs) A GAN [19, 20, 34, 35, 36] is composed of a generator\ng\u2713(z) and a discriminator d(x). The generator maps a source of noise z \u21e0 Pz to a synthetic image\nx = g\u2713(z). The discriminator receives an image x and produces a value d(x) to distinguish whether\nit is sampled from the true image distribution Px or generated by g\u2713(z). The goal of GAN training\nis to learn a discriminator to reliably distinguish between fake and real images, and to use this\ndiscriminator to train a good generator by trying to fool the discriminator. To stabilize training, we\nuse the Wasserstein GAN [34] formulation with gradient penalty [21], which solves the following\noptimization problem\n\nmin\n\n\u2713\n\nmax\n\n\n\nEz\u21e0Pz [d(g\u2713(z))] Ex\u21e0Px[d(x)] + E\u02dcx\u21e0P\u02dcx[(kr\u02dcxd(\u02dcx)k2 1)2],\n\n1In previous drafts we called it Generative Adversarial Examples. We switched the name to emphasize the\ndifference from [23]. Concurrently, the same name was also used in [24] to refer to adversarial examples beyond\nsmall perturbations.\n\n2\n\n\fwhere P\u02dcx is the distribution obtained by sampling uniformly along straight lines between pairs of\nsamples from Px and generated images from g\u2713(z).\nTo generate adversarial examples with the intended semantic information, we also need to control\nthe labels of the generated images. One popular method of incorporating label information into the\ngenerator is Auxiliary Classi\ufb01er GAN (AC-GAN [20]), where the conditional generator g\u2713(z, y)\ntakes label y as input and an auxiliary classi\ufb01er c (x) is introduced to predict the labels of both\ntraining and generated images. Let c (y | x) be the con\ufb01dence of predicting label y for an input\nimage x. The optimization objectives for the generator and the discriminator, respectively, are:\n\n\u2713 Ez\u21e0Pz,y\u21e0Py [d(g\u2713(z, y)) log c (y | g\u2713(z, y)))]\nmin\nmin\n, \n\nEz\u21e0Pz,y\u21e0Py [d(g\u2713(z, y))] Ex\u21e0Px[d(x)] Ex\u21e0Px,y\u21e0Py|x[log c (y | x)]\n\nwhere d(\u00b7) is the discriminator, Py represents a uniform distribution over all labels {1, 2,\u00b7\u00b7\u00b7 , k},\nPy|x denotes the ground-truth distribution of y given x, and Pz is chosen to be N (0, 1) in our\nexperiments.\n\n+ E\u02dcx\u21e0P\u02dcx[(kr\u02dcxd(\u02dcx)k2 1)2],\n\n3 Methods\n\n3.1 Unrestricted adversarial examples\nWe start this section by formally characterizing perturbation-based and unrestricted adversarial\nexamples. Let I be the set of all digital images under consideration. Suppose o : O\u2713I!\n{1, 2,\u00b7\u00b7\u00b7 , K} is an oracle that takes an image in its domain O and outputs one of K labels. In\naddition, we consider a classi\ufb01er f : I!{ 1, 2,\u00b7\u00b7\u00b7 , K} that can give a prediction for any image in\nI, and assume f 6= o. Equipped with those notations, we can provide de\ufb01nitions used in this paper:\nDe\ufb01nition 1 (Perturbation-Based Adversarial Examples). Given a subset of (test) images T\u21e2O ,\nsmall constant \u270f> 0, and matrix norm k\u00b7k, a perturbation-based adversarial example is de\ufb01ned to\nbe any image in Ap , {x 2O |9 x0 2T ,kx x0k \uf8ff \u270f ^ f (x0) = o(x0) = o(x) 6= f (x)}.\nIn other words, traditional adversarial examples are based on perturbing a correctly classi\ufb01ed image\nin T so that f gives an incorrect prediction, according to the oracle o.\nDe\ufb01nition 2 (Unrestricted Adversarial Examples). An unrestricted adversarial example is any image\nthat is an element of Au , {x 2O | o(x) 6= f (x)}.\nIn most previous work on perturbation-based adversarial examples, the oracle o is implicitly de\ufb01ned\nas a black box that gives ground-truth predictions (consistent with human judgments), T is chosen\nto be the test dataset, and k\u00b7k is usually one of l1 ([9]), l2 ([29]) or l0 ([30]) matrix norms. Since\no corresponds to human evaluation, O should represent all images that look realistic to humans,\nincluding those with small perturbations. The past work assumed o(x) was known by restricting x to\nbe close to another image x0 which came from a labeled dataset. Our work removes this restriction,\nby using a high quality generative model which can generate samples which, with high probability,\nhumans will label with a given class. From the de\ufb01nition it is clear that Ap \u21e2A u, which means our\nproposed unrestricted adversarial examples are a strict generalization of traditional perturbation-based\nadversarial examples, where we remove the small-norm constraints.\n\n3.2 Practical unrestricted adversarial attacks\nThe key to practically producing unrestricted adversarial examples is to model the set of legitimate\nimages O. We do so by training a generative model g(z, y) to map a random variable z 2 Rm\nand a label y 2{ 1, 2,\u00b7\u00b7\u00b7 , K} to a legitimate image x = g(z, y) 2O satisfying o(x) = y. If the\ngenerative model is ideal, we will have O\u2318{ g(z, y) | y 2{ 1, 2,\u00b7\u00b7\u00b7 , K}, z 2 Rm}. Given such a\nmodel we can in principle enumerate all unrestricted adversarial examples for a given classi\ufb01er f, by\n\ufb01nding all z and y such that f (g(z, y)) 6= y.\nIn practice, we can exploit different approximations of the ideal generative model to produce different\nkinds of unrestricted adversarial examples. Because of its reliable conditioning and high \ufb01delity\nimage generation, we choose AC-GAN [20] as our basic class-conditional generative model. In what\nfollows, we explore two attacks derived from variants of AC-GAN (see pseudocode in Appendix B).\n\n3\n\n\fBasic attack Let g\u2713(z, y), c(x) be the generator and auxiliary classi\ufb01er of AC-GAN, and let f (x)\ndenote the classi\ufb01er that we wish to attack. We focus on targeted unrestricted adversarial attacks,\nwhere the attacker tries to generate an image x so that o(x) = ysource but f (x) = ytarget. In order to\nproduce unrestricted adversarial examples, we propose \ufb01nding the appropriate z by minimizing a\nloss function L that is carefully designed to produce high \ufb01delity unrestricted adversarial examples.\nWe decompose the loss as L = L0 + 1L1 + 2L2, where 1, 2 are positive hyperparameters for\nweighting different terms. The \ufb01rst component\n\n(1)\nencourages f to predict ytarget, where f (y | x) denotes the con\ufb01dence of predicting label y for input\nx. The second component\n\nL0 , log f (ytarget | g\u2713(z, ysource))\n\n(2)\n\nmax{|zi z0\n\nmXi=1\nsoft-constrains the search region of z so that it is close to a randomly sampled noise vector z0.\nm} i.i.d.\u21e0N (0, 1), and \u270f is a small positive constant. For a\nHere z = (z1, z2,\u00b7\u00b7\u00b7 , zm), {z0\ngood generative model, we expect that x0 = g\u2713(z0, ysource) is diverse for randomly sampled z0 and\no(x0) = ysource holds with high probability. Therefore, by reducing the distance between z and z0,\nL1 has the effect of generating more diverse adversarial examples from class ysource. Without this\nconstraint, the optimization may always converge to the same example for each class. Finally\n\n2,\u00b7\u00b7\u00b7 , z0\n\ni | \u270f, 0}\n\nL1 , 1\n\nm\n\n1, z0\n\nL2 , log c(ysource | g\u2713(z, ysource))\n\n(3)\nencourages the auxiliary classi\ufb01er c to give correct predictions, and c(y | x) is the con\ufb01dence\nof predicting y for x. We hypothesize that c is relatively uncorrelated with f, which can possibly\npromote the generated images to reside in class ysource.\nNote that L can be easily modi\ufb01ed to perform untargeted attacks, for example replacing L0 with\n maxy6=ysource log f (y | g\u2713(z, ysource)). Additionally, when performing our evaluations, we need to\nuse humans to ensure that our generative model is actually generating images which are in one of\nthe desired classes with high probability. In contrast, when simply perturbing an existing image,\npast work has been able to assume that the true label does not change if the perturbation is small.\nThus, during evaluation, to test whether the images are legitimate and belong to class ysource, we use\ncrowd-sourcing on Amazon Mechanical Turk (MTurk).\n\nNoise-augmented attack The representation power of the AC-GAN generator can be improved\nif we add small trainable noise to the generated image. Let \u270fattack be the maximum magnitude\nof noise that we want to apply. The noise-augmented generator is de\ufb01ned as g\u2713(z, \u2327, y ; \u270fattack) ,\ng\u2713(z, y)+\u270fattack tanh(\u2327 ), where \u2327 is an auxiliary trainable variable with the same shape as g\u2713(z, y) and\ntanh is applied element-wise. As long as \u270fattack is small, g\u2713(z, \u2327, y ; \u270fattack) should be indistinguishable\nfrom g\u2713(z, y), and o(g\u2713(z, \u2327, y ; \u270fattack)) = o(g\u2713(z, y)), i.e., adding small noise should preserve\nimage quality and ground-truth labels. Similar to the basic attack, noise-augmented unrestricted\nadversarial examples can be obtained by solving minz,\u2327 L, with g\u2713(z, ysource) in (1) and (3) replaced\nby g\u2713(z, \u2327, ysource; \u270fattack).\nOne interesting observation is that traditional perturbation-based adversarial examples can also be\nobtained as a special case of our noise-augmented attack, by choosing a suitable g\u2713(z, y) instead of\nthe AC-GAN generator. Speci\ufb01cally, let T be the test dataset, and Ty = {x 2T | o(x) = y}. We\ncan use a discrete latent code z 2{ 1, 2,\u00b7\u00b7\u00b7 ,|Tysource|} and specify g\u2713(z, y) to be the z-th image in Ty.\nThen, when z0 is uniformly drawn from {1, 2,\u00b7\u00b7\u00b7 ,|Tysource|}, 1 ! 1 and 2 = 0, we will recover\nan objective similar to FGSM [2] or PGD [10].\n\n4 Experiments\n\n4.1 Experimental details\nAmazon Mechanical Turk settings\nIn order to demonstrate the success of our unrestricted ad-\nversarial examples, we need to verify that their ground-truth labels disagree with the classi\ufb01er\u2019s\n\n4\n\n\fTable 1: Attacking certi\ufb01ed defenses on MNIST. The unrestricted adversarial examples here are\nuntargeted and without noise-augmentation. Numbers represent success rates (%) of our attack, based\non human evaluations on MTurk. While no perturbation-based attack with \u270f = 0.1 can have a success\nrate larger than the certi\ufb01ed rate (when evaluated on the training set) we are able to achieve that by\nconsidering a more general attack mechanism.\n\nSource\n\nClassi\ufb01er\nRaghunathan et al. [16]\n\nKolter & Wang [17]\n\n0\n90.8\n94.2\n\n1\n48.3\n57.3\n\n2\n86.7\n92.2\n\n3\n93.7\n94.0\n\n4\n94.7\n93.7\n\n5\n85.7\n89.6\n\n6\n93.4\n95.7\n\n7\n80.8\n81.4\n\n8\n96.8\n96.3\n\n9\n95.0\n93.5\n\nOverall\n86.6\n88.8\n\nCerti\ufb01ed Rate\n\n(\u270f = 0.1)\n\uf8ff 35.0\n\uf8ff 5.8\n\npredictions. To this end, we use Amazon Mechanical Turk (MTurk) to manually label each unre-\nstricted adversarial example.\nTo improve signal-to-noise ratio, we assign the same image to 5 different workers and use the result\nof a majority vote as ground-truth. For each worker, the MTurk interface contains 10 images and\nfor each image, we use a button group to show all possible labels. The worker is asked to select the\ncorrect label for each image by clicking on the corresponding button. In addition, each button group\ncontains an \u201cN/A\u201d button that the workers are instructed to click on if they think the image is not\nlegitimate or does not belong to any class. To con\ufb01rm our MTurk setup results in accurate labels,\nwe ran a test to label MNIST images. The results show that 99.6% of majority votes obtained from\nworkers match the ground-truth labels.\nFor some of our experiments, we want to investigate whether unrestricted adversarial examples are\nmore similar to existing images in the dataset, compared to perturbation-based attacks. We use the\nclassical A/B test for this, i.e., each synthesized adversarial example is randomly paired with an\nexisting image from the dataset, and the annotators are asked to identify the synthesized images.\nScreen shots of all our MTurk interfaces can be found in the Appendix E.\nDatasets\nThe datasets used in our experiments are MNIST [25], SVHN [26], and CelebA [27].\nBoth MNIST and SVHN are images of digits. For CelebA, we group the face images according to\nfemale/male, and focus on gender classi\ufb01cation. We test our attack on these datasets because the tasks\n(digit categorization and gender classi\ufb01cation) are easier and less ambiguous for MTurk workers,\ncompared to those having more complicated labels, such as Fashion-MNIST [37] or CIFAR-10 [38].\nModel Settings We train our AC-GAN [20] with gradient penalty [21] on all available data\npartitions of each dataset, including training, test, and extra images (SVHN only). This is based on\nthe assumption that attackers can access a large number of images. We use ResNet [39] blocks in\nour generative models, mainly following the architecture design of [21]. For training classi\ufb01ers, we\nonly use the training partition of each dataset. We copy the network architecture from [10] for the\nMNIST task, and use a similar ResNet architecture to [13] for other datasets. For more details about\narchitectures, hyperparameters and adversarial training methods, please refer to Appendix C.\n\n4.2 Untargeted attacks against certi\ufb01ed defenses\n\nWe \ufb01rst show that our new adversarial attack can bypass the recently proposed certi\ufb01ed defenses [16,\n17]. These defenses can provide a theoretically veri\ufb01ed certi\ufb01cate that a training example cannot be\nclassi\ufb01ed incorrectly by any perterbation-based attack with a perturbation size less than a given \u270f.\nSetup\nFor each source class, we use our method to produce 1000 untargeted unrestricted adversarial\nexamples without noise-augmentation. By design these are all incorrectly classi\ufb01ed by the target\nclassifer. Since they are synthesized from scratch, we report in Tab. 1 the fraction labeled by human\nannotators as belonging to the intended source class. We conduct these experiments on the MNIST\ndataset to ensure a fair comparison, since certi\ufb01ed classi\ufb01ers with pre-trained weights for this dataset\ncan be obtained directly from the authors of [16, 17]. We produce untargeted unrestricted adversarial\nexamples in this task, as the certi\ufb01cates are for untargeted attacks.\nResults Tab. 1 shows the results. We can see that the stronger of the two certi\ufb01ed defenses, [17],\nprovides a certi\ufb01cate for 94.2% of the samples in the MNIST test set with \u270f = 0.1 (out of 1). Since\nour technique is not perturbation-based, 88.8% of our samples are able to fool this defense. Note this\ndoes not mean the original defense is broken, since we are considering a different threat model.\n\n5\n\n\f(a) Untargeted attacks against [16].\n\n(b) Untargeted attacks against [17].\n\nFigure 1: Random samples of untargeted unrestricted adversarial examples (w/o noise) against\ncerti\ufb01ed defenses on MNIST. Green and red borders indicate success/failure respectively, according\nto MTurk results. The annotation in upper-left corner of each image represents the classi\ufb01er\u2019s\nprediction. For example, the entry in the top left corner (left panel) is classi\ufb01ed as 0 by MTurk, and\nas a 7 by the classi\ufb01er, and is therefore considered a success. The entry in the top left corner (right\npanel) is not classi\ufb01ed as 0 by MTurk, and is therefore counted as a failure.\n\nFigure 2: Comparing PGD attacks with high \u270f = 0.31 (top) and ours (bottom) for [16] on MNIST.\nThe two methods have comparable success rates of 86.0% (with \u270f = 0.31) and 86.6% respectively.\nOur images, however, look signi\ufb01cantly more realistic.\n\n4.2.1 Evading human detection\nOne natural question is, why not increase \u270f to achieve higher success rates? With a large enough \u270f,\nexisting perturbation-based attacks will be able to fool certi\ufb01ed defenses using larger perturbations.\nHowever, we can see the downside of this approach in Fig. 2: because perturbations are large, the\nresulting samples appear obviously altered.\nSetup To show this quantitatively, we increase the \u270f value for a traditional perturbation-based attack\n(a 100-step PGD) until the attack success rates on the certi\ufb01ed defenses are similar to those for our\ntechnique. Speci\ufb01cally, we used an \u270f of 0.31 and 0.2 to attack [16] and [17] respectively, resulting\nin success rates of 86.0% and 83.5%. We then asked human annotators to distinguish between the\nadversarial examples and unmodi\ufb01ed images from the dataset, in an A/B test setting. If the two are\nindistinguishable we expect a 50% success rate.\nResults We found that with perturbation-based examples [16], annotators can correctly identify\nadversarial images with a 92.9% success rate. In contrast, they can only correctly identify adversarial\nexamples from our attack with a 76.8% success rate. Against [17], the success rates are 87.6% and\n78.2% respectively. We expect this gap to increase even more as better generative models and defense\nmechanisms are developed.\n\n4.3 Targeted attacks against adversarial training\n\nThe theoretical guarantees provided by certi\ufb01ed defenses are satisfying, however, these defenses\nrequire computationally expensive optimization techniques and do not yet scale to larger datasets.\nFurthermore, more traditional defense techniques actually perform better in practice in certain cases.\n[33] has shown that the strongest non-certi\ufb01ed defense currently available is the adversarial training\ntechnque presented in [10], so we also evaluated our attack against this defense technique. We\nperform this evaluation in the targeted setting in order to better understand how the success rate varies\nbetween various source-target pairs.\n\n6\n\n\f(a) sampled examples on MNIST\n\n(b) success rates on MNIST\n\n(c) sampled examples on SVHN\n\n(d) success rates on SVHN\n\nFigure 3: (a)(c) Random samples of targeted unrestricted adversarial examples (w/o noise). Row i\nand column j is an image that is supposed to be class i and classi\ufb01er predicts it to be class j. Green\nborder indicates that the image is voted legitimate by MTurk workers, and red border means the label\ngiven by workers (as shown in the upper left corner) disagrees with the image\u2019s source class. (b)(d)\nThe success rates (%) of our targeted unrestricted adversarial attack (w/o noise). Also see Tab. 2.\n\nTable 2: Overall success rates of our targeted unrestricted adversarial attacks. Success rates of PGD\nare provided to show that the network is adversarially-trained to be robust. \u2020Best public result.\n\nRobust Classi\ufb01er\n\nAccuracy\n\n(orig. images)\n\nMadry network [10] on MNIST\nResNet (adv-trained) on SVHN\nResNet (adv-trained) on CelebA\n\n98.4\n96.3\n97.3\n\nSuccess Rate\n\nof PDG\n10.4\u2020\n59.9\n20.5\n\nOur Success Rate\n\n(w/o Noise)\n\nOur Success Rate\n\n(w/ Noise)\n\n85.2\n84.2\n91.1\n\n85.0\n91.6\n86.7\n\n\u270fattack\n\n0.3\n0.03\n0.03\n\nSetup We produce 100 unrestricted adversarial examples for each pair of source and target classes\nand ask human annotators to label them. We also compare to traditional PGD attacks as a reference.\nFor the perturbation-based attacks against the ResNet networks, we use a 20-step PGD with values\nof \u270f given in the table. For the perturbation-based attack against the Madry network [10], we report\nthe best published result [40]. It\u2019s important to note that the reference perturbation-based results\nare not directly comparable to ours because they are i) untargeted attacks and ii) limited to small\nperturbations. Nonetheless, they can provide a good sense of the robustness of adversarially-trained\nnetworks.\nResults A summary of the results can be seen in Tab. 2. We can see that the defense from [10]\nis quite effective against the basic perturbation-based attack, limiting the success rate to 10.4% on\n\n7\n\n\fFigure 4: Sampled unrestricted adversarial examples (w/o noise) for fooling the classi\ufb01er to misclas-\nsify a female as male (left) and the other way around (right). Green, red borders and annotations have\nthe same meanings as in Fig. 3, except \u201cF\u201d is short for \u201cFemale\u201d and \u201cM\u201d is short for \u201cMale\u201d.\n\nTable 3: Transferability of unrestricted adversarial examples on MNIST. We attack Madry Net [10]\n(adv) with our method and feed legitimate unrestricted adversarial examples, as veri\ufb01ed by AMT\nworkers, to other classi\ufb01ers. Here \u201cadv\u201d is short for \u201cadversarially-trained\u201d (with PGD) and \u201cno adv\u201d\nmeans no adversarial training is used. Numbers represent accuracies of classi\ufb01ers.\n\nAttack Type\n\nClassi\ufb01er\n\nMadry Net [10]\n\n(no adv)\n\nNo attack\n\nOur attack (w/o noise)\n\nOur attack (w/ noise, \u270fattack = 0.3)\n\n99.5\n95.1\n78.3\n\nMadry Net [10]\n\n(adv)\n98.4\n0\n0\n\nResNet\n(no adv)\n\n99.3\n92.7\n73.8\n\nResNet\n(adv)\n99.4\n93.7\n84.9\n\n[16]\n95.8\n77.1\n78.1\n\n[17]\n98.2\n84.3\n63.0\n\nMNIST, and 20.5% on CelebA. In contrast, our unrestricted adversarial examples (with or without\nnoise-augmentation) can successfully fool this defense with more than an 84% success rate on all\ndatasets. We \ufb01nd that adding noise-augmentation to our attack does not signi\ufb01cantly change the\nresults, boosting the SVNH success rate by 7.4% while reducing the CelebA success rate by 4.4%.\nIn Fig. 3 and Fig. 4, we show samples and detailed success rates of unrestricted adversarial attacks\nwithout noise-augmentation. More samples and success rate details are provided in Appendix D.\n\n4.4 Transferability\nAn important feature of traditional perturbation-based attacks is their transferability across different\nclassi\ufb01ers.\nSetup To test how our unrestricted adversarial examples transfer to other architectures, we use\nall of the unrestricted adversarial examples we created to target Madry Network [41] on MNIST\nfor the results in Section 4.3, and \ufb01lter out invalid ones using the majority vote of a set of human\nannotators. We then feed these unrestricted adversarial examples to other architectures. Besides the\nadversarially-trained Madry Network, the architectures we consider include a ResNet [39] similar\nto those used on SVHN and CelebA datasets in Section 4.3. We test both normally-trained and\nadversarially-trained ResNets. We also take the architecture of Madry Network in [10] and train it\nwithout adversarial training.\nResults We show in Tab. 3 that unrestricted adversarial examples exhibit moderate transferability\nto different classi\ufb01ers, which means they can be threatening in a black-box scenario as well. For\nattacks without noise-augmentation, the most successful transfer happens against [16], where the\nsuccess rate is 22.9%. For the noise-augmented attack, the most successful transfer is against [17],\nwith a success rate of 37.0%. The results indicate that the transferability of unrestricted adversarial\nexamples can be generally enhanced with noise-augmentation.\n\n5 Analysis\n\nIn this section, we analyze why our method can attack a classi\ufb01er using a generative model, under\nsome idealized assumptions. For simplicity, we assume the target f (x) is a binary classi\ufb01er, where\nx 2 Rn is the input vector. Previous explanations for perturbation-based attacks [2] assume that\nthe score function s(x) 2 R used by f is almost linear. Suppose s(x) \u21e1 w|x + b and w, x 2 Rn\nboth have high dimensions (large n). We can choose the perturbation to be = \u270f sign(w), so that\n\n8\n\n\fs(x + ) s(x) \u21e1 \u270f \u00b7 n. Though \u270f is small, n is typically a large number, therefore \u270f \u00b7 n can be\nlarge enough to change the prediction of f (x). Similarly, we can explain the existence of unrestricted\nadversarial examples. Suppose g(z, y) 2 Rn is an ideal generative model that can always produce\nlegitimate images of class y 2{ 0, 1} for any z 2 Rm, and assume for all z0, f (g(z0, y)) = y. The\nend-to-end score function s(g(z, y)) can be similarly approximated by s(g(z, y)) \u21e1 w|\ng z + bg, and\nwe can again take z = \u270f sign(wg), so that s(g(z0 + z, y)) s(g(z0, y)) \u21e1 \u270f \u00b7 m. Because m 1 ,\n\u270f \u00b7 m can be large enough to change the prediction of f, justifying why we can \ufb01nd many unrestricted\nadversarial examples by minimizing L.\nIt becomes harder to analyze the case of an imperfect generative model. We provide a theoretical\nanalysis in Appendix A under relatively strong assumptions to argue that most unrestricted adversarial\nexamples produced by our method should be legitimate.\n\n6 Related work\n\nSome recent attacks also use more structured perturbations beyond simple norm bounds. For example,\n[42] shows that wearing eyeglass frames can cause face-recognition models to misclassify. [43] tests\nthe robustness of classi\ufb01ers to \u201cnuisance variables\u201d, such as geometric distortions, occlusions, and\nillumination changes. [44] proposes converting the color space from RGB to HSV and shifting H,\nS components. [45] proposes mapping the input image to a latent space using GANs, and search\nfor adversarial examples in the vicinity of the latent code. In contrast to our unrestricted adversarial\nexamples where images are synthesized from scratch, these attacking methods craft malicious inputs\nbased on a given test dataset using a limited set of image manipulations. Similar to what we have\nshown for traditional adversarial examples, we can view these attacking methods as special instances\nof our unrestricted adversarial attack framework by choosing a suitable generative model.\nThere is also a related class of maliciously crafted inputs named fooling images [46]. Different\nfrom adversarial examples, fooling images consist of noise or patterns that do not necessarily look\nrealistic but are nonetheless predicted to be in one of the known classes with high con\ufb01dence. As\nwith our unrestricted adversarial examples, fooling images are not restricted to small norm-bounded\nperturbations. However, fooling images do not typically look legitimate to humans, whereas our\nfocus is on generating adversarial examples which look realistic and meaningful.\nGenerative adversarial networks have also been used in some previous attack and defense mechanisms.\nExamples include AdvGAN [23], DeepDGA [47], ATN [48], GAT [49] and Defense-GAN [14]. The\nclosest to our work are AdvGAN and DeepDGA. AdvGAN also proposes to use GANs for creating\nadversarial examples. However, their adversarial examples are still based on small norm-bounded\nperturbations. This enables them to assume adversarial examples have the same ground-truth labels\nas unperturbed images, while we use human evaluation to ensure the labels for our evaluation.\nDeepDGA uses GANs to generate adversarial domain names. However, domain names are arguably\neasier to generate than images since they need to satisfy fewer constraints.\n\n7 Conclusion\n\nIn this paper, we explore a new threat model and propose a more general form of adversarial attacks.\nInstead of perturbing existing data points, our unrestricted adversarial examples are synthesized\nentirely from scratch, using conditional generative models. As shown in experiments, this new kind\nof adversarial examples undermines current defenses, which are designed for perturbation-based\nattacks. Moreover, unrestricted adversarial examples are able to transfer to other classi\ufb01ers trained\nusing the same dataset. After releasing the \ufb01rst draft of this paper, there has been a surge of interest\nin more general adversarial examples. For example, a contest [24] has recently been launched on\nunrestricted adversarial examples.\nBoth traditional perturbation-based attacks and the new method proposed in this paper exploit current\nclassi\ufb01ers\u2019 vulnerability to covariate shift [50]. The prevalent training framework in machine learning,\nEmpirical Risk Minimization [51], does not guarantee performance when tested on a different data\ndistribution. Therefore, it is important to develop new training methods that can generalize to different\ninput distributions, or new methods that can reliably detect covariate shift [52]. Such new methods\nshould be able to alleviate threats of both perturbation-based and unrestricted adversarial examples.\n\n9\n\n\fAcknowledgements\nThe authors would like to thank Shengjia Zhao for reviewing an early draft of this paper. We\nalso thank Ian Goodfellow, Ben Poole, Anish Athalye and Sumanth Dathathri for helpful online\ndiscussions. This research was supported by Intel Corporation, TRI, NSF (#1651565, #1522054,\n#1733686 ) and FLI (#2017-158687).\n\nReferences\n[1] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[2] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[3] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical\n\nworld. arXiv preprint arXiv:1607.02533, 2016.\n\n[4] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul\nPrakash, Tadayoshi Kohno, and Dawn Song. Robust Physical-World Attacks on Deep Learning\nVisual Classi\ufb01cation. In Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[5] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversar-\nial examples for semantic segmentation and object detection. In International Conference on\nComputer Vision. IEEE, 2017.\n\n[6] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields,\nDavid Wagner, and Wenchao Zhou. Hidden voice commands. In USENIX Security Symposium,\npages 513\u2013530, 2016.\n\n[7] Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dol-\nphinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC Conference\non Computer and Communications Security, pages 103\u2013117. ACM, 2017.\n\n[8] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep\n\nstructured prediction models. arXiv preprint arXiv:1707.05373, 2017.\n\n[9] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale.\n\narXiv preprint arXiv:1611.01236, 2016.\n\n[10] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\n[11] Aman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with\n\nprincipled adversarial training. arXiv preprint arXiv:1710.10571, 2017.\n\n[12] Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial\n\nexamples. arXiv preprint arXiv:1412.5068, 2014.\n\n[13] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend:\nIn\n\nLeveraging generative models to understand and defend against adversarial examples.\nInternational Conference on Learning Representations, 2018.\n\n[14] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classi\ufb01ers\n\nagainst adversarial attacks using generative models. 2018.\n\n[15] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier.\nParseval networks: Improving robustness to adversarial examples. In International Conference\non Machine Learning, pages 854\u2013863, 2017.\n\n[16] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certi\ufb01ed defenses against adversarial\n\nexamples. In International Conference on Learning Representations, 2018.\n\n[17] J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex\n\nouter adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.\n\n[18] Tom B Brown, Dandelion Man\u00e9, Aurko Roy, Mart\u00edn Abadi, and Justin Gilmer. Adversarial\n\npatch. arXiv preprint arXiv:1712.09665, 2017.\n\n10\n\n\f[19] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[20] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\nauxiliary classi\ufb01er gans. In International Conference on Machine Learning, pages 2642\u20132651,\n2017.\n\n[21] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[22] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon\u2019s mechanical turk: A new\nsource of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3\u20135,\n2011.\n\n[23] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating\n\nadversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610, 2018.\n\n[24] Tom B Brown, Nicholas Carlini, Chiyuan Zhang, Catherine Olsson, Paul Christiano, and Ian\n\nGoodfellow. Unrestricted adversarial examples. arXiv preprint arXiv:1809.08352, 2018.\n\n[25] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne\nHubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.\nNeural computation, 1(4):541\u2013551, 1989.\n\n[26] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, volume 2011, page 5, 2011.\n\n[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[28] Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of computa-\n\ntion, 35(151):773\u2013782, 1980.\n\n[29] Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple\nand accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), number EPFL-CONF-218057, 2016.\n\n[30] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and\nAnanthram Swami. The limitations of deep learning in adversarial settings. In Security and\nPrivacy (EuroS&P), 2016 IEEE European Symposium on, pages 372\u2013387. IEEE, 2016.\n\n[31] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\n\nSecurity and Privacy (SP), 2017 IEEE Symposium on, pages 39\u201357. IEEE, 2017.\n\n[32] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial\nexamples and black-box attacks. In International Conference on Learning Representations,\n2017.\n\n[33] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of\nsecurity: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420,\n2018.\n\n[34] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[35] Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-gan: Combining maximum likelihood\nand adversarial learning in generative models. In AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[36] Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adver-\n\nsarial imitation learning. 2018.\n\n[37] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[38] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n[39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n11\n\n\f[40] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\n\nMnist adversarial examples challenge, 2017.\n\n[41] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[42] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Accessorize to a crime:\nReal and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM\nSIGSAC Conference on Computer and Communications Security, pages 1528\u20131540. ACM,\n2016.\n\n[43] Alhussein Fawzi and Pascal Frossard. Measuring the effect of nuisance variables on classi\ufb01ers.\n\nIn British Machine Vision Conference (BMVC), number EPFL-CONF-220613, 2016.\n\n[44] Hossein Hosseini and Radha Poovendran. Semantic adversarial examples. arXiv preprint\n\narXiv:1804.00499, 2018.\n\n[45] Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In\n\nInternational Conference on Learning Representations, 2018.\n\n[46] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High\ncon\ufb01dence predictions for unrecognizable images. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 427\u2013436, 2015.\n\n[47] Hyrum S Anderson, Jonathan Woodbridge, and Bobby Filar. Deepdga: Adversarially-tuned\ndomain generation and detection. In Proceedings of the 2016 ACM Workshop on Arti\ufb01cial\nIntelligence and Security, pages 13\u201321. ACM, 2016.\n\n[48] Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate\n\nadversarial examples. arXiv preprint arXiv:1703.09387, 2017.\n\n[49] Hyeungill Lee, Sungyeob Han, and Jungwoo Lee. Generative adversarial trainer: Defense to\n\nadversarial perturbations with gan. arXiv preprint arXiv:1705.03387, 2017.\n\n[50] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the\n\nlog-likelihood function. Journal of statistical planning and inference, 90(2):227\u2013244, 2000.\n\n[51] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media,\n\n2013.\n\n[52] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A DIRT-T approach to unsupervised\n\ndomain adaptation. In International Conference on Learning Representations, 2018.\n\n[53] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.\n[54] Phillippe Rigollet. High-dimensional statistics. Lecture notes for course 18S997, 2015.\n\n12\n\n\f", "award": [], "sourceid": 5046, "authors": [{"given_name": "Yang", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Rui", "family_name": "Shu", "institution": "Stanford University"}, {"given_name": "Nate", "family_name": "Kushman", "institution": "Microsoft Research Cambridge"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}