{"title": "A New Defense Against Adversarial Images: Turning a Weakness into a Strength", "book": "Advances in Neural Information Processing Systems", "page_first": 1635, "page_last": 1646, "abstract": "Natural images are virtually surrounded by low-density misclassified regions that can be efficiently discovered by gradient-guided search --- enabling the generation of adversarial images. While many techniques for detecting these attacks have been proposed, they are easily bypassed when the adversary has full knowledge of the detection mechanism and adapts the attack strategy accordingly. In this paper, we adopt a novel perspective and regard the omnipresence of adversarial perturbations as a strength rather than a weakness. We postulate that if an image has been tampered with, these adversarial directions either become harder to find with gradient methods or have substantially higher density than for natural images. We develop a practical test for this signature characteristic to successfully detect adversarial attacks, achieving unprecedented accuracy under the white-box setting where the adversary is given full knowledge of our detection mechanism.", "full_text": "A New Defense Against Adversarial Images:\n\nTurning a Weakness into a Strength\n\nTao Yu\u2217\u2020\n\nShengyuan Hu\u2217\u2020 Chuan Guo\u2020 Wei-Lun Chao\u2021 Kilian Q. Weinberger\u2020\n\nAbstract\n\nNatural images are virtually surrounded by low-density misclassi\ufb01ed regions that\ncan be ef\ufb01ciently discovered by gradient-guided search \u2014 enabling the generation\nof adversarial images. While many techniques for detecting these attacks have\nbeen proposed, they are easily bypassed when the adversary has full knowledge\nof the detection mechanism and adapts the attack strategy accordingly. In this\npaper, we adopt a novel perspective and regard the omnipresence of adversarial\nperturbations as a strength rather than a weakness. We postulate that if an image\nhas been tampered with, these adversarial directions either become harder to \ufb01nd\nwith gradient methods or have substantially higher density than for natural images.\nWe develop a practical test for this signature characteristic to successfully detect\nadversarial attacks, achieving unprecedented accuracy under the white-box setting\nwhere the adversary is given full knowledge of our detection mechanism.\n\n1\n\nIntroduction\n\nThe advance of deep neural networks has led to natural questions regarding its robustness to both\nnatural and malicious change in the test input. For the latter scenario, the seminal work of Biggio et\nal. [3] and Szegedy et al. [48] \ufb01rst suggested that neural networks may be prone to imperceptible\nchanges in the input \u2014 the so-called adversarial perturbations \u2014 that alter the model\u2019s decision\nentirely. This weakness not only applies to image classi\ufb01cation models, but is prevalent in various\nmachine learning applications, including object detection and image segmentation [10, 54], speech\nrecognition [8], and deep policy networks [2, 21].\nThe threat of adversarial perturbations has prompted tremendous effort towards the development\nof defense mechanisms. Common defenses either attempt to recover the true semantic labels of\nthe input [5, 12, 19, 38, 41, 45] or detect and reject adversarial examples [17, 28, 31, 33\u201335, 55].\nAlthough many of the proposed defenses have been successful against passive attackers \u2014 ones that\nare unaware of the presence of the defense mechanism \u2014 almost all fail against adversaries that have\nfull knowledge of the internal details of the defense and modify the attack algorithm accordingly\n[1, 6]. To date, the success of existing defenses have been limited to simple datasets with relatively\nlow variety of classes [24, 29, 39, 44, 52].\nRecent studies [13, 42] have shown that the existence of adversarial perturbations may be an inherent\nproperty of natural data distributions in high dimensional spaces \u2014 painting a grim picture for\ndefenses. However, in this paper we propose a radically new approach to defenses against adversarial\nattacks that turns this seemingly insurmountable obstacle from a weakness into a strength: We use\nthe inherent property of the existence of valid adversarial perturbations around a natural image as a\nsignature to attest that it is unperturbed.\n\n\u2217Equal Contribution. \u2020Department of Computer Science, Cornell University. \u2021Department of Computer\nScience and Engineering, The Ohio State University. Email: {ty367, sh797, cg563, kqw4}@cornell.edu,\nchao.209@osu.edu.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fConcretely, we exploit two seemingly contradicting properties of natural images: On one hand,\nnatural images lie with high probability near the decision boundary to any given label [13, 42]; on the\nother hand, natural images are robust to random noise [48], which means these small \u201cpockets\u201d of\nspaces where the input is misclassi\ufb01ed have low density and are unlikely to be found through random\nperturbations. To verify if an image is benign, we can test for both properties effectively:\n1. We measure the degree of robustness to random noise by observing the change in prediction after\nadding i.i.d. Gaussian noise.\n2. We measure the proximity to a decision boundary by observing the number of gradient steps\nrequired to change the label of an input image. This procedure is identical to running a gradient-based\nattack algorithm against the input (which is potentially an adversarial image already).\nWe hypothesize that arti\ufb01cially perturbed images mostly violate at least one of the two conditions.\nThis gives rise to an effective detection mechanism even when the adversary has full knowledge of\nthe defense. Against strong L\u221e-bounded white-box adversaries that adaptively optimize against\nthe detector, we achieve a worst-case detection rate of 49% at a false positive rate of 20% on\nImageNet [11] using a pre-trained ResNet-101 model [20]. Prior art achieves a detection rate of\n0% at equal false positive rate under the same setting. Further analysis shows that there exists a\nfundamental trade-off for white-box attackers when optimizing to satisfy the two detection criteria.\nOur method creates new challenges for the search of adversarial examples and points to a promising\ndirection for future research in defense against white-box adversaries.\n\n2 Background\n\nAttack overview. Test-time attacks via adversarial examples can be broadly categorized into either\nblack-box or white-box settings. In the black-box setting, the adversary can only access the model\nas an oracle, and may receive continuous-valued outputs or only discrete classi\ufb01cation decisions\n[9, 18, 22, 23, 30, 37, 49\u201351]. We focus on the white-box setting in this paper, where the attacker\nis assumed to be an insider and therefore has full knowledge of internal details of the network. In\nparticular, having access to the model parameters allows the attacker to perform powerful \ufb01rst-order\noptimization attacks by optimizing an adversarial loss function.\nThe white-box attack framework can be summarized as follows. Let h be the target classi\ufb01cation\nmodel that, given any input x, outputs a vector of probabilities h(x) with h(x)y(cid:48) = p(y(cid:48)|x) (i.e.\nthe y(cid:48)-th component of the vector h(x)) for every class y(cid:48). Let y be the true class of x and L be a\ncontinuous-valued adversarial loss that encourages misclassi\ufb01cation, e.g.,\n\nL(h(x(cid:48)), y) = \u2212cross-entropy(h(x(cid:48)), y).\n\nGiven a target image x for which the model correctly classi\ufb01es as arg maxy(cid:48) h(x)y(cid:48) = y, the attacker\naims to solve the following optimization problem:\n\nmin\n\nx(cid:48) L(h(x(cid:48)), y) , s.t. (cid:107)x \u2212 x(cid:48)(cid:107) \u2264 \u03c4.\n\nHere, (cid:107) \u00b7 (cid:107) is a measure of perceptible difference and is commonly approximated using the Euclidean\nnorm (cid:107)\u00b7(cid:107)2 or the max-norm (cid:107)\u00b7(cid:107)\n\u221e, and \u03c4 > 0 is a perceptibility threshold. This optimization problem\nde\ufb01nes an untargeted attack, where the adversary\u2019s goal is to cause misclassi\ufb01cation. In contrast, for\na targeted attack, the adversary is given some target label yt (cid:54)= y and de\ufb01nes the adversarial loss to\nencourage classi\ufb01cation to the target label:\n\nL(h(x(cid:48)), yt) = cross-entropy(h(x(cid:48)), yt).\n\n(1)\n\nFor the remainder of this paper, we will focus on the targeted attack setting but any approach can be\nreadily augmented for untargeted attacks as well.\nOptimization. White-box (targeted) attacks mainly differ in the choice of the adversarial loss\nfunctions L and the optimization procedures. One of the earliest attacks [48] used L-BFGS to\noptimize the cross-entropy adversarial loss in Equation 1. Carlini and Wagner [7] investigated the use\nof different adversarial loss functions and found that the margin loss\n\n(cid:20)\n\nmax\ny(cid:48)(cid:54)=yt\n\n(cid:21)\nZ(x(cid:48))y(cid:48) \u2212 Z(x(cid:48))yt + \u03ba\n\n+\n\n2\n\nL(Z(x(cid:48)), yt) =\n\n(2)\n\n\fis more suitable for \ufb01rst-order optimization methods, where Z is the logit vector predicted by the\nmodel and \u03ba > 0 is a chosen margin constant. This loss is optimized using Adam [25], and the\nresulting method is known as the Carlini-Wagner (CW) attack. Another class of attacks favors the\nuse of simple gradient descent using the sign of the gradient [16, 27, 32], which results in improved\ntransferability of the constructed adversarial examples from one classi\ufb01cation model to another.\nEnforcing perceptibility constraint. For common choices of the measures of perceptibility, the\nattacker can either fold the constraint as a Lagrangian penalty into the adversarial loss, or apply a\nprojection step at the end of every iteration onto the feasible region. Since the Euclidean norm (cid:107) \u00b7 (cid:107)2\nis differentiable, it is commonly enforced with the former option, i.e.,\n\nmin\n\nx(cid:48) L(h(x(cid:48)), yt) + c(cid:107)x \u2212 x(cid:48)(cid:107)2\nfor some choice of c > 0. On the other hand, the max-norm (cid:107) \u00b7 (cid:107)\n\u221e is often enforced by restricting\nevery coordinate of the difference x \u2212 x(cid:48) to the range [\u2212\u03c4, \u03c4 ] after every gradient step. In addition,\nsince all pixel values must fall within the range [0, 1], most methods also project x(cid:48) to the unit cube at\nthe end of every iteration [7, 32]. When using this option along with the cross entropy adversarial loss,\nthe resulting algorithm is commonly referred to as the Projected Gradient Descent (PGD) attack1 [1].\n\n3 Detection Methods and Their Insuf\ufb01ciency\n\nOne commonly accepted explanation for the existence of adversarial examples is that they operate\noutside the natural image manifold \u2014 regions of the space that the model had no exposure to during\ntraining time and hence its behavior can be manipulated arbitrarily. This view casts the problem of\ndefending against adversarial examples as a robust classi\ufb01cation or anomaly detection problem. The\nformer aims to project the input back to the natural image manifold and recover its true label, whereas\nthe latter only requires determining whether the input belongs to the manifold and reject it if not.\nDetection methods. Many principled detection algorithms have been proposed to date [17, 28, 31, 33\u2013\n35, 55]. The most common approach involves testing the input against one or several criteria that are\nsatis\ufb01ed by natural images but are likely to fail for adversarially perturbed images. In what follows,\nwe brie\ufb02y describe two representative detection mechanisms.\nFeature Squeezing [55] applies a semantic-preserving image transformation to the input and measures\nthe difference in the model\u2019s prediction compared to the plain input. Transformations such as median\nsmoothing, bit quantization, and non-local mean do not alter the image content; hence the model\nis expected to output similar predictions after applying these transformations. The method then\nmeasures the maximum L1 change in predicted probability after applying these transformations and\n\ufb02ags the input as adversarial if this change is above a chosen threshold.\nArtifacts [14] uses the empirical density of the input and the model uncertainty to characterize benign\nand adversarial images. The empirical density can be computed via kernel density estimation on\nthe feature vector. For the uncertainty estimate, the method evaluates the network multiple times\nusing different random dropout masks and computes the variance in the output. Under the Bayesian\ninterpretation of dropout, this variance estimate encodes the model\u2019s uncertainty [15]. Adversarial\ninputs are expected to have lower density and higher uncertainty than natural inputs. Thus, the method\npredicts the input as adversarial if these criteria are below or above a chosen threshold.\nDetectors that use multiple criteria (such as Feature Squeezing and Artifacts) can combine these\ncriteria into a single detection method by either declaring the input as adversarial if any criterion\nfails to be satis\ufb01ed, or by training a classi\ufb01er on top of them as features to classify the input. Other\nnotable useful features for detecting adversarial images include convolutional features extracted from\nintermediate layers [28, 34], distance to training samples in pixel space [17, 31], and entropy of\nnon-maximal class probabilities [36].\nBypassing detection methods. While the approaches for detecting adversarial examples appear\nprincipled in nature, the difference in settings from traditional anomaly detection renders most\ntechniques easy to bypass. In essence, a white-box adversary with knowledge of the features used for\ndetection can optimize the adversarial input to mimic these features with gradient descent. Any non-\ndifferentiable component used in the detection algorithm, such as bit quantization and non-local mean,\n\n1Some literature also refer to the iterative Fast Gradient Signed Method (FGSM) [16] as PGD [32].\n\n3\n\n\fcan be approximated with the identity transformation on the backward pass [1], and randomization\ncan be circumvented by minimizing the expected adversarial loss via Monte Carlo sampling [1].\nThese simple techniques have proven tremendously successful, bypassing almost all known detection\nmethods to date [6]. Given enough gradient queries, adversarial examples can be optimized to appear\neven \u201cmore benign\u201d than natural images.\n\n4 Detection by Adversarial Perturbations\n\nIn this section we describe a novel approach to detect adversarial images that relies on two principled\ncriteria regarding the distribution of adversarial perturbations around natural images. In contrast to\nthe shortcomings of prior work, our approach is hard to fool through \ufb01rst-order optimization.\n\n4.1 Criterion 1: Low density of adversarial perturbations\n\nThe features extracted by convolutional neural networks (CNNs) from natural images are known to\nbe particularly robust to random input corruptions [19, 48, 53]. In other words, random perturbations\napplied to natural images should not lead to changes in the predicted label (i.e. an adversarial image).\nOur \ufb01rst criterion follows this intuition and tests if the given input is robust to Gaussian noise:\nC1: Robustness to random noise. Sample \u0001 \u223c N (0, \u03c32I) (where \u03c32 is a hyperparameter) and\ncompute \u2206 = (cid:107)h(x) \u2212 h(x + \u0001)(cid:107)1. The input x is rejected as adversarial if \u2206 is suf\ufb01ciently large.\nThis style of reasoning has indeed been successfully applied to de-\nfend against black-box and gray-box2 attacks [19, 40, 53]. Figure 1\nshows a 2D cartoon depiction of the high dimensional decision\nboundary near a natural image x. When the adversarial attack per-\nturbs x slightly across the decision boundary from A to an incorrect\nclass B, the resulting adversarial image x(cid:48) can be easily randomly\nperturbed to return to class A and will therefore fail criterion C1.\nHowever, we emphasize that this criterion alone is insuf\ufb01cient\nagainst white-box adversaries and can be easily bypassed. In or-\nder to make the adversarial image also robust against Gaussian noise,\nthe attacker can optimize the expected adversarial loss under this\ndefense strategy [1] through Monte Carlo sampling of noise vectors\nduring optimization. This effectively produces an adversarial image\nx(cid:48)(cid:48) (see Figure 1) that is deep inside the decision boundary.\nMore precisely, for a natural image x with correctly predicted label y and target label yt, let h(x) be\nthe predicted class-probability vector. Let us de\ufb01ne padv to be identical to h(x) in every dimension,\nexcept for the correct class y and the target yt, where the two probabilities are swapped. Consequently,\ndimension yt is the dominant prediction in padv. We rede\ufb01ne the adversarial loss of the (targeted)\nPGD attack to contain two terms:\nL(cid:63) = L1 + L2 where: L1 = L(h(x(cid:48)), padv)\nmisclassify x(cid:48) as yt\n\n(cid:125)\n(cid:125)\n, and L2 = E\u0001\u223cN (0,\u03c32I) [(cid:107)h(x(cid:48)) \u2212 h(x(cid:48) + \u0001)(cid:107)1]\n\nFigure 1: Schematic illustra-\ntion of the shape of adversarial\nregions near a natural image x.\n\n(cid:123)(cid:122)\n\n,\n\n(3)\n\n(cid:124)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nbypass C1\n\nwhere L(\u00b7,\u00b7) denotes the cross-entropy loss. For the \ufb01rst term, we deviate from standard attacks\nby targeting the probability vector padv instead of the one-hot vector corresponding to label yt.\nOptimizing against the one-hot vector would cause the adversarial example to over-saturate in\nprobability, which arti\ufb01cially increases the difference \u2206 = (cid:107)h(x(cid:48)) \u2212 h(x(cid:48) + \u0001)(cid:107)1 and makes it easier\nto detect using criterion C1.\nWe evaluate this white-box attack against criterion C1 using a pre-trained ResNet-101 [20] model on\nImageNet [11] as the classi\ufb01cation model. We sample 1,000 images from the ImageNet validation\nset and optimize the adversarial loss L(cid:63) for each of them using Adam [25] with learning rate 0.005\nfor a maximum of 400 steps to construct the adversarial images.\nFigure 2 (left) shows the effect of the number of gradient iterations on \u2206 when optimizing the\nadversarial loss L(cid:63). The center line shows median values of \u2206 across 1,000 sample images, and\n2In gray-box attacks, the adversary has full access to the classi\ufb01er h but is agnostic to the defense mechanism.\n\n4\n\nxAAAB8XicbVBNS8NAFHypX7V+VT16WSyCp5KIoMeiF48VbCu2oWy2m3bpZhN2X8QS+i+8eFDEq//Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrG+D6jhUijeQoGS3yea0yiQvBOMr3O/88i1EbG6w0nC/YgOlQgFo2ilh15EcRSE2dO0X625dXcGsky8gtSgQLNf/eoNYpZGXCGT1Jiu5yboZ1SjYJJPK73U8ISyMR3yrqWKRtz42SzxlJxYZUDCWNunkMzU3xsZjYyZRIGdzBOaRS8X//O6KYaXfiZUkiJXbP5RmEqCMcnPJwOhOUM5sYQyLWxWwkZUU4a2pIotwVs8eZm0z+qeW/duz2uNq6KOMhzBMZyCBxfQgBtoQgsYKHiGV3hzjPPivDsf89GSU+wcwh84nz/9OZEcAAAB8XicbVBNS8NAFHypX7V+VT16WSyCp5KIoMeiF48VbCu2oWy2m3bpZhN2X8QS+i+8eFDEq//Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrG+D6jhUijeQoGS3yea0yiQvBOMr3O/88i1EbG6w0nC/YgOlQgFo2ilh15EcRSE2dO0X625dXcGsky8gtSgQLNf/eoNYpZGXCGT1Jiu5yboZ1SjYJJPK73U8ISyMR3yrqWKRtz42SzxlJxYZUDCWNunkMzU3xsZjYyZRIGdzBOaRS8X//O6KYaXfiZUkiJXbP5RmEqCMcnPJwOhOUM5sYQyLWxWwkZUU4a2pIotwVs8eZm0z+qeW/duz2uNq6KOMhzBMZyCBxfQgBtoQgsYKHiGV3hzjPPivDsf89GSU+wcwh84nz/9OZEcAAAB8XicbVBNS8NAFHypX7V+VT16WSyCp5KIoMeiF48VbCu2oWy2m3bpZhN2X8QS+i+8eFDEq//Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrG+D6jhUijeQoGS3yea0yiQvBOMr3O/88i1EbG6w0nC/YgOlQgFo2ilh15EcRSE2dO0X625dXcGsky8gtSgQLNf/eoNYpZGXCGT1Jiu5yboZ1SjYJJPK73U8ISyMR3yrqWKRtz42SzxlJxYZUDCWNunkMzU3xsZjYyZRIGdzBOaRS8X//O6KYaXfiZUkiJXbP5RmEqCMcnPJwOhOUM5sYQyLWxWwkZUU4a2pIotwVs8eZm0z+qeW/duz2uNq6KOMhzBMZyCBxfQgBtoQgsYKHiGV3hzjPPivDsf89GSU+wcwh84nz/9OZEcAAAB8XicbVBNS8NAFHypX7V+VT16WSyCp5KIoMeiF48VbCu2oWy2m3bpZhN2X8QS+i+8eFDEq//Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrG+D6jhUijeQoGS3yea0yiQvBOMr3O/88i1EbG6w0nC/YgOlQgFo2ilh15EcRSE2dO0X625dXcGsky8gtSgQLNf/eoNYpZGXCGT1Jiu5yboZ1SjYJJPK73U8ISyMR3yrqWKRtz42SzxlJxYZUDCWNunkMzU3xsZjYyZRIGdzBOaRS8X//O6KYaXfiZUkiJXbP5RmEqCMcnPJwOhOUM5sYQyLWxWwkZUU4a2pIotwVs8eZm0z+qeW/duz2uNq6KOMhzBMZyCBxfQgBtoQgsYKHiGV3hzjPPivDsf89GSU+wcwh84nz/9OZEcAAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPVi8cW7Ae0oWy2k3btZhN2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4bua3n1BpHssHM0nQj+hQ8pAzaqzUuOmXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqt3kcRTiBUzgHD66gBvdQhyYwQHiGV3hzHp0X5935WLQWnHzmGP7A+fwBkt2MxQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPVi8cW7Ae0oWy2k3btZhN2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4bua3n1BpHssHM0nQj+hQ8pAzaqzUuOmXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqt3kcRTiBUzgHD66gBvdQhyYwQHiGV3hzHp0X5935WLQWnHzmGP7A+fwBkt2MxQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPVi8cW7Ae0oWy2k3btZhN2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4bua3n1BpHssHM0nQj+hQ8pAzaqzUuOmXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqt3kcRTiBUzgHD66gBvdQhyYwQHiGV3hzHp0X5935WLQWnHzmGP7A+fwBkt2MxQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPVi8cW7Ae0oWy2k3btZhN2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4bua3n1BpHssHM0nQj+hQ8pAzaqzUuOmXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqt3kcRTiBUzgHD66gBvdQhyYwQHiGV3hzHp0X5935WLQWnHzmGP7A+fwBkt2MxQ==BAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOpF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipWR+UK27VXYCsEy8nFcjRGJS/+sOYpRFKwwTVuue5ifEzqgxnAmelfqoxoWxCR9izVNIItZ8tDp2RC6sMSRgrW9KQhfp7IqOR1tMosJ0RNWO96s3F/7xeasJbP+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6btVrXldq9TyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AlGGMxg==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOpF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipWR+UK27VXYCsEy8nFcjRGJS/+sOYpRFKwwTVuue5ifEzqgxnAmelfqoxoWxCR9izVNIItZ8tDp2RC6sMSRgrW9KQhfp7IqOR1tMosJ0RNWO96s3F/7xeasJbP+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6btVrXldq9TyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AlGGMxg==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOpF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipWR+UK27VXYCsEy8nFcjRGJS/+sOYpRFKwwTVuue5ifEzqgxnAmelfqoxoWxCR9izVNIItZ8tDp2RC6sMSRgrW9KQhfp7IqOR1tMosJ0RNWO96s3F/7xeasJbP+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6btVrXldq9TyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AlGGMxg==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOpF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1FipWR+UK27VXYCsEy8nFcjRGJS/+sOYpRFKwwTVuue5ifEzqgxnAmelfqoxoWxCR9izVNIItZ8tDp2RC6sMSRgrW9KQhfp7IqOR1tMosJ0RNWO96s3F/7xeasJbP+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6btVrXldq9TyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AlGGMxg==CAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOxF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoJJfe53nlBpHssHM03Qj+hI8pAzaqzUrA/KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryu1uzyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AleWMxw==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOxF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoJJfe53nlBpHssHM03Qj+hI8pAzaqzUrA/KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryu1uzyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AleWMxw==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOxF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoJJfe53nlBpHssHM03Qj+hI8pAzaqzUrA/KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryu1uzyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AleWMxw==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GOxF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoJJfe53nlBpHssHM03Qj+hI8pAzaqzUrA/KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryu1uzyOIpzBOVyCBzdQg3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AleWMxw==DAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GNRDx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj25nffkKleSwfzCRBP6JDyUPOqLFS465frrhVdw6ySrycVCBHvV/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJ66LquVWvcVmp3eRxFOEETuEcPLiCGtxDHZrAAOEZXuHNeXRenHfnY9FacPKZY/gD5/MHl2mMyA==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GNRDx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj25nffkKleSwfzCRBP6JDyUPOqLFS465frrhVdw6ySrycVCBHvV/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJ66LquVWvcVmp3eRxFOEETuEcPLiCGtxDHZrAAOEZXuHNeXRenHfnY9FacPKZY/gD5/MHl2mMyA==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GNRDx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj25nffkKleSwfzCRBP6JDyUPOqLFS465frrhVdw6ySrycVCBHvV/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJ66LquVWvcVmp3eRxFOEETuEcPLiCGtxDHZrAAOEZXuHNeXRenHfnY9FacPKZY/gD5/MHl2mMyA==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GNRDx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj25nffkKleSwfzCRBP6JDyUPOqLFS465frrhVdw6ySrycVCBHvV/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJ66LquVWvcVmp3eRxFOEETuEcPLiCGtxDHZrAAOEZXuHNeXRenHfnY9FacPKZY/gD5/MHl2mMyA==x0AAAB8nicbVBNS8NAFHypX7V+VT16WSyip5KIoMeiF48VbC20oWy2m3bpZhN2X8QS+jO8eFDEq7/Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwjGN7n/8Mi1EbG6x0nC/YgOlQgFo2ilbi+iOArC7Gl62q/W3Lo7A1kmXkFqUKDZr371BjFLI66QSWpM13MT9DOqUTDJp5VeanhC2ZgOeddSRSNu/GwWeUpOrDIgYaztU0hm6u+NjEbGTKLATuYRzaKXi/953RTDKz8TKkmRKzb/KEwlwZjk95OB0JyhnFhCmRY2K2EjqilD21LFluAtnrxM2ud1z617dxe1xnVRRxmO4BjOwINLaMAtNKEFDGJ4hld4c9B5cd6dj/loySl2DuEPnM8fYZiRTQ==AAAB8nicbVBNS8NAFHypX7V+VT16WSyip5KIoMeiF48VbC20oWy2m3bpZhN2X8QS+jO8eFDEq7/Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwjGN7n/8Mi1EbG6x0nC/YgOlQgFo2ilbi+iOArC7Gl62q/W3Lo7A1kmXkFqUKDZr371BjFLI66QSWpM13MT9DOqUTDJp5VeanhC2ZgOeddSRSNu/GwWeUpOrDIgYaztU0hm6u+NjEbGTKLATuYRzaKXi/953RTDKz8TKkmRKzb/KEwlwZjk95OB0JyhnFhCmRY2K2EjqilD21LFluAtnrxM2ud1z617dxe1xnVRRxmO4BjOwINLaMAtNKEFDGJ4hld4c9B5cd6dj/loySl2DuEPnM8fYZiRTQ==AAAB8nicbVBNS8NAFHypX7V+VT16WSyip5KIoMeiF48VbC20oWy2m3bpZhN2X8QS+jO8eFDEq7/Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwjGN7n/8Mi1EbG6x0nC/YgOlQgFo2ilbi+iOArC7Gl62q/W3Lo7A1kmXkFqUKDZr371BjFLI66QSWpM13MT9DOqUTDJp5VeanhC2ZgOeddSRSNu/GwWeUpOrDIgYaztU0hm6u+NjEbGTKLATuYRzaKXi/953RTDKz8TKkmRKzb/KEwlwZjk95OB0JyhnFhCmRY2K2EjqilD21LFluAtnrxM2ud1z617dxe1xnVRRxmO4BjOwINLaMAtNKEFDGJ4hld4c9B5cd6dj/loySl2DuEPnM8fYZiRTQ==AAAB8nicbVBNS8NAFHypX7V+VT16WSyip5KIoMeiF48VbC20oWy2m3bpZhN2X8QS+jO8eFDEq7/Gm//GTZuDtg4sDDPvsfMmSKQw6LrfTmlldW19o7xZ2dre2d2r7h+0TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwjGN7n/8Mi1EbG6x0nC/YgOlQgFo2ilbi+iOArC7Gl62q/W3Lo7A1kmXkFqUKDZr371BjFLI66QSWpM13MT9DOqUTDJp5VeanhC2ZgOeddSRSNu/GwWeUpOrDIgYaztU0hm6u+NjEbGTKLATuYRzaKXi/953RTDKz8TKkmRKzb/KEwlwZjk95OB0JyhnFhCmRY2K2EjqilD21LFluAtnrxM2ud1z617dxe1xnVRRxmO4BjOwINLaMAtNKEFDGJ4hld4c9B5cd6dj/loySl2DuEPnM8fYZiRTQ==x00AAAB83icbVDLSgMxFL2pr1pfVZdugkXqqsyIoMuiG5cV7AM6Q8mkmTY0kxmSjFiG/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnSATXxnG+UWltfWNzq7xd2dnd2z+oHh51dJwqyto0FrHqBUQzwSVrG24E6yWKkSgQrBtMbnO/+8iU5rF8MNOE+REZSR5ySoyVPC8iZhyE2dOsXh9Ua07DmQOvErcgNSjQGlS/vGFM04hJQwXRuu86ifEzogyngs0qXqpZQuiEjFjfUkkipv1snnmGz6wyxGGs7JMGz9XfGxmJtJ5GgZ3MM+plLxf/8/qpCa/9jMskNUzSxaEwFdjEOC8AD7li1IipJYQqbrNiOiaKUGNrqtgS3OUvr5LORcN1Gu79Za15U9RRhhM4hXNw4QqacActaAOFBJ7hFd5Qil7QO/pYjJZQsXMMf4A+fwDGEZF+AAAB83icbVDLSgMxFL2pr1pfVZdugkXqqsyIoMuiG5cV7AM6Q8mkmTY0kxmSjFiG/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnSATXxnG+UWltfWNzq7xd2dnd2z+oHh51dJwqyto0FrHqBUQzwSVrG24E6yWKkSgQrBtMbnO/+8iU5rF8MNOE+REZSR5ySoyVPC8iZhyE2dOsXh9Ua07DmQOvErcgNSjQGlS/vGFM04hJQwXRuu86ifEzogyngs0qXqpZQuiEjFjfUkkipv1snnmGz6wyxGGs7JMGz9XfGxmJtJ5GgZ3MM+plLxf/8/qpCa/9jMskNUzSxaEwFdjEOC8AD7li1IipJYQqbrNiOiaKUGNrqtgS3OUvr5LORcN1Gu79Za15U9RRhhM4hXNw4QqacActaAOFBJ7hFd5Qil7QO/pYjJZQsXMMf4A+fwDGEZF+AAAB83icbVDLSgMxFL2pr1pfVZdugkXqqsyIoMuiG5cV7AM6Q8mkmTY0kxmSjFiG/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnSATXxnG+UWltfWNzq7xd2dnd2z+oHh51dJwqyto0FrHqBUQzwSVrG24E6yWKkSgQrBtMbnO/+8iU5rF8MNOE+REZSR5ySoyVPC8iZhyE2dOsXh9Ua07DmQOvErcgNSjQGlS/vGFM04hJQwXRuu86ifEzogyngs0qXqpZQuiEjFjfUkkipv1snnmGz6wyxGGs7JMGz9XfGxmJtJ5GgZ3MM+plLxf/8/qpCa/9jMskNUzSxaEwFdjEOC8AD7li1IipJYQqbrNiOiaKUGNrqtgS3OUvr5LORcN1Gu79Za15U9RRhhM4hXNw4QqacActaAOFBJ7hFd5Qil7QO/pYjJZQsXMMf4A+fwDGEZF+AAAB83icbVDLSgMxFL2pr1pfVZdugkXqqsyIoMuiG5cV7AM6Q8mkmTY0kxmSjFiG/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnSATXxnG+UWltfWNzq7xd2dnd2z+oHh51dJwqyto0FrHqBUQzwSVrG24E6yWKkSgQrBtMbnO/+8iU5rF8MNOE+REZSR5ySoyVPC8iZhyE2dOsXh9Ua07DmQOvErcgNSjQGlS/vGFM04hJQwXRuu86ifEzogyngs0qXqpZQuiEjFjfUkkipv1snnmGz6wyxGGs7JMGz9XfGxmJtJ5GgZ3MM+plLxf/8/qpCa/9jMskNUzSxaEwFdjEOC8AD7li1IipJYQqbrNiOiaKUGNrqtgS3OUvr5LORcN1Gu79Za15U9RRhhM4hXNw4QqacActaAOFBJ7hFd5Qil7QO/pYjJZQsXMMf4A+fwDGEZF+\fFigure 2: The variation in \u2206 under Gaussian perturbations (C1; left plot) and numbers of steps Kt to\nthe decision boundary (C2t; right plot) for adversarial images constructed using different numbers of\ngradient iterations. Gray-box attacks (orange) can be detected easily with criterion C1 alone (left\nplot, the orange line is signi\ufb01cantly higher than the gray line). For white-box attacks (blue), C1 alone\nis not suf\ufb01cient (the blue line overlaps with the gray line) \u2014 however C2 (right plot) separates the\ntwo lines reliably when C1 does not.\n\nthe error bars show the range of values between the 30th and 70th quantiles. When the attacker\nis agnostic to the detector (orange line), i.e., only optimizing L1, \u2206 does not decrease throughout\noptimization and can be used to perfectly separate adversarial and real images (gray line). However,\nin the white-box attack, the adversarial loss explicitly encourages \u2206 to be small, and we observe that\nindeed the blue line shows a downward trend as the adversary proceeds through gradient iterations.\nAs a result, the range of values for \u2206 quickly begins to overlap with and fall below that of real images\nafter 100 steps, which shows that criterion C1 alone cannot be used to detect adversarial examples.\n\n4.2 Criterion 2: Close proximity to decision boundary\n\nThe intuitive reason why the attack strategy described above in section 4.1 can successfully fool\ncriterion C1 is that it effectively pushes the adversarial image far into the decision boundary of the\ntarget class (e.g. x(cid:48)(cid:48) in Figure 1) \u2014 an unlikely position for a natural image, which tends to be close\nto adversarial decision boundaries. Indeed, Fawzi et al. [13] and Shafahi et al. [42] have shown that\nadversarial examples are inevitable in high-dimensional spaces. Their theoretical arguments suggest\nthat, due to the curse of dimensionality, a sample from the natural image distribution is close to the\ndecision boundary of any classi\ufb01er with high probability. Hence, we de\ufb01ne a second criterion to test\nif an image is close to the decision boundary of an incorrect class:\nC2(t/u): Susceptibility to adversarial noise. For a chosen \ufb01rst-order iterative attack algorithm A,\nevaluate A on the input x and record the minimum number of steps K required to adversarially\nperturb x. The input is rejected as adversarial if K is suf\ufb01ciently large.\nCriterion C2 can be further specialized to targeted attacks (C2t) and untargeted attacks (C2u), which\nmeasures the proximity (i.e. number of gradient steps) to either a chosen target class or to an arbitrary\nbut different class. We denote these quantities as Kt and Ku, respectively. In this paper we choose A\nin C2 to be the targeted/untargeted PGD attack, but our framework can plausibly generalize to any\n\ufb01rst-order attack algorithm. Figure 2 (right) shows the effect of optimizing the adversarial loss L(cid:63)\non Kt. Again, the center line shows the median value of Kt across 1,000 images and the error bars\nindicate the 30th and 70th quantiles. As expected, real images (gray line) require very few steps to\nreach the decision boundary of any random target class. When the adversary does not seek to bypass\ncriterion C1 (orange line), the constructed adversarial images lie very close to the decision boundary\nand are indistinguishable from real images with C2 alone (however here C1 is already suf\ufb01cient).\nOn the other hand, when the attacker minimizes \u2206 to fool criterion C1, the adversarial image moves\naway from the decision boundary in order to be robust to random Gaussian noise. This results in\nan increase in the number of steps Kt to reach the decision boundary of a random target class. At\n400 steps, there is almost no overlap between the 30-70th quantiles of values of Kt for real and\nadversarial images. This separation begins almost precisely as the value of \u2206 for adversarial images\n(left plot) begins to overlap with that of natural images at 100 steps. Thus, C2t becomes an effective\ncriterion to detect adversarial images that optimize against C1.\n\n5\n\n0100200300400steps0.00.51.01.52.0L1 norm ()0100200300400steps5101520253035# of steps to decision boundary (Kt)gray-boxwhite-boxreal\f4.3 Detection strategy\n\nThe fact that natural images can simultaneously satisfy criteria C1 and C2 can be regarded as almost\nparadoxical: While the minimum distance from a natural input to the decision boundary of any\nincorrect class is small, the density of directions that can lead to a decision boundary within a short\ndistance is also very low. We postulate that this behavior of natural images is dif\ufb01cult to imitate even\nfor an adaptive, white-box adversary.\nOur detection strategy using the two criteria can be summarized as follows. Given an input x (which\nmight be an adversarial example already), we compute (\u2206, Kt, Ku) and compare these quantities\nto chosen thresholds (tC1, tC2t, tC2u), corresponding to criteria C1, C2t, and C2u. We reject x as an\nadversarial example if at least one of the three (sub-)criteria is not satis\ufb01ed, i.e., if any measurement\nis larger than the corresponding threshold. Details on hyperparameter selection can be found in the\nSupplementary Material.\nBest effort white-box adversary. Based on our proposed detection method, we de\ufb01ne a white-box\nadversary that aims to cause misclassi\ufb01cation while passing the detection criteria C1 and C2. Let\nL be the adversarial loss for the defense-agnostic (targeted) attack (e.g. Equation 1). We de\ufb01ne\nloss functions L1 and L2 as in Equation 3 following the same strategy used in section 4.1 to bypass\nC1. Since the criterion C2t is discrete, it is dif\ufb01cult to optimize directly. Instead, we encourage the\nconstructed adversarial image to change prediction to any class y(cid:48) (cid:54)= yt after a single gradient step\ntowards y(cid:48). As natural images require very few gradient steps to cross the decision boundary, the\nresulting adversarial image will appear real to criterion C2t. Let\nA(h(x(cid:48)), y(cid:48))\n\ndenote the gradient of the cross-entropy loss w.r.t. x(cid:48)3. The loss term to bypass C2t can be de\ufb01ned as\n\n\u03b4y(cid:48) = \u2207x(cid:48)L\n\nL3 = Ey(cid:48)\u223cUniform,y(cid:48)(cid:54)=yt[L(h(x(cid:48) \u2212 \u03b1\u03b4y(cid:48)), y(cid:48))],\n\nwhich encourages x(cid:48) \u2212 \u03b1\u03b4y(cid:48) \u2014 the one-step move towards class y(cid:48) at step size \u03b1 \u2014 to be close to or\ncross the decision boundary of class y(cid:48) for every randomly chosen class y(cid:48) (cid:54)= yt. Similarly, to bypass\ncriterion C2u, we simulate one gradient step at step size \u03b1 away from the target class yt (which the\ndefender perceives as the predicted class) as x(cid:48) + \u03b1\u03b4yt. We then encourage this resulting image to be\nclassi\ufb01ed as not yt via the loss term:\n\nL4 = \u2212L(h(x(cid:48) + \u03b1\u03b4yt), yt).\n\nGradients for L3 and L4 can be approximated using Backward Pass Differentiable Approximation\n(BPDA) [1]. As a result of optimizing L3 and L4, the produced image x(cid:48) will admit both a targeted\nand an untargeted \u201cadversarial example\u201d within one or few steps of the attack algorithm A, therefore\nbypassing C2. Combining all the components, the modi\ufb01ed adversarial loss L(cid:63) for white-box attack\nagainst our detector becomes\n(4)\nThe inclusion of additional loss terms hinders the optimality of L1 and may cause the attack to fail to\ngenerate a valid adversarial example. Thus, we include the coef\ufb01cient \u03bb so that L1 dominates the\nother loss terms and guarantees close to 100% success rate in constructing adversarial examples to\nfool h. We optimize the total loss L(cid:63) using Adam [25].\n\nL(cid:63) = \u03bbL1 + L2 + L3 + L4.\n\n5 Experiments\n\nWe test our detection mechanism against the white-box attack de\ufb01ned in section 4.3 in several\ndifferent settings, and release our code publicly for reproducibility4.\n\n5.1 Setup\n\nDatasets and target models. We conduct our empirical studies on ImageNet [11] and CIFAR-10 [26].\nWe sample 1,000 images from ImageNet (validation) and CIFAR-10 (test): each class has 1 or 100\nimages. We use the pre-trained ResNet-101 model [20] in PyTorch for ImageNet and train a VGG-19\nmodel [43] with a dropout rate of 0.5 [46] for CIFAR-10 as target models. We additionally include\ndetection results using an Inception-v3 model [47] on ImageNet in the Supplementary Material.\n\n3We denote the adversarial loss of the Algo. A in our detector by LA to differentiate it from L of the attacker.\n4https://github.com/s-huu/TurningWeaknessIntoStrength\n\n6\n\n\fTable 1: Detection rates of different detection algorithms against white-box adversaries on ImageNet.\nWorst-case performance against all evaluated attacks is underlined for each detector.\n\nDetector\nFeature Squeezing\nFeature Squeezing\n\nFPR\n0.2\n0.1\n\nPGD\n0.003\n0.002\n\nCW\n0.000\n0.000\n\nC1\nC2t\nC2u\nCombined\nC1\nC2t\nC2u\nCombined\n\n0.2\n0.2\n0.2\n0.2\n0.1\n0.1\n0.1\n0.1\n\nLR=0.01 LR=0.03 LR=0.1 LR=0.01 LR=0.03 LR=0.1\n0.068\n0.882\n0.039\n0.809\n0.021\n0.806\n0.003\n0.601\n\n0.585\n0.205\n0.001\n0.494\n0.320\n0.120\n0.000\n0.269\n\n0.103\n0.800\n0.042\n0.718\n0.044\n0.709\n0.010\n0.482\n\n0.132\n0.649\n0.001\n0.490\n0.043\n0.483\n0.000\n0.264\n\n0.066\n0.724\n0.002\n0.612\n0.013\n0.616\n0.000\n0.378\n\n0.682\n0.436\n0.154\n0.688\n0.486\n0.287\n0.062\n0.512\n\nFigure 3: A sample clean (left) and ad-\nversarial (right) image at L\u221e percepti-\nbility threshold of \u03c4 = 0.1.\n\nAttack algorithms. We evaluate our detection method\nagainst the white-box adversary de\ufb01ned in section 4.3.\nSince the adversary may vary in the choice of the surrogate\nloss (cf. L in Equation 3), we experiment using both\ntargeted and untargeted variants of two representative loss\nfunctions: the margin loss de\ufb01ned in the Carlini-Wagner\n(CW) attack [7] (see Equation 2), and the cross-entropy\nloss used in the Projected Gradient Descent (PGD) attack\n[1]. The L\u221e-bound for all attacks is set to \u03c4 = 0.1, which\nis very strong and often produces images with noticeable\nvisual distortion. See Figure 3 for an illustration. We\nfurther experiment with boundary attack [4] for attacking\nthe target model and detection mechanism as a black box in the Supplementary Material.\nAll attacks optimize the adversarial loss using Adam [25]. We set \u03bb = 2 (cf. Equation 4) for ImageNet\nand \u03bb = 3 for CIFAR-10 to guarantee close to 100% attack success rate. We found that changing the\nmaximum number of iterations has little effect on the attack\u2019s ability to bypass our detector, and thus\nwe \ufb01x to a reasonable value of 50 steps for ImageNet (which is suf\ufb01cient to guarantee convergence;\nsee Figure 4) and 200 steps for CIFAR-10. The learning rate has a more noticeable effect and we\nevaluate our detector against different chosen values. See the Supplementary Material for detection\nresults against variants of these attacks, including untargeted attacks and \u03c4 = 0.03.\nBaselines. We compare our detector against two strategies: Feature Squeezing [55] and Artifacts\n[14]. These detection algorithms are the most similar to ours \u2014 using a combination of different\ncriteria as features for the detector. We modify the Artifacts defense slightly to use the density and\nuncertainty estimates directly by thresholding rather than training a classi\ufb01er on top of these features,\nwhich has been shown in prior work [6] to remain effective against adversaries that are agnostic to the\ndefense. With a false positive rates (FPR) of 0.1, Feature Squeezing attains a detection rate of 0.737\non ImageNet and 0.892 on CIFAR-10, while Artifacts attains a detection rate of 0.587 on CIFAR-10.\nWe adopt the same strategy as in section 4.3 to formulate white-box attacks against these detectors,\nadding a term in the adversarial loss for each criterion and using Backward Pass Differentiable\nApproximation (BPDA) to compute the gradient of non-differentiable transformations [1]. Details on\nthese modi\ufb01cations can be found in the Supplementary Material.\n\n5.2 Detection results\n\nImageNet results. Table 1 shows the detection rate of our method against various adversaries\non ImageNet. We evaluate our detector under two different settings, resulting in FPR of 0.1 and\n0.2. Entries in the table correspond to the detection rate (or true positive rate) when the white-box\nadversary de\ufb01ned in section 4.3 is applied to attack the model along with the detector.\nUnder all six attack settings (PGD vs. CW, LR = 0.01, 0.03, 0.1), our detector performs substantially\nbetter than random, achieving a worst-case detection rate of 0.49 at FPR = 0.2 and 0.264 at FPR =\n\n7\n\n\fTable 2: Detection rates of different detection algorithms against white-box adversaries on CIFAR-10.\nWorst-case performance against all evaluated attacks is underlined for each detector.\n\nDetector\nFeature Squeezing\nFeature Squeezing\nArtifacts\nArtifacts\n\nFPR\n0.2\n0.1\n0.2\n0.1\n\nPGD\n0.074\n0.008\n0.108\n0.090\n\nCW\n0.096\n0.021\n0.018\n0.009\n\nC1\nC2t\nC2u\nCombined\nC1\nC2t\nC2u\nCombined\n\n0.2\n0.2\n0.2\n0.2\n0.1\n0.1\n0.1\n0.1\n\nLR=0.001 LR=0.01 LR=0.1 LR=0.001 LR=0.01 LR=0.1\n0.012\n0.971\n0.000\n0.740\n0.007\n0.858\n0.000\n0.568\n\n0.033\n0.786\n0.000\n0.481\n0.016\n0.581\n0.000\n0.356\n\n0.792\n0.346\n0.000\n0.660\n0.207\n0.180\n0.000\n0.187\n\n0.991\n0.050\n0.000\n0.984\n0.953\n0.015\n0.000\n0.909\n\n0.422\n0.098\n0.000\n0.374\n0.283\n0.026\n0.000\n0.263\n\n1.000\n0.024\n0.000\n0.998\n0.986\n0.010\n0.000\n0.966\n\n0.1 on ImageNet. This result is a considerable improvement over similar detection methods such as\nFeature Squeezing, where the detection rate is close to 0, i.e. the adversarial images appear \u201cmore\nreal\u201d than natural images. We emphasize that given the strong adversary that we evaluate against\n(\u03c4 = 0.1), these detection rates are very dif\ufb01cult to attain against white-box attacks.\nAblation study. We further decompose the components of our detector to demonstrate the trade-offs\nthe adversary must make when attacking our detector. When using different learning rates, the\nadversary switches between attempting to fool criteria C1 and C2. For example, at LR = 0.01, the\nPGD adversary can be detected using criterion C1 substantially better than using criterion C2t due\nto under-optimization of the value \u2206. On the other hand, at LR = 0.1, the adversary succeeds in\nbypassing criterion C1 at the cost of failing C2t. The criterion C2u does not appear to be effective\nhere as it consistently achieves a detection rate of close to 0. However, it is a crucial component of\nour method against untargeted attacks (see Supplementary Material). Overall, our combined detector\nachieves the best worst-case detection rate across all attack scenarios.\nCIFAR-10 results. The detection rates for our method are slightly worse on CIFAR-10 (Table 2)\nbut still outperforming the Feature Squeezing and Artifacts baselines, which are close to 0 in the\nworst case. For this dataset, criterion C2u becomes ineffective due to the over-saturation of predicted\nprobabilities for clean images, causing untargeted perturbation to take excessively many steps.\nFurthermore, the CIFAR-10 dataset violates both of our hypotheses regarding the distribution of\nadversarial perturbations near a natural image. Models trained on CIFAR-10 are much less robust\nto random Gaussian noise due to lack of data augmentation and poor diversity of training samples\n\u2014 the VGG-19 model could only tolerate a Gaussian noise of \u03c3 = 0.01 as opposed to \u03c3 = 0.1 for\nResNet-101 on ImageNet. Furthermore, CIFAR-10 is much lower-dimensional than ImageNet, hence\nnatural images are comparatively farther from the decision boundary [13, 42]. Given this observation,\nwe suggest that our detector be used only in situations where these two assumptions can be satis\ufb01ed.\nGray-box detection results. Despite\nthe fact that our detection mechanism\nis formulated against white-box adver-\nsaries, we evaluated against a gray-\nbox adversary with knowledge of the\nunderlying model but not of the detec-\ntor for completeness.\nTable 3 shows detection rates for gray-\nbox attacks at FPR of 0.05 and 0.1\non ImageNet. At perceptibility bound\n\u03c4 = 0.03, the combined detector is\nvery successful at detecting the gener-\nated adversarial images, achieving a detection rate of 97.6% at 5% FPR. In comparison, Feature\nSqueezing could only achieve a detection rate of 30.4% against the CW attack. Against the much\n\nTable 3: Detection rates for variations of the gray-box ad-\nversary on ImageNet. Worst-case performance against all\nevaluated attacks is underlined for each detector.\n\nCW PGD\n0.572\n0.304\n0.672\n0.336\n0.896\n0.981\n0.989\n0.915\n\nFeature Squeezing\nFeature Squeezing\nOurs: Combined\nOurs: Combined\n\nPGD\n0.669\n0.758\n0.976\n0.990\n\nCW\n0.014\n0.020\n0.570\n0.678\n\nDetector\n\nFPR\n\n0.05\n0.1\n0.05\n0.1\n\n\u03c4 = 0.03\n\n\u03c4 = 0.1\n\n8\n\n\fstronger adversary at \u03c4 = 0.1, both detectors perform signi\ufb01cantly worse, but our combined detector\nstill achieves a non-trivial detection rate.\n\n5.3 Adversarial loss curves\n\nTo further substantiate our claim that the criteria\nC1 and C2t/u are mutually exclusive, we plot the\nvalue of different components of the adversarial\nloss L\u2217 throughout optimization for the white-\nbox attack (PGD) on ImageNet. The center lines\nin Figure 4 show the average loss for each Li\nover 1,000 images and the shaded areas indicate\nstandard deviation. Since the primary goal is\nto cause misclassi\ufb01cation, the term L1 (blue\nline) shows steady descending trend throughout\noptimization and its value has stabilized after 50\niterations. L2 (orange line) begins at a low value\ndue to the initialization being a natural image\n(and hence it is robust against Gaussian noise),\nand after 50 iterations it returns back to the initial level, which shows that the adversary is successful\nat bypassing criterion C1. However, this success comes at the cost of L3 (red line) failing to reduce\nto a suf\ufb01ciently low level due to inherent con\ufb02ict with L2 (and L1), hence criterion C2t can be used\nto detect the resulting adversarial image.\n\nFigure 4: Plot of different components of the ad-\nversarial loss L\u2217. See text for details.\n\n5.4 Detection times\n\nTable 4: Running time of different components of\nour detection algorithm on ImageNet and CIFAR-\n10. See text for details.\n\nOne drawback of our method is its (relatively)\nhigh computation cost. Criteria C2t/u require\nexecuting a gradient-based attack until either la-\nbel change or for a speci\ufb01ed number of steps.\nTo limit the number of false positives, the up-\nper threshold on the number of gradient steps\nmust be suf\ufb01ciently high, dominating the run-\nning time of the detection algorithm. Table 4\nshows the average per-image detection time for\nboth real and (targeted) adversarial images on\nImageNet and CIFAR-10. On both datasets, the\naverage detection time for real images is approx-\nimately 5 seconds and is largely due to a large threshold for C2u. The situation is similar for\nadversarial images: As the CW attack optimizes the margin loss, taking the adversarial images much\nfarther into the decision boundary, it takes longer (many more steps to undo via C2t/u) to detect it.\n\nCW\n0.107s\n3.46s\n0.241s\n0.012s\n0.27s\n9.631s\n\nReal\n0.074s\n0.403s\n4.512s\n0.011s\n0.379s\n5.230s\n\nPGD\n0.091s\n1.057s\n0.138s\n0.013s\n0.128s\n0.055s\n\nImageNet\n\nCIFAR-10\n\nC1\nC2t\nC2u\nC1\nC2t\nC2u\n\n6 Conclusion\n\nWe have shown that our detection method achieves substantially improved resistance to white-box\nadversaries compared to prior work. In contrast to other detection algorithms that combine multiple\ncriteria, the criteria used in our method are mutually exclusive \u2014 optimizing one will negatively\naffect the other \u2014 yet are inherently true for natural images. While we do not suggest that our method\nis impervious to white-box attacks, it does present a signi\ufb01cant hurdle to overcome and raises the bar\nfor any potential adversary.\nThere are, however, some limitations to our method. The running time of our detector is dominated\nby testing criterion C2, which involves running an iterative gradient-based attack algorithm. The high\ncomputation cost could prohibit the suitability of our detector for deployment. Furthermore, it is\nfair to say that the false positive rate remains relatively high due to a large variance in the statistics\n\u2206, Kt and Ku for the different criteria, hence a threshold-based test cannot completely separate\nreal and adversarial inputs. Future research that improve in either front can certainly ameliorate the\nperformance of our method to be more practical in real world systems.\n\n9\n\n01020304050steps201001020Loss1234\fAcknowledgments\n\nC.G., W-L.C., K.Q.W. are supported by grants from the NSF (III-1618134, III-1526012, IIS-1149882,\nIIS-1724282, and TRIPODS-1740822), the Bill and Melinda Gates Foundation, and the Cornell\nCenter for Materials Research with funding from the NSF MRSEC program (DMR-1719875); and\nare also supported by Zillow, SAP America Inc., and Facebook. We thank Pin-Yu Chen (IBM) for\nconstructive discussions.\n\nReferences\n\n[1] A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security:\n\nCircumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018. 1, 3, 4, 6, 7\n\n[2] V. Behzadan and A. Munir. Vulnerability of deep reinforcement learning to policy induction\n\nattacks. CoRR, abs/1701.04143, 2017. 1\n\n[3] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. \u0160rndi\u00b4c, P. Laskov, G. Giacinto, and F. Roli.\nEvasion attacks against machine learning at test time. In Proc. ECML, pages 387\u2013402, 2013. 1\n[4] W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks\n\nagainst black-box machine learning models. CoRR, abs/1712.04248, 2017. 7\n\n[5] J. Buckman, A. Roy, C. Raffel, and I. J. Goodfellow. Thermometer encoding: One hot way\nto resist adversarial examples. In 6th International Conference on Learning Representations,\nICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings,\n2018. 1\n\n[6] N. Carlini and D. Wagner. Adversarial Examples Are Not Easily Detected: Bypassing Ten\nDetection Methods. the 10th ACM Workshop on Arti\ufb01cial Intelligence and Security, 2017. 1, 4,\n7\n\n[7] N. Carlini and D. Wagner. Towards Evaluating the Robustness of Neural Networks. IEEE\n\nSymposium on Security and Privacy, 2017. 2, 3, 7\n\n[8] N. Carlini and D. A. Wagner. Audio adversarial examples: Targeted attacks on speech-to-text.\n\nCoRR, abs/1801.01944, 2018. 1\n\n[9] P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh. ZOO: zeroth order optimization based\nblack-box attacks to deep neural networks without training substitute models. In Proceedings of\nthe 10th ACM Workshop on Arti\ufb01cial Intelligence and Security, AISec@CCS 2017, Dallas, TX,\nUSA, November 3, 2017, pages 15\u201326, 2017. 2\n\n[10] M. Cisse, Y. Adi, N. Neverova, and J. Keshet. Houdini: Fooling deep structured prediction\n\nmodels. CoRR, abs/1707.05373, 2017. 1\n\n[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In Proc. CVPR, pages 248\u2013255. IEEE, 2009. 2, 4, 6\n\n[12] G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossai\ufb01, A. Khanna, and\nA. Anandkumar. Stochastic activation pruning for robust adversarial defense. In 6th Interna-\ntional Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 -\nMay 3, 2018, Conference Track Proceedings, 2018. 1\n\n[13] A. Fawzi, H. Fawzi, and O. Fawzi. Adversarial vulnerability for any classi\ufb01er. In Advances\nin Neural Information Processing Systems 31: Annual Conference on Neural Information\nProcessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages\n1186\u20131195, 2018. 1, 2, 5, 8\n\n[14] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Detecting Adversarial Samples from\n\nArtifacts. ArXiv e-prints, 2017. 3, 7\n\n[15] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncer-\ntainty in deep learning. In international conference on machine learning, pages 1050\u20131059,\n2016. 3\n\n[16] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples.\n\nInternational Conference on Learning Representation (ICLR), 2015. 3\n\n[17] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel. On the (Statistical)\n\nDetection of Adversarial Examples. arXiv e-prints, 2017. 1, 3\n\n[18] C. Guo, J. R. Gardner, Y. You, A. G. Wilson, and K. Q. Weinberger. Simple black-box\n\nadversarial attacks. CoRR, abs/1905.07121, 2019. 2\n\n[19] C. Guo, M. Rana, M. Cisse, and L. van der Maaten. Countering Adversarial Images using Input\n\nTransformations. International Conference on Learning Representation (ICLR), 2018. 1, 4\n\n[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\npages 770\u2013778. IEEE Computer Society, 2016. 2, 4, 6\n\n10\n\n\f[21] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural\n\nnetwork policies. CoRR, abs/1702.02284, 2017. 1\n\n[22] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries\nand information. In Proceedings of the 35th International Conference on Machine Learning,\nICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 2142\u20132151, 2018.\n2\n\n[23] A. Ilyas, L. Engstrom, and A. Madry. Prior convictions: Black-box adversarial attacks with\n\nbandits and priors. CoRR, abs/1807.07978, 2018. 2\n\n[24] H. Kannan, A. Kurakin, and I. Goodfellow. Adversarial Logit Pairing. ArXiv e-prints, 2018. 1\n[25] D. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n[26] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Technical\n\n2014. 3, 4, 6, 7\n\nreport, Citeseer, 2009. 6\n\n[27] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial Machine Learning at Scale. International\n\nConference on Learning Representation (ICLR), 2017. 3\n\n[28] X. Li and F. Li. Adversarial examples detection in deep networks with convolutional \ufb01lter\nstatistics. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy,\nOctober 22-29, 2017, pages 5775\u20135783, 2017. 1, 3\n\n[29] X. Liu, M. Cheng, H. Zhang, and C. Hsieh. Towards robust neural networks via random self-\nensemble. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany,\nSeptember 8-14, 2018, Proceedings, Part VII, pages 381\u2013397, 2018. 1\n\n[30] Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and\n\nblack-box attacks. CoRR, abs/1611.02770, 2016. 2\n\n[31] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. N. R. Wijewickrema, G. Schoenebeck, D. Song, M. E.\nHoule, and J. Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality.\nIn 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC,\nCanada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. 1, 3\n\n[32] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards Deep Learning Models\nResistant to Adversarial Attacks. International Conference on Learning Representation (ICLR),\n2018. 3\n\n[33] D. Meng and H. Chen. Magnet: A two-pronged defense against adversarial examples. In\nProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security,\nCCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pages 135\u2013147, 2017. 1, 3\n\n[34] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations.\nIn 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April\n24-26, 2017, Conference Track Proceedings, 2017. 3\n\n[35] A. Nitin Bhagoji, D. Cullina, C. Sitawarin, and P. Mittal. Enhancing Robustness of Machine\nLearning Systems via Data Transformations. 52nd Annual Conference on Information Sciences\nand Systems (CISS), 2018. 1, 3\n\n[36] T. Pang, C. Du, Y. Dong, and J. Zhu. Towards robust detection of adversarial examples. In\n\nAdvances in Neural Information Processing Systems, pages 4579\u20134589, 2018. 3\n\n[37] N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-\nbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on\nComputer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates,\nApril 2-6, 2017, pages 506\u2013519, 2017. 2\n\n[38] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. A. Storer. De\ufb02ecting adversarial attacks\nwith pixel de\ufb02ection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8571\u20138580, 2018. 1\n\n[39] A. Raghunathan, J. Steinhardt, and P. Liang. Certi\ufb01ed Defenses against Adversarial Examples.\n\nInternational Conference on Learning Representation (ICLR), 2018. 1\n\n[40] K. Roth, Y. Kilcher, and T. Hofmann. The odds are odd: A statistical test for detecting adversarial\nexamples. In Proceedings of the 36th International Conference on Machine Learning (ICML),\n2019. 4\n\n[41] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-gan: Protecting classi\ufb01ers against\n\nadversarial attacks using generative models. CoRR, abs/1805.06605, 2018. 1\n\n[42] A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein. Are adversarial examples\n\ninevitable? CoRR, abs/1809.02104, 2018. 1, 2, 5, 8\n\n[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\nrecognition. In 3rd International Conference on Learning Representations, ICLR 2015, San\nDiego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 6\n\n11\n\n\fture for computer vision. CoRR, abs/1512.00567, 2015. 6\n\n[48] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\nInternational Conference on Machine Learning\n\nIntriguing properties of neural networks.\n(ICML), 2014. 1, 2, 4\n\n[49] F. Tram\u00e8r, A. Kurakin, N. Papernot, D. Boneh, and P. D. McDaniel. Ensemble adversarial\n\n[44] A. Sinha, H. Namkoong, and J. C. Duchi. Certifying some distributional robustness with\nprincipled adversarial training. In 6th International Conference on Learning Representations,\nICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings,\n2018. 1\n\n[45] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative\nmodels to understand and defend against adversarial examples. In 6th International Conference\non Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,\nConference Track Proceedings, 2018. 1\n\n[46] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014. 6\n\n[47] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-\n\ntraining: Attacks and defenses. CoRR, abs/1705.07204, 2017. 2\n\n[50] C. Tu, P. Ting, P. Chen, S. Liu, H. Zhang, J. Yi, C. Hsieh, and S. Cheng. Autozoom: Autoencoder-\nbased zeroth order optimization method for attacking black-box neural networks. CoRR,\nabs/1805.11770, 2018.\n\n[51] J. Uesato, B. O\u2019Donoghue, P. Kohli, and A. van den Oord. Adversarial risk and the dangers\nof evaluating against weak attacks. In Proceedings of the 35th International Conference on\nMachine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018,\npages 5032\u20135041, 2018. 2\n\n[52] E. Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex\n\nouter adversarial polytope. International Conference on Machine Learning (ICML), 2017. 1\n\n[53] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. L. Yuille. Mitigating adversarial effects through\nrandomization. In 6th International Conference on Learning Representations, ICLR 2018,\nVancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. 4\n\n[54] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. L. Yuille. Adversarial examples for semantic\nsegmentation and object detection. In ICCV, pages 1378\u20131387. IEEE Computer Society, 2017.\n1\n\n[55] W. Xu, D. Evans, and Y. Qi. Feature Squeezing: Detecting Adversarial Examples in Deep\nNeural Networks. Network and Distributed Systems Security Symposium (NDSS), 2018. 1, 3, 7\n\n12\n\n\f", "award": [], "sourceid": 926, "authors": [{"given_name": "Shengyuan", "family_name": "Hu", "institution": "Cornell University"}, {"given_name": "Tao", "family_name": "Yu", "institution": "Cornell University"}, {"given_name": "Chuan", "family_name": "Guo", "institution": "Cornell University"}, {"given_name": "Wei-Lun", "family_name": "Chao", "institution": "Cornell University Ohio State University (OSU)"}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": "Cornell University / ASAPP Research"}]}