{"title": "Adversarial Training and Robustness for Multiple Perturbations", "book": "Advances in Neural Information Processing Systems", "page_first": 5866, "page_last": 5876, "abstract": "Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small $\\ell_\\infty$-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model's vulnerability.\nOur aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types.\n\nWe prove that a trade-off in robustness to different types of $\\ell_p$-bounded and spatial perturbations must exist in a natural and simple statistical setting.\nWe corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. We propose new multi-perturbation adversarial training schemes, as well as an efficient attack for the $\\ell_1$-norm, and use these to show that models trained against multiple attacks fail to achieve robustness competitive with that of models trained on each attack individually. \nIn particular, we find that adversarial training with first-order $\\ell_\\infty, \\ell_1$ and $\\ell_2$ attacks on MNIST achieves merely $50\\%$ robust accuracy, partly because of gradient-masking.\nFinally, we propose affine attacks that linearly interpolate between perturbation types and further degrade the accuracy of adversarially trained models.", "full_text": "Adversarial Training and Robustness for\n\nMultiple Perturbations\n\nFlorian Tram\u00e8r\nStanford University\n\nDan Boneh\n\nStanford University\n\nAbstract\n\nDefenses against adversarial examples, such as adversarial training, are typically\ntailored to a single perturbation type (e.g., small (cid:96)\u221e-noise). For other perturbations,\nthese defenses offer no guarantees and, at times, even increase the model\u2019s vulnera-\nbility. Our aim is to understand the reasons underlying this robustness trade-off,\nand to train models that are simultaneously robust to multiple perturbation types.\nWe prove that a trade-off in robustness to different types of (cid:96)p-bounded and spatial\nperturbations must exist in a natural and simple statistical setting. We corroborate\nour formal analysis by demonstrating similar robustness trade-offs on MNIST\nand CIFAR10. We propose new multi-perturbation adversarial training schemes,\nas well as an ef\ufb01cient attack for the (cid:96)1-norm, and use these to show that models\ntrained against multiple attacks fail to achieve robustness competitive with that of\nmodels trained on each attack individually. In particular, we \ufb01nd that adversarial\ntraining with \ufb01rst-order (cid:96)\u221e, (cid:96)1 and (cid:96)2 attacks on MNIST achieves merely 50%\nrobust accuracy, partly because of gradient-masking. Finally, we propose af\ufb01ne\nattacks that linearly interpolate between perturbation types and further degrade the\naccuracy of adversarially trained models.\n\n1\n\nIntroduction\n\nAdversarial examples [37, 15] are proving to be an inherent blind-spot in machine learning (ML)\nmodels. Adversarial examples highlight the tendency of ML models to learn super\ufb01cial and brittle\ndata statistics [19, 13, 18], and present a security risk for models deployed in cyber-physical systems\n(e.g., virtual assistants [5], malware detectors [16] or ad-blockers [39]).\nKnown successful defenses are tailored to a speci\ufb01c perturbation type (e.g., a small (cid:96)p-ball [25, 28, 42]\nor small spatial transforms [11]). These defenses provide empirical (or certi\ufb01able) robustness\nguarantees for one perturbation type, but typically offer no guarantees against other attacks [35,\n31]. Worse, increasing robustness to one perturbation type has sometimes been found to increase\nvulnerability to others [11, 31]. This leads us to the central problem considered in this paper:\n\nCan we achieve adversarial robustness to different types of perturbations simultaneously?\n\nNote that even though prior work has attained robustness to different perturbation types [25, 31, 11],\nthese results may not compose. For instance, an ensemble of two classi\ufb01ers\u2014each of which is robust\nto a single type of perturbation\u2014may be robust to neither perturbation. Our aim is to study the extent\nto which it is possible to learn models that are simultaneously robust to multiple types of perturbation.\nTo gain intuition about this problem, we \ufb01rst study a simple and natural classi\ufb01cation task, that has\nbeen used to analyze trade-offs between standard and adversarial accuracy [41], and the sample-\ncomplexity of adversarial generalization [30]. We de\ufb01ne Mutually Exclusive Perturbations (MEPs) as\npairs of perturbation types for which robustness to one type implies vulnerability to the other. For this\ntask, we prove that (cid:96)\u221e and (cid:96)1-perturbations are MEPs and that (cid:96)\u221e-perturbations and input rotations\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) MNIST models trained on (cid:96)1, (cid:96)2 & (cid:96)\u221e attacks.\n\n(b) MNIST models trained on (cid:96)\u221e and RT attacks.\n\n(c) CIFAR10 models trained on (cid:96)1 and (cid:96)\u221e attacks.\n(d) CIFAR10 models trained on (cid:96)\u221e and RT attacks.\nFigure 1: Robustness trade-off on MNIST (top) and CIFAR10 (bottom). For a union of (cid:96)p-balls\n(left), or of (cid:96)\u221e-noise and rotation-translations (RT) (right), we train models Advmax on the strongest\nperturbation-type for each input. We report the test accuracy of Advmax against each individual\nperturbation type (solid line) and against their union (dotted brown line). The vertical lines show the\nadversarial accuracy of models trained and evaluated on a single perturbation type.\n\nand translations [11] are also MEPs. Moreover, for these MEP pairs, we \ufb01nd that robustness to either\nperturbation type requires fundamentally different features. The existence of such a trade-off for this\nsimple classi\ufb01cation task suggests that it may be prevalent in more complex statistical settings.\nTo complement our formal analysis, we introduce new adversarial training schemes for multiple\nperturbations. For each training point, these schemes build adversarial examples for all perturbation\ntypes and then train either on all examples (the \u201cavg\u201d strategy) or only the worst example (the \u201cmax\u201d\nstrategy). These two strategies respectively minimize the average error rate across perturbation types,\nor the error rate against an adversary that picks the worst perturbation type for each input.\nFor adversarial training to be practical, we also need ef\ufb01cient and strong attacks [25]. We show that\nProjected Gradient Descent [22, 25] is inef\ufb01cient in the (cid:96)1-case, and design a new attack, Sparse (cid:96)1\nDescent (SLIDE), that is both ef\ufb01cient and competitive with strong optimization attacks [8],\nWe experiment with MNIST and CIFAR10. MNIST is an interesting case-study, as distinct models\nfrom prior work attain strong robustness to all perturbations we consider [25, 31, 11], yet no single\nclassi\ufb01er is robust to all attacks [31, 32, 11]. For models trained on multiple (cid:96)p-attacks ((cid:96)1, (cid:96)2, (cid:96)\u221e\nfor MNIST, and (cid:96)1, (cid:96)\u221e for CIFAR10), or on both (cid:96)\u221e and spatial transforms [11], we con\ufb01rm a\nnoticeable robustness trade-off. Figure 1 plots the test accuracy of models Advmax trained using our\n\u201cmax\u201d strategy. In all cases, robustness to multiple perturbations comes at a cost\u2014usually of 5-10%\nadditional error\u2014compared to models trained against each attack individually (the horizontal lines).\nRobustness to (cid:96)1, (cid:96)2 and (cid:96)\u221e-noise on MNIST is a striking failure case, where the robustness trade-off\nis compounded by gradient-masking [27, 40, 1]. Extending prior observations [25, 31, 23], we show\nthat models trained against an (cid:96)\u221e-adversary learn representations that mask gradients for attacks in\nother (cid:96)p-norms. When trained against \ufb01rst-order (cid:96)1, (cid:96)2 and (cid:96)\u221e-attacks, the model learns to resist\n(cid:96)\u221e-attacks while giving the illusion of robustness to (cid:96)1 and (cid:96)2 attacks. This model only achieves\n52% accuracy when evaluated on gradient-free attacks [3, 31]. This shows that, unlike previously\nthought [41], adversarial training with strong \ufb01rst-order attacks can suffer from gradient-masking.\nWe thus argue that attaining robustness to (cid:96)p-noise on MNIST requires new techniques (e.g., training\non expensive gradient-free attacks, or scaling certi\ufb01ed defenses to multiple perturbations).\nMNIST has sometimes been said to be a poor dataset for evaluating adversarial examples defenses,\nas some attacks are easy to defend against (e.g., input-thresholding or binarization works well for\n(cid:96)\u221e-attacks [41, 31]). Our results paint a more nuanced view: the simplicity of these (cid:96)\u221e-defenses\n\n2\n\n0246810Epochs0.000.250.500.751.00AccuracyAdv\u221eAdv1Adv2Advmaxtestedon\u2113\u221eAdvmaxtestedon\u21131Advmaxtestedon\u21132Advmaxtestedonall0246810Epochs0.000.250.500.751.00AccuracyAdv\u221eAdvRTAdvmaxtestedon\u2113\u221eAdvmaxtestedonRTAdvmaxtestedonboth020000400006000080000Steps0.40.50.60.70.8AccuracyAdv\u221eAdv1Advmaxtestedon\u2113\u221eAdvmaxtestedon\u21131Advmaxtestedonboth020000400006000080000Steps0.40.50.60.70.8AccuracyAdv\u221eAdvRTAdvmaxtestedon\u2113\u221eAdvmaxtestedonRTAdvmaxtestedonboth\fbecomes a disadvantage when training against multiple (cid:96)p-norms. We thus believe that MNIST\nshould not be abandoned as a benchmark just yet. Our inability to achieve multi-(cid:96)p robustness for this\nsimple dataset raises questions about the viability of scaling current defenses to more complex tasks.\nLooking beyond adversaries that choose from a union of perturbation types, we introduce a new af\ufb01ne\nadversary that may linearly interpolate between perturbations (e.g., by compounding (cid:96)\u221e-noise with\na small rotation). We prove that for locally-linear models, robustness to a union of (cid:96)p-perturbations\nimplies robustness to af\ufb01ne attacks. In contrast, af\ufb01ne combinations of (cid:96)\u221e and spatial perturbations\nare provably stronger than either perturbation individually. We show that this discrepancy translates\nto neural networks trained on real data. Thus, in some cases, attaining robustness to a union of\nperturbation types remains insuf\ufb01cient against a more creative adversary that composes perturbations.\nOur results show that despite recent successes in achieving robustness to single perturbation types,\nmany obstacles remain towards attaining truly robust models. Beyond the robustness trade-off,\nef\ufb01cient computational scaling of current defenses to multiple perturbations remains an open problem.\nThe code used for all of our experiments can be found here: https://github.com/ftramer/\nMultiRobustness\nProofs of all theorems, experimental setups, and additional experiments are in the full version of this\nextended abstract [38].\n\n2 Theoretical Limits to Multi-perturbation Robustness\n\nWe study statistical properties of adversarial robustness in a natural statistical model introduced in [41],\nand which exhibits many phenomena observed on real data, such as trade-offs between robustness and\naccuracy [41] or a higher sample complexity for robust generalization [31]. This model also proves\nuseful in analyzing and understanding adversarial robustness for multiple perturbations. Indeed,\nwe prove a number of results that correspond to phenomena we observe on real data, in particular\ntrade-offs in robustness to different (cid:96)p or rotation-translation attacks [11].\nWe follow a line of works that study distributions for which adversarial examples exist uncondition-\nally [41, 21, 33, 12, 14, 26]. These distributions, including ours, are much simpler than real-world\ndata, and thus need not be evidence that adversarial examples are inevitable in practice. Rather, we\nhypothesize that current ML models are highly vulnerable to adversarial examples because they learn\nsuper\ufb01cial data statistics [19, 13, 18] that share some properties of these simple distributions.\nIn prior work, a robustness trade-off for (cid:96)\u221e and (cid:96)2-noise is shown in [21] for data distributed over\ntwo concentric spheres. Our conceptually simpler model has the advantage of yielding results beyond\n(cid:96)p-norms (e.g., for spatial attacks) and which apply symmetrically to both classes. Building on\nwork by Xu et al. [43], Demontis et al. [9] show a robustness trade-off for dual norms (e.g., (cid:96)\u221e and\n(cid:96)1-noise) in linear classi\ufb01ers.\n\n2.1 Adversarial Risk for Multiple Perturbation Models\nConsider a classi\ufb01cation task for a distribution D over examples x \u2208 Rd and labels y \u2208 [C]. Let\nf : Rd \u2192 [C] denote a classi\ufb01er and let l(f (x), y) be the zero-one loss (i.e., 1f (x)(cid:54)=y).\nWe assume n perturbation types, each characterized by a set S of allowed perturbations for an input\nx. The set S can be an (cid:96)p-ball [37, 15] or capture other perceptually small transforms such as image\nrotations and translations [11]. For a perturbation r \u2208 S, an adversarial example is \u02c6x = x + r (this\nis pixel-wise addition for (cid:96)p perturbations, but can be a more complex operation, e.g., for rotations).\nFor a perturbation set S and model f, we de\ufb01ne Radv(f ; S) := E(x,y)\u223cD [maxr\u2208S l(f (x + r), y)]\nas the adversarial error rate. To extend Radv to multiple perturbation sets S1, . . . , Sn, we can consider\nthe average error rate for each Si, denoted Ravg\nadv. This metric most clearly captures the trade-off in\nrobustness across independent perturbation types, but is not the most appropriate from a security\nperspective on adversarial examples. A more natural metric, denoted Rmax\nadv , is the error rate against an\nadversary that picks, for each input, the worst perturbation from the union of the Si. More formally,\n(1)\nadv \u2265 Ravg\nadv.\n\n(cid:80)\ni Radv(f ; Si) .\nadv(f ; S1, . . . , Sn) := 1\nn\nadv, which also hold for Ravg\nmax since Rmax\n\nRmax\nadv (f ; S1, . . . , Sn) := Radv(f ;\u222aiSi) , Ravg\nMost results in this section are lower bounds on Ravg\n\n3\n\n\fTwo perturbation types S1, S2 are Mutually Exclusive Perturbations (MEPs), if Ravg\n1/|C| for all models f (i.e., no model has non-trivial average risk against both perturbations).\n\nadv(f ; S1, S2) \u2265\n\n2.2 A binary classi\ufb01cation task\n\nWe analyze the adversarial robustness trade-off for different perturbation types in a natural statistical\nmodel introduced by Tsipras et al. [41]. Their binary classi\ufb01cation task consists of input-label pairs\n(x, y) sampled from a distribution D as follows (note that D is (d + 1)-dimensional):\n\nu.a.r\u223c {\u22121, +1},\n\ny\n\n(cid:26)+y, w.p. p0,\n\nx0 =\n\n\u2212y, w.p. 1 \u2212 p0\n\n\u221a\nwhere p0 \u2265 0.5, N (\u00b5, \u03c32) is the normal distribution and \u03b7 = \u03b1/\nFor this distribution, Tsipras et al. [41] show a trade-off between standard and adversarial accuracy\n(for (cid:96)\u221e attacks), by drawing a distinction between the \u201crobust\u201d feature x0 that small (cid:96)\u221e-noise cannot\nmanipulate, and the \u201cnon-robust\u201d features x1, . . . , xd that can be fully overridden by small (cid:96)\u221e-noise.\n\nd for some positive constant \u03b1.\n\n,\n\nx1, . . . , xd\n\ni.i.d\u223c N (y\u03b7, 1) ,\n\n(2)\n\n2.3 Small (cid:96)\u221e and (cid:96)1 Perturbations are Mutually Exclusive\n\nis highly vulnerable to (cid:96)1-perturbations of size \u0001 \u2265 1. The second classi\ufb01er, h(x) = sign((cid:80)d\nis robust to (cid:96)1-perturbations of average norm below E[(cid:80)d\n\nThe starting point of our analysis is the observation that the robustness of a feature depends on the\nconsidered perturbation type. To illustrate, we recall two classi\ufb01ers from [41] that operate on disjoint\nfeature sets. The \ufb01rst, f (x) = sign(x0), achieves accuracy p0 for all (cid:96)\u221e-perturbations with \u0001 < 1 but\n\u221a\ni=1 xi)\n\u221a\nd), yet it is fully subverted\ni=1 xi] = \u0398(\nby a (cid:96)\u221e-perturbation that shifts the features x1, . . . , xd by \u00b12\u03b7 = \u0398(1/\nd). We prove that this\ntension between (cid:96)\u221e and (cid:96)1 robustness, and of the choice of \u201crobust\u201d features, is inherent for this task:\nTheorem 1. Let f be a classi\ufb01er for D. Let S\u221e be the set of (cid:96)\u221e-bounded perturbations with \u0001 = 2\u03b7,\nand S1 the set of (cid:96)1-bounded perturbations with \u0001 = 2. Then, Ravg\nThe proof is in Appendix F. The bound shows that no classi\ufb01er can attain better Ravg\nthan a trivial constant classi\ufb01er f (x) = 1, which satis\ufb01es Radv(f ; S\u221e) = Radv(f ; S1) = 1/2.\nSimilar to [9], our analysis extends to arbitrary dual norms (cid:96)p and (cid:96)q with 1/p + 1/q = 1 and p < 2.\nThe perturbation required to \ufb02ip the features x1, . . . , xn has an (cid:96)p norm of \u0398(d\n2 ) = \u03c9(1) and an\n(cid:96)q norm of \u0398(d\np ) = o(1). Thus, feature x0 is more robust than features x1, . . . , xn\nwith respect to the (cid:96)q-norm, whereas for the dual (cid:96)p-norm the situation is reversed.\n\nadv(f ; S\u221e, S1) \u2265 1/2 .\n\nadv (and thus Rmax\nadv )\n\n2 ) = \u0398(d\n\np\u2212 1\n\n2\u2212 1\n\n1\n\nq \u2212 1\n\n1\n\n1\n\n2.4 Small (cid:96)\u221e and Spatial Perturbations are (nearly) Mutually Exclusive\n\nWe now analyze two other orthogonal perturbation types, (cid:96)\u221e-noise and rotation-translations [11]. In\nsome cases, increasing robustness to (cid:96)\u221e-noise has been shown to decrease robustness to rotation-\ntranslations [11]. We prove that such a trade-off is inherent for our binary classi\ufb01cation task.\nTo reason about rotation-translations, we assume that the features xi form a 2D grid. We also let x0\nbe distributed as N (y, \u03b1\u22122), a technicality that does not qualitatively change our prior results. Note\nthat the distribution of the features x1, . . . , xd is permutation-invariant. Thus, the only power of a\nrotation-translation adversary is to \u201cmove\u201d feature x0. Without loss of generality, we identify a small\nrotation-translation of an input x with a permutation of its features that sends x0 to one of N \ufb01xed\npositions (e.g., with translations of \u00b13px as in [11], x0 can be moved to N = 49 different positions).\nA model can be robust to these permutations by ignoring the N positions that feature x0 can be moved\nto, and focusing on the remaining permutation-invariant features. Yet, this model is vulnerable to\n(cid:96)\u221e-noise, as it ignores x0. In turn, a model that relies on feature x0 can be robust to (cid:96)\u221e-perturbations,\nbut is vulnerable to a spatial perturbation that \u201chides\u201d x0 among other features. Formally, we show:\nTheorem 2. Let f be a classi\ufb01er for D (with x0 \u223c N (y, \u03b1\u22122)). Let S\u221e be the set of (cid:96)\u221e-bounded\nperturbations with \u0001 = 2\u03b7, and SRT be the set of perturbations for an RT adversary with budget N.\nThen, Ravg\nThe proof, given in Appendix G, is non-trivial and yields an asymptotic lower-bound on Ravg\ncan also provide tight numerical estimates for concrete parameter settings (see Appendix G.1).\n\nadv(f ; S\u221e, SRT) \u2265 1/2 \u2212 O(1/\n\nadv. We\n\nN ) .\n\n\u221a\n\n4\n\n\f2.5 Af\ufb01ne Combinations of Perturbations\nWe de\ufb01ned Rmax\nadv as the error rate against an adversary that may choose a different perturbation type\nfor each input. If a model were robust to this adversary, what can we say about the robustness to\na more creative adversary that combines different perturbation types? To answer this question, we\nintroduce a new adversary that mixes different attacks by linearly interpolating between perturbations.\nFor a perturbation set S and \u03b2 \u2208 [0, 1], we denote \u03b2 \u00b7 S the set of perturbations scaled down by \u03b2.\nFor an (cid:96)p-ball with radius \u0001, this is the ball with radius \u03b2 \u00b7 \u0001. For rotation-translations, the attack\nbudget N is scaled to \u03b2 \u00b7 N. For two sets S1, S2, we de\ufb01ne Saf\ufb01ne(S1, S2) as the set of perturbations\nthat compound a perturbation r1 \u2208 \u03b2 \u00b7 S1 with a perturbation r2 \u2208 (1 \u2212 \u03b2) \u00b7 S2, for any \u03b2 \u2208 [0, 1].\nConsider one adversary that chooses, for each input, (cid:96)p or (cid:96)q-noise from balls Sp and Sq, for p, q > 0.\nThe af\ufb01ne adversary picks perturbations from the set Saf\ufb01ne de\ufb01ned as above. We show:\nClaim 3. For a linear classi\ufb01er f (x) = sign(wT x + b), we have Rmax\nadv (f ; Sp, Sq) = Radv(f ; Saf\ufb01ne).\nThus, for linear classi\ufb01ers, robustness to a union of (cid:96)p-perturbations implies robustness to af\ufb01ne\nadversaries (this holds for any distribution). The proof, in Appendix H extends to models that are\nlocally linear within balls Sp and Sq around the data points. For the distribution D of Section 2.2, we\ncan further show that there are settings (distinct from the one in Theorem 1) where: (1) robustness\nagainst a union of (cid:96)\u221e and (cid:96)1-perturbations is possible; (2) this requires the model to be non-linear;\n(3) yet, robustness to af\ufb01ne adversaries is impossible (see Appendix I for details). Our experiments\nin Section 4 show that neural networks trained on CIFAR10 have a behavior that is consistent with\nlocally-linear models, in that they are as robust to af\ufb01ne adversaries as against a union of (cid:96)p-attacks.\nIn contrast, compounding (cid:96)\u221e and spatial perturbations yields a stronger attack, even for linear models:\nTheorem 4. Let f (x) = sign(wT x + b) be a linear classi\ufb01er for D (with x0 \u223c N (y, \u03b1\u22122)). Let\nS\u221e be some (cid:96)\u221e-ball and SRT be rotation-translations with budget N > 2. De\ufb01ne Saf\ufb01ne as above.\nAssume w0 > wi > 0,\u2200i \u2208 [1, d]. Then Radv(f ; Saf\ufb01ne) > Rmax\nThis result (the proof is in Appendix J) draws a distinction between the strength of af\ufb01ne combinations\nof (cid:96)p-noise, and combinations of (cid:96)\u221e and spatial perturbations. It also shows that robustness to a\nunion of perturbations can be insuf\ufb01cient against a more creative af\ufb01ne adversary. These results are\nconsistent with behavior we observe in models trained on real data (see Section 4).\n\nadv (f ; S\u221e, SRT).\n\n3 New Attacks and Adversarial Training Schemes\n\nWe complement our theoretical results with empirical evaluations of the robustness trade-off on\nMNIST and CIFAR10. To this end, we \ufb01rst introduce new adversarial training schemes tailored to\nthe multi-perturbation risks de\ufb01ned in Equation (1), as well as a novel attack for the (cid:96)1-norm.\n\nMulti-perturbation adversarial training. Let\n\nm(cid:88)\n\ni=1\n\n\u02c6Radv(f ; S) =\n\nL(f (x(i) + r), y(i)) ,\n\nmax\nr\u2208S\n\nbet the empirical adversarial risk, where L is the training loss and D is the training set. For a\nsingle perturbation type, \u02c6Radv can be minimized with adversarial training [25]: the maximal loss is\napproximated by an attack procedure A(x), such that maxr\u2208S L(f (x + r), y) \u2248 L(f (A(x)), y).\nFor i \u2208 [1, d], let Ai be an attack for the perturbation set Si. The two multi-attack robustness metrics\nintroduced in Equation (1) immediately yield the following natural adversarial training strategies:\n1. \u201cMax\u201d strategy: For each input x, we train on the strongest adversarial example from all attacks,\n\ni.e., the max in \u02c6Radv is replaced by L(f (Ak\u2217 (x)), y), for k\u2217 = arg maxk L(f (Ak(x)), y).\nThat is, the max in \u02c6Radv is replaced by 1\n\n2. \u201cAvg\u201d strategy: This strategy simultaneously trains on adversarial examples from all attacks.\n\n(cid:80)n\ni=1 L(f (Ai(x), y)).\n\nn\n\nThe sparse (cid:96)1-descent attack (SLIDE). Adversarial training is contingent on a strong and ef\ufb01cient\nattack. Training on weak attacks gives no robustness [40], while strong optimization attacks (e.g., [6,\n\n5\n\n\fInput: Input x \u2208 [0, 1]d, steps k, step-size \u03b3, percentile q, (cid:96)1-bound \u0001\nOutput: \u02c6x = x + r s.t. (cid:107)r(cid:107)1 \u2264 \u0001\nr \u2190 0d\nfor 1 \u2264 i \u2264 k do\n\ng \u2190 \u2207rL(\u03b8, x + r, y)\nei = sign(gi) if |gi| \u2265 Pq(|g|), else 0\nr \u2190 r + \u03b3 \u00b7 e/(cid:107)e(cid:107)1\nr \u2190 \u03a0S\u0001\n\n(r)\n\n1\n\nend\nAlgorithm 1: The Sparse (cid:96)1 Descent Attack (SLIDE). Pq(|g|) denotes the qth percentile of |g| and\n\u03a0S\u0001\n\n1 is the projection onto the (cid:96)1-ball (see [10]).\n\n8]) are prohibitively expensive. Projected Gradient Descent (PGD) [22, 25] is a popular choice of\nattack that is both ef\ufb01cient and produces strong perturbations. To complement our formal results,\nwe want to train models on (cid:96)1-perturbations. Yet, we show that the (cid:96)1-version of PGD is highly\ninef\ufb01cient, and propose a better approach suitable for adversarial training.\nPGD is a steepest descent algorithm [24]. In each iteration, the perturbation is updated in the steepest\ndescent direction arg max(cid:107)v(cid:107)\u22641 vT g, where g is the gradient of the loss. For the (cid:96)\u221e-norm, the\nsteepest descent direction is sign(g) [15], and for (cid:96)2, it is g/(cid:107)g(cid:107)2. For the (cid:96)1-norm, the steepest\ndescent direction is the unit vector e with ei\u2217 = sign(gi\u2217 ), for i\u2217 = arg maxi |gi|.\nThis yields an inef\ufb01cient attack, as each iteration updates a single index of the perturbation r. We\nthus design a new attack with \ufb01ner control over the sparsity of an update step. For q \u2208 [0, 1], let\nPq(|g|) be the qth percentile of |g|. We set ei = sign(gi) if |gi| \u2265 Pq(|g|) and 0 otherwise, and\nnormalize e to unit (cid:96)1-norm. For q (cid:29) 1/d, we thus update many indices of r at once. We introduce\nanother optimization to handle clipping, by ignoring gradient components where the update step\ncannot make progress (i.e., where xi + ri \u2208 {0, 1} and gi points outside the domain). To project\nr onto an (cid:96)1-ball, we use an algorithm of Duchi et al. [10]. Algorithm 1 describes our attack. It\noutperforms the steepest descent attack as well as a recently proposed Frank-Wolfe algorithm for\n(cid:96)1-attacks [20] (see Appendix B). Our attack is competitive with the more expensive EAD attack [8]\n(see Appendix C).\n\n4 Experiments\n\nWe use our new adversarial training schemes to measure the robustness trade-off on MNIST and\nCIFAR10.1 MNIST is an interesting case-study as distinct models achieve strong robustness to\ndifferent (cid:96)p and spatial attacks[31, 11]. Despite the dataset\u2019s simplicity, we show that no single\nmodel achieves strong (cid:96)\u221e, (cid:96)1 and (cid:96)2 robustness, and that new techniques are required to close this\ngap. The code used for all of our experiments can be found here: https://github.com/ftramer/\nMultiRobustness\n\n255 ) and (cid:96)1(\u0001 = 2000\n\nTraining and evaluation setup. We \ufb01rst use adversarial training to train models on a single\nperturbation type. For MNIST, we use (cid:96)1(\u0001 = 10), (cid:96)2(\u0001 = 2) and (cid:96)\u221e(\u0001 = 0.3). For CIFAR10 we use\n255 ). We also train on rotation-translation attacks with \u00b13px translations\n(cid:96)\u221e(\u0001 = 4\nand \u00b130\u25e6 rotations as in [11]. We denote these models Adv1, Adv2, Adv\u221e, and AdvRT. We then use\nthe \u201cmax\u201d and \u201cavg\u201d strategies from Section 3 to train models Advmax and Advavg against multiple\nperturbations. We train once on all (cid:96)p-perturbations, and once on both (cid:96)\u221e and RT perturbations.\nWe use the same CNN (for MNIST) and wide ResNet model (for CIFAR10) as Madry et al. [25].\nAppendix A has more details on the training setup, and attack and training hyper-parameters.\nWe evaluate robustness of all models using multiple attacks: (1) we use gradient-based attacks\nfor all (cid:96)p-norms, i.e., PGD [25] and our SLIDE attack with 100 steps and 40 restarts (20 restarts\non CIFAR10), as well as Carlini and Wagner\u2019s (cid:96)2-attack [6] (C&W), and an (cid:96)1-variant\u2014EAD [8];\n1Kang et al. [20] recently studied the transfer between (cid:96)\u221e, (cid:96)1 and (cid:96)2-attacks for adversarially trained models\non ImageNet. They show that models trained on one type of perturbation are not robust to others, but they do not\nattempt to train models against multiple attacks simultaneously.\n\n6\n\n\fTable 1: Evaluation of MNIST models trained on (cid:96)\u221e, (cid:96)1 and (cid:96)2 attacks (left) or (cid:96)\u221e and rotation-\ntranslation (RT) attacks (right). Models Adv\u221e, Adv1, Adv2 and AdvRT are trained on a single\nattack, while Advavg and Advmax are trained on multiple attacks using the \u201cavg\u201d and \u201cmax\u201d strategies.\nThe columns show a model\u2019s accuracy on individual perturbation types, on the union of them\n(1 \u2212 Rmax\nadv). The best results are in bold (at 95%\ncon\ufb01dence). Results in red indicate gradient-masking, see Appendix C for a breakdown of all attacks.\nModel Acc.\n(cid:96)2\n99.4\nNat\n8.5\nAdv\u221e 99.1 91.1 12.1 11.3\n0.0 78.5 50.6\nAdv1\n0.4 68.0 71.8\nAdv2\nAdvavg 97.3 76.7 53.9 58.3\nAdvmax 97.2 71.7 62.6 56.0\n\nadv 1 \u2212 Ravg\n(cid:96)\u221e RT 1 \u2212 Rmax\nModel Acc.\nadv\n99.4\n0.0\n0.0\n0.0\nNat\n0.0\nAdv\u221e 99.1 91.4\n0.2\n0.2\n45.8\nAdvRT 99.3\n0.0 94.6\n47.3\n0.0\n87.3\n82.9\nAdvavg 99.2 88.2 86.4\n83.8\n87.6\nAdvmax 98.9 89.6 85.6\n\n1 \u2212 Rmax\nadv 1 \u2212 Ravg\nadv\n7.0\n0.0\n6.8\n38.2\n43.0\n0.0\n46.7\n0.4\n63.0\n49.9\n52.4\n63.4\n\nadv ), and the average accuracy across them (1 \u2212 Ravg\n\n(cid:96)\u221e (cid:96)1\n0.0 12.4\n\n98.9\n98.5\n\n(2) to detect gradient-masking, we use decision-based attacks: the Boundary Attack [3] for (cid:96)2, the\nPointwise Attack [31] for (cid:96)1, and the Boundary Attack++ [7] for (cid:96)\u221e; (3) for spatial attacks, we use\nthe optimal attack of [11] that enumerates all small rotations and translations. For unbounded attacks\n(C&W, EAD and decision-based attacks), we discard perturbations outside the (cid:96)p-ball.\nFor each model, we report accuracy on 1000 test points for: (1) individual perturbation types; (2) the\nunion of these types, i.e., 1\u2212Rmax\nadv ; and (3) the average of all perturbation types, 1\u2212Ravg\nadv. We brie\ufb02y\n(cid:80)n\ndiscuss the optimal error that can be achieved if there is no robustness trade-off. For perturbation sets\nS1, . . . Sn, let R1, . . . ,Rn be the optimal risks achieved by distinct models. Then, a single model can\nat best achieve risk Ri for each Si, i.e., OPT(Ravg\ni=1 Ri. If the errors are fully correlated,\nso that a maximal number of inputs admit no attack, we have OPT(Rmax\nadv ) = max{R1, . . . ,Rn}.\nOur experiments show that these optimal error rates are not achieved.\n\nadv) = 1\nn\n\nResults on MNIST. Results are in Table 1. The left table is for the union of (cid:96)p-attacks, and the\nright table is for the union of (cid:96)\u221e and RT attacks. In both cases, the multi-perturbation training\nstrategies \u201csucceed\u201d, in that models Advavg and Advmax achieve higher multi-perturbation accuracy\nthan any of the models trained against a single perturbation type.\nThe results for (cid:96)\u221e and RT attacks are promising, although the best model Advmax only achieves 1 \u2212\nRmax\nadv = 83.8% and 1 \u2212 Ravg\nadv ) =\nmin{91.4%, 94.6%} = 91.4% and 1 \u2212 OPT(Ravg\nadv) = (91.4% + 94.6%)/2 = 93%. Thus, these\nmodels do exhibit some form of the robustness trade-off analyzed in Section 2.\nThe (cid:96)p results are surprisingly mediocre and re-raise questions about whether MNIST can be consid-\nered \u201csolved\u201d from a robustness perspective. Indeed, while training separate models to resist (cid:96)1, (cid:96)2\nor (cid:96)\u221e attacks works well, resisting all attacks simultaneously fails. This agrees with the results of\nSchott et al. [31], whose models achieve either high (cid:96)\u221e or (cid:96)2 robustness, but not both simultaneously.\nWe show that in our case, this lack of robustness is partly due to gradient masking.\n\nadv = 87.6%, which is far less than the optimal values, 1 \u2212 OPT(Rmax\n\nFirst-order adversarial training and gradient masking on MNIST. The model Adv\u221e is not\nrobust to (cid:96)1 and (cid:96)2-attacks. This is unsurprising as the model was only trained on (cid:96)\u221e-attacks. Yet,\ncomparing the model\u2019s accuracy against multiple types of (cid:96)1 and (cid:96)2 attacks (see Appendix C) reveals\na more curious phenomenon: Adv\u221e has high accuracy against \ufb01rst-order (cid:96)1 and (cid:96)2-attacks such as\nPGD, but is broken by decision-free attacks. This is an indication of gradient-masking [27, 40, 1].\nThis issue had been observed before [31, 23], but an explanation remained illusive, especially since\n(cid:96)\u221e-PGD does not appear to suffer from gradient masking (see [25]). We explain this phenomenon by\ninspecting the learned features of model Adv\u221e, as in [25]. We \ufb01nd that the model\u2019s \ufb01rst layer learns\nthreshold \ufb01lters z = ReLU(\u03b1 \u00b7 (x \u2212 \u0001)) for \u03b1 > 0. As most pixels in MNIST are zero, most of the\nzi cannot be activated by an \u0001-bounded (cid:96)\u221e-attack. The (cid:96)\u221e-PGD thus optimizes a smooth (albeit \ufb02at)\nloss function. In contrast, (cid:96)1- and (cid:96)2-attacks can move a pixel xi = 0 to \u02c6xi > \u0001 thus activating zi, but\nhave no gradients to rely on (i.e, dzi/dxi = 0 for any xi \u2264 \u0001). Figure 3 in Appendix D shows that\nthe model\u2019s loss resembles a step-function, for which \ufb01rst-order attacks such as PGD are inadequate.\nNote that training against \ufb01rst-order (cid:96)1 or (cid:96)2-attacks directly (i.e., models Adv1 and Adv2 in Table 1),\nseems to yield genuine robustness to these perturbations. This is surprising in that, because of gradient\n\n7\n\n\fTable 2: Evaluation of CIFAR10 models trained against (cid:96)\u221e and (cid:96)1 attacks (left) or (cid:96)\u221e and\nrotation-translation (RT) attacks (right). Models Adv\u221e, Adv1 and AdvRT are trained against a\nsingle attack, while Advavg and Advmax are trained against two attacks using the \u201cavg\u201d and \u201cmax\u201d\nstrategies. The columns show a model\u2019s accuracy on individual perturbation types, on the union of\nthem (1 \u2212 Rmax\nadv ), and the average accuracy across them (1 \u2212 Ravg\nadv). The best results are in bold (at\n95% con\ufb01dence). A breakdown of all (cid:96)1 attacks is in Appendix C.\nadv 1 \u2212 Ravg\n(cid:96)\u221e RT 1 \u2212 Rmax\nModel Acc.\nadv\n95.7\n5.9\n3.0\n0.0\nNat\n0.0\nAdv\u221e 92.0 71.0\n8.9\n8.7\n40.0\n94.9\n0.0 82.5\n41.3\n0.0\nAdvRT\n73.0\n65.2\nAdvavg\n93.6 67.8 78.2\nAdvmax 93.1 69.6 75.2\n65.7\n72.4\n\nModel Acc.\n(cid:96)\u221e (cid:96)1\n95.7\nNat\n0.0\n0.0\nAdv\u221e 92.0 71.0 16.4\n90.8 53.4 66.2\nAdv1\nAdvavg\n91.1 64.1 60.8\nAdvmax 91.2 65.7 62.5\n\n1 \u2212 Rmax\nadv 1 \u2212 Ravg\nadv\n0.0\n0.0\n16.4\n44.9\n60.0\n53.1\n62.5\n59.4\n61.1\n64.1\n\nTable 3: Evaluation of af\ufb01ne attacks. For models trained with the \u201cmax\u201d strategy, we evaluate\nagainst attacks from a union SU of perturbation sets, and against an af\ufb01ne adversary that interpolates\nbetween perturbations. Examples of af\ufb01ne attacks are in Figure 4.\n\nAttacks\nDataset\nMNIST\n(cid:96)\u221e & RT\nCIFAR10 (cid:96)\u221e & RT\nCIFAR10 (cid:96)\u221e & (cid:96)1\n\nacc. on SU acc. on Saf\ufb01ne\n62.6\n56.0\n58.0\n\n83.8\n65.7\n61.1\n\nmasking, model Adv\u221e actually achieves lower training loss against \ufb01rst-order (cid:96)1 and (cid:96)2-attacks than\nmodels Adv1 and Adv2. That is, Adv1 and Adv2 converged to sub-optimal local minima of their\nrespective training objectives, yet these minima generalize much better to stronger attacks.\nThe models Advavg and Advmax that are trained against (cid:96)\u221e, (cid:96)1 and (cid:96)2-attacks also learn to use\nthresholding to resist (cid:96)\u221e-attacks while spuriously masking gradient for (cid:96)1 and (cid:96)2-attacks. This is\nevidence that, unlike previously thought [41], training against a strong \ufb01rst-order attack (such as PGD)\ncan cause the model to minimize its training loss via gradient masking. To circumvent this issue,\nalternatives to \ufb01rst-order adversarial training seem necessary. Potential (costly) approaches include\ntraining on gradient-free attacks, or extending certi\ufb01ed defenses [28, 42] to multiple perturbations.\nCerti\ufb01ed defenses provide provable bounds that are much weaker than the robustness attained by\nadversarial training, and certifying multiple perturbation types is likely to exacerbate this gap.\n\nResults on CIFAR10. The left table in Table 2 considers the union of (cid:96)\u221e and (cid:96)1 perturbations,\nwhile the right table considers the union of (cid:96)\u221e and RT perturbations. As on MNIST, the models\nAdvavg and Advmax achieve better multi-perturbation robustness than any of the models trained on a\nsingle perturbation, but fail to match the optimal error rates we could hope for. For (cid:96)1 and (cid:96)\u221e-attacks,\nwe achieve 1 \u2212 Rmax\nadv = 64.1%, again signi\ufb01cantly below the optimal values,\n1 \u2212 OPT(Rmax\nadv) = (71.0% + 66.2%)/2 =\n68.6%. The results for (cid:96)\u221e and RT attacks are qualitatively and quantitatively similar. 2\nInterestingly, models Advavg and Advmax achieve 100% training accuracy. Thus, multi-perturbation\nrobustness increases the adversarial generalization gap [30]. These models might be resorting to\nmore memorization because they fail to \ufb01nd features robust to both attacks.\n\nadv ) = min{71.0%, 66.2%} = 66.2% and 1 \u2212 OPT(Ravg\n\nadv = 61.1% and 1 \u2212 Ravg\n\nAf\ufb01ne Adversaries. Finally, we evaluate the af\ufb01ne attacks introduced in Section 2.5. These attacks\ntake af\ufb01ne combinations of two perturbation types, and we apply them on the models Advmax (we\nomit the (cid:96)p-case on MNIST due to gradient masking). To compound (cid:96)\u221e and (cid:96)1-noise, we devise\nan attack that updates both perturbations in alternation. To compound (cid:96)\u221e and RT attacks, we pick\nrandom rotation-translations (with \u00b13\u03b2px translations and \u00b130\u03b2\u25e6 rotations), apply an (cid:96)\u221e-attack\nwith budget (1 \u2212 \u03b2)\u0001 to each, and retain the worst example.\n\n2An interesting open question is why the model Advavg trained on (cid:96)\u221e and RT attacks does not attain optimal\naverage robustness Ravg\nadv. Indeed, on CIFAR10, detecting the RT attack of [11] is easy, due to the black in-painted\npixels in a transformed image. The following \u201censemble\u201d model thus achieves optimal Ravg\nadv (but not necessarily\noptimal Rmax\nadv ): on input \u02c6x, return AdvRT( \u02c6x) if there are black in-painted pixels, otherwise return Adv\u221e( \u02c6x).\nThe fact that model Advavg did not learn such a function might hint at some limitation of adversarial training.\n\n8\n\n\fThe results in Table 3 match the predictions of our formal analysis: (1) af\ufb01ne combinations of (cid:96)p\nperturbations are no stronger than their union. This is expected given Claim 3 and prior observations\nthat neural networks are close to linear near the data [15, 29]; (2) combining of (cid:96)\u221e and RT attacks\ndoes yield a stronger attack, as shown in Theorem 4. This demonstrates that robustness to a union of\nperturbations can still be insuf\ufb01cient to protect against more complex combinations of perturbations.\n\n5 Discussion and Open Problems\n\nDespite recent success in defending ML models against some perturbation types [25, 11, 31], extend-\ning these defenses to multiple perturbations unveils a clear robustness trade-off. This tension may be\nrooted in its unconditional occurrence in natural and simple distributions, as we proved in Section 2.\nOur new adversarial training strategies fail to achieve competitive robustness to more than one attack\ntype, but narrow the gap towards multi-perturbation robustness. We note that the optimal risks Rmax\nadv\nand Ravg\nadv that we achieve are very close. Thus, for most data points, the models are either robust to all\nperturbation types or none of them. This hints that some points (sometimes referred to as prototypical\nexamples [4, 36]) are inherently easier to classify robustly, regardless of the perturbation type.\nWe showed that \ufb01rst-order adversarial training for multiple (cid:96)p-attacks suffers from gradient masking\non MNIST. Achieving better robustness on this simple dataset is an open problem. Another challenge\nis reducing the cost of our adversarial training strategies, which scale linearly in the number of pertur-\nbation types. Breaking this linear dependency requires ef\ufb01cient techniques for \ufb01nding perturbations\nin a union of sets, which might be hard for sets with near-empty intersection (e.g., (cid:96)\u221e and (cid:96)1-balls).\nThe cost of adversarial training has also be reduced by merging the inner loop of a PGD attack and\ngradient updates of the model parameters [34, 44], but it is unclear how to extend this approach to a\nunion of perturbations (some of which are not optimized using PGD, e.g., rotation-translations).\nHendrycks and Dietterich [17], and Geirhos et al. [13] recently measured robustness of classi\ufb01ers\nto multiple common (i.e., non-adversarial) image corruptions (e.g., random image blurring). In that\nsetting, they also \ufb01nd that different classi\ufb01ers achieve better robustness to some corruptions, and\nthat no single classi\ufb01er achieves the highest accuracy under all forms. The interplay between multi-\nperturbation robustness in the adversarial and common corruption case is worth further exploration.\n\nReferences\n[1] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing\n\ndefenses to adversarial examples. In International Conference on Machine Learning (ICML), 2018.\n\n[2] A. C. Berry. The accuracy of the gaussian approximation to the sum of independent variates. Transactions\n\nof the american mathematical society, 49(1):122\u2013136, 1941.\n\n[3] W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against\n\nblack-box machine learning models. In International Conference on Learning Representations, 2018.\n\n[4] N. Carlini, U. Erlingsson, and N. Papernot. Prototypical examples in deep learning: Metrics, characteristics,\n\nand utility. 2018.\n\n[5] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou. Hidden voice\n\ncommands. In USENIX Security Symposium, pages 513\u2013530, 2016.\n\n[6] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on\n\nSecurity and Privacy, 2017.\n\n[7] J. Chen and M. I. Jordan. Boundary attack++: Query-ef\ufb01cient decision-based adversarial attack. arXiv\n\npreprint arXiv:1904.02144, 2019.\n\n[8] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C.-J. Hsieh. Ead: elastic-net attacks to deep neural networks\n\nvia adversarial examples. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[9] A. Demontis, P. Russu, B. Biggio, G. Fumera, and F. Roli. On security and sparsity of linear classi\ufb01ers\nIn Joint IAPR International Workshops on Statistical Techniques in Pattern\nfor adversarial settings.\nRecognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 322\u2013332. Springer,\n2016.\n\n9\n\n\f[10] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l1-ball for learning\n\nin high dimensions. In International Conference on Machine Learning (ICML), 2008.\n\n[11] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry. A rotation and a translation suf\ufb01ce: Fooling\n\nCNNs with simple transformations. arXiv preprint arXiv:1712.02779, 2017.\n\n[12] A. Fawzi, H. Fawzi, and O. Fawzi. Adversarial vulnerability for any classi\ufb01er. In Advances in Neural\n\nInformation Processing Systems, pages 1186\u20131195, 2018.\n\n[13] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. ImageNet-trained\nCNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International\nConference on Learning Representations (ICLR), 2019.\n\n[14] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow. Adversarial\n\nspheres. arXiv preprint arXiv:1801.02774, 2018.\n\n[15] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2015.\n\n[16] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel. Adversarial examples for malware\n\ndetection. In European Symposium on Research in Computer Security, 2017.\n\n[17] D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and\n\nperturbations. In International Conference on Learning Representations (ICLR), 2019.\n\n[18] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs,\n\nthey are features. arXiv preprint arXiv:1905.02175, 2019.\n\n[19] J. Jo and Y. Bengio. Measuring the tendency of CNNs to learn surface statistical regularities. arXiv preprint\n\narXiv:1711.11561, 2017.\n\n[20] D. Kang, Y. Sun, T. Brown, D. Hendrycks, and J. Steinhardt. Transfer of adversarial robustness between\n\nperturbation types. arXiv preprint arXiv:1905.01034, 2019.\n\n[21] M. Khoury and D. Had\ufb01eld-Menell. On the geometry of adversarial examples, 2019.\n\n[22] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale.\n\nConference on Learning Representations (ICLR), 2017.\n\nIn International\n\n[23] B. Li, C. Chen, W. Wang, and L. Carin. Second-order adversarial attack and certi\ufb01able robustness. arXiv\n\npreprint arXiv:1809.03113, 2018.\n\n[24] A. Madry and Z. Kolter. Adversarial robustness: Theory and practice. In Tutorial at NeurIPS 2018, 2018.\n\n[25] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to\n\nadversarial attacks. In International Conference on Learning Representations (ICLR), 2018.\n\n[26] S. Mahloujifar, D. I. Diochnos, and M. Mahmoody. The curse of concentration in robust learning: Evasion\n\nand poisoning attacks from concentration of measure. arXiv preprint arXiv:1809.03063, 2018.\n\n[27] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks\n\nagainst machine learning. In ASIACCS, pages 506\u2013519. ACM, 2017.\n\n[28] A. Raghunathan, J. Steinhardt, and P. Liang. Certi\ufb01ed defenses against adversarial examples. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2018.\n\n[29] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any\n\nclassi\ufb01er. In KDD. ACM, 2016.\n\n[30] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires\n\nmore data. In Advances in Neural Information Processing Systems, pages 5019\u20135031, 2018.\n\n[31] L. Schott, J. Rauber, M. Bethge, and W. Brendel. Towards the \ufb01rst adversarially robust neural network\n\nmodel on mnist. In International Conference on Learning Representations (ICLR), 2019.\n\n[32] L. Schott, J. Rauber, M. Bethge, and W. Brendel. Towards the \ufb01rst adversarially robust neural network\n\nmodel on mnist (OpenReview comment on spatial transformations), 2019.\n\n[33] A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein. Are adversarial examples inevitable? In\n\nInternational Conference on Learning Representations (ICLR), 2019.\n\n10\n\n\f[34] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein.\n\nAdversarial training for free! arXiv preprint arXiv:1904.12843, 2019.\n\n[35] Y. Sharma and P.-Y. Chen. Attacking the madry defense model with l1-based adversarial examples. arXiv\n\npreprint arXiv:1710.10733, 2017.\n\n[36] P. Stock and M. Cisse. Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering\nbiases. In Proceedings of the European Conference on Computer Vision (ECCV), pages 498\u2013512, 2018.\n\n[37] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing\n\nproperties of neural networks. In International Conference on Learning Representations (ICLR), 2014.\n\n[38] F. Tram\u00e8r and D. Boneh. Adversarial training and robustness for multiple perturbations.\n\nInformation Processing Systems (NeurIPS) 2019, 2019. arXiv preprint arXiv:1904.13000.\n\nIn Neural\n\n[39] F. Tram\u00e8r, P. Dupr\u00e9, G. Rusak, G. Pellegrino, and D. Boneh. Ad-versarial: Perceptual ad-blocking meets\n\nadversarial machine learning. arXiv preprint arXiv:1811:03194, Nov 2018.\n\n[40] F. Tram\u00e8r, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial\ntraining: Attacks and defenses. In International Conference on Learning Representations (ICLR), 2018.\n\n[41] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy.\n\nIn International Conference on Learning Representations (ICLR), 2019.\n\n[42] E. Wong and Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial\n\npolytope. In International Conference on Machine Learning, pages 5283\u20135292, 2018.\n\n[43] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. Journal\n\nof Machine Learning Research, 10(Jul):1485\u20131510, 2009.\n\n[44] D. Zhang, T. Zhang, Y. Lu, Z. Zhu, and B. Dong. You only propagate once: Painless adversarial training\n\nusing maximal principle. arXiv preprint arXiv:1905.00877, 2019.\n\n11\n\n\f", "award": [], "sourceid": 3146, "authors": [{"given_name": "Florian", "family_name": "Tramer", "institution": "Stanford University"}, {"given_name": "Dan", "family_name": "Boneh", "institution": "Stanford University"}]}