{"title": "Adversarial Robustness through Local Linearization", "book": "Advances in Neural Information Processing Systems", "page_first": 13847, "page_last": 13856, "abstract": "Adversarial training is an effective methodology for training deep neural networks that are robust against adversarial, norm-bounded perturbations. However, the computational cost of adversarial training grows prohibitively as the size of the model and number of input dimensions increase. Further, training against less expensive and therefore weaker adversaries produces models that are robust against weak attacks but break down under attacks that are stronger. This is often attributed to the phenomenon of gradient obfuscation; such models have a highly non-linear loss surface in the vicinity of training examples, making it hard for gradient-based attacks to succeed even though adversarial examples still exist. In this work, we introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness. We show via extensive experiments on CIFAR-10 and ImageNet, that models trained with our regularizer avoid gradient obfuscation and can be trained significantly faster than adversarial training. Using this regularizer, we exceed current state of the art and achieve 47% adversarial accuracy for ImageNet with L-infinity norm adversarial perturbations of radius 4/255 under an untargeted, strong, white-box attack. Additionally, we match state of the art results for CIFAR-10 at 8/255.", "full_text": "Adversarial Robustness through Local Linearization\n\nChongli Qin\nDeepMind\n\nJames Martens\n\nDeepMind\n\nSven Gowal\nDeepMind\n\nDilip Krishnan\n\nGoogle\n\nKrishnamurthy (Dj) Dvijotham\n\nDeepMind\n\nAlhussein Fawzi\n\nDeepMind\n\nSoham De\nDeepMind\n\nRobert Stanforth\n\nDeepMind\n\nPushmeet Kohli\n\nDeepMind\n\nchongliqin@google.com\n\nAbstract\n\nAdversarial training is an effective methodology to train deep neural networks\nwhich are robust against adversarial, norm-bounded perturbations. However, the\ncomputational cost of adversarial training grows prohibitively as the size of the\nmodel and number of input dimensions increase. Further, training against less\nexpensive and therefore weaker adversaries produces models that are robust against\nweak attacks but break down under attacks that are stronger. This is often attributed\nto the phenomenon of gradient obfuscation; such models have a highly non-linear\nloss surface in the vicinity of training examples, making it hard for gradient-based\nattacks to succeed even though adversarial examples still exist. In this work, we\nintroduce a novel regularizer that encourages the loss to behave linearly in the vicin-\nity of the training data, thereby penalizing gradient obfuscation while encouraging\nrobustness. We show via extensive experiments on CIFAR-10 and ImageNet, that\nmodels trained with our regularizer avoid gradient obfuscation and can be trained\nsigni\ufb01cantly faster than adversarial training. Using this regularizer, we exceed\ncurrent state of the art and achieve 47% adversarial accuracy for ImageNet with (cid:96)\u221e\nadversarial perturbations of radius 4/255 under an untargeted, strong, white-box\nattack. Additionally, we match state of the art results for CIFAR-10 at 8/255.\n\n1\n\nIntroduction\n\nIn a seminal paper, Szegedy et al. [22] demonstrated that neural networks are vulnerable to visually\nimperceptible but carefully chosen adversarial perturbations which cause them to output incorrect\npredictions. After this revealing study, a \ufb02urry of research has been conducted with the focus of\nmaking networks robust against such adversarial perturbations [14, 16, 17, 25]. Concurrently, re-\nsearchers devised stronger attacks that expose previously unknown vulnerabilities of neural networks\n[24, 4, 1, 3].\nOf the many approaches proposed [19, 2, 6, 21, 15, 17], adversarial training [14, 16] is empirically\nthe best performing algorithm to train networks robust to adversarial perturbations. However, the cost\nof adversarial training becomes prohibitive with growing model complexity and input dimensionality.\nThis is primarily due to the cost of computing adversarial perturbations, which is incurred at each step\nof adversarial training. In particular, for each new mini-batch one must perform multiple iterations\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof a gradient-based optimizer on the network\u2019s inputs to \ufb01nd the perturbations.1 As each step of\nthis optimizer requires a new backwards pass, the total cost of adversarial training scales as roughly\nthe number of such steps. Unfortunately, effective adversarial training of ImageNet often requires\nlarge number of steps to avoid problems of gradient obfuscation [1, 24], making it signi\ufb01cantly more\nexpensive than conventional training.\nOne approach which can alleviate the cost of adversar-\nial training is training against weaker adversaries that are\ncheaper to compute. For example, by taking fewer gradi-\nent steps to compute adversarial examples during training.\nHowever, this can produce models which are robust against\nweak attacks, but break down under strong attacks \u2013 often\ndue to gradient obfuscation. In particular, one form of\ngradient obfuscation occurs when the network learns to\nfool a gradient based attack by making the loss surface\nhighly convoluted and non-linear (see Fig 1), an effect\nwhich has also been observed by Papernot et al [18]. This\nnon-linearity prevents gradient based optimization meth-\nods from \ufb01nding an adversarial perturbation within a small\nnumber of iterations [4, 24]. In contrast, if the loss surface\nwas linear in the vicinity of the training examples, which\nis to say well-predicted by local gradient information, gra-\ndient obfuscation cannot occur. In this paper, we take up this idea and introduce a novel regularizer\nthat encourages the loss to behave linearly in the vicinity of the training data. We call this regularizer\nthe local linearity regularizer (LLR). Empirically, we \ufb01nd that networks trained with LLR exhibit\nfar less gradient obfuscation, and are almost equally robust against strong attacks as they are against\nweak attacks. The main contributions of our paper are summarized below:\n\nFigure 1: Example of gradient obfuscated\nsurface. The color of the surface denotes the\nprediction of the network.\n\n\u2022 We show that training with LLR is signi\ufb01cantly faster than adversarial training, allowing us\nto train a robust ImageNet model with a 5\u00d7 speed up when training on 128 TPUv3 cores [9].\n\u2022 We show that LLR trained models exhibit higher robustness relative to adversarially trained\nmodels when evaluated under strong attacks. Adversarially trained models can exhibit a\ndecrease in accuracy of 6% when increasing the attack strength at test time for CIFAR-10,\nwhereas LLR shows only a decrease of 2%.\n\u2022 We achieve new state of the art results for adversarial accuracy against untargeted white-box\nattack for ImageNet (with \u0001 = 4/2552): 47%. Furthermore, we match state of the art results\nfor CIFAR 10 (with \u0001 = 8/255): 52.81%3.\n\u2022 We perform a large scale evaluation of existing methods for adversarially robust training\nunder consistent, strong, white-box attacks. For this we recreate several baseline models\nfrom the literature, training them both for CIFAR-10 and ImageNet (where possible).4\n\nlogits for classes in set C, i.e. pi(y|x; \u03b8) = exp (fi(x; \u03b8)) /(cid:80)\n\n2 Background and Related Work\nWe denote our classi\ufb01cation function by f (x; \u03b8) : x (cid:55)\u2192 RC, mapping input features x to the output\nj exp (fj(x; \u03b8)), with \u03b8 being the\nmodel parameters and y being the label. Adversarial robustness for f is de\ufb01ned as follows: a network\nis robust to adversarial perturbations of magnitude \u0001 at input x if and only if\n\nargmax\n\ni\u2208C\n\nfi(x; \u03b8) = argmax\n\ni\u2208C\n\nfi(x + \u03b4; \u03b8)\n\n\u2200\u03b4 \u2208 Bp(\u0001) = {\u03b4 : (cid:107)\u03b4(cid:107)p \u2264 \u0001}.\n\n(1)\n\n1While computing the globally optimal adversarial example is NP-hard [12], gradient descent with several\nrandom restarts was empirically shown to be quite effective at computing adversarial perturbations of suf\ufb01cient\nquality.\n\n2This means that every pixel is perturbed independently by up to 4 units up or down on a scale where pixels\n\ntake values ranging between 0 and 255.\n\n3We note that TRADES [27] gets 55% against a much weaker attack; under our strongest attack, it gets\n\n52.5%.\n\n4Baselines created are adversarial training, TRADES and CURE [17]. Contrary to CIFAR-10, we are\n\ncurrently unable to achieve consistent and competitive results on ImageNet at \u0001 = 4/255 using TRADES.\n\n2\n\n0.040.000.040.040.000.04l(x;y)1001020catdeer\fIn this paper, we focus on p = \u221e and we use B(\u0001) to denote B\u221e(\u0001) for brevity. Given the dataset is\ndrawn from distribution D, the standard method to train a classi\ufb01er f is empirical risk minimization\n(ERM), which is de\ufb01ned by: min\u03b8 E(x,y)\u223cD[(cid:96) (x; y, \u03b8)]. Here, (cid:96)(x; y, \u03b8) is the standard cross-entropy\nloss function de\ufb01ned by\n\n(2)\nwhere pi(x; \u03b8) is de\ufb01ned as above, and y is a 1-hot vector representing the class label. While ERM is\neffective at training neural networks that perform well on heldout test data, the accuracy on the test\nset goes to zero under adversarial evaluation. This is a result of a distribution shift in the data induced\nby the attack. To rectify this, adversarial training [17, 14] seeks to perturb the data distribution by\nperforming adversarial attacks during training. More concretely, adversarial training minimizes the\nloss function\n\n(cid:96)(x; y, \u03b8) = \u2212yT log (p(x; \u03b8)) ,\n\n(cid:20)\n\n(cid:21)\n\nE(x,y)\u223cD\n\nmax\n\u03b4\u2208B(\u0001)\n\n(cid:96)(x + \u03b4; y, \u03b8)\n\n,\n\n(3)\n\nwhere the inner maximization, max\u03b4\u2208B(\u0001) (cid:96)(x + \u03b4; y, \u03b8), is typically performed via a \ufb01xed number\nof steps of a gradient-based optimization method. One such method is Projected-Gradient-Descent\n(PGD) which performs the following gradient step:\n\n\u03b4 \u2190 Proj (\u03b4 \u2212 \u03b7\u2207\u03b4(cid:96)(x + \u03b4; y, \u03b8)) ,\n\n(4)\nwhere Proj(x) = argmin\u03be\u2208B(\u0001) (cid:107)x \u2212 \u03be(cid:107). Another popular gradient-based method is to use the\nsign of the gradient [8]. The cost of solving Eq (3) is dominated by the cost of solving the inner\nmaximization problem. Thus, the inner maximization should be performed ef\ufb01ciently to reduce the\noverall cost of training. A naive approach is to reduce the number of gradient steps performed by the\noptimization procedure. Generally, the attack is weaker when we do fewer steps. If the attack is too\nweak, the trained networks often display gradient obfuscation as shown in Fig 1.\nSince the introduction of adversarial training, a corpus of work has researched alternative ways\nof making networks robust. One such approach is the TRADES method [27], which is a form of\nregularization that optimizes the trade-off between robustness and accuracy \u2013 as many studies have\nobserved these two quantities to be at odds with each other [23]. Others, such as work by Ding\net al [7] adaptively increase the perturbation radius by \ufb01nd the minimal length perturbation which\nchanges the output label. Some have proposed architectural changes which promote adversarial\nrobustness, such as the \"denoise\" model [25] for ImageNet.\nThe work presented here is a regularization technique which encourages the loss function to be well\napproximated by its linear Taylor expansion in a suf\ufb01ciently small neighbourhood. There has been\nwork before which uses gradient information as a form of regularization [20, 17]. The work presented\nin this paper is closely related to the paper by Moosavi et al [17], which highlights that adversarial\ntraining reduces the curvature of (cid:96)(x; y, \u03b8) with respect to x. Leveraging an empirical observation (the\nhighest curvature is along the direction \u2207x(cid:96)(x; y, \u03b8)), they further propose an algorithm to mimic the\neffects of adversarial training on the loss surface. The algorithm results in comparable performance\nto adversarial training with a signi\ufb01cantly lower cost.\n\n3 Motivating the Local Linearity Regularizer\n\nAs described above, the cost of adversarial training is dominated by solving the inner maximization\nproblem max\u03b4\u2208B(\u0001) (cid:96)(x + \u03b4). Throughout we abbreviate (cid:96)(x; y, \u03b8) with (cid:96)(x). We can reduce this cost\nsimply by reducing the number of PGD (as de\ufb01ned in Eq (4)) steps taken to solve max\u03b4\u2208B(\u0001) (cid:96)(x + \u03b4).\nTo motivate the local linearity regularizer (LLR), we start with an empirical analysis of how the\nbehavior of adversarial training changes as we increase the number of PGD steps used during training.\nWe \ufb01nd that the loss surface becomes increasingly linear (as captured by the local linearity measure\nde\ufb01ned below) as we increase the number of PGD steps.\n\n3.1 Local Linearity Measure\nSuppose that we are given an adversarial perturbation \u03b4 \u2208 B(\u0001). The corresponding adversarial loss\nis given by (cid:96)(x + \u03b4). If our loss surface is smooth and approximately linear, then (cid:96)(x + \u03b4) is well\napproximated by its \ufb01rst-order Taylor expansion (cid:96)(x) + \u03b4T\u2207x(cid:96)(x). In other words, the absolute\ndifference between these two values,\n\ng(\u03b4; x) =(cid:12)(cid:12)(cid:96)(x + \u03b4) \u2212 (cid:96)(x) \u2212 \u03b4T\u2207x(cid:96)(x)(cid:12)(cid:12) ,\n\n(5)\n\n3\n\n\fis an indicator of how linear the surface is. Consequently, we consider the quantity\n\n\u03b3(\u0001, x) = max\n\u03b4\u2208B(\u0001)\n\ng(\u03b4; x),\n\n(6)\n\nto be a measure of how linear the surface is within a neighbourhood B(\u0001). We call this quantity the\nlocal linearity measure.\n\n3.2 Empirical Observations on Adversarial Training\n\n(a)\n\n(b)\n\nFigure 2: Plots showing that \u03b3(\u0001, x) (Eq (6)) is large (on the order of 10) when we train with just one or two\nsteps of PGD for inner maximization, (2a). In contrast, \u03b3(\u0001, x) becomes increasingly smaller (on the order\nof 10\u22121) as we increase the number of PGD steps to 4 and above, (2b). The x-axis is the number of training\niterations and the y-axis is \u03b3(\u0001, x), here \u0001 = 8/255 for CIFAR-10.\n\nWe measure \u03b3(\u0001, x) for networks trained with adversarial training on CIFAR-10, where the inner max-\nimization max\u03b4\u2208B(\u0001) (cid:96)(x + \u03b4) is performed with 1, 2, 4, 8 and 16 steps of PGD. \u03b3(\u0001, x) is measured\nthroughout training on the training set5. The architecture used is a wide residual network [26] 28 in\ndepth and 10 in width (Wide-ResNet-28-10). The results are shown in Fig 2a and 2b. Fig 2a shows\nthat when we train with one and two steps of PGD for the inner maximization, the local loss surface\nis extremely non-linear at the end of training. An example visualization of such a loss surface is given\nin Fig A1a. However, when we train with four or more steps of PGD for the inner maximization, the\nsurface is relatively well approximated by (cid:96)(x) + \u03b4T\u2207x(cid:96)(x) as shown in Fig 2b. An example of the\nloss surface is shown in Fig A1b. For the adversarial accuracy of the networks, see Table A1.\n\n4 Local Linearity Regularizer (LLR)\n\nFrom the section above, we make the empirical observation that the local linearity measure \u03b3(\u0001, x)\ndecreases as we train with stronger attacks6. In this section, we give some theoretical justi\ufb01cations of\nwhy local linearity \u03b3(\u0001, x) correlates with adversarial robustness, and derive a regularizer from the\nlocal linearity measure that can be used for training of robust models.\n\n4.1 Local Linearity Upper Bounds Adversarial Loss\n\nThe following proposition establishes that the adversarial loss (cid:96)(x + \u03b4) is upper bounded by the\nlocal linearity measure, plus the change in the loss as predicted by the gradient (which is given by\n|\u03b4T\u2207x(cid:96)(x)|).\nProposition 4.1. Consider a loss function (cid:96)(x) that is once-differentiable, and a local neighbourhood\nde\ufb01ned by B(\u0001). Then for all \u03b4 \u2208 B(\u0001)\n\n(7)\n5To measure \u03b3(\u0001, x) we \ufb01nd max\u03b4\u2208B(\u0001) g(\u03b4; x) with 50 steps of PGD using Adam as the optimizer and 0.1\n\n|(cid:96)(x + \u03b4) \u2212 (cid:96)(x)| \u2264 |\u03b4T\u2207x(cid:96)(x)| + \u03b3(\u0001, x).\n\nas the step size.\n\n6Here, we imply an increase in the number of PGD steps for the inner maximization max\u03b4\u2208B(\u0001) (cid:96)(x + \u03b4).\n\n4\n\n200004000060000training steps612\u03b3(\u0001,x)1 steps2 steps200004000060000training steps0.070.09\u03b3(\u0001,x)4 steps8 steps16 steps\fSee Appendix B for the proof.\nFrom Eq (7) it is clear that the adversarial loss tends to (cid:96)(x), i.e., (cid:96)(x + \u03b4) \u2192 (cid:96)(x), as both\n|\u03b4(cid:62)\u2207x(cid:96)(x)| \u2192 0 and \u03b3(\u0001; x) \u2192 0 for all \u03b4 \u2208 B(\u0001). And assuming (cid:96)(x + \u03b4) \u2265 (cid:96)(\u03b4) one also has the\nupper bound (cid:96)(x + \u03b4) \u2264 (cid:96)(x) + |\u03b4T\u2207x(cid:96)(x)| + \u03b3(\u0001, x).\n\n4.2 Local Linearity Regularization (LLR)\n\nFollowing the analysis above, we propose the following objective for adversarially robust training\n\nL(D) = ED(cid:2)(cid:96)(x) + \u03bb\u03b3(\u0001, x) + \u00b5|\u03b4T\n(cid:125)\n(cid:123)(cid:122)\nLLR\u2207x(cid:96)(x)|\n\n(cid:124)\n\n(8)\n\n(cid:3),\n\nLLR\n\nwhere \u03bb and \u00b5 are hyper-parameters to be optimized, and \u03b4LLR = argmax\u03b4\u2208B(\u0001)g(\u03b4; x) (recall the\nde\ufb01nition of g(\u03b4; x) from Eq (5)). Concretely, we are trying to \ufb01nd the point \u03b4LLR in B(\u0001) where\nthe linear approximation (cid:96)(x) + \u03b4T\u2207x(cid:96)(x) is maximally violated. To train we penalize both its\n\nlinear violation \u03b3(\u0001, x) =(cid:12)(cid:12)(cid:96)(x + \u03b4LLR) \u2212 (cid:96)(x) \u2212 \u03b4T\nLLR\u2207x(cid:96)(x)(cid:12)(cid:12), and the gradient magnitude term\nLLR\u2207x(cid:96)(x)(cid:12)(cid:12), as required by the above proposition. We note that, analogous to adversarial training,\n(cid:12)(cid:12)\u03b4T\n\nLLR requires an inner optimization to \ufb01nd \u03b4LLR \u2013 performed via gradient descent. However, as we\nwill show in the experiments, much fewer optimization steps are required for the overall scheme to\nbe effective. Pseudo-code for training with this regularizer is given in Appendix E.\n\n4.3 Local Linearity Measure \u03b3(\u0001; x) bounds the adversarial loss by itself\n\nInterestingly, under certain reasonable approximations and standard choices of loss functions, we\ncan bound |\u03b4(cid:62)\u2207x(cid:96)(x)| in terms of \u03b3(\u0001; x). See Appendix C for details. Consequently, the bound in\nEq (7) implies that minimizing \u03b3(\u0001; x) (along with the nominal loss (cid:96)(x)) is suf\ufb01cient to minimize the\nadversarial loss (cid:96)(x + \u03b4). This prediction is con\ufb01rmed by our experiments. However, our experiments\nalso show that including |\u03b4(cid:62)\u2207x(cid:96)(x)| in the objective along with (cid:96)(x) and \u03b3(\u0001; x) works better in\npractice on certain datasets, especially ImageNet. See Appendix F.3 for details.\n\n5 Experiments and Results\n\nWe perform experiments using LLR on both CIFAR-10 [13] and ImageNet [5] datasets. We show\nthat LLR gets state of the art adversarial accuracy on CIFAR-10 (at \u0001 = 8/255) and ImageNet (at\n\u0001 = 4/255) evaluated under a strong adversarial attack. Moreover, we show that as the attack strength\nincreases, the degradation in adversarial accuracy is more graceful for networks trained using LLR\nthan for those trained with standard adversarial training. Further, we demonstrate that training using\nLLR is 5\u00d7 faster for ImageNet. Finally, we show that, by linearizing the loss surface, models are less\nprone to gradient obfuscation.\nCIFAR-10: The perturbation radius we examine is \u0001 = 8/255 and the model architectures we use\nare Wide-ResNet-28-8, Wide-ResNet-40-8 [26]. Since the validity of our regularizer requires (cid:96)(x) to\nbe smooth, the activation function we use is softplus function log(1 + exp(x)), which is a smooth\nversion of ReLU. The baselines we compare our results against are adversarial training (ADV) [16],\nTRADES [27] and CURE [17]. We recreate these baselines from the literature using the same network\narchitecture and activation function. The evaluation is done on the full test set of 10K images.\nImageNet: The perturbation radii considered are \u0001 = 4/255 and \u0001 = 16/255. The architecture used\nfor this is from [11] which is ResNet-152. We use softplus as activation function. For \u0001 = 4/255,\nthe baselines we compare our results against is our recreated versions of ADV [16] and denoising\nmodel (DENOISE) [25].7 For \u0001 = 16/255, we compare LLR to ADV [16] and DENOISE [25]\nnetworks which have been published from the the literature. Due to computational constraints, we\nlimit ourselves to evaluating all models on the \ufb01rst 1K images of the test set.\nTo make sure that we have a close estimate of the true robustness, we evaluate all the models on a\nwide range of attacks these are described below.\n\n7We attempted to use TRADES on ImageNet but did not manage to get competitive results. Thus they are\n\nomitted from the baselines.\n\n5\n\n\f5.1 Evaluation Setup\n\nTo accurately gauge the true robustness of our network, we tailor our attack to give the lowest possible\nadversarial accuracy. The two parts which we tune to get the optimal attack is the loss function for\nthe attack and its corresponding optimization procedure. The loss functions used are described below,\nfor the optimization procedure please refer to Appendix F.1.\nLoss Functions: The three loss functions we consider are summarized in Table 1. We use the\ndifference between logits for the loss function rather than the cross-entropy loss as we have empirically\nfound the former to yield lower adversarial accuracy.\n\nAttack Name\nRandom-Targeted\nUntargeted\nMulti-Targeted [10]\n\nLoss Function\nMetric\nmax\u03b4\u2208B(\u0001) fr(x + \u03b4) \u2212 ft(x + \u03b4)\nAttack Success Rate\nmax\u03b4\u2208B(\u0001) fs(x + \u03b4) \u2212 ft(x + \u03b4)\nAdversarial Accuracy\nmax\u03b4\u2208B(\u0001) maxi\u2208C fi(x + \u03b4) \u2212 ft(x + \u03b4) Adversarial Accuracy\n\nTable 1: This shows the loss functions corresponding to the attacks we use for evaluation and also the metric\nwe measure on the test set for each of these attacks. Notation-wise, s = argmaxi(cid:54)=t fi(x + \u03b4) is the highest\nlogit excluding the logits corresponding to the correct class t, note s can change through the optimization\nprocedure. For the Random-Targeted attack, r is a randomly chosen target label that is not t and does not change\nthroughout the optimization. C stands for the set of class labels. For the Multi-Targeted attack we maximize\nfi(x + \u03b4) \u2212 fT (x + \u03b4) for all i \u2208 C, and consider the attack successful if any of the individual attacks on each\neach target class i are successful. The metric used on the Random-Targeted attack is the attack success rate:\nthe percentage of attacks where the target label r is indeed the output label (this metric is especially important\nfor ImageNet at \u0001 = 16/255). For the other attacks we use the adversarial accuracy as the metric which is the\naccuracy on the test set after the attack.\n\n5.2 Results for Robustness\n\nMethods\nAttack Strength\nADV[16]\nCURE[17]\nADV(S)\nCURE(S)\nTRADES(S)\nLLR (S)\n\nADV(R)\nTRADES(R)\nADV(S)\nCURE(S)\nTRADES(S)\nLLR (S)\n\nNominal\n\n87.25%\n80.76%\n85.11%\n84.31%\n87.40%\n86.83%\n\n85.58%\n86.25%\n85.27%\n84.45%\n88.11%\n86.28%\n\nFGSM-20\nWeak\n48.89%\n39.76%\n56.76%\n48.56%\n51.63\n54.24%\n\n56.32%\n53.38%\n57.94%\n49.41%\n53.03%\n56.44%\n\nUntargeted\nStrong\n45.92%\n38.87%\n53.96%\n47.28%\n50.46%\n52.99%\n\n52.34%\n51.76%\n55.26%\n47.69%\n51.65%\n54.95%\n\nMulti-Targeted\nVery Strong\n44.54%\n37.57%\n48.79%\n45.43%\n49.48%\n51.13%\n\n46.89%\n50.84%\n49.79%\n45.51%\n50.53%\n52.81%\n\nCIFAR-10: Wide-ResNet-28-8 (8/255)\n\nCIFAR-10: Wide-ResNet-40-8 (8/255)\n\nTable 2: Model accuracy results for CIFAR-10. Our LLR regularizer performs the best under the strongest\nattack (highlighted column). (S) denotes softplus activation; (R) denotes ReLU activation; and models with (S,\nR) are our implementations.\n\nFor CIFAR-10, the main adversarial accuracy results are given in Table 2. We compare LLR training\nto ADV [16], CURE [17] and TRADES [27], both with our re-implementation and the published\nmodels 8. Note that our re-implementation using softplus activations perform at or above the published\nresults for ADV, CURE and TRADES. This is largely due to the learning rate schedule used, which is\nthe similar to the one used by TRADES [27].\n\n8Note the network published for TRADES [27] uses a Wide-ResNet-34-10 so this is not shown in the table\nbut under the same rigorous evaluation we show that TRADES get 84.91% nominal accuracy, 53.41% under\nUntargeted and 52.58% under Multi-Targeted. We\u2019ve also ran (cid:96)\u221e DeepFool (not in the table as the attack is\nweaker) giving ADV(S): 64.29%, CURE(S): 58.73%, TRADES(S): 63.4%, LLR(S): 65.87%.\n\n6\n\n\fInterestingly, for adversarial training (ADV), using the Multi-Targeted attack for evaluation gives\nsigni\ufb01cantly lower adversarial accuracy compared to Untargeted. The accuracy obtained are 49.79%\nand 55.26% respectively. Evaluation using Multi-Targeted attack consistently gave the lowest\nadversarial accuracy throughout. Under this attack, the methods which stand out amongst the rest are\nLLR and TRADES. Using LLR we get state of the art results with 52.81% adversarial accuracy.\n\nImageNet: ResNet-152 (4/255)\n\nUntargeted\n\nMethods\n\nPGD steps\n\nADV\nDENOISE\nLLR\n\nADV [25]\nDENOISE [25]\nLLR\n\n30\n30\n2\n\n30\n30\n10\n\nNominal\n\nAccuracy\n\n69.20%\n69.70%\n72.70%\n\n64.10%\n66.80%\n51.20%\n\nImageNet: ResNet-152 (16/255)\n\n39.70%\n38.90%\n47.00%\n\n6.30%\n7.50%\n6.10%\n\nRandom-Targeted\nSuccess Rate\n0.50%\n0.40%\n0.40%\n\n40.00%\n38.00%\n43.80%\n\nTable 3: LLR gets 47% adversarial accuracy for 4/255 \u2013 7.30% higher than DENOISE and ADV. For 16/255,\nLLR gets similar robustness results, but it comes at a signi\ufb01cant cost to the nominal accuracy. Note Multi-\nTargeted attacks for ImageNet requires looping over 1000 labels, this evaluation can take up to several days even\non 50 GPUs thus is omitted from this table. The column of the strongest attack is highlighted.\n\nFor ImageNet, we compare against adversarial training (ADV) [16] and the denoising model (DE-\nNOISE) [25]. The results are shown in Table 3. For a perturbation radius of 4/255, LLR gets 47%\nadversarial accuracy under the Untargeted attack which is notably higher than the adversarial accuracy\nobtained via adversarial training which is 39.70%. Moreover, LLR is trained with just two-steps\nof PGD rather than 30 steps for adversarial training. The amount of computation needed for each\nmethod is further discussed in Sec 5.2.1.\nFurther shown in Table 3 are the results for \u0001 = 16/255. We note a signi\ufb01cant drop in nominal\naccuracy when we train with LLR to perturbation radius 16/255. When testing for perturbation radius\nof 16/255 we also show that the adversarial accuracy under Untargeted is very poor (below 8%) for\nall methods. We speculate that this perturbation radius is too large for the robustness problem. Since\nadversarial perturbations should be, by de\ufb01nition, imperceptible to the human eye, upon inspection\nof the images generated using an adversarial attack (see Fig F4) - this assumption no longer holds\ntrue. The images generated appear to consist of super-imposed object parts of other classes onto the\ntarget image. This leads us to believe that a more \ufb01ne-grained analysis of what should constitute\n\"robustness for ImageNet\" is an important topic for debate.\n\n5.2.1 Runtime Speed\n\nFor ImageNet, we trained on 128 TPUv3 cores [9], the total training wall time for the LLR network\n(4/255) is 7 hours for 110 epochs. Similarly, for the adversarially trained (ADV) networks the total\nwall time is 36 hours for 110 epochs. This is a 5\u00d7 speed up.\n\n5.2.2 Accuracy Degradation: Strong vs Weak Evaluation\n\nThe resulting model trained using LLR degrades gracefully in terms of adversarial accuracy when we\nincrease the strength of attack, as shown in Fig 3. In particular, Fig 3a shows that, for CIFAR-10,\nwhen the attack changes from Untargeted to Multi-Targeted, the LLR\u2019s accuracy remains similar with\nonly a 2.18% drop in accuracy. Contrary to adversarial training (ADV), where we see a 5.64% drop\nin accuracy. We also see similar trends in accuracy in Table 2. This could indicate that some level of\nobfuscation may be happening under standard adversarial training.\nAs we empirically observe that LLR evaluates similarly under weak and strong attacks, we hypothesize\nthat this is because LLR explicitly linearizes the loss surface. An extreme case would be when the\nsurface is completely linear - in this instance the optimal adversarial perturbation would be found\nwith just one PGD step. Thus evaluation using a weak attack is often good enough to get an accurate\ngauge of how it will perform under a stronger attack.\nFor ImageNet, see Fig 3b, the adversarial accuracy trained using LLR remains signi\ufb01cantly higher\n(7.5%) than the adversarially trained network going from a weak to a stronger attack.\n\n7\n\n\f(a) CIFAR-10 (8/255)\n\n(b) ImageNet (4/255)\n\nFigure 3: Adversarial accuracy shown for CIFAR-10, (3a), and ImageNet, (3b), as we increase the strength\nof attack. (3a) shows LLR\u2019s adversarial accuracy degrades gracefully going from 53.32% to 51.14% (-2.18%)\nwhile ADV\u2019s adversarial accuracy drops from 54.43% to 48.79% (-5.64%). (3b) LLR remains 7.5% higher in\nterms of adversarial accuracy (47.20%) compared to ADV (39.70%). The annotations on each node denotes no.\nof PGD steps \u00d7 no. of random restarts (see Appendix F.1). (3a), background color denotes whether the attack is\nUntargeted (blue) or Multi-Targeted (orange). (3b), we only use Untargeted attacks.\n\n5.3 Resistance to Gradient Obfuscation\n\n(a) ADV-1\n\n(b) LLR-1\n\n(c) ADV-2\n\n(d) LLR-2\n\nFigure 4: Comparing the loss surface, (cid:96)(x), after we train using just 1 or 2 steps of PGD for the inner\nmaximization of either the adversarial objective (ADV) max\u03b4\u2208B(\u0001) (cid:96)(x + \u03b4) or the linearity objective (LLR)\n\u03b3(\u0001, x) = max\u03b4\u2208B(\u0001)\nthe nominal label is deer. ADV-i refers to adversarial training with i PGD steps, similarly with LLR-i.\n\n(cid:12)(cid:12)(cid:96)(x + \u03b4) \u2212 (cid:96)(x) \u2212 \u03b4T\u2207(cid:96)(x)(cid:12)(cid:12). Results are shown for image 126 in test set of CIFAR-10,\n\nWe use either the standard adversarial training objective (ADV-1, ADV-2) or the LLR objective\n(LLR-1, LLR-2) and taking one or two steps of PGD to maximize each objective. To train LLR-1/2,\nwe only optimize the local linearity \u03b3(\u0001, x), i.e. \u00b5 in Eq. (8) is set to zero. We see that for adversarial\ntraining, as shown in Figs 4a, 4c, the loss surface becomes highly non-linear and jagged \u2013 in other\nwords obfuscated. Additionally in this setting, the adversarial accuracy under our strongest attack\nis 0% for both, see Table F3. In contrast, the loss surface is smooth when we train using LLR as\nshown in Figs 4b, 4d. Further, Table F3 shows that we obtain an adversarial accuracy of 44.50% with\nthe LLR-2 network under our strongest evaluation. We also evaluate the values of \u03b3(\u0001, x) for the\nCIFAR-10 test set after these networks are trained. This is shown in Fig F3. The values of \u03b3(\u0001, x) are\ncomparable when we train with LLR using two steps of PGD to adversarial training with 20 steps of\nPGD. By comparison, adversarial training with two steps of PGD results in much larger values of\n\u03b3(\u0001, x).\n\n6 Conclusions\n\nWe show that, by promoting linearity, deep classi\ufb01cation networks are less susceptible to gradient\nobfuscation, thus allowing us to do fewer gradient descent steps for the inner optimization. Our\nnovel linearity regularizer promotes locally linear behavior as justi\ufb01ed from a theoretical perspective.\nThe resulting models achieve state of the art adversarial robustness on the CIFAR-10 and Imagenet\ndatasets, and can be trained 5\u00d7 faster than regular adversarial training.\n\n8\n\nStrongerAttack\u21924850525456586062AdversarialAccuracyMultiTargetedUntargeted1000\u00d7201000\u00d7201\u00d711\u00d7110\u00d7110\u00d7120\u00d7120\u00d7150\u00d7150\u00d71100\u00d71100\u00d71200\u00d71200\u00d711000\u00d711000\u00d71200\u00d720200\u00d720LLRADVStrongerAttack\u219238404244464850AdversarialAccuracy1000\u00d711000\u00d711\u00d711\u00d7110\u00d7110\u00d7120\u00d7120\u00d7150\u00d7150\u00d71100\u00d71100\u00d71200\u00d71200\u00d71LLRADV0.040.000.040.040.000.04l(x;y)1001020catdeer0.040.000.040.040.000.04l(x;y)1001020birddeer0.040.000.040.040.000.04l(x;y)1001020birddeerhorse0.040.000.040.040.000.04l(x;y)1001020birddeer\fAcknowledgements\n\nWe would like to acknowledge Jost Tobias Springenberg and Brendan O\u2019Donoghue for careful\nreading of this manual script. We would also like to acknowledge Jonathan Uesato and Po-Sen Huang\nfor the insightful discussions.\n\nReferences\n[1] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of\nsecurity: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420,\n2018.\n\n[2] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One\n\nhot way to resist adversarial examples. 2018.\n\n[3] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing\nten detection methods. In Proceedings of the 10th ACM Workshop on Arti\ufb01cial Intelligence and\nSecurity, pages 3\u201314. ACM, 2017.\n\n[4] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\n\n2017 IEEE Symposium on Security and Privacy (SP), pages 39\u201357. IEEE, 2017.\n\n[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-\nscale hierarchical image database. In 2009 IEEE conference on computer vision and pattern\nrecognition, pages 248\u2013255. Ieee, 2009.\n\n[6] Guneet S Dhillon, Kamyar Azizzadenesheli, Zachary C Lipton, Jeremy Bernstein, Jean Kossai\ufb01,\nAran Khanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial\ndefense. arXiv preprint arXiv:1803.01442, 2018.\n\n[7] Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang. Max-margin\nadversarial (mma) training: Direct input space margin maximization through adversarial training.\narXiv preprint arXiv:1812.02637, 2018.\n\n[8] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[9] Google. Google, 2018. https://cloud.google.com/tpu/, 2018.\n\n[10] Sven Gowal, Jonathan Uesato, Chongli Qin, Po-Sen Huang, Timothy Mann, and Pushmeet Kohli.\nAn alternative surrogate loss for pgd-based adversarial testing. arXiv preprint arXiv:1910.09338,\n2019.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[12] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex:\nAn ef\ufb01cient smt solver for verifying deep neural networks. In International Conference on\nComputer Aided Veri\ufb01cation, pages 97\u2013117. Springer, 2017.\n\n[13] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[14] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale.\n\narXiv preprint arXiv:1611.01236, 2016.\n\n[15] Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck,\nDawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using\nlocal intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.\n\n[16] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\n9\n\n\f[17] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard.\nRobustness via curvature regularization, and vice versa. arXiv preprint arXiv:1811.09716, 2018.\n\n[18] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan-\nthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017\nACM on Asia conference on computer and communications security, pages 506\u2013519. ACM,\n2017.\n\n[19] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as\na defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium\non Security and Privacy (SP), pages 582\u2013597. IEEE, 2016.\n\n[20] Patrice Simard, Bernard Victorri, Yann LeCun, and John Denker. Tangent prop-a formalism\nfor specifying selected invariances in an adaptive network. In Advances in neural information\nprocessing systems, pages 895\u2013903, 1992.\n\n[21] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend:\nLeveraging generative models to understand and defend against adversarial examples. arXiv\npreprint arXiv:1710.10766, 2017.\n\n[22] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[23] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.\n\nRobustness may be at odds with accuracy. stat, 1050:11, 2018.\n\n[24] Jonathan Uesato, Brendan O\u2019Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial\nrisk and the dangers of evaluating against weak attacks. arXiv preprint arXiv:1802.05666, 2018.\n\n[25] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He. Feature\n\ndenoising for improving adversarial robustness. arXiv preprint arXiv:1812.03411, 2018.\n\n[26] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[27] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I\nJordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint\narXiv:1901.08573, 2019.\n\n10\n\n\f", "award": [], "sourceid": 7741, "authors": [{"given_name": "Chongli", "family_name": "Qin", "institution": "DeepMind"}, {"given_name": "James", "family_name": "Martens", "institution": "DeepMind"}, {"given_name": "Sven", "family_name": "Gowal", "institution": "DeepMind"}, {"given_name": "Dilip", "family_name": "Krishnan", "institution": "Google"}, {"given_name": "Krishnamurthy", "family_name": "Dvijotham", "institution": "DeepMind"}, {"given_name": "Alhussein", "family_name": "Fawzi", "institution": "DeepMind"}, {"given_name": "Soham", "family_name": "De", "institution": "DeepMind"}, {"given_name": "Robert", "family_name": "Stanforth", "institution": "DeepMind"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "DeepMind"}]}