{"title": "Deep Defense: Training DNNs with Improved Adversarial Robustness", "book": "Advances in Neural Information Processing Systems", "page_first": 419, "page_last": 428, "abstract": "Despite the efficacy on a variety of computer vision tasks, deep neural networks (DNNs) are vulnerable to adversarial attacks, limiting their applications in security-critical systems. Recent works have shown the possibility of generating imperceptibly perturbed image inputs (a.k.a., adversarial examples) to fool well-trained DNN classifiers into making arbitrary predictions. To address this problem, we propose a training recipe named \"deep defense\". Our core idea is to integrate an adversarial perturbation-based regularizer into the classification objective, such that the obtained models learn to resist potential attacks, directly and precisely. The whole optimization problem is solved just like training a recursive network. Experimental results demonstrate that our method outperforms training with adversarial/Parseval regularizations by large margins on various datasets (including MNIST, CIFAR-10 and ImageNet) and different DNN architectures. Code and models for reproducing our results are available at https://github.com/ZiangYan/deepdefense.pytorch.", "full_text": "Deep Defense: Training DNNs with Improved\n\nAdversarial Robustness\n\nZiang Yan1* Yiwen Guo2,1* Changshui Zhang1\n\n1Institute for Arti\ufb01cial Intelligence, Tsinghua University (THUAI),\n\nState Key Lab of Intelligent Technologies and Systems,\n\nBeijing National Research Center for Information Science and Technology (BNRist),\n\nDepartment of Automation,Tsinghua University, Beijing, China\n\n2 Intel Labs China\n\nyza18@mails.tsinghua.edu.cn yiwen.guo@intel.com zcs@mail.tsinghua.edu.cn\n\nAbstract\n\nDespite the ef\ufb01cacy on a variety of computer vision tasks, deep neural net-\nworks (DNNs) are vulnerable to adversarial attacks, limiting their applications\nin security-critical systems. Recent works have shown the possibility of gen-\nerating imperceptibly perturbed image inputs (a.k.a., adversarial examples) to\nfool well-trained DNN classi\ufb01ers into making arbitrary predictions. To address\nthis problem, we propose a training recipe named \u201cdeep defense\u201d. Our core\nidea is to integrate an adversarial perturbation-based regularizer into the clas-\nsi\ufb01cation objective, such that the obtained models learn to resist potential at-\ntacks, directly and precisely. The whole optimization problem is solved just like\ntraining a recursive network. Experimental results demonstrate that our method\noutperforms training with adversarial/Parseval regularizations by large margins\non various datasets (including MNIST, CIFAR-10 and ImageNet) and different\nDNN architectures. Code and models for reproducing our results are available at\nhttps://github.com/ZiangYan/deepdefense.pytorch.\n\n1\n\nIntroduction\n\nAlthough deep neural networks (DNNs) have advanced the state-of-the-art of many challenging\ncomputer vision tasks, they are vulnerable to adversarial examples [34] (i.e., generated images which\nseem perceptually similar to the real ones but are intentionally formed to fool learning models).\nA general way of synthesizing the adversarial examples is to apply worst-case perturbations to real\nimages [34, 8, 26, 3]. With proper strategies, the required perturbations for fooling a DNN model can\nbe 1000\u00d7 smaller in magnitude when compared with the real images, making them imperceptible to\nhuman beings. It has been reported that even the state-of-the-art DNN solutions have been fooled to\nmisclassify such examples with high con\ufb01dence [18]. Worse, the adversarial perturbation can transfer\nacross different images and network architectures [25]. Such transferability also allows black-box\nattacks, which means the adversary may succeed without having any knowledge about the model\narchitecture or parameters [28].\nThough intriguing, such property of DNNs can lead to potential issues in real-world applications\nlike self-driving cars and paying with your face systems. Unlike certain instability against random\nnoise, which is theoretically and practically guaranteed to be less critical [7, 34], the vulnerability\nto adversarial perturbations is still severe in deep learning. Multiple attempts have been made to\nanalyze and explain it so far [34, 8, 5, 14]. For example, Goodfellow et al. [8] argue that the main\n\n*The \ufb01rst two authors contributed equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\freason why DNNs are vulnerable is their linear nature instead of nonlinearity and over\ufb01tting. Based\non the explanation, they design an ef\ufb01cient l\u221e induced perturbation and further propose to combine\nit with adversarial training [34] for regularization. Recently, Cisse et al. [5] investigate the Lipschitz\nconstant of DNN-based classi\ufb01ers and propose Parseval training. However, similar to some previous\nand contemporary methods, approximations to the theoretically optimal constraint are required in\npractice, making the method less effective to resist very strong attacks.\nIn this paper, we introduce \u201cdeep defense\u201d, an adversarial regularization method to train DNNs with\nimproved robustness. Unlike many existing and contemporaneous methods which make approxima-\ntions and optimize possibly untight bounds, we precisely integrate a perturbation-based regularizer\ninto the classi\ufb01cation objective. This endows DNN models with an ability of directly learning from at-\ntacks and further resisting them, in a principled way. Speci\ufb01cally, we penalize the norm of adversarial\nperturbations, by encouraging relatively large values for the correctly classi\ufb01ed samples and possibly\nsmall values for those misclassi\ufb01ed ones. As a regularizer, it is jointly optimized with the original\nlearning objective and the whole problem is ef\ufb01ciently solved through being considered as training\na recursive-\ufb02avoured network. Extensive experiments on MNIST, CIFAR-10 and ImageNet show\nthat our method signi\ufb01cantly improves the robustness of different DNNs under advanced adversarial\nattacks, in the meanwhile no accuracy degradation is observed.\nThe remainder of this paper is structured as follows. First, we brie\ufb02y introduce and discuss represen-\ntative methods for conducting adversarial attacks and defenses in Section 2. Then we elaborate on\nthe motivation and basic ideas of our method in Section 3. Section 4 provides implementation details\nof our method and experimentally compares it with the state-of-the-arts, and \ufb01nally, Section 5 draws\nthe conclusions.\n\n2 Related Work\n\nAdversarial Attacks. Starting from a common objective, many attack methods have been proposed.\nSzegedy et al. [34] propose to generate adversarial perturbations by minimizing a vector norm using\nbox-constrained L-BFGS optimization. For better ef\ufb01ciency, Goodfellow et al. [8] develop the fast\ngradient sign (FGS) attack, by choosing the sign of gradient as the direction of perturbation since it is\napproximately optimal under a (cid:96)\u221e constraint. Later, Kurakin et al. [18] present an iterative version\nof the FGS attack by applying it multiple times with a small step size, and clipping pixel values on\ninternal results. Similarly, Moosavi-Dezfooli et al. [26] propose DeepFool as an iterative lp attack. At\neach iteration, it linearizes the network and seeks the smallest perturbation to transform current images\ntowards the linearized decision boundary. Some more detailed explanations of DeepFool can be found\nin Section 3.1. More recently, Carlini and Wagner [4] reformulate attacks as optimization instances\nthat can be solved using stochastic gradient descent to generate more sophisticated adversarial\nexamples. Based on the above methods, input- and network- agnostic adversarial examples can also\nbe generated [25, 28].\n\nDefenses. Resisting adversarial attacks is challenging. It has been empirically studied that con-\nventional regularization strategies such as dropout, weight decay and distorting training data (with\nrandom noise) do not really solve the problem [8]. Fine-tuning networks using adversarial examples,\nnamely adversarial training [34], is a simple yet effective approach to perform defense and relieve the\nproblem [8, 18], for which the examples can be generated either online [8] or of\ufb02ine [26]. Adversarial\ntraining works well on small datasets such as MNIST and CIFAR. Nevertheless, as Kurakin et al. [18]\nhave reported, it may result in a decreased benign-set accuracy on large-scale datasets like ImageNet.\nAn alternative way of defending such attacks is to train a detector, to detect and reject adversarial\nexamples. Metzen et al. [23] utilize a binary classi\ufb01er which takes intermediate representations as\ninput for detection, and Lu et al. [21] propose to invoke an RBF-SVM operating on discrete codes\nfrom late stage ReLUs. However, it is possible to perform attacks on the joint system if an adversary\nhas access to the parameters of such a detector. Furthermore, it is still in doubt whether the adversarial\nexamples are intrinsically different from the benign ones [3].\nAnother effective work is to exploit distillation [30], but it also slightly degrades the benign-set\naccuracy and may be broken by C&W\u2019s attack [4]. Alemi et al. [1] present an information theo-\nretic method which helps on improving the resistance to adversarial attacks too. Some recent and\ncontemporaneous works also propose to utilize gradient masking [29] as defenses [6, 35, 2].\n\n2\n\n\fFigure 1: Top left: The recursive-\ufb02avoured network which takes a reshaped image xk as input and\nsequentially compute each perturbation component by using a pre-designed attack module. Top right:\nan example for generating the \ufb01rst component, in which the three elbow double-arrow connectors\nindicate weight-sharing fully-connected layers and index-sharing between ReLU activation layers.\nBottom: the attack module for n-class (n \u2265 2) scenarios.\n\nSeveral regularization-based methods have also been proposed. For example, Gu and Rigazio [9]\npropose to penalize the Frobenius norm of the Jacobian matrix in a layer-wise fashion. Recently,\nCisse et al. [5] and Hein and Audriushchenko [14] theoretically show that the sensitivity to adversarial\nexamples can be controlled by the Lipschitz constant of DNNs and propose Parseval training and\ncross-Lipschitz regularization, respectively. However, these methods usually require approximations,\nmaking them less effective to defend very strong and advanced adversarial attacks.\nAs a regularization-based method, our Deep Defense is orthogonal to the adversarial training, defense\ndistillation and detecting then rejecting methods. It also differs from previous and contemporaneous\nregularization-based methods (e.g. [9, 5, 14, 31]) in a way that it endows DNNs the ability of directly\nlearning from adversarial examples and precisely resisting them.\n\n3 Our Deep Defense Method\n\nMany methods regularize the learning objective of DNNs approximately, which may lead to a de-\ngraded prediction accuracy on the benign test sets or unsatisfactory robustness to advanced adversarial\nexamples. We reckon it can be more bene\ufb01cial to incorporate advanced attack modules into the learn-\ning process and learn to maximize a margin. In this section, we \ufb01rst brie\ufb02y analyze a representative\ngradient-based attack and then introduce our solution to learn from it.\n\n3.1 Generate Adversarial Examples\n\nAs discussed, a lot of efforts have been devoted to generating adversarial examples. Let us take the l2\nDeepFool as an example here. It is able to conduct 100% successful attacks on advanced networks.\nMathematically, starting from a binary classi\ufb01er f : Rm \u2192 R which makes predictions (to the class\nlabel) based on the sign of its outputs, DeepFool generates the adversarial perturbation \u2206x for an\narbitrary input vector x \u2208 Rm in a heuristic way. Concretely, \u2206x = r(0) + ... + r(u\u22121), in which the\ni-th (0 \u2264 i < u) addend r(i) is obtained by taking advantage of the Taylor\u2019s theorem and solving:\n\n(cid:107)r(cid:107)2\n\nr\n\n(1)\nmin\nj=0 r(j), function \u2207f denotes the gradient of f w.r.t. its input, and operator\nin which \u2206(i)\n(cid:107) \u00b7 (cid:107)2 denotes the l2 (i.e., Euclidean) norm. Obviously, Equation (1) has a closed-form solution as:\n\nx :=(cid:80)i\u22121\n\ns.t. f (x + \u2206(i)\n\nx ) + \u2207f (x + \u2206(i)\n\nx )T r = 0,\n\nr(i) = \u2212 f (x + \u2206(i)\nx )\n(cid:107)\u2207f (x + \u2206(i)\nx )(cid:107)2\n\n\u2207f (x + \u2206(i)\nx ).\n\n(2)\n\n3\n\n\fBy sequentially calculating all the r(i)s with (2), DeepFool employs a faithful approximation to\nthe \u2206x of minimal l2 norm. In general, the approximation algorithm converges in a reasonably\nsmall number of iterations even when f is a non-linear function represented by a very deep neural\nnetwork, making it both effective and ef\ufb01cient in practical usage. The for-loop for calculating r(i)s\nx )) (cid:54)= sgn(f (x)) is already reached at any iteration\nends in advance if the attack goal sgn(f (x + \u2206(i)\ni < u \u2212 1. Similarly, such strategy also works for the adversarial attacks to multi-class classi\ufb01ers,\nwhich only additionally requires a speci\ufb01ed target label in each iteration of the algorithm.\n\n3.2 Perturbation-based Regularization\n\nOur target is to improve the robustness of off-the-shelf networks without modifying their architectures,\nhence giving a (cid:107)\u2206x(cid:107)p-based (p \u2208 [1,\u221e)) regularization to their original objective function seems to\nbe a solution.\nConsidering the aforementioned attacks which utilize \u2207f when generating the perturbation \u2206x [34,\n8, 26, 36], their strategy can be technically regarded as a function parameterized by the same set\nof learnable parameters as that of f. Therefore, it is possible that we jointly optimize the original\nnetwork objective and a scaled (cid:107)\u2206x(cid:107)p as a regularization for some chosen norm operator (cid:107) \u00b7 (cid:107)p,\nprovided (cid:107)\u2206x(cid:107)p is differentiable. Speci\ufb01cally, given a set of training samples {(xk, yk)} and a\nparameterized function f, we may want to optimize:\nL(yk, f (xk;W)) + \u03bb\n\n(cid:88)\n\n(cid:88)\n\n(cid:18)\n\n(3)\n\n(cid:19)\n\n,\n\nin which the set W exhaustively collects learnable parameters of f, and (cid:107)xk(cid:107)p is a normalization\nfactor for (cid:107)\u2206xk(cid:107)p. As will be further detailed in Section 3.4, function R should treat incorrectly and\ncorrectly classi\ufb01ed samples differently, and it should be monotonically increasing on the latter such\nthat it gives preference to those fs resisting small (cid:107)\u2206xk(cid:107)p/(cid:107)xk(cid:107)p anyway (e.g., R(t) = exp(t)).\nRegarding the DNN representations, W may comprise the weight and bias of network connections,\nmeans and variances of batch normalization layers [16], and slops of the parameterized ReLU\nlayers [12].\n\nminW\n\nk\n\n\u2212(cid:107)\u2206xk(cid:107)p\n(cid:107)xk(cid:107)p\n\nR\n\nk\n\n3.3 Network-based Formulation\nAs previously discussed, we re-formulate the adversarial perturbation as \u2206xk = g(xk;W), in which\ng need to be differentiable except for maybe certain points, so that problem (3) can be solved using\nstochastic gradient descent following the chain rule. In order to make the computation more ef\ufb01cient\nand easily parallelized, an explicit formulation of g or its gradient w.r.t W is required. Here we\naccomplish this task by representing g as a \u201creverse\u201d network to the original one. Taking a two-class\nmulti-layer perceptron (MLP) as an example, we have W = {W0, b0, w1, b1} and\n\nf (xk;W) = wT\n\n1 h(WT\n\n0 xk + b0) + b1,\n\n(4)\n\nin which h denotes the non-linear activation function and we choose h(WT\nmax(WT\n0 xk + b0) and \u02c6yk := f (xk;W), then we have\nused. Let us further denote ak := h(WT\n\n0 xk + b0) :=\n0 xk + b0, 0) (i.e.as the ReLU activation function) in this paper since it is commonly\n\n\u2207f (xk;W) = W0(1>0(ak) \u2297 w1),\n\n(5)\nin which \u2297 indicates the element-wise product of two matrices, and 1>0 is an element-wise indicator\nfunction that compares the entries of its input with zero.\nWe choose \u2206xk as the previously introduced DeepFool perturbation for simplicity of notation 1.\nBased on Equation (2) and (5), we construct a recursive-\ufb02avoured regularizer network (as illustrated\nin the top left of Figure 1) to calculate R(\u2212(cid:107)\u2206xk(cid:107)p/(cid:107)xk(cid:107)p). It takes image xk as input and calculate\neach addend for \u2206xk by utilizing an incorporated multi-layer attack module (as illustrated in the top\nright of Figure 1). Apparently, the original three-layer MLP followed by a multiplicative inverse\noperator makes up the \ufb01rst half of the attack module and its \u201creverse\u201d followed by a norm-based\nrescaling operator makes up the second half. It can be easily proved that the designed network is\ndifferentiable w.r.t. each element of W, except for certain points. As sketched in the bottom of\n\n1Note that our method also naturally applies to some other gradient-based adversarial attacks.\n\n4\n\n\fFigure 1, such a network-based formulation can also be naturally generalized to regularize multi-class\nMLPs with more than one output neurons (i.e., \u02c6yk \u2208 Rn, \u2207f (xk;W) \u2208 Rm\u00d7n and n > 1). We\nuse I \u2208 Rn\u00d7n to indicate the identity matrix, and \u02c6lk, lk to indicate the one-hot encoding of current\nprediction label and a chosen label to fool in the \ufb01rst iteration, respectively.\nSeeing that current winning DNNs are constructed as a stack of convolution, non-linear activation\n(e.g., ReLU, parameterized ReLU and sigmoid), normalization (e.g., local response normalization [17]\nand batch normalization), pooling and fully-connected layers, their \u2207f functions, and thus the g\nfunctions, should be differentiable almost everywhere. Consequently, feasible \u201creverse\u201d layers can\nalways be made available to these popular layer types. In addition to the above explored ones (i.e.,\nReLU and fully-connected layers), we also have deconvolution layers [27] which are reverse to the\nconvolution layers, and unpooling layers [38] which are reverse to the pooling layers, etc.. Just note\nthat some learning parameters and variables like \ufb01lter banks and pooling indices should be shared\namong them.\n\n3.4 Robustness and Accuracy\n\nProblem (3) integrates an adversarial perturbation-based regularization into the classi\ufb01cation objective,\nwhich should endow parameterized models with the ability of learning from adversarial attacks and\nresisting them. Additionally, it is also crucial not to diminish the inference accuracy on benign sets.\nGoodfellow et al. [8] have shown the possibility of ful\ufb01lling such expectation in a data augmentation\nmanner. Here we explore more on our robust regularization to ensure it does not degrade benign-set\naccuracies either.\nMost attacks treat all the input samples equally [34, 8, 26, 18], regardless of whether or not their\npredictions match the ground-truth labels. It makes sense when we aim to fool the networks, but not\nwhen we leverage the attack module to supervise training. Speci\ufb01cally, we might expect a decrease\nin (cid:107)\u2206xk(cid:107)p/(cid:107)xk(cid:107)p from any misclassi\ufb01ed sample xk, especially when the network is to be \u201cfooled\u201d\nto classify it as its ground-truth. This seems different with the objective as formulated in (3), which\nappears to enlarge the adversarial perturbations for all training samples.\nMoreover, we found it dif\ufb01cult to seek reasonable trade-offs between robustness and accuracy, if\nR is a linear function (e.g., R(z) = z). In that case, the regularization term is dominated by some\nextremely \u201crobust\u201d samples, so the training samples with relatively small (cid:107)\u2206xk(cid:107)p/(cid:107)xk(cid:107)p are not\nfully optimized. This phenomenon can impose a negative impact on the classi\ufb01cation objective L and\nthus the inference accuracy. In fact, for those samples which are already \u201crobust\u201d enough, enlarging\n(cid:107)\u2206xk(cid:107)p/(cid:107)xk(cid:107)p is not really necessary. It is appropriate to penalize more on the currently correctly\nclassi\ufb01ed samples with abnormally small (cid:107)\u2206xk(cid:107)p/(cid:107)xk(cid:107)p values than those with relatively large ones\n(i.e., those already been considered \u201crobust\u201d in regard of f and \u2206xk).\nTo this end, we rewrite the second term in the objective function of Problem (3) as\n\n(cid:18)\n\n(cid:88)\n\nk\u2208T\n\n\u03bb\n\nR\n\n\u2212c\n\n(cid:107)\u2206xk(cid:107)p\n(cid:107)xk(cid:107)p\n\n(cid:19)\n\n(cid:18)\n\n(cid:88)\n\nk\u2208F\n\n(cid:19)\n\n(cid:107)\u2206xk(cid:107)p\n(cid:107)xk(cid:107)p\n\n+ \u03bb\n\nR\n\nd\n\n,\n\n(6)\n\nin which F is the index set of misclassi\ufb01ed training samples, T is its complement, c, d > 0 are two\nscaling factors that balance the importance of different samples, and R is chosen as the exponential\nfunction. With extremely small or large c, our method treats all the samples the same in T , otherwise\nthose with abnormally small (cid:107)\u2206xk(cid:107)p/(cid:107)xk(cid:107)p will be penalized more than the others.\n\n4 Experimental Results\n\nIn this section, we evaluate the ef\ufb01cacy of our method on three different datasets: MNIST, CIFAR-10\nand ImageNet [32]. We compare our method with adversarial training and Parseval training (also\nknown as Parseval networks). Similar to previous works [26, 1], we choose to \ufb01ne-tune from pre-\ntrained models instead of training from scratch. Fine-tuning hyper-parameters can be found in the\nsupplementary materials. All our experiments are conducted on an NVIDIA GTX 1080 GPU. Our\nmain results are summarized in Table 1, where the fourth column demonstrates the inference accuracy\nof different models on benign test images, the \ufb01fth column compares the robustness of different\nmodels to DeepFool adversarial examples, and the subsequent columns compare the robustness to\n\n5\n\n\fTable 1: Test set performance of different defense methods. Column 4: prediction accuracies on\nbenign examples. Column 5: \u03c12 values under the DeepFool attack. Column 6-8: prediction accuracies\non the FGS adversarial examples.\n\nDataset\n\nNetwork\n\nMNIST\n\nCIFAR-10\n\nImageNet\n\nMLP\n\nLeNet\n\nConvNet\n\nNIN\n\nAlexNet\n\nResNet\n\nMethod\nReference\nPar. Train\nAdv. Train I\n\nOurs\n\nReference\nPar. Train\nAdv. Train I\n\nOurs\n\nReference\nPar. Train\nAdv. Train I\n\nOurs\n\nReference\nPar. Train\nAdv. Train I\n\nOurs\n\nReference\n\nOurs\n\nReference\n\nOurs\n\n\u03c12\n\nAcc.\n98.31% 1.11\u00d710\u22121\n98.32% 1.11\u00d710\u22121\n98.49% 1.62\u00d710\u22121\n98.65% 2.25\u00d710\u22121\n99.02% 2.05\u00d710\u22121\n99.10% 2.03\u00d710\u22121\n99.18% 2.63\u00d710\u22121\n99.34% 2.84\u00d710\u22121\n79.74% 2.59\u00d710\u22122\n80.48% 3.42\u00d710\u22122\n80.65% 3.05\u00d710\u22122\n81.70% 5.32\u00d710\u22122\n89.64% 4.20\u00d710\u22122\n88.20% 4.33\u00d710\u22122\n89.87% 5.25\u00d710\u22122\n89.96% 5.58\u00d710\u22122\n56.91% 2.98\u00d710\u22123\n57.11% 4.54\u00d710\u22123\n69.64% 1.63\u00d710\u22123\n69.66% 2.43\u00d710\u22123\n\nAcc.@0.2\u0001ref Acc.@0.5\u0001ref Acc.@1.0\u0001ref\n\n72.76%\n77.44%\n87.70%\n95.04%\n90.95%\n91.68%\n95.20%\n96.51%\n61.62%\n69.19%\n65.16%\n72.15%\n75.61%\n75.39%\n78.87%\n80.70%\n54.62%\n55.79%\n63.39%\n65.53%\n\n29.08%\n28.95%\n59.69%\n88.93%\n53.88%\n66.48%\n74.82%\n88.93%\n37.84%\n50.43%\n45.03%\n59.02%\n49.22%\n49.75%\n58.85%\n70.73%\n51.39%\n53.50%\n54.45%\n59.46%\n\n3.31%\n2.96%\n22.55%\n50.00%\n19.75%\n19.64%\n41.40%\n50.00%\n23.85%\n22.13%\n35.53%\n50.00%\n33.56%\n17.74%\n45.90%\n50.00%\n46.05%\n50.00%\n41.70%\n50.00%\n\nFGS adversarial examples. The evaluation metrics will be carefully explained in Section 4.1. Some\nimplementation details of the compared methods are shown as follows.\n\nDeep Defense. There are three hyper-parameters in our method: \u03bb, c and d. As previously explained\nin Section 3.4, they balance the importance of the model robustness and benign-set accuracy. We \ufb01x\n\u03bb = 15, c = 25, d = 5 for MNIST and CIFAR-10 major experiments (except for NIN, c = 70), and\nuniformly set \u03bb = 5, c = 500, d = 5 for all ImageNet experiments. Practical impact of varying these\nhyper-parameters will be discussed in Section 4.2. The Euclidean norm is simply chosen for (cid:107) \u00b7 (cid:107)p.\nAdversarial Training. There exist many different versions of adversarial training [34, 8, 26, 18,\n24, 22], partly because it can be combined with different attacks. Here we choose two of them, in\naccordance with the adversarial attacks to be tested, and try out to reach their optimal performance.\nFirst we evaluate the one introduced in the DeepFool paper [26], which utilizes a \ufb01xed adversarial\ntraining set generated by DeepFool, and summarize its performance in Table 1 (see \u201cAdv. Train I\u201d).\nWe also test Goodfellow et al.\u2019s adversarial training objective [8] (referred to as \u201cAdv. Train II\u201d)\nand compare it with our method intensively (see supplementary materials), considering there exists\ntrade-offs between accuracies on benign and adversarial examples. In particular, a combined method\nis also evaluated to testify our previous claim of orthogonality.\n\nParseval Training. Parseval training [5] improves the robustness of a DNN by controlling its global\nLipschitz constant. Practically, a projection update is performed after each stochastic gradient descent\niteration to ensure all weight matrices\u2019 Parseval tightness. Following the original paper, we uniformly\nsample 30% of columns to perform this update. We set the hyper-parameter \u03b2 = 0.0001 for MNIST,\nand \u03b2 = 0.0003 for CIFAR-10 after doing grid search.\n\n4.1 Evaluation Metrics\n\nThis subsection explains some evaluation metrics adopted in our experiments. Different lp (e.g., l2\nand l\u221e) norms have been used to perform attacks. Here we conduct the famous FGS and DeepFool\nas representatives of l\u221e and l2 attacks and compare the robustness of obtained models using different\ndefense methods. As suggested in the paper [26], we evaluate model robustness by calculating\n\n(cid:88)\n\nk\u2208D\n\n6\n\n\u03c12 :=\n\n1\n|D|\n\n(cid:107)\u2206xk(cid:107)2\n(cid:107)xk(cid:107)2\n\n,\n\n(7)\n\nin which D is the test set (for ImageNet we use its validation set), when DeepFool is used.\nIt is popular to evaluate the accuracy on a perturbed D as a metric for the FGS attack [9, 8, 5].\nLikewise, we calculate the smallest \u0001 such that 50% of the perturbed images are misclassi\ufb01ed by our\n\n\fFigure 2: Convergence curves. From left to right: test accuracy and \u03c12 of MLP, and test accuracy and\n\u03c12 of LeNet. \u201cClean\u201d indicates \ufb01ne-tuning on benign examples. Best viewed in color.\nregularized models and denote it as \u0001ref, then test prediction accuracies of those models produced\nby adversarial and Parseval training at this level of perturbation (abbreviated as \u201cAcc.@1.0\u0001ref\u201d in\nTable 1). Accuracies at lower levels of perturbations (a half and one \ufb01fth of \u0001ref) are also reported.\nMany other metrics will be introduced and used for further comparisons in supplementary materials.\n\n4.2 Exploratory Experiments on MNIST\n\nAs a popular dataset for conducting adversarial attacks [34, 8, 26], MNIST is a reasonable choice for\nus to get started. It consists of 70,000 grayscale images, in which 60,000 of them are used for training\nand the remaining are used for test. We train a four-layer MLP and download a LeNet [19] structured\nCNN model 2 as references (see supplementary materials for more details). For fair comparisons,\nwe use identical \ufb01ne-tuning policies and hyper-parameters for different defense methods We cut the\nlearning rate by 2\u00d7 after four epochs of training because it can be bene\ufb01cial for convergence.\n\nRobustness and accuracy. The accuracy of different models (on the benign test sets) can be found\nin the fourth column of Table 1 and the robustness performance is compared in the last four columns.\nWe see Deep Defense consistently and signi\ufb01cantly outperforms competitive methods in the sense of\nboth robustness and accuracy, even though our implementation of Adv. Train I achieves slightly better\nresults than those reported in [26]. Using our method, we obtain an MLP model with over 2\u00d7 better\nrobustness to DeepFool and an absolute error decrease of 46.69% under the FGS attack considering\n\u0001 = 1.0\u0001ref, while the inference accuracy also increases a lot (from 98.31% to 98.65% in comparison\nwith the reference model. The second best is Adv. Train I, which achieves roughly 1.5\u00d7 and an\nabsolute 19.24% improvement under the DeepFool and FGS attacks, respectively. Parseval training\nalso yields models with improved robustness to the FGS attack, but they are still vulnerable to the\nDeepFool. The superiority of our method holds on LeNet, and the benign-set accuracy increases\nfrom 99.02% to 99.34% with the help of our method.\nConvergence curves of different methods are provided\nin Figure 2, in which the \u201cClean\u201d curve indicates \ufb01ne-\ntuning on the benign training set with the original learning\nobjective. Our method optimizes more sophisticated ob-\njective than the other methods so it takes longer to \ufb01nally\nconverge. However, both its robustness and accuracy per-\nformance surpasses that of the reference models in only\nthree epochs and keeps growing in the last two. Consis-\ntent with results reported in [26], we also observe growing\naccuracy and decreasing \u03c12 on Adv. Train I.\nIn fact, the bene\ufb01t of our method to test-set accuracy for\nFigure 3: The performance of Deep De-\nbenign examples is unsurprising. From a geometrical point\nfense with varying hyper-parameters on\nof view, an accurate estimation of the optimal perturba-\nLeNet. Best viewed in color.\ntion like our \u2206xk represents the distance from a benign\nexample xk to the decision boundary, so maximizing (cid:107)\u2206xk(cid:107) approximately maximizes the margin.\nAccording to some previous theoretical works [37, 33], such regularization to the margin should\nrelieve the over\ufb01tting problem of complex learning models (including DNNs) and thus lead to better\ntest-set performance on benign examples.\n\nVarying Hyper-parameters. Figure 3 illustrates the impact of the hyper-parameters in our method.\nWe \ufb01x d = 5 and try to vary c and \u03bb in {5, 10, 15, 20, 25, 30, 35, 40, 45} and {5, 15, 25}, respectively.\n\n2https://github.com/LTS4/DeepFool/blob/master/MATLAB/resources/net.mat\n\n7\n\n\u000e\u000f\u0010\u0011\u0012054.\u0004\r\u000b\u0001\u0013\u0012\r\u000b\u0001\u0001\r\r\u000b\u0001\u0001\u0012\r\u000b\u0001\u0001\r\r\u000b\u0001\u0001\u0012\r\u000b\u0001\u0001\r9089\u0003,..:7,.\u0005\u000341\u0003\u0003\u001e\u0002!\u001a\u00040,3!,7\u000b\u0003%7,\u00043\u0018/;\u000b\u0003%7,\u00043\u0003\u0002 :78\r\u000e\u000f\u0010\u0011\u0012054.\u0004\r\u000b\u000e\r\r\u000b\u000e\u0011\r\u000b\u000e\u0001\r\u000b\u000f\u000f\r\u000b\u000f\u0013\r\u000b\u0010\r9089\u00032\u000341\u0003\u0003\u001e\u0002!\u001a\u00040,3!,7\u000b\u0003%7,\u00043\u0018/;\u000b\u0003%7,\u00043\u0003\u0002 :78\r\u000e\u000f\u0010\u0011\u0012054.\u0004\r\u000b\u0001\u0001\r\r\u000b\u0001\u0001\u0012\r\u000b\u0001\u0001\r\r\u000b\u0001\u0001\u0012\u000e\u000b\n\n9089\u0003,..:7,.\u0005\u000341\u0003\u0003\u00020\u001f09\u001a\u00040,3!,7\u000b\u0003%7,\u00043\u0018/;\u000b\u0003%7,\u00043\u0003\u0002 :78\r\u000e\u000f\u0010\u0011\u0012054.\u0004\r\u000b\u000f\r\r\u000b\u000f\u000f\r\u000b\u000f\u0011\r\u000b\u000f\u0013\r\u000b\u000f\u0001\r\u000b\u0010\r9089\u00032\u000341\u0003\u0003\u00020\u001f09\u001a\u00040,3!,7\u000b\u0003%7,\u00043\u0018/;\u000b\u0003%7,\u00043\u0003\u0002 :78\r\u000b\u000e\u0012\r\u000b\u000f\r\r\u000b\u000f\u0012\r\u000b\u0010\r\r\u000b\u0010\u0012\r\u000b\u0011\r\r\u000b\u0011\u00129089\u00032\r\u000b\u0001\u0001\u000f\r\u000b\u0001\u0001\u0011\r\u000b\u0001\u0001\u0013\r\u000b\u0001\u0001\u0001\r\u000b\u0001\u0001\r\r\u000b\u0001\u0001\u000f\r\u000b\u0001\u0001\u0011\r\u000b\u0001\u0001\u00139089\u0003,..:7,.\u0005#010703.0!,7\u000b\u0003%7,\u00043\u0018/;\u000b\u0003%7,\u00043\u0003\u0002 :78\u0003\u0014\u0012 :78\u0003\u0014\u000e\u0012 :78\u0003\u0014\u000f\u0012\fNote that d is \ufb01xed here because it has relatively minor effect on our \ufb01ne-tuning process on MNIST.\nIn the \ufb01gure, different solid circles on the same curve indicate different values of c. From left to right,\nthey are calculated with decreasing c, which means a larger c encourages achieving a better accuracy\nbut lower robustness. Conversely, setting a very small c (e.g., c = 5) can yield models with high\nrobustness but low accuracies. By adjusting \u03bb, one changes the numerical range of the regularizer. A\nlarger \u03bb makes the regularizer contributes more to the whole objective function.\nLayer-wise Regularization. We also investigate the im-\nportance of different layers to the robustness of LeNet\nwith our Deep Defense method. Speci\ufb01cally, we mask the\ngradient (by setting its elements to zero) of our adversarial\nregularizer w.r.t. the learning parameters (e.g., weights and\nbiases) of all layers except one. By \ufb01xing \u03bb = 15, d = 5\nand varying c in the set {5, 15, 25, 35, 45}, we obtain 20\ndifferent models. Figure 4 demonstrates the \u03c12 values and\nbenign-set accuracies of these models. Different points\non the same curve correspond to \ufb01ne-tuning with differ-\nent values of c (decreasing from left to right). Legends\nindicate the gradient of which layer is not masked. Appar-\nently, when only one layer is exploited to regularize the\nclassi\ufb01cation objective, optimizing \u201cfc1\u201d achieves the best\nperformance. This is consistent with previous results that\n\u201cfc1\u201d is the most \u201credundant\u201d layer of LeNet [11, 10].\n\nFigure 4: The performance of Deep De-\nfense when only one layer is regularized\nfor LeNet. Best viewed in color.\n\n4.3\n\nImage Classi\ufb01cation Experiments\n\nFor image classi\ufb01cation experiments, we testify the effectiveness of our method on several different\nbenchmark networks on the CIFAR-10 and ImageNet datasets.\n\nCIFAR-10 results. We train two CNNs on CIFAR-10: one with the same architecture as in [15],\nand the other with a network-in-network architecture [20]. Our training procedure is the same as\nin [26]. We still compare our Deep Defense with adversarial and Parseval training by \ufb01ne-tuning\nfrom the references. Fine-tuning hyper-parameters are summarized in the supplementary materials.\nLikewise, we cut the learning rate by 2\u00d7 for the last 10 epochs.\nQuantitative comparison results can be found in Table 1, in which the two chosen CNNs are referred\nto as \u201cConvNet\u201d and \u201cNIN\u201d, respectively. Obviously, our Deep Defense outperforms the other defense\nmethods considerably in all test cases. When compared with the reference models, our regularized\nmodels achieve higher test-set accuracies on benign examples and gain absolute error decreases of\n26.15% and 16.44% under the FGS attack. For the DeepFool attack which might be stronger, our\nmethod gains 2.1\u00d7 and 1.3\u00d7 better robustness on the two networks.\n\nImageNet results. As a challenging classi\ufb01cation dataset, ImageNet consists of millions of high-\nresolution images [32]. To verify the ef\ufb01cacy and scalability of our method, we collect well-trained\nAlexNet [17] and ResNet-18 [13] model from the Caffe and PyTorch model zoo respectively, \ufb01ne-tune\nthem on the ILSVRC-2012 training set using our Deep Defense and test it on the validation set.\nAfter only 10 epochs of \ufb01ne-tuning for AlexNet and 1 epoch for ResNet, we achieve roughly 1.5\u00d7\nimproved robustness to the DeepFool attack on both architectures, along with a slightly increased\nbenign-set accuracy, highlighting the effectiveness of our method.\n\n5 Conclusion\n\nIn this paper, we investigate the vulnerability of DNNs to adversarial examples and propose a novel\nmethod to address it, by incorporating an adversarial perturbation-based regularization into the\nclassi\ufb01cation objective. This shall endow DNNs with an ability of directly learning from attacks and\nprecisely resisting them. We consider the joint optimization problem as learning a recursive-\ufb02avoured\nnetwork to solve it ef\ufb01ciently. Extensive experiments on MNIST, CIFAR-10 and ImageNet have\nshown the effectiveness of our method. In particular, when combined with the FGS-based adversarial\nlearning, our method achieves even better results on various benchmarks. Future works shall include\nexplorations on resisting black-box attacks and attacks in the physical world.\n\n8\n\n0.180.200.220.240.260.280.30test 20.9840.9860.9880.9900.9920.994test accuracyconv1conv2fc1fc2Reference\fAcknowledgments\n\nThis work is supported by NSFC (Grant No. 61876095, No. 61751308 and No.61473167) and\nBeijing Natural Science Foundation (Grant No. L172037).\n\nReferences\n[1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information\n\nbottleneck. In ICLR, 2017.\n\n[2] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to\n\nresist adversarial examples. In ICLR, 2018.\n\n[3] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection\n\nmethods. In ACM Workshop on Arti\ufb01cial Intelligence and Security, 2017.\n\n[4] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE\n\nSymposium on Security and Privacy (SP), 2017.\n\n[5] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval\n\nnetworks: Improving robustness to adversarial examples. In ICML, 2017.\n\n[6] Guneet S Dhillon, Kamyar Azizzadenesheli, Zachary C Lipton, Jeremy Bernstein, Jean Kossai\ufb01, Aran\nKhanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial defense. In ICLR,\n2018.\n\n[7] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classi\ufb01ers: from\n\nadversarial to random noise. In NIPS, 2016.\n\n[8] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.\n\nIn ICLR, 2015.\n\n[9] Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples.\n\nIn ICLR Workshop, 2015.\n\n[10] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for ef\ufb01cient dnns. In NIPS, 2016.\n[11] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for ef\ufb01cient\n\nneural network. In NIPS, 2015.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\n\nhuman-level performance on imagenet classi\ufb01cation. In ICCV, 2015.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[14] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classi\ufb01er against\n\nadversarial manipulation. In NIPS, 2017.\n\n[15] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Im-\nproving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580,\n2012.\n\n[16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[18] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In ICLR, 2017.\n[19] Yann LeCun, Patrick Haffner, L\u00e9on Bottou, and Yoshua Bengio. Object recognition with gradient-based\n\nlearning. Shape, contour and grouping in computer vision, pages 823\u2013823, 1999.\n[20] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014.\n[21] Jiajun Lu, Theerasit Issaranon, and David Forsyth. Safetynet: Detecting and rejecting adversarial examples\n\nrobustly. In ICCV, 2017.\n\n[22] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards\n\ndeep learning models resistant to adversarial attacks. In ICLR, 2018.\n\n[23] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial\n\nperturbations. In ICLR, 2017.\n\n[24] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a\nregularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976,\n2017.\n\n9\n\n\f[25] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversar-\n\nial perturbations. In CVPR, 2017.\n\n[26] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: a simple and accurate\n\nmethod to fool deep neural networks. In CVPR, 2016.\n\n[27] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic\n\nsegmentation. In ICCV, 2015.\n\n[28] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram\nSwami. Practical black-box attacks against machine learning. In Asia Conference on Computer and\nCommunications Security, 2017.\n\n[29] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards the science of\nsecurity and privacy in machine learning. In IEEE European Symposium on Security and Privacy, 2018.\n[30] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense\nto adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy\n(SP), 2016.\n\n[31] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of\n\ndeep neural networks by regularizing their input gradients. In AAAI, 2018.\n\n[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large\nscale visual recognition challenge. IJCV, 2015.\n\n[33] Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural\n\nnetworks. IEEE Transactions on Signal Processing, 2017.\n\n[34] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. In ICLR, 2014.\n\n[35] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects\n\nthrough randomization. In ICLR, 2018.\n\n[36] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversarial examples\n\nfor semantic segmentation and object detection. In ICCV, 2017.\n\n[37] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391\u2013423, 2012.\n[38] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.\n\n10\n\n\f", "award": [], "sourceid": 268, "authors": [{"given_name": "Ziang", "family_name": "Yan", "institution": "Automation Department, Tsinghua University"}, {"given_name": "Yiwen", "family_name": "Guo", "institution": "Intel Labs China"}, {"given_name": "Changshui", "family_name": "Zhang", "institution": "Tsinghua University"}]}