{"title": "Robust Detection of Adversarial Attacks by Modeling the Intrinsic Properties of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7913, "page_last": 7922, "abstract": "It has been shown that deep neural network (DNN) based classifiers are vulnerable to human-imperceptive adversarial perturbations which can cause DNN classifiers to output wrong predictions with high confidence. We propose an unsupervised learning approach to detect adversarial inputs without any knowledge of attackers. Our approach tries to capture the intrinsic properties of a DNN classifier and uses them to detect adversarial inputs. The intrinsic properties used in this study are the output distributions of the hidden neurons in a DNN classifier presented with natural images. Our approach can be easily applied to any DNN classifiers or combined with other defense strategy to improve robustness. Experimental results show that our approach demonstrates state-of-the-art robustness in defending black-box and gray-box attacks.", "full_text": "Robust Detection of Adversarial Attacks by Modeling\n\nthe Intrinsic Properties of Deep Neural Networks\n\nZhihao Zheng\n\nPengyu Hong\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nBrandeis University\nWaltham, MA 02453\n\nzhihaozh@brandeis.edu\n\nBrandeis University\nWaltham, MA 02453\n\nhongpeng@brandeis.edu\n\nAbstract\n\nIt has been shown that deep neural network (DNN) based classi\ufb01ers are vulnerable\nto human-imperceptive adversarial perturbations which can cause DNN classi\ufb01ers\nto output wrong predictions with high con\ufb01dence. We propose an unsupervised\nlearning approach to detect adversarial inputs without any knowledge of attackers.\nOur approach tries to capture the intrinsic properties of a DNN classi\ufb01er and uses\nthem to detect adversarial inputs. The intrinsic properties used in this study are the\noutput distributions of the hidden neurons in a DNN classi\ufb01er presented with natural\nimages. Our approach can be easily applied to any DNN classi\ufb01ers or combined\nwith other defense strategies to improve robustness. Experimental results show that\nour approach demonstrates state-of-the-art robustness in defending black-box and\ngray-box attacks.\n\n1\n\nIntroduction\n\nSince the successful application of deep convolutional neural network to large-scale image recognition\n[26] by Krizhevsky et al. [11], neural network based Deep Learning has gained signi\ufb01cant attentions.\nResearchers have been shown that deep neural networks (DNN) were able to deliver state-of-the-art\nperformances in various \ufb01elds, such as, robotics [20, 7], self-driving cars [1, 2], face recognition\nfor identi\ufb01cation [33], games [29, 30, 19], biomedical image processing [6, 28], and so on. Despite\nof these successes, DNN-based classi\ufb01ers have a severe weakness [34]. For example, knowing the\narchitecture and parameters of a DNN classi\ufb01er (i.e., white-box attack), an adversarial example can\nbe easily constructed to fool the DNN classi\ufb01er by applying a small perturbation to an input image.\nEven though the perturbations are too small to affect human recognition, the DNN classi\ufb01er can\nmisclassify the perturbed input with high con\ufb01dence. Successful attacks can also be black-box, in\nwhich the architecture and parameters of a DNN are unknown to the attackers [23]. Interestingly,\nadversarial images remain malicious even after printed out and then fed to a well trained DNN [12]. A\nvariety of algorithms have been developed to generate powerful attacks [34, 9, 12, 24, 32, 4, 21, 22].\nWithout proper safeguards, users of DNN-based applications can be exposed to unforeseen hazardous\nsituations caused by \"trivial\" noises. Various attempts have been conducted to defend adversarial\nattacks. Papernot et al. [25] proposed a defensive distillation approach, which reduces the magnitude\nof gradients during training, to make the trained model more robust to input perturbations . However,\nit was later shown that the distillation approach was still highly vulnerable to attacks [3]. Recently,\nthe adversarial training strategy became popular [9, 18, 13]. This strategy augments the training data\nwith adversarial examples to enhance the capability of DNNs to deal with targeted attacks. It focuses\nmore on defending black-box attacks, and usually does not consider white-box attacks. Hence, it\nis still vulnerable to iterative attacks [13]. In addition, the robustness of the model trained by this\nstrategy depends on the attacks covered by the adversarial training examples. Another strategy is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fto actively detect adversarial inputs [17, 15] by training a two-class classi\ufb01er that takes the hidden\nstates of a DNN as the input to tell if an input is adversarial. Unfortunately, this strategy can seriously\nsuffer from attacks unseen during the training procedure.\nIn this paper, we propose a strategy for detecting adversarial inputs by modeling the intrinsic\nproperties of a DNN classi\ufb01er. This strategy does not need to know the attack methods or require\nto train the classi\ufb01er with adversarial samples. Therefore, it will suffer less from unseen attacks.\nWe implement an approach, termed I-defender (\"I\" stands for intrinsic), which explores one of the\nintrinsic properties of a DNN classi\ufb01er, i.e., the distributions of its hidden states given natural training\ndata. We reason that: When the DNN classi\ufb01er mis-assigns a speci\ufb01c class label to an adversarial\ninput, its hidden states are quite different from those given natural data of the same class. I-defender\nmodels the hidden state distributions of a DNN classi\ufb01er given natural data and uses them to detect\nadversarial inputs. We do not attempt to model the hidden state distributions of a classi\ufb01er presented\nwith adversarial inputs because such distributions will depend speci\ufb01cally on the attack methods\nthat generate the adversarial training inputs. For the same reason, we do not try to build a model to\ndistinguish the hidden state distributions of a classi\ufb01er presented with natural inputs from those of\nthe classi\ufb01er presented with adversarial inputs. We do not try to model the input distribution because\nthe input space is of much higher dimension. In addition, the hidden state distributions encode the\ngeneralization power of the model.\n\n2 Related Works\n\n2.1 Adversarial Attacks\n\nFast Gradient Sign Method (FGSM) [9] ef\ufb01ciently generates adversarial examples. The perturba-\ntion \u03c1 is computed along the direction in the input space which maximally increases a linearized cost\nfunction under (cid:96)\u221e norm:\n(1)\nwhere \u0001 is a scalar to restrict the norm of the perturbations, \u2207J is the gradient of the cost function,\nand x is the original input with its true class label as l. The perturbation generated by this method\nusually is small with respect to the maximum value of the input.\nBasic Iterative Method (BIM) [12] is an iterative method. At each iteration, it adds a small\nperturbation decided by the gradient \u2207J at the current version of the perturbed input and clips the\nmodi\ufb01cations in the range of \u0001 from the original input.\n\n\u03c1 = \u0001 \u00b7 sign(\u2207J(\u03b8, x, l))\n\nadv = clip\u0001{xi\nxi+1\n\nadv + \u03b1 \u00b7 sign(\u2207J(\u03b8, xi\n\nadv, l))}\n\n(2)\n\nwhere xi\nadv denotes the perturbed input generated at the i-th iteration. Usually, \u03b1 is set to 1, and\nthe number of iterations is set to 10. Based on this approach, Metzen et al. [17] developed another\niterative method, which produces each perturbation in the direction of (cid:96)2-normalized gradient and\nproject the perturbed version back to the \u0001 ball around the original input if the (cid:96)2 distance between\nthem exceeds \u0001.\n\nadv = project\u0001{xi\nxi+1\n\nadv + \u03b1\n\n\u2207J(\u03b8, xi\n||\u2207J(\u03b8, xi\n\nadv, l)\nadv, l)||2\n\n}\n\n(3)\n\nDeepFool Method [21] iteratively computes the minimal norm adversarial perturbation. Inputs\nare assumed to reside in a region con\ufb01ned by the decision boundary of a classi\ufb01er. In the i-th\niteration, the classi\ufb01er is linearized around the perturbed input xi\nadv. The algorithm will \ufb01nd the\nclosest class boundary and take the minimal step to traverse the boundary accordingly to the lp-norm\ndistance from xi\nadv. The perturbations are accumulated to the original input until mis-classi\ufb01cation\nis achieved. DeepFool is able to achieve the same successful level of attack as FGSM while using\nsmaller perturbations.\n\n2.2 Perturbation Detection Methods\n\nMetzen et al. [17] proposed to augment a DNN with a subnetwork that focuses on adversarial\nperturbation detection. The subnetwork connects each layer of the DNN and is trained separately\n\n2\n\n\fusing a dataset containing both natural samples and adversarial samples generated by known attack\nmethods. Although this method can achieve certain degree of success, the trained subnetwork can\nbe easily fooled by adversarial examples generated by attack methods unseen during training [15].\nLu et al. [15] introduced SafetyNet that trains a Support Vector Machine [5] to detect the boundary\nbetween natural and perturbed data in the space of the quanti\ufb01ed features from a DNN. Similar to the\nabove subnetwork approach, SafetyNet is trained using certain attacking methods. Although it may\nproduce more robust results, it still suffers from unseen adversarial patterns. Samangouei et al. [27]\nproposed Defense-GAN to leverage the expressive capability of Generative Adversarial Network [8]\nto defend against attacks. Defense-GAN trains a generative model to model the distribution of natural\ninputs. To detect adversarial inputs, Defense-GAN projects an input onto the range of the GAN\ngenerator by a Gradient Descent (GD) procedure to minimize the Wasserstein distance between the\ninput and the sample generated by the GAN generator. The generator runs the GD procedure several\ntime with different seed inputs. An input will be detected as an attack if the minimal Wasserstein\ndistance between the input and the generated samples is larger than a threshold. To achieve a higher\naccuracy, Defense-GAN needs to try more seed inputs and run more GD iterations, which will be\ntime-costly. Its performance also relies on the quality of its GAN, which can be challenging to train\nfor complex tasks.\n\n3\n\nI-defender\n\nFigure 1: Hidden state distribution examples. The architecture of the DNN is speci\ufb01ed in Section 4.2.\nThe DNN was trained on CIFAR-10. (a) The IHSDs of two classes (class 1: automobile and class 2:\nbird) are plotted in the 2D subspace with the largest variances. (b) The intrinsic distribution (green)\nof a hidden state of the airplane class versus the distribution (red) of the same hidden state when the\nDNN misclassi\ufb01es perturbed inputs as airplane.\n\nSzegedy et al. [34] interpreted adversarial examples as the low probability pockets in the manifold\nwhich are never or rarely seen during training. There are too many of them so that attackers can easily\nexplore them to fool the trained classi\ufb01er. Typically, adversarial attacks can be generated by traversing\nthe high dimension decision space along the direction of gradient to reach those pockets. Adversarial\ntraining can be viewed as a method to \ufb01ll those pockets with adversarial examples generated by\nknown attack methods. However, when dealing with complex applications, this strategy may not be\neffective because there can be in\ufb01nite number of low probability pockets. Instead of trying to \ufb01ll\nthose pockets, Defense-GAN tries to model the input distribution, which however can be too complex\nto deal with. On the other hand, given natural data, the distributions of the hidden states (i.e., outputs\nof hidden neurons) of a DNN classi\ufb01er can be much simpler (e.g., Figure 1a). The dimensions of the\nhidden state spaces are often much lower than that of the input space, which can make the hidden state\ndistributions much easier to model than the input distribution. We call a hidden state distribution of a\nDNN presented with natural data as an intrinsic hidden state distribution (IHSD), which characterizes\ncertain intrinsic properties of the DNN. I-defender uses the IHSDs of a classi\ufb01er to reject adversarial\ninputs because they tend to produce hidden states lying in the low density regions of the IHSDs (e.g.,\nsee Figure 1b). I-defender can be easily attached to any models that produce internal representations.\n\n3\n\n\fIn our current implementation, I-defender uses Gaussian Mixture Model (GMM) to approximate the\nIHSD of each class as the following:\n\np(H(x)|\u03b8, c) =\n\nwiN (H(x)|\u00b5ck, \u03a3ck)\n\n(4)\n\nK(cid:88)\n\nk=1\n\nwhere H(x) denotes the hidden state of an input x \u2208 the c-th class, \u03b8 denotes the DNN classi\ufb01er of\ninterest (or its parameters), and \u00b5ck and \u03a3ck are the mean and covariance matrix of the k-th Gaussian\ncomponent in the mixture model of the c-th class. After training the DNN classi\ufb01er, we feed all\ntraining samples into it and collect the corresponding hidden states for training a GMM for each class\nusing the EM algorithm [16].\nIn this study, all DNN classi\ufb01ers consisted of convolutional layers followed by fully connected layers.\nThe states of the convolutional layers are position-dependent, which make it non-trivial to model\ndirectly. Thus, we choose to only model the state of the fully connected hidden layers. For each class\nc, a threshold T Hc is chosen to reject inputs by checking if their likelihoods are lower than T Hc.\n\nReject(x, c) = p(H(x)|\u03b8, c) < T Hc\n\n(5)\n\n4 Experiments\n\nWe evaluated I-defender on standard datasets (MNIST, F-MNIST, CIFAR-10) against several attack\nmethods including (cid:96)\u221e\u2212 norm Iterative, (cid:96)2\u2212 norm Iterative, FGSM, and DeepFool. The number of\niterations was set to 10 for iterative attack methods ((cid:96)\u221e and (cid:96)2). The results are organized accordingly\nto the following attack types:\n\n1. Black-box Attack: Attackers know nothing about the defense strategy and use a substitute\n\nnetwork to generate adversarial samples.\n\n2. Semi White-box Attack: Attackers know all details of the DNN classi\ufb01er, but however have\nno knowledge of its defense strategy. Attackers use the gradients from the DNN to generate\nadversarial samples.\n\n3. Gray-box Attack: Attackers know the architecture of the DNN classi\ufb01er and its defense\nstrategy, but have no knowledge of their parameters. Attackers use a network of the same\narchitecture and the same defense strategy to generate adversarial samples.\n\n4.1 Black-box Attack\n\nIn this experiment, we used two datasets: the MNIST dataset [14] and the F-MNIST dataset [35], a\nmore challenging replacement of the MNIST dataset. It was shown that the Defense-GAN detection\nmethod outperformed previous approaches on these two datasets under the FGSM attack. Hence, we\n\ufb01rst compared I-defender only with the Defense-GAN detection under the FGSM attack. We used the\nsame experiment settings to those used by in the Defense-GAN detection experiment [27]. Attackers\ngenerated adversarial examples using model E, and used them to attack model F (see Table 1 for\ndetails of models E and F). The results are summarized in Tables 2-5 (the results of the Defense-GAN\ndetection are copied from [27]).\n\nTable 1: Architectures of Models E and F. The architectures are the same to those used in the adversary\ndetection experiment in [27]. FC(n) denotes a fully connected layer with n neurons. Conv(k, w\u00d7h, s)\ndenotes a convolutional layer with k output features, \ufb01lter size of w \u00d7 h and stride as s. ReLU is the\nRecti\ufb01ed Linear Unit activation.\n\nModel E\nFC(200)\nReLU\nFC(200)\nReLU\n\nFC(10)+Softmax\n\nModel F\n\nConv( 64, 8 \u00d7 8, 2 )\nConv( 128, 6 \u00d7 6, 2 )\n\nReLU\n\nReLU\n\nFC(10)+Softmax\n\n4\n\n\fTable 2: I-defender vs Defense-GAN of different settings. The FGSM attack used \u0001 = 0.3. The\nMNIST data was used.\n\nDetection AUC Number of GD runs\n0.993\n\nMethod\nI-defender\nDefense-GAN 1.0\nDefense-GAN 1.0\nDefense-GAN 0.985\nDefense-GAN 0.982\nDefense-GAN 0.922\nDefense-GAN 0.836\n\nN/A\n10\n10\n10\n5\n2\n1\n\nIteration number in each GD run\nN/A\n800\n400\n50\n100\n100\n100\n\nTable 3: I-defender vs Defense-GAN (10 GD runs and 400 iterations in each run) under the FGSM\nattack on the MNIST data. The detection AUC is used as the measurement. The ROC curves of\nI-defender are shown in Figure 2\n\n\u0001\n0.1\n0.15\n0.2\n0.25\n0.3\n\nDefense-GAN I-defender\n0.914\n0.975\n0.989\n0.998\n0.999\n\n0.964\n0.979\n0.988\n0.991\n0.993\n\nTable 4: I-defender vs Defense-GAN of different settings under the FGSM attack with \u0001 = 0.3. The\nF-MNIST data was used.\n\nDetection AUC Number of GD runs\nMethod\nI-defender\n0.985\nDefense-GAN 0.987\nDefense-GAN 0.983\nDefense-GAN 0.965\nDefense-GAN 0.945\nDefense-GAN 0.935\nDefense-GAN 0.876\nDefense-GAN 0.794\n\nN/A\n10\n10\n10\n5\n10\n2\n1\n\nIteration number in each GD run\nN/A\n800\n400\n100\n100\n25\n100\n100\n\nTable 5: I-defender vs Defense-GAN (10 GD runs and 200 iterations in each run) under the FGSM\nattack. The F-MNIST data was used. The detection AUC is used as the measurement. The ROC\ncurves of I-defender are shown in Figure 3\n\n\u0001\n0.1\n0.15\n0.2\n0.25\n0.3\n\nDefense-GAN I-defender\n0.775\n0.884\n0.940\n0.969\n0.985\n\n0.9302\n0.9587\n0.9722\n0.9807\n0.9850\n\n5\n\n\fFigure 2: The ROC curves of I-defender\nattacked by FSGM with different \u0001 on the\nMNIST dataset.\n\nFigure 3: The ROC curves of I-defender at-\ntacked by FSGM with different \u0001 on the F-\nMNIST dataset.\n\nFigure 4: The ROC curves of I-defender at-\ntacked by various methods on the MNIST\ndataset\n\nFigure 5: The ROC curves of I-defender at-\ntacked by various methods on the F-MNIST\ndataset.\n\nThe performance of Defense-GAN detection highly depends on its hyper-parameters (i.e., the number\nof GD runs and the number of GD iterations). Tables 2 and 4 show that I-defender outperforms\nDefense-GAN with less than 10 GD runs and less than 400 GD iterations. Although Defense-GAN is\nable to produce a slightly better result than I-defender by trying more GD runs (e.g., 10) and more\nGD iterations (e.g., 800), we estimated that the running time of Defense-GAN (reported in [27]) was\n105 times slower than I-defender. Tables 3 and 5 summarize the effects of changing the attacking\npower (\u0001) of FGSM, and the corresponding ROC curves are shown in Figures 2 and 3. I-defender\nsuffers much less than Defense-GAN when the perturbation level is more subtle. We think this is\ndue to that I-defender models the hidden state distributions while Defense-GAN models the input\ndistribution. It is explained in [9] that small input perturbations can be ampli\ufb01ed across layers and\ncause a hidden state to grow by \u03c9T \u03b7, where \u03c9 means weight vector and \u03b7 means perturbation. Thus,\neven if a perturbation is small, the hidden states can be altered signi\ufb01cantly and be easily detected\nby I-defender. We also tested I-defender using other attacking methods. The ROC curves and AUC\nvalues show that I-defender is robust (see Figures 4 and 5).\n\n4.2 Semi White-box Attack\n\nIn this experiment, we used the CIFAR-10 dataset [10], which is more complex than MNIST and\nF-MNIST. We trained a 34-layer wide residual network [36] with k = 8 as the classi\ufb01er. Since\nDefense-GAN did not report their results on CIFAR-10, we compared I-defender with two supervised\ndetection methods, SafetyNet[15] and Subnetwork[17]. The experiment settings (e.g., the attack\nmethods, the attack strengths, the balanced data for evaluate detection accuracy, etc.) were the same\nto those reported in [15]. The results are summarized in Table 6. Supervised methods perform better\nagainst the attack methods/strengths that are used in their training phases. Their performances drop\n\n6\n\n\fsigni\ufb01cantly when facing unseen attack methods/strengths. I-defender performs more consistently\nacross different attack methods/strengths, and signi\ufb01cantly outperforms supervised methods on\nunseen attack methods/strengths.\n\nTable 6: Semi white-box attack results measured by detection AUC. The results of SafetyNet (SVM\nand mSVM) and Subnetwork are incorporated from [15]. The column headers denote the attack\nmethods (Iter-(cid:96)\u221e: Iterative-(cid:96)\u221e; Iter-(cid:96)2: Iterative-(cid:96)2) and their attack strengths indicated by \"Adv\nAcc\" (i.e., accuracy of classi\ufb01er on classifying adversarial samples). The parameters of the attacking\nmethods were set to match with the Adv Acc\u2019s speci\ufb01ed in [15]. Both SafetyNet and Subnetwork\nwere trained by Iterative (cid:96)\u221e with Adv Acc = 13.14% (i.e., the attack in the \ufb01rst column), and then\nwere generalized to other attacks in the rest columns. The results of I-defender ID-95 and ID-99\nwere obtained by setting the likelihood thresholds to keep 95% and 99% of the natural training data,\nrespectively.\n\nMethod\nAdv Acc\nSVM\nmSVM\nSubnet\nID-95\nID-99\n\nIter-(cid:96)\u221e\n13.14%\n83.6\n92.52\n98.235\n81.28\n79.04\n\nIter-(cid:96)2\n10.80%\n84.840\n93.915\n98.660\n79.398\n74.173\n\nFGSM\n27.84%\n75.545\n74.480\n68.980\n87.464\n90.485\n\nDeepFool\n24.53%\n78.305\n86.635\n49.270\n87.636\n90.709\n\nIter-(cid:96)\u221e\n29.56%\n76.330\n76.670\n49.295\n78.730\n74.580\n\nFGSM\n45.68%\n61.755\n60.895\n49.27\n85.024\n87.596\n\n4.3 Gray-box Attack\n\nIn a gray-box attack, an attacker knows the structure and defense strategy of a DNN classi\ufb01er, but\nhas no knowledge of its parameters. Adversarial training is able to increase the robustness of a DNN\nclassi\ufb01er when attacked by single-step gray-box methods, but not by iterative gray-box methods.\nTo test the robustness of I-defender under gray-box attacks, we trained two deep networks with\nsame structure independently on the same natural training data, and tested I-defender on three attack\nmethods: FGSM, Iterative (cid:96)\u221e, and Iterative (cid:96)2. We let the attackers maximize the following function\nto derive a perturbed input x,\n\n(cid:89)\n\nk(cid:54)=s\n\narg max\n\nx\n\nJ(x, ys|\u03b8) + \u03b1\n\nReject(x, yk) \u00d7 max\nk(cid:54)=s\n\nlog P (H(x)|yk)\n\n(6)\n\nwhere the \ufb01rst term J(.) denotes the cross entropy loss of classifying x into its true label ys by a\nclassi\ufb01er with parameter \u03b8 (this term encourages \ufb01nding a perturbation x that will lead to misclassi-\n\ufb01cation), and the second term penalizes x if it is detected by our defense mechanism (i.e., it is not\naccepted by any classes other than ys because its likelihoods from any other classes are lower than the\ncorresponding thresholds). The second term will encourage moving x towards the most promising\nclass other than its true class ys.\nOur \ufb01rst experiment was on CIFAR-10, and the results (Tables 7 and 8) indicate the following.\nAlthough an attacker can successfully exploit the weakness in an I-defender, which is fully exposed to\nthe attacker, the attacker can hardly affect the detection power of another I-defender whose parameters\nare unknown to the attacker. This is because two classi\ufb01ers are trained independently and can have\nvery different hidden state spaces/distributions. Moreover, the change of DNN architecture (more\ncomplex ones) should not greatly in\ufb02uence the performance of our approach. We also carried out an\nexperiment on ImageNet by setting \u03b1 to 1. Without deploying I-defender, the adversarial accuracy\nof a \"Target\" DNN can easily drop to around 10%. I-defender made it signi\ufb01cantly harder for an\nattacker to succeed (Table 9). In addition, the lower the adversarial accuracy of the \"Target\" DNN,\nthe higher the detection AUC. This indicates that I-defender can be used to effectively defend attacks.\nWe think this is because a DNN architecture appropriate for a more dif\ufb01cult task (e.g., ImageNet)\nis sophisticated and has many distinct local minimals. Hence, if trained twice independently, it\nwill produce two DNN instances with very different hidden state spaces/distributions. Therefore, a\nperturbed input generated accordingly to a \"Source\" DNN can easily lead to a remarkably different\nhidden state con\ufb01guration in a \"Target\" DNN, which means it is challenging for attackers to rely on a\n\"Source\" DNN to successfully attack a \"Target\" DNN.\n\n7\n\n\fTable 7: Defense against gray-box attacks on CIFAR-10. Adversarial examples were generated from\nthe \"Source\" DNN (WRN34-8 [36]) and tested on both the \"Source\" and \"Target\" DNNs (WRN34-8).\nPerformances are measured by detection AUC.\n\n\u03b1\n\n1e-3\n1e-2\n1e-1\n1\n\n\u03b1\n\n1e-3\n1e-2\n1e-1\n1\n\nIterative-(cid:96)\u221e (\u0001 = 0.013)\nTarget / Source\n85.203 / 76.450\n83.261 / 58.967\n83.875 / 59.459\n83.945 / 59.554\nIterative-(cid:96)\u221e (\u0001 = 0.0085)\nTarget / Source\n86.130 / 74.463\n85.485 / 63.151\n85.777 / 63.598\n85.745 / 63.590\n\nIterative-(cid:96)2 (\u0001 = 0.47)\nTarget / Source\n86.249 / 75.786\n84.740 / 69.243\n85.406 / 71.179\n85.576 / 72.027\nFGSM (\u0001 = 0.075)\nTarget / Source\n94.472 / 93.564\n94.480 / 93.387\n94.450 / 93.359\n94.457 / 93.355\n\nTable 8: Evaluate the effects of DNN size on the performance of I-defender. We compared WRN34-8\nand WRN46-8 in this experiment.\n\n\u03b1\n\n1e-3\n1e-2\n1e-1\n1\n\n\u03b1\n\n1e-3\n1e-2\n1e-1\n1\n\nIterative-(cid:96)\u221e (\u0001 = 0.013)\nWRN-34 / WRN-46\n85.203 / 87.549\n83.261 / 87.555\n83.875 / 87.637\n83.945 / 87.553\nIterative-(cid:96)\u221e (\u0001 = 0.0085)\nWRN-34 / WRN-46\n86.130 / 85.960\n85.485 / 86.084\n85.777 / 85.969\n85.745 / 86.113\n\nIterative-(cid:96)2 (\u0001 = 0.47)\nWRN-34 / WRN-46\n86.249 / 86.23\n84.740 / 86.215\n85.406 / 85.844\n85.576 / 85.528\nFGSM (\u0001 = 0.075)\nWRN-34 / WRN-46\n94.472 / 94.060\n94.480 / 94.159\n94.450 / 94.174\n94.457 / 94.111\n\nTable 9: Defense against gray-box attacks on ImageNet. Both the \"Source\" and \"Target\" DNNs\nused VGG19 [31]. Note that the accuracy of the \"Target\" DNN on the natural data is 71.028%. The\n\"Adv Acc\" represents the adversarial accuracy of the \"Target\" DNN under attacks. Performances are\nmeasured by detection AUC.\n\n\u03b1\n\nFGSM (\u0001 = 0.3)\n\nIter-(cid:96)\u221e\n(\u0001 = 0.019)\n\nAdv Acc / AUC Adv Acc / AUC\n62.636 / 97.201\n70.892 / 62.685\n70.892 / 62.801\n62.646 / 97.211\n62.648 / 97.215\n70.892 / 62.612\n\n1e-2\n1e-1\n1\n\nIterative-(cid:96)2 (\u0001 = 12, 5)\n\nAdv Acc / AUC\n59.032 / 93.832\n59.016 / 93.837\n59.010 / 93.834\n\n8\n\n\f5 Conclusion and Discussion\n\nWe show that modeling the intrinsic properties of a DNN classi\ufb01er can be a reliable strategy to detect\nadversarial attacks. This strategy does not need any knowledge about attack methods. Hence, it does\nnot suffer from attack methods unseen during training and is able to robustly defense against various\nof black-box and gray-box attacks, which is suf\ufb01cient for most application scenarios. Our implemen-\ntation of this strategy uses GMM to approximate the hidden state distribution of a DNN classi\ufb01er.\nExperiment results validate that our implementation achieves state-of-the-art performance among\nunsupervised methods and generalizes better than supervised ones. Our method is straightforward\nand can be easily incorporated into any DNN-based classi\ufb01ers and can also be easily combined with\nany existing defense strategies. Depending on applications, one can replace GMM with other more\nappropriate models to approximate hidden state distributions. Since our method models the hidden\nstates of a DNN instead of the inputs, it can be directly applied to other modalities (such as text).\n\nReferences\n[1] E. Ackerman. How drive. ai is mastering autonomous driving with deep learning. IEEE Spectrum, March,\n\n2017.\n\n[2] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort,\nU. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316,\n2016.\n\n[3] N. Carlini and D. Wagner. Defensive distillation is not robust to adversarial examples. arXiv preprint\n\narXiv:1607.04311, 2016.\n\n[4] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy\n\n(SP), 2017 IEEE Symposium on, pages 39\u201357. IEEE, 2017.\n\n[5] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n\n[6] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level\n\nclassi\ufb01cation of skin cancer with deep neural networks. Nature, 542(7639):115, 2017.\n\n[7] A. Giusti, J. Guzzi, D. C. Cire\u00b8san, F.-L. He, J. P. Rodr\u00edguez, F. Fontana, M. Faessler, C. Forster, J. Schmid-\nhuber, G. Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots.\nIEEE Robotics and Automation Letters, 1(2):661\u2013667, 2016.\n\n[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in neural information processing systems, pages 2672\u20132680,\n2014.\n\n[9] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv\n\npreprint arXiv:1412.6572, 2014.\n\n[10] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[12] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint\n\narXiv:1607.02533, 2016.\n\n[13] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. arXiv preprint\n\narXiv:1611.01236, 2016.\n\n[14] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n\n[15] J. Lu, T. Issaranon, and D. Forsyth. Safetynet: Detecting and rejecting adversarial examples robustly.\n\nCoRR, abs/1704.00103, 2017.\n\n[16] G. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.\n\n[17] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. arXiv\n\npreprint arXiv:1702.04267, 2017.\n\n9\n\n\f[18] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with virtual\n\nadversarial training. arXiv preprint arXiv:1507.00677, 2015.\n\n[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing\n\natari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\n[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529, 2015.\n\n[21] S. M. Moosavi Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep\nneural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), number EPFL-CONF-218057, 2016.\n\n[22] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. arXiv\n\npreprint, 2017.\n\n[23] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks\n\nagainst deep learning systems using adversarial examples. arXiv preprint, 2016.\n\n[24] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep\nlearning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on,\npages 372\u2013387. IEEE, 2016.\n\n[25] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations\nagainst deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582\u2013597.\nIEEE, 2016.\n\n[26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[27] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-gan: Protecting classi\ufb01ers against adversarial\nattacks using generative models. In International Conference on Learning Representations, volume 9,\n2018.\n\n[28] D. Shen, G. Wu, and H.-I. Suk. Deep learning in medical image analysis. Annual review of biomedical\n\nengineering, 19:221\u2013248, 2017.\n\n[29] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural\nnetworks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[30] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,\n\nA. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.\n\n[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[32] J. Su, D. V. Vargas, and S. Kouichi. One pixel attack for fooling deep neural networks. arXiv preprint\n\narXiv:1710.08864, 2017.\n\n[33] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face recognition with very deep neural networks. arXiv\n\npreprint arXiv:1502.00873, 2015.\n\n[34] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing\n\nproperties of neural networks. arXiv preprint arXiv:1312.6199, 2013.\n\n[35] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine\n\nlearning algorithms, 2017.\n\n[36] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.\n\n10\n\n\f", "award": [], "sourceid": 4902, "authors": [{"given_name": "Zhihao", "family_name": "Zheng", "institution": "Brandeis University"}, {"given_name": "Pengyu", "family_name": "Hong", "institution": "Brandeis University"}]}