{"title": "Improved Network Robustness with Adversary Critic", "book": "Advances in Neural Information Processing Systems", "page_first": 10578, "page_last": 10587, "abstract": "Ideally, what confuses neural network should be confusing to humans. However, recent experiments have shown that small, imperceptible perturbations can change the network prediction. To address this gap in perception, we propose a novel approach for learning robust classifier. Our main idea is: adversarial examples for the robust classifier should be indistinguishable from the regular data of the adversarial target. We formulate a problem of learning robust classifier in the framework of Generative Adversarial Networks (GAN), where the adversarial attack on classifier acts as a generator, and the critic network learns to distinguish between regular and adversarial images. The classifier cost is augmented with the objective that its adversarial examples should confuse the adversary critic. To improve the stability of the adversarial mapping, we introduce adversarial cycle-consistency constraint which ensures that the adversarial mapping of the adversarial examples is close to the original. In the experiments, we show the effectiveness of our defense. Our method surpasses in terms of robustness networks trained with adversarial training. Additionally, we verify in the experiments with human annotators on MTurk that adversarial examples are indeed visually confusing.", "full_text": "Improved Network Robustness\n\nwith Adversary Critic\n\nAlexander Matyasko, Lap-Pui Chau\n\nSchool of Electrical and Electronic Engineering\nNanyang Technological University, Singapore\n\naliaksan001@ntu.edu.sg, elpchau@ntu.edu.sg\n\nAbstract\n\nIdeally, what confuses neural network should be confusing to humans. However,\nrecent experiments have shown that small, imperceptible perturbations can change\nthe network prediction. To address this gap in perception, we propose a novel\napproach for learning robust classi\ufb01er. Our main idea is: adversarial examples\nfor the robust classi\ufb01er should be indistinguishable from the regular data of the\nadversarial target. We formulate a problem of learning robust classi\ufb01er in the\nframework of Generative Adversarial Networks (GAN), where the adversarial\nattack on classi\ufb01er acts as a generator, and the critic network learns to distinguish\nbetween regular and adversarial images. The classi\ufb01er cost is augmented with\nthe objective that its adversarial examples should confuse the adversary critic. To\nimprove the stability of the adversarial mapping, we introduce adversarial cycle-\nconsistency constraint which ensures that the adversarial mapping of the adversarial\nexamples is close to the original. In the experiments, we show the effectiveness\nof our defense. Our method surpasses in terms of robustness networks trained\nwith adversarial training. Additionally, we verify in the experiments with human\nannotators on MTurk that adversarial examples are indeed visually confusing.\n\n1\n\nIntroduction\n\nDeep neural networks are powerful representation learning models which achieve near-human\nperformance in image [1] and speech [2] recognition tasks. Yet, state-of-the-art networks are sensitive\nto small input perturbations. [3] showed that adding adversarial noise to inputs produces images\nwhich are visually similar to the original inputs but which the network misclassi\ufb01es with high\ncon\ufb01dence. In speech recognition, [4] introduced an adversarial attack, which can change any audio\nwaveform, such that the corrupted signal is over 99.9% similar to the original but transcribes to any\ntargeted phrase. The existence of adversarial examples puts into question generalization ability of\ndeep neural networks, reduces model interpretability, and limits applications of deep learning in\nsafety and security-critical environments [5, 6].\nAdversarial training [7, 8, 9] is the most popular approach to improve network robustness. Adversarial\nexamples are generated online using the latest snapshot of the network parameters. The generated\nadversarial examples are used to augment training dataset. Then, the classi\ufb01er is trained on the\nmixture of the original and the adversarial images. In this way, adversarial training smoothens a\ndecision boundary in the vicinity of the training examples. Adversarial training (AT) is an intuitive\nand effective defense, but it has some limitations. AT is based on the assumption that adversarial\nnoise is label non-changing. If the perturbation is too large, the adversarial noise may change the true\nunderlying label of the input. Secondly, adversarial training discards the dependency between the\nmodel parameters and the adversarial noise. As a result, the neural network may fail to anticipate\nchanges in the adversary and over\ufb01t the adversary used during training.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fF1\n\nx1\n\n\u02c6x1\n\nx2\n\n\u02c6x2\n\nF2\n\nFigure 1: Adversarial examples should be indistinguishable from the regular data of the adversarial\ntarget. The images in the \ufb01gure above are generated using Carlini and Wagner [10] l2-attack on the\nnetwork trained with our defense, such that the con\ufb01dence of the prediction on the adversarial images\nis 95%. The con\ufb01dence on the original images x1 and x2 is 99%.\n\nIdeally, what confuses neural network should be confusing to humans. So the changes introduced by\nthe adversarial noise should be associated with removing identifying characteristics of the original\nlabel and adding identifying characteristics of the adversarial label. For example, images that are\nadversarial to the classi\ufb01er should be visually confusing to a human observer. Current techniques [7,\n8, 9] improve robustness to input perturbations from a selected uncertainty set. Yet, the model\u2019s\nadversarial examples remain semantically meaningless. To address this gap in perception, we propose\na novel approach for learning robust classi\ufb01er. Our core idea is that adversarial examples for the robust\nclassi\ufb01er should be indistinguishable from the regular data of the attack\u2019s target class (see \ufb01g. 1).\nWe formulate the problem of learning robust classi\ufb01er in the framework of Generative Adversarial\nNetworks (GAN) [11]. The adversarial attack on the classi\ufb01er acts as a generator, and the critic\nnetwork learns to distinguish between natural and adversarial images. We also introduce a novel\ntargeted adversarial attack which we use as the generator. The classi\ufb01er cost is augmented with the\nobjective that its adversarial images generated by the attack should confuse the adversary critic. The\nattack is fully-differentiable and implicitly depends on the classi\ufb01er parameters. We train the classi\ufb01er\nand the adversary critic jointly with backpropagation. To improve the stability of the adversarial\nmapping, we introduce adversarial cycle-consistency constraint which ensures that the adversarial\nmapping of the adversarial examples is close to the original. Unlike adversarial training, our method\ndoes not require adversarial noise to be label non-changing. To the contrary, we require that the\nchanges introduced by adversarial noise should change the \u201ctrue\u201d label of the input to confuse the\ncritic. In the experiments, we demonstrate the effectiveness of the proposed approach. Our method\nsurpasses in terms of robustness networks trained with adversarial training. Additionally, we verify\nin the experiments with human annotators that adversarial examples are indeed visually confusing.\n\n2 Related work\n\nAdversarial attacks\nSzegedy et al. [3] have originally introduced a targeted adversarial attack\nwhich generates adversarial noise by optimizing the likelihood of input for some adversarial target\nusing a box-constrained L-BFGS method. Fast Gradient Sign method (FGSM) [7] is a one-step\nattack which uses a \ufb01rst-order approximation of the likelihood loss. Basic Iterative Method (BIM),\nwhich is also known as Projected Gradient Descent (PGD), [12] iteratively applies the \ufb01rst-order\napproximation and projects the perturbation after each step. [6] propose an iterative method which at\neach iteration selects a single most salient pixel and perturbs it. DeepFool [13] iteratively generates\nadversarial perturbation by taking a step in the direction of the closest decision boundary. The decision\nboundary is approximated with \ufb01rst-order Taylor series to avoid complex non-convex optimization.\nThen, the geometric margin can be computed in the closed-form. Carlini and Wagner [10] propose\nan optimization-based attack on a modi\ufb01ed loss function with implicit box-constraints. [14] intro-\nduce a black-box adversarial attack based on transferability of adversarial examples. Adversarial\nTransformation Networks (ATN) [15] trains a neural network to attack.\n\n2\n\n\fDefenses against adversarial attacks Adversarial training (AT) [7] augments training batch with\nadversarial examples which are generated online using Fast Gradient Sign method. Virtual Adversarial\ntraining (VAT) [16] minimizes Kullback-Leibler divergence between the predictive distribution of\nclean inputs and adversarial inputs. Notably, adversarial examples can be generated without using\nlabel information and VAT was successfully applied in semi-supervised settings. [17] applies iterative\nProjected Gradient Descent (PGD) attack to adversarial training. Stability training [18] minimizes a\ntask-speci\ufb01c distance between the output on clean and the output on corrupted inputs. However, only\na random noise was used to distort the input. [19, 20] propose to maximize a geometric margin to\nimprove classi\ufb01er robustness. Parseval networks [21] are trained with the regularization constraint,\nso the weight matrices have a small spectral radius. Most of the existing defenses are based on robust\noptimization and improve the robustness to perturbations from a selected uncertainty set.\nDetecting adversarial examples is an alternative way to mitigate the problem of adversarial examples at\ntest time. [22] propose to train a detector network on the hidden layer\u2019s representation of the guarded\nmodel. If the detector \ufb01nds an adversarial input, an autonomous operation can be stopped and human\nintervention can be requested. [23] adopt a Bayesian interpretation of Dropout to extract con\ufb01dence\nintervals during testing. Then, the optimal threshold was selected to distinguish natural images from\nadversarial. Nonetheless, Carlini and Wagner [24] have extensively studied and demonstrated the\nlimitations of the detection-based methods. Using modi\ufb01ed adversarial attacks, such defenses can\nbe broken in both white-box and black-box setups. In our work, the adversary critic is somewhat\nsimilar to the adversary detector. But, unlike adversary-detection methods, we use information from\nthe adversary critic to improve the robustness of the guarded model during training and do not use\nthe adversary critic during testing.\nGenerative Adversarial Networks [11] introduce a generative model where the learning problem\nis formulated as an adversarial game between discriminator and generator. The discriminator is\ntrained to distinguish between real images and generated images. The generator is trained to produce\nnaturally looking images which confuse the discriminator. A two-player minimax game is solved by\nalternatively optimizing two models. Recently several defenses have been proposed which use GAN\nframework to improve robustness of neural networks. Defense-GAN [25] use the generator at test\ntime to project the corrupted input on the manifold of the natural examples. Lee et al. [26] introduce\nGenerative Adversarial Trainer (GAT) in which the generator is trained to attack the classi\ufb01er. Like\nAdversarial Training [7], GAT requires that adversarial noise does not change the label. Compare\nwith defenses based on robust optimization, we do not put any prior constraint on the adversarial\nattack. To the contrary, we require that adversarial noise for robust classi\ufb01er should change the \u201ctrue\u201d\nlabel of the input to confuse the critic. Our formulation has three components (the classi\ufb01er, the critic,\nand the attack) and is also related to Triple-GAN [27]. But, in our work: 1) the generator also fools\nthe classi\ufb01er; 2) we use the implicit dependency between the model and the attack to improve the\nrobustness of the classi\ufb01er. Also, we use a \ufb01xed algorithm to attack the classi\ufb01er.\n\n3 Robust Optimization\n\nWe \ufb01rst recall a mathematical formulation for the robust multiclass classi\ufb01cation. Let f (x; W) be a\nk-class classi\ufb01er, e.g. neural network, where x \u2208 RN is in the input space and W are the classi\ufb01er\nparameters. The prediction rule is \u02c6k(x) = arg max f (x). Robust optimization seeks a solution\nrobust to the worst-case input perturbations:\n\nL(f (xi + ri), yi)\n\nmin\nW\n\nmax\nri\u2208Ui\n\n(1)\nwhere L is a training loss, ri is an arbitrary (even adversarial) perturbation for the input xi, and Ui is\nan uncertainty set, e.g. lp-norm \u0001-ball Ui = {ri : (cid:107)ri(cid:107)p \u2264 \u0001}. Prior information about the task can\nbe used to select a problem-speci\ufb01c uncertainty set U.\nSeveral regularization methods can be shown to be equivalent to the robust optimization, e.g. l1\nlasso regression [28] and l2 support vector machine [29]. Adversarial training [7] is a popular\nregularization method to improve neural network robustness. AT assumes that adversarial noise is\nlabel non-changing and trains neural network on the mixture of original and adversarial images:\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\nmin\nW\n\ni=1\n\n3\n\nL(f (xi), yi) + \u03bbL(f (xi + ri), yi)\n\n(2)\n\n\fwhere ri is the adversarial perturbation generated using Fast Gradient Sign method (FGSM). Shaham\net al. [30] show that adversarial training is a form of robust optimization with l\u221e-norm constraint.\nMadry et al. [17] experimentally argue that Projected Gradient Descent (PGD) adversary is inner\nmaximizer of eq. (1) and, thus, PGD is the optimal \ufb01rst-order attack. Adversarial training with PGD\nattack increases the robustness of the regularized models compare to the original defense. Margin\nmaximization [19] is another regularization method which generalizes SVM objective to deep neural\nnetworks, and, like SVM, it is equivalent to the robust optimization with the margin loss.\nSelecting a good uncertainty set U for robust opti-\nmization is crucial. Poorly chosen uncertainty set\nmay result in an overly conservative robust model.\nMost importantly, each perturbation r \u2208 U should\nleave the \u201ctrue\u201d class of the original input x un-\nchanged. To ensure that the changes of the network\nprediction are indeed fooling examples, Goodfellow\net al. [7] argue in favor of a max-norm perturbation\nconstraint for image classi\ufb01cation problems. How-\never, simple disturbance models (e.g. l2- and l\u221e-\nnorm \u0001-ball used in adversarial training) are inade-\nquate in practice because the distance to the decision\nboundary for different examples may signi\ufb01cantly\nvary. To adapt uncertainty set to the problem at hand,\nseveral methods have been developed for construct-\ning data-dependent uncertainty sets using statistical\nhypothesis tests [31]. In this work, we propose a\nnovel approach for learning a robust classi\ufb01er which\nis orthogonal to prior robust optimization methods.\nIdeally, inputs that are adversarial to the classi\ufb01er\nshould be confusing to a human observer. So the\nchanges introduced by the adversarial noise should\nbe associated with the removing of identifying char-\nacteristics of the original label and adding the identifying characteristics of the adversarial target. For\nexample, adversarial images in Figure 2 are visually confusing. The digit \u20181\u2019 (second row, eighth col-\numn) after adding the top stroke was classi\ufb01ed by the neural network as digit \u20187\u2019. Likewise, the digit\n\u20187\u2019 (eighth row, second column) after removing the top stroke was classi\ufb01ed by the network as digit\n\u20181\u2019. Similarly for other images in Figure 2, the model\u2019s \u201cmistakes\u201d can be predicted visually. Such\nbehavior of the classi\ufb01er is expected and desired for the problems in computer vision. Additionally, it\nimproves the interpretability of the model. In this work, we study image classi\ufb01cation problems, but\nour formulation can be extended to the classi\ufb01cation tasks in other domains, e.g. audio or text.\nBased on the above intuition, we develop a novel formulation for learning a robust classi\ufb01er. Classi\ufb01er\nis robust if its adversarial examples are indistinguishable from the regular data of the adversarial\ntarget (see \ufb01g. 1). So, we formulate the following mathematical problem:\n\nFigure 2:\nImages off-diagonal are cor-\nrupted with the adversarial noise generated\nby CW [10] l2-norm attack, so the predic-\ntion con\ufb01dence on the adversarial images is\nat least 95%. The prediction con\ufb01dence on\nthe original images is 99%.\n\nN(cid:88)\n\nmin\n\nL(f (xi), yi) + \u03bbD [pdata (x, y) , padv (x, y)]\n\n(3)\n\ni=1\n\nwhere pdata (x, y) and padv (x, y) is the distribution of the natural and the adversarial for f examples\nand the parameter \u03bb controls the trade-off between accuracy and robustness. Note that the distribution\npadv (x, y) is constructed by transforming natural samples (x, y) \u223c pdata (x, y) with y (cid:54)= yadv, so that\nadversarial example xadv = Af (x; yadv) is classi\ufb01ed by f as the attack\u2019s target yadv.\nThe \ufb01rst loss in eq. (3), e.g. NLL, \ufb01ts the model predictive distribution to the data distribution. The\nsecond term measures the probabilistic distance between the distribution of the regular and adversarial\nimages and constrains the classi\ufb01er, so its adversarial examples are indistinguishable from the regular\ninputs. It is important to note that we minimize a probabilistic distance between joint distributions\nbecause the distance between marginal distributions pdata(x) and padv(x) is trivially minimized when\nr \u223c 0. Compare with adversarial training, the proposed formulation does not impose the assumption\nthat adversarial noise is label non-changing. To the contrary, we require that adversarial noise for the\nrobust classi\ufb01er should be visually confusing and, thus, it should change the underlying label of the\ninput. Next, we will describe the implementation details of the proposed defense.\n\n4\n\n\f4 Robust Learning with Adversary Critic\n\nDk\n\nAs we have argued in the previous section, adversarial examples for the robust classi\ufb01er should be\nindistinguishable from the regular data of the adversarial target. Minimizing the statistical distance\nbetween pdata (x, y) and padv (x, y) in eq. (3) requires probability density estimation which in itself\nis a dif\ufb01cult problem. Instead, we adopt the framework of Generative Adversarial Networks [11].\nWe rely on a discriminator, or an adversary critic, to estimate a measure of difference between\ntwo distributions. The discriminator given an input-label pair (x, y) classi\ufb01es it as either natural\nor adversarial. For the k-class classi\ufb01er f, we implement the adversary critic as a k-output neural\nnetwork (see \ufb01g. 3). The objective for the k-th output of the discriminator D is to correctly distinguish\nbetween natural and adversarial examples of the class yk:\nL(f\u2217, Dk) = min\nEx\u223cpdata(x|y) [log (1 \u2212 Dk (Af\u2217 (x; yk)))]\nEx\u223cpdata(x|yk) [log Dk (x)] + Ey:y(cid:54)=yk\n(4)\nwhere Af (x, yk) is the targeted adversarial attack on the classi\ufb01er f which transforms the input x\nto the adversarial target yk. An example of such attack is Projected Gradient Descent [12] which\niteratively takes a step in the direction of the target yk. Note that the second term in eq. (4) is\ncomputed by transforming the regular inputs (x, y) \u223c pdata (x, y) with the original label y different\nfrom the adversarial target yk.\nOur architecture for the discriminator in Figure 3 is slightly different from the previous work on\njoint distribution matching [27] where the label information was added as the input to each layer\nof the discriminator. We use class label only in the \ufb01nal classi\ufb01cation layer of the discriminator.\nIn the experiments, we observe that with the proposed architecture: 1) the discriminator is more\nstable during training; 2) the classi\ufb01er f converges faster and is more robust. We also regularize\nthe adversary critic with a gradient norm penalty [32]. For the gradient norm penalty, we do not\ninterpolate between clean and adversarial images but simply compute the penalty at the real and\nadversarial data separately. Interestingly, regularizing the gradient of the binary classi\ufb01er has the\ninterpretation of maximizing the geometric margin [19].\nThe objective for the classi\ufb01er f is to minimize the number of mistakes subject to that its adversarial\nexamples generated by the attack Af fool the adversary critic D:\nL(f, D\u2217) = min\n\nk (Af (x; yk))] (5)\nwhere L is a standard supervised loss such as negative log-likelihood (NLL) and the parameter \u03bb\ncontrols the trade-off between test accuracy and classi\ufb01er robustness. To improve stability of the\nadversarial mapping during training, we introduce adversarial cycle-consistency constraint which\nensures that adversarial mapping Af of the adversarial examples should be close to the original:\n\nEx,y\u223cpdata(x,y)L(f (x), y) + \u03bb\n\nEy:y(cid:54)=yk\n\nEx\u223cpdata(x|y) [log D\u2217\n\n(cid:88)\n\nyk\n\nf\n\nLcycle(ys, yt) = Ex\u223cpdata(x|ys)\n\n(cid:2)(cid:107)Af (Af (x, yt), ys) \u2212 x(cid:107)2\n\n(cid:3) \u2200ys (cid:54)= yt\n\n(6)\n\nreal1\n\nadv1\n\n. . .\n\nrealk\n\nadvk\n\nXadv\n\nAf\n\nD\n\nXreal\n\nAlgorithm 1 High-Con\ufb01dence Attack Af\n1: Input: Image x, target y, network f, con\ufb01dence C.\n2: Output: Adversarial image \u02c6x.\n3: \u02c6x \u2190 x\n4: while py(\u02c6x) < C do\nf \u2190 log C \u2212 log py( \u02c6x)\n5:\n6: w \u2190 \u2207 log py( \u02c6x)\nr \u2190 f(cid:107)w(cid:107)2\n7:\nw\n\u02c6x \u2190 \u02c6x + r\n8:\n9: end while\n\n2\n\nFigure 3: Multiclass Adversary Critic.\n\n5\n\n\frk =\n\nlog C \u2212 log pk(x)\n(cid:107)\u2207x log pk(x)(cid:107)2\n\n(7)\n\nwhere ys is the original label of the input and yt is the adversarial target. Adversarial cycle-consistency\nconstraint is similar to cycle-consistency constraint which was introduced for image-to-image transla-\ntion [33]. But, we introduce it to constraint the adversarial mapping Af and it improves the robustness\nof the classi\ufb01er f. Next, we discuss implementation of our targeted adversarial attack Af .\nOur defense requires that the adversarial attack Af is differentiable. Additionally, adversarial\nexamples generated by the attack Af should be misclassi\ufb01ed by the network f with high con\ufb01dence.\nAdversarial examples which are close to the decision boundary are likely to retain some identifying\ncharacteristics of the original class. An attack which optimizes for the mistakes, e.g. DeepFool [13],\nguarantees the con\ufb01dence of 1\nk for k-way classi\ufb01er. To generate high-con\ufb01dence adversarial examples,\nwe propose a novel adversarial attack which iteratively maximizes the con\ufb01dence of the adversarial\ntarget. The con\ufb01dence of the target k after adding perturbation r is pk(x + r). The goal of the attack\nis to \ufb01nd the perturbation, so the adversarial input is misclassi\ufb01ed as k with the con\ufb01dence at least C:\n\nmin (cid:107)r(cid:107)\ns. t. pk(x + r) \u2265 C\n\nWe apply a \ufb01rst-order approximation to the constraint inequality:\n\nmin (cid:107)r(cid:107)\ns. t. pk(x) + r\u2207xpk(x) \u2265 C\n\nSoftmax in the \ufb01nal classi\ufb01cation layer saturates quickly and shatters the gradient. To avoid small\ngradients, we use log-likelihood instead. Finally, the l2-norm minimal perturbation can be computed\nusing a method of Lagrange multipliers as follows:\n\ntarget k with the con\ufb01dence C. Our attack can be equivalently written as xadv = x +(cid:81)Nmax\nBecause we use the approximation of the non-convex decision boundary, we iteratively update\n(cid:80)i\nperturbation r for Nmax steps using eq. (7) until the adversarial input xadv is misclassi\ufb01ed as the\ni=1 I(p(x +\nj=1 rj) \u2264 C)ri where I is an indicator function. The discrete stopping condition introduces a\nnon-differentiable path in the computational graph. We replace the gradient of the indicator function\nI with sigmoid-adjusted straight-through estimator during backpropagation [34]. This is a biased\nestimator but it has low variance and performs well in the experiments.\nThe proposed attack is similar to Basic Iterative Method (BIM) [12]. BIM takes a \ufb01xed \u0001-norm step\n|log C\u2212log py( \u02c6x)|\nin the direction of the attack target while our method uses an adaptive step \u03b3 =\n(cid:107)\u2207x log py( \u02c6x)(cid:107) . The\ndifference is important for our defense:\n\n1. BIM introduces an additional parameter \u0001. If \u0001 is too large, then the attack will not be\n\naccurate. If \u0001 is too small, then the attack will require many iterations to converge.\n\n2. Both attacks are differentiable. However, for BIM attack during backpropagation, all\nthe gradients \u2202ri\n\u2202w have an equal weight \u0001. For our attack, the gradients will be weighted\nadaptively depending on the distance \u03b3 to the attack\u2019s target. The step \u03b3 for our attack is\nalso fully-differentiable.\n\nFull listing of our attack is shown in algorithm 1. Next, we discuss how we select the adversarial\ntarget yt and the attack\u2019s target con\ufb01dence C during training.\nThe classi\ufb01er f approximately characterizes a conditional distribution p (y|x). If the classi\ufb01er f\u2217 is\noptimal and robust, its adversarial examples generated by the attack Af should fool the adversary\ncritic D. Therefore, the attack Af to fool the critic D should generate adversarial examples with the\ncon\ufb01dence C equal to the con\ufb01dence of the classi\ufb01er f on the regular examples. During training,\nwe maintain a running mean of the con\ufb01dence score for each class on the regular data. The attack\ntarget yt for the input x with the label ys can be sampled from the masked uniform distribution.\nAlternatively, the class with the closest decision boundary [13] can be selected. The latter formulation\nresulted in more robust classi\ufb01er f and we used it in all our experiments. This is similar to support\nvector machine formulation which maximizes the minimum margin.\nFinally, we train the classi\ufb01er f and the adversary critic D jointly using stochastic gradient descent\nby alternating minimization of Equations (4) and (5). Our formulation has three components (the\nclassi\ufb01er f, the critic D, and the attack Af ) and it is similar to Triple-GAN [27] but the generator in\nour formulation also fools the classi\ufb01er.\n\n6\n\n\f5 Experiments\n\nAdversarial training [7] discards the dependency between the model parameters and the adversarial\nnoise. In this work, it is necessary to retain the implicit dependency between the classi\ufb01er f and the\nadversarial noise, so we can backpropagate through the adversarial attack Af . For these reasons,\nall experiments were conducted using Tensor\ufb02ow [35] which supports symbolic differentiation and\ncomputation on GPU. Backpropagation through our attack requires second-order gradients \u22022f (x;w)\nwhich increases computational complexity of our defense. At the same time, this allows the model to\nanticipate the changes in the adversary and, as we show, signi\ufb01cantly improves the model robustness\nboth numerically and perceptually.\nWe perform experiments on MNIST dataset. While MNIST is a simple classi\ufb01cation task, it remains\nunsolved in the context of robust learning. We evaluate robustness of the models against l2 attacks.\nMinimal adversarial perturbation r is estimated using DeepFool [13], Carlini and Wagner [10], and\nthe proposed attack. To improve the accuracy of DeepFool and our attack during testing, we clip the\nl2-norm of perturbation at each iteration to 0.1. Note that our attack with the \ufb01xed step is equivalent\nto Basic Iterative Method [12]. We set the maximum number of iterations for DeepFool and our\nattack to 500. The target con\ufb01dence C for our attack is set to the prediction con\ufb01dence on the original\ninput x. DeepFool and our attack do not handle domain constraints explicitly, so we project the\nperturbation after each update. For Carlini and Wagner [10], we use implementation provided by the\nauthors with default settings for the attack but we reduce the number of optimization iterations from\n10000 to 1000. As suggested in [13], we measure the robustness of the model as follows:\n\n\u2202x\u2202w\n\n1\n|D|\n\n\u03c1adv(Af ) =\n\n(cid:107)r(x)(cid:107)2\n(cid:107)x(cid:107)2\nwhere Af is the attack on the classi\ufb01er f and D is the test set.\nWe compare our defense with reference (no defense), Adversarial Training [7, 8] (\u0001 = 0.1), Virtual\nAdversarial Training (VAT) [16] (\u0001 = 2.0), and l2-norm Margin Maximization [19] (\u03bb = 0.1) defense.\nWe study the robustness of two networks with recti\ufb01ed activation: 1) a fully-connected neural network\nwith three hidden layers of size 1200 units each; 2) Lenet-5 convolutional neural network. We train\nboth networks using Adam optimizer [36] with batch size 100 for 100 epochs. Next, we will describe\nthe training details for our defense.\nOur critic has two layers with 1200 units each and leaky recti\ufb01ed activation. We also add Gaussian\nnoise to the input of each layer. We train both the classi\ufb01er and the critic using Adam [36] with the\nmomentum \u03b21 = 0.5. The starting learning rate is set to 5 \u00b7 10\u22124 and 10\u22123 for the classi\ufb01er and the\ndiscriminator respectively. We train our defense for 100 epochs and the learning rate is halved every\n40 epochs. We set \u03bb = 0.5 for fully-connected network and \u03bb = 0.1 for Lenet-5 network which we\nselected using validation dataset. Both networks are trained with \u03bbrec = 10\u22122 for the adversarial\ncycle-consistency loss and \u03bbgrad = 10.0 for the gradient norm penalty. The number of iterations for\nour attack Af is set to 5. The attack con\ufb01dence C is set to the running mean class con\ufb01dence of the\nclassi\ufb01er on natural images. We pretrain the classi\ufb01er f for 1 epoch without any regularization to get\nan initial estimate of the class con\ufb01dence scores.\nOur results for 10 independent runs are summarized in Table 1, where the second column shows the\ntest error on the clean images, and the subsequent columns compare the robustness \u03c1 to DeepFool [13],\nCarlini and Wagner [10], and our attacks. Our defense signi\ufb01cantly increases the robustness of the\n\n(cid:88)\n\nx\u2208D\n\n(8)\n\nDefense\nReference\n[7]\n[16]\n[19]\nOur\n\n%\n1.46\n0.90\n0.84\n0.84\n1.18\n\n[13]\n0.131\n0.228\n0.244\n0.262\n0.290\n\n(a)\n\n[10]\n0.124\n0.210\n0.215\n0.230\n0.272\n\nOur\n0.173\n0.299\n0.355\n0.453\n0.575\n\nDefense\nReference\n[7]\n[16]\n[19]\nOur\n\n%\n0.64\n0.55\n0.60\n0.54\n0.93\n\n[10]\n0.148\n0.191\n0.195\n0.225\n0.278\n\nOur\n0.207\n0.286\n0.330\n0.470\n0.590\n\n[13]\n0.157\n0.215\n0.225\n0.248\n0.288\n(b)\n\nTable 1: Results on MNIST dataset for fully-connected network in table 1a and for Lenet-5 convolu-\ntional network in table 1b. Column 1: test error on original images. Column 3-5: robustness \u03c1 under\nDeepFool [13], Carlini and Wagner [10], and the proposed attack.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: Figure 4a shows a random subset of test images (average con\ufb01dence 97%). Figure 4b\nshows adversarial examples at the class decision boundary (average con\ufb01dence 34%). Figure 4c\nshows high-con\ufb01dence adversarial images (average con\ufb01dence 98%).\n\nDefense\nReference\n[7]\n[16]\n[19]\nOur\n\n% Change % No change\n0.57\n19.02\n35.08\n60.47\n87.99\n\n98.74\n77.21\n59.68\n34.52\n9.86\n\nDefense\nReference\n[7]\n[16]\n[19]\nOur\n\n% Change % No change\n2.54\n19.1\n26.8\n81.77\n92.29\n\n96.53\n75.94\n67.73\n13.15\n6.51\n\n(a)\n\n(b)\n\nTable 2: Results of Amazon Mechanical Turk experiment for fully-connected network in table 2a and\nfor Lenet-5 convolutional network in \ufb01g. 4c. Column 2: shows percent of adversarial images which\nhuman annotator label with its adversarial target, so adversarial noise changed the \u201ctrue\u201d label of\nthe input. Column 3: shows percent of the adversarial images which human annotator label with its\noriginal label, so adversarial noise did not change the underlying label of the input.\n\nmodel to adversarial examples. Some adversarial images for the neural network trained with our\ndefense are shown in Figure 4. Adversarial examples are generated using Carlini and Wagner\n[10] attack with default parameters. As we can observe, adversarial examples at the decision\nboundary in Figure 4b are visually confusing. At the same time, high-con\ufb01dence adversarial examples\nin Figure 4c closely resemble natural images of the adversarial target. We propose to investigate and\ncompare various defenses based on how many of its adversarial \u201cmistakes\u201d are actual mistakes.\nWe conduct an experiment with human annotators on MTurk. We asked the workers to label\nadversarial examples. Adversarial examples were generated from the test set using the proposed\nattack. The attack\u2019s target was set to class closest to the decision boundary and the target con\ufb01dence\nwas set to the model\u2019s con\ufb01dence on the original examples. We split 10000 test images into 400\nassignments. Each assignment was completed by one unique annotator. We report the results for four\ndefenses in Table 2. For the model trained without any defense, adversarial noise does not change the\nlabel of the input. When the model is trained with our defense, the high-con\ufb01dence adversarial noise\nactually changes the label of the input.\n\n6 Conclusion\n\nIn this paper, we introduce a novel approach for learning a robust classi\ufb01er. Our defense is based\non the intuition that adversarial examples for the robust classi\ufb01er should be indistinguishable from\nthe regular data of the adversarial target. We formulate a problem of learning robust classi\ufb01er in the\nframework of Generative Adversarial Networks. Unlike prior work based on robust optimization, our\nmethod does not put any prior constraints on adversarial noise. Our method surpasses in terms of\nrobustness networks trained with adversarial training. In experiments with human annotators, we also\nshow that adversarial examples for our defense are indeed visually confusing. In the future work, we\nplan to scale our defense to more complex datasets and apply it to the classi\ufb01cation tasks in other\ndomains, such as audio or text.\n\n8\n\n\fAcknowledgments\n\nThis work was carried out at the Rapid-Rich Object Search (ROSE) Lab at Nanyang Technological\nUniversity (NTU), Singapore. The ROSE Lab is supported by the National Research Foundation,\nSingapore, and the Infocomm Media Development Authority, Singapore. We thank NVIDIA Corpo-\nration for the donation of the GeForce Titan X and GeForce Titan X (Pascal) used in this research.\nWe also thank all the anonymous reviewers for their valuable comments and suggestions.\n\nReferences\n[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,\nP. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in\nspeech recognition: The shared views of four research groups. In IEEE Signal Processing\nMagazine, 2012.\n\n[3] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\n\nIntriguing properties of neural networks. In ICLR, 2013.\n\n[4] N. Carlini and D. Wagner. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text.\n\narXiv preprint arXiv:1801.01944, 2018.\n\n[5] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. Accessorize to a crime:\nReal and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM\nSIGSAC Conference on Computer and Communications Security, 2016.\n\n[6] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and\nAnanthram Swami. The limitations of deep learning in adversarial settings. In IEEE European\nSymposium on Security and Privacy, 2016.\n\n[7] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. In ICLR, 2015.\n\n[8] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial Machine Learning at Scale. In ICLR,\n\n2017.\n\n[9] Florian Tram\u00e8r, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick\n\nMcDaniel. Ensemble adversarial training: Attacks and defenses. In ICLR, 2018.\n\n[10] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks.\n\nIn IEEE Symposium on Security and Privacy, 2017.\n\n[11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[12] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv\n\npreprint arXiv:1607.02533, 2016.\n\n[13] S. M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: A simple and accurate method\n\nto fool deep neural networks. In CVPR, 2016.\n\n[14] Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and\nAnanthram Swami. Practical black-box attacks against machine learning. In Proceedings of the\n2017 ACM on Asia Conference on Computer and Communications Security, 2017.\n\n[15] Shumeet Baluja and Ian Fischer. Learning to attack: Adversarial transformation networks. In\n\nAAAI, 2018.\n\n[16] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional Smoothing with\n\nVirtual Adversarial Training. In ICLR, 2015.\n\n9\n\n\f[17] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards Deep Learning Models\n\nResistant to Adversarial Attacks. In ICLR, 2018.\n\n[18] S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural\n\nnetworks via stability training. In CVPR, 2016.\n\n[19] Alexander Matyasko and Lap-Pui Chau. Margin maximization for robust classi\ufb01cation using\n\ndeep learning. In IJCNN, 2017.\n\n[20] Gamaleldin Fathy Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio.\n\nLarge margin deep networks for classi\ufb01cation. In NIPS, 2018.\n\n[21] Moustapha Ciss\u00e9, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier.\n\nParseval networks: Improving robustness to adversarial examples. In ICML, 2017.\n\n[22] J. Hendrik Metzen, T. Genewein, V. Fischer, and B. Bischoff. On Detecting Adversarial\n\nPerturbations. In ICLR, 2017.\n\n[23] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Detecting Adversarial Samples from\n\nArtifacts. arXiv preprint arXiv:1703.00410, 2017.\n\n[24] Nicholas Carlini and David A. Wagner. Adversarial examples are not easily detected: Bypassing\nten detection methods. In Proceedings of the 10th ACM Workshop on Arti\ufb01cial Intelligence and\nSecurity, 2017.\n\n[25] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-GAN: Protecting classi\ufb01ers\n\nagainst adversarial attacks using generative models. In ICLR, 2018.\n\n[26] H. Lee, S. Han, and J. Lee. Generative Adversarial Trainer: Defense to Adversarial Perturbations\n\nwith GAN. arXiv preprint arXiv:1705.03387, 2017.\n\n[27] Chongxuan LI, Tau\ufb01k Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. In NIPS,\n\n2017.\n\n[28] Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. In NIPS,\n\n2009.\n\n[29] Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support\n\nvector machines. In Journal of Machine Learning Research, 2009.\n\n[30] U. Shaham, Y. Yamada, and S. Negahban. Understanding Adversarial Training: Increasing\nLocal Stability of Neural Nets through Robust Optimization. arXiv preprint arXiv:1511.05432,\n2015.\n\n[31] Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Data-driven robust optimization. In\n\nMathematical Programming, 2018.\n\n[32] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\n\nImproved training of wasserstein gans. In NIPS, 2017.\n\n[33] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-\n\nconsistent adversarial networks. In ICCV, 2017.\n\n[34] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron C. Courville. Estimating or propagating gradients\nthrough stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2015.\n\n[35] The Tensor\ufb02ow Development Team. TensorFlow: Large-Scale Machine Learning on Heteroge-\n\nneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2015.\n\n[36] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint\n\narXiv:1412.6980, 2015.\n\n10\n\n\f", "award": [], "sourceid": 6751, "authors": [{"given_name": "Alexander", "family_name": "Matyasko", "institution": "Nanyang Technological University"}, {"given_name": "Lap-Pui", "family_name": "Chau", "institution": "Nanyang Technological University"}]}