{"title": "Learning to Confuse: Generating Training Time Adversarial Data with Auto-Encoder", "book": "Advances in Neural Information Processing Systems", "page_first": 11994, "page_last": 12004, "abstract": "In this work, we consider one challenging training time attack by modifying training data with bounded perturbation, hoping to manipulate the behavior (both targeted or non-targeted) of any corresponding trained classifier during test time when facing clean samples. To achieve this, we proposed to use an auto-encoder-like network to generate such adversarial perturbations on the training data together with one imaginary victim differentiable classifier. The perturbation generator will learn to update its weights so as to produce the most harmful noise, aiming to cause the lowest performance for the victim classifier during test time. This can be formulated into a non-linear equality constrained optimization problem. Unlike GANs, solving such problem is computationally challenging, we then proposed a simple yet effective procedure to decouple the alternating updates for the two networks for stability. By teaching the perturbation generator to hijacking the training trajectory of the victim classifier, the generator can thus learn to move against the victim classifier step by step. The method proposed in this paper can be easily extended to the label specific setting where the attacker can manipulate the predictions of the victim classifier according to some predefined rules rather than only making wrong predictions. Experiments on various datasets including CIFAR-10 and a reduced version of ImageNet confirmed the effectiveness of the proposed method and empirical results showed that, such bounded perturbations have good transferability across different types of victim classifiers.", "full_text": "Learning to Confuse: Generating Training Time\n\nAdversarial Data with Auto-Encoder\u2217\n\n1National Key Laboratory for Novel Software Technology\n\nNanjing University, Nanjing 210023, China\n\nJi Feng1,2, Qi-Zhi Cai2, Zhi-Hua Zhou1\n\n{fengj, zhouzh}@lamda.nju.edu.cn, caiqizhi@chuangxin.com\n\n2Sinovation Ventures AI Institute\n\nAbstract\n\nIn this work, we consider one challenging training time attack by modifying training\ndata with bounded perturbation, hoping to manipulate the behavior (both targeted\nor non-targeted) of any corresponding trained classi\ufb01er during test time when\nfacing clean samples. To achieve this, we proposed to use an auto-encoder-like\nnetwork to generate such adversarial perturbations on the training data together\nwith one imaginary victim differentiable classi\ufb01er. The perturbation generator will\nlearn to update its weights so as to produce the most harmful noise, aiming to\ncause the lowest performance for the victim classi\ufb01er during test time. This can be\nformulated into a non-linear equality constrained optimization problem. Unlike\nGANs, solving such problem is computationally challenging, we then proposed\na simple yet effective procedure to decouple the alternating updates for the two\nnetworks for stability. By teaching the perturbation generator to hijacking the\ntraining trajectory of the victim classi\ufb01er, the generator can thus learn to move\nagainst the victim classi\ufb01er step by step. The method proposed in this paper can\nbe easily extended to the label speci\ufb01c setting where the attacker can manipulate\nthe predictions of the victim classi\ufb01er according to some prede\ufb01ned rules rather\nthan only making wrong predictions. Experiments on various datasets including\nCIFAR-10 and a reduced version of ImageNet con\ufb01rmed the effectiveness of the\nproposed method and empirical results showed that, such bounded perturbations\nhave good transferability across different types of victim classi\ufb01ers.\n\n1\n\nIntroduction\n\nHow to modify the training data with bounded transferable perturbation that can lead to the largest\ngeneralization gap? In other words, we consider the task of adding imperceivable noises to the training\ndata, hoping to maximally confuse any corresponding classi\ufb01er so as to make wrong predictions as\nmuch as possible when facing clean test data. In this paper, we refer such perturbed training samples\nas training time adversarial training data.\nTo achieve the above goal, we de\ufb01ned a deep encoder-decoder-like network to generate such pertur-\nbations. Meanwhile, we used an imaginary neural network acting as the victim classi\ufb01er, and the\ngoal here is to train both networks simultaneously that can cause the lowest accuracy for the victim\nclassi\ufb01er on clean test set. We can thus formulate such problem into a non-linear equality constrained\noptimization problem. Unlike GANs [9], such optimization problem is much harder to solve, and a\ndirect implementation of alternating updates will lead to unstable result. Inspired by some common\ntechniques in reinforcement learning such as introducing a separate record tracking network like\ntarget-nets to stabilize Q-learning [19], we proposed a similar approach by decoupling the training\n\n\u2217The \ufb01rst two authors contributed equally to the work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fprocedure for both networks. By doing so, the optimization procedure is much stable in practice. In\nother words, the adversarial perturbation generator is trained by hijacking the training procedure of\nthe victim classi\ufb01er. By doing so, the noise generator will learn to move against the victim classi\ufb01er\nstep by step.\nA similar setting is data poisoning [20] proposed in the security community. However, their goal is\nquite different compared with this work. The main goal for this work is to reveal some intriguing\nproperties of neural networks by adding bounded perturbations to the training data, whereas data\npoisoning focuses on the restriction that only a few training data is allowed to change. In other\nwords, in traditional data poisoning tasks, the attackers goal is to add or modify training data as few\nas possible, whereas training time adversarial data put the constraint on the perturbation levels (as\nhuman imperceivable as possible). Moreover, having full control of training data (instead of changing\na few) is a realistic assumption. For instance, in some applications an agent may agree to release\nsome internal data for peer assessment or academic research, but does not like to enable the data\nreceiver to build a model which performs well on real test data; this can be realized by applying such\nadversarial noises before the data release. In addition, when taking this from data privacy aspect,\nsuch procedure is quite different from releasing synthetic data via GANs. Consider a company selling\nsurveillance cameras and the user will store all the data been taken (these photos cannot be synthetic\nfor obvious reasons). On the other hand, the user certainly does not want any other unauthorized third\nparties to steal the data and train a classi\ufb01er. Then, our proposed procedure is suitable for this kind of\ntask since now the user can just make self-perturbations on its own data for protection.\nThe other contribution of this work is that, such formalization can be easily extended to the label\nspeci\ufb01c case, where one wants to speci\ufb01cally fool the classi\ufb01er of recognizing one input pattern into a\nspeci\ufb01cally prede\ufb01ned class, rather than making a wrong prediction only. Finally, experimental results\nshowed that, the learned noises is effective and robust to other machine learning models with different\nstructure or even different types such as Random Forest [4] or Support Vector Machine(SVM) [6].\nThe rest of the paper is organized as follows: First, we will give a formalization for the proposed\nproblem and describe the optimization procedure. Experimental results are then presented and \ufb01nally\nconclusion and future works are discussed.\n\n2 Related Works\n\nOne subject which closely relates to our work is data poisoning. The task of data poisoning dates\nback to the pre-deep learning times. For instance, there has been some research on poisoning the\nclassical models, including SVM [2], Linear Regression [14], and Naive Bayes [21] which basically\ntransform the poisoning task into a convex optimization problem.\nPoisoning for deep models, however, is a more challenging one. Kon et.al. [16] \ufb01rst proposed the\npossibility of poisoning deep models via the in\ufb02uence function to derive adversarial training examples.\nCurrently, there have been some popular approaches to data poisoning. For instance, sample speci\ufb01c\npoisoning aims to manipulate the model\u2019s behavior on some particular test samples. [24, 5, 11]. On\nthe other hand, general poison attacks aiming to reduce the performance on cleaned whole unseen\ntest set [16, 20]. As explained in the previous section, one of the differences with data poisoning is\nthat the poisoning task mainly focuses on modifying as few samples as possible whereas our work\nfocus on adding bounded noises as small as possible. In addition, our noise adding scheme can be\nscaled to much larger datasets with good transferability.\nAnother related subject is adversarial examples or testing time attacks, which refers to the case of\npresenting malicious testing samples to an already trained classi\ufb01er. Since the classi\ufb01er is given and\n\ufb01xed, there is no two-party game involved. Researches showed deep model is very sensitive to such\nadversarial examples due to the high-dimensionality of the input data and the linearity nature inside\ndeep neural networks [10]. Some recent works showed such adversarial examples also exist in the\nphysical world [8, 1], making it an important security and safety issue when designing high-stakes\nmachine learning systems in an open and dynamic environment. Our work can be regarded as a\ntraining time analogy of adversarial examples. There have been some works on explaining the\neffectiveness of adversarial examples. The work in [26] proposed that it is the linearity inside neural\nnetworks that makes the decision boundary vulnerable in high dimensional space. Although beyond\nthe scope of this paper, we tested several hypotheses on explaining the effectiveness of training time\nadversarial noises.\n\n2\n\n\fFigure 1: An overview for learning to confuse: Decoupling the alternating update for f\u03b8 and g\u03be\n\n3 The proposed method\n\nConsider the standard supervised learning procedure for classi\ufb01cation where one wants to learn\nthe mapping f\u03b8 : X \u2192 {0, 1}K from data where K is the number of classes being predicted. To\nlearn the optimal parameters \u03b8\u2217, a loss function such as cross-entropy for classi\ufb01cation L(f\u03b8(x), y) :\nRk \u00d7 Z+ \u2192 R+ on training data is often de\ufb01ned and empirical risk minimization [27] can thus be\napplied, that is, one wants to minimize the loss function on training data as:\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\n[L(f\u03b8(x), y)]\n\n(1)\n\n(cid:88)\n\n(x,y)\u223cD\n\nWhen f\u03b8 is a differentiable system such as neural networks, stochastic gradient descent (SGD) [3] or\nits variants can be applied by updating \u03b8 via gradient descent\n\u03b8 \u2190 \u03b8 \u2212 \u03b1\u2207\u03b8L(f\u03b8(x), y),\n\n(2)\n\nwhere \u03b1 refers to the learning rate.\nThe goal for this work is to perturb the training data by adding arti\ufb01cially imperceivable noise such\nthat during testing time, the classi\ufb01er\u2019s behavior will be dramatically different on the clean test-set.\nTo formulate this, we \ufb01rst de\ufb01ne a noise generator g\u03be : X \u2192 X which takes one training sample x in\nX and transform it into an imperceivable noise pattern in the same space X . For image data, such\nconstraint can be formulated as:\n\n\u2200x,(cid:107)g\u03be(x)(cid:107)\u221e \u2264 \u0001\n\n(3)\n\nHere, the \u0001 controls the perturbation strength which is a common practice in adversarial settings [10].\nIn this work, we choose the noise generator g\u03be to be an encoder-decoder neural network and the\nactivation for the \ufb01nal layer is de\ufb01ned to be: \u0001 \u00b7 (tanh(\u00b7)) to facilitate the constraint (3).\nWith the above motivation and notations, we can then formalize the task into the following optimiza-\ntion problem as:\n\n(cid:88)\n\nmax\n\n\u03be\n\ns.t.\n\n(x,y)\u223cD\n\u03b8\u2217(\u03be) = arg min\n\n[L(f\u03b8\u2217(\u03be)(x), y)],\n\n(cid:88)\n\n\u03b8\n\n(x,y)\u223cD\n\n[L(f\u03b8(x + g\u03be(x)), y)]\n\n(4)\n\nIn other words, every possible con\ufb01guration \u03be is paired with one classi\ufb01er f\u03b8\u2217(\u03be) trained on the\ncorresponding modi\ufb01ed data, the goal here is to \ufb01nd a noise generator g\u03be\u2217 such that the paired\nclassi\ufb01er f\u03b8\u2217(\u03be\u2217) to have the worst performance on the cleaned test set, compared with all the other\npossible \u03be.\nThis non-convex optimization problem is challenging, especially due to the nonlinear equality\nconstraint. Here we propose an alternating update procedure using some commonly accepted tricks\nin reinforcement learning for stability [19] which is simple yet effective in practice.\nFirst, since we are assuming f\u03b8 and g\u03be to be neural networks, the equality constraint can be relaxed\ninto\n\n\u03b8i = \u03b8i\u22121 \u2212 \u03b1 \u00b7 \u2207\u03b8i\u22121L(f\u03b8i\u22121(x + g\u03be(x)), y)\n\n(5)\n\n3\n\n\fwhere i is the index for SGD updates.\nSecond, the basic idea is to alternatively update f\u03b8 over adversarial training data via gradient descent\nand update g\u03be over clean data via gradient ascent. The main problem is that, if we directly using\nthis alternating approach, both networks f\u03b8 and g\u03be won\u2019t converge in practice. To stabilize this\nprocess, we propose to update f\u03b8 over the adversarial training data \ufb01rst, while collecting the update\ntrajectories for f\u03b8, then, based on such trajectories, we update the adversarial training data as well as\ng\u03be by calculating the pseudo-update for f\u03b8 at each time step. Such whole procedure is repeated T\ntrials until convergence. The detailed procedure is illustrated in Algorithm 1 and Figure 1.\n\nAlgorithm 1: Deep Confuse\nInput: Training data D, number of trials T , max iteration for training a classi\ufb01cation model maxiter,\n\nlearning rate of classi\ufb01cation model \u03b1f ,learning rate of the Noise Generator \u03b1g, batch size b\n\nOutput: Learned Noise Generator g\u03be\n\n1 \u03be \u2190 RandomInit()\n2 for t = 1 to T do\n\u03b80 \u2190 RandomInit()\n3\nL \u2190 empty list\n4\n// Update f\u03b8 while keeping g\u03be \ufb01xed\n5\nfor i = 0 to maxiter do\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18 end\n19 return g\u03be\n\nend\n\ni\n\n(xi, yi) \u223c D // Sample a mini-batch of training data\nL.append((\u03b8i, xi, yi))\nxadversarial\n\u03b8i+1 \u2190 \u03b8i \u2212 \u03b1f\u2207\u03b8iL(f\u03b8i (xadversarial\n\n\u2190 xi + g\u03be(xi)\n\ni\n\n), yi) // Update model f\u03b8 by SGD\n\nend\n// update g\u03be via a pseudo-update of f\u03b8\nfor i = 0 to maxiter do\n(\u03b8i, xi, yi) \u2190 L[i]\n\u03b8(cid:48) \u2190 \u03b8i \u2212 \u03b1f\u2207\u03b8iL(f\u03b8i (xi + g\u03be(xi)), yi) // Pseudo-update f\u03b8 over the current adversarial data\n\u03be \u2190 \u03be + \u03b1g\u2207\u03beL(f\u03b8(cid:48) (x), y) // Update g\u03be over clean data\n\nFinally, we introduce one more modi\ufb01cation for ef\ufb01ciency. Notice that storing the whole trajectory\nof the gradient updates when training f\u03b8 is memory inef\ufb01cient. To avoid directly storing such\ninformation, during each trial of training, we can create a copy of g\u03be as g(cid:48)\n\u03be to alternatively\nupdate with f\u03b8, then copy the parameters back to g\u03be. By doing so, we can merge the two loops within\neach trial into a single one and don\u2019t need to store the gradients at all. The detailed procedure is\nillustrated in Algorithm 2.\n\n\u03be and let g(cid:48)\n\nAlgorithm 2: Mem-Ef\ufb01cient Deep Confuse\nInput: Training data D, number of trials T , max iteration for training a classi\ufb01cation model maxiter,\n\nlearning rate of classi\ufb01cation model \u03b1f ,learning rate of the Noise Generator \u03b1g, batch size b\n\n\u03b80 \u2190 RandomInit()\nfor i = 0 to maxiter do\n\nOutput: Learned Noise Generator g\u03be\n1 \u03be \u2190 RandomInit()\n\u03be \u2190 g\u03be.copy()\n2 g(cid:48)\n3 for t = 1 to T do\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13 end\n14 return g\u03be\n\n(xi, yi) \u223c D // Sample a mini-batch\n\u03b8(cid:48) \u2190 \u03b8i \u2212 \u03b1f\u2207\u03b8iL(f\u03b8i (xi + g(cid:48)\n\u03be(cid:48) \u2190 \u03be(cid:48) + \u03b1g\u2207\u03be(cid:48)L(f\u03b8(cid:48) (x), y)\n\u2190 xi + g\u03be(xi)\nxadversarial\n\u03b8i+1 \u2190 \u03b8i \u2212 \u03b1f\u2207\u03b8iL(f\u03b8i (xadversarial\n\nend\ng\u03be \u2190 g(cid:48)\n\n\u03be\n\ni\n\ni\n\n\u03be(xi)), yi) // Update g(cid:48)\n\n\u03be using current f\u03b8\n\n), yi) // Update f\u03b8 by SGD\n\n4\n\n\f4 Label Speci\ufb01c Adversaries\n\nIn this section, we give a brief introduction of how to transfer our settings to the label speci\ufb01c\nscenarios. The goal for label speci\ufb01c adversaries is that the adversary not only wants the classi\ufb01er to\nmake the wrong predictions but also want the classi\ufb01er\u2019s predictions speci\ufb01cally according to some\npre-de\ufb01ned rules. For instance, the attacker wants the classi\ufb01er to wrongly recognize the pattern\nfrom class A speci\ufb01cally to Class B (thus not to Class C). To achieve this, denote \u03b7 : Z+ \u2192 Z+ as a\npre-de\ufb01ned label transformation function which maps one label to another. Here \u03b7 is pre-de\ufb01ned by\nthe attacker, and it transforms a label index into another different label index. Such label speci\ufb01c\nadversary can thus be formalized into:\n\n(cid:88)\n\n(cid:88)\n\n\u03b8\n\n(x,y)\u223cD\n\n[L(f\u03b8\u2217(\u03be)(x), \u03b7(y))],\n\nmin\n\n\u03be\n\ns.t.\n\n(x,y)\u223cD\n\u03b8\u2217(\u03be) = arg min\n\nL(f\u03b8(xi + g\u03be(xi)), yi)\n\n(6)\n\nIt is easy to show that optimizing the above problem is nearly identical with the procedure described in\nAlgorithm 2. The only thing needed to be changed is to replace the gradient ascent into gradient decent\nin line 10 in Algorithm 2 and replace \u03b7(y) to y in the same line while keeping others unchanged.\n\n5 Experiment\n\nTo validate the effectiveness of our method, we used the classical MNIST [18], CIFAR-10 [17] for\nmulti-classi\ufb01cation and a subset of ImageNet [7] for 2-class classi\ufb01cation. Concretely, we used a\nsubset of ImageNet (bulbul v.s. jelly\ufb01sh) consists of 2,600 colored images with size 224\u00d7224\u00d73\nfor training and 100 colored images for testing. Random samples for the adversarial training data is\nillustrated in Figure 2.\n\n(b) MNIST\n\n(a) 2-Class ImageNet\n\n(c) CIFAR-10\n\nFigure 2: First rows: original training samples. Second rows: adversarial training samples.\n\nThe classi\ufb01er f\u03b8 during training we used for MNIST is a simple Convolutional Network with 2\nconvolution layers having 20 and 50 channels respectively, followed by a fully-connected layer\nconsists of 500 hidden units. For the 2-class ImageNet and CIFAR-10, we used f\u03b8 to be a CNN with\n5 convolution layers having 32,64,128,128 and 128 channels respectively, each convolution layer is\nfollowed by a 2\u00d72 pooling operations. Both classi\ufb01ers used ReLU as activation and the kernel size is\nset to be 3\u00d73. Cross-entropy is used for loss function whereas the learning rate and batch size for the\nclassi\ufb01ers f\u03b8 are set to be 0.01 and 64 for MNIST and CIFAR-10 and 0.1 and 32 for ImageNet. The\nnumber of trials T is set to be 500 for both cases.\nThe noise generator g\u03be for MNIST and ImageNet consists of an encoder-decoder structure where\neach encoder/decoder has 4 4x4 convolution layers with channel numbers 16,32,64,128 respectively.\nFor CIFAR-10, we use a U-Net [23] which has larger model capacity. The learning rate for the noise\ngenerator g\u03be is set to be 10\u22124 via Adam [15].\n\n5\n\n\f5.1 Performance Evaluation of Training Time Adversary\n\nUsing the model con\ufb01gurations described above, we trained the noise generator g\u03be and its corre-\nsponding classi\ufb01er f\u03b8 with perturbation constraint \u0001 to be 0.3, 0.1, 0.032, for MNIST, ImageNet and\nCIFAR-10, respectively. The classi\ufb01cation results are summarized in Table 1. Each experiment is\nrepeated 10 times.\n\n(a) MNIST-Train\n\n(b) ImageNet-Train\n\n(c) CIFAR-Train\n\n(d) MNIST-Test\n\n(e) ImageNet-Test\n\n(f) CIFAR-Test\n\nFigure 3: First row: Deep features of the adversarial training data. Second row: Deep features of the\ncleaned test data.\n\nTable 1: Test accuracy (mean\u00b1std) when the classi\ufb01er is trained on the original clean training set and\nthe adversarial training set,respectively.\n\nClean Data\nAdversarial Data\n\nMNIST\n99.32 \u00b1 0.05\n0.25 \u00b1 0.04\n\nImageNet\n88.5 \u00b1 2.32\n54.2 \u00b1 11.19\n\nCIFAR-10\n77.28 \u00b1 0.17\n28.77 \u00b1 2.80\n\nWhen trained on the adversarial datasets, the test accuracy dramatically dropped to only 0.25 \u00b1 0.04,\n54.2 \u00b1 11.19 and 28.77 \u00b1 2.80, a clear evidence of the effectiveness for the proposed method.\nWe also visualized the activation of the \ufb01nal hidden layers of f\u03b8s trained on the adversarial training\nsets in Figure 3. Concretely, we \ufb01t a PCA [22] model on the \ufb01nal hidden layer\u2019s output for each f\u03b8 on\nthe adversarial training data, then using the same projection model, we projected the clean data into\nthe same space. It can be shown that the classi\ufb01er trained on the adversarial data cannot differentiate\nthe clean samples.\nIt is interesting to know how does the perturbation constraint \u0001 affects the performance in terms\nof both accuracy and visual appearance. Concretely, on MNIST dataset, we varied \u0001 from 0 (no\nmodi\ufb01cation) to 0.3, with a step size of 0.05 while keeping other con\ufb01gurations the same and the\nresults are illustrated in Figure 4.\nThe test accuracy in Figure 4 refers to the corresponding model performance trained on the different\nadversarial training data with different \u0001. From the experimental result, we observed a sudden drop in\nperformance when \u0001 exceeds 0.15. Although beyond the scope of this work, we conjecture this result\nis related or somewhat consistent with a similar theoretical guarantee for the robust error bound when\n\u0001 is 0.10 [28].\nFinally, we examined the results when the training data is partially modi\ufb01ed. Concretely, under\ndifferent perturbation constraint, we varied the percentage of adversaries in the training data while\nkeeping other con\ufb01gurations the same. The results are demonstrated in Figure 5. Random \ufb02ip refers\nto the case when one randomly \ufb02ip the labels in the training data.\n\n6\n\n\fFigure 4: Effect of varying \u0001.\n\nFigure 5: Varying the ratio of adversaries\nunder different \u0001.\n\n5.2 Evaluation of Transferability\n\nIn a more realistic setting, it is important to know the performance when we use a different classi\ufb01er.\nConcretely, denote the original conv-net f\u03b8 been used during training as CNNoriginal. After the\nadversarial data is obtained, we then train several different classi\ufb01ers on the same adversarial data\nand evaluate their performance on the clean test set.\nFor MNIST, we doubled/halved all the channels/hidden units and denote the model as CNNlarge and\nCNNsmall accordingly. In addition, we also trained a standard Random Forest [4] with 300 trees and\na SVM [6] using RBF kernels with kernel coef\ufb01cient equal to 0.01. The experimental results are\nsummarized in Figure 6.\n\nFigure 6: Test performance when using different classi\ufb01ers. The horizontal red line indicates random\nguess accuracy.\n\nThe blue histograms in Figure 6 correspond to the test performance trained on the clean dataset,\nwhereas orange histograms correspond to the test performance trained on the adversarial dataset.\nFrom the experimental results, it can be shown that the adversarial noises produced by g\u03be are general\nenough such that even non-NN classi\ufb01ers as random forest and SVM are also vulnerable and produce\npoor results as expected.\n\n(a) CIFAR-10\n\n(b) Two-class Imagenet\n\nFigure 7: Test performance when using different model architectures.The horizontal red line indicates\nrandom guess accuracy.\n\n7\n\n\fFor CIFAR-10 and ImageNet, we tried a variety of conv-nets including VGG [25], ResNet [12]\nand DenseNet [13] with different layers, and evaluate the performance accordingly. The results are\nsummarized in Figure 7. Again, good transferability of the adversarial noise is observed.\n\n5.3 The Generalization Gap and Linear Hypothesis\n\nTo fully illustrate the generalization gap caused by the adversaries, after we obtained the adversarial\ntraining data, we retrained 3 conv-nets (one for each data-sets) having the same architecture as f\u03b8\nand plotted the training curves as illustrated in Figure 8. A clear generalization gap between training\nand testing is observed. We conjecture the deep model tends to over-\ufb01ts towards the training noises\ng\u03be(x).\n\n(a) MNIST.\n\n(b) 2-class ImageNet.\n\n(c) CIFAR-10.\n\nFigure 8: Learning curves for f\u03b8\n\nTo validate our conjecture, we measured the predictive accuracy between the true label and the\npredictions f\u03b8(g\u03be(x)) taking only adversarial noises as inputs. The results are summarized in Table 2.\nNotice 95.15%, 93.00% and 72.98% test accuracy is obtained on the test set.\nThis interesting result con\ufb01rmed the conjecture that the model does over-\ufb01t to the noises. Here we\ngive one possible explanation. We hypothesize that it is the linearity inside deep models that make\nthe adversarial effective. In other words, f\u03b8(g\u03be(x)) contributes most when minimizing L(f\u03b8(x +\ng\u03be(x)), y). This result is deeply related and consistent with the results from adversarial examples\n[10] and the memorization property for DNNs [29].\n\nTable 2: Prediction accuracy taking only\nnoises as inputs. That is, the accuracy be-\ntween the true label and f\u03b8(g\u03be(x)) where x is\nthe clean sample.\n\nMNIST\nImageNet\nCIFAR-10\n\nNoisetrain Noisetest\n95.15\n93.00\n72.98\n\n95.62\n88.87\n78.57\n\nFigure 9: Clean samples and their correspond-\ning adversarial noises for MNIST, CIFAR-10\nand ImageNet\n\n5.4 Weight Visualizations\n\nInstead of visualizing deep features of the adversarial data, it is also interesting to directly plotting the\ntrained weights of the victim classi\ufb01er as a visual interpretation of the effectiveness. Concretely, we\nvisualized the weights of two linear SVMs trained on clean and adversarial training data, respectively.\nOur results are shown in Figure 10.\nIt can be shown that, compared with image templates (top row) obtained from clean training data,\nthe victim SVM weights (bottom row) trained on adversarial data went to the opposite direction\nand trend to over-\ufb01ts on image corners. This result is also hinted that, the decision boundary in a\nhigh-dimensional space is indeed easy to manipulate, which in-turn give the attacker the chance of\nproducing training time adversarial data.\n\n8\n\n\fFigure 10: LinearSVM weights visualization for MNIST. Top row: Weights trained on clean training\ndata. Bottom row: Weights trained on adversarial training data.\n\n5.5 Label Speci\ufb01c Adversaries\n\nTo validate the effectiveness in label speci\ufb01c adversarial setting, without loss of generalizability, here\nwe shift the predictions by one. For MNIST dataset, we want the classi\ufb01er trained on the adversarial\ndata to predict the test samples from class 1 speci\ufb01cally to class 2, and class 2 to class 3 ... and class\n9 to class 0. Using the method described in section 4, we trained the corresponding noise generator\nand evaluated the corresponding CNN on the test set, as illustrated in Figure 11.\n\n(a) Clean Training Data\n\n(b) Non-label speci\ufb01c setting\n\n(c) Label-speci\ufb01c setting\n\nFigure 11: The confusion matrices on test set under different scenarios for MNIST dataset. They\nsummarized the test performance of classi\ufb01er trained on (a) clean training data (b) Non-label speci\ufb01c\nsetting and (c) label-speci\ufb01c setting.\nCompared with the test accuracy (0.25 \u00b1 0.04) in the non-label speci\ufb01c setting, the test accuracy also\ndropped to 1.48 \u00b1 0.21, in addition, the success rate for targeting the desired speci\ufb01c label increased\nfrom 0.00 to 79.7 \u00b1 0.38. Such results gave positive supports for the effectiveness in label speci\ufb01c\nadversarial setting. Notice this is only a side-product of the proposed method to show the formulation\ncan be easily modi\ufb01ed to achieve some more user-speci\ufb01c tasks.\n\n6 Conclusion\n\nIn this work, we proposed a general framework for generating training time adversarial data by letting\nan auto-encoder watch and move against an imaginary victim classi\ufb01er. We further proposed a simple\nyet effective training scheme to train both networks simultaneously by decoupling the alternating\nupdate procedure for stability. Experiments on image data con\ufb01rmed the effectiveness of the proposed\nmethod, in particular, such adversarial data is still effective even to use a different victim classi\ufb01er,\nmaking it more useful in a realistic setting.\nTheoretical analysis or some more improvements for the optimization procedure is planned as future\nworks. In addition, it is interesting to design adversarially robustness classi\ufb01ers against this scheme.\n\nAcknowledgments\n\nThis research was supported by NSFC (61751306), the National Key R&D Program of China\n(2018YFB1004300), and the Collaborative Innovation Center of Novel Software Technology and\nIndustrialization. The \ufb01rst two authors would like to thank Beijing Sinnovation Ventures Megvii\nInternational AI Institute Company Limited for the support.\n\n9\n\n\fReferences\n[1] Athalye, A. and Sutskever, I. Synthesizing robust adversarial examples. In ICML, pp. 284\u2013293, 2018.\n\n[2] Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. In ICML, pp.\n\n1467\u20131474, 2012.\n\n[3] Bottou, L. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pp. 177\u2013186,\n\n2010.\n\n[4] Breiman, L. Random forests. Machine learning, 45(1):5\u201332, 2001.\n\n[5] Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targeted backdoor attacks on deep learning systems using\n\ndata poisoning. arXiv preprint arXiv:1712.05526, 2017.\n\n[6] Cortes, C. and Vapnik, V. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n\n[7] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, pp. 248\u2013255, 2009.\n\n[8] Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., and Song, D.\n\nRobust physical-world attacks on deep learning visual classi\ufb01cation. In CVPR, pp. 1625\u20131634, 2018.\n\n[9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and\n\nBengio, Y. Generative adversarial nets. In NIPS, pp. 2672\u20132680, 2014.\n\n[10] Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In ICLR,\n\n2015.\n\n[11] Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model\n\nsupply chain. arXiv preprint arXiv:1708.06733, 2017.\n\n[12] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp.\n\n770\u2013778, 2016.\n\n[13] Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks.\n\nIn Proc. of CVPR, pp. 2261\u20132269, 2017.\n\n[14] Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., and Li, B. Manipulating machine learning:\n\nPoisoning attacks and countermeasures for regression learning. In IEEE S & P, pp. 19\u201335, 2018.\n\n[15] Kingma, D. P. and Ba, J. L. Adam: Amethod for stochastic optimization. In ICLR, 2014.\n\n[16] Koh, P. W. and Liang, P. Understanding black-box predictions via in\ufb02uence functions. In ICML, pp.\n\n1885\u20131894, 2017.\n\n[17] Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.\n\n[18] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller,\nM., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning.\nNature, 518(7540):529, 2015.\n\n[20] Mu\u02dcnoz-Gonz\u00b4alez, L., Biggio, B., Demontis, A., Paudice, A., Wongrassamee, V., Lupu, E. C., and Roli, F.\nTowards poisoning of deep learning algorithms with back-gradient optimization. In ACM Workshop on\nArti\ufb01cial Intelligence and Security, pp. 27\u201338, 2017.\n\n[21] Nelson, B., Barreno, M., Chi, F. J., Joseph, A. D., Rubinstein, B. I. P., Saini, U., Sutton, C., Tygar, J. D.,\nand Xia, K. Exploiting machine learning to subvert your spam \ufb01lter. In Usenix Workshop on Large-Scale\nExploits and Emergent Threats, pp. 7:1\u20137:9, 2008.\n\n[22] Pearson, K. On lines and planes of closest \ufb01t to systems of points in space. The London, Edinburgh, and\n\nDublin Philosophical Magazine and Journal of Science, 2(11):559\u2013572, 1901.\n\n[23] Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmenta-\ntion. In International Conference on Medical image computing and computer-assisted intervention, pp.\n234\u2013241. Springer, 2015.\n\n10\n\n\f[24] Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., and Goldstein, T. Poison frogs!\n\ntargeted clean-label poisoning attacks on neural networks. In NIPS, pp. 6106\u20136116, 2018.\n\n[25] Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[26] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing\n\nproperties of neural networks. arXiv preprint, 2013.\n\n[27] Vapnik, V. Principles of risk minimization for learning theory. In NIPS, pp. 831\u2013838, 1992.\n\n[28] Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer adversarial\n\npolytope. In ICML, pp. 5283\u20135292, 2018.\n\n[29] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires\n\nrethinking generalization. In ICLR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6459, "authors": [{"given_name": "Ji", "family_name": "Feng", "institution": "Sinovation Ventures"}, {"given_name": "Qi-Zhi", "family_name": "Cai", "institution": "Sinovation Ventures"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}