{"title": "Large Margin Deep Networks for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 842, "page_last": 852, "abstract": "We present a formulation of deep learning that aims at  producing a large margin classifier. The notion of \\emc{margin}, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature representation; and conventional margin methods for neural networks only enforce margin at the output layer.\nSuch methods are therefore not well suited for deep networks. In this work, we propose a novel loss function to impose a margin on any chosen set of layers of a deep network (including input and hidden layers). Our formulation allows choosing any $l_p$ norm ($p \\geq 1$) on the metric measuring the margin. We demonstrate that the decision boundary obtained by our loss has nice properties compared to standard classification loss functions. Specifically, we show improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets on multiple tasks:\ngeneralization from small training sets, corrupted labels, and robustness against adversarial perturbations. The resulting loss is general and complementary to existing data augmentation (such as random/adversarial input transform) and regularization techniques such as weight decay, dropout, and batch norm. \\footnote{Code for the large margin loss function is released at \\url{https://github.com/google-research/google-research/tree/master/large_margin}}", "full_text": "Large Margin Deep Networks for Classi\ufb01cation\n\nGamaleldin F. Elsayed \u2217\n\nGoogle Research\n\nDilip Krishnan\nGoogle Research\n\nHossein Mobahi\nGoogle Research\n\nKevin Regan\n\nGoogle Research\n\n{gamaleldin, dilipkay, hmobahi, kevinregan, bengio}@google.com\n\nSamy Bengio\nGoogle Research\n\nAbstract\n\nWe present a formulation of deep learning that aims at producing a large margin\nclassi\ufb01er. The notion of margin, minimum distance to a decision boundary, has\nserved as the foundation of several theoretically profound and empirically suc-\ncessful results for both classi\ufb01cation and regression tasks. However, most large\nmargin algorithms are applicable only to shallow models with a preset feature\nrepresentation; and conventional margin methods for neural networks only enforce\nmargin at the output layer. Such methods are therefore not well suited for deep\nnetworks. In this work, we propose a novel loss function to impose a margin on\nany chosen set of layers of a deep network (including input and hidden layers).\nOur formulation allows choosing any lp norm (p \u2265 1) on the metric measuring\nthe margin. We demonstrate that the decision boundary obtained by our loss has\nnice properties compared to standard classi\ufb01cation loss functions. Speci\ufb01cally, we\nshow improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets\non multiple tasks: generalization from small training sets, corrupted labels, and\nrobustness against adversarial perturbations. The resulting loss is general and\ncomplementary to existing data augmentation (such as random/adversarial input\ntransform) and regularization techniques such as weight decay, dropout, and batch\nnorm. 2\n\n1\n\nIntroduction\n\nThe large margin principle has played a key role in the course of machine learning history, producing\nremarkable theoretical and empirical results for classi\ufb01cation (Vapnik, 1995) and regression problems\n(Drucker et al., 1997). However, exact large margin algorithms are only suitable for shallow models.\nIn fact, for deep models, computation of the margin itself becomes intractable. This is in contrast\nto classic setups such as kernel SVMs, where the margin has an analytical form (the l2 norm of the\nparameters). Desirable bene\ufb01ts of large margin classi\ufb01ers include better generalization properties\nand robustness to input perturbations (Cortes & Vapnik, 1995; Bousquet & Elisseeff, 2002).\nTo overcome the limitations of classical margin approaches, we design a novel loss function based\non a \ufb01rst-order approximation of the margin. This loss function is applicable to any network\narchitecture (e.g., arbitrary depth, activation function, use of convolutions, residual networks), and\ncomplements existing general-purpose regularization techniques such as weight-decay, dropout and\nbatch normalization.\n\n\u2217Work done as member of the Google AI Residency program https://ai.google/research/join-us/\n\nai-residency/\n\n2Code for the large margin loss function is released at https://github.com/google-research/\n\ngoogle-research/tree/master/large_margin\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe illustrate the basic idea of a large margin classi\ufb01er within a toy setup in Figure 1. For demonstration\npurposes, consider a binary classi\ufb01cation task and assume there is a model that can perfectly separate\nthe data. Suppose the models is parameterized by vector w, and the model g(x; w) maps the input\nvector x to a real number, as shown in Figure 1(a); where the yellow region corresponds to positive\nvalues of g(x; w) and the blue region to negative values; the red and blue dots represent training\npoints from the two classes. Such g is sometimes called a discriminant function. For a \ufb01xed w,\ng(x; w) partitions the input space into two sets of regions, depending on whether g(x; w) is positive\nor negative at those points. We refer to the boundary of these sets as the decision boundary, which\ncan be characterized by {x| g(x; w) = 0} when g is a continuous function. For a \ufb01xed w, consider\nthe distance of each training point to the decision boundary. We call the smallest non-negative such\ndistance the margin. A large margin classi\ufb01er seeks model parameters w that attain the largest\nmargin. Figure 1(b) shows the decision boundaries attained by our new loss (right), and another\nsolution attained by the standard cross-entropy loss (left). The yellow squares show regions where\nthe large margin solution better captures the correct data distribution.\n\nFigure 1: Illustration of large margin. (a) The distance of each training point to the decision\nboundary, with the shortest one being marked as \u03b3. While the closest point to the decision boundary\ndoes not need to be unique, the value of shortest distance (i.e. \u03b3 itself) is unique. (b) Toy example\nillustrating a good and a bad decision boundary obtained by optimizing a 4-layer deep network with\ncross-entropy loss (left), and with our proposed large margin loss (right). The two losses were trained\nfor 10000 steps on data shown in bold dots (train accuracy is 100% for both losses). Accuracy on test\ndata (light dots) is reported at the top of each \ufb01gure. Note how the decision boundary is better shaped\nin the region outlined by the yellow squares. This \ufb01gure is best seen in PDF.\n\nMargin may be de\ufb01ned based on the values of g (i.e. the output space) or on the input space. Despite\nsimilar naming, the two are very different. Margin based on output space values is the conventional\nde\ufb01nition. In fact, output margin can be computed exactly even for deep networks (Sun et al., 2015).\nIn contrast, the margin in the input space is computationally intractable for deep models. Despite\nthat, the input margin is often of more practical interest. For example, a large margin in the input\nspace implies immunity to input perturbations. Speci\ufb01cally, if a classi\ufb01er attains margin of \u03b3, i.e.\nthe decision boundary is at least \u03b3 away from all the training points, then any perturbation of the\ninput that is smaller than \u03b3 will not be able to \ufb02ip the predicted label. More formally, a model with a\nmargin of \u03b3 is robust to perturbations x + \u03b4 where sign(g(x)) = sign(g(x + \u03b4)), when (cid:107)\u03b4(cid:107)< \u03b3. It\nhas been shown that standard deep learning methods lack such robustness (Szegedy et al., 2013).\nIn this work, our main contribution is to derive a new loss for obtaining a large margin classi\ufb01er\nfor deep networks, where the margin can be based on any lp-norm (p \u2265 1), and the margin may be\nde\ufb01ned on any chosen set of layers of a network. We empirically evaluate our loss function on deep\nnetworks across different applications, datasets and model architectures. Speci\ufb01cally, we study the\nperformance of these models on tasks of adversarial learning, generalization from limited training\ndata, and learning from data with noisy labels. We show that the proposed loss function consistently\noutperforms baseline models trained with conventional losses, e.g. for adversarial perturbation, we\noutperform common baselines by up to 21% on MNIST, 14% on CIFAR-10 and 11% on Imagenet.\n\n2 Related Work\n\nPrior work (Liu et al., 2016; Sun et al., 2015; Sokolic et al., 2016; Liang et al., 2017) has explored\nthe bene\ufb01ts of encouraging large margin in the context of deep networks. Sun et al. (2015) state that\n\n2\n\n(a)(b)cross entropy: test accuracy 98%large margin: test accuracy 99%\fcross-entropy loss does not have margin-maximization properties, and add terms to the cross-entropy\nloss to encourage large margin solutions. However, these terms encourage margins only at the\noutput layer of a deep neural network. Other recent work (Soudry et al., 2017), proved that one\ncan attain max-margin solution by using cross-entropy loss with stochastic gradient descent (SGD)\noptimization. Yet this was only demonstrated for linear architecture, making it less useful for deep,\nnonlinear networks. Sokolic et al. (2016) introduced a regularizer based on the Jacobian of the loss\nfunction with respect to network layers, and demonstrated that their regularizer can lead to larger\nmargin solutions. This formulation only offers L2 distance metrics and therefore may not be robust\nto deviation of data based on other metrics (e.g., adversarial perturbations). In contrast, our work\nformulates a loss function that directly maximizes the margin at any layer, including input, hidden\nand output layers. Our formulation is general to margin de\ufb01nitions in different distance metrics (e.g.\nl1, l2, and l\u221e norms). We provide empirical evidence of superior performance in generalization\ntasks with limited data and noisy labels, as well as robustness to adversarial perturbations. Finally,\nHein & Andriushchenko (2017) propose a linearization similar to ours, but use a very different loss\nfunction for optimization. Their setup and optimization are speci\ufb01c to the adversarial robustness\nscenario, whereas we also consider generalization and noisy labels; their resulting loss function\nis computationally expensive and possibly dif\ufb01cult to scale to large problems such as Imagenet.\nMatyasko & Chau (2017) also derive a similar linearization and apply it to adversarial robustness\nwith promising results on MNIST and CIFAR-10.\nIn real applications, training data is often not as copious as we would like, and collected data might\nhave noisy labels. Generalization has been extensively studied as part of the semi-supervised and\nfew-shot learning literature, e.g. (Vinyals et al., 2016; Rasmus et al., 2015). Speci\ufb01c techniques\nto handle noisy labels for deep networks have also been developed (Sukhbaatar et al., 2014; Reed\net al., 2014). Our margin loss provides generalization bene\ufb01ts and robustness to noisy labels and is\ncomplementary to these works. Deep networks are susceptible to adversarial attacks (Szegedy et al.,\n2013) and a number of attacks (Papernot et al., 2017; Sharif et al., 2016; Hosseini et al., 2017), and\ndefenses (Kurakin et al., 2016; Madry et al., 2017; Guo et al., 2017; Athalye & Sutskever, 2017) have\nbeen developed. A natural bene\ufb01t of large margins is robustness to adversarial attacks, as we show\nempirically in Sec. 4.\n\n3 Large Margin Deep Networks\nConsider a classi\ufb01cation problem with n classes. Suppose we use a function fi : X \u2192 R, for\ni = 1, . . . , n that generates a prediction score for classifying the input vector x \u2208 X to class i. The\npredicted label is decided by the class with maximal score, i.e. i\u2217 = arg maxi fi(x).\nDe\ufb01ne the decision boundary for each class pair {i, j} as:\n\n(1)\nUnder this de\ufb01nition, the distance of a point x to the decision boundary D{i,j} is de\ufb01ned as the\nsmallest displacement of the point that results in a score tie:\n\nD{i,j} (cid:44) {x| fi(x) = fj(x)}\n\ndf,x,{i,j} (cid:44) min\n\n(cid:107)\u03b4(cid:107)p\n\n\u03b4\n\ns.t.\n\n(2)\nHere (cid:107).(cid:107)p is any lp norm (p \u2265 1). Using this distance, we can develop a large margin loss. We start\nwith a training set consisting of pairs (xk, yk), where the label yk \u2208 {1, . . . , n}. We penalize the\ndisplacement of each xk to satisfy the margin constraint for separating class yk from class i (i (cid:54)= yk).\nThis implies using the following loss function:\n\nfi(x + \u03b4) = fj(x + \u03b4)\n\nmax{0, \u03b3 + df,xk,{i,yk} sign (fi(xk) \u2212 fyk (xk))} ,\n\n(3)\nwhere the sign(.) adjusts the polarity of the distance. The intuition is that, if xk is already correctly\nclassi\ufb01ed, then we only want to ensure it has distance \u03b3 from the decision boundary, and penalize\nproportional to the distance d it falls short (so the penalty is max{0, \u03b3 \u2212 d}). However, if it is\nmisclassi\ufb01ed, we also want to penalize the point for not being correctly classi\ufb01ed. Hence, the penalty\nincludes the distance xk needs to travel to reach the decision boundary as well as another \u03b3 distance\nto travel on the correct side of decision boundary to attain \u03b3 margin. Therefore, the penalty becomes\nmax{0, \u03b3 + d}. In a multiclass setting, we aggregate individual losses arising from each i (cid:54)= yk by\nsome aggregation operator A :\n\nAi(cid:54)=yk max{0, \u03b3 + df,xk,{i,yk} sign (fi(xk) \u2212 fyk (xk))}\n\n(4)\n\n3\n\n\f(cid:80). In order to learn fi\u2019s, we assume they are parameterized by a vector w and should use the notation\n\nIn this paper we use two aggregation operators, namely the max operator max and the sum operator\n\nfi(x; w); for brevity we keep using the notation fi(x). The goal is to minimize the loss w.r.t. w:\n\n\u2217 (cid:44) arg min\n\nw\n\nAi(cid:54)=yk max{0, \u03b3 + df,xk,{i,yk} sign (fi(xk) \u2212 fyk (xk))}\n\nThe above formulation depends on d, whose exact computation from (2) is intractable when fi\u2019s are\nnonlinear. Instead, we present an approximation to d by linearizing fi w.r.t. \u03b4 around \u03b4 = 0.\n\n\u02dcdf,x,{i,j} (cid:44) min\n\n(cid:107)\u03b4(cid:107)p\n\ns.t.\n\nfi(x) + (cid:104)\u03b4,\u2207xfi(x)(cid:105) = fj(x) + (cid:104)\u03b4,\u2207xfj(x)(cid:105)\n\n(cid:88)\n\nw\n\nk\n\n\u03b4\n\nThis problem now has the following closed form solution (see supplementary for proof):\n\n(5)\n\n(6)\n\n(7)\n\n\u02dcdf,x,{i,j} =\n\n|fi(x) \u2212 fj(x)|\n\n(cid:107)\u2207xfi(x) \u2212 \u2207xfj(x)(cid:107)q\n\n,\n\nwhere (cid:107).(cid:107)q is the dual-norm of (cid:107).(cid:107)p. lq is the dual norm of lp when it satis\ufb01es q (cid:44) p\np\u22121 (Boyd &\nVandenberghe, 2004). For example if distances are measured w.r.t. l1, l2, or l\u221e norm, the norm in (7)\nwill respectively be l\u221e, l2, or l1 norm. Using the linear approximation, the loss function becomes:\nsign (fi(xk) \u2212 fyk (xk))} (8)\n\u02c6w (cid:44) arg min\n\n|fi(xk) \u2212 fyk (xk)|\n\nAi(cid:54)=yk max{0, \u03b3 +\n\n(cid:88)\n\n(cid:107)\u2207xfi(xk) \u2212 \u2207xfyk (xk)(cid:107)q\n\nw\n\nk\n\nThis further simpli\ufb01es to the following problem:\n\n\u02c6w (cid:44) arg min\n\nw\n\nAi(cid:54)=yk max{0, \u03b3 +\n\nfi(xk) \u2212 fyk (xk)\n\n(cid:107)\u2207xfi(xk) \u2212 \u2207xfyk (xk)(cid:107)q\n\n}\n\n(9)\n\n(cid:88)\n\nk\n\nIn (Huang et al., 2015), (7) has been derived (independently of us) to facilitate adversarial training\nwith different norms. In contrast, we develop a novel margin-based loss function that uses this\ndistance metric at multiple hidden layers, and show bene\ufb01ts for a wide range of problems. In the\nsupplementary material, we show that (7) coincides with an SVM for the special case of a linear\nclassi\ufb01er.\n\n3.1 Margin for Hidden Layers\nThe classic notion of margin is de\ufb01ned based on the distance of input samples from the decision\nboundary; in shallow models such as SVM, input/output association is the only way to de\ufb01ne a\nmargin. In deep networks, however, the output is shaped from input by going through a number\nof transformations (layers). In fact, the activations at each intermediate layer could be interpreted\nas some intermediate representation of the data for the following part of the network. Thus, we\ncan de\ufb01ne the margin based on any intermediate representation and the ultimate decision boundary.\nWe leverage this structure to enforce that the entire representation maintain a large margin with the\ndecision boundary. The idea then, is to simply replace the input x in the margin formulation (9)\nwith the intermediate representation of x. More precisely, let h(cid:96) denote the output of the (cid:96)\u2019th layer\n(h0 = x) and \u03b3(cid:96) be the margin enforced for its corresponding representation. Then the margin loss\n(9) can be adapted as below to incorporate intermediate margins (where the \u0001 in the denominator is\nused to prevent numerical problems, and is set to a small value such as 10\u22126 in practice):\n\nAi(cid:54)=yk max{0, \u03b3(cid:96) +\n\nfi(xk) \u2212 fyk (xk)\n\n\u0001 + (cid:107)\u2207h(cid:96) fi(xk) \u2212 \u2207h(cid:96) fyk (xk)(cid:107)q\n\n}\n\n(10)\n\n(cid:88)\n\n(cid:96),k\n\n\u02c6w (cid:44) arg min\n\nw\n\n4 Experiments\n\nHere we provide empirical results using formulation (10) on a number of tasks and datasets. We\nconsider the following datasets and models: a deep convolutional network for MNIST (LeCun et al.,\n1998), a deep residual convolutional network for CIFAR-10 (Zagoruyko & Komodakis, 2016) and an\nImagenet model with the Inception v3 architecture (Szegedy et al., 2016). Details of the architectures\nand hyperparameter settings are provided in the supplementary material. Our code was written in\nTensor\ufb02ow (Abadi et al., 2016). The tasks we consider are: training with noisy labels, training with\nlimited data, and defense against adversarial perturbations. In all these cases, we expect that the\npresence of a large margin provides robustness and improves test accuracies. As shown below, this is\nindeed the case across all datasets and scenarios considered.\n\n4\n\n\f4.1 Optimization of Parameters\n\nOur loss function (10) differs from the cross-entropy loss due to the presence of gradients in the\nloss itself. We compute these gradients for each class i (cid:54)= yk (yk is the true label corresponding to\nsample xk). To reduce computational cost, we choose a subset of the total number of classes. We\npick these classes by choosing i (cid:54)= yk that have the highest value from the forward propagation step.\nFor MNIST and CIFAR-10, we used all 9 (other) classes. For Imagenet we used only 1 class i (cid:54)= yk\n(increasing k increased computational cost without helping performance). The backpropagation step\nfor parameter updates requires the computation of second-order mixed gradients. To further reduce\ncomputation cost to a manageable level, we use a \ufb01rst-order Taylor approximation to the gradient\nwith respect to the weights. This approximation simply corresponds to treating the denominator\n((cid:107)\u2207h(cid:96)fi(xk)\u2212\u2207h(cid:96) fyk (xk)(cid:107)q) in (10) as a constant with respect to w for backpropagation. The value\nof (cid:107)\u2207h(cid:96)fi(xk) \u2212 \u2207h(cid:96)fyk (xk)(cid:107)q is recomputed at every forward propagation step. We compared\nperformance with and without this approximation for MNIST and found minimal difference in\naccuracy, but signi\ufb01cantly higher GPU memory requirement due to the computation of second-order\nmixed derivatives without the approximation (a derivative with respect to activations, followed by\nanother with respect to weights). Using these optimizations, we found, for example, that training\nis around 20% to 60% more expensive in wall-clock time for the margin model compared to cross-\nentropy, measured on the same NVIDIA p100 GPU (but note that there is no additional cost at\ninference time). Finally, to improve stability when the denominator is small, we found it bene\ufb01cial to\nclip the loss at some threshold. We use standard optimizers such as RMSProp (Tieleman & Hinton,\n2012).\n\n4.2 MNIST\n\nWe train a 4 hidden-layer model with 2 convolutional layers and 2 fully connected layers, with\nrecti\ufb01ed linear unit (ReLu) activation functions, and a softmax output layer. The \ufb01rst baseline\nmodel uses a cross-entropy loss function, trained with stochastic gradient descent optimization with\nmomentum and learning rate decay. A natural question is whether having a large margin loss de\ufb01ned\nat the network output such as the standard hinge loss could be suf\ufb01cient to give good performance.\nTherefore, we trained a second baseline model using a hinge loss combined with a small weight of\ncross-entropy.\nThe large margin model has the same architecture as the baseline, but we use our new loss function\nin formulation (10). We considered margin models using an l\u221e, l1 or l2 norm on the distances,\nrespectively. For each norm, we train a model with margin either only on the input layer, or on all\nhidden layers and the output layer. Thus there are 6 margin models in all. For models with margin\nat all layers, the hyperparameter \u03b3l is set to the same value for each layer (to reduce the number of\nhyperparameters). Furthermore, we observe that using a weighted sum of margin and cross-entropy\nfacilitates training and speeds up convergence 3. We tested all models with both stochastic gradient\ndescent with momentum, and RMSProp (Hinton et al.) optimizers and chose the one that worked\nbest on the validation set. In case of cross-entropy and hinge loss we used momentum, in case of\nmargin models for MNIST, we used RMSProp with no momentum.\nFor all our models, we perform a hyperparameter search including with and without dropout, with\nand without weight decay and different values of \u03b3l for the margin model (same value for all layers\nwhere margin is applied). We hold out 5, 000 samples of the training set as a validation set, and the\nremaining 55, 000 samples are used for training. The full evaluation set of 10, 000 samples is used\nfor reporting all accuracies. Under this protocol, the cross-entropy and margin models trained on\nthe 55, 000 sample training set achieves a test accuracy of 99.4% and the hinge loss model achieve\n99.2%.\n\n4.2.1 Noisy Labels\n\nIn this experiment, we choose, for each training sample, whether to \ufb02ip its label with some other\nlabel chosen at random. E.g. an instance of \u201c1\u201d may be labeled as digit \u201c6\u201d. The percentage of such\n\ufb02ipped labels varies from 0% to 80% in increments of 20%. Once a label is \ufb02ipped, that label is \ufb01xed\nthroughout training. Fig. 2(left) shows the performance of the best performing 4 (all layer margin\n\n3We emphasize that the performance achieved with this combination cannot be obtained by cross-entropy\n\nalone, as shown in the performance plots.\n\n5\n\n\fand cross-entropy) of the 8 algorithms, with test accuracy plotted against noise level. It is seen that\nthe margin l1 and l2 models perform better than cross-entropy across the entire range of noise levels,\nwhile the margin l\u221e model is slightly worse than cross-entropy. In particular, the margin l2 model\nachieves a evaluation accuracy of 96.4% at 80% label noise, compared to 93.9% for cross-entropy.\nThe input only margin models were outperformed by the all layer margin models and are not shown\nin Fig. 2. We \ufb01nd that this holds true across all our tests. The performance of all 8 methods is shown\nin the supplementary material.\n\nFigure 2: Performance of MNIST models on: (left) noisy label tasks and (right) generalization tasks.\n\n4.2.2 Generalization\n\nIn this experiment we consider models trained with signi\ufb01cantly lower amounts of training data. This\nis a problem of practical importance, and we expect that a large margin model should have better\ngeneralization abilities. Speci\ufb01cally, we randomly remove some fraction of the training set, going\ndown from 100% of training samples to only 0.125%, which is 68 samples. In Fig. 2(right), the\nperformance of cross-entropy, hinge and margin (all layers) is shown. The test accuracy is plotted\nagainst the fraction of data used for training. We also show the generalization results of a Bayesian\nactive learning approach presented in (Gal et al., 2017). The all-layer margin models outperform both\ncross-entropy and (Gal et al., 2017) over the entire range of testing, and the amount by which the\nmargin models outperform increases as the dataset size decreases. The all-layer l\u221e-margin model\noutperforms cross-entropy by around 3.7% in the smallest training set of 68 samples. We use the\nsame randomly drawn training set for all models.\n\n4.2.3 Adversarial Perturbation\n\nBeginning with (Goodfellow et al., 2014), a number of papers (Papernot et al., 2016; Kurakin et al.,\n2016; Moosavi-Dezfooli et al., 2016) have examined the presence of adversarial examples that can\n\u201cfool\u201d deep networks. These papers show that there exist small perturbations to images that can\ncause a deep network to misclassify the resulting perturbed examples. We use the Fast Gradient Sign\nMethod (FGSM) and the iterative version (IFGSM) of perturbation introduced in (Goodfellow et al.,\n2014; Kurakin et al., 2016) to generate adversarial examples4. Details of FGSM and IFGSM are\ngiven in the supplementary.\nFor each method, we generate a set of perturbed adversarial examples using one network, and then\nmeasure the accuracy of the same (white-box) or another (black-box) network on these examples.\nFig. 3 (left, middle) shows the performance of the 8 models for IFGSM attacks (which are stronger\nthan FGSM). FGSM performance is given in the supplementary. We plot test accuracy against\ndifferent values of \u0001 used to generate the adversarial examples. In both FGSM and IFGSM scenarios,\nall margin models signi\ufb01cantly outperform cross-entropy, with the all-layer margin models outperform\nthe input-only margin, showing the bene\ufb01t of margin at hidden layers. This is not surprising as the\nadversarial attacks are speci\ufb01cally de\ufb01ned in input space. Furthermore, since FGSM/IFGSM are\nde\ufb01ned in the l\u221e norm, we see that the l\u221e margin model performs the best among the three norms.\nIn the supplementary, we also show the white-box performance of the method from (Madry et al.,\n2017) 5, which is an algorithm speci\ufb01cally designed for adversarial defenses against FGSM attacks.\nOne of the margin models outperforms this method, and another is very competitive. For black box,\n\n4There are also other methods of generating adversarial perturbations, not considered here.\n5We used \u0001 values provided by the authors.\n\n6\n\n\fthe attacker is a cross-entropy model. It is seen that the margin models are robust against black-box\nattacks, signi\ufb01cantly outperforming cross-entropy. For example at \u0001 = 0.1, cross-entropy is at 67%\naccuracy, while the best margin model is at 90%.\nKurakin et al. (2016) suggested adversarial training as a defense against adversarial attacks. This\napproach augments training data with adversarial examples. However, they showed that adding\nFGSM examples in this manner, often do not confer robustness to IFGSM attacks, and is also\ncomputationally costly. Our margin models provide a mechanism for robustness that is independent\nof the type of attack. Further, our method is complementary and can still be used with adversarial\ntraining. To demonstrate this, Fig. 3 (right) shows the improved performance of the l\u221e model\ncompared to the cross-entropy model for black-box attacks from a cross-entropy model, when the\nmodels are adversarially trained. While the gap between cross-entropy and margin models is reduced\nin this scenario, we continue to see greater performance from the margin model at higher values\nof \u0001. Importantly, we saw no bene\ufb01t for the generalization or noisy label tasks from adversarial\ntraining - thus showing that this type of data augmentation provides very speci\ufb01c robustness. In the\nsupplementary, we also show the performance against input corrupted with varying levels of Gaussian\nnoise, showing the bene\ufb01t of margin for this type of perturbation as well.\n\nFigure 3: Robustness of MNIST models to adversarial attacks: (Left) White-box IFGSM; (Middle)\nBlack-box IFGSM; (Right) Black-box performance of adversarially trained models.\n\n4.3 CIFAR-10\n\nNext, we test our models for the same tasks on CIFAR-10 dataset (Krizhevsky & Hinton, 2009).\nWe use the ResNet model proposed in Zagoruyko & Komodakis (2016), consisting of an input\nconvolutional layer, 3 blocks of residual convolutional layers where each block containing 9 layers,\nfor a total of 58 convolutional layers. Similar to MNIST, we set aside 10% of the training data for\nvalidation, leaving a total of 45, 000 training samples, and 5, 000 validation samples. We train margin\nmodels with multiple layers of margin, choosing 5 evenly spaced layers (input layer, output layer and\n3 other convolutional layers in the middle) across the network. We perform a hyper-parameter search\nacross margin values. We also train with data augmentation (random image transformations such\nas cropping and contrast/hue changes). Hyperparameter details are provided in the supplementary\nmaterial. With these settings, we achieve a baseline accuracy of around 90% for the following 5\nmodels: cross-entropy, hinge, margin l\u221e, l1 and l2\n\n6\n\n4.3.1 Noisy Labels\n\nFig. 4 (left) shows the performance of the 5 models under the same noisy label regime, with fractions\nof noise ranging from 0% to 80%. The margin l\u221e and l2 models consistently outperforms cross-\nentropy by 4% to 10% across the range of noise levels.\n\n4.3.2 Generalization\nFig. 4(right) shows the performance of the 5 CIFAR-10 models on the generalization task. We\nconsistently see superior performance of the l1 and l\u221e margin models w.r.t. cross-entropy, especially\nas the amount of data is reduced. For example at 5% and 1% of the total data, the l1 margin model\noutperforms the cross-entropy model by 2.5%.\n\n6With a better CIFAR network WRN-40-10 from Zagoruyko & Komodakis (2016), we were able to achieve\n\n95% accuracy on full data.\n\n7\n\n\fFigure 4: Performance of CIFAR-10 models on noisy data (left) and limited data (right).\n\nFigure 5: Performance of CIFAR-10 models on IFGSM adversarial examples.\n\n4.3.3 Adversarial Perturbations\n\nFig. 5 shows the performance of cross-entropy and margin models for IFGSM attacks, for both\nwhite-box and black box scenarios. The l1 and l\u221e margin models perform well for both sets of\nattacks, giving a clear boost over cross-entropy. For \u0001 = 0.1, the l1 model achieves an improvement\nover cross-entropy of about 14% when defending against a cross-entropy attack. Another approach\nfor robustness is in (Cisse et al., 2017), where the Lipschitz constant of network layers is kept small,\nthereby directly insulating the network from small perturbations. Our models trained with margin\nsigni\ufb01cantly outperform their reported results in Table 1 for CIFAR-10. For an SNR of 33 (as\ncomputed in their paper), we achieve 82% accuracy compared to 69.1% by them (for non-adversarial\ntraining), a 18.7 % relative improvement.\n\n4.4\n\nImagenet\n\nWe tested our l1 margin model against cross-entropy for a full-scale Imagenet model based on\nthe Inception architecture (Szegedy et al., 2016), with data augmentation. Our margin model and\ncross-entropy achieved a top-1 validation precision of 78% respectively, close to the 78.8% reported\nin (Szegedy et al., 2016). We test the Imagenet models for white-box FGSM and IFGSM attacks, as\nwell as for black-box attacks defending against cross-entropy model attacks. Results are shown in\nFig. 6. We see that the margin model consistently outperforms cross-entropy for black and white box\nFGSM and IFGSM attacks. For example at \u0001 = 0.1, we see that cross-entropy achieves a white-box\nFGSM accuracy of 33%, whereas margin achieves 44% white-box accuracy and 59% black-box\naccuracy. Note that our FGSM accuracy numbers on the cross-entropy model are quite close to that\nachieved in (Kurakin et al., 2016) (Table 2, top row); also note that we use a wider range of \u0001 in our\nexperiments.\n\n5 Discussion\n\nWe have presented a new loss function inspired by the theory of large margin that is amenable to\ndeep network training. This new loss is \ufb02exible and can establish a large margin that can be de\ufb01ned\non input, hidden or output layers, and using l\u221e, l1, and l2 distance de\ufb01nitions. Models trained with\n\n8\n\n\fFigure 6: Imagenet white-box/black-box performance on adversarial examples.\n\nthis loss perform well in a number of practical scenarios compared to baselines on standard datasets.\nThe formulation is independent of network architecture and input domain and is complementary to\nother regularization techniques such as weight decay and dropout. Our method is computationally\npractical: for Imagenet, our training was about 1.6 times more expensive than cross-entropy (per\nstep). Finally, our empirical results show the bene\ufb01t of margin at the hidden layers of a network.\n\nReferences\nAbadi, Mart\u00edn, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin,\nMatthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In OSDI, volume 16, pp. 265\u2013283, 2016.\n\nAthalye, Anish and Sutskever, Ilya. Synthesizing robust adversarial examples. arXiv preprint\n\narXiv:1707.07397, 2017.\n\nBousquet, Olivier and Elisseeff, Andr\u00e9. Stability and generalization. Journal of machine learning\n\nresearch, 2(Mar):499\u2013526, 2002.\n\nBoyd, Stephen and Vandenberghe, Lieven. Convex optimization. 2004.\n\nCisse, Moustapha, Bojanowski, Piotr, Grave, Edouard, Dauphin, Yann, and Usunier, Nicolas. Parseval\nnetworks: Improving robustness to adversarial examples. In International Conference on Machine\nLearning, pp. 854\u2013863, 2017.\n\nCortes, Corinna and Vapnik, Vladimir. Support-vector networks. Machine learning, 20(3):273\u2013297,\n\n1995.\n\nDrucker, Harris, Burges, Chris J. C., Kaufman, Linda, Smola, Alex, and Vapnik, Vladimir. Support\n\nvector regression machines. In NIPS, pp. 155\u2013161. MIT Press, 1997.\n\nGal, Yarin, Islam, Riashat, and Ghahramani, Zoubin. Deep bayesian active learning with image data.\n\narXiv preprint arXiv:1703.02910, 2017.\n\nGoodfellow, Ian J, Shlens, Jonathon, and Szegedy, Christian. Explaining and harnessing adversarial\n\nexamples. arXiv preprint arXiv:1412.6572, 2014.\n\nGuo, Chuan, Rana, Mayank, Ciss\u00e9, Moustapha, and van der Maaten, Laurens. Countering adversarial\n\nimages using input transformations. arXiv preprint arXiv:1711.00117, 2017.\n\nHein, Matthias and Andriushchenko, Maksym. Formal guarantees on the robustness of a classi\ufb01er\nagainst adversarial manipulation. In Advances in Neural Information Processing Systems, pp.\n2266\u20132276, 2017.\n\nHinton, Geoffrey, Srivastava, Nitish, and Swersky, Kevin. Neural networks for machine learning-\n\nlecture 6a-overview of mini-batch gradient descent.\n\nHosseini, Hossein, Xiao, Baicen, and Poovendran, Radha. Google\u2019s cloud vision api is not robust to\n\nnoise. arXiv preprint arXiv:1704.05051, 2017.\n\n9\n\n\fHuang, Ruitong, Xu, Bing, Schuurmans, Dale, and Szepesv\u00e1ri, Csaba. Learning with a strong\n\nadversary. arXiv preprint arXiv:1511.03034, 2015.\n\nKrizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.\n\nKurakin, A., Goodfellow, I., and Bengio, S. Adversarial Machine Learning at Scale. ArXiv e-prints,\n\nNovember 2016.\n\nLeCun, Yann, Bottou, L\u00e9on, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied\n\nto document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nLiang, Xuezhi, Wang, Xiaobo, Lei, Zhen, Liao, Shengcai, and Li, Stan Z. Soft-margin softmax for\ndeep classi\ufb01cation. In International Conference on Neural Information Processing, pp. 413\u2013421.\nSpringer, 2017.\n\nLiu, Weiyang, Wen, Yandong, Yu, Zhiding, and Yang, Meng. Large-margin softmax loss for\n\nconvolutional neural networks. In ICML, pp. 507\u2013516, 2016.\n\nMadry, Aleksander, Makelov, Aleksandar, Schmidt, Ludwig, Tsipras, Dimitris, and Vladu, Adrian.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\nMatyasko, Alexander and Chau, Lap-Pui. Margin maximization for robust classi\ufb01cation using deep\nlearning. In Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 300\u2013307.\nIEEE, 2017.\n\nMoosavi-Dezfooli, Seyed-Mohsen, Fawzi, Alhussein, Fawzi, Omar, and Frossard, Pascal. Universal\n\nadversarial perturbations. arXiv preprint arXiv:1610.08401, 2016.\n\nPapernot, Nicolas, McDaniel, Patrick, Goodfellow, Ian, Jha, Somesh, Celik, Z Berkay, and Swami,\nAnanthram. Practical black-box attacks against deep learning systems using adversarial examples.\narXiv preprint arXiv:1602.02697, 2016.\n\nPapernot, Nicolas, McDaniel, Patrick, Goodfellow, Ian, Jha, Somesh, Celik, Z Berkay, and Swami,\nAnanthram. Practical black-box attacks against machine learning. In Proceedings of the 2017\nACM on Asia Conference on Computer and Communications Security, pp. 506\u2013519. ACM, 2017.\n\nRasmus, Antti, Berglund, Mathias, Honkala, Mikko, Valpola, Harri, and Raiko, Tapani. Semi-\nsupervised learning with ladder networks. In Advances in Neural Information Processing Systems,\npp. 3546\u20133554, 2015.\n\nReed, Scott, Lee, Honglak, Anguelov, Dragomir, Szegedy, Christian, Erhan, Dumitru, and Rabinovich,\nAndrew. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint\narXiv:1412.6596, 2014.\n\nSharif, Mahmood, Bhagavatula, Sruti, Bauer, Lujo, and Reiter, Michael K. Accessorize to a crime:\nReal and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM\nSIGSAC Conference on Computer and Communications Security, pp. 1528\u20131540. ACM, 2016.\n\nSokolic, Jure, Giryes, Raja, Sapiro, Guillermo, and Rodrigues, Miguel R. D. Robust large margin deep\nneural networks. CoRR, abs/1605.08254, 2016. URL http://arxiv.org/abs/1605.08254.\n\nSoudry, Daniel, Hoffer, Elad, and Srebro, Nathan. The implicit bias of gradient descent on separable\n\ndata. arXiv preprint arXiv:1710.10345, 2017.\n\nSukhbaatar, Sainbayar, Bruna, Joan, Paluri, Manohar, Bourdev, Lubomir, and Fergus, Rob. Training\n\nconvolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.\n\nSun, Shizhao, Chen, Wei, Wang, Liwei, and Liu, Tie-Yan. Large margin deep neural networks: Theory\n\nand algorithms. CoRR, abs/1506.05232, 2015. URL http://arxiv.org/abs/1506.05232.\n\nSzegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow,\nIan J., and Fergus, Rob. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.\nURL http://arxiv.org/abs/1312.6199.\n\n10\n\n\fSzegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna, Zbigniew. Rethinking\nthe inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 2818\u20132826, 2016.\n\nTieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331,\n2012.\n\nVapnik, Vladimir N. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc.,\n\nNew York, NY, USA, 1995. ISBN 0-387-94559-8.\n\nVinyals, Oriol, Blundell, Charles, Lillicrap, Tim, Wierstra, Daan, et al. Matching networks for one\n\nshot learning. In Advances in Neural Information Processing Systems, pp. 3630\u20133638, 2016.\n\nZagoruyko, Sergey and Komodakis, Nikos. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016.\n\n11\n\n\f", "award": [], "sourceid": 457, "authors": [{"given_name": "Gamaleldin", "family_name": "Elsayed", "institution": "Google Brain"}, {"given_name": "Dilip", "family_name": "Krishnan", "institution": "Google"}, {"given_name": "Hossein", "family_name": "Mobahi", "institution": "Google Research"}, {"given_name": "Kevin", "family_name": "Regan", "institution": "Google"}, {"given_name": "Samy", "family_name": "Bengio", "institution": "Google Brain"}]}