{"title": "Hessian-based Analysis of Large Batch Training and Robustness to Adversaries", "book": "Advances in Neural Information Processing Systems", "page_first": 4949, "page_last": 4959, "abstract": "Large batch size training of Neural Networks has been shown to incur accuracy\nloss when trained with the current methods.  The exact underlying reasons for\nthis are still not completely understood.  Here, we study large batch size\ntraining through the lens of the Hessian operator and robust optimization. In\nparticular, we perform a Hessian based study to analyze exactly how the landscape of the loss function changes when training with large batch size. We compute the true Hessian spectrum, without approximation, by back-propagating the second\nderivative. Extensive experiments on multiple networks show that saddle-points are\nnot the cause for generalization gap of large batch size training, and the results\nconsistently show that large batch converges to points with noticeably higher Hessian spectrum. Furthermore, we show that robust training allows one to favor flat areas, as points with large Hessian spectrum show poor robustness to adversarial perturbation. We further study this relationship, and provide empirical and theoretical proof that the inner loop for robust training is a saddle-free optimization problem \\textit{almost everywhere}. We present detailed experiments with five different network architectures, including a residual network, tested on MNIST, CIFAR-10/100 datasets.", "full_text": "Hessian-based Analysis of Large Batch Training and\n\nRobustness to Adversaries\n\nZhewei Yao1\u21e4 Amir Gholami1\u21e4 Qi Lei2 Kurt Keutzer1 Michael W. Mahoney1\n\n1 University of California at Berkeley, {zheweiy, amirgh, keutzer and mahoneymw}@berkeley.edu\n\n2 University of Texas at Austin, leiqi@ices.utexas.edu\n\nAbstract\n\nLarge batch size training of Neural Networks has been shown to incur accuracy loss\nwhen trained with the current methods. The exact underlying reasons for this are\nstill not completely understood. Here, we study large batch size training through\nthe lens of the Hessian operator and robust optimization. In particular, we perform\na Hessian based study to analyze exactly how the landscape of the loss function\nchanges when training with large batch size. We compute the true Hessian spectrum,\nwithout approximation, by back-propagating the second derivative. Extensive\nexperiments on multiple networks show that saddle-points are not the cause for\ngeneralization gap of large batch size training, and the results consistently show\nthat large batch converges to points with noticeably higher Hessian spectrum.\nFurthermore, we show that robust training allows one to favors \ufb02at areas, as points\nwith large Hessian spectrum show poor robustness to adversarial perturbation. We\nfurther study this relationship, and provide empirical and theoretical proof that the\ninner loop for robust training is a saddle-free optimization problem. We present\ndetailed experiments with \ufb01ve different network architectures, including a residual\nnetwork, tested on MNIST, CIFAR-10, and CIFAR-100 datasets.\n\n1\n\nIntroduction\n\nDuring the training of a Neural Network (NN), we are given a set of input data x with the correspond-\ning labels y drawn from an unknown distribution P. In practice, we only observe a set of discrete\nexamples drawn from P, and train the NN to learn this unknown distribution. This is typically a\nnon-convex optimization problem, in which the choice of hyper-parameters would highly affect the\nconvergence properties. In particular, it has been observed that using large batch size for training\noften results in convergence to points with poor convergence properties. The main motivation for\nusing large batch is the increased opportunities for data parallelism which can be used to reduce\ntraining time [13]. Recently, there have been several works that have proposed different methods to\navoid the performance loss with large batch [16, 28, 31]. However, these methods do not work for\nall networks and datasets. This has motivated us to revisit the original problem and study how the\noptimization with large batch size affects the convergence behavior.\nWe \ufb01rst start by analyzing how the Hessian spectrum and gradient change during training for small\nbatch and compare it to large batch size and then draw connection with robust training. In particular,\nwe aim to answer the following questions:\nQ1 How is the training for large batch size different than small batch size? Equivalently, what is\nthe difference between the local geometry of the neighborhood that the model converges when large\nbatch size is used as compared to small batch?\nA1 We backpropagate the second-derivative and compute its spectrum during training. The results\nshow that despite the arguments regarding prevalence of saddle-points plaguing optimization [6, 12],\nthat is actually not the problem with large batch size training, even when batch size is increased to\nthe gradient descent limit. In [19], an approximate numerical method was used to approximate the\n\n\u21e4Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Top 20 eigenvalues of the Hessian is shown for C1 on CIFAR-10 (left) and M1 on MNIST\n(right) datasets. The spectrum is computed using power iteration with relative error of 1E-4.\n\nmaximum at a point. Here, by directly computing the spectrum of the true Hessian, we show that\nlarge batch size progressively gets trapped in areas with noticeably larger spectrum (and not just the\ndominant eigenvalue). For details please see \u00a72, especially Figs. 1, 2 and 4.\nQ2 What is the connection between robust optimization and large batch size training? Equivalently,\nhow does the batch size affect the robustness of the model to adversarial perturbation?\nA2 We show that robust optimization is antithetical to large batch training, in the sense that it favors\nareas with small spectrum (aka \ufb02at minimas). We show that points converged with large batch size are\nsigni\ufb01cantly more prone to adversarial attacks as compared to a model trained with small batch size.\nFurthermore, we show that robust training progressively favors the opposite, leading to points with\n\ufb02at spectrum and robust to adversarial perturbation. We provide empirical and theoretical proof that\nthe inner loop of the robust optimization, where we \ufb01nd the worst case, is a saddle-free optimization\nproblem. Details are discussed in \u00a73, especially Table 1, 7 and Figs. 4, 6.\nLimitations: We believe it is critical for every paper to clearly state limitations. In this work, we\nhave made an effort to avoid reporting just the best results, and repeated all the experiments at least\nthree times and found all the \ufb01ndings to be consistent. Furthermore, we performed the tests on\nmultiple datasets and multiple models, including a residual network, to avoid getting results that\nmay be speci\ufb01c to a particular test. The main limitation is that we do not propose a solution for\nlarge batch training. Even though we show a very promising connection between large batch and\nrobust training, but we emphasize that this is an analysis paper to understand the original problem.\nThere has been several solutions proposed so far, but they only work for particular cases and require\nextensive hyper-parameter tuning. We are performing an in-depth follow up study to use the results\nof this paper to guide large batch size training.\nRelated Work. Deep neural networks have achieved good performance for a wide range of applica-\ntions. The diversity of the different problems that a DNN can be used for, has been related to their\nef\ufb01ciency in function approximation [25, 7, 21, 1]. However the work of [32] showed that not only\nthe network can perform well on real dataset, but it can also memorize randomly labeled data very\nwell. Moreover, the performance of the network is highly dependent on the hyper-parameters used\nfor training. In particular, recent studies have shown that Neural Networks can easily be fooled by\nimperceptible perturbation to input data [15]. Moreover, multiple studies have found that large batch\nsize training suffers from poor generalization capability [16, 31].\nHere we focus on the latter two aspects of training neural networks. [19] presented results showing\nthat large batches converge to a \u201csharper minima\u201d. It was argued that even if the sharp minima\nhas the same training loss as the \ufb02at one, but small discrepancies between the test data and the\ntraining data can easily lead to poor generalization performance [19, 9]. The fact that \u201c\ufb02at minimas\u201d\ngeneralize well goes back to the earlier work of [18]. The authors relate \ufb02at minima to the theory\nof minimum description length [26], and proposed an optimization method to actually favor \ufb02at\nminimas. There have been several similar attempts to change the optimization algorithm to \ufb01nd \u201cbetter\u201d\nregions [8, 5]. For instance, [5] proposed entropy-SGD, which uses Langevin dynamics to augment\nthe loss functional to favor \ufb02at regions of the \u201cenergy landscape\u201d. The notion of \ufb02at/sharpness does\nnot have a precise de\ufb01nition. A detailed comparison of different metrics is discussed in [9], where the\nauthors show that sharp minimas can also generalize well. The authors also argued that the sharpness\ncan be arbitrarily changed by reparametrization of the weights. However, this won\u2019t happen when\n\n2\n\n\fFigure 2: The landscape of the loss is shown along the dominant eigenvector, v1, of the Hessian for\nC1 on CIFAR-10 dataset. Here \u270f is a scalar that perturbs the model parameters along v1.\n\nconsidering the same model and just changing the training hyper-parameters which is the case here.\nIn [28, 29], the authors proposed that the training can be viewed as a stochastic differential equation,\nand argued that the optimum batch size is proportional to the training size and the learning rate.\nAs our results show, there is an interleaved connection by studying when NNs do not work well.\n[30, 15] found that they can easily fool a NN with very good generalization by slightly perturbing\nthe inputs. The perturbation magnitude is most of the time imperceptible to human eye, but can\ncompletely change the networks prediction. They introduced an effective adversarial attack algorithm\nknown as Fast Gradient Sign Method (FGSM). They related the vulnerability of the Neural Network\nto linear classi\ufb01ers and showed that RBF models, despite achieving much smaller generalization\nperformance, are considerably more robust to FGSM attacks. The FGSM method was then extended\nin [20] to an iterative FGSM, which performs multiple gradient ascend steps to compute the adversarial\nperturbation. Adversarial attack based on iterative FGSM was found to be stronger than the original\none step FGSM. Various defenses have been proposed to resist adversarial attacks [24, 14, 17, 2, 11].\nWe will later show that there is an interleaved connection between robustness of the model and the\nlarge batch size problem.\nThe structure of this paper is as follows: We present the results by \ufb01rst analyzing how the spectrum\nchanges during training, and test the generalization performances of the model for different batch\nsizes in \u00a72. In section \u00a73, we discuss details of how adversarial attack/training is performed. In\nparticular, we provide theoretical proof that \ufb01nding adversarial perturbation is a saddle-free problem\nunder certain conditions, and test the robustness of the model for different batch sizes. Also, we\npresent results showing how robust training affects the spectrum with empirical studies. Finally, in\nsection \u00a74 we provide concluding remarks.\n\n2 Large Batch, Generalization Gap and Hessian Spectrum\nSetup: The architecture for the networks used is reported in Table 6. In the text, we refer to each\narchitecture by the abbreviation used in this table. Unless otherwise speci\ufb01ed, each of the batch sizes\nare trained until a training loss of 0.001 or better is achieved. Different batches are trained under the\nsame conditions, and no weight decay or dropout is used.\nWe \ufb01rst focus on large batch size training versus small batch and report the results for large batch\ntraining for C1 network on CIFAR-10 dataset, and M1 network on MNIST are shown in Table 1, and\nTable 7, respectively. As one can see, after a certain point increasing batch size results in performance\ndegradation on the test dataset. This is in line with results in the literature [19, 16].\nAs discussed before, one popular argument about large batch size\u2019s poor generalization accuracy\nhas been that large batches tend to get attracted to \u201csharp\u201d minimas of the training loss. In [19] an\napproximate metric was used to measure curvature of the loss function for a given model parameter.\nHere, we directly compute the Hessian spectrum. Note that computing the whole Hessian matrix is\ninfeasible as it is a O(N 2) matrix. However, the spectrum can be computed using power iteration by\nback-propagating the matvec of the Hessian [23]. Unless otherwise noted, we continue the power\niterations until a relative error of 1E-4 reached for each individual eigenvalue.\nWith this approach, we have computed the \ufb01rst top 20 eigenvalues of the Hessian for different batch\nsizes as shown in Fig. 1. Moreover, the value of the dominant eigenvalue, denoted by \u2713\n1, is reported\nin Table 1, and Table 2, respectively (Additional result for MNIST tested using LeNet-5 is given in\nappendix. Please see Table 7). From Fig. 1, we can clearly see that for all the experiments, large\nbatches have a noticeably larger Hessian spectrum both in the dominant eigenvalue as well as the\nrest of the 19 eigenvalues. However, note that curvature is a very local measure. It would be more\n\n3\n\n\fFigure 3: The landscape of the loss is shown when the C1 model parameters are changed along the\n\ufb01rst two dominant eigenvectors of the Hessian with the perturbation magnitude \u270f1 and \u270f2.\n\ninformative to study how the loss functional behaves in a neighborhood around the point that the\nmodel has converged. To visually demonstrate this, we have plotted how the total loss changes when\nthe model parameters are perturbed along the dominant eigenvector as shown in Fig. 2, and Fig. 7\nfor C1 and M1 models, respectively. We can clearly see that the large batch size models have been\nattracted to areas with higher curvature for both the test and training losses.\nThis is re\ufb02ected in the visual \ufb01gures. We have also added a 3D plot, where we perturb the parameters\nof C1 model along both the \ufb01rst and second eigenvectors as shown in Fig. 3. The visual results are in\nline with the numbers shown for the Hessian spectrum (see \u2713\n1) in Table 1, and Table 7. For instance,\nnote the value of \u2713\n1 for the training and test loss for B = 512, 2048 in Table 1 and compare the\ncorresponding results in Fig. 3.\nA recent argument has been that saddle-points in high dimension plague optimization for neural\nnetworks [6, 12]. We have computed the dominant eigenvalue of the Hessian along with the total\ngradient during training and report it in Fig. 4. As we can see, large batch size progressively gets\nattracted to areas with larger spectrum, but it clearly does not get stuck in saddle points since the\ngradient is still large.\n\nTable 1: Result on CIFAR-10 dataset using C1, C2 network. We show the Hessian spectrum of\ndifferent batch training models, and the corresponding performances on adversarial dataset generated\nby training/testing dataset (testing result is given in parenthese).\n\nBatch\n\nAcc.\n\n16\n32\n64\n128\n256\n512\n1024\n2048\n256\n512\n1024\n2048\n\n100 (77.68)\n100 (76.77)\n100 (77.32)\n100 (78.84)\n100 (78.54)\n100 (79.25)\n100 (78.50)\n100 (77.31)\n100 (79.20)\n100 (80.44)\n100 (79.61)\n100 (78.99)\n\n\u2713\n1\n\n0.64 (32.78)\n0.97 (45.28)\n0.77 (48.06)\n1.33 (137.5)\n3.34 (338.3)\n16.88 (885.6)\n51.67 (2372 )\n80.18 (3769 )\n\n0.62 (28 )\n0.75 (57 )\n2.36 (142)\n4.30 (307)\n\nx\n1\n\n2.69 (200.7)\n3.43 (234.5)\n3.14 (195.0)\n1.41 (128.1)\n1.51 (132.4)\n1.97 (100.0)\n3.11 (146.9)\n5.18 (240.2)\n12.10 (704.0)\n4.82 (425.2)\n0.523 (229.9)\n0.145 (260.0)\n\nkrxJk\n\n0.05 (20.41)\n0.05 (23.55)\n0.04 (21.47)\n0.02 (13.98)\n0.02 (14.08)\n0.04 (10.42)\n0.05 (13.33)\n0.06 (18.08)\n0.10 (41.95)\n0.03 (26.14)\n0.04 (17.16)\n0.50 (17.94)\n\nAcc \u270f = 0.02\n48.07 (30.38)\n49.04 (31.23)\n50.40 (32.59)\n33.15 (25.2 )\n25.33 (19.99)\n14.17 (12.94)\n8.80 (8.40 )\n4.14 (3.77 )\n0.57 (0.38)\n0.34 (0.25)\n0.27 (0.22)\n0.18 (0.16)\n\nAcc \u270f = 0.01\n72.67 (42.70)\n72.63 (43.30)\n73.85 (44.76)\n57.69 (39.09)\n50.10 (34.94)\n28.54 (25.08)\n23.99 (21.57)\n17.42 (16.31)\n0.73 (0.47)\n0.54 (0.38)\n0.46 (0.35)\n0.33 (0.28)\n\n0\n1\n-\nr\na\nf\ni\n\nC\n1\nC\n\n0\n1\n-\nr\na\nf\ni\n\nC\n2\nC\n\n3 Large Batch, Adversarial Attack and Robust training\nWe \ufb01rst give a brief overview of adversarial attack and robust training and then present results\nconnecting these with large batch size training.\n3.1 Robust Optimization and Adversarial Attack\nThe methods for adversarial attack on a neural network can be broadly split into white-box attacks,\nwhere the model architecture and its parameters are known, and black-box attacks where such informa-\ntion is unavailable. Here we focus on the white-box methods, and in particular the optimization-based\napproach both for the attack and defense.\nSuppose M(\u2713) is a learning model (the neural network architecture), and (x, y) are the input data and\nthe corresponding labels. The loss functional of the network with parameter \u2713 on (x, y) is denoted by\nJ (\u2713, x, y). For adversarial attack, we seek a perturbation x (with a bounded L1 or L2 norm) such\nthat it maximizes J (\u2713, x, y):\n\n4\n\n\fmax\n\nx2U J (\u2713, x + x, y),\n\n(1)\n\nwhere U is an admissibility set for acceptable perturbation (typically restricting the magnitude of the\nperturbation). A typical choice for this set is U = B(x,\u270f ), a ball of radius \u270f centered at x. A popular\nmethod for approximately computing x, is Fast Gradient Sign Method [15], where the gradient of\nthe loss functional is computed w.r.t. inputs, and the perturbation is set to:\n\nx = \u270f sign(\n\n@J (x,\u2713 )\n\n@x\n\n)\n\n(2)\n\nThis is not the only attack method possible. Other approaches include an iterative FGSM method\n(FGSM-10)[20] or using other norms such as L2 norm instead of L1 (We denote the L2 method\nby L2Grad in our results). Here we also use a second-order attack, where we use the Hessian w.r.t.\ninput to precondition the gradient direction with second order information; Please see Table 5 in\nAppendix for details. One method to defend against such adversarial attacks, is to perform robust\ntraining [30, 22]:\n\n(3)\n\nmin\n\n\u2713\n\nmax\n\nx2U J (\u2713, x + x, y).\n\nSolving this min-max optimization problem at each iteration requires \ufb01rst \ufb01nding the worst adversarial\nperturbation that maximizes the loss, and then updating the model parameters \u2713 for those cases. Since\nadversarial examples have to be generated at every iteration, it would not be feasible to \ufb01nd the exact\nperturbation that maximizes the objective function. Instead, a popular method is to perform a single\nor multiple gradient ascents to approximately compute x. After computing x at each iteration, a\ntypical optimization step (variant of SGD) is performed to update \u2713.\nNext we show that solving the maximization part is actually a saddle-free problem almost everywhere.\nThis propert means the Hessian w.r.t input does not have negative eigenvalues which allows us to use\nCG for performing Newton solver for our second order adversarial perturbation tests in \u00a73.4. 2\n3.2 Adversarial perturbation: A saddle-free problem\nRecall that our loss functional is J (\u2713; x, y). We make following assumptions for the model to help\nshow our theoretical result,\nAssumption 1. We assume the model\u2019s activation functions are strictly ReLu activation, and all\nlayers are either convolution or fully connected. Here, Batch Normalization layers are accepted.\nNote that even though the ReLu activation has discontinuity at origin, i.e. x = 0, ReLu function is\ntwice differentiable almost everywhere.\n\nThe following theorem shows that the problem of \ufb01nding an adversarial perturbation that maximized\nJ , is a saddle-free optimization problem, with a Positive-Semi-De\ufb01nite (PSD) Hessian w.r.t. input\nalmost everywhere. For details on the proof please see Appendix. A.1.\nTheorem 1. With Assumption. 1, for a DNN, its loss functional J (\u2713, x, y) is a saddle-free function\nw.r.t. input x almost everywhere, i.e.\n\nr2J (\u2713, x, y)\n\nrx2\n\n\u232b 0.\n\nFrom the proof of Theorem 1, we could immediately get the following proposition of DNNs:\nProposition 2. Based on Theorem 1 with Assumption 1, if the input x 2 Rd and the number of the\noutput class is c, i.e. y 2{ 1, 2, 3 . . . , c}, then the Hessian of DNNs w.r.t. to x is almost a rank c\nmatrix almost everywhere; see Appendix A.1 for details.\n\n3.3 Large Batch Training and Robustness\nHere, we test the robustness of the models trained with different batches to an adversarial attack.\nWe use Fast Gradient Sign Method for all the experiments (we did not see any difference with\nFGSM-10 attack). The adversarial performance is measured by the fraction of correctly classi\ufb01ed\n\n2This results might also be helpful for \ufb01nding better optimization strategies for GANS.\n\n5\n\n\fTable 2: Result on CIFAR-100 dataset using CR network. We show the Hessian spectrum of different\nbatch training models, and the corresponding performances on adversarial dataset generated by\ntraining/testing dataset (testing result is given in parenthese).\n\nBatch\n\nAcc.\n\n\u2713\n1\n\n64\n128\n256\n512\n\n99.98 (70.81)\n99.97 (70.9 )\n99.98 (68.6 )\n99.98 (68.6 )\n\n0.022 (10.43)\n0.055 (26.50 )\n1.090 (148.29)\n1.090 (148.29)\n\nAcc \u270f = 0.02\n61.54 (34.48)\n58.15 (33.73)\n39.96 (28.37)\n40.48 (28.37)\n\nAcc \u270f = 0.01\n78.57 (39.94)\n77.41 (38.77)\n66.12 (35.02)\n66.09 (35.02)\n\nFigure 4: Changes in the dominant eigenvalue of the Hessian w.r.t weights and the total gradient is\nshown for different epochs during training. Note the increase in 0 (blue curve) for large batch v.s.\nsmall batch. In particular, note that the values for total gradient along with the Hessian spectrum\nshow that large batch does not get \u201cstuck\u201d in saddle points, but areas in the optimization landscape\nthat have high curvature. More results are shown in Fig. 16. The dotted points show the corresponding\nresults when using robust optimization, which makes the solver stay in areas with smaller spectrum.\n\nadversarial inputs. We report the performance for both the training and test datasets for different\nvalues of \u270f = 0.02, 0.01 (\u270f is the metric for the adversarial perturbation magnitude in L1 norm).\nThe performance results for C1, and C2 models on CIFAR-10, CR model on CIFAR-100, are\nreported in the last two columns of Tables 1,and 2 (MNIST results are given in appendix, Table 7).\nThe interesting observation is that for all the cases, large batches are considerably more prone to\nadversarial attacks as compared to small batches. This means that not only the model design affects\nthe robustness of the model, but also the hyper-parameters used during optimization, and in particular\nthe properties of the point that the model has converged to.\n\nTable 3: Accuracy of different models across different adversarial samples of MNIST, which are\nobtained by perturbing the original model MORI\n\nDclean DF GSM DF GSM 10 DL2GRAD DF HSM DL2HESS MEAN of Adv\n99.32\n99.49\n99.5\n98.91\n99.45\n98.72\n\n60.37\n96.18\n96.52\n96.88\n94.41\n95.02\n\n77.27\n97.44\n97.63\n97.39\n96.48\n96.49\n\n53.44\n87.59\n88.52\n94.14\n83.60\n91.29\n\n14.32\n63.46\n66.15\n86.23\n52.67\n77.18\n\n82.04\n97.56\n97.66\n97.66\n96.89\n97.43\n\n33.21\n83.33\n84.64\n92.56\n77.58\n90.33\n\nMORI\nMF GSM\nMF GSM 10\nML2GRAD\nMF HSM\nML2HESS\n\nFrom this result, there seems to be a strong correlation between the spectrum of the Hessian w.r.t. \u2713\nand how robust the model is. However, we want to emphasize that in general there is no correlation\nbetween the Hessian w.r.t. weights and the robustness of the model w.r.t. the input. For instance,\nconsider a two variable function J (\u2713, x) (we treat \u2713 and x as two single variables), for which the\nHessian spectrum of \u2713 has no correlation to robustness of J w.r.t. x. This can be easily demonstrated\nfor a least squares problem, L = k\u2713x  yk2\n2. It is not hard to see the Hessian of \u2713 and x are, xxT\nand \u2713\u2713T , respectively. Therefore, in general we cannot link the Hessian spectrum w.r.t. weights\nto robustness of the network. However, the numerical results for all the neural networks show that\nmodels that have higher Hessian spectrum w.r.t. \u2713 are also more prone to adversarial attacks. A\npotential explanation for this would be to look at how the gradient and Hessian w.r.t. input (i.e.\nx) would change for different batch sizes. We have computed the dominant eigenvalue of this\nHessian using power iteration for each individual input sample for both training and testing datasets.\nFurthermore, we have computed the norm of the gradient w.r.t. x for these datasets as well. These two\n\n6\n\n\fTable 4: Accuracy of different models across different samples of CIFAR-10, which are obtained by\nperturbing the original model MORI\n\nDclean DF GSM DF GSM 10 DL2GRAD DF HSM DL2HESS MEAN of Adv\n79.46\n71.82\n71.14\n63.52\n74.34\n71.59\n\n15.25\n63.05\n63.32\n59.33\n47.65\n50.05\n\n4.46\n63.44\n63.88\n59.73\n43.95\n46.66\n\n16.93\n62.51\n62.82\n59.16\n49.71\n52.19\n\nMORI\nMF GSM\nMF GSM 10\nML2GRAD\nMF HSM\nML2HESS\n\n12.37\n57.68\n58.25\n57.35\n38.45\n42.95\n\n29.64\n66.04\n65.95\n60.44\n62.75\n62.87\n\n22.93\n62.36\n62.70\n58.98\n55.77\n58.42\n\nmetrics are reported in x\n1 , and krxJk ; See Table 1 for details. The results on all of our experiments\nshow that these two metrics actually do not correlate with the adversarial accuracy. For instance,\nconsider C1 model with B = 512. It has both smaller gradient and smaller Hessian eigenvalue w.r.t.\nx as compared to B = 32, but it performs note acidly worse under adversarial attack. One possible\nreason for this could be that the decision boundaries for large batches are less stable, such that with\nsmall adversarial perturbation the model gets fooled.\n\nFigure 5: 1-D Parametric plot for C3 model on CIFAR-10. We interpolate between parameters of\nMORI and MADV , and compute the cross entropy loss on the y-axis.\n3.4 Adversarial Training and Hessian Spectrum\nIn this part, we study how the Hessian spectrum and the landscape of the loss functional change\nafter adversarial training is performed. Here, we \ufb01x the batch size (and all other optimization\nhyper-parameters) and use \ufb01ve different adversarial training methods as described in \u00a73.1.\nFor the sake of clarity let us denote D to be the test dataset which can be the original clean test dataset\nor one created by using an adversarial method. For instance, we denote DF GSM to be the adversarial\ndataset generated by FGSM, and Dclean to be the original clean test dataset.\nsetup: For the MNIST experiments, we train a standard LeNet on MNIST dataset [3] (using M1\nnetwork). For the original training, we set the learning rate to 0.01 and momentum to 0.9, and\ndecay the learning rate by half after every 5 epochs, for a total of 100 epochs. Then we perform an\nadditional \ufb01ve epochs of adversarial training with a learning rate of 0.01. The perturbation magnitude,\n\u270f, is set to 0.1 for L1 attack and 2.8 for L2 attack. We also present results for C3 model [4] on\n\nFigure 6: Spectrum of the sub-sampled Hessian of the loss functional w.r.t. weights. The results are\ncomputed for different batch sizes, which are randomly chosen, of B = 1, 320, 50000 of C1.\n\n7\n\n\fCIFAR-10, using the same hyper-parameters, except that the training is performed for 100 epochs.\nAfterwards, adversarial training is performed for a subsequent 10 epochs with a learning rate of\n0.01 and momentum of 0.9 (the learning rate is decayed by half after \ufb01ve epochs). Furthermore, the\nadversarial perturbation magnitude is set to \u270f = 0.02 for L1 attack and 1.2 for L2 attack[27].\nThe results are shown in Table 3, 4. We can see that after adversarial training the model becomes\nmore robust to these attacks. Note that the accuracy of different adversarial attacks varies, which is\nexpected since the various strengths of different attack method. In addition, all adversarial training\nmethods improve the robustness on adversarial dataset, though they lose some accuracy on Dclean,\nwhich is consistent with the observations in [15]. As an example, consider the second row of Table 3\nwhich shows the results when FGSM is used for robust training. The performance of this model when\ntested against the L2GRAD attack method is 63.46% as opposed to 14.32% of the original model\n(MORI). The rest of the rows show the results for different algorithms.\nThe main question here is how the landscape of the loss functional is changed after these robust\noptimizations are performed? We \ufb01rst show a 1-D parametric interpolation between the original\nmodel parameters \u2713 and that of the robusti\ufb01ed models, as shown in Fig. 5 (see Fig. 11 for all cases)\nand 10. Notice the robust models are at a point that has smaller curvature as compared to the original\nmodel. To exactly quantify this, we compute the spectrum of the Hessian as shown in Fig. 6, and 12.\nBesides the full Hessian spectrum, we also report the spectrum of sub-sampled Hessian. The latter is\ncomputed by randomly selecting a subset of the training dataset. We denote the size of this subset\nas BH to avoid confusion with the training batch size. In particular, we report results for BH = 1\nand BH = 320. There are several important observations here. First, notice that the spectrum of the\nrobust models is noticeably smaller than the original model. This means that the min-max problem\nof Eq. 3 favors areas with lower curvature. Second, note that even though the total Hessian shows\nthat we have converged to a point with positive curvature (at least based on the top 20 eigenvalues),\nbut that is not necessarily the case when we look at individual samples (i.e. BH = 1). For a randomly\nselected batch of BH = 1, we see that we have actually converged to a point that has both positive\nand negative curvatures, with a non-zero gradient (meaning it is not a saddle point). To the best of\nour knowledge this is a new \ufb01nding, but one that is expected as SGD optimizes the expected loss\ninstead of individual ones.\nNow going back to Fig. 4, we show how the spectrum changes during training when we use robust\noptimization. We can clearly see that with robust optimization the solver is pushed to areas with\nsmaller spectrum as opposed to when we do not use robust training. This is a very interesting \ufb01nding\nand shows the possibility of using robust training as a systematic means to bias the solver to avoid\nsharp minimas.\n\n4 Conclusion\n\nWe studied Neural Networks through the lens of the Hessian operator. In particular, we studied\nlarge batch size training and its connection with stability of the model in the presence of white-box\nadversarial attacks. By computing Hessian spectrum, we provided several evidences that show that\nlarge batch size training tends to get attracted to areas with higher Hessian spectrum. We reported the\neigenvalues of the Hessian w.r.t. whole dataset, and plotted the landscape of the loss when perturbed\nalong the dominant eigenvector. Visual results were in line with the numerical values for the spectrum.\nOur empirical results show that adversarial attacks/training and large batches are closely related. We\nprovided several empirical results on multiple datasets that show large batch size training is more\nprone to adversarial attacks (more results are provided in the supplementary material). This means\nthat not only the model design is important, but also the optimization hyper-parameters can drastically\naffect a network\u2019s robustness. Furthermore, we observed that robust training is antithetical to large\nbatch size training, in the sense that it favors areas with noticeably smaller Hessian spectrum w.r.t. \u2713.\nThe results show that the robustness of the model does not (at least directly) correlate with the Hessian\nw.r.t. x. We also found that this Hessian is actually a PSD matrix, meaning that the problem of \ufb01nding\nthe adversarial perturbation is actually a saddle-free problem for cases that satisfy assumption 1.\nFurthermore, we showed that even though the model may converge to an area with positive curvature\nwhen considering all of the training dataset (i.e. total loss), but if we look at individual samples then\nthe Hessian can actually have signi\ufb01cant negative eigenvalues. From an optimization viewpoint, this\nis due to the fact that SGD optimizes the expected loss and not the individual per sample loss.\n\n8\n\n\f5 Rebuttal\n\nWe would like to thank all the reviewers and area chair for taking the time to review our work and\nproviding us with their valuable feedback. Below we discuss the main comments:\n\nTable 5: Result on SVHN dataset using C1. We use the full training dataset with 530K images. The large batch\nsize behavior is consistent with other datasets (i.e. CIFAR-10, CIFAR-100, and MNIST).\n\nBatch\n256\n1024\n4096\n16384\n\nAcc\n\n100 (95.70 )\n100 (95.41 )\n100 (95.22 )\n100 (94.86 )\n\n\u2713\n1\n\n1.85 (87.71)\n18.23 (184.7)\n58.46 (606.1)\n74.28 (1040)\n\nAcc \u270f = 0.05\n23.59 (16.46)\n16.65 (11.88)\n8.36 (6.33)\n6.25 (4.95)\n\nAcc \u270f = 0.02\n50.96 (35.89)\n42.67 (29.27)\n27.04 (17.78)\n22.31 (15.43)\n\n\u21e4Preliminary results for adversarial regularization\n(A) Following this paper, we have designed a new adaptive algorithm that uses adversarial training\n(robust optimization) in combination with second order information that achieves state-of-the-art\nperformance for large batch training (please see [33]). The main goal of this work has been to perform\ndetailed analysis to better understand the problems with large batch training.\n\n\u21e4 ReLU has 0 Hessian a.e. and I suggest adding analysis with twice differentiable activation.\n(A) This is an excellent observation regarding ReLU networks. We have performed new experiments\nwith the suggested activation functions (Softplus and ELU) and show results in Table 6. The reason\nwe chose ReLU activation was that many/most of the new neural networks are incorporating it.\nHowever, our results still hold for twice differentiable activations as well. That is, larger batches are\nless robust (please see last two columns of Table 6) and get attracted to areas with higher Hessian\nspectrum, also known as sharper points, in the optimization space. We have also visually plotted the\ndominant eigenvalue of the Hessian versus batch size for different activation functions in Figure 7.\nThis clearly shows the same trend. We will add these results to the \ufb01nal version of the paper.\n\n\u21e4 Experiments on two small datasets MNIST and CIFAR-10.\n(A) Our results are not limited to these two small datasets. We have addressed this fair concern of the\nreviewer by running an experiment on full SVHN dataset with 530K images as shown in Table 5.\nWe can see the results are consistent with the other datasets in that larger batches are less robust to\nadversarial perturbation (last two columns), and the Hessian spectrum also increases for larger batch\n(please see \u2713\n\n1 column).\n\nBatch\n128\n256\n512\n1024\n2048\n128\n256\n512\n1024\n2048\n\nAcc.\n\n100.00 (78.79 )\n100.00 (78.79 )\n100.00 (78.68 )\n100.00 (77.78 )\n100.00 (76.27 )\n100.00 (78.94 )\n100.00 (78.88 )\n100.00 (78.38 )\n100.00 (77.82 )\n100.00 (76.64 )\n\n\u2713\n1\n\n4.45 (318.9)\n5.00 (507.4)\n16.18 (819.2)\n46.99 (2030)\n97.71 (4329)\n4.32 (271.4)\n17.39 (469.2)\n27.23 (1048)\n62.64 (2392)\n114.4 (4347)\n\nAcc \u270f = 0.02\n20.40 (17.79)\n17.79 (16.19)\n12.99 (11.55)\n5.55 (5.42)\n2.38 (2.29)\n17.37 (15.26)\n13.44 (12.01)\n9.20 (8.74)\n4.10 (3.99)\n1.55 (1.6)\n\nS\n1\nC\n\nE\n1\nC\n\nTable 6: Results on CIFAR-10 dataset by\nC1S (replace all ReLU by Softplus,  = 20)\nand C1E (replace all ReLU by ELU, \u21b5 =\n1). We see the same trend with these twice\ndifferentiable activations as with ReLU.\n\n9\n\nFigure 7: Top eigenvalue on training dataset for\nC1 with different activation functions on various\nbatch size.\n\n\fReferences\n[1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.\n\ncambridge university press, 2009.\n\n[2] Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Dimensionality reduction as a defense\nagainst evasion attacks on machine learning classi\ufb01ers. arXiv preprint arXiv:1704.02654, 2017.\n[3] L\u00e9on Bottou, Corinna Cortes, John S Denker, Harris Drucker, Isabelle Guyon, Lawrence D\nJackel, Yann LeCun, Urs A Muller, Edward Sackinger, Patrice Simard, et al. Comparison of\nclassi\ufb01er methods: a case study in handwritten digit recognition. In Computer Vision & Image\nProcessing., Proceedings of the 12th IAPR International, volume 2, pages 77\u201382, 1994.\n\n[4] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\n\nSecurity and Privacy (SP), pages 39\u201357. IEEE, 2017.\n\n[5] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing\n\ngradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.\n\n[6] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In Advances in neural information processing systems, pages 2933\u20132941,\n2014.\n\n[7] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 666\u2013674, 2011.\n\n[8] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. Natural\nneural networks. In Advances in Neural Information Processing Systems, pages 2071\u20132079,\n2015.\n\n[9] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize\n\nfor deep nets. arXiv preprint arXiv:1703.04933, 2017.\n\n[10] Stanley C Eisenstat and Homer F Walker. Choosing the forcing terms in an inexact newton\n\nmethod. SIAM Journal on Scienti\ufb01c Computing, 17(1):16\u201332, 1996.\n\n[11] Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial\n\nsamples from artifacts. arXiv preprint arXiv:1703.00410, 2017.\n\n[12] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online\nstochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797\u2013\n842, 2015.\n\n[13] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. Integrated model, batch and domain\n\nparallelism in training neural networks. SPAA\u20198, 2018. [PDF].\n\n[14] Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku. Adversarial and clean data are not twins. arXiv\n\npreprint arXiv:1704.04960, 2017.\n\n[15] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[16] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[17] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel.\nOn the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.\n[18] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n[19] Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: near-\nlinear acceleration of deep neural network training on compute clusters. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 2592\u20132600, 2016.\n\n10\n\n\f[20] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[21] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical\n\nworld. arXiv preprint arXiv:1607.02533, 2016.\n\n[22] Nicolas Le Roux and Yoshua Bengio. Deep belief networks are compact universal approximators.\n\nNeural computation, 22(8):2192\u20132207, 2010.\n\n[23] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and\nBenjamin Recht. First-order methods almost always avoid saddle points. arXiv preprint\narXiv:1710.07406, 2017.\n\n[24] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. International Conference on\nLearning Representations, 2018.\n\n[25] James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free\n\noptimization. In Neural Networks: Tricks of the trade, pages 479\u2013535. Springer, 2012.\n\n[26] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting\n\nadversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.\n\n[27] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of\nlinear regions of deep neural networks. In Advances in neural information processing systems,\npages 2924\u20132932, 2014.\n\n[28] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n[29] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial train-\ning: Increasing local stability of neural nets through robust optimization. arXiv preprint\narXiv:1511.05432, 2015.\n\n[30] Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don\u2019t decay the learning rate, increase\n\nthe batch size. arXiv preprint arXiv:1711.00489, 2017.\n\n[31] Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic\n\ngradient descent. Second workshop on Bayesian Deep Learning (NIPS 2017), 2017.\n\n[32] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[33] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. Large batch size training\nof neural networks with adversarial training and second-order information. arXiv preprint\narXiv:1810.01021, 2018.\n\n[34] Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32K for imagenet\n\ntraining. arXiv preprint arXiv:1708.03888, 2017.\n\n[35] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2391, "authors": [{"given_name": "Zhewei", "family_name": "Yao", "institution": "UC Berkeley"}, {"given_name": "Amir", "family_name": "Gholami", "institution": "University of California, Berkeley"}, {"given_name": "Qi", "family_name": "Lei", "institution": "University of Texas at Austin"}, {"given_name": "Kurt", "family_name": "Keutzer", "institution": "EECS, UC Berkeley"}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}]}