{"title": "Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6541, "page_last": 6550, "abstract": "High sensitivity of neural networks against malicious perturbations on inputs causes security concerns. To take a steady step towards robust classifiers, we aim to create neural network models provably defended from perturbations. Prior certification work requires strong assumptions on network structures and massive computational costs, and thus the range of their applications was limited. From the relationship between the Lipschitz constants and prediction margins, we present a computationally efficient calculation technique to lower-bound the size of adversarial perturbations that can deceive networks, and that is widely applicable to various complicated networks. Moreover, we propose an efficient training procedure that robustifies networks and significantly improves the provably guarded areas around data points. In experimental evaluations, our method showed its ability to provide a non-trivial guarantee and enhance robustness for even large networks.", "full_text": "Lipschitz-Margin Training: Scalable Certi\ufb01cation of\nPerturbation Invariance for Deep Neural Networks\n\nYusuke Tsuzuku\n\nThe University of Tokyo\n\nRIKEN\n\ntsuzuku@ms.k.u-tokyo.ac.jp\n\nsato@k.u-tokyo.ac.jp\n\nThe University of Tokyo\n\nIssei Sato\n\nRIKEN\n\nMasashi Sugiyama\n\nRIKEN\n\nThe University of Tokyo\nsugi@k.u-tokyo.ac.jp\n\nAbstract\n\nHigh sensitivity of neural networks against malicious perturbations on inputs causes\nsecurity concerns. To take a steady step towards robust classi\ufb01ers, we aim to create\nneural network models provably defended from perturbations. Prior certi\ufb01cation\nwork requires strong assumptions on network structures and massive computational\ncosts, and thus the range of their applications was limited. From the relationship be-\ntween the Lipschitz constants and prediction margins, we present a computationally\nef\ufb01cient calculation technique to lower-bound the size of adversarial perturbations\nthat can deceive networks, and that is widely applicable to various complicated\nnetworks. Moreover, we propose an ef\ufb01cient training procedure that robusti\ufb01es\nnetworks and signi\ufb01cantly improves the provably guarded areas around data points.\nIn experimental evaluations, our method showed its ability to provide a non-trivial\nguarantee and enhance robustness for even large networks.\n\n1\n\nIntroduction\n\nDeep neural networks are highly vulnerable against intentionally created small perturbations on inputs\n[29], called adversarial perturbations, which cause serious security concerns in applications such\nas self-driving cars. Adversarial perturbations in object recognition systems have been intensively\nstudied [29, 10, 6], and we mainly target the object recognition systems.\nOne approach to defend from adversarial perturbations is to mask gradients. Defensive distillation\n[23], which distills networks into themselves, is one of the most prominent methods. However, Carlini\nand Wagner [6] showed that we can create adversarial perturbations that deceive networks trained with\ndefensive distillation. Input transformations and detections [32, 11] are some other defense strategies,\nalthough we can bypass them [5]. Adversarial training [10, 16, 18], which injects adversarially\nperturbed data into training data, is a promising approach. However, there is a risk of over\ufb01tting to\nattacks [16, 30]. Many other heuristics have been developed to make neural networks insensitive\nagainst small perturbations on inputs. However, recent work has repeatedly succeeded to create\nadversarial perturbations for networks protected with heuristics in the literature [1]. For instance,\nAthalye et al. [2] reported that many ICLR 2018 defense papers did not adequately protect networks\nsoon after the announcement of their acceptance. This indicates that even protected networks can\nbe unexpectedly vulnerable, which is a crucial problem for this speci\ufb01c line of research because the\nprimary concern of these studies is security threats.\nThe literature indicates the dif\ufb01culty of defense evaluations. Thus, our goal is to ensure the lower\nbounds on the size of adversarial perturbations that can deceive networks for each input. Many\nexisting approaches, which we cover in Sec. 2, are applicable only for special-structured small\nnetworks. On the other hand, common networks used in evaluations of defense methods are wide,\nwhich makes prior methods computationally intractable and complicated, which makes some prior\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmethods inapplicable. This work tackled this problem, and we provide a widely applicable, yet,\nhighly scalable method that ensures large guarded areas for a wide range of network structures.\nThe existence of adversarial perturbations indicates that the slope of the loss landscape around\ndata points is large, and we aim to bound the slope. An intuitive way to measure the slope is to\ncalculate the size of the gradient of a loss with respect to an input. However, it is known to provide\na false sense of security [30, 6, 2]. Thus, we require upper-bounds of the gradients. The next\ncandidate is to calculate a local Lipschitz constant, that is the maximum size of the gradients around\neach data point. Even though this can provide certi\ufb01cation, calculating the Lipschitz constant is\ncomputationally hard. We can obtain it in only small networks or get its approximation, which cannot\nprovide certi\ufb01cation [12, 31, 9]. A coarser but available alternative is to calculate the global Lipschitz\nconstant. However, prior work could provide only magnitudes of smaller certi\ufb01cations compared to\nthe usual discretization of images even for small networks [29, 24]. We show that we can overcome\nsuch looseness with our improved and uni\ufb01ed bounds and a developed training procedure. The\ntraining procedure is more general and effective than previous approaches [7, 33]. We empirically\nobserved that the training procedure also improves robustness against current attack methods.\n\n2 Related work\n\nIn this section, we review prior work to provide certi\ufb01cations for networks. One of the popular\napproaches is restricting discussion to networks using ReLU [20] exclusively as their activation\nfunctions and reducing the veri\ufb01cation problem to some other well-studied problems. Bastani et al.\n[4] encoded networks to linear programs, Katz et al. [14, 13] reduced the problem to Satis\ufb01ability\nModulo Theory, and Raghunathan et al. [25] encoded networks to semide\ufb01nite programs. However,\nthese formulations demand prohibitive computational costs and their applications are limited to only\nsmall networks. As a relatively tractable method, Kolter and Wong [15] has bounded the in\ufb02uence\nof (cid:96)\u221e-norm bounded perturbations using convex outer-polytopes. However, it is still hard to scale\nthis method to deep or wide networks. Another approach is assuming smoothness of networks and\nlosses. Hein and Andriushchenko [12] focused on local Lipschitz constants of neural networks around\neach input. However, the guarantee is provided only for networks with one hidden layer. Sinha\net al. [27] proposed a certi\ufb01able procedure of adversarial training. However, smoothness constants,\nwhich their certi\ufb01cation requires, are usually unavailable or in\ufb01nite. As a concurrent work, Ruan\net al. [26] proposed another algorithm to certify robustness with more scalable manner than previous\napproaches. We note that our algorithm is still signi\ufb01cantly faster.\n\n3 Problem formulation\n\nWe de\ufb01ne the threat model, our defense goal, and basic terminologies.\nThreat model: Let X be a data point from data distribution D and its true label be tX \u2208 {1, . . . , K}\nwhere K is the number of classes. Attackers create a new data point similar to X which deceives\ndefenders\u2019 classi\ufb01ers. In this paper, we consider the (cid:96)2-norm as a similarity measure between data\npoints because it is one of the most common metrics [19, 6].\nLet c be a positive constant and F be a classi\ufb01er. We assume that the output of F is a vector F (X)\nand the classi\ufb01er predicts the label with argmaxi\u2208{1,...,K}{F (X)i}, where F (X)i denotes the i-th\n(cid:41)\nelement of F (X). Now, we de\ufb01ne adversarial perturbation \u0001F,X as follows.\n{F (X + \u0001)i}\n(cid:33)\n{F (X + \u0001)i}\n\nDefense goal: We de\ufb01ne a guarded area for a network F and a data point X as a hypersphere with\na radius c that satis\ufb01es the following condition:\n\n(cid:107)\u0001(cid:107)2 < c \u21d2 tX = argmax\ni\u2208{1,...,K}\n\n\u0001F,X \u2208\n\n(cid:32)\n\n(cid:40)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:107)\u0001(cid:107)2 < c \u2227 tX (cid:54)= argmax\n\ni\u2208{1,...,K}\n\n\u0001\n\n.\n\n.\n\n\u2200\u0001,\n\n(1)\n\nThis condition (1) is always satis\ufb01ed when c = 0. Our goal is to ensure that neural networks have\nlarger guarded areas for data points in data distribution.\n\n2\n\n\f4 Calculation and enlargement of guarded area\n\nIn this section, we \ufb01rst describe basic concepts for calculating the provably guarded area de\ufb01ned in\nSec. 3. Next, we outline our training procedure to enlarge the guarded area.\n\n4.1 Lipschitz constant and guarded area\n\nWe explain how to calculate the guarded area using the Lipschitz constant. If LF bounds the Lipschitz\nconstant of neural network F , we have the following from the de\ufb01nition of the Lipschitz constant:\n\n(cid:107)F (X) \u2212 F (X + \u0001)(cid:107)2 \u2264 LF(cid:107)\u0001(cid:107)2.\n\nNote that if the last layer of F is softmax, we only need to consider the subnetwork before the softmax\nlayer. We introduce the notion of prediction margin MF,X:\nMF,X := F (X)tX \u2212 max\ni(cid:54)=tX\n\n{F (X)i}.\n\nThis margin has been studied in relationship to generalization bounds [17, 3, 22]. Using the prediction\nmargin, we can prove the following proposition holds.\nProposition 1.\n\n\u221a\n\n(2)\n\n(MF,X \u2265\n\n2LF(cid:107)\u0001(cid:107)2) \u21d2 (MF,X+\u0001 \u2265 0).\n\n2LF\n\nThe details of the proof are in Appendix A of the supplementary material. Thus, perturbations smaller\n\n(cid:1) cannot deceive the network F for a data point X. Proposition 1 sees network\n\nthan MF,X /(cid:0)\u221a\n\nF as a function with a multidimensional output. This connects the Lipschitz constant of a network,\nwhich has been discussed in Szegedy et al. [29] and Cisse et al. [7], with the absence of adversarial\nperturbations. If we cast the problem to a set of functions with a one-dimensional output, we can\nobtain a variant of Prop. 1. Assume that the last layer before softmax in F is a fully-connected layer\nand wi is the i-th raw of its weight matrix. Let Lsub be a Lipschitz constant of a sub-network of F\nbefore the last fully-connected layer. We obtain the following proposition directly from the de\ufb01nition\nof the Lipschitz constant [12, 31].\nProposition 2.\n\n(\u2200i, (FtX \u2212 Fi \u2265 Lsub(cid:107)wtX \u2212 wi(cid:107)2(cid:107)\u0001(cid:107)2)) \u21d2 (MF,X+\u0001 \u2265 0).\n\n(3)\n\nWe can use either Prop. 1 or Prop. 2 for the certi\ufb01cation. Calculations of the Lipschitz constant,\nwhich is not straightforward in large and complex networks, will be explained in Sec. 5.\n\n4.2 Guarded area enlargement\n\nTo ensure non-trivial guarded areas, we propose a training procedure that enlarges the provably\nguarded area.\n\nLipschitz-margin training: To encourage conditions Eq.(2) or Eq.(3) to be satis\ufb01ed with the\ntraining data, we convert them into losses. We take Eq.(2) as an example. To make Eq.(2) satis\ufb01ed\nfor perturbations with (cid:96)2-norm larger than c, we require the following condition.\n\n\u2200i (cid:54)= tX , (FtX \u2265 Fi +\n\n\u221a\n\n2cLF ).\n\n(4)\n\n\u221a\n\nThus, we add\n2cLF to all elements in logits except for the index corresponding to tX. In training, we\ncalculate an estimation of the upper bound of LF with a computationally ef\ufb01cient and differentiable\nway and use it instead of LF . Hyperparameter c is speci\ufb01ed by users. We call this training procedure\nLipschitz-margin training (LMT). The algorithm is provided in Figure 1. Using Eq.(3) instead\nof Eq.(2) is straightforward. Small additional techniques to make LMT more stable is given in\nAppendix E of the supplementary material.\n\nInterpretation of LMT: From the former paragraph, we can see that LMT maximizes the number\nof training data points that have larger guarded areas than c, as long as the original training procedure\nmaximizes the number of them that are correctly classi\ufb01ed. We experimentally evaluate its general-\nization to test data in Sec. 6. The hyperparameter c is easy to interpret and easy to tune. The larger\n\n3\n\n\fc we specify, the stronger invariant property the trained network will have. However, this does not\nmean that the trained network always has high accuracy against noisy examples. To see this, consider\nthe case where c is extremely large. In such a case, constant functions become an optimal solution.\nWe can interpret LMT as an interpolation between the original function, which is highly expressive\nbut extremely non-smooth, and constant functions, which are robust and smooth.\n\nComputational costs: A main computational overhead of LMT is the calculation of the Lipschitz\nconstant. We show in Sec. 5 that its computational cost is almost the same as increasing the batch size\nby one. Since we typically have tens or hundreds of samples in a mini-batch, this cost is negligible.\n\n5 Calculation of the Lipschitz constant\n\nIn this section, we \ufb01rst describe a method to calculate upper bounds of the Lipschitz constant. We\nbound the Lipschitz constant of each component and recursively calculate the overall bound. The\nconcept is from Szegedy et al. [29]. While prior work required separate analysis for slightly different\ncomponents [29, 24, 7, 26], we provide a more uni\ufb01ed analysis. Furthermore, we provide a fast\ncalculation algorithm for both the upper bounds and their differentiable approximation.\n\n5.1 Composition, addition, and concatenation\n\nWe describe the relationships between the Lipschitz constants and some functionals which frequently\nappears in deep neural networks: composition, addition, and concatenation. Let f and g be functions\nwith Lipschitz constants bounded by L1 and L2, respectively. The Lipschitz constant of output for\neach functional is bounded as follows:\ncomposition f \u25e6 g : L1 \u00b7 L2,\n\nconcatenation (f, g) :\n\naddition f + g\n\n: L1 + L2,\n\n(cid:113)\n\nL2\n\n1 + L2\n2.\n\n5.2 Major Components\n\nWe describe bounds of the Lipschitz constants of major layers commonly used in image recognition\ntasks. We note that we can ignore any bias parameters because they do not change the Lipschitz\nconstants of each layer.\n\nmatrix whose i-th element is \u03b3i/(cid:112)\u03c32\n\nLinear layers in general: Fully-connected, convolutional and normalization layers are typically\nlinear operations at inference time. For instance, batch-normalization is a multiplication of a diagonal\ni , \u0001 are a scaling parameter, running average\nof variance, and a constant, respectively. Since the composition of linear operators is also linear, we\ncan jointly calculate the Lipschitz constant of some common pairs of layers such as convolution +\nbatch-normalization. By using the following theorem, we proposed a more uni\ufb01ed algorithm than\nYoshida and Miyato [33].\nTheorem 1. Let \u03c6 be a linear operator from Rn to Rm, where n < \u221e and m < \u221e. We initialize a\nvector u \u2208 Rn from a Gaussian with zero mean and unit variance. When we iteratively apply the\n\ni + \u0001, where \u03b3i, \u03c32\n\n:X : image, tX: label of X\n\nAlgorithm 1: Lipschitz-margin training\nhyperparam :c : required robustness\ninput\ny \u2190 Forward(X);\nL \u2190CalcLipschitzConst();\nforeach index i:\n\u221a\n\nif i (cid:54)= tX:\nyi +=\n\n2Lc;\np \u2190 SoftmaxIfNecessary(y);\n(cid:96) \u2190 CalcLoss(p, tX);\n\nAlgorithm 2: Calculation of operator norm\ninput\n\n:u : array at previous\niteration\n:f : linear function\n\ntarget\nu \u2190 u/(cid:107)u(cid:107)2;\n// \u03c3 is an approximated spectral norm;\n\u03c3 \u2190 (cid:107)f (u)(cid:107)2;\nL \u2190 CalcLipschitzConst(\u03c3);\n(cid:96) \u2190 CalcLoss(L);\nu \u2190 \u2202(cid:96)\n\u2202u;\n\nFigure 1: Lipschitz-margin training algorithm\nwhen we use Prop. 1.\n\nFigure 2: Calculation of the spectral norm of\nlinear components at training time.\n\n4\n\n\ffollowing update formula, the (cid:96)2-norm of u converges to the square of the operator norm of \u03c6 in\nterms of (cid:96)2-norm, almost surely.\n\nu \u2190 u/(cid:107)u(cid:107)2,\n\nv \u2190 \u03c6(u),\n\nu \u2190 1\n2\n\n\u2202(cid:107)v(cid:107)2\n\u2202u\n\n2\n\n.\n\nThe proof is found in Appendix C.1 of the supplementary material. The algorithm for training time\nis provided in Figure 2. At training time, we need only one iteration of the above update formula\nas with Yoshida and Miyato [33]. Note that for estimation of the operator norm for a forward path,\nwe do not require to use gradients. In a convolutional layer, for instance, we do not require another\nconvolution operation or transposed convolution. We only need to increase the batch size by one. The\nwide availability of our calculation method will be especially useful when more complicated linear\noperators than usual convolution appear in the future. Since we want to ensure that the calculated\nbound is an upper-bound for certi\ufb01cation, we can use the following theorem.\nTheorem 2. Let (cid:107)\u03c6(cid:107)2 and Rk be an operator norm of a function \u03c6 in terms of the (cid:96)2-norm and the\n(cid:96)2-norm of the vector u at the k-th iteration, where each element u is initialized by a Gaussian with\n2 and\n\nzero mean and unit variance. With probability higher than 1 \u2212(cid:112)2/\u03c0, the error between (cid:107)\u03c6(cid:107)2\nRk is smaller than (\u2206k +(cid:112)\u2206k(4Rk + \u2206k))/2, where \u2206k := (Rk \u2212 Rk\u22121)n.\n\nThe proof is in Appendix C.3 of the supplementary material, which is mostly from Friedman [8].\nIf we use a large batch for the power iteration, the probability becomes exponentially closer to one.\nWe can also use singular value decomposition as another way for accurate calculation. Despite its\nsimplicity, the obtained bound for convolutional layers is much tighter than the previous results\nin Peck et al. [24] and Cisse et al. [7], and that for normalization layers is novel. We numerically\ncon\ufb01rm the improvement of bounds in Sec. 6.\n\nnL,\n\n(cid:107)f(cid:107)2 \u2264 \u221a\n\nPooling and activation: First, we have the following theorem.\nTheorem 3. De\ufb01ne f (Z) = (f1(Z 1), f2(Z 2), ..., f\u039b(Z \u039b)), where Z \u03bb \u2282 Z and (cid:107)f\u03bb(cid:107)2 \u2264 L for all\n\u03bb. Then,\nwhere n := maxj|{\u03bb|xj \u2208 Z \u03bb}| and xj is the j-th element of x.\nThe proof, whose idea comes from Cisse et al. [7], is found in Appendix D.1 of the supplementary\nmaterial. The exact form of n in the pooling and convolutional layers is given in Appendix D.3 of\nthe supplementary material. The assumption in Theorem 3 holds for most layers of networks for\nimage recognition tasks, including pooling layers, convolutional layers, and activation functions.\nCareful counting of n leads to improved bounds on the relationship between the Lipschitz constant of\na convolutional layer and the spectral norm of its reshaped kernel from the previous result [7].\nCorollary 1. Let (cid:107)Conv(cid:107)2 be the operator norm of a convolutional layer in terms of the (cid:96)2-norm,\nand (cid:107)W (cid:48)(cid:107)2 be the spectral norm of a matrix where the kernel of the convolution is reshaped into a\nmatrix with the same number of rows as its output channel size. Assume that the width and the height\nof its input before padding are larger or equal to those of the kernel. The following inequality holds.\n\n(cid:107)W (cid:48)(cid:107)2 \u2264 (cid:107)Conv(cid:107)2 \u2264 \u221a\n\nn(cid:107)W (cid:48)(cid:107)2,\n\nwhere n is a constant independent of the weight matrix.\n\nThe proof of Corollary 1 is in Appendix D.2 of the supplementary material. Lists of the Lipschitz con-\nstant of pooling layers and activation functions are summarized in Appendix D of the supplementary\nmaterial.\n\n5.3 Putting them together\n\nWith recursive computation using the bounds described in the previous sections, we can calculate an\nupper bound of the Lipschitz constants of the whole network in a differentiable manner with respect\nto network parameters. At inference time, calculation of the Lipschitz constant is required only once.\nIn calculations at training time, there may be some notable differences in the Lipschitz constants. For\nexample, \u03c3i in a batch normalization layer depends on its input. However, we empirically found that\ncalculating the Lipschitz constants using the same bound as inference time effectively regularizes\nthe Lipschitz constant. This lets us deal with batch-normalization layers, which prior work ignored\ndespite its impact on the Lipschitz constant [7, 33].\n\n5\n\n\fFigure 3: Comparison of bounds in layers. Left: the second convolutional layer of a naive model\nin Sec 6. Center: the second convolutional layer of an LMT model in Sec 6. Right: pooling layers\nassuming size 2 and its input size is 28 \u00d7 28.\n\n6 Numerical evaluations\n\nIn this section, we show the results of numerical evaluations. Since our goal is to create networks\nwith stronger certi\ufb01cation, we evaluated the following three points.\n\n1. Our bounds of the Lipschitz constants are tighter than previous ones (Sec. 6.1).\n2. LMT effectively enlarges the provably guarded area (Secs. 6.1 and 6.2).\n3. Our calculation technique of the guarded area and LMT are available for modern large and\n\ncomplex networks (Sec. 6.2).\n\nWe also evaluated the robustness of trained networks against current attacks and con\ufb01rmed that LMT\nrobusti\ufb01es networks (Secs. 6.1 and 6.2). For calculating the Lipschitz constant and guarded area,\nwe used Prop. 2. Detailed experimental setups are available in Appendix F of the supplementary\nmaterial. Our codes are available at https://github.com/ytsmiling/lmt.\n\n6.1 Tightness of bounds\n\nWe numerically validated improvements of bounds for each component and numerically analyzed the\ntightness of overall bounds of the Lipschitz constant. We also see the non-triviality of the provably\nguarded area. We used the same network and hyperparameters as Kolter and Wong [15].\n\nImprovement in each component: We evaluated the difference of bounds in convolutional layers\nin networks trained using a usual training procedure and LMT. Figure 3 shows comparisons between\nthe bounds in the second convolutional layer. It also shows the difference of bounds in pooling layers,\nwhich does not depend on training methods. We can con\ufb01rm improvement in each bound. This\nresults in signi\ufb01cant differences in upper-bounds of the Lipschitz constants of the whole networks.\n\nAnalysis of tightness: Let L be an upper-bound of the Lipschitz constant calculated by our method.\nLet Llocal, Lglobal be the local and global Lipschitz constants. Between them, we have the following\nrelationship.\n\nMargin\n\nL(cid:124) (cid:123)(cid:122) (cid:125)\n\n(A)\n\n\u2264\n\n(i)\n\nMargin\nLglobal\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n(B)\n\n\u2264\n\n(ii)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nMargin\nLlocal\n(C)\n\n\u2264\n\n(iii)\n\n(cid:124)\n\nSmallest Adversarial Perturbation\n\n(5)\n\n(cid:123)(cid:122)\n\n(D)\n\n(cid:125)\n\nWe analyzed errors in inequalities (i) \u2013 (iii). We de\ufb01ne an error of (i) as (B)/(A) and others in the same\nway. We used lower bounds of the local and global Lipschitz constant calculated by the maximum\nsize of gradients found. A detailed procedure for the calculation is explained in Appendix F.1.3\nof the supplementary material. For the generation of adversarial perturbations, we used DeepFool\n[19]. Note that (iii) does not hold because we calculated mere lower bounds of Lipschitz constants\nin (B) and (C). We analyzed inequality (5) in an unregularized model, an adversarially trained (AT)\nmodel with the 30-iteration C&W attack [6], and an LMT model. Figure 4 shows the result. With an\nunregularized model, estimated error ratios in (i) \u2013 (iii) were 39.9, 1.13, and 1.82 respectively. This\nshows that even if we could precisely calculate the local Lipschitz constant for each data point with\n\n6\n\nThm. 1, 2Cor. 1Cisse+Peck+100101102103Thm. 1, 2Cor. 1Cisse+Peck+100101102103Thm. 3max-poolThm. 3avg-poolPeck+max/avg-pool101100101102\fFigure 4: Comparison of error bounds using inequalities (5) with estimation. Each label corresponds\nto the value in inequality (5). Left: naive model, Center: AT model, Right: LMT model.\n\nFigure 5: Examples of pairs of an original image and an arti\ufb01cially perturbed image which LMT\nmodel was ensured not to make wrong predictions. The differences between images are large and\nvisually perceptible. On the basis of Proposition 2, any patterns of perturbations with the same or\nsmaller magnitudes could not deceive the network trained with LMT.\n\npossibly substantial computational costs, inequality (iii) becomes more than 1.8 times looser than the\nsize of adversarial perturbations found by DeepFool. In an AT model, the discrepancy became more\nthan 2.4. On the other hand, in an LMT model, estimated error ratios in (i) \u2013 (iii) were 1.42, 1.02,\nand 1.15 respectively. The overall median error between the size of found adversarial perturbations,\nand the provably guarded area was 1.72. This shows that the trained network became smooth and\nLipschitz constant based certi\ufb01cations became signi\ufb01cantly tighter when we use LMT. This also\nresulted in better defense against attack. For reference, the median of found adversarial perturbations\nfor an unregularized model was 0.97, while the median of the size of the provably guarded area was\n1.02 in an LMT model.\n\nSize of provably guarded area: We discuss the size of the provably guarded area, which is\npractically more interesting than tightness. While our algorithm has clear advantages on computational\ncosts and broad applicability over prior work, guarded areas that our algorithm ensured were non-\ntrivially large. In a naive model, the median of the size of perturbations we could certify invariance\nwas 0.012. This means changing several pixels by one in usual 0\u2013255 scale cannot change their\nprediction. Even though this result is not so tight as seen in the previous paragraph, this is signi\ufb01cantly\nlarger than prior computationally cheap algorithm proposed by Peck et al. [24]. The more impressive\nresult was obtained in models trained with LMT, and the median of the guarded area was 1.02.\nThis corresponds to 0.036 in the (cid:96)\u221e norm. Kolter and Wong [15], which used the same network\nand hyperparameters as ours, reported that they could defend from perturbations with its (cid:96)\u221e-norm\nbounded by 0.1 for more than 94% examples. Thus, in the (cid:96)\u221e-norm, our work is inferior, if we ignore\ntheir limited applicability and massive computational demands. However, our algorithm mainly\ntargets the (cid:96)2-norm, and in that sense, the guarded area is signi\ufb01cantly larger. Moreover, for more\nthan half of the test data, we could ensure that there are no one-pixel attacks [28]. To con\ufb01rm the\nnon-triviality of the obtained certi\ufb01cation, we have some examples of provably guarded images in\nFigure 5.\n\n6.2 Scalability test\n\nWe evaluated our method with a larger and more complex network to con\ufb01rm its broad applicability\nand scalability. We used 16-layered wide residual networks [34] with width factor 4 on the SVHN\ndataset [21] following Cisse et al. [7]. To the best of our knowledge, this is the largest network\nconcerned with certi\ufb01cation. We compared LMT with a naive counterpart, which uses weight decay,\nspectral norm regularization [33], and Parseval networks.\n\nSize of provably guarded area: For a model trained with LMT, we could ensure larger guarded\nareas than 0.029 for more than half of test data. This order of certi\ufb01cation was only provided for\n\n7\n\n012342-norm of perturbations0.00.20.40.60.81.0error ratio(A)(B)(C)(D)012342-norm of perturbations0.00.20.40.60.81.0error ratio(A)(B)(C)(D)012342-norm of perturbations0.00.20.40.60.81.0error ratio(A)(B)(C)(D)\fTable 1: Accuracy of trained wide residual networks on SVHN against C&W attack.\n\nweight decay\n\nParseval network\n\nspectral norm regularization\n\nLMT\n\nClean\n98.31\n98.30\n98.27\n96.38\n\nSize of perturbations\n0.2\n1.0\n\n0.5\n\n72.38\n71.35\n73.66\n86.90\n\n20.98\n17.92\n20.35\n55.11\n\n2.02\n0.94\n1.39\n17.69\n\nsmall networks in prior work. In models trained with other methods, we could not provide such\nstrong certi\ufb01cation. There are mainly two differences between LMT and other methods. First, LMT\nenlarges prediction margins. Second, LMT regularizes batch-normalization layers, while in other\nmethods, batch-normalization layers cancel the regularization on weight matrices and kernel of\nconvolutional layers. We also conducted additional experiments to provide further certi\ufb01cation for\nthe network. First, we replaced convolution with kernel size 1 and stride 2 with average-pooling\nwith size 2 and convolution with kernel size 1. Then, we used LMT with c = 0.1. As a result, while\nthe accuracy dropped to 86%, the median size of the provably guarded areas was larger than 0.08.\nThis corresponds to that changing 400 elements of input by \u00b11 in usual image scales (0\u2013255) cannot\ncause error over 50% for the trained network. These certi\ufb01cations are non-trivial, and to the best of\nour knowledge, these are the best certi\ufb01cation provided for this large network.\n\nRobustness against attack: We evaluated the robustness of trained networks against adversarial\nperturbations created by the current attacks. We used C&W attack [6] with 100 iterations and\nno random restart for evaluation. Table 1 summarizes the results. While LMT slightly dropped\nits accuracy, it largely improved robustness compared to other regularization based techniques.\nSince these techniques are independent of other techniques such as adversarial training or input\ntransformations, further robustness will be expected when LMT is combined with them.\n\n7 Conclusion\n\nTo ensure perturbation invariance of a broad range of networks with a computationally ef\ufb01cient\nprocedure, we achieved the following.\n\n1. We offered general and tighter spectral bounds for each component of neural networks.\n2. We introduced general and fast calculation algorithm for the upper bound of operator norms\n\nand its differentiable approximation.\n\n3. We proposed a training algorithm which effectively constrains networks to be smooth, and\n\nachieves better certi\ufb01cation and robustness against attacks.\n\n4. We successfully provided non-trivial certi\ufb01cation for small to large networks with negligible\n\ncomputational costs.\n\nWe believe that this work will serve as an essential step towards both certi\ufb01able and robust deep\nlearning models. Applying developed techniques to other Lipschitz-concerned domains such as\ntraining of GAN or training with noisy labels is future work.\n\n8\n\n\fAcknowledgement\n\nAuthors appreciate Takeru Miyato for valuable feedback. YT was supported by Toyota/Dwango\nAI scholarship. IS was supported by KAKENHI 17H04693. MS was supported by KAKENHI\n17H00757.\n\nReferences\n[1] N. Akhtar and A. Mian. Threat of Adversarial Attacks on Deep Learning in Computer Vision:\n\nA Survey. CoRR, abs/1801.00553, 2018.\n\n[2] A. Athalye, N. Carlini, and D. Wagner. Obfuscated Gradients Give a False Sense of Security:\nCircumventing Defenses to Adversarial Examples. In Proceedings of the 35th International\nConference on Machine Learning, pages 274\u2013283, 2018.\n\n[3] P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized Margin Bounds for Neural\nNetworks. In Advances in Neural Information Processing Systems 30, pages 6241\u20136250, 2017.\n\n[4] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. V. Nori, and A. Criminisi. Measuring\nNeural Net Robustness with Constraints. In Advances in Neural Information Processing Systems\n28, pages 2613\u20132621, 2015.\n\n[5] N. Carlini and D. Wagner. Adversarial Examples Are Not Easily Detected: Bypassing Ten\nDetection Methods. In Proceedings of the 10th ACM Workshop on Arti\ufb01cial Intelligence and\nSecurity, pages 3\u201314, 2017.\n\n[6] N. Carlini and D. A. Wagner. Towards Evaluating the Robustness of Neural Networks. In\nProceedings of the 2017 IEEE Symposium on Security and Privacy, pages 39\u201357. IEEE\nComputer Society, 2017.\n\n[7] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval Networks: Improving\nRobustness to Adversarial Examples. In Proceedings of the 34th International Conference on\nMachine Learning, pages 854\u2013863, 2017.\n\n[8] J. Friedman. Error bounds on the power method for determining the largest eigenvalue of a\nsymmetric, positive de\ufb01nite matrix. Linear Algebra and its Applications, 280(2):199\u2013216, 1998.\n\n[9] I. J. Goodfellow. Gradient Masking Causes CLEVER to Overestimate Adversarial Perturbation\n\nSize. CoRR, abs/1804.07870, 2018.\n\n[10] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples.\n\nInternational Conference on Learning Representations, 2015.\n\n[11] C. Guo, M. Rana, M. Cisse, and L. v. d. Maaten. Countering Adversarial Images using Input\n\nTransformations. International Conference on Learning Representations, 2018.\n\n[12] M. Hein and M. Andriushchenko. Formal Guarantees on the Robustness of a Classi\ufb01er against\nAdversarial Manipulation. In Advances in Neural Information Processing Systems 30, pages\n2263\u20132273, 2017.\n\n[13] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Towards Proving the\nAdversarial Robustness of Deep Neural Networks. In Proceedings of the First Workshop on\nFormal Veri\ufb01cation of Autonomous Vehicles, pages 19\u201326, 2017.\n\n[14] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An Ef\ufb01cient\nSMT Solver for Verifying Deep Neural Networks. In Computer Aided Veri\ufb01cation Part I, pages\n97\u2013117, 2017.\n\n[15] J. Z. Kolter and E. Wong. Provable Defenses against Adversarial Examples via the Convex\nOuter Adversarial Polytope. In Proceedings of the 35th International Conference on Machine\nLearning, pages 5286\u20135295, 2018.\n\n[16] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial Machine Learning at Scale. Interna-\n\ntional Conference on Learning Representations, 2017.\n\n9\n\n\f[17] J. Langford and J. Shawe-Taylor. PAC-Bayes & Margins. In Advances in Neural Information\n\nProcessing Systems 15, pages 439\u2013446, 2002.\n\n[18] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards Deep Learning Models\nResistant to Adversarial Attacks. International Conference on Learning Representations, 2018.\n\n[19] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. DeepFool: A Simple and Accurate Method\nto Fool Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 2574\u20132582, 2016.\n\n[20] V. Nair and G. E. Hinton. Recti\ufb01ed Linear Units Improve Restricted Boltzmann Machines. In\nProceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24,\n2010, Haifa, Israel, 2010.\n\n[21] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading Digits in Natural\nImages with Unsupervised Feature Learning. Neural Information Processing Systems Workshop,\n2011.\n\n[22] B. Neyshabur, S. Bhojanapalli, and N. Srebro. A PAC-Bayesian Approach to Spectrally-\nInternational Conference on Learning\n\nNormalized Margin Bounds for Neural Networks.\nRepresentations, 2018.\n\n[23] N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a Defense to Adver-\nsarial Perturbations Against Deep Neural Networks. In Proceedings of the IEEE Symposium\non Security and Privacy, pages 582\u2013597, 2016.\n\n[24] J. Peck, J. Roels, B. Goossens, and Y. Saeys. Lower Bounds on the Robustness to Adversarial\nPerturbations. In Advances in Neural Information Processing Systems 30, pages 804\u2013813, 2017.\n\n[25] A. Raghunathan, J. Steinhardt, and P. Liang. Certi\ufb01ed Defenses against Adversarial Examples .\n\nInternational Conference on Learning Representations, 2018.\n\n[26] W. Ruan, X. Huang, and M. Kwiatkowska. Reachability Analysis of Deep Neural Networks\nwith Provable Guarantees. In Proceedings of the Twenty-Seventh International Joint Conference\non Arti\ufb01cial Intelligence, pages 2651\u20132659, 2018.\n\n[27] A. Sinha, H. Namkoong, and J. Duchi. Certi\ufb01able Distributional Robustness with Principled\n\nAdversarial Training. International Conference on Learning Representations, 2018.\n\n[28] J. Su, D. V. Vargas, and S. Kouichi. One pixel attack for fooling deep neural networks. CoRR,\n\nabs/1710.08864, 2017.\n\n[29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. In-\ntriguing Properties of Neural Networks. International Conference on Learning Representations,\n2014.\n\n[30] F. Tram\u00e8r, A. Kurakin, N. Papernot, D. Boneh, and P. D. McDaniel. Ensemble Adversarial\nTraining: Attacks and Defenses. International Conference on Learning Representations, 2018.\n\n[31] T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel. Evaluating the\nrobustness of neural networks: An extreme value theory approach. International Conference on\nLearning Representations, 2018.\n\n[32] W. Xu, D. Evans, and Y. Qi. Feature Squeezing: Detecting Adversarial Examples in Deep\n\nNeural Networks. Network and Distributed Systems Security Symposium, 2018.\n\n[33] Y. Yoshida and T. Miyato. Spectral Norm Regularization for Improving the Generalizability of\n\nDeep Learning. CoRR, abs/1705.10941, 2017.\n\n[34] S. Zagoruyko and N. Komodakis. Wide Residual Networks. In Proceedings of the British\n\nMachine Vision Conference, pages 87.1\u201387.12, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3258, "authors": [{"given_name": "Yusuke", "family_name": "Tsuzuku", "institution": "The University of Tokyo / RIKEN"}, {"given_name": "Issei", "family_name": "Sato", "institution": "The University of Tokyo/RIKEN"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}