{"title": "Theoretical Analysis of Adversarial Learning: A Minimax Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 12280, "page_last": 12290, "abstract": "In this paper, we propose a general theoretical method for analyzing the risk bound in the presence of adversaries. Specifically, we try to fit the adversarial learning problem into the minimax framework. We first show that the original adversarial learning problem can be transformed into a minimax statistical learning problem by introducing a transport map between distributions. Then, we prove a new risk bound for this minimax problem in terms of covering numbers under a weak version of Lipschitz condition. Our method can be applied to multi-class classification and popular loss functions including the hinge loss and ramp loss. As some illustrative examples, we derive the adversarial risk bounds for SVMs and deep neural networks, and our bounds have two data-dependent terms, which can be optimized for achieving adversarial robustness.", "full_text": "Theoretical Analysis of Adversarial Learning:\n\nA Minimax Approach\n\n1UBTECH Sydney AI Centre, School of Computer Science, The University of Sydney, Australia\n\nZhuozhuo Tu1, Jingwei Zhang2,1, Dacheng Tao1\n\n2Department of Computer Science and Engineering, HKUST, Hong Kong\n\nzhtu3055@uni.sydney.edu.au, jzhangey@cse.ust.hk, dacheng.tao@sydney.edu.au\n\nAbstract\n\nIn this paper, we propose a general theoretical method for analyzing the risk\nbound in the presence of adversaries. Speci\ufb01cally, we try to \ufb01t the adversarial\nlearning problem into the minimax framework. We \ufb01rst show that the original\nadversarial learning problem can be transformed into a minimax statistical learning\nproblem by introducing a transport map between distributions. Then, we prove\na new risk bound for this minimax problem in terms of covering numbers under\na weak version of Lipschitz condition. Our method can be applied to multi-class\nclassi\ufb01cation and popular loss functions including the hinge loss and ramp loss. As\nsome illustrative examples, we derive the adversarial risk bounds for SVMs and\ndeep neural networks, and our bounds have two data-dependent terms, which can\nbe optimized for achieving adversarial robustness.\n\n1\n\nIntroduction\n\nMachine learning models, especially deep neural networks, have achieved impressive performance\nacross a variety of domains including image classi\ufb01cation, natural language processing, and speech\nrecognition. However, these techniques can easily be fooled by adversarial examples, i.e., carefully\nperturbed input samples aimed to cause misclassi\ufb01cation during the test phase. This phenomenon was\n\ufb01rst studied in spam \ufb01ltering [14, 31, 32] and has attracted considerable attention since 2014, when\nSzegedy et al. [42] noticed that small perturbations in images can cause misclassi\ufb01cation in neural\nnetwork classi\ufb01ers. Since then, there has been considerable focus on developing adversarial attacks\nagainst machine learning algorithms [21, 9, 8, 4, 44], and, in response, many defense mechanisms\nhave also been proposed to counter these attacks [22, 20, 15, 41, 33]. These works focus on creating\noptimization-based robust algorithms, but their generalization performance under adversarial input\nperturbations is still not fully understood.\nSchmidt et al. [38] recently discussed the generalization problem in the adversarial setting and\nshowed that the sample complexity of learning a speci\ufb01c distribution in the presence of l\u221e-bounded\nadversaries increases by an order of\nd for all classi\ufb01ers. The same paper recognized that deriving\nthe agnostic-distribution generalization bound remained an open problem [38]. In a subsequent\nstudy, Cullina et al. [13] extended the standard PAC-learning framework to the adversarial setting by\nde\ufb01ning a corrupted hypothesis class and showed that the VC dimension of this corrupted hypothesis\nclass for halfspace classi\ufb01ers which controlled the sample complexity does not increase in the\npresence of an adversary. While their work provided a theoretical understanding of the problem of\nlearning with adversaries, it had two limitations. First, their results could only be applied to binary\nproblems, whereas in practice we usually need to handle multi-class problems. Second, the 0-1 loss\nfunction used in their work is not convex and thus very hard to optimize.\nIn this paper, we propose a general theoretical method for analyzing generalization performance in\nthe presence of adversaries. In particular, we \ufb01t the adversarial learning problem into the minimax\n\n\u221a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fframework [28]. In contrast to traditional statistical learning, where the underlying data distribution\nP is unknown but \ufb01xed, the minimax framework considers the uncertainty about the distribution P\nby introducing an ambiguity set and then aims to minimize the risk with respect to the worst-case\ndistribution in this set. Motivated by Lee & Raginsky [28], we \ufb01rst note that the adversarial expected\nrisk over a distribution P is equivalent to the standard expected risk under a new distribution P (cid:48).\nSince this new distribution is not \ufb01xed and depends on the hypothesis, we instead consider the worst\ncase. In this way, the original adversarial learning problem is reduced to a minimax problem, and we\nuse the minimax approach to derive the risk bound for the adversarial expected risk. Our contributions\ncan be summarized as follows.\n\n\u2022 We propose a general method for analyzing the risk bound in the presence of adversaries.\nOur method is general in several respects. First, the adversary we consider is general and\nencompasses all lq bounded adversaries. Second, our method can be applied to multi-class\nproblems and commonly used loss functions such as the hinge loss and ramp loss, whereas\nCullina et al. [13] only considered the binary classi\ufb01cation problem and the 0-1 loss.\n\n\u2022 We prove a new bound for the local worst-case risk under a weak version of Lipschitz\ncondition. Our bound is always better than that of Lee & Raginsky [29], and can recover the\nusual risk bound by setting the radius \u0001B of the Wasserstein ball to 0, whereas they give a\n\u0001B-free bound.\n\n\u2022 We derive the adversarial risk bounds for SVMs and deep neural networks. Our bounds\nhave two data-dependent terms, suggesting that minimizing the sum of the two terms can\nhelp achieve adversarial robustness.\n\nThe remainder of this paper is structured as follows. In Section 2, we discuss related works. Section 3\nformally de\ufb01nes the problem, and we present our theoretical method in Section 4. The adversarial\nrisk bounds for SVMs and neural networks are described in Section 5, and we conclude and discuss\nfuture directions in Section 6.\n\n2 Related work\n\nOur work leverages some of the bene\ufb01ts of statistical machine learning, summarized as follows.\n\n2.1 Generalization in supervised learning\n\nGeneralization is a central problem in supervised learning, and the generalization capability of\nlearning algorithms has been extensively studied. Here we review the salient aspects of generalization\nin supervised learning relevant to this work.\nTwo main approaches are used to analyze the generalization bound of a learning algorithm. The \ufb01rst\nis based on the complexity of the hypothesis class, such as the VC dimension [45, 46] for binary\nclassi\ufb01cation, Rademacher and Gaussian complexities [7, 5], and the covering number [53, 52, 6].\nNote that hypothesis complexity-based analyses of generalization error are algorithm independent\nand consider the worst-case generalization over all functions in the hypothesis class. In contrast,\nthe second approach is based on the properties of a learning algorithm and is therefore algorithm\ndependent. The properties characterizing the generalization of a learning algorithm include, for\nexample, algorithmic stability [11, 39, 30], robustness [50], and algorithmic luckiness [24]. Some\nother methods exist for analyzing the generalization error in machine learning such as the PAC-\nBayesian approach [35, 2], compression-based bounds [27, 3], and information-theoretic approaches\n[49, 1, 37].\n\n2.2 Minimax statistical learning\n\nIn contrast to standard empirical risk minimization in supervised learning, where test data follow the\nsame distribution as training data, minimax statistical learning arises in problems of distributionally\nrobust learning [16, 18, 28, 29, 40] and minimizes the worst-case risk over a family of probability\ndistributions. Thus, it can be applied to the learning setting in which the test data distribution differs\nfrom that of the training data, such as in domain adaptation and transfer learning [12]. In particular,\nGao & Kleywegt [18] proposed a dual representation of worst-case risk over the ambiguity set of\nprobability distributions, which was given by balls in Wasserstein space. Then, Lee & Raginsky\n\n2\n\n\f[28] derived the risk bound for minimax learning by exploiting the dual representation of worst-case\nrisk. However, their minimax risk bound would go to in\ufb01nity and thus become vacuous as \u0001B \u2192 0.\nDespite that the same authors later presented a new bound [29] by imposing a Lipschitz assumption\nto avoid this problem, their new bound was \u0001B-free and cannot recover the usual risk bound by setting\n\u0001B = 0. Sinha et al. [40] also provided a similar upper bound on the worst-case population loss over\ndistributions de\ufb01ned by the Wasserstein metric via a Lagrangian penalty formulation, and their bound\nwas ef\ufb01ciently computable by a principled adversarial training procedure, which provably certi\ufb01ed\ndistributional robustness. However their training procedure required that the penalty parameter should\nbe large enough and thus can only achieve a small amount of robustness. Here we improve on the\nresults in Lee & Raginsky [28, 29] and present a new risk bound for the minimax problem.\n\n2.3 Learning with adversaries\n\nThe existence of adversaries during the test phase of a learning algorithm may render predictions\nmade by learning system unthrustworthy. There is extensive literature on analysis of adversarial\nrobustness [47, 17, 23, 19] and design of provable defense against adversarial attacks[48, 36, 33, 40],\nin contrast to the relatively limited literature on risk bound analysis of adversarial learning. A\ncomprehensive review of works on adversarial machine learning can be found in Biggio & Roli\n[10]. Concurrently to our work, Khim & Loh [25] and Yin et al. [51] provided different approaches\nto deriving adversarial risk bounds. Khim & Loh [25] derived adversarial risk bounds for linear\nclassi\ufb01ers and neural networks using a method called supremum transform. However, their approach\ncan only be applied to binary classi\ufb01cation. Yin et al. [51] gave similar adversarial risk bounds\nthrough the lens of Rademacher complexity. Although they provided risk bounds in multi-class\nsetting, their work focused on l\u221e adversarial attacks and was limited to one-hidden layer ReLU\nneural networks. After the initial preprint of this paper, Khim & Loh [26] extended their method to\nmulti-class setting by considering the binary supremum transform on each component of classi\ufb01er,\nwhich as a result incurred an extra factor of the number of classes in their bound. Instead we used\ncovering number analysis to derive the multi-class bound, which can avoid explicit dependence on\nthis number.\n\n3 Problem setup\nWe consider a standard statistical learning framework. Let Z = X \u00d7 Y be a measurable instance\nspace where X and Y represent feature and label spaces, respectively. We assume that examples\nare independently and identically distributed according to some \ufb01xed but unknown distribution P .\nThe learning problem is then formulated as follows. The learner considers a class H of hypothesis\nh : X \u2192 Y(cid:48) where Y(cid:48) sometimes differs from Y and a loss function l : Y(cid:48) \u00d7 Y \u2192 R+. The learner\nreceives n training examples denoted by S = ((x1, y1), (x2, y2),\u00b7\u00b7\u00b7 , (xn, yn)) drawn i.i.d. from\nP and tries to select a hypothesis h \u2208 H that has a small expected risk. However, in the presence\nof adversaries, there will be imperceptible perturbations to the input of examples, which are called\nadversarial examples. Throughout this paper, we assume that the adversarial examples are generated\nby adversarially choosing an example from neighborhood N (x) = {x(cid:48) : x(cid:48) \u2212 x \u2208 B} where B is\na nonempty set. Note that the de\ufb01nition of N (x) is very general and encompasses all lq-bounded\nadversaries. We next give the formal de\ufb01nition of adversarial expected and empirical risk to measure\nthe learner\u2019s performance in the presence of adversaries.\nDe\ufb01nition 1. (Adversarial Expected Risk). The adversarial expected risk of a hypothesis h \u2208 H over\nthe distribution P in the presence of an adversary constrained by B is\nl(h(x(cid:48)), y)].\n\nRP (h,B) = E(x,y)\u223cP [ max\nx(cid:48)\u2208N (x)\n\nIf B is the zero-dimensional space {0}, then the adversarial expected risk will reduce to the standard\nexpected risk without an adversary. Since the true distribution is usually unknown, we instead use the\nempirical distribution to approximate the true distribution, which is equal to (xi, yi) with probability\n1/n for each i \u2208 {1,\u00b7\u00b7\u00b7 , n}. That gives us the following de\ufb01nition of adversarial empirical risk.\nDe\ufb01nition 2. (Adversarial Empirical Risk ). The adversarial empirical risk of h in the presence of\nan adversary constrained by B is\n\nRPn(h,B) =\n\n1\nn\n\n(cid:2) max\n\nx(cid:48)\u2208N (xi)\n\nl(h(x(cid:48)), yi)(cid:3).\n\nn(cid:88)\n\ni=1\n\n3\n\n\f4 Main results\n\nIn this section, we present our main results. The trick is to pushforward the original distribution P\ninto a new distribution P (cid:48) using a transport map Th : Z \u2192 Z satisfying\n\nRP (h,B) = RP (cid:48)(h),\n\nwhere RP (cid:48)(h) = E(x,y)\u223cP (cid:48)l(h(x), y) is the standard expected risk without the adversary. Therefore,\nan upper bound on the expected risk over the new distribution leads to an upper bound on the\nadversarial expected risk.\nNote that the new distribution P (cid:48) is not \ufb01xed and depends on the hypothesis h. As a result, traditional\nstatistical learning cannot be directly applied. However, note that these new distributions lie within\na Wasserstein ball centered on P , which we will show in Section 4.2. If we consider the worst\ncase within this Wasserstein ball, then the original adversarial learning problem can be reduced to a\nminimax problem. We can therefore use the minimax approach to derive the adversarial risk bound.\nWe \ufb01rst introduce the Wasserstein distance and minimax framework.\n\n4.1 Wasserstein distance and local worst-case risk\nLet (Z, dZ ) be a metric space where Z = X \u00d7 Y and dZ is de\ufb01ned as\n\ndpZ (z, z(cid:48)) = dpZ ((x, y), (x(cid:48), y(cid:48))) = (dpX (x, x(cid:48)) + dpY (y, y(cid:48)))\n\nwith dX and dY representing the metric in the feature space and label space respectively. For example,\nif Y = {1,\u22121}, dY (y, y(cid:48)) can be 1(y(cid:54)=y(cid:48)), and if Y = [\u2212B, B], dY (y, y(cid:48)) can be (y \u2212 y(cid:48))2. In this\npaper, we require that dX is translation invariant, i.e., dX (x, x(cid:48)) = dX (x \u2212 x(cid:48), 0). With this metric,\nwe denote with P(Z) the space of all Borel probability measures on Z, and with Pp(Z) the space of\nall P \u2208 P(Z) with \ufb01nite pth moments for p \u2265 1:\n\nPp(Z) := {P \u2208 P(Z) : EP [dpZ (z, z0)] < \u221e f or z0 \u2208 Z}.\n\nThen, the p-Wasserstein distance between two probability measures P, Q \u2208 Pp(Z) is de\ufb01ned as\n\nWp(P, Q) :=\n\ninf\n\nM\u2208\u0393(P,Q)\n\n(E(z,z(cid:48))\u223cM [dpZ (z, z(cid:48))])1/p,\n\nwhere \u0393(P, Q) denotes the collection of all measures on Z \u00d7 Z with marginals P and Q on the \ufb01rst\nand second factors, respectively.\nNow we de\ufb01ne the local worst-case risk of h at P ,\n\nR\u0001,p(P, h) := sup\nQ\u2208BW\n\n\u0001,p(P )\n\nRQ(h),\n\n\u0001,p(P ) := {Q \u2208 Pp(Z) : Wp(P, Q)) \u2264 \u0001} is the p-Wasserstein ball of radius \u0001 \u2265 0 centered\n\nwhere BW\nat P .\nWith these de\ufb01nitions, we next show the adversarial expected risk can be related to the local worst-case\nrisk by a transport map Th.\n\n4.2 Transport map\nDe\ufb01ne a mapping Th : Z \u2192 Z\nwhere x\u2217 = arg maxx(cid:48)\u2208N (x) l(h(x(cid:48)), y). By the de\ufb01nition of dZ,\nis easy to obtain\ndZ ((x, y), (x\u2217, y)) = dX (x, x\u2217). We now prove that the adversarial expected risk can be related to\nthe standard expected risk via the mapping Th.\nLemma 1. Let P (cid:48) = Th#P , the pushforward of P by Th, then we have\n\nz = (x, y) \u2192 (x\u2217, y),\n\nit\n\nRP (h,B) = RP (cid:48)(h).\n\nProof. By the de\ufb01nition, we have\nRP (h,B) = E(x,y)\u223cP [maxx(cid:48)\u2208N (x) l(h(x(cid:48)), y)] = E(x,y)\u223cP [l(h(x\u2217), y)] = E(x,y)\u223cP (cid:48)[l(h(x), y)] .\nSo RP (h,B) = RP (cid:48)(h).\n\n4\n\n\fBy this lemma, the adversarial expected risk over a distribution P is equivalent to the standard\nexpected risk over a new distribution P (cid:48). However since the new distribution is not \ufb01xed and depends\non the hypothesis h, traditional statistical learning cannot be directly applied. Luckily, the following\nlemma proves that all these new distributions locate within a Wasserstein ball centered at P .\nLemma 2. De\ufb01ne the radius of the adversary constrained by B as \u0001B := supx\u2208B dX (x, 0). For any\nhypothesis h and the corresponding P (cid:48) = Th#P , we have\nWp(P, P (cid:48)) \u2264 \u0001B.\n\nProof. By the de\ufb01nition of Wasserstein distance,\n\np (P, P (cid:48)) \u2264 EP [dpZ (Z, Th(Z))] = EP [dpX (x, x\u2217)] \u2264 \u0001pB,\nW p\n\nwhere the last inequality uses the translation invariant property of dX . Therefore, we have\nWp(P, P (cid:48)) \u2264 \u0001B.\n\nFrom this lemma, we can see that all possible new distributions lie within a Wasserstein ball of radius\n\u0001B centered on P . So, by upper bounding the worst-case risk in the ball, we can bound the adversarial\nexpected risk. The relationship between local worst-case risk and adversarial expected risk is as\nfollows. Note that this inequality holds for any p \u2265 1. For ease of exposition, in the rest of the paper,\nwe only discuss the case p = 1; that is,\n\nRP (h,B) \u2264 R\u0001B,1(P, h),\n\n\u2200h \u2208 H.\n\n(1)\n\n4.3 Adversarial risk bounds\n\nIn this subsection, we \ufb01rst prove a bound for the local worst-case risk. Then, the adversarial\nrisk bounds can be derived directly by (1). To simplify notation, we denote a function class F by\ncompositing the functions in H with the loss function l(\u00b7,\u00b7), i.e., F = {(x, y) \u2192 l(h(x), y) : h \u2208 H}.\nThe key ingredient of a bound on the local worst-case risk is the following strong duality result by\nGao & Kleywegt [18]:\nProposition 1. For any upper semicontinuous function f : Z \u2192 R and for any P \u2208 Pp(Z),\n\nR\u0001B,1(P, f ) = min\n\u03bb\u22650\n\n{\u03bb\u0001B + EP [\u03d5\u03bb,f (z)]},\n\nwhere \u03d5\u03bb,f (z) := supz(cid:48)\u2208Z{f (z(cid:48)) \u2212 \u03bb \u00b7 dZ (z, z(cid:48))}.\nWe begin with some assumptions.\nAssumption 1. The instance space Z is bounded: diam(Z) := supz,z(cid:48)\u2208Z dZ (z, z(cid:48)) < \u221e.\nAssumption 2. The functions in F are upper semicontinuous and uniformly bounded: 0 \u2264 f (z) \u2264\nM < \u221e for all f \u2208 F and z \u2208 Z.\nAssumption 3. For any function f \u2208 F and any z \u2208 Z, there exists \u03bbf,z such that f (z(cid:48)) \u2212 f (z) \u2264\n\u03bbf,zdZ (z, z(cid:48)) for any z(cid:48) \u2208 Z.\nNote that Assumption 3 is a weak version of Lipschitz condition since \u03bbf,z is not \ufb01xed and depends\non f and z. It is easy to see that if the function f \u2208 F is L-Lipschitz with respect to the metric dZ,\ni.e., |f (z) \u2212 f (z(cid:48))| \u2264 LdZ (z, z(cid:48)), Assumption 3 automatically holds with \u03bbf,z always being L. Now\nwe give an equivalent expression for Assumption 3 which is easier to use in our proofs.\nLemma 3. Assumption 3 holds if and only if for any function f \u2208 F and any empirical distribution\nPn, the set {\u03bb : \u03c8f,Pn(\u03bb) = 0} is nonempty, where \u03c8f,Pn(\u03bb) := EPn(supz(cid:48)\u2208Z{f (z(cid:48))\u2212\u03bbdZ (z, z(cid:48))\u2212\nf (z)}).\nThe proof of Lemma 3 is contained in Appendix A.\nWe denote the smallest value in the set as \u03bb+\nlocal worst-case risk bound, we need two technical lemmas.\nLemma 4. Fix some f \u2208 F. De\ufb01ne \u00af\u03bb via\n\n:= inf{\u03bb : \u03c8f,Pn (\u03bb) = 0}. In order to prove the\n\nf,Pn\n\n\u00af\u03bb := arg min\n\u03bb\u22650\n\n{\u03bb\u0001B + EPn [\u03d5\u03bb,f (Z)]}.\n\n5\n\n\fThen\n\n\u00af\u03bb \u2208\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 [0,\n\nM\n\u0001B\n[\u03bb\u2212\n\nf,Pn\n\n]\n\n, \u03bb+\n\nf,Pn\n\n]\n\nif \u0001B \u2265 M\n\u03bb+\nf,Pn\nM\n\u03bb+\nf,Pn\n\nif \u0001B <\n\n,\n\n(2)\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\n:= 0.\n\n\u00b7 \u0001B} is nonempty,\n\n:= sup{\u03bb : \u03c8f,Pn (\u03bb) = \u03bb+\n\n\u00b7 \u0001B} if the set {\u03bb : \u03c8f,Pn (\u03bb) = \u03bb+\n\nwhere \u03bb\u2212\notherwise \u03bb\u2212\nRemark 1. We can show that lim\u0001B\u21920 \u03bb\u2212\nde\ufb01ne \u03b4 =\nde\ufb01nition of \u03bb\u2212\nf,Pn\nhave \u03bb\u2212\n> \u03bb+\nLemma 5. De\ufb01ne the function class \u03a6 := {\u03d5\u03bb,f : \u03bb \u2208 [a, b], f \u2208 F} where b \u2265 a \u2265 0. Then, the\nexpected Rademacher complexity of the function class \u03a6 satis\ufb01es\n\nby using (\u0001, \u03b4) language as follows. \u2200\u0001 > 0,\n\u00b7 \u0001B. By the\n\u00b7 \u0001B. Since \u03c8f,Pn(\u03bb) is monotonically non-increasing, we\n\n, \u03c8f,Pn(\u03bb\u2212\n\u2212 \u0001. Therefore, lim\u0001B\u21920 \u03bb\u2212\n\n. Then, for any \u0001B < \u03b4, we have \u03c8f,Pn(\u03bb+\n\n\u2212 \u0001) > \u03bb+\n\n\u03c8f,Pn (\u03bb+\n\n) = \u03bb+\n\n= \u03bb+\n\n= \u03bb+\n\n\u2212\u0001)\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\n\u03bb+\n\nf,Pn\n\nf,Pn\n\n.\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nRn(\u03a6) \u2264 12C(F)\u221a\n(cid:112)logN (F,|| \u00b7 ||\u221e, u/2)du and N (F,||\u00b7||\u221e, u/2) denotes the covering number\n\n(b \u2212 a) \u00b7 diam(Z),\n\n+\n\nn\n\n\u221a\n\u03c0\u221a\n6\nn\n\nwhere C(F) :=(cid:82) \u221e\n\n0\n\nof F.\n\n(2), [\u03b6\u2212, \u03b6 +] := (cid:83)\n\nThe proofs of Lemma 4 and 5 is contained in Appendix B.\nWe are now ready to prove the local worst-case risk bound. Let \u00af\u03bb \u2208 [\u03b6\u2212\n\n] denotes expression\n] and \u039b\u0001B := \u03b6 + \u2212 \u03b6\u2212. It is straightforward to check that\n[\u03b6\u2212, \u03b6 +] \u2282 [0, M/\u0001B] from expression (2). The generalization bound for local worst-case risk is\ngiven by the following lemma.\nLemma 6. If the assumptions 1- 3 hold, then for any f \u2208 F, we have\n\n[\u03b6\u2212\n\n, \u03b6 +\n\n, \u03b6 +\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nR\u0001B,1(P, f ) \u2212 R\u0001B,1(Pn, f ) \u2264 24C(F)\u221a\n\nn\n\n+ M\n\nlog( 1\n\u03b4 )\n2n\n\n+\n\n\u221a\n\u03c0\u221a\n12\nn\n\n\u039b\u0001B \u00b7 diam(Z)\n\n(cid:114)\n\nwith probability at least 1 \u2212 \u03b4.\nRemark 2. Lee & Raginsky [29] proved a bound with \u039b\u0001B \u2261 L under the Lipschitz assumption\nwhere L represents the Lipschitz constant. Our result improves a lot on theirs. First, our Assumption\n3 is weaker than their Lipschitz assumption. Second, even under the weaker assumptions, our bound\nis always better than their results since [\u03b6\u2212, \u03b6 +] \u2282 [0, L] by expression (2) and the de\ufb01nition of \u03bb+\n.\n\u039b\u0001B \u00b7 diam(Z) in our bound will vanish, recovering the\nFinally, by setting \u0001B = 0, the term 12\nusual risk bound, whereas they gave a \u0001B-free bound with \u039b\u0001B always being the constant L.\nThis leads to our main theorem for the adversarial expected risk.\nTheorem 1. If the assumptions 1- 3 hold, for any f \u2208 F, we have\n24C(F)\u221a\n\n{\u03bb\u0001B + \u03c8f,Pn(\u03bb)} +\n\n\u039b\u0001B \u00b7 diam(Z) + M\n\n\u221a\n\u03c0\u221a\nn\n\n(cid:115)\n\nf,Pn\n\n+\n\nRP (f,B) \u2264 1\nn\n\ni=1 f (zi) + min\n\u03bb\u22650\n\n\u221a\n\u03c0\u221a\n12\nn\n\nn\n\n24C(F)\u221a\n\nn\n\n+\n\n\u221a\n\u03c0\u221a\n12\nn\n\n\u0001B +\n\n\u039b\u0001B \u00b7 diam(Z) + M\n\nlog( 1\n\u03b4 )\n2n\n\n(cid:114)\n\nlog( 1\n\u03b4 )\n2n\n\n(3)\n\n(4)\n\n(cid:80)n\n(cid:80)n\n\nand\n\nRP (f,B) \u2264 1\nn\n\ni=1 f (zi) + \u03bb+\n\nf,Pn\n\nwith probability at least 1 \u2212 \u03b4.\n\n6\n\n\f(cid:80)n\n\nRP (h) \u2264 1\nn\n\ni=1 f (zi) +\n\nRemark 3. We are interested in how the adversarial risk bounds differ from the case in which the\nadversary is absent. Plugging \u0001B = 0 into inequality (3) or (4) yields the usual generalization bound\nof the form\n\n(cid:114) log(1/\u03b4)\n\n.\n\n24C(F)\u221a\n\n+ M\n\nn\n\nf,Pn\n\nn and\n\nRemark 5. There are two data dependent terms 1/n(cid:80)n\n\n2n\n\u221a\n\u221a\n\u03c0\u039b\u0001B \u00b7 diam(Z)/\nSo the effect of an adversary is to introduce an extra complexity term 12\nan additional term on \u0001B which contributes to the empirical risk.\nRemark 4. The extra complexity term will decrease as \u0001B gets bigger if \u0001B \u2265 M/\u03bb+\nby\nexpression (2), indicating that a stronger adversary might have a negative impact on the hypothesis\nclass complexity. This is intuitive, since different hypotheses might have the same performance in\nthe presence of a strong adversary and, therefore, the hypothesis class complexity will decrease. We\nemphasize that this phenomenon does not occur in concurrent works [25, 51]. In both of their work,\nthis term will increase linearly as \u0001B grows.\ni=1 f (zi) and min\u03bb\u22650{\u03bb\u0001B + \u03c8f,Pn(\u03bb)} (or\n\u03bb+\n\u0001B) in bound (3) (or (4)), corresponding to the empirical risk and the effect of adversary on\nf,Pn\nempirical risk, respectively. Although the bound (3) is tighter, it is hard to optimize because of the\ninner minimization problem. The bound (4) cannot be directly minimized either because \u03bb+\nis\ncomputationally intractable in practice. But we can consider an upper bound for \u03bb+\n. For example,\n\u2264 L. See Section 5 for more examples.\nif f is L-Lipschitz, by the de\ufb01nition of \u03bb+\nThis upper bound for \u03bb+\ncan be used in optimization, as we will discuss in Section 6. In particular,\nif \u03c8f,Pn (\u03bb) \u2261 0 for any \u03bb \u2265 0, we get \u03bb+\n\u0001B in inequality (4)\nwill disappear.\n\n= 0, and the additional term \u03bb+\n\n, we have \u03bb+\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\n5 Example bounds\n\nIn this section, we illustrate the application of Theorem 1 to two commonly-used models: SVMs and\nneural networks.\n\n5.1 Support vector machines\nWe \ufb01rst start with SVMs. Let Z = X \u00d7 Y, where the feature space X = {x \u2208 Rd : ||x||2 \u2264 r} and\nthe label space Y = {\u22121, +1}. Equip Z with the Euclidean metric\n\ndZ (z, z(cid:48)) = dZ ((x, y), (x(cid:48), y(cid:48))) = ||x \u2212 x(cid:48)||2 + 1(y(cid:54)=y(cid:48)).\n\nConsider the hypothesis space F = {(x, y) \u2192 max{0, 1 \u2212 yh(x)} : h \u2208 H}, where H = {x \u2192\nw \u00b7 x : ||w||2 \u2264 \u039b}. We can now derive the expected risk bound for SVMs in the presence of an\nadversary.\nCorollary 1. For the SVMs setting considered above, for any f \u2208 F, with probability at least 1 \u2212 \u03b4,\nlog( 1\nRP (f,B) \u2264 1\n\u03b4 )\nn\n2n\n\u2264 max\n\ni=1 f (zi) + \u03bb+\n{2yiw \u00b7 xi,||w||2}.\n\n\u039b\u0001B \u00b7 (2r + 1) + (1 + \u039br)\n\n\u221a\n\u03c0\u221a\n12\nn\n\n(cid:80)n\n\nwhere \u03bb+\n\n(cid:114)\n\n144\u221a\nn\n\n\u0001B +\n\nd +\n\n\u221a\n\nf,Pn\n\n\u039br\n\n,\n\nf,Pn\n\ni\n\nThe proof of Corollary 1 can be found in Appendix E.\n\n5.2 Neural networks\n\nWe next consider feed-forward neural networks. To demonstrate the generality of our method, we\nconsider a multi-class prediction problem. Let Z = X \u00d7 Y, where the feature space X = {x \u2208 Rd :\n||x||2 \u2264 B} and the label space Y = {1, 2,\u00b7\u00b7\u00b7 , k}; k represents the number of classes. The network\nuses L \ufb01xed nonlinear activation functions (\u03c31, \u03c32,\u00b7\u00b7\u00b7 , \u03c3L), where \u03c3i is \u03c1i-Lipschitz and satis\ufb01es\n\u03c3i(0) = 0. Given L weight matrices A = (A1, A2,\u00b7\u00b7\u00b7 , AL), the network computes the following\nfunction\n\nHA(x) := \u03c3L(AL\u03c3L\u22121(AL\u22121\u03c3L\u22122(\u00b7\u00b7\u00b7 \u03c32(A2\u03c31(A1x)\u00b7)),\n\n7\n\n\fRP (f,B) \u2264 1\nn\n\u221a\n\u03c0\u221a\n12\nn\n\n(cid:80)n\ni=1 f (zi) + \u03bb+\n\u039b\u0001B \u00b7 (2B + 1) +\n(cid:26) 2\nL(cid:89)\n\nf,Pn\n\n\u03c1i||Ai||\u03c3,\n\n1\n\u03b3\n\n\u03b3\n\ni=1\n\nwhere \u03bb+\n\nf,Pn\n\n\u2264 max\n\nj\n\n+\n\n,\n\nsi\n\ni=1\n\n\u0001B +\n\n\u221a\n288\n\u03b3\nn\n\ni=1 \u03c1isiBW\n\n(cid:18) bi\n\n(cid:32)(cid:80)L\n\n2(cid:33)2\n(cid:19) 1\n(cid:81)L\n(cid:114) log(1/\u03b4)\n(cid:0)M(HA(xj), yj) + maxHA(xj) \u2212 minHA(xj)(cid:1)(cid:27)\n(cid:19)1/2(cid:33)2\n\n2n\n\n(cid:32)(cid:80)L\n\n(cid:81)L\n\ni=1 \u03c1isiBW\n\ni=1\n\n.\n\n+\n\n\u221a\n288\nn\n\u03b3\n\n(cid:18) bi\n\nsi\n\n.\n\n(5)\n\nwhere Ai \u2208 Rdi\u00d7di\u22121 and HA : Rd \u2192 Rk with d0 = d and dL = k. Let W = max{d0, d1,\u00b7\u00b7\u00b7 , dL}.\nDe\ufb01ne a margin operator M : Rk \u00d7 {1, 2,\u00b7\u00b7\u00b7 , k} \u2192 R as M(v, y) := vy \u2212 maxj(cid:54)=y vj and the\nramp loss l\u03b3 : R \u2192 R+ as\n\n(cid:40) 0\n\nl\u03b3 :=\n\n1 + r/\u03b3\n1\n\nr < \u2212\u03b3\nr \u2208 [\u2212\u03b3, 0]\nr > 0\n\n.\n\nthe hypothesis class F = {(x, y) \u2192 l\u03b3(\u2212M(HA(x), y))\n\n: A =\nConsider\n(A1, A2,\u00b7\u00b7\u00b7 , AL),||Ai||\u03c3 \u2264 si,||Ai||F \u2264 bi}, where || \u00b7 ||\u03c3 represents spectral norm and || \u00b7 ||F\ndenotes the Frobenius norm. The metric in space Z is de\ufb01ned as\n\ndZ (z, z(cid:48)) = dZ ((x, y), (x(cid:48), y(cid:48))) = ||x \u2212 x(cid:48)||2 + 1(y(cid:54)=y(cid:48)).\n\nNow we derive the adversarial expected risk for neural networks.\nCorollary 2. For the neural networks setting de\ufb01ned above, for any f \u2208 F, with probability of 1 \u2212 \u03b4,\nthe following inequality holds\n\nThe proof of this Corollary is provided in Appendix F.\nRemark 6. Setting \u0001B = 0, we obtain a risk bound for neural networks:\n\n(cid:80)n\n\nRP (f ) \u2264 1\nn\n\ni=1 f (zi) +\n\n(cid:114) log(1/\u03b4)\n\n2n\n\nThe bound is in terms of the spectral norm and the Frobenius norm. Although inequality (5) is similar\nto the results in Bartlett et al. [6] and Neyshabur et al. [35], since our proof technique is different, our\napproach may provide a different perspective on the generalization of deep neural networks.\n\n6 Conclusions\n\nIn this paper, we propose a theoretical method for deriving adversarial risk bounds. Our method\nis general and can easily be applied to multi-class problems and most of the commonly used loss\nfunctions. The bound may be loose in some cases, since we consider the worst case distribution in\nthe Wasserstein ball to avoid computing the transport map. However, for some problems, it may be\npossible to derive the transport map and thus get tighter bounds. Furthermore, our bounds may be\nmade tighter by relying on the expected Rademacher complexity directly instead of using covering\nnumbers.\nIn the future, one interesting problem is to develop adversarial robust algorithms based on our results.\nFor example, our bounds suggest that minimizing the sum of empirical risk and the term \u03bb+\n\u0001B can\nhelp achieve adversarial robustness. However, since \u03bb+\nis computationally intractable in practice,\ninstead of using the exact \u03bb+\nin the objective function, we may consider the data-dependent upper\nwhich is usually easier to obtain and a regularization parameter \u03b7 \u2208 [0, 1] selected\nbound for \u03bb+\nvia grid search. For a \ufb01xed \u03b7, we multiply it by the upper bound for \u03bb+\nand use this product as a\nsurrogate of the true \u03bb+\nin the objective function. Afterward, we minimize this surrogate objective\nfunction and obtain the optimal solution for this speci\ufb01c \u03b7. Each such \u03b7 corresponds to a solution.\nFinally we choose the best one from these candidates.\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nf,Pn\n\nAcknowledgments\n\nWe thank the reviewers for their constructive comments that helped improve the paper signi\ufb01cantly.\nThis work was supported by the ARC FL-170100117.\n\n8\n\n\fReferences\n[1] I. M. Alabdulmohsin. Algorithmic stability and uniform generalization. In Advances in Neural\n\nInformation Processing Systems, pages 19\u201327, 2015.\n\n[2] A. Ambroladze, E. Parrado-Hern\u00e1ndez, and J. S. Shawe-taylor. Tighter pac-bayes bounds. In\n\nAdvances in neural information processing systems, pages 9\u201316, 2007.\n\n[3] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via\n\na compression approach. arXiv preprint arXiv:1802.05296, 2018.\n\n[4] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security:\n\nCircumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.\n\n[5] P. L. Bartlett, O. Bousquet, S. Mendelson, et al. Local rademacher complexities. The Annals of\n\nStatistics, 33(4):1497\u20131537, 2005.\n\n[6] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 6240\u20136249, 2017.\n\n[7] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[8] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. \u0160rndi\u00b4c, P. Laskov, G. Giacinto, and F. Roli.\nEvasion attacks against machine learning at test time. In Joint European conference on machine\nlearning and knowledge discovery in databases, pages 387\u2013402. Springer, 2013.\n\n[9] B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. arXiv\n\npreprint arXiv:1206.6389, 2012.\n\n[10] B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning.\n\nPattern Recognition, 84:317\u2013331, 2018.\n\n[11] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning\n\nresearch, 2(Mar):499\u2013526, 2002.\n\n[12] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain\nadaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853\u20131865,\n2017.\n\n[13] D. Cullina, A. N. Bhagoji, and P. Mittal. Pac-learning in the presence of adversaries.\n\nAdvances in Neural Information Processing Systems, pages 228\u2013239, 2018.\n\nIn\n\n[14] N. Dalvi, P. Domingos, S. Sanghai, D. Verma, et al. Adversarial classi\ufb01cation. In Proceedings\nof the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 99\u2013108. ACM, 2004.\n\n[15] O. Dekel, O. Shamir, and L. Xiao. Learning to classify with missing and corrupted features.\n\nMachine learning, 81(2):149\u2013178, 2010.\n\n[16] F. Farnia and D. Tse. A minimax approach to supervised learning. In Advances in Neural\n\nInformation Processing Systems, pages 4240\u20134248, 2016.\n\n[17] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Robustness of classi\ufb01ers: from adversarial\nto random noise. In Advances in Neural Information Processing Systems, pages 1632\u20131640,\n2016.\n\n[18] R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with wasserstein\n\ndistance. arXiv preprint arXiv:1604.02199, 2016.\n\n[19] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow.\n\nAdversarial spheres. arXiv preprint arXiv:1801.02774, 2018.\n\n[20] A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In\nProceedings of the 23rd international conference on Machine learning, pages 353\u2013360. ACM,\n2006.\n\n9\n\n\f[21] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.\n\narXiv preprint arXiv:1412.6572, 2014.\n\n[22] S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples.\n\narXiv preprint arXiv:1412.5068, 2014.\n\n[23] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classi\ufb01er against\nadversarial manipulation. In Advances in Neural Information Processing Systems, pages 2266\u2013\n2276, 2017.\n\n[24] R. Herbrich and R. C. Williamson. Algorithmic luckiness. Journal of Machine Learning\n\nResearch, 3(Sep):175\u2013212, 2002.\n\n[25] J. Khim and P.-L. Loh. Adversarial risk bounds for binary classi\ufb01cation via function transfor-\n\nmation. arXiv preprint arXiv:1810.09519, 2018.\n\n[26] J. Khim and P.-L. Loh. Adversarial risk bounds via function transformation. arXiv preprint\n\narXiv:1810.09519v2, 2018.\n\n[27] J. Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of machine\n\nlearning research, 6(Mar):273\u2013306, 2005.\n\n[28] J. Lee and M. Raginsky. Minimax statistical learning and domain adaptation with wasserstein\n\ndistances. arXiv preprint arXiv:1705.07815, 2017.\n\n[29] J. Lee and M. Raginsky. Minimax statistical learning with wasserstein distances. In Advances\n\nin Neural Information Processing Systems, pages 2692\u20132701, 2018.\n\n[30] T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. arXiv\n\npreprint arXiv:1702.08712, 2017.\n\n[31] D. Lowd and C. Meek. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD\ninternational conference on Knowledge discovery in data mining, pages 641\u2013647. ACM, 2005.\n\n[32] D. Lowd and C. Meek. Good word attacks on statistical spam \ufb01lters. In CEAS, volume 2005,\n\n2005.\n\n[33] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models\nresistant to adversarial attacks. In International Conference on Learning Representations, 2018.\n\n[34] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press,\n\n2012.\n\n[35] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. A pac-bayesian approach to\nspectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564,\n2017.\n\n[36] A. Raghunathan, J. Steinhardt, and P. Liang. Certi\ufb01ed defenses against adversarial examples.\n\narXiv preprint arXiv:1801.09344, 2018.\n\n[37] D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. In\nA. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on\nArti\ufb01cial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research,\npages 1232\u20131240, Cadiz, Spain, 09\u201311 May 2016. PMLR.\n\n[38] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. M \u02dbadry. Adversarially robust generaliza-\n\ntion requires more data. arXiv preprint arXiv:1804.11285, 2018.\n\n[39] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform\n\nconvergence. Journal of Machine Learning Research, 11(Oct):2635\u20132670, 2010.\n\n[40] A. Sinha, H. Namkoong, and J. Duchi. Certi\ufb01able distributional robustness with principled\n\nadversarial training. In International Conference on Learning Representations, 2018.\n\n10\n\n\f[41] A. S. Suggala, A. Prasad, V. Nagarajan, and P. Ravikumar. On adversarial risk and training.\n\narXiv preprint arXiv:1806.02924, 2018.\n\n[42] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\n\nIntriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.\n\n[43] M. Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical\n\nproblems, volume 60. Springer Science & Business Media, 2014.\n\n[44] J. Uesato, B. O\u2019Donoghue, A. v. d. Oord, and P. Kohli. Adversarial risk and the dangers of\n\nevaluating against weak attacks. arXiv preprint arXiv:1802.05666, 2018.\n\n[45] V. Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.\n\n[46] V. N. Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks,\n\n10(5):988\u2013999, 1999.\n\n[47] Y. Wang, S. Jha, and K. Chaudhuri. Analyzing the robustness of nearest neighbors to adversarial\n\nexamples. arXiv preprint arXiv:1706.03922, 2017.\n\n[48] E. Wong and Z. Kolter. Provable defenses against adversarial examples via the convex outer\nadversarial polytope. In International Conference on Machine Learning, pages 5283\u20135292,\n2018.\n\n[49] A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning\nalgorithms. In Advances in Neural Information Processing Systems, pages 2524\u20132533, 2017.\n\n[50] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86(3):391\u2013423, 2012.\n\n[51] D. Yin, K. Ramchandran, and P. Bartlett. Rademacher complexity for adversarially robust\n\ngeneralization. arXiv preprint arXiv:1810.11914, 2018.\n\n[52] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of\n\nMachine Learning Research, 2(Mar):527\u2013550, 2002.\n\n[53] D.-X. Zhou. The covering number in learning theory. Journal of Complexity, 18(3):739\u2013767,\n\n2002.\n\n11\n\n\f", "award": [], "sourceid": 6644, "authors": [{"given_name": "Zhuozhuo", "family_name": "Tu", "institution": "The University of Sydney"}, {"given_name": "Jingwei", "family_name": "Zhang", "institution": "HKUST"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Sydney"}]}