{"title": "Robust Attribution Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 14300, "page_last": 14310, "abstract": "An emerging problem in trustworthy machine learning is to train models that produce robust interpretations for their predictions. We take a step towards solving this problem through the lens of axiomatic attribution of neural networks. Our theory is grounded in the recent work, Integrated Gradients (IG) [STY17], in axiomatically attributing a neural network\u2019s output change to its input change. We propose training objectives in classic robust optimization models to achieve robust IG attributions. Our objectives give principled generalizations of previous objectives designed for robust predictions, and they naturally degenerate to classic soft-margin training for one-layer neural networks. We also generalize previous theory and prove that the objectives for different robust optimization models are closely related. Experiments demonstrate the effectiveness of our method, and also point to intriguing problems which hint at the need for better optimization techniques or better neural network architectures for robust attribution training.", "full_text": "Robust Attribution Regularization\n\nJiefeng Chen \u21e4 1 Xi Wu \u21e4 2 Vaibhav Rastogi \u20202\n\n1 University of Wisconsin-Madison\n\nYingyu Liang 1\n\nSomesh Jha 1,3\n\n2 Google\n\n3 XaiPient\n\nAbstract\n\nAn emerging problem in trustworthy machine learning is to train models that pro-\nduce robust interpretations for their predictions. We take a step towards solving\nthis problem through the lens of axiomatic attribution of neural networks. Our\ntheory is grounded in the recent work, Integrated Gradients (IG) [STY17], in\naxiomatically attributing a neural network\u2019s output change to its input change.\nWe propose training objectives in classic robust optimization models to achieve\nrobust IG attributions. Our objectives give principled generalizations of previous\nobjectives designed for robust predictions, and they naturally degenerate to classic\nsoft-margin training for one-layer neural networks. We also generalize previous\ntheory and prove that the objectives for different robust optimization models are\nclosely related. Experiments demonstrate the effectiveness of our method, and\nalso point to intriguing problems which hint at the need for better optimization\ntechniques or better neural network architectures for robust attribution training.\n\n1\n\nIntroduction\n\nTrustworthy machine learning has received considerable attention in recent years. An emerging\nproblem to tackle in this domain is to train models that produce reliable interpretations for their\npredictions. For example, a pathology prediction model may predict certain images as containing\nmalignant tumor. Then one would hope that under visually indistinguishable perturbations of an\nimage, similar sections of the image, instead of entirely different ones, can account for the pre-\ndiction. However, as Ghorbani, Abid, and Zou [GAZ17] convincingly demonstrated, for existing\nmodels, one can generate minimal perturbations that substantially change model interpretations,\nwhile keeping their predictions intact. Unfortunately, while the robust prediction problem of ma-\nchine learning models is well known and has been extensively studied in recent years (for example,\n[MMS+17a, SND18, WK18], and also the tutorial by Madry and Kolter [KM18]), there has only\nbeen limited progress on the problem of robust interpretations.\nIn this paper we take a step towards solving this problem by viewing it through the lens of ax-\niomatic attribution of neural networks, and propose Robust Attribution Regularization. Our theory\nis grounded in the recent work, Integrated Gradients (IG) [STY17], in axiomatically attributing a\nneural network\u2019s output change to its input change. Speci\ufb01cally, given a model f, two input vec-\ntors xxx, xxx0, and an input coordinate i, IGf\ni (xxx, xxx0) de\ufb01nes a path integration (parameterized by a curve\nfrom xxx to xxx0) that assigns a number to the i-th input as its \u201ccontribution\u201d to the change of the model\u2019s\noutput from f (xxx) to f (xxx0). IG enjoys several natural theoretical properties (such as the Axiom of\nCompleteness3) that other related methods violate.\n\n\u21e4Equal contribution.\n\u2020Work done while at UW-Madison.\n\nDue to lack of space and for completeness, we put some de\ufb01nitions (such as coupling) to Sec-\ntion B.1. Code for this paper is publicly available at the following repository: https://github.com/jfc43/\nrobust-attribution-regularization\n3Axiom of Completeness says that summing up attributions of all components should give f (xxx0) f (xxx).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fNATURAL\n\nIG-NORM\n\nIG-SUM-NORM\n\nTop-1000 Intersection: 0.1%\nKendall\u2019s Correlation: 0.2607\n\nTop-1000 Intersection: 58.8%\nKendall\u2019s Correlation: 0.6736\n\nTop-1000 Intersection: 60.1%\nKendall\u2019s Correlation: 0.6951\n\nFigure 1: Attribution robustness comparing different models. Top-1000 Intersection and\nKendall\u2019s Correlation are rank correlations between original and perturbed saliency maps. NAT-\nURAL is the naturally trained model, IG-NORM and IG-SUM-NORM are models trained using\nour robust attribution method. We use attribution attacks described in [GAZ17] to perturb the at-\ntributions while keeping predictions intact. For all images, the models give correct prediction \u2013\nWind\ufb02ower. However, the saliency maps (also called feature importance maps), computed via IG,\nshow that attributions of the naturally trained model are very fragile, either visually or quantitatively\nas measured by correlation analyses, while models trained using our method are much more robust\nin their attributions.\n\nWe brie\ufb02y overview our approach. Given a loss function ` and a data generating distribution P , our\nRobust Attribution Regularization objective contains two parts: (1) Achieving a small loss over the\ndistribution P , and (2) The IG attributions of the loss ` over P are \u201cclose\u201d to the IG attributions\nover Q, if distributions P and Q are close to each other. We can naturally encode these two goals\nin two classic robust optimization models: (1) In the uncertainty set model [BTEGN09] where we\ntreat sample points as \u201cnominal\u201d points, and assume that true sample points are from certain vicinity\naround them, which gives:\n\nminimize\n\n\u2713\n\nE\n\n(xxx,y)\u21e0P\n\n[\u21e2(xxx, y; \u2713)]\n\nwhere \u21e2(xxx, y; \u2713) = `(xxx, y; \u2713) + max\n\nxxx02N (xxx,\")\n\ns(IG`y\n\nhhh (xxx, xxx0; r))\n\nwhere IG`y\nhhh (\u00b7) is the attribution w.r.t. neurons in an intermediate layer hhh, and s(\u00b7) is a size function\n(e.g., k\u00b7k 2) measuring the size of IG, and (2) In the distributional robustness model [SND18,\nMEK15], where closeness between P and Q is measured using metrics such as Wasserstein distance,\nwhich gives:\n\nsup\n\nQ;M2Q(P,Q)n E\n\n[c(Z, Z0)] \uf8ff \u21e2o,\n\nminimize\n\n\u2713\n\nE\nP\n\n[`(P ; \u2713)] + \n\n[dIG(Z, Z0)] s.t. E\nZ,Z0\n\nZ,Z0\n\nIn this formulation,Q(P, Q) is the set of couplings of P and Q, and M = (Z, Z0) is one coupling.\nc(\u00b7,\u00b7) is a metric, such as k\u00b7k 2, to measure the cost of an adversary perturbing z to z0. \u21e2 is an\nupper bound on the expected perturbation cost, thus constraining P and Q to be \u201cclose\u201d with each\ntogether. dIG is a metric to measure the change of attributions from Z to Z0, where we want a large\ndIG-change under a small c-change. The supremum is taken over Q andQ(P, Q).\nWe provide theoretical characterizations of our objectives. First, we show that they give principled\ngeneralizations of previous objectives designed for robust predictions. Speci\ufb01cally, under weak in-\nstantiations of size function s(\u00b7), and how we estimate IG computationally, we can leverage axioms\nsatis\ufb01ed by IG to recover the robust prediction objective of [MMS+17a], the input gradient regu-\nlarization objective of [RD18], and also the distributional robust prediction objective of [SND18].\nThese results provide theoretical evidence that robust prediction training can provide some control\nover robust interpretations. Second, for one-layer neural networks, we prove that instantiating s(\u00b7) as\n1-norm coincides with the instantiation of s(\u00b7) as sum, and further coincides with classic soft-margin\n\n2\n\n\ftraining, which implies that for generalized linear classi\ufb01ers, soft-margin training will robustify both\npredictions and interpretations. Finally, we generalize previous theory on distributional robust pre-\ndiction [SND18] to our objectives, and show that they are closely related.\nThrough detailed experiments we study the effect of our method in robustifying attributions. On\nMNIST, Fashion-MNIST, GTSRB and Flower datasets, we report encouraging improvement in attri-\nbution robustness. Compared with naturally trained models, we show signi\ufb01cantly improved attribu-\ntion robustness, as well as prediction robustness. Compared with Madry et al.\u2019s model [MMS+17a]\ntrained for robust predictions, we demonstrate comparable prediction robustness (sometimes even\nbetter), while consistently improving attribution robustness. We observe that even when our training\nstops, the attribution regularization term remains much more signi\ufb01cant compared to the natural loss\nterm. We discuss this problem and point out that current optimization techniques may not have ef-\nfectively optimized our objectives. These results hint at the need for better optimization techniques\nor new neural network architectures that are more amenable to robust attribution training.\nThe rest of the paper is organized as follows: Section 2 brie\ufb02y reviews necessary background. Sec-\ntion 3 presents our framework for robustifying attributions, and proves theoretical characterizations.\nSection 4 presents instantiations of our method and their optimization, and we report experimental\nresults in Section 5. Finally, Section 6 concludes with a discussion on future directions.\n\n2 Preliminaries\nAxiomatic attribution and Integrated Gradients Let f : Rd 7! R be a real-valued function, and\nxxx and xxx0 be two input vectors. Given that function values changes from f (xxx) to f (xxx0), a basic\nquestion is: \u201cHow to attribute the function value change to the input variables?\u201d A recent work by\nSundararajan, Taly and Yan [STY17] provides an axiomatic answer to this question. Formally, let\nr : [0, 1] 7! Rd be a curve such that r(0) = xxx, and r(1) = xxx0, Integrated Gradients (IG) for input\nvariable i is de\ufb01ned as the following integral:\n\nIGf\n\ni (xxx, xxx0; r) =Z 1\n\n0\n\n@f (r(t))\n\n@xxxi\n\nr0i(t)dt,\n\n(1)\n\ngives the change of function value: sum(IGf (xxx, xxx0; r)) =Pd\n\nwhich formalizes the contribution of the i-th variable as the integration of the i-th partial as we move\nalong curve r. Let IGf (xxx, xxx0; r) be the vector where the i-th component is IGf\ni , then IGf satis\ufb01es\nsome natural axioms. For example, the Axiom of Completeness says that summing all coordinates\ni (xxx, xxx0; r) = f (xxx0) f (xxx). We\nrefer readers to the paper [STY17] for other axioms IG satis\ufb01es.\nIntegrated Gradients for an intermediate layer. We can generalize the theory of IG to an inter-\nmediate layer of neurons. The key insight is to leverage the fact that Integrated Gradients is a curve\nintegration. Therefore, given some hidden layer hhh = [h1, . . . , hl], computed by a function h(xxx)\ninduced by previous layers, one can then naturally view the previous layers as inducing a curve h r\nwhich moves from h(xxx) to h(xxx0), as we move from xxx to xxx0 along curve r. Viewed this way, we can\nthus naturally compute IG for hhh in a way that leverages all layers of the network4,\nLemma 1. Under curve r : [0, 1] 7! Rd such that r(0) = xxx and r(1) = xxx0 for moving xxx to xxx0, and\nthe function induced by layers before hhh, the attribution for hi for a differentiable f is\n\ni=1 IGf\n\nIGf\nhi\n\n(xxx, xxx0) =\n\n@f (h(r(t)))\n\n@hi(r(t))\n\n@hi\n\n@xxxj\n\nr0j(t)dt .\n\nThe corresponding summation approximation is:\n\nIGf\nhi\n\n(xxx, xxx0) =\n\n1\nm\n\n@f (h(r(k/m)))\n\n@hi(r(k/m))\n\n@hi\n\n@xxxj\n\nr0j(k/m))\n\n0\n\ndXj=1\u21e2Z 1\ndXj=1(m1Xk=0\n\n(2)\n\n(3)\n\n3 Robust Attribution Regularization\n\nIn this section we propose objectives for achieving robust attribution, and study their connections\nwith existing robust training objectives. At a high level, given a loss function ` and a data generating\n\n4 Proofs are deferred to B.2.\n\n3\n\n\fdistribution P , our objectives contain two parts: (1) Achieving a small loss over the data generating\ndistribution P , and (2) The IG attributions of the loss ` over P are \u201cclose\u201d to the IG attributions\nover distribution Q, if P and Q are close to each other. We can naturally encode these two goals in\nexisting robust optimization models. Below we do so for two popular models: the uncertainty set\nmodel and the distributional robustness model.\n3.1 Uncertainty Set Model\nIn the uncertainty set model, for any sample (xxx, y) \u21e0 P for a data generating distribution P , we\nthink of it as a \u201cnominal\u201d point and assume that the real sample comes from a neighborhood around\nxxx. In this case, given any intermediate layer hhh, we propose the following objective function:\n\nminimize\n\n\u2713\n\nE\n\n(xxx,y)\u21e0P\n\n[\u21e2(xxx, y; \u2713)]\n\nwhere \u21e2(xxx, y; \u2713) = `(xxx, y; \u2713) + max\n\nxxx02N (xxx,\")\n\ns(IG`y\n\nhhh (xxx, xxx0; r))\n\n(4)\n\nwhere 0 is a regularization parameter, `y is the loss function with label y \ufb01xed: `y(xxx; \u2713) =\n`(xxx, y; \u2713), r : [0, 1] 7! Rd is a curve parameterization from xxx to xxx0, and IG`y is the integrated\ngradients of `y, and therefore gives attribution of changes of `y as we go from xxx to xxx0. s(\u00b7) is a size\nfunction that measures the \u201csize\u201d of the attribution.5\nWe now study some particular instantiations of the objective (4). Speci\ufb01cally, we recover existing\nrobust training objectives under weak instantiations (such as choosing s(\u00b7) as summation function,\nwhich is not metric, or use crude approximation of IG), and also derive new instantiations that are\nnatural extensions to existing ones.\nProposition 1 (Madry et al.\u2019s robust prediction objective). If we set = 1 , and let s(\u00b7) be\nthe sum function (sum all components of a vector), then for any curve r and any intermediate\nlayer hhh, (4) is exactly the objective proposed by Madry et al. [MMS+17a] where \u21e2(xxx, y; \u2713) =\nmaxxxx02N (xxx,\") `(xxx0, y; \u2713).\nWe note that: (1) sum is a weak size function which does not give a metric. (2) As a result, while\nthis robust prediction objective falls within our framework, and regularizes robust attributions, it\nallows a small regularization term where attributions actually change signi\ufb01cantly but they cancel\neach other in summation. Therefore, the control over robust attributions can be weak.\nProposition 2 (Input gradient regularization). For any 0 > 0 and q 1, if we set = 0/\"q,\ns(\u00b7) = k\u00b7k q\n1, and use only the \ufb01rst term of summation approximation (3) to approximate IG, then\n(4) becomes exactly the input gradient regularization of Drucker and LeCun [DL92], where we have\nq.\n\u21e2(xxx, y; \u2713) = `(xxx, y; \u2713) + krxxx`(xxx, y; \u2713)kq\nIn the above we have considered instantiations of a weak size function (summation function), which\nrecovers Madry et al.\u2019s objective, and of a weak approximation of IG (picking the \ufb01rst term), which\nrecovers input gradient regularization. In the next example, we pick a nontrivial size function, the\n1-norm k\u00b7k 1, use the precise IG, but then we use a trivial intermediate layer, the output loss `y.\nProposition 3 (Regularizing by attribution of the loss output). Let us set = 1, s(\u00b7) =\nk\u00b7k 1, and hhh = `y (the output layer of loss function!), then we have \u21e2(xxx, y; \u2713) = `y(xxx) +\nmaxxxx02N (xxx,\"){|`y(xxx0) `y(xxx)|}.\nWe note that this loss function is a \u201csurrogate\u201d loss function for Madry et al.\u2019s loss function be-\ncause `y(xxx) + maxxxx02N (xxx,\"){|`y(xxx0) `y(xxx)|} `y(xxx) + maxxxx02N (xxx,\"){(`y(xxx0) `y(xxx))} =\nmaxxxx02N (xxx,\") `y(xxx0). Therefore, even at such a trivial instantiation, robust attribution regularization\nprovides interesting guarantees.\n3.2 Distributional Robustness Model\nA different but popular model for robust optimization is the distributional robustness model. In this\ncase we consider a family of distributions P, each of which is supposed to be a \u201cslight variation\u201d of\na base distribution P . The goal of robust optimization is then that certain objective functions obtain\nstable values over this entire family. Here we apply the same underlying idea to the distributional\n\n5 We stress that this regularization term depends on model parameters \u2713 through loss function `y.\n\n4\n\n\fminimize\n\n\u2713\n\nE\nP\n\n[`(P ; \u2713)] + \n\nrobustness model: One should get a small loss value over the base distribution P , and for any\ndistribution Q 2P , the IG-based attributions change only a little if we move from P to Q. This is\nformalized as:\n\nminimize\n\n\u2713\n\nE\nP\n\n[`(P ; \u2713)] + sup\n\nQ2P {WdIG(P, Q)} ,\n\nwhere the WdIG(P, Q) is the Wasserstein distance between P and Q under a distance metric dIG.6\nWe use IG to highlight that this metric is related to integrated gradients.\nWe propose again dIG(zzz, zzz0) = s(IG`\nhhh(zzz, zzz0)). We are particularly interested in the case where P is\na Wasserstein ball around the base distribution P , using \u201cperturbation\u201d cost metric c(\u00b7). This gives\nregularization term EWc(P,Q)\uf8ff\u21e2 sup{WdIG(P, Q)}. An unsatisfying aspect of this objective, as\none can observe now, is that WdIG and Wc can take two different couplings, while intuitively we\nwant to use only one coupling to transport P to Q. For example, this objective allows us to pick\na coupling M1 under which we achieve WdIG (recall that Wasserstein distance is an in\ufb01mum over\ncouplings), and a different coupling M2 under which we achieve Wc, but under M1 = (Z, Z0),\nEz,z0\u21e0M1[c(z, z0)] >\u21e2 , violating the constraint. This motivates the following modi\ufb01cation:\n\nsup\n\nQ;M2Q(P,Q)n E\n\nZ,Z0\n\n[dIG(Z, Z0)] s.t. E\nZ,Z0\n\n[c(Z, Z0)] \uf8ff \u21e2o,\n\n(5)\n\nIn this formulation,Q(P, Q) is the set of couplings of P and Q, and M = (Z, Z0) is one coupling.\nc(\u00b7,\u00b7) is a metric, such as k\u00b7k 2, to measure the cost of an adversary perturbing z to z0. \u21e2 is an\nupper bound on the expected perturbation cost, thus constraining P and Q to be \u201cclose\u201d with each\ntogether. dIG is a metric to measure the change of attributions from Z to Z0, where we want a large\ndIG-change under a small c-change. The supremum is taken over Q andQ(P, Q).\nProposition 4 (Wasserstein prediction robustness). Let s(\u00b7) be the summation function and = 1,\nthen for any curve and any layer hhh, (5) reduces to supQ:Wc(P,Q)\uf8ff\u21e2 {EQ[`(Q; \u2713)]}, which is the\nobjective proposed by Sinha, Namhoong, and Duchi [SND18] for robust predictions.\nLagrange relaxation. For any 0, the Lagrange relaxation of (5) is\n\nminimize\n\n\u2713\n\n[`(P ; \u2713)] + \n\n\u21e2 E\n\nP\n\nsup\n\nQ;M2Q(P,Q)n\n\nM =(Z,Z0)\u21e5dIG(Z, Z0) c(Z, Z0)\u21e4o\n\nE\n\n(6)\n\nwhere the supremum is taken over Q (unconstrained) and all couplings of P and Q, and we want\nto \ufb01nd a coupling under which IG attributions change a lot, while the perturbation cost from P to\nQ with respect to c is small. Recall that g : Rd \u21e5 Rd ! R is a normal integrand if for each \u21b5, the\nmapping z !{ z0|g(z, z0) \uf8ff \u21b5} is closed-valued and measurable [RW09].\nOur next two theorems generalize the duality theory in [SND18] to a much larger, but natural, class\nof objectives.\nTheorem 1. Suppose c(z, z) = 0 and dIG(z, z) = 0 for any z, and suppose c(z, z0) dIG(z, z0) is\na normal integrand. Then, supQ;M2Q(P,Q){EM =(Z,Z0)[d\nIG(z, z0)}].\nConsequently, we have (6) to be equal to the following:\n\nIG(Z, Z0)]} = Ez\u21e0P [supz0{d\n\nminimize\n\n\u2713\n\nE\n\nz\u21e0Ph`(z; \u2713) + sup\n\nz0 {dIG(z, z0) c(z, z0)}i\n\n(7)\n\nThe assumption dIG(z, z) = 0 is true for what we propose, and c(z, z) = 0 is true for any typical\ncost such as `p distances. The normal integrand assumption is also very weak, e.g., it is satis\ufb01ed\nwhen dIG is continuous and c is closed convex.\nNote that (7) and (4) are very similar, and so we use (4) for the rest the paper. Finally, given\nTheorem 1, we are also able to connect (5) and (7) with the following duality result:\nTheorem 2. Suppose c(z, z) = 0 and dIG(z, z) = 0 for any z, and suppose c(z, z0) dIG(z, z0)\nis a normal integrand. For any \u21e2> 0, there exists 0 such that the optimal solutions of (7) are\noptimal for (5).\n6 For supervised learning problem where P is of the form Z = (X, Y ), we use the same treatment as\nin [SND18] so that cost function is de\ufb01ned as c(z, z0) = cx(x, x0) + 1\u00b7 1{y 6= y0}. All our theory carries\nover to such c which has range R+ [{1}.\n\n5\n\n\f3.3 One Layer Neural Networks\nWe now consider the special case of one-layer neural networks, where the loss function takes the\nform of `(xxx, y; www) = g(yhwww, xxxi), www is the model parameters, xxx is a feature vector, y is a label, and\ng is nonnegative. We take s(\u00b7) to be k\u00b7k 1, which corresponds to a strong instantiation that does not\nallow attributions to cancel each other. Interestingly, we prove that for natural choices of g, this is\nhowever exactly Madry et al.\u2019s objective [MMS+17a], which corresponds to s(\u00b7) = sum(\u00b7). That\nis, the strong (s(\u00b7) = k\u00b7k 1) and weak instantiations (s(\u00b7) = sum(\u00b7)) coincide for one-layer neural\nnetworks. This thus says that for generalized linear classi\ufb01ers, \u201crobust interpretation\u201d coincides with\n\u201crobust predictions,\u201d and further with classic soft-margin training.\nTheorem 3. Suppose that g is differentiable, non-decreasing, and convex. Then for = 1, s(\u00b7) =\nk\u00b7k 1, and `1 neighborhood, (4) reduces to Madry et al.\u2019s objective:\n\nmXi=1\nmXi=1\n\n=\n\nmax\n\nk xxx0i xxxi k1\uf8ff\"\n\ng(yihwww, xxx0ii) (Madry et al.\u2019s objective)\n\ng(yihwww, xxxii + \"k www k1) (soft-margin).\n\nNatural losses, such as Negative Log-Likelihood and softplus hinge loss, satisfy the conditions of\nthis theorem.\n\n4\n\nInstantiations and Optimizations\n\nIn this section we discuss instantiations of (4) and how to optimize them. We start by presenting two\nobjectives instantiated from our method: (1) IG-NORM, and (2) IG-SUM-NORM. Then we discuss\nhow to use gradient descent to optimize these objectives.\nIG-NORM. As our \ufb01rst instantiation, we pick s(\u00b7) = k\u00b7k 1, hhh to be the input layer, and r to be the\nstraightline connecting xxx and xxx0. This gives:\n\nminimize\n\n\u2713\n\n(xxx,y)\u21e0P\uf8ff`(xxx, y; \u2713) + max\n\nxxx02N (xxx,\")k IG`y (xxx, xxx0)k1\n\nE\n\nIG-SUM-NORM. In the second instantiation we combine the sum size function and norm size\nfunction, and de\ufb01ne s(\u00b7) = sum(\u00b7) + k\u00b7k 1. Where 0 is a regularization parameter. Now with\nthe same hhh and r as above, and put = 1, then our method simpli\ufb01es to:\n\nminimize\n\n\u2713\n\n(xxx,y)\u21e0P\uf8ff max\n\nxxx02N (xxx,\")n`(xxx0, y; \u2713) + k IG`y (xxx, xxx0)k1o\n\nE\n\nwhich can be viewed as appending an extra robust IG term to `(xxx0).\nGradient descent optimization. We propose the following gradient descent framework to optimize\nthe objectives. The framework is parameterized by an adversary A which is supposed to solve\nthe inner max by \ufb01nding a point xxx? which changes attribution signi\ufb01cantly. Speci\ufb01cally, given a\npoint (xxx, y) at time step t during SGD training, we have the following two steps (this can be easily\ngeneralized to mini-batches):\nAttack step. We run A on (xxx, y) to \ufb01nd xxx? that produces a large inner max term (that is\nk IG`y (xxx, xxx?)k1 for IG-NORM, and `(xxx?) + k IG`y (xxx, xxx?)k1 for IG-SUM-NORM.\nGradient step. Fixing xxx?, we can then compute the gradient of the corresponding objective with\nrespect to \u2713, and then update the model.\nImportant objective parameters. In both attack and gradient steps, we need to differentiate IG\n(in attack step, \u2713 is \ufb01xed and we differentiate w.r.t. xxx, while in gradient step, this is reversed), and\nthis induces a set of parameters of the objectives to tune for optimization, which is summarized\nin Table 1. Differentiating summation approximation of IG amounts to compute second partial\nderivatives. We rely on the auto-differentiation capability of TensorFlow [ABC+16] to compute\nsecond derivatives.\n\n6\n\n\fAdversary A\n\nm in the attack step\n\nm in the gradient step\n\n\n\n\nAdversary to \ufb01nd xxx?. Note that our goal is simply to max-\nimize the inner term in a neighborhood, thus in this paper\nwe choose Projected Gradient Descent for this purpose.\nTo differentiate IG in the attack step, we use summation\napproximation of IG, and this is the number of segments\nfor apprioximation.\nSame as above, but in the gradient step. We have this m\nseparately due to ef\ufb01ciency consideration.\nRegularization parameter for IG-NORM.\nRegularization parameter for IG-SUM-NORM.\nTable 1: Optimization parameters.\n\n5 Experiments\n\nWe now perform experiments using our method. We ask the following questions: (1) Comparing\nmodels trained by our method and naturally trained models at test time, do we maintain the accuracy\non unperturbed test inputs? (2) At test time, if we use attribution attacks mentioned in [GAZ17] to\nperturb attributions while keeping predictions intact, how does the attribution robustness of our mod-\nels compare with that of the naturally trained models? (3) Finally, how do we compare attribution\nrobustness of our models with weak instantiations for robust predictions?\nTo answer these questions, We perform experiments on four classic datasets: MNIST [LCB98],\nFashion-MNIST [XRV17], GTSRB [SSSI12], and Flower [NZ06]. In summary, our \ufb01ndings are\nthe following: (1) Our method results in very small drop in test accuracy compared with naturally\ntrained models. (2) On the other hand, our method gives sign\ufb01cantly better attribution robustness, as\nmeasured by correlation analyses. (3) Finally, our models yield comparable prediction robustness\n(sometimes even better), while consistently improving attribution robustness.\nIn the rest of the\nsection we give more details.\nEvaluation setup. In this work we use IG to compute attributions (i.e. feature importance map),\nwhich, as demonstrated by [GAZ17], is more robust compared to other related methods (note that,\nIG also enjoys other theoretical properties). To attack attribution while retaining model predictions,\nwe use Iterative Feature Importance Attacks (IFIA) proposed by [GAZ17]. Due to lack of space,\nwe defer details of parameters and other settings to the appendix. We use two metrics to measure\nattribution robustness (i.e. how similar the attributions are between original and perturbed images):\nKendall\u2019s tau rank order correlation. Attribution methods rank all of the features in order of their\nimportance, we thus use the rank correlation [Ken38] to compare similarity between interpretations.\nTop-k intersection. We compute the size of intersection of the k most important features before and\nafter perturbation.\nCompared with [GAZ17], we use Kendall\u2019s tau correlation, instead of Spearman\u2019s rank correlation.\nThe reason is that we found that on the GTSRB and Flower datasets, Spearman\u2019s correlation is not\nconsistent with visual inspection, and often produces too high correlations. In comparison, Kendall\u2019s\ntau correlation consistently produces lower correlations and aligns better with visual inspection.\nFinally, when computing attribution robustness, we only consider the test samples that are correctly\nclassi\ufb01ed by the model.\nComparing with natural models. Figures (a), (b), (c), and (d) in Figure 2 show that, compared\nwith naturally trained models, robust attribution training gives signi\ufb01cant improvements in attribu-\ntion robustness (measured by either median or con\ufb01dence intervals). The exact numbers are recorded\nin Table 2: Compared with naturally trained models (rows where \u201cApproach\u201d is NATURAL), robust\nattribution training has signi\ufb01cantly better adversarial accuracy and attribution robustness, while\nhaving a small drop in natural accuracy (denoted by Nat Acc.).\nIneffective optimization. We observe that even when our training stops, the attribution regular-\nization term remains much more signi\ufb01cant compared to the natural loss term. For example for\nIG-NORM, when training stops on MNIST, `(xxx) typically stays at 1, but k IG(xxx, xxx0)k1 stays at\n10 \u21e0 20. This indicates that optimization has not been very effective in minimizing the regulariza-\ntion term. There are two possible reasons to this: (1) Because we use summation approximation of\nIG, it forces us to compute second derivatives, which may not be numerically stable for deep net-\n\n7\n\n\f(a) MNIST\n\n(b) Fashion-MNIST\n\n(c) GTSRB\n\n(d) Flower\n\nFigure 2: Experiment results on MNIST, Fashion-MNIST, GTSRB and Flower.\n\nworks. (2) The network architecture may be inherently unsuitable for robust attributions, rendering\nthe optimization hard to converge.\nComparing with robust prediction models. Finally we compare with Madry et al.\u2019s models, which\nare trained for robust prediction. We use Adv Acc. to denote adversarial accuracy (prediction\naccuracy on perturbed inputs). Again, TopK Inter. denotes the average topK intersection (K =\n100 for MNIST, Fashion-MNIST and GTSRB datasets, K = 1000 for Flower), and Rank Corr.\ndenotes the average Kendall\u2019s rank order correlation. Table 2 gives the details of the results. As we\ncan see, our models give comparable adversarial accuracy, and are sometimes even better (on the\nFlower dataset). On the other hand, we are consistently better in terms of attribution robustness.\n\nDataset\n\nMNIST\n\nFashion-MNIST\n\nGTSRB\n\nFlower\n\nNATURAL\nMadry et al.\nIG-NORM\n\nApproach\nNATURAL\nMadry et al.\nIG-NORM\n\nNat Acc.\n99.17%\n98.40%\n98.74%\nIG-SUM-NORM 98.34%\n90.86%\n85.73%\n85.13%\nIG-SUM-NORM 85.44%\n98.57%\n97.59%\n97.02%\nIG-SUM-NORM 95.68%\n86.76%\n83.82%\n85.29%\nIG-SUM-NORM 82.35%\n\nNATURAL\nMadry et al.\nIG-NORM\n\nNATURAL\nMadry et al.\nIG-NORM\n\nAdv Acc.\n\n0.00%\n92.47%\n81.43%\n88.17%\n0.01%\n73.01%\n65.95%\n70.26%\n21.05%\n83.24%\n75.24%\n77.12%\n0.00%\n41.91%\n24.26%\n47.06%\n\nTopK Inter.\n\n46.61%\n62.56%\n71.36%\n72.45%\n39.01%\n46.12%\n59.22%\n72.08%\n54.16%\n68.85%\n74.81%\n74.04%\n8.12%\n55.87%\n64.68%\n66.33%\n\nRank Corr.\n\n0.1758\n0.2422\n0.2841\n0.3111\n0.4610\n0.6251\n0.6171\n0.6747\n0.6790\n0.7520\n0.7555\n0.7684\n0.4978\n0.7784\n0.7591\n0.7974\n\nTable 2: Experiment results including prediction accuracy, prediction robustness and attribution\nrobustness.\n\n8\n\n\f6 Discussion and Conclusion\n\nThis paper builds a theory to robustify model interpretations through the lens of axiomatic attri-\nbutions of neural networks. We show that our theory gives principled generalizations of previous\nformulations for robust predictions, and we characterize our objectives for one-layer neural net-\nworks. We believe that our work opens many intriguing avenues for future research, and we discuss\na few topics below.\nWhy we want robust attributions? Model attributions are facts about model behaviors. While\nrobust attribution does not necessarily mean that the attribution is correct, a model with brittle attri-\nbution can never be trusted. To this end, it seems interesting to examine attribution methods other\nthan Integrated Gradients.\nRobust attribution leads to more human-aligned attribution. Note that our proposed training\nscheme requires both prediction correctness and robust attributions, and therefore it encourages to\nlearn invariant features from data that are also highly predictive. In our experiments, we found\nan intriguing phenomenon that our regularized models produce attributions that are much more\naligned with human perceptions (for example, see Figure 1). Our results are aligned with the recent\nwork [TSE+19, EIS+19].\nRobust attribution may help tackle spurious correlations. In view of our discussion so far, we\nthink it is plausible that robust attribution regularization can help remove spurious correlations be-\ncause intuitively spurious correlations should not be able to be reliably attributed to. Future research\non this potential connection seems warranted.\nDif\ufb01culty of optimization. While our experimental results are encouraging, we observe that when\ntraining stops, the attribution regularization term remains signi\ufb01cant (typically around tens to hun-\ndreds), which indicates ineffective optimization for the objectives. To this end, a main problem is\nnetwork depth, where as depth increases, we get very unstable trajectories of gradient descent, which\nseems to be related to the use of second order information during robust attribution optimization (due\nto summation approximation, we have \ufb01rst order terms in the training objectives). Therefore, it is\nnatural to further study better optimization tchniques or better architectures for robust attribution\ntraining.\n\n7 Acknowledgments\n\nThis work is partially supported by Air Force Grant FA9550-18-1-0166, the National Science Foun-\ndation (NSF) Grants CCF-FMitF-1836978, SaTC-Frontiers-1804648 and CCF-1652140 and ARO\ngrant number W911NF-17-1-0405.\n\n9\n\n\fReferences\n[ABC+16] Mart\u00b4\u0131n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,\nMatthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow:\nA system for large-scale machine learning. In 12th USENIX Symposium on Operating\nSystems Design and Implementation (OSDI), pages 265\u2013283, 2016.\n\n[BTEGN09] A. Ben-Tal, L. El Ghaoui, and A.S. Nemirovski. Robust Optimization. Princeton\n\n[DL92]\n\n[EIS+19]\n\n[GAZ17]\n\nSeries in Applied Mathematics. Princeton University Press, October 2009.\nHarris Drucker and Yann LeCun. Improving generalization performance using double\nbackpropagation. IEEE Trans. Neural Networks, 3(6):991\u2013997, 1992.\nLogan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran,\nand Aleksander Madry. Adversarial robustness as a prior for learned representations.\narXiv 1906.00945, 2019.\nAmirata Ghorbani, Abubakar Abid, and James Y. Zou. Interpretation of neural net-\nworks is fragile. CoRR, abs/1710.10547, 2017.\n\n[KM18]\n\n[Ken38]\n\n[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning\nfor image recognition. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 770\u2013778, 2016.\nMaurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81\u201393,\n1938.\nZico Kolter and Aleksander Madry. Adversarial robustness \u2013 theory and practice,\n2018. https://adversarial-ml-tutorial.org/.\nYann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of\nhandwritten digits, 1998. http://yann.lecun.com/exdb/mnist/.\nDavid G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons,\nInc., New York, NY, USA, 1st edition, 1997.\nPeyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust\noptimization using the wasserstein metric: performance guarantees and tractable re-\nformulations. arXiv preprint arXiv:1505.05116, 2015.\n\n[MEK15]\n\n[LCB98]\n\n[Lue97]\n\n[MMS+17a] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and\nAdrian Vladu. Towards deep learning models resistant to adversarial attacks. CoRR,\nabs/1706.06083, 2017.\n\n[NZ06]\n\n[RD18]\n\n[RW09]\n\n[MMS+17b] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and\nAdrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv\npreprint arXiv:1706.06083, 2017.\nM-E Nilsback and Andrew Zisserman. A visual vocabulary for \ufb02ower classi\ufb01cation.\nIn 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recog-\nnition (CVPR\u201906), volume 2, pages 1447\u20131454. IEEE, 2006.\nAndrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness\nand interpretability of deep neural networks by regularizing their input gradients. In\nProceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, (AAAI-\n18), pages 1660\u20131669, 2018.\nR Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer\nScience & Business Media, 2009.\nAman Sinha, Hongseok Namkoong, and John C. Duchi. Certifying some distribu-\ntional robustness with principled adversarial training. In 6th International Conference\non Learning Representations, ICLR 2018, 2018.\nJohannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. com-\nputer: Benchmarking machine learning algorithms for traf\ufb01c sign recognition. Neural\nnetworks, 32:323\u2013332, 2012.\nMukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning,\npages 3319\u20133328, 2017.\n\n[SSSI12]\n\n[SND18]\n\n[STY17]\n\n10\n\n\f[SVZ13]\n\n[TSE+19]\n\n[WK18]\n\n[XRV17]\n\nKaren Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional\nnetworks: Visualising image classi\ufb01cation models and saliency maps. arXiv preprint\narXiv:1312.6034, 2013.\nDimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Alek-\nsander Madry. Robustness may be at odds with accuracy. In International Conference\non Learning Representations, 2019.\nEric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the\nconvex outer adversarial polytope. In Proceedings of the 35th International Confer-\nence on Machine Learning, ICML 2018, pages 5283\u20135292, 2018.\nHan Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset\nfor benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 8088, "authors": [{"given_name": "Jiefeng", "family_name": "Chen", "institution": "University of Wisconsin-Madison"}, {"given_name": "Xi", "family_name": "Wu", "institution": "Google"}, {"given_name": "Vaibhav", "family_name": "Rastogi", "institution": "Google"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "University of Wisconsin Madison"}, {"given_name": "Somesh", "family_name": "Jha", "institution": "University of Wisconsin, Madison"}]}