{"title": "On the (In)fidelity and Sensitivity of Explanations", "book": "Advances in Neural Information Processing Systems", "page_first": 10967, "page_last": 10978, "abstract": "We consider objective evaluation measures of saliency explanations for complex black-box machine learning models. We propose simple robust variants of two notions that have been considered in recent literature: (in)fidelity, and sensitivity. We analyze optimal explanations with respect to both these measures, and while the optimal explanation for sensitivity is a vacuous constant explanation, the optimal explanation for infidelity is a novel combination of two popular explanation methods. By varying the perturbation distribution that defines infidelity, we obtain novel explanations by optimizing infidelity, which we show to out-perform existing explanations in both quantitative and qualitative measurements. Another salient question given these measures is how to modify any given explanation to have better values with respect to these measures. We propose a simple modification based on lowering sensitivity, and moreover show that when done appropriately, we could simultaneously improve both sensitivity as well as fidelity.", "full_text": "On the (In)\ufb01delity and Sensitivity of Explanations\n\nChih-Kuan Yeh \u02da, Cheng-Yu Hsieh :, Arun Sai Suggala ;\n\nDepartment of Machine Learning\n\nCarnegie Mellon University\n\nSchool of Electrical and Computer Engineering\n\nDavid I. Inouye \u00a7\n\nPurdue University\n\nPradeep Ravikumar \u00b6\n\nDepartment of Machine Learning\n\nCarnegie Mellon University\n\nAbstract\n\nWe consider objective evaluation measures of saliency explanations for complex\nblack-box machine learning models. We propose simple robust variants of two\nnotions that have been considered in recent literature: (in)\ufb01delity, and sensitivity.\nWe analyze optimal explanations with respect to both these measures, and while\nthe optimal explanation for sensitivity is a vacuous constant explanation, the\noptimal explanation for in\ufb01delity is a novel combination of two popular explanation\nmethods. By varying the perturbation distribution that de\ufb01nes in\ufb01delity, we obtain\nnovel explanations by optimizing in\ufb01delity, which we show to out-perform existing\nexplanations in both quantitative and qualitative measurements. Another salient\nquestion given these measures is how to modify any given explanation to have\nbetter values with respect to these measures. We propose a simple modi\ufb01cation\nbased on lowering sensitivity, and moreover show that when done appropriately,\nwe could simultaneously improve both sensitivity as well as \ufb01delity.\n\nIntroduction\n\n1\nWe consider the task of how to explain a complex machine learning model, abstracted as a function\nthat predicts a response given an input feature vector, given only black-box access to the model. A\npopular approach to do so is to attribute any given prediction to the set of input features: ranging from\nproviding a vector of importance weights, one per input feature, to simply providing a set of important\nfeatures. For instance, given a deep neural network for image classi\ufb01cation, we may explain a speci\ufb01c\nprediction by showing the set of salient pixels, or a heatmap image showing the importance weights\nfor all the pixels. But how good is any such explanation mechanism? We can distinguish between\ntwo classes of explanation evaluation measures [22, 27]: objective measures and subjective measures.\nThe predominant evaluations of explanations have been subjective measures, since the notion of\nexplanation is very human-centric; these range from qualitative displays of explanation examples, to\ncrowd-sourced evaluations of human satisfaction with the explanations, as well as whether humans\nare able to understand the model. Nonetheless, it is also important to consider objective measures\nof explanation effectiveness, not only because these place explanations on a sounder theoretical\nfoundation, but also because they allow us to improve our explanations by improving their objective\nmeasures.\nOne way to objectively evaluate explanations is to verify whether the explanation mechanism satis\ufb01es\n(or does not satisfy) certain axioms, or properties [25, 43]. In this paper, we focus on quantitative\nobjective measures, and provide and analyze two such measures. First, we formalize the notion\nof \ufb01delity of an explanation to the predictor function. One natural approach to measure \ufb01delity,\nwhen we have apriori information that only a particular subset of features is relevant, is to test if the\n\n\u02dacjyeh@cs.cmu.edu\n:chyu.hsieh@gmail.com\n;asuggala@andrew.cmu.edu\n\u00a7dinouye@purdue.edu\n\u00b6pradeepr@cs.cmu.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffeatures with high explanation weights belong to this relevant subset [10]. In the absence of such\napriori information, Ancona et al. [3] provide a more quantitative perspective on the earlier notion by\nmeasuring the correlation between the sum of a subset of feature importances and the difference in\nfunction value when setting the features in the subset to some reference value; by varying the subsets,\nwe get different values of such subset correlations. In this work, we consider a simple generalization\nof this notion, that produces a single \ufb01delity measure, which we call the in\ufb01delity measure.\nOur in\ufb01delity measure is de\ufb01ned as the expected difference between the two terms: (a) the dot product\nof the input perturbation to the explanation and (b) the output perturbation (i.e., the difference in\nfunction values after signi\ufb01cant perturbations on the input). This general setup allows for a varied\nclass of signi\ufb01cant perturbations: a non-random perturbation towards a single reference or baseline\nvalue, perturbations towards multiple reference points e.g. by varying subsets of features to perturb,\nand a random perturbation towards a reference point with added small Gaussian noise, which allows\nthe in\ufb01delity measure to be robust to small mis-speci\ufb01cations or noise in either the test input or the\nreference point.\nWe then show that the optimal explanation that minimizes this in\ufb01delity measure could be loosely\ncast as a novel combination of two well-known explanation mechanisms: Smooth-Grad [40] and\nIntegrated Gradients [43] using a kernel function speci\ufb01ed by the random perturbations. As another\nvalidation of our formalization, we show that many recently proposed explanations can be seen as\noptimal explanations for the in\ufb01delity measure with speci\ufb01c perturbations. We also introduce new\nperturbations which lead to novel explanations by optimizing the in\ufb01delity measure, and we validate\nthe explanations are qualitatively better through human experiments. It is worth emphasizing that the\nin\ufb01delity measure, while objective, may not capture all the desiderata of a successful explanation;\nthus, it is still of interest to take a given explanation that does not have the form of the optimal\nexplanation with respect to a speci\ufb01ed in\ufb01delity measure and modify it to have lesser in\ufb01delity.\nAnalyzing this question leads us to another objective measure: the sensitivity of an explanation, which\nmeasures the degree to which the explanation is affected by insigni\ufb01cant perturbations from the test\npoint. It is natural to wish for our explanation to have low sensitivity, since that would entail differing\nexplanations with minor variations in the input (or prediction values), which might lead us to distrust\nthe explanations. Explanations with high sensitivity could also be more amenable to adversarial\nattacks, as Ghorbani et al. [13] show in the context of gradient based explanations. Regardless, we\nlargely expect explanations to be simple, and lower sensitivity could be viewed as one such notion\nof simplicity. Due in part to this, there have been some recent efforts to quantify the sensitivity of\nexplanations [2, 28, 13]. We propose and analyze a simple robust variant of these recent proposals\nthat is amenable to Monte Carlo sampling-based approximation. Our key contribution, however,\nis in relating the notion of sensitivity to our proposed notion of in\ufb01delity, which also allows us to\naddress the earlier raised question of how to modify an explanation to have better \ufb01delity. Asking this\nquestion for sensitivity might seem vacuous, since the optimal explanation that minimizes sensitivity\n(for all its related variants) is simply a trivial constant explanation, which is naturally not a desired\nexplanation. So a more interesting question would be: how do we modify a given explanation so that\nit has lower sensitivity, but not too much. To quantify the latter, we could in turn use \ufb01delity.\nAs one key contribution of the paper, we show that a restrained lowering of the sensitivity of an\nexplanation also increases its \ufb01delity. In particular, we consider a simple kernel smoothing based\nalgorithm that appropriately lowers the sensitivity of any given explanation, but importantly also\nlowers its in\ufb01delity. Our meta-algorithm encompasses Smooth-Grad [40] which too modi\ufb01es any\nexisting explanation mechanism by averaging explanations in a small local neighborhood of the test\npoint. In the appendix, we also consider an alternative approach to improve gradient explanation\nsensitivity and \ufb01delity by adversarial training, which however requires that we be able to modify\nthe given predictor function itself, which might not always be feasible. Our modi\ufb01cations improve\nboth sensitivity and \ufb01delity in most cases, and also provides explanations that are qualitatively better,\nwhich we validate in a series of experiments.6\n2 Objective Measure: Explanation In\ufb01delity\nConsider the following general supervised learning setting: input space X \u010e Rd, an output space\nY \u010e R, and a (machine-learnt) black-box predictor f : Rd \u00de\u00d1 R, which at some test input x P Rd,\npredicts the output fpxq. Then a feature attribution explanation is some function \u03a6 : F \u02c6 Rd \u00de\u00d1 Rd,\nthat given a black-box predictor f, and a test point x, provides importance scores \u03a6pf , xq for the set of\n\n6Implementation available at https://github.com/chihkuanyeh/saliency_evaluation.\n\n2\n\n\finput features. We let } \u00a8 } denote a given norm over the input and explanation space. In experiments,\nif not speci\ufb01ed, this will be set to the (cid:96)2 norm.\n2.1 De\ufb01ning the in\ufb01delity measure\nA natural notion of the goodness of an explanation is to quantify the degree to which it captures\nhow the predictor function itself changes in response to signi\ufb01cant perturbations. Along this spirit,\n[4, 37, 43] propose the completeness axiom for explanations consisting of feature importances, which\nstates that the sum of the feature importances should sum up to the difference in the predictor function\nvalue at the given input and some speci\ufb01c baseline. [3] extend this to require that the sum of a subset of\nfeature importance weights should sum up to the difference in the predictor function value at the given\ninput and to a perturbed input that sets the subset of features to some speci\ufb01c baseline value. When\nthe subset of features is large, this would entail that explanations capture the combined importance\nof the subset of features even if not the individual feature importances, and when the the subset of\nfeatures is small, this would entail that explanations capture the individual importance of features.\nWe note that this can be contrasted with requiring the explanations to capture the function values\nitself as in the causal local explanation metric of [30], rather than the difference in function values,\nbut we focus on the latter. Letting Sk denote a subset of k features, [3] measured the discrepancy of\n\u03a6pf , xqi and fpxq \u00b4 fpxrxSk \u201c 0sq, where\nthe above desiderata as the correlation between\nxrxS \u201c asj \u201c a Ipj P Sq ` xj Ipj R Sq and I is the indicator function.\nOne minor caveat with the above is that we may be interested in perturbations more general than\nsetting feature values to 0, or even to a single baseline; for instance, we might simultaneously require\nsmaller discrepancy over a set of subsets, or some distribution of subsets (as is common in game\ntheoretic approaches to deriving feature importances [11, 42, 25]), or even simply a prior distribution\nover the baseline input. The correlation measure also focuses on second order moments, and is not as\neasy to optimize. We thus build on the above developments, by \ufb01rst allowing random perturbations\non feature values instead of setting certain features to some baseline value, and secondly, by replacing\ncorrelation with expected mean square error (our development could be further generalized to allow\nfor more general loss functions). We term our evaluation measure explanation in\ufb01delity.\nDe\ufb01nition 2.1. Given a black-box function f, explanation functional \u03a6, a random variable I P Rd\nwith probability measure \u00b5I, which represents meaningful perturbations of interest, we de\ufb01ne the\nexplanation in\ufb01delity of \u03a6 as:\n\n\u0159\n\niPSk\n\nINFDp\u03a6, f , xq \u201c EI\u201e\u00b5I\n\n(1)\nI represents signi\ufb01cant perturbations around x, and can be speci\ufb01ed in various ways. We begin by\nlisting various plausible perturbations of interest.\n\n.\n\nIT \u03a6pf , xq \u00b4 pfpxq \u00b4 fpx \u00b4 Iqq\n\n\u0131\n\n\u02d82\n\n\u201d`\n\n\u2022 Difference to baseline: I \u201c x \u00b4 x0, the difference between input and baseline.\n\u2022 Subset of difference to baseline: for any \ufb01xed subset Sk \u010e rds, ISk \u201c x\u00b4 xrxSk \u201c px0qSks\ncorresponds to the perturbation in the correlation measure of [3] when x0 \u201c 0.\n\u2022 Difference to noisy baseline: I \u201c x \u00b4 z0, where z0 \u201c x0 ` \u0001, for some zero mean random\nvector \u0001, for instance \u0001 \u201e Np0, \u03c32q.\n\u2022 Difference to multiple baselines: I \u201c x \u00b4 x0, where x0 is a random variable that can take\n\nmultiple values.\n\nAs we will next show in Section 2.3, many recently proposed explanations could be viewed as optimiz-\ning the aforementioned in\ufb01delity for varying perturbations I. Our proposed in\ufb01delity measurement\ncan thus be seen as a unifying framework for these explanations, but moreover, as a way to design\nnew explanations, and evaluate any existing explanations.\n2.2 Explanations with least In\ufb01delity\nGiven our notion of in\ufb01delity, a natural question is: what is the explanation that is optimal with\nrespect to in\ufb01delity, that is, has the least in\ufb01delity possible. This naturally depends on the distribution\nof the perturbations I, and its surprisingly simple form is detailed in the following proposition.\nProposition 2.1. Suppose the perturbations I are such that\nexplanation \u03a6\u02dapf , xq that minimizes in\ufb01delity for perturbations I can then be written as\n\n\u02d8\u00b41 is invertible. The optimal\n\u02d9\n\nIIT d\u00b5I\n\n`\u015f\n\n\u02c6\u017c\n\n\u02c6\u017c\n\n\u02d9\u00b41\n\n\u02dapf , xq \u201c\n\n\u03a6\n\nIIT d\u00b5I\n\nIIT IGpf , x, Iqd\u00b5I\n\n,\n\n(2)\n\n3\n\n\f\u015f\n\n1\n\n\u015f\n\u015f\nwhere IGpf , x, Iq \u201c\nt\u201c0 \u2207fpx ` pt \u00b4 1qIq dt is the integrated gradient of fp\u00a8q between px \u00b4 Iq\nand x [43], but can be replaced by any functional that satis\ufb01es IT IGpf , x, Iq \u201c fpxq \u00b4 fpx \u00b4 Iq. A\nz \u03a6pf , zq kpx, zqdz\nz kpx, zqs\u00b41\ngeneralized version of SmoothGrad can be written as \u03a6kpf , xq :\u201c r\nwhere the Gaussian kernel can be replaced by any kernel. Therefore, the optimal solution of\nProposition 2.1 can be seen as applying a smoothing operation reminiscent of SmoothGrad on\nIntegrated Gradients (or any explanation that satis\ufb01es the completeness axiom), where a special\nkernel IIT is used instead of the original kernel kpx, zq. When I is deterministic, the integral of IIT\nis rank-one and cannot be inverted, but being optimal with respect to the in\ufb01delity can be shown to be\nequivalent to satisfying the Completeness Axiom. To enhance computational stability, we can replace\ninverse by pseudo-inverse, or add a small diagonal matrix to overcome the non-invertible case, which\nworks well in experiments.\n\n2.3 Many Recent Explanations Optimize In\ufb01delity\n\nd\n\nAs we show in the sequel, many recently proposed explanation methods can be shown to be optimal\nwith respect to our in\ufb01delity measure in De\ufb01nition 2.1, for varying perturbations I.\nProposition 2.2. Suppose the perturbation I \u201c x \u00b4 x0 is deterministic and is equal to the dif-\n\u0159\nference between x and some baseline x0. Let \u03a6\u02dapf , xq be any explanation which is optimal with\nrespect to in\ufb01delity for perturbations I. Then \u03a6\u02dapf , xq d I satis\ufb01es the completeness axiom; that is\nj\u201c1r\u03a6\u02dapf , xq d Isj \u201c fpxq \u00b4 fpx \u00b4 Iq. Note that the completeness axiom is also satis\ufb01ed by IG\n[43], DeepLift [37], LRP [4].\nProposition 2.3. Suppose the perturbation is given by I\u0001 \u201c \u0001 \u00a8 ei, where ei is a coordinate basis\n\u0001 pf , xq with respect to in\ufb01delity for perturbations I\u0001, satis\ufb01es:\nvector. Then the optimal explanation \u03a6\u02da\n\u0001 pf , xq \u201c \u2207fpxq, so that the limit point of the optimal explanations is the gradient\nlim\u0001\u00d10 \u03a6\u02da\nexplanation [36].\nProposition 2.4. Suppose the perturbation is given by I \u201c ei d x, where ei is a coordinate basis\nvector. Let \u03a6\u02dapf , xq be the optimal explanation with respect to in\ufb01delity for perturbations I. Then\n\u03a6\u02dapf , xq d x is the occlusion-1 explanation[47].\nProposition 2.5. Following the notation in [25], given a test input x, suppose there is a mapping\nhx : t0, 1ud \u00de\u00d1 Rd that maps simpli\ufb01ed binary inputs z P t0, 1ud to Rd, such that the given test input\nx is equal to hxpz0q where z0 is a vector with all ones and hxp0q = 0 where 0 is the zero vector. Now,\nconsider the perturbation I \u201c hxpZq, where Z P t0, 1ud is a binary random vector with distribution\nPpZ \u201c zq9\np d}z}1q }z}1 pd\u00b4}z}1q . Then for the optimal explanation \u03a6\u02dapf , xq with respect to in\ufb01delity\nfor perturbations I, \u03a6\u02dapf , xq d x is the Shapley value[25].\n\nd\u00b41\n\n2.4 Some Novel Explanations with New Perturbations\nBy varying the perturbations I in our in\ufb01delity de\ufb01nition 2.1, we not only recover existing explanations\n(as those that optimize the corresponding in\ufb01delity), but also design some novel explanations. We\nprovide two such instances below.\nNoisy Baseline. The completeness axiom is one of the most commonly adopted axioms in the\ncontext of explanations, but a caveat is that the baseline is set to some \ufb01xed vector, which does not\naccount for noise in the input (or the baseline itself). We thus set the baseline to be a Gaussian\nrandom vector centered around a certain clean baseline (such as the mean input or zero) depending on\nthe context. The explanation that optimizes in\ufb01delity with corresponding perturbations I is a novel\nexplanation that can be seen as satisfying a robust variant of the completeness axiom.\nSquare Removal. Our second example is speci\ufb01c for image data. We argue that perturbations that\nremove random subsets of pixels in images may be somewhat meaningless, since there is very little\nloss of information given surrounding pixels that are not removed. Also ranging over all possible\nsubsets to remove (as in SHAP [25]) is infeasible for high dimension images. We thus propose a\nmodi\ufb01ed subset distribution from that described in Proposition 2.5 where the perturbation Z has\na uniform distribution over square patches with prede\ufb01ned length, which is in spirit similar to the\nwork of [49]. This not only improves the computational complexity, but also better captures spatial\nrelationships in the images. One can also replace the square with more complex random masks\ndesigned speci\ufb01cally for the image domain [29].\n\n4\n\n\f2.5 Local and Global Explanations\nAs discussed in [3], we can contrast between local and global feature attribution explanations: global\nfeature attribution methods directly provide the change in the function value given changes in the\nfeatures, whereas local feature attribution methods focus on the sensitivity of the function to the\nchanges to the features, so that the local feature attributions need to be multiplied with the input\nto obtain an estimate of the change in the function value. Thus, for gradient-based explanations\nconsidered in [3], the raw explanation such as the gradient itself is a local explanation, while the raw\nexplanation multiplied with the raw input is called a global explanation. In our context, explanations\noptimizing De\ufb01nition 2.1 are naturally local explanations as I is real-valued. However, this can be\neasily modi\ufb01ed to a global explanation by multiplying with x \u00b4 x0 when I is a subset of x \u00b4 x0.\nThe reason we emphasize this distinction is that since global and local explanations capture subtly\ndifferent aspects, they should be compared separately. We note that our de\ufb01nition of local and global\nexplanations follows the description of [3], distinct from that in [30].\n3 Objective Measure: Explanation Sensitivity\nA classical approach to measure the sensitivity of a function, would simply be the gradient of the\nfunction with respect to the input. Therefore, the sensitivity of the explanation can be de\ufb01ned as: for\nany j P t1, . . . , du,\n\nr\u2207x\u03a6pfpxqqsj \u201c lim\n\u0001\u00d10\n\n\u03a6pfpx ` \u0001 ejqq \u00b4 \u03a6pfpxqq\n\n,\n\n\u0001\n\nwhere ej P Rd is the j-th coordinate basis vector, with j-th entry one and all others zero. It quanti\ufb01es\nhow the explanation changes as the input is varied in\ufb01nitesimally. And as a scalar-valued summary of\nthis sensitivity vector, a natural approach is to simply compute some norm of the sensitivity matrix:\n}\u2207x\u03a6pfpxqq}. A slightly more robust variant would be a locally uniform bound:\n\nThis is in turn related to local Lipschitz continuity [2] around x:\n\nSENSGRADp\u03a6, f , x, rq \u201c sup\n}\u03b4}\u010fr\n\n}\u2207x\u03a6px ` \u03b4q}.\n\nSENSLIPSp\u03a6, f , x, rq \u201c sup\n}\u03b4}\u010fr\n\n}\u03a6pxq \u00b4 \u03a6px ` \u03b4qq}\n\n}\u03b4}\n\n(3)\n\n(4)\n\n,\n\nThus if an explanation has locally uniformly bounded gradients, it is locally Lipshitz continuous as\nwell. In this paper, we consider a closely related measure, we term max-sensitivity, that measures the\nmaximum change in the explanation with a small perturbation of the input x.\nDe\ufb01nition 3.1. Given a black-box function f, explanation functional \u03a6, and a given input neighbor-\nhood radius r, we de\ufb01ne the max-sensitivity for explanation as:\n\nSENSMAXp\u03a6, f , x, rq \u201c max\n}y\u00b4x}\u010fr\n\n}\u03a6pf , yq \u00b4 \u03a6pf , xqq}.\n\nIt can be readily seen that if an explanation is locally Lipshitz continuous, it has bounded max-\nsensitivity as well:\n\nSENSMAXp\u03a6, f , x, rq :\u201c max\n}\u03b4}\u010fr\n\n}\u03a6pf , x ` \u03b4q \u00b4 \u03a6pf , xqq} \u010f SENSLIPSp\u03a6, f , x, rq r,\n\n(5)\n\nThe main attraction of the max-sensitivity measure is that it can be robustly estimated via Monte-Carlo\nsampling, as in our experiments. We point out that in certain cases, local Lipschitz continuity may\nbe unbounded in a deep network (such as using ReLU activation function for gradient explanations,\nwhich is a common setting), but max-sensitivity is always \ufb01nite given that explanation score is\nbounded, and thus is more robust to estimate. Can we then modify a given explanation so that it has\nlower sensitivity? If so, how much do we lower its sensitivity? There are two key objections to the\nvery premise of these questions on how to lower sensitivity of an explanation. For the \ufb01rst objection,\nas we noted in the introduction, sensitivity provides only a partial measure of what is desired from an\nexplanation. This can be seen from the fact that the optimal explanation that minimizes the above\nmax-sensitivity measure is simply a constant explanation that just outputs a (potentially nonsensical)\nconstant value for all possible test inputs. The second objection is that natural explanations might\nhave a certain amount of sensitivity by their very nature, either because the model is sensitive, or\nbecause the explanations themselves are constructed by measuring the sensitivities of the predictor\nfunction, so that their sensitivities in turn is likely to be more than that of the function. In which case,\nwe might not want to lower their sensitivities, since it might affect the \ufb01delity of the explanation to the\npredictor function, and perhaps degrade the explanation towards the vacuous constant explanation.\n\n5\n\n\fAs one key contribution of the paper, we show that it is indeed possible to reduce sensitivity\n\u201cresponsibly\u201d by ensuring that it also lowers the in\ufb01delity, as we detail in the next section. We start by\nrelating the sensitivity of an explanation to its in\ufb01delity, and then show that appropriately reducing\nthe sensitivity can achieve two ends: lowering sensitivity of course, but surprisingly, also lowering\nthe in\ufb01delity itself.\n4 Reducing Sensitivity and In\ufb01delity by Smoothing Explanations\nIn Section C in appendix, we show that if the explanation sensitivity is much larger than the function\nsensitivity around some input x, the in\ufb01delity measure in turn will necessarily be large for some\npoint around x (that is, loosely, in\ufb01delity is lower bounded by the difference in sensitivity of the\nexplanation and the function). Given that a large class of explanations are based on sensitivity of\nthe function at the test input, and such sensitivities in turn can be more sensitive to the input than\nthe function itself, does that mean that sensitivity-based explanations are just fated to have a large\nin\ufb01delity? In this section, we show that this need not be the case: by appropriately lowering the\nsensitivity of any given explanation, we not only reduce its sensitivity, but also its in\ufb01delity.\n\u015f\nGiven any kernel kpx, zq over the input domain with respect to which we desire smoothness,\nand some explanation functional \u03a6pf , zq, we can de\ufb01ne a smoothed explanation as \u03a6kpf , xq :\u201c\nz \u03a6pf , zq kpx, zqdz. When kpx, zq is set to the Gaussian kernel, \u03a6kpf , xq becomes Smooth-Grad\n[40]. We now show that the smoothed explanation is less sensitive than the original sensitivity\naveraged around x.\nTheorem 4.1. Given a black-box function f, explanation functional \u03a6, the smoothed explanation\nfunctional \u03a6k,\n\n\u017c\n\nSENSMAXp\u03a6k, f , x, rq \u010f\n\nSENSMAXp\u03a6, f , z, rqkpx, zqdz.\n\nz\n\nThus, when the sensitivity SENSMAX is large only along some directions z, the averaged sensitivity\ncould be much smaller than the worst case sensitivity over directions z.\nWe now show that under certain assumptions, the in\ufb01delity of the smoothed explanation actually\ndecreases. First, we introduce two relevant terms:\n\nC1 \u201c max\n\nx\n\nC2 \u201c max\n\nx\n\n\u015f\n\nI\n\n\u015f\n\u015f\n\u015f\nz pfpzq \u00b4 fpz \u00b4 Iq \u00b4 rfpxq \u00b4 fpx \u00b4 Iqsq2 kpx, zqdzd\u00b5I\n`\u015f\n\u02d8\n\u015f\nz pIT \u03a6pf , zq \u00b4 rfpxq \u00b4 fpx \u00b4 Iqsq2 kpx, zqdzd\u00b5I\n\u015f\n\u015f\nztIT \u03a6pf , zq \u00b4 rfpxq \u00b4 fpx \u00b4 Iqsukpx, zqdz\nz pIT \u03a6pf , zq \u00b4 rfpxq \u00b4 fpx \u00b4 Iqsq2 kpx, zqdzd\u00b5I\n\nd\u00b5I\n\n.\n\n2\n\nI\n\nI\n\nI\n\n,\n\n(6)\n\n(7)\n\n\u017c\n\n?\nC2\n1\u00b42\n\n?\nC2\n1 \u00b4 2\n\nINFDp\u03a6k, f , xq \u010f\n\nINFDp\u03a6, f , zqkpx, zqdz.\n\nWe note that when the sensitivity of f is much smaller than the sensitivity of IT \u03a6pf ,\u00a8q, the numerator\nof the term C1 will be much smaller than the denominator of C1, so that the term C1 will be small.\nThe term C2 is smaller than one by Jensen\u2019s inequality, but in practice it may be much smaller than\none when IT \u03a6pf , zq \u00b4 rfpxq \u00b4 fpx \u00b4 Iqs have different signs for varying z. We now present our\ntheorem which relates the in\ufb01delity of the smoothed explanation to that of the original explanation.\nTheorem 4.2. Given a black-box function f, explanation functional \u03a6, the smoothed explanation\nfunctional \u03a6k, some perturbation of interest I, C1, C2 de\ufb01ned in (6) and (7) where C1 \u010f 1?\n\u015f\n\u010f 1, we show that the in\ufb01delity of \u03a6k is less than the in\ufb01delity of \u03a6, as\nWhen\nz INFDp\u03a6, f , zqkpx, zqdz is usually very close to INFDp\u03a6, f , zq. This shows that smoothed ex-\nplanation can be less sensitive and more faithful, which is validated in the experiments. Another\ndirection to improve the explanation sensitivity and in\ufb01delity is to retrain the model, as we show in\nthe appendix that adversarial traning leads to less sensitive and more faithful gradient explanations.\n5 Experiments\nSetup. We perform our experiments on randomly selected images from MNIST, CIFAR-10, and\nImageNet. In our comparisons, we restrict local variants of the explanations to MNIST, since\nsensitivity of function values given pixel perturbations make more sense for grayscale rather than\ncolor images. To calculate our in\ufb01delity measure, we use the noisy baseline perturbation for local\nvariants of the explanations, and the square removal for global variants of the explanations, and\nuse Monte Carlo Sampling to estimate the measures. We use Grad, IG, GBP, and SHAP to denote\n\nC1\n\nC1\n\n,\n\n2\n\nz\n\n6\n\n\fDatasets\n\nMNIST\n\nDatasets\n\nMNIST\n\nCifar-10\n\nImagenet\n\nMethods\n\nSENSMAX INFD\n\nMethods\n\nSENSMAX INFD SENSMAX INFD SENSMAX INFD\n\nGrad\n\nGrad-SG\n\nIG\n\nIG-SG\nGBP\n\nGBP-SG\nNoisy\nBaseline\n\n0.86\n0.23\n0.77\n0.22\n0.85\n0.23\n\n0.35\n\n4.12\n1.84\n2.75\n1.52\n4.13\n1.84\n\n0.51\n\nGrad\n\nGrad-SG\n\nIG\n\nIG-SG\nGBP\n\nGBP-SG\nSHAP\nSquare\n\n0.56\n0.28\n0.47\n0.26\n0.58\n0.29\n0.35\n0.24\n\n2.38\n1.89\n1.88\n1.72\n2.38\n1.88\n1.20\n0.46\n\n1.15\n1.15\n1.08\n0.90\n1.18\n1.15\n0.93\n0.99\n\n15.99\n13.94\n16.03\n15.90\n15.99\n13.93\n5.78\n2.27\n\n1.16\n0.59\n0.93\n0.48\n1.09\n0.41\n\n\u2013\n\n1.33\n\n0.25\n0.24\n0.24\n0.23\n0.15\n0.15\n\n\u2013\n\n0.04\n\n(a) Results for local explanations\non MNIST dataset.\n\n(b) Results for global explanations on MNIST, Cifar-10 and imagenet.\n\nTable 1: Sensitivity and In\ufb01delity for local and global explanations.\n\nFigure 1: Examples of explanations on Imagenet.\n\nFigure 2: Examples of local explanations on MNIST.\n\nvanilla gradient [37], integrated gradient [43], Guided Back-Propagation [41], and KernelSHAP [25]\nrespectively, and add the post\ufb01x \u201c-SG\u201d when Smooth-Grad [40] is applied. We call the optimal\nexplanation with respect to the perturbation Noisy Baseline and Square Removal as NB and Square\nfor simplicity. We provide more exhaustive details of the experiments in the appendix.\n\nExplanation Sensitivity and In\ufb01delity. We show results comparing sensitivity and in\ufb01delity for\nlocal explanations on MNIST and global explanations on MNIST, CIFAR-10, and ImageNet in Table\n1. Recalling the discussion from Section 2.5, global explanations include a point-wise multiplication\nwith the image minus baseline, but local explanations do not. We observe that the noisy baseline and\nsquare removal optimal explanations achieve the lowest in\ufb01delity, which is as expected, since they\nexplicitly optimize the corresponding in\ufb01delity. We also observe that Smooth-Grad improves both\nsensitivity and in\ufb01delity for all base explanations across all datasets, which corroborates the analysis\nin section 4, and also addresses plausible criticisms of lowering sensitivity via smoothing: while one\nmight expect such smoothing to increase in\ufb01delity, modest smoothing actually improves in\ufb01delity.\nWe also perform a sanity check experiment when the perturbation follows that in SHAP (De\ufb01ned in\nProp.2.5), and we verify that SHAP has the lowest in\ufb01delity for this perturbation.\nIn the Appendix, we investigate how varying the smoothing radius for Smooth-Grad impacts the\nsensitivity and in\ufb01delity. We also provide an analysis of how adversarial training of robust networks\ncan also lower both sensitivity and in\ufb01delity (which is useful in the case where we can retrain the\nmodel), which we validate both measures are lowered in additional experiments.\n\nVisualization. For a qualitative evaluation, we show several examples of global explanations on\nImageNet, and local explanations on MNIST. The explanations optimizing our in\ufb01delity measure with\nrespect to Square and Noisy Baseline (NB) perturbations, show a cleaner saliency map, highlighting\nthe actual object being classi\ufb01ed, when compared to then other explanations. For example, Square\nis the only explanation that highlights the whole bannister in the second image of Figure 1. For\nlocal examples on MNIST, NB clearly shows the digits, as well as regions that would increase the\nprediction score if brightened, such as the region on top of the number 6, which gives more insight\ninto the behavior of the model. We also observe that SG provides a cleaner set of explanations, which\nvalidates the experimental results in [40], as well as our analysis in Section 4. We provide a more\ncomplete set of visualization results with higher resolution in the appendix.\n\n7\n\n\fFigure 3: Examples of various explanations for the origi-\nnal model and the randomized model. More in appendix.\n\nFigure 4: One example of explanations where the\napproximated ground truth is the right block (model\nfocuses on the text). Some explanations focus on\nboth text and image, so that just from these expla-\nnations, might be dif\ufb01cult to infer the ground truth\nfeature used. More examples in appendix.\n\nHuman Evaluation. We perform a controlled experiment to validate whether the in\ufb01delity measure\naligns with human intuitions in a setting where we have an approximated ground truth feature for\nour model, following the setting of [18]. We create a dataset of two classes (bird and frog), with\nthe image of the bird or frog in one half of the overall image, and just the caption in the other half\n(as shown in Figure 4). The images are potentially noisy with noise probability p P t0, 0.6u: when\np \u201c 0, the image always agrees with the caption, and when p \u201c 0.6, we randomize the image 60\npercent of the time to a random image of another class. We train two models which both achieve\ntesting accuracy above 0.95, where one model only relies on the image and the other only relies on\nthe caption7. We then show the original input with aligned image and text, the prediction result, along\nwith the corresponding explanations of the model (among Grad, IG, Grad-SG, and OPT) to humans,\nand test how often humans are able to infer the approximated ground truth feature (image or caption)\nthe model relies on. The optimal explanation (OPT) is the explanation that minimizes our in\ufb01delity\nmeasure with respect to perturbation I de\ufb01ned as the right half or the left half of the image (since the\nlocation of the caption is in one half of the overall image in our case; but note that in more general\nsettings, we could simply use a caption bounding box detector to specify our perturbations). Our\nhuman study includes 2 models, 4 explanations, and 16 test users, where each test user did a series of\n8 tasks (2 models \u02c6 4 explanations) on random images. We report the average human accuracy and\nthe in\ufb01delity measure for each explanation models in Table 3. We observe that unsurprisingly OPT\nhas the best in\ufb01delity score by construction, and we also observe that the in\ufb01delity aligns with human\nevaluation result in general. This suggests that a faithful explanation communicates the important\nfeature better in this setting, which validates the usefulness of the objective measure.\nSanity Check. Recent work in the interpretable machine learning literature [12, 1] has strongly\nargued for the importance of performing sanity checks on whether the explanation is at least loosely\nrelated to the model. Here, we conduct the sanity check proposed by Adebayo et al. [1], to check if ex-\nplanations look different when the network being explained is randomly perturbed. One might expect\nthat explanations that minimize in\ufb01delity will naturally be faithful to the model, and consequently\npass this sanity check. We show visualizations for various explanations (with and without absolute\nvalues) of predictions by a pretrained Resnet-50 model and a randomized Resnet-50 model where the\n\ufb01nal fully connected layer is randomized in Figure 3. We also report the average rank correlation\nof the explanations for the original model and the randomized model in Table 2. All explanations\nwithout the absolute value pass the sanity check, but the rank correlation for explanations with the\nabsolute value between the original model and the randomized model is high. In this case, Square\nhas the lowest rank correlation and the visualizations for two models look the most distinct, which\nsupports the hypothesis that an explanation with low in\ufb01delity is also faithful to the model. More\nexamples are included in the appendix.\n6 Related Work\nOur work focuses on placing attribution-based explanations on an objective footing. We begin with a\nbrief and necessarily incomplete review of recent explanation mechanisms, and then discuss recent\napproaches to place these on an objective footing. While attribution-based explanations are the\nmost popular form of explanations, other types of explanations do exist. Sample-based explanation\n7When p \u201c 0, the trained model solely relies on the image (accuracy for image only input is 0.9, but accuracy\nfor caption only input is 0.5). When p \u201c 0.6, the trained model only relies on the caption (the accuracy for\ncaption only input is 0.98 but the accuracy for image only input is 0.5\n\n8\n\n\fGrad Grad-SG IG IG-SG Square\n0.13\n0.17\nCorr (abs) 0.57\n0.28\n\n0.10\n0.62\n\nCorr\n\n0.18\n0.61\n\n0.16\n0.62\n\nGrad Grad-SG IG OPT\n0.35 0.00\n0.53 0.88\n\n0.38\n0.50\n\nIn\ufb01d. 0.55\nAcc.\n0.47\n\nTable 2: Correlation of the explanation between the\noriginal model randomized model for the sanity check.\n\nTable 3: The in\ufb01delity and the accuracy hu-\nman are able to predict the input blocked used\nbased on the explanations.\n\nmethods attribute the decision of the model to previously observed samples [21, 45, 17]. Concept-\nbased explanation methods seek to explain the decision of the model by high-level human concepts\n[18, 14, 6]. However, attribution-based explanations have the advantage that they are generally\napplicable to a wide range of tasks and they are easy to understand. Among attribution-based\nexplanations, perturbation-based attributions measure the prediction difference after perturbing a set\nof features. Zeiler & Fergus [47] use such perturbations with grey patch occlusions on CNNs. This\nwas further improved by [49, 7] by including a generative model, similar in spirit to counterfactual\nvisual explanations [15]. Gradient based attribution explanations [5, 38, 47, 41, 35] range from\nexplicit gradients, to variants that leverage back-propagation to address some caveats with simple\ngradients. As shown in [3], many recent explanations such as \u0001-LRP [4], Deep LIFT [37], and\nIntegrated Gradients [43] can be seen as variants of gradient explanations. There are also approaches\nthat average feature importance weights by varying the active subsets of the set of input features (e.g.\nover the power set of the set of all features), which has roots in cooperative game theory and revenue\ndivision [11, 25].\nAmong works that place these explanations on a more objective footing are those that focus on\nimproving the sensitivity of explanations. To reduce the noise in gradient saliency maps, Kindermans\net al. [19] propose to calculate the signal in the image by removing distractors. SmoothGrad [40]\ngenerating noisy images via additive Gaussian noise and average the gradient of the sampled images.\nAnother form of sensitivity analysis proposed by Ribeiro et al. [32] approximates the behavior of\na complex model by an locally linear interpretable model, which has been extended by [46, 30]\nin different domains. The reliability of these attribution explanations is a key problem of interest.\nAdebayo et al. [1] has shown that several saliency methods are insensitive to random perturbations\nin the parameter space, generating the same saliency maps even when the parameter space is\nrandomized. Ghorbani et al. [13], Zhang et al. [48] shows that it is possible to generate a perceptively\nindistinguishable image that changes the saliency explanations signi\ufb01cantly. In this work, we show\nthat the optimal explanation that optimizes \ufb01delity passes the sanity check in [1] and smoothing\nexplanations with SmoothGrad [40] lowers the sensitivity and in\ufb01delity of explanations, which sheds\nlight on how to generate more robust explanations that does not degrade the \ufb01delity, which addresses\nthe concerns for saliency explanations. There are also works that proposes objective evaluations\nfor saliency explanations. Montavon et al. [28] use explanation continuity as an objective measure\nof explanations, and observe that discontinuities may occur for gradient-based explanations, while\nvariants such as deep Taylor LRP [4] can achieve continuous explanations, as compared to simple\ngradient explanations. Samek et al. [34] evaluate the explanations by the area over perturbation curve\nwhile removing the most salient features. Dabkowski & Gal [10] uses object localisation metrics to\nevaluate the closeness of the saliency and the actual object. Kindermans et al. [20] posit that a good\nexplanations should ful\ufb01ll input invariance. Hooker et al. [16] propose to remove salient features and\nretrain the model to evaluate explanations.\n7 Conclusion\nWe propose two objective evaluation metrics, naturally termed in\ufb01delity and sensitivity, for machine\nlearning explanations. One of our key contributions is to show that a large number of existing\nexplanations can be uni\ufb01ed, as they all optimize the in\ufb01delity with respect to various perturbations.\nWe then show that the explanation that optimizes the in\ufb01delity can be seen as a combination of two\nexisting explanation methods with a kernel with respect to the perturbation. We then propose two\nperturbations and their respective optimal explanations as new explanations. Another key contribution\nof the paper is that we show both theoretically and empirically that there need not exist a trade-off\nbetween sensitivity and in\ufb01delity, as we may improve the sensitivity as well as the in\ufb01delity of\nexplanations by the right amount of smoothing. Finally, we validate that our in\ufb01delity measurement\naligns with human evaluation in a setting where the ground truth of explanations is given.\n\n9\n\n\fAcknowlegement We acknowledge the support of DARPA via FA87501720152, and Accenture.\n\nReferences\n[1] Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for\nsaliency maps. In Advances in Neural Information Processing Systems, pp. 9525\u20139536, 2018.\n\n[2] Alvarez-Melis, D. and Jaakkola, T. S. On the robustness of interpretability methods. arXiv\n\npreprint arXiv:1806.08049, 2018.\n\n[3] Ancona, M., Ceolini, E., \u00d6ztireli, C., and Gross, M. A uni\ufb01ed view of gradient-based attribution\nmethods for deep neural networks. International Conference on Learning Representations,\n2018.\n\n[4] Bach, S., Binder, A., Montavon, G., Klauschen, F., M\u00fcller, K.-R., and Samek, W. On pixel-wise\nexplanations for non-linear classi\ufb01er decisions by layer-wise relevance propagation. PloS one,\n10(7):e0130140, 2015.\n\n[5] Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and M\u00c3\u017eller, K.-R.\nHow to explain individual classi\ufb01cation decisions. Journal of Machine Learning Research, 11\n(Jun):1803\u20131831, 2010.\n\n[6] Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying\ninterpretability of deep visual representations. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pp. 6541\u20136549, 2017.\n\n[7] Chang, C.-H., Creager, E., Goldenberg, A., and Duvenaud, D. Explaining image classi\ufb01ers by\n\ncounterfactual generation. In International Conference on Learning Representations, 2019.\n\n[8] Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. Learning to explain: An information-\ntheoretic perspective on model interpretation. In Proceedings of the 35th International Confer-\nence on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15,\n2018, pp. 882\u2013891, 2018.\n\n[9] Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certi\ufb01ed adversarial robustness via randomized\n\nsmoothing. arXiv preprint arXiv:1902.02918, 2019.\n\n[10] Dabkowski, P. and Gal, Y. Real time image saliency for black box classi\ufb01ers. In Advances in\n\nNeural Information Processing Systems, pp. 6967\u20136976, 2017.\n\n[11] Datta, A., Sen, S., and , Y. Z. Algorithmic transparency via quantitative input in\ufb02uence: Theory\nand experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium\non, pp. 598\u2013617. IEEE, 2016.\n\n[12] Doshi-Velez, F. and Kim, B. A roadmap for a rigorous science of interpretability. CoRR,\n\nabs/1702.08608, 2017.\n\n[13] Ghorbani, A., Abid, A., and Zou, J. Interpretation of neural networks is fragile. AAAI, 2019.\n\n[14] Ghorbani, A., Wexler, J., and Kim, B. Automating interpretability: Discovering and testing\n\nvisual concepts learned by neural networks. arXiv preprint arXiv:1902.03129, 2019.\n\n[15] Goyal, Y., Wu, Z., Ernst, J., Batra, D., Parikh, D., and Lee, S. Counterfactual visual explanations.\n\nCoRR, abs/1904.07451, 2019.\n\n[16] Hooker, S., Erhan, D., Kindermans, P.-J., and Kim, B. Evaluating feature importance estimates.\n\narXiv preprint arXiv:1806.10758, 2018.\n\n[17] Khanna, R., Kim, B., Ghosh, J., and Koyejo, O. Interpreting black box predictions using \ufb01sher\n\nkernels. arXiv preprint arXiv:1810.10118, 2018.\n\n[18] Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond\nfeature attribution: Quantitative testing with concept activation vectors (tcav). In International\nConference on Machine Learning, pp. 2673\u20132682, 2018.\n\n10\n\n\f[19] Kindermans, P.-J., Sch\u00fctt, K. T., Alber, M., M\u00fcller, K.-R., and D\u00e4hne, S. Patternnet and\npatternlrp\u2013improving the interpretability of neural networks. International Conference on\nLearning Representations, 2018.\n\n[20] Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Sch\u00fctt, K. T., D\u00e4hne, S., Erhan, D., and\nKim, B. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining\nand Visualizing Deep Learning, pp. 267\u2013280. Springer, 2019.\n\n[21] Koh, P. W. and Liang, P. Understanding black-box predictions via in\ufb02uence functions. In\n\nInternational Conference on Machine Learning, pp. 1885\u20131894, 2017.\n\n[22] Kulesza, T., Burnett, M., Wong, W.-K., and Stumpf, S. Principles of explanatory debugging to\npersonalize interactive machine learning. In Proceedings of the 20th International Conference\non Intelligent User Interfaces, pp. 126\u2013137. ACM, 2015.\n\n[23] Lee, G.-H., Alvarez-Melis, D., and Jaakkola, T. S. Towards robust, locally linear deep networks.\n\nIn International Conference on Learning Representations, 2019.\n\n[24] Liu, X., Cheng, M., Zhang, H., and Hsieh, C.-J. Towards robust neural networks via random\nself-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\n369\u2013385, 2018.\n\n[25] Lundberg, S. M. and Lee, S.-I. A uni\ufb01ed approach to interpreting model predictions.\n\nAdvances in Neural Information Processing Systems, pp. 4765\u20134774, 2017.\n\nIn\n\n[26] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models\n\nresistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.\n\n[27] Miller, T. Explanation in arti\ufb01cial intelligence: Insights from the social sciences. arXiv preprint\n\narXiv:1706.07269, 2017.\n\n[28] Montavon, G., Samek, W., and M\u00fcller, K.-R. Methods for interpreting and understanding deep\n\nneural networks. Digital Signal Processing, 2017.\n\n[29] Petsiuk, V., Das, A., and Saenko, K. Rise: Randomized input sampling for explanation of\n\nblack-box models. arXiv preprint arXiv:1806.07421, 2018.\n\n[30] Plumb, G., Molitor, D., and Talwalkar, A. S. Model agnostic supervised local explanations. In\n\nAdvances in Neural Information Processing Systems, pp. 2515\u20132524, 2018.\n\n[31] Raghunathan, A., Steinhardt, J., and Liang, P. Certi\ufb01ed defenses against adversarial examples.\n\narXiv preprint arXiv:1801.09344, 2018.\n\n[32] Ribeiro, M. T., Singh, S., and Guestrin, C. Why should i trust you?: Explaining the predictions\nIn Proceedings of the 22nd ACM SIGKDD International Conference on\n\nof any classi\ufb01er.\nKnowledge Discovery and Data Mining, pp. 1135\u20131144. ACM, 2016.\n\n[33] Ross, A. S. and Doshi-Velez, F. Improving the adversarial robustness and interpretability of\ndeep neural networks by regularizing their input gradients. arXiv preprint arXiv:1711.09404,\n2017.\n\n[34] Samek, W., Binder, A., Montavon, G., Lapuschkin, S., and M\u00fcller, K.-R. Evaluating the\nvisualization of what a deep neural network has learned. IEEE transactions on neural networks\nand learning systems, 28(11):2660\u20132673, 2016.\n\n[35] Selvaraju, R. R., Cogswell, M., Das, A., Vedantamand, R., Parikh, D., and Parikh, D. Grad-\ncam: Visual explanations from deep networks via gradient-based localization. International\nconference on computer vision, 2017.\n\n[36] Shrikumar, A., Greenside, P., Shcherbina, A., and Kundaje, A. Not just a black box: Learning\nimportant features through propagating activation differences. arXiv preprint arXiv:1605.01713,\n2016.\n\n[37] Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating\n\nactivation differences. International Conference on Machine Learning, 2017.\n\n11\n\n\f[38] Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.\n\n[39] Sinha, A., Namkoong, H., and Duchi, J. Certi\ufb01able distributional robustness with principled\n\nadversarial training. arXiv preprint arXiv:1710.10571, 2017.\n\n[40] Smilkov, D., Thorat, N., Kim, B., Vi\u00e9gas, F., and Wattenberg, M. Smoothgrad: removing noise\n\nby adding noise. arXiv preprint arXiv:1706.03825, 2017.\n\n[41] Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The\n\nall convolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[42] \u0160trumbelj, E. and Kononenko, I. Explaining prediction models and individual predictions with\n\nfeature contributions. Knowledge and information systems, 41(3):647\u2013665, 2014.\n\n[43] Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Interna-\n\ntional Conference on Machine Learning, 2017.\n\n[44] Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer\nadversarial polytope. In International Conference on Machine Learning, pp. 5283\u20135292, 2018.\n\n[45] Yeh, C., Kim, J. S., Yen, I. E., and Ravikumar, P. Representer point selection for explaining\ndeep neural networks. In Advances in Neural Information Processing Systems 31: Annual\nConference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December\n2018, Montr\u00e9al, Canada., pp. 9311\u20139321, 2018.\n\n[46] Ying, R., Bourgeois, D., You, J., Zitnik, M., and Leskovec, J. Gnn explainer: A tool for post-hoc\n\nexplanation of graph neural networks. arXiv preprint arXiv:1903.03894, 2019.\n\n[47] Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European\n\nconference on computer vision, pp. 818\u2013833. Springer, 2014.\n\n[48] Zhang, X., Wang, N., Ji, S., Shen, H., and Wang, T. Interpretable deep learning under \ufb01re. arXiv\n\npreprint arXiv:1812.00891, 2018.\n\n[49] Zintgraf, L. M., Cohen, T. S., Adel, T., and Welling, M. Visualizing deep neural network\ndecisions: Prediction difference analysis. In 5th International Conference on Learning Rep-\nresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,\n2017.\n\n12\n\n\f", "award": [], "sourceid": 5866, "authors": [{"given_name": "Chih-Kuan", "family_name": "Yeh", "institution": "Carnegie Mellon University"}, {"given_name": "Cheng-Yu", "family_name": "Hsieh", "institution": "National Taiwan University"}, {"given_name": "Arun", "family_name": "Suggala", "institution": "Carnegie Mellon University"}, {"given_name": "David", "family_name": "Inouye", "institution": "Carnegie Mellon University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}]}