{"title": "Explanations can be manipulated and geometry is to blame", "book": "Advances in Neural Information Processing Systems", "page_first": 13589, "page_last": 13600, "abstract": "Explanation methods aim to make neural networks more trustworthy and interpretable. In this paper, we demonstrate a property of explanation methods which is disconcerting for both of these purposes. Namely, we show that explanations can be manipulated arbitrarily by applying visually hardly perceptible perturbations to the input that keep the network's output approximately constant. We establish theoretically that this phenomenon can be related to certain geometrical properties of neural networks. This allows us to derive an upper bound on the susceptibility of explanations to manipulations. Based on this result, we propose effective mechanisms to enhance the robustness of explanations.", "full_text": "Explanations can be manipulated\n\nand geometry is to blame\n\nAnn-Kathrin Dombrowski1, Maximilian Alber5, Christopher J. Anders1,\n\nMarcel Ackermann2, Klaus-Robert M\u00fcller1,3,4, Pan Kessel1\n\n1Machine Learning Group, Technische Universit\u00e4t Berlin, Germany\n\n2Department of Video Coding & Analytics, Fraunhofer Heinrich-Hertz-Institute, Berlin, Germany\n\n3Max-Planck-Institut f\u00fcr Informatik, Saarbr\u00fccken, Germany\n\n4Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea\n\n5Charit\u00e9 Berlin, Berlin, Germany\n\nAbstract\n\nExplanation methods aim to make neural networks more trustworthy and inter-\npretable. In this paper, we demonstrate a property of explanation methods which is\ndisconcerting for both of these purposes. Namely, we show that explanations can\nbe manipulated arbitrarily by applying visually hardly perceptible perturbations\nto the input that keep the network\u2019s output approximately constant. We establish\ntheoretically that this phenomenon can be related to certain geometrical properties\nof neural networks. This allows us to derive an upper bound on the susceptibil-\nity of explanations to manipulations. Based on this result, we propose effective\nmechanisms to enhance the robustness of explanations.\n\nFigure 1: Original image with corresponding explanation map on the left. Manipulated image with\nits explanation on the right. The chosen target explanation was an image with a text stating \"this\nexplanation was manipulated\".\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nOriginalImageManipulatedImage\f1\n\nIntroduction\n\nExplanation methods have attracted signi\ufb01cant attention over the last years due to their promise to\nopen the black box of deep neural networks. Interpretability is crucial for scienti\ufb01c understanding\nand safety critical applications.\nExplanations can be provided in terms of explanation maps[1\u201320] that visualize the relevance\nattributed to each input feature for the overall classi\ufb01cation result. In this work, we establish that\nthese explanation maps can be changed to an arbitrary target map. This is done by applying a visually\nhardly perceptible perturbation to the input. We refer to Figure 1 for an example. This perturbation\ndoes not change the output of the neural network, i.e. in addition to the classi\ufb01cation result also the\nvector of all class probabilities is (approximately) the same.\nThis \ufb01nding is clearly problematic if a user, say a medical doctor, is expecting a robustly interpretable\nexplanation map to rely on in the clinical decision making process.\nMotivated by this unexpected observation, we provide a theoretical analysis that establishes a\nrelation of this phenomenon to the geometry of the neural network\u2019s output manifold. This novel\nunderstanding allows us to derive a bound on the degree of possible manipulation of the explanation\nmap. This bound is proportional to two differential geometric quantities: the principle curvatures\nand the geodesic distance between the original input and its manipulated counterpart. Given this\ntheoretical insight, we propose ef\ufb01cient ways to limit possible manipulations and thus enhance\nresilience of explanation methods.\nIn summary, this work provides the following key contributions:\n\n\u2022 We propose an algorithm which allows to manipulate an image with a hardly perceptible\nperturbation such that the explanation matches an arbitrary target map. We demonstrate its\neffectiveness for six different explanation methods and on four network architectures as well\nas two datasets.\n\n\u2022 We provide a theoretical understanding of this phenomenon for gradient-based methods\nin terms of differential geometry. We derive a bound on the principle curvatures of the\nhypersurface of equal network output. This implies a constraint on the maximal change of\nthe explanation map due to small perturbations.\n\n\u2022 Using these insights, we propose methods to undo the manipulations and increase the\nrobustness of explanation maps by smoothing the explanation method. We demonstrate\nexperimentally that smoothing leads to increased robustness not only for gradient but also\nfor propagation-based methods.\n\n1.1 Related work\n\nIn [21], it was demonstrated that explanation maps can be sensitive to small perturbations in the image.\nThe authors apply perturbations to the image which lead to an unstructured change in the explanation\nmap. Speci\ufb01cally, their approach can increase the overall sum of relevances in a certain region of\nthe explanation map. Our work focuses on structured manipulations instead, i.e. to reproduce a\ngiven target map on a pixel-by-pixel basis. Furthermore, their attacks only keep the classi\ufb01cation\nresult the same which often leads to signi\ufb01cant changes in the network output. From their analysis,\nit is therefore not clear whether the explanation or the network is vulnerable (and the explanation\nmap simply re\ufb02ects the relevance of the perturbation faithfully). Our method keeps the output of\nthe network (approximately) constant. We furthermore provide a theoretical analysis in terms of\ndifferential geometry and propose effective defense mechanisms. Another approach [22] adds a\nconstant shift to the input image, which is then eliminated by changing the bias of the \ufb01rst layer.\nFor some methods, this leads to a change in the explanation map. Contrary to our approach, this\nrequires to change the network\u2019s biases. In [23], explanation maps are changed by randomization of\n(some of) the network weights and in [24] the complete network is \ufb01ne-tuned to produce manipulated\nexplanations while the accuracy remains high. These two approaches are different from our method\nas they do not aim to change the explanation to a speci\ufb01c target explanation map and modify the\nparameters of the network. In [25, 26], it is proposed to bound the (local) Lipschitz constant of the\nexplanation. This has the disadvantage that explanations become insensitive to any small perturbation,\ne.g. even those which lead to a substantial change in network output. This is clearly undesirable as\nthe explanation should re\ufb02ect why the perturbation leads to such a drastic change of the network\u2019s\n\n2\n\n\fcon\ufb01dence. In this work, we therefore propose to only bound the curvature of the hypersurface of\nequal network output.\n\n2 Manipulation of explanations\n\n2.1 Explanation methods\nWe consider a neural network g : Rd \u2192 RK with relu non-linearities which classi\ufb01es an image\nx \u2208 Rd in K categories with the predicted class given by k = arg maxi g(x)i. The explanation map\nis denoted by h : Rd \u2192 Rd and associates an image with a vector of the same dimension whose\ncomponents encode the relevance score of each pixel for the neural network\u2019s prediction.\nThroughout this paper, we will use the following explanation methods:\n\n\u2022 Gradient: The map h(x) = \u2202g\n\n\u2202x (x) is used and quanti\ufb01es how in\ufb01nitesimal perturbations in\n\neach pixel change the prediction g(x) [2, 1].\n\u2022 Gradient \u00d7 Input: This method uses the map h(x) = x (cid:12) \u2202g\nthis measure gives the exact contribution of each pixel to the prediction.\n\u2022 Integrated Gradients: This method de\ufb01nes h(x) = (x \u2212 \u00afx) (cid:12)\nis a suitable baseline. See the original reference [13] for more details.\n\u2022 Guided Backpropagation (GBP): This method is a variation of the gradient explanation\nfor which negative components of the gradient are set to zero while backpropagating through\nthe non-linearities [4].\n\n\u2202x (x) [14]. For linear models,\n\n\u2202g(\u00afx+t(x\u2212\u00afx))\n\ndt where \u00afx\n\n(cid:82) 1\n\n\u2202x\n\n0\n\n\u2022 Layer-wise Relevance Propagation (LRP): This method [5, 16] propagates relevance\n\nbackwards through the network. For the output layer, relevance is de\ufb01ned by1\n\nRL\n\ni = \u03b4i,k ,\n\nwhich is then propagated backwards through all layers but the \ufb01rst using the z+ rule\n\n(cid:88)\n\nj\n\n(cid:80)\n\nRl\n\ni =\n\ni(W l)+\nxl\nji\ni xl\n\ni(W l)+\n\nji\n\nRl+1\n\nj\n\n,\n\n(1)\n\n(2)\n\nwhere (W l)+ denotes the positive weights of the l-th layer and xl is the activation vector of\nthe l-th layer. For the \ufb01rst layer, we use the zB rule to account for the bounded input domain\n\n(cid:88)\n\nj\n\n(cid:80)\n\nR0\n\ni =\n\nx0\nj W 0\ni(x0\n\nji \u2212 lj(W 0)+\nji \u2212 lj(W 0)+\nj W 0\n\nji \u2212 hj(W 0)\u2212\nji \u2212 hj(W 0)\u2212\n\nji\n\nji)\n\nR1\nj ,\n\n(3)\n\nwhere li and hi are the lower and upper bounds of the input domain respectively.\n\n\u2022 Pattern Attribution (PA): This method is equivalent to standard backpropagation upon\nelement-wise multiplication of the weights W l with learned patterns Al. We refer to the\noriginal publication for more details [17].\n\nThese methods cover two classes of attribution methods, namely gradient-based and propagation-\nbased explanations, and are frequently used in practice [27, 28].\n\n2.2 Manipulation Method\nFor a given explanation method and speci\ufb01ed target ht \u2208 Rd, a manipulated image xadv = x + \u03b4x\nhas the following properties:\n\n1. The output of the network stays approximately constant, i.e. g(xadv) \u2248 g(x).\n2. The explanation is close to the target map, i.e. h(xadv) \u2248 ht.\n\n1Here we use the Kronecker symbol \u03b4i,k =\n\n(cid:40)\n\n1,\n0,\n\nfor i = k\nfor i (cid:54)= k\n\n.\n\n3\n\n\fFigure 2: The explanation map of the cat is used as the target and the image of the dog is perturbed.\nThe red box contains the manipulated images and the corresponding explanations. The \ufb01rst column\ncorresponds to the original explanations of the unperturbed dog image. The target map, shown in the\nsecond column, is the corresponding explanation of the cat image. The last column visualizes the\nperturbations.\n\nL =(cid:13)(cid:13)h(xadv) \u2212 ht(cid:13)(cid:13)2\n\n3. The norm of the perturbation \u03b4x added to the input image is small, i.e. (cid:107)\u03b4x(cid:107) =\n\n(cid:107)xadv \u2212 x(cid:107) (cid:28) 1 and therefore not perceptible.\n\nWe obtain such manipulations by optimizing the loss function\n\n+ \u03b3 (cid:107)g(xadv) \u2212 g(x)(cid:107)2 ,\n\n(4)\nwith respect to xadv using gradient descent. We clamp xadv after each iteration so that it is a valid\nimage. The \ufb01rst term in the loss function (4) ensures that the manipulated explanation map is close\nto the target while the second term encourages the network to have the same output. The relative\nweighting of these two summands is controlled by the hyperparameter \u03b3 \u2208 R+.\nOur method therefore requires us to calculate the gradient with respect to the input \u2207h(x) of the\nexplanation. For relu-networks, this gradient often depends on the vanishing second derivative of\nnon-linearities which leads to problems during optimization of the loss (4). As an example, the\ngradient method leads to\n\n(cid:13)(cid:13)h(xadv) \u2212 ht(cid:13)(cid:13)2\n\n\u2202xadv\n\n\u2202h\n\u2202xadv\n\n\u221d\n\n=\n\n\u22022g\n\u2202x2\n\nadv \u221d relu(cid:48)(cid:48) = 0 .\n\nWe therefore replace the relu with softplus non-linearities\n\nlog(1 + e\u03b2x) .\n\n(5)\n\nsoftplus\u03b2(x) =\n\n1\n\u03b2\n\n4\n\nOriginal MapTarget MapManipulated MapPerturbationsGradientGradientxInputLayerwiseRelevancePropagationIntegratedGradientsGuidedBackpropagationPatternAttributionPerturbed ImageOriginal ImageImage used toproduce Target\fFigure 3: Left: Similarity measures between target ht and manipulated explanation map h(xadv).\nRight: Similarity measures between original x and perturbed image xadv. For SSIM and PCC large\nvalues indicate high similarity while for MSE small values correspond to similar images. For fair\ncomparison, we use the same 100 randomly selected images for each explanation method.\n\nFor large \u03b2 values, the softplus approximates the relu closely but has a well-de\ufb01ned second derivative.\nAfter optimization is complete, we test the manipulated image with the original relu network.\nSimilarity metrics: In our analysis, we assess the similarity between both images and explanation\nmaps. To this end, we use three metrics following [23]: the structural similarity index (SSIM), the\nPearson correlation coef\ufb01cient (PCC) and the mean squared error (MSE). SSIM and PCC are relative\nsimilarity measures with values in [0, 1], where larger values indicate high similarity. The MSE is an\nabsolute error measure for which values close to zero indicate high similarity. We normalize the sum\nof the explanation maps to be one and the images to have values between 0 and 1.\n\n2.3 Experiments\n\nTo evaluate our approach, we apply our algorithm to 100 randomly selected images for each explana-\ntion method. We use a pre-trained VGG-16 network [29] and the ImageNet dataset [30]. For each run,\nwe randomly select two images from the test set. One of the two images is used to generate a target\nexplanation map ht. The other image is perturbed by our algorithm with the goal of replicating the\ntarget ht using a few hundred iterations of gradient descent. We sum over the absolute values of the\nchannels of the explanation map to get the relevance per pixel. Further details about the experiments\nare summarized in Supplement A.\nQualitative analysis: Our method is illustrated in Figure 2 in which a dog image is manipulated\nin order to have an explanation resembling a cat. For all explanation methods, the target is closely\nemulated and the perturbation of the dog image is small. More examples can be found in the\nsupplement.\nQuantitative analysis: Figure 3 shows similarity measures between the target ht and the manipulated\nexplanation map h(xadv) as well as between the original image x and perturbed image xadv.2 All\nconsidered metrics show that the perturbed images have an explanation closely resembling the targets.\nAt the same time, the perturbed images are very similar to the corresponding original images. We\nalso veri\ufb01ed by visual inspection that the results look very similar. We have uploaded the results of all\n\n2Throughout this paper, boxes denote 25th and 75th percentiles, whiskers denote 10th and 90th percentiles,\n\nand solid lines show the medians\n\n5\n\n0.000.250.500.75MSE\u00d710\u22129SimilaritiesExplanations0.60.70.80.9SSIMGradientGradientxInputIntegratedGradientsLRPGBPPA0.70.80.9PCC0.000.250.500.75MSE\u00d710\u22123SimilaritiesImages0.900.95SSIMGradientGradientxInputIntegratedGradientsLRPGBPPA0.9900.9951.000PCC\fruns so that interested readers can assess their similarity themselves3 and provide code4 to reproduce\nthem. In addition, the output of the neural network is approximately unchanged by the perturbations,\ni.e. the classi\ufb01cation of all examples is unchanged and the median of (cid:107)g(xadv) \u2212 g(x)(cid:107) is of the order\nof magnitude 10\u22123 for all methods. See Supplement B for further details.\nOther architectures and datasets: We checked that comparable results are obtained for ResNet-18\n[31], AlexNet [32] and Densenet-121 [33]. Moreover, we also successfully tested our algorithm on\nthe CIFAR-10 dataset [34]. We refer to the Supplement C for further details.\n\n3 Theoretical considerations\n\nIn this section, we analyze the vulnerability of explanations theoretically. We argue that this phe-\nnomenon can be related to the large curvature of the output manifold of the neural network. We focus\non the gradient method starting with an intuitive discussion before developing mathematically precise\nstatements.\nWe have demonstrated that one can drastically change the explanation map while keeping the output\nof the neural network constant\n\ng(x + \u03b4x) = g(x) = c\n\n(6)\nusing only a small perturbation in the input \u03b4x. The perturbed image xadv = x + \u03b4x therefore lies on\nthe hypersurface of constant network output S = {p \u2208 Rd|g(p) = c}.5 We can exclusively consider\nthe winning class output, i.e. g(x) := g(x)k with k = arg maxi g(x)i because the gradient method\nonly depends on this component of the output. Therefore, the hypersurface S is of co-dimension one.\nThe gradient \u2207g for every p \u2208 S is normal to this hypersurface. The fact that the normal vector \u2207g\ncan be drastically changed by slightly perturbing the input along the hypersurface S suggests that the\ncurvature of S is large.\nWhile the latter statement may seem intuitive, it requires non-trivial concepts of differential geometry\nto make it precise, in particular the notion of the second fundamental form. We will brie\ufb02y summarize\nthese concepts in the following (see e.g. [35] for a standard textbook). To this end, it is advantageous\nto consider a normalized version of the gradient method\nn(x) = \u2207g(x)\n(cid:107)\u2207g(x)(cid:107)\n\nThis normalization is merely conventional as it does not change the relative importance of any pixel\nwith respect to the others. For any point p \u2208 S, we de\ufb01ne the tangent space TpS as the vector space\ndt \u03b3(t)|t=0 of all possible curves \u03b3 : R \u2192 S with \u03b3(0) = p.\nspanned by the tangent vectors \u02d9\u03b3(0) = d\nFor u, v \u2208 TpS, we denote their inner product by (cid:104)u, v(cid:105). For any u \u2208 TpS, the directional derivative\nof a function f is uniquely de\ufb01ned for any choice of \u03b3 by\n\n(7)\n\n.\n\nwith\n\n\u03b3(0) = p and \u02d9\u03b3(0) = u.\n\n(8)\n\n(cid:12)(cid:12)(cid:12)(cid:12)t=0\n\nDuf (p) =\n\nd\ndt\n\nf (\u03b3(t))\n\nWe then de\ufb01ne the Weingarten map as6\n\nL :\n\n(cid:26)TpS \u2192 TpS\n\nu\n\n(cid:55)\u2192 \u2212Dun(p) ,\n\nwhere the unit normal n(p) can be written as (7). This map quanti\ufb01es how much the unit normal\nchanges as we in\ufb01nitesimally move away from p in the direction u. The second fundamental form is\nthen given by\n\n(cid:26)TpS \u00d7 TpS \u2192 R\n\nu, v\n\nL :\n\n(cid:55)\u2192 \u2212(cid:104)v, L(u)(cid:105) = (cid:104)v, Dun(p)(cid:105) .\n\n3https://drive.google.com/drive/folders/1TZeWngoevHRuIw6gb5CZDIRrc7EWf5yb?usp=\n\n4https://github.com/pankessel/adv_explanation_ref\n5It is suf\ufb01cient to consider the hypersurface S in a neighbourhood of the unperturbed input x.\n6The fact that Dun(p) \u2208 TpS follows by taking the directional derivative with respect to u on both sides of\n\nsharing\n\n(cid:104)n, n(cid:105) = 1 .\n\n6\n\n\fIt can be shown that the second fundamental form is bilinear and symmetric L(u, v) = L(v, u). It is\ntherefore diagonalizable with real eigenvalues \u03bb1, . . . \u03bbd\u22121 which are called principle curvatures.\nWe have therefore established the remarkable fact that the sensitivity of the gradient map (7) is\ndescribed by the principle curvatures, a key concept of differential geometry.\nIn particular, this allows us to derive an upper bound on the maximal change of the gradient map\nh(x) = n(x) as we move slightly on S. To this end, we de\ufb01ne the geodesic distance dg(p, q) of two\npoints p, q \u2208 S as the length of the shortest curve on S connecting p and q. In the supplement, we\nshow that:\nTheorem 1 Let g : Rd \u2192 R be a network with softplus\u03b2 non-linearities and U\u0001(p) = {x \u2208\nRd;(cid:107)x \u2212 p(cid:107) < \u0001} an environment of a point p \u2208 S such that U\u0001(p)\u2229 S is fully connected. Let g have\nbounded derivatives (cid:107)\u2207g(x)(cid:107) \u2265 c for all x \u2208 U\u0001(p) \u2229 S. It then follows for all p0 \u2208 U\u0001(p) \u2229 S that\n(9)\nwhere \u03bbmax is the principle curvature with the largest absolute value for any point in U\u0001(p) \u2229 S and\nthe constant C > 0 depends on the weights of the neural network.\n\n(cid:107)h(p) \u2212 h(p0)(cid:107) \u2264 |\u03bbmax| dg(p, p0) \u2264 \u03b2 C dg(p, p0),\n\nThis theorem can intuitively be motivated as follows: for relu non-linearities, the lines of equal\nnetwork output are piece-wise linear and therefore have kinks, i.e. points of divergent curvature.\nThese relu non-linearities are well approximated by softplus non-linearities (5) with large \u03b2. Reducing\n\u03b2 smoothes out the kinks and therefore leads to reduced maximal curvature, i.e. |\u03bbmax| \u2264 \u03b2 C. For\neach point on the geodesic curve connecting p and p0, the normal can at worst be affected by the\nmaximal curvature, i.e. the change in explanation is bounded by |\u03bbmax| dg(p, p0).\nThere are two important lessons to be learnt from this theorem:\nthe geodesic distance can be\nsubstantially greater than the Euclidean distance for curved manifolds. In this case, inputs which\nare very similar to each other, i.e. the Euclidean distance is small, can have explanations that are\ndrastically different. Secondly, the upper bound is proportional to the \u03b2 parameter of the softplus\nnon-linearity. Therefore, smaller values of \u03b2 provably result in increased robustness with respect to\nmanipulations.\n\nFigure 4: Left: \u03b2 dependence for the correlations of the manipulated explanation (here Gradient and\nLRP) with the target and original explanation. Lines denote the medians, 10th and 90th percentiles\nare shown in semitransparent colour. Center and Right: network input and the respective explanation\nmaps as \u03b2 is decreased for Gradient (center) and LRP (right).\n\n4 Robust explanations\n\nUsing the fact that the upper bound of the last section is proportional\nto the \u03b2 parameter of the softplus non-linearities, we propose \u03b2-\nsmoothing of explanations. This method calculates an explanation\nusing a network for which the relu non-linearities are replaced by\nsoftplus with a small \u03b2 parameter to smooth the principle curvatures.\nThe precise value of \u03b2 is a hyperparameter of the method, but we\n\ufb01nd that a value around one works well in practice.\n\n7\n\n0.00.51.0PCCGradienth(xadv)&hth(xadv)&h(x)102103104\u03b20.00.51.0PCCLRPOriginalImageReLUSoftplus\u03b2=5Softplus\u03b2=0.8ManipulatedTargetOriginalImageReLUSoftplus\u03b2=5Softplus\u03b2=0.8ManipulatedTargetunsmoothedsmoothed\fAs shown in the supplement, a relation between SmoothGrad [12]\nand \u03b2-smoothing can be proven for a one-layer neural network:\n\nTheorem 2 For a one-layer neural network g(x) = relu(wT x) and its \u03b2-smoothed counterpart\ng\u03b2(x) = softplus\u03b2(wT x), it holds that\n\nE\u0001\u223cp\u03b2 [\u2207g(x \u2212 \u0001)] = \u2207g \u03b2(cid:107)w(cid:107)\n\n(x) ,\n\nwhere p\u03b2(\u0001) =\n\n(e\u03b2\u0001/2+e\u2212\u03b2\u0001/2)2 .\n\n\u03b2\n\nN\n\n2\u03c0\n\n(cid:80)N\n\u221a\n\u03b2 , \u03b2-smoothing\nSince p\u03b2(x) closely resembles a normal distribution with variance \u03c3 = log(2)\ncan be understood as N \u2192 \u221e limit of SmoothGrad h(x) = 1\ni=1 \u2207g(x \u2212 \u0001i) where \u0001i \u223c g\u03b2 \u2248\nN (0, \u03c3). We emphasize that the theorem only holds for a one-layer neural network, but for deeper\nnetworks we empirically observe that both lead to visually similar maps as they are considerably less\nnoisy than the gradient map. The theorem therefore suggests that SmoothGrad can similarly be used\nto smooth the curvatures and can thereby make explanations more robust.7\nExperiments: Figure 4 demonstrates that \u03b2-smoothing allows us to recover the original explanation\nmap by decreasing the value of the \u03b2 parameter. We stress that this works for all considered methods.\nWe also note that the same effect can be observed using SmoothGrad by successively increasing the\nstandard deviation \u03c3 of the noise distribution. This further underlines the similarity between the two\nsmoothing methods.\nIf an attacker knew that smoothing was used to undo the manipulation, they could try to attack the\nsmoothed method directly. However, both \u03b2-smoothing and SmoothGrad are substantially more\nrobust than their non-smoothed counterparts, see Figure 5. It is important to note that \u03b2-smoothing\nachieves this at considerably lower computational cost: \u03b2-smoothing only requires a single forward\nand backward pass, while SmoothGrad requires as many as the number of noise samples (typically\nbetween 10 to 50).\nWe refer to Supplement D for more details on these experiments.\n\nFigure 5: Left: markers are clearly left of the diagonal, i.e. explanations are more robust to manipula-\ntions when \u03b2-smoothing is used. Center: SmoothGrad has comparable results to \u03b2-smoothing, i.e.\nmarkers are distributed around the diagonal. Right: \u03b2-smoothing has signi\ufb01cantly lower computa-\ntional cost than SmoothGrad.\n\nFigure 6 shows the evolution of the gradient explanation maps when reducing the \u03b2 parameter of\nthe softplus activations. We note that for small \u03b2 the explanation maps tend to become similar\nto LRP/GBP/PA explanation maps (see Figure 2 for comparison). Figure 7 demonstrates that \u03b2-\nsmoothing leads to better performance than the gradient method and to comparable performance with\nSmoothGrad on the pixel-\ufb02ipping metric [5, 36].\n\n(cid:80)N\ni=1 h(x \u2212 \u0001i).\n\nform, i.e. 1\nN\n\n7For explanation methods h(x) other than gradient, SmoothGrad needs to be used in a slightly generalized\n\n8\n\n0.250.500.75\u03b2-smoothedGradient0.20.40.60.8GradientPCCbetweenhtandh(xadv)0.250.500.75\u03b2-smoothedGradient0.20.40.60.8SmoothGradPCCbetweenhtandh(xadv)\u03b2-smoothedSmoothGrad0.0000.0250.0500.0750.100secondsRuntime\fFigure 6: Gradient explanation map produced with the original network and a network with softplus\nactivation functions using various values for \u03b2.\n\nFigure 7: Pixel\ufb02ipping performance compared to random baseline (the lower the accuracy the better\nthe explanation): the metric sorts pixels of images by relevance and incrementally sets the pixels\nto zero starting with the most relevant. In each step, the network\u2019s performance is evaluated on the\ncomplete ImageNet validation set.\n\n5 Conclusion\n\nExplanation methods have recently become increasingly popular among practitioners. In this contri-\nbution, we show that dedicated imperceptible manipulations of the input data can yield arbitrary and\ndrastic changes of the explanation map. We demonstrate both qualitatively and quantitatively that\nexplanation maps of many popular explanation methods can be arbitrarily manipulated. Crucially,\nthis can be achieved while keeping the model\u2019s output constant. A novel theoretical analysis reveals\nthat in fact the large curvature of the network\u2019s decision function is one important culprit for this\nunexpected vulnerability. Using this theoretical insight, we can profoundly increase the resilience to\nmanipulations by smoothing only the explanation process while leaving the model itself unchanged.\nFuture work will investigate possibilities to modify the training process of neural networks itself such\nthat they can become less vulnerable to manipulations of explanations. Another interesting future\ndirection is to generalize our theoretical analysis of gradient-based to propagation-based methods.\nThis seems particularly promising because our experiments strongly suggest that similar theoretical\n\ufb01ndings should also hold for these explanation methods.\n\nAcknowledgments\n\nWe want to thank the anonymous reviewers for their helpful feedback. We also thank Kristof Sch\u00fctt,\nGr\u00e9goire Montavon and Shinichi Nakajima for useful discussions. This work is supported by the\nGerman Ministry for Education and Research as Berlin Big Data Center (01IS18025A) and Berlin\nCenter for Machine Learning (01IS18037I). This work is also supported by the Information &\nCommunications Technology Planning & Evaluation (IITP) grant funded by the Korea government\n(No. 2017-0-001779), as well as by the Research Training Group \"Differential Equation- and Data-\ndriven Models in Life Sciences and Fluid Dynamics (DAEDALUS)\" (GRK 2433) and Grant Math+,\nEXC 2046/1, Project ID 390685689 both funded by the German Research Foundation (DFG).\n\nReferences\n[1] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and\nKlaus-Robert M\u00fcller. How to explain individual classi\ufb01cation decisions. Journal of Machine\nLearning Research, 11(Jun):1803\u20131831, 2010.\n\n9\n\nImageReLU\u03b2=10\u03b2=3\u03b2=2\u03b2=10.00.20.40.60.8Ratioofpixelssettozero0.000.250.500.75top-1accuracy\u03b2-smoothedGradientSmoothGradGradientrandom\f[2] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convolutional Networks:\nVisualising Image Classi\ufb01cation Models and Saliency Maps. In 2nd International Conference\non Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop\nTrack Proceedings, 2014.\n\n[3] Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks.\nIn Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September\n6-12, 2014, Proceedings, Part I, pages 818\u2013833, 2014.\n\n[4] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving\nIn 3rd International Conference on Learning\nfor Simplicity: The All Convolutional Net.\nRepresentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings,\n2015.\n\n[5] Sebastian Bach, Alexander Binder, Gr\u00e9goire Montavon, Frederick Klauschen, Klaus-Robert\nM\u00fcller, and Wojciech Samek. On Pixel-Wise Explanations for Non-Linear Classi\ufb01er Decisions\nby Layer-Wise Relevance Propagation. PLOS ONE, 10(7):1\u201346, 07 2015.\n\n[6] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi\nParikh, and Dhruv Batra. Grad-CAM: Why did you say that? Visual Explanations from Deep\nNetworks via Gradient-based Localization. CoRR, abs/1610.02391, 2016.\n\n[7] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining\nthe predictions of any classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD international\nconference on knowledge discovery and data mining, pages 1135\u20131144. ACM, 2016.\n\n[8] Luisa M. Zintgraf, Taco S. Cohen, Tameem Adel, and Max Welling. Visualizing Deep Neural\nIn 5th International Conference on\nNetwork Decisions: Prediction Difference Analysis.\nLearning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track\nProceedings, 2017.\n\n[9] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features\nThrough Propagating Activation Differences. In Proceedings of the 34th International Con-\nference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages\n3145\u20133153, 2017.\n\n[10] Scott M Lundberg and Su-In Lee. A Uni\ufb01ed Approach to Interpreting Model Predictions. In\nI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 30, pages 4765\u20134774. Curran\nAssociates, Inc., 2017.\n\n[11] Piotr Dabkowski and Yarin Gal. Real Time Image Saliency for Black Box Classi\ufb01ers. In\nI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 30, pages 6967\u20136976. Curran\nAssociates, Inc., 2017.\n\n[12] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Vi\u00e9gas, and Martin Wattenberg. Smooth-\n\nGrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017.\n\n[13] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks.\nIn Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney,\nNSW, Australia, 6-11 August 2017, pages 3319\u20133328, 2017.\n\n[14] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features\nThrough Propagating Activation Differences. In Proceedings of the 34th International Con-\nference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages\n3145\u20133153, 2017.\n\n[15] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful\nperturbation. In 2017 IEEE international conference on computer vision (ICCV), pages 3449\u2013\n3457. IEEE, 2017.\n\n10\n\n\f[16] Gr\u00e9goire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-\nRobert M\u00fcller. Explaining nonlinear classi\ufb01cation decisions with Deep Taylor Decomposition.\nPattern Recognition, 65:211\u2013222, 2017.\n\n[17] Pieter-Jan Kindermans, Kristof T Sch\u00fctt, Maximilian Alber, Klaus-Robert M\u00fcller, Dumitru\nErhan, Been Kim, and Sven D\u00e4hne. Learning how to explain neural networks: PatternNet and\nPatternAttribution. International Conference on Learning Representations, 2018.\n\n[18] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie J. Cai, James Wexler, Fernanda B. Vi\u00e9gas,\nand Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept\nactivation vectors (TCAV). In Proceedings of the 35th International Conference on Machine\nLearning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 2673\u2013\n2682, 2018.\n\n[19] Sebastian Lapuschkin, Stephan W\u00e4ldchen, Alexander Binder, Gr\u00e9goire Montavon, Wojciech\nSamek, and Klaus-Robert M\u00fcller. Unmasking Clever Hans predictors and assessing what\nmachines really learn. Nature communications, 10(1):1096, 2019.\n\n[20] Wojciech Samek, Gregoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert\nM\u00fcller. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer,\n2019.\n\n[21] Amirata Ghorbani, Abubakar Abid, and James Y. Zou. Interpretation of neural networks is\nfragile. In The Thirty-Third AAAI Conference on Arti\ufb01cial Intelligence, AAAI 2019, The Thirty-\nFirst Innovative Applications of Arti\ufb01cial Intelligence Conference, IAAI 2019, The Ninth AAAI\nSymposium on Educational Advances in Arti\ufb01cial Intelligence, EAAI 2019, Honolulu, Hawaii,\nUSA, January 27 - February 1, 2019., pages 3681\u20133688, 2019.\n\n[22] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Sch\u00fctt, Sven\nD\u00e4hne, Dumitru Erhan, and Been Kim. The (un)reliability of saliency methods. In Explainable\nAI: Interpreting, Explaining and Visualizing Deep Learning, pages 267\u2013280. Springer, 2019.\n\n[23] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been\nKim. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems\n31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8\nDecember 2018, Montr\u00e9al, Canada., pages 9525\u20139536, 2018.\n\n[24] Juyeon Heo, Sunghwan Joo, and Taesup Moon. Fooling neural network interpretations via\nadversarial model manipulation. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc,\nE. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages\n2921\u20132932. Curran Associates, Inc., 2019.\n\n[25] David Alvarez-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-\nexplaining neural networks. In Advances in Neural Information Processing Systems 31: Annual\nConference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December\n2018, Montr\u00e9al, Canada., pages 7786\u20137795, 2018.\n\n[26] David Alvarez-Melis and Tommi S. Jaakkola. On the Robustness of Interpretability Methods.\n\nCoRR, abs/1806.08049, 2018.\n\n[27] Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam H\u00e4gele, Kristof T. Sch\u00fctt,\nGr\u00e9goire Montavon, Wojciech Samek, Klaus-Robert M\u00fcller, Sven D\u00e4hne, and Pieter-Jan\nKindermans. iNNvestigate neural networks! Journal of Machine Learning Research 20, 2019.\n\n[28] Marco Ancona, Enea Ceolini, Cengiz Oztireli, and Markus Gross. Towards better understand-\ning of gradient-based attribution methods for Deep Neural Networks. In 6th International\nConference on Learning Representations (ICLR 2018), 2018.\n\n[29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. International Conference on Learning Representations, 2014.\n\n11\n\n\f[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR\n2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770\u2013778, 2016.\n\n[32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classi\ufb01cation with Deep\nConvolutional Neural Networks. In Proceedings of the 25th International Conference on Neural\nInformation Processing Systems - Volume 1, NIPS\u201912, pages 1097\u20131105, USA, 2012. Curran\nAssociates Inc.\n\n[33] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected\nconvolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261\u20132269, 2017.\n\n[34] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009.\n\n[35] Loring W Tu. Differential geometry: connections, curvature, and characteristic classes, volume\n\n275. Springer, 2017.\n\n[36] Wojciech Samek, Alexander Binder, Gregoire Montavon, Sebastian Lapuschkin, and Klaus-\nRobert M\u00fcller. Evaluating the Visualization of What a Deep Neural Network Has Learned.\nIEEE Transactions on Neural Networks and Learning Systems, 28:2660\u20132673, 11 2017.\n\n12\n\n\f", "award": [], "sourceid": 7516, "authors": [{"given_name": "Ann-Kathrin", "family_name": "Dombrowski", "institution": "TU Berlin"}, {"given_name": "Maximillian", "family_name": "Alber", "institution": "TU Berlin"}, {"given_name": "Christopher", "family_name": "Anders", "institution": "Technische Universit\u00e4t Berlin"}, {"given_name": "Marcel", "family_name": "Ackermann", "institution": "HHI"}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": "TU Berlin"}, {"given_name": "Pan", "family_name": "Kessel", "institution": "TU Berlin"}]}