{"title": "Towards Automatic Concept-based Explanations", "book": "Advances in Neural Information Processing Systems", "page_first": 9277, "page_last": 9286, "abstract": "Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. \n    Most of the current explanation methods provide explanations through feature importance scores, which identify features that are important for each individual input. However, how to systematically summarize and interpret such per sample feature importance scores itself is challenging. In this work, we propose principles and desiderata for \\emph{concept} based explanation, which goes beyond per-sample features to identify higher level human-understandable concepts that apply across the entire dataset. We develop a new algorithm, ACE, to automatically extract visual concepts. Our systematic experiments demonstrate that \\alg discovers concepts that are human-meaningful, coherent and important for the neural network's predictions.", "full_text": "Towards Automatic Concept-based Explanations\n\nAmirata Ghorbani\u2217\nStanford University\n\namiratag@stanford.edu\n\nJames Zou\n\nStanford University\n\njamesz@stanford.edu\n\nJames Wexler\nGoogle Brain\n\njwexler@google.com\n\nBeen Kim\nGoogle Brain\n\nbeenkim@google.com\n\nAbstract\n\nInterpretability has become an important topic of research as more machine learn-\ning (ML) models are deployed and widely used to make important decisions. Most\nof the current explanation methods provide explanations through feature impor-\ntance scores, which identify features that are important for each individual input.\nHowever, how to systematically summarize and interpret such per sample feature\nimportance scores itself is challenging. In this work, we propose principles and\ndesiderata for concept based explanation, which goes beyond per-sample features\nto identify higher level human-understandable concepts that apply across the entire\ndataset. We develop a new algorithm, ACE, to automatically extract visual concepts.\nOur systematic experiments demonstrate that ACE discovers concepts that are\nhuman-meaningful, coherent and important for the neural network\u2019s predictions.\n\n1\n\nIntroduction\n\nAs machine learning (ML) becomes widely used in applications ranging from medicine [17] to\ncommerce [38], gaining insights into ML models\u2019 predictions has become an important topic of\nstudy, and in some cases a legal requirement [16]. The industry is also recognizing explainability as\none of the main components of responsible use of ML [1]; not just a nice-to-have component but a\nmust-have one.\nMost of the recent literature on ML explanation methods has revolved around deep learning models.\nMethods that are focused on providing explanations of ML models follow a common procedure: For\neach input to the model, they alter individual features (pixels, super-pixels, word-vectors, etc) either\nin the form of removal (zero-out, blur, shuf\ufb02e, etc) [29, 5] or perturbation [35, 34] to approximate the\nimportance of each feature for model\u2019s prediction. These \u201cfeature-based\u201d explanations suffer from\nseveral drawbacks. There has been a line of research focused on showing that these methods are not\nas reliable [14, 3, 15]. For examples, Kindermans et al.discussed vulnerability even to simple shifts\nin the input [21] while Ghorbani et al.designed adversarial perturbations against these methods. A\nmore important concern, however, is that human experiments show that these methods are susceptible\nto human con\ufb01rmation biases [20], and also showing that these methods do not increase human\nunderstanding of the model and human trust in the model [28, 20]. For example, Kim et al. [20]\nshowed that given identical feature-based explanations, human subjects con\ufb01dently \ufb01nd evidence for\ncompletely contradicting conclusions.\nAs a consequence, a recent line of research has focused on providing explanations in the form of high-\nlevel human \u201cconcepts\u201d [46, 20]. Instead of assigning importance to individual features or pixels, the\n\n\u2217Work done while interning at Google Brain.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\foutput of the method reveals the important concepts. For examples, the wheel and the police logo are\nimportant concepts for detecting police vans. These methods come with their own drawbacks. Rather\nthan pointing to the important concepts, they respond to the user\u2019s queries about concepts. That is, for\neach concept\u2019s importance to query, a human has to provide hand-labeled examples of that concept.\nWhile these methods are useful when the user knows the set of well-de\ufb01ned concepts and has the\nresources to provide examples, a major problem is that the space of possible concepts to query can\n\ufb01rst of all, be unlimited, or in some settings even be unclear. Another important drawback is that\nthey rely on human bias in the explanation process; humans might fail to choose the right concepts to\nquery. Because these previous methods can only test concepts that are already labeled and identi\ufb01ed\nby humans, their discovery power is severely limited.\n\nOur contribution We lay out general principles that a concept-based explanation of ML should\nsatisfy. Then we develop a systemic framework to automatically identify higher-level concepts which\nare meaningful to humans and are important for the ML model. Our novel method, Automated\nConcept-based Explanation (ACE), works by aggregating related local image segments across diverse\ndata. We apply an ef\ufb01cient implementation of our method to a widely-used object recognition model.\nQuantitative human experiments and evaluations demonstrate that ACE satis\ufb01es the principles of\nconcept-based explanation and provide interesting insights into the ML model.2\n\n2 Concept-based Explanation Desiderata\n\nOur goal is to explain a machine learning model\u2019s decision making via units that are more un-\nderstandable to humans than individual features, pixels, characters, and so forth. Following the\nliterature [46, 20], throughout this work, we refer to these units as concepts. A precise de\ufb01nition of a\nconcept is not easy [13]. Instead, we lay out the desired properties that a concept-based explanation\nof a machine learning model should satisfy to be understandable by humans.\n\n1. Meaningfulness An example of a concept is semantically meaningful on its own. In the\ncase of image data, for instance, individual pixels may not satisfy this property while a\ngroup of pixels (an image segment) containing a texture concept or an object part concept\nis meaningful. Meaningfulness should also correspond to different individuals associating\nsimilar meanings to the concept.\n\n2. Coherency Examples of a concept should be perceptually similar to each other while being\ndifferent from examples of other concepts. Examples of \u201cblack and white striped\u201d concept\nare all similar in having black and white stripes.\n\n3. Importance A concept is \u201cimportant\u201d for the prediction of a class if its presence is necessary\nfor the true prediction of samples in that class. In the case of image data, for instance, the\nobject which presence is being predicted is necessary while the background color is not.\n\nWe do not claim these properties to be a complete set of desiderata, however, we believe that this is a\ngood starting point towards concept-based explanations.\n\n3 Methods\n\nAn explanation algorithm has typically three main components: A trained classi\ufb01cation model, a\nset of test data points from the same classi\ufb01cation task, and a importance computation procedure\nthat assigns importance to features, pixels, concepts, and so forth. The method either explains an\nindividual data point\u2019s prediction (local explanation), or an entire model, class or sets of examples\n(global explanation). One example of a local explanation method is the family of saliency map\nmethods [33, 34, 35]. Each pixel in every image is assigned an importance score for the correct\nprediction of that image typically by using the gradient of prediction with respect to each pixel.\nTCAV [20] is an example of a global method. For each class, it determines how important a given\nconcept is for predicting that class.\nIn what follows, we present ACE . ACE is a global explanation method that explains an entire class\nin a trained classi\ufb01er without the need for human supervision.\n\n2 Implementation available: https://github.com/amiratag/ACE\n\n2\n\n\fFigure 1: ACE algorithm (a) A set of images from the same class is given. Each image is segmented\nwith multiple resolutions resulting in a pool of segments all coming from the same class. (b) The\nactivation space of one bottleneck layer of a state-of-the-art CNN classi\ufb01er is used as a similarity\nspace. After resizing each segment to the standard input size of the model, similar segments are\nclustered in the activation space and outliers are removed to increase coherency of clusters. (d) For\neach concept, its TCAV importance score is computed given its examples segments.\n\nAutomated concept-based explanations step-by-step ACE takes a trained classi\ufb01er and a set of\nimages of a class as input. It then extracts concepts present in that class and returns each concept\u2019s\nimportance. In image data, concepts are present in the form of groups of pixels (segments). To\nextract all concepts of a class, the \ufb01rst step of ACE (Fig 1(a) starts with segmentation of the given\nclass images. To capture the complete hierarchy of concepts from simple \ufb01ne-grained ones like\ntextures and colors to more complex and coarse-grained ones such as parts and objects, each image is\nsegmented with multiple resolutions. In our experiments, we used three different levels of resolution\nto capture three levels of texture, object parts, and objects. As discussed in Section 4, three levels of\nsegmentation is enough to achieve the goal.\nThe second step of ACE (Fig 1(b)) groups similar segments as examples of the same concept. To\nmeasure the similarity of segments, we use the result of previous work [45] showing that in state-of-\nthe art convolutional neural networks (CNNs) trained on large-scale data sets like ImageNet [32], the\neuclidean distance in the activation space of \ufb01nal layers is an effective perceptual similarity metric.\nEach segment is then passed through the CNN to be mapped to the activation space. Similar to the\nargument made by Dabkowski & Gal [8], as most image classi\ufb01ers accept images of a standard size\nwhile the segments have arbitrary size, we resize the segment to the required size disregarding aspect\nratio. As the results in Section 4 suggest, this works \ufb01ne in practice but it should be mentioned\nthat the proposed similarity measure works the best with classi\ufb01ers robust to scale and aspect ratio.\nAfter the mapping is performed, using the euclidean distance between segments, we cluster similar\nsegments as examples of the same concept. To preserve concept coherency, outlier segments of each\ncluster that have low similarity to cluster\u2019s segments are removed (Fig. 1(b)).\nThe last step of ACE (Fig 1(c)) includes returning important concepts from the set of extracted\nconcepts in previous steps. TCAV [20] concept-based importance score is used in this work (Fig. 1(c)),\nthough any other concept-importance score could be used.\n\nHow ACE is designed to achieve the three desiderata The \ufb01rst of the desiderata requires the\nreturned concepts to be clean of meaningless examples (segments). To perfectly satisfy meaningful-\nness, the \ufb01rst step of ACE can be replaced by a human subject going over all the given images and\nextracting only meaningful segments. To automate this procedure, a long line of research has focused\non semantic segmentation algorithms [25, 23, 27, 30], that is, to segment an image so that every pixel\nis assigned to a meaningful class. State-of-the art semantic segmentation methods use deep neural\nnetworks which imposes higher computational cost. Most of these methods are also unable to perform\nsegmentation with different resolutions. To tackle these issues, ACE uses simple and fast super-pixel\nsegmentation methods which have been widely used in the hierarchical segmentation literature [43].\nThese methods could be applied with any desired level of resolution with low computational cost\n\n3\n\n(b) Resizing(a)Mutlri-resolution Segmentation(d) OutputImportance Scores0.80.70.4TCAV(c) Clustering and removing outliers\u2026 \u2026 ACE(b) Clustering similar segments and removing outliers(a) Multi-resolution segmentation of images(c) Computing saliency of concepts\fat the cost of suffering from lower segmentation quality, that is, returning segments that either are\nmeaningless or capture numerous textures, objects, etc instead of isolating one meaningful concept.\nTo have perfect meaningfulness and coherency, we can replace the second step with a human subject\nto go over all the segments, clusters similar segments as concepts, and remove meaningless or\ndissimilar segments. The second step of ACE aims to automate the same procedure. It replaces\na human subject as a perceptual similarity metric with an ImageNet-trained CNN. It then clusters\nsimilar segments and removes outliers. The outlier removal step is necessary to make every cluster of\nsegments clean of meaningless or dissimilar segments. The idea is that if a segment is dissimilar to\nsegments in a cluster, it is either a random and meaningless segment or if it is meaningful, it belongs\nto a different concept; a concept that has appeared a few times in the class images and therefore its\nsegments are not numerous enough to form a cluster. For example, asphalt texture segments are\npresent in almost every police van image and therefore are expected to form a coherent cluster while\nsegments of grass texture that are present in only one police van image form an unrelated concept to\nthe class and are to be removed.\nACE utilizes the TCAV score as a concept\u2019s importance metric. The intuition behind the TCAV score\nis to approximate the average positive effect of a concept on predicting the class and is generally\napplied to deep neural network classi\ufb01ers. Given examples of a concept, TCAV score [20] is the\nfraction of class images for which the prediction score increases if the representation of those images\nin the activation space are perturbed in the general direction of representation of concept examples\nin the same activation space (with the use of directional derivatives). Details are described in the\noriginal work [20].\nIt is evident that satisfying the desiderata through ACE is limited to the performance of the segmenta-\ntion method, the clustering and outlier removal method, and above all the reliability of using CNNs\nas a similarity metric. The results and human experiments in the next section verify the effectiveness\nof this method.\n\n4 Experiments and Results\n\nAs an experimental example, we use ACE to interpret the widely-used Inception-V3 model [36]\ntrained on ILSVRC2012 data set (ImageNet) [32]. We select a subset of 100 classes out of the 1000\nclasses in the data set to apply ACE . As shown in the original TCAV paper [20], this importance\nscore performs well given a small number of examples for each concept (10 to 20). In our experiments\non ImageNet classes, 50 images was suf\ufb01cient to extract enough examples of concepts; possibly\nbecause the concepts are frequently present in these images. The segmentation step is performed\nusing SLIC [2] due to its speed and performance (after examining several super-pixel methods\n[10, 26, 41]) with three resolutions of 15, 50, and 80 segments for each image. For our similarity\nmetric, we examined the euclidean distance in several layers of the ImageNet trained Inception-V3\narchitecture and chose the \u201cmixed_8\u201d layer. As previously shown [20], earlier layers are better at\nsimilarity of textures and colors while latter ones are better for object and the \u201cmixed_8\u201d layer yields\nthe best trade-off. K-Means clustering is performed and outliers are removed using euclidean distance\nto the cluster centers. More implementation details are provided in Appendix A.\n\nExamples of ACE algorithm We apply ACE to 100 randomly selected ImageNet classes. Fig. 2\ndepicts the outputs for three classes. For each class, we show the four most important concepts\nvia three randomly selected examples (each example is shown above the original image it was\nsegmented from). The \ufb01gure suggests that ACE considers concepts of several levels of complexity.\nFrom Lion\ufb01sh spines and its skin texture to a car wheel or window. More examples are shown in\nAppendix E.\n\nHuman experiments To verify the coherency of concepts, following the explainability litera-\nture [7], we designed an intruder detection experiment. At each question, a subject is asked to identify\none image out of six that is conceptually different from the rest. We created a questionnaire of 34\nquestions, such as the one shown in Fig. 3. Among 34 randomly ordered questions, 15 of them\ninclude using the output concepts of ACE and other 15 questions using human-labeled concepts\nfrom Broaden dataset [4]. The \ufb01rst four questions were used for training the participants and were\ndiscarded. On average, 30 participants answered the hand-labeled dataset 97% (14.6/15) (\u00b10.7)\ncorrectly, while discovered concepts were answered 99% (14.9/15) (\u00b10.3) correctly. This experiment\n\n4\n\n\fFigure 2: The output of ACE for three ImageNet classes. Here we depict three randomly selected\nexamples of the top-4 important concepts of each class (each example is shown above the original\nimage it was segmented from). Using this result, for instance, we could see that the network classi\ufb01es\npolice vans using the van\u2019s tire and the police logo.\n\ncon\ufb01rms that while a discovered concept is only a set of image segments, ACE outputs segments that\nare coherent.\nIn our second experiment, we test how meaningful the concepts are to humans. We asked 30\nparticipants to perform two tasks: As a baseline test of meaningfulness, \ufb01rst we ask them to choose\nthe more meaningful of two options. One being four segments of the same concept (along with the\nimage they were segmented from) and the other being four random segments of images in the same\nclass. the right option was chosen 95.6% (14.3/15)(\u00b11.0). To further query the meaningfulness of\nthe concepts, participants were asked to describe their chosen option with one word. As a result, for\neach question, a set of words (e.g. bike, wheel, motorbike) are provided and we tally how many\nindividuals use the same word to describe each set of image. For examples, for the question in Fig. 3,\n19 users used the word human or person and 8 users used face or head. For all of the questions,\non average, 56% of participants described it with the most frequent word and its synonyms (77% of\ndescriptions were from the two most frequent words). This suggests that, \ufb01rst of all ACE discovers\nconcepts with high precision. Secondly, the discovered concepts have consistent semantic/verbal\nmeanings across individuals. The questionnaire had 19 questions and the \ufb01rst 4 were used as training\nand were discarded.\n\nExamining the importance of important concepts To con\ufb01rm the importance scores given by\nTCAV, we extend the two importance measures de\ufb01ned for pixel importance scores in the literature\n[8] to the case of concepts. Smallest suf\ufb01cient concepts (SSC) which looks for the smallest set of\nconcepts that are enough for predicting the target class. Smallest destroying concepts (SDC) which\nlooks for the smallest set of concepts removing which will cause incorrect prediction. Note that\n\n5\n\nMost Salient2ndmost salient3rdmost salien4thmost salientLionfishPolice VanBasketballMost Salient2ndmost salient3rdmost salien4thmost salientMost Salient2ndmost salient3rdmost salien4thmost salient\fFigure 3: Human subject experiments questionnaires. (Texts in blue are not part of the question-\nnaire) (a) 30 human subjects were asked to identify one image out of six that is conceptually different\nfrom the rest. For comparison, each question is either a set of extracted or hand-labeled concepts. On\naverage, participants answer the hand-labeled dataset 97% (14.6/15, \u00b10.7) correctly, while discovered\nconcepts were answered 99% (14.9/15, \u00b10.3) correctly. (b) 30 human subjects were asked to identify\na set of image segments belonging to a concept versus a random set of segments and then to assign a\nword to the selected concept. On average, 55% of participants used the most frequent word and its\nsynonyms for each question and 77% of the answers were one of top-two frequent words.\n\nalthough these importance scores are de\ufb01ned and used for local pixel-based explanations in [8]\n(explaining one data point), the main idea can still be used to evaluate our global concept-based\nexplanation (explaining a class).\nTo examine ACE with these two measures, we use 1000 randomly selected ImageNet validation\nimages from the same 100 classes. Each image is segmented with multiple resolutions similar to\nACE . Using the same similarity metric in ACE , each resulting segment is assigned to a concept\nusing its the examples of a concept with least similarity distance concept\u2019s examples. Fig. 4 shows\nthe prediction accuracy on these examples as we add and remove important concepts.\n\nInsights into the model through ACE To begin with, some interesting correlations are revealed.\nFor many classes, the concepts with high importance follow human intuition, e.g.\nthe \u201cPolice\u201d\ncharacters on a police car are important for detecting a police van while the asphalt on the ground is\nnot important. Fig. 5(a) shows more examples of this kind. On the other side, there are examples\nwhere the correlations in the real world are transformed into model\u2019s prediction behavior. For instance,\nthe most important concept for predicting basketball images is the players\u2019 jerseys rather than the\nball itself. It turns out that most of the ImageNet basketball images contain jerseys in the image (We\ninspected 50 training images and there was a jersey in 48 of them). Similar examples are shown\nin Fig. 5(b). A third category of results is shown in Fig. 5(c). In some cases, when the object\u2019s\nstructure is complex, parts of the object as separate concepts have their own importance and some\nparts are more important than others. The example of carousel is shown: lights, poles, and seats. It is\ninteresting to learn that the lights are more important than seats.\nA natural follow-up question is whether the mere existence of a important concepts is enough for\nprediction without having the structural properties; e.g. an image of just black and white zebra\nstripes is predicted as zebra. For each class, we randomly place examples of the four most important\n\n6\n\nExtractedHand-labeledConcept SegmentsRandom SegmentsExperiment 1: Identifyig intruder conceptExperiment 2: Identifying the meaning of concept\fFigure 4: Importance For 1000 randomly sampled images in the ImageNet validation set, we start\nremoving/adding concepts from the most important. As it is shown, the top-5 concepts is enough to\nreach within 80% of the original accuracy and removing the top-5 concepts results in misclassi\ufb01cation\nof more than 80% of samples that are classi\ufb01ed correctly. For comparison, we also plot the effect of\nadding/removing concepts with random order and with reverse importance order.\n\nFigure 5: Insights into the model The text above each image describes its original class and our\nsubjective interpretation of the extracted concept; e.g. \u201cVolcano\u201d class and \u201cLava\u201d concepts. (a)\nIntuitive correlations. (b) Unintuitive correlations (c) Different parts of an object as separate but\nimportant concepts\n\nconcepts on a blank image. (100 images for each class) Fig. 6 depicts examples of these randomly\n\u201cstitched\u201d images with their predicted class. For 20 classes (zebra, liner, etc), more than 80% of\nimages were classi\ufb01ed correctly. For more than half of the classes, above 40% of the images were\nclassi\ufb01ed correctly (note that random chance is 0.001%). This result aligns with similar \ufb01ndings\n[6, 12] of surprising effectiveness of Bag-of-local-Features and CNNs bias towards texture and shows\nthat our extracted concepts are important enough to be suf\ufb01cient for the ML model. Examples are\ndiscussed in Appendix C.\n\n5 Related Work\n\nThis work is focused on post-training explanation methods - explaining an already trained model\ninstead of building an inherently explainable model [42, 19, 40]. Most common post-training expla-\nnation methods provide explanations by estimating the importance of each input feature (covariates,\npixels, etc) or training sample for the prediction of a particular data point [33, 34, 44, 22] and are de-\nsigned to explain the prediction on individual data points. While this is useful when only speci\ufb01c data\npoints matter, these methods have been shown to come with many limitations, both methodologically\n\n7\n\nSSCSDCNumber of added conceptsPrediction accuracy(%)Prediction accuracy(%)AdditionDeletionTop-5Top-1Top-10Top-15SSCSDCNumber of deleted conceptsLeast ImportantRandomMost ImportantLeast ImportantRandomMost Important3rdmost salientMost salientMost salientMost salientMost salientCarousel lightsCinema & CharactersDumbbell & HandJinrikisha & Human4thmost salientTrain & Pavement2ndmost salient3rdmost salient(a)Most salientTennis ball and TextureCarousel polesCarousel seatsVolcano & Lava(b)(c)\fFigure 6: Stitching important concepts We test what would the classi\ufb01er see if we randomly stitch\nimportant concepts. We discover that for a number classes this results in predicting the image to be a\nmember of that class. For instance, basketball jerseys, zebra skin, lion\ufb01sh, and king snake patterns all\nseem to be enough for the Inception-V3 network to classify them as images of their class.\n\nand fundamentally. [21, 18, 14] For example, [18] showed that some input feature-based explanations\nare qualitatively and quantitatively similar for a trained model (i.e., making superhuman performance\nprediction) and a randomized model (i.e., making random predictions). Other work proved that\nsome of these methods are in fact trying to reconstruct the input image, rather than estimating pixels\u2019\nimportance for prediction [39]. In addition, it\u2019s been shown that these explanations are susceptible to\nhumans\u2019 con\ufb01rmation biases [20]. Using input features as explanations also introduces challenges in\nscaling this method to high dimensional datasets (e.g., health records). Humans typically reason in\nhigher abstracted concepts [31] than a particular input feature (e.g., lab results, a particular hospital\nvisit). A recently developed method uses high-level concepts, instead of input features. TCAV [20]\nproduces estimates of how important that a concept was for the prediction and IBD [46] decomposes\nthe prediction of one image into human-interpretable conceptual components. Both methods require\nhumans to provide examples of concepts. Our work introduces an explanation method that explains\neach class in the network using concepts that are present in the images of that class while removing\nthe need for humans to label examples of those concepts. [37]\n\n6 Discussion\n\nWe note a couple of limitations of our method. The experiments are performed on image data, as\nautomatically grouping features into meaningful units is simple for this case. The general idea of\nproviding concept-based explanations applies to to other data types such as texts, and this would\nbe an interesting direction of future work. An interesting direction of future work here is to apply\nmore sophisticated dictionary learning approaches, beyond clustering, on the representation space\n(e.g. sparse coding) which could reduce the need for image segmentation, and learn more complex\nconcepts. Additionally, the above discussions only apply to concepts that are present in the form of\ngroups of pixels. While this assumption gave us plenty of insight into the model, there might be more\ncomplex and abstract concepts are dif\ufb01cult to automatically extract. Future work includes tuning\nthe ACE hyper-parameters (multi-resolution segmentation, etc) for each class separately. This may\nbetter capture the inherent granularity of objects; for example, scenes in nature may contain a smaller\nnumber of concepts compared to scenes in a city.\nIn conclusion, we introduces ACE , a post-training explanation method that automatically groups\ninput features into high-level concepts; meaningful concepts that appear as coherent examples and\nare important for correct prediction of the images they are present in. We veri\ufb01ed the meaningfulness\nand coherency through human experiments and further validated that they indeed carry salient signals\nfor prediction. The discovered concepts reveal insights into potentially surprising correlations that\nthe model has learned. Such insights may help to promote safer use of this powerful tool, machine\nlearning.\n\n8\n\nBasketballZebraKing SnakeBubbleLionfishElectric GuitarBasketballZebraKing SnakeBubbleLionfishElectric Guitar\fAcknowledgement A.G. is supported by a Stanford Graduate Fellowship (Robert Bosch Fellow).\nJ.Z. is supported by NSF CCF 1763191, NIH R21 MD012867-01, NIH P30AG059307, and grants\nfrom the Silicon Valley Foundation and the Chan-Zuckerberg Initiative.\n\nReferences\n[1] Google AI principles. https://www.blog.google/technology/ai/ai-principles/.\n\nAccessed: 2018-11-15.\n\n[2] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. S\u00fcsstrunk, et al. Slic superpixels compared\nto state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine\nintelligence, 34(11):2274\u20132282, 2012.\n\n[3] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for\nsaliency maps. In Advances in Neural Information Processing Systems, pages 9505\u20139515, 2018.\n[4] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying\ninterpretability of deep visual representations. In Computer Vision and Pattern Recognition,\n2017.\n\n[5] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[6] W. Brendel and M. Bethge. Approximating cnns with bag-of-local-features models works\n\nsurprisingly well on imagenet. arXiv preprint arXiv:1904.00760, 2019.\n\n[7] J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea leaves: How\nhumans interpret topic models. In Advances in neural information processing systems, pages\n288\u2013296, 2009.\n\n[8] P. Dabkowski and Y. Gal. Real time image saliency for black box classi\ufb01ers. In Advances in\n\nNeural Information Processing Systems, pages 6967\u20136976, 2017.\n\n[9] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering\n\nclusters in large spatial databases with noise. In Kdd, volume 96, pages 226\u2013231, 1996.\n\n[10] P. F. Felzenszwalb and D. P. Huttenlocher. Ef\ufb01cient graph-based image segmentation. Interna-\n\ntional journal of computer vision, 59(2):167\u2013181, 2004.\n\n[11] B. J. Frey and D. Dueck. Clustering by passing messages between data points. science,\n\n315(5814):972\u2013976, 2007.\n\n[12] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-\ntrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness.\narXiv preprint arXiv:1811.12231, 2018.\n\n[13] J. Genone and T. Lombrozo. Concept possession, experimental semantics, and hybrid theories\n\nof reference. Philosophical Psychology, 25(5):717\u2013742, 2012.\n\n[14] A. Ghorbani, A. Abid, and J. Zou. Interpretation of neural networks is fragile. arXiv preprint\n\narXiv:1710.10547, 2017.\n\n[15] J. R. Gimenez, A. Ghorbani, and J. Zou. Knockoffs for the mass: new feature importance\n\nstatistics with false discovery guarantees. arXiv preprint arXiv:1807.06214, 2018.\n\n[16] B. Goodman and S. Flaxman. European union regulations on algorithmic decision-making and\n\na \u201cright to explanation\u201d. arXiv preprint arXiv:1606.08813, 2016.\n\n[17] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan,\nK. Widner, T. Madams, J. Cuadros, et al. Development and validation of a deep learning algo-\nrithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402\u2013\n2410, 2016.\n\n[18] M. M. I. G. M. H. B. K. Julius Adebayo, Justin Gilmer. Sanity checks for saliency maps. NIPS,\n\n2018.\n\n[19] B. Kim, C. Rudin, and J. A. Shah. The Bayesian Case Model: A generative approach for case-\nbased reasoning and prototype classi\ufb01cation. In Advances in Neural Information Processing\nSystems, pages 1952\u20131960, 2014.\n\n[20] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres. Interpretability\nIn\n\nbeyond feature attribution: Quantitative testing with concept activation vectors (tcav).\nInternational Conference on Machine Learning, pages 2673\u20132682, 2018.\n\n[21] P.-J. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Sch\u00fctt, S. D\u00e4hne, D. Erhan, and\n\nB. Kim. The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867, 2017.\n\n[22] P. W. Koh and P. Liang. Understanding black-box predictions via in\ufb02uence functions. arXiv\n\npreprint arXiv:1703.04730, 2017.\n\n9\n\n\f[23] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv preprint\n\n[24] S. Lloyd. Least squares quantization in pcm.\n\nIEEE transactions on information theory,\n\narXiv:1506.04579, 2015.\n\n28(2):129\u2013137, 1982.\n\n[25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n3431\u20133440, 2015.\n\n[26] P. Neubert and P. Protzel. Compact watershed and preemptive slic: On improving trade-offs of\nsuperpixel segmentation algorithms. In Pattern Recognition (ICPR), 2014 22nd International\nConference on, pages 996\u20131001. IEEE, 2014.\n\n[27] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In\nProceedings of the IEEE international conference on computer vision, pages 1520\u20131528, 2015.\n[28] F. Poursabzi-Sangdeh, D. G. Goldstein, J. M. Hofman, J. W. Vaughan, and H. Wallach. Manipu-\n\nlating and measuring model interpretability. arXiv preprint arXiv:1802.07810, 2018.\n\n[29] M. T. Ribeiro, S. Singh, and C. Guestrin. Model-agnostic interpretability of machine learning.\n\narXiv preprint arXiv:1606.05386, 2016.\n\n[30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image\nsegmentation. In International Conference on Medical image computing and computer-assisted\nintervention, pages 234\u2013241. Springer, 2015.\n\n[31] E. Rosch. Principles of categorization. Concepts: core readings, 189, 1999.\n[32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International\nJournal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[33] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.\n\n[34] D. Smilkov, N. Thorat, B. Kim, F. Vi\u00e9gas, and M. Wattenberg. Smoothgrad: removing noise by\n\nadding noise. arXiv preprint arXiv:1706.03825, 2017.\n\n[35] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 3319\u20133328. JMLR.\norg, 2017.\n\n[36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception archi-\ntecture for computer vision. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 2818\u20132826, 2016.\n\n[37] S. Tariyal, A. Majumdar, R. Singh, and M. Vatsa. Deep dictionary learning. IEEE Access,\n\n4:10096\u201310109, 2016.\n\n[38] N. Tintarev and J. Masthoff. Designing and evaluating explanations for recommender systems.\n\nIn Recommender Systems Handbook. Springer, 2011.\n\n[39] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Deep image prior. CoRR, abs/1711.10925, 2017.\n[40] B. Ustun and C. Rudin. Methods and models for interpretable linear classi\ufb01cation. ArXiv, 2014.\n[41] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In European\n\nConference on Computer Vision, pages 705\u2013718. Springer, 2008.\n\n[42] F. Wang and C. Rudin. Falling rule lists. In AISTATS, 2015.\n[43] X. Wei, Q. Yang, Y. Gong, N. Ahuja, and M.-H. Yang. Superpixel hierarchy. IEEE Transactions\n\non Image Processing, 27(10):4838\u20134849, 2018.\n\n[44] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European\n\nconference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[45] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness\nof deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 586\u2013595, 2018.\n\n[46] B. Zhou, Y. Sun, D. Bau, and A. Torralba.\n\nInterpretable basis decomposition for visual\nexplanation. In Proceedings of the European Conference on Computer Vision (ECCV), pages\n119\u2013134, 2018.\n\n10\n\n\f", "award": [], "sourceid": 4967, "authors": [{"given_name": "Amirata", "family_name": "Ghorbani", "institution": "Stanford University"}, {"given_name": "James", "family_name": "Wexler", "institution": null}, {"given_name": "James", "family_name": "Zou", "institution": "Stanford University"}, {"given_name": "Been", "family_name": "Kim", "institution": "Google"}]}