{"title": "Model Agnostic Supervised Local Explanations", "book": "Advances in Neural Information Processing Systems", "page_first": 2515, "page_last": 2524, "abstract": "Model interpretability is an increasingly important component of practical machine learning. Some of the most common forms of interpretability systems are example-based, local, and global explanations. One of the main challenges in interpretability is designing explanation systems that can capture aspects of each of these explanation types, in order to develop a more thorough understanding of the model. We address this challenge in a novel model called MAPLE that uses local linear modeling techniques along with a dual interpretation of random forests (both as a supervised neighborhood approach and as a feature selection method). MAPLE has two fundamental advantages over existing interpretability systems. First, while it is effective as a black-box explanation system, MAPLE itself is a highly accurate predictive model that provides faithful self explanations, and thus sidesteps the typical accuracy-interpretability trade-off. Specifically, we demonstrate, on several UCI datasets, that MAPLE is at least as accurate as random forests and that it produces more faithful local explanations than LIME, a popular interpretability system. Second, MAPLE provides both example-based and local explanations and can detect global patterns, which allows it to diagnose limitations in its local explanations.", "full_text": "Model Agnostic Supervised Local Explanations\n\nGregory Plumb\n\nCMU\n\nDenali Molitor\n\nUCLA\n\ngdplumb@andrew.cmu.edu\n\ndmolitor@math.ucla.edu\n\ntalwalkar@cmu.edu\n\nAmeet Talwalkar\n\nCMU\n\nAbstract\n\nModel interpretability is an increasingly important component of practical ma-\nchine learning. Some of the most common forms of interpretability systems are\nexample-based, local, and global explanations. One of the main challenges in\ninterpretability is designing explanation systems that can capture aspects of each\nof these explanation types, in order to develop a more thorough understanding\nof the model. We address this challenge in a novel model called MAPLE that\nuses local linear modeling techniques along with a dual interpretation of random\nforests (both as a supervised neighborhood approach and as a feature selection\nmethod). MAPLE has two fundamental advantages over existing interpretability\nsystems. First, while it is effective as a black-box explanation system, MAPLE\nitself is a highly accurate predictive model that provides faithful self explanations,\nand thus sidesteps the typical accuracy-interpretability trade-off. Speci\ufb01cally, we\ndemonstrate, on several UCI datasets, that MAPLE is at least as accurate as random\nforests and that it produces more faithful local explanations than LIME, a popular\ninterpretability system. Second, MAPLE provides both example-based and local\nexplanations and can detect global patterns, which allows it to diagnose limitations\nin its local explanations.\n\n1\n\nIntroduction\n\nLeading machine learning models are typically opaque and dif\ufb01cult to interpret, yet they are increas-\ningly being used to make critical decisions: e.g., a doctor\u2019s diagnosis (life or death), a biologist\u2019s\nexperimental design (time and money), or a lender\u2019s loan decision (legal consequences). As a result,\nthere is a pressing need to understand these models to ensure that they are correct, fair, unbiased,\nand/or ethical. Although there is no precise de\ufb01nition of interpretability and user requirements are\ngenerally application-speci\ufb01c, three of the most common types of model explanations are:\n\n1. Example-based. In the context of an individual prediction, it is natural to ask: Which points in\nthe training set most closely resemble a test point or in\ufb02uenced the prediction? Nearest neighbors\nand in\ufb02uence function based methods are archetypal methods that naturally lead to example-based\nexplanations [1, 2, 12].\n\n2. Local. Alternatively, we may aim to understand an individual prediction by asking: If the input\nis changed slightly, how does the model\u2019s prediction change? Local explanations are typically\nderived from a model directly (e.g., sparse linear models), or from a local model that approximates\nthe predictive model well in a neighborhood around a speci\ufb01c point [1, 17].\n\n3. Global. To gain an understanding of a model\u2019s overall behavior we can ask: What are the\npatterns underlying the model\u2019s behavior? Global explanations usually take the form of a series\nof rules [13, 18].\n\nExample-based explanations are clearly distinct from the other two explanation types, as the former\nrelies on sample data points and the latter two on features. Further, local and global explanations\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Toy datasets. (a) Linear. (b) Shifted Inverse Logistic (SIL). (c) Step Function.\n\nthemselves capture fundamentally different characteristics of the predictive model. To see this,\nconsider the toy datasets in Fig. 1 generated from three univariate functions.\nGenerally, local explanations are better suited for modeling smooth continuous effects (Fig. 1a).\nFor discontinuous effects (Fig. 1c) or effects that are very strong in a small region (Fig. 1b), which\ncan be approximated well by discontinuities, they either fail to detect the effect or make unusual\npredictions, depending on how the local neighborhood is de\ufb01ned. We will call such effects global\npatterns because they are dif\ufb01cult to detect or model with local explanations. Conversely, global\nexplanations are better suited for global patterns because these discontinuities create natural \u2018rules,\u2019\nand are less effective at explaining continuous effects because explanation rules must introduce\narbitrary feature discretization/binning. Most real datasets have both continuous and discontinuous\neffects and, therefore, it is crucial to devise explanation systems that can capture, or are at least aware\nof, both types of effects.\nIn this work, we propose a novel model explanation system that draws on ideas from example-\nbased, local, and global explanations. Our method, called MAPLE1, is a supervised neighborhood\napproach that combines ideas from local linear models and ensembles of decision trees. One of\nthe distinguishing features of MAPLE is that, for a particular prediction, it assigns weights to each\ntraining point; these weights induce a probability distribution over the input space that we call the\nlocal training distribution. MAPLE is endowed with several favorable properties described below:\n\u2022 It avoids the typical trade-off between model accuracy and model interpretability [8, 15, 16], as\nit is a highly accurate predictive model (on par with leading tree ensembles) that simultaneously\nprovides example-based and local explanations.\n\n\u2022 While it does not provide global explanations, it does detect global patterns by leveraging its\nlocal training distributions, thus distinguishing it from other local explanation methods. As a\nconsequence, it can diagnose limitations of its local explanations in the presence of global patterns.\nMoreover, it offers global feature selection, which can help detect data leakage [10] or possible\nbias [18].\n\n\u2022 In some settings, we want to represent a predictive model with a small number of exemplar\nexplanations [17] and, when asked to explain a new test point, must decide which of these exemplar\nexplanations to use. Its local training distribution allows us to make that decision in a principled\nfashion, thereby addressing another key weakness of existing local explanation systems [18].\n\n\u2022 In addition to providing faithful self-explanations, it can also be deployed effectively as a black-box\nexplainer by simply training it on the predictions of a black-box predictive model instead of the\nactual labels. We \ufb01nd that it produces more faithful explanations than LIME [17], a commonly\nused model-agnostic algorithm for generating local explanations.\n\n2 Background and Related Work\n\nWe divide the background material into two sections. The \ufb01rst outlines interpretability and the second\nreviews the random forest literature that is most relevant to MAPLE.\n\n1The source code for our model and experiments is at https://github.com/GDPlumb/MAPLE.\n\n2\n\n\f2.1\n\nInterpretability\n\nWe say that a function is interpretable if it is human simulatable [15]. This de\ufb01nition requires that a\nperson can 1) carry out all of the model calculations in reasonable time, which rules out functions\nthat are too complicated, and 2) provide a semantic description of each calculation, which rules out\nusing features that do not have well understood meanings. Generally, sparse-linear models, small\ndecision trees, nearest neighbors, and short rule lists are all considered to be human simulatable.\nBased on the ideas in [15, 17], we de\ufb01ne a local explanation at x, denoted expx(), as an interpretable\nfunction that approximates pred() well for x(cid:48) in a neighborhood of x where pred() is the predictive\nmodel being explained. There are two main challenges when using this type of explanation. The\n\ufb01rst is accurately modeling or detecting global patterns. Intuitively, local explanations based on\nsupervised neighborhood methods will fail to detect global patterns and those based on unsupervised\nneighborhoods will be inaccurate near them because their neighborhoods are, respectively, either\npartitioned on the discontinuity or contain points on both sides of it. More details about and a\nvisualization of this are in Sec. 5.4. The second challenge is determining if an explanation generated\nat one point can be applied at a new point [18]. We demonstrate how MAPLE addresses these\nchallenges in Sec. 4.\nThe most closely related work to ours is LIME [17], which provides a local explanation by \ufb01tting\na sparse linear model to the predictive model\u2019s response via sampling randomly around the point\nbeing explained. The authors of [17] also demonstrate how to use local explanations in a variety\nof practical applications. They also de\ufb01ne SP-LIME which summarizes a model by \ufb01nding a set\nof points whose explanations (generated by LIME) are diverse in their selected features and their\ndependence on those features. However, this does not address the dif\ufb01culties local explanations have\nwith global patterns because the individual explanations being chosen all are unaware of them.\nFundamentally, the problem of identifying a good local explanation is a causal question, which is\ntypically very dif\ufb01cult to answer since most models are not causal. However, the local explanation is\nnot trying to \ufb01nd causal structure in the data, but in the model\u2019s response. This makes the problem\nfeasible because we can freely manipulate the input and observe how the model\u2019s response changes.\nHowever, most explanation systems are not evaluated in a way that is consistent with this goal and\nuse the standard evaluation metric: Ex[loss(expx(x), pred(x))]. To address this issue, we de\ufb01ne the\ncausal local explanation metric as\n\nEx,x(cid:48)\u223cpx [loss(expx(x(cid:48)), pred(x(cid:48)))].\n\n(1)\nThis metric is based on sampling x(cid:48) from px, which is a distribution centered around x, and encourages\nthe explanation generated at x to accurately predict the model\u2019s value at x(cid:48). Interestingly, performing\nwell on the standard metric does not guarantee doing well on the casual metric. To see this, consider\na local linear explanation with the form expx(x(cid:48)) = 0T x(cid:48) + pred(x). While it performs perfectly\non the standard evaluation metric, its performance on the causal metric can be arbitrarily bad. This\nis because all of the active effects (the features that, if perturbed, signi\ufb01cantly effect the model\u2019s\nresponse) are rolled into the bias term pred(x) rather than being given a non-zero coef\ufb01cient. This\nexplanation does not tell us anything about why the model made its prediction and, consequently, we\nwant a metric that does not select it.\nBased on the methods in [13, 18], we de\ufb01ne a global explanation as a set of rules that generally\nhold true for pred() for all or a well de\ufb01ned subset of the input space. Two of the main challenges of\nusing global explanations are 1) adequately covering the input space and 2) properly processing the\ndata to allow the rules to be meaningful. Anchors [18], which approximates the model with a set of\nif-then rules, is an example of a method that provides global explanations. It has the advantage of\nproviding easily understood explanations and it is simple to determine if an explanation applies for\na speci\ufb01c instance. Further, [18] shows that it is possible to choose these rules such that they have\nvery high precision (at the cost of reduced coverage). While MAPLE does not directly offer global\nexplanations, we show how to use it to detect global patterns in Sec. 4.1, which are the patterns global\nexplanations represent best.\nBased on [1, 12, 15], we de\ufb01ne an example-based explanation as any function that assigns weights\nto the training points based on how much in\ufb02uence they have on the predictive model or on an\nindividual prediction. The most general form of an example-based explanation is in\ufb02uence functions\n[6]. In\ufb02uence functions study how the model or prediction would change if a training point was\nup-weighted in\ufb01nitesimally, but require model differentiability and convexity to work. However,\n\n3\n\n\fit was recently shown that these assumptions can be relaxed and that in\ufb02uence functions can be\nused for understanding model behavior, model debugging, detecting dataset errors, and creating\nvisually indistinguishable adversarial training examples [12]. More speci\ufb01c strategies for methods of\ngenerating example-based explanations are Case Based Reasoning [2], K Nearest Neighbors [1], and\nSP-LIME [17]. Notably, MAPLE naturally provides in\ufb02uence functions via the weights it assigns to\nthe training points for a particular test-point/prediction.\n\n2.2 Random Forests: Feature Selection and Local Models\n\nDue to their accuracy and robustness, random forests [4] have been a popular and effective method in\nmachine learning. Unfortunately, they are not generally considered to be interpretable because they\naggregate many decision trees, each of which is often quite large. However, they yield a measure\nof global variable importance and they can be viewed as a way of doing supervised neighborhood\nselection. We use both of these aspects in de\ufb01ning our method.\nThe permutation based importance measure initially proposed in [4] determines the feature importance\nby considering the performance of the random forest before and after a random permutation of a\npredictor. Another popular variant was proposed in [9] which works by summing the impurity\nreductions over each node in the tree where a split was made on that variable, while adjusting for the\nnumber of points in the node, and then averaging this over the forest. DStump [11] simpli\ufb01es this\nmeasure further by only considering the splits made on the root nodes of the trees. We use DStump\nfor global feature selection as part of MAPLE, though MAPLE can be extended to work with other\nvariants.\nOne of the reasons that local methods are not commonly applied on large-scale problems is that,\nalthough their learning rates are minimax optimal, this rate is conservative when not all of the features\nare involved in the response, as demonstrated empirically in [3]. As a result, we are interested in\nusing a supervised local method. Random forests have two main interpretations towards this: \ufb01rst, as\nan adaptive method for \ufb01nding potential nearest neighbors [14] and, second, as a kernel method [19].\nRecently, [3] introduced SILO which explicitly uses random forests to de\ufb01ne the instance weights for\nlocal linear modeling. Empirically, they found that this decreased the bias of the random forest and\nincreased its variance, which is potentially problematic in high dimensional settings.\n\n3 MAPLE\n\nOur proposed method, MAPLE (Model Agnostic SuPervised Local Explanations), combines the idea\nof using random forests as a method for supervised neighborhood selection for local linear modeling,\nintroduced in [3] as SILO, with the feature selection method proposed in [11] as DStump. For a\ngiven point, SILO de\ufb01nes a local neighborhood by assigning a weight to each training point based on\nhow frequently that training point appears in the same leaf node as the given point across the trees in\nthe random forest. DStump de\ufb01nes the importance of a feature based on how much it reduces the\nimpurity of the label when it is split on at the root of the trees in the random forest. These methods\nare explained in more detail shortly.\nUnder certain regularity conditions, SILO has been shown to be consistent in that its estimator\nconverges in probability to the true function [3]. Further, DSTump has been shown to identify\nthe active features in a high dimensional setting under the assumption of a general additive model\n[11]. As a result, the combination of these methods is likely an effective model in a variety of\nproblem settings because it should inherit both of these properties. However, rather than focus on\nthe statistical properties and empirical accuracy of this combination, we focus instead on how the\nresulting algorithm can be applied to problems where interpretability is a concern.\nBefore formally de\ufb01ning these procedures and our method, we introduce some notation. Let x \u2208 Rp+1\nbe a feature vector; we assume [x]0 = 1 is a constant term. The index j \u2208 {0, . . . , p} will refer to\na speci\ufb01c feature in this vector. Next, let {xi}n\ni=1 be the training set. The index i will refer to a\nspeci\ufb01c feature vector (training point) in the training set. Then X \u2208 Rn\u00d7(p+1) will denote the matrix\nrepresentation of the training set (i.e., [X]i,j = [xi]j). Finally, let {Tk}K\nk=1 be the trees in the random\nforest; the index k will always refer to a speci\ufb01c tree.\nWe begin by de\ufb01ning the process by which SILO computes the weights of the training points (i.e.,\nthe local training distribution) and makes predictions. Let leafk(x) be the index of the leaf node of\n\n4\n\n\fTk that contains x. We then de\ufb01ne the connection function of the kth tree as\n\nck(x, x(cid:48)) = 1{leafk(x) = leafk(x(cid:48))}\n\nso the number of training points in the same leaf node as x is\n\n. Finally, the weight function of the random forest for the ith training point at the point x is\n\nnumk(x) =\n\nck(xi, x)\n\nw(xi, x) =\n\n1\nK\n\nck(xi, x)\nnumk(x)\n\n.\n\ni=1\n\nn(cid:88)\nK(cid:88)\nn(cid:80)\n\nk=1\n\ni=1\n\nK(cid:88)\n\n(2)\n\n(3)\n\n.\n\nFor a random forest, the model prediction can be written as\n\n\u02c6fRF (x) =\n\n1\nK\n\nck(xi, x)yi\n\nnumk(x)\n\nk=1\n\nFor SILO, the prediction is given by evaluating the solution to the weighted linear regression\nproblem de\ufb01ned by {xi, w(xi, x), yi} at x. Let Wx \u2208 Rn\u00d7n be the diagonal weight matrix where\n[Wx]i,i = w(xi, x). Then SILO\u2019s prediction is\n\n\u02c6fSILO(x) = \u02c6\u03b2x\n\n(4)\nNext, we de\ufb01ne the process DStump uses to select features. Let splitk \u2208 {1, . . . , p} be the index of\nthe feature that the root node of the kth tree split on and suppose that that split reduces the impurity\nof the label by rk. Then DStump assigns feature j the score\n\nx where \u02c6\u03b2x = (X T WxX)\u22121X T Wxy.\n\nT\n\nK(cid:88)\n\nk=1\n\nsj =\n\n1{splitk = j}rk\n\n, and chooses the subset, Ad \u2282 {1, . . . , p}, of the d highest scored features.\nMAPLE combines these procedures by using SILO\u2019s local training distribution and the best d features\nfrom DStump (along with a constant term corresponding to the bias in the local linear model) to\nsolve the weighted linear regression problem {[xi]Ad , w(xi, x), yi}. Formally, let Zd = [X]:,Ad \u2208\nRn\u00d7(d+1) and zd = [x]Ad \u2208 Rd+1. Then MAPLE makes the prediction\n\n\u02c6fM AP LE(x) = \u02c6\u03b2T\n\nx,dzd where \u02c6\u03b2x,d = (Z T\n\nd WxZd)\u22121Z T\n\nd Wxy.\n\n(5)\n\nChoosing d: We pick the number of features to use in our local linear model via a greedy forward\nselection procedure that relies on the fact that features can be sorted by their sj scores. Speci\ufb01cally,\nwe evaluate the predictive accuracy of MAPLE on a held out validation set for d = 1, . . . , p and\nchoose the value of d that gives the best validation accuracy. It would also be possible to choose d\nbased on the causal local metric.\nExtension to Gradient Boosted Regression Trees (GBRT): GBRTs [9], an alternative tree ensem-\nble approach, can also naturally be integrated with MAPLE. Indeed, GBRT tree ensembles can be\nused to generate local training distributions and feature scores, which can then be fed into MAPLE.\n\n4 MAPLE as an Explanation System\n\nIn this section we describe how to use MAPLE to generate explanations ([17] gives details on practical\napplications of these explanations). The general process of using MAPLE to explain a prediction is\nessentially the same whether we use MAPLE as a predictive model or only as a black-box explainer;\nthe only difference is that in the \ufb01rst case we \ufb01t MAPLE directly on the response variable, while\nin the second case we \ufb01t it on the predictive model\u2019s predicted response. MAPLE\u2019s local training\ndistribution is vital to these explanations, and is what enables it to address two core weaknesses of\nlocal explanations related to 1) diagnosing their limitations in the presence of global patterns, and 2)\nselecting an appropriate explanation for a new test point when restricted to an existing set of exemplar\nexplanations. We discuss both of these topics in the remainder of this section.\n\n5\n\n\fFigure 2: Distribution of the active feature of the most in\ufb02uential points in the training set across a grid search\nover the range of feature values. (a) Linear Dataset: The distributions are roughly uniform in width, typically\ncentered, and change smoothly. (b) SIL Dataset: The distributions are wider in the \ufb02atter regions of the function\nand narrower in the transition. They also follow two disjoint ranges with one intermediate range on the transition\nbetween the two \ufb02atter regions. (c) Step Dataset: The distributions are wide, essentially disjoint across the\ndiscontinuities of the step function, and not centered.\n\n4.1 Generating Explanations and Detecting Global Patterns\n\nWhen MAPLE makes a prediction/local explanation, it uses a local linear model, where the coef\ufb01-\ncients determine the estimated local effect of each feature. If a feature coef\ufb01cient is non-zero, then\nwe can interpret the impact of the feature according to the sign and magnitude of the coef\ufb01cient.\nHowever, a zero coef\ufb01cient has two possible interpretations: 1) the feature does not contain global\npatterns and so it is, indeed, locally inactive; or 2) the feature does contain global patterns and this\nfeature is in fact signi\ufb01cant (though not necessarily locally signi\ufb01cant). Consequently, our main goal\nis diagnosing the ef\ufb01cacy of a local explanation to determine whether a feature with a zero coef\ufb01cient\ncontains global patterns. We summarize how we can leverage the local training distribution to do this\n(see additional discussion in Sec. 5.4).\nIn particular, we propose two diagnostics for each feature. The \ufb01rst (and simpler) diagnostic involves\nusing the local training distribution for the given test point to create a boxplot to visualize the\ndistribution of each feature. If the boxplot is substantially skewed (i.e., not centered around the\ntest point), then that feature likely contains a global pattern and the test point is nearby it. If the\nboxplots are not skewed, then the second diagnostic involves performing a grid search across the\nrange of the feature in question. For each value on the grid, we can sample the remaining features in\nsome reasonable way (e.g., by \ufb01nding several training points with a similar feature value, assuming\nfeature independence and sampling from the empirical distribution, or via MCMC), and create a\nboxplot for the local training distribution across this grid (see Fig. 2 for examples of this type of plot,\nthe experimental setup is described in Sec. 5.4). If the local training distributions appear to share\nsimilar boundaries that change abruptly during the grid search, as seen in Fig. 2c and somewhat in\nFig. 2b, then there is likely a global pattern present in that feature. Conversely, if the local training\ndistributions are roughly centered around the test point during the grid search and change smoothly\nduring it, as seen in Fig. 2a, then the effect of that feature likely does not have signi\ufb01cant global\npatterns.\n\n6\n\n\f4.2 Picking an Exemplar Explanation\n\nAs noted in [17], there are settings where we want to compile a small set of (presumably diverse)\nrepresentative exemplar explanations, and use these exemplar explanations to explain new test points.\nSelecting the appropriate exemplar for a new test point is a challenge for existing local explanation\nsystems [18]. MAPLE provides an elegant solution to this problem by using the local training\ndistributions. Speci\ufb01cally, we can determine if we can apply a particular exemplar explanation to a\nproposed test point by evaluating how likely the proposed point is under the exemplar explanation\u2019s\nlocal training distribution.\nOf course, there is no guarantee that, collectively, these distributions span the entire input space. If\nasked to explain a test point from an uncovered part of the input space, then the proposed point will\nhave low probability under all of the exemplar explanation distributions, and, having noticed that, we\ncan determine that no exemplar explanation should be applied. Similarly, we can detect if multiple\nexemplar explanations may be equally applicable.\n\n5 Experimental Results\n\nWe present experiments demonstrating that: 1) MAPLE is generally at least as accurate as random\nforests, GBRT, and SILO 2) MAPLE provides faithful self-explanations, i.e., its local linear model\nat x is a good local explanation of the prediction at x 3) MAPLE is more accurate in predicting a\nblack-box predictive model\u2019s response than a comparable and popular explanation system, LIME\n[17] 4) The local training distribution can be used to detect the presence of global patterns in the\npredictive model.\n\n5.1 Accuracy on UCI datasets\n\nWe run our experiments on several of the UCI datasets [7]. Each dataset was divided into a 50/25/25\ntraining, validation, and testing split for each of the trials. All variables, including the response, were\nstandardized to have mean zero and variance one.\nWe compare MAPLE and SILO with tree ensembles constructed using standard random forests (RF)\nand gradient boosted regression trees (GBRT). The ensemble choice for the baseline impacts the\nstructure of the trees, which alters the weights as well as the global features selected by DStump.\nWe include the performance of a linear model (LM) as well. Root mean squared errors (RMSE)\nare reported in Table 1 and the number of selected features is in Table 4. Overall, MAPLE does at\nleast as well as the tree ensembles and SILO, and often does better (the Music dataset being the sole\nexception).\n\n5.2 Faithful Self-Explanations\n\nWe next demonstrate that the local linear model that MAPLE uses to make its prediction doubles as\nan effective local explanation. The general data processing is the same as in the previous section, and\nwe restrict our results to the random forest based version of MAPLE. We use our proposed causal\nmetric de\ufb01ned in (1) as our evaluation metric, de\ufb01ning px as N (x, \u03c3I), using the squared l2 loss, and\napproximating the expectation by taking x from the testing set and drawing \ufb01ve x(cid:48) per testing point.\nWe chose \u03c3 = 0.1 as a reasonable choice for the neighborhood scale because the data was normalized\nto have variance one. The results, which show the RMSE of the causal metric in Table 2, demonstrate\nthat the local linear models produced are good local explanations for the model as a whole when\ncompared to using LIME to explain MAPLE.\n\n5.3 MAPLE as a Black-box Explainer\n\nThe overall setup for these experiments is the same as in the previous section, except that we are\nevaluating explanation systems in the black-box setting, where they are \ufb01t against the predictive\nmodel\u2019s predicted response. We use a Support Vector Regression (SVR) model (implementation and\nstandard parameters from scikit-learn) as a black-box predictive model. We present a comparison of\nMAPLE to LIME in Table 3. Our results show the RMSE of the causal metric. MAPLE produces\n\n7\n\n\fSILO + RF MAPLE + RF GBRT SILO + GBRT MAPLE + GBRT\n\nDataset\nAutompgs\n\nCommunities\n\nCrimes\n\nDay\n\nHappiness\nHousing\nMusic\n\nLM\n0.446\n0.781\n0.327\n\n0\n\n0.001\n0.56\n0.935\n0.814\n\nRF\n\n0.4164\n0.745\n1.012\n0.204\n0.644\n0.486\n0.742\n0.78\n\n0.3784\n0.724\n0.531\n1.7e-05\n0.001\n0.409\n0.881\n0.779\n\n0.381\n0.688\n0.331\n6e-06\n0.001\n0.419\n0.764\n0.778\n\n0.392\n0.709\n0.968\n0.104\n0.344\n0.395\n0.658\n0.783\n\n0.3745\n0.751\n0.493\n1.3e-05\n0.001\n0.396\n0.901\n0.786\n\n0.377\n0.712\n0.295\n4e-06\n0.001\n0.404\n0.849\n0.779\n\nWinequality-red\nTable 1: Average RMSE across 50 trials; underlined results indicate that MAPLE differed signi\ufb01cantly from\nthe baseline method (RF or GBRT) and bold results indicate that MAPLE differed signi\ufb01cantly from SILO built\non the same baseline. With the exception of the Music dataset, MAPLE is at least as good as the baseline.\n\nDataset\nAutompgs\n\nCommunities\n\nCrimes\n\nDay\n\nHappiness\nHousing\nMusic\n\nWinequality-red\n\n0.042\n0.130\n0.047\n\nLIME MAPLE\n0.178\n0.409\n0.276\n0.034\n0.05\n0.238\n0.189\n0.149\n\n3e-05\n0.07\n0.181\n0.06\n\n0\n\nTable 2: A comparison of the causal metric of\nMAPLE vs LIME for \u03c3 = 0.1 when being used to\nexplain the predictions of MAPLE. Values shown are\nRMSE averaged over 25 trials. Bold entries denote a\nsigni\ufb01cant difference.\n\nDataset\nAutompgs\n\nCommunities\n\nCrimes\n\nDay\n\nHappiness\nHousing\nMusic\n\nWinequality-red\n\nn\n392\n1993\n2214\n731\n578\n506\n1059\n1599\n\nDay\n\nCrimes\n\nCommunities\n\nWinequality-red\n\nDataset\nAutompgs\n\nHappiness\nHousing\nMusic\n\n0.282\n0.338\n0.183\n0.242\n0.28\n0.366\n0.326\n0.295\n\nSVR LIME MAPLE\n0.39\n0.761\n0.895\n0.2\n0.267\n0.459\n0.816\n0.807\n\n0.15\n0.323\n0.232\n0.17\n0.187\n0.206\n0.304\n0.204\nTable 3: A comparison of the causal metric of\nMAPLE vs LIME for \u03c3 = 0.1 when being used in\nthe black-box setting to explain the predictions of a\nSVR model. Values shown are RMSE averaged over\n25 trials. Bold entries denote a signi\ufb01cant difference.\np\n8\n103\n103\n15\n8\n12\n70\n12\n\nd - RF d - GBRT\n6.44\n54.14\n20.34\n2.46\n7.74\n9.98\n5.56\n7.1\n\n5.94\n50.12\n21.62\n3.02\n7.46\n10.06\n14.46\n6.88\n\nTable 4: For each dataset, its size, n, and dimension, p, along with the average number of features used, d, by\nMAPLE for RF and GBRT respectively.\n\nmore accurate local explanations than LIME for all but the Crimes dataset, where the difference was\nnot statistically signi\ufb01cant.\nWhat about a Larger \u03c3?: After data normalization, the median range of our features (across all\ndatasets) is roughly six. For a single dimension, the width of the 95% con\ufb01dence interval for the\nsample used in the causal metric is approximately 4\u03c3. In addition to the reported results with \u03c3 = 0.1,\nwe ran experiments with \u03c3 = 0.25. For this larger value, the expected range of the neighborhood\nis approximately one sixth of the overall feature range. Further, the probability that one dimension\nfalls outside of this interval increases exponentially with the number of dimensions. Consequently,\nthe neighborhoods are even larger for high dimensional problems. Thus \u03c3 = 0.25 appears to be\nunreasonably large for a local explanation in high dimensions, especially considering that standard\nnearest neighbor methods frequently rely on a small constant number of neighbors. Nonetheless, we\nnote that for \u03c3 = 0.25, we found that MAPLE signi\ufb01cantly outperformed LIME on the Autompgs,\nHappiness, Housing, and Winequality-red datasets. In contrast, LIME signi\ufb01cantly outperformed\nMAPLE on the three datasets with largest dimension p (see Table 4): Communities, Crimes, and\nMusic.\n\n5.4 Using In\ufb02uential Training Points\n\nTo demonstrate how the local training distribution can be used to make inferences about the global\npatterns of the model, we work with the datasets introduced in Fig. 1 of Sec. 1. Each dataset consists\n\n8\n\n\fof n = 200 draws from [0, 1]5 taken uniformly at random. These samples are passed through either\na linear, shifted inverse logistic (SIL), or step function that acts only on the \ufb01rst dimension of the\nfeature vector (the remaining four dimensions are noisy features), and then a normally distributed\nnoise is added with \u03c3 = 0.1. We refer to the \ufb01rst feature as the \u2018active\u2019 feature. We then \ufb01t a random\nforest to the data and \ufb01t MAPLE to the random forest\u2019s predicted response. Next, we do a grid search\nacross the active feature (e.g., [x]0 = 0, 0.1, . . . , 1.0) and sample the remaining features uniformly at\nrandom from [0, 1] (simulating sampling the remaining features from the data distribution). For a\ngiven sampled point x(cid:48), we use MAPLE\u2019s local training distribution to identify the 20 most in\ufb02uential\ntraining points and plot a boxplot of the distribution of the active feature for these in\ufb02uential points.\nTo smooth the results, we repeat this procedure 10 times for each point in the grid search. The plots\nof these distributions are in Fig. 2.\nInterpreting these Distributions: When a random forest \ufb01ts a continuous function, each tree splits\nthe input space into \ufb01ner and \ufb01ner partitions and, if the slope of the function does not change too\nrapidly, the distribution of where the partitions are split are relatively uncorrelated between the trees.\nAs a result, the in\ufb02uential training points for a prediction at x are roughly centered around x and tend\nto change smoothly as x changes. This is demonstrated in Fig. 2a with the exception of points near\nthe boundary of the distribution.\nFurther, the steeper the function is at x, the \ufb01ner the partitions become around x and, consequently,\nthe distribution of the in\ufb02uential points becomes more concentrated. This can be seen by contrasting\nFig. 2a and Fig. 2b; in Fig. 2a, the distributions have roughly equal variance while, in Fig. 2b, the\ndistributions in the \ufb02at areas of the function have large variance while the distributions in the steep\ntransition of the function are much narrower. Although not shown, one extreme of this is a feature\nthat is inactive; in all of our experiments, the inactive feature distributions of the most in\ufb02uential\ntraining points look like the original data distribution.\nIn the extreme case, when \ufb01tting a discontinuous function, the trees are likely to all split at or near\nthe discontinuity. As a result, in\ufb02uential training points for x may not be centered around x when\nx is near the discontinuity and they will change abruptly as x moves past the discontinuity. This is\ndemonstrated in Fig. 2c where we clearly observe the transitions of the step function, as well as the\nfact that the intervals are not centered around x. Further, in Fig. 2b we can see the fact that there are\ntwo \ufb02at areas of the function with a steep and short transition between by noticing that there are two\nmain ranges for the in\ufb02uential points with one intermediate range at x0 = 0.5.\n\n6 Conclusion and Future Work\n\nWe have shown that MAPLE is effective both as a predictive model and an explanation system.\nAdditionally, we have demonstrated how to use its local training distribution to address two key\nweaknesses of local explanations: 1) Detecting and modeling global patterns, and 2) Determining\nwhether an exemplar explanation can be applied to a new test point. Some interesting avenues of\nfuture work include: 1) Exploiting the fact that MAPLE is a locally linear model to tap into the wide\nrange of approaches that use in\ufb02uence functions to improve model accuracy or identify interesting\ndata points via measures such as leverage and Cook\u2019s distance [5]; 2) Exploring the use of local\nfeature selection approaches with MAPLE, e.g., by considering impurity reductions along the paths\nthrough all trees for a given test point; and 3) Exploring methods other than tree ensembles for\nde\ufb01ning the similarity weights from Eq. 2.\n\nAcknowledgements\n\nThis work was supported in part by DARPA FA875017C0141, the National Science Foundation grants\nIIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services\nAward, and a Carnegie Bosch Institute Research Award. Any opinions, \ufb01ndings and conclusions or\nrecommendations expressed in this material are those of the author(s) and do not necessarily re\ufb02ect\nthe views of DARPA, the National Science Foundation, or any other funding agency.\n\n9\n\n\fReferences\n[1] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and\nKlaus-Robert M\u00c3\u017eller. How to explain individual classi\ufb01cation decisions. Journal of Machine\nLearning Research, 11(Jun):1803\u20131831, 2010.\n\n[2] Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classi\ufb01cation. The\n\nAnnals of Applied Statistics, pages 2403\u20132424, 2011.\n\n[3] Adam Bloniarz, Ameet Talwalkar, Bin Yu, and Christopher Wu. Supervised neighborhoods for\ndistributed nonparametric regression. In Arti\ufb01cial Intelligence and Statistics, pages 1450\u20131459,\n2016.\n\n[4] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n\n[5] Samprit Chatterjee and Ali S Hadi. In\ufb02uential observations, high leverage points, and outliers\n\nin linear regression. Statistical Science, pages 379\u2013393, 1986.\n\n[6] R Dennis Cook and Sanford Weisberg. Characterizations of an empirical in\ufb02uence function for\n\ndetecting in\ufb02uential cases in regression. Technometrics, 22(4):495\u2013508, 1980.\n\n[7] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\n[8] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.\n\n2017.\n\n[9] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of\n\nstatistics, pages 1189\u20131232, 2001.\n\n[10] Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. Leakage in data mining:\nFormulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data\n(TKDD), 6(4):15, 2012.\n\n[11] Jalil Kazemitabar, Arash Amini, Adam Bloniarz, and Ameet S Talwalkar. Variable importance\nusing decision trees. In Advances in Neural Information Processing Systems, pages 425\u2013434,\n2017.\n\n[12] Pang Wei Koh and Percy Liang. Understanding black-box predictions via in\ufb02uence functions.\n\narXiv preprint arXiv:1703.04730, 2017.\n\n[13] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A\njoint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 1675\u20131684. ACM,\n2016.\n\n[14] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the\n\nAmerican Statistical Association, 101(474):578\u2013590, 2006.\n\n[15] Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490,\n\n2016.\n\n[16] Scott M Lundberg and Su-In Lee. A uni\ufb01ed approach to interpreting model predictions. In\n\nAdvances in Neural Information Processing Systems, pages 4768\u20134777, 2017.\n\n[17] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining\nthe predictions of any classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 1135\u20131144. ACM, 2016.\n\n[18] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-\n\nagnostic explanations. AAAI, 2018.\n\n[19] Erwan Scornet. Random forests and kernel methods. IEEE Transactions on Information Theory,\n\n62(3):1485\u20131500, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1253, "authors": [{"given_name": "Gregory", "family_name": "Plumb", "institution": "CMU"}, {"given_name": "Denali", "family_name": "Molitor", "institution": "University of California, Los Angeles"}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": "CMU"}]}