{"title": "Human-in-the-Loop Interpretability Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 10159, "page_last": 10168, "abstract": "We often desire our models to be interpretable as well as accurate. Prior work on optimizing models for interpretability has relied on easy-to-quantify proxies for interpretability, such as sparsity or the number of operations required.  In this work, we optimize for interpretability by directly including humans in the optimization loop.  We develop an algorithm that minimizes the number of user studies to find models that are both predictive and interpretable and demonstrate our approach on several data sets.  Our human subjects results show trends towards different proxy notions of interpretability on different datasets, which suggests that different proxies are preferred on different tasks.", "full_text": "Human-in-the-Loop Interpretability Prior\n\nIsaac Lage\n\nDepartment of Computer Science\n\nHarvard University\n\nisaaclage@g.harvard.edu\n\nAndrew Slavin Ross\n\nDepartment of Computer Science\n\nHarvard University\n\nandrew_ross@g.harvard.edu\n\nBeen Kim\nGoogle Brain\n\nbeenkim@google.com\n\nSamuel J. Gershman\n\nDepartment of Psychology\n\nHarvard University\n\ngershman@fas.harvard.edu\n\nfinale@seas.harvard.edu\n\nFinale Doshi-Velez\n\nDepartment of Computer Science\n\nHarvard University\n\nAbstract\n\nWe often desire our models to be interpretable as well as accurate. Prior work on\noptimizing models for interpretability has relied on easy-to-quantify proxies for\ninterpretability, such as sparsity or the number of operations required. In this work,\nwe optimize for interpretability by directly including humans in the optimization\nloop. We develop an algorithm that minimizes the number of user studies to \ufb01nd\nmodels that are both predictive and interpretable and demonstrate our approach\non several data sets. Our human subjects results show trends towards different\nproxy notions of interpretability on different datasets, which suggests that different\nproxies are preferred on different tasks.\n\n1\n\nIntroduction\n\nUnderstanding machine learning models can help people discover confounders in their training\ndata, and dangerous associations or new scienti\ufb01c insights learned by their models [3, 9, 15]. This\nmeans that we can encourage the models we learn to be safer and more useful to us by effectively\nincorporating interpretability into our training objectives. But interpretability depends on both the\nsubjective experience of human users and the downstream application, which makes it dif\ufb01cult to\nincorporate into computational learning methods.\nHuman-interpretability can be achieved by learning models that are inherently easier to explain or\nby developing more sophisticated explanation methods; we focus on the \ufb01rst problem. This can be\nsolved with one of two broad approaches. The \ufb01rst de\ufb01nes certain classes of models as inherently\ninterpretable. Well known examples include decision trees [9], generalized additive models [3], and\ndecision sets [13]. The second approach identi\ufb01es some proxy that (presumably) makes a model\ninterpretable and then optimizes that proxy. Examples of this second approach include optimizing\nlinear models to be sparse [29], optimizing functions to be monotone [1], or optimizing neural\nnetworks to be easily explained by decision trees [33].\nIn many cases, the optimization of a property can be viewed as placing a prior over models and\nsolving for a MAP solution of the following form:\n\n(1)\nwhere M is a family of models, X is the data, p(X|M ) is the likelihood, and p(M ) is a prior on the\nmodel that encourages it to share some aspect of our inductive biases. Two examples of biases include\nthe interpretation of the L1 penalty on logistic regression as a Laplace prior on the weights and the\n\nM\u2208M p(X|M )p(M )\n\nmax\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fclass of norms described in Bach [2] that induce various kinds of structured sparsity. Generally, if we\nhave a functional form for p(M ), we can apply a variety of optimization techniques to \ufb01nd the MAP\nsolution. Placing an interpretability bias on a class of models through p(M ) allows us to search for\ninterpretable models in more expressive function classes.\nOptimizing for interpretability in this way relies heavily on the assumption that we can quantify\nthe subjective notion of human interpretability with some functional form p(M ). Specifying this\nfunctional form might be quite challenging. In this work, we directly estimate the interpretability\nprior p(M ) from human-subject feedback. Optimizing this more direct measure of interpretability\ncan give us models more suited to a task at hand than more accurately optimizing an imperfect proxy.\nSince measuring p(M ) for each model M has a high cost\u2014requiring a user study\u2014we develop a\ncost-effective approach that initially identi\ufb01es models M with high likelihood p(X|M ), then uses\nmodel-based optimization to identify an approximate MAP solution from that set with few queries\nto p(M ). We \ufb01nd that different proxies for interpretability prefer different models, and that our\napproach can optimize all of these proxies. Our human subjects results suggest that we can optimize\nfor human-interpretability preferences.\n\n2 Related Work\n\nLearning interpretable models with proxies Many approaches to learning interpretable models\noptimize proxies that can be computed directly from the model. Examples include decision tree depth\n[9], number of integer regression coef\ufb01cients [30], amount of overlap between decision rules [13],\nand different kinds of sparsity penalties in neural networks [10, 24]. In some cases, optimizing a\nproxy can be viewed as MAP estimation under an interpretability-encouraging prior [29, 2]. These\nproxy-based approaches assume that it is possible to formulate a notion of interpretability that is a\ncomputational property of the model, and that we know a priori what that property is. Lavrac [14]\nshows a case where doctors prefer longer decision trees over shorter ones, which suggests that these\nproxies do not fully capture what it means for a model to be interpretable in all contexts. Through our\napproach, we place an interpretability-encouraging prior on arbitrary classes of models that depends\ndirectly on human preferences.\n\nLearning from human feedback Since interpretability is dif\ufb01cult to quantify mathematically,\nDoshi-Velez and Kim [8] argue that evaluating it well requires a user study. Many works in inter-\npretable machine learning have user studies: some advance the science of interpretability by testing\nthe effect of explanation factors on human performance on interpretability-related tasks [20, 19]\nwhile others compare the interpretability of two classes of models through A/B tests [13, 11]. More\nbroadly, there exist many studies about situations in which human preferences are hard to articulate\nas a computational property and must be learned directly from human data. Examples include\nkernel learning [28, 31], preference based reinforcement learning [32, 5] and human based genetic\nalgorithms [12]. Our work resembles human computation algorithms [16] applied to user studies for\ninterpretability as we use the user studies to optimize for interpretability instead of just comparing a\nmodel to a baseline.\n\nModel-based optimization Many techniques have been developed to ef\ufb01ciently characterize func-\ntions in few evaluations when each evaluation is expensive. The \ufb01eld of Bayesian experimental design\n[4] optimizes which experiments to perform according to a notion of which information matters. In\nsome cases, the intent is to characterize the entire function space completely [34, 17], and in other\ncases, the intent is to \ufb01nd an optimum [27, 26]. We are interested in this second case. Snoek et\nal. [26] optimize the hyperparameters of a neural network in a problem setup similar to ours. For\nthem, evaluating the likelihood is expensive because it requires training a network, while in our case,\nevaluating the prior is expensive because it requires a user study. We use a similar set of techniques\nsince, in both cases, evaluating the posterior is expensive.\n\n3 Framework and Modeling Considerations\nOur high-level goal is to \ufb01nd a model M that maximizes p(M|X) \u221d p(X|M )p(M ) where p(M )\nis a measure of human interpretability. We assume that computation is relatively inexpensive, and\nthus computing and optimizing with respect to the likelihood p(X|M ) is signi\ufb01cantly less expensive\n\n2\n\n\fFigure 1: High-level overview of the pipeline\n\nthan evaluating the prior p(M ), which requires a user study. Our strategy will be to \ufb01rst identify a\nlarge, diverse collection of models M with large likelihood p(X|M ), that is, models that explain\nthe data well. This task can be completed without user studies. Next, we will search amongst these\nmodels to identify those that also have large prior p(M ). Speci\ufb01cally, to limit the number of user\nstudies required, we will use a model-based optimization approach [27] to identify which models M\nto evaluate. Figure 1 depicts the steps in the pipeline. Below, we outline how we de\ufb01ne the likelihood\np(X|M ) and the prior p(M ); in Section 4 we de\ufb01ne our process for approximate MAP inference.\n\n3.1 Likelihood\n\nIn many domains, experts desire a model that achieves some performance threshold (and amongst\nthose, may prefer one that is most interpretable). To model this notion of a performance threshold, we\nuse the soft insensitive loss function (SILF)-based likelihood [6, 18]. The likelihood takes the form of\n\np(X|M ) =\n\ne(\u2212C\u00d7SILF\u0001,\u03b2 (1\u2212accuracy(X,M )))\n\nwhere accuracy(X,M) is the accuracy of model M on data X and SILF\u0001,\u03b2(y) is given by\n\n1\nZ\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30,\n\nSILF\u0001,\u03b2(y) =\n\n(y\u2212(1\u2212\u03b2)\u0001)2\ny \u2212 \u0001,\n\n4\u03b2\u0001\n\n,\n\n0 \u2264 y \u2264 (1 \u2212 \u03b2)\u0001\n(1 \u2212 \u03b2)\u0001 \u2264 y \u2264 (1 + \u03b2)\u0001\ny \u2265 (1 + \u03b2)\u0001\n\nwhich effectively de\ufb01nes a model as having high likelihood if its accuracy is greater than 1\u2212 (1\u2212 \u03b2)\u0001.\nIn practice, we choose the threshold 1 \u2212 (1 \u2212 \u03b2)\u0001 to be equal to an accuracy threshold placed on the\nvalidation performance of our classi\ufb01cation tasks, and only consider models that perform above that\nthreshold. (Note that with this formulation, accuracy can be replaced with any domain speci\ufb01c notion\nof a high-quality model without modifying our approach.)\n\n3.2 A Prior for Interpretable Models\n\nSome model classes are generally amenable to human inspection (e.g. decision trees, rule lists,\ndecision sets [9, 13]; unlike neural networks), but within those model classes, there likely still exist\nsome models that are easier for humans to utilize than others (e.g. shorter decision trees rather than\nlonger ones [23], or decision sets with fewer overlaps [13]). We want our model prior p(M ) to re\ufb02ect\nthis more nuanced view of interpretability.\nWe consider a prior of the form:\n\n(cid:90)\n\nx\n\np(M ) \u221d\n\nHIS(x, M )p(x)dx\n\nIn our experiments, we will de\ufb01ne HIS(x, M ) (human-interpretability-score) as:\n\nHIS(x, M ) =\n\nmean-RT(x, M ) > max-RT\nmax-RT \u2212 mean-RT(x, M ), mean-RT(x, M ) \u2264 max-RT\n\nwhere mean-RT(x, M ) (mean response time) measures how long it takes users to predict the label\nassigned to a data point x by the model M, and max-RT is a cap on response time that is set to\na large enough value to catch all legitimate points and exclude outliers. The choice of measuring\nthe time it takes to predict the model\u2019s label follows Doshi-Velez and Kim [8], which suggests this\nsimulation proxy as a measure of interpretability when no downstream task has been de\ufb01ned yet; but\nany domain-speci\ufb01c task and metric could be substituted into our pipeline including error detection\nor cooperative decision-making.\n\n3\n\n(2)\n\n(3)\n\n(cid:26)0,\n\nTrain\ta\tdiverse\tset\tof\thigh\tlikelihood\tmodelsChoose\ta\tcandidate\tbest\tmodel\tto\tevaluate\twith\tmodel\tbased\toptimizationEvaluate\tthe\tinterpretability\tof\tthe\tmodel\twith\ta\tuser\tstudyUpdate\tthe\toptimization\tprocedure\twith\tthe\tnew\tlabelReturn\tbest\tmodel\tas\tapproximate\tMAP\tsolution\f3.3 A Prior for Arbitrary Models\n\n(cid:90)\n\np(M ) \u221d\n\nIn the interpretable model case, we can give a human subject a model M and ask them questions\nabout it; in the general case, models may be too complex for this approach to be feasible. In order to\ndetermine the interpretability of complex models like neural networks, we follow the approach in\nRibeiro et al. [22], and construct a simple local model for each point x by sampling perturbations\nof x and training a simple model to mimic the predictions of M in this local region. We denote this\nlocal-proxy(M, x).\nWe change the prior in Equation 2 to re\ufb02ect that we evaluate the HIS with the local proxy rather than\nthe entire model:\n\nHIS(x, local-proxy(M, x))p(x)dx\n\n(4)\n\nWe describe computational considerations for this more complex situation in Section 4.\n\nx\n\n4\n\nInference\n\nOur goal is to \ufb01nd the MAP solution from Equation 1. Our overall approach will be to \ufb01nd a collection\nof models with high likelihood p(X|M ) and then perform model-based optimization [27] to identify\nwhich priors p(M ) to evaluate via user studies. Below, we describe each of the three main aspects of\nthe inference: identifying models with large likelihoods p(X|M ), evaluating p(M ) via user studies,\nand using model-based optimization to determine which p(M ) to evaluate. The model from our set\nwith the best p(X|M )p(M ) is our approximation to the MAP solution.\n\n4.1\n\nIdentifying models with high likelihood p(X|M )\n\nIn the model-\ufb01nding phase, our goal is to create a diverse set of models with large likelihoods\np(X|M ) in the hopes that some will have large prior value p(M ) and thus allow us to identify the\napproximate MAP solution. For simpler model classes, such as decision trees, we \ufb01nd these solutions\nvia running multiple restarts with different hyperparameter settings and rejecting those that do not\nmeet our accuracy threshold. For neural networks, we jointly optimize a collection of predictive\nneural networks with different input gradient patterns (as a proxy for creating a diverse collection)\n[25].\n\n4.2 Computing the prior p(M )\n\nHuman-Interpretable Model Classes. For any model M and data point x, a user study is required\nfor every evaluation of HIS(x, M ). Since it is infeasible to perform a user study for every value of x\nfor even a single model M, we approximate the integral in Equation 2 via a collection of samples:\n\n(cid:90)\n\n(cid:88)\n\nxn\u223cp(x)\n\nx\n\n\u2248 1\nN\n\np(M ) \u221d\n\nHIS(x, M )p(x)dx\n\nHIS(xn, M )\n\nIn practice, we use the empirical distribution over the inputs x as the prior p(x).\nArbitrary Model Classes.\nIf the model M is not itself human-interpretable, we de\ufb01ne p(M )\nto be the integral over HIS(x, local-proxy(M, x)) where local-proxy(M, x) locally approxi-\nmates M around x (Equation 4). As before, evaluating HIS(x, local-proxy(M, x)) requires a\nuser study; however, now we must determine a procedure for generating the local approximations\nlocal-proxy(M, x).\nWe generate these local approximations via a procedure akin to Ribeiro et al. [22]: for any x, we\nsample a set of perturbations x(cid:48) around x, compute the outputs of model M for each of those x(cid:48), and\nthen \ufb01t a human-interpretable model (e.g. a decision-tree) to those data.\nWe note that these local models will only be nontrivial if the data point x is in the vicinity of a\ndecision boundary; if not, we will not succeed in \ufb01tting a local model. Let B(M ) denote the set of\ninputs x that are near the decision boundary of M. Since we de\ufb01ned HIS to equal max-RT when\n\n4\n\n\fmean-RT(x, M ) is 0 as it does when no local model can be \ufb01t (see Equation 3), we can compute the\nintegral in Equation 4 more intelligently by only seeking user input for samples near the model\u2019s\ndecision boundary:\n\n(cid:33)\n\n(5)\n\np(M ) \u221d\n\n=\n\n(cid:90)\n(cid:32)(cid:90)\n(cid:32)(cid:90)\n\nx\n\n+\n\n(cid:32)\n\n\u2248\n\n1\nN(cid:48)\n\n(cid:32)\nwhere \u02dcp(x) = p(x)/(cid:82)\n\n+\n\nHIS(x, local-proxy(M, x))p(x)dx\n\n(cid:33)\n\n(cid:32)(cid:90)\n(cid:33)\n\n\u00b7\n\nHIS(x, local-proxy(M, x))\u02dcp(x)dx\n\np(x)dx\n\nx\u2208B(M )\n\nx /\u2208B(M )\n\n(cid:88)\n(cid:88)\n\nxn(cid:48)\u223cp(x)\n1\nN(cid:48)\n\nxn(cid:48)\u223cp(x)\n\np(x)dx\n\nI(x \u2208 B(M ))\n\nx\u2208B(M )\n\u00b7 max-RT\n\n(cid:33)\n\n(cid:32)\n(cid:33)\n\n\u00b7\n\n1\nN\n\nI(x /\u2208 B(M ))\n\nxn\u223c \u02dcp(x)\n\u00b7 max-RT\n\n(cid:88)\n\n(cid:33)\n\nHIS(xn, M )\n\nx\u2208B(M ) p(x)dx. The \ufb01rst term (the volume of p(x) in B(M )), and the third\nterm (the volume of p(x) not in B(M )) can be approximated without any user studies by attempting to\n\ufb01t local models for each point in x (or a subsample of points). We detail how we \ufb01t local explanations\nand de\ufb01ne the boundary in Appendix C.\n\n4.3 Model-based Optimization of the MAP Objective\nThe \ufb01rst stage of our optimization procedure gives us a collection of models {M1, ..., MK} with high\nlikelihood p(X|M ). Our goal is to identify the model Mk in this set that is the approximate MAP,\nthat is, maximizes p(X|M )p(M ), with as few evaluations of p(M ) as possible.\nLet L be the set of all labeled models M, that is, the set of models for which we have evaluated\np(M ). We estimate the values (and uncertainties) for the remaining unlabeled models\u2014set U\u2014via\na Gaussian Process (GP) [21]. (See Appendix A for details about our model-similarity kernel.)\nFollowing Srinivas et al. [27], we use the GP upper con\ufb01dence bound acquisition function to choose\namong unlabeled models M \u2208 U that are likely to have large p(M ) (this is equivalent to using the\nlower con\ufb01dence bound to minimize response time):\n\naLCB(M ; L, \u03b8) = \u00b5(M ; L, \u03b8) \u2212 \u03ba\u03c3(M ; L, \u03b8)\nMnext = arg minM\u2208U aLCB(M ; L, \u03b8)\n\nwhere \u03ba is a hyperparameter that can be tuned, \u03b8 are parameters of the GP, \u00b5 is the GP mean function,\nand \u03c3 is the GP variance. (We \ufb01nd \u03ba = 1 works well in practice.)\n\n5 Experimental Setup\n\nIn this section, we provide details for applying our approach to four datasets. Our results are in\nSection 6.\n\nDatasets and Training Details We test our approach on a synthetic dataset as well as the mush-\nroom, census income, and covertype datasets from the UCI database [7]. All features are preprocessed\nby z-scoring continuous features and one-hot encoding categorical features. We also balance the\nclasses of the \ufb01rst three datasets by subsampling the more common class. (The sizes reported are\nafter class balancing. We do not include a test set because we do not report held-out accuracy.)\n\u2022 Synthetic (N = 90, 000, D = 6, continuous). We build a data set with two noise dimensions, two\ndimensions that enable a lower-accuracy, interpretable explanation, and two dimensions that enable\na higher-accuracy, less interpretable explanation. We use an 80%-20% train-validate split. (See\nFigure 1 in the Appendix.)\n\n5\n\n\f(a) For each pair of proxies (A, B) for\ninterpretability, we \ufb01rst identify the best\nmodel if we only care about proxy A, then\ncompute its rank if we now care about\nproxy B. This simulates the setting where\nwe optimize for proxy B, but A is the true\nHIS. This value for each pair of prox-\nies is plotted with an \u00d7. The large rank-\ning value indicates that sometimes proxies\ndisagree on which models are good.\n\n(b) Rank of the best model(s) by each proxy across multiple samples\nof data points (\u2018N.Z.\u2019 denotes non-zero and \u2018feats.\u2019 denotes features).\nThis simulates the setting where we compute HIS on a human accessi-\nble number of data points. The lines dropping below the high values\nin Figure 2a indicate that computing the right proxy on a human-\naccessible number of points is better than computing the wrong proxy\naccurately. This bene\ufb01t occurs across all datasets and models, but it\ntakes more samples for neural networks on Covertype than the others.\n\nFigure 2: Determining interpretability on a few points is better than using the wrong proxy.\n\n\u2022 Mushroom (N = 8, 000, D = 22 categorical with 126 distinct values). The goal is to predict if the\n\nmushroom is edible or poisonous. We use an 80%-20% train-validate split.\n\n\u2022 Census (N = 20, 000, D = 13\u20146 continuous, 7 categorical with 83 distinct values). The goal is\n\nto predict if people make more than $50, 000/year. We use their 60%-40% train-validate split.\n\n\u2022 Covertype (N = 580, 000, D = 12\u201410 continuous, 2 categorical with 44 distinct values). The\n\ngoal is to predict tree cover type. We use a 75%-25% train-validate split.\n\nOur experiments include two classes of models: decision trees and neural networks. We train decision\ntrees for the simpler synthetic, mushroom and census datasets and neural networks for the more\ncomplex covertype dataset. Details of our model training procedure (that is, identifying models with\nhigh predictive accuracy) are in Appendix B. The covertype dataset, because it is modeled by a neural\nnetwork, also needs a strategy for producing local explanations; we describe our parameter choices\nas well as provide a detailed sensitivity analysis to these choices in Appendix C.\n\nProxies for Interpretability An important question is whether currently used proxies for inter-\npretability, such as sparsity or number of nodes in a path, correspond to some HIS. In the following\nwe will use four different interpretability proxies to demonstrate the ability of our pipeline to identify\nmodels that are best under these different proxies, simulating the case where we have a ground truth\nmeasure of HIS. We show that (a) different proxies favor different models and (b) how these proxies\ncorrespond to the results of our user studies.\nThe interpretability proxies we will use are: mean path length, mean number of distinct features in a\npath, number of nodes, and number of nonzero features. The \ufb01rst two are local to a speci\ufb01c input\nx while the last two are global model properties (although these will be properties of local proxy\nmodels for neural networks). These proxies include notions of tree depth [23] and sparsity [15, 20].\nWe compute the proxies based on a sample of 1, 000 points from the validation set (the same set of\npoints is used across models).\n\nHuman Experiments\nIn our human subjects experiments, we quantify HIS(x, M ) for a data point\nx and a model M as a function of the time it takes a user to simulate the label for x with M. We\nextend this to the locally interpretable case by simulating the label according to local-proxy(x, M ).\nWe refer to the model itself as the explanation in the globally interpretable case, and the local\nmodel as the explanation in the locally interpretable case. Our experiments are closely based on\nthose in Narayanan et al. [19]. We provide users with a list of feature values for features used in\nthe explanation and a graphical depiction of the explanation, and ask them to identify the correct\nprediction. Figure 3a in Appendix D depicts our interface. These experiments were reviewed and\n\n6\n\nSyntheticMushroomCensusCovertype0200400BestRankbyWrongProxyIfIOptimizefortheWrongProxy,HowBadWillmyChoiceBe?Bound25500100200300400500BestRankbySampledProxy95thPctile(1000Samples)Census2550NumberofSamples0100200300400500MushroomAvg.PathLengthNum.N.Z.Feats.Num.NodesAvg.PathFeats.2550050100150200250CovertypeHowDoesSubsamplingProxiesA\ufb00ectRankings?\fFigure 3: We ran random restarts of the pipeline with all datasets and proxies\u2013denoted \u2018opt\u2019 (random-\nness from choice of start), and compared to randomly sampling the same number of models\u2013denoted\n\u2018rd\u2019 (we account for models with the same score by computing the lowest rank of any model with that\nscore). \u2018NZ\u2019 denotes non-zero and \u2018feats\u2019 denotes features. The fact that the solid lines stay below\nthe corresponding dotted lines indicates that we do better than random guessing.\n\napproved by our institution\u2019s IRB. Details of the experiments we conducted with machine learning\nresearchers and details and results of a pilot study [not used in this paper] conducted using Amazon\nTurk are in Appendix D.\n\n6 Experimental Results\n\nOptimizing different automatic proxies results in different models. For each dataset, we run\nsimulations to test what happens when the optimized measure of interpretability does not match the\ntrue HIS. We do this by computing the best model by one proxy\u2013our simulated HIS, then identifying\nwhat rank it would have had among the collection of models if one of the other proxies\u2013our optimized\ninterpretability measure\u2013had been used. A rank of 0 indicates that the model identi\ufb01ed as the best by\none proxy is the same as the best model for the second proxy; more generally a rank of r indicates\nthat the best model by one proxy is the rth-best model under the second proxy. Figure 2a shows that\nchoosing the wrong proxy can seriously mis-rank the true best model. This suggests that it is not\na good idea to optimize an arbitrary proxy for interpretability in the hopes that the resulting model\nwill be interpretable according to the truly relevant measure. Figure 2a also shows that the synthetic\ndataset has a very different distribution of proxy mis-rankings than any of the real datasets in our\nexperiments. This suggests that it is hard to design synthetic datasets that capture the relevant notions\nof interpretability since, by assumption, we do not know what these are.\nComputing the right proxy on a small sample of data points is better than computing the\nwrong proxy. For each dataset, we run simulations to test what happens when we optimize the\ntrue HIS computed on only a small sample of points\u2013the size limitation comes from limited human\ncognitive capacity. As in the previous experiment, we compute the best model by one proxy\u2013our\nsimulated HIS. We then identify what rank it would have had among the collection of models if the\nsame proxy had been computed on a small sample of data points. Figure 2 shows that computing the\nright proxy on a small sample of data points can do better than computing the wrong proxy. This\nholds across datasets and models. This suggests that it may be better to \ufb01nd interpretable models by\nasking people to examine the interpretability of a small number of examples\u2014which will result in\nnoisy measurements of the true quantity of interest\u2014rather than by accurately optimizing a proxy\nthat does not capture the quantity of interest.\nOur model-based optimization approach can learn human-interpretable models that corre-\nspond to a variety of different proxies on globally and locally interpretable models. We run our\npipeline 100 times for 10 iterations with each proxy as the signal (the randomness comes from the\nchoice of starting point), and compare to 1, 000 random draws of 10 models. We account for multiple\nmodels with the same score by computing the lowest rank for any model with the same score as the\nmodel we sample. Figure 3 shows that across all three datasets, and across all four proxies, we do\nbetter than randomly sampling models to evaluate.\nOur pipeline \ufb01nds models with lower response times and lower scores across all four proxies\nwhen we run it with human feedback. We run our pipeline for 10 iterations on the census and\n\n7\n\n0123456789050100150200250RankofBestModelSoFarCensus0123456789NumberofEvaluations050100150200250Mushroom0123456789020406080100120CovertypeOptAvgPathLenRdAvgPathLenOptAvgPathFeatsRdAvgPathFeatsOptNumNodesRdNumNodesOptNumNZFeatsRdNumNZFeatsCanWeOptimizeProxiesBetterthanRandomSampling?\f(a) We computed response times for each iteration of\nthe pipeline on two datasets. Each data point is the\nmean response time for a single user. In both experi-\nments, we see the mean response times decrease as we\nevaluate more models. We reach times comparable to\nthose of the best proxy models. The last 2 models are\nour baselines (\u2018NZ feats\u2019 denotes non-zero features).\n\n(b) We computed the proxy scores for the model evalu-\nated at each iteration of the pipeline. On the mushroom\ndataset, our approach converges to models with the\nfewest nodes and shortest paths, and on the census\ndataset, it converges to models with the fewest features.\n\u2018Mush\u2019 denotes the mushroom dataset and \u2018Cens\u2019 de-\nnotes the census dataset.\n\nFigure 4: Human subjects pipeline results show a trend towards interpretability.\n\nmushrooms datasets with human response time as the signal. We recruited a group of machine\nlearning researchers who took all quizzes in a single run of the pipeline, with models iteratively\nchosen from our model-based optimization. Figure 4a shows the distributions of mean response times\ndecreasing as we evaluate more models. (In Figure 3b in Appendix D we demonstrate that increases\nin speed from repeatedly doing the task are small compared to the differences we see in Figure 4a;\nthese are real improvements in response time.)\nOn different datasets, our pipeline converges to different proxies. In the human subjects experi-\nments above, we tracked the proxy scores of each model we evaluated. Figure 4b shows a decrease in\nproxy scores that corresponds to the decrease in response times in Figure 4a (our approach did not\nhave access to these proxy scores). On the mushroom dataset, our approach converged to a model\nwith the fewest nodes and the shortest paths, while on the census dataset, it converged to a model\nwith the fewest features. This suggests that, for different datasets, different notions of interpretability\nare important to users.\n\n7 Discussion and Conclusion\n\nWe presented an approach to ef\ufb01ciently optimize models for human-interpretability (alongside\nprediction) by directly including humans in the optimization loop. Our experiments showed that,\nacross several datasets, several reasonable proxies for interpretability identify different models as the\nmost interpretable; all proxies do not lead to the same solution. Our pipeline was able to ef\ufb01ciently\nidentify the model that humans found most expedient for forward simulation. While the human-\nselected models often corresponded to some known proxy for interpretability, which proxy varied\nacross datasets, suggesting the proxies may be a good starting point but are not the full story when it\ncomes to \ufb01nding human-interpretable models.\nThat said, the direct human-in-the-loop optimization has its challenges. In our initial pilot studies\n[not used in this paper] with Amazon Mechanical Turk (Appendix D), we found that the variance\namong subjects was simply too large to make the optimization cost-effective (especially with the\nbetween-subjects model that makes sense for Amazon Mechanical Turk). In contrast, our smaller but\nlonger within-subjects studies had lower variance with a smaller number of subjects. This observation,\nand the importance of downstream tasks for de\ufb01ning interpretability suggest that interpretability\nstudies should be conducted with the people who will use the models (who we can expect to be more\nfamiliar with the task and more patient).\nThe many exciting directions for future work include exploring ways to ef\ufb01ciently allocate the human\ncomputation to minimize the variance of our estimates p(M ) via intelligently choosing which inputs\n\n8\n\n0102030Time(Sec.)Census12345678910NumberofEvaluations0102030MeanResponseMushroomPathLenNZFeatsResponseTimesbyPipelineIteration234ProxyScoreMeanPathLengthCensPipe234MeanPathFeaturesCensBest0123456789EvaluationNumber102030ProxyScoreNumberofTreeNodesMushPipe0123456789EvaluationNumber246NumberofTreeFeaturesMushBestProxyScoresbyPipelineIteration\fx to evaluate and structuring these long, sequential experiments to be more engaging; and further\nre\ufb01ning our model kernels to capture more nuanced notions of human-interpretability, particularly\nacross model classes. Optimizing models to be human-interpretable will always require user studies,\nbut with intelligent optimization approaches, we can reduce the number of studies required and thus\ncost-effectively identify human-interpretable models.\n\nAcknowledgments\nIL acknowledges support from NIH 5T32LM012411-02. All authors acknowl-\nedge support from the Google Faculty Research Award and the Harvard Dean\u2019s Competitive Fund.\nAll authors thank Emily Chen and Jeffrey He for their support with the experimental interface, and\nWeiwei Pan and the Harvard DTaK group for many helpful discussions and insights.\n\nReferences\n[1] Eric E. Altendorf, Angelo C. Resti\ufb01car, and Thomas G. Dietterich. Learning from sparse data by exploiting\nIn Proceedings of the Twenty-First Conference on Uncertainty in Arti\ufb01cial\n\nmonotonicity constraints.\nIntelligence, UAI\u201905, pages 18\u201326, Arlington, Virginia, United States, 2005. AUAI Press.\n\n[2] Francis R. Bach. Structured sparsity-inducing norms through submodular functions. In J. D. Lafferty, C. K. I.\nWilliams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing\nSystems 23, pages 118\u2013126. Curran Associates, Inc., 2010.\n\n[3] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models\nfor healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1721\u20131730. ACM,\n2015.\n\n[4] Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statist. Sci., 10(3):273\u2013\n\n304, 08 1995.\n\n[5] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein-\nforcement learning from human preferences. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages\n4299\u20134307. Curran Associates, Inc., 2017.\n\n[6] Wei Chu, S. S. Keerthi, and Chong Jin Ong. Bayesian support vector regression using a uni\ufb01ed loss function.\n\nIEEE Transactions on Neural Networks, 15(1):29\u201344, Jan 2004.\n\n[7] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\n[8] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv,\n\n2017.\n\n[9] Alex A. Freitas. Comprehensible classi\ufb01cation models: A position paper. SIGKDD Explor. Newsl.,\n\n15(1):1\u201310, March 2014.\n\n[10] Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926, 2010.\n\n[11] Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generative approach for\ncase-based reasoning and prototype classi\ufb01cation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1952\u20131960.\nCurran Associates, Inc., 2014.\n\n[12] Alex Kosorukoff. Human based genetic algorithm. In Proceedings of the IEEE International Conference\n\non Systems, Man and Cybernetics, volume 5, 05 2001.\n\n[13] Himabindu Lakkaraju, Stephen H. Bach, and Jure Leskovec. Interpretable decision sets: A joint framework\nfor description and prediction. In Proceedings of the 22Nd ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, KDD \u201916, pages 1675\u20131684, New York, NY, USA, 2016. ACM.\n\n[14] Nada Lavrac. Selected techniques for data mining in medicine. Arti\ufb01cial Intelligence in Medicine, 16(1):3\n\n\u2013 23, 1999. Data Mining Techniques and Applications in Medicine.\n\n[15] Zachary Chase Lipton. The mythos of model interpretability. CoRR, abs/1606.03490, 2016.\n\n[16] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. Turkit: Human computation algorithms\non mechanical turk. In Proceedings of the 23Nd Annual ACM Symposium on User Interface Software and\nTechnology, UIST \u201910, pages 57\u201366, New York, NY, USA, 2010. ACM.\n\n9\n\n\f[17] Yifei Ma, Roman Garnett, and Jeff G. Schneider. Submodularity in batch active learning and survey\n\nproblems on gaussian random \ufb01elds. CoRR, abs/1209.3694, 2012.\n\n[18] Muhammad A. Masood and Finale Doshi-Velez. A particle-based variational approach to bayesian\n\nnon-negative matrix factorization. arXiv, 2018.\n\n[19] Menaka Narayanan, Emily, Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. How\ndo Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-\nInterpretability of Explanation. ArXiv e-prints, February 2018.\n\n[20] Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wortman Vaughan, and\n\nHanna M. Wallach. Manipulating and measuring model interpretability. CoRR, abs/1802.07810, 2018.\n\n[21] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.\n\n[22] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. \"why should i trust you?\": Explaining the\nIn Proceedings of the 22Nd ACM SIGKDD International Conference on\n\npredictions of any classi\ufb01er.\nKnowledge Discovery and Data Mining, KDD \u201916, pages 1135\u20131144, New York, NY, USA, 2016. ACM.\n\n[23] Lior Rokach and Oded Maimon. Introduction to Decision Trees, chapter Chapter 1, pages 1\u201316. WORLD\n\nSCIENTIFIC, 2nd edition, 2014.\n\n[24] Andrew Ross, Isaac Lage, and Finale Doshi-Velez. The neural lasso: Local linear sparsity for interpretable\nexplanations. In Workshop on Transparent and Interpretable Machine Learning in Safety Critical Environ-\nments, 31st Conference on Neural Information Processing Systems, 2017. https://goo.gl/TwRhXo.\n\n[25] Andrew Ross, Weiwei Pan, and Finale Doshi-Velez. Learning qualitatively diverse and interpretable rules\nfor classi\ufb01cation. In 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018),\n2018.\n\n[26] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning\nalgorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 25, pages 2951\u20132959. Curran Associates, Inc., 2012.\n\n[27] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian Process Bandits with-\nout Regret: An Experimental Design Approach. Technical Report arXiv:0912.3995, Dec 2009. Comments:\n17 pages, 5 \ufb01gures.\n\n[28] Omer Tamuz, Ce Liu, Serge J. Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the\n\ncrowd kernel. CoRR, abs/1105.1033, 2011.\n\n[29] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[30] Berk Ustun and Cynthia Rudin. Optimized risk scores.\n\nIn Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, KDD \u201917, pages 1125\u20131134, New York,\nNY, USA, 2017. ACM.\n\n[31] Andrew G Wilson, Christoph Dann, Chris Lucas, and Eric P Xing. The human kernel. In C. Cortes, N. D.\nLawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 28, pages 2854\u20132862. Curran Associates, Inc., 2015.\n\n[32] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F\u00fcrnkranz. A survey of preference-based\n\nreinforcement learning methods. Journal of Machine Learning Research, 18(136):1\u201346, 2017.\n\n[33] Mike Wu, Michael C. Hughes, Sonali Parbhoo, Maurizio Zazzi, Volker Roth, and Finale Doshi-Velez.\nBeyond Sparsity: Tree Regularization of Deep Models for Interpretability. In Proceedings of the Thirty-\nSecond AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[34] Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-supervised\nlearning using gaussian \ufb01elds and harmonic functions. In ICML 2003 workshop on The Continuum from\nLabeled to Unlabeled Data in Machine Learning and Data Mining, pages 58\u201365, 2003.\n\n10\n\n\f", "award": [], "sourceid": 6531, "authors": [{"given_name": "Isaac", "family_name": "Lage", "institution": "Harvard"}, {"given_name": "Andrew", "family_name": "Ross", "institution": "Harvard University"}, {"given_name": "Samuel", "family_name": "Gershman", "institution": "Harvard University"}, {"given_name": "Been", "family_name": "Kim", "institution": "Google"}, {"given_name": "Finale", "family_name": "Doshi-Velez", "institution": "Harvard"}]}