{"title": "The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1952, "page_last": 1960, "abstract": "We present the Bayesian Case Model (BCM), a general framework for Bayesian case-based reasoning (CBR) and prototype classification and clustering. BCM brings the intuitive power of CBR to a Bayesian generative framework. The BCM learns prototypes, the ``quintessential observations that best represent clusters in a dataset, by performing joint inference on cluster labels, prototypes and important features. Simultaneously, BCM pursues sparsity by learning subspaces, the sets of features that play important roles in the characterization of the prototypes. The prototype and subspace representation provides quantitative benefits in interpretability while preserving classification accuracy. Human subject experiments verify statistically significant improvements to participants' understanding when using explanations produced by BCM, compared to those given by prior art.\"", "full_text": "The Bayesian Case Model: A Generative Approach\n\nfor Case-Based Reasoning and Prototype\n\nClassi\ufb01cation\n\nBeen Kim, Cynthia Rudin and Julie Shah\n\nMassachusetts Institute of Technology\n\nCambridge, Massachusetts 02139\n\n{beenkim, rudin, julie a shah}@csail.mit.edu\n\nAbstract\n\nWe present the Bayesian Case Model (BCM), a general framework for Bayesian\ncase-based reasoning (CBR) and prototype classi\ufb01cation and clustering. BCM\nbrings the intuitive power of CBR to a Bayesian generative framework. The BCM\nlearns prototypes, the \u201cquintessential\u201d observations that best represent clusters in\na dataset, by performing joint inference on cluster labels, prototypes and impor-\ntant features. Simultaneously, BCM pursues sparsity by learning subspaces, the\nsets of features that play important roles in the characterization of the prototypes.\nThe prototype and subspace representation provides quantitative bene\ufb01ts in inter-\npretability while preserving classi\ufb01cation accuracy. Human subject experiments\nverify statistically signi\ufb01cant improvements to participants\u2019 understanding when\nusing explanations produced by BCM, compared to those given by prior art.\n\n1\n\nIntroduction\n\nPeople like to look at examples. Through advertising, marketers present examples of people we\nmight want to emulate in order to lure us into making a purchase. We might ignore recommendations\nmade by Amazon.com and look instead at an Amazon customer\u2019s Listmania to \ufb01nd an example of a\ncustomer like us. We might ignore medical guidelines computed from a large number of patients in\nfavor of medical blogs where we can get examples of individual patients\u2019 experiences.\nNumerous studies have demonstrated that exemplar-based reasoning, involving various forms of\nmatching and prototyping, is fundamental to our most effective strategies for tactical decision-\nmaking ([26, 9, 21]). For example, naturalistic studies have shown that skilled decision makers\nin the \ufb01re service use recognition-primed decision making, in which new situations are matched to\ntypical cases where certain actions are appropriate and usually successful [21]. To assist humans in\nleveraging large data sources to make better decisions, we desire that machine learning algorithms\nprovide output in forms that are easily incorporated into the human decision-making process.\nStudies of human decision-making and cognition provided the key inspiration for arti\ufb01cial intelli-\ngence Case-Based Reasoning (CBR) approaches [2, 28]. CBR relies on the idea that a new situation\ncan be well-represented by the summarized experience of previously solved problems [28]. CBR\nhas been used in important real-world applications [24, 4], but is fundamentally limited, in that it\ndoes not learn the underlying complex structure of data in an unsupervised fashion and may not\nscale to datasets with high-dimensional feature spaces (as discussed in [29]).\nIn this work, we introduce a new Bayesian model, called the Bayesian Case Model (BCM), for\nprototype clustering and subspace learning. In this model, the prototype is the exemplar that is most\nrepresentative of the cluster. The subspace representation is a powerful output of the model because\nwe neither need nor want the best exemplar to be similar to the current situation in all possible ways:\n\n1\n\n\ffor instance, a moviegoer who likes the same horror \ufb01lms as we do might be useful for identifying\ngood horror \ufb01lms, regardless of their cartoon preferences. We model the underlying data using a\nmixture model, and infer sets of features that are important within each cluster (i.e., subspace). This\ntype of model can help to bridge the gap between machine learning methods and humans, who use\nexamples as a fundamental part of their decision-making strategies.\nWe show that BCM produces prediction accuracy comparable to or better than prior art for standard\ndatasets. We also verify through human subject experiments that the prototypes and subspaces\npresent as meaningful feedback for the characterization of important aspects of a dataset. In these\nexperiments, the exemplar-based output of BCM resulted in statistically signi\ufb01cant improvements\nto participants\u2019 performance of a task requiring an understanding of clusters within a dataset, as\ncompared to outputs produced by prior art.\n\n2 Background and Related Work\n\nPeople organize and interpret information through exemplar-based reasoning, particularly when they\nare solving problems ([26, 7, 9, 21]). AI Cased-Based Reasoning approaches are motivated by\nthis insight, and provide example cases along with the machine-learned solution. Studies show\nthat example cases signi\ufb01cantly improve user con\ufb01dence in the resulting solutions, as compared to\nproviding the solution alone or by also displaying a rule that was used to \ufb01nd the solution [11].\nHowever, CBR requires solutions (i.e. labels) for previous cases, and does not learn the underlying\nstructure of the data in an unsupervised fashion. Maintaining transparency in complex situations\nalso remains a challenge [29]. CBR models designed explicitly to produce explanations [1] rely on\nthe backward chaining of the causal relation from a solution, which does not scale as complexity\nincreases. The cognitive load of the user also increases with the complexity of the similarity measure\nused for comparing cases [14]. Other CBR models for explanations require the model to be manually\ncrafted in advance by experts [25].\nAlternatively, the mixture model is a powerful tool for discovering cluster distributions in an un-\nsupervised fashion. However, this approach does not provide intuitive explanations for the learned\nclusters (as pointed out in [8]). Sparse topic models are designed to improve interpretability by re-\nducing the number of words per topic [32, 13]. However, using the number of features as a proxy for\ninterpretability is problematic, as sparsity is often not a good or complete measure of interpretability\n[14]. Explanations produced by mixture models are typically presented as distributions over fea-\ntures. Even users with technical expertise in machine learning may have a dif\ufb01cult time interpreting\nsuch output, especially when the cluster is distributed over a large number of features [14].\nOur approach, the Bayesian Case Model (BCM), simultaneously performs unsupervised clustering\nand learns both the most representative cases (i.e., prototypes) and important features (i.e., sub-\nspaces). BCM preserves the power of CBR in generating interpretable output, where interpretability\ncomes not only from sparsity but from the prototype exemplars.\nIn our view, there are at least three widely known types of interpretable models: sparse linear\nclassi\ufb01ers ([30, 8, 31]); discretization methods, such as decision trees and decision lists (e.g.,\n[12, 32, 13, 23, 15]); and prototype- or case-based classi\ufb01ers (e.g., nearest neighbors [10] or a super-\nvised optimization-based method [5]). (See [14] for a review of interpretable classi\ufb01cation.) BCM is\nintended as the third model type, but uses unsupervised generative mechanisms to explain clusters,\nrather than supervised approaches [16] or by focusing myopically on neighboring points [3].\n\n3 The Bayesian Case Model\n\nIntuitively, BCM generates each observation using the important pieces of related prototypes. The\nmodel might generate a movie pro\ufb01le made of the horror movies from a quintessential horror movie\nwatcher, and action movies from a quintessential action moviegoer.\nBCM begins with a standard discrete mixture model [18, 6] to represent the underlying structure\nof the observations. It augments the standard mixture model with prototypes and subspace feature\nindicators that characterize the clusters. We show in Section 4.2 that prototypes and subspace feature\nindicators improve human interpretability as compared to the standard mixture model output. The\ngraphical model for BCM is depicted in Figure 1.\n\n2\n\n\f, c\n\nN\n\nps\n\ns\n\nq\n\n!s\n\nS\n\n\u21b5\n\n\u21e1i\n\nzij\n\nxij\n\nF\n\nN\n\nFigure 1: Graphical model for the Bayesian Case Model\n\nWe start with N observations, denoted by x = {x1, x2, . . . , xN}, with each xi represented as a ran-\ndom mixture over clusters. There are S clusters, where S is assumed to be known in advance. (This\nassumption can easily be relaxed through extension to a non-parametric mixture model.) Vector \u21e1i\nare the mixture weights over these clusters for the ith observation xi, \u21e1i 2 RS\n+. Each observation\nhas P features, and we denote the jth feature of the ith observation as xij. Each feature j of the\nobservation xi comes from one of the clusters, the index of the cluster for xij is denoted by zij and\nthe full set of cluster assignments for observation-feature pairs is denoted by z. Each zij takes on the\nvalue of a cluster index between 1 and S. Hyperparameters q, , c, and \u21b5 are assumed to be \ufb01xed.\nThe explanatory power of BCM results from how the clusters are characterized. While a standard\nmixture model assumes that each cluster take the form of a prede\ufb01ned parametric distribution (e.g.,\nnormal), BCM characterizes each cluster by a prototype, ps, and a subspace feature indicator, !s.\nIntuitively, the subspace feature indicator selects only a few features that play an important role in\nidentifying the cluster and prototype (hence, BCM clusters are subspace clusters). We intuitively\nde\ufb01ne these latent variables below.\nPrototype, ps: The prototype ps for cluster s is de\ufb01ned as one observation in x that maximizes\np(ps|!s, z, x), with the probability density and !s as de\ufb01ned below. Our notation for element j of\nps is psj. Since ps is a prototype, it is equal to one of the observations, so psj = xij for some i.\nNote that more than one maximum may exist per cluster; in this case, one prototype is arbitrarily\nchosen. Intuitively, the prototype is the \u201cquintessential\u201d observation that best represents the cluster.\nSubspace feature indicator !s: Intuitively, !s \u2018turns on\u2019 the features that are important for charac-\nterizing cluster s and selecting the prototype, ps. Here, !s 2 {0, 1}P is an indicator variable that\nis 1 on the subset of features that maximizes p(!s|ps, z, x), with the probability for !s as de\ufb01ned\nbelow. Here, !s is a binary vector of size P , where each element is an indicator of whether or not\nfeature j belongs to subspace s.\nThe generative process for BCM is as follows: First, we generate the subspace clusters. A sub-\nspace cluster can be fully described by three components: 1) a prototype, ps, generated by sampling\nuniformly over all observations, 1 . . . N; 2) a feature indicator vector, !s, that indicates important\nfeatures for that subspace cluster, where each element of the feature indicator (!sj) is generated\naccording to a Bernoulli distribution with hyperparameter q; and 3) the distribution of feature out-\ncomes for each feature, s, for subspace s, which we now describe.\nDistribution of feature outcomes s for cluster s: Here, s is a data structure wherein each \u201crow\u201d\nsj is a discrete probability distribution of possible outcomes for feature j. Explicitly, sj is a vector\nof length Vj, where Vj is the number of possible outcomes of feature j. Let us de\ufb01ne \u21e5 as a vector\nof the possible outcomes of feature j (e.g., for feature \u2018color\u2019, \u21e5 = [red, blue, yellow]), where \u21e5v\nrepresents a particular outcome for that feature (e.g., \u21e5v = blue). We will generate s so that it\nmostly takes outcomes from the prototype ps for the important dimensions of the cluster. We do this\nby considering the vector g, indexed by possible outcomes v, as follows:\ngpsj ,!sj ,(v) = (1 + c1[wsj =1 and psj =\u21e5v]),\n\nwhere c and are constant hyperparameters that indicate how much we will copy the prototype in\norder to generate the observations. The distribution of feature outcomes will be determined by g\nthrough sj \u21e0 Dirichlet(gpsj ,!sj ,). To explain at an intuitive level: First, consider the irrelevant\ndimensions j in subspace s, which have wsj = 0. In that case, sj will look like a uniform distribu-\n\n3\n\n\ftion over all possible outcomes for features j; the feature values for the unimportant dimensions are\ngenerated arbitrarily according to the prior. Next, consider relevant dimensions where wsj = 1. In\nthis case, sj will generally take on a larger value + c for the feature value that prototype ps has on\nfeature j, which is called \u21e5v. All of the other possible outcomes are taken with lower probability .\nAs a result, we will be more likely to select the outcome \u21e5v that agrees with the prototype ps. In the\nextreme case where c is very large, we can copy the cluster\u2019s prototype directly within the cluster\u2019s\nrelevant subspace and assign the rest of the feature values randomly.\nAn observation is then a mix of different prototypes, wherein we take the most important pieces of\neach prototype. To do this, mixture weights \u21e1i are generated according to a Dirichlet distribution,\nparameterized by hyperparameter \u21b5. From there, to select a cluster and obtain the cluster index zij\nfor each xij, we sample from a multinomial distribution with parameters \u21e1i. Finally, each feature for\nan observation, xij, is sampled from the feature distribution of the assigned subspace cluster (zij ).\n(Note that Latent Dirichlet Allocation (LDA) [6] also begins with a standard mixture model, though\nour feature values exist in a discrete set that is not necessarily binary.) Here is the full model, with\nhyperparameters c, , q, and \u21b5:\n!sj \u21e0 Bernoulli(q) 8s, j\nsj \u21e0 Dirichlet(gpsj ,!sj ,) 8s, j\nxij \u21e0 Multinomial(zij j) 8i, j.\n\u21e1i \u21e0 Dirichlet(\u21b5) 8i\nOur model can be readily extended to different similarity measures, such as standard kernel methods\nor domain speci\ufb01c similarity measures, by modifying the function g. For example, we can use the\nleast squares loss i.e., for \ufb01xed threshold \u270f, gpsj ,!sj ,(v) = (1 + c1[wsj =1 and (psj\u21e5v)2\uf8ff\u270f]); or,\nmore generally, gpsj ,!sj ,(v) = (1 + c1[wsj =1 and `(psj ,\u21e5v)\uf8ff\u270f]).\nIn terms of setting hyperparameters, there are natural settings for \u21b5 (all entries being 1). This\nmeans that there are three real-valued parameters to set, which can be done through cross-validation,\nanother layer of hierarchy with more diffuse hyperparameters, or plain intuition. To use BCM for\nclassi\ufb01cation, vector \u21e1i is used as S features for a classi\ufb01er, such as SVM.\n\nwhere gpsj ,!sj ,(v) = (1 + c1[wsj =1 and psj =\u21e5v])\n\nps \u21e0 Uniform(1, N ) 8s\n\nzij \u21e0 Multinomial(\u21e1i) 8i, j\n\n3.1 Motivating example\n\nThis section provides an illustrative example for prototypes, subspace feature indicators and sub-\nspace clusters, using a dataset composed of a mixture of smiley faces. The feature set for a smiley\nface is composed of types, shapes and colors of eyes and mouths. For the purpose of this example,\nassume that the ground truth is that there are three clusters, each of which has two features that are\nimportant for de\ufb01ning that cluster. In Table 1, we show the \ufb01rst cluster, with a subspace de\ufb01ned by\nthe color (green) and shape (square) of the face; the rest of the features are not important for de\ufb01ning\nthe cluster. For the second cluster, color (orange) and eye shape de\ufb01ne the subspace. We generated\n240 smiley faces from BCM\u2019s prior with \u21b5 = 0.1 for all entries, and q = 0.5, = 1 and c = 50.\n\nData in assigned to cluster\n\nLDA\n\nTop 3 words and probabilities\n\nPrototype\n\nBCM\n\nSubspaces\n\n1\n\n2\n\n3\n\ncolor (\nare important.\n\n) and shape (\n\ncolor (\nare important.\n\n) and eye (\n\neye (\nare important.\n\n) and mouth (\n\n)\n\n)\n\n)\n\n0.26\n\n0.23\n\n0.12\n\n0.26\n\n0.24\n\n0.16\n\n0.35\n\n0.27\n\n0.15\n\nTable 1: The mixture of smiley faces for LDA and BCM\n\n4\n\n\fBCM works differently to Latent Dirichlet Allocation (LDA) [6], which presents its output in a very\ndifferent form. Table 1 depicts the representation of clusters in both LDA (middle column) and BCM\n(right column). This dataset is particularly simple, and we chose this comparison because the two\nmost important features that both LDA and BCM learn are identical for each cluster. However, LDA\ndoes not learn prototypes, and represents information differently. To convey cluster information\nusing LDA (i.e., to de\ufb01ne a topic), we must record several probability distributions \u2013 one for each\nfeature. For BCM, we need only to record a prototype (e.g., the green face depicted in the top row,\nright column of the \ufb01gure), and state which features were important for that cluster\u2019s subspace (e.g.,\nshape and color). For this reason, BCM is more succinct than LDA with regard to what information\nmust be recorded in order to de\ufb01ne the clusters. One could de\ufb01ne a \u201cspecial\u201d constrained version\nof LDA with topics having uniform weights over a subset of features, and with \u201cword\u201d distributions\ncentered around a particular value. This would require a similar amount of memory; however, it loses\ninformation, with respect to the fact that BCM carries a full prototype within it for each cluster.\nA major bene\ufb01t of BCM over LDA is that the \u201cwords\u201d in each topic (the choice of feature values) are\ncoupled and not assumed to be independent \u2013 correlations can be controlled depending on the choice\nof parameters. The independence assumption of LDA can be very strong, and this may be crippling\nfor its use in many important applications. Given our example of images, one could easily generate\nan image with eyes and a nose that cannot physically occur on a single person (perhaps overlapping).\nBCM can also generate this image, but it would be unlikely, as the model would generally prefer to\ncopy the important features from a prototype.\nBCM performs joint inference on prototypes, subspace feature indicators and cluster labels for ob-\nservations. This encourages the inference step to achieve solutions where clusters are better repre-\nsented by prototypes. We will show that this is bene\ufb01cial in terms of predictive accuracy in Sec-\ntion 4.1. We will also show through an experiment involving human subjects that BCM\u2019s succinct\nrepresentation is very effective for communicating the characteristics of clusters in Section 4.2.\n\n3.2\n\nInference: collapsed Gibbs sampling\n\nWe use collapsed Gibbs sampling to perform inference, as this has been observed to converge\nquickly, particularly in mixture models [17]. We sample !sj, zij, and ps, where and \u21e1 are in-\ntegrated out. Note that we can recover by simply counting the number of feature values assigned\nto each subspace. Integrating out and \u21e1 results in the following expression for sampling zij:\n\np(zij = s|zi\u00acj, x, p, !, \u21b5, ) /\n\n\u21b5/S + n(s,i,\u00acj,\u00b7)\n\n\u21b5 + n\n\n\u21e5\n\ng(psj, !sj, ) + n(s,\u00b7,j,xij )\n\nPs g(psj, !sj, ) + n(s,\u00b7,j,\u00b7)\n\n,\n\n(1)\n\nwhere n(s,i,j,v) = 1(zij = s, xij = v). In other words, if xij takes feature value v for feature j\nand is assigned to cluster s, then n(s,i,j,v) = 1, or 0 otherwise. Notation n(s,\u00b7,j,v) is the number of\ntimes that the jth feature of an observation takes feature value v and that observation is assigned to\n1(zij = s, xij = v)). Notation n(s,\u00b7,j,\u00b7) means sum over\ni and v. We use n(s,i,\u00acj,v) to denote a count that does not include the feature j. The derivation is\nsimilar to the standard collapsed Gibbs sampling for LDA mixture models [17].\nSimilarly, integrating out results in the following expression for sampling !sj:\n\nsubspace cluster s (i.e., n(s,\u00b7,j,v) = Pi\n\np(!sj = b|q, psj, , , x, z, \u21b5) /8>><>>:\n\nq \u21e5\n1 q \u21e5\n\nB(g(psj, 1, ) + n(s,\u00b7,j,\u00b7))\n\nB(g(psj, 1, ))\nB(g(psj, 0, ) + n(s,\u00b7,j,\u00b7))\n\nB(g(psj, 0, ))\n\nb = 1\n\nb = 0,\n\n(2)\n\nwhere B is the Beta function and comes from integrating out variables, which are sampled from\nDirichlet distributions.\n\n4 Results\n\nIn this section, we show that BCM produces prediction accuracy comparable to or better than LDA\nfor standard datasets. We also verify the interpretability of BCM through human subject experiments\ninvolving a task that requires an understanding of clusters within a dataset. We show statistically\n\n5\n\n\f(a) Accuracy and standard deviation\nwith SVM\n\n(b) Unsupervised accuracy\nfor BCM\n\n(c) Sensitivity analysis for BCM\n\nFigure 2: Prediction test accuracy reported for the Handwritten Digit [19] and 20 Newsgroups\ndatasets [22]. (a) applies SVM for both LDA and BCM, (b) presents the unsupervised accuracy\nof BCM for Handwritten Digit (top) and 20 Newsgroups (bottom) and (c) depicts the sensitivity\nanalysis conducted for hyperparameters for Handwritten Digit dataset. Datasets were produced by\nrandomly sampling 10 to 70 observations of each digit for the Handwritten Digit dataset, and 100-\n450 documents per document class for the 20 Newsgroups dataset. The Handwritten Digit pixel\nvalues (range from 0 to 255) were rescaled into seven bins (range from 0 to 6). Each 16-by-16 pixel\npicture was represented as a 1D vector of pixel values, with a length of 256. Both BCM and LDA\nwere randomly initialized with the same seed (one half of the labels were incorrect and randomly\nmixed), The number of iterations was set at 1,000. S = 4 for 20 Newsgroups and S = 10 for\nHandwritten Digit. \u21b5 = 0.01, = 1, c = 50, q = 0.8.\n\nsigni\ufb01cant improvements in objective measures of task performance using prototypes produced by\nBCM, compared to output of LDA. Finally, we visually illustrate that the learned prototypes and sub-\nspaces present as meaningful feedback for the characterization of important aspects of the dataset.\n\n4.1 BCM maintains prediction accuracy.\n\nWe show that BCM output produces prediction accuracy comparable to or better than LDA, which\nuses the same mixture model (Section 3) to learn the underlying structure but does not learn ex-\nplanations (i.e., prototypes and subspaces). We validate this through use of two standard datasets:\nHandwritten Digit [19] and 20 Newsgroups [22]. We use the implementation of LDA available from\n[27], which incorporates Gibbs sampling, the same inference technique used for BCM.\nFigure 2a depicts the ratio of correctly assigned cluster labels for BCM and LDA. In order to com-\npare the prediction accuracy with LDA, the learned cluster labels are provided as features to a sup-\nport vector machine (SVM) with linear kernel, as is often done in the LDA literature on cluster-\ning [6]. The improved accuracy of BCM over LDA, as depicted in the \ufb01gures, is explained in part\nby the ability of BCM to capture dependencies among features via prototypes, as described in Sec-\ntion 3. We also note that prediction accuracy when using the full 20 Newsgroups dataset acquired\nby LDA (accuracy: 0.68\u00b1 0.01) matches that reported previously for this dataset when using a com-\nbined LDA and SVM approach [33]. Also, LDA accuracy for the full Handwritten Digit dataset\n(accuracy: 0.76 \u00b1 0.017) is comparable to that produced by BCM using the subsampled dataset (70\nsamples per digit, accuracy: 0.77 \u00b1 0.03).\nAs indicated by Figure 2b, BCM achieves high unsupervised clustering accuracy as a function of\niterations. We can compute this measure for BCM because each cluster is characterized by a pro-\ntotype \u2013 a particular data point with a label in the given datasets. (Note that this is not possible for\nLDA.) We set \u21b5 to prefer each \u21e1i to be sparse, so only one prototype generates each observation,\n\n6\n\n(cid:1)(cid:2)(cid:2)(cid:3)(cid:2)(cid:2)(cid:4)(cid:2)(cid:2)(cid:5)(cid:2)(cid:2)(cid:6)(cid:2)(cid:2)(cid:7)(cid:2)(cid:2)(cid:8)(cid:2)(cid:2)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:18)(cid:20)(cid:15)(cid:21)(cid:22)(cid:19)(cid:23)(cid:2)(cid:24)(cid:2)(cid:2)(cid:24)(cid:3)(cid:2)(cid:24)(cid:5)(cid:2)(cid:24)(cid:7)(cid:2)(cid:24)(cid:25)(cid:1)(cid:24)(cid:2)(cid:26)(cid:27)(cid:28)(cid:18)(cid:29)(cid:29)(cid:10)(cid:14)(cid:18)(cid:29)(cid:30)(cid:31)(cid:32)(cid:28)(cid:33)(cid:34)(cid:35)(cid:1)(cid:2)(cid:2)(cid:3)(cid:2)(cid:2)(cid:4)(cid:2)(cid:2)(cid:5)(cid:2)(cid:2)(cid:2)(cid:5)(cid:6)(cid:2)(cid:2)(cid:5)(cid:1)(cid:2)(cid:2)(cid:5)(cid:3)(cid:2)(cid:2)(cid:5)(cid:4)(cid:2)(cid:2)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:16)(cid:18)(cid:13)(cid:19)(cid:20)(cid:17)(cid:21)(cid:2)(cid:22)(cid:2)(cid:2)(cid:22)(cid:6)(cid:2)(cid:22)(cid:1)(cid:2)(cid:22)(cid:3)(cid:2)(cid:22)(cid:4)(cid:5)(cid:22)(cid:2)(cid:23)(cid:24)(cid:25)(cid:16)(cid:26)(cid:26)(cid:8)(cid:12)(cid:16)(cid:26)(cid:27)(cid:28)(cid:29)(cid:25)(cid:30)(cid:31)(cid:32)\fFigure 3: Web-interface for the human subject experiment\n\nand we use that prototype\u2019s label for the observation. Sensitivity analysis in Figure 2c indicates that\nthe additional parameters introduced to learn prototypes and subspaces (i.e., q, and c) are not too\nsensitive within the range of reasonable choices.\n\n4.2 Verifying the interpretability of BCM\n\nWe veri\ufb01ed the interpretability of BCM by performing human subject experiments that incorporated\na task requiring an understanding of clusters within a dataset. This task required each participant\nto assign 16 recipes, described only by a set of required ingredients (recipe names and instructions\nwere withheld), to one cluster representation out of a set of four to six. (This approach is similar\nto those used in prior work to measure comprehensibility [20].) We chose a recipe dataset1 for this\ntask because such a dataset requires clusters to be well-explained in order for subjects to be able to\nperform classi\ufb01cation, but does not require special expertise or training.\nOur experiment incorporated a within-subjects design, which allowed for more powerful statistical\ntesting and mitigated the effects of inter-participant variability. To account for possible learning\neffects, we blocked the BCM and LDA questions and balanced the assignment of participants into\nthe two ordering groups: Half of the subjects were presented with all eight BCM questions \ufb01rst,\nwhile the other half \ufb01rst saw the eight LDA questions. Twenty-four participants (10 females, 14\nmales, average age 27 years) performed the task, answering a total of 384 questions. Subjects were\nencouraged to answer the questions as quickly and accurately as possible, but were instructed to take\na 5-second break every four questions in order to mitigate the potential effects of fatigue.\nCluster representations (i.e., explanations) from LDA were presented as the set of top ingredients\nfor each recipe topic cluster. For BCM we presented the ingredients of the prototype without the\nname of the recipe and without subspaces. The number of top ingredients shown for LDA was set as\nthe number of ingredients from the corresponding BCM prototype and ran Gibbs sampling for LDA\nwith different initializations until the ground truth clusters were visually identi\ufb01able.\nUsing explanations from BCM, the average classi\ufb01cation accuracy was 85.9%, which was statisti-\ncally signi\ufb01cantly higher (c2(1, N = 24) = 12.15, p \u2327 0.001) than that of LDA, (71.3%). For\nboth LDA and BCM, each ground truth label was manually coded by two domain experts: the \ufb01rst\nauthor and one independent analyst (kappa coef\ufb01cient: 1). These manually-produced ground truth\nlabels were identical to those that LDA and BCM predicted for each recipe. There was no statisti-\ncally signi\ufb01cant difference between BCM and LDA in the amount of time spent on each question\n(t(24) = 0.89, p = 0.37); the overall average was 32 seconds per question, with 3% more time spent\non BCM than on LDA. Subjective evaluation using Likert-style questionnaires produced no statisti-\ncally signi\ufb01cant differences between reported preferences for LDA versus BCM. Interestingly, this\nsuggests that participants did not have insight into their superior performance using output from\nBCM versus that from LDA.\n\n1Computer Cooking Contest: http://liris.cnrs.fr/ccc/ccc2014/\n\n7\n\n\fPrototype (Recipe names)\nHerbs and Tomato in Pasta\n\nGeneric chili recipe\n\nMicrowave brownies\n\nSpiced-punch\n\nIngredients ( Subspaces )\nbasil, garlic, Italian seasoning, oil\ntomato\npasta pepper salt,\ncumin, gar-\nbeer\nchili powder\nlic, meat, oil, onion, pepper, salt,\ntomato\nbaking powder\nbutter,\nchocolate chopped pecans, eggs,\n\ufb02our, salt, vanilla\ncinnamon\nstick,\norange juice\nsugar, water, whole cloves\n\nlemon juice\npineapple juice\n\nsugar,\n\n(a) Handwritten Digit dataset\n\n(b) Recipe dataset\n\nFigure 4: Learned prototypes and subspaces for the Handwritten Digit and Recipe datasets.\n\nOverall, the experiment demonstrated substantial improvement to participants\u2019 classi\ufb01cation accu-\nracy when using BCM compared with LDA, with no degradation to other objective or subjective\nmeasures of task performance.\n\n4.3 Learning subspaces\n\nFigure 4a illustrates the learned prototypes and subspaces as a function of sampling iterations for the\nHandwritten Digit dataset. For the later iterations, shown on the right of the \ufb01gure, the BCM output\neffectively characterizes the important aspects of the data. In particular, the subspaces learned by\nBCM are pixels that de\ufb01ne the digit for the cluster\u2019s prototype.\nInterestingly, the subspace highlights the absence of writing in certain areas. This makes sense: For\nexample, one can de\ufb01ne a \u20187\u2019 by showing the absence of pixels on the left of the image where the\nloop of a \u20189\u2019 might otherwise appear. The pixels located where there is variability among digits of\nthe same cluster are not part of the de\ufb01ning subspace for the cluster.\nBecause we initialized randomly, in early iterations, the subspaces tend to identify features common\nto the observations that were randomly initialized to the cluster. This is because !s assigns higher\nlikelihood to features with the most similar values across observations within a given cluster. For\nexample, most digits \u2018agree\u2019 (i.e., have the same zero pixel value) near the borders; thus, these are\nthe \ufb01rst areas that are re\ufb01ned, as shown in Figure 4a. Over iterations, the third row of Figure 4a\nshows how BCM learns to separate the digits \u201c3\u201d and \u201c5,\u201d which tend to share many pixel values in\nsimilar locations. Note that the sparsity of the subspaces can be customized by hyperparameter q.\nNext, we show results for BCM using the Computer Cooking Contest dataset in Figure 4b. Each pro-\ntotype consists of a set of ingredients for a recipe, and the subspace is a set of important ingredients\nthat de\ufb01ne that cluster, highlighted in red boxes. For instance, BCM found a \u201cchili\u201d cluster de\ufb01ned\nby the subspace \u201cbeer,\u201d \u201cchili powder,\u201d and \u201ctomato.\u201d A recipe called \u201cGeneric Chili Recipe\u201d was\nchosen as the prototype for the cluster. (Note that beer is indeed a typical ingredient in chili recipes.)\n\n5 Conclusion\n\nThe Bayesian Case Model provides a generative framework for case-based reasoning and prototype-\nbased modeling. Its clusters come with natural explanations; namely, a prototype (a quintessential\nexemplar for the cluster) and a set of de\ufb01ning features for that cluster. We showed the quantitative\nadvantages in prediction quality and interpretability resulting from the use of BCM. Exemplar-based\nmodeling (nearest-neighbors, case-based reasoning) has historical roots dating back to the beginning\nof arti\ufb01cial intelligence; this method offers a fresh perspective on this topic, and a new way of\nthinking about the balance of accuracy and interpretability in predictive modeling.\n\n8\n\n\fReferences\n[1] A. Aamodt. A knowledge-intensive, integrated approach to problem solving and sustained learning.\n\nKnowledge Engineering and Image Processing Group. University of Trondheim, pages 27\u201385, 1991.\n\n[2] A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological variations, and\n\nsystem approaches. AI communications, 1994.\n\n[3] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.R. M\u00a8uller. How to explain\n\nindividual classi\ufb01cation decisions. JMLR, 2010.\n\n[4] I. Bichindaritz and C. Marling. Case-based reasoning in the health sciences: What\u2019s next? AI in medicine,\n\n2006.\n\n[5] J. Bien, R. Tibshirani, et al. Prototype selection for interpretable classi\ufb01cation. AOAS, 2011.\n[6] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. JMLR, 2003.\n[7] J.S. Carroll. Analyzing decision behavior: The magician\u2019s audience. Cognitive processes in choice and\n\ndecision behavior, 1980.\n\n[8] J. Chang, J.L. Boyd-Graber, S. Gerrish, C. Wang, and D.M. Blei. Reading tea leaves: How humans\n\ninterpret topic models. In NIPS, 2009.\n\n[9] M.S. Cohen, J.T. Freeman, and S. Wolf. Metarecognition in time-stressed decision making: Recognizing,\n\ncritiquing, and correcting. Human Factors, 1996.\n\n[10] T. Cover and P. Hart. Nearest neighbor pattern classi\ufb01cation. Information Theory, 1967.\n[11] P. Cunningham, D. Doyle, and J. Loughrey. An evaluation of the usefulness of case-based explanation.\n\nIn CBRRD. Springer, 2003.\n\n[12] G. De\u2019ath and K.E. Fabricius. Classi\ufb01cation and regression trees: a powerful yet simple technique for\n\necological data analysis. Ecology, 2000.\n\n[13] J. Eisenstein, A. Ahmed, and E. Xing. Sparse additive generative models of text. In ICML, 2011.\n[14] A. Freitas. Comprehensible classi\ufb01cation models: a position paper. ACM SIGKDD Explorations, 2014.\n[15] S. Goh and C. Rudin. Box drawings for learning with imbalanced data. In KDD, 2014.\n[16] A. Graf, O. Bousquet, G. R\u00a8atsch, and B. Sch\u00a8olkopf. Prototype classi\ufb01cation: Insights from machine\n\nlearning. Neural computation, 2009.\n\n[17] T.L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 2004.\n[18] T. Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR, 1999.\n[19] J.J. Hull. A database for handwritten text recognition research. TPAMI, 1994.\n[20] J. Huysmans, K. Dejaeger, C. Mues, J. Vanthienen, and B. Baesens. An empirical evaluation of the\n\ncomprehensibility of decision table, tree and rule based predictive models. DSS, 2011.\n\n[21] G.A. Klein. Do decision biases explain too much. HFES, 1989.\n[22] K. Lang. Newsweeder: Learning to \ufb01lter netnews. In ICML, 1995.\n[23] B. Letham, C. Rudin, T. McCormick, and D. Madigan. Interpretable classi\ufb01ers using rules and Bayesian\n\nanalysis. Technical report, University of Washington, 2014.\n\n[24] H. Li and J. Sun. Ranking-order case-based reasoning for \ufb01nancial distress prediction. KBSI, 2008.\n[25] J.W. Murdock, D.W. Aha, and L.A. Breslow. Assessing elaborated hypotheses: An interpretive case-based\n\nreasoning approach. In ICCBR. Springer, 2003.\n\n[26] A. Newell and H.A. Simon. Human problem solving. Prentice-Hall Englewood Cliffs, 1972.\n[27] X. Phan and C. Nguyen. GibbsLDA++, AC/C++ implementation of latent dirichlet allocation using gibbs\n\nsampling for parameter estimation and inference, 2013.\n\n[28] S. Slade. Case-based reasoning: A research paradigm. AI magazine, 1991.\n[29] F. S\u00f8rmo, J. Cassens, and A. Aamodt. Explanation in case-based reasoning\u2013perspectives and goals. AI\n\nReview, 2005.\n\n[30] R. Tibshirani. Regression shrinkage and selection via the lasso. JRSS, 1996.\n[31] B. Ustun and C. Rudin. Methods and models for interpretable linear classi\ufb01cation. ArXiv, 2014.\n[32] S. Williamson, C. Wang, K. Heller, and D. Blei. The IBP compound dirichlet process and its application\n\nto focused topic modeling. 2010.\n\n[33] J. Zhu, A. Ahmed, and E.P. Xing. MedLDA: maximum margin supervised topic models. JMLR, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1073, "authors": [{"given_name": "Been", "family_name": "Kim", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Cynthia", "family_name": "Rudin", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Julie", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}]}