{"title": "Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 2260, "page_last": 2268, "abstract": "We present the Mind the Gap Model (MGM), an approach for interpretable feature extraction and selection. By placing interpretability criteria directly into the model, we allow for the model to both optimize parameters related to interpretability and to directly report a global set of distinguishable dimensions to assist with further data exploration and hypothesis generation. MGM extracts distinguishing features on real-world datasets of animal features, recipes ingredients, and disease co-occurrence. It also maintains or improves performance when compared to related approaches. We perform a user study with domain experts to show the MGM's ability to help with dataset exploration.", "full_text": "Mind the Gap: A Generative Approach to\n\nInterpretable Feature Selection and Extraction\n\nBeen Kim\nMassachusetts Institute of Technology\n\nJulie Shah\n\n{beenkim, julie a shah}@csail.mit.edu\n\nCambridge, MA 02139\n\nFinale Doshi-Velez\nHarvard University\n\nCambridge, MA 02138\n\nfinale@seas.harvard.edu\n\nAbstract\n\nWe present the Mind the Gap Model (MGM), an approach for interpretable fea-\nture extraction and selection. By placing interpretability criteria directly into the\nmodel, we allow for the model to both optimize parameters related to interpretabil-\nity and to directly report a global set of distinguishable dimensions to assist with\nfurther data exploration and hypothesis generation. MGM extracts distinguishing\nfeatures on real-world datasets of animal features, recipes ingredients, and dis-\nease co-occurrence. It also maintains or improves performance when compared\nto related approaches. We perform a user study with domain experts to show the\nMGM\u2019s ability to help with dataset exploration.\n\n1\n\nIntroduction\n\nNot only are our data growing in volume and dimensionality, but the understanding that we wish to\ngain from them is increasingly sophisticated. For example, an educator might wish to know what\nfeatures characterize different clusters of assignments to provide in-class feedback tailored to each\nstudent\u2019s needs. A clinical researcher might apply a clustering algorithm to his patient cohort, and\nthen wish to understand what sets of symptoms distinguish clusters to assist in performing a differen-\ntial diagnosis. More broadly, researchers often perform clustering as a tool for data exploration and\nhypothesis generation. In these situations, the domain expert\u2019s goal is to understand what features\ncharacterize a cluster, and what features distinguish between clusters.\nObjectives such as data exploration present unique challenges and opportunities for problems in\nunsupervised learning. While in more typical scenarios, the discovered latent structures are simply\nrequired for some downstream task\u2014such as features for a supervised prediction problem\u2014in data\nexploration, the model must provide information to a domain expert in a form that they can readily\ninterpret. It is not suf\ufb01cient to simply list what observations are part of which cluster; one must also\nbe able to explain why the data partition in that particular way. These explanations must necessarily\nbe succinct, as people are limited in the number of cognitive entities that they can process at one\ntime [1].\nThe de-facto standard for summarizing clusters (and other latent factor representations) is to list the\nmost probable features of each factor. For example, top-N word lists are the de-facto standard for\npresenting topics from topic models [2]; principle component vectors in PCA are usually described\nby a list of dimensions with the largest magnitude values for the components with the largest mag-\nnitude eigenvalues. Sparsity-inducing versions of these models [3, 4, 5, 6] make this goal more\nexplicit by trying to limit the number of non-zero values in each factor. Other works make these\ndescriptions more intuitive by deriving disjunctive normal form (DNF) expressions for each clus-\nter [7] or learning a set of important features and examples that characterizes each cluster [8]. While\nthese approaches might effectively characterize each cluster, they do not provide information about\n\n1\n\n\fwhat distinguishes clusters from each other. Understanding these differences is important in many\nsituations\u2014such when performing a differential diagnosis and computing relative risks [9, 10].\nTechniques that combine variable selection and clustering assist\nin \ufb01nding dimensions that\ndistinguish\u2014rather than simply characterize\u2014the clusters [11, 12]. Variable extraction methods,\nsuch as PCA, project the data into a smaller number of dimensions and perform clustering there. In\ncontrast, variable selection methods choose a small number of dimensions to retain. Within vari-\nable selection approaches, \ufb01lter methods (e.g. [13, 14, 15]) \ufb01rst select important dimensions and\nthen cluster based on those. Wrapper methods (e.g. [16]) iterate between selecting dimensions and\nclustering to maximize a clustering objective. Embedded methods (e.g. [17, 18, 19]) combine vari-\nable selection and clustering into one objective. All of these approaches identify a small subset of\ndimensions that can be used to form a clustering that is as good as (or better than) using all the\ndimensions. A primary motivation for identifying this small subset is that one can then accurately\ncluster future data with many fewer measurements per observation. However, identifying a minimal\nset of distinguishing dimensions is the opposite of what is required in data exploration and hypothe-\nsis generation tasks. Here, the researcher desires a comprehensive set of distinguishing dimensions\nto better understand the important patterns in the data.\nIn this work, we present a generative approach for discovering a global set of distinguishable dimen-\nsions when clustering high-dimensional data. Our goal is to \ufb01nd a comprehensive set of distinguish-\ning dimensions to assist with further data exploration and hypothesis generation, rather than a few\ndimensions that will distinguish the clusters. We use an embedded approach that incorporates inter-\npretability criteria directly into the model. First, we use a logic-based feature extraction technique to\nconsolidate dimensions into easily-interpreted groups. Second, we de\ufb01ne important groups as ones\nhaving multi-modal parameter values\u2014that is, groups that have gap in their parameter values across\nclusters. By building these human-oriented interpretability criteria directly into the model, we can\neasily report back what an extracted set of features means (by its logical formula) and what sets of\nfeatures distinguish one cluster from another without any ad-hoc post-hoc analysis.\n\n2 Model\nWe consider a data-set {wnd} with N observations and D binary dimensions. Our goal is to decom-\npose these N observations into K clusters while simulateneously returning a comprehensive list of\nwhat sets of dimensions d are important for distinguishing between the clusters.\nMGM has two core elements which perform interpretable feature extraction and selection. At the\nfeature extraction stage, features are grouped together by logical formulas, which are easily inter-\npreted by people [20, 21], allowing some dimensionality reduction while maintaining interpretabil-\nity. Next, we select features for which there is a large separation\u2014or a gap\u2014in parameter val-\nues. From personal communication with domain experts across several domains, we observed that\nseparation\u2014rather than simply variation\u2014is often as aspect of interest as it provides an unambigu-\nous way to discriminate between clusters.\nWe focus on binary-valued data. Our feature extraction step involves consolidating dimensions into\ngroups. We posit that there an in\ufb01nite number of groups g, and a multinomial latent variable ld\nthat indicates the group to which dimension d belongs. Each group g is characterized by a latent\nvariable fg which contains the formula associated with the group g. In this work, we only consider\nthe formulas fg = or, fg = and and constrain each dimension to belong to only one group. Simple\nBoolean operations like or and and are easy to interpret by people. Requiring each dimension to be\npart of only one group avoid having to solve a (possibly NP-complete) satis\ufb01ability problem as part\nof the generative procedure.\nFeature selection is performed through a binary latent variable yg which indicates whether each\ngroup g is important for distinguishing clusters. If a group is important (yg = 1), then the probability\n\u03b2gk that group g is present in an observation from cluster k is drawn from a bi-modal distribution\n(modeled as a mixture of Beta distributions). If the group is unimportant (yg = 0), the the probability\n\u03b2gk is drawn from a uni-modal distribution. While a uni-modal distribution with high variance can\nalso produce both low and high values for the probability \u03b2gk, it will also produce intermediate\nvalues. However, draws from the bi-modal distribution will have a clear gap between low and\nhigh values. This de\ufb01nition of important distributions is distinct from the criterion in [17], where\n\n2\n\n\f\u03c0g\n\nyg\n\n\u03b2gk\n\nK\n\n\u03c3p\n\n\u03b3g\n\ntgk\n\n\u03b1i, \u03b2i\n\nC\n\n\u03c0f\n\nfg\n\n\u03c0l\n\nld\n\nG\n\nD\n\n\u03b1z\n\n\u03c0z\n\nzn\n\ning\n\nG\n\nwnd\n\nD\n\nN\n\n(a) Mind the gap graphical model\n\n(b) Cartoon describing emissions from\nimportant dimensions. In our case, we\nde\ufb01ne importance by separability\u2014or\na gap\u2014rather than simply variance.\nThus, we distinguish panel (1) from (2)\nand (3), while [17] distinguishes be-\ntween (2) and (3).\n\nFigure 1: Graphical model of MGM, Cartoon of distinguishing dimensions.\n\nparameters for important distributions were selected from a uni-modal distribution and parameters\nfor unimportant dimensions were shared across all clusters. Figure 1b illustrates this difference.\nGenerative Model The graphical model for MGM is shown in Figure 1. We assume that there are an\nin\ufb01nite number of possible groups g, each with an associated formula fg. Each dimension d belongs\nto a group g, as indicated by ld. We also posit that there are a set of latent clusters k, each with\nemission characteristics described below. The latent variable \u03b2gk corresponds to the probability that\ngroup g is present in the data, and is drawn with a uni-modal or bi-modal distribution governed by\nthe parameters {\u03b3g, yg, tgk}. Each observation n belongs to exactly one latent cluster k, indicated\nby zn. The binary variable ing indicates whether group g is present in observation n. Finally, the\nprobability of some observation wnd = 1 depends on whether its associated group g (indicated by\nld) is present in the data (indicated by ing) and the associated formula fg.\nThe complete generative process \ufb01rst involves assigning dimensions d to groups, choosing the for-\nmula fg associated with each group, and deciding whether each group g is important:\n\n\u03c0l \u223c DP(\u03b1l)\nyg \u223c Bernoulli(\u03c0g)\n\n\u03c0f \u223c Dirichlet(\u03b1f )\n\u03b3g \u223c Beta(\u03c31, \u03c32)\n\nld \u223c Multinomial(\u03c0l)\nfg \u223c Multinomial(\u03c0f )\n\nwhere DP is the Dirichlet process. Thus, there are an in\ufb01nite number of potential groups; however,\ngiven a \ufb01nite number of dimensions, only a \ufb01nite number of groups can be present in the data. Next,\nemission parameters are selected for each cluster k:\n\nIf(yg = 0)\nElse :\n\ntgk \u223c Bernoulli(\u03b3g)\n\ntgk = 0 :\n\nIf :\nElse :\n\n\u03b2gk \u223c Beta(\u03b1u, \u03b2u)\n\u03b2gk \u223c Beta(\u03b1b, \u03b2b)\n\u03b2gk \u223c Beta(\u03b1t, \u03b2t)\n\nFinally, observations wnd are generated:\n\n\u03c0z \u223c Dirichlet(\u03b1z)\nIf :\n\ning = 0 : {wnd|ld = g} = 0 Else : {wnd|ld = g} \u223c Formulafg\n\nzn \u223c Multinomial(\u03c0z)\n\ning \u223c Bernoulli(\u03b2gk)\n\nThe above equations indicate that if ing = 0, that is, group g is not present in the observation, then\nin that observation, all wnd such that ld = g are also absent (i.e. wnd = 0). If the group g is present\n(ing = 1) and the group formula fg = and, then all the dimensions associated with that dimension\nare present (i.e. wnd = 1). Finally, if the group g is present (ing = 1) and the group formula\nfg = or, then we sample the associated wnd from all possible con\ufb01gurations of wnd such that at\nleast one wnd = 1.\n\n3\n\n\fFigure 2: Motivating examples with cartoons from three clusters (vacation, student, winter) and the\ndistinguishable dimensions discovered by the MGM.\n\nLet \u03b8 = {yg, \u03b3g, tgk, \u03b2gk, ld, fg, zn, ing} be the set of variables in the MGM. Given a set of obser-\nvations {wnd}, the posterior over \u03b8 factors as\n\nP r({yg, \u03b3g, tgk, \u03b2gk, ld, fg, zn, ing}|{wnd}) =\n\np(yg|\u03c1)p(\u03b3g|\u03c3)p(fg|\u03b1)\u00b7\n\nG(cid:89)\n\nD(cid:89)\n\ng\n\np(ld|\u03ba)p(\u03c0|\u03b1)\n\nN(cid:89)\n\np(zn|\u03c0)\n\nd\n\nn\n\np(wnd|ing, f, ld)]\n\n(1)\n\nK(cid:89)\nN(cid:89)\n\nk\n\n[\n\nG(cid:89)\n\np(tgk|\u03b3g)p(\u03b2gk|tgk, yg)]p(\u03ba|\u03b1)\n\nN(cid:89)\n\nD(cid:89)\n\np(ing|\u03b2, zn)\n\nn\n\ng\n\nn\n\nd\n\n(cid:89)\n\nMost of these terms are straight-forward to compute given the generative model. The likelihood\nterm p(wnd|ing, f, ld) can be expanded as\n\np(wn\u00b7|ing, f, ld) =\n\n[(0)1(ing=1)(1\u2212SAT(g;wn\u00b7,fg,ld))(1)1(ing=1)SAT(g;wn\u00b7,fg,ld)\n\nd,g\n(0)1(ing=0)1(ld=g)1(wnd=1)(1)1(ing=0)1(ld=g)1(wnd=0)\n\n(2)\n\nwhere we use wn\u00b7 to indicate the vector of measurements associated with observation n. The func-\ntion SAT(g; wn\u00b7, fg, ld) indicates whether the associated formula, fg is satis\ufb01ed, where fg involves\nd dimensions of wn\u00b7 that belong to group ld.\nMotivating Example Here we provide an example to illustrate the properties of MGM on a synthetic\ndata-set of 400 cartoon faces. Each cartoon face can be described by eight features: earmuffs, scarf,\nhat, sunglasses, pencil, silly glasses, face color, mouth shape (see Figure 2). The cartoon faces\nbelong to three clusters. Winter faces tend to have earmuffs and scarves. Student faces tend to\nhave silly glasses and pencils. Vacation faces tend to have hats and sunglasses. Face color does not\ndistinguish between the different clusters.\nThe MGM discovers four distinguishing sets of features: the vacation cluster has hat or sunglasses,\nthe winter cluster has earmuffs or scarfs or smile, and the student cluster has silly glasses as well as\npencils. Face color does not appear because it does not distinguish between the groups. However,\nwe do identify both hats and sunglasses as important, even though only one of those two features\nis important for distinguishing the vacation cluster from the other clusters: our model aims to \ufb01nd\na comprehensive list the distinguishing features for a human expert to later review for interesting\npatterns, not a minimal subset for classi\ufb01cation. By consolidating features\u2014such as (sunglasses or\nhat)\u2014we still provide a compact summary of the ways in which the clusters can be distinguished.\n\n4\n\n\fSolving Equation 1 is computationally intractable. We use variational approach to approximate the\ntrue posterior distribution p(yg, \u03b3g, tgk, \u03b2gk, ld, fg, zn, ing|{wnd}) with a factored distribution:\n\n3\n\nInference\n\nq\u03b7g (yg) \u223c Bernoulli(\u03b7g)\nq(cid:96)g (\u03b3g) \u223c Beta((cid:96)g1, (cid:96)g2)\nq\u03c4n (\u03c0) \u223c Dirichlet(\u03c4 )\nqcd (ld) \u223c Multinomial(cd)\n\nq\u03bbgk (tgk) \u223c Bernoulli(\u03bbgk)\nq\u03c6gk (\u03b2gk) \u223c Beta(\u03c6gk1, \u03c6gk2)\n\nq\u03bdn (zn) \u223c Multinomial(\u03bdn)\nqeg (fg) \u223c Bernoulli(eg)\n\nqing (ing) \u223c Bernoulli(ong)\n\nwhere in addition we use a weak-limit approximation to the Dirichlet process to approximate the dis-\ntribution over group assignments ld. Minimizing the Kullback-Leibler divergence between the true\nposterior p(\u03b8|{wnd}) and the variational distribution q(\u03b8) corresponds to maximizing the evidence\nlower bound (the ELBO) Eq[log p(\u03b8|{wnd})] \u2212 H(q) where H(q) is the entropy.\nBecause of the conjugate exponential family terms, most of the expressions in the ELBO are straight-\nforward to compute. The most challenging part is determining how to optimize the variational\nterms q(ld), q(ing), and q(fg) that are involved in the likelihood in Equation 2. Here, we \ufb01rst relax\nour generative process of or to have it correspond to independently sampling each wnd with some\nprobability s. Thus, Equation 2 becomes\np(wn\u00b7|ing, fg, ld) =\n\n[(0)1(fg=and)1(ld=g)1(ing=1)1(wnd=0)(1)1(fg=and)1(ld=g)1(ing=1)1(wnd=1)\n\n(cid:89)\n\nd,g\n\n(1 \u2212 s)1(fg=or)1(ld=g)1(ing=1)1(wnd=0)(s)1(fg=or)1(ld=g)1(ing=1)1(wnd=1)\n(0)1(ing=0)1(ld=g)1(wnd=1)(1)1(ing=0)1(ld=g)1(wnd=0)\n\n(3)\nWith this relaxation, the expression for the entire evidence lower bound is straight-forward to com-\npute. (The full derivations are given in the supplementary materials.)\nHowever, the logical formulas in Equation 3 still impose hard, combinatorial constraints on settings\nof the variables {ing, fg, ld} that are associated with the logical formulas. Speci\ufb01cally, if the values\nfor the formula choice {fg} and group assignments {ld} are \ufb01xed, then the value of ing is also \ufb01xed\nbecause the feature extraction step is deterministic. Once ing is \ufb01xed, however, the relationships\nbetween all the other variables are conjugate in the exponential family. Therefore, we alternate our\ninference between the extraction-related variables {ing, fg, ld} and the selection-related variables\n{yg, \u03b3g, tgk, \u03b2gk, zn}.\nFeature Extraction We consider only degenerate distributions q(ing), q(fg), q(ld) that put mass on\nonly one setting of the variables. Note that this is still a valid setting for the variational inference\nas \ufb01xing values for ing, fg, and ld, which corresponds to a degenerate Beta or Dirichlet prior, only\nmeans that we are further limiting our set of variational distributions. Not fully optimizing a lower\nbound due to this constraint can only lower the lower bound.\nWe perform an agglomerative procedure for assigning dimensions to groups. We begin our search\nwith each dimension d assigned to its own formula ld = d, fd = or. Merges of groups are explored\nusing a combination of data-driven and random proposals, in which we also explore changing the\nformula assignment of the group. For the data-driven proposals, we use an initial run of a vanilla\nk-means clustering algorithm to test whether combining two more groups results in an extracted\nfeature that has high variance. At each iteration, we compute the ELBO for non-overlapping subsets\nof these proposals and choose the agglomeration with the highest ELBO.\nFeature Selection Given a particular setting of the extraction variables {ing, fg, ld}, the remain-\ning variables {yg, \u03b3g, tgk, \u03b2gk, zn} are all in the exponential family. The corresponding posterior\ndistributions q(yg), q(\u03b3g), q(tgk), q(\u03b2gk), and q(zn) can be optimized via coordinate ascent [22].\n\n4 Results\n\nWe applied our MGM to both standard benchmark and more interesting data sets. In all cases, we\nran 5 restarts of the MGM. Inference was run for 40 iterations or until the ELBO improved by less\nthan 0.1 relative to the previous iteration. Twenty possible merges were explored in each iteration;\n\n5\n\n\fFaces\nDigits\n\nMGM\n0.59 (13)\n0.53 (13)\n\nKmeans\n0.46 (4)\n0.45 (13)\n\nHFS(G)\n0.627 (16)\n0.258 (13)\n\nLaw\n\n0.454 (4)\n0.254 (6)\n\nDPM\n\n0.481 (12)\n0.176 (5)\n\nHFS(L)\n0.569 (12)\n0.354 (11)\n\nCc\n\n0.547 (4)\n0.364 (10)\n\nTable 1: Mutual information and number of clusters (in parentheses) for UCI benchmarks. The\nmutual information is with respect to the true class labels (higher is better). Performance values for\nHFS(G), Law, DPM, HFS(L), and CC are taken from [17].\n\nFigure 3: Results on real-world datasets: animal dataset (left), recipe dataset (middle) and disease\ndataset (right). Each row represents an important feature. Lighter boxes indicate that the feature is\nlikely to be present in the cluster, while darker boxes are unlikely to be present.\n\neach merge exploration involved combining two existing groups into a new group.\nIf we failed\nto accept our data-driven candidate merge proposals more than three times within an iteration, we\nswitched to random proposals for the remaining proposals. We swept over the number of clusters\nfrom K=4 to K=16 and reported the results with the highest ELBO.\n\n4.1 Benchmark Problems: MGM discriminates classes\n\nWe compared the classi\ufb01cation performance of our clustering algorithms on several UCI benchmark\nproblems [23]. The digits data set consists of 11000 16\u00d716 grayscale images, 1100 for each digit.\nThe faces data set consists of 640 32\u00d730 images of 20 people, with 32 images of each person from\nvarious angles. In both cases, we binarized the images, setting the value to 0 if the value was less\nthan 128, 1 if the value was greater than 128. These two data-sets are chosen as they are discrete\nand we have the same versions for comparison to results cited in [17].\nThe mutual information between our discovered clusters and the true classes in the data sets is\nshown in Table 1. A higher mutual information between our clustering and known labels is one\nway to objectively show that our clusters correspond to groups that humans \ufb01nd interesting (i.e. the\nhuman-provided classi\ufb01cation labels). MGM is second only to HFS(G) in the Faces dataset (second\nonly to HFS(G)) and the highest scoring model in the Digits dataset. It always outperforms k-means.\n\n4.2 Demonstrating Interpretability: Real-world Applications\n\nOur quantitative results on the benchmark datasets show that the structure recovered by our approach\nis consistent with classes de\ufb01ned by human labelers better than or at the level of other clustering ap-\nproaches. However, the dimensions in the image benchmarks do not have much associated meaning,\nand the our approach was designed for clustering, not classi\ufb01cation. Here, we demonstrate the qual-\nitative advantages of our approach on three more interesting datasets.\nAnimals The animals data set [24] consists of 21 biological and ecological properties of 101 animals\n(such as \u201chas wings\u201d or \u201chas teeth\u201d). We are also provided class labels such as insects, mammals,\nand birds. The result of our MGM is shown in Figure 3. Each row is a distinguishable feature; each\ncolumn is a cluster. Lighter color boxes in Figure 3 indicate that the feature is likely to be present in\nthe cluster, while darker color boxes indicate that the feature is unlikely to be present in the cluster.\nBelow each cluster, a few animals that belong to that cluster are listed.\n\n6\n\n\fWe \ufb01rst note that, as desired, our model selects features that have large variation in their probabilities\nacross the clusters (rows in Figure 3). Thus, it is straight-forward to read what makes each column\ndifferent from the others: the mammals in the third column do not lay eggs; the insects in the \ufb01fth\ncolumn are toothless and invertebrates (and therefore have no tails). They are also rarely predators.\nUnlike the land animals, many of the water animals in columns one and two do not breathe.\nRecipes The recipes data set consists of ingredients from recipes taken from the computer cooking\ncontest1. There are 56 recipes, with 147 total ingredients. The recipes fall into four categories: pasta,\nchili, brownies or punch. We seek to \ufb01nd ingredients and groups of ingredients that can distinguish\ndifferent types of recipes. Note: The names for each cluster have been \ufb01lled in after the analysis,\nbased on the class label of the majority of the observations that were grouped into that cluster.\nThe MGM distills the 147 ingredients into only 3 important features. The \ufb01rst extracted feature\ncontains several spices, which are present in pasta, brownies, and chili but not in punch. Punch\nis also distinguished from the other clusters by its lack of basic spices such as salt and pepper\n(the second extracted feature). The third extracted feature contains a number of savory cooking\ningredients such as oil, garlic, and shallots. These are common in the pasta and chili clusters but\nuncommon in the punch and brownie clusters.\nDiseases Finally, we consider a data set of patients with autism spectrum disorder (ASD) accumu-\nlated over the \ufb01rst 15 years of life [25]. ASDs are a complex disease that is often associated with\nco-occurring conditions such as seizures and developmental delays. As most patients have very\nfew diagnoses, we limited our analysis to the 184 patients with at least 200 diagnoses and the 58\ndiagnoses that occurred in at least 5% of the patients. We binarized the count data to 0-1 values.\nOur model reduces these 58 dimensions to 9 important sets of features. The extracted features had\nmany more dimensions than in the examples, so we only list two features from each group and\nprovide the total number in parenthesis. Several of the groups of the extracted variables\u2014which did\nnot use any auxiliary information\u2014are similar to those from [25]. In particular, [25] report clusters\nof patients with epilepsy and cerebral palsy, patients with psychiatric disorders, and patients with\ngastrointestinal disorders. Using our representation, we can easily see that there appears to be one\ngroup of sick patients (cluster 1) for whom all features are likely. We can also see what features\ndistinguish clusters 0, 2, and 3 from each other by which ones are unlikely to be present.\n\n4.3 Verifying interpretability: Human subject experiment\n\nWe conducted a pilot study to gather more qualitative evaluation of the MGM. We \ufb01rst divided the\nASD data into three datasets with random disjoint subsets of approximately 20 dimensions each. For\neach of these subsets, we prepared the data in three formats: raw patient data (a list of symptoms),\nclustered results (centroids) from K-means, and clustered results with the MGM with distinguishable\nsets of features. Both the clustered results were presented as \ufb01gures such as \ufb01gure 3 and the raw\ndata were presented in a spreadsheet. Three domain experts were then tasked to explore the different\ndata subsets in each format (so each participant saw all formats and all data subsets) and produce a\n2-3 sentence executive summary of each. The different conditions serve as reference points for the\nsubjects to give more qualitative feedback about the MGM.\nAll subjects reported that the raw data\u2014even with a \u201csmall\u201d number of 20 dimensions\u2014was im-\npossible to summarize in a 5 minute period. Subjects also reported that the aggregation of states in\nthe MGM helped them summarize the data faster rather than having to aggregate manually. While\nnone of them explicitly indicated they noticed that all the rows of the MGM were relevant, they did\nreport that it was easier to \ufb01nd the differences. One strictly preferred the MGM over the options,\nwhile another found the MGM easier for making up a narrative but was overall satis\ufb01ed with both\nthe MGM and the K-means clustering. One subject appreciated the succinctness of the MGM but\nwas concerned that \u201cit may lose some information\u201d. This \ufb01nal comment motivates future work on\nstructured priors for on what logical formulas should be allowed or likely; future user studies should\nstudy the effects of the feature extraction and selection separately. Finally, a qualitative review of\nthe summaries produced found similar but slightly more compact organization of notes in the MGM\ncompared to the K-means model.\n\n1Computer Cooking Contest: http://liris.cnrs.fr/ccc/ccc2014/doku.php\n\n7\n\n\f5 Discussion and Related Work\n\nMGM combines extractive and selective approaches for \ufb01nding a small set of distinguishable di-\nmensions when performing unsupervised learning on high-dimensional data sets. Rather than rely\non criteria that use statistical measures of variation, and then performing additional post-processing\nto interpret the results, we build interpretable criteria directly into the model. Our logic-based fea-\nture extraction step allows us to \ufb01nd natural groupings of dimensions such as (backbone or tail\nor toothless) in the animal data and (salt or pepper or cream) in the recipe data. De\ufb01ning an in-\nteresting dimension as one whose parameters are drawn from a multi-modal distribution helps us\nrecover groups like pasta and punch. Providing such comprehensive lists of distinguishing dimen-\nsions assists in the data exploration and hypothesis generation process. Similarly, providing lists of\ndimensions that have been consolidated in one extraction aids the human discovery process of why\nthose dimensions might be a meaningful group.\nClosest to our work are feature selection approaches such as [17, 18, 19], which also use a mixture\nof beta-distributions to identify feature types. In particular, [17] uses a similar hierarchy of Beta\nand Bernoulli priors to identify important dimensions. They carefully choose the priors so that\nsome dimensions can be globally important, while other dimensions can be locally important. The\nparameters for important dimensions are chosen IID from a Gaussian distribution, while values for\nall unimportant dimensions come from the same background distribution.\nOur approach draws parameters for important dimensions from distributions with multiple modes\u2014\nwhile unimportant dimensions are drawn from a uni-modal distribution. Thus, our model is more\nexpressive than approaches in which all unimportant dimension values are drawn from the same\ndistribution. It captures the idea that not all variation is important; clusters can vary in their emission\nparameters for a particular dimension and that variation still might not be interesting. Speci\ufb01cally,\nan important dimension is one where there is a gap between parameter values. Our logic-based\nfeature extraction step collapses the dimensionality further while retaining interpretability.\nMore broadly, there are many other lines of work that focus on creating latent variable models\nbased on diversity or differences. Methods for inducing diversity, such as determinantal point pro-\ncesses [26], have been used to \ufb01nd diverse solutions on applications ranging from detecting objects\nin videos [27], topic modeling [28], and variable selection [29]. In these cases, the goal is to avoid\n\ufb01nding multiple very similar optima; while the generated solutions are different, the model itself\ndoes not provide descriptions of what distinguishes one solution from the rest. Moreover, there may\nbe situations in which forcing solutions to be very different might not make sense: for example,\nwhen clustering recipes, it may be very sensible for the ingredient \u201csalt\u201d to be a common feature of\nall clusters; likewise when clustering patients from an autism cohort, one would expect all patients\nto have some kind of developmental disorder.\nFinally, other approaches focus on building models in which factors describe what distinguishes\nthem from some baseline. For example, [30] builds a topic model in which each topic is described\nby the difference from some baseline distribution. Contrastive learning [31] focuses on \ufb01nding the\ndirections that are most distinguish background data from foreground data. Max-margin approaches\nto topic models [32] try to \ufb01nd topics that can best assist in distinguishing between classes, but are\nnot necessarily readily interpretable themselves.\n\n6 Conclusions and Future Work\n\nWe presented MGM, an approach for interpretable feature extraction and selection. By incorpo-\nrating interpretability-based criteria directly into the model design, we found key dimensions that\ndistinguished clusters of animals, recipes, and patients. While this work focused on the clustering of\nbinary data, these ideas could also be applied to mixed and multiple membership models. Similarly,\nnotions of interestingness based on a gap could be applied to categorical and continuous data. It also\nwould be interesting to consider more expressive extracted features, such as more complex logical\nformulas. Finally, while we learned feature extractions in a completely unsupervised fashion, our\ngenerative approach also allows one to \ufb02exibly incorporate domain knowledge about possible group\nmemberships into the priors.\n\n8\n\n\fReferences\n[1] G. A. Miller, \u201cThe magical number seven, plus or minus two: Some limits on our capacity for processing\n\ninformation,\u201d The Psychological Review, pp. 81\u201397, March 1956.\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, \u201cLatent Dirichlet allocation,\u201d JMLR, pp. 3:993\u20131022, 2003.\n[3] H. Zou, T. Hastie, and R. Tibshirani, \u201cSparse principal component analysis,\u201d Journal of Computational\n\nand Graphical Statistics, vol. 15, p. 2006, 2004.\n\n[4] K. Than and T. B. Ho, \u201cFully sparse topic models,\u201d in ECML-PKDD, pp. 490\u2013505, 2012.\n[5] S. Williamson, C. Wang, K. Heller, and D. Blei, \u201cThe ibp compound dirichlet process and its application\n\nto focused topic modeling,\u201d ICML, 2010.\n\n[6] E. Elhamifar and R. Vidal, \u201cSparse subspace clustering: Algorithm, theory, and applications,\u201d IEEE Trans.\n\nPattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2765\u20132781, 2013.\n\n[7] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, \u201cAutomatic subspace clustering of high dimen-\n\nsional data for data mining applications,\u201d SIGMOD Rec., vol. 27, pp. 94\u2013105, June 1998.\n\n[8] B. Kim, C. Rudin, and J. A. Shah, \u201cThe Bayesian Case Model: A generative approach for case-based\n\nreasoning and prototype classi\ufb01cation,\u201d in NIPS, 2014.\n\n[9] R. K, \u201cDifferential diagnosis in primary care,\u201d JAMA, vol. 307, no. 14, pp. 1533\u20131534, 2012.\n[10] Z. J and Y. KF, \u201cWhat\u2019s the relative risk?: A method of correcting the odds ratio in cohort studies of\n\ncommon outcomes,\u201d JAMA, vol. 280, no. 19, pp. 1690\u20131691, 1998.\n\n[11] S. Alelyani, J. Tang, and H. Liu, \u201cFeature selection for clustering: A review.,\u201d Data Clustering: Algo-\n\nrithms and Applications, vol. 29, 2013.\n\n[12] S. Gu\u00b4erif, \u201cUnsupervised variable selection: when random rankings sound as irrelevancy.,\u201d in FSDM,\n\n2008.\n\n[13] P. Mitra, C. Murthy, and S. K. Pal, \u201cUnsupervised feature selection using feature similarity,\u201d IEEE trans-\n\nactions on pattern analysis and machine intelligence, vol. 24, no. 3, pp. 301\u2013312, 2002.\n\n[14] M. Dash and H. Liu, \u201cFeature selection for clustering,\u201d in KDD: Current Issues and New Applications,\n\npp. 110\u2013121, 2000.\n\n[15] K. Tsuda, M. Kawanabe, and K.-R. Mller, \u201cClustering with the \ufb01sher score,\u201d in NIPS, 2003.\n[16] J. G. Dy and C. E. Brodley, \u201cFeature selection for unsupervised learning,\u201d JMLR, pp. 5:845\u2013889, 2004.\n[17] Y. Guan, J. G. Dy, and M. I. Jordan, \u201cA uni\ufb01ed probabilistic model for global and local unsupervised\n\nfeature selection.,\u201d in ICML, pp. 1073\u20131080, 2011.\n\n[18] W. Fan and N. Bouguila, \u201cOnline learning of a dirichlet process mixture of generalized dirichlet distribu-\n\ntions for simultaneous clustering and localized feature selection.,\u201d in ACML, pp. 113\u2013128, 2012.\n\n[19] G. Yu, R. Huang, and Z. Wang, \u201cDocument clustering via dirichlet process mixture model with feature\n\nselection,\u201d in KDD, pp. 763\u2013772, ACM, 2010.\n\n[20] A. A. Freitas, \u201cComprehensible classi\ufb01cation models: a position paper,\u201d ACM SIGKDD Explorations\n\nNewsletter, 2014.\n\n[21] G. De\u2019ath and K. E. Fabricius, \u201cClassi\ufb01cation and regression trees: a powerful yet simple technique for\n\necological data analysis,\u201d Ecology, vol. 81, no. 11, pp. 3178\u20133192, 2000.\n\n[22] M. J. Wainwright and M. I. Jordan, \u201cGraphical models, exponential families, and variational inference,\u201d\n\nFoundations and Trends in Machine Learning, vol. 1, no. 1-2, pp. 1\u2013305, 2008.\n\n[23] M. Lichman, \u201cUCI machine learning repository,\u201d 2013.\n[24] C. Kemp and J. B. Tenenbaum, \u201cThe discovery of structural form,\u201d PNAS, 2008.\n[25] F. Doshi-Velez, Y. Ge, and I. Kohane, \u201cComorbidity clusters in autism spectrum disorders: an electronic\n\nhealth record time-series analysis,\u201d Pediatrics, vol. 133, no. 1, pp. e54\u2013e63, 2014.\n\n[26] A. Kulesza, Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2012.\n[27] A. Kulesza and B. Taskar, \u201cStructured determinantal point processes,\u201d in NIPS, 2010.\n[28] J. Y. Zou and R. P. Adams, \u201cPriors for diversity in generative latent variable models,\u201d in NIPS, 2012.\n[29] N. K. Batmanghelich, G. Quon, A. Kulesza, M. Kellis, P. Golland, and L. Bornn, \u201cDiversifying sparsity\n\nusing variational determinantal point processes,\u201d CoRR, 2014.\n\n[30] J. Eisenstein, A. Ahmed, and E. P. Xing, \u201cSparse additive generative models of text,\u201d ICML, 2011.\n[31] J. Y. Zou, D. J. Hsu, D. C. Parkes, and R. P. Adams, \u201cContrastive learning using spectral methods,\u201d in\n\nNIPS, 2013.\n\n[32] J. Zhu, A. Ahmed, and E. P. Xing, \u201cMedlda: maximum margin supervised topic models for regression\n\nand classi\ufb01cation,\u201d in ICML, pp. 1257\u20131264, ACM, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1346, "authors": [{"given_name": "Been", "family_name": "Kim", "institution": "MIT"}, {"given_name": "Julie", "family_name": "Shah", "institution": "MIT"}, {"given_name": "Finale", "family_name": "Doshi-Velez", "institution": "Harvard"}]}