{"title": "A Debiased MDI Feature Importance Measure for Random Forests", "book": "Advances in Neural Information Processing Systems", "page_first": 8049, "page_last": 8059, "abstract": "Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high importance to noisy features, leading to systematic bias in feature selection. In this paper, we address the feature selection bias of MDI from both theoretical and methodological perspectives. Based on the original definition of MDI by Breiman et al.  \\cite{Breiman1984} for a single tree, we derive a tight non-asymptotic bound on the expected bias of MDI importance of noisy features, showing that deep trees have higher (expected) feature selection bias than shallow ones. However, it is not clear how to reduce the bias of MDI using its existing analytical expression. We derive a new analytical expression for MDI, and based on this new expression, we are able to propose a debiased MDI feature importance measure using out-of-bag samples, called MDI-oob. For both the simulated data and a genomic ChIP dataset, MDI-oob achieves state-of-the-art performance in feature selection from Random Forests for both deep and shallow trees.", "full_text": "A Debiased MDI Feature Importance Measure for\n\nRandom Forests\n\nXiao Li\u2217\n\nStatistics Department\n\nUC Berkeley\n\nsxli@berkeley.edu\n\nYu Wang\n\nStatistics Department\n\nUC Berkeley\n\nwang.yu@berkeley.edu\n\nSumanta Basu\n\nStatistics and Data Science Department\n\nComputational Biology Department\n\nCornell University\n\nsumbose@cornell.edu\n\nKarl Kumbier\n\nStatistics Department\n\nUC Berkeley\n\nkkumbier@berkeley.edu\n\nBin Yu\n\nEECS, Statistics Department\n\nUC Berkeley\n\nbinyu@berkeley.edu\n\nAbstract\n\nTree ensembles such as Random Forests have achieved impressive empirical suc-\ncess across a wide variety of applications. To understand how these models make\npredictions, people routinely turn to feature importance measures calculated from\ntree ensembles. It has long been known that Mean Decrease Impurity (MDI), one\nof the most widely used measures of feature importance, incorrectly assigns high\nimportance to noisy features, leading to systematic bias in feature selection. In\nthis paper, we address the feature selection bias of MDI from both theoretical and\nmethodological perspectives. Based on the original de\ufb01nition of MDI by Breiman\net al. [3] for a single tree, we derive a tight non-asymptotic bound on the expected\nbias of MDI importance of noisy features, showing that deep trees have higher\n(expected) feature selection bias than shallow ones. However, it is not clear how to\nreduce the bias of MDI using its existing analytical expression. We derive a new\nanalytical expression for MDI, and based on this new expression, we are able to\npropose a new MDI feature importance measure using out-of-bag samples, called\nMDI-oob. For both the simulated data and a genomic ChIP dataset, MDI-oob\nachieves state-of-the-art performance in feature selection from Random Forests for\nboth deep and shallow trees.\n\n1\n\nIntroduction\n\nUnderstanding how a machine learning (ML) model makes predictions is important in many scienti\ufb01c\nand industrial problems [19]. Appropriate interpretations can help increase the predictive performance\nof a model and provide new domain insights. While a line of study focuses on interpreting any generic\nML model [30, 22], there is a growing interest in developing specialized methods to understand\nspeci\ufb01c models. In particular, interpreting Random Forests (RFs) [2] and its variants [14, 28, 27,\n29, 1, 12] has become an important area of research due to the wide ranging applications of RFs\nin various scienti\ufb01c areas, such as genome-wide association studies (GWAS) [7], gene expression\nmicroarray [13, 23], and gene regulatory networks [9].\nA key question in understanding RFs is how to assign feature importance. That is, which features\ndoes a RF rely on for prediction? One of the most widely used feature importance measures for\n\n\u2217The \ufb01rst two authors contributed equally to this paper.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fRFs is mean decrease impurity (MDI) [3]. MDI computes the total reduction in loss or impurity\ncontributed by all splits for a given feature. This method is computationally very ef\ufb01cient and has\nbeen widely used in a variety of applications [25, 9]. However, theoretical analysis of MDI has\nremained sparse in the literature [11]. Assuming there are an in\ufb01nite number of samples, Louppe et al.\n[16] characterized MDI for totally randomized trees using mutual information between features and\nthe response. They showed that noisy features, i.e., features independent of the outcome, have zero\nMDI importance. However, empirical studies have shown that MDI systematically assigns higher\nfeature importance values to numerical features or categorical features with many categories [29].\nIn other words, high MDI values do not always correspond to the predictive associations between\nfeatures and the outcome. We call this phenomenon MDI feature selection bias. Louppe [15] studied\nthis issue and demonstrate via simulations that early stopping mechanisms (e.g., limited depth and\nlarger leaf sizes) are effective to reduce the feature selection bias.\nIn this paper, using the original de\ufb01nition of MDI, we analyze the non-asymptotic behavior of MDI\nand bridge the gap between the population case and the \ufb01nite sample case. We \ufb01nd that under mild\nconditions, if the samples used for each tree are i.i.d, then the expected MDI feature importance of\nnoisy features derived from any tree ensemble constructed on n samples with p features is upper\nbounded by dn log(np)/mn, where mn is the minimum leaf size and dn is the maximum tree depth\nin the ensemble. In other words, deep trees with small leaves suffer more from feature selection\nbias. Our \ufb01ndings are particularly relevant for practical applications involving RFs, in which scenario\ndeep trees are recommended [2] and used more often. To reduce the feature selection bias for RFs,\nespecially when the trees are deep, we derive a new analytical expression for MDI and then use this\nnew expression to propose a new feature importance measure that evaluates MDI using out-of-bag\nsamples. We call this importance measure MDI-oob. For both regression and classi\ufb01cation problems,\nwe use simulated data and a genomic dataset to demonstrate that MDI-oob often achieves 5%\u201310%\nhigher AUC scores compared to other feature importance measures used in several publicly available\npackages including party [4], ranger [33], and scikit-learn [21].\n\n1.1 Related works\n\nimportance measure. This method has been studied in [28, 1] and is available in XGBoost [6].\n\nIn addition to MDI [32, 17], some other feature importance measures have been studied in the\nliterature and used in practice:\n\u2022 Split count, namely, the number of times a feature is used to split [29], can be used as a feature\n\u2022 Mean decrease in accuracy (MDA) measures a feature\u2019s importance by the reduction in the model\u2019s\naccuracy after randomly permuting the values of a feature. The motivation of MDA is that permuting\nan important feature would result in a large decrease in the accuracy while permuting an unimportant\nfeature would have a negligible effect. Different permutation choices have been studied in [28, 10].\n\nRecently, Lundberg et al. [17] show that for feature importance measures such as MDI and split\ncounts, the importance of a feature does not always increase as the outcome becomes more dependent\non that feature. To remedy this issue, they propose the tree SHAP feature importance, which focuses\non giving consistent feature attributions to each sample. When individual feature importance is\nobtained, overall feature importance is straightforward to obtain by just averaging the individual\nfeature importances across samples.\nWhile our paper focuses on interpreting trees learned via the classic RF procedure, there is another line\nof work that focuses on modifying the tree construction procedure to obtain better feature importance\nmeasures. Hothorn et al. [8] introduced cforest in the R package party that grows classi\ufb01cation\ntrees based on a conditional inference framework. Strobl et al. [29] showed that cforest suffers less\nfrom the feature selection bias. Sandri and Zuccolotto [25] proposed to create a set of uninformative\npseudo-covariates to evaluate the bias in Gini importance. Nembrini et al. [20] gave a modi\ufb01ed\nalgorithm that is faster than the original method proposed by Sandri and Zuccolotto [25] with almost\nno overhead over the creation of the original RFs and available in the R package ranger. In a very\nrecent paper, Zhou and Hooker [34] proposed to evaluate the decrease in impurity at each node using\nout-of-bag samples. However, our implementation is different from that in [34] and MDI-oob enjoys\nhigher computational ef\ufb01ciency.\nIn Section 4, we will compare MDI-oob with all the aforementioned methods except the split count,\nfor which we did not \ufb01nd a package that implements it for RFs.\n\n2\n\n\f1.2 Organization\n\nThe rest of this paper is organized as follows. In Section 2, we provide a non-asymptotic analysis to\nquantify the bias in the MDI importance when noisy features are independent of relevant features.\nIn Section 3, we give a new characterization of MDI and propose a new MDI feature importance\nusing out-of-bag samples, which we call MDI-oob. In Section 4, we compare our MDI-oob with\nother commonly used feature importance measures in terms of feature selection accuracy using the\nsimulated data and a genomic ChIP dataset. We conclude our work and discuss possible future\ndirections in Section 5.\n\n2 Understanding the feature selection bias of MDI\n\nIn this section, we focus on understanding the \ufb01nite sample properties of MDI importance and why it\nmay have a signi\ufb01cant bias in feature selection. We \ufb01rst brie\ufb02y review the construction of RFs and\nintroduce some important notations. Then, using the original de\ufb01nition of MDI, we give a tight upper\nbound to quantify the expected bias of MDI importance for a noisy feature. This upper bound is tight\nup to a log n factor where n is the number of i.i.d. samples.\n\n2.1 Background and notations\nSuppose that the data set D contains n i.i.d samples from a random vector (X1, . . . , Xp, Y ), where\nX = (X1, . . . , Xp) \u2208 Rp are p input features and Y \u2208 R is the response. The ith sample is denoted\nby (xi, yi), where xi = (xi1, . . . , xip). We say that a feature Xk is a noisy feature if Xk and Y are\nindependent, and a relevant feature otherwise. Note that this de\ufb01nition of noisy features has also\nbeen used in many previous papers such as [16, 26]. We denote S \u2282 [p] as the set of indexes of\nrelevant features. We are particularly interested in the case where the number of relevant features is\nsmall, namely, |S| is much smaller than p. For any number m \u2208 N, [m] denotes the set of integers\n{1, . . . , m}. For any hyper-rectangle R \u2282 Rp, let 1(X \u2208 R) be the indicator function taking value\none when X \u2208 R and zero otherwise.\nRFs are an ensemble of classi\ufb01cation and regression trees, where each tree T de\ufb01nes a mapping\nfrom the feature space to the response. Trees are constructed independently of one another on a\nbootstrapped or subsampled data set D(T ) of the original data D. Any node t in a tree T represents a\nsubset (usually a hyper-rectangle) Rt of the feature space. A split of the node t is a pair (k, z) which\ndivides the hyper-rectangle Rt into two hyper-rectangles Rt \u2229 1(Xk \u2264 z) and Rt \u2229 1(Xk > z),\ncorresponding to the left child tleft and right child tright of node t, respectively. For a node t in a tree\nT , Nn(t) = |{i \u2208 D(T ) : xi \u2208 Rt}| denotes the number of samples falling into Rt and\n\n(cid:88)\n\n\u00b5n(t) :=\n\n1\n\nNn(t)\n\ni:xi\u2208Rt\n\nyi\n\n(1)\n\ndenotes their average response.\nEach tree T is grown using a recursive procedure which proceeds in two steps for each node\nt. First, a subset M \u2282 [p] of features is chosen uniformly at random. Then the optimal split\nv(t) \u2208 M, z(t) \u2208 R is determined by maximizing:\n\n\u2206I(t) := Impurity(t) \u2212 Nn(tleft)\nNn(t)\n\nImpurity(tleft) \u2212 Nn(tright)\nNn(t)\n\nImpurity(tright)\n\n(2)\n\nfor some impurity measure Impurity(t). The procedure terminates at a node t if two children contain\ntoo few samples, i.e., min{Nn(tleft), Nn(tright)} \u2264 mn , or if all responses are identical. The\nthreshold mn is called the minimum leaf size. If a node t does not have any children, it is called a leaf\nnode; otherwise, it is called an inner node. We de\ufb01ne the set of inner nodes of a tree T as I(T ). We\nsay that T (cid:48) is a sub-tree of T if T (cid:48) can be obtained by pruning some nodes in T .\nSome popular choices of the impurity measure Impurity(t) include variance, Gini index, or entropy.\nFor simplicity, we focus on the variance of the responses, i.e.,\n\nImpurity(t) =\n\n1\n\n(yi \u2212 \u00b5n(t))2,\n\n(3)\n\n(cid:88)\n\nNn(t)\n\ni:xi\u2208Rt\n\n3\n\n\fthroughout the paper unless stated otherwise. Later we show that this de\ufb01nition of impurity is\nequivalent to the Gini index of categorical variables with one hot encoding (see Remark in Section 3)\nThe Mean Decrease Impurity (MDI) feature importance of Xk, with respect to a single tree T (\ufb01rst\nproposed by Breiman et al. in [3]) and an ensemble of ntree trees T1, . . . , Tntree, can be written as\n\n(cid:88)\n\nntree(cid:88)\n\ns=1\n\nMDI(k, T ) =\n\nt\u2208I(T ),v(t)=k\n\nNn(t)\n\nn\n\n\u2206I(t)\n\nand MDI(k) =\n\n1\n\nntree\n\nM DI(k, Ts),\n\n(4)\n\nrespectively. This expression is the best known formula for MDI and was analyzed in many papers\nsuch as Louppe et al. [16].\n\n2.2 Finite sample bias of MDI importance for Random Forests\n\nGiven the set S of relevant features and a tree T , we denote\n\n(cid:88)\n\nk /\u2208S\n\nG0(T ) =\n\nMDI(k, T )\n\n(5)\n\nas the sum of MDI importance of all noisy features. Ideally, G0(T ) should be close to zero with high\nprobability, to ensure that no noisy features get selected when using MDI importance for feature\nselection. In fact, Louppe et al. [16] show that G0(T ) is indeed zero almost surely if we grow totally\nrandomized trees with in\ufb01nite samples. However, G0(T ) is typically non-negligible in real data, and\n\ufb01nite sample properties of G0(T ) are not well understood. In order to bridge this gap, we conduct\na non-asymptotic analysis of the expected value of G0(T ). Our main result characterizes how the\nexpected value of G0(T ) depends on mn, the minimum leaf size of T , and p, the dimension of the\nfeature space. We start with the following simple but important fact.\nFact 1. If T (cid:48) is a sub-tree of T , then MDI(k, T (cid:48)) \u2264 MDI(k, T ) for any feature Xk.\nThis fact naturally follows from the observation that by de\ufb01nition, \u2206I(t) \u2265 0 for any node t. Since\nthe impurity decrease at each node is guaranteed to be non-negative, G0(T ) will never decrease as\nT grows deeper, in which case the minimum leaf size mn will be smaller. Indeed, if T is grown\nto purity (mn = 1), and all features are noisy (S = \u2205), then G0(T ) would simply be equal to the\nsample variance of the responses in the data D(T ). How fast does G0(T ) increase as the minimum\nleaf size mn becomes smaller? To quantify the relation between G0(T ) and mn, we need a few mild\nconditions which we now describe. Let\n\nyi = \u03c6(xi,S) + \u0001i, i = 1, . . . , n\n\n(6)\nfor some unknown function \u03c6 : R|S| \u2192 R, where \u0001i are i.i.d zero-mean Gaussian noise. We make the\nfollowing assumptions.\n(A1) Xk \u223c Unif[0, 1] for all k \u2208 [p]. In addition, the noisy features {Xk, k \u2208 [p]\\S} are mutually\nindependent, and independent of all relevant features. Here S denotes the set of relevant features.\n(A2) \u03c6 is bounded: supx\u2208[0,1]|S| |\u03c6(x)| \u2264 M for some M > 0.\nThe Assumptions (A1) and (A2) are weaker than the assumptions usually made when studying the\nstatistical properties of RF. The marginal uniform distribution condition in (A1) is common in the RF\nliterature [26], and can be easily satis\ufb01ed by transforming the features via its inverse CDF. Since we\nare interested in characterizing the MDI of noisy features, we do not require the relevant features to\nbe independent of each other. We do require that noisy features are independent of relevant features,\nwhich is a limitation of Theorem 1 below. Correlated features are commonly encountered in practice\nand dif\ufb01cult for any feature selection method.\nWe now state our \ufb01rst main result which provides a non-asymptotic upper and lower bound for the\nexpected value of the maximum of G0(T ) over all tree T with minimum leaf size mn.\nTheorem 1. Let Tn(mn) denote the set of decision trees whose minimum leaf size is lower bounded\nby mn, and Tn(mn, dn) \u2282 Tn(mn) denote the subset of Tn(mn) whose depth is upper bounded by\ndn. Under Assumptions (A1) and (A2), there exists a positive constant C such that,\n\nEX,\u0001\n\nsup\n\nT\u2208Tn(mn,dn)\n\nG0(T ) \u2264 C\n\ndn log(np)\n\nmn\n\n.\n\n(7)\n\n4\n\n\fIn addition, when f = 0 and mn \u2265 36 log p + 18 log n,\n\nEX,\u0001\n\nsup\n\nT\u2208Tn(mn)\n\nG0(T ) \u2265 log p\nCmn\n\n.\n\n(8)\n\nWe give the proof in the Appendix. To the best of our knowledge, Theorem 1 is the \ufb01rst non-\nasymptotic result on the expected MDI importance of tree ensembles. In particular, the upper bound\ncan be directly applied to any tree ensembles with a minimum leaf size mn and a maximum tree depth\ndn, including Breiman\u2019s original RF procedure, if subsampling is used instead of bootstrapping.\nProof Sketch. Every node t in a tree T \u2208 Tn(mn, dn) corresponds to an axis-aligned hyper-rectangle\nin [0, 1]p which contains at least mn samples and is formed by splitting on at most dn dimensions\nconsecutively. Therefore, bounding the supremum of impurity reduction for any potential node\nin Tn(mn, dn) boils down to controlling the complexity of all such hyper-rectangles. Two hyper-\nrectangles are considered equivalent if they contain the same subset of samples, since the impurity\nreductions of these two hyper-rectangles are always the same. Up to this equivalence, it can be proved\nthat the number of unique hyper-rectangles of interest is upper bounded by (np)dn, which corresponds\nto the dnlog(np) term in the upper bound. The \ufb01nal result is obtained via union bound.\nIn the upper bound, each node t is obtained by splitting on at most dn features. In practice, dn is\ntypically at most of order log n. Indeed, if the decision tree is a balanced binary tree, then dn \u2264 log2 n.\nTherefore, for balanced trees, the upper bound can be written as\n\u2264 C\n\n(log n)2 + log n log p\n\nG0(T ) \u2264 C\n\ndn log(np)\n\nEX,\u0001\n\nsup\n\n(9)\n\n,\n\nT\u2208Tn(mn,dn)\n\nmn\n\nmn\n\nand the theorem shows that the sum of MDI importance of noisy features is of order log p\nmn\n\n, i.e.,\n\nsup\n\n(10)\nup to a log n term correction, which is typically small in the high dimensional p (cid:29) n setting. If\nall features Xj are categorical with a bounded number of categories, then the upper bound can be\nimproved to\n\n\u03c6:(cid:107)\u03c6(cid:107)\u221e\u2264M\n\nT\u2208Tn(mn)\n\nsup\n\n,\n\nG0(T ) \u223c log p\nmn\n\nEX,\u0001\n\nEX,\u0001\n\nsup\n\nT\u2208Tn(mn,dn)\n\nG0(T ) \u2264 C\n\ndn log p\n\nmn\n\n,\n\n(11)\n\nwhich shows that the MDI importance of noisy features can be better controlled if the noisy features\nare categorical rather than numerical. That is consistent with the previous empirical studies because\nthe number of candidate split points for a numerical feature is larger than that for a categorical feature.\nTheorem 1 shows that the supremum of MDI importance of noisy features over all trees with minimum\nleaf size mn is, in expectation, roughly inversely proportional to mn. In the Appendix Fig. 5, we show\nthat the inversely proportional relationship is consistent with the empirical G0(T ) on a simulated\ndataset described in the \ufb01rst simulation study in Section 4. Therefore, to control the \ufb01nite sample\nbias of MDI importance, one should either grow shallow trees, or use only the shallow nodes in a\ndeep tree when computing the feature importance. In fact, since G0(T ) depends on the dimension p\nonly through a log factor log p, we expect G0(T ) to be very small even in a high-dimensional setting\nif mn is larger than, say,\nn. For a balanced binary tree grown to purity with depth dn = log2 n, this\ncorresponds to computing MDI only from the \ufb01rst dn/2 = (log2 n)/2 levels of the tree, as the node\nsize on the dth level of a balanced tree is n/2d.\nFact 1 implies that the MDI importance of relevant features might also decrease as mn increases, but\nwe will show in simulation studies that they will decrease at a much slower rate, especially when the\nunderlying model is sparse.\n\n\u221a\n\n3 MDI using out-of-bag samples (MDI-oob)\n\nAs shown in the previous section, for balanced trees, the sum of MDI feature importance of all noisy\nfeatures is of order log(p)\nif we ignore the log(n) terms. This means that the MDI feature selection\nmn\nbias becomes severe for trees with smaller leaf size mn, which usually corresponds to a deeper tree.\n\n5\n\n\fFortunately, this bias can be corrected by evaluating MDI using out-of-bag samples. In this section,\nwe \ufb01rst introduce a new analytical expression of MDI as the motivation of our new method, then we\npropose the MDI-oob as a new feature importance measure. For simplicity, in this section, we only\nfocus on one tree T . However, all the results are directly applicable to the forest case.\n\n3.1 A new characterization of MDI\n\nRecall that the original de\ufb01nition of the MDI importance of any feature k is provided in Equation (4),\nthat is, the sum of impurity decreases among all the inner nodes t such that v(t) = k. Although we\ncan use this de\ufb01nition to analyze the feature selection bias of MDI in Theorem 1, this expression (4)\ngives us few intuitions on how we can get a new feature importance measure that reduces the MDI\nbias. Next, we derive a novel analytical expression of MDI, which shows that the MDI of any feature\nk can be viewed as the sample covariance between the response yi and the function fT,k(xi) de\ufb01ned\nin Proposition 1. This new expression inspires us to propose a new MDI feature importance measure\nby using the out-of-bag samples.\nProposition 1. De\ufb01ne the function fT,k(\u00b7) to be\n\n(cid:111)\n(cid:110)\n\u00b5n(tleft)1(X \u2208 Rtleft) + \u00b5n(tright)1(X \u2208 Rtright ) \u2212 \u00b5n(t)1(X \u2208 Rt)\n\n(cid:88)\n\nfT,k(X) =\n\n.\n\nt\u2208I(T ):v(t)=k\n\nThen the MDI of the feature k in a tree T can be written as:\n\nfT,k(xi) \u00b7 yi,\n\n(12)\n\n(cid:88)\n\ni\u2208D(T )\n\n1\n\n|D(T )|\n\nWe give the proof in the Appendix. The proof is just a few lines but it requires a good understanding\nof MDI. Although we have not seen this analytical expression in the prior works, we found that the\nfunctions fT,k(\u00b7) have been studied from a quite different perspective. Those functions were \ufb01rst\nproposed in Saabas [24] to interpret the RF predictions for each individual sample. According to this\npaper, fT,k can be viewed as the \"contribution\" made by the feature k in the tree T . For any tree,\nthose functions fT,k can be easily computed using the python package treeinterpreter.\n\n(cid:80)\ni\u2208D(T ) fT,k(xi) \u00b7 yi is essentially\nthe sample covariance between fT,k(xi) and yi on the bootstrapped dataset D(T ). This indicates a\npotential drawback of MDI: RFs use the training data D(T ) to construct the functions fT,k(\u00b7), then\nMDI uses the same data to evaluate the covariance between yi and fT,k(xi) in Equation (12).\n\nIt can be shown that(cid:80)\n\ni\u2208D(T ) fT,k(xi) = 0. That implies\n\n1|D(T )|\n\nc1, c2, . . . , cD. Let pd = P(Y = cd). Then the Gini index of Y is Gini(Y ) =(cid:80)D\n\nRemark: So far we have only considered regression trees, and have de\ufb01ned the impurity at a node\nt using the sample variance of responses. For classi\ufb01cation trees, one may use Gini index as the\nmeasure of impurity. We point out that these two de\ufb01nitions of impurity are actually equivalent when\nwe use a one-hot vector to represent the categorical response. Therefore, our results are directly\napplicable to the classi\ufb01cation case. Suppose that Y is a categorical variable which can take D values\nd=1 pd(1 \u2212 pd). We\nde\ufb01ne the one-hot encoding of Y as a D-dimensional vector \u02dcY = (1(Y = c1), . . . , 1(Y = cD)).\nThen\nVar( \u02dcY ) = (cid:107) \u02dcY \u2212 E \u02dcY (cid:107)2\n\n(E \u02dcYi \u2212 (E \u02dcYi)2) =\n\ni \u2212 (E \u02dcYi)2) =\n\nD(cid:88)\n\nD(cid:88)\n\nD(cid:88)\n\n2 =\n\n(E \u02dcY 2\n\npd(1\u2212 pd) = Gini(Y ),\n(13)\n\nd=1\n\nd=1\n\nd=1\n\nthereby showing that Gini index and variance are equivalent.\n\n3.2 Evaluating MDI using out-of-bag samples\n\nProposition 1 suggests that we can calculate the covariance between yi and fT,k(xi) in Equation (12)\nusing the out-of-bag samples D\\D(T ):\n\nfT,k(xi) \u00b7 yi.\n\n(14)\n\n(cid:88)\n\ni\u2208D\\D(T )\n\nMDI-oob of feature k =\n\n1\n\n|D\\D(T )|\n\n6\n\n\fFigure 1: MDI against min leaf\nsize.\n\nFigure 2: MDI against tree depth.Figure 3: MDI-oob against min\n\nleaf size.\n\nIn other words, for each tree, we calculate the fT,k(xi) for all the OOB samples xi and then compute\nMDI-oob using (14). Although out-of-bag samples have been used for other feature importance\nmeasures such as MDA, to the best of the authors\u2019 knowledge, there are few results that use the\nout-of-bag samples to evaluate MDI feature importance. A naive way of using the out-of-bag samples\nto evaluate MDI is to directly compute the impurity decrease at each inner-node of a tree using OOB\nsamples. However, this approach is not desirable since the impurity decrease at each node is still\nalways positive unless the responses of all the OOB samples falling into a node are constant. In this\ncase, an argument similar to the proof of Theorem 1 can show that the bias of directly computing\nimpurity using OOB samples could still be large for deep trees. The idea of MDI-oob depends\nheavily on the new analytical MDI expression. Without the new expression, it is not clear how one\ncan use out-of-bag samples to get a better estimate of MDI. One highlight of the MDI-oob is its\nlow computation cost. The time complexity of evaluating MDI-oob for RFs is roughly the same as\ncomputing the RF predictions for |D\\D(T )| number of samples.\n\n4 Simulation experiments2\n\nSimulated study on the effect of minimum leaf size and the tree depth\n\nIn this simulation, we investigate the empirical relationship between MDI importance and the\nminimum leaf size. To mimic the major experiment setting in the paper [29], we generate the\ndata as follows. We sample n = 200 observations, each containing 5 features. The \ufb01rst feature is\ngenerated from standard Gaussian distribution. The second feature is generated from a Bernoulli\ndistribution with p = 0.5. The third/fourth/\ufb01fth features have 4/10/20 categories respectively with\nequal probability of taking any states. The response label y is generated from a Bernoulli distribution\nsuch that P (yi = 1) = (1 + xi2)/3. While keeping the number of trees to be 300, we vary the\nminimum leaf size of RF from 1 to 50 and record the MDI of every feature. The results are shown\nin Fig. 1. We can see from this \ufb01gure that the MDI of noisy features, namely X1, X3, X4 and X5,\ndrops signi\ufb01cantly when the minimum leaf size increases from 1 to 50. This observation supports our\ntheoretical result in Theorem 1. Besides the minimum leaf size, we also investigate the relationship\nbetween MDI and the tree depth. As tree depth increases, the minimum leaf size generally decreases\nexponentially. Therefore, we expect the MDI of noisy features to become larger for increasing tree\ndepth. We vary the maximum depth from 1 to 20 and record the MDI of every feature. The results\nshown in Fig. 2 are consistent with our expectation. MDI importance of noisy features increase when\nthe tree depth increases from 1 to 20. Fig. 3 shows the MDI-oob measure and it indeed reduces the\nbias of MDI in this simulation.\n\nNoisy feature identi\ufb01cation using the simulated data\n\nIn this experiment, we evaluate different feature importance measures in terms of their abilities\nto identify noisy features in a simulated data set. We compare our method with the following\nmethods: MDA, cforest in the R package party, SHAP[17], default feature importance (MDI) in\nscikit-learn, the impurity corrected Gini importance in the R package ranger, UFI in [34], and\nnaive-oob, which refers to the naive method that evaluates impurity decrease at each node using\nout-of-bag samples directly. To evaluate feature importance measures, we generate the following\nsimulated data. Inspired by the experiment settings in Strobl et al. [29], our setting involves discrete\nfeatures with different number of distinct values, which poses a critical challenge for MDI. The data\nhas 1000 samples with 50 features. All features are discrete, with the jth feature containing j + 1\n\n2\n\nThe source code is available at https://github.com/shifwang/paper-debiased-feature-importance\n\n7\n\n\fdistinct values 0, 1, . . . , j. We randomly select a set S of 5 features from the \ufb01rst ten as relevant\nfeatures. The remaining features are noisy features. Choosing active features with fewer categories\nrepresents the most challenging case for MDI. All samples are i.i.d. and all features are independent.\nWe generate the outcomes using the following rules:\n\u2022 Classi\ufb01cation: P (Y = 1|X) = Logistic( 2\n\u2022 Regression: Y = 1\n\n(cid:80)\nj\u2208S Xj/j + \u0001, where \u0001 \u223c N (0, 100 \u00b7 Var( 1\n\n(cid:80)\nj\u2208S Xj/j \u2212 1).\n\nj\u2208S Xj/j)).\n\n(cid:80)\n\n5\n\n5\n\n5\n\nTreating the noisy features as label 0 and the relevant features as label 1, we can evaluate a feature\nimportance measure in terms of its area under the receiver operating characteristic curve (AUC). Note\nthat when a feature importance measure gives low importance to relevant features, its AUC score\nmeasure can be smaller than 0.5 or even 0. We grow 100 trees with the minimum leaf size set to\neither 100 (shallow tree case) or 1 (deep tree case). The number of candidate features mtry is set to\nbe 10. We repeat the whole process 40 times and report the average AUC scores for each method in\nTable 1. The boxplots For this simulated setting, MDI-oob achieves the best AUC score under all\ncases.\n\nNoisy feature identi\ufb01cation using a genomic ChIP dataset\n\nTo evaluate our method MDI-oob in a more realistic setting, we consider a ChIP-chip and ChIP-seq\ndataset measuring the enrichment of 80 biomolecules at 3912 regions of the Drosophila genome\n[5, 18]. These data have previously been used in conjunction with RF-based methods, namely\niterative random forests (iRF) [1], to predict functional labels associated with genomic regions. They\nprovide a realistic representation of many issues encountered in practice, such as heterogeneity and\ndependencies among features, which make it especially challenging for feature selection problems.\nTo evaluate feature selection in the ChIP data, we scale each feature Xj to be between 0 and 1.\nSecond, we randomly select a set S of 5 features as relevant features and include the rest as noisy\nfeatures. We randomly permute values of any noisy features to break their dependencies with relevant\nfeatures. By this means, we avoid the cases where RFs \"think\" some features are important not\nbecause they themselves are important but because they are highly correlated with other relevant\nfeatures. Then we generate responses using the following rules:\n\n\u2022 Classi\ufb01cation:P (Y = 1|X) = Logistic( 2\n\u2022 Regression: Y = 1\n\n(cid:80)\nj\u2208S Xj + \u0001, where \u0001 \u223c N (0, 100 \u00b7 Var( 1\n\n5\n\n5\n\n5\n\n(cid:80)\n\nj\u2208S Xj)).\n\n(cid:80)\nj\u2208S Xj \u2212 1).\n\nAll the other settings remain the same as the previous simulations. We report the average AUC\nscores for each method in Table 1. The standard errors and the beeswarm plots of all the methods\nare included in the Appendix. Naive-oob, namely, the method that directly computes MDI using the\nout-of-bag samples is hardly any better than the original gini importance. MDI-oob or UFI usually\nachieves the best AUC score in three out of four cases, except for shallow regression trees, when\nall methods appear to be equally good with AUC scores close to 1. Although UFI and MDI-oob\nuse out-of-bag samples in different ways, their results are generally comparable. We also note that\nincreasing the minimum leaf size consistently improves the AUC scores of all methods.\nAnother observation is that MDA behaves poorly in some simulations despite its use of a validation\nset. This could be due to the low signal-to-noise ratio in the simulation setting. After we train the\nRF model on the training set, we evaluated the model\u2019s accuracy on a test set. It turns out that the\naccuracy of the model is quite low. In that case, MDA struggles because the accuracy difference\nbetween permuting a relevant feature and permuting a noisy feature is small. We observe that the\nMDA gets better when we increase the signal-to-noise ratio.\nThe computation time of different methods is hard to compare due to a few factors. Because the\npackages including scikit-learn and ranger compute feature importance when constructing the\ntree, it is hard to disentangle the time taken to construct the trees and the time taken to get the feature\nimportance. Furthermore, different packages are implemented in different programming languages\nso it is not clear if the time difference is because of the algorithm or because of the language. We\nimplement MDI-oob in Python and for our \ufb01rst simulated classi\ufb01cation setting, MDI-oob takes \u223c 3.8\nseconds for each run. To compare, scikit-learn which uses Cython (A C extension for Python)\ntakes \u223c 1.4 seconds to construct the RFs for each run. Thus, MDI-oob runs in a reasonable time\nframe and we expect it to be faster if it is implemented in C or C++.\n\n8\n\n\fTable 1: Average AUC scores for noisy feature identi\ufb01cation\n\nMDI-oob\nUFI\nnaive-oob\nSHAP\nranger\nMDA\ncforest\nMDI\n\"C\" stands for classi\ufb01cation, \"R\" stands for regression. The column maximum is bolded.\n\nShallow tree(min leaf size = 100)\nSimulated\nC\n0.75\n0.75\n0.60\n0.68\n0.55\n0.50\n0.70\n0.63\n\nChIP\nR\n0.98\n0.98\n0.97\n0.97\n0.99\n0.99\n0.98\n0.97\n\nC\n0.94\n0.94\n0.89\n0.91\n0.76\n0.50\n0.90\n0.88\n\nR\n0.58\n0.56\n0.39\n0.46\n0.49\n0.58\n0.49\n0.40\n\nDeep tree (min leaf size = 1)\nSimulated\nC\n0.76\n0.72\n0.18\n0.55\n0.56\n0.49\n0.65\n0.12\n\nChIP\nR\n0.98\n0.99\n0.71\n0.96\n0.97\n0.97\n0.93\n0.71\n\nR\n0.52\n0.54\n0.10\n0.33\n0.50\n0.51\n0.50\n0.09\n\nC\n0.87\n0.88\n0.67\n0.82\n0.73\n0.54\n0.79\n0.60\n\n5 Discussion and future directions\n\nMean Decrease Impurity (MDI) is widely used to assess feature importance and its bias in feature\nselection is well known. Based on the original de\ufb01nition of MDI, we show that its expected bias is\nupper bounded by an expression that is inversely proportional to the minimum leaf size under mild\nconditions, which means deep trees generally have a higher feature selection bias than shallow trees.\nTo reduce the bias, we derive a new analytical expression for MDI and use the new expression to\nobtain MDI-oob. For the simulated data and a genomic ChIP dataset, MDI-oob has exhibited the\nstate-of-the-art feature selection performance in terms of AUC scores.\nComparison to SHAP. SHAP originates from game theory and offers a novel perspective to analyze\nthe existing methods. While it is desirable to have \u2018consistency, missingness and local accuracy\u2019, our\nanalysis indicates that there are other theoretical properties that are also worth taking into account.\nAs shown in our simulation, the feature selection bias of SHAP increases with the depth of the tree,\nand we believe SHAP can also use OOB samples to improve feature selection performance.\nRelationship to honest estimation. Honest estimation is an important technique built on the core\nnotion of sample splitting. It has been successfully used in causal inference and other areas to mitigate\nthe concern of over-\ufb01tting in complex learners due to usage of same data in different stages of training.\nThe proposed algorithm MDI-oob has important connections with \"honest sampling\" or \"honest\nestimation\". For example, in Breiman\u2019s 1984 book [3], he proposed to use a separate validation set for\npruning and uncertainty estimation. In [31], each within-leaf prediction is estimated using a different\nsub-sample (such as OOB sample) than the one used to decide split points. Theoretical results of\nthese papers and Proposition 1 of our paper convey the same message, that \ufb01nite sample bias is\ncaused by using the same data for growing trees and for estimation, and the bias can be reduced if we\nleverage OOB data. We believe the theoretical contributions of those papers can also help us analyze\nthe statistical properties (such as variance) of the MDI-oob.\nFuture directions. Although the MDI-oob shows promising results for selecting relevant features, it\nalso raises many interesting questions to be considered in the future. First of all, how can MDI-oob be\nextended to better accommodate correlated features? Going beyond feature selection, can importance\nmeasures also rank the relevant features in a reasonable order? Finally, can we use the new analytical\nexpression of MDI to give a tighter theoretical bound for MDI\u2019s feature selection bias? We are\nexploring these interesting questions in our ongoing work.\n\nAcknowledgements\n\nThe authors would like to thank Merle Behr and Raaz Dwivedi from University of California, Berkeley\nfor their very helpful comments of this paper that greatly improve its presentation. Partial supports\nare gratefully acknowledged from ARO grant W911NF1710005, ONR grant N00014-16-1-2664,\nNSF grants DMS-1613002 and IIS 1741340, and the Center for Science of Information (CSoI), a US\nNSF Science and Technology Center, under grant agreement CCF-0939370.\n\n9\n\n\fReferences\n[1] Sumanta Basu, Karl Kumbier, James B. Brown, and Bin Yu. Iterative random forests to discover\npredictive and stable high-order interactions. Proceedings of the National Academy of Sciences,\n115(8):1943\u20131948, 2018.\n\n[2] Leo Breiman. Random Forests. Machine Learning, 45:1\u201333, 2001.\n\n[3] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classi\ufb01cation\n\nand regression trees. Chapman and Hall/CRC, 1984.\n\n[4] Strobl Carolin, Hothorn Torsten, and Zeileis Achim. Party on! A New, Conditional Variable-\nthe R journal,\n\nImportance Measure for Random Forests Available in the party Package.\n1/2:14\u201317, 2009.\n\n[5] Susan E Celniker, Laura AL Dillon, Mark B Gerstein, Kristin C Gunsalus, Steven Henikoff,\nGary H Karpen, Manolis Kellis, Eric C Lai, Jason D Lieb, David M MacAlpine, et al. Unlocking\nthe secrets of the genome. Nature, 459(7249):927, 2009.\n\n[6] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785\u2013794,\n2016.\n\n[7] R Diaz-Uriarte and S de Andr\u00e9s. Gene Selection and Classi\ufb01cation of Microarray Data Using\n\nRandom Forest. BMC Bioinformatics, 7, 2006.\n\n[8] T Hothorn, K Hornik, and A Zeileis. Unbiased Recursive Partitioning: A Conditional Inference\n\nFramework. Journal of Computational and Graphical Statistics, 15, 2006.\n\n[9] V\u00e2n Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts.\n\nInferring\nregulatory networks from expression data using tree-based methods. PLoS ONE, 5(9), 2010.\n\n[10] Silke Janitza, Ender Celik, and Anne Laure Boulesteix. A computationally fast variable\nimportance test for random forests for high-dimensional data. Advances in Data Analysis and\nClassi\ufb01cation, 12(4):1\u201331, 2016.\n\n[11] Jalil Kazemitabar, Arash Amini, Adam Bloniarz, and Ameet S Talwalkar. Variable importance\nusing decision trees. In Advances in Neural Information Processing Systems, pages 426\u2013435,\n2017.\n\n[12] Karl Kumbier, Sumanta Basu, James B Brown, Susan Celniker, and Bin Yu. Re\ufb01ning interaction\n\nsearch through signed iterative random forests. arXiv preprint arXiv:1810.07287, 2018.\n\n[13] Jung Bok Jae Won Lee, Jung Bok Jae Won Lee, Mira Park, and Seuck Heun Song. An extensive\ncomparison of recent classi\ufb01cation tools applied to microarray data. Computational Statistics\nand Data Analysis, 48(4):869\u2013885, 2005.\n\n[14] Wei-Yin Loh. Fifty years of classi\ufb01cation and regression trees. International Statistical Review,\n\n82(3):329\u2013348, 2014.\n\n[15] Gilles Louppe. Understanding random forests: From theory to practice. arXiv preprint\n\narXiv:1407.7502, 2014.\n\n[16] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understanding variable\nimportances in forests of randomized trees. In Advances in Neural Information Processing\nSystems 26, pages 431\u2014-439, 2013.\n\n[17] Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent Individualized Feature\n\nAttribution for Tree Ensembles. ArXiv e-prints arXiv:1802.03888, 2018.\n\n[18] Stewart MacArthur, Xiao-Yong Li, Jingyi Li, James B Brown, Hou Cheng Chu, Lucy Zeng,\nBrandi P Grondona, Aaron Hechmer, Lisa Simirenko, and Soile VE Ker\u00e4nen. Developmental\nroles of 21 drosophila transcription factors are determined by quantitative differences in binding\nto an overlapping set of thousands of genomic regions. Genome biology, 10(7):1, 2009.\n\n10\n\n\f[19] W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Interpretable\n\nmachine learning: de\ufb01nitions, methods, and applications. ArXiv e-prints, pages 1\u201311, 2019.\n\n[20] Stefano Nembrini, Inke R. K\u00f6nig, and Marvin N. Wright. The revival of the Gini importance?\n\nBioinformatics, 34(21):3711\u20133718, 2018.\n\n[21] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,\nO. Grisel, M. Blondel, B. Prettenhofer, R. Weiss, and V. Dubourg. Scikit-learn: Machine\nlearning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[22] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. \u201cWhy Should I Trust You?\u201d Explain-\ning the Predictions of Any Classi\ufb01er. Proceedings of the 22nd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining - KDD \u201916, 2016.\n\n[23] Wendy Rodenburg, A. Geert Heidema, M. A. Jolanda Boer, I. M. Ingeborg Bovee-Oudenhoven,\nJ. M. Edith Feskens, C. M. Edwin Mariman, and Jaap Keijer. A Framework to Identify Physio-\nlogical Responses in Microarray Based Gene Expression Studies: Selection and Interpretation\nof Biologically Relevant Genes. Physiological Genomics, 33, 2008.\n\n[24] Ando Saabas. Interpreting random forests, 2014.\n\n[25] Marco Sandri and Paola Zuccolotto. A bias correction algorithm for the gini variable importance\nmeasure in classi\ufb01cation trees. Journal of Computational and Graphical Statistics, 17(3):611\u2013\n628, 2008.\n\n[26] Erwan Scornet, Gerard Biau, and Jean Philippe Vert. Consistency of random forests. Annals of\n\nStatistics, 43(4):1716\u20131741, 2015.\n\n[27] C Strobl, A L Boulesteix, and T Augustin. Unbiased Split Selection for Classi\ufb01cation Trees\n\nBased on the Gini Index. Computational Statistics {&} Data Analysis, 52, 2007.\n\n[28] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis.\n\nConditional variable importance for random forests. BMC Bioinformatics, 9(1):307, 2008.\n\n[29] Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in Random\nForest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformat-\nics, 8, 2007.\n\n[30] Erik \u0160trumbelj and Igor Kononenko. Explaining prediction models and individual predictions\n\nwith feature contributions. Knowledge and Information Systems, 41(3):647\u2013665, 2014.\n\n[31] Stefan Wager and Susan Athey. Estimation and Inference of Heterogeneous Treatment Effects\n\nusing Random Forests. Journal of the American Statistical Association, 1459:1\u201315, 2018.\n\n[32] Pengfei Wei, Zhenzhou Lu, and Jingwen Song. Variable importance analysis: A comprehensive\n\nreview. Reliability Engineering and System Safety, 142:399\u2013432, 2015.\n\n[33] Marvin Wright and Andreas Ziegler. ranger: A fast implementation of random forests for high\n\ndimensional data in c++ and r. Journal of Statistical Software, Articles, 77(1):1\u201317, 2017.\n\n[34] Zhengze Zhou and Giles Hooker. Unbiased measurement of feature importance in tree-based\n\nmethods. arXiv preprint arXiv:1903.05179, 2019.\n\n11\n\n\f", "award": [], "sourceid": 4403, "authors": [{"given_name": "Xiao", "family_name": "Li", "institution": "University of California, Berkeley"}, {"given_name": "Yu", "family_name": "Wang", "institution": "UC Berkeley"}, {"given_name": "Sumanta", "family_name": "Basu", "institution": "Cornell University"}, {"given_name": "Karl", "family_name": "Kumbier", "institution": "University of California, Berkeley"}, {"given_name": "Bin", "family_name": "Yu", "institution": "UC Berkeley"}]}