{"title": "Understanding variable importances in forests of randomized trees", "book": "Advances in Neural Information Processing Systems", "page_first": 431, "page_last": 439, "abstract": "Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees.", "full_text": "Understanding variable importances\n\nin forests of randomized trees\n\nGilles Louppe, Louis Wehenkel, Antonio Sutera and Pierre Geurts\n\n{g.louppe, l.wehenkel, a.sutera, p.geurts}@ulg.ac.be\n\nDept. of EE & CS, University of Li`ege, Belgium\n\nAbstract\n\nDespite growing interest and practical use in various scienti\ufb01c areas, variable im-\nportances derived from tree-based ensemble methods are not well understood from\na theoretical point of view. In this work we characterize the Mean Decrease Im-\npurity (MDI) variable importances as measured by an ensemble of totally ran-\ndomized trees in asymptotic sample and ensemble size conditions. We derive a\nthree-level decomposition of the information jointly provided by all input vari-\nables about the output in terms of i) the MDI importance of each input variable, ii)\nthe degree of interaction of a given input variable with the other input variables,\niii) the different interaction terms of a given degree. We then show that this MDI\nimportance of a variable is equal to zero if and only if the variable is irrelevant\nand that the MDI importance of a relevant variable is invariant with respect to\nthe removal or the addition of irrelevant variables. We illustrate these properties\non a simple example and discuss how they may change in the case of non-totally\nrandomized trees such as Random Forests and Extra-Trees.\n\n1 Motivation\n\nAn important task in many scienti\ufb01c \ufb01elds is the prediction of a response variable based on a set\nof predictor variables. In many situations though, the aim is not only to make the most accurate\npredictions of the response but also to identify which predictor variables are the most important\nto make these predictions, e.g.\nin order to understand the underlying process. Because of their\napplicability to a wide range of problems and their capability to both build accurate models and,\nat the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and\nvariants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with\nsuccess in various scienti\ufb01c areas.\nDespite their extensive use in applied research, only a couple of works have studied the theoretical\nproperties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al.\n(2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simpli\ufb01ed to very\nrealistic variants of these algorithms and proved the consistency of those variants. Little is known\nhowever regarding the variable importances computed by Random Forests like algorithms, and \u2013\nas far as we know \u2013 the work of Ishwaran (2007) is indeed the only theoretical study of tree-based\nvariable importance measures.\nIn this work, we aim at \ufb01lling this gap and present a theoretical analysis of the Mean Decrease\nImpurity importance derived from ensembles of randomized trees. The rest of the paper is organized\nas follows: in section 2, we provide the background about ensembles of randomized trees and recall\nhow variable importances can be derived from them; in section 3, we then derive a characterization\nin asymptotic conditions and show how variable importances derived from totally randomized trees\noffer a three-level decomposition of the information jointly contained in the input variables about the\noutput; section 4 shows that this characterization only depends on the relevant variables and section 5\n\n1\n\n\fdiscusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6\nthen illustrates all these ideas on an arti\ufb01cial problem; \ufb01nally, section 7 includes our conclusions\nand proposes directions of future works.\n\n2 Background\n\nIn this section, we \ufb01rst describe decision trees, as well as forests of randomized trees. Then, we\ndescribe the two major variable importances measures derived from them \u2013 including the Mean\nDecrease Impurity (MDI) importance that we will study in the subsequent sections.\n\n2.1 Single classi\ufb01cation and regression trees and random forests\n\nA binary classi\ufb01cation (resp.\nregression) tree (Breiman et al., 1984) is an input-output model\nrepresented by a tree structure T , from a random input vector (X1, ..., Xp) taking its values in\nX1 \u00d7 ... \u00d7 Xp = X to a random output variable Y \u2208 Y . Any node t in the tree represents a subset\nof the space X , with the root node being X itself. Internal nodes t are labeled with a binary test\n(or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children\ntL and tR, while the terminal nodes (or leaves) are labeled with a best guess value of the output\nvariable2. The predicted output \u02c6Y for a new instance is the label of the leaf reached by the instance\nwhen it is propagated through the tree. A tree is built from a learning sample of size N drawn from\nP (X1, ..., Xp, Y ) using a recursive procedure which identi\ufb01es at each node t the split st = s\u2217 for\nwhich the partition of the Nt node samples into tL and tR maximizes the decrease\n\n\u2206i(s, t) = i(t) \u2212 pLi(tL) \u2212 pRi(tR)\n\n(1)\n\nof some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ),\nand where pL = NtL/Nt and pR = NtR /Nt. The construction of the tree stops , e.g., when nodes\nbecome pure in terms of Y or when all variables Xi are locally constant.\nSingle trees typically suffer from high variance, which makes them not competitive in terms of\naccuracy. A very ef\ufb01cient and simple way to address this \ufb02aw is to use them in the context of\nrandomization-based ensemble methods. Speci\ufb01cally, the core principle is to introduce random\nperturbations into the learning procedure in order to produce several different decision trees from\na single learning set and to use some aggregation technique to combine the predictions of all these\ntrees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data,\nhence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended\nand combined with a randomization of the input variables that are used when considering candidate\nvariables to split internal nodes t. In particular, instead of looking for the best split s\u2217 among all\nvariables, the Random Forest algorithm selects, at each node, a random subset of K variables and\nthen determines the best split over these latter variables only.\n\n2.2 Variable importances\n\nIn the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the\nimportance of a variable Xm for predicting Y by adding up the weighted impurity decreases\np(t)\u2206i(st, t) for all nodes t where Xm is used, averaged over all NT trees in the forest:\n\n(cid:88)\n\n(cid:88)\n\nImp(Xm) =\n\n1\nNT\n\nT\n\nt\u2208T :v(st)=Xm\n\np(t)\u2206i(st, t)\n\n(2)\n\nand where p(t) is the proportion Nt/N of samples reaching t and v(st) is the variable used in split\nst. When using the Gini index as impurity function, this measure is known as the Gini importance or\nMean Decrease Gini. However, since it can be de\ufb01ned for any impurity measure i(t), we will refer\nto Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure\ni(t). We will characterize and derive results for this measure in the rest of this text.\n\n1More generally, splits are de\ufb01ned by a (not necessarily binary) partition of the range Xm of possible values\n\nof a single variable Xm.\n\n2e.g. determined as the majority class j(t) (resp., the average value \u00afy(t)) within the subset of the leaf t.\n\n2\n\n\fIn addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable\nXm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are\nrandomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as\nthe permutation importance.\nThanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa\net al., 2011), both of these variable importance measures have shown their practical utility in an\nincreasing number of experimental studies. Little is known however regarding their inner workings.\nStrobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased\ntowards some predictor variables. As explained by White and Liu (1994) in case of single decision\ntrees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards\npredictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as\nwell, and that it overestimates the importance of correlated variables \u2013 although this effect was not\ncon\ufb01rmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view,\nIshwaran (2007) provides a detailed theoretical development of a simpli\ufb01ed version of MDA, giving\nkey insights for the understanding of the actual MDA.\n\n3 Variable importances derived from totally randomized tree ensembles\n\nLet us now consider the MDI importance as de\ufb01ned by Equation 2, and let us assume a set V =\n{X1, ..., Xp} of categorical input variables and a categorical output Y . For the sake of simplicity\nwe will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later\non we will discuss other impurity measures and tree construction algorithms.\nGiven a training sample L of N joint observations of X1, ..., Xp, Y independently drawn from the\njoint distribution P (X1, ..., Xp, Y ), let us assume that we infer from it an in\ufb01nitely large ensemble\nof totally randomized and fully developed trees.\nIn this setting, a totally randomized and fully\ndeveloped tree is de\ufb01ned as a decision tree in which each node t is partitioned using a variable Xi\npicked uniformly at random among those not yet used at the parent nodes of t, and where each t is\nsplit into |Xi| sub-trees, i.e., one for each possible value of Xi, and where the recursive construction\nprocess halts only when all p variables have been used along the current branch. Hence, in such a\ntree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection\nwith the set X of all possible joint con\ufb01gurations of the p input variables. For example, if the input\nvariables are all binary, the resulting tree will have exactly 2p leaves.\nTheorem 1. The MDI importance of Xm \u2208 V for Y as computed with an in\ufb01nite ensemble of fully\ndeveloped totally randomized trees and an in\ufb01nitely large training sample is:\n\nImp(Xm) =\n\nI(Xm; Y |B),\n\n(3)\n\np\u22121(cid:88)\n\nk=0\n\n(cid:88)\n\n1\nC k\np\n\n1\np \u2212 k\n\nB\u2208Pk(V \u2212m)\n\np(cid:88)\n\nwhere V \u2212m denotes the subset V \\ {Xm}, Pk(V \u2212m) is the set of subsets of V \u2212m of cardinality k,\nand I(Xm; Y |B) is the conditional mutual information of Xm and Y given the variables in B.\n\nProof. See Appendix B.\nTheorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions\n(e.g., in the same conditions as those of Theorem 1), we have that\n\nImp(Xm) = I(X1, . . . , Xp; Y ).\n\n(4)\n\nm=1\n\nProof. See Appendix C.\n\nTogether, theorems 1 and 2 show that variable importances derived from totally randomized trees\nin asymptotic conditions provide a three-level decomposition of the information I(X1, . . . , Xp; Y )\ncontained in the set of input variables about the output variable. The \ufb01rst level is a decomposition\namong input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the\n\n3\n\n\fdegrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of\nTheorem 1), and the third level is a decomposition along the combinations B of interaction terms of\n\ufb01xed size k of possible interacting variables (see the inner sum in Equation 3).\nWe observe that the decomposition includes, for each variable, each and every interaction term\nof each and every degree weighted in a fashion resulting only from the combinatorics of possible\ninteraction terms. In particular, since all I(Xm; Y |B) terms are at most equal to H(Y ), the prior\nentropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by\n\n1\nC k\np\n\n1\np \u2212 k\n\nH(Y ) =\n\n1\nC k\np\n\n1\np \u2212 k\n\nC k\n\np\u22121H(Y ) =\n\nH(Y ).\n\n1\np\n\n(cid:80)\nB\u2208Pk(V \u2212m)\nAs such, the second level decomposition resulting from totally randomized trees makes the p sub-\nB\u2208Pk(V \u2212m) I(Xm; Y |B) to equally contribute (at most) to the total\nimportance terms\nimportance, even though they each include a combinatorially different number of terms.\n\n1\np\u2212k\n\n1\nCk\np\n\n(cid:88)\n\n4\n\nImportances of relevant and irrelevant variables\n\nFollowing Kohavi and John (1997), let us de\ufb01ne as relevant to Y with respect to V a variable Xm for\nwhich there exists at least one subset B \u2286 V (possibly empty) such that I(Xm; Y |B) > 0.3 Thus we\nde\ufb01ne as irrelevant to Y with respect to V a variable Xi for which, for all B \u2286 V , I(Xi; Y |B) = 0.\nRemark that if Xi is irrelevant to Y with respect to V , then by de\ufb01nition it is also irrelevant to Y\nwith respect to any subset of V .\nTheorem 3. Xi \u2208 V is irrelevant to Y with respect to V if and only if its in\ufb01nite sample size\nimportance as computed with an in\ufb01nite ensemble of fully developed totally randomized trees built\non V for Y is 0.\n\nProof. See Appendix D.\nLemma 4. Let Xi /\u2208 V be an irrelevant variable for Y with respect to V . The in\ufb01nite sample size\nimportance of Xm \u2208 V as computed with an in\ufb01nite ensemble of fully developed totally randomized\ntrees built on V for Y is the same as the importance derived when using V \u222a {Xi} to build the\nensemble of trees for Y .\n\nProof. See Appendix E.\nTheorem 5. Let VR \u2286 V be the subset of all variables in V that are relevant to Y with respect\nto V . The in\ufb01nite sample size importance of any variable Xm \u2208 VR as computed with an in\ufb01nite\nensemble of fully developed totally randomized trees built on VR for Y is the same as its importance\ncomputed in the same conditions by using all variables in V . That is:\n\np\u22121(cid:88)\nr\u22121(cid:88)\n\nk=0\n\nl=0\n\n1\nC k\np\n\n1\np \u2212 k\n\n1\nC l\nr\n\n1\nr \u2212 l\n\n(cid:88)\n(cid:88)\n\nB\u2208Pk(V \u2212m)\n\nB\u2208Pl(V\n\n\u2212m\nR )\n\nI(Xm; Y |B)\n\nI(Xm; Y |B)\n\nImp(Xm) =\n\n=\n\n(5)\n\nwhere r is the number of relevant variables in VR.\n\nProof. See Appendix F.\n\nTheorems 3 and 5 show that the importances computed with an ensemble of totally randomized\ntrees depends only on the relevant variables. Irrelevant variables have a zero importance and do not\naffect the importance of relevant variables. Practically, we believe that such properties are desirable\nconditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be\ncredited of any importance and should not make any other variable more (or less) important.\n\n3Among the relevant variables, we have the marginally relevant ones, for which I(Xm; Y ) > 0, the strongly\nrelevant ones, for which I(Xm; Y |V \u2212m) > 0, and the weakly relevant variables, which are relevant but not\nstrongly relevant.\n\n4\n\n\f5 Random Forest variants\n\nIn this section, we consider and discuss variable importances as computed with other types of en-\nsembles of randomized trees. We \ufb01rst show how our results extend to any other impurity measure,\nand then analyze importances computed by depth-pruned ensemble of randomized trees and those\ncomputed by randomized trees built on random subspaces of \ufb01xed size. Finally, we discuss the case\nof non-totally randomized trees.\n\n5.1 Generalization to other impurity measures\n\nAlthough our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure,\nwe show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substi-\ntuting conditional mutual information for conditional impurity reduction in the different formulas\nand in the de\ufb01nition of irrelevant variables. In particular, our results thus hold for the Gini index in\nclassi\ufb01cation and can be extended to regression problems using variance as the impurity measure.\n\n5.2 Pruning and random subspaces\n\nIn sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all\np variables were used within each branch. When totally randomized trees are developed only up to\nsome smaller depth q \u2264 p, we show in Proposition 6 that the variable importances as computed by\nthese trees is limited to the q \ufb01rst terms of Equation 3. We then show in Proposition 7 that these\nlatter importances are actually the same as when each tree of the ensemble is fully developed over a\nrandom subspace (Ho, 1998) of q variables drawn prior to its construction.\nProposition 6. The importance of Xm \u2208 V for Y as computed with an in\ufb01nite ensemble of pruned\ntotally randomized trees built up to depth q \u2264 p and an in\ufb01nitely large training sample is:\n\nImp(Xm) =\n\nI(Xm; Y |B)\n\n(6)\n\nq\u22121(cid:88)\n\nk=0\n\n(cid:88)\n\n1\nC k\np\n\n1\np \u2212 k\n\nB\u2208Pk(V \u2212m)\n\nProof. See Appendix G.\nProposition 7. The importance of Xm \u2208 V for Y as computed with an in\ufb01nite ensemble of pruned\ntotally randomized trees built up to depth q \u2264 p and an in\ufb01nitely large training sample is identical\nto the importance as computed for Y with an in\ufb01nite ensemble of fully developed totally randomized\ntrees built on random subspaces of q variables drawn from V .\n\nProof. See Appendix H.\nAs long as q \u2265 r (where r denotes the number of relevant variables in V ), it can easily be shown\nthat all relevant variables will still obtain a strictly positive importance, which will however differ\nin general from the importances computed by fully grown totally randomized trees built over all\nvariables. Also, each irrelevant variable of course keeps an importance equal to zero, which means\nthat, in asymptotic conditions, these pruning and random subspace methods would still allow us\nidentify the relevant variables, as long as we have a good upper bound q on r.\n\n5.3 Non-totally randomized trees\n\nIn our analysis in the previous sections, trees are built totally at random and hence do not directly\nrelate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To\nbetter understand the importances as computed by those algorithms, let us consider a close variant\nof totally randomized trees: at each node t, let us instead draw uniformly at random 1 \u2264 K \u2264 p\nvariables and let us choose the one that maximizes \u2206i(t). Notice that, for K = 1, this procedure\namounts to building ensembles of totally randomized trees as de\ufb01ned before, while, for K = p, it\namounts to building classical single trees in a deterministic way.\nFirst, the importance of Xm \u2208 V as computed with an in\ufb01nite ensemble of such randomized trees\nis not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are\n\n5\n\n\fnever selected because there always is some other variable for which \u2206i(t) is larger. Such effects\ntend to pull the best variables at the top of the trees and to push the others at the leaves. As a result,\nthe importance of a variable no longer decomposes into a sum including all I(Xm; Y |B) terms.\nThe importance of the best variables decomposes into a sum of their mutual information alone or\nconditioned only with the best others \u2013 but not conditioned with all variables since they no longer\never appear at the bottom of trees. By contrast, the importance of the least promising variables\nnow decomposes into a sum of their mutual information conditioned only with all variables \u2013 but\nnot alone or conditioned with a couple of others since they no longer ever appear at the top of\ntrees. In other words, because of the guided structure of the trees, the importance of Xm now takes\ninto account only some of the conditioning sets B, which may over- or underestimate its overall\nrelevance.\nTo make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be\na slightly noisy copy of X1 (i.e., I(X1; Y ) \u2248 I(X2; Y ), I(X1; Y |X2) = \u0001 and I(X2; Y |X1) = 0).\nUsing totally randomized trees, the importances of X1 and X2 are nearly equal \u2013 the importance of\nX1 being slightly higher than the importance of X2:\n\nImp(X1) =\n\nImp(X2) =\n\n1\n2\n1\n2\n\nI(X1; Y ) +\n\nI(X2; Y ) +\n\nI(X1; Y |X2) =\nI(X2; Y |X1) =\n\n1\n2\n1\n2\n\n1\n2\n1\n2\n\nI(X1; Y ) +\n\n\u0001\n2\n\nI(X2; Y ) + 0\n\nIn non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is\nalways used in its children. Also, since X1 perfectly explains Y , all its children are pure and the\nreduction of entropy when splitting on X2 is null. As a result, ImpK=2(X1) = I(X1; Y ) and\nImpK=2(X2) = I(X2; Y |X1) = 0. Masking effects are here clearly visible: the true importance\nof X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1.\nAs a direct consequence of the example above, for K > 1, it is no longer true that a variable is\nirrelevant if and only if its importance is zero.\nIn the same way, it can also be shown that the\nimportances become dependent on the number of irrelevant variables. Let us indeed consider the\nfollowing counter-example: let us add in the previous example an irrelevant variable Xi with respect\nto {X1, X2} and let us keep K = 2. The probability of selecting X2 at the root node now becomes\npositive, which means that ImpK=2(X2) now includes I(X2; Y ) > 0 and is therefore strictly larger\nthan the importance computed before. For K \ufb01xed, adding irrelevant variables dampens masking\neffects, which thereby makes importances indirectly dependent on the number of irrelevant variables.\nIn conclusion, the importances as computed with totally randomized trees exhibit properties that do\nnot possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the\nimportance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant.\nAs we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally\nrandomized trees for assessing the importance of a variable may therefore be more appropriate. In\na \ufb01nite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice\nof the splitting variables remains however a sound strategy. In such a case, I(Xm; Y |B) cannot be\nmeasured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the\nmost promising \u2013 even if the resulting importances may be biased.\n\n6\n\nIllustration on a digit recognition problem\n\nIn this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating\nvariable importances as computed with totally randomized trees. We verify that they match with our\ntheoretical developments and that they decompose as foretold. We also compare these importances\nwith those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3,\nfor increasing values of K.\nLet us consider a seven-segment indicator displaying numerals using horizontal and vertical lights\nin on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in\n{0, 1, ..., 9} with equal probability and let X1, ..., X7 be binary variables whose values are each\ndetermined univocally given the corresponding value of Y in Table 1.\nSince Table 1 perfectly de\ufb01nes the data distribution, and given the small dimensionality of the prob-\nlem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our\n\n6\n\n\fy\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\nx1\n1\n0\n1\n1\n0\n1\n1\n1\n1\n1\n\nx2\n1\n0\n0\n0\n1\n1\n1\n0\n1\n1\n\nx3\n1\n1\n1\n1\n1\n0\n0\n1\n1\n1\n\nx4\n0\n0\n1\n1\n1\n1\n1\n0\n1\n1\n\nx5\n1\n0\n1\n0\n0\n0\n1\n0\n1\n0\n\nx6\n1\n1\n0\n1\n1\n1\n1\n1\n1\n1\n\nx7\n1\n0\n1\n1\n0\n1\n1\n0\n1\n1\n\nFigure 1: 7-segment display\n\nTable 1: Values of Y, X1, ..., X7\n\nEqn. 3 K = 1 K = 2 K = 3 K = 4 K = 5 K = 6 K = 7\n0.306\n0.412\n0.799\n0.581\n0.531\n0.475\n0.412\n0.542\n0.835\n0.656\n0.120\n0.225\n0.372\n0.372\n3.321\n\n0.414\n0.583\n0.532\n0.543\n0.658\n0.221\n0.368\n3.321\n\n0.362\n0.663\n0.512\n0.525\n0.731\n0.140\n0.385\n3.321\n\n0.305\n0.801\n0.475\n0.409\n0.831\n0.121\n0.375\n3.321\n\nX1\nX2\nX3\nX4\nX5\nX6\nX7\n\n(cid:80) 3.321\n\n0.327\n0.715\n0.496\n0.484\n0.778\n0.126\n0.392\n3.321\n\n0.309\n0.757\n0.489\n0.445\n0.810\n0.122\n0.387\n3.321\n\n0.304\n0.787\n0.483\n0.414\n0.827\n0.122\n0.382\n3.321\n\nTable 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K.\nImportances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K\nincreases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always\nequal to I(X1, . . . , X7; Y ) = H(Y ) = log2(10) = 3.321 since inputs allow to perfectly predict the output.\n\ntheoretical developments, we then compare in Table 2 variable importances as computed by Equa-\ntion 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that\ngiven the known structure of the problem, building trees on a sample of \ufb01nite size that perfectly\nfollows the data distribution amounts to building them on a sample of in\ufb01nite size. At best, trees\ncan thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable\noutcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match\nthose computed by Equation 3, which con\ufb01rms Theorem 1. Small differences stem from the fact\nthat a \ufb01nite number of trees were built in our simulations, but those discrepancies should disappear\nas the size of the ensemble grows towards in\ufb01nity. It also shows that importances indeed add up to\nI(X1, ...X7; Y ), which con\ufb01rms Theorem 2. Regarding the actual importances, they indicate that\nX5 is stronger than all others, followed \u2013 in that order \u2013 by X2, X4 and X3 which also show large\nimportances. X1, X7 and X6 appear to be the less informative. The table also reports importances\nfor increasing values of K. As discussed before, we see that a large value of K yields importances\nthat can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at\nK = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1, X3, X4\nand X6 are smaller than at K = 1, as if they were less important). It can also be observed that\nmasking effects may even induce changes in the variable rankings (e.g., compare the rankings at\nK = 1 and at K = 7), which thus con\ufb01rms that importances are differently affected.\nTo better understand why a variable is important, it is also insightful to look at its decomposition into\nits p sub-importances terms, as shown in Figure 2. Each row in the plots of the \ufb01gure corresponds\n(cid:80)\nto one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at\nrow m and column k corresponds the importance of Xm when conditioned with k other variables\nB\u2208Pk(V \u2212m) I(Xm; Y |B) in Equation 3 in the case of totally randomized\n(e.g., to the term 1\nCk\np\ntrees). In the left plot, for K = 1, the \ufb01gure \ufb01rst illustrates how importances yielded by totally\nrandomized trees decomposes along the degrees k of interactions terms. We can observe that they\neach equally contribute (at most) the total importance of a variable. The plot also illustrates why\nX5 is important: it is informative either alone or conditioned with any combination of the other\nvariables (all of its terms are signi\ufb01cantly larger than 0). By contrast, it also clearly shows why\n\n1\np\u2212k\n\n7\n\nX1X2X3X4X5X6X7\fFigure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the\nother ones. At K = 1, all I(Xm; Y |B) are accounted for in the total importance, while at K = 7 only some\nof them are taken into account due to masking effects.\n\nX6 is not important: neither alone nor combined with others X6 seems to be very informative\n(all of its terms are close to 0). More interestingly, this \ufb01gure also highlights redundancies: X7\nis informative alone or conditioned with a couple of others (the \ufb01rst terms are signi\ufb01cantly larger\nthan 0), but becomes uninformative when conditioned with many others (the last terms are closer\nto 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are\nchosen in a deterministic way. The \ufb01rst thing to notice is masking effects. Some of the I(Xm; Y |B)\nterms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the\ntotal importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while\nall others are null, which means that only those two variables are ever selected at the root node,\nhence masking the others. As a consequence, this also means that the importances of the remaining\nvariables is biased and that it actually only accounts of their relevance when conditioned to X2\nor X5, but not of their relevance in other contexts. At k = 0, masking effects also amplify the\ncontribution of I(X2; Y ) (resp. I(X5; Y )) since X2 (resp. X5) appears more frequently at the root\nnode than in totally randomized trees.\nIn addition, because nodes become pure before reaching\ndepth p, conditioning sets of size k \u2265 4 are never actually encountered, which means that we can no\nlonger know whether variables are still informative when conditioned to many others. All in all, this\n\ufb01gure thus indeed con\ufb01rms that importances as computed with non-totally randomized trees take\ninto account only some of the conditioning sets B, hence biasing the measured importances.\n\n7 Conclusions\n\nIn this work, we made a \ufb01rst step towards understanding variable importances as computed with\na forest of randomized trees. In particular, we derived a theoretical characterization of the Mean\nDecrease Impurity importances as computed by totally randomized trees in asymptotic conditions.\nWe showed that they offer a three-level decomposition of the information jointly provided by all\ninput variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances\nas computed by totally randomized trees exhibit desirable properties for assessing the relevance of\na variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the\nrelevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and \ufb01nally\nillustrated our developments on an arti\ufb01cial but insightful example (Section 6).\nThere remain several limitations to our framework that we would like address in the future. First, our\nresults should be adapted to binary splits as used within an actual Random Forest-like algorithm. In\nthis setting, any node t is split in only two subsets, which means that any variable may then appear\none or several times within a branch, and thus should make variable importances now dependent on\nthe cardinalities of the input variables. In the same direction, our framework should also be extended\nto the case of continuous variables. Finally, results presented in this work are valid in an asymptotic\nsetting only. An important direction of future work includes the characterization of the distribution\nof variable importances in a \ufb01nite setting.\n\nAcknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its \ufb01nancial\nsupport. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science\nPolicy Of\ufb01ce.\n\n8\n\n0123456X1X2X3X4X5X6X7K=10123456X1X2X3X4X5X6X7K=70.00.1250.250.3750.5\fReferences\nBiau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research,\n\n98888:1063\u20131095.\n\nBiau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging\n\nclassi\ufb01ers. The Journal of Machine Learning Research, 9:2015\u20132033.\n\nBreiman, L. (1996). Bagging predictors. Machine learning, 24(2):123\u2013140.\nBreiman, L. (2001). Random forests. Machine learning, 45(1):5\u201332.\nBreiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics\n\nDepartment University of California Berkeley, CA, USA.\n\nBreiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC\n\nBerkeley.\n\nBreiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classi\ufb01cation and regression\n\ntrees.\n\nGenuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests.\n\nPattern Recognition Letters, 31(14):2225\u20132236.\n\nGeurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning,\n\n63(1):3\u201342.\n\nHo, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and\n\nMachine Intelligence, IEEE Transactions on, 20(8):832\u2013844.\n\nIshwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal\n\nof Statistics, 1:519\u2013537.\n\nKohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Arti\ufb01cial intelligence,\n\n97(1):273\u2013324.\n\nLiaw, A. and Wiener, M. (2002). Classi\ufb01cation and regression by randomforest. R news, 2(3):18\u201322.\nLin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American\n\nStatistical Association, 101(474):578\u2013590.\n\nMeinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research,\n\n7:983\u2013999.\n\nPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Pret-\ntenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The\nJournal of Machine Learning Research, 12:2825\u20132830.\n\nStrobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable\n\nimportance for random forests. BMC bioinformatics, 9(1):307.\n\nStrobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable\n\nimportance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25.\n\nWhite, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision\n\ntree induction. Machine Learning, 15(3):321\u2013329.\n\nZhao, G. (2000). A new perspective on classi\ufb01cation. PhD thesis, Utah State University, Department\n\nof Mathematics and Statistics.\n\n9\n\n\f", "award": [], "sourceid": 281, "authors": [{"given_name": "Gilles", "family_name": "Louppe", "institution": "Universit\u00e9 de Li\u00e8ge"}, {"given_name": "Louis", "family_name": "Wehenkel", "institution": "Universit\u00e9 de Li\u00e8ge"}, {"given_name": "Antonio", "family_name": "Sutera", "institution": "Universit\u00e9 de Li\u00e8ge"}, {"given_name": "Pierre", "family_name": "Geurts", "institution": "Universit\u00e9 de Li\u00e8ge"}]}