{"title": "When do random forests fail?", "book": "Advances in Neural Information Processing Systems", "page_first": 2983, "page_last": 2993, "abstract": "Random forests are learning algorithms that build large collections of random trees and make predictions by averaging the individual tree predictions.\nIn this paper, we consider various tree constructions and examine how the choice of parameters affects the generalization error of the resulting random forests as the sample size goes to infinity. \nWe show that subsampling of data points during the tree construction phase is important: Forests can become inconsistent with either no subsampling or too severe subsampling. \nAs a consequence, even highly randomized trees can lead to inconsistent forests if no subsampling is used, which implies that some of the commonly used setups for random forests can be inconsistent. \nAs a second consequence we can show that trees that have good performance in nearest-neighbor search can be a poor choice for random forests.", "full_text": "When do random forests fail?\n\nGeorge Washington University\n\nMax Planck Institute for Intelligent Systems\n\nCheng Tang\n\nWashington, DC\n\ntangch@gwu.edu\n\nDamien Garreau\n\nT\u00a8ubingen, Germany\n\ndamien.garreau@tuebingen.mpg.de\n\nUlrike von Luxburg\nUniversity of T\u00a8ubingen\n\nMax Planck Institute for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\nluxburg@informatik.uni-tuebingen.de\n\nAbstract\n\nRandom forests are learning algorithms that build large collections of random trees\nand make predictions by averaging the individual tree predictions. In this paper,\nwe consider various tree constructions and examine how the choice of parame-\nters affects the generalization error of the resulting random forests as the sample\nsize goes to in\ufb01nity. We show that subsampling of data points during the tree\nconstruction phase is important: Forests can become inconsistent with either no\nsubsampling or too severe subsampling. As a consequence, even highly random-\nized trees can lead to inconsistent forests if no subsampling is used, which implies\nthat some of the commonly used setups for random forests can be inconsistent.\nAs a second consequence we can show that trees that have good performance in\nnearest-neighbor search can be a poor choice for random forests.\n\n1\n\nIntroduction\n\nRandom forests (Breiman, 2001) are considered as one of the most successful general-purpose algo-\nrithms in modern-times (Biau and Scornet, 2016). They can be applied to a wide range of learning\ntasks, but most prominently to classi\ufb01cation and regression. A random forest is an ensemble of trees,\nwhere the construction of each tree is random. After building an ensemble of trees, the random for-\nest makes predictions by averaging the predictions of individual trees. Random forests often make\naccurate and robust predictions, even for very high-dimensional problems (Biau, 2012), in a variety\nof applications (Criminisi and Shotton, 2013; Belgiu and Dr\u02d8agut\u00b8, 2016; D\u00b4\u0131az-Uriarte and Alvarez de\nAndr\u00b4es, 2006). Recent theoretical works have established a series of consistency results of different\nvariants of random forests, when the forests\u2019 parameters are tuned in certain ways (Scornet, 2016;\nScornet et al., 2015; Biau, 2012; Biau et al., 2008). In this paper, however, we ask the question of\nwhen do random forests fail. In particular, we examine how varying several key parameters of the\nalgorithm affects the generalization error of forests.\nWhen building a random forest, there are several parameters to tune: the choice of the base trees\n(the randomized algorithm that generates the individual trees), the number of trees in the forest, the\nsize of the leaf nodes, the rate of data subsampling, and sometimes the rate of feature subsampling.\nPopular variants of random forests usually come with their own default parameter tuning guidelines,\noften suggested by practice. For example, common wisdom suggests that training a large number\nof trees and growing deep trees whose leaf sizes are \ufb01xed to a small constant lead to better per-\nformance. For data subsampling, the original random forest paper (Breiman, 2001) suggests to set\nthe subsampling (with replacement) rate to be 1, while a later popular variant (Geurts et al., 2006)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fproposes to disable data subsampling altogether. For feature subsampling, the consensus is to set\nthe rate to d/3 for regression problems, with d being the dimension (Friedman et al., 2009, Sec-\ntion 15.3). But in D\u00b4\u0131az-Uriarte and Alvarez de Andr\u00b4es (2006), the feature sampling rate is found to\nbe not important, while Genuer et al. (2010) suggests to not subsample the features.\nExisting analyses of random forests mostly focus on positive results and typically fall into two\ncategories: (1) They show a forest is consistent by showing that its base trees are consistent (Biau\net al., 2008; Biau, 2012; Denil et al., 2014). This class of results does not cover the case of deep trees\n(because individual deep trees are clearly inconsistent), and fails to highlight the advantage of using\nrandom forests as opposed to single trees. (2) In the deep tree regime, recent theoretical consistency\nresults require subsampling as a suf\ufb01cient condition for consistency (Scornet, 2016).\nWe focus on negative results: When are random forests inconsistent? To facilitate our theoretical\ninvestigation, we restrict our analysis to unsupervised random forests, that is, random forests whose\ntree construction does not use label information (Def. 2). We establish two conditions, diversity and\nlocality (Def. 3 and 4), that are necessary for a forest to be consistent. We then examine how para-\nmeter tuning affects diversity and locality. Our results highlight the importance of subsampling data\npoints during the tree construction phase: Without subsampling, forests of deep trees can become\ninconsistent due to violation of diversity; on the other hand, if we subsample too heavily, forests\ncan also become inconsistent due to violation of locality. Our analysis implies two surprising con-\nsequences as special cases: (1) When considering partitioning trees that are particularly good for\nnearest-neighbor search, such as random projection trees, it is natural to expect them to be also good\nfor random forests. Our results disagree with this intuition: Unless we use severe subsampling, they\nlead to inconsistent forests. (2) In a popular variant of random forests, extremely randomized trees\nare used and subsampling is disabled (Geurts et al., 2006). The argument in that paper is that when\nforests use extremely randomized trees, the randomness in the trees already reduces variance and\nthus subsampling becomes unnecessary. Our results suggest otherwise.\n\n2 Background on random forests\n\n\u22001 \u2264 i \u2264 n,\n\nThroughout this paper, we consider n i.i.d. samples X1, . . . , Xn of an unknown random variable X\nthat has support included in [0, 1]d. Let \u03b7 : [0, 1]d \u2192 R be a measurable function. The responses\nY1, . . . , Yn are R-valued random variables which satisfy\n\nYi = \u03b7(Xi) + \u03b5i ,\n\n(2.1)\nwhere the \u03b5i are centered random variables with variance \u03c32 > 0. We assume that they are in-\ndependent from the observations. For any integer n, we set [n] := {1, . . . , n}. We denote by\nX[n] := (Xi)1\u2264i\u2264n the training set, Y[n] := (Yi)1\u2264i\u2264n the responses, and Dn := (Xi, Yi)1\u2264i\u2264n the\nregression function \u03b7(x) = E [Y |X = x] by constructing an estimator(cid:98)\u03b7n(x) based on the training\nsample Dn. We de\ufb01ne the mean squared error of any estimator(cid:98)\u03b7n as E(cid:104)|(cid:98)\u03b7n(X) \u2212 \u03b7(X)|2(cid:105)\ntraining sample. We focus on the regression problem, that is, the problem of estimating the unknown\n\n, and we\nsay that the estimator is L2-consistent if the mean squared error goes to zero when the sample size\ngrows to in\ufb01nity, that is,\n\nE(cid:104)|(cid:98)\u03b7n(X) \u2212 \u03b7(X)|2(cid:105)\n\n(2.2)\nThe present paper examines the consistency of random forests as estimators of the regression func-\ntion. Here and in the rest of this article the expectation E [\u00b7] is taken with respect to the random\nvariables X, X1, . . . , Xn, \u03b51, . . . , \u03b5n, and any additional source of randomness coming from the\n(random) tree construction, unless otherwise speci\ufb01ed.\n\nlim\nn\u2192\u221e\n\n= 0 .\n\nRegression trees. A random forest makes predictions by aggregating the predictions of tree-based\nestimators. To obtain a tree-based estimator, one \ufb01rst uses the training sample to build a \u201cspatial\npartitioning tree.\u201d Any query x in the ambient space is then routed from the root to a unique leaf\nnode and assigned the mean value of the responses in the corresponding cell.\nFormally, the j-th tree in the ensemble constructed from training sample Dn induces a hierarchy\nof \ufb01nite coverings of the ambient space [0, 1]d: let k denote the height of the tree. Then at every\nlevel (cid:96) \u2208 [k] the tree induces a p(cid:96)-covering of the ambient space, namely subspaces Aj\n\u2282\n\n1, . . . , Aj\n\np(cid:96)\n\n2\n\n\f[0, 1]d such that(cid:83)p(cid:96)\n\np(cid:96)\n\ni=1\n\nAj\ni = [0, 1]d. Each cell Aj\n\n1, . . . , Aj\n\ni \u2208 {1, . . . , p(cid:96)}\n\n}; it satis\ufb01es \u2200x \u2208 [0, 1]d,\u2203!\n\ni corresponds to a node of the tree. The tree-\ninduced routing of a query to a unique cell in space at level (cid:96) \u2208 [k] is a function Aj\n(cid:96) : [0, 1]d \u2192\n{Aj\ni . In the\nfollowing, we refer to function Aj\n(cid:96) as the routing function associated with tree j at level (cid:96), and we\nwill often identify the trees with their associated functions at level k, Aj\nk (or simply Aj when there\nis no ambiguity). Note that this routing function is well-de\ufb01ned even for tree structures that allow\noverlapping cells.\nOnce a tree Aj has been constructed, it estimates the regression function \u03b7(x) for a query point x,\nusing only information on training points contained in cell Aj(x). Formally, given a query point\nx let N (Aj(x)) denote the number of samples that belong to the cell Aj(x). We de\ufb01ne the j-th\n\n(cid:96)(x) = Aj\n\nsuch that Aj\n\ntree-based estimator(cid:98)\u03b7n,Aj : [0, 1]d \u2192 R as\n(cid:98)\u03b7n,Aj (x) :=\n\nn(cid:88)\n\n1\n\n0 = 0. Intuitively,(cid:98)\u03b7n,Aj (x) is the empirical average of the responses of sample\n\nwith the convention 0\npoints falling in the same cell as x \u2014 see Fig. 1. We refer to Friedman et al. (2009, Section 9.2.2)\nfor a more detailed overview of regression trees.\n\nN (Aj(x))\n\ni=1\n\nYi 1Xi\u2208Aj (x) ,\n\nRandom forests. A random forest builds an ensemble of T tree estimators that are all constructed\nbased on the same data set and the same tree algorithm, which we call the base tree algorithm.\nDue to the inherent randomness in the base tree algorithm, which we denote by \u0398, each tree Aj\nwill be different; Aj can depend on both the training data Dn, and \u0398. For instance, the random\nvariable \u0398 may encode what feature and threshold are used when splitting a node. An important\nsource of randomness is the one coming from what we simply call \u201csubsampling\u201d: when building\neach tree Aj, we do not use the entire data set during tree construction, but just a susbsample of the\ndata (which can be with or without replacement). This source of randomness is also encoded by \u0398.\n\nFormally, the random forest estimator associated to the collection of trees VT =(cid:8)Aj, 1 \u2264 j \u2264 T(cid:9)\n\nis de\ufb01ned by\n\n(cid:98)\u03b7n,VT (x) :=\n\nT(cid:88)\n\nj=1\n\n1\nT\n\n(cid:98)\u03b7n,Aj (x) =\n\n1\nT\n\nT(cid:88)\n\nj=1\n\n1\n\nN (Aj(x))\n\nn(cid:88)\n\ni=1\n\nYi 1Xi\u2208Aj (x) .\n\n(2.3)\n\nWe refer to Friedman et al. (2009, Chapter 15) and Biau and Scornet (2016) for a more comprehen-\nsive introduction to random forests algorithms.\n\nLocal average estimators and in\ufb01nite random forests. An important fact about random forest\nestimators is that they can be seen as local average estimators (Devroye et al., 1996, Section 6.5),\na concept that generalizes many nonparametric estimators, including histogram, kernel, nearest-\nneighbor, and tree-based estimators. A local average estimator takes the following generic form:\n\n(cid:98)\u03b7n(x) =\n\nn(cid:88)\n\ni=1\n\nWn,i(x)Yi .\n\n(2.4)\n\nFor a given query x, a local average estimator predicts its conditional response by averaging the\nresponses in the training sample that are \u201cclose\u201d to x. Wn,i(x) can be thought of as the \u201cweight\u201d or\nthe contribution of the i-th training point in predicting the response value for x.\nRandom forests form a special class of local average estimators: introducing the weights W T\n1\nT\n\n1\nXi\u2208Aj (x)\nN (Aj (x)) , we can immediately see from Eq. (2.3) that\n\n(cid:80)T\n\nn,i(x) :=\n\nj=1\n\n(cid:98)\u03b7n,VT (x) =\n\nT(cid:88)\n\nn(cid:88)\n\ni=1\n\n1\nT\n\n1\n\nN (Aj(x))\n\nj=1\n\nn(cid:88)\n\ni=1\n\n1Xi\u2208Aj (x) Yi =\n\nW T\n\nn,i(x)Yi .\n\n(2.5)\n\nIt is clear that the weights de\ufb01ned by a random forest are non-negative. To analyze the asymptotic\nproperties of forests, there are different regimes that one can consider: the regime \u201c\ufb01xed T , and\n\n3\n\n\flarge n\u201d essentially does not differ from analyzing an individual tree. To see advantages of forests,\none needs to let both T and n go to in\ufb01nity. As it is common in the literature on random forests, we\n\ufb01rst let T \u2192 \u221e to get rid of the randomness \u0398 that is inherent to the tree construction: According\nto the law of large numbers, the estimator de\ufb01ned by Eq. (2.5) behaves approximately as an in\ufb01nite\n\nW \u221e\n\nn,i(x)Yi ,\n\n(cid:104) 1Xi\u2208A(x)\n\nN (A(x))\n\nwhere W \u221e\n\nn,i(x) := E\u0398\n\nrandom forest with associated estimator(cid:98)\u03b7n,V\u221e(x) :=\nciated with a generic random tree. Indeed, Scornet (2016, Theorem 3.1) shows that(cid:98)\u03b7n,V\u221e (\u00b7) is the\nlimiting function of(cid:98)\u03b7n,VT (\u00b7) as the number of trees T goes to in\ufb01nity. The concept of the in\ufb01nite\n\nare the asymptotic weights and A(\u00b7) is the routing function asso-\n\nforest captures the common wisdom that one should use many trees in random forests (see the next\nparagraph). In the following, we focus on such in\ufb01nite random forests. Now our question becomes:\nIf we construct in\ufb01nitely many trees by a particular base tree algorithm, is the forest consistent as\nthe number n of data points goes to in\ufb01nity?\n\nn(cid:88)\n\ni=1\n\n(cid:105)\n\nCommon beliefs and parameter setups in random forests. Different variants of random forests\nusually have different parameter tuning principles. However, there are three common beliefs about\nrandom forests in general, both in the literature and among practitioners. The \ufb01rst belief is that\n\u201cmany trees are good,\u201d in the sense that adding trees to the ensemble tends to decrease the gen-\neralization error of random forests (Biau and Scornet, 2016, Sec. 2.4). For example, the results\nin Theorem 3.3 of Scornet (2016) and Arlot and Genuer (2014) both corroborate this belief. The\nsecond belief is that, in the context of random forests, \u201cit is good to use deep trees\u201d (Breiman, 2000).\nDe\ufb01nition 1 (Deep trees and fully-grown trees). We say a random forest has deep trees if there\nexists an integer n0 such that, for any sample size n, the leaf nodes of its base trees have at most n0\npoints almost surely; a fully-grown tree is a deep tree whose leaves have exactly one data point.\n\nThe use of deep trees seems counter-intuitive at \ufb01rst glance: They have low bias but extremely high\nvariance that does not vanish as the sample size increases, and thus are destined to over\ufb01t. However,\nwhile a single deep tree estimator is clearly not consistent in general, it is believed that combining\nmany deep trees can effectively reduce the variance of individual trees. Thus, it is believed that a\nrandom forest estimator takes advantage of the low bias of individual deep trees while retaining low\nvariance. Recent work of Scornet (2016) provided theoretical evidence of this belief by showing\nthat forests of fully-grown quantile trees are consistent under certain suf\ufb01cient conditions. The third\nbelief is that a diverse portfolio of trees helps alleviate over\ufb01tting (by reducing variance), and that\nrandomizing the tree construction helps creating a more diverse portfolio. Since the introduction\nof random forest, \u201ctree diversity,\u201d which has been de\ufb01ned as correlation of \ufb01t residuals between\nbase trees in Breiman (2001), has been perceived as crucial for achieving variance reduction. It\nhas also become a folklore knowledge in the random forest community that by introducing \u201cmore\nrandomness,\u201d trees in the ensemble become more diverse, and thus less likely to over\ufb01t. In prac-\ntice, many ways of injecting randomness to the tree construction have been explored, for example\nrandom feature selection, random projection, random splits, and data subsampling (bootstrapping).\nGeurts et al. (2006) suggest using extremely randomized trees; taking this idea to the limit yields\nthe totally randomized trees, that is, trees constructed without using information from the responses\nY[n]. Our analysis takes into account all three common beliefs, and studies forest consistency under\ntwo extreme scenarios of subsampling setup.\n\n2.1 Related Work\n\nRandom forests were \ufb01rst proposed by Breiman (2001), where the base trees are chosen as Classi\ufb01-\ncation And Regression Trees (CART) (Breiman et al., 1984) and subsampling is enabled during tree\nconstruction. A popular variant of random forests is called \u201cextremely randomized trees\u201d (extra-\ntrees) (Geurts et al., 2006). Forests of extra-trees adopt a different parameter setup than Breiman\u2019s\nforest: They disable subsampling and use highly randomized trees as compared to CART trees. Be-\nsides axis-aligned trees such as CART, oblique trees (trees with non-rectangular cells) such as ran-\ndom projection trees are also used in random forests (Ho, 1998; Menze et al., 2011; Rodriguez et al.,\n2006; Tomita et al., 2015). On the theoretical side, all previous works that we are aware of investi-\ngate forests with axis-aligned base-trees. Most works analyze trees with UW-property (see Def. 2)\n\n4\n\n\fand focus on establishing consistency results (Scornet (2016); Biau (2012); Biau et al. (2008)). A\nnotable breakthrough was Scornet et al. (2015), who were the \ufb01rst to establish that Breiman\u2019s for-\nest, which do not satisfy the UW-property (Def. 2), is consistent on additive regression models. To\nour knowledge, few works focus on negative results. An exception is Lin and Jeon (2006), which\nprovides a lower bound on the mean squared error convergence rate of forests.\n\n2.2 Overview of our results\n\nSection 3 establishes two notions, \u201cdiversity\u201d and \u201clocality,\u201d that are necessary for local average\nestimators to be consistent. Then, viewing in\ufb01nite random forests as local average estimators, we\nestablish a series of inconsistency results in Section 4. In Section 4.1, we show that forests of deep\ntrees with either nearest-neighbor-preserving property (Def. 6) or fast-diameter-decreasing prop-\nerty (see condition in Prop. 1) violate the diversity condition, when subsampling is disabled. As a\nsurprising consequence, we show that trees with nearest-neighbor-preserving property (Algorithm 1\nand 2) can be inconsistent if we follow a common forest parameter setup (Def. 5). In Section 4.2, we\nshow that when undersampled, forests of deep trees can violate the locality condition. Our analysis\napplies to trees that are both axis-aligned and irregularly shaped (oblique).\n\n3\n\nInconsistency of local average estimators\n\nA classical result of Stone (1977, Theorem 1) provides a set of suf\ufb01cient conditions for local average\nestimators to be consistent. In this section, we derive new inconsistency results for a general class\nof local average estimator satisfying an additional property, often used in theoretical analyses:\nDe\ufb01nition 2 (UW-property). A local average estimator de\ufb01ned as in Eq. (2.4) satis\ufb01es the \u201cun-\nsupervised-weights\u201d property (UW-property) if the weights Wn,i depend only on the unlabeled data.\n\n3.1 Diversity is necessary to avoid over\ufb01tting\n\nWe \ufb01rst de\ufb01ne a condition on local average estimators, which we call diversity, and show that if\nlocal average estimators do not satisfy diversity, then they are inconsistent on data generated from a\nlarge class of regression models. In fact, from the proof of Lemma 1, it can be seen that violating\ndiversity results in high asymptotic variance, hence inconsistent estimators.\nDe\ufb01nition 3 (Diversity condition). We say a local average estimator as de\ufb01ned in Eq (2.4) satis\ufb01es\n\nthe diversity condition, if E(cid:2)(cid:80)n\n\nn,i(X)(cid:3) \u2212\u2192 0 as n \u2192 \u221e .\n\ni=1 W 2\n\nIntuitively, the diversity condition says that no single data point in the training set should be given\ntoo much weight asymptotically. The following lemma shows that diversity is necessary for a local\naverage estimator (with UW-property) to be consistent on a large class of regression models.\nLemma 1 (Local average estimators without diversity are inconsistent). Consider a local aver-\n\nage estimator(cid:98)\u03b7n as in Eq. (2.4) that satis\ufb01es the UW-property. Suppose the data satis\ufb01es Eq. (2.1),\nn,i(X)(cid:3) \u2265 \u03b4 for in\ufb01nitely many n. Then(cid:98)\u03b7n is not consistent.\nexists \u03b4 > 0 such that E(cid:2)(cid:80)n\n\nand \u03c3 be as de\ufb01ned therein. Suppose the diversity condition (Def. 3) is not satis\ufb01ed: that is, there\n\ni=1 W 2\n\nA related result is proved in Stone (1977). It considers the arti\ufb01cial scenario where the data dis-\ntribution (X, Y ) satis\ufb01es that (i) Y is independent of X, and (ii) Y is standard Gaussian. On this\nparticular distribution, Stone (1977, Prop. 8) shows that condition (5) of Stone (1977, Theorem 1) is\nnecessary for a local average estimator to be consistent. In contrast, our Lemma 1 applies to a much\nlarger class of distributions.\n\n3.2 Locality is necessary to avoid under\ufb01tting\n\nNow we introduce another necessary condition for the consistency of local average estimators, which\nwe call locality. While diversity controls the variance of the risk, locality controls the bias.\n\nDe\ufb01nition 4 (Locality condition). We say that a local average estimator (cid:98)\u03b7n with weights Wn,i\nsatis\ufb01es the locality condition if, for any a > 0, E(cid:2)(cid:80)n\n(cid:3) \u2212\u2192 0 as n \u2192 \u221e .\n\ni=1 Wn,i(X) 1(cid:107)Xi\u2212X(cid:107)>a\n\nThe locality condition is one of the conditions of Stone\u2019s theorem for the consistency of local average\nestimators. In plain words, it requires the estimator to give small weight to sample points located\n\n5\n\n\fSample q uniformly from(cid:2) 1\n\nAlgorithm 1 Randomized Projection Tree\nInput: Sample S, maximum leaf size n0;\nOutput: T = RP T (S, n0);\n1: T \u2190 empty tree;\n2: if |S| > n0 then\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end if\n\nSample U uniformly from Sd\u22121;\n4 , 3\ntq \u2190 empirical q-th quantile of U T \u00b7 S;\nSL \u2190 {x \u2208 S : U T \u00b7 x \u2264 tq};\nT.graft (RP T (SL, n0));\nSR \u2190 S \\ SL;\nT.graft (RP T (SR, n0));\n\n(cid:3);\n\n4\n\nAlgorithm 2 Randomized Spill Tree\nInput: S, n0, \u03b1 \u2208 (0, 1/2);\nOutput: T = RST (S, n0, \u03b1);\n1: T \u2190 empty tree;\n2: if |S| > n0 then\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end if\n\nSample U uniformly from Sd\u22121;\ntL \u2190 top 1\ntR \u2190 bottom 1\nSL \u2190 {x \u2208 S : U T \u00b7 x \u2264 tL};\nT.graft (RST (SL, n0, \u03b1));\nSR \u2190 {x \u2208 S : U T \u00b7 x \u2265 tR};\nT.graft(RST (SR, n0, \u03b1));\n\n2 + \u03b1-quantile of U T \u00b7 S;\n\n2 + \u03b1-quantile of U T \u00b7 S;\n\noutside a ball of \ufb01xed radius centered around a query. Indeed, intuitively, a local average estimator\nshould be able to capture \ufb01ne-scale changes in the distribution of X in order to be consistent. Our\nnext result shows that there exists a distribution such that, when a local average estimator with\nnon-negative weights violates the locality property, it is inconsistent.\nLemma 2 (Local average estimators without locality are inconsistent). In the setting given by\n\nEq. (2.1), let (cid:98)\u03b7n be a local average estimator with non-negative weights Wn,i. Suppose that (cid:98)\u03b7n\nsatis\ufb01es the UW-property (Def. 2). Assume furthermore that(cid:98)\u03b7n does not satisfy locality (Def. 4).\nThen, there exists a continuous bounded regression function \u03b7 : [0, 1]d \u2192 R such that (cid:98)\u03b7n is not\n\nconsistent.\n\nThis result is a straightforward application of Prop. 6 of Stone (1977). Intuitively, when locality is\nviolated, a local average estimator can be highly biased when the regression function \u03b7 has a large\namount of local variability. Note that the data models on which we prove locality is necessary in\nLemma 2 are more restricted in comparison to that of diversity.\n\n4\n\nInconsistency of random forests\n\nViewing forests as a special type of local average estimators, we obtain several inconsistency results\nby considering the choice of subsampling rate in two extreme scenarios: in Section 4.1, we study\ntrees without subsampling, and in Section 4.2, we study trees with constant subsample sizes.\n\n4.1 Forests without subsampling can be inconsistent\n\nIn this section, we establish inconsistency of some random forests by showing that they violate\nthe diversity condition. In particular, we focus on in\ufb01nite random forests with the following tree-\nconstruction strategy:\nDe\ufb01nition 5 (Totally randomized deep trees). We say a random forest has totally randomized deep\ntrees if its base trees (i) have the U W -property (Def. 2), (ii) are deep (Def. 1), and (iii) are grown\non the entire dataset (no subsampling).\n\nThis parameter setup is similar to the one suggested by Geurts et al. (2006), and the term \u201ctotally\nrandomized\u201d in Def. 5 follows the naming convention therein.\n\nTrees with nearest-neighbor-preserving property. Besides serving as the base algorithms for\nrandom forests, spatial partitioning trees are also widely used for other important tasks such as\nnearest-neighbor search (Yianilos, 1993). We show that, surprisingly, trees that are good for nearest-\nneighbor search can lead to inconsistent forests when we adopt the parameter setup that is widely\nused in the random forest community. Given X[n] and any x \u2208 [0, 1]d, we let X(i)(x) denote the\n\n(cid:9) for the Euclidean distance. We de\ufb01ne the nearest-\n\ni-th nearest neighbor of x from the set(cid:8)X[n]\n\nneighbor-preserving property of a tree as follows.\n\n6\n\n\fDe\ufb01nition 6 (Nearest-neighbor-preserving property). Let A(\u00b7) be the routing function associated\nwith a generic (randomized) tree. We say that the tree has nearest-neighbor-preserving property if\n\nthere exists \u03b5 > 0 such that, P(cid:0)X(1)(X) \u2208 A(X)(cid:1) \u2265 \u03b5 for in\ufb01nitely many n .\n\nIntuitively, Def. 6 means that if we route a query point x through the tree to its leaf cell A(x),\nthen its nearest neighbor is likely to be in the same cell, which is quite appealing when trees are\nused for nearest-neighbor search. However, via Lemma 1, we can now show that such trees lead to\ninconsistent forests whenever we grow the trees deep and disable subsampling.\nTheorem 1 (Forests with deep, nearest-neighbor-preserving trees are inconsistent). Suppose\nthat the data distribution satis\ufb01es the condition in Eq (2.1). Suppose that the in\ufb01nite random forest\n\n(cid:98)\u03b7n,V\u221e is built with totally randomized deep trees that additionally satisfy the nearest-neighbor-\npreserving property, Def. 6. Then(cid:98)\u03b7n,V\u221e is L2-inconsistent.\n\nThe intuition behind Theorem 1 is that trees with nearest-neighbor-preserving property are highly\nhomogeneous when subsampling is disabled: given a query point x, each tree in the forest tends to\nretrieve in its leaf of x a very similar set from the training data, namely those data points that are\nlikely nearest neighbors of x. This in turn implies violation of diversity and leads to over\ufb01tting (and\ninconsistency) of the random forest.\nTheorem 1 suggests that without subsampling, forests of totally randomized trees can still over\ufb01t\n(that is, subsampling is necessary for some forests to be consistent under the totally randomized\ndeep tree construction regime). On the other hand, we speculate that proper subsampling can make\nthe forests consistent again, while \ufb01xing other parameters (that is, subsampling is also suf\ufb01cient\nfor forests consistency here): with subsampling, the nearest-neighbor-preserving property of the\nbase tree algorithm should still hold, but each time applied on a subsample of the original data;\ntaken together, all nearest neighbors on different subsamples are a much more signi\ufb01cant set, hence\ndiversity should work again. If this can be proved, then it would imply that, in contrary to common\nbelief (Geurts et al., 2006), different ways of injecting randomness in the tree construction phase\nmay not be equivalent in reducing over\ufb01tting, and that subsampling may be more effective than\nother ways of injecting randomness to the algorithm. We leave this for future work.\n\nExample: Forests of deep random projection trees. Random-projection trees (Dasgupta and\nFreund, 2008) are a popular data structure, both for nearest-neighbor search (Dasgupta and Sinha,\n2015) and regression. In particular in the latter case, random-projection tree based estimators were\ntheoretically shown to be L2-consistent, with a convergence rate that adapts to the intrinsic data\ndimension for regression problems when they are pruned cleverly (Kpotufe and Dasgupta, 2012).\nBelow we show, however, that two variants of these trees, namely random projection trees (Algo-\nrithm 1) and randomized spill trees (Algorithm 2) can make bad candidates as base trees for random\nforests when tree pruning and data subsampling are disabled.\nTheorem 2 (Forests of deep random projection trees are inconsistent). Suppose that X is dis-\ntributed according to a measure \u00b5 that has doubling dimension d0 \u2265 2. Suppose additionally that\nthe responses satisfy Eq. (2.1). Let c0 be a constant such that Dasgupta and Sinha (2015, Theorem 7)\nholds\u2014we recall this result as Theorem 5 in the Appendix. For any \u03b4 \u2208 (0, 1/3) and \u03b5 \u2208 (0, 1),\nsuppose that we grow the base trees such that each leaf contains at most n0 sample points, where n0\nis a constant which does not depend on n and is de\ufb01ned as follows:\n\n(cid:40)\n\n\u2022 (Random projection tree) n0 = max\n\n(cid:16) 2c0d2\n(cid:17)d0\n(cid:16) c0d0\nThen the random forest estimator(cid:98)\u03b7n,V\u221e is L2\u2013inconsistent.\n\n\u2022 (Randomized spill tree) n0 = 8 log 1/\u03b4\n\n8 log 1/\u03b4\n\n\u03b1(1\u2212\u03b5)\n\n(cid:17)d0\n\n(cid:16) 2c0d3\n\n0\n\n1\u2212\u03b5\n\n, exp\n\n0(8 log 1/\u03b4)1/d0\n\n1\u2212\u03b5\n\n(cid:17)(cid:41)\n\n.\n\n, with \u03b1 \u2264 \u03b10 = \u03b10(c0, d0, \u03b5, \u03b4).\n\nTheorem 2 is a direct consequence of Theorem 1 and Theorem 5; the latter shows that both Algo-\nrithms 1 and 2 are nearest-neighbor-preserving.\n\nTrees with fast shrinking cell diameter. Local average estimators such as k-nearest-neighbor\n(k-NN), kernel, and tree based estimators, often make predictions based on information in a neigh-\nborhood around the query point. In all these methods, the number of training data contained in\n\n7\n\n\f\u2022\n\n\u2022\n\nx\n\n\u2022\n\ny\n\n\u2022\n\n\u2022\n\n\u2022\n\n\u2022\n\n\u2022\n\nx\n\n\u2022\n\n\u2022\n\n\u2022\u2022\n\n\u2022\n\nFigure 1: Left: Illustration of the \u201caggregating\u201d effect of a forest induced local neighborhood; the\nblack dot is a query point x; the blue points are training points; each cell is the leaf cell of a single\ntree in the forest containing x; the maximal leaf size is n0 = 1. We can see that the aggregated cell\n(the union of the individual cells) is much larger (less local) than the individual cells. Right: The\nvertical blue lines represent the response values of the sample points belonging to the same cell as\nthe query x. The predicted value (in black) is the empirical mean of theses values.\n\nthe local neighborhood controls the bias-variance trade-off of the estimator (Devroye et al., 1996,\nSec. 6); for these methods to be consistent, the local neighborhood needs to adapt to the training\nsize. For example, in k-NN methods, the size of the neighborhood is determined by the choice of k,\nthe number of nearest neighbors of the query point. The classical result of Stone (1977) shows that\nthe k-NN classi\ufb01er is universally consistent if k grows with n and, at the same time, if k does not\ngrow too fast, namely k/n \u2192 0. We now present a necessary condition on the local neighborhood\nsize for random forests to be consistent. In a particular tree j, the local neighborhood of a query x\nis the leaf cell containing it, Aj(x). In a forest, the local neighborhood of a query can be viewed as\nan aggregation of all possible realizations of tree cells containing x.\nIntuitively, the aggregated cell in the forest should behave better in the following sense: Consider\ntrees that are fully grown, that is each leaf cell contains only one point. Then the local neighborhood\nof any query is too small and will result in a tree-based estimator with high variance. Considering\nthe forest, different tree realizations will partition the space differently. This means when \ufb01xing a\nquery point x, different training data will end up in the leaf cell containing x in different trees, and\nthe aggregated cell can potentially be much larger than the individual tree cell. See the left panel of\nFig. 1 for an illustration of this effect. Based on this observation, one would hope that even forests\nof deep trees can have low enough variance and eventually become consistent.\nOur result implies that whether the intuition above holds or not depends on the size of the local\nneighborhood, controlled by the diameter of the generic (random) function A(\u00b7): if the generic tree\ncell is too small, compared to the data size, then aggregating tree cells will not do much better.\nProposition 1 (Forests of fully-grown trees with fast shrinking cells are inconsistent). Suppose\nthat the data satisfy Eq. (2.1). Suppose additionally that (i) the distribution of X has a density f\nwith respect to the Lebesgue measure on [0, 1]d, (ii) there exists constants fmin and fmax such that\n\n\u2200x \u2208 [0, 1]d , 0 < fmin \u2264 f (x) \u2264 fmax < +\u221e . Consider the random forest estimator(cid:98)\u03b7n,V\u221e built\n\nwith totally randomized deep trees, and in addition, each tree leaf contains exactly one data point.\nIf with positive probability with respect to X, X[n] and \u0398, there exists a deterministic sequence an\nof order\n\nn1/d such that diam (A(X)) \u2264 an, then(cid:98)\u03b7n,V\u221e is L2\u2013inconsistent.\n\n1\n\nProp. 1 is similar in spirit to Lin and Jeon (2006, Theorem 3), which is the \ufb01rst result connecting\nnearest-neighbor methods to random forests. There it was shown that forests with axis-aligned trees\ncan be interpreted to yield sets of \u201cpotential nearest neighbors.\u201d Using this insight, the authors show\nthat forests of deep axis-aligned trees without subsampling have very slow convergence rate in mean\nsquared error, of order 1/ (log n)(d\u22121), which is much worse than the optimal rate for regression,\nO(1/n2m/(2m+d)) by Stone (1980) (the parameter m controls the smoothness of regression func-\ntion \u03b7). To the best of our knowledge, this is the only previous result applying to non-arti\ufb01cial\ndata models. We adopt a different approach and directly relate the consistency of forests with the\ndiameter of the generic tree cell. Prop. 1 is stronger than Lin and Jeon (2006), since it establishes\n\n8\n\n\finconsistency, whereas the latter only provides a lower bound on the convergence rate. In addi-\ntion, Prop. 1 can be applied to any type of trees, including irregularly shaped trees, whereas the\naforementioned result is only applicable to axis-aligned trees.\n\n4.2 Forests with too severe subsampling can be inconsistent\n\nIn contrast to the \u201ctotally randomized tree\u201d setup considered in Section 4.1, where subsampling\nis disabled, we now consider forests with severe subsampling\u2014when the subsample size remains\nconstant as the data size grows to in\ufb01nity.\nTheorem 3 (Forests of undersampled fully-grown trees can be inconsistent). Suppose that the\ndata satisfy Eq. (2.1) and that X has bounded density. Suppose that the random forest estimator\n\n(cid:98)\u03b7n,V\u221e has base trees that satisfy the following properties:\n\n\u2022 Finite subsample size: each tree is constructed on a subsample (sampling with replacement,\n\nthat is, bootstrapping) of the data S of size m, such that m does not vary with n;\n\n\u2022 Fully-grown tree: each tree leaf has exactly one data point.\n\nThen(cid:98)\u03b7n,V\u221e is L2\u2013inconsistent.\n\nTheorem 3 applies Lemma 5 in the undersampled setup. The intuition here is that when the sample\npoints are too \u201csparse,\u201d some cells will have large size when the tree leaves are non-empty (satis\ufb01ed\nwhen trees are fully-grown). Consequently, when a query point falls into a leaf cell, with high\nprobability, it will be far away from the training data in the same cell, violating locality (see the\nright panel of Fig. 1). It is interesting to compare this result with Prop. 1, which relates the average\ndiameter of a cell in the randomized tree with the tree diversity.\n\n5 Discussion\n\nWe have shown that random forests with deep trees with either no subsampling or too much sub-\nsampling can be inconsistent. One surprising consequence is that trees that work well for nearest-\nneighbor search problems can be bad candidates for forests without suf\ufb01cient subsampling, due to\na lack of diversity. Another implication is that even totally randomized trees can lead to over\ufb01tting\nforests, which disagrees with the conventional belief that injecting more \u201crandomness\u201d will prevent\ntrees from over\ufb01tting (Geurts et al., 2006). In summary, our results indicate that subsampling plays\nan important role in random forests and may need to be tuned more carefully than other parameters.\nThere are interesting future directions to explore: (1) While we consider the extreme case of no\nsubsampling or constant subsample size, it would be interesting to explore whether inconsistency\nholds in cases in-between. Results in this direction would indicate how to choose the subsampling\nrate in practice. (2) In our analysis, we \ufb01rst let the number of trees T to in\ufb01nity, and then analyze\nthe consistency of forests as n grows. In the future, it would also be interesting to study the \ufb01ner\ninterplay between T and n when both of them grow jointly. (3) Bootstrapping, that is subsampling\nwith replacement with subsample size equal to n, is a common practice in random forests. It differs\nsubtly from the no subsampling scheme and has been a matter of debate in the theory community\n(Biau, 2012). We believe that some of our inconsistency results can be extended to the bootstrap\ncase. For example, consider Theorem 2 in the bootstrap case: one would expect that the nearest\nneighbor property of random projection trees holds on bootstrapped samples as well (according to\nthe central limit theorem for bootstrapped empirical measure (Gine and Zinn, 1990)); when the\nbootstrap sample size equals n, the setup will thus not differ much from the no-subsampling set up,\nand inconsistency should follow.\n\nAcknowledgements\n\nThe authors thank Debarghya Ghoshdastidar for his careful proofreading of a previous version of\nthis article. This research has been supported by the German Research Foundation via the Research\nUnit 1735 \u201dStructural Inference in Statistics: Adaptation and Ef\ufb01ciency\u201d and and the Institutional\nStrategy of the University of T\u00a8ubingen (DFG ZUK 63).\n\n9\n\n\fReferences\nS. Arlot and R. Genuer. Analysis of purely random forests bias. ArXiv preprint, 2014. Available at\n\nhttps://arxiv.org/abs/1407.3939.\n\nM. Belgiu and L. Dr\u02d8agut\u00b8. Random forest in remote sensing: A review of applications and future\n\ndirections. ISPRS Journal of Photogrammetry and Remote Sensing, 114:24\u201331, 2016.\n\nG. Biau. Analysis of a random forests model. Journal of Machine Learning Research (JMLR), 13\n\n(1):1063\u20131095, 2012.\n\nG. Biau and E. Scornet. A random forest guided tour. Test, 25(2):197\u2013227, 2016.\n\nG. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classi\ufb01ers.\n\nJournal of Machine Learning Research (JMLR), 9:2015\u20132033, 2008.\n\nP. Billingsley. Probability and Measure. John Wiley & Sons, 2008.\n\nS. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of\n\nindependence. Oxford University Press, 2013.\n\nL. Breiman. Some In\ufb01nity Theory for Predictor Ensembles. Technical report, University of Califor-\n\nnia, Berkeley, Statistics Department, 2000.\n\nL. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n\nL. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression Trees.\n\nWadsworth and Brooks, Monterey, CA, 1984.\n\nA. Criminisi and J. Shotton. Decision Forests for Computer Vision and Medical Image Analysis.\n\nSpringer, 2013.\n\nS. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In Proceedings\n\nof the 40th ACM Symposium on Theory of Computing (STOC), pages 537\u2013546. ACM, 2008.\n\nS. Dasgupta and K. Sinha. Randomized partition trees for nearest neighbor search. Algorithmica,\n\n72(1):237\u2013263, 2015.\n\nM. Denil, D. Matheson, and N. D. Freitas. Narrowing the gap: Random forests in theory and in\nIn Proceedings of the 31st International Conference on Machine Learning (ICML),\n\npractice.\npages 665\u2013673, 2014.\n\nL. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,\n\n1996.\n\nR. D\u00b4\u0131az-Uriarte and S. Alvarez de Andr\u00b4es. Gene selection and classi\ufb01cation of microarray data using\n\nrandom forest. BMC Bioinformatics, 7(1):3\u201315, 2006.\n\nJ. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer Series in\n\nStatistics. Springer (NY), second edition, 2009.\n\nR. Genuer, J. Poggi, and C. Tuleau-Malot. Variable selection using random forests. Pattern Recog-\n\nnition Letters, 31(14):2225 \u2013 2236, 2010.\n\nP. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3\u201342,\n\n2006.\n\nE. Gine and J. Zinn. Bootstrapping general empirical measures. The Annals of Probability, 18(2):\n\n851\u2013869, 1990.\n\nT. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 20(8):832\u2013844, 1998.\n\nS. Kpotufe and S. Dasgupta. A tree-based regressor that adapts to intrinsic dimension. Journal of\n\nComputer and System Sciences, 78(5):1496 \u2013 1515, 2012.\n\n10\n\n\fY. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American\n\nStatistical Association, 101(474):578\u2013590, 2006.\n\nB. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and F. A. Hamprecht. On oblique random\nforests. In D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, editors, Machine Learn-\ning and Knowledge Discovery in Databases, pages 453\u2013469, Berlin, Heidelberg, 2011. Springer\nBerlin Heidelberg.\n\nJ. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso. Rotation forest: A new classi\ufb01er ensemble method.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619\u20131630, 2006.\n\nE. Scornet. On the asymptotics of random forests. Journal of Multivariate Analysis (JMVA), 146\n(Supplement C):72 \u2013 83, 2016. Special Issue on Statistical Models and Methods for High or\nIn\ufb01nite Dimensional Spaces.\n\nE. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):\n\n1716\u20131741, 08 2015.\n\nC. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595\u2013620, 1977.\n\nC. Stone. Optimal rates of convergence for nonparametric estimators. The Annals of Statistics, 8(6):\n\n1348\u20131360, 1980.\n\nT. Tomita, M. Maggioni, and J. Vogelstein. Randomer Forests. ArXiv preprint, 2015. Available at\n\nhttps://arxiv.org/abs/1506.03410.\n\nP. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces.\nIn Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages\n311\u2013321, 1993.\n\n11\n\n\f", "award": [], "sourceid": 1547, "authors": [{"given_name": "Cheng", "family_name": "Tang", "institution": "George Washington University"}, {"given_name": "Damien", "family_name": "Garreau", "institution": "Max Planck Institute"}, {"given_name": "Ulrike", "family_name": "von Luxburg", "institution": "University of T\u00fcbingen"}]}