{"title": "Universal consistency and minimax rates for online Mondrian Forests", "book": "Advances in Neural Information Processing Systems", "page_first": 3758, "page_last": 3767, "abstract": "We establish the consistency of an algorithm of Mondrian Forests~\\cite{lakshminarayanan2014mondrianforests,lakshminarayanan2016mondrianuncertainty}, a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm proposed in~\\cite{lakshminarayanan2014mondrianforests}, that considers a \\emph{fixed} lifetime parameter. Indeed, the fact that this parameter is fixed actually hinders statistical consistency of the original procedure. Our modified Mondrian Forest algorithm grows trees with increasing lifetime parameters $\\lambda_n$, and uses an alternative updating rule, allowing to work also in an online fashion. Second, we provide a theoretical analysis establishing simple conditions for consistency. Our theoretical analysis also exhibits a surprising fact: our algorithm achieves the minimax rate (optimal rate) for the estimation of a Lipschitz regression function, which is a strong extension of previous results~\\cite{arlot2014purf_bias} to an \\emph{arbitrary dimension}.", "full_text": "Universal consistency and minimax rates for online\n\nMondrian Forests\n\nJaouad Mourtada\n\nCentre de Math\u00e9matiques Appliqu\u00e9es\n\u00c9cole Polytechnique, Palaiseau, France\n\njaouad.mourtada@polytechnique.edu\n\nSt\u00e9phane Ga\u00efffas\n\nCentre de Math\u00e9matiques Appliqu\u00e9es\n\u00c9cole Polytechnique,Palaiseau, France\n\nst\u00e9phane.gaiffas@polytechnique.edu\n\nErwan Scornet\n\nCentre de Math\u00e9matiques Appliqu\u00e9es\n\u00c9cole Polytechnique,Palaiseau, France\nerwan.scornet@polytechnique.edu\n\nAbstract\n\nWe establish the consistency of an algorithm of Mondrian Forests [LRT14, LRT16],\na randomized classi\ufb01cation algorithm that can be implemented online. First, we\namend the original Mondrian Forest algorithm proposed in [LRT14], that considers\na \ufb01xed lifetime parameter. Indeed, the fact that this parameter is \ufb01xed hinders\nthe statistical consistency of the original procedure. Our modi\ufb01ed Mondrian\nForest algorithm grows trees with increasing lifetime parameters \u03bbn, and uses\nan alternative updating rule, allowing to work also in an online fashion. Second,\nwe provide a theoretical analysis establishing simple conditions for consistency.\nOur theoretical analysis also exhibits a surprising fact: our algorithm achieves the\nminimax rate (optimal rate) for the estimation of a Lipschitz regression function,\nwhich is a strong extension of previous results [AG14] to an arbitrary dimension.\n\n1\n\nIntroduction\n\nRandom Forests (RF) are state-of-the-art classi\ufb01cation and regression algorithms that proceed by\naveraging the forecasts of a number of randomized decision trees grown in parallel (see [Bre01, Bre04,\nGEW06, BDL08, Bia12, BS16, DMdF14, SBV15]). Despite their widespread use and remarkable\nsuccess in practical applications, the theoretical properties of such algorithms are still not fully\nunderstood [Bia12, DMdF14]. Among these methods, purely random forests [Bre00, BDL08, Gen12,\nAG14] that grow the individual trees independently of the sample, are particularly amenable to\ntheoretical analysis; the consistency of such classi\ufb01ers was obtained in [BDL08].\nAn important limitation of the most commonly used random forests algorithms, such as Breiman\u2019s\nRandom Forest [Bre01] and the Extra-Trees algorithm [GEW06], is that they are typically trained in\na batch manner, using the whole dataset to build the trees. In order to enable their use in situations\nwhen large amounts of data have to be incorporated in a streaming fashion, several online adaptations\nof the decision trees and RF algorithms have been proposed [DH00, TGP11, SLS+09, DMdF13].\nOf particular interest in this article is the Mondrian Forest algorithm, an ef\ufb01cient and accurate online\nrandom forest classi\ufb01er [LRT14]. This algorithm is based on the Mondrian process [RT09, Roy11], a\nnatural probability distribution on the set of recursive partitions of the unit cube [0, 1]d. An appealing\nproperty of Mondrian processes is that they can be updated in an online fashion: in [LRT14], the use\nof the conditional Mondrian process enabled to design an online algorithm that matched its batch\ncounterpart. While Mondrian Forest offer several advantages, both computational and in terms of\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fpredictive performance, the algorithm proposed in [LRT14] depends on a \ufb01xed lifetime parameter \u03bb\nthat guides the complexity of the trees. Since this parameter has to be set in advance, the resulting\nalgorithm is inconsistent, as the complexity of the randomized trees remains bounded. Furthermore,\nan analysis of the learning properties of Mondrian Forest \u2013 and in particular of the in\ufb02uence and\nproper theoretical tuning of the lifetime parameter \u03bb \u2013 is still lacking.\nIn this paper, we propose a modi\ufb01ed online random forest algorithm based on Mondrian processes.\nOur algorithm retains the crucial property of the original method [LRT14] that the decision trees\ncan be updated incrementally. However, contrary to the original approach, our algorithm uses an\nincreasing sequence of lifetime parameters (\u03bbn)n(cid:62)1, so that the corresponding trees are increasingly\ncomplex, and involves an alternative online updating algorithm. We study such classi\ufb01cation rules\ntheoretically, establishing simple conditions on the sequence (\u03bbn)n(cid:62)1 to achieve consistency, see\nTheorem 1 from Section 5 below.\nIn fact, Mondrian Forests achieve much more than what they were designed for: while they were\nprimarily introduced to derive an online algorithm, we show in Theorem 2 (Section 6) that they\nactually achieve minimax convergence rates for Lipschitz conditional probability (or regression)\nfunctions in arbitrary dimension. To the best of our knowledge, such results have only been proved\nfor very speci\ufb01c purely random forests, where the covariate dimension is equal to one.\n\nRelated work. While random forests were introduced in the early 2000s [Bre01], as noted\nby [DMdF14] the theoretical analysis of these methods is outpaced by their practical use. The consis-\ntency of various simpli\ufb01ed random forests algorithms is \ufb01rst established in [BDL08], as a byproduct\nof the consistency of individual tree classi\ufb01ers. A recent line of research [Bia12, DMdF14, SBV15]\nhas sought to obtain theoretical guarantees (i.e. consistency) for random forests variants that more\nclosely resembled the algorithms used in practice. Another aspect of the theoretical study of random\nforests is the bias-variance analysis of simpli\ufb01ed versions of random forests [Gen12, AG14], such\nas the purely random forests (PRF) model that performs splits independently of the data. In particu-\nlar, [Gen12] shows that some PRF variants achieve the minimax rate for the estimation of a Lipschitz\nregression functions in dimension 1. Additionally, the bias-variance analysis is extended in [AG14],\nshowing that PRF can also achieve minimax rates for C 2 regression functions in dimension one, and\nconsidering higher dimensional models of PRF that achieve suboptimal rates.\nStarting with [SLS+09], online variants of the random forests algorithm have been considered.\nIn [DMdF13], the authors propose an online random forest algorithm and prove its consistency. The\nprocedure relies on a partitioning of the data into two streams: a structure stream (used to grow the\ntree structure) and an estimation stream (used to compute the prediction in each leaf). This separation\nof the data into separate streams is a way of simplifying the proof of consistency, but leads to a\nnon-realistic setting in practice.\nA major development in the design of online random forests is the introduction of the Mondrian\nForest (MF) classi\ufb01er [LRT14, LRT16]. This algorithm makes an elegant use of the Mondrian\nProcess, introduced in [RT09], see also [Roy11, OR15], to draw random trees. Indeed, this process\nprovides a very convenient probability distribution over the set of recursive, tree-based partitions of\nthe hypercube. In [BLG+16], the links between the Mondrian process and the Laplace kernel are\nused to design random features in order to ef\ufb01ciently approximate kernel ridge regression, leading to\nthe so-called Mondrian kernel algorithm.\nOur approach differs from the original Mondrian Forest algorithm [LRT14], since it introduces a\n\u201cdual\u201d construction, that works in the \u201ctime\u201d domain (lifetime parameters) instead of the \u201cspace\u201d\ndomain (features range). Indeed, in [LRT14], the splits are selected using a Mondrian process on the\nrange of previously observed features vectors, and the online updating of the trees is enabled by the\npossibility of extending a Mondrian process to a larger cell using conditional Mondrian processes.\nOur algorithm incrementally grows the trees by extending the lifetime; the online update of the trees\nexploits the Markov property of the Mondrian process, a consequence of its formulation in terms of\ncompeting exponential clocks.\n\n2 Setting and notation\n\nWe \ufb01rst explain the considered setting allowing to state consistency of our procedure, and we describe\nand set notation for the main concepts used in the paper, namely trees, forests and partitions.\n\n2\n\n\fConsidered setting. Assume we are given an i.i.d. sequence (X1, Y1), (X2, Y2) . . . of [0, 1]d \u00d7\n{0, 1}-valued random variables that come sequentially, such that each (Xi, Yi) has the same distribu-\ntion as (X, Y ). This unknown distribution is characterized by the distribution \u00b5 of X on [0, 1]d and\nthe conditional probability \u03b7(x) = P(Y = 1| X = x).\nAt each time step n (cid:62) 1, we want to output a 0-1-valued randomized classi\ufb01cation rule gn(\u00b7, Z, Dn) :\n\n[0, 1]d \u2192 {0, 1}, where Dn =(cid:0)(X1, Y1), . . . , (Xn, Yn)(cid:1) and Z is a random variable that accounts for\nthe randomization procedure; to simplify notation, we will generally denote(cid:98)gn(x, Z) = gn(x, Z, Dn).\n\nThe quality of a randomized classi\ufb01er gn is measured by its probability of error\n\nL(gn) = P(gn(X, Z, Dn) (cid:54)= Y | Dn) = P(X,Y ),Z(gn(X, Z, Dn) (cid:54)= Y )\n\n(1)\nwhere P(X,Y ),Z denotes the integration with respect to (X, Y ), Z alone. The quantity of Equation (1)\nis minimized by the Bayes classi\ufb01er g\u2217(x) = 1{\u03b7(x)> 1\n2}, and its loss, the Bayes error, is denoted\nL\u2217 = L(g\u2217). We say that a sequence of classi\ufb01cation rules (gn)n(cid:62)1 is consistent whenever L(gn) \u2192\nL\u2217 in probability as n \u2192 \u221e.\nRemark 1. We restrict ourselves to binary classi\ufb01cation, note however that our results and proofs can\nbe extended to multi-class classi\ufb01cation.\n\nTrees and Forests. The classi\ufb01cation rules (gn)n(cid:62)1 we consider take the form of a random forest,\nde\ufb01ned by averaging randomized tree classi\ufb01ers. More precisely, let K (cid:62) 1 be a \ufb01xed number of\n\nrandomized classi\ufb01ers(cid:98)gn(x, Z1), . . . ,(cid:98)gn(x, ZK) associated to the same randomized mechanism,\nwhere the Zk are i.i.d. Set Z (K) = (Z1, . . . , ZK). The averaging classi\ufb01er(cid:98)g(K)\n\nby taking the majority vote among the values gn(x, Zk), k = 1, . . . , K.\nOur individual randomized classi\ufb01ers are decision trees. A decision tree (T, \u03a3) is composed of the\nfollowing components:\n\nn (x, Z (K)) is de\ufb01ned\n\n\u2022 A \ufb01nite rooted ordered binary tree T , with nodes N (T ), interior nodes N \u25e6(T ) and leaves\nL(T ) (so that N (T ) is the disjoint union of N \u25e6(T ) and L(T )). Each interior node \u03b7 has a\nleft child left(\u03b7) and a right child right(\u03b7);\n\u2022 A family of splits \u03a3 = (\u03c3\u03b7)\u03b7\u2208N \u25e6(T ) at each interior node, where each split \u03c3\u03b7 = (d\u03b7, \u03bd\u03b7)\nis characterized by its split dimension d\u03b7 \u2208 {1, . . . , d} and its threshold \u03bd\u03b7.\n\nEach randomized classi\ufb01er(cid:98)gn(x, Zk) relies on a decision tree T , the random variable Zk is the\n\nrandom sampling of the splits (\u03c3\u03b7) de\ufb01ning T . This sampling mechanism, based on the Mondrian\nprocess, is de\ufb01ned in Section 3.\nWe associate to M = (T, \u03a3) a partition (A\u03c6)\u03c6\u2208L(T ) of the unit cube [0, 1]d, called a tree partition (or\nguillotine partition). For each node \u03b7 \u2208 N (T ), we de\ufb01ne a hyper-rectangular region A\u03b7 recursively:\n\n\u2022 The cell associated to the root of T is [0, 1]d;\n\u2022 For each \u03b7 \u2208 N \u25e6(T ), we de\ufb01ne\n\nAleft(\u03b7) := {x \u2208 A\u03b7 : xd\u03b7\n\n(cid:54) \u03bd\u03b7}\n\nand Aright(\u03b7) := A\u03b7 \\ Aleft(\u03b7).\n\nThe leaf cells (A\u03c6)\u03c6\u2208L(T ) form a partition of [0, 1]d by construction. In the sequel, we will identify\na tree with splits (T, \u03a3) with the associated tree partition M (T, \u03a3), and a node \u03b7 \u2208 N (T ) with the\ncell A\u03b7 \u2282 [0, 1]d. The decision tree classi\ufb01er outputs a constant prediction of the label in each leaf\ncell A\u03b7 using a simple majority vote of the labels Yi (1 (cid:54) i (cid:54) n) such that Xi \u2208 A\u03b7.\n\n3 A new online Mondrian Forest algorithm\n\nWe describe the Mondrian Process in Section 3.1, and recall the original Mondrian Forest procedure\nin Section 3.2. Our procedure is introduced in Section 3.3.\n\n3.1 The Mondrian process\n\nMondrian process, introduced in [RT09]. Given a rectangular box C =(cid:81)d\n\nThe probability distribution we consider on tree-based partitions of the unit cube [0, 1]d is the\nj=1[aj, bj], we denote\n\n3\n\n\f|C| :=(cid:80)d\n\nj=1(bj\u2212aj) its linear dimension. The Mondrian process distribution MP(\u03bb, C) is the distri-\nbution of the random tree partition of C obtained by the sampling procedure SampleMondrian(\u03bb, C)\nfrom Algorithm 1.\n\nAlgorithm 1 SampleMondrian(\u03bb, C) ; Samples a tree partition distributed as MP(\u03bb, C).\n1: Parameters: A rectangular box C \u2282 Rd and a lifetime parameter \u03bb > 0.\n2: Call SplitCell(C, \u03c4C := 0, \u03bb).\n\n1: Parameters: A cell A =(cid:81)\n\nAlgorithm 2 SplitCell(A, \u03c4, \u03bb) ; Recursively split a cell A, starting from time \u03c4, until \u03bb\n1(cid:54)j(cid:54)d[aj, bj], a starting time \u03c4 and a lifetime parameter \u03bb.\n\n2: Sample an exponential random variable EA with intensity |A|.\n3: if \u03c4 + EA (cid:54) \u03bb then\n4:\n\nDraw at random a split dimension J \u2208 {1, . . . , d}, with P(J = j) = (bj \u2212 aj)/|A|, and a\nsplit threshold \u03bdJ uniformly in [aJ , bJ ].\nSplit A along the split (J, \u03bdJ ).\nCall SplitCell(left(A), \u03c4 + EA, \u03bb) and SplitCell(right(A), \u03c4 + EA, \u03bb).\n\n5:\n6:\n7: else\n8:\n9: end if\n\nDo nothing.\n\n3.2 Online tree growing: the original scheme\n\nIn order to implement an online algorithm, it is crucial to be able to \u201cupdate\u201d the tree partitions\ngrown at a given time step. The approach of the original Mondrian Forest algorithm [LRT14] uses\na slightly different randomization mechanism, namely a Mondrian process supported in the range\nde\ufb01ned by the past feature points. More precisely, this modi\ufb01cation amounts to replacing each call\nto SplitCell(A, \u03c4, \u03bb) by a call to SplitCell(Arange(n), \u03c4, \u03bb), where Arange(n) is the range of the\nfeature points X1, . . . , Xn that fall in A (i.e. the smallest box that contains them).\nWhen a new training point (Xn+1, Yn+1) arrives, the ranges of the training points may change. The\nonline update of the tree partition then relies on the extension properties of the Mondrian process:\ngiven a Mondrian partition M1 \u223c MP(\u03bb, C1) on a box C1, it is possible to ef\ufb01ciently sample a\nMondrian partition M0 \u223c MP(\u03bb, C0) on a larger box C0 \u2283 C1 that restricts to M1 on the cell C1\n(this is called a \u201cconditional Mondrian\u201d, see [RT09]).\nRemark 2. In [LRT14] a lifetime parameter \u03bb = \u221e is actually used in experiments, which essentially\namounts to growing the trees completely, until the leaves are homogeneous. We will not analyze this\nvariant here, but this illustrates the problem of using a \ufb01xed, \ufb01nite budget \u03bb in advance.\n\n3.3 Online tree growing: a dual approach\n\nAn important limitation of the original scheme is the fact that it requires to \ufb01x the lifetime parameter\n\u03bb in advance. In order to obtain a consistent algorithm, it is required to grow increasingly complex\ntrees. To achieve this, we propose to adopt a \u201cdual\u201d point of view: instead of using a Mondrian\nprocess with \ufb01xed lifetime on a domain that changes as new data points are added, we use a Mondrian\nprocess on a \ufb01xed domain (the cube [0, 1]d) but with a varying lifetime \u03bbn that grows with the sample\nsize n. The rationale is that, as more data becomes available, the classi\ufb01ers should be more complex\nand precise. Since the lifetime, rather than the domain, is the parameter that guides the complexity of\nthe trees, it should be this parameter that dynamically adapts to the amount of training data.\nIt turns out that in this approach, quite surprisingly, the trees can be updated incrementally, leading to\nan online algorithm. The ability to extend a tree partition M\u03bbn \u223c MP(\u03bbn, [0, 1]d) into a \ufb01ner tree\npartition M\u03bbn+1 \u223c MP(\u03bbn+1, [0, 1]d) relies on a different property of the Mondrian process, namely\nthe fact that for \u03bb < \u03bb(cid:48), it is possible to ef\ufb01ciently sample a Mondrian tree partition M\u03bb(cid:48) \u223c MP(\u03bb(cid:48), C)\ngiven its pruning M\u03bb \u223c MP(\u03bb, C) at time \u03bb (obtained by dropping all splits of M\u03bb(cid:48) performed at a\ntime \u03c4 > \u03bb).\nThe procedure ExtendMondrian(M\u03bb, \u03bb, \u03bb(cid:48)) from Algorithm 3 extends a Mondrian tree partition\nM\u03bb \u223c MP(\u03bb, C) to a tree partition M\u03bb(cid:48) \u223c MP(\u03bb(cid:48), C). Indeed, for each leaf cell A of M\u03bb, the fact\n\n4\n\n\fAlgorithm 3 ExtendMondrian(M\u03bb, \u03bb, \u03bb(cid:48)) ; Extend M\u03bb \u223c MP(\u03bb, C) to M\u03bb(cid:48) \u223c MP(\u03bb(cid:48), C)\n1: Parameters: A tree partition M\u03bb, and lifetimes \u03bb (cid:54) \u03bb(cid:48).\n2: for A in L(M\u03bb) do\n3:\n4: end for\n\nCall SplitCell(A, \u03bb, \u03bb(cid:48))\n\nthat A is a leaf of M\u03bb means that during the sampling of M\u03bb, the time of the next candidate split\n\u03c4 + EA (where \u03c4 is the time A was formed and EA \u223c Exp(|A|)) was strictly larger than \u03bb. Now in\nthe procedure ExtendMondrian(M\u03bb, \u03bb, \u03bb(cid:48)), the time of the next candidate split is \u03bb + E(cid:48)\nA, where\nA \u223c Exp(|A|). This is precisely the where the trick resides: by the memory-less property of the\nE(cid:48)\nexponential distribution, the distribution of \u03c4A + EA conditionally on EA > \u03bb \u2212 \u03c4A is the same as\nthat of \u03bb + E(cid:48)\nA. The procedure ExtendMondrian can be replaced by the following more ef\ufb01cient\nimplementation:\n\n\u2022 Time of the next split of the tree is sampled as \u03bb+EM\u03bb with EM\u03bb \u223c Exp((cid:80)\n\n\u03c6\u2208L(M\u03bb) |A\u03c6|);\n\u2022 Leaf to split is chosen using a top-down path from the root of the tree, where the choice\nbetween left or right child for each interior node is sampled at random, proportionally to the\nlinear dimension of all the leaves in the subtree de\ufb01ned by the child.\n\nRemark 3. While we consider Mondrian partitions on the \ufb01xed domain [0, 1]d, our increasing\nlifetime trick can be used in conjunction with a varying domain based on the range of the data (as\nin the original MF algorithm), simply by applying ExtendMondrian(M\u03bbn , \u03bbn, \u03bbn+1) after having\nextended the Mondrian to the new range. In order to keep the analysis tractable and avoid unnecessary\ncomplications in the analysis, we will study the procedure on a \ufb01xed domain only.\n\nGiven an increasing sequence (\u03bbn)n(cid:62)1 of lifetime parameters, our modi\ufb01ed MF algorithm incremen-\ntally updates the trees M (k)\n, \u03bbn, \u03bbn+1), and\ncombines the forecasts of the given trees, as explained in Algorithm 4.\n\nfor k = 1, . . . , K by calling ExtendMondrian(M (k)\n\u03bbn\n\n\u03bb\n\nAlgorithm 4 MondrianForest(K, (\u03bbn)n(cid:62)1) ; Trains a Mondrian Forest classi\ufb01er.\n1: Parameters: The number of trees K and the lifetime sequence (\u03bbn)n(cid:62)1.\n2: Initialization: Start with K trivial partitions M (k)\n\u03bb0\n\nthe training labels in each cell to 0, and the labels e.g. to 0.\n\n, \u03bb0 := 0, k = 1, . . . , K. Set the counts of\n\nReceive the training point (Xn, Yn).\nfor k = 1, . . . , K do\n\n3: for n = 1, 2, . . . do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nend for\n\nUpdate the counts of 0 and 1 (depending on Yn) in the leaf cell of Xn in M\u03bbn.\nCall ExtendMondrian(M (k)\n\u03bbn\u22121\nFit the newly created leaves.\n\n, \u03bbn\u22121, \u03bbn).\n\nFor the prediction of the label given a new feature vector, our algorithm uses a majority vote\nover the predictions given by all K trees. However, other choices are possible. For instance, the\noriginal Mondrian Forest algorithm [LRT14] places a hierarchical Bayesian prior over the label\ndistribution on each node of the tree, and performs approximate posterior inference using the so-\ncalled interpolated Kneser-Ney (IKN) smoothing. Another possibility, that will be developed in an\nextended version of this work, is tree expert aggregation methods, such as the Context-Tree Weighting\n(CTW) algorithm [WST95, HS97] or specialist aggregation methods [FSSW97] over the nodes of\nthe tree, adapting them to increasingly complex trees.\nOur modi\ufb01cation of the original Mondrian Forest replaces the process of online tree growing with a\n\ufb01xed lifetime by a new process, that allows to increase lifetimes. This modi\ufb01cation not only allows\nto prove consistency, but more surprisingly leads to an optimal estimation procedure, in terms of\nminimax rates, as illustrated in Sections 5 and 6 below.\n\n5\n\n\f4 Mondrian Forest with \ufb01xed lifetime are inconsistent\n\nWe state in Proposition 1 the inconsistency of \ufb01xed-lifetime Mondrian Forests, such as the original\nalgorithm [LRT14]. This negative result justi\ufb01es our modi\ufb01ed algorithm based on an increasing\nsequence of lifetimes (\u03bbn)n(cid:62)1.\nProposition 1. The Mondrian Forest algorithm (Algorithm 4) with a \ufb01xed lifetime sequence \u03bbn = \u03bb\nis inconsistent: there exists a distribution of (X, Y ) \u2208 [0, 1] \u00d7 {0, 1} such that L\u2217 = 0 and\nL(gn) = P(gn(X) (cid:54)= Y ) does not tend to 0. This result also holds true for the original Mondrian\nForest algorithm with lifetime \u03bb.\n\nProposition 1 is established in Appendix C. The proof uses a result of independent interest (Lemma 3),\nwhich states that asymptotically over the sample size, for \ufb01xed \u03bb, the restricted domain does not\naffect the randomization procedure.\n\n5 Consistency of Mondrian Forest with lifetime sequence (\u03bbn)\n\nThe consistency of the Mondrian Forest used with a properly tuned sequence (\u03bbn) is established in\nTheorem 1 below.\nn/n \u2192 0. Then, the online Mondrian Forest described\nTheorem 1. Assume that \u03bbn \u2192 \u221e and that \u03bbd\nin Algorithm 4 is consistent.\n\nThis consistency result is universal, in the sense that it makes no assumption on the distribution of\nX nor on the conditional probability \u03b7. This contrasts with some consistency results on Random\nforests, such as Theorem 1 of [DMdF13], which assumes that the density of X is bounded by above\nand below.\nTheorem 1 does not require an assumption on K (number of trees). It is well-known for batch\nRandom Forests that this meta-parameter is not a sensitive tuning parameter, and that it suf\ufb01ces to\nchoose it large enough to obtain good accuracy. The only important parameter is the sequence (\u03bbn),\nthat encodes the complexity of the trees. Requiring an assumption on this meta-parameter is natural,\nand con\ufb01rmed by the well-known fact that the tree-depth is the most important tuning parameter for\nbatch Random Forests, see for instance [BS16].\nThe proof of Theorem 1 can be found in the supplementary material (see Appendix D). The core of\nthe argument lies in two lemmas describing two novel properties of Mondrian trees. Lemma 1 below\nprovides an upper bound of order O(\u03bb\u22121) on the diameter of the cell A\u03bb(x) of a Mondrian partition\nM\u03bb \u223c MP(\u03bb, [0, 1]d). This is the key to control the bias of Mondrian Forests with lifetime sequence\nthat tend to in\ufb01nity.\nLemma 1 (Cell diameter). Let x \u2208 [0, 1]d, and let D\u03bb(x) be the (cid:96)2-diameter of the cell containing\nx in a Mondrian partition M\u03bb \u223c MP(\u03bb, [0, 1]d). If \u03bb \u2192 \u221e, then D\u03bb(x) \u2192 0 in probability. More\nprecisely, for every \u03b4, \u03bb > 0, we have\n\nP(D\u03bb(x) (cid:62) \u03b4) (cid:54) d\n\n1 +\n\nexp\n\n\u2212 \u03bb\u03b4\u221a\nd\n\nand\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\nE(cid:2)D\u03bb(x)2(cid:3) (cid:54) 4d\n\n\u03bb\u03b4\u221a\nd\n\n\u03bb2 .\n\n(2)\n\n(3)\n\nThe proof of Lemma 1 is provided in the supplementary material (see Appendix A). The second\nimportant property needed to carry out the analysis is stated in Lemma 2 and helps to control the\n\u201cvariance\u201d of Mondrian forests. It consists in an upper bound of order O(\u03bbd) on the total number of\nsplits performed by a Mondrian partition M\u03bb \u223c MP(\u03bb, [0, 1]d). This ensures that enough data points\nfall in each cell of the tree, so that the labels of the tree are well estimated. The proof of Lemma 2 is\nto be found in the supplementary material (see Appendix B).\nLemma 2 (Number of splits). If K\u03bb denotes the number of splits performed by a Mondrian tree\npartition M\u03bb \u223c MP(\u03bb, [0, 1]d), we have E(K\u03bb) (cid:54) (e(\u03bb + 1))d.\nRemark 4. It is worth noting that controlling the total number of splits ensures that the cell A\u03bbn(X)\nin which a new random X \u223c \u00b5 ends up contains enough training points among X1, . . . , Xn\n\n6\n\n\f(see Lemma 4 in appendix D). This enables to get a distribution-free consistency result. Another\napproach consists in lower-bounding the volume V\u03bbn (x) of A\u03bbn (x) in probability for any x \u2208 [0, 1]d,\nwhich shows that the cell A\u03bbn (x) contains enough training points, but this would require the extra\nassumption that the density of X is lower-bounded.\nRemarkably, owing to the nice restriction properties of the Mondrian process, Lemmas 1 and 2\nessentially provide matching upper and lower bounds on the complexity of the partition. Indeed,\nin order to partition the cube [0, 1]d in cells of diameter O(1/\u03bb), at least \u0398(\u03bbd) cells are needed;\nLemma 2 shows that the Mondrian partition in fact contains only O(\u03bbd) cells.\n\n6 Minimax rates over the class of Lipschitz functions\n\nThe estimates obtained in Lemmas 1 and 2 are quite explicit and sharp in their dependency on \u03bb, and\nallow to study the convergence rate of our algorithm. Indeed, it turns out that our modi\ufb01ed Mondrian\nForest, when properly tuned, can achieve the minimax rate in classi\ufb01cation over the class of Lipschitz\nfunctions (see e.g. Chapter I.3 in [Nem00] for details on minimax rates). We provide two results: a\nconvergence rate for the estimation of the conditional probabilities, measured by the quadratic risk,\nsee Theorem 2, and a control on the distance between the classi\ufb01cation error of our classi\ufb01er and\nthe Bayes error, see Theorem 3. We provide also similar minimax bounds for the regression setting\ninstead of the classi\ufb01cation one in the supplementary material, see Proposition 4 in Appendix E.\n\nLet(cid:98)\u03b7n be the estimate of the conditional probability \u03b7 based on the Mondrian Forest (see Algorithm 4)\n\nin which:\n\n(i) Each leaf label is computed as the proportion of 1 in the corresponding leaf;\n(ii) Forest prediction results from the average of tree estimates instead of a majority vote.\n\n[0, 1]d. Let(cid:98)\u03b7n be a Mondrian Forest as de\ufb01ned in Points (i) and (ii), with a lifetimes sequence that\n\nTheorem 2. Assume that the conditional probability function \u03b7 : [0, 1]d \u2192 [0, 1] is Lipschitz on\nsatis\ufb01es \u03bbn (cid:16) n1/(d+2). Then, the following upper bound holds\n\nE(\u03b7(X) \u2212(cid:98)\u03b7n(X))2 = O(n\u22122/(d+2))\n\nfor n large enough, which correspond to the minimax rate over the set of Lipschitz functions.\n\n(4)\n\nTo the best of our knowledge, Theorem 2 is the \ufb01rst to exhibit the fact that a classi\ufb01cation method\nbased on a purely random forest can be minimax optimal in an arbitrary dimension. The same\nkind of result is stated for regression estimation in the supplementary material (see Proposition 4 in\nAppendix E).\nMinimax rates, but only for d = 1, were obtained in [Gen12, AG14] for models of purely random\nforests such as Toy-PRF (where the individual partitions corresponded to randomly shifts of the\nregular partition of [0, 1] in k intervals) and PURF (Purely Uniformly Random Forests, where the\npartitions were obtained by drawing k random thresholds at random in [0, 1]).\nHowever, for d = 1, tree partitions reduce to partitions of [0, 1] in intervals, and do not possess the\nrecursive structure that appears in higher dimensions and makes their precise analysis dif\ufb01cult. For\nthis reason, the analysis of purely random forests for d > 1 has typically produced sub-optimal\nresults: for example, [BDL08] show consistency for UBPRF (Unbalanced Purely Random Forests,\nthat perform a \ufb01xed number of splits and randomly choose a leaf to split at each step), but with no\nrate of convergence. A further step was made by [AG14], who studied the BPRF (Balanced Purely\nRandom Forests algorithm, where all leaves were split, so that the resulting tree was complete), and\nobtained suboptimal rates. In our approach, the convenient properties of the Mondrian process enable\nto bypass the inherent dif\ufb01culties met in previous attempts, thanks to its recursive structure, and allow\nto obtain the minimax rate with transparent proof.\n\nNow, note that the Mondrian Forest classi\ufb01er corresponds to the plugin classi\ufb01er (cid:98)gn(x) =\n1{(cid:98)\u03b7n(x)>1/2}, where(cid:98)\u03b7n is de\ufb01ned in Points (i) and (ii). A general theorem (Theorem 6.5 in [DGL96])\nallows us to derive upper bounds on the distance between the classi\ufb01cation error of(cid:98)gn and the Bayes\nTheorem 3. Under the same assumptions as in Theorem 2, the Mondrian Forest classi\ufb01er(cid:98)gn with\n\nerror, thanks to Theorem 2.\nlifetime sequence \u03bbn (cid:16) n1/(d+2) satis\ufb01es\n\nL((cid:98)gn) \u2212 L\u2217 = o(n\u22121/(d+2)).\n\n(5)\n\n7\n\n\fThe rate of convergence o(n\u22121/(d+2)) for the error probability with a Lipschitz conditional probability\n\u03b7 turns out to be optimal, as shown by [Yan99]. Note that faster rates can be achieved in classi\ufb01cation\nunder low noise assumptions such as the margin assumption [MT99] (see e.g. [Tsy04, AT07, Lec07]).\nSuch specializations of our results are to be considered in a future work, the aim of the present paper\nbeing an emphasis on the appealing optimal properties of our modi\ufb01ed Mondrian Forest.\n\n7 Experiments\n\nWe now turn to the empirical evaluation of our algorithm, and examine its predictive performance\n(test error) as a function of the training size. More precisely, we compare the modi\ufb01ed Mondrian\nForest algorithm (Algorithm 4) to batch (Breiman RF [Bre01], Extra-Trees-1 [GEW06]) and online\n(the Mondrian Forest algorithm [LRT14] with \ufb01xed lifetime parameter \u03bb) Random Forests algorithms.\nWe compare the prediction accuracy (on the test set) of the aforementioned algorithms trained on\nvarying fractions of the training data from 10% to 100%.\nRegarding our choice of competitors, we note that Breiman\u2019s RF is well-established and known\nto achieve state-of-the-art performance. We also included the Extra-Trees-1 (ERT-1) algorithm\n[GEW06], which is most comparable to the Mondrian Forest classi\ufb01er since it also draws splits\nrandomly (we note that the ERT-k algorithm [GEW06] with the default tuning k =\nd in the\nscikit-learn implementation [PVG+11] achieves scores very close to those of Breiman\u2019s RF).\nIn the case of online Mondrian Forests, we included our modi\ufb01ed Mondrian Forest classi\ufb01er with\nan increasing lifetime parameter \u03bbn = n1/(d+2) tuned according to the theoretical analysis (see\nTheorem 3), as well as a Mondrian Forest classi\ufb01er with constant lifetime parameter \u03bb = 2. Note that\nwhile a higher choice of \u03bb would have resulted in a performance closer to that of the modi\ufb01ed version\n(with increasing \u03bbn), our inconsistency result (Proposition 1) shows that its error would eventually\nstagnate given more training samples. In both cases, the splits are drawn within the range of the\ntraining feature, as in the original Mondrian Forest algorithm. Our results are reported in Figure 1.\n\n\u221a\n\n.\n\nFigure 1: Prediction accuracy as a function of the fraction of data used on several datasets. Modi\ufb01ed\nMF (Algorithm 4) outperforms MF with a constant lifetime, and is better than the batch ERT-1\nalgorithm. It also performs almost as well as Breiman\u2019s RF (a batch algorithm that uses the whole\ntraining dataset in order to choose each split) on several datasets, while being incremental and\nmuch faster to train. On the dna dataset, as noted in [LRT14], Breiman\u2019s RF outperforms the other\nalgorithms because of the presence of a large number of irrelevant features.\n\n8\n\n0.20.40.60.81.00.650.700.750.800.850.90letterBreiman_RFExtra_Trees_1Mondrian_increasingMondrian_fixed0.20.40.60.81.00.740.760.780.800.820.840.860.88satimage0.20.40.60.81.00.7500.7750.8000.8250.8500.8750.900usps0.20.40.60.81.00.550.600.650.700.750.800.850.90dna\f8 Conclusion and future work\n\nDespite their widespread use in practice, the theoretical understanding of Random Forests is still\nincomplete. In this work, we show that amending the Mondrian Forest classi\ufb01er, originally introduced\nto provide an ef\ufb01cient online algorithm, leads to an algorithm that is not only consistent, but in fact\nminimax optimal for Lipschitz conditional probabilities in arbitrary dimension. This new result\nsuggests promising improvements in the understanding of random forests methods.\nA \ufb01rst, natural extension of our results, that will be addressed in a future work, is the study of the\nrates for smoother regression functions. Indeed, we conjecture that through a more re\ufb01ned study of\nthe local properties of the Mondrian partitions, it is possible to describe exactly the distribution of the\ncell of a given point. In the spirit of the work of [AG14] in dimension one, this could be used to show\nimproved rates for the bias of forests (e.g. for C 2 regression functions) compared to the tree bias, and\nhence give some theoretical insight to the empirically well-known fact that a forest performs better\nthan individual trees.\nSecond, the optimal upper bound O(n\u22121/(d+2)) obtained in this paper is very slow when the number\nof features d is large. This comes from the well-known curse of dimensionality phenomenon, a\nproblem affecting all fully nonparametric algorithms. A standard technique used in high-dimensional\nsettings is to work under a sparsity assumption, where only s (cid:28) d features are informative (i.e.\naffect the distribution of Y ). In such settings, a natural strategy is to select the splits using the\nlabels Y1, . . . , Yn, as most variants of Random Forests used in practice do. For example, it would be\ninteresting to combine a Mondrian process-based randomization with a choice of the best split among\nseveral candidates, as performed by the Extra-Tree algorithm [GEW06]. Since the Mondrian Forest\nguarantees minimax rates, we conjecture that it should improve feature selection of batch random\nforest methods, and improve the underlying randomization mechanism of these algorithms. From a\ntheoretical perspective, it could be interesting to see how the minimax rates obtained here could be\ncoupled with results on the ability of forests to select informative variables, see for instance [SBV15].\n\nReferences\n\n[AG14] Sylvain Arlot and Robin Genuer. Analysis of purely random forests bias. arXiv preprint\n\narXiv:1407.3939, 2014.\n\n[AT07] Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classi\ufb01ers. The\n\nAnnals of Statistics, 35(2):608\u2013633, 2007.\n\n[BDL08] G\u00e9rard Biau, Luc Devroye, and G\u00e1bor Lugosi. Consistency of random forests and other averaging\n\nclassi\ufb01ers. Journal of Machine Learning Research, 9:2015\u20132033, 2008.\n\n[Bia12] G\u00e9rard Biau. Analysis of a random forests model. Journal of Machine Learning Research,\n\n13(1):1063\u20131095, 2012.\n\n[BLG+16] Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee W. Teh. The\n\nMondrian kernel. In 32nd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2016.\n\n[Bre00] Leo Breiman. Some in\ufb01nity theory for predictor ensembles. Technical Report 577, Statistics\n\ndepartement, University of California Berkeley, 2000.\n\n[Bre01] Leo Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n\n[Bre04] Leo Breiman. Consistency for a simple model of random forests. Technical Report 670, Statistics\n\ndepartement, University of California Berkeley, 2004.\n\n[BS16] G\u00e9rard Biau and Erwan Scornet. A random forest guided tour. TEST, 25(2):197\u2013227, 2016.\n\n[DGL96] Luc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A Probabilistic Theory of Pattern Recognition,\n\nvolume 31 of Applications of Mathematics. Springer-Verlag, 1996.\n\n[DH00] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the 6th\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 71\u201380, 2000.\n\n[DMdF13] Misha Denil, David Matheson, and Nando de Freitas. Consistency of online random forests. In\nProceedings of the 30th Annual International Conference on Machine Learning (ICML), pages\n1256\u20131264, 2013.\n\n9\n\n\f[DMdF14] Misha Denil, David Matheson, and Nando de Freitas. Narrowing the gap: Random forests in theory\nand in practice. In Proceedings of the 31st Annual International Conference on Machine Learning\n(ICML), pages 665\u2013673, 2014.\n\n[FSSW97] Yoav Freund, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. Using and combining\nIn Proceedings of the 29th Annual ACM Symposium on Theory of\n\npredictors that specialize.\nComputing, pages 334\u2013343, 1997.\n\n[Gen12] Robin Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics,\n\n24(3):543\u2013562, 2012.\n\n[GEW06] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning,\n\n63(1):3\u201342, 2006.\n\n[HS97] David P. Helmbold and Robert E. Schapire. Predicting nearly as well as the best pruning of a\n\ndecision tree. Machine Learning, 27(1):51\u201368, 1997.\n\n[Lec07] Guillaume Lecu\u00e9. Optimal rates of aggregation in classi\ufb01cation under low noise assumption.\n\nBernoulli, 13(4):1000\u20131022, 2007.\n\n[LRT14] Balaji Lakshminarayanan, Daniel M. Roy, and Yee W. Teh. Mondrian forests: Ef\ufb01cient online\nrandom forests. In Advances in Neural Information Processing Systems 27, pages 3140\u20133148.\nCurran Associates, Inc., 2014.\n\n[LRT16] Balaji Lakshminarayanan, Daniel M. Roy, and Yee W. Teh. Mondrian forests for large-scale\nregression when uncertainty matters. In Proceedings of the 19th International Workshop on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2016.\n\n[MT99] Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. The Annals of\n\nStatistics, 27(6):1808\u20131829, 1999.\n\n[Nem00] Arkadi Nemirovski. Topics in non-parametric statistics. Lectures on Probability Theory and\n\nStatistics: Ecole d\u2019Ete de Probabilites de Saint-Flour XXVIII-1998, 28:85\u2013277, 2000.\n\n[OR15] Peter Orbanz and Daniel M. Roy. Bayesian models of graphs, arrays and other exchangeable random\nstructures. IEEE transactions on pattern analysis and machine intelligence, 37(2):437\u2013461, 2015.\n\n[PVG+11] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier\nGrisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexan-\ndre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and \u00c9douard Duchesnay. Scikit-\nlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[Roy11] Daniel M. Roy. Computability, inference and modeling in probabilistic programming. PhD thesis,\n\nMassachusetts Institute of Technology, 2011.\n\n[RT09] Daniel M. Roy and Yee W. Teh. The Mondrian process.\n\nIn Advances in Neural Information\n\nProcessing Systems 21, pages 1377\u20131384. Curran Associates, Inc., 2009.\n\n[SBV15] Erwan Scornet, G\u00e9rard Biau, and Jean-Philippe Vert. Consistency of random forests. The Annals of\n\nStatistics, 43(4):1716\u20131741, 2015.\n\n[SLS+09] Amir Saffari, Christian Leistner, Jacob Santner, Martin Godec, and Horst Bischof. On-line random\n\nforests. In 3rd IEEE ICCV Workshop on On-line Computer Vision, 2009.\n\n[TGP11] Matthew A. Taddy, Robert B. Gramacy, and Nicholas G. Polson. Dynamic trees for learning and\n\ndesign. Journal of the American Statistical Association, 106(493):109\u2013123, 2011.\n\n[Tsy04] Alexandre B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics,\n\n32(1):135\u2013166, 2004.\n\n[WST95] Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The context-tree weighting method:\n\nBasic properties. IEEE Transactions on Information Theory, 41(3):653\u2013664, 1995.\n\n[Yan99] Yuhong Yang. Minimax nonparametric classi\ufb01cation. I. Rates of convergence. IEEE Transactions\n\non Information Theory, 45(7):2271\u20132284, 1999.\n\n10\n\n\f", "award": [], "sourceid": 2086, "authors": [{"given_name": "Jaouad", "family_name": "Mourtada", "institution": "Ecole Polytechnique"}, {"given_name": "St\u00e9phane", "family_name": "Ga\u00efffas", "institution": "Ecole polytechnique"}, {"given_name": "Erwan", "family_name": "Scornet", "institution": "Ecole Polytechnique"}]}