{"title": "PAC-Bayes Tree: weighted subtrees with guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 9484, "page_last": 9492, "abstract": "We present a weighted-majority classification approach over subtrees of a fixed tree, which provably achieves excess-risk of the same order as the best tree-pruning. Furthermore, the computational efficiency of pruning is maintained at both training and testing time despite having to aggregate over an exponential number of subtrees. We believe this is the first subtree aggregation approach with such guarantees.", "full_text": "PAC-Bayes Tree: weighted subtrees with guarantees\n\nTin Nguyen\u2217\nMIT EECS\n\ntdn@mit.edu\n\nSamory Kpotufe\n\nPrinceton University ORFE\nsamory@princeton.edu\n\nAbstract\n\nWe present a weighted-majority classi\ufb01cation approach over subtrees of a \ufb01xed\ntree, which provably achieves excess-risk of the same order as the best tree-pruning.\nFurthermore, the computational ef\ufb01ciency of pruning is maintained at both training\nand testing time despite having to aggregate over an exponential number of subtrees.\nWe believe this is the \ufb01rst subtree aggregation approach with such guarantees.\nThe guarantees are obtained via a simple combination of insights from PAC-Bayes\ntheory, which we believe should be of independent interest, as it generically implies\nconsistency for weighted-voting classi\ufb01ers w.r.t. Bayes \u2013 while, in contrast, usual\nPAC-bayes approaches only establish consistency of Gibbs classi\ufb01ers.\n\n1\n\nIntroduction\n\nClassi\ufb01cation trees endure as popular tools in data analysis, offering both ef\ufb01cient prediction and\ninterpretability \u2013 yet they remain hard to analyze in general. So far there are two main approaches\nwith generalization guarantees: in both approaches, a large tree (possibly over\ufb01tting the data) is \ufb01rst\nobtained; one approach is then to prune back this tree down to a subtree2 that generalizes better;\nthe alternative approach is to combine all possible subtrees of the tree by weighted majority vote.\nInterestingly, while both approaches are competitive with other practical heuristics, it remains unclear\nwhether the alternative of weighting subtrees enjoys the same strong generalization guarantees as\npruning; in particular, no weighting scheme to date has been shown to be statistically consistent, let\nalone attain the same tight generalization rates (in terms of excess risk) as pruning approaches.\nIn this work, we consider a new weighting scheme based on PAC-Bayesian insights [1], that (a) is\nconsistent and attains the same generalization rates as the best pruning of a tree, (b) is ef\ufb01ciently\ncomputable at both training and testing time, and (c) competes against pruning approaches on\nreal-world data. To the best of our knowledge, this is the \ufb01rst practical scheme with such guarantees.\nThe main technical hurdle has to do with a subtle tension between goals (a) and (b) above. Namely,\nlet T0 denote a large tree built on n datapoints, usually a binary tree with O(n) nodes; the family\nof subtrees T of T0 is typically of exponential size in n [2], so a naive voting scheme that requires\nvisiting all subtrees is impractical; on the other hand it is known that if the weights decompose\nfavorably over the leaves of T (e.g., multiplicative over leaves) then ef\ufb01cient classi\ufb01cation is possible.\nUnfortunately, while various such multiplicative weights have been designed for voting with subtrees\n[3, 4, 5], they are not known to yield statistically consistent prediction. In fact, the best known result\nto date [5] presents a weighting scheme which can provably achieve an excess risk3 (over the Bayes\nclassi\ufb01er) of the form oP (1) + C \u00b7 minT R(hT ), where R(hT ) denotes the misclassi\ufb01cation rate of\na classi\ufb01er hT based on subtree T . In other words, the excess risk might never go to 0 as sample size\nincreases, which in contrast is a basic property of the pruning alternative. Furthermore, the approach\n\u2217The majority of the research was done when the author was an undergraduate student at Princeton University\n\nORFE.\n\n2Considering only subtrees that partition the data space.\n3The excess risk of a classi\ufb01er h over the Bayes hB (which minimizes R(h) over any h) is R(h) \u2212 R(hB).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof [5], based on l1-risk minimization, does not trivially extend to multiclass classi\ufb01cation, which is\nmost common in practice. Our approach is designed for multiclass by default.\nStatistical contribution. PAC-Bayesian theory [1, 6, 7, 8] offers useful insights into designing\nweighting schemes with generalization guarantees (w.r.t. a prior distribution P over classi\ufb01ers).\nHowever, a direct application of existing results fails to yield a consistent weighted-majority scheme.\nThis is because PAC-Bayes results are primarily concerned with so-called Gibbs classi\ufb01ers, which\nin our context corresponds to predicting with a random classi\ufb01er hT drawn according to a weight-\ndistribution Q over subtrees of T0. Instead, we are interested in Q-weighted majority classi\ufb01ers\nhQ. Unfortunately the corresponding error R(hQ) can be twice the risk R(Q) = EhT \u223cQR(hT ) of\nthe corresponding Gibbs classi\ufb01er: this then results (at best \u2013 see overview in Section 2.2) in an\nexcess risk of the form (R(hQ) \u2212 R(hB)) \u2264 (R(Q) \u2212 R(hB)) + R(Q) = oP (1) + R(hB), which,\nsimilar to [5], does not go to 0. So far, this problem is best addressed in PAC-Bayes results such as\nthe MinCq bound in [6, 8] on R(hQ), which is tighter in the presence of low correlation between\nbase classi\ufb01ers. In contrast, our PAC-Bayes result applies even without low correlation between base\nclassi\ufb01ers, and allows an excess risk oP (1) + (C/n) \u00b7 minT log(1/P (T )) \u2192 0 (Proposition 2). This\n\ufb01rst result is in fact of general interest since it extends beyond subtrees to any family of classi\ufb01ers,\nand is obtained by carefully combining existing arguments from PAC-Bayes analysis.\nHowever, our basic PAC-Bayes result alone does not ensure convergence at the same rate as that\nof the best pruning approaches. This requires designing a prior P that scales properly with the\nsize of subtrees T of T0. For instance, suppose P were uniform over all subtrees of T0, then\nlog(1/(P (T )) = \u2126(n), yielding a vacuous excess risk. We show through information-theoretic\narguments that an appropriate prior P can be designed to yield rates of convergence of the same\norder as that of the best pruning of T0. In particular, our resulting weighting scheme maintains ideal\nproperties of pruning approaches such as adaptivity to the intrinsic dimension of data (see e.g. [9]).\nAlgorithmic contribution. We show that we can design a prior P which, while meeting the above\nstatistical constraints, yields posterior weights that decompose favorably over the leaves of a subtree T .\nAs a result of this decomposition, the weights of all subtrees can be recovered by simply maintaining\ncorresponding weights at the nodes of the original tree T0 for ef\ufb01cient classi\ufb01cation in time O(log n)\n(this is illustrated in Figure 1). We then propose an ef\ufb01cient approach to obtain weights at the nodes\nof T0, consisting of concurrent top-down and bottom-up dynamic programs that run in O(n) time.\nThese match the algorithmic complexity of the most ef\ufb01cient pruning approaches, and thus offer a\npractical alternative.\nOur theoretical results are then veri\ufb01ed in experiments over many real-world datasets. In particular\nwe show that our weighted-voting scheme achieves similar or better error than pruning on practical\nproblems, as suggested by our theoretical results.\nPaper Organization. We start in Section 2 with theoretical setup and an overview of PAC-Bayes\nanalysis. This is followed in Section 3 with an overview of our statistical results, and in Section 4\nwith algorithmic results. Our experimental analysis is then presented in Section 5.\n\n2 Preliminaries\n\n2.1 Classi\ufb01cation setup\nWe consider a multiclass setup where the input X \u2282 X , for a bounded subset X of RD, possibly of\nlower intrinsic dimension. For simplicity of presentation we assume X \u2282 [0, 1]D (as in normalized\ndata). The output Y \u2282 [L], where we use the notation [L] = {1, 2, . . . , L} for L \u2208 N.\nWe are to learn a classi\ufb01er h : X (cid:55)\u2192 [L], given an i.i.d. training sample {Xi, Yi}2n\nan unknown distribution over X, Y . Throughout, we let S .\ni=1 and S0\nwhich will serve later to simplify dependencies in our analysis.\nOur performance measure is as follows.\nDe\ufb01nition 1. The risk of a classi\ufb01er h is given as R(h) = E[h(X) (cid:54)= Y ]. This is minimized by\nP (Y = l|X = x). Therefore, for any classi\ufb01er \u02c6h learned\nthe Bayes classi\ufb01er hB(x)\nover a sample {Xi, Yi}i, we are interested in the excess-risk E(\u02c6h)\n\ni=1 of size 2n, from\n= {Xi, Yi}2n\n.\ni=n+1,\n\n= R(\u02c6h) \u2212 R(hB).\n.\n\n= {Xi, Yi}n\n\n.\n= argmaxl\u2208[L]\n\n2\n\n\fFigure 1: A partition tree T0 over input space X , and a query x \u2208 X to classify. The leaves of T0 are the 4\ncells shown left, and the root is X . A query x follows a single path (shown in bold) from the root down to a leaf.\nA key insight towards ef\ufb01cient weighted-voting is that this path visits all leaves (containing x) of any subtree of\nT0. Therefore, weighted voting might be implemented by keeping a weight w(A) at any node A along the path,\nwhere w(A) aggregates the weights Q(T ) of every subtree T that has A as a leaf. This is feasible if we can\nrestrict Q(T ) to be multiplicative over the leaves of T , without trading off accuracy.\n\nHere we are interested in aggregations of classi\ufb01cation trees, de\ufb01ned as follows.\nDe\ufb01nition 2. A hierarchical partition or (space) partition-tree T of X is a collection of nested\npartitions of X ; this is viewed as a tree where each node is a subset A of X , each child A(cid:48) of\na node A is a subset of A, and whose collection of leaves, denoted \u03c0(T ), is a partition of X . A\nclassi\ufb01cation tree hT on X is a labeled partition-tree T of X : each leaf A \u2208 \u03c0(T ) is assigned a\nlabel l = l(A) \u2208 [L]; the classi\ufb01cation rule is simply hT (x) = l(A) for any x \u2208 A.\nGiven an initial tree T0, we will consider only subtrees T of T0 that form a hierarchical partition of\nX , and we henceforth use the term subtrees (of T0) without additional quali\ufb01cation.\nFinally, aggregation (of subtrees of T0) consists of majority-voting as de\ufb01ned below.\nDe\ufb01nition 3. Let H denote a discrete family of classi\ufb01ers h : X (cid:55)\u2192 [L], and let Q denote a\ndistribution over H. The Q-majority classi\ufb01er hQ\n\n= hQ(H) is one satisfying for any x \u2208 X\n.\n\n(cid:88)\n\nhQ(x) = argmax\n\nl\u2208[L]\n\nh\u2208H,h(x)=l\n\nQ(h).\n\nOur oracle rates of Theorem 1 requires no additional assumptions; however, the resulting corollary is\nstated under standard distributional conditions that characterize convergence rates for tree-prunings.\n\n2.2 PAC-Bayes Overview\n\nPAC-Bayes analysis develops tools to bound the error of a Gibbs classi\ufb01er, i.e. one that randomly\nsamples a classi\ufb01er h \u223c Q over a family of classi\ufb01ers H. In this work we are interested in families\n{hT} de\ufb01ned over subtrees of an initial tree T0. Here we present some basic PAC-Bayes result which\nwe extend for our analysis. While these results are generally presented for classi\ufb01cation risk R\n(de\ufb01ned above), we keep our presentation generic, as we show later that a different choice of risk\nleads to stronger results for R than what is possible through direct application of existing results.\nGeneric Setup. Consider a random vector Z, and an i.i.d sample Z[n] = {Zi}n\ni=1. Let Z be the\nsupport of Z, and L = {(cid:96)h : h \u2208 H} be a loss class indexed by h \u2208 H \u2013 discrete, and where\n(cid:96)h : Z \u2192 [0, 1]. For h \u2208 H, the loss (cid:96)h induces the following risk and empirical counterparts:\n\nRL(h)\n\n.\n= EZ(cid:96)h(Z),\n\n(cid:98)RL(h, Z[n])\n\nn(cid:88)\n\ni=1\n\n.\n=\n\n1\nn\n\n(cid:96)h(Zi).\n\nIn particular, for the above classi\ufb01cation risk R, and Z (cid:44) (X, Y ), we have (cid:96)h(Z) = 1{h(X) (cid:54)= Y }.\nGiven a distribution Q over H, the risk (and empirical counterpart) of the Gibbs classi\ufb01er is then\n\nRL(Q)\n\n= Eh\u223cQRL(h),\n.\n\n(cid:98)RL(Q, Z[n])\n\n= Eh\u223cQ(cid:98)RL(h, Z[n]).\n\n.\n\n3\n\n\fPAC-Bayesian results bound RL(Q) in terms of (cid:98)RL(Q, Z[n]), uniformly over any distribution Q,\n\nprovided a \ufb01xed prior distribution P over H. We will build on the following form of [10] which\nyields an upper-bound that is convex in Q (and therefore can be optimized for a good posterior Q\u2217).\nProposition 1 (PAC-Bayes on RL [10]). Fix a prior P supported on H, and let n \u2265 8 and \u03b4 \u2208 (0, 1).\nWith probability at least 1 \u2212 \u03b4 over Z[n], simultaneously for all \u03bb \u2208 (0, 2) and all posteriors Q over\nH:\n\nRL(Q) \u2264 (cid:98)RL(Q, Z[n])\n\n1 \u2212 \u03bb/2\n\n\u221a\nDkl (Q(cid:107)P ) + log (2\n\u03bb(1 \u2212 \u03bb/2)n\n\n+\n\nn/\u03b4)\n\n,\n\n.\n= EQlog Q(h)\n\nwhere Dkl (Q(cid:107)P )\nChoice of posterior Q\u2217. Let Q\u2217 minimize the above upper-bound, and let h\u2217 minimize RL over H.\nThen, by letting Qh\u2217 put all mass on h\u2217, we automatically get that, with probability at least 1 \u2212 2\u03b4:\n\nP (h) is the Kullback-Leibler divergence between Q and P .\n\n(cid:18)(cid:98)RL(h\u2217, Z[n]) +\n\nlog(1/P (h\u2217)) + log(n/\u03b4)\n\n(cid:114)\n\nn\n\n(cid:19)\n\n(cid:33)\n\nRL(Q\u2217) \u2264 RL(Qh\u2217 ) \u2264 C \u00b7\n\n(cid:32)\n\nlog(1/P (h\u2217)) + log(n/\u03b4)\n\nlog(1/\u03b4)\n\n\u2264 C \u00b7\n\nRL(h\u2217) +\n\n(1)\n\nwhere the last inequality results from bounding |RL(h\u2217) \u2212 (cid:98)RL(h\u2217, Z[n])| using Chernoff.\n\nn\n\nn\n\n+\n\n,\n\nUnfortunately, such direct application is not enough for our purpose when RL = R. We want\nto bound the excess risk E(hQ) for a Q-majority classi\ufb01er hQ over h(cid:48)s \u2208 H. It is known that\nR(hQ) \u2264 2R(Q) which yields a bound of the form (1) on R(hQ\u2217 ); however this implies at best\nthat R(hQ\u2217 ) \u2192 2R(hB) even if E(h\u2217) \u2192 0 (which is generally the case for optimal tree-pruning\nh\u2217\nT [9]). This is a general problem in converting from Gibbs error to that of majority-voting, and is\nstudied for instance in [6, 8] where it is shown that R(hQ) can actually be smaller in some situations.\nImproved choice of Q\u2217. Here, we want to design Q\u2217 such that R(hQ\u2217 ) \u2192 R(hB) (i.e. E(hQ\u2217 ) \u2192\nT ) \u2192 0 always. Our solution relies on a proper choice of loss (cid:96)h\n0) at the same rate as E(h\u2217\nthat relates most directly to excess risk E that the 0-1 loss 1{h(x) (cid:54)= y}. A \ufb01rst candidate is to\n= 1{h(x) (cid:54)= y} \u2212 1{hB(x) (cid:54)= y} since E(h) = E eh(X, Y ); however\n.\nde\ufb01ne (cid:96)h(x, y) as eh(x, y)\neh(x, y) /\u2208 [0, 1] and can take negative values. This is resolved by considering an intermediate loss\neh(x) = EY |xeh(x, Y ) \u2208 [0, 1] to be related back to eh(x, y) by integration in a suitable order.\n\n3 Statistical results\n\n3.1 Basic PAC-Bayes result\n\nWe start with the following intermediate loss family over classi\ufb01ers h, w.r.t. the Bayes classi\ufb01er hB.\nDe\ufb01nition 4. Let eh(x, y)\n\n= 1{h(x) (cid:54)= y} \u2212 1{hB(x) (cid:54)= y}, and eh(x) = EY |xeh(x, Y ), and\n.\n.\n=\n\neh(Xi), and (cid:98)E(h,S)\n\n(cid:101)E(h,S)\n\nn(cid:88)\n\nn(cid:88)\n\neh(Xi, Yi).\n\n.\n=\n\n1\nn\n\ni=1\n\n1\nn\n\ni=1\n\nOur \ufb01rst contribution is a basic PAC-Bayes result which the rest of our analysis builds on.\nProposition 2 (PAC-Bayes on excess risk). Let H denote a discrete family of classi\ufb01ers, and \ufb01x\na prior distribution P with support H. Let n \u2265 8 and \u03b4 \u2208 (0, 1). Suppose, there exists bounded\n\n(cid:17) \u2265 1 \u2212 \u03b4,\n\nFor any \u03bb \u2208 (0, 2), consider the following posterior over H:\n\nfunctions (cid:98)\u2206n(h,S), \u2206n(h), h \u2208 H (depending on \u03b4) such that\nP(cid:16)\u2200h \u2208 H, (cid:101)E(h,S) \u2264 (cid:98)E(h,S) +(cid:98)\u2206n(h,S)\ne\u2212n\u03bb((cid:98)R(h,S)+(cid:98)\u2206n(h,S))P (h),\n\uf8eb\uf8edE(h) + \u2206n(h) +\n\nQ\u2217\n\u03bb(h) =\n\nE(hQ\u2217\n\n) \u2264\n\n1\nc\n\nL\n\nlog(1/P (h))\n\n\u03bb\n\n1 \u2212 \u03bb/2\n\ninf\nh\u2208H\n\n\u03bbn\n\nP(cid:16)(cid:98)\u2206n(h,S) \u2264 \u2206n(h)\nfor c = Eh\u223cP e\u2212n\u03bb((cid:98)R(h,S)+(cid:98)\u2206n(h,S)).\n\ninf\nh\u2208H\n\n(cid:17) \u2265 1 \u2212 \u03b4.\n\n(cid:113)\n\nlog 2\n\n+\n\n\u221a\nn\n\u03b4 + \u03bb\n\u03bbn\n\n2n log 1\n\u03b4\n\n(2)\n\n\uf8f6\uf8f8 .\n\nThen, with probability at least 1 \u2212 4\u03b4 over S, simultaneously for all \u03bb \u2208 (0, 2):\n\n4\n\n\fProposition 2 builds on Proposition 1 by \ufb01rst taking RL(h) to be E(h), (cid:98)RL(h) to be (cid:101)E(h), and Z\n\nto be X. The bound in Proposition 2 is then obtained by optimizing over Q for \ufb01xed \u03bb. Since this\nbound is on excess error (rather than error), optimizing over \u03bb can only improve constants, while the\nchoice of prior P is crucial in obtaining optimal rates as |H| \u2192 \u221e. Such choice is treated next.\n3.2 Oracle risk for trees (as H .\nWe start with the following de\ufb01nitions on classi\ufb01ers of interest and related quantities.\nDe\ufb01nition 5. Let T0 be a binary partition-tree of X obtained from data S0, of depth D0. Consider a\nfamily of classi\ufb01cation trees H(T0)\n= {hT} indexed by subtrees T of T0, and where hT de\ufb01nes a\n.\n\ufb01xed labeling l(A) of nodes A \u2208 \u03c0(T ), e.g., l(A)\nFurthermore, for any node A of T0, let \u02c6p(A,S) denote the empirical mass of A under S and p(A) be\nthe population mass. Then for any subtree T of T0, let |T| be the number of nodes in T and de\ufb01ne\n\n= majority label in Y if A \u2229 S0 (cid:54)= \u2205.\n.\n\n= H(T0) grows in size with T0)\n\n(cid:114)\n\n(cid:88)\n\nA\u2208\u03c0(T )\n\n(cid:98)\u2206n(hT ,S)\n(cid:115)\n(cid:88)\n\n.\n=\n\n(cid:18)\n\n\u02c6p(A,S)\n\n2 log(|T0| /\u03b4)\n\nn\n\n, and\n\n(cid:19) log(|T0| /\u03b4)\n\n(3)\n\n.\n\n(4)\n\n\u2206n(hT )\n\n.\n=\n\nA\u2208\u03c0(T )\n\n8 max\n\np(A),\n\n(2 + log D) \u00b7 D0 + log(1/\u03b4)\n\nn\n\nn\n\nRemark 1. In practice, we might start with a space partitioning tree T (cid:48)\n0 (e.g., a dyadic tree, or\nKD-tree) which partitions [0, 1]D, rather than the support X . We then view T0 as the intersection of\n0 with X .\nT (cid:48)\nde\ufb01nition of (cid:98)\u2206n(hT ,S) and \u2206n(hT ) satis\ufb01es the conditions of Proposition 2, and (b) that there\nOur main theorem below follows from Proposition 2 on excess risk, by showing (a) that the above\nbe a proper distribution (i.e.(cid:80)\nexists a proper prior P such that log(1/P (T )) \u223c |\u03c0(T )|, i.e., depends just on the subtree complexity\nrather than on that of T0. The main technicality in showing (b) stems from the fact that P needs to\nT P (T ) = 1) without requiring too large a normalization constant\n(remember that the number of subtrees can be exponential in the size of T0). This is established\nthrough arguments from coding theory, and in particular Kraft-McMillan inequality.\n= (1/CP )e\u22123D0\u00b7|\u03c0(T )| for a nor-\n.\nTheorem 1 (Oracle risk for trees). Let the prior satisfy P (hT )\nmalizing constant CP , and consider the corresponding posterior Q\u2217\n\u03bb as de\ufb01ned in Equation 2, such\nthat, with probability at least 1 \u2212 4\u03b4 over S, for all \u03bb \u2208 (0, 2), the excess risk E(hQ\u2217\n) of the\nmajority-classi\ufb01er is at most\n\n\u03bb\n\n(cid:0)\n\n(cid:1) \u00b7 min\n\nhT \u2208H(T0)\n\nL\n\n1 \u2212 \u03bb/2\n\n\uf8eb\uf8edE(hT ) + \u2206n(hT ) +\n\n3D0 \u00b7 |\u03c0(T )|\n\n\u03bbn\n\n+\n\nlog 2\n\n\u221a\nn\n\u03b4 + \u03bb\n\u03bbn\n\n2n log 1\n\u03b4\n\n(cid:113)\n\n\uf8f6\uf8f8 .\n\nFrom Theorem 1 we can deduce that the majority classi\ufb01er hQ\u2217\nis consistent whenever the approach\nof pruning to the best subtree is consistent (typically, minhT E(hT ) + (D0 |\u03c0(T )|)/n = oP (1)).\nFurthermore, we can infer that E(hQ\u2217\n) converges at the same rate as pruning approaches: the terms\n\u2206n(hT ) and D0 \u00b7 |\u03c0(T )|/n can be shown to be typically, of lower or similar order as E(hT ) for the\nbest subtree classi\ufb01er hT . These remarks are formalized next and result in Corollary 1 below.\n\n\u03bb\n\n\u03bb\n\n3.3 Rate of convergence\n\nMuch of known rates for tree-pruning are established for dyadic trees (see e.g. [9, 11]), due to their\nsimplicity, under nonparametric assumptions on E[Y |X]. Thus, we adopt such standard assumptions\nhere to illustrate the rates achievable by hQ\u2217\n, following the more general statement of Theorem 1.\nThe \ufb01rst standard assumption below restricts how fast class probabilities change over space.\nAssumption 1. Consider the so-called regression function \u03b7(x) \u2208 RL with coordinate \u03b7l(x)\nEY |x1{Y = l} , l \u2208 [L]. We assume \u03b7 is \u03b1-H\u00f6lder for \u03b1 \u2208 (0, 1], i.e.,\n\n.\n=\n\n\u03bb\n\n\u2203\u03bb such that \u2200x, x(cid:48) \u2208 X ,\n\n(cid:107)\u03b7(x) \u2212 \u03b7(x(cid:48))(cid:107) \u2264 \u03bb(cid:107)x \u2212 x(cid:48)(cid:107)\u03b1 .\n\n5\n\n\fNext, we illustrate some of the key conditions veri\ufb01ed by dyadic trees which standard results build\non. In particular, we want the diameters of nodes of T0 to decrease relatively fast from the root down.\nAssumption 2 (Conditions on T0). The tree T0 is obtained as the intersection of X with dyadic\npartition of [0, 1]D (e.g. by cycling though coordinates) of depth D0 = O(D log n) and partition size\n|T0| = O(n). In particular, we emphasize that the following conditions on subtrees then hold.\nFor any subtree T of T0, let r(T ) denote the maximum diameter of leaves of T (viewed as subsets of\nX ). There exist C1, C2, d > 0 such that:\nFor all (C1/n) < r \u2264 1, there exists a subtree T of T0 such that r(T ) \u2264 r and |\u03c0(T )| \u2264 C2r\u2212d.\nThe above conditions on subtrees are known to approximately hold for other procedures such as\nKD-trees, and PCA-trees; in this sense, analyses of dyadic trees do yield some insights into the\nperformance other approaches. The quantity d captures the intrinsic dimension (e.g., doubling or box\ndimension) of the data space X or is often of the same order [12, 13, 14].\nUnder the above two assumptions, it can be shown through standard arguments that the excess error\nof the best pruning, namely minhT \u2208H(T0) E(hT ) is of order n\u2212\u03b1/(2\u03b1+d), which is tight (see e.g.\nminimax lower-bounds of [15]). The following corollary to Theorem 1 states that such a rate, up to a\nlogarithmic factor of n, is also attained by majority classi\ufb01cation under Q\u2217\n\u03bb.\nCorollary 1 (Adaptive rate of convergence). Assume that for any cell A of T0, the labeling l(A)\ncorresponds to the majority label in A (under S0) if A \u2229 S0 (cid:54)= \u2205, or l(A) = 1 otherwise. Then, under\nAssumptions 1 and 2, and the conditions of Theorem 1, there exists a constant C such that:\n\n(cid:18) log n\n\n(cid:19)\u03b1/(2\u03b1+d)\n\n.\n\nn\n\nES0,SE(hQ\u2217\n\n\u03bb\n\n) \u2264 C\n\n4 Algorithmic Results\n\n=(cid:80)\n\n.\n\n(cid:80)\n\nHere we show that hQ can be ef\ufb01ciently implemented by storing appropriate weights at nodes of\nhT :A\u2208\u03c0(T ) Q(hT ) aggregate weights over all subtrees T of T0 having A as a\nT0. Let wQ(A)\nleaf. Then hQ(x) = argmaxl\u2208[L]\nA\u2208path(x),l(A)=l wQ(A), where path(x) denotes all nodes of T0\ncontaining x. Thus, hQ(x) is computable from weights proportional to wQ(A) at every node.\nWe show in what follows that we can ef\ufb01ciently obtain w(A) = C\u00b7wQ\u2217\nby ensuring that Q\u2217\nTheorem 1: we have Q\u2217\n\n(A) by dynamic-programming\n\u03bb(hT ) is multiplicative over \u03c0(T ). This is the case, given our choice of prior from\n\n) \u00b7 exp((cid:80)\n\nA\u2208\u03c0(T ) \u03c6(A)) where\n\n\u03bb(hT ) = (1/CQ\u2217\n\n\u03bb\n\n\u03bb\n\n\u03c6(A)\n\n= \u2212\u03bb\n.\n\n1{Yi (cid:54)= l(A)} \u2212 n\u03bb\n\n\u02c6p(A,S)\n\n2 log(|T0| /\u03b4)\n\n\u2212 3D0.\n\n(cid:114)\n\n(cid:88)\n\ni:Xi\u2208A\u2229S\n\n.\n= CQ\u2217\n\n\u00b7 wQ\u2217\n\nWe can then compute w(A)\n(A) via dynamic-programming. The intuition is similar\nto that in [5], however, the particular form of our weights require a two-pass dynamic program\n(bottom-up and top-down) rather than the single pass in [5]. Namely, w(A) divides into subweights\nthat any node A(cid:48) might contribute up or down the tree. Let\n\n\u03bb\n\n\u03bb\n\n\u03b1(A)\n\n.\n=\n\nexp\n\n\u03c6(A(cid:48))\n\n,\n\n(5)\n\n(cid:88)\n\n(cid:18) (cid:88)\n\nhT :A\u2208\u03c0(T )\n\nA(cid:48)(cid:54)=A,A(cid:48)\u2208\u03c0(T )\n\nso that w(A) = e\u03c6(A) \u00b7 \u03b1(A). As we will show (proof of Theorem 2), \u03b1(A) decomposes into\ncontributions from the parent Ap and sibling As of A, i.e., \u03b1(A) = \u03b1(Ap)\u03b2(As) where \u03b2(As) is\ngiven as (writing T A\n\n0 for the subtree of T0 rooted at A, and T (cid:22) T (cid:48) when T is a subtree of T (cid:48)):\n\nn\n\n(cid:19)\n\n(cid:88)\n\n(cid:18) (cid:88)\n\n(cid:19)\n\n\u03c6(A(cid:48))\n\n.\n\n\u03b2(As) =\n\nexp\n\nT(cid:22)T As\n\n0\n\nA(cid:48)\u2208\u03c0(T )\n\n(6)\n\nThe contributions \u03b2(A) are \ufb01rst computed using the bottom-up Algorithm 1, and the contributions\n\u03b1(A) and \ufb01nal weights w(A) are then computed using the top-down Algorithm 2. For ease of\npresentation, these routines run on a full-binary tree version \u00afT0 of T0, obtained by adding a dummy\nchild to each node A that has a single child in T0. Each dummy node A(cid:48) has \u03c6(A(cid:48)) = 0.\n\n6\n\n\fAlgorithm 1 Bottom-up pass\n\nfor A \u2208 \u03c0( \u00afT0) do\n\u03b2(A) \u2190 e\u03c6(A)\nend for\nfor i \u2190 D0 to 0 do\n\nAi \u2190 set of nodes of \u00afT0 at depth i\nfor A \u2208 Ai \\ \u03c0( \u00afT0) do\nN \u2190 the children nodes of A\n\n\u03b2(A) \u2190 e\u03c6(A) +(cid:81)\n\nA(cid:48)\u2208N \u03b2(A(cid:48))\n\nend for\n\nend for\n\nAlgorithm 2 Top-down pass\n\n\u03b1(root) \u2190 1\nfor i \u2190 1 to D0 do\nAi \u2190 set of nodes of \u00afT0 at depth i\nfor A \u2208 Ai do\n\nAp, As \u2190 parent of node A, sibling of node A\n\u03b1(A) \u2190 \u03b1(Ap)\u03b2(As)\nw(A) \u2190 e\u03c6(A)\u03b1(A)\n\nend for\n\nend for\n\nTheorem 2 (Computing w(A)). Running Algorithm 1, then 2, we obtain w(A)\nwhere Q\u2217\n2| \u00afT0| \u2264 4|T0|, where |T| is the number of nodes in T .\n\n(A),\n\u03bb is as de\ufb01ned in Theorem 1. Furthermore, the combined runtime of Algorithms 1, then 2 is\n\n\u03bb\n\n\u03bb\n\n.\n= CQ\u2217\n\n\u00b7 wQ\u2217\n\n5 Experiments\n\nTable 1: UCI datasets\n\nName (abbreviation)\nSpambase (spam)\nEEG Eye State (eeg)\nEpileptic Seizure Recognition (epileptic)\nCrowdsourced Mapping (crowd)\nWine Quality (wine)\nOptical Recognition of Handwritten Digits (digit)\nLetter Recognition (letter)\n\nFeatures count\n57\n14\n178\n28\n12\n64\n16\n\nLabels count\n2\n2\n2\n6\n11\n10\n26\n\nTrain size\n2601\n12980\n9500\n8546\n4497\n3620\n18000\n\n.\n\nn\n\n|\u03c0(T,S)|\n\n(cid:113)\nmax(cid:0)\u02c6p(A,S),\n\nHere we present experiments on real-world datasets, for two common partition-tree approaches,\ndyadic trees and KD-trees. The various datasets are described in Table 1.\nThe main baseline we compare against, is a popular ef\ufb01cient pruning heuristic where a subtree of T0\n\nis selected to minimize the penalized error C1(hT ) = (cid:98)R(hT ,S) + \u03bb\nC2(hT ) = (cid:98)R(hT ,S) + \u03bb(cid:80)\n\nWe also compare against other tree-based approaches that are theoretically driven and ef\ufb01cient.\nFirst is a pruning approach proposed in [16], which picks a subtree minimizing the penalized error\nn , where (cid:107)A(cid:107) denotes the depth of node\nA in T0. We note that, here we choose a form of C2 that avoids theoretical constants that were of a\ntechnical nature, but instead let \u03bb account for such. We report this approach as SN-pruning. Second\nis the majority classi\ufb01er of [5], which however is geared towards binary classi\ufb01cation as it requires\nregression-type estimates in [0, 1] at each node. This is denoted HS-vote.\nAll the above approaches have ef\ufb01cient dynamic programs that run in time O(|T0|), and all predict in\ntime O(height(T0)). The same holds for our PAC-Bayes approach as discussed above in Section 4.\nPractical implementation of PAC-Bayes tree. Our implementation rests on the theoretical in-\nsights of Theorem 1, however we avoid some of the technical details that were needed for rigor,\n\n(cid:1) \u00b7 (cid:107)A(cid:107)\n\n(cid:107)A(cid:107)\nn\n\nA\n\n7\n\n\f(cid:113) \u02c6p(A,S)\n\nA\u2208\u03c0(T,S)\n\nsuch as sample splitting and overly conservative constants in concentration results. Instead we\nadvise cross-validating for such constants in the prior and posterior de\ufb01nitions. Namely, we \ufb01rst\nset P (hT ) \u221d exp(\u2212|\u03c0(T,S)|), where \u03c0(T,S) denotes the leaves of T containing data. We set\n\n. The posterior is then set as Q\u2217(hT ) \u221d exp(\u2212n(\u03bb1(cid:98)R(hT ,S) +\n\n\u2206n(hT ,S) =(cid:80)\n\nn\n\n\u03bb2\u2206n(hT ,S)))P (hT ), where \u03bb1, \u03bb2 account for concentration terms to be tuned to the data.\nFinally, we use the entire data to construct T0 and compute weights, i.e., S0 = S, as inter-\ndependencies are in fact less of an issue in practice. We note, that the above alternative theoretical\napproaches, SN-pruning and HS-vote, are also assumed (in theory) to work on a sample independent\nchoice of T0 (or equivalently built and labeled on a separate sample S0), but are implemented here on\nthe entire data to similarly take advantage of larger data sizes. The baseline pruning heuristic is by\ndefault always implemented on the full data.\nExperimental setup and results. The data is preprocessed as follows: for dyadic trees, data is scaled\nto be in [0, 1]D, while for KD-trees data is normalized accross each coordinate by standard deviation.\nTesting data is \ufb01xed to be of size 2000, while each experiment is ran 5 times (with random choice of\ntraining data of size reported in Table 1) and average performance is reported. In each experiment,\nall parameters are chosen by 2-fold cross-validation for each of the procedures. The log-grid is 10\nvalues, equally spaced in logarithm, from 2\u22128 to 26 while the linear-grid is 10 linearly-spaced values\nbetween half the best value of the log-search and twice the best value of the log-search.\nTable 2 reports classi\ufb01cation performance of the various theoretical methods relative to the baseline\npruning heuristic. We see that proposed PAC-Bayes tree achieves competitive performance against all\nother alternatives. All the approaches have similar performance accross datasets, with some working\nslightly better on particular datasets. Figure 2 further illustrates typical performance on multiclass\nproblems as training size varies.\n\nTable 2: Ratio of classi\ufb01cation error over that of the default pruning baseline: bold indicates best results across\nmethods, while blue indicates improvement over baseline; N/A means the algorithm was not run on the task.\n\nT0 \u2261 dyadic tree\n\nDataset\nspam\neeg\nepileptic\ncrowd\nwine\ndigit\nletter\n\nSN-pruning\n1.118\n0.979\n0.993\n0.991\n1.035\n1.000\n1.005\n\nPAC-Bayes tree HS-vote\n0.975\n0.993\n0.992\n1.020\n0.991\n0.936\n0.993\n\n1.224\n1.029\n0.951\nN/A\nN/A\nN/A\nN/A\n\nSN-pruning\n1.048\n1.000\n0.977\n1.001\n1.010\n0.994\n1.000\n\nT0 \u2261 KD tree\nPAC-Bayes tree HS-vote\n1.020\n0.990\n0.987\n1.017\n0.997\n0.997\n1.001\n\n1.075\n1.000\n0.907\nN/A\nN/A\nN/A\nN/A\n\nFigure 2: Classi\ufb01cation error versus training size\n\n8\n\n\fReferences\n[1] David A McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n\n[2] L\u00e1szl\u00f3 A Sz\u00e9kely and Hua Wang. On subtrees of trees. Advances in Applied Mathematics, 34(1):138\u2013155,\n\n2005.\n\n[3] Trevor Hastie and Daryl Pregibon. Shrinking trees. AT & T Bell Laboratories, 1990.\n\n[4] Wray Buntine and Tim Niblett. A further comparison of splitting rules for decision-tree induction. Machine\n\nLearning, 8(1):75\u201385, 1992.\n\n[5] David P Helmbold and Robert E Schapire. Predicting nearly as well as the best pruning of a decision tree.\n\nMachine Learning, 27(1):51\u201368, 1997.\n\n[6] Alexandre Lacasse, Fran\u00e7ois Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier. PAC-\nBayes bounds for the risk of the majority vote and the variance of the Gibbs classi\ufb01er. In Advances in\nNeural information processing systems, pages 769\u2013776, 2007.\n\n[7] John Langford and John Shawe-Taylor. PAC-Bayes & margins.\n\nprocessing systems, pages 439\u2013446, 2003.\n\nIn Advances in neural information\n\n[8] Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy. Risk\nBounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm. Journal of\nMachine Learning Research, 16:787\u2013860, 2015.\n\n[9] C. Scott and R.D. Nowak. Minimax-optimal classi\ufb01cation with dyadic decision trees. IEEE Transactions\n\non Information Theory, 52, 2006.\n\n[10] Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A Strongly Quasiconvex\nPAC-Bayesian bound. In Steve Hanneke and Lev Reyzin, editors, Proceedings of the 28th International\nConference on Algorithmic Learning Theory, volume 76 of Proceedings of Machine Learning Research,\npages 466\u2013492, Kyoto University, Kyoto, Japan, 15\u201317 Oct 2017. PMLR.\n\n[11] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric Regression.\n\nSpringer, New York, NY, 2002.\n\n[12] Nakul Verma, Samory Kpotufe, and Sanjoy Dasgupta. Which spatial partition trees are adaptive to intrinsic\ndimension? In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n565\u2013574. AUAI Press, 2009.\n\n[13] Samory Kpotufe and Sanjoy Dasgupta. A tree-based regressor that adapts to intrinsic dimension. Journal\n\nof Computer and System Sciences, 78(5):1496\u20131515, 2012.\n\n[14] Santosh Vempala. Randomly-oriented kd trees adapt to intrinsic dimension. In FSTTCS, volume 18, pages\n\n48\u201357. Citeseer, 2012.\n\n[15] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classi\ufb01ers. The Annals of\n\nStatistics, 35(2):608\u2013633, 2007.\n\n[16] Clayton Scott. Dyadic Decision Trees. PhD thesis, Rice University, 2004.\n\n9\n\n\f", "award": [], "sourceid": 5763, "authors": [{"given_name": "Tin", "family_name": "Nguyen", "institution": "MIT"}, {"given_name": "Samory", "family_name": "Kpotufe", "institution": "Princeton University"}]}