{"title": "Boosting with Multi-Way Branching in Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 300, "page_last": 306, "abstract": null, "full_text": "Boosting with Multi-Way Branching in \n\nDecision Trees \n\nYishay Mansour \n\nDavid McAllester \n\nAT&T Labs-Research \n180 Park Ave \nFlorham Park NJ 07932 \n{mansour, dmac }@research.att.com \n\nAbstract \n\nIt is known that decision tree learning can be viewed as a form \nof boosting. However, existing boosting theorems for decision tree \nlearning allow only binary-branching trees and the generalization to \nmulti-branching trees is not immediate. Practical decision tree al(cid:173)\ngorithms, such as CART and C4.5, implement a trade-off between \nthe number of branches and the improvement in tree quality as \nmeasured by an index function. Here we give a boosting justifica(cid:173)\ntion for a particular quantitative trade-off curve. Our main theorem \nstates, in essence, that if we require an improvement proportional \nto the log of the number of branches then top-down greedy con(cid:173)\nstruction of decision trees remains an effective boosting algorithm. \n\n1 \n\nIntroduction \n\nDecision trees have been proved to be a very popular tool in experimental machine \nlearning. Their popularity stems from two basic features -\nthey can be constructed \nquickly and they seem to achieve low error rates in practice. In some cases the \ntime required for tree growth scales linearly with the sample size. Efficient tree \nconstruction allows for very large data sets. On the other hand, although there \nare known theoretical handicaps of the decision tree representations, it seem that \nin practice they achieve accuracy which is comparable to other learning paradigms \nsuch as neural networks. \n\nWhile decision tree learning algorithms are popular in practice it seems hard to \nquantify their success ,in a theoretical model. It is fairly easy to see that even \nif the target function can be described using a small decision tree, tree learning \nalgorithms may fail to find a good approximation. Kearns and, Mansour [6] used \nthe weak learning hypothesis to show that standard tree learning algorithms perform \nboosting. This provides a theoretical justification for decision tree learning similar \n\n\fBoosting with Multi-Way Branching in Decision Trees \n\n301 \n\nto justifications that have been given for various other boosting algorithms, such as \nAdaBoost [4]. \n\nMost decision tree learning algorithms use a top-down growth process. Given a \ncurrent tree the algorithm selects some leaf node and extends it to an internal node \nby assigning to it some \"branching function\" and adding a leaf to each possible \noutput value of this branching function. The set of branching functions may differ \nfrom one algorithm to another, but most algorithms used in practice try to keep the \nset of branching functions fairly simple. For example, in C4.5 [7], each branching \nfunction depends on a single attribute. For categorical attributes, the branching \nis according to the attribute's value, while for continuous attributes it performs a \ncomparison of the attribute with some constant. \n\nit is easy to construct \nOf course such top-down tree growth can over-fit the data -\na (large) tree whose error rate on the training data is zero. However, if the class of \nsplitting functions has finite VC dimension then it is possible to prove that, with \nhigh confidence of the choice of the training data, for all trees T the true error rate \nof T is bounded by f(T) + 0 (JITI/m) where f(T) is the error rate of T on the \ntraining sample, ITI is the number of leaves of T, and m is the size of the training \nsample. Over-fitting can be avoided by requiring that top-down tree growth produce \na small tree. In practice this is usually done by constructing a large tree and then \npruning away some of its nodes. Here we take a slightly different approach. We \nassume a given target tree size s and consider the problem of constructing a tree T \nwith ITI = sand f(T) as small as possible. We can avoid over-fitting by selecting a \nsmall target value for the tree size. \n\na four-way branch increases the tree size by roughly the same amount as two two(cid:173)\n\nA fundamental question in top-down tree growth is how to select the branching \nfunction when growing a given leaf. We can think of the target size as a \"budget\" . \nA four-way branch spends more of the tree size budget than does a two-way branch \n-\nway branches. A sufficiently large branch would spend the entire tree size budget in \na single step. Branches that spend more of the tree size budget should be required to \nachieve more progress than branches spending less ofthe budget. Naively, one would \nexpect that the improvement should be required to be roughly linear in the number \none should get a return proportional to the expense. \nof new leaves introduced -\nHowever, a weak learning assumption and a target tree size define a nontrivial \ngame between the learner and an adversary. The learner makes moves by selecting \nbranching functions and the adversary makes moves by presenting options consistent \nwith the weak learning hypothesis. We prove here that the learner achieve a better \nvalue in this game by selecting branches that get a return considerably smaller than \nthe naive linear return. Our main theorem states, in essence, that the return need \nonly be proportional to the log of the number of branches. \n\n2 Preliminaries \n\nWe assume a set X of instances and an unknown target function f mapping X \nto {O,l}. We assume a given \"training set\" S which is a set of pairs of the form \n(x, f(x)). We let 1l be a set of potential branching functions where each hE 1l is \na function from X to a finite set Rh - we allow different functions in 1l to have \ndifferent ranges. We require that for any h E 1l we have IRhl ~ 2. An 1l-tree is \n\n\f302 \n\nY. Mansour and D. McAllester \n\na tree where each internal node is labeled with an branching function h E 1i and \nhas children corresponding to the elements of the set Rh. We define ITI to be the \nnumber ofleafnodes ofT. We let L(T) be the set ofleafnodes ofT. For a given tree \nT, leaf node f of T and sample S we write Sl to denote the subset of the sample S \nreaching leaf f. For f E T we define Pl to be the fraction of the sample reaching leaf \nf, i.e., ISll/ISI. We define ql to be the fraction of the pairs (x, f(x\u00bb in Sl for which \nf(x) = 1. The training error ofT, denoted i(T), is L:lEL(T)Plmin(ql, 1- ql). \n\n3 The Weak Learning Hypothesis and Boosting \n\nHere, as in [6], we view top-down decision tree learning as a form of Boosting [8, 3]. \nBoosting describes a general class of iterative algorithms based on a weak learning \nhypothesis. The classical weak learning hypothesis applies to classes of Boolean \nfunctions. Let 1i2 be the subset of branching functions h E 1i with IRhl = 2. For \n\nc5 > \u00b0 the classical c5-weak learning hypothesis for 1i2 states that for any distribution \n\non X there exists an hE 1i2 with PrD(h(x) f f(x)) ~ 1/2-c5. Algorithms designed \nto exploit this particular hypothesis for classes of Boolean functions have proved to \nbe quite useful in practice [5]. \n\nKearns and Mansour show [6] that the key to using the weak learning hypothesis \nfor decision tree learning is the use of an index function I : [0, 1] ~ [0,1] where \nI(q) ~ 1, I(q) ~ min(q, (1- q)) and where I(T) is defined to be L:lEL(T) PlI(ql). \nNote that these conditions imply that i(T) ~ I(T). For any sample W let qw be \nthe fraction of pairs (x, f(x)) E W such that f(x) = 1. For any h E 1i let Th be \nthe decision tree consisting of a single internal node with branching function h plus \na leaf for each member of IRh I. Let Iw (Th) denote the value of I(Th) as measured \nwith respect to the sample W. Let ~ (W, h) denote I (qW ) - Iw (Th). The quantity \n~(W, h) is the reduction in the index for sample W achieved by introducing a single \nbranch. Also note that Pt~(Sl, h) is the reduction in I(T) when the leaf f is replaced \nby the branch h. Kearns and Mansour [6] prove the following lemma. \n\nLemma 3.1 (Kearns & Mansour) Assuming the c5-weak learning hypothesis for \n1i2, and taking I(q) to be 2Jq(1- q), we have that for any sample W there exists \nan h E 1i2 such that ~(W,h) ~ ~:I(qw). \n\nThis lemma motivates the following definition. \n\nDefinition 1 We say that 1i2 and I satisfies the \"I-weak tree-growth hypothesis if \nfor any sample W from X there exists an hE 1i2 such that ~(W, h) ~ \"II(qw). \n\nLemma 3.1 states, in essence, that the classical weak learning hypothesis implies the \nweak tree growth hypothesis for the index function I(q) = 2J q(l - q). Empirically, \nhowever, the weak tree growth hypothesis seems to hold for a variety of index \nfunctions that were already used for tree growth prior to the work of Kearns and \nMansour. The Ginni index I(q) = 4q(1 - q) is used in CART [1] and the entropy \nI(q) = -q log q - (1- q) log(l- q) is used in C4.5 [7]. It has long been empirically \nobserved that it is possible to make steady progress in reducing I(T) for these \nchoices of I while it is difficult to make steady progress in reducing i(T). \n\nWe now define a simple binary branching procedure. For a given training set S \nand target tree size s this algorithm grows a tree with ITI = s. In the algorithm \n\n\fBoosting with Multi-Way Branching in Decision Trees \n\n303 \n\no denotes the trivial tree whose root is a leaf node and Tl h denotes the result of \nreplacing the leaf l with the branching function h and a new leaf for each element \nof Rh. \n\n, \n\nT=0 \nWHILE (ITI < s) DO \nl f- argmaxl \nh f- argmaxhEl\u00a3:l~(Sl' h) \nT f- Tl,h; \n\n'ftl1(til) \n\nEND-WHILE \nWe now define e(n) to be the quantity TI~:ll(l-;). Note that e(n) ~ TI~:/ e- 7 = \ne--Y Wi\"'l 1 S < e--Y Inn = n--Y. \n\n~ .. - l /\" \n\nTheorem 3.2 (Kearns & Mansour) 1f1l2 and I satisfy the ,-weak tree growth \nhypothesis then the binary branching procedure produces a tree T with i(T) ~ I(T) ~ \ne(ITI) ~ ITI--Y\u00b7 \n\nProof: The proof is by induction on the number of iterations of the procedure. \nWe have that 1(0) ~ 1 = e(l) so the initial tree immediately satisfies the condi(cid:173)\ntion. We now assume that the condition is satisfied by T at the begining of an \niteration and prove that it remains satisfied by Tl,h at the end of the iteration. \nSince I(T) = LlET Ih1(til) we have that the leaf l selected by the procedure is \nsuch that Pl1(til) 2: II~)\u00b7 By the ,-weak tree growth assumption the function \nh selected by the procedure has the property that ~(Sl, h) 2: ,1(ql). We now \nI(Tl,h) = Pl~(Sl' h) 2: P1I1(til) 2: ,II?il \" This implies that \nhave that I(T) -\nI(Tl,h) ~ I(T) - rh1(T) = (1- j;)I(T) ~ (1- rh)e(ITI) = e(ITI + 1) = e(ITl,hl). \no \n\n4 Statement of the Main Theorem \n\nWe now construct a tree-growth algorithm that selects multi-way branching func(cid:173)\ntions. As with many weak learning hypotheses, the ,-weak tree-growth hypothesis \ncan be viewed as defining a game between the learner and an adversary. Given a \ntree T the adversary selects a set of branching functions allowed at each leaf of the \ntree subject to the constraint that at each leaf l the adversary must provide a binary \nbranching function h with ~(Sl' h) 2: ,1(til). The learner then selects a leaf land \na branching function h and replaces T by Tl,h. The adversary then again selects \na new set of options for each leaf subject to the ,-weak tree growth hypothesis. \nThe proof of theorem 3.2 implies that even when the adversary can reassign all op(cid:173)\ntions at every move there exists a learner strategy, the binary branching procedure, \nguaranteed to achieves a final error rate of ITI--Y. \n\nOf course the optimal play for the adversary in this game is to only provide a single \nbinary option at each leaf. However, in practice the \"adversary\" will make mistakes \nand provide options to the learner which can be exploited to achieve even lower \nerror rates. Our objective now is to construct a strategy for the learner which can \nexploit multi-way branches provided by the adversary. \n\nWe first say that a branching function h is acceptable for tree T and target size \n\n\f304 \n\nY. Mansour and D. MeAl/ester \n\ns if either IRhl = 2 or ITI < e(IRh!)s\"Y/(2IRh!). We also define g(k) to be the \nquantity (1 - e(k\u00bb/\"Y. It should be noted that g(2) = 1. It should also be noted \nthat e( k) '\" e -'Y Ink and hence for \"Y In k small we have e( k) '\" 1 - \"Y In k and hence \ng(k) '\" Ink. We now define the following multi-branch tree growth procedure. \n\nT=0 \nWHILE (ITI < s) DO \n\nl +- argm~ Ptl(qt) \nh +- argmaxhEll, h acceptable for T and s ~(St, h)/g(IRhl) \nT +- Tt,h; \n\nEND-WHILE \n\nA run of the multi-branch tree growth procedure will be called \"Y-boosting if at each \niteration the branching function h selected has the property that ~(St, h) / g(lRh I) ~ \n\"YI(qt). The \"Y-weak tree growth hypothesis implies that ~(St,h)/g(IRhl) ~ \n\"YI(qt)/g(2) = \"YI(qt). Therefore, the \"Y-weak tree growth hypothesis implies that \nevery run of the multi-branch growth procedure is \"Y-bootsing. But a run can be \n\"Y-bootsing by exploiting mutli-way branches even when the \"Y-weak tree growth \nhypothesis fails. The following is the main theorem of this paper. \n\nTheorem 4.1 1fT is produced by a \"Y-boosting run of the multi-branch tree-growth \nprocedure then leT) ~ e(ITI) ~ ITI-'Y\u00b7 \n\n5 Proof of Theorem 4.1 \n\nTo prove the main theorem we need the concept of a visited weighted tree, or VW(cid:173)\ntree for short. A VW-tree is a tree in which each node m is assigned both a rational \nweight Wm E [0,1] and an integer visitation count Vm ~ 1. We now define the \nfollowing VW tree growth procedure. In the procedure Tw is the tree consisting of \na single root node with weight wand visitation count 1. The tree Tt.w1 .... . w/c is the \nresult of inserting k new leaves below the leaf l where the ith new leaf has weight \nWi and new leaves have visitation count 1. \n\nW +- any rational number in [0,1] \nT+-Tw \nFOR ANY NUMBER OF STEPS REPEAT THE FOLLOWING \n\ne(tI:~wl \n\nl +- argmaxt \nVt +- Vt + 1 \nOPTIONALLY T +- Tt.Wl .. .. ,Wlll WITH WI + .. . Wtll ~ e(vt)wt \n\nWe first prove an analog of theorem 3.2 for the above procedure. For a VW-tree T \nwe define ITI to be LtEL(T) Vt and we define leT) to be LtEL(T) e( Vt)Wt. \n\nLemma 5.1 The VW procedure maintains the invariant that leT) ~ e(ITI). \n\nProof: The proof is by induction on the number of iterations of the algorithm. \nThe result is immediate for the initial tree since eel) = 1. We now assume that \nleT) ~ e(IT!) at the start of an iteration and show that this remains true at the \nend of the iteration. \n\n\fBoosting with Multi- Way Branching in Decision Trees \n\n305 \n\nWe can associate each leaf l with Vt \"subleaves\" each of weight e(vt)wt/Vt. We have \nthat ITI is the total number of these subleaves and I(T) is the total weight of these \nsubleaves. Therefore there must exist a subleaf whose weight is at least I(T)/ITI. \nHence there must exist a leaf l satisfying e(vt)wt/Vt 2': I(T)/ITI. Therefore this \nrelation must hold of the leaf l selected by the procedure. \nLet T' be the tree resulting from incrementing Vt. We now have I(T) - I(T') = \ne(vt)wt- e(vt + l)wt = e(vt)wt- (1- ;;)e(vt)wt = ;;e(vt)wt 2': \"/I~)' So we have \nI(T') ~ (1 - ffl )I(T) ~ (1 - ffl )e(ITI) = e(IT'I). \nFinally, if the procedure grows new leaves we have that the I(T) does not increase \nand that ITI remains the same and hence the invariant is maintained. \n0 \n\nFor any internal node m in a tree T let C(m) denote the set of nodes which are \nchildren of m. A VW-tree will be called locally-well-formed if for every internal \nnode m we have that Vm = IC(m)l, that I:nEC(m) Wn ~ e(IC(m)l)wm . A VW-tree \nwill be called globally-safe ifmaxtEL(T) e(vt)wt/Vt ~ millmEN(T) e(vt-1)wt/(vt-1) \nwhere N(T) denotes the set of internal nodes of T. \n\nLemma 5.2 If T is a locally well-formed and globally safe VW-tree, then T is a \npossible output of the VW growth procedure and therefore I(T) ~ e(ITI). \n\nProof: Since T is locally well formed we can use T as a \"template\" for making \nnondeterministic choices in the VW growth procedure. This process is guaranteed \nto produce T provided that the growth procedure is never forced to visit a node \ncorresponding to a leaf of T. But the global safety condition guarantees that any \nunfinished internal node of T has a weight as least as large as any leaf node of T. \no \n\nWe now give a way of mapping ?i-trees into VW-trees. More specifically, for any \n?i-tree T we define VW(T) to be the result of assigning each node m in T the weight \nPmI(qm), each internal node a visitation count equal to its number of children, and \neach leaf node a visitation count equal to 1. We now have the following lemmas. \n\nLemma 5.3 If T is grown by a I-boosting run of the multi-branch procedure then \nVW(T) is locally well-formed. \n\nProof: Note that the children of an internal node m are derived by selecting \na branching function h for the node m. Since the run is I-boosting we have \n~(St, h)/g(IRhi) 2': II(qt). Therefore ~(St, h) = (I(tit) - 1St (n)) 2': I(tit)(l -\ne(IRhl)). This implies that Ist(Th) ~ e(IRhDI(qt). Multiplying by Pt and trans(cid:173)\nforming the result into weights in the tree VW(T) gives the desired result. \n0 \n\nThe following lemma now suffices for theorem 4.1. \n\nLemma 5.4 If T is grown by a I-boosting run of the multi-branch procedure then \nVW(T) is globally safe. \n\nProof: First note that the following is an invariant of a I-boosting run of the \nmulti-branch procedure. \n\nmax Wt < \n\ntEL(VW(T)) \n\nWt \n\nmin \n\n- mEN(VW(T)) \n\n\f306 \n\nY. Mansour and D. MeAl/ester \n\nThe proof is a simple induction on ,-boosting tree growth using the fact that the \nprocedure always expands a leaf node of maximal weight. \n\nWe must now show that for every internal node m and every leaf \u00a3 we have that \nWi ~ e(k -1)wm/(k -1) where k is the number of children of m. Note that if k = 2 \nthen this reduces to Wi ~ Wm which follows from the above invariant. So we can \nassume without loss of generality that k > 2. Also, since e( k) / k < e( k - 1) / (k - 1), \nit suffices to show that Wi ~ e(k)wm/k. \nLet m be an internal node with k > 2 children and let T' be the tree at the time \nm was selected for expansion. Let Wi be the maximum weight of a leaf in the final \ntree T. By the definition of the acceptability condition, in the last s/2 iterations \nwe are performing only binary branching. Each binary expansion reduces the index \nby at least , times the weight of the selected node. Since the sequence of nodes \nselected in the multi-branch procedure has non-increasing weights, we have that in \nany iteration the weight of the selected node is at least Wi . Since there are at least \ns/2 binary expansions after the expansion of m, each of which reduces I by at least \n,Wi, we have that s,wd2 ~ I(T') so Wi ~ 2I(T')/(/s). The acceptability condition \ncan be written as 2/(/s) ~ e(k)/(kIT'1) which now yields WI ~ I(T')e(k)/(kIT'I). \nBut we have that I(T')/IT'I ~ Wm which now yields WI ~ e(k)wm/k as desired. 0 \n\nReferences \n\n[1] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. \n\nClassification and Regression Trees. Wadsworth International Group, 1984. \n\n[2] Tom Dietterich, Michael Kearns and Yishay Mansour. Applying the Weak \nLearning Framework to understand and improve C4.5. In Proc. of Machine \nLearning, 96-104, 1996. \n\n[3] Yoav Freund. Boosting a weak learning algorithm by majority. Information and \n\nComputation, 121(2):256-285, 1995. \n\n[4] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of \non-line learning and an application to boosting. In Computational Learning \nTheory: Second European Conference, EuroCOLT '95, pages 23-37. Springer(cid:173)\nVerlag, 1995. \n\n[5] Yoav Freund and Robert E. Schapire. Experiments with a new boosting al(cid:173)\n\ngorithm. In Machine Learning: Proceedings of the Thirteenth International \nConference, pages 148-156, 1996. \n\n[6] Michael Kearns and Yishay Mansour. On the boosting ability of top-down \ndecision tree learning. In Proceedings of the Twenty-Eighth ACM Symposium \non the Theory of Computing, pages 459-468,1996. \n\n[7] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, \n\n1993. \n\n[8] Robert E. Schapire. The strength of weak learnability. Machine Learning, \n\n5(2):197-227, 1990. \n\n\f", "award": [], "sourceid": 1659, "authors": [{"given_name": "Yishay", "family_name": "Mansour", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}]}