{"title": "Generalization in Decision Trees and DNF: Does Size Matter?", "book": "Advances in Neural Information Processing Systems", "page_first": 259, "page_last": 265, "abstract": "", "full_text": "Generalization in decision trees and DNF: \n\nDoes size matter? \n\nMostefa Golea\\ Peter L. Bartletth , Wee Sun Lee2 and Llew Mason1 \n\n1 Department of Systems Engineering \n\nResearch School of Information \nSciences and Engineering \nAustralian National University \nCanberra, ACT, 0200, Australia \n\n2 School of Electrical Engineering \n\nUniversity College UNSW \nAustralian Defence Force Academy \nCanberra, ACT, 2600, Australia \n\nAbstract \n\nRecent theoretical results for pattern classification with thresh(cid:173)\nolded real-valued functions (such as support vector machines, sig(cid:173)\nmoid networks, and boosting) give bounds on misclassification \nprobability that do not depend on the size of the classifier, and \nhence can be considerably smaller than the bounds that follow from \nthe VC theory. In this paper, we show that these techniques can \nbe more widely applied, by representing other boolean functions \nas two-layer neural networks (thresholded convex combinations of \nboolean functions). For example, we show that with high probabil(cid:173)\nity any decision tree of depth no more than d that is consistent with \nm training examples has misclassification probability no more than \no ( (~ (Neff VCdim(U) log2 m log d)) 1/2), where U is the class of \nnode decision functions, and Neff ::; N can be thought of as the \neffective number of leaves (it becomes small as the distribution on \nthe leaves induced by the training data gets far from uniform). \nThis bound is qualitatively different from the VC bound and can \nbe considerably smaller. \nWe use the same technique to give similar results for DNF formulae. \n\n\u2022 Author to whom correspondence should be addressed \n\n\f260 \n\nM. Golea, P Bartlett, W. S. Lee and L Mason \n\n1 \n\nINTRODUCTION \n\nDecision trees are widely used for pattern classification [2, 7]. For these problems, \nresults from the VC theory suggest that the amount of training data should grow \nat least linearly with the size of the tree[4, 3]. However, empirical results suggest \nthat this is not necessary (see [6, 10]). For example, it has been observed that the \nerror rate is not always a monotonically increasing function of the tree size[6]. \nTo see why the size of a tree is not always a good measure of its complexity, consider \ntwo trees, A with N A leaves and B with N B leaves, where N B \u00ab N A . Although A \nis larger than B, if most of the classification in A is carried out by very few leaves \nand the classification in B is equally distributed over the leaves, intuition suggests \nthat A is actually much simpler than B, since tree A can be approximated well by \na small tree with few leaves. In this paper, we formalize this intuition. \nWe give misclassification probability bounds for decision trees in terms of a new \ncomplexity measure that depends on the distribution on the leaves that is induced \nby the training data, and can be considerably smaller than the size of the tree. \nThese results build on recent theoretical results that give misclassification probabil(cid:173)\nity bounds for thresholded real-valued functions, including support vector machines, \nsigmoid networks, and boosting (see [1, 8, 9]), that do not depend on the size of the \nclassifier. We extend these results to decision trees by considering a decision tree as \na thresholded convex combination of the leaf functions (the boolean functions that \nspecify, for a given leaf, which patterns reach that leaf). We can then apply the \nmisclassification probability bounds for such classifiers. In fact, we derive and use \na refinement of the previous bounds for convex combinations of base hypotheses, \nin which the base hypotheses can come from several classes of different complexity, \nand the VC-dimension of the base hypothesis class is replaced by the average (un(cid:173)\nder the convex coefficients) of the VC-dimension of these classes. For decision trees, \nthe bounds we obtain depend on the effective number of leaves, a data dependent \nquantity that reflects how uniformly the training data covers the tree's leaves. This \nbound is qualitatively different from the VC bound, which depends on the total \nnumber of leaves in the tree. \nIn the next section, we give some definitions and describe the techniques used. We \npresent bounds on the misclassification probability of a thresholded convex combina(cid:173)\ntion of boolean functions from base hypothesis classes, in terms of a misclassification \nmargin and the average VC-dimension of the base hypotheses. In Sections 3 and 4, \nwe use this result to give error bounds for decision trees and disjunctive normal \nform (DNF) formulae. \n\n2 GENERALIZATION ERROR IN TERMS OF MARGIN \n\nAND AVERAGE COMPLEXITY \n\nWe begin with some definitions. For a class ti of { -1,1 }-valued functions defined on \nthe input space X, the convex hull co(ti) ofti is the set of [-1, l]-valued functions of \nthe form :Ei aihi, where ai ~ 0, :Ei ai = 1, and hi E ti. A function in co(ti) is used \nfor classification by composing it with the threshold function, sgn : IR ~ {-I, I}, \nwhich satisfies sgn(a) = 1 iff a ~ O. So f E co(ti) makes a mistake on the pair \n(x,y) E X x {-1,1} iff sgn(f(x\u00bb =F y. We assume that labelled examples (x,y) \nare generated according to some probability distribution V on X x {-I, I}, and we \nlet Pv [E] denote the probability under V of an event E. If S is a finite subset \nof Z, we let Ps [E] denote the empirical probability of E (that is, the proportion \nof points in S that lie in E). We use Ev [.] and Es [.] to denote expectation in a \nsimilar way. For a function class H of {-I, l}-valued functions defined on the input \n\n\fGeneralization in Decision Trees and DNF: Does Size Matter? \n\n261 \n\nspace X, the growth function and VC dimension of H will be denoted by IIH (m) \nand VCdim(H) respectively. \nIn [8], Schapire et al give the following bound on the misclassification probability \nof a thresholded convex combination of functions , in terms of the proportion of \ntraining data that is labelled to the correct side of the threshold by some margin. \n(Notice that Pv [sgn(f(x\u00bb # y] ~ Pv [yf(x) ~ 0].) \nTheorem 1 ([8]) Let V be a distribution on X x {-I, I}, 1\u00a3 a hypothesis class \nwith VCdim(H) = d < 00 , and 8> O. With probability at least 1- 8 over a training \nset S of m examples chosen according to V, every function f E co(1\u00a3) and every \n8> 0 satisfy \n\nP v [yf(x) ~ 0] ~ Ps [yf(x) ~ 8] + 0..;m \n\n( 1 (dl 2( \nog 82m \n\n/d) \n\n) 1/2) \n\n+ log(1/8) \n\n. \n\nIn Theorem 1, all of the base hypotheses in the convex combination f are elements \nof a single class 1\u00a3 with bounded VC-dimension. The following theorem generalizes \nthis result to the case in which these base hypotheses may be chosen from any of k \nclasses, 1\u00a31, ... , 1\u00a3k, which can have different VC-dimensions. It also gives a related \nresult that shows the error decreases to twice the error estimate at a faster rate. \n\nTheorem 2 Let V be a distribution on X x {-I, I}, 1\u00a31, ... ,1\u00a3k hypothesis classes \nwith VCdim(Hi) = di , and 8 > O. With probability at least 1 - 8 over a training \nset S of m examples chosen according to V, every function f E co (U~=1 1\u00a3i) and \nevery 8 > 0 satisfy both \n\nPv [yf(x) ~ 0] ~ Ps [yf(x) ~ 8] + \n\n( 1 (1 \n..;m 82 (dlogm + logk) log (m82 /d) + log(1/8) \n\no \n\n)1/2) \n\n, \n\nPv [yf(x) ~ 0] ~ 2Ps [yf(x) ~ 8] + \n\no (! (812 (dlogm + logk) log (m8 2 /d) +IOg(1/8\u00bb)), \nwhere d = E \u00b7 aidj; and the ai and ji are defined by f = Ei aihi and hi E 1\u00a3j; for \njiE{l, ... ,k}. \n\nA \n\nA \n\n} \n\n{ \n\nN \n\n(l/N) Ei=1 hi : hi E 1\u00a31; \n\nProof sketch: We shall sketch only the proof of the first inequality of the the(cid:173)\norem. The proof closely follows the proof of Theorem 1 (see [8]). We consider \na number of approximating sets of the form eN,1 = \n, \nwhere I = (h, ... , IN) E {I, ... , k}N and N E N. Define eN = Ul eN,I' \nFor a given f = Ei aihi from co (U~=1 1\u00a3i ), we shall choose an approximation \n9 E eN by choosing hI, .. . , hN independently from {hI, h2 , ... ,}, according to \nthe distribution defined by the coefficients ai. Let Q denote this distribution \non eN. As in [8], we can take the expectation under this random choice of \n9 E eN to show that, for any 8 > 0, Pv [yf(x) ~ 0] ~ Eg_Q [PD [yg(x) ~ 8/2]] + \nexp(-N82/8). Now, for a given I E {I, .. . ,k}N, the probability that there is \na 9 in eN,1 and a 8 > 0 for which Pv [yg(x) ~ 8/2] > Ps [yg(x) ~ 8/2] + fN,1 \nis at most 8(N + 1) rr~1 (2:/7) dl\n; exp( -mf~,zl32). Applying the union bound \n\n\fM. Golea, P. Bartlett, W S. Lee andL Mason \n\n262 \n(over the values of 1), taking expectation over 9 I'V Q, and setting EN,l = \n( ~ In (8(N + 1) n~1 (2;;; )\". kN / 6N ) ) 1'2 shows that, with probability at least \n1 - 6N, every f and 8 > 0 satisfy Pv [yf(x) ~ 0] ~ Eg [Ps [yg(x) ~ 8/2]] + \nEg [EN,d. As above, we can bound the probability inside the first expectation \nin terms of Ps [yf(x) ~ 81. Also, Jensen's inequality implies that Eg [ENtd ~ \n(~ (In(8(N + 1)/6N) + Nln k + N L..i aidj; In(2em))) 1/2. Setting 6N = 6/(N(N + \n1)) and N = r /-I In ( mf) 1 gives the result. \n\nI \n\nTheorem 2 gives misclassification probability bounds only for thresholded convex \ncombinations of boolean functions. The key technique we use in the remainder of the \npaper is to find representations in this form (that is, as two-layer neural networks) \nof more arbitrary boolean functions. We have some freedom in choosing the convex \ncoefficients, and this choice affects both the error estimate Ps [yf(x) ~ 81 and the \naverage VC-dimension d. We attempt to choose the coefficients and the margin 8 \nso as to optimize the resulting bound on misclassification probability. In the next \ntwo sections, we use this approach to find misclassification probability bounds for \ndecision trees and DNF formulae. \n\n3 DECISION TREES \n\nA two-class decision tree T is a tree whose internal decision nodes are labeled with \nboolean functions from some class U and whose leaves are labeled with class labels U \nfrom {-I, +1}. For a tree with N leaves, define the leaf functions, hi : X -+ {-I, I} \nby hi(X) = 1 iff x reaches leaf i, for i = 1, ... ,N. Note that hi is the conjunction \nof all tests on the path from the root to leaf i. \nFor a sample S and a tree T, let Pi = Ps [hi(X) = 1]. Clearly, P = (PI, .. \" PN) is \na probability vector. Let Ui E {-I, + I} denote the class assigned to leaf i. Define \nthe class of leaf functions for leaves up to depth j as \n\n1lj = {h : \n\nh = UI /\\ U2 /\\ \u2022.\u2022 /\\ U r I r ~ j, Ui E U}. \n\nIt is easy to show that VCdim(1lj) ~ 2jVCdim(U) In(2ej). Let di denote the depth \nof leaf i, so hi E 1ld;, and let d = maxi di. \nThe boolean function implemented by a decision tree T can be written as a \nthresholded convex combination of the form T(x) = sgn(f(x\u00bb, where f(x) = \nL..~I WWi \u00abhi(x) + 1)/2) = L..~I WWi hi(X)/2 + L..~l wwd2, with Wi > 0 and \nL..~I Wi = 1. (To be precise, we need to enlarge the classes 1lj slightly to be closed \nunder negation. This does not affect the results by more than a constant.) We first \nassume that the tree is consistent with the training sample. We will show later how \nthe results extend to the inconsistent case. \nThe second inequality of Theorem 2 shows that, for fixed 6 > 0 there is a con(cid:173)\nstant c such that, for any distribution V, with probability at least 1 - 6 over \nthe sample S we have Pv [T(x) 'I y] ~ 2Ps [yf(x) ~ 8] + -b L~I widiB, where \nB = ~ VCdim(U) log2 m log d. Different choices of the WiS and the 8 will yield dif(cid:173)\nferent estimates of the error rate of T. We can assume (wlog) that PI ~ ... ~ PN. \nA natural choice is Wi = Pi and Pj+I ::.; 8 < Pj for some j E {I, ... ,N} which gives \n\ndB \nPv [T(x) 'I y] ~ 2 L Pi + (i2' \n\nN \n\ni=j+I \n\n(1) \n\n\fGeneralization in Decision Trees and DNF: Does Size Matter? \n\n263 \nwhere d = L:~1 Pidi . We can optimize this expression over the choices of j E \n{I ... ,N} and () to give a bound on the misclassification probability of the tree. \nLet pep, U) = L:~1 (Pi -\nIIN)2 be the quadratic distance between the prob-(cid:173)\n(PI, ... ,PN ) and the uniform probability vector U = \nability vector P = \n(liN, liN, ... , liN). Define Neff == N (1 - pep, U\u00bb. The parameter Neff is a mea(cid:173)\nsure of the effective number of leaves in the tree. \nTheorem 3 For a fixed d > 0, there is a constant c that satisfies the following. Let \nV be a distribution on X x { -1, I}. Consider the class of decision trees of depth 'Up \nto d, with decision functions in U. With probability at least 1 - d over the training \nset S (of size mY, every decision tree T that is consistent with S has \n\nPv [T(x) 1= y] ~ c ( Neff VCdlm(~ log m log d \n\n2 \n\n\u2022 \n\n) 1/2 \n\n, \n\nwhere Neff is the effective number of leaves of T. \n\nProof: Supposing that () ~ (aIN)I/2 we optimize (1) by choice of (). If the chosen \n() is actually smaller than ca/ N)I/2 then we show that the optimized bound still \nholds by a standard VC result. If () ~ (a/N)I/2 then L:~i+l Pi ~ (}2 Neff/d. So (1) \nimplies that P v [T (x) 1= y] ~ 2(}2 Neff /d + dB / (}2. The optimal choice of () is then \n(~iB/Neff)I/4. So if (~iB/Neff)I/4 ~ (a/N)I/2, we have the result. Otherwise, \nthe upper bound we need to prove satisfies 2(2NeffB)I/2 > 2NB, and this result is \nimplied by standard VC results using a simple upper bound for the growth function \nof the class of decision trees with N leaves. \n\nI \n\nThus the parameters that quantify the complexity of a tree are: a) the complexity \nof the test function class U, and b) the effective number of leaves Neff. The effective \nnumber of leaves can potentially be much smaller than the total number of leaves \nin the tree [5]. Since this parameter is data-dependent, the same tree can be simple \nfor one set of PiS and complex for another set of PiS. \nFor trees that are not consistent with the training data, the procedure to estimate \nthe error rate is similar. By defining Qi = Ps [YO'i = -1 I hi(x) = 1] and PI = \nPi (l- Qi)/ (1 - Ps [T(x) 1= V]) we obtain the following result. \nTheorem 4 For a fixed d > 0, there is a constant c that satisfies the following. Let \nV be a distribution on X x { -1, 1}. Consider the class of decision trees of depth up \nto d, with decision functions in U. With probability at least 1 - d over the training \nset S (of size mY, every decision tree T has \n\nPv [T(x) 1= y] ~ Ps [T(x) 1= y] + c Neff VCdim ~ log mlogd \n\n. ( ) \n\n2 \n\n( \n\n) 1/3 \n\n, \n\nwhere c is a universal constant, and Neff = N(1- pep', U\u00bb \nof leaves ofT. \n\nis the effective number \n\nNotice that this definition of Neff generalizes the definition given before Theorem 3. \n\n4 DNF AS THRESHOLDED CONVEX COMBINATIONS \n\nA DNF formula defined on {-1, I}n is a disjunction of terms, where each term is a \nconjunction of literals and a literal is either a variable or its negation. For a given \nDNF formula g, we use N to denote the number of terms in g, ti to represent the ith \n\n\f264 \n\nM. Golea, P. Bartlett, W S. Lee and L Mason \n\nterm in f, Li to represent the set of literals in ti, and Ni the size of Li . Each term ti \ncan be thought of as a member of the class HNi' the set of monomials with Ni liter(cid:173)\nals. Clearly, IHi I = et). The DNF 9 can be written as a thresholded convex combi-\nnation of the form g(x) = -sgn( - f(x)) = -sgn ( - L:f:,l Wi \u00abti + 1)/2)) . (Recall \nthat sgn(a) = 1 iff a ~ 0.) Further, each term ti can be written as a thresholded con(cid:173)\nvex combination of the form ti(X) = sgn(Ji(x)) = sgn (L:lkELi Vik \u00ablk(x) - 1)/2)) . \nAssume for simplicity that the DNF is consistent (the results extend easily to the \ninconsistent case). Let ')'+ (')'-) denote the fraction of positive (negative) exam(cid:173)\nples under distribution V. Let P v + [.] (Pv - [.]) denote probability with respect \nto the distribution over the positive (negative) examples, and let Ps+ [.] (Ps- [.]) \nbe defined similarly, with respect to the sample S. Notice that P v [g(x) :f:. y] = \n')'+Pv+ [g(x) = -l]+,),-Pv - [(3i)ti(X) = 1], so the second inequality of Theorem 2 \nshows that, with probability at least 1- 8, for any 8 and any 8i s, \n\nPv [g(x) :f:. y] :::; ')'+ 2Ps+ [I(x) :::; 8] + \u00a5 \n\ndB) \n\n(\n\n+ ')'- ~ 2Ps- [- fi(x) ~ 8i ] + 8; \n\nN ( \n\nB) \n\nwhere d = L:f:,l WiNi and B = c(lognlog2m+log(N/8)) /m. As in the case of \ndecision trees, different choices of 8, the 8is, and the weights yield different estimates \nof the error. For an arbitrary order of the terms, let Pi be the fraction of positive \nexamples covered by term ti but not by terms ti-l, ... ,tl' We order the terms such \nthat for each i, with ti-l. ... ,tl fixed, Pi is maximized, so that PI 2:: ... ~ PN, \nand we choose Wi = Pi. Likewise, for a given term ti with literals 11,'\" ,LN. in an \narbitrary order, let p~i) be the fraction of negative examples uncovered by literal \nlk but not uncovered by lk-l, ... ,11' We order the literals of term ti in the same \ngreedy way as above so that pi i) ~ ... 2:: P~:, and we choose Vik = P~ i). For \nPHI:::; 8 < Pi and pLiL ~ 8i < Pi~iL, where 1 :::; j :::; Nand 1 ~ ji :::; Ni, we get \n\nP D [g(x) :f:. y] :::; ')'+ 2 i~l Pi + \u00a5 \n\n( \n\nN \n\ndB) \n\n+ ')'- ~ 2 kf+l p~,) + 8; \n\nN (N' \n\n. B) \n\nNow, let P = (Pl,,,,,PN) and for each term i let p(i) = (pii), ... ,p~:). Define \nNeff = N(1 - pcP, U)) and N~~ = Ni(1 - p(p(i) , U)), where U is the relevant \nuniform distribution in each case. The parameter Neff is a measure of the effective \nnumber of terms in the DNF formula. It can be much smaller than N; this would be \nthe case if few terms cover a large fraction of the positive examples. The parameter \nN~~ is a measure of the effective number of literals in term ti. Again, it can be \nmuch smaller than the actual number of literals in ti: this would be the case if few \nliterals of the term uncover a large fraction of the negative examples. \nOptimizing over 8 and the 8i s as in the proof of Theorem 3 gives the following \nresult. \nTheorem 5 For a fixed 8 > 0, there is a constant c that satisfies the following. Let \nV be a distribution on X x {-I, I}. Consider the class of DNF formuLae with up to \nN terms. With probabiLity at Least 1 - 8 over the training set S (of size mY, every \nDNF formulae 9 that is consistent with S has \n\nPD [g(x):f:. y]:::; ,),+(NeffdB)1/2 + ')'-I~)N~~B)1/2 \n\nN \n\nwhere d = maxf:.l N i , ')'\u00b1 = Pv [y = \u00b11] and B = c(lognlog2 m + log(N/8))/m. \n\ni=l \n\n\fGeneralization in Decision Trees and DNF: Does Size Matter? \n\n265 \n\n5 CONCLUSIONS \n\nThe results in this paper show that structural complexity measures (such as size) of \ndecision trees and DNF formulae are not always the most appropriate in determining \ntheir generalization behaviour, and that measures of complexity that depend on the \ntraining data may give a more accurate descriptirm. Our analysis can be extended \nto multi-class classification problems. A similar analysis implies similar bounds \non misclassification probability for decision lists, and it seems likely that these \ntechniques will also be applicable to other pattern classification methods. \nThe complexity parameter, Neff described here does not always give the best possi(cid:173)\nble error bounds. For example, the effective number of leaves Neff in a decision tree \ncan be thought of as a single number that summarizes the probability distribution \nover the leaves induced by the training data. It seems unlikely that such a number \nwill give optimal bounds for all distributions. In those cases, better bounds could be \nobtained by using numerical techniques to optimize over the choice of (J and WiS. It \nwould be interesting to see how the bounds we obtain and those given by numerical \ntechniques reflect the generalization performance of classifiers used in practice. \n\nAcknowledgements \n\nThanks to Yoav Freund and Rob Schapire for helpful comments. \n\nReferences \n\n[1] P. L. Bartlett. For valid generalization, the size of the weights is more important \nthan the size of the network. In Neural Information Processing Systems 9, pages \n134-140. Morgan Kaufmann, San Mateo, CA, 1997. \n\n[2] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and \n\nRegression Trees. Wadsworth, Belmont, 1984. \n\n[3] A. Ehrenfeucht and D. Haussler. Learning decision trees from random exam(cid:173)\n\nples. Information and Computation, 82:231-246, 1989. \n\n[4] U .M. Fayyad and K.B. Irani. What should be . '1inimized in a decision tree? \n\nIn AAAI-90, pages 249-754,1990. \n\n[5] R. C. Holte. Very simple rules perform well on most commonly used databases. \n\nMachine learning, 11:63-91, 1993. \n\n[6] P.M. Murphy and M.J. pazzani. Exploring the decision forest: An empirical \ninvestigation of Occam's razor in decision tree induction. Journal of Artificial \nIntelligence Research, 1:257-275, 1994. \n\n[7] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, \n\n1992. \n\n[8] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: \na new explanation for the effectiveness of voting methods. In Machine Learning: \nProceedings of the Fourteenth International Conference, pages 322-330, 1997. \n[9] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. A frame(cid:173)\n\nwork for structural risk minimisation. In Proc. 9th COLT, pages 68-76. ACM \nPress, New York, NY, 1996. \n\n[10] G.L. Webb. Further experimental evidence against the utility of Occam's razor. \n\nJournal of Artificial Intelligence Research, 4:397-417, 1996. \n\n\f", "award": [], "sourceid": 1340, "authors": [{"given_name": "Mostefa", "family_name": "Golea", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Wee Sun", "family_name": "Lee", "institution": null}, {"given_name": "Llew", "family_name": "Mason", "institution": null}]}