{"title": "Using Pairs of Data-Points to Define Splits for Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 513, "abstract": null, "full_text": "Using Pairs of Data-Points to Define \n\nSplits for Decision Trees \n\nGeoffrey E. Hinton \n\nMichael Revow \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nUniversity of Toronto \n\nToronto, Ontario, M5S lA4, Canada \n\nToronto, Ontario, M5S lA4, Canada \n\nhinton@cs.toronto.edu \n\nrevow@cs.toronto.edu \n\nAbstract \n\nConventional binary classification trees such as CART either split \nthe data using axis-aligned hyperplanes or they perform a compu(cid:173)\ntationally expensive search in the continuous space of hyperplanes \nwith unrestricted orientations. We show that the limitations of the \nformer can be overcome without resorting to the latter. For every \npair of training data-points, there is one hyperplane that is orthog(cid:173)\nonal to the line joining the data-points and bisects this line. Such \nhyperplanes are plausible candidates for splits. In a comparison \non a suite of 12 datasets we found that this method of generating \ncandidate splits outperformed the standard methods, particularly \nwhen the training sets were small. \n\n1 \n\nIntroduction \n\nBinary decision trees come in many flavours, but they all rely on splitting the set of \nk-dimensional data-points at each internal node into two disjoint sets. Each split is \nusually performed by projecting the data onto some direction in the k-dimensional \nspace and then thresholding the scalar value of the projection. There are two \ncommonly used methods of picking a projection direction. The simplest method is \nto restrict the allowable directions to the k axes defined by the data. This is the \ndefault method used in CART [1]. If this set of directions is too restrictive, the \nusual alternative is to search general directions in the full k-dimensional space or \ngeneral directions in a space defined by a subset of the k axes. \n\nProjections onto one of the k axes defined by the the data have many advantages \n\n\f508 \n\nG. E. HINTON, M. REVOW \n\nover projections onto a more general direction: \n\n1. It is very efficient to perform the projection for each of the data-points. We \n\nsimply ignore the values of the data-point on the other axes. \n\n2. For N data-points, it is feasible to consider all possible axis-aligned pro(cid:173)\n\njections and thresholds because there are only k possible projections and \nfor each of these there are at most N - 1 threshold values that yield dif(cid:173)\nferent splits. Selecting from a fixed set of projections and thresholds is \nsimpler than searching the k-dimensional continuous space of hyperplanes \nthat correspond to unrestricted projections and thresholds. \n\n3. Since a split is selected from only about N k candidates, it takes only about \nlog2 N + log2 k bits to define the split. So it should be possible to use many \nmore of these axis-aligned splits before overfitting occurs than if we use more \ngeneral hyperplanes. If the data-points are in general position, each subset \nof size k defines a different hyperplane so there are N!/k!(N - k)! distinctly \ndifferent hyperplanes and if k < < N it takes approximately k log2 N bits \nto specify one of them. \n\nFor some datasets, the restriction to axis-aligned projections is too limiting. This \nis especially true for high-dimensional data, like images, in which there are strong \ncorrelations between the intensities of neighbouring pixels. In such cases, many \naxis-aligned boundaries may be required to approximate a planar boundary that \nis not axis-aligned, so it is natural to consider unrestricted projections and some \nversions of the CART program allow this. Unfortunately this greatly increases the \ncomputational burden and the search may get trapped in local minima. Also signif(cid:173)\nicant care must be exercised to avoid overfitting. There is, however, an intermediate \napproach which allows the projections to be non-axis-aligned but preserves all three \nof the attractive properties of axis-aligned projections: It is trivial to decide which \nside of the resulting hyperplane a given data-point lies on; the hyperplanes can be \nselected from a modest-sized set of sensible candidates; and hence many splits can \nbe used before overfitting occurs because only a few bits are required to specify each \nsplit. \n\n2 Using two data-points to define a projection \n\nEach pair of data-points defines a direction in the data space. This direction is a \nplausible candidate for a projection to be used in splitting the data, especially if \nit is a classification task and the two data-points are in different classes. For each \nsuch direction, we could consider all of the N - 1 possible thresholds that would \ngive different splits, or, to save time and reduce complexity, we could only consider \nthe threshold value that is halfway between the two data-points that define the \nprojection. If we use this threshold value, each pair of data-points defines exactly \none hyperplane and we call the two data-points the \"poles\" of this hyperplane. \nFor a general k-dimensional hyperplane it requires O( k) operations to decide \nwhether a data-point, C, is on one side or the other. But we can save a factor \nof k by using hyperplanes defined by pairs of data-points. If we already know the \ndistances of C from each of the two poles, A, B then we only need to compare \n\n\fUsing Pairs of Data Points to Define Splits for Decision Trees \n\n509 \n\nB \n\nA \n\nFigure 1: A hyperplane orthogonal to the line joining points A and B. We can \nquickly determine on which side a test point, G, lies by comparing the distances \nAG and BG. \n\nthese two distances (see figure 1).1 So if we are willing to do O(kN2) operations to \ncompute all the pairwise distances between the data-points, we can then decide in \nconstant time which side of the hyperplane a point lies on. \n\nAs we are building the decision tree, we need to compute the gain in performance \nfrom using each possible split at each existing terminal node. Since all the terminal \nnodes combined contain N data-points and there are N(N - 1)/2 possible splits2 \nthis takes time O(N3) instead of O(kN3). So the work in computing all the pairwise \ndistances is trivial compared with the savings. \n\nUsing the Minimum Description Length framework, it is clear that pole-pair splits \ncan be described very cheaply, so a lot of them can be used before overfitting occurs. \nWhen applying MDL to a supervised learning task we can assume that the receiver \ngets to see the input vectors for free. It is only the output vectors that need to be \ncommunicated. So if splits are selected from a set of N (N -1) /2 possibilities that is \ndetermined by the input vectors, it takes only about 210g2 N bits to communicate \na split to a receiver. Even if we allow all N - 1 possible threshold values along \nthe projection defined by two data-points, it takes only about 310g2 N bits. So the \nnumber of these splits that can be used before overfitting occurs should be greater by \na factor of about k/2 or k/3 than for general hyperplanes. Assuming that k \u00ab N, \nthe same line of argument suggests that even more axis-aligned planes can be used, \nbut only by a factor of about 2 or 3. \n\nTo summarize, the hyperplanes planes defined by pairs of data-points are computa(cid:173)\ntionally convenient and seem like natural candidates for good splits. They overcome \nthe major weakness of axis-aligned splits and, because they can be specified in a \nmodest number of bits, they may be more effective than fully general hyperplanes \nwhen the training set is small. \n\n1 If the threshold value is not midway between the poles, we can still save a factor of k \n\nbut we need to compute (d~c - d1c )/2dAB instead of just the sign of this expression. \n\n2Since we only consider splits in which the poles are in different classes, this number \n\nignores a factor that is independent of N. \n\n\f510 \n\nG. E. HINTON, M. REVOW \n\n3 Building the decision tree \n\nWe want to compare the \"pole-pair\" method of generating candidate hyperplanes \nwith the standard axis-aligned method and the method that uses unrestricted hy(cid:173)\nperplanes. We can see no reason to expect strong interactions between the method \nof building the tree and the method of generating the candidate hyperplanes, but \nto minimize confounding effects we always use exactly the same method of building \nthe decision tree. \n\nWe faithfully followed the method described in [1], except for a small modification \nwhere the code that was kindly supplied by Leo Breiman used a slightly different \nmethod for determining the amount of pruning. \n\nTraining a decision tree involves two distinct stages. In the first stage, nodes are \nrepeatedly split until each terminal node is \"pure\" which means that all of its data(cid:173)\npoints belong to the same class. The pure tree therefore fits the training data \nperfectly. A node is split by considering all candidate decision planes and choosing \nthe one that maximizes the decrease in impurity. Breiman et. al recommend using \nthe Gini index to measure impurity.3 If pUlt) is the probability of class j at node \nt, then the Gini index is 1 - 2:j p2(jlt). \nClearly the tree obtained at the end of the first stage will overfit the data and so in \nthe second stage the tree is pruned by recombining nodes. For a tree, Ti , with ITil \nterminal nodes we consider the regularized cost: \n\n(1) \n\nwhere E is the classification error and Q is a pruning parameter. In \"weakest-link\" \npruning the terminal nodes are eliminated in the order which keeps (1) minimal as \nQ increases. This leads to a particular sequence, T = {TI' T2, ... Tk} of subtrees, \nin which ITII > IT21 ... > ITkl. We call this the \"main\" sequence of subtrees because \nthey are trained on all of the training data. \n\nThe last remaining issue to be resolved is which tree in the main sequence to use. \nThe simplest method is to use a separate validation set and choose the tree size \nthat gives best classification on it. Unfortunately, many of the datasets we used \nwere too small to hold back a reserved validation set. So we always used 10-fold \ncross validation to pick the size of the tree. We first grew 10 different subsidiary \ntrees until their terminal nodes were pure, using 9/10 of the data for training each of \nthem. Then we pruned back each of these pure subsidiary trees, as above, producing \n10 sequences of subsidiary subtrees. These subsidiary sequences could then be used \nfor estimating the performance of each subtree in the main sequence. For each of \nthe main subtrees, Ti , we found the largest tree in each subsidiary sequence that \nwas no larger than Ti and estimated the performance of Ti to be the average of the \nperformance achieved by each subsidiary subtree on the 1/10 of the data that was \nnot used for training that subsidiary tree. We then chose the Ti that achieved the \nbest performance estimate and used it on the test set4. Results are expressed as \n\n3Impurity is not an information measure but, like an information measure, it is mini(cid:173)\n\nmized when all the nodes are pure and maximized when all classes at each node have equal \nprobability. \n\n4This differs from the conventional application of cross validation, where it is used to \n\n\fUsing Pairs of Data Points to Define Splits for Decision Trees \n\nSize (N) \nClasses (e) \nAttributes (k) \n\nJR \n\n150 \n3 \n4 \n\nTR \n\nLV \n\n215 \n3 \n5 \n\n345 \n2 \n6 \n\nDB \n\n768 \n2 \n8 \n\nBC \n\n683 \n2 \n9 \n\nGL \n\n163 \n2 \n9 \n\nVW WN \n\nVH WV \n\nIS \n\n990 \n11 \n10 \n\n178 \n3 \n13 \n\n846 \n4 \n18 \n\n2100 \n3 \n21 \n\n351 \n2 \n34 \n\n511 \n\nSN \n\n208 \n2 \n60 \n\nTable 1: Summary of the datasets used. \n\nthe ratio of the test error rate to the baseline rate, which is the error rate of a tree \nwith only a single terminal node. \n\n4 The Datasets \n\nEleven datasets were selected from the database of machine learning tasks main(cid:173)\ntained by the University of California at Irvine (see the appendix for a list of the \ndatasets used). Except as noted in the appendix, the datasets were used exactly \nin the form of the distribution as of June 1993. All datasets have only continuous \nattributes and there are no missing values. 5 The synthetic \"waves\" example [1] was \nadded as a twelfth dataset. \n\nTable 1 gives a brief description of the datasets. Datasets are identified by a two \nletter abbreviation along the top. The rows in the table give the total number of \ninstances, number of classes and number of attributes for each dataset. \n\nA few datasets in the original distribution have designated training and testing \nsubsets while others do not. To ensure regularity among datasets, we pooled all \nusable examples in a given dataset, randomized the order in the pool and then \ndivided the pool into training and testing sets. Two divisions were considered. The \nlarge training division had ~ of the pooled examples allocated to the training set \nand ~ to the test set. The small training division had ~ of the data in the training \nset and ~ in the test set. \n\n5 Results \n\nTable 2 gives the error rates for both the large and small divisions of the data, \nexpressed as a percentage of the error rate obtained by guessing the dominant \nclass. \n\nIn both the small and large training divisions of the datasets, the pole-pair method \nhad lower error rates than axis-aligned or linear cart in the majority of datasets \ntested. While these results are interesting, they do not provide any measure of con(cid:173)\nfidence that one method performs better or worse than another. Since all methods \nwere trained and tested on the same data, we can perform a two-tailed McNemar \ntest [2] on the predictions for pairs of methods. The resulting P-values are given \nin table 3. On most of the tasks, the pole-pair method is significantly better than \nat least one of the standard methods for at least one of the training set sizes and \nthere are only 2 tasks for which either of the other methods is significantly better \non either training set size. \n\ndetermine the best value of ex rather than the tree size \n\n5In the Be dataset we removed the case identification number attribute and had to \n\ndelete 16 cases with missing values. \n\n\f512 \n\nG. E. HINTON, M. REVOW \n\nDatabase \n\nSmall Train \n\nLarge Train \n\nIR \nTR \nLV \nDB \nBC \nGL \nVW \nWN \nVH \nWV \nIS \nSN \n\ncart \n14.3 \n36.6 \n88.9 \n85.8 \n12.8 \n62.5 \n31.8 \n17.8 \n42.5 \n28.9 \n44.0 \n65.2 \n\nlinear \n14.3 \n26.8 \n100.0 \n82.2 \n14.1 \n81.3 \n37.7 \n13.7 \n46.5 \n25.8 \n31.0 \n71.2 \n\npole \n4.3 \n14.6 \n100.0 \n87.0 \n8.3 \n89.6 \n30.0 \n11.0 \n44.2 \n24.3 \n41.7 \n48.5 \n\ncart \n5.6 \n33.3 \n108.7 \n69.7 \n15.7 \n46.4 \n21.4 \n14.7 \n36.2 \n30.6 \n21.4 \n48.4 \n\nlinear \n\n5.6 \n33.3 \n87.0 \n69.7 \n12.0 \n46.4 \n26.2 \n11.8 \n43.9 \n24.8 \n23 .8 \n45.2 \n\npole \n5.6 \n20.8 \n97.8 \n59.6 \n9.6 \n35.7 \n19.2 \n14.7 \n40.7 \n26.6 \n42.9 \n48.4 \n\nTable 2: Relative error rates expressed as a percentage of the baseline rate on the \nsmall and large training sets. \n\n6 Discussion \n\nWe only considered hyperplanes whose poles were in different classes, since these \nseemed more plausible candidates. An alternative strategy is to disregard class \nmembership, and consider all possible pole-pairs. Another variant of the method \narises depending on whether the inputs are scaled. We transformed all inputs so \nthat the training data has zero mean and unit variance. However, using unsealed \ninputs and/or allowing both poles to have the same class makes little difference to \nthe overall advantage of the pole-pair method. \n\nTo summarize, we have demonstrated that the pole-pair method is a simple, effective \nmethod for generating projection directions at binary tree nodes. The same idea of \nminimizing complexity by selecting among a sensible fixed set of possibilities rather \nthan searching a continuous space can also be applied to the choice of input-to(cid:173)\nhidden weights in a neural network. \n\nA Databases used in the study \n\nIR - Iris plant database. \nTR - Thyroid gland data. \nLV - BUPA liver disorders. \nDB - Pima Indians Diabetes. \nBC - Breast cancer database from the University of Wisconsin Hospitals. \nGL - Glass identification database. In these experiments we only considered the \nclassification into float/nonfloat processed glass, ignoring other types of glass. \nVW - Vowel recognition. \nWN - Wine recognition. \nVH - Vehicle silhouettes. \nWV - Waveform example, the synthetic example from [1]. \nIS - Johns Hopkins University Ionosphere database. \nSN - Sonar - mines versus rocks discrimination. We did not control for aspect-angle. \n\n\fUsing Pairs of Data Points to Define Splits for Decision Trees \n\n513 \n\nSmall Training - Large Test \n\nIR \n\nTR \n\nLV \n\nDB \n\nBC \n\nGL \n\nVW WN \n\nVH \n\nWV \n\nIS \n\n-SN \n\nAxis- Pole \n\n.02 ~ .18 \n\n.46 \n\n.06 \n\n.02 \n\nLinear- Pole ~ .13 \n\nAxis-Linear \n\n1.0 \n\n.06 \n\n1.0 \n\n.18 \n\n.24 \n\n.00 \n\n.15 \n\n.41 \n\n.31 \n\n.33 \n\n.27 \n\n.08 \n\n,QQ.. \n\n.44 \n\n.07 \n\n.17 \n\n.09 ~ \n\n.03 ~ .32 \n\n.26 ~ .30 \n\n.30 \n\n.40 \n\nJ>O \n\nJ>O \n\nLarge Training - Small Test \n\nAxis-Pole \n\nIR \n\n.75 \n\nLinear-Pole \n\n.75 \n\n.23 \n\n.23 \n\nAxis-Linear \n\n1.0 \n\n1.0 \n\nTR \n\nLV \n\nDB \n\nBC \n\nGL \n\nVW WN \n\nVH \n\nWV \n\nIS \n\n.29 \n\n.26 \n\n.07 \n\n:.Q!.. \n\n:.Q!.. \n\n1.0 \n\n.11 \n\n.25 \n\n.29 \n\n.29 \n\n.30 \n\n.69 \n\n.26 \n\n.69 \n\nJ!!... \n\n.06 \n\n.50 \n\n.50 \n\n.14 \n\n.25 \n\nF3\"\" \n\n.08 \n\n.26 \n\n:.Q!. \n\n:02 \n:os \n\n.50 \n\nSN \n\n.60 \n\n.50 \n\n.50 \n\nTable 3: P-Values using a two-tailed McNemar test on the small (top) and large \n(bottom) training sets. Each row gives P-values when the methods in the left most \ncolumn are compared. A significant difference at the P = 0.05 level is indicated with \na line above (below) the P-value depending on whether the first (second) mentioned \nmethod in the first column had superior performance. For example, in the top most \nrow, the pole-pair method was significantly better than the axis-aligned method on \nthe TR dataset. \n\nAcknowledgments \n\nWe thank Leo Breiman for kindly making his CART code available to us. This \nresearch was funded by the Institute for Robotics and Intelligent Systems and by \nNSERC. Hinton is a fellow of the Canadian Institute for Advanced Research. \n\nReferences \n\n[1] L. Breiman, J. H. Freidman, R. A. Olshen, and C. J. Stone. Classification and \n\nregression trees. Wadsworth international Group, Belmont, California, 1984. \n\n[2] J. L. Fleiss. Statistical methods for rates and proportions. Second edition. Wiley, \n\n1981. \n\n\f", "award": [], "sourceid": 1171, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Michael", "family_name": "Revow", "institution": null}]}