{"title": "Boosting Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 479, "page_last": 485, "abstract": null, "full_text": "From Isolation to Cooperation: \n\nAn Alternative View of a System of Experts \n\nStefan Schaal:!:* \n\nsschaal@cc.gatech.edu \n\nChristopher C. Atkeson:!: \n\ncga@cc.gatech.edu \n\nhttp://www.cc.gatech.eduifac/Stefan.Schaal \n\nhttp://www.cc.gatech.eduifac/Chris.Atkeson \n\n+College of Computing, Georgia Tech, 801 Atlantic Drive, Atlanta, GA 30332-0280 \n\n* A TR Human Infonnation Processing, 2-2 Hikaridai, Seiko-cho, Soraku-gun, 619-02 Kyoto \n\nAbstract \n\nWe introduce a constructive, incremental learning system for regression \nproblems that models data by means of locally linear experts. In contrast \nto other approaches, the experts are trained independently and do not \ncompete for data during learning. Only when a prediction for a query is \nrequired do the experts cooperate by blending their individual predic(cid:173)\ntions. Each expert is trained by minimizing a penalized local cross vali(cid:173)\ndation error using second order methods. In this way, an expert is able to \nfind a local distance metric by adjusting the size and shape of the recep(cid:173)\ntive field in which its predictions are valid, and also to detect relevant in(cid:173)\nput features by adjusting its bias on the importance of individual input \ndimensions. We derive asymptotic results for our method. In a variety of \nsimulations the properties of the algorithm are demonstrated with respect \nto interference, learning speed, prediction accuracy, feature detection, \nand task oriented incremental learning. \n\n1. INTRODUCTION \nDistributing a learning task among a set of experts has become a popular method in compu(cid:173)\ntationallearning. One approach is to employ several experts, each with a global domain of \nexpertise (e.g., Wolpert, 1990). When an output for a given input is to be predicted, every \nexpert gives a prediction together with a confidence measure. The individual predictions \nare combined into a single result, for instance, based on a confidence weighted average. \nAnother approach-the approach pursued in this paper-of employing experts is to create \nexperts with local domains of expertise. In contrast to the global experts, the local experts \nhave little overlap or no overlap at all. To assign a local domain of expertise to each expert, \nit is necessary to learn an expert selection system in addition to the experts themselves. \nThis classifier determines which expert models are used in which part of the input space. \nFor incremental learning, competitive learning methods are usually applied. Here the ex(cid:173)\nperts compete for data such that they change their domains of expertise until a stable con(cid:173)\nfiguration is achieved (e.g., Jacobs, Jordan, Nowlan, & Hinton, 1991). The advantage of \nlocal experts is that they can have simple parameterizations, such as locally constant or lo(cid:173)\ncally linear models. This offers benefits in terms of analyzability, learning speed, and ro(cid:173)\nbustness (e.g., Jordan & Jacobs, 1994). For simple experts, however, a large number of ex(cid:173)\nperts is necessary to model a function. As a result, the expert selection system has to be \nmore complicated and, thus, has a higher risk of getting stuck in local minima and/or of \nlearning rather slowly. In incremental learning, another potential danger arises when the \ninput distribution of the data changes. The expert selection system usually makes either \nimplicit or explicit prior assumptions about the input data distribution. For example, in the \nclassical mixture model (McLachlan & Basford, 1988) which was employed in several lo(cid:173)\ncal expert approaches, the prior probabilities of each mixture model can be interpreted as \n\n\f606 \n\nS. SCHAAL. C. C. ATKESON \n\nthe fraction of data points each expert expects to experience. Therefore, a change in input \ndistribution will cause all experts to change their domains of expertise in order to fulfill \nthese prior assumptions. This can lead to catastrophic interference. \nIn order to avoid these problems and to cope with the interference problems during incre(cid:173)\nmental learning due to changes in input distribution, we suggest eliminating the competi(cid:173)\ntion among experts and instead isolating them during learning. Whenever some new data is \nexperienced which is not accounted for by one of the current experts, a new expert is cre(cid:173)\nated. Since the experts do not compete for data with their peers, there is no reason for them \nto change the location of their domains of expertise. However, when it comes to making a \nprediction at a query point, all the experts cooperate by giving a prediction of the output \ntogether with a confidence measure. A blending of all the predictions of all experts results \nin the final prediction. It should be noted that these local experts combine properties of \nboth the global and local experts mentioned previously. They act like global experts by \nlearning independently of each other and by blending their predictions, but they act like lo(cid:173)\ncal experts by confining themselves to a local domain of expertise, i.e., their confidence \nmeasures are large only in a local region. \nThe topic of data fitting with structurally simple local models (or experts) has received a \ngreat deal of attention in nonparametric statistics (e.g., Nadaraya, 1964; Cleveland, 1979; \nScott, 1992, Hastie & Tibshirani, 1990). In this paper, we will demonstrate how a non(cid:173)\nparametric approach can be applied to obtain the isolated expert network (Section 2.1), \nhow its asymptotic properties can be analyzed (Section 2.2), and what characteristics such \na learning system possesses in terms of the avoidance of interference, feature detection, \ndimensionality reduction, and incremental learning of motor control tasks (Section 3). \n\n2. RECEPTIVE FIELD WEIGHTED REGRESSION \nThis paper focuses on regression problems, i.e., the learning of a map from 9t n ~ 9t m \u2022 \nEach expert in our learning method, Receptive Field Weighted Regression (RFWR), con(cid:173)\nsists of two elements, a locally linear model to represent the local functional relationship, \nand a receptive field which determines the region in input space in which the expert's \nknowledge is valid. As a result, a given data set will be modeled by piecewise linear ele(cid:173)\nments, blended together. For 1000 noisy data points drawn from the unit interval of the \nfunction z == max[exp(-10x 2),exp(-50l),1.25exp(-5(x 2 + l)], Figure 1 illustrates an \nexample of function fitting with RFWR. This function consists of a narrow and a wide \nridge which are perpendicular to each other, and a Gaussian bump at the origin. Figure 1 b \nshows the receptive fields which the system created during the learning process. Each ex(cid:173)\nperts' location is at the center of its receptive field, marked by a $ in Figure 1 b. The recep-\n\n0 . 5 \n\n0 \n\n-0.5 \n\n(a) \n\n- 0 .5 \n\n-1 \n\n0 \n\nx \n\n1.5 \n\n0.5 \n\n,., \n\n0 \n\n-0.5 \n\n-1 \n\n-1.5 \n\n-1 \n\n1.5 \n\n,1 \n\n10. 5% \n\n0 \nI \n1- 0 .5 \n1 \n\n-1.5 \n\n-1 \n\n-0.5 \n\n(b) \n\no \nx \n\n0.5 \n\n1.5 \n\nFigure 1: (a) result of function approximation with RFWR. (b) contour lines of 0.1 iso-activation of \n\neach expert in input space (the experts' centers are marked by small circles). \n\n\fFrom Isolation to Cooperation: An Alternative View of a System of Experts \n\n607 \n\ntive fields are modeled by Gaussian functions, and their 0.1 iso-activation lines are shown \nin Figure 1 b as well. As can be seen, each expert focuses on a certain region of the input \nspace, and the shape and orientation of this region reflects the function's complexity, or \nmore precisely, the function's curvature, in this region. It should be noticed that there is a \ncertain amount of overlap among the experts, and that the placement of experts occurred on \na greedy basis during learning and is not globally optimal. The approximation result \n(Figure 1 a) is a faithful reconstruction of the real function (MSE = 0.0025 on a test set, 30 \nepochs training, about 1 minute of computation on a SPARC1O). As a baseline comparison, \na similar result with a sigmoidal 3-layer neural network required about 100 hidden units \nand 10000 epochs of annealed standard backpropagation (about 4 hours on a SPARC1O). \n\n2.1 THE ALGORITHM \nRFWR can be sketched in network form as \nshown in Figure 2. All inputs connect to all ex(cid:173)\npert networks, and new experts can be added as \nneeded. Each expert is an independent entity. It \nconsists of a two layer linear subnet and a recep(cid:173)\ntive field subnet. The receptive field subnet has a \nsingle unit with a bell-shaped activation profile, \ncentered at the fixed location c in input space. \nThe maximal output of this unit is \"I\" at the cen(cid:173)\nter, and it decays to zero as a function of the dis(cid:173)\ntance from the center. For analytical convenience, \nwe choose this unit to be Gaussian: \n\n(1) \n\nli'Iear \n\nGalng Unrt \n\n. .\u2022... '. ~\"\" \" \n\n~:~~ ConnectIOn \ncentered at e \n\nWeighBd' / \nAverage \nOutput \nFigure 2: The RFWR network \n\ny, \n\nx is the input vector, and D the distance metric, a positive definite matrix that is generated \nfrom the upper triangular matrix M. The output of the linear subnet is: \n\nTb b \n\n-Tf3 \n\nA \n\ny=x + o=x \n\n(2) \nThe connection strengths b of the linear subnet and its bias bO will be denoted by the d-di(cid:173)\nmensional vector f3 from now on, and the tilde sign will indicate that a vector has been \naugmented by a constant \"I\", e.g., i = (x T , Il. In generating the total output, the receptive \nfield units act as a gating component on the output, such that the total prediction is: \n\n(3) \n\nThe parameters f3 and M are the primary quantities which have to be adjusted in the learn(cid:173)\ning process: f3 forms the locally linear model, while M determines the shape and orienta(cid:173)\ntion of the receptive fields . Learning is achieved by incrementally minimizing the cost \nfunction: \n\n(4) \n\nThe first term of this function is the weighted mean squared cross validation error over all \nexperienced data points, a local cross validation measure (Schaal & Atkeson, 1994). The \nsecond term is a regularization or penalty term. Local cross validation by itself is consis(cid:173)\ntent, i.e., with an increasing amount of data, the size of the receptive field of an expert \nwould shrink to zero. This would require the creation of an ever increasing number of ex(cid:173)\nperts during the course of learning. The penalty term introduces some non-vanishing bias \nin each expert such that its receptive field size does not shrink to zero. By penalizing the \nsquared coefficients of D, we are essentially penalizing the second derivatives of the func(cid:173)\ntion at the site of the expert. This is similar to the approaches taken in spline fitting \n\n\f608 \n\nS. SCHAAL, C. C. A TI(ESON \n\n(deBoor, 1978) and acts as a low-pass filter: the higher the second derivatives, the more \nsmoothing (and thus bias) will be introduced. This will be analyzed further in Section 2.2. \nThe update equations for the linear subnet are the standard weighted recursive least squares \nequation with forgetting factor A (Ljung & SOderstrom, 1986): \n) \npn- -Tpn \nAjw + xTpnx \n\nf3 n+1 =f3n+wpn+lxe wherepn+1 =_ pn_ \n\nande =(y-xT f3n) \n\n1 ( \nA \n\nxx \n\n(5) \n\ncv' \n\ncv \n\nThis is a Newton method, and it requires maintaining the matrix P, which is size \n0.5d x (d + 1) . The update of the receptive field subnet is a gradient descent in J: \n\nMn+l=Mn- a dJ!aM \n\n(6) \nDue to space limitations, the derivation of the derivative in (6) will not be explained here. \nThe major ingredient is to take this derivative as in a batch update, and then to reformulate \nthe result as an iterative scheme. The derivatives in batch mode can be calculated exactly \ndue to the Sherman-Morrison-Woodbury theorem (Belsley, Kuh, & Welsch, 1980; At(cid:173)\nkeson, 1992). The derivative for the incremental update is a very good approximation to \nthe batch update and realizes incremental local cross validation. \nA new expert is initialized with a default M de! and all other variables set to zero, except the \nmatrix P. P is initialized as a diagonal matrix with elements 11 r/, where the ri are usually \nsmall quantities, e.g., 0.01. The ri are ridge regression parameters. From a probabilistic \nview, they are Bayesian priors that the f3 vector is the zero vector. From an algorithmic \nview, they are fake data points of the form [x = (0, ... , '12 ,o, ... l,y = 0] (Atkeson, Moore, & \nSchaal, submitted). Using the update rule (5), the influence of the ridge regression pa(cid:173)\nrameters would fade away due to the forgetting factor A. However, it is useful to make the \nridge regression parameters adjustable. As in (6), rj can be updated by gradient descent: \n\n1'n+1 = 1'n - a aJ/ar \n\nI \n\n(7) \nThere are d ridge regression parameters, one for each diagonal element of the P matrix. In \norder to add in the update of the ridge parameters as well as to compensate for the forget(cid:173)\nting factor, an iterative procedure based on (5) can be devised which we omit here. The \ncomputational complexity of this update is much reduced in comparison to (5) since many \ncomputations involve multiplications by zero. \n\nI \n\nI \n\nInitialize the RFWR network. with no expert; \nFor every new training sample (x,y): \n\na) \n\nb) \n\nc) \n\nd) \n\ne) \n\nFor k= I to #experts: \n- calculate the activation from (I) \n- update the expert's parameters according to (5), (6), and (7) \nend; \nIr no expert was activated by more than W gen : \n- create a new expert with c=x \nend; \nIr two experts are acti vated more than W pn .. ~ \n- erase the expert with the smaller receptive field \nend; \ncalculate the mean, err \"\"an' and standard de viation errslIl of the \nincrementally accumulated error er,! of all experts; \nFor k.= I to #experts: \n\nIr (Itrr! - err_I> 9 er'Sld) reinitialize expert k with M = 2 \u2022 Mdef \n\nend; \n\nend; \n\nIn sum, a RFWR expert consists of \nthree sets of parameters, one for \nthe locally linear model, one for \nthe size and shape of the receptive \nfields, and one for the bias. The \nlinear model parameters are up(cid:173)\ndated by a Newton method, while \nthe other parameters are updated \nby gradient descent. In our imple(cid:173)\nmentations, we actually use second \norder gradient descent based on \nSutton (1992), since, with minor \n\nextra effort, we can obtain estimates of the second derivatives of the cost function with re(cid:173)\nspect to all parameters. Finally, the logic of RFWR becomes as shown in the pseudo-code \nabove. Point c) and e) of the algorithm introduce a pruning facility. Pruning takes place ei(cid:173)\nther when two experts overlap too much, or when an expert has an exceptionally large \nmean squared error. The latter method corresponds to a simple form of outlier detection. \nLocal optimization of a distance metric always has a minimum for a very large receptive \nfield size. In our case, this would mean that an expert favors global instead of locally linear \nregression. Such an expert will accumulate a very large error which can easily be detected \n\n\fFrom Isolation to Cooperation: An Alternative View of a System of Experts \n\n609 \n\nin the given way. The mean squared error term, err, on which this outlier detection is \nbased, is a bias-corrected mean squared error, as will be explained below. \n\n2.2 ASYMPTOTIC BIAS AND PENALTY SELECTION \nThe penalty term in the cost function (4) introduces bias. In order to assess the asymptotic \nvalue of this bias, the real function f(x) , which is to be learned, is assumed to be repre(cid:173)\nsented as a Taylor series expansion at the center of an expert's receptive field. Without loss \nof generality, the center is assumed to be at the origin in input space. We furthermore as(cid:173)\nsume that the size and shape of the receptive field are such that terms higher than 0(2) are \nnegligible. Thus, the cost (4) can be written as: \n\nJ ~ (1w(f. +fTX+~XTFX-bo -bTx Y dx )/(1 wdx )+r~Dnm \n\n(8) \n\nwhere fo' f, and F denote the constant, linear, and quadratic terms of the Taylor series \nexpansion, respectively. Inserting Equation (1), the integrals can be solved analytically af(cid:173)\nter the input space is rotated by an orthonormal matrix transforming F to the diagonal ma(cid:173)\ntrix F'. Subsequently, bo' b, and D can be determined such that J is minimized: \n~ \n\nb~ = fa + bias = fa + ~075 ~ sgn(F:')~IF;,:I, b' = f, D:: = (2r)2 \n\n(9) \n\n0.25 \n\n( \n\n) \n\nThis states that the linear model will asymptotically acquire the correct locally linear \nmodel, while the constant term will have bias proportional to the square root of the sum of \nthe eigenvalues of F, i.e., the F:n \u2022 The distance metric D, whose diagonalized counterpart \nis D', will be a scaled image of the Hessian F with an additional square root distortion. \nThus, the penalty term accomplishes the intended task: it introduces more smoothing the \nhigher the curvature at an expert's location is, and it prevents the receptive field of an ex(cid:173)\npert shrinking to zero size (which would obviously happen for r ~ 0). Additionally, \nEquation (9) shows how to determine rfor a given learning problem from an estimate of \nthe eigenvalues and a permissible bias. Finally, it is possible to derive estimates of the bias \nand the mean squared error of each expert from the current distance metric D: \n\nbiasesl = ~0.5r IJeigenvalues(D)l.; \n\n(10) \n\nen,,~, = r L D;m \n\nn.m \n\nThe latter term was incorporated in the mean squared error, err, in Section 2.1. Empirical \nevaluations (not shown here) verified the validity of these asymptotic results. \n\n3. SIMULA TION RESULTS \nThis section will demonstrate some of the properties of RFWR. In all simulations, the \nthreshold parameters of the algorithm were set to e = 3.5, w prune = 0.9, and w min = 0.1. \nThese quantities determine the overlap of the experts as well as the outlier removal thresh(cid:173)\nold; the results below are not affected by moderate changes in these parameters. \n\n3.1 AVOIDING INTERFERENCE \nIn order to test RFWR's sensitivity with respect to changes in input data distribution, the \ndata of the example of Figure 1 was partitioned into three separate training sets \n1; = {(x, y, z) 1-1.0 < x < -O.2} , 1; = {(x, y, z) 1-0.4 < x < OA}, 1; = {(x, y, z) I 0.2 < x < 1.0} . \nThese data sets correspond to three overlapping stripes of data, each having about 400 uni(cid:173)\nformly distributed samples. From scratch, a RFWR network was trained first on I; for 20 \nepochs, then on T2 for 20 epochs, and finally on 1; for 20 epochs. The penalty was chosen \nas in the example of Figure 1 to be r = I.e - 7 , which corresponds to an asymptotic bias of \n\n\f610 \n\nS. SCHAAL, C. C. ATKESON \n\n0.1 at the sharp ridge of the function. The default distance metric D was 50*1, where I is \nthe identity matrix. Figure 3 shows the results of this experiment. Very little interference \ncan be found. The MSE on the test set increased from 0.0025 (of the original experiment of \nFigure 1) to 0.003, which is still an excellent reconstruction of the real function. \n\ny \n\n0 .5 \n\n-0 . 5 \n\n- 0 . 5 \n\n(a) \nFigure 3: Reconstructed function after training on (a) 7;, (b) then ~,(c) and finally 1;. \n\n(b) \n\n(c) \n\n-1 \n\n3.2 LOCAL FEATURE DETECTION \nThe examples of RFWR given so far did not require ridge regression parameters. Their im(cid:173)\nportance, however, becomes obvious when dealing with locally rank deficient data or with \nirrelevant input dimensions. A learning system should be able to recognize irrelevant input \ndimensions. It is important to note that this cannot be accomplished by a distance metric. \nThe distance metric is only able to decide to what spatial extent averaging over data in a \ncertain dimension should be performed. However, the distance metric has no means to ex(cid:173)\nclude an input dimension. In contrast, bias learning with ridge regression parameters is able \nto exclude input dimensions. To demonstrate this, we added 8 purely noisy inputs \n(N(0,0.3)) to the data drawn from the function of Figure 1. After 30 epochs of training on a \n10000 data point training set, we analyzed histograms of the order of magnitude of the \nridge regression parameters in all 100bias input dimensions over all the 79 experts that had \nbeen generated by the learning algorithm. All experts recognized that the input dimensions \n3 to 8 did not contain relevant information, and correctly increased the corresponding ridge \nparameters to large values. The effect of a large ridge regression parameter is that the asso(cid:173)\nciated regression coefficient becomes zero. In contrast, the ridge parameters of the inputs 1, \n2, and the bias input remained very small. The MSE on the test set was 0.0026, basically \nidentical to the experiment with the original training set. \n\n3.3 LEARNING AN INVERSE DYNAMICS MODEL OF A ROBOT ARM \nRobot learning is one of the domains where incremental learning plays an important role. A \nreal movement system experiences data at a high rate, and it should incorporate this data \nimmediately to improve its performance. As learning is task oriented, input distributions \nwill also be task oriented and interference problems can easily arise. Additionally, a real \nmovement system does not sample data from a training set but rather has to move in order \nto receive new data. Thus, training data is always temporally correlated, and learning must \nbe able to cope with this. An example of such a learning task is given in Figure 4 where a \nsimulated 2 DOF robot arm has to learn to draw the figure \"8\" in two different regions of \nthe work space at a moderate speed (1.5 sec duration). In this example, we assume that the \ncorrect movement plan exists, but that the inverse dynamics model which is to be used to \ncontrol this movement has not been acquired. The robot is first trained for 10 minutes (real \nmovement time) in the region of the lower target trajectory where it performs a variety of \nrhythmic movements under simple PID control. The initial performance of this controller is \nshown in the bottom part of Figure 4a. This training enables the robot to learn the locally \nappropriate inverse dynamics model, a ~6 ~ ~2 continuous mapping. Subsequent per-\n\n\fFrom Isolation to Cooperation: An Alternative View of a System of Experts \n\n611 \n\n0.5 \n\n0.' \n\n0.1 \n\n0.' tGralMy \n0.2 ~ 8 \n~t Z \n8 \n\n..,. \n\n~. \n\n8 \n\n\u00b70.4 \n\n~.5 \n\n(a) \n\n0 \n\n0.1 \n\n0.2 0.3 \n\n0.4 \n\n0.!5 \n\n(b) \n\n(0) \n\nFigure 4: Learning to draw the figure \"8\" with a 2-joint \narm: (a) Performance of a PID controller before learn(cid:173)\ning (the dimmed lines denote the desired trajectories, \nthe solid lines the actual performance); (b) Perfor(cid:173)\nmance after learning using a PD controller with feed(cid:173)\nforward commands from the learned inverse model; (c) \nPerformance of the learned controller after training on \nthe upper \"8\" of (b) (see text for more explanations). \n\n39 locally linear experts were generated. \n\nformance using this inverse model for \ncontrol is depicted in the bottom part \nof Figure 4b. Afterwards, the same \ntraining takes place in the region of the \nupper target trajectory in order to ac(cid:173)\nquire the inverse model in this part of \nthe world. The figure \"8\" can then \nequally well be drawn there (upper \npart of Figure 4a,b). Switching back to \nthe bottom part of the work space \n(Figure 4c), the first task can still be \nperformed as before. No interference \nis recognizable. Thus, the robot could \nlearn fast and reliably to fulfill the two \ntasks. It is important to note that the \ndata generated by the training move(cid:173)\nments did not always have locally full \nrank. All the parameters of RFWR \nwere necessary to acquire the local in(cid:173)\nverse model appropriately. A total of \n\n4. DISCUSSION \nWe have introduced an incremental learning algorithm, RFWR, which constructs a network \nof isolated experts for supervised learning of regression tasks. Each expert determines a lo(cid:173)\ncally linear model, a local distance metric, and local bias parameters by incrementally \nminimizing a penalized local cross validation error. Our algorithm differs from other local \nlearning techniques by entirely avoiding competition among the experts, and by being \nbased on nonparametric instead of parametric statistics. The resulting properties of RFWR \nare a) avoidance of interference in the case of changing input distributions, b) fast incre(cid:173)\nmental learning by means of Newton and second order gradient descent methods, c) ana(cid:173)\nlyzable asymptotic properties which facilitate the selection of the fit parameters, and d) lo(cid:173)\ncal feature detection and dimensionality reduction. The isolated experts are also ideally \nsuited for parallel implementations. Future work will investigate computationally less \ncostly delta-rule implementations of RFWR, and how well RFWR scales in higher dimen(cid:173)\nsions. \n5. REFERENCES \nAtkeson, C. G., Moore, A. W. , & Schaal, S. \n(submitted). \"Locally weighted learning.\" Artificial In(cid:173)\ntelligence Review. \nAtkeson, C. G. (1992). \"Memory-based approaches to \napproximating continuous functions.\" In: Casdagli, M., \n& Eubank, S. (Eds.), Nonlinear Modeling and Fore(cid:173)\ncasting, pp.503-521. Addison Wesley. \nBelsley, D. A., Kuh, E., & Welsch, R. E. (1980). Re(cid:173)\ngression diagnostics: Identifying influential data and \nsources ofcollinearity. New York: Wiley. \nCleveland, W. S. (1979). \"Robust locally weighted re(cid:173)\ngression and smoothing scatterplots.\" J. American Stat. \nAssociation, 74, pp.829-836. \nde Boor, C. (1978). A practical guide to splines. New \nYork: Springer. \nHastie, T. J., & Tibshirani, R. J. (1990). Generalized \nadditive models. London: Chapman and Hall. \nJacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, \nG. E. (1991). \"Adaptive mixtures of local experts.\" \nNeural Computation, 3, pp.79-87. \n\nJordan, M. I., & Jacobs, R. (1994). \"Hierarchical mix(cid:173)\ntures of experts and the EM algorithm.\" Neural Com(cid:173)\nputation, 6, pp.79-87. \nLjung, L., & S_derstr_m, T. (1986). Theory and prac(cid:173)\ntice of recursive identification. Cambridge, MIT Press. \nMcLachlan, G. J., & Basford, K. E. (1988). Mixture \nmodels . New York: Marcel Dekker. \nNadaraya, E. A. (1964). \"On estimating regression .\" \nTheor. Prob. Appl., 9, pp.141-142. \nSchaal, S., & Atkeson, C. G. (l994b). \"Assessing the \nquality of learned local models.\" In: Cowan, J. \nsauro, G., & Alspector, J. (Eds.), Advances in Neural \nInformation Processing Systems 6. Morgan Kaufmann. \nScott, D. W. (1992). Multivariate Density Estimation. \nNew York: Wiley. \nSutton, R. S. (1992). \"Gain adaptation beats least \nsquares.\" In: Proc. of 7th Yale Workshop on Adaptive \nand Learning Systems, New Haven, CT. \nWolpert, D. H. (1990). \"Stacked genealization.\" Los \nAlamos Technical Report LA-UR-90-3460. \n\n,Te(cid:173)\n\n\fBoosting Decision Trees \n\nHarris Drucker \n\nAT&T Bell Laboratories \n\nHolmdel, New Jersey 07733 \n\nCorinna Cortes \n\nAT&T Bell Laboratories \n\nMurray Hill, New Jersey 07974 \n\nAbstract \n\nA new boosting algorithm of Freund and Schapire is used to improve \nthe performance of decision trees which are constructed using the \ninformation ratio criterion of Quinlan's C4.5 algorithm. This boosting \nalgorithm iteratively constructs a series of decision trees, each decision \ntree being trained and pruned on examples that have been filLered by \npreviously trained trees. Examples that have been incorrectly classified \nby the previous trees in the ensemble are resampled with higher \nprobability to give a new probability distribution for the next tree in the \nensemble to train on. Results from optical character recognition \n(OCR), and knowledge discovery and data mining problems show that \nin comparison to single trees, or to trees trained independently, or to \ntrees trained on subsets of the feature space, the boosting ensemble is \nmuch better. \n\n1 INTRODUCTION \n\nA new boosting algorithm termed AdaBoost by their inventors (Freund and Schapire, \n1995) has advantages over the original boosting algorithm (Schapire, 1990) and a second \nversion (Freund, 1990). The implications of a boosting algorithm is that one can take a \nseries of learning machines (termed weak learners) each having a poor error rate (but no \nworse than .5-y, where y is some small positive number) and combine them to give an \nensemble that has very good performance (termed a strong learner). The first practical \nimplementation of boosting was in OCR (Drucker, 1993, 1994) using neural networks as \nthe weak learners. In a series of comparisons (Bottou, 1994) boosting was shown to be \nsuperior to other techniques on a large OCR problem. \n\nThe general configuration of AdaBoost is shown in Figure 1. Each box is a decision tree \nbuilt using Quinlans C4.5 algorithm (Quinlan, 1993) The key idea is that each weak \nlearner is trained sequentially. The first weak learner is trained on a set of patterns picked \nrandomly (with replacement) from a training set. After training and pruning, the training \npatterns are passed through this first decision tree. In the two class case the hypothesis hi \nis either class 0 or class 1. Some of the patterns will be in error. The training set for the \n\n\f480 \n\nH. DRUCKER. C. CORTES \n\nINPUT FEATURES \n\n#1 \n\nh 1 \n~l \n\n#2 \nh \n~2 \n2 \n\n#3 \n\nh \n~3 \n3 \n\nT \n\nhT \n~T \n\n~)t log (11 ~t ) \n\nFIGURE 1. BOOSTING ENSEMBLE \n\n.5 \n\nw \n~ a: \na: o a: \n\na: \nw \n\nWEAK LEARNER WEIGHTED \nTRAINING ERROR RATE \n\nENSEMBLE TEST ERROR RATE \n\nENSEMBLE \nTRAINING \nERROR RATE \n\nNUMBER OF WEAK LEARNERS \n\nFIGURE 2. INDIVIDUAL WEAK LEARNER ERROR RATE \nAND ENSEMBLE TRAINING AND TEST ERROR RATES \n\n\fBoosting Decision Trees \n\n481 \n\nsecond weak learner will consist of patterns picked from the training set with higher \nprobability assigned to those patterns the first weak learner classifies incorrectly. Since \npatterns are picked with replacement, difficult patterns are more likely to occur multiple \ntimes in the training set. Thus as we proceed to build each member of the ensemble, \npatterns which are more difficult to classify correctly appear more and more likely. The \ntraining error rate of an individual weak learner tends to grow as we increase the number \nof weak learners because each weak learner is asked to classify progressively more \ndifficult patterns. However the boosting algorithm shows us that the ensemble training \nand test error rate decrease as we increase the number of weak learners. The ensemble \noutput is determined by weighting the hypotheses with the log of (l!~i) where ~ is \nproportional to the weak learner error rate. If the weak learner has good error rate \nperformance, it will contribute significantly to the output, because then 1 / ~ will be large. \n\nFigure 2 shows the general shape of the curves we would expect. Say we have \nconstructed N weak learners where N is a large number (right hand side of the graph). \nThe N'th weak learner (top curve) will have a training error rate that approaches .5 \nbecause it is trained on difficult patterns and can do only sightly better than guessing. \nThe bottom two curves show the test and training error rates of the ensemble using all N \nweak learners. which decrease as weak learners are added to the ensemble. \n\n2 BOOSTING \n\nBoosting arises from the PAC (probably approximately correct) learning model which \nhas as one of its primary interests the efficiency of learning. Schapire was the first one to \nshow that a series of weak learners could be converted to a strong learner. The detailed \nalgorithm is show in Figure 3. Let us call the set of N 1 distinct examples the original \ntraining set. We distinguish the original training set from what we will call the filtered \ntraining set which consists of N 1 examples picked with replacement from the original \ntraining set. Basically each of N 1 original examples is assigned a weight which is \nproportional to the probability that the example will appear in the filtered training set \n(these weights have nothing to do with the weights usually associated with neural \nnetworks). Initially all examples are assigned a weight of unity so that all the examples \nare equally likely to show up in the initial set of training examples. However, the weights \nare altered at each state of boosting (Step 5 of Figure 3) and if the weights are high we \nmay have multiple copies of some of the original examples appearing in the filtered \ntraining set. In step three of this algorithm, we calculate what is called the weighted \ntraining error and this is the error rate over all the original N 1 training examples \nweighted by their current respective probabilities. The algorithms terminates if this error \nrate is .5 (no better than guessing) or zero (then the weights of step 5 do not change). \nAlthough not called for in the original C4.5 algorithm, we also have an original set of \npruning examples which also are assigned weights to form a filtered pruning set and used \nto prune the classification trees constructed using the filtered training set. It is known \n(Mingers, 1989a) that reducing the size of the tree (pruning) improves generalization. \n\n3 DECISION TREES \n\nFor our implementation of decision trees, we have a set of features (attributes) that \nspecifies an example along with their classification (we discuss the two-class problem \nprimarily). We pick a feature that based on some criterion, best splits the examples into \ntwo subsets. Each of these two subsets will usually not contain examples of just one \nclass, so we recursively divide the subsets until the final subsets each contain examples of \njust one class. Thus, each internal node specifies a feature and a value for that feature that \ndetermines whether one should take the left or right branch emanating from that node. At \nterminal nodes, we make the final decision, class 0 or 1. Thus, in decision trees one \nstarts at a root node and progressively traverses the tree from the root node to one of the \n\n\f482 \n\nH. DRUCKER,C. CORTES \n\nInputs: N I training paUans. N 2 pruning paUems. N 3 test paUans \n\nlaitialize the weight veco of the N I training pattems: wI = 1 for i = 1 \u2022...\u2022 N I \nlaitialize the weight veco of the N 2 pruning paUmls: sl = 1 for i = 1 \u2022...\u2022 N 2 \nlaitialize the number of trees in the ensemble to t = 1 \n\nDo Vatil weighted training enol' rate is 0 or .5 or ensemble test enoI'rate asymptotes \n\n1. For the training set and pruning sets \n\nw' \np'=-(cid:173)N. \n1:wl \ni-I \n\na' \nr' = -w.-\n1:sl \n\nPick N I samples from original training set with probability P(i) to form filtered training set \nPick N 2 samples from original pruning set with probability r(i) to form filtered pruning set \n\n2. Train tree t using filtered training set and prune using filtered pruning set \n\n3. Pass the N I mginal training examples through the IRJICd tree whose output h, (i) is \neither 0 or 1 and classification c(i) is either 0 or 1. Calculate the weighted training error \nrate: E, = 1: pll h, (i) - c(i) I \n\nN. \n\ni-I \n\n4. Set Ii, = 1 \n\nE, \n- E, \n\n5. Set the new training weight vectm' to be \n\nwI+ 1 = wf{Ii,**(1-lh,(i) - c(i)I\u00bb) \n\ni = 1 \u2022...\u2022 N I \n\nPass the N 2 original pruning paUems through the pruned tree and calculate new pruning \nweight vector: \n\n6. F