{"title": "Regularizing AdaBoost", "book": "Advances in Neural Information Processing Systems", "page_first": 564, "page_last": 570, "abstract": "", "full_text": "Regularizing AdaBoost \n\nGunnar Riitsch, Takashi Onoda; Klaus R. M iiller \nGMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany \n\n{raetsch, onoda, klaus }@first.gmd.de \n\nAbstract \n\nBoosting methods maximize a hard classification margin and are \nknown as powerful techniques that do not exhibit overfitting for low \nnoise cases. Also for noisy data boosting will try to enforce a hard \nmargin and thereby give too much weight to outliers, which then \nleads to the dilemma of non-smooth fits and overfitting. Therefore \nwe propose three algorithms to allow for soft margin classification \nby introducing regularization with slack variables into the boosting \nconcept: (1) AdaBoostreg and regularized versions of (2) linear \nand (3) quadratic programming AdaBoost. Experiments show the \nusefulness of the proposed algorithms in comparison to another soft \nmargin classifier: the support vector machine. \n\n1 \n\nIntrod uction \n\nBoosting and other ensemble methods have been used with success in several ap(cid:173)\nplications, e.g. OCR [13, 8]. For low noise cases several lines of explanation have \nbeen proposed as candidates for explaining the well functioning of boosting meth(cid:173)\nods. (a) Breiman proposed that during boosting also a \"bagging effect\" takes place \n[3] which reduces the variance and effectively limits the capacity of the system and \n(b) Freund et al. [12] show that boosting classifies with large margins, since the \nerror function of boosting can be written as a function of the margin and every \nboosting step tries to minimize this function by maximizing the margin [9, 11]. \nRecently, studies with noisy patterns have shown that boosting does indeed overfit \non noisy data, this holds for boosted decision trees [10], RBF nets [11] and also \nother kinds of classifiers (e.g. [7]). So it is clearly a myth that boosting methods \nwill not overfit. The fact that boosting is trying to maximize the margin, is exactly \nalso the argument that can be used to understand why boosting must necessarily \noverfit for noisy patterns or overlapping distributions and we give asymptotic argu(cid:173)\nments for this statement in section 3. Because the hard margin (smallest margin in \nthe trainings set) plays a central role in causing overfitting, we propose to relax the \nhard margin classification and allow for misclassifications by using the soft margin \nclassifier concept that has been applied to support vector machines successfully [5]. \n\n\u00b7permanent address: Communication & Information Research Lab. CRIEPI, 2-11-1 \n\nIwado kita, Komae-shi, Tokyo 201-8511, Japan. \n\n\fRegularizing AdaBoost \n\n565 \n\nOur view is that the margin concept is central for the understanding of both sup(cid:173)\nport vector machines and boosting methods. So far it is not clear what the optimal \nmargin distribution should be that a learner has to achieve for optimal classification \nin the noisy case. For data without noise a hard margin might be the best choice. \nHowever, for noisy data there is always the trade-off in believing in the data or \nmistrusting it, as the very data point could be an outlier. In general (e.g. neural \nnetwork) learning strategies this leads to the introduction of regularization which \nreflects the prior that we have about a problem. We will also introduce a regu(cid:173)\nlarization strategy (analogous to weight decay) into boosting. This strategy uses \nslack variables to achieve a soft margin (section 4). Numerical experiments show \nthe validity of our regularization approach in section 5 and finally a brief conclusion \nis given. \n\n2 AdaBoost Algorithm \n\nLet {ht(x) : t = 1, ... ,T} be an ensemble of T hypotheses defined on input vector \nx and e = [Cl ... CT] their weights satisfying Ct > 0 and lei = 2:t Ct = 1. In the \nbinary classification case, the output is one of two class labels, i.e. ht (x) = \u00b11. \nThe ensemble generates the label which is the weighted majority of the votes: \nsgn (2:t Ctht(x)). \nIn order to train this ensemble of T hypotheses {ht(x)} and \ne, several algorithms have been proposed: bagging, where the weighting is simply \nCt = l/T [2] and AdaBoost/ Arcing, where the weighting scheme is more compli(cid:173)\ncated [12]. In the following we give a brief description of AdaBoost/ Arcing. We use \na special form of Arcing, which is equivalent to AdaBoost [4]. In the binary classi(cid:173)\nfication case we define the margin for an input-output pair Zi = (Xi, Yi), i = 1, ... ,1 \nby \n\nmg(zi' e) = Yi L Ctht(Xi), \n\nT \n\nt=l \n\n(1) \n\nwhich is between -1 and 1, if lei = 1. The correct class is predicted, if the margin \nat Z is positive. When the positivity of the margin value increases, the decision \ncorrectness becomes larger. AdaBoost maximizes the margin by (asymptotically) \nminimizing a function of the margin mg(zi' e) [9, 11] \n\ng(b) = t, exp { -1~lmg(Zi' C)}, \n\n(2) \n\nwhere b = [bl ... bTl and Ibl = 2:t bt (starting from b = 0). Note that bt is the \nunnormalized weighting of the hypothesis ht, whereas e is simply a normalized \nversion of b, i.e. e = b/lbl. In order to find the hypothesis ht the learning examples \nZi are weighted in each iteration t with Wt(Zi). Using a bootstrap on this weighted \nsample we train ht ; alternatively a weighted error function can be used (e.g. weighted \nMSE). The weights Wt(Zi) are computed according tol \n\n() \n\nWt Zi = \n\nexp{-lbt-llmg(zi,et-l)/2} \nI \n\n2:j=l exp {-Ibt-dmg(zj, et-d/2} \n\n(3) \n\nand the training error tOt of ht is computed as tOt = 2:~=1 Wt(zi)I(Yi t ht(Xi)), where \nI(true) = 1 and I(false) = O. For each given hypothesis ht we have to find a weight \nbt , such that g(b) is minimized. One can optimize this parameter by a line search \n\n1 This direct way for computing the weights is equivalent to the update rule of AdaBoost. \n\n\f566 \nor directly by analytic minimization [4], which gives bt = 10g(1 - \u20act} \nInterestingly, we can write \n\nG. RaIsch. T. Onoda and K.-R. Maller \n\n() \n\nWt Zi = \n\n8g(ht-d/8mg(zi, h t- 1 } \nI \n\n2:j=l 8g(ht-d/8mg(zj, ht-d \n\n' \n\n- log ft. \n\n(4) \n\nas a gradient of g(ht - 1 ) with respect to the margins. The weighted minimization \nwith Wt(Zi) will give a hypothesis ht which is an approximation to the best possible \nhypothesis h; that would be obtained by minimizing 9 directly. Note that, the \nweighted minimization (bootstrap, weighted LS) will not necessarily give hi, even \nis minimized [11]. AdaBoost is therefore an approximate gradient descent \nif \u20act \nmethod which minimizes 9 asymptotically. \n\n3 Hard margins \n\nA decrease of g(c, Ihl) := g(h) is predominantly achieved by improvements of the \nmargin mg(zi' c). IT the margin mg(zi, c) is negative, then the error g(c, Ihl) takes \nclearly a big value, which is additionally amplified by Ihl. So, AdaBoost tries to \ndecrease the negative margin efficiently to improve the error g(c, Ihl). \nNow, let us consider the asymptotic case, where the number of iterations and \ntherefore also Ihl take large values [9]. \nIn this case, when the values of all \nmg(zi,c),i = 1,\u00b7\u00b7\u00b7,l, are almost the same but have small differences, these differ(cid:173)\nences are amplified strongly in g(c, Ihl). Obviously the function g(c, Ihl) is asymp(cid:173)\ntotically very sensitive to small differences between margins. Therefore, the margins \nmg(zi' c) of the training patterns from the margin area (boundary area between \nclasses) should asymptotically converge to the same value. From Eq. (3), when \nIhl takes a very big value, AdaBoost learning becomes a \"hard competition\" case: \nonly the pattern with smallest margin will get high weights, the other patterns are \neffectively neglected in the learning process. In order to confirm that the above \nreasoning is correct, Fig. 1 shows margin distributions after 104 AdaBoost itera(cid:173)\ntions for a toy example [9] at different noise levels generated by uniform distribution \nU(0.0,u 2 ) (left). From this figure, it becomes apparent that the margin distribution \nasymptotically makes a step at a fixed size of the margin for training patterns which \nare in the margin area. In previous studies [9, 11] we observed that those patterns \nexhibit a large overlap to support vectors in support vector machines. The numeri(cid:173)\ncal results support our theoretical asymptotic analysis. The property of AdaBoost \nto produce a big margin area (no pattern in the area, i.e. a hard margin), will not \nalways lead to the best generalization ability (d. [5, 11]). This is especially true, \n\n09 \n\n0.8 \n\nF: \n\n.~ 0.5 \n~ o. \n~ 0.3 \n\n0.2 \n\n0.1 \n\n0 \n0 \n\nI , \n\" \n\nI \n\n, , \n, \n, , \n\n0.22 \n\n0.215 \n\n0.21 \n\n0.205 \n\n0.2 \n\n0 \n\n1'. \n\n0 \n\n0 \n\n\u00b700 0 \n\n'\" 0 \n\n0.2 \n\n0 .' \n\nstability \n\n0.6 \n\n0.8 0 . 1'15 \n\n10' \n\n10' \n\n10' \n\n10' \n\n10' \n\n1.5 \n\n2.5 \n\nFigure 1: Margin distributions for AdaBoost (left) for different noise levels (a 2 = \nO%(dotted), 9%(dashed), 16%(solid\u00bb with fixed number of RBF-centers for the base hy(cid:173)\npothesis and typical overfitting behaviour in the generalization error as a function of the \nnumber of iterations (middle) and a typical decision line (right) generated by AdaBoost \nusing RBF networks in the case with noise (here: 30 centers and a 2 = 16%; smoothed) \n\n\fRegularizing AdaBoost \n\n567 \n\nif the training patterns have classification or input noise. In our experiments with \nnoisy data, we often observed that AdaBoost made overfitting (for a high number \nof boosting iterations). Fig. 1 (middle) shows a typical overfitting behaviour in the \ngeneralization error for AdaBoost: after only 80 boosting iterations the best gen(cid:173)\neralization performance is already achieved. Quinlan [10] and Grove et al. [7] also \nobserved overfitting and that the generalization performance of AdaBoost is often \nworse than that of the single classifier, if the data has classification noise. \nThe first reason for overfitting is the increasing value of Ibl: noisy patterns (e.g. bad \nlabelled) can asymptotically have an \"unlimited\" influence to the decision line lead(cid:173)\ning to overfitting (cf. Eq. (3)). Another reason is the classification with a hard \nmargin, which also means that all training patterns will asymptotically be correctly \nclassified (without any capacity limitation!). In the presence of noise this will cer(cid:173)\ntainly be not the right concept, because the best decision line (e.g. Bayes) usually \nwill not give a training error of zero. So, the achievement of large hard margins for \nnoisy data will produce hypotheses which are too complex for the problem. \n\n4 How to get Soft Margins \n\nIn order to avoid overfitting, we in(cid:173)\n\nChanging AdaBoost's error function \ntroduce slack variables, which are similar to those of the support vector algorithm \n[5, 14], into AdaBoost. \nWe know that all training patterns will get non-negative stabilities after many itera(cid:173)\ntions(see Fig. 1(left)), i.e. mg(zi, c) 2: p for all i = 1, ... , I, where p is the minimum \nmargin of the patterns. Due to this fact, AdaBoost often produces high weights for \nthe difficult training patterns by enforcing a non-negative margin p 2: 0 (for every \npattern including outliers) and this property will eventually lead to overfitting, as \nobserved in Fig. 1. Therefore, we introduce some variables ~i - the slack variables -\nand get \n\nmg(zi, c) 2: p - C~L \n\n(5) \nIn these inequalities, ~! are positive and if a training pattern has high weights in \nthe previous iterations, the ~! should be increasing. In this way, for example, we do \nnot force outliers to be classified according to their possibly wrong labels, but we \nallow for some errors. In this sense we get a trade-off between the margin and the \nimportance of a pattern in the training process (depending on the constant C 2: 0). \nIf we choose C = 0 in Eq. (5), the original AdaBoost algorithm is retrieved. If C is \nchosen too high, the data is not taken seriously. We adopt a prior on the weights \nWr(Zi) that punishes large weights in analogy to weight decay and choose \n\n~f > O. \n\n\u20acl ~ (t, c,. Wc(Zi) r \n\n(6) \n\nwhere the inner sum is the cumulative weight of the pattern in the previous iterations \n(we call it influence of a pattern - similar to Lagrange multipliers in SVMs) . By \nthis ~!, AdaBoost is not changed for easy classifiable patterns, but is changed for \ndifficult patterns. From Eq. (5), we can derive a new error function: \n9reg(ct,lbt l) = ~exp{ -1~tlmg(zi,Ct) - C~f} \n\n(7) \n\nI \n\nBy this error function, we can control the trade-off between the weights, which the \npattern had in the last iterations, and the achieved margin. The weight Wt(Zi) of a \npattern is computed as the derivative ofEq. (7) subject to mg(zi, b t - 1 ) (cf. Eq. (4)) \nand is given by \n\n() \n\nWt Zi = \n\nexp {lbt-11(mg(zi,Ct-d - ~:-1)/2} \nI \n\nE j =l exp Ibt-11(mg(zj, Ct-t} - ~j -\n\nt 1 \n\n{ \n\n} . \n\n)/2 \n\n(8) \n\n\f568 \n\nG. Riitsch, T. Onoda and K.-R. Muller \n\nTable 1: Pseudocode description of the algorithms \n\nLP-AdaBoost(Z, T) I LPreg-AdaBoost(Z, T, C) I QPreg-AdaBoost(Z, T, C) \n\nRun Ada Boost on dataset Z to get T hypotheses h and their weights c \n\nC \n\nonstruct oss matnx \n\nI \n\n{-I if h t (Xi) =1= Yi \n\n1 otherwise \n\n. L \n\ni,t = \n\nminimize -p \nS.t. E~=l CtLi,t ~ P \nCt ~ 0, ECt = 1 \n\nT \n\nminimize -p+C2:\u00b7ei \nS.t. 2:t=l CtLi ,t ~ P + ei \nCt ~ 0, E Ct = 1 \n{i ~ 0 \n\n\u2022 \n\nminimize IlbW +CE\u00b7ei \nS.t. Et=l btLi ,t ~ 1 - ei \n\nT \n\n\u2022 \n\nb t ~ 0 \n{i ~ 0 \n\nThus we can get an update rule for the weight of a training pattern [11] \n\nWt(Zi) = Wt-l (Zi) exp{bt-1I(Yi =I ht- 1 (Xi\u00bb) + C~:-2Ibt_21 - C~;-llbt_ll}. (9) \nIt is more difficult to compute the weight bt of the t-th hypothesis analytically. \nHowever, we can get bt by a line search procedure over Eq. (7), which has an unique \nsolution because 8~t greg> 0 is satisfied. This line search can be implemented very \nefficiently. With this line-search, we can now also use real-valued outputs of the \nbase hypotheses, while the original AdaBoost algorithm could not (d. also [6]). \n\nOptimizing a given ensemble \nIn Grove et al. [7], it was shown how to use \nlinear programming to maximize the minimum margin for a given ensemble and \nLP-AdaBoost was proposed (table 1 left). This algorithm maximizes the mini(cid:173)\nmum margin on the training patterns. It achieves a hard margin (as AdaBoost \nasymptotically does) for small number of iterations. For the reasoning for a hard \nmargin (section 3) this can not generalize well. If we introduce slack variables to \nLP-AdaBoost, one gets the algorithm LP reg-AdaBoost (table 1 middle) [11]. This \nmodification allows that some patterns have lower margins than p (especially lower \nthan 0). There is a trade-off: (a) make all margins bigger than p and (b) maximize \np. This trade-off is controlled by the constant C. \nAnother formulation of a optimization problem can be derived from the support vec(cid:173)\ntor algorithm. The optimization objective of a SVM is to find a function h W which \nminimizes a functional of the form E = IlwW + C 2:i ~i' where Yih(Xi) ~ 1 -\n~i \nand the norm of the parameter vector w is the measure for the complexity of the \nhypothesis h W [14]. For ensemble learning we do not have such a measure of com(cid:173)\nplexity and so we use the norm of the hypotheses weight vector b. For Ibl = 1 this is \na small value, if the elements are approximately equal (analogy to bagging) and has \nhigh values, when there are some strongly emphasized hypotheses (far away from \nbagging). Experimentally, we found that IIbl12 is often larger for more complex \nhypothesis. Thus, we can apply the optimization principles of SVMs to AdaBoost \nand get the algorithm QPreg-AdaBoost (table 1 right). We effectively use a linear \nSVM on top of the results of the base hypotheses. \n\n5 Experiments \n\nIn order to evaluate the performance of our new algorithms, we make a compari(cid:173)\nson among the single RBF classifier, the original AdaBoost algorithm, AdaBoostreg \n(with RBF nets), LfQPreg-AdaBoost and a Support Vector Machine (with RBF \nkernel). We use ten artificial and real world datasets from the DCI and DELVE \nbenchmark repositories: banana (toy dataset as in [9, 11]), breast cancer, image seg(cid:173)\nment, ringnorm, flare sonar, splice, new-thyroid, titanic, twonorm, waveform. Some of \nthe problems are originally not binary classification problems, hence a (random) \npartition into two classes was used. At first we generate 20 partitions into training \nand test set (mostly ~ 60% : 40%). On each partition we train the classifier and \nget its test set error. The performance is averaged and we get table 2. \n\n\fRegularizing AdaBoost \n\n569 \n\nTable 2: Comparison among the six methods: Single RBF classifier, AdaBoost(AB), \nAdaBoostreg (ABreg), L/QP reg-AdaBoost (L/QPR) and a Support Vector Machine(SVM) : \nEstimation of generalization error in % on 10 datasets (best method in bold face). Clearly, \nAdaBoostreg gives the best overall performance. For further explanation see text. \n\nBanana \nCancer \nImage \nRingnorm \nFSonar \nSplice \nThyroid \nTitanic \nTwonorm \nWaveform \nMean '70 \nWinner '70 \n\nRBF \n\n10.9\u00b10.5 \n28.7\u00b15.3 \n2.8\u00b10.7 \n1.1\u00b1O.3 \n34.6\u00b12.1 \n1O.0\u00b10.3 \n4.8\u00b12.4 \n23.4\u00b11.7 \n2.8\u00b10.2 \n10.7\u00b11.0 \n\n6.7 \n16.4 \n\nAB \n\n12.3\u00b10.7 \n30.5\u00b14.5 \n2.5\u00b10.7 \n2.0\u00b10.2 \n35.6\u00b11.9 \n10.1\u00b10.3 \n4.4\u00b11.9 \n22.7\u00b11.2 \n3.1\u00b10.3 \n10.8\u00b10.4 \n\n9.6 \n8.2 \n\nABreg \n\nlO.1\u00b1O.5 \n26.3\u00b14.3 \n2.5\u00b10.7 \n1.1\u00b1O.2 \n33.6\u00b11.7 \n9.5\u00b1O.2 \n4.4\u00b12.1 \n22.5\u00b11.0 \n2.1\u00b12.1 \n9.9\u00b10.9 \n\n1.0 \n28.5 \n\nQPR \n\nLPR \n\nSVM \n10.8\u00b10.4 \n11.5\u00b14.7 \n10.9\u00b10.5 \n31.0\u00b14.2 \n26.2\u00b14.7 26.1\u00b14.8 \n2.6\u00b10.6 \n2.9\u00b10.7 \n2.4\u00b1O.5 \n2.2\u00b10.4 \n1.9\u00b10.2 \n1.1\u00b1O.1 \n36.2\u00b11.7 32.5\u00b11.1 \n35.7\u00b14.5 \n10.2\u00b11.6 \n10.1\u00b10.5 \n10.9\u00b10.7 \n4.4\u00b12.0 4.4\u00b12.2 \n4.8\u00b12.2 \n22.7\u00b11.0 22.4\u00b11.0 \n22.9\u00b11.9 \n3.0\u00b10.2 \n3.0\u00b10.3 \n3.4\u00b10.6 \n10.6\u00b11.0 \n9.8\u00b1O.3 \n10.1\u00b10.5 \n\n6.3 \n16.6 \n\n11.1 \n15.0 \n\n4.7 \n15.3 \n\nWe used RBF nets with adaptive centers (some conjugate gradient iterations to \noptimize positions and widths of the centers) as base hypotheses as described in \n[1, 11]. In all experiments, we combined 200 hypotheses. Clearly, this number of \nhypotheses may be not optimal, however Adaboost with optimal early stopping \nis not better than AdaBoost.reg . The parameter C of the regularized versions of \nAdaBoost and the parameters (C, a) of the SVM are optimized by the first five \ntraining datasets. On each training set 5-fold-cross validation is used to find the \nbest model for this dataset 2 . Finally, the model parameters are computed as the \nmedian of the five estimations. This way of estimating the parameters is surely \nnot possible in practice, but will make this comparison more robust and the results \nmore reliable. The last but one line in Tab. 2 shows the line 'Mean %', which is \ncomputed as follows: For each dataset the average error rate of all classifier types \nare divided by the minimum error rate and 1 is subtracted. These resulting num(cid:173)\nbers are averaged over the 10 datasets. The last line shows the probabilities that a \nmethod wins, i.e. gives the smallest generalization error, on the basis of our exper(cid:173)\niments (averaged over all ten datasets) . Our experiments on noisy data show that \n(a) the results of AdaBoost are in almost all cases worse than the single classifier \n(clear overfitting effect) and (b) the results of AdaBoostreg are in all cases (much) \nbetter than those of AdaBoost and better than that of the single classifier. Fur(cid:173)\nthermore, we see clearly, that (c) the single classifier wins as often as the SVM, (d) \nL/QPreg-AdaBoost improves the results of AdaBoost, (e) AdaBoostreg wins most \noften. L/QP reg-AdaBoost improves the results of AdaBoost in almost cases due \nto established the soft margin. But the results are not as good as the results of \nAdaBoostreg and the SVM, because the hypotheses generated by AdaBoost (aimed \nto construct a hard margin) may be not the appropriate ones generate a good soft \nmargin. We also observe that quadratic programming gives slightly better results \nthan linear programming. This may be due to the fact that the hypotheses coef(cid:173)\nficients generated by LPreg-AdaBoost are more sparse (smaller ensemble). Bigger \nensembles may have a better generalization ability (due to the reduction of variance \n[3]). The worse performance of SVM compared to AdaBoostreg and the unexpected \ntie between SVM and RBF net may be explained with (a) the fixed a of the RBF(cid:173)\nkernel (loosing multi-scale information), (b) coarse model selection, (c) worse error \nfunction ofthe SV algorithm (noise model). Sumarizing, AdaBoost is useful for low \nnoise cases, where the classes are separable (as shown for OCR[13, 8]). AdaBoostreg \nextends the applicability of boosting to \"difficult separable\" cases and should be \napplied, if the data is noisy. \n\n2The parameters are only near-optimal. Only 10 values for each parameter are tested. \n\n\f570 \n\n6 Conclusion \n\nG. Ratsch, T. Onoda and K.-R. Maller \n\nWe introduced three algorithms to alleviate the overfitting problems of boosting al(cid:173)\ngorithms for high noise data: (1) direct incorporation ofthe regularization term into \nthe error function (Eq.(7)), use of (2) linear and (3) quadratic programming with \nconstraints given by the slack variables. The essence of our proposal is to introduce \nslack variables for regularization in order to allow for soft margin classification in \ncontrast to the hard margin classification used before. The slack variables basically \nallow to control how much we trust the data, so we are permitted to ignore outliers \nwhich would otherwise have spoiled our classification. This generalization is very \nmuch in the spirit of support vector machines that also trade-off the maximization \nof the margin and the minimization of the classification errors in the slack variables. \nIn our experiments, AdaBoostreg showed a better overall generalization performance \nthan all other algorithms including the Support Vector Machines. We conjecture \nthat this unexpected result is mostly due to the fact that SVM can only use one CT \nand therefore loose scaling information. AdaBoost does not have this limitation. \nSo far we balance our trust in the data and the margin maximization by cross val(cid:173)\nidation. Better would be, if we knew the \"optimal\" margin distribution that we \ncould achieve for classifying noisy patterns, then we could of course balance the \nerrors and the margin sizes optimally. \nIn future works, we plan to establish more connections between AdaBoost and SVM. \nAcknowledgements: We thank for valuable discussions with A. Smola, B. \nSch6lkopf, T. FrieB and D. Schuurmans. Partial funding from EC STORM project \ngrant number 25387 is greatfully acknowledged. The breast cancer domain was \nobtained from the University Medical Centre, Inst. of Oncology, Ljubljana, Yu(cid:173)\ngoslavia. Thanks go to M. Zwitter and M. Soklic for providing the data. \n\nReferences \n[1] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon, 1995. \n[2] L. Breiman. Bagging predictors. Machine Learning, 26(2):123- 140, 1996. \n[3] L. Breiman. Arcing classifiers. Tech.Rep.460, Berkeley Stat.Dept., 1997. \n[4] L. Breiman. Prediction games and arcing algorithms. Tech.Rep. 504, Berkeley \n\nStat.Dept., 1997. \n\n[5] C. Cortes, V. Vapnik. Support vector network. Mach.Learn., 20:273-297,1995. \n[6] R. Schapire, Y. Singer. Improved Boosting Algorithms Using Confidence-rated \n\nPredictions. In Proc. of COLT'98. \n\n[7] A.J. Grove, D. Schuurmans. Boosting in the limit: Maximizing the margin of \n\nlearned ensembles. In Proc. 15th Nat. Conf. on AI, 1998. To appear. \n\n[8] Y. LeCun et al. Learning algorithms for classification: A comparism on hand(cid:173)\n\nwritten digit recognistion. Neural Networks, pages 261-276, 1995. \n\n[9] T. Onoda, G. Ratsch, and K.-R. Muller. An asymptotic analysis of adaboost \n\nin the binary classification case. In Proc. of ICANN'98, April 1998. \n\n[10] J. Quinlan. Boosting first-order learning. In Proc. of the 7th Internat. Work(cid:173)\n\nshop on Algorithmic Learning Theory, LNAI, 1160,143-155. Springer. \n\n[11] G. Ratsch. Soft Margins for AdaBoost. August 1998. Royal Holloway College, \n\nTechnical Report NC-TR-1998-021. Submitted to Machine Learning. \n\n[12] R. Schapire, Y. Freund, P. Bartlett, W. Lee. Boosting the margin: A new ex(cid:173)\nplanation for the effectiveness of voting methods. Mach. Learn. , 148-156, 1998. \n\n[13] H. Schwenk and Y. Bengio. -Adaboosting neural networks: Application to on(cid:173)\n\nline character recognition. In ICANN'97, LNCS, 1327,967-972,1997. Springer. \n\n[14] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. \n\n\f", "award": [], "sourceid": 1615, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Takashi", "family_name": "Onoda", "institution": null}, {"given_name": "Klaus", "family_name": "M\u00fcller", "institution": null}]}