{"title": "v-Arc: Ensemble Learning in the Presence of Outliers", "book": "Advances in Neural Information Processing Systems", "page_first": 561, "page_last": 567, "abstract": null, "full_text": "v-Arc: Ensemble Learning \nin the Presence of Outliers \n\nG. Ratscht , B. Scholkopf1, A. Smola\", \nK.-R. Miillert, T. Onodatt , and S. Mikat \n\nt Microsoft Research, 1 Guildhall Street, Cambridge CB2 3NH, UK \n\nt GMD FIRST, Rudower Chaussee 5,12489 Berlin, Germany \n* Dep. of Engineering, ANU, Canberra ACT 0200, Australia \ntt CRIEPI, 2-11-1, Iwado Kita, Komae-shi, Tokyo, Japan \n\n{raetsch, klaus, mika}~first.gmd.de,bsc~microsoft.com, \n\nAlex.Smola~anu.edu.au,onoda~criepi.denken.or.jp \n\nAbstract \n\nAdaBoost and other ensemble methods have successfully been ap(cid:173)\nplied to a number of classification tasks, seemingly defying prob(cid:173)\nlems of overfitting. AdaBoost performs gradient descent in an error \nfunction with respect to the margin, asymptotically concentrating \non the patterns which are hardest to learn. For very noisy prob(cid:173)\nlems, however, this can be disadvantageous. Indeed, theoretical \nanalysis has shown that the margin distribution, as opposed to just \nthe minimal margin, plays a crucial role in understanding this phe(cid:173)\nnomenon. Loosely speaking, some outliers should be tolerated if \nthis has the benefit of substantially increasing the margin on the \nremaining points. We propose a new boosting algorithm which al(cid:173)\nlows for the possibility of a pre-specified fraction of points to lie in \nthe margin area Or even on the wrong side of the decision boundary. \n\n1 \n\nIntroduction \n\nBoosting and related Ensemble learning methods have been recently used with great \nsuccess in applications such as Optical Character Recognition (e.g. [8, 16]). \nThe idea of a large minimum margin [17] explains the good generalization perfor(cid:173)\nmance of AdaBoost in the low noise regime. However, AdaBoost performs worse \non noisy tasks [10, 11], such as the iris and the breast cancer benchmark data sets \n[1]. On the latter tasks, a large margin on all training points cannot be achieved \nwithout adverse effects on the generalization error. This experimental observation \nwas supported by the study of [13] where the generalization error of ensemble meth(cid:173)\nods was bounded by the sum of the fraction of training points which have a margin \nsmaller than some value p, say, plus a complexity term depending on the base hy(cid:173)\npotheses and p. While this bound can only capture part of what is going on in \npractice, it nevertheless already conveys the message that in some cases it pays to \nallow for some points which have a small margin, or are misclassified, if this leads \nto a larger overall margin on the remaining points. \nTo cope with this problem, it was mandatory to construct regularized variants of \nAdaBoost, which traded off the number of margin errors and the size of the margin \n\n\f562 \n\nG. Riitsch, B. Sch6lkopf, A. J. Smola, K.-R. Muller, T. Onoda and S. Mika \n\n[9, 11]. This goal, however, had so far been achieved in a heuristic way by introduc(cid:173)\ning regularization parameters which have no immediate interpretation and which \ncannot be adjusted easily. \nThe present paper addresses this problem in two ways. Primarily, it makes an algo(cid:173)\nrithmic contribution to the problem of constructing regularized boosting algorithms. \nHowever, compared to the previous efforts, it parameterizes the above trade-off in \na much more intuitive way: its only free parameter directly determines the fraction \nof margin errors. This, in turn, is also appealing from a theoretical point of view, \nsince it involves a parameter which controls a quantity that plays a crucial role in \nthe generalization error bounds (cf. also [9, 13]). Furthermore, it allows the user \nto roughly specify this parameter once a reasonable estimate of the expected error \n(possibly from other studies) can be obtained, thus reducing the training time. \n\n2 Boosting and the Linear Programming Solution \n\nBefore deriving a new algorithm, we briefly discuss the properties of the solution \ngenerated by standard AdaBoost and, closely related, Arc-GV (2], and show the \nrelation to a linear programming (LP) solution over the class of base hypotheses G. \nLet {gt(x) : t = 1, ... ,T} be a sequence of hypotheses and a = [al ... aT] their \nweights satisfying at ~ O. The hypotheses gt are elements of a hypotheses class \nG = {g: x 14 [-1, In, which is defined by a base learning algorithm. \nThe ensemble generates the label which is the weighted majority of the votes by \n\nsign(f(x)) where \n\nf(x) = ~ lI:ill gt(x). \n\n(1) \n\nIn order to express that f and therefore also the margin p depend on a and for ease \nof notation we define \n\np(z, a) := yf(x) where z := (x, y) and f is defined as in (1). \n\nLikewise we use the normalized margin: \n\np(a):= min P(Zi, a) , \n\nl~t~m \n\n(2) \n\n(3) \n\nEnsemble learning methods have to find both, the hypotheses gt E G used for the \ncombination and their weights a. In the following we will consider only AdaBoost \nalgorithms (including Arcing). For more details see e.g. (4, 2]. The main idea of \nAdaBoost is to introduce weights Wt(Zi) on the training patterns. They are used to \ncontrol the importance of each single pattern in learning a new hypothesis (Le. while \nrepeatedly running the base algorithm). Training patterns that are difficult to learn \n(which are misclassified repeatedly) become more important. \nThe minimization objective of AdaBoost can be expressed in terms of margins as \n\nm \n\ni=1 \n\n(4) \n\nIn every iteration, AdaBoost tries to minimize this error by a stepwise maximization \nof the margin. It is widely believed that AdaBoost tries to maximize the smallest \nmargin on the training set [2, 5, 3, 13, 11]. Strictly speaking, however, a general \nproof is missing. It would imply that AdaBoost asymptotically approximates (up to \nscaling) the solution of the following linear programming problem over the complete \nhypothesis set G (cf. [7], assuming a finite number of basis hypotheses): \n\nmaximize \nsubject to \n\np \np( Zi, a) ~ p \nat, P ~ 0 \nlIalil = 1 \n\nfor all 1 < i < m \nfor all 1 ~ t ~ IGI \n\n(5) \n\n\fv-Arc: Ensemble Learning in the Presence o/Outliers \n\n563 \n\nSince such a linear program cannot be solved exactly for a infinite hypothesis set \nin general, it is interesting to analyze approximation algorithms for this kind of \nproblems. \nBreiman [2] proposed a modification of AdaBoost - Arc-GV - making it possible \nto show the asymptotic convergence of p(a t ) to the global solution pIP: \nTheorem 1 (Breiman [2]). Choose at in each iteration as \n\nat := argmin Lexp [-llatlll (p(Zi' at) - p(at- I ))], \n\naE[O,I] \n\ni \n\n(6) \n\nand assume that the base learner always finds the hypothesis 9 E G which minimizes \nthe weighted training error with respect to the weights. Then \n\nlim p( at) = pIp. \nt-HX> \n\nNote that the algorithm above can be derived from the modified error function \n\n9gv(at):= Lexp [-llatlll (p(Zi' at) - p(at - I ))]. \n\n(7) \n\nThe question one might ask now is whether to use AdaBoost or rather Arc-GV \nin practice. Does Arc-GV converge fast enough to benefit from its asymptotic \nproperties? In [12] we conducted experiments to investigate this question. We \nempirically found that (a) AdaBoost has problems finding the optimal combination \nif pIp < 0, (b) Arc-GV's convergence does not depend on pIp, and (c) for pIp> 0, \nAdaBoost usually converges to the maximum margin solution slightly faster than \n\nArc-GV. Observation (a) becomes clear from (4): 9(a) will not converge to \u00b0 and \n\nlIal11 can be bounded by some value. Thus the asymptotic case cannot be reached, \nwhereas for Arc-GV the optimum is always found. \nMoreover, the number of iterations necessary to converge to a good solution seems to \nbe reasonable, but for a near optimal solution the number of iterations is rather high. \nThis implies that for real world hypothesis sets, the number of iterations needed \nto find an almost optimal solution can become prohibitive, but we conjecture that \nin practice a reasonably good approximation to the optimum is provided by both \nAdaBoost and Arc-GV. \n3 v-Algorithms \nFor the LP-AdaBoost [7] approach it has been shown for noisy problems that the \ngeneralization performance is usually not as good as the one of AdaBoost [7, 2, 11]. \nFrom Theorem 5 in [13] (cf. Theorem 3 on page 5) this fact becomes clear, as \nthe minimum of the right hand side of inequality (cf. (13)) need not necessarily be \nachieved with a maximum margin. We now propose an algorithm to directly control \nthe number of margin errors and therefore also the contribution of both terms in the \ninequality separately (cf. Theorem 3). We first consider a small hypothesis class \nand end up with a linear program - v-LP-AdaBoost. In subsection 3.2 we then \ncombine this algorithm with the ideas from section 2 and get a new algorithm -\nv-Arc - which approximates the v-LP solution. \n\n3.1 v-LP-AdaBoost \nLet us consider the case where we are given a (finite) set G = {g: x I-t [-1, 1n ofT \nhypotheses. To find the coefficients a for the combined hypothesis f(x) we extend \nthe LP-AdaBoost algorithm [7, 11] by incorporating the parameter v [15] and solve \nthe following linear optimization problem: \n\nmaximize \nsubject to \n\np - v!n E::'I ~i \nP(Zi' a) ::::: p - ~i \n\nfor all 1 ~ i ~ m \n\n~i' at, P ::::: \u00b0 for all 1 ~ t ~ T and 1 ~ i ~ m \nlIalh = 1 \n\n(8) \n\n\f564 \n\nG. Riitsch, B. SchOlkopf, A. J. Smola, K.-R. Muller, T. Onoda and S. Mika \n\nThis algorithm does not force all margins to be beyond zero and we get a soft \nmargin classification (cf. SVMs) with a regularization constant v!n. The following \nproposition shows that v has an immediate interpretation: \nProposition 2 (Ratsch et al. [12]). Suppose we run the algorithm given in (8) \non some data with the resulting optimal P > o. Then \n\n1. v upper bounds the fraction of margin errors. \n\n2. 1 - v upper bounds the fraction of patterns with margin larger than p. \n\nSince the slack variables ~i only enter the cost function linearly, their absolute size \nis not important. Loosely speaking, this is due to the fact that for the optimum \nof the primal objective function, only derivatives wrt. the primal variables matter, \nand the derivative of a linear function is constant. \nIn the case of SVMs [14], where the hypotheses can be thought of as vectors in \nsome feature space, this statement can be translated into a precise rule for dis(cid:173)\ntorting training patterns without changing the solution: we can move them locally \northogonal to a separating hyperplane. This yields a desirable robustness property. \nNote that the algorithm essentially depends on the number of outliers, not on the \nsize of the error [15]. \n\n3.2 The v-Arc Algorithm \nSuppose we have a very large (but finite) base hypothesis class G. Then it is difficult \nto solve (8) as (5) directly. To this end, we propose a new algorithm - v-Arc - that \napproximates the solution of (8). \nThe optimal p for fixed margins P(Zi' a) in (8) can be written as \n\npv(a) := argmax (p - _1 f)p - p(Zi' a\u00bb+) . \n\npE[O,I] \n\nvm i=1 \n\n(9) \n\nwhere (~)+ := max(~, 0). Setting ~i := (pv(a) - P(Zi' a\u00bb+ and subtracting \nv!n I:~l ~i from the resulting inequality on both sides yields (for all 1 ~ i ~ m) \n\nP(Zi' a) + ~i -\n\nL~i ~ pv(a) - - L~i . \n\n(10) \n\n1 m \n\n-\nvm i=1 \n\n1 m \n\nvm i=1 \n\nTwo more substitutions are needed to transform the problem into one which can \nbe solved by the AdaBoost algorithm. In particular we have to get rid of the slack \nvariables ~i again by absorbing them into quantities similar to P(Zi' a) and p(a). \nThis works as follows: on the right hand side of (10) we have the objective function \n(cf. (8\u00bb and on the left hand side a term that depends nonlinearly on a. Defining \n\n_ \npv(a) := pv(a) - -\n\n1 m \nL \nvm. \n,=1 \n\n_ \n\nand Pv(Zi' a) := P(Zi' a) + ~i -\n\n~i \n\n1 m \n\n' \" ' ~i, \n-\nvm~ \ni=l \n\n(11) \n\nwhich we substitute for p(a) and p(z,a) in (5), respectively, we obtain a new \noptimization problem. Note that ,ov (a) and ,ov (Zi' a) play the role of a corrected \nor virtual margin. We obtain a nonlinear min-max problem \n\nmaximize \nsubject to \n\n,o(a) \n,o( Zi, a) ~ ,o( a) \nat > 0 \nlIallt ~ 1 \n\nfor all 1 ~ i ~ m \nfor all 1 ~ t ~ T \n\n' \n\n(12) \n\nwhich Arc-GV can solve approximately (cf. section 2). Hence, by replacing the mar(cid:173)\ngin p(Z, a) by ,o(z,a) in equation (4) and the other formulas for Arc-GV (cf. [2]), \n\n\fv-Arc: Ensemble Learning in the Presence o/Outliers \n\n565 \n\nwe obtain a new algorithm which we refer to as v-Arc. \nWe can now state interesting properties for v-Arc by using Theorem 5 of [13] that \nbounds the generalization error R(f) for ensemble methods. In our case Rp(f) ~ v \nby construction (i.e. the number of patterns with a margin smaller than p, cf. Propo(cid:173)\nsition 2), thus we get the following simple reformulation of this bound: \nTheorem 3. Let p(x, y) be a distribution over X x [-1,1]' and let X be a sample \nof m examples chosen iid according to p. Suppose the base-hypothesis space G has \nVC dimension h, and let [) > 0. Then with probability at least 1 -\n[) over the \nrandom choice of the training set X, Y, every function f generated by v-Arc, where \nv E (0,1) and pv > 0, satisfies the following bound: \n\nR(f) ~ v + ~ (hIOg2 (ml h) \n\nm \n\n2 \nPv \n\nI (!)) \n+ og \n\n. \n\n~ \nu \n\n(13) \n\nSo, for minimizing the right hand side we can tradeoff between the first and the \nsecond term by controlling an easily interpretable regularization parameter v. \n\n4 Experiments \n\nWe show a set of toy experiments to illustrate the general behavior of v-Arc. As \nbase hypothesis class G we use the RBF networks of [11], and as data a two-class \nproblem generated from several 2D Gauss blobs (cf. Banana shape dataset from \nhttp://www.first.gmd.derdata/banana .html.). We obtain the following results: \n\n\u2022 v-Arc leads to approximately vm patterns that are effectively used in \nthe training of the base learner: Figure 1 (left) shows the fraction \nof patterns that have high average weights during the learning process \n(i.e. Ei=l Wt(Zi) > 112m). We find that the number of the latter increases \n(almost) linearly with v. This follows from (11) as the (soft) margin of \npatterns with p(z, a) < Pv is set to pv and the weight of those patterns will \nbe the same. \n\n\u2022 The (estimated) test error, averaged over 10 training sets, exhibits a rather \n\nflat minimum in v (Figure 1 (lower)). This indicates that just as for v(cid:173)\nSVMs, where corresponding results have been obtained, v is a well-behaved \nparameter in the sense that a slight misadjustment it is not harmful. \n\n\u2022 v-Arc leads to the fraction v of margin errors (cf. dashed line in Figure 1) \n\nexactly as predicted in Proposition 2. \n\n\u2022 Finally, a good value of v can already be inferred from prior knowledge of \nthe expected error. Setting it to a value similar to the latter provides a \ngood starting point for further optimization (cf. Theorem 3). \n\nNote that for v = 1, we recover the Bagging algorithm (if we used bootstrap \nsamples), as the weights of all patterns will be the same (Wt(Zi) = 11m for all \ni = 1, . . . ,m) and also the hypothesis weights will be constant (at'\" liT for all \nt = 1, .. . ,T) . \nFinally, \nbenchmark \nwe \ndata \nrepository \n(cf. http://ida.first.gmd.de/-raetsch/data/benchmarks.html). We an-\nalyze the performance of single RBF networks, AdaBoost, v-Arc and RBF-SVMs. \nFor AdaBoost and v-Arc we use RBF networks [11] as base hypothesis. The \nmodel parameters of RBF (number of centers etc.), v-Arc (v) and SVMs (0', C) are \noptimized using 5-fold cross-validation. More details on the experimental setup can \n\ncomparison \n[1] \n\non \nten \nbenchmark \n\nthe VCI \n\nsets \n\npresent \nobtained \n\na \nfrom \n\nsmall \n\n\f566 \n\nG. Riitsch, B. SchO/kopf, A. J. Smola, K.-R. Muller, T. Onoda and S. Mika \n\n0.8 number of important \n\npanems \n\n~ \n\n~ \n~ 0.6 \n\"-\n'0 \nc \n~ 0 .4 \n~ \n\ntraining error \n\nArc-GV \n\n0.16 \n\n0. 15 \n~ \nw 0.'4 \n\n0.12 \n\n0.11 \n\nBagging \n\no \n\no. I \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\n0.6 \n\n0.7 \n\n0.8 \n\n0.9 \n\no \n\n0.1 \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\n0.6 \n\n0.7 \n\n0.8 \n\n0.9 \n\nFigure 1: Toy experim~nt (0' = 0): the left graph shows the average Yfraction of important \npatterns, the avo fraction of margin errors and the avo training error for different values \nof the regularization constant v for v-Arc. The right graph shows the corresponding \ngeneralization error. In both cases, the parameter v allows us to reduce the test errors \nto values much lower than for the hard margin algorithm (for v = 0 we recover Arc(cid:173)\nGV / AdaBoost, and for v = 1 we get Bagging.) \n\nbe found in [11]. Fig. 1 shows the generalization error estimates (after averaging \nover 100 realizations of the data sets) and the confidence interval. The results \nof the best classifier and the classifiers that are not significantly worse are set in \nbold face . To test the significance, we used at-test (p = 80%). On eight out of \nthe ten data sets, v-Arc performs significantly better than AdaBoost. This clearly \nshows the superior performance of v-Arc for noisy data sets and supports this soft \nmargin approach for AdaBoost. Furthermore, we find comparable performances \nfor v-Arc and SVMs. In three cases the SVM performs better and in two cases \nv-Arc performs best. Summarizing, AdaBoost is useful for low noise cases, where \nthe classes are separable. v-Arc extends the applicability of boosting to problems \nthat are difficult to separate and should be applied if the data are noisy. \n\n5 Conclusion \n\nWe analyzed the AdaBoost algorithm and found that Arc-GV and AdaBoost are \nefficient for approximating the solution of non-linear min-max problems over huge \nhypothesis classes. We re-parameterized the LP Reg-AdaBoost algorithm (cf. [7, 11]) \nand introduced a new regularization constant v that controls the fraction of pat(cid:173)\nterns inside the margin area. The new parameter is highly intuitive and has to be \noptimized only on a fixed interval [0,1] . \nUsing the fact that Arc-GV can approximately solve min-max problems, we found a \nformulation of Arc-G V - v-Arc - that implements the v-idea for Boosting by defining \nan appropriate soft margin. The present paper extends previous work on regular(cid:173)\nizing boosting (DOOM [9], AdaBoostReg [11]) and shows the utility and flexibility \nof the soft margin approach for AdaBoost. \n\nRBF \n\n10.8 \u00b1 0.06 \nBanana \n27.6 \u00b1 0.47 \nB.Cancer \n24.3 \u00b1 0.19 \nDiabetes \n24.7 \u00b1 0.24 \nGerman \n17.6 \u00b1 0.33 \nHeart \n1.7 \u00b1 0.02 \nRingnorm \n34.4 \u00b1 0.20 \nF .Sonar \n4.5 \u00b1 0.21 \nThyroid \n23.3 \u00b1 0.13 \nTitanic \nWaveform 10.7 \u00b1 0.11 \n\nAB \n\n12.3 \u00b1 0.07 \n30.4 \u00b1 0.47 \n26.5 \u00b1 0.23 \n27.5 \u00b1 0.25 \n20.3 \u00b1 0.34 \n1.9 \u00b1 0.03 \n35.7 \u00b1 0.18 \n4.4 \u00b1 0.22 \n22.6 \u00b1 0.12 \n10.8 \u00b1 0.06 \n\nv-Arc \n\n10.6 \u00b1 0.05 \n25.8 \u00b1 0.46 \n23.7 \u00b1 0.20 \n24.4 \u00b1 0.22 \n16.5 \u00b1 0.36 \n1.7 \u00b1 0.02 \n34.4 \u00b1 0.19 \n4.4 \u00b1 0.22 \n23.0 \u00b1 0.14 \n10.0 \u00b1 0.07 \n\nSVM \n\n11.5 \u00b1 0.07 \n26.0 \u00b1 0.47 \n23.5 \u00b1 0.17 \n23.6 \u00b1 0.21 \n16.0 \u00b1 0.33 \n1.7 \u00b1 0.01 \n32.4 \u00b1 0.18 \n4.8 \u00b1 0.22 \n22.4 \u00b1 0.10 \n9.9 \u00b1 0.04 \n\nTable 1: Generalization error estimates and confidence intervals. The best classifiers for a \nparticular data set are marked in bold face (see text). \n\n\fv-Arc: Ensemble Learning in the Presence of Outliers \n\n567 \n\nWe found empirically that the generalization performance in v-Arc depends only \nslightly on the choice of the regularization constant. This makes model selection \n(e.g. via cross-validation) easier and faster. \nFuture work will study the detailed regularization properties of the regularized ver(cid:173)\nsions of AdaBoost, in particular in comparison to v-LP Support Vector Machines . \nAcknowledgments: Partial funding from DFG grant (Ja 379/52) is gratefully \nacknowledged. This work was done while AS and BS were at GMD FIRST. \nReferences \n\n[1] C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases, \n\n1998. http://www.ics. uci.edu/ \",mlearn/MLRepository.html. \n\n[2] L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Statistics \n\nDepartment, University of California, December 1997. \n\n[3] M. Frean and T. Downs. A simple cost function for boosting. Technical report, Dept. \n\nof Computer Science and Electrical Eng., University of Queensland, 1998. \n\n[4] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning \nand an application to boosting. In Computational Learning Theory: Eurocolt '95, \npages 23-37. Springer-Verlag, 1995. \n\n[5] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning \n\nand an application to boosting. J. of Compo fj Syst. Sc. , 55(1):119- 139, 1997. \n\n[6] J . Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical \n\nview of boosting. Technical report, Stanford University, 1998. \n\n[7) A. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of \n\nlearned ensembles. In Proc. of the 15th Nat. Conf. on AI, pages 692- 699 , 1998. \n\n[8] Y. LeCun, L. D . Jackel, L. Bottou, C. Cortes, J . S. Denker, H. Drucker, I. Guyon, \nU. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Learning algorithms for \nclassification: A comparison on handwritten digit recognition. Neural Networks, pages \n261-276, 1995. \n\n[9) L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit \n\noptimization of margins. Machine Learning, 1999. to appear. \n\n(10) J. R. Quinlan. Boosting first-order learning (invited lecture). Lecture Notes in Com(cid:173)\n\nputer Science, 1160:143, 1996. \n\n(11) G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Technical Report \nNC-TR-1998-021, Department of Computer Science, Royal Holloway, University of \nLondon, Egham, UK , 1998. To appear in Machine Learning. \n\n(12) G. Ratsch, B. Schokopf, A. Smola, S. Mika, T. Onoda, and K.-R. Muller. Robust \nensemble learning. In A.J . Smola, P.L. Bartlett, B. Scholkopf, and D. Schuurmans, \neditors, Advances in LMC, pages 207-219. MIT Press, Cambridge, MA , 1999. \n\n[13] R. Schapire, Y. Freund, P. L. Bartlett, and W . Sun Lee. Boosting the margin: A \nnew explanation for the effectiveness of voting methods. Annals of Statistics, 1998. \n(Earlier appeared in: D. H. Fisher, Jr. (ed.), Proc. ICML97, M. Kaufmann). \n\n[14] B. Scholkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods -\n\nSupport Vector Learning. MIT Press, Cambridge, MA, 1999. \n\n(15) B. Scholkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector \n\nalgorithms. Neural Computation, 12:1083 - 1121, 2000. \n\n(16) H. Schwenk and Y. Bengio. Training methods for adaptive boosting of neural net(cid:173)\n\nworks. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances \nin Neural Inf. Processing Systems, volume 10. The MIT Press, 1998. \n\n[17) V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, \n\n1995. \n\n\f", "award": [], "sourceid": 1721, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Takashi", "family_name": "Onoda", "institution": null}, {"given_name": "Sebastian", "family_name": "Mika", "institution": null}]}