{"title": "Boosting Algorithms as Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 512, "page_last": 518, "abstract": null, "full_text": "Boosting Algorithms as Gradient Descent \n\nLlew Mason \n\nResearch School of Information \n\nSciences and Engineering \n\nAustralian National University \nCanberra, ACT, 0200, Australia \n\nlmason@syseng.anu.edu.au \n\nJonathan Baxter \n\nResearch School of Information \n\nSciences and Engineering \n\nAustralian National University \nCanberra, ACT, 0200, Australia \n\nJonathan. Baxter@anu.edu.au \n\nPeter Bartlett \n\nResearch School of Information \n\nSciences and Engineering \n\nAustralian National University \nCanberra, ACT, 0200, Australia \n\nPeter.Bartlett@anu.edu.au \n\nMarcus Frean \n\nDepartment of Computer Science \n\nand Electrical Engineering \n\nThe University of Queensland \nBrisbane, QLD, 4072, Australia \n\nmarcusf@elec.uq.edu.au \n\nAbstract \n\nWe provide an abstract characterization of boosting algorithms as \ngradient decsent on cost-functionals in an inner-product function \nspace. We prove convergence of these functional-gradient-descent \nalgorithms under quite weak conditions. Following previous theo(cid:173)\nretical results bounding the generalization performance of convex \ncombinations of classifiers in terms of general cost functions of the \nmargin, we present a new algorithm (DOOM II) for performing a \ngradient descent optimization of such cost functions. Experiments \non several data sets from the UC Irvine repository demonstrate \nthat DOOM II generally outperforms AdaBoost, especially in high \nnoise situations, and that the overfitting behaviour of AdaBoost is \npredicted by our cost functions. \n\n1 \n\nIntroduction \n\nThere has been considerable interest recently in voting methods for pattern classi(cid:173)\nfication, which predict the label of a particular example using a weighted vote over \na set of base classifiers [10, 2, 6, 9, 16, 5, 3, 19, 12, 17, 7, 11, 8]. Recent theoretical \nresults suggest that the effectiveness of these algorithms is due to their tendency \n[1, 18]. Loosely speaking, if a combination of \nto produce large margin classifiers \nclassifiers correctly classifies most of the training data with a large margin, then its \nerror probability is small. \n\nIn [14] we gave improved upper bounds on the misclassification probability of a \ncombined classifier in terms of the average over the training data of a certain cost \nfunction of the margins. That paper also described DOOM, an algorithm for di(cid:173)\nrectly minimizing the margin cost function by adjusting the weights associated with \n\n\fBoosting Algorithms as Gradient Descent \n\n513 \n\neach base classifier (the base classifiers are suppiled to DOOM). DOOM exhibits \nperformance improvements over AdaBoost, even when using the same base hypothe(cid:173)\nses, which provides additional empirical evidence that these margin cost functions \nare appropriate quantities to optimize. \nIn this paper, we present a general class of algorithms (called AnyBoost) which \nare gradient descent algorithms for choosing linear combinations of elements of an \ninner product function space so as to minimize some cost functional. The normal \noperation of a weak learner is shown to be equivalent to maximizing a certain inner \nproduct. We prove convergence of AnyBoost under weak conditions. In Section 3, \nwe show that this general class of algorithms includes as special cases nearly all \nexisting voting methods. In Section 5, we present experimental results for a special \ncase of AnyBoost that minimizes a theoretically-motivated margin cost functional. \nThe experiments show that the new algorithm typically outperforms AdaBoost, and \nthat this is especially true with label noise. In addition, the theoretically-motivated \ncost functions provide good estimates of the error of AdaBoost, in the sense that \nthey can be used to predict its overfitting behaviour. \n\n2 AnyBoost \n\nLet (x, y) denote examples from X x Y, where X is the space of measurements \n(typically X ~ JRN) and Y is the space of labels (Y is usually a discrete set or some \nsubset of JR). Let F denote some class of functions (the base hypotheses) mapping \nX -7 Y, and lin (F) denote the set of all linear combinations of functions in F. Let \n(,) be an inner product on lin (F), and \n\nC: lin (F) -7 ~ \n\na cost functional on lin (F). \nOur aim is to find a function F E lin (F) minimizing C(F). We will proceed \niteratively via a gradient descent procedure. \nSuppose we have some F E lin (F) and we wish to find a new f E F to add to F \nso that the cost C(F + Ef) decreases, for some small value of E. Viewed in function \nspace terms, we are asking for the \"direction\" f such that C(F + Ef) most rapidly \ndecreases. The desired direction is simply the negative of the functional derivative \nofC at F, -\\lC(F), where: \n\n\\lC(F)(x) := aC(F + o:Ix) I \n\n' \n\n(1) \n\nao: \n\n0:=0 \n\nwhere Ix is the indicator function of x. Since we are restricted to choosing our new \nfunction f from F, in general it will not be possible to choose f = -\\lC(F), so \ninstead we search for an f with greatest inner product with -\\lC(F). That is, we \nshould choose f to maximize - (\\lC(F), I). This can be motivated by observing \nthat, to first order in E, C(F + Ef) = C(F) + E (\\lC(F), f) and hence the greatest \nreduction in cost will occur for the f maximizing - (\\lC(F), f). \nFor reasons that will become obvious later, an algorithm that chooses f attempting \nto maximize - (\\lC(F), f) will be described as a weak learner. \nThe preceding discussion motivates Algorithm 1 (AnyBoost), an iterative algorithm \nfor finding linear combinations F of base hypotheses in F that minimize the cost \nfunctional C (F). Note that we have allowed the base hypotheses to take values in \nan arbitrary set Y, we have not restricted the form of the cost or the inner product, \nand we have not specified what the step-sizes should be. Appropriate choices for \n\n\f514 \n\nL. Mason, J Baxter. P. Bartlett and M Frean \n\nthese things will be made when we apply the algorithm to more concrete situations. \nNote also that the algorithm terminates when - (\\lC(Ft ), It+!) ~ 0, i.e when the \nweak learner C returns a base hypothesis It+l which no longer points in the downhill \ndirection of the cost function C(F). Thus, the algorithm terminates when, to first \norder, a step in function space in the direction of the base hypothesis returned by \nC would increase the cost. \n\nAlgorithm 1 : Any Boost \n\nRequire: \n\nsome set Y. \n\n\u2022 An inner product space (X, (, )) containing functions mapping from X to \n\n\u2022 A class of base classifiers F ~ X. \n\u2022 A differentiable cost functional C: lin (F) --+ III \n\u2022 A weak learner C(F) that accepts F E lin (F) and returns I E F with a \n\nlarge value of - (\\lC(F), f). \n\nLet Fo(x) := O. \nfor t := 0 to T do \n\nLet It+! := C(Ft ). \nif - (\\lC(Ft ), It+!) ~ 0 then \n\nreturn Ft. \n\nend if \nChoose Wt+!. \nLet Ft+l := Ft + Wt+I!t+1 \n\nend for \nreturn FT+I. \n\n3 A gradient descent view of voting methods \nWe now restrict our attention to base hypotheses I E F mapping to Y = {\u00b1 I}, \nand the inner product \n\n(2) \n\nfor all F, G E lin (F), where S = {Xl, yt), . . . , (Xn, Yn)} is a set of training examples \ngenerated according to some unknown distribution 1) on X x Y. Our aim now is to \nfind F E lin (F) such that Pr(x,y)\"\"\"Vsgn (F(x)) -=f. Y is minimal, where sgn (F(x)) = \n-1 if F (x) < 0 and sgn (F (x)) = 1 otherwise. In other words, sgn F should minimize \nthe misclassification probability. \nThe margin of F : X --+ R on example (x,y) is defined as yF(x). Consider margin \ncost-Iunctionals defined by \n\nC(F) := - L C(YiF(Xi)) \n\n1 m \n\nm i=l \n\nwhere c: R --+ R is any differentiable real-valued function of the margin. With these \ndefinitions, a quick calculation shows: \n\n- (\\lC(F), I) = -2 LYd(Xi)C'(YiF(Xi)). \n\n1 m \n\nm \n\ni=l \n\nSince positive margins correspond to examples correctly labelled by sgn F and neg(cid:173)\native margins to incorrectly labelled examples, any sensible cost function of the \n\n\fBoosting Algorithms as Gradient Descent \n\n515 \n\nTable 1: Existing voting methods viewed as AnyBoost on margin cost functions. \n\nAlgorithm \nAdaBoost [9] \nARC-X4 [2] \nConfidenceBoost [19] \nLogitBoost [12] \n\nCost function \n\ne-yF(x) \n\n(1 - yF(x))\" \n\ne yF(x) \n\nStep size \nLine search \n\n1ft \n\nLine search \n\nIn(l + e-yl\u00abX\u00bb) Newton-Raphson \n\nmargin will be monotonically decreasing. Hence -C'(YiF(Xi)) will always be posi(cid:173)\ntive. Dividing through by - 2:::1 C'(YiF(Xi)), we see that finding an I maximizing \n- ('\\1 C (F), f) is equivalent to finding an I minimizing the weighted error \n\nL D(i) where \n\ni: f(Xi):f;Yi \n\nfor i = 1, ... ,m. \n\nMany of the most successful voting methods are, for the appropriate choice of margin \ncost function c and step-size, specific cases of the AnyBoost algorithm (see Table 3). \nA more detailed analysis can be found in the full version of this paper [15]. \n\n4 Convergence of Any Boost \n\nIn this section we provide convergence results for the AnyBoost algorithm, under \nquite weak conditions on the cost functional C. The prescriptions given for the \nstep-sizes Wt in these results are for convergence guarantees only: in practice they \nwill almost always be smaller than necessary, hence fixed small steps or some form \nof line search should be used. \nThe following theorem (proof omitted, see [15]) supplies a specific step-size for \nAnyBoost and characterizes the limiting behaviour with this step-size. \nTheorem 1. Let C: lin (F) -7 ~ be any lower bounded, Lipschitz differentiable \ncost functional (that is, there exists L > 0 such that II'\\1C(F)-'\\1C(F')1I :::; LIIF-F'II \nlor all F, F' E lin (F)). Let Fo, F l , ... be the sequence 01 combined hypotheses \ngenerated by the AnyBoost algorithm, using step-sizes \n\nWt+1 := -\n\n('\\1C(Ft ), It+!) \n\nLll/t+!112 \n\n. \n\n(3) \n\nThen AnyBoost either halts on round T with - ('\\1C(FT ), IT+1) :::; 0, or C(Ft) \nconverges to some finite value C*, in which case limt-+oo ('\\1C(Ft ), It+l) = O. \n\nThe next theorem (proof omitted, see [15]) shows that if the weak learner can \nalways find the best weak hypothesis It E F on each round of AnyBoost, and if \nthe cost functional C is convex, then any accumulation point F of the sequence \n(Ft) generated by AnyBoost with the step sizes (3) is a global minimum of the \ncost. For ease of exposition, we have assumed that rather than terminating when \n- ('\\1C(FT), h+l) :::; 0, AnyBoost simply continues to return FT for all subsequent \ntime steps t. \nTheorem 2. Let C: lin (F) -7 ~ be a convex cost functional with the properties \nin Theorem 1, and let (Ft) be the sequence 01 combined hypotheses generated by \nthe AnyBoost algorithm with step sizes given by (3). Assume that the weak hypoth(cid:173)\n- I E F) and that on each round \nesis class F is negation closed (f E F \n\n===} \n\n\f516 \n\nL. Mason, 1. Baxter, P. Bartlett and M. Frean \n\nthe AnyBoost algorithm finds a function fHl maximizing - (V'C(Ft ), ft+l)\u00b7 Then \nany accumulation point F of the sequence (Ft) satisfies sUP/EF - (V'C(F), f) = \n0, \n\nand C(F) = infGElin(F) C(G). \n\n5 Experiments \n\nAdaBoost had been perceived to be resistant to overfitting despite the fact that \nit can produce combinations involving very large numbers of classifiers. However, \nrecent studies have shown that this is not the case, even for base classifiers as \nsimple as decision stumps [13, 5, 17]. This overfitting can be attributed to the use \nof exponential margin cost functions (recall Table 3). \nThe results in in [14] showed that overfitting may be avoided by using margin cost \nfunctionals of a form qualitatively similar to \n\nC(F) = - 2: 1 - tanh(>'YiF(xi)), \n\n1 m \n\nm i=l \n\n(4) \n\nwhere >. is an adjustable parameter controlling the steepness of the margin cost \nfunction c(z) = 1 - tanh(>.z). For the theoretical analysis of [14] to apply, F \nmust be a convex combination of base hypotheses, rather than a general linear \ncombination. Henceforth (4) will be referred to as the normalized sigmoid cost \nfunctional. AnyBoost with (4) as the cost functional and (2) as the inner product \nwill be referred to as DOOM II. In our implementation of DOOM II we use a \nfixed small step-size \u20ac (for all of the experiments \u20ac = 0.05). For all details of the \nalgorithm the reader is referred to the full version of this paper [15]. \nWe compared the performance of DOOM II and AdaBoost on a selection of nine \ndata sets taken from the VCI machine learning repository [4] to which various levels \nof label noise had been applied. To simplify matters, only binary classification \nproblems were considered. For all of the experiments axis orthogonal hyperplanes \n(also known as decision stumps) were used as the weak learner. Full details of \nthe experimental setup may be found in [15]. A summary of the experimental \nresults is shown in Figure 1. The improvement in test error exhibited by DOOM \nII over AdaBoost is shown for each data set and noise level. DOOM II generally \noutperforms AdaBoost and the improvement is more pronounced in the presence of \nlabel noise. \nThe effect of using the normalized sigmoid cost function rather than the exponential \ncost function is best illustrated by comparing the cumulative margin distributions \ngenerated by AdaBoost and DOOM II. Figure 2 shows comparisons for two data \nsets with 0% and 15% label noise applied. For a given margin, the value on the \ncurve corresponds to the proportion of training examples with margin less than or \nequal to this value. These curves show that in trying to increase the margins of \nnegative examples AdaBoost is willing to sacrifice the margin of positive examples \nsignificantly. In contrast, DOOM II 'gives up' on examples with large negative \nmargin in order to reduce the value of the cost function. \nGiven that AdaBoost does suffer from overfitting and is guaranteed to minimize an \nexponential cost function of the margins, this cost function certainly does not relate \nto test error. How does the value of our proposed cost function correlate against \nAdaBoost's test error? Figure 3 shows the variation in the normalized sigmoid \ncost function, the exponential cost function and the test error for AdaBoost for \ntwo VCI data sets over 10000 rounds. There is a strong correlation between the \nnormalized sigmoid cost and AdaBoost's test error. In both data sets the minimum \n\n\fBoosting Algorithms as Gradient Descent \n\n517 \n\n3.5 \n\n3 \n\n2.5 \n\n2 \n\n1.5 \n\n0.5 \n\n~ \nII) \nbO \nfl \n= \nos \n> \n-0 \nos \ng \n0 \n~ -0.5 \n-1 \n-1.5 \n-2 \n\nI \n\n! \n: \n\n1 \n; \ni \n! \n0; \n\n1 \n! \n! \n0 : \n\n.. i ~ t \n! \n\" ! , \nJ \n\n.11 \n\n0 \n\nf \n\n, \n\n0 11 \nI -r \n\n, \n\nQ \n\nSOIlar \n\ncleve \n\nIonosphere \n\nvote I \n\ncredll \n\nData set \n\nbrea.l;t-cancer Jmna .. uldlans \n\n;! \n\n~ \n~ \n\n'\" \n\n0 \n\nI \n15(fi.: noi~e -\n\ni \n\nhypo I \n\n00/0 noise \n~,. ;~ Il'.'ioc\u00b7 ... ~ .. ,. # ,.~ \n\nsphCt! \n\nFigure 1: Summary oft est error advantage (with standard error bars) of DOOM II \nover AdaBoost with varying levels of noise on nine VCI data sets. \n\nbreast--cancer~wisconsin \n\n- - O~ noise .. AdaBI')(\\'it \n- - U,,\u00b7 Mise - DOOM II \n15% noise - AdaBoost \n............. 15% noise - DOOM n \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\nsplice \n\n- - 0% n(!ise .. AdaBo(lst \n- - 0% ,,,)i.,~ -DOOM II \n15% noise - AdaBoost . \n............ 15% noise - DOOM n. \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\no~------~~~~~~----~ \n\n-1 \n\n-0.5 \n\no \n\n0.5 \n\no~------~~~~------~-----J \n\n-1 \n\n-0.5 \n\no \n\n0.5 \n\nMar~in \n\nMar~in \n\nFigure 2: Margin distributions for AdaBoost and DOOM II with 0% and 15% label \nnoise for the breast-cancer and splice data sets. \n\nof AdaBoost's test error and the mlllimum of the normalized sigmoid cost very \nnearly coincide, showing that the sigmoid cost function predicts when AdaBoost \nwill start to overfit. \n\nReferences \n\n[1] P. L. Bartlett. The sample complexity of pattern classification with neural networks: \n\nthe size of the weights is more important than the size of the network. IEEE Trans(cid:173)\nactions on Information Theory, 44(2) :525-536, March 1998. \n\n[2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123- 140, 1996. \n\n[3] L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Depart(cid:173)\n\nment of Statistics, University of California, Berkeley, 1998. \n\n[4] E. Keogh C. Blake and C. J. Merz. UCI repository of machine learning databases, \n\n1998. http:j jwww.ics.uci.eduj\"'mlearnjMLRepository_html. \n\n[5] T.G. Dietterich. An experimental comparison of three methods for constructing en(cid:173)\n\nsembles of decision trees: Bagging, boosting, and randomization. Technical report, \nComputer Science Department, Oregon State University, 1998. \n\n\f518 \n\nL. Mason, J. Baxter, P Bartlett and M. Frean \n\nlabor \n\nyotel \n\n30.-~--~------~~--~------~ \n\nAdaB(>o~1 1e_~1 (';1.)( - -\n\nExponential CQ!;! \n\n--\n\nNormalized sigmoid cost ............ . \n\nAJaB ()(~3l re;.;( CIYor - (cid:173)\nExponential cost ---.. \n\n. \n\nNormalized sigmoid enst ........... .. \n\n6 \n\n7 /1 I \n\\ \n\\ \n5 ........ '\\ \\ \n\\ 1/' \n\\ ~_~ ~ \n\\..i\\.. \n\n'V'\\..,..... \n\n4 \n\n2 \n\n, ............................... ,\"\" ....................... . \n\n!O \n\n100 \n\nRounds \n\n1000 \n\n10000 \n\nO~----~------~-----=~----~ \n10000 \n\n1000 \n\n100 \n\n10 \n\n1 \n\nRounds \n\nFigure 3: AdaBoost test error, exponential cost and normalized sigmoid cost over \n10000 rounds of AdaBoost for the labor and vote1 data sets. Both costs have been \nscaled in each case for easier comparison with test error. \n\n[6] H. Drucker and C . Cortes. Boosting decision trees. In Advances in Neural Information \n\nProcessing Systems 8, pages 479- 485, 1996. \n\n[7] N. Duffy and D. Helmbold. A geometric approach to leveraging weak learners. In \n\nComputational Learning Theory: 4th European Conference, 1999. (to appear). \n\n[8] Y. Freund. An adaptive version of the boost by majority algorithm. In Proceedings of \nthe Twelfth Annual Conference on Computational Learning Theory, 1999. (to appear) . \n[9] Y . Freund and R. E. Schapire. Experiments with a new boosting algorithm. In \nMachine Learning: Proceedings of the Thirteenth International Conference, pages \n148-156, 1996. \n\n[10] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning \nand an application to boosting. Journal of Computer and System Sciences, 55(1):119-\n139, August 1997. \n\n[11] J . Friedman. Greedy function approximation : A gradient boosting machine. Tech(cid:173)\n\nnical report, Stanford University, 1999. \n\n[12] J . Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression : A statistical \n\nview of boosting. Technical report, Stanford University, 1998. \n\n[13] A. Grove and D . Schuurmans. Boosting in the limit: Maximizing the margin of \nlearned ensembles. In Proceedings of the Fifteenth National Conference on Artificial \nIntelligence, pages 692-699, 1998. \n\n[14] L. Mason, P. 1. Bartlett, and J . Baxter. Improved generalization through explicit \n\noptimization of margins. Machine Learning, 1999. (to appear) . \n\n[15] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional Gradi(cid:173)\n\nent Techniques for Combining Hypotheses. In Alex Smola, Peter Bartlett, Bernard \nSch6lkopf, and Dale Schurmanns, editors, Large Margin Classifiers. MIT Press, 1999. \nTo appear. \n\n[16] J. R. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National \n\nConference on Artificial Intelligence, pages 725-730, 1996. \n\n[17] G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Technical Report \nNC-TR-1998-021 , Department of Computer Science, Royal Holloway, University of \nLondon, Egham, UK, 1998. \n\n[18] R. E. Schapire, Y. Freund, P. L. Bartlett, and W . S. Lee. Boosting the margin \n: A new explanation for the effectiveness of voting methods. Annals of Statistics, \n26(5):1651- 1686, October 1998. \n\n[19] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated \nIn Proceedings of the Eleventh Annual Conference on Computational \n\npredictions. \nLearning Theory, pages 80- 91, 1998. \n\n\f", "award": [], "sourceid": 1766, "authors": [{"given_name": "Llew", "family_name": "Mason", "institution": null}, {"given_name": "Jonathan", "family_name": "Baxter", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Marcus", "family_name": "Frean", "institution": null}]}