{"title": "A Gradient-Based Boosting Algorithm for Regression Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 696, "page_last": 702, "abstract": null, "full_text": "A Gradient-Based Boosting Algorithm for \n\nRegression Problems \n\nRichard S. Zemel \n\nToniann Pitassi \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nAbstract \n\nIn adaptive boosting, several weak learners trained sequentially \nare combined to boost the overall algorithm performance. Re(cid:173)\ncently adaptive boosting methods for classification problems have \nbeen derived as gradient descent algorithms. This formulation jus(cid:173)\ntifies key elements and parameters in the methods, all chosen to \noptimize a single common objective function. We propose an anal(cid:173)\nogous formulation for adaptive boosting of regression problems, \nutilizing a novel objective function that leads to a simple boosting \nalgorithm. We prove that this method reduces training error, and \ncompare its performance to other regression methods. \n\nThe aim of boosting algorithms is to \"boost\" the small advantage that a hypothesis \nproduced by a weak learner can achieve over random guessing, by using the weak \nlearning procedure several times on a sequence of carefully constructed distribu(cid:173)\ntions. Boosting methods, notably AdaBoost (Freund & Schapire, 1997), are sim(cid:173)\nple yet powerful algorithms that are easy to implement and yield excellent results \nin practice. Two crucial elements of boosting algorithms are the way in which a \nnew distribution is constructed for the learning procedure to produce the next hy(cid:173)\npothesis in the sequence, and the way in which hypotheses are combined to pro(cid:173)\nduce a highly accurate output. Both of these involve a set of parameters, whose \nvalues appeared to be determined in an ad hoc maImer. Recently boosting algo(cid:173)\nrithms have been derived as gradient descent algorithms (Breiman, 1997; Schapire \n& Singer, 1998; Friedman et al., 1999; Mason et al., 1999). These formulations justify \nthe parameter values as all serving to optimize a single common objective function. \nThese optimization formulations of boosting originally developed for classification \nproblems have recently been applied to regression problems. However, key prop(cid:173)\nerties of these regression boosting methods deviate significantly from the classifica(cid:173)\ntion boosting approach. We propose a new boosting algorithm for regression prob(cid:173)\nlems, also derived from a central objective function, which retains these properties. \nIn this paper, we describe the original boosting algorithm and summarize boosting \nmethods for regression. We present our method and provide a simple proof that \nelucidates conditions under which convergence on training error can be guaran(cid:173)\nteed. We propose a probabilistic framework that clarifies the relationship between \nvarious optimization-based boosting methods. Finally, we summarize empirical \ncomparisons between our method and others on some standard problems. \n\n\f1 A Brief Summary of Boosting Methods \n\nAdaptive boosting methods are simple modular algorithms that operate as follows. \nLet 9 : X -t Y be the function to be learned, where the label set Y is finite, typ(cid:173)\nically binary-valued. The algorithm uses a learning procedure, which has access \nto n training examples, {(Xl, Y1), ... , (xn, Yn)}, drawn randomly from X x Yac(cid:173)\ncording to distribution D; it outputs a hypothesis I : X -t Y, whose error is the \nexpected value of a loss function on I(x) , g(x), where X is chosen according to D. \nGiven f, cl > 0 and access to random examples, a strong learning procedure outputs \nwith probability 1 - cl a hypothesis with error at most f, with running time polyno(cid:173)\nmial in 1/ f, 1/ cl and the number of examples. A weak learning procedure satisfies \nthe same conditions, but where f need only be better than random guessing. \nSchapire (1990) showed that any weak learning procedure, denoted WeakLeam, \ncan be efficiently transformed (\"boosted\") into a strong learning procedure. The \nAdaBoost algorithm achieves this by calling WeakLeam multiple times, in a se(cid:173)\nquence of T stages, each time presenting it with a different distribution over a fixed \ntraining set and finally combining all of the hypotheses. The algorithm maintains a \nweight w: for each training example i at stage i, and a distribution D t is computed \nby normalizing these weights. The algorithm loops through these steps: \n\n1. At stagei, the distribution D t is given to WeakLeam, which generates a hy(cid:173)\npothesis It- The error rate ft of It w.r.t. D t is: ft = 2::i f,(x');t'y ' wU 2::7=1 w~ \nw: * (ft/ (l -\n\n2. The new training distribution is obtained from the new weights: W;+l \n\nft))Hf,(x')-y'l \n\nAfter T stages, a test example X will be classified by a combined weighted-majority \nhypothesis: y = sgn(2::;=1 cdt (x)). Each combination coefficient Ct = log( (1- fd/ fd \ntakes into account the accuracy of hypothesis It with respect to its distribution. \n\nThe optimization approach derives these equations as all minimizing a com(cid:173)\nmon objective function J, the expected error of the combined hypotheses, esti(cid:173)\nmated from the training set. The new hypothesis is the step in function space \nin the direction of steepest descent of this objective. For example, if J \n~ 2::7=1 exp(- 2::t yicdt(xi)), then the cost after T rounds is the cost after T - 1 \nrounds times the cost of hypothesis IT : \nT-1 \n\nJ (T) \n\nn \n\n~ L exp (- L yi cdt (xi) ) exp ( _yi cT IT (xi) ) \n\ni=l \n\nt=l \n\nso training IT to minimize J(T) amounts to minimizing the cost on a weighted \ntraining distribution. Similarly, the training distribution is formed by normalizing \nupdated weights: w:+1 = w: * exp(-yicdt(xi )) = w; * exp(s~cdwhere s: = 1 if \nIt (xi) i- yi, else s~ = -1. Note that because the objective function J is multiplica(cid:173)\ntive in the costs of the hypotheses, a key property follows: The objective for each \nhypothesis is formed simply by re-weighting the training distribution. \nThis boosting algorithm applies to binary classification problems, but it does not \nreadily generalize to regression problems. Intuitively, regression problems present \nspecial difficulties because hypotheses may not just be right or wrong, but can be a \nlittle wrong or very wrong. Recently a spate of clever optimization-based boosting \nmethods have been proposed for regression (Duffy & Helmbold, 2000; Friedman, \n\n\f1999; Karakoulas & Shawe-Taylor, 1999; R~itsch et al., 2000). While these methods \ninvolve diverse objectives and optimization approaches, they are alike in that new \nhypotheses are formed not by simply changing the example weights, but instead \nby modifying the target values. As such they can be viewed as forms of forward \nstage-wise additive models (Hastie & Tibshirani, 1990), which produce hypotheses \nsequentially to reduce residual error. We study a simple example of this approach, \nin which hypothesis T is trained not to produce the target output yi on a given case \ni, but instead to fit the current residual, r~, where r~ = yi - L,;;11 ctft(x). Note that \nthis approach develops a series of hypotheses all based on optimizing a common \nobjective, but it deviates from standard boosting in that the distribution of exam(cid:173)\nples is not used to control the generation of hypotheses, and each hypothesis is not \ntrained to learn the same function. \n\n2 An Objective Function for Boosting Regression Problems \n\nWe derive a boosting algorithm for regression from a different objective function. \nThis algorithm is similar to the original classification boosting method in that the \nobjective is multiplicative in the hypotheses' costs, which means that the target out(cid:173)\nputs are not altered after each stage, but rather the objective for each hypothesis is \nformed simply by re-weighting the training distribution. The objective function is: \n\nh = -;;; {; J1 c;'i exp {; Ct(ft(x') - y' )2 \n\n1 n (T \n\n[T \n\n1) \n\n\"\n\nJ \n\n(1) \n\nHere, training hypothesis T to minimize JT, the cost after T stages, amounts to min(cid:173)\nimizing the exponentiated squared error of a weighted training distribution: \n\nn L w~ (c;~ exp [cT(h(xi ) - yi )2J) \n\n; =1 \n\nWe update each weight by multiplying by its respective error, and form the training \ndistribution for the next hypothesis by normalizing these updated weights. \nIn the standard AdaBoost algorithm, the combination coefficient Ct can be analyti(cid:173)\ncally determined by solving %I; = 0 for Ct. Unfortunately, one cannot analytically \ndetermine the combination coefficient Ct in our algorithm, but a simple line search \ncan be used to find value of Ct that minimizes the cost Jt . We limit Ct to be between 0 \nand 1. Finally, optimizing J with respect to y produces a simple linear combination \nrule for the estimate: fj = L,t Ct It (x) / L,t Ct\u00b7 \nWe introduce a constant r as a threshold used to demarcate correct from incorrect \nresponses. This threshold is the single parameter of this algorithm that must be cho(cid:173)\nsen in a problem-dependent manner. It is used to judge when the performance of \na new hypothesis warrants its inclusion: ft = L,i p~ exp[(ft(xi ) - yi )2 - r] < 1. The \nalgorithm can be summarized as follows: \n\n\fNew Boosting Algorithm \n\n1. Input: \n\n\u2022 training set examples (Xl, yI) , .... (Xn, Yn ) with Y E ~; \n\u2022 WeakLeam: \n\nlearning procedure produces a hypothesis h(x) \n\nwhose accuracy on the training set is judged according to J \n\n2. Choose initial distribution P1 (xi) = P~ = w~ = ~ \n3. Iterate: \n\n\u2022 Call WeakLearn - minimize Jt with distribution Pt \n\u2022 Accept iff Et = ~i P~ exp[(ft(xi ) - yi )2 - r] < 1 \n\u2022 Set a ~ Ct ~ 1 to minimize Jt (using line search) \n\u2022 Update training distribution \n\nn \n\nw;+l /L W{+l \n\nj=l \n\n4. Estimate output y on input x: \n\nY = L cdt (x)/ L Ct \n\nt \n\nt \n\n3 Proof of Convergence \n\nTheorem: Assume that for all t ~ T , hypothesis t makes error Et on its distribution. If \nthe combined output y is considered to be in error iff (Y - y)2 > r, then the output of the \nboosting algorithm (after T stages) will have error at most E, where \n\nE = P[(yi - yi )2 > r] ~ II Et exp[r(T - L cd] \n\nT \n\nT \n\nProof: We follow the approach used in the AdaBoost proof (Freund & Schapire, \n1997). We show that the sum of the weights at stage T is bounded above by a con(cid:173)\nstant times the product of the Et'S, while at the same time, for each input i that is \nincorrect, its corresponding weight w~ at stage T is significant. \n\nT \n\n< II c~1/2 exp(r)Et \n\nt= l \n\nThe inequality holds because a ~ Ct ~ 1. We now compute the new weights: \nL Ct(ft(xi ) - yi )2 = [L ct][Var(fi ) + (yi - yi)2] ~ [L Ct][(yi - yi )2] \nt \n\nt \n\nt \n\nwhere yi = ~t cth(xi )/ ~t Ct and Var(fi) = ~t ct (h(xi) - yi)2 / ~t Ct. Thus, \nW~+l = (II C;1/2) exp(L Ct(ft (xi) - yi)2) ~ (II C; 1/2) exp([L Ct][(yi _ yi )2]) \n\nT \n\nT \n\nT \n\nT \n\nt=l \n\n\fNow consider an example input k such that the final answer is an error. ll1en, by \ndefinition, (yk - yk)2 > T \nW~+l 2:: (TIt c;1/2) exp(T L:t cd. If f is the total \nerror rate of the combination output, then: \n\n=> \n\nL w~+1 2:: L w~+1 2:: f(II C;1/2) exp(T L Ct) \n\nT \n\nT \n\nk:k error \n\nT \n\nt=l \n\ni \n\nT \n\nt=l \n\nf:::; (L w~+l)(II ci/2) exp[T(T - L Ct)] :::; II ft exp[T(T - L Ct)] \n\nt \n\nt \n\nNote that as in the binary AdaBoost theorem, there are no assumptions made here \n,6. < I, then f < \nabout ft, the error rate of individual hypotheses. If all ft = \n,6.T exp[T(T - L:t Ct)], which is exponentially decreasing as long as Ct -+ 1. \n\n4 Comparing the Objectives \n\nWe can compare the objectives by adopting a probabilistic framework. We as(cid:173)\nsociate a probability distribution with the output of each hypothesis on input x, \nand combine them to form a consensus model M by multiplying the distributions: \ng(ylx, M) == TIt pt(ylx, (1d,where (1t are parameters specific to hypothesis t. If each \nhypothesis t produces a single output ft (x) and has confidence Ct assigned to it, \nthen pt (y lx, (1t) can be considered a Gaussian with mean It (x) and variance 1/ Ct \n\n9 (y I x, M) = k [If ci /2] exp [- ~ Ct (y - ft ( x ) ) 2] \n\nModel parameters can be tuned to maximize g(y*lx, M), where y* is the target for \nx; our objective (Eq. 1) is the expected value ofthe reciprocal of g(y* lx, M). \nAn alternative objective can be derived by first normalizing g(y lx, M): \n\n( I M) = g(ylx , M) = \npyx, \n\nJ, \ny,g(ylx,M) \n\n-\n\n- J, TI \n\ny' \n\nTIt pt(ylx, (1d \n\n( \n\ntPt y'lx, (1t)dy' \n\nTIUs probability model underlies the product-of-experts model (Hinton, 2000) and \nthe logarithmic opinion pool (Bordley, 1982).If we again assume pt(ylx, (1t) ~ \nN(ft(x), C;l)), thenp(ylx, M) is a Gaussian, with mean f(x) = L:\u00a3tft(X) and in-\nverse variance c = L:t Ct. The objective for this model is: \n\nt Ct \n\nJR=-logp(y* lx,M )=c[y*-f(x)f -~logc \n\n(2) \n\nTIUs objective corresponds to a type of residual-fitting algorithm. If r(x) \n[y* - f (x) ] , and {Ct} for t < T are assumed frozen, then training iT to minimize \nJ R is achieved by using r (x) as a target. \nThese objectives can be further compared w.r.t. a bias-variance decomposition (Ge(cid:173)\nman et aI., 1992; Heskes, 1998). The main term in our objective can be re-expressed: \nf(x)] 2 + L Ct [ft(x) - f( x) ] 2 = bias+variance \nL Ct [yO -\nt \nMeanwhile, the main term of JR corresponds to the bias term. Hence a new hy(cid:173)\npothesis can minimize JR by having low error (ft (x) = y*), or with a deviant (am(cid:173)\nbiguous) response (ft(x) -=F f(x) (Krogh & Vedelsby, 1995). Thus our objective at(cid:173)\ntempts to minimize the average error of the models, while the residual-fitting ob(cid:173)\njective minimizes the error of the average model. \n\nft(X)]2 = L Ct [yO -\n\nt \n\nt \n\n\f0 .065 \n\n0.06 \n\n0 .055 \n\n0.05 \n\n0.045 \n\n0.04 \n\n0.035 \n\n0.35 \n\n0.3 \n\n2 \n\n4 \n\n8 \n\n10 \n\no.25 '------~2---4c---~---8~-----c10 \n\nFigure 1: Generalization results for our gradient-based boosting algorithm, com(cid:173)\npared to the residual-fitting and mixture-of-experts algorithms. Left: Test problem \nF1; Right: Boston housing data. Normalized mean-squared error is plotted against \nthe number of stages of boosting (or number of experts for the mixture-of-experts). \n\n5 Empirical Tests of Algorithm \n\nWe report results comparing the performance of our new algorithm with two other \nalgorithms. The first is a residual-fitting algorithm based on the J R objective (Eq. 2), \nbut the coefficients are not normalized. The second algorithm is a version of the \nmixture-of-experts algorithm aacobs et al., 1991). Here the hypotheses (or experts) \nare trained simultaneously. In the standard mixture-of-experts the combination co(cid:173)\nefficients depend on the input; to make this model comparable to the others, we \nallowed each expert one input-independent, adaptable coefficient. This algorithm \nprovides a good alternative to the greedy stage-wise methods, in that the experts \nare trained simultaneously to collectively fit the data. \n\nWe evaluate these algorithms on two problems. The first is the nonlinear prediction \nproblem F1 (Friedman, 1991), which has 10 independent input variables uniform in \n[0 , 1]: y = 10 sin( 7rX1X2) + 20(X3 - .5)2 + 10x4 + 5X5 + n where n is a random variable \ndrawn from a mean-zero, unit-variance normal distribution. In this problem, only \nfive input variables (Xl to X5) have predictive value. We rescaled the target values \ny to be in [0, 3]. We used 400 training examples, and 100 validation and test exam(cid:173)\nples. The second test problem is the standard Boston Housing problem Here there \nare 506 examples and twelve continuous input variables. We scaled the input vari(cid:173)\nables to be in [0,1], and the outputs to be in [0, 5] . We used 400 of the examples for \ntraining, 50 for validation, and the remainder to test generalization. We used neu(cid:173)\nral networks as the hypotheses and back-propagation as the learning procedure to \ntrain them. Each network had a layer of tanhO units between the input units and a \nsingle linear output. For each algorithm, we used early stopping with a validation \nset in order to reduce over-fitting in the hypotheses. \nOne finding was that the other algorithms out-performed ours when the hypothe(cid:173)\nses were simple: when the weak learners had only one or two hidden nodes, the \nresidual-fitting algorithm reduced test error. With more hidden nodes the relative \nperformance of our algorithm improved. Figure 1 shows average results for three(cid:173)\nhidden-unit networks over 20 runs of each algorithm on the two problems, with ex(cid:173)\namples randomly assigned to the three sets on each run. The results were consistent \nfor different values of T in our algorithm; here T = 0.1. Overall, the residual-fitting \nalgorithm exhibited more over-fitting than our method. Over-fitting in these ap(cid:173)\nproaches may be tempered: a regularization technique known as shrinkage, which \nscales combination coefficients by a fractional parameter, has been found to im-\n\n\fprove generalization in gradient boosting applications to classification (Friedman, \n1999). Finally, the mixture-of-experts algorithm generally out-performed the se(cid:173)\nquential training algorithm. A drawback of this method is the need to specify \nthe number of hypotheses in advance; however, given that number, simultaneous \ntraining is likely less prone to local minima than the sequential approaches. \n\n6 Conclusion \n\nWe have proposed a new boosting algorithm for regression problems. Like several \nrecent boosting methods for regression, the parameters and updates can be derived \nfrom a single common objective. Unlike these methods, our algorithm forms new \nhypotheses by simply modifying the distribution over training examples. Prelim(cid:173)\ninary empirical comparisons have suggested that our method will not perform as \nwell as a residual-fitting approach for simple hypotheses, but it works well for more \ncomplex ones, and it seems less prone to over-fitting. The lack of over-fitting in our \nmethod can be traced to the inherent bias-variance tradeoff, as new hypotheses are \nforced to resemble existing ones if they cannot improve the combined estimate. \nWe are exploring an extension that brings our method closer to the full mixture-of(cid:173)\nexperts. The combination coefficients can be input-dependent: a learner returns not \nonly ft(x i ) but also kt(xi ) E [0 ,1], a measure of confidence in its prediction. This \nelaboration makes the weak learning task harder, but may extend the applicabil(cid:173)\nity of the algorithm: letting each learner focus on a subset of its weighted training \ndistribution permits a divide-and-conquer approach to function approximation. \n\nReferences \n\n[1] Bordley, R. (1982). A multiplicative formula for aggregating probability assessments. Managment \n\nScience, 28, 1137-1148. \n\n[2] Breiman, L. (1997). Prediction games and arcing classifiers. TR 504. Statistics Dept., UC Berkeley. \n[3] Duffy, N. & Helmbold, D. (2000). Leveraging for regression. In Proceedings of COLT, 13. \n[4] Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an \n\napplication to boosting. Journal of Camp. and System Sci., 55, 119-139. \n\n[5] Friedman, J. H. (1999). Greedy function approximation: A gradient boosting machine. TR, Dept. \n\nof Statistics, Stanford University. \n\n[6] Friedman, J. H., Hastie, T., & Tibshirani, R. (1999). Additive logistic regression: A statistical view \n\nof boosting. Annals of Statistics, To appear. \n\n[7] Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance \n\ndilemma. Neural Computation, 4,1-58. \n\n[8] Hastie, T & TIbshirani, R. (1990). Generalized Additive Models. Chapman and Hall. \n[9] Heskes, T (1998). Bias-variance decompositions for likelihood-based estimators. Neural Computa(cid:173)\n\ntion, 10, 1425-1433. \n\n[10] Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence. GC(cid:173)\n\nNUTR 2000-004. Gatsby Computational Neuroscience Unit, University College London. \n\n[11] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local ex(cid:173)\n\nperts. Neural Computation, 3,79-87. \n\n[12] Karakoulas, G., & Shawe-Taylor, J. (1999). Towards a strategy for boosting regressors. In Advances \n\nin Large Margin Classifiers, Smola, Bartlett, Schdlkopf & Schuurmans (Eds.). \n\n[13] Krogh, A. & Vedelsby, J. (1995). Neural network ensembles, cross-validation, and active learning. \n\nInJ\\TIps 7. \n\n[14] Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradient descent in \n\nfunction space. In NIPS 11. \n\n[15] Riitsch, G., Mika, S. Onoda, T, Lemm, S. & Miiller, K.-R. (2000). Barrier boosting. In Proceedings of \n\nCOLT,13 . \n\n[16] Schapire, R. E. (1990). The strength of weak leamability. Machine Learning, 5, 197-227. \n[17] Schapire, R. E. & Singer, Y. (1998). Improved boosting algorithms using confidence-rated \n\nprecitions. In Proceedings of COLT, 11. \n\n\f", "award": [], "sourceid": 1797, "authors": [{"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Toniann", "family_name": "Pitassi", "institution": null}]}