Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)

*Richard Zemel, Toniann Pitassi*

In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Re(cid:173) cently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation jus(cid:173) tifies key elements and parameters in the methods, all chosen to optimize a single common objective function. We propose an anal(cid:173) ogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods.

The aim of boosting algorithms is to "boost" the small advantage that a hypothesis produced by a weak learner can achieve over random guessing, by using the weak learning procedure several times on a sequence of carefully constructed distribu(cid:173) tions. Boosting methods, notably AdaBoost (Freund & Schapire, 1997), are sim(cid:173) ple yet powerful algorithms that are easy to implement and yield excellent results in practice. Two crucial elements of boosting algorithms are the way in which a new distribution is constructed for the learning procedure to produce the next hy(cid:173) pothesis in the sequence, and the way in which hypotheses are combined to pro(cid:173) duce a highly accurate output. Both of these involve a set of parameters, whose values appeared to be determined in an ad hoc maImer. Recently boosting algo(cid:173) rithms have been derived as gradient descent algorithms (Breiman, 1997; Schapire & Singer, 1998; Friedman et al., 1999; Mason et al., 1999). These formulations justify the parameter values as all serving to optimize a single common objective function. These optimization formulations of boosting originally developed for classification problems have recently been applied to regression problems. However, key prop(cid:173) erties of these regression boosting methods deviate significantly from the classifica(cid:173) tion boosting approach. We propose a new boosting algorithm for regression prob(cid:173) lems, also derived from a central objective function, which retains these properties. In this paper, we describe the original boosting algorithm and summarize boosting methods for regression. We present our method and provide a simple proof that elucidates conditions under which convergence on training error can be guaran(cid:173) teed. We propose a probabilistic framework that clarifies the relationship between various optimization-based boosting methods. Finally, we summarize empirical comparisons between our method and others on some standard problems.

1 A Brief Summary of Boosting Methods

Adaptive boosting methods are simple modular algorithms that operate as follows. Let 9 : X -t Y be the function to be learned, where the label set Y is finite, typ(cid:173) ically binary-valued. The algorithm uses a learning procedure, which has access to n training examples, {(Xl, Y1), ... , (xn, Yn)}, drawn randomly from X x Yac(cid:173) cording to distribution D; it outputs a hypothesis I : X -t Y, whose error is the expected value of a loss function on I(x) , g(x), where X is chosen according to D. Given f, cl > 0 and access to random examples, a strong learning procedure outputs with probability 1 - cl a hypothesis with error at most f, with running time polyno(cid:173) mial in 1/ f, 1/ cl and the number of examples. A weak learning procedure satisfies the same conditions, but where f need only be better than random guessing. Schapire (1990) showed that any weak learning procedure, denoted WeakLeam, can be efficiently transformed ("boosted") into a strong learning procedure. The AdaBoost algorithm achieves this by calling WeakLeam multiple times, in a se(cid:173) quence of T stages, each time presenting it with a different distribution over a fixed training set and finally combining all of the hypotheses. The algorithm maintains a weight w: for each training example i at stage i, and a distribution D t is computed by normalizing these weights. The algorithm loops through these steps:

At stagei, the distribution D t is given to WeakLeam, which generates a hy(cid:173) pothesis It- The error rate ft of It w.r.t. D t is: ft = 2::i f,(x');t'y ' wU 2::7=1 w~ w: * (ft/ (l -

The new training distribution is obtained from the new weights: W;+l

Do not remove: This comment is monitored to verify that the site is working properly