{"title": "A Theory of Multiclass Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1714, "page_last": 1722, "abstract": "", "full_text": "A Theory of Multiclass Boosting\r\n\r\nIndraneel Mukherjee\r\n\r\nRobert E. Schapire\r\n\r\nPrinceton University, Department of Computer Science, Princeton, NJ 08540\r\n{imukherj,schapire}@cs.princeton.edu\r\n\r\nAbstract\r\nBoosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the \"correct\" requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements.\r\n\r\n1\r\n\r\nIntroduction\r\n\r\nBoosting [17] refers to a general technique of combining rules of thumb, or weak classifiers, to form highly accurate combined classifiers. Minimal demands are placed on the weak classifiers, so that a variety of learning algorithms, also called weak-learners, can be employed to discover these simple rules, making the algorithm widely applicable. The theory of boosting is well-developed for the case of binary classification. In particular, the exact requirements on the weak classifiers in this setting are known: any algorithm that predicts better than random on any distribution over the training set is said to satisfy the weak learning assumption. Further, boosting algorithms that minimize loss as efficiently as possible have been designed. Specifically, it is known that the Boost-by-majority [6] algorithm is optimal in a certain sense, and that AdaBoost [11] is a practical approximation. Such an understanding would be desirable in the multiclass setting as well, since many natural classification problems involve more than two labels, e.g. recognizing a digit from its image, natural language processing tasks such as part-of-speech tagging, and object recognition in vision. However, for such multiclass problems, a complete theoretical understanding of boosting is lacking. In particular, we do not know the \"correct\" way to define the requirements on the weak classifiers, nor has the notion of optimal boosting been explored in the multiclass setting. Straightforward extensions of the binary weak-learning condition to multiclass do not work. Requiring less error than random guessing on every distribution, as in the binary case, turns out to be too weak for boosting to be possible when there are more than two labels. On the other hand, requiring more than 50% accuracy even when the number of labels is much larger than two is too stringent, and simple weak classifiers like decision stumps fail to meet this criterion, even though they often can be combined to produce highly accurate classifiers [9]. The most common approaches so far have relied on reductions to binary classification [2], but it is hardly clear that the weak-learning conditions implicitly assumed by such reductions are the most appropriate. The purpose of a weak-learning condition is to clarify the goal of the weak-learner, thus aiding in its design, while providing a specific minimal guarantee on performance that can be exploited by a boosting algorithm. These considerations may significantly impact learning and generalization because knowing the correct weak-learning conditions might allow the use of simpler weak classifiers, which in turn can help prevent overfitting. Furthermore, boosting algorithms that more efficiently and effectively minimize training error may prevent underfitting, which can also be important. In this paper, we create a broad and general framework for studying multiclass boosting that formalizes the interaction between the boosting algorithm and the weak-learner. Unlike much, but not all, of the previous work on multiclass boosting, we focus specifically on the most natural, and perhaps 1\r\n\r\n\fweakest, case in which the weak classifiers are genuine classifiers in the sense of predicting a single multiclass label for each instance. Our new framework allows us to express a range of weak-learning conditions, both new ones and most of the ones that had previously been assumed (often only implicitly). Within this formalism, we can also now finally make precise what is meant by correct weak-learning conditions that are neither too weak nor too strong. We focus particularly on a family of novel weak-learning conditions that have an especially appealing form: like the binary conditions, they require performance that is only slightly better than random guessing, though with respect to performance measures that are more general than ordinary classification error. We introduce a whole family of such conditions since there are many ways of randomly guessing on more than two labels, a key difference between the binary and multiclass settings. Although these conditions impose seemingly mild demands on the weak-learner, we show that each one of them is powerful enough to guarantee boostability, meaning that some combination of the weak classifiers has high accuracy. And while no individual member of the family is necessary for boostability, we also show that the entire family taken together is necessary in the sense that for every boostable learning problem, there exists one member of the family that is satisfied. Thus, we have identified a family of conditions which, as a whole, is necessary and sufficient for multiclass boosting. Moreover, we can combine the entire family into a single weak-learning condition that is necessary and sufficient by taking a kind of union, or logical OR, of all the members. This combined condition can also be expressed in our framework. With this understanding, we are able to characterize previously studied weak-learning conditions. In particular, the condition implicitly used by AdaBoost.MH [19], which is based on a one-against-all reduction to binary, turns out to be strictly stronger than necessary for boostability. This also applies to AdaBoost.M1 [9], the most direct generalization of AdaBoost to multiclass, whose conditions can be shown to be equivalent to those of AdaBoost.MH in our setting. On the other hand, the condition implicit to Zhu et al.'s SAMME algorithm [21] is too weak in the sense that even when the condition is satisfied, no boosting algorithm can guarantee to drive down the training error. Finally, the condition implicit to AdaBoost.MR [19, 9] (also called AdaBoost.M2) turns out to be exactly necessary and sufficient for boostability. Employing proper weak-learning conditions is important, but we also need boosting algorithms that can exploit these conditions to effectively drive down error. For a given weak-learning condition, the boosting algorithm that drives down training error most efficiently in our framework can be understood as the optimal strategy for playing a certain two-player game. These games are nontrivial to analyze. However, using the powerful machinery of drifting games [8, 16], we are able to compute the optimal strategy for the games arising out of each weak-learning condition in the family described above. These optimal strategies have a natural interpretation in terms of random walks, a phenomenon that has been observed in other settings [1, 6]. Our focus in this paper is only on minimizing training error, which, for the algorithms we derive, provably decreases exponentially fast with the number of rounds of boosting. Such results can be used in turn to derive bounds on the generalization error using standard techniques that have been applied to other boosting algorithms [18, 11, 13]. (We omit these due to lack of space.) The game-theoretic strategies are non-adaptive in that they presume prior knowledge about the edge, that is, how much better than random are the weak classifiers. Algorithms that are adaptive, such as AdaBoost, are much more practical because they do not require such prior information. We show therefore how to derive an adaptive boosting algorithm by modifying one of the game-theoretic strategies. We present experiments aimed at testing the efficacy of the new methods when working with a very weak weak-learner to check that the conditions we have identified are indeed weaker than others that had previously been used. We find that our new adaptive strategy achieves low test error compared to other multiclass boosting algorithms which usually heavily underfit. This validates the potential practical benefit of a better theoretical understanding of multiclass boosting. Previous work. The first boosting algorithms were given by Schapire [15] and Freund [6], followed by their AdaBoost algorithm [11]. Multiclass boosting techniques include AdaBoost.M1 and AdaBoost.M2 [11], as well as AdaBoost.MH and AdaBoost.MR [19]. Other approaches include [5, 21]. There are also more general approaches that can be applied to boosting including [2, 3, 4, 12]. Two game-theoretic perspectives have been applied to boosting. The first one [10, 14] views the weak-\r\n\r\n2\r\n\r\n\flearning condition as a minimax game, while drifting games [16, 6] were designed to analyze the most efficient boosting algorithms. These games have been further analyzed in the multiclass and continuous time setting in [8].\r\n\r\n2\r\n\r\nFramework\r\n\r\nWe introduce some notation. Unless otherwise stated, matrices will be denoted by bold capital letters like M, and vectors by bold small letters like v. Entries of a matrix and vector will be denoted as M (i, j) or v(i), while M(i) will denote the ith row of a matrix. Inner product of two vectors u, v is denoted by u, v . The Frobenius inner product of two matrices Tr(MM ) will be denoted by M M . The indicator function is denoted by 1 []. The distribution over the set {1, . . . , k} will be denoted by {1, . . . , k}. In multiclass classification, we want to predict the labels of examples lying in some set X. Each example x X has a unique y label in the set {1, . . . , k}, where k 2. We are provided a training set of labeled examples {(x1 , y1 ), . . . , (xm , ym )}. Boosting combines several mildly powerful predictors, called weak classifiers, to form a highly accurate combined classifier, and has been previously applied for multiclass classification. In this paper, we only allow weak classifier that predict a single class for each example. This is appealing, since the combined classifier has the same form, although it differs from what has been used in much previous work. We adopt a game-theoretic view of boosting. A game is played between two players, Booster and Weak-Learner, for a fixed number of rounds T . With binary labels, Booster outputs a distribution in each round, and Weak-Learner returns a weak classifier achieving more than 50% accuracy on that distribution. The multiclass game is an extension of the binary game. In particular, in each round t: (1) Booster creates a cost-matrix Ct Rmk , specifying to Weak-Learner that the cost of classifying example xi as l is C(i, l). The cost-matrix may not be arbitrary, but should conform to certain restrictions as discussed below. (2) Weak-Learner returns some weak classifier ht : X m {1, . . . , k} from a fixed space ht H so that the cost incurred is Ct 1ht = i=1 Ct (i, ht (xi )), is \"small enough\", according to some conditions discussed below. Here by 1h we mean the m k matrix whose (i, j)-th entry is 1 [h(i) = j]. (3) Booster computes a weight t for the current weak classifier based on how much cost was incurred in this round. At the end, Booster predicts according to the weighted plurality vote of the classifiers returned in each round:\r\nT\r\n\r\nH(x) = argmax fT (x, l), where fT (x, l) =\r\nl{1,...,k} t=1\r\n\r\n1 [ht (x) = l] t .\r\n\r\n(1)\r\n\r\nBy carefully choosing the cost matrices in each round, Booster aims to minimize the training error of the final classifer H, even when Weak-Learner is adversarial. The restrictions on cost-matrices created by Booster, and the maximum cost Weak-Learner can suffer in each round, together define the weak-learning condition being used. For binary labels, the traditional weak-learning condition states: for any non-negative weights w(1), . . . , w(m) on the training set, the error of the weak classfier returned is at most (1/2 - /2) i wi . Here parametrizes the condition. There are many ways to translate this condition into our language. The one with fewest restrictions on the costmatrices requires labeling correctly should be less costly than labeling incorrectly: i : C(i, yi ) C(i, yi ), while the restriction on the returned weak classifier h requires less cost than predicting 1 randomly: i C(i, h(xi )) i 1 - C(i, yi ) + 2 + C(i, yi ) . By the correspondence 2 2 2 w(i) = C(i, yi ) - C(i, yi ), we may verify the two conditions are the same. We will rewrite this condition after making some simplifying assumptions. Henceforth, without loss of generality, we assume that the true label is always 1. Let C bin Rm2 consist of matrices C which satisfy C(i, 1) C(i, 2). Further, let Ubin Rm2 be the matrix whose each row is (1/2 + /2, 1/2 - /2). Then, Weak-Learner searching space H satisfies the binary weak-learning condition if: C C bin , h H : C 1h - Ubin 0. There are two main benefits to this refor mulation. With linear homogeneous constraints, the mathematics is simplified, as will be apparent later. More importantly, by varying the restrictions C bin on the cost vectors and the matrix Ubin , we can generate a vast variety of weak-learning conditions for the multiclass setting k 2 as we now show. 3\r\n\r\n\fLet C Rmk and matrix B Rmk , which we call the baseline; we say a weak classifier space H satisfies the condition (C, B) if\r\nm m\r\n\r\nC C, h H : C (1h - B) 0,\r\n\r\ni.e.,\r\ni=1\r\n\r\nc(i, h(i)) \r\ni=1\r\n\r\nc(i), B(i) .\r\n\r\n(2)\r\n\r\nIn (2), the variable matrix C specifies how costly each misclassification is, while the baseline B specifies a weight for each misclassification. The condition therefore states that a weak classifier should not exceed the average cost when weighted according to baseline B. This large class of weak-learning conditions captures many previously used conditions, such as the ones used by AdaBoost.M1 [9], AdaBoost.MH [19] and AdaBoost.MR [9, 19] (see below), as well as novel conditions introduced in the next section. By studying this vast class of weak-learning conditions, we hope to find the one that will serve the main purpose of the boosting game: finding a convex combination of weak classifiers that has zero training error. For this to be possible, at the minimum the weak classifiers should be sufficiently rich for such a perfect combination to exist. Formally, a collection H of weak classifiers is eligible for boosting, or simply boostable, if there exists a distribution on this space that linearly separates the data: i : argmaxl{1,...,k} hH (h)1 [h(xi ) = l] = yi . The weak-learning condition plays two roles. It rejects spaces that are not boostable, and provides an algorithmic means of searching for the right combination. Ideally, the second factor will not cause the weak-learning condition to impose additional restrictions on the weak classifiers; in that case, the weak-learning condition is merely a reformulation of being boostable that is more appropriate for deriving an algorithm. In general, it could be too strong, i.e. certain boostable spaces will fail to satisfy the conditions. Or it could be too weak i.e., non-boostable spaces might satisfy such a condition. Booster strategies relying on either of these conditions will fail to drive down error; the former due to underfitting, and the latter due to overfitting. In the next section we will describe conditions captured by our framework that avoid being too weak or too strong.\r\n\r\n3\r\n\r\nNecessary and sufficient weak-learning conditions\r\n\r\nThe binary weak-learning condition has an appealing form: for any distribution over the examples, the weak classifier needs to achieve error not greater than that of a random player who guesses the correct answer with probability 1/2 + . Further, this is the weakest condition under which boosting is possible as follows from a game-theoretic perspective [10, 14] . Multiclass weak-learning conditions with similar properties are missing in the literature. In this section we show how our framework captures such conditions. In the multiclass setting, we model a random player as a baseline predictor B Rmk whose rows are distributions over the labels, B(i) {1, . . . , k}. The prediction on example i is a sample from eor B(i). We only consider the space of edge-over-random baselines B Rmk who have a faint eor clue about the correct answer. More precisely, any baseline B B in this space is more likely to predict the correct label than an incorrect one on every example i: l = 1, B(i, 1) B(i, l) + , with equality holding for some l.\r\neor When k = 2, the space B consists of the unique player Ubin , and the binary weak-learning condition is given by (C bin , Ubin ). The new conditions generalize this to k > 2. In particular, define C eor to be the multiclass extension of C bin : any cost-matrix in C eor should put the least cost on the correct label, i.e., the rows of the cost-matrices should come from the set c Rk : l, c(1) c(l) . eor Then, for every baseline B B , we introduce the condition (C eor , B), which we call an edgeover-random weak-learning condition. Since C B is the expected cost of the edge-over-random baseline B on matrix C, the constraints (2) imposed by the new condition essentially require better than random performance.\r\n\r\nWe now present the central results of this section. The seemingly mild edge-over-random conditions guarantee eligibility, meaning weak classifiers that satisfy any one such condition can be combined to form a highly accurate combined classifier. Theorem 1 (Sufficiency). If a weak classifier space H satisfies a weak-learning condition (C eor , B), eor for some B B , then H is boostable. 4\r\n\r\n\fThe proof involves the Von-Neumann Minimax theorem, and is in the spirit of the ones in [10]. On the other hand the family of such conditions, taken as a whole, is necessary for boostability in the sense that every eligible space of weak classifiers satisfies some edge-over-random condition. Theorem 2 (Relaxed necessity). For every boostable weak classifier space H, there exists a > 0 eor and B B such that H satisfies the weak-learning condition (C eor , B). The proof shows existence through non-constructive averaging arguments. Theorem 2 states that any boostable weak classifier space will satisfy some condition in our family, but it does not help us choose the right condition. Experiments in Section 5 suggest C eor , U is effective with very eor simple weak-learners compared to popular boosting algorithms. (Here U B is the edge-overrandom baseline closest to uniform; it has weight (1 - )/k on incorrect labels and (1 - )/k + on the correct label.) However, there are theoretical examples showing each condition in our family is too strong (supplement). A perhaps extreme way of weakening the condition is by requiring the performance on a cost matrix eor to be competitive not with a fixed baseline B B , but with the worst of them: C C eor , h H : C 1h max C B. eor\r\nBB\r\n\r\n(3)\r\n\r\nCondition (3) states that during the course of the same boosting game, Weak-Learner may choose eor to beat any edge-over-random baseline B B , possibly a different one for every round and every cost-matrix. This may superficially seem much too weak. On the contrary, this condition turns out to be equivalent to boostability. In other words, according to our criterion, it is neither too weak nor too strong as a weak-learning condition. However, unlike the edge-over-random conditions, it also turns out to be more difficult to work with algorithmically. Furthermore, this condition can be shown to be equivalent to the one used by AdaBoost.MR [19, 9]. This is perhaps remarkable since the latter is based on the apparently completely unrelated all-pairs multiclass to binary reduction: the MR condition is given by (C MR , BMR ), where C MR consists of cost-matrices that put non-negative costs on incorrect labels and whose rows sum up to zero, while BMR Rmk is the matrix that has on the first column and - on all other columns(supplement). Further, the MR condition, and hence (3), can be shown to be neither too weak nor too strong. Theorem 3 (MR). A weak classifier space H satisfies AdaBoost.MR's weak-learning condition (C MR , BMR ) if and only if it satisfies (3). Moreover, this condition is equivalent to being boostable. Next, we illustrate the strengths of our random-over-edge weak-learning conditions through concrete comparisons with previous algorithms. Comparison with SAMME. The SAMME algorithm of [21] requires the weak classifiers to achieve less error than uniform random guessing for multiple labels; in our language, their weaklearning condition is (C = {(-t, t, t, . . .) : t 0} , U ). As is well-known, this condition is not sufficient for boosting to be possible. In particular, consider the dataset {(a, 1), (b, 2)} with k = 3, m = 2, and a weak classifier space consisting of h1 , h2 which always predict 1, 2, respectively. Since neither classifier distinguishes between a, b we cannot achieve perfect accuracy by combining them in any way. Yet, due to the constraints on the cost-matrix, one of h1 , h2 will always manage non-positive cost while random always suffers positive cost. On the other hand our weaklearning condition allows the Booster to choose far richer cost matrices. In particular, when the cost matrix is C = (c(1) = (-1, +1, 0), c(2) = (+1, -1, 0)) C eor , both classifiers in the above example suffer more loss than the random player U , and fail to satisfy our condition. Comparison with AdaBoost.MH. AdaBoost.MH is a popular multiclass boosting algorithm that is based on the one-against-all reduction[19]. However, we show that its implicit demands on the weak classifier space is too strong. We construct a classifier space that satisfies the condition (C eor , U ) in our family, but cannot satisfy AdaBoost.MH's weak-learning condition. Consider a space H that has, for every (1/k + )m element subset of the examples, a classifier that predicts correctly on exactly those elements. The expected loss of a randomly chosen classifier from this space is the same as that of the random player U . Hence H satisfies this weak-learning condition. On the other hand, it can be shown (supplement) that AdaBoost.MH's weak-learning condition is the pair (C MH , BMH ), where C MH has non-(positive)negative entries on (in)correct labels, and where each row of the matrix BMH is the vector (1/2 + /2, 1/2 - /2, . . . , 1/2 - /2). A 5\r\n\r\n\fquick calculation shows that for any h H, and C C MH with -1 in the first column and zeroes elsewhere, C 1h - BMH = 1/2 - 1/k. This is positive when k > 2, so that H fails to satisfy AdaBoost.MH's condition.\r\n\r\n4\r\n\r\nAlgorithms\r\n\r\nIn this section we devise algorithms by analyzing the boosting games that employ our edge-overrandom weak-learning conditions. We compute the optimum Booster strategy against a completely adversarial Weak-Learner, which here is permitted to choose weak classifiers without restriction, i.e. the entire space Hall of all possible functions mapping examples to labels. By modeling WeakLearner adversarially, we make absolutely no assumptions on the algorithm it might use. Hence, error guarantees enjoyed in this situation will be universally applicable. Our algorithms are derived from the very general drifting games framework [16] for solving boosting games, in turn inspired by Freund's Boost-by-majority algorithm [6], which we review next. The OS Algorithm. Fix the number of rounds T and an edge-over-random weak-learning condition (C, B). For simplicity of presentation we fix the weights t = 1 in each round. With fT defined as in (1), the optimum Booster payoff can be written as\r\nm C1 C\r\n\r\nmin\r\n\r\nh1 Hall : C1 (1h1 -B)0\r\n\r\nmax\r\n\r\n. . . min\r\n\r\nCT C\r\n\r\nhT Hall : CT (1hT -B)0\r\n\r\nmax\r\n\r\n(1/m)\r\ni=1\r\n\r\nL(fT (xi , 1), fT (xi , 2), . . . , fT (xi , k)).\r\n\r\nHere the function L : Rk R is error, but we can also consider other loss functions such as exponential loss, hinge loss, etc. that upper-bound error and are proper: i.e. L(x) is increasing in the weight of the correct label x(1), and decreasing in the weights of the incorrect labels x(l), l = 1. Directly analyzing the optimal payoff is hard. However, Schapire [16] observed that the payoffs can be very well approximated by certain potential functions. Indeed, for any b Rk define the potential function b : Rk R by the following recurrence: t b = L; 0 b (s) = t\r\ncRk :l:c(1)c(l) p{1,...,k}\r\n\r\nmin\r\n\r\nmax\r\n\r\nElp b (s + el ) : Elp [c(l)] b, c t-1\r\n\r\n, (4)\r\n\r\nwhere el Rk is the unit-vector whose lth coordinate is 1 and the remaining coordinates zero. These potential functions compute an estimate b (st ) of whether an example x will be misclassified, t based on its current state st consisting of counts of votes received so far on various classes st (l) = t-1 t =1 1 [ht (x) = l], and the number of rounds t remaining. Using these functions, Schapire [16] proposed a Booster strategy, aka the OS strategy, which, in round t, constructs a cost matrix C C, whose each row C(i) achieves the minimum of the right hand side of (4) with b replaced by B(i), t replaced by T - t, and s replaced by current state st (i). The following theorem provides a guarantee for the loss suffered by the OS algorithm, and also shows that it is the game-theoretically optimum strategy when the number of examples is large. Theorem 4 (Extension of results in [16]). Suppose the weak-learning condition is given by (C, B), If B(i) m Booster employs the OS algorithm, then the average potential of the states (1/m) i=1 t (s(i)) never increases in any round. In particular, loss suffered after T rounds of play is at most B(i) m (1/m) i=1 T (0). Further, for any > 0, when the loss function satisfies some mild conditions, and m T, k, 1/ , no Booster strategy can achieve loss less than the above bound in T rounds. Computing the potentials. In order to implement the OS strategy using our weak-learning conditions, we only need to compute the potential b for distributions b {1, . . . , k}. Fortunately, t these potentials have a very simple solution in terms of the homogeneous random-walk Rt (x), the b random position of a particle after t time steps, that starts at location x Rk , and in each step moves in direction el with probability b(l). Theorem 5. If L is proper, and b {1, . . . , k} satisfies l : b(1) b(l), then b (s) = t E [L (Rt (s))]. Furthermore, the vector achieving the minimum in the right hand side of (4) is b given by c(l) = b (s + el ). t-1 Theorem (5) implies the OS strategy chooses the following cost matrix in round t: c(i, l) = b(i) T -t-1 (st (i) + el ), where st (i) is the state of example i in round t. Therefore everything boils 6\r\n\r\n\fdown to computing the potentials, which is made possible by Theorem 5. There is no simple closed form solution for the non-convex 0-1 loss L(s) = 1[s1 (maxi>1 si )]. However, using Theorem 4, we can write the potential t (s) explicitly, and then compute it using dynamic programming in O(t3 k) time. This yields very tight bounds. To obtain a more efficient procedure, and one that we will soon show can be made adaptive, we next focus on the exponential loss associated with AdaBoost that does have a closed form solution. Lemma 1. If L(s) = exp(2 (s2 - s1 )) + + exp(k (sk - s1 )), where each l is positive, then k the solution in Theorem 5 evaluates to b (s) = l=2 (al )t el (sl -s1 ) , where al = 1 - (b1 + bl ) + t l -l e bl + e b1 . The proof by induction is straightforward. In particular, when the condition is (C eor , U ) and k = (, , . . .), the relevant potential is t (s) = (, )t l=2 e(sl -s1 ) where (, ) = 1 + (1-) (e + e- - 2) - (1 - e- ) . The cost-matrix output by the OS algorithm can be k simplified by rescaling, or adding the same number to each coordinate of a cost vector, without affecting the constraints it imposes on a weak classifier, to the following form c(i, l) = (e - 1) e(sl -s1 ) k (e- - 1) j=2 e(sj -s1 ) if l > 1, if l = 1, (5)\r\n\r\nWith such a choice, Theorem 4 and the form of the potential guarantee that the average loss m (1/m) i=1 L(st (i)) of the states st (i) changes by a factor of at most (, ) every round. Hence T the final loss is at most (k - 1) (, ) . Variable edges. So far we have required Weak-Learner to beat random by at least a fixed amount > 0 in each round of the boosting game. In reality, the edge over random is larger initially, and gets smaller as the OS algorithm creates harder cost matrices. Therefore requiring a fixed edge is either unduly pessimistic or overly optimistic. If the fixed edge is too small, not enough progress is made in the initial rounds, and if the edge is too large, Weak-Learner fails to meet the weak-learning condition in latter rounds. We attempt to fix this via two approaches: prescribing a decaying sequence of edges 1 , . . . , T , or being completely flexible, aka adaptive, with respect to the edges returned by the weak-learner. In either case, we only use the edge-over-random condition (C eor , U ), but with varying values of . Fixed sequence of edges. With a prescribed sequence of edges 1 , . . . , T the weak-learning condition (C eor , Ut ) in each round t is different. We allow the weights 1 , . . . , T to be arbitrary, but they must be fixed in advance. All the results for uniform and weights t = 1 hold in this case as well. m k In particular, by the arguments leading to (5), if we want to minimize i=1 l=2 e{ft (i,l)-ft (i,1)} , where ft is as defined in (1), then the following strategy is optimal: in round t output the cost matrix C(i, l) = (et - 1) eft-1 (i,j)-ft-1 (i,1) k (e-t - 1) j=2 eft-1 (i,j)-ft-1 (i,1)\r\nm k\r\n\r\nif l > 1, if l = 1.\r\n\r\n(6)\r\n\r\nThis will ensure that the expression i=1 l=2 e{ft (i,l)-ft (i,1)} changes by a factor of at most T (t , t ) in each round. Hence the final loss will be at most (k - 1) t=1 (t , t ). Adaptive. In the adaptive setting, we depart from the game-theoretic framework in that WeakLearner is no longer adversarial. Further, we are no longer guaranteed to receive a certain sequence of edges. Since the choice of cost-matrix in (6) does not depend on the edges, we could fix an arbitrary set of weights t in advance, follow the same algorithm as before and enjoy the same bound T t=1 (t , t ). The trouble with this is (t , t ) is not less than 1 unless t is small compared to t . To ensure progress, the weight t must be chosen adaptively as a function of t . Since we do not know what edge we will receive, we choose the cost matrix as before but anticipating infinitesimally small edge, in the spirit of [7], (and with some rescaling) C(i, l) =\r\n0\r\n\r\nlim C (i, l) =\r\n\r\n1 \r\n\r\n(e - 1) eft-1 (i,j)-ft-1 (i,1) k (e- - 1) j=2 eft-1 (i,j)-ft-1 (i,1) if l > 1, if l = 1.\r\n\r\nif l > 1, if l = 1. (7)\r\n\r\n=\r\n\r\neft-1 (i,j)-ft-1 (i,1) k - j=2 eft-1 (i,j)-ft-1 (i,1) 7\r\n\r\n\f0.30 0.35 0.40\r\n\r\n0.7\r\n\r\n0.5\r\n\r\n0.50\r\n\r\nconnect4\r\n\r\nforest 0.8\r\n\r\nletter\r\n\r\npendigits\r\n\r\npoker 0.20\r\n\r\nsatimage\r\n\r\n0.5\r\n\r\n0.3\r\n\r\n0.40\r\n\r\n0.3\r\n\r\n0.1\r\n\r\n0.30\r\n\r\n5\r\n\r\n20\r\n\r\n100\r\n\r\n500\r\n\r\n5\r\n\r\n20\r\n\r\n100\r\n\r\n500\r\n\r\n5\r\n\r\n20\r\n\r\n100\r\n\r\n500\r\n\r\n5\r\n\r\n20 50\r\n\r\n200\r\n\r\n0.20\r\n\r\n5\r\n\r\n20\r\n\r\n100\r\n\r\n500\r\n\r\n0.08 5\r\n\r\n0.0\r\n\r\n0.14\r\n\r\n0.4\r\n\r\n20 50\r\n\r\n200\r\n\r\n(a)\r\n1.0 1.0\r\n\r\nconnect4\r\n0.40\r\n\r\nforest\r\n\r\nletter\r\n0.5\r\n\r\npendigits\r\n\r\npoker\r\n0.10 0.15 0.20 0.25\r\n\r\nsatimage\r\n\r\n0.8\r\n\r\n0.8\r\n\r\n0.36\r\n\r\n0.6\r\n\r\n0.32\r\n\r\n0.4\r\n\r\n0.6\r\n\r\n0.4\r\n\r\n0.1\r\n\r\n0 100\r\n\r\n300\r\n\r\n500\r\n\r\n0 100\r\n\r\n300\r\n\r\n500\r\n\r\n0 100\r\n\r\n300\r\n\r\n500\r\n\r\n0 100\r\n\r\n300\r\n\r\n500\r\n\r\n0.40 0 100\r\n\r\n0.50\r\n\r\n0.3\r\n\r\n300\r\n\r\n500\r\n\r\n0 100\r\n\r\n300\r\n\r\n500\r\n\r\n(b) Figure 1: Figure 1(a) plots the final test-errors of M1(black, dashed), MH(blue, dotted) and New method(red, solid) against the maximum tree-sizes allowed as weak classifiers. Figure 1(b) plots how fast the test-errors of these algorithms drop with rounds, when the maximum tree-size allowed is 5.\r\n\r\nSince Weak-Learner cooperates, we expect the edge t of the returned classifier ht on the supplied cost-matrix lim0 C to be more than just infinitesimal. In that case, by continuity, there are noninfinitesimal choices of the weight t such that the edge t achieved by ht on the cost-matrix Ct remains large enough to ensure (t , t ) < 1. In fact, with any choice of t , we get (t , t ) 1+ 1 1 1 - 2 (et - e-t ) t + 2 (et + e-t - 2) (supplement). Tuning t to 1 ln 1-t results in 2 t (t , t ) \r\n2 1 - t . This algorithm is adaptive, and ensures that the loss, and hence error, after T t=1 2 1 - t (k - 1) exp -(1/2) T 2 t=1 t\r\n\r\nT rounds is at most (k - 1)\r\n\r\n.\r\n\r\n5\r\n\r\nExperiments\r\n\r\nWe report preliminary experimental results on six, varying multiclass UCI datasets. The first set of experiments were aimed at determining overall performance of our new algorithm. We compared MH a standard implementation M1 of AdaBoost.M1 with C4.5 M1 New Method as weak learner, and the Boostexter implementation MH of AdaBoost.MH using stumps [20], with the adaptive algorithm described in Section 4, which we call New method, using a naive greedy tree-searching algorithm Greedy for weak-learner. The size of trees was chosen to be of the same order as the tree sizes used by M1. Test errors after 500 rounds of boosting are plotted in Figure 2. The performance is comparable with M1 and far better than MH (understandably since stumps are far weaker than trees), even though our weak-learner is very naive com- Figure 2: This is a plot of the final test-errors of standard implementations of M1, MH and pared to C4.5.\r\n0.0 0.1 0.2 0.3 0.4 connect4 forest letter pendigits poker satimage\r\n\r\nNew method after 500 rounds of boosting.\r\n\r\nWe next investigated how each algorithm performs with less powerful weak-classifiers, namely, decision trees whose size has been sharply limited to various pre-specified limits. Figure 1(a) shows test-error plotted as a function of tree size. As predicted by our theory, our algorithm succeeds in boosting the accuracy even when the tree size is too small to meet the stronger weak learning assumptions of the other algorithms. The differences in performance are particularly strong when using the smallest tree sizes. More insight is provided by plots in Figure 1(b) of the rate of convergence of test error with rounds when the tree size allowed is very small (5). Both M1 and MH drive down the error for a few rounds. But since boosting keeps creating harder cost-matrices, very soon the small-tree learning algorithms are no longer able to meet the excessive requirements of M1 and MH. However, our algorithm makes more reasonable demands that are easily met by the weak learner. 8\r\n\r\n\fReferences\r\n[1] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal stragies and minimax lower bounds for online convex games. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, pages 415424, 2008. [2] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113141, 2000. [3] Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Error-correcting tournaments. In Algorithmic Learning Theory: 20th International Conference, pages 247262, 2009. [4] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263286, January 1995. [5] G nther Eibl and Karl-Peter Pfeiffer. Multiclass boosting for weak classifiers. Journal of Machine Learnu ing Research, 6:189210, 2005. [6] Yoav Freund. Boosting a weak learning algorithm by majority. 121(2):256285, 1995. Information and Computation,\r\n\r\n[7] Yoav Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293318, June 2001. [8] Yoav Freund and Manfred Opper. Continuous drifting games. Journal of Computer and System Sciences, pages 113132, 2002. [9] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148156, 1996. [10] Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 325332, 1996. [11] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119139, August 1997. [12] Trevor Hastie and Robert Tibshirani. Classification by pairwise coupling. Annals of Statistics, 26(2):451 471, 1998. [13] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30(1), February 2002. [14] Gunnar R tsch and Manfred K. Warmuth. Efficient margin maximizing with boosting. Journal of Machine a Learning Research, 6:21312152, 2005. [15] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197227, 1990. [16] Robert E. Schapire. Drifting games. Machine Learning, 43(3):265291, June 2001. [17] Robert E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002. [18] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):16511686, October 1998. [19] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297336, December 1999. [20] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135168, May/June 2000. [21] Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. Multi-class AdaBoost. Statistics and Its Interface, 2:349360, 2009.\r\n\r\n9\r\n\r\n\f", "award": [], "sourceid": 4135, "authors": [{"given_name": "Indraneel", "family_name": "Mukherjee", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}