{"title": "Boosting Algorithms for Maximizing the Soft Margin", "book": "Advances in Neural Information Processing Systems", "page_first": 1585, "page_last": 1592, "abstract": null, "full_text": "Boosting Algorithms for Maximizing the Soft Margin\n\nManfred K. Warmuth\u2217\nDept. of Engineering\nUniversity of California\nSanta Cruz, CA, U.S.A.\n\nKaren Glocer\n\nDept. of Engineering\nUniversity of California\nSanta Cruz, CA, U.S.A.\n\nAbstract\n\nGunnar R\u00a8atsch\n\nFriedrich Miescher Laboratory\n\nMax Planck Society\nT\u00a8ubingen, Germany\n\nWe present a novel boosting algorithm, called SoftBoost, designed for sets of bi-\nnary labeled examples that are not necessarily separable by convex combinations\nof base hypotheses. Our algorithm achieves robustness by capping the distribu-\ntions on the examples. Our update of the distribution is motivated by minimizing\na relative entropy subject to the capping constraints and constraints on the edges\nof the obtained base hypotheses. The capping constraints imply a soft margin in\nthe dual optimization problem. Our algorithm produces a convex combination of\nhypotheses whose soft margin is within \u03b4 of its maximum. We employ relative en-\ntropy projection methods to prove an O( ln N\n\u03b42 ) iteration bound for our algorithm,\nwhere N is number of examples.\nWe compare our algorithm with other approaches including LPBoost, Brown-\nBoost, and SmoothBoost. We show that there exist cases where the number of iter-\nations required by LPBoost grows linearly in N instead of the logarithmic growth\nfor SoftBoost. In simulation studies we show that our algorithm converges about\nas fast as LPBoost, faster than BrownBoost, and much faster than SmoothBoost.\nIn a benchmark comparison we illustrate the competitiveness of our approach.\n\n1 Introduction\n\nBoosting methods have been used with great success in many applications like OCR, text classi\ufb01-\ncation, natural language processing, drug discovery, and computational biology [13]. For AdaBoost\n[7] it was frequently observed that the generalization error of the combined hypotheses kept de-\ncreasing after the training error had already reached zero [19]. This sparked a series of theoretical\nstudies trying to understand the underlying principles that govern the behavior of ensemble methods\n[19, 1]. It became apparent that some of the power of ensemble methods lies in the fact that they\ntend to increase the margin of the training examples. This was consistent with the observation that\nAdaBoost works well on low-noise problems, such as digit recognition tasks, but not as well on tasks\nwith high noise. On such tasks, better generalizaton can be achieved by not enforcing a large margin\non all training points. This experimental observation was supported by the study of [19], where the\ngeneralization error of ensemble methods was bounded by the sum of two terms: the fraction of\ntraining points which have a margin smaller than some value \u03c1 plus a complexity term that depends\non the base hypothesis class and \u03c1. While this worst-case bound can only capture part of what is\ngoing on in practice, it nevertheless suggests that in some cases it pays to allow some points to have\nsmall margin or be misclassi\ufb01ed if this leads to a larger overall margin on the remaining points.\n\nTo cope with this problem, it was necessary to construct variants of AdaBoost which trade off the\nfraction of examples with margin at least \u03c1 with the size of the margin \u03c1. This was typically done\nby preventing the distribution maintained by the algorithm from concentrating too much on the\nmost dif\ufb01cult examples. This idea is implemented in many algorithms including AdaBoost with\nsoft margins [15], MadaBoost [5], \u03bd-Arc [16, 14], SmoothBoost [21], LPBoost [4], and several\nothers (see references in [13]). For some of these algorithms, signi\ufb01cant improvements were shown\ncompared to the original AdaBoost algorithm on high noise data.\n\n\u2217Supported by NSF grant CCR 9821087.\n\n1\n\n\fIn parallel, there has been a signi\ufb01cant interest in how the linear combination of hypotheses gener-\nated by AdaBoost is related to the maximum margin solution [1, 19, 4, 18, 17]. It was shown that\nAdaBoost generates a combined hypothesis with a large margin, but not necessarily the maximum\nhard margin [15, 18]. This observation motivated the development of many Boosting algorithms\nthat aim to maximize the margin [1, 8, 4, 17, 22, 18]. AdaBoost\u2217 [17] and TotalBoost [22] provable\nconverge to the maximum hard margin within precision \u03b4 in 2 ln(N/\u03b42) iterations. The other algo-\nrithms have worse or no known convergence rates. However, such margin-maximizing algorithms\nare of limited interest for a practitioner working with noisy real-world data sets, as over\ufb01tting is\neven more problematic for such algorithms than for the original AdaBoost algorithm [1, 8].\n\nIn this work we combine these two lines of research into a single algorithm, called SoftBoost, that\nfor the \ufb01rst time implements the soft margin idea in a practical boosting algorithm. SoftBoost\n\ufb01nds in O(ln(N )/\u03b42) iterations a linear combination of base hypotheses whose soft margin is at\nleast the optimum soft margin minus \u03b4. BrownBoost [6] does not always optimize the soft margin.\nSmoothBoost and MadaBoost can be related to maximizing the soft margin, but while they have\nknown iterations bounds in terms of other criteria, it is unknown how quickly they converge to\nthe maximum soft margin. From a theoretical point of view the optimization problems underlying\nSoftBoost as well as LPBoost are appealing, since they directly maximize the margin of a (typically\nlarge) subset of the training data [16]. This quantity plays a crucial role in the generalization error\nbounds [19].\n\nOur new algorithm is most similar to LPBoost because its goal is also to optimize the soft margin.\nThe most important difference is that we use slightly relaxed constraints and a relative entropy to\nthe uniform distribution as the objective function. This leads to a distribution on the examples that\nis closer to the uniform distribution. An important result of our work is to show that this strategy\nmay help to increase the convergence speed: We will give examples where LPBoost converges much\nmore slowly than our algorithm\u2014linear versus logarithmic growth in N .\nThe paper is organized as follows: in Section 2 we introduce the notation and the basic optimization\nproblem. In Section 3 we discuss LPBoost and give a separable setting where N/2 iterations are\nneeded by LPBoost to achieve a hard margin within precision .99. In Section 4 we present our new\nSoftBoost algorithm and prove its iteration bound. We provide an experimental comparison of the\nalgorithms on real and synthetic data in Section 5, and conclude with a discussion in Section 6.\n\n2 Preliminaries\n\nIn the boosting setting, we are given a set of N labeled training examples (xn, yn), n = 1 . . . N,\nwhere the instances xn are in some domain X and the labels yn \u2208 \u00b11. Boosting algorithms maintain\na distribution d on the N examples, i.e. d lies in the N dimensional probability simplex P N . Intu-\nitively, the hard to classify examples receive more weight. In each iteration, the algorithm gives the\ncurrent distribution to an oracle (a.k.a. base learning algorithm), which returns a new base hypothe-\nsis h : X \u2192 [\u22121, 1]N with a certain guarantee of performance. This guarantee will be discussed at\nthe end of this section.\n\nOne measure of the performance of a base hypothesis h with respect to distribution d is its edge,\n\u03b3h = PN\nn=1 dnynh(xn). When the range of h is \u00b11 instead of the interval [-1,1], then the edge is\njust an af\ufb01ne transformation of the weighted error \u01ebh of hypothesis h: i.e. \u01ebh(d) = 1\n2 \u03b3h. A\nhypothesis that predicts perfectly has edge \u03b3 = 1, a hypothesis that always predicts incorrectly has\nedge \u03b3 = \u22121, and a random hypothesis has edge \u03b3 \u2248 0. The higher the edge, the more useful is the\nhypothesis for classifying the training examples. The edge of a set of hypotheses is de\ufb01ned as the\nmaximum edge of the set.\n\n2 \u2212 1\n\nAfter a hypothesis is received, the algorithm must update its distribution d on the examples. Boost-\ning algorithms (for the separable case) commonly update their distribution by placing a constraint\non the edge of most recent hypothesis. Such algorithms are called corrective [17]. In totally cor-\nrective updates, one constrains the distribution to have small edge with respect to all of the previous\nhypotheses [11, 22]. The update developed in this paper is an adaptation of the totally corrective\nupdate of [22] that handles the inseparable case. The \ufb01nal output of the boosting algorithm is always\na convex combination of base hypotheses fw(xn) = PT\nt=1 wtht(xn), where ht is the hypothesis\nadded at iteration t and wt is its coef\ufb01cient. The margin of a labeled example (xn, yn) is de\ufb01ned as\n\n2\n\n\f\u03c1n = ynfw(xn). The (hard) margin of a set of examples is taken to be the minimum margin of the\nset.\n\nIt is convenient to de\ufb01ne an N -dimensional vector um that combines the base hypothesis hm with\nthe labels yn of the N examples: um\nn := ynhm(xn). With this notation, the edge of the t-th\nhypothesis becomes d \u00b7 ut and the margin of the n-th example w.r.t. a convex combination w of the\n\ufb01rst t \u2212 1 hypotheses is Pt\u22121\nFor a given set of hypotheses {h1, . . . , ht}, the following linear programming problem (1) optimizes\nthe minimum soft margin. The term \u201csoft\u201d here refers to a relaxation of the margin constraint. We\nnow allow examples to lie below the margin but penalize them linearly via slack variables \u03c8n. The\ndual problem (2) minimizes the maximum edge when the distribution is capped with 1/\u03bd, where\n\u03bd \u2208 {1, . . . , N }:\n\nm=1 um\n\nn wm.\n\n1\n\n\u03bd XN\n\n\u03c1\u2217\nt (\u03bd) = max\n\nw,\u03c1,\u03c8(cid:0)\u03c1 \u2212\n\n\u03c8n(cid:1)\num\nn wm \u2265 \u03c1 \u2212 \u03c8n, for 1 \u2264 n \u2264 N,\nw \u2208 P t, \u03c8 \u2265 0.\n\nn=1\n\n(1)\n\n\u03b3\u2217\nt (\u03bd) = min\nd,\u03b3\n\n\u03b3\n\n(2)\n\ns.t. d \u00b7 um \u2264 \u03b3, for 1 \u2264 m \u2264 t,\n\nd \u2208 P N , d \u2264\n\n1\n\u03bd\n\n1.\n\ns.t. Xt\n\nm=1\n\nt (\u03bd) = \u03b3\u2217\n\nBy duality, \u03c1\u2217\nt (\u03bd). Note that the relationship between capping and the hinge loss has\nlong been exploited by the SVM community [3, 20] and has also been used before for Boosting in\n[16, 14]. In particular, it is known that \u03c1 in (1) is chosen such that N \u2212 \u03bd examples have margin at\nleast \u03c1. This corresponds to \u03bd active constraints in (2). The case \u03bd = 1 is degenerate: there are no\ncapping constraints in (2) and this is equivalent to the hard margin case.1\nAssumption on the weak learner We assume that for any distribution d \u2264 1\n1 on the examples,\n\u03bd\nthe oracle returns a hypothesis h with edge at least g, for some \ufb01xed g. This means that for the\ncorresponding u vector, d \u00b7 u \u2265 g. For binary valued features, this is equivalent to the assumption\nthat the base learner always returns a hypothesis with error at most 1\n\n2 \u2212 1\n\n2 g.\n\nt (\u03bd) of the minimization problem (2) and\nAdding a new constraint can only increase the value \u03b3\u2217\nt (\u03bd) is non-decreasing in t. It is natural to de\ufb01ne \u03b3\u2217(\u03bd) as the value of (2) w.r.t. the entire\ntherefore \u03b3\u2217\nhypothesis set from which the oracle can choose. Clearly \u03b3\u2217\nt (\u03bd) approaches \u03b3\u2217(\u03bd) from below.\nAlso, the guarantee g of the oracle can be at most \u03b3\u2217(\u03bd) because for the optimal distribution d\u2217 that\nrealizes \u03b3\u2217(\u03bd), all hypotheses have edge at most \u03b3\u2217(\u03bd). For computational reasons, g might however\nbe lower than \u03b3\u2217(\u03bd) and in that case the optimum soft margin we can achieve is g.\n\n3 LPBoost\n\nIn iteration t, the LPBoost algorithm [4] sends its current distribution dt\u22121 to the oracle and receives\na hypothesis ht that satis\ufb01es dt\u22121 \u00b7 ut \u2265 g. It then updates its distribution to dt by solving the linear\nprogramming problem (1) based on the t hypotheses received so far.\nThe goal of the boosting algorithms is to produce a convex combination of T hypotheses such that\n\u03b3T (\u03bd) \u2265 g \u2212 \u03b4. The simplest way to achieve this is to break when this condition is satis\ufb01ed.\nAlthough the guarantee g is typically not known, it is upper bounded by b\u03b3t = min1\u2264m\u2264t dt\u22121 \u00b7 ut\nand therefore LPBoost uses the more stringent stopping criterion \u03b3t(\u03bd) \u2265 b\u03b3t \u2212 \u03b4.\n\nTo our knowledge, there is no known iteration bound for LPBoost even though it provably converges\nto the \u03b4-optimal solution of the optimization problem after it has seen all hypotheses [4, 10]. Empir-\nically, the convergence speed depends on the linear programming optimizer, e.g. simplex or interior\npoint solver [22]. For the \ufb01rst time, we are able to establish a lower bound showing that, independent\nof the optimizer, LPBoost can require \u2126(N ) iterations:\n\nTheorem 1 There exists a case where LPBoost requires N/2 iterations to achieve a hard margin\nthat is within \u03b4 = .99 of the optimum hard margin.\nProof. Assume we are in the hard margin case (\u03bd = 1). The counterexample has N examples and\n2 + 1 base hypothesis. After N\nt (1) for the chosen hypotheses will\nN\n\n2 iterations, the optimal value \u03b3\u2217\n\n1Please note that [20] have previously used the parameter \u03bd with a slightly different meaning, namely \u03bd/N\n\nin our notation. We use an unnormalized version of \u03bd denoting a number of examples instead of a fraction.\n\n3\n\n\fAlgorithm 1 LPBoost with accuracy param. \u03b4 and capping parameter \u03bd\n\n1. Input: S = h(x1, y1), . . . , (xN , yN )i, accuracy \u03b4, capping parameter \u03bd \u2208 [1, N ].\n2. Initialize: d0 to the uniform distribution and b\u03b30 to 1.\n3. Do for t = 1, . . .\n(a) Send dt\u22121 to oracle and obtain hypothesis ht.\n\nSet ut\n(Assume dt\u22121 \u00b7 ut \u2265 g, where edge guarantee g is unknown.)\n\nn = ht(xn)yn and b\u03b3t = min{b\u03b3t\u22121, dt\u22121 \u00b7 ut}.\n\n(b) Update the distribution to any dt that solves the LP problem\n\ns.t. d \u00b7 um \u2264 \u03b3, for 1 \u2264 m \u2264 t; d \u2208 P N , d \u2264\n\n1\n\u03bd\n\n1.\n\n\u03b3\n\nt ) = argmin\n\n(dt, \u03b3\u2217\nt \u2265 b\u03b3t \u2212 \u03b4 then set T = t and break.2\n\nd,\u03b3\n\n(c) If \u03b3\u2217\n\n4. Output: fw(x) = PT\n\nmargin over the hypothesis set {h1, . . . , hT } using the LP problem (1).\n\nm=1 wmhm(x), where the coef\ufb01cients wm maximize the soft\n\n2When g is known, then one can break already when \u03b3 \u2217\n\nt (\u03bd) \u2265 g \u2212 \u03b4.\n\n2\n\n3\n\n4\n\n5\n\nn \\ t\n\n1\n+1\n+1\n+1\n+1\n\n+1\n\n\u22121 + 8\u01eb\n\u22121 + 8\u01eb\n\n+1\n\n\u22121 + 6\u01eb\n\u22121 + 7\u01eb\n\u22121 + 6\u01eb\n\n\u22121 + 5\u01eb\n\u22121 + 5\u01eb\n\u22121 + 5\u01eb\n\u22121 + 5\u01eb\n\n+1\n\n\u22121 + 4\u01eb\n\u22121 + 5\u01eb\n\u22121 + 5\u01eb\n\u22121 + 4\u01eb\n\n\u22121 + 7\u01eb\n\u22121 + 7\u01eb\n\u22121 + 7\u01eb\n\u22121 + 7\u01eb\n\u22121 + 7\u01eb\n\n\u22121 + 9\u01eb\n\u22121 + 9\u01eb\n\u22121 + 9\u01eb\n\u22121 + 9\u01eb\n\u22121 + 9\u01eb\n\u22121 + 9\u01eb\n\n\u22121 + \u01eb\n\u22121 + \u01eb\n\u22121 + \u01eb\n\u22121 + \u01eb\n+1 \u2212 \u01eb\n+1 \u2212 \u01eb\n+1 \u2212 \u01eb\n+1 \u2212 \u01eb\n\u2265 \u01eb/2\n\nFigure 1: The ut vectors that are hard for LPBoost (for \u03bd = 1).\n\n1\n2\n3\n4\n5\n\u22121 + 2\u01eb\n6\n\u22121 + 3\u01eb\n7\n\u22121 + 3\u01eb\n8\n\u22121 + 3\u01eb\nt (1) \u22121 + 2\u01eb\n\u03b3 \u2217\n\nstill be close to \u22121, whereas after the last hypothesis is added, this value is at least \u01eb/2. Here \u01eb is a\nprecision parameter that is an arbitrary small number.\nFigure 1 shows\nthe case\nwhere N = 8 and T = 5,\nbut it is trivial to generalize\nthis example to any even N .\nThere are 8 examples/rows\nand the \ufb01ve columns are the\nut\u2019s of the \ufb01ve available base\nhypotheses. The examples\nare separable because if we\nput half of the weight on the\n\ufb01rst and last hypothesis, then\nthe margins of all examples are at least \u01eb/2.\nWe assume that in each iteration the oracle will return the remaining hypothesis with maximum\nedge. This will result in LPBoost choosing the hypotheses in order, and there will never be any ties.\nThe initial distribution d0 is uniform. At the end of iteration t (1 \u2264 t \u2264 N/2), the distribution dt\nwill focus all its weight on example N/2 + t, and the optimum mixture of the columns will put all\nof its weight on the tth hypothesis that was just received. In other words the value will be the bolded\nentries in Figure 1: \u22121 + 2\u01ebt at the end of iteration t = 1, . . . , N/2. After N/2 iterations the value\nt (1) of the underlying LP problem will still be close to \u22121, because \u01eb can be made arbitrary small.\n\u03b3\u2217\nWe reasoned already that the value for all N/2 + 1 hypotheses will be positive. So if \u01eb is small\nenough, then after N/2 iterations LPBoost is still at least .99 away from the optimal solution.\n(cid:3)\nAlthough the example set used in the above proof is linearly separable, we can modify it explicitly\nto argue that capping the distribution on examples will not help in the sense that \u201csoft\u201d LPBoost\nwith \u03bd > 1 can still have linear iteration bounds. To negate the effect of capping, simply pad out\nthe problem by duplicating all of the rows \u03bd times. There will now be \u02dcN = N \u03bd examples, and after\n2 = \u02dcN\n2\u03bd iterations, the value of the game is still close to \u22121. This is not a claim that capping has no\nN\nvalue. It remains an important technique for making an algorithm more robust to noise. However, it\nis not suf\ufb01cient to improve the iteration bound of LPBoost from linear growth in N to logarithmic.\nAnother attempt might be to modify LPBoost so that at each iteration a base hypothesis is chosen\nthat increases the value of the optimization problem the most. Unfortunately we found similar \u2126(N )\ncounter examples to this heuristic (not shown). It is also easy to see that the algorithms related to\nthe below SoftBoost algorithm choose the last hypothesis after \ufb01rst and \ufb01nish in just two iterations.\n\n4\n\n\fAlgorithm 2 SoftBoost with accuracy param. \u03b4 and capping parameter \u03bd\n\n1. Input: S = h(x1, y1), . . . , (xN , yN )i, desired accuracy \u03b4, and capping parameter\n\n\u03bd \u2208 [1, N ].\n\n2. Initialize: d0 to the uniform distribution and b\u03b30 to 1.\n3. Do for t = 1, . . .\n\n(a) Send dt\u22121 to the oracle and obtain hypothesis ht.\nn = ht(xn)yn and b\u03b3t = min{b\u03b3t\u22121, dt\u22121 \u00b7 ut}.\n\nSet ut\n(Assume dt\u22121 \u00b7 ut \u2265 g, where edge guarantee g is unknown.)\n\n(b) Update3\n\ndt = argmin\n\n\u2206(d, d0),\n\nd\n\ns.t. d\u00b7um \u2264 b\u03b3t\u2212\u03b4, for 1 \u2264 m \u2264 t,X\n\nn\n\ndn = 1, d \u2264\n\n1\n\u03bd\n\n1.\n\n(c) If above infeasible or dt contains a zero then T = t and break.\n\n4. Output: fw(x) = PT\n\nm=1\n\nwmhm(x), where the coef\ufb01cients wm maximize the soft\n\nmargin over the hypothesis set {h1, . . . , ht} using the LP problem (1).\n\n3 When g is known, replace the upper bound b\u03b3t \u2212 \u03b4 by g \u2212 \u03b4.\n\n4 SoftBoost\n\nIn this section, we present the SoftBoost algorithm, which adds capping to the TotalBoost algorithm\nof [22]. SoftBoost takes as input a sequence of examples S = h(x1, y1), . . . , (xN , yN )i, an accuracy\nparameter \u03b4, and a capping parameter \u03bd. The algorithm has an oracle available with unknown\nguarantee g. Its initial distribution d0 is uniform. In each iteration t, the algorithm prompts the oracle\nfor a new base hypothesis, incorporates it into the constraint set, and updates its distribution dt\u22121 to\ndt by minimizing the relative entropy \u2206(d, d0) := Pn dn ln dn\n\nsubject to linear constraints:\n\nd0\n\nn\n\ndt+1 = argmind \u2206(d, d0)\nPn dn = 1, d \u2264 1\n\n1.\n\n\u03bd\n\ns.t. d \u00b7 um \u2264 b\u03b3t \u2212 \u03b4, for 1 \u2264 m \u2264 t (where b\u03b3t = min1\u2264m\u2264t dm\u22121 \u00b7 um),\n\nIt is easy to solve this optimization problem with vanilla sequential quadratic programming methods\n(see [22] for details). Observe that removing the relative entropy term from the objective, results\n\nin a feasibility problem for linear programming where the edges are upper bounded by b\u03b3t \u2212 \u03b4. If\n\nwe remove the relative entropy and minimize the upper bound on the edges, then we arrive at the\noptimization problem of LPBoost, and logarithmic growth in the number of examples is no longer\npossible. The relative entropy in the objective assures that the probabilities of the examples are\nalways proportional to their exponentiated negative soft margins (not shown). That is, more weight\nis put on the examples with low soft margin, which are the examples that are hard to classify.\n\n4.1 Iteration bounds for SoftBoost\n\nOur iteration bound for SoftBoost is very similar to the bound proven for TotalBoost [22], differing\nonly in the additional details related to capping.\n\n\u03b42 ln(N/\u03bd)\u2309 iterations with a convex combination\n\nt (\u03bd) > b\u03b3t \u2212 \u03b4 \u2265 g \u2212 \u03b4. Also if dt contains a zero, then since the objective function \u2206(d, d0) is\n\nTheorem 2 SoftBoost terminates after at most \u2308 2\nthat is at most \u03b4 below the optimum value g.\nProof. We begin by observing that if the optimization problem at iteration t is infeasible, then\n\u03b3\u2217\nstrictly convex in d and minimized at the interior point d0, there is no optimal solution in the interior\nof the simplex. Hence, \u03b3\u2217\nLet Ct be the convex subset of probability vectors d \u2208 P N satisfying d \u2264 1\n\u03bd\nb\u03b3t \u2212 \u03b4. Notice that C0 is the N dimensional probability simplex where the components are capped\n\u03bd . The distribution dt\u22121 at iteration t \u2212 1 is the projection of d0 onto the closed convex set Ct\u22121.\nBecause adding a new hypothesis in iteration t results in an additional constraint and b\u03b3t \u2264 b\u03b3t\u22121,\n\nt (\u03bd) = b\u03b3t \u2212 \u03b4 \u2265 g \u2212 \u03b4.\n\n1 and maxt\n\nd \u00b7 ut \u2264\n\nto 1\n\nm=1\n\n5\n\n\fwe have Ct \u2286 Ct\u22121. If t \u2264 T \u2212 1, then our termination condition assures that at iteration t \u2212 1\nthe set Ct\u22121 has a feasible solution in the interior of the simplex. Also, d0 lies in the interior and\ndt \u2208 Ct \u2286 Ct\u22121. These preconditions assure that at iteration t \u2212 1, the projection dt\u22121 of d0 onto\nCt\u22121, exists and the Generalized Pythagorean Theorem for Bregman divergences [2, 9] is applicable:\n\n\u2206(dt, d0) \u2212 \u2206(dt\u22121, d0) \u2265 \u2206(dt, dt\u22121).\n\n(3)\n\nBy Pinsker\u2019s inequality, \u2206(dt, dt\u22121) \u2265 (||dt\u2212dt\u22121||1)2\n, and by H\u00a8older\u2019s inequality, ||dt\u22121 \u2212 dt||1 \u2265\n||dt\u22121 \u2212 dt||1||ut||\u221e \u2265 dt\u22121 \u00b7 ut \u2212 dt \u00b7 ut. Also dt\u22121 \u00b7 ut \u2265 b\u03b3t by the de\ufb01nition of b\u03b3t, and the\nconstraints on the optimization problem assure that dt \u00b7 ut \u2264 b\u03b3t \u2212 \u03b4 and thus dt\u22121 \u00b7 ut \u2212 dt \u00b7 ut \u2265\nb\u03b3t\u2212(b\u03b3t \u2212\u03b4) = \u03b4. We conclude that \u2206(dt, dt\u22121) \u2265 \u03b42\n2 at iterations 1 through T \u22121. By summing (3)\n\nover the \ufb01rst T \u2212 1 iterations, we obtain\n\n2\n\n\u2206(dT , d0) \u2212 \u2206(d0, d0) \u2265 (T \u2212 1)\n\n\u03b42\n2\n\n.\n\nSince the left side is at most ln(N/\u03bd), the bound of the theorem follows.\nWhen \u03bd = 1, then capping is vacuous and the algorithm and its iteration bound coincides with the\nbound for TotalBoost. Note that the upper bound ln(N/\u03bd) on the relative entropy decreases with \u03bd.\nWhen \u03bd = N , then the distribution stays at d0 and the iteration bound is zero.\n\n(cid:3)\n\n5 Experiments\n\nIn a \ufb01rst study, we use experiments on synthetic data to illustrate the general behavior of the con-\nsidered algorithms.2 We generated a synthetic data set by starting with a random matrix of 2000\nrows and 100 columns, where each entry was chosen uniformly in [0, 1]. For the \ufb01rst 1000 rows, we\nadded 1/2 to the \ufb01rst 10 columns and rescaled such that the entries in those columns were again in\n[0, 1]. The rows of this matrix are our examples and the columns and their negation are the base hy-\npotheses, giving us a total of 200 of them. The \ufb01rst 1000 examples were labeled +1 and the rest \u22121.\nThis results in a well separable dataset. To illustrate how the algorithms deal with the inseparable\ncase, we \ufb02ipped the sign of a random 10% of the data set. We then chose a random 500 examples as\nour training set and the rest as our test set. In every boosting iteration we chose the base hypothesis\nwhich has the largest edge with respect to the current distribution on the examples.\n\nWe have trained LPBoost and SoftBoost for different values of \u03bd and recorded the generalization\nerror (cf. Figure 2; \u03b4 = 10\u22123). We should expect that for small \u03bd (e.g. \u03bd/N < 10%) the data is\nnot easily separable, even when allowing \u03bd wrong predictions. Hence the algorithm may mistakenly\nconcentrate on the random directions for discrimination.\nIf \u03bd is large enough, most incorrectly\nlabeled examples are likely to be identi\ufb01ed as margin errors (\u03c8i > 0) and the performance should\nstabilize. In Figure 2 we observe this expected behavior and also that for large \u03bd the classi\ufb01cation\nperformance decays again. The generalization performances of LPBoost and SoftBoost are very\nsimilar, which is expected as they both attempt to maximize the soft-margin.\n\nUsing the same data set, we analysed the convergence speed of several algorithms: LPBoost, Soft-\nBoost, BrownBoost, and SmoothBoost. We chose \u03b4 = 10\u22122 and \u03bd = 200.3 For every iteration\nwe record all margins and compute the soft margin objective (1) for optimally chosen \u03c1 and \u03c8\u2019s.\nFigure 3 plots this value against the number of iterations for the four algorithms. SmoothBoost\ntakes dramatically longer to converge to the maximum soft margin than the other other three algo-\nrithms. In our experiments it nearly converges to the maximum soft margin objective, even though\nno theoretical evidence is known for this observed convergence. Among the three remaining algo-\nrithms, LPBoost and SoftBoost converge in roughly the same number of iterations, but SoftBoost\nhas a slower start. BrownBoost terminates in fewer iterations than the other algorithms but does not\nmaximize the soft margin.4 This is not surprising as there is no theoretical reason to expect such a\nresult.\n\n2Our code is available at https://sourceforge.net/projects/nboost\n3Smaller choices of \u03bd lead to an even slower convergence of SmoothBoost.\n4SmoothBoost has two parameters: a guarantee g on the edge of the base learner and the target margin\n2+g/2 as proposed in [21]. Brownboost\u2019s one\n\n\u03b8. We chose g = \u03b3 \u2217(\u03bd) (computed with LPBoost) and \u03b8 = g/2\nparameter, c = 0.35, was chosen via cross-validation.\n\n6\n\n\f\u2190 LPBoost\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\nSoftBoost \u2192\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n0.06\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\u03bd/N\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\nFigure 2: Generalization performance of SoftBoost\n(solid) and LPBoost (dotted) on a synthetic data set\nwith 10% label-noise for different values of \u03bd.\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0\n\ne\nv\ni\nt\nc\ne\nb\no\n\nj\n\n \n\ni\n\nn\ng\nr\na\nm\n\n \nt\nf\n\no\ns\n\n\u22120.01\n\n\u22120.02\n\n100\n\nLPBoost \u2192\n\n\u2190 BrownBoost\n\n\u2190 SoftBoost\n\nSmoothBoost \u2192\n\n101\n\nnumber of iterations\n\n102\n\n103\n\nFigure 3: Soft margin objective vs. the number of\niterations for LPBoost, SoftBoost, BrownBoost and\nSmoothBoost.\n\nFinally, we present a small comparison on ten benchmark data sets derived from the UCI benchmark\nrepository as previously used in [15]. We analyze the performance of AdaBoost, LPBoost, Soft-\nBoost, BrownBoost [6] and AdaBoostReg [15] using RBF networks as base learning algorithm.5\nThe data comes in 100 prede\ufb01ned splits into training and test sets. For each of the splits we use\n5-fold cross-validation to select the optimal regularization parameter for each of the algorithms.\nThis leads to 100 estimates of the generalization error for each method and data set. The means\nand standard deviations are given in Table 1.6 As before, the generalization performances of Soft-\nBoost and LPBoost are very similar. However, the soft margin algorithms outperform AdaBoost on\nmost data sets. The genaralization error of BrownBoost lies between that of AdaBoost and Soft-\nBoost. AdaBoostReg performs as well as SoftBoost, but there are no iteration bounds known for this\nalgorithm.\n\nEven though SoftBoost and LPBoost often have similar generalization error on natural datasets, the\nnumber of iterations needed by both algorithms can be radically different (see Theorem 1). Also, in\n[22] there are some arti\ufb01cial data sets where TotalBoost (i.e. SoftBoost with \u03bd = 1) outperformed\nLPBoost i.t.o. generalization error.\n\nAdaBoost\n\n13.3 \u00b1 0.7\nBanana\n32.1 \u00b1 3.8\nB.Cancer\n27.9 \u00b1 1.5\nDiabetes\n26.9 \u00b1 1.9\nGerman\n20.1 \u00b1 2.7\nHeart\n1.9 \u00b1 0.3\u2217\nRingnorm\n36.1 \u00b1 1.5\nF.Solar\n4.4 \u00b1 1.9\u2217\nThyroid\nTitanic\n22.8 \u00b1 1.0\nWaveform 10.5 \u00b1 0.4\n\nLPBoost\n\n11.1 \u00b1 0.6\n27.8 \u00b1 4.3\n24.4 \u00b1 1.7\n24.6 \u00b1 2.1\n18.4 \u00b1 3.0\n1.9 \u00b1 0.2\n35.7 \u00b1 1.6\n4.9 \u00b1 1.9\n22.8 \u00b1 1.0\n10.1 \u00b1 0.5\n\nSoftBoost\n\n11.1 \u00b1 0.5\n28.0 \u00b1 4.5\n24.4 \u00b1 1.7\n24.7 \u00b1 2.1\n18.2 \u00b1 2.7\n1.8 \u00b1 0.2\n35.5 \u00b1 1.4\n4.9 \u00b1 1.9\n23.0 \u00b1 0.8\n9.8 \u00b1 0.5\n\nBrownBoost\n12.9 \u00b1 0.7\n30.2 \u00b1 3.9\n27.2 \u00b1 1.6\n24.8 \u00b1 1.9\n20.0 \u00b1 2.8\n1.9 \u00b1 0.2\n36.1 \u00b1 1.4\n4.6 \u00b1 2.1\n22.8 \u00b1 0.8\n10.4 \u00b1 0.4\n\nAdaBoost reg\n11.3 \u00b1 0.6\n27.3 \u00b1 4.3\n24.5 \u00b1 1.7\n25.0 \u00b1 2.2\n17.6 \u00b1 3.0\n1.7 \u00b1 0.2\n34.4 \u00b1 1.7\n4.9 \u00b1 2.0\n22.7 \u00b1 1.0\n10.4 \u00b1 0.7\n\nTable 1: Generalization error estimates and standard deviations for ten UCI benchmark data sets. SoftBoost\nand LPBoost outperform AdaBoost and BrownBoost on most data sets.\n6 Conclusion\n\nWe prove by counterexample that LPBoost cannot have an O(ln N ) iteration bound. This counterex-\nample may seem similar to the proof that the Simplex algorithm for LP can take exponentially more\nsteps than interior point methods. However this similarity is only super\ufb01cial. First, our iteration\nbound does not depend on the LP solver used within LPBoost. This is because in the construction,\nthe interim solutions are always unique and thus all LP solvers will produce the same solution. Sec-\nond, the iteration bound essentially says that column generation methods (of which LPBoost is a\ncanonical example) should not solve the current subproblem at iteration t optimally. Instead a good\nalgorithm should loosen the constraints and spread the weight via a regularization such as the rela-\ntive entropy. These two tricks used by the SoftBoost algorithm make it possible to obtain iteration\n\n5The data is from http://theoval.cmp.uea.ac.uk/\u223cgcc/matlab/index.shtml. The RBF\n\nnetworks were obtained from the authors of [15], including the hyper-parameter settings for each data set.\n\n6Note that [15] contains a similar benchmark comparison. It is based on a different model selection setup\nleading to underestimates of the generalization error. Presumably due to slight differences in the RBF hyper-\nparameters settings, our results for AdaBoost often deviate by 1-2%.\n\n7\n\n\fbounds that grow logarithmic in N . The iteration bound for our algorithm is a straightforward ex-\ntension of a bound given in [22] that is based on Bregman projection methods. By using a different\ndivergence in SoftBoost, such as the sum of binary relative entropies, the algorithm morphs into a\n\u201csoft\u201d version of LogitBoost (see discussion in [22]) which has essentially the same iteration bound\nas SoftBoost. We think that the use of Bregman projections illustrates the generality of the meth-\nods. Although the proofs seem trivial in hindsight, simple logarithmic iteration bounds for boosting\nalgorithms that maximize the soft margin have eluded many researchers (including the authors) for\na long time. Note that duality methods typically can be used in place of Bregman projections. For\nexample in [12], a number of iteration bounds for boosting algorithms are proven with both methods.\n\nOn a more technical level, we show that LPBoost may require N/2 examples to get .99 close to the\nmaximum hard margin. We believe that similar methods can be used to show that \u2126(N/\u03b4) examples\nmay be needed to get \u03b4 close. However the real challenge is to prove that LPBoost may require\n\u2126(N/\u03b42) examples to get \u03b4 close.\nReferences\n\n[1] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493\u20131518, 1999. Also\n\nTechnical Report 504, Statistics Department, University of California Berkeley.\n\n[2] Y. Censor and S. A. Zenios. Parallel Optimization. Oxford, New York, 1997.\n[3] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297, 1995.\n[4] A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column generation.\n\nMachine Learning, 46(1-3):225\u2013254, 2002.\n\n[5] C. Domingo and O. Watanabe. Madaboost: A modi\ufb01cation of Adaboost.\n\n180\u2013189, 2000.\n\nIn Proc. COLT \u201900, pages\n\n[6] Y. Freund. An adaptive version of the boost by majority algorithm. Mach. Learn., 43(3):293\u2013318, 2001.\n[7] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[8] A.J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In\n\nProceedings of the Fifteenth National Conference on Arti\ufb01cal Intelligence, 1998.\n\n[9] Mark Herbster and Manfred K. Warmuth. Tracking the best linear predictor. Journal of Machine Learning\n\nResearch, 1:281\u2013309, 2001.\n\n[10] R. Hettich and K.O. Kortanek. Semi-in\ufb01nite programming: Theory, methods and applications. SIAM\n\nReview, 3:380\u2013429, September 1993.\n\n[11] J. Kivinen and M. K. Warmuth. Boosting as entropy projection.\n\nIn Proc. 12th Annu. Conference on\n\nComput. Learning Theory, pages 134\u2013144. ACM Press, New York, NY, 1999.\n\n[12] J. Liao. Totally Corrective Boosting Algorithms that Maximize the Margin. PhD thesis, University of\n\nCalifornia at Santa Cruz, December 2006.\n\n[13] R. Meir and G. R\u00a8atsch. An introduction to boosting and leveraging.\n\nIn S. Mendelson and A. Smola,\neditors, Proc. 1st Machine Learning Summer School, Canberra, LNCS, pages 119\u2013184. Springer, 2003.\n[14] G. R\u00a8atsch. Robust Boosting via Convex Optimization: Theory and Applications. PhD thesis, University\n\nof Potsdam, Germany, December 2001.\n\n[15] G. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for AdaBoost. Machine Learning, 42(3):287\u2013320,\n\n2001.\n\n[16] G. R\u00a8atsch, B. Sch\u00a8olkopf, A.J. Smola, S. Mika, T. Onoda, and K.-R. M\u00a8uller. Robust ensemble learn-\ning. In A.J. Smola, P.L. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances in Large Margin\nClassi\ufb01ers, pages 207\u2013219. MIT Press, Cambridge, MA, 2000.\n\n[17] G. R\u00a8atsch and M. K. Warmuth. Ef\ufb01cient margin maximizing with boosting. Journal of Machine Learning\n\nResearch, 6:2131\u20132152, December 2005.\n\n[18] C. Rudin, I. Daubechies, and R.E. Schapire. The dynamics of adaboost: Cyclic behavior and convergence\n\nof margins. Journal of Machine Learning Research, 5:1557\u20131595, 2004.\n\n[19] R.E. Schapire, Y. Freund, P.L. Bartlett, and W.S. Lee. Boosting the margin: A new explanation for the\n\neffectiveness of voting methods. The Annals of Statistics, 26(5):1651\u20131686, 1998.\n\n[20] B. Sch\u00a8olkopf, A.J. Smola, R.C. Williamson, and P.L. Bartlett. New support vector algorithms. Neural\n\nComput., 12(5):1207\u20131245, 2000.\n\n[21] Rocco A. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning\n\nResearch, 4:633\u2013648, 2003.\n\n[22] M.K. Warmuth, J. Liao, and G. R\u00a8atsch. Totally corrective boosting algorithms that maximize the margin.\n\nIn Proc. ICML \u201906, pages 1001\u20131008. ACM Press, 2006.\n\n8\n\n\f", "award": [], "sourceid": 891, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}, {"given_name": "Karen", "family_name": "Glocer", "institution": null}]}