{"title": "Direct 0-1 Loss Minimization and Margin Maximization with Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 872, "page_last": 880, "abstract": "We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives consistently better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n'th order bottom sample margin.", "full_text": "Direct 0-1 Loss Minimization and Margin\n\nMaximization with Boosting\n\nShaodan Zhai, Tian Xia, Ming Tan and Shaojun Wang\n\nKno.e.sis Center\n\nDepartment of Computer Science and Engineering\n\nWright State University\n\n{zhai.6,xia.7,tan.6,shaojun.wang}@wright.edu\n\nAbstract\n\nWe propose a boosting method, DirectBoost, a greedy coordinate descent algo-\nrithm that builds an ensemble classi\ufb01er of weak classi\ufb01ers through directly min-\nimizing empirical classi\ufb01cation error over labeled training examples; once the\ntraining classi\ufb01cation error is reduced to a local coordinatewise minimum, Direct-\nBoost runs a greedy coordinate ascent algorithm that continuously adds weak clas-\nsi\ufb01ers to maximize any targeted arbitrarily de\ufb01ned margins until reaching a local\ncoordinatewise maximum of the margins in a certain sense. Experimental results\non a collection of machine-learning benchmark datasets show that DirectBoost\ngives better results than AdaBoost, LogitBoost, LPBoost with column generation\nand BrownBoost, and is noise tolerant when it maximizes an n\u2032th order bottom\nsample margin.\n\nIntroduction\n\n1\nThe classi\ufb01cation problem in machine learning and data mining is to predict an unobserved discrete\noutput value y based on an observed input vector x. In the spirit of the model-free framework, it\nis always assumed that the relationship between the input vector and the output value is stochastic\nand described by a \ufb01xed but unknown probability distribution p(X, Y ) [7]. The goal is to learn a\nclassi\ufb01er, i.e., a mapping function f (x) from x to y \u2208 Y such that the probability of the classi\ufb01cation\nerror is small. As it is well known, the optimal choice is the Bayes classi\ufb01er [7]. However, since\np(X, Y ) is unknown, we cannot learn the Bayes classi\ufb01er directly.\nInstead, following Vapnik\u2019s\ngeneral setting of the empirical risk minimization [7, 24], we focus on a more realistic goal: Given a\nset of training data D = {(x1, y1), \u00b7 \u00b7 \u00b7 , (xn, yn)} independently drawn from p(X, Y ), we consider\n\ufb01nding f (x) in a function class H that minimizes the empirical classi\ufb01cation error,\n\n1(\u02c6yi 6= yi)\n\n(1)\n\n1\nn\n\nn\n\nXi=1\n\nwhere \u02c6yi = arg maxy\u2208Y yf (xi), Y = {\u22121, 1} and 1(\u00b7) is an indicator function. Under certain\nconditions, direct empirical classi\ufb01cation error minimization is consistent [24] and under low noise\nsituations it has a fast convergence rate [15, 23]. However, due to the nonconvexity, nondifferen-\ntiability and discontinuity of the classi\ufb01cation error function, the minimization of (1) is typically\nNP-hard for general linear models [13]. The common approach is to minimize a surrogate function\nwhich is usually a convex upper bound of the classi\ufb01cation error function. The problem of minimiz-\ning the empirical surrogate loss turns out to be a convex programming problem with considerable\ncomputational advantages and learned classi\ufb01ers remain consistent to Bayes classi\ufb01er [1, 20, 28, 29],\nhowever clearly there is a mismatch between \u201cdesired\u201d loss function used in inference and \u201ctrain-\ning\u201d loss function during the training process [16]. Moreover, it has been shown that all boosting\nalgorithms based on convex functions are susceptible to random classi\ufb01cation noise [14].\n\nBoosting is a machine-learning method based on the idea of creating a single, highly accurate clas-\nsi\ufb01er by combining many weak and inaccurate \u201crules of thumb.\u201d A remarkably rich theory and\na record of empirical success [18] have evolved around boosting, nevertheless it is still not clear\nhow to best exploit what is known about how boosting operates, even for binary classi\ufb01cation. In\n\n1\n\n\fthis paper, we propose a boosting method for binary classi\ufb01cation \u2013 DirectBoost \u2013 a greedy coor-\ndinate descent algorithm that directly minimizes classi\ufb01cation error over labeled training examples\nto build an ensemble linear classi\ufb01er of weak classi\ufb01ers. Once the training error is reduced to a\n(local coordinatewise) minimum, DirectBoost runs a coordinate ascent algorithm that greedily adds\nweak classi\ufb01ers by directly maximizing any targeted arbitrarily de\ufb01ned margins, it might escape the\nregion of minimum training error in order to achieve a larger margin. The algorithm stops once a\n(local coordinatewise) maximum of the margins is reached. In the next section, we \ufb01rst present a\ncoordinate descent algorithm that directly minimizes 0-1 loss over labeled training examples. We\nthen describe coordinate ascent algorithms that aims to directly maximize any targeted arbitrarily\nde\ufb01ned margins right after we reach a (local coordinatewise) minimum of 0-1 loss. In Section 3, we\nshow experimental results on a collection of machine-learning benchmark data sets for DirectBoost,\nAdaBoost [9], LogitBoost [11], LPBoost with column generation [6] and BrownBoost [10], and dis-\ncuss our \ufb01ndings. Due to space limitation, the proofs of theorems, related works, technical details\nas well as conclustions and future works are given in the full version of this paper [27].\n2 DirectBoost: Minimizing 0-1 Loss and Maximizing Margins\nLet H = {h1, ..., hl} denote the set of all possible weak classi\ufb01ers that can be produced by the\nweak learning algorithm, where a weak classi\ufb01er hj \u2208 H is a mapping from an instance space X to\nY = {\u22121, 1}. The hjs are not assumed to be linearly independent, and H is closed under negation,\ni.e., both h and \u2212h belong to H. We assume that the training set consists of examples with labels\n{(xi, yi)}, i = 1, \u00b7 \u00b7 \u00b7 , n, where (xi, yi) \u2208 X \u00d7 Y that are generated independently from p(X, Y ).\nWe de\ufb01ne C of H as the set of mappings that can be generated by taking a weighted average of\nclassi\ufb01ers from H:\n\nC =(f : x \u2192 Xh\u2208H\n\n\u03b1hh(x) | \u03b1h \u2265 0) ,\n\n(2)\n\nThe goal here is to \ufb01nd f \u2208 C that minimizes the empirical classi\ufb01cation error (1), and has good\ngeneralization performance.\n\n2.1 Minimizing 0-1 Loss\nSimilar to AdaBoost, DirectBoost works by sequentially running an iterative greedy coordinate\ndescent algorithm, each time directly minimizing true empirical classi\ufb01cation error (1) instead of\na weighted empirical classi\ufb01cation error in AdaBoost. That is, for each iteration, only the parameter\nof a weak classi\ufb01er that leads to the most signi\ufb01cant true classi\ufb01cation error reduction is updated,\nwhile the weights of all other weak classi\ufb01ers are kept unchanged. The rationale is that the inference\nused to predict the label of a sample can be written as a linear function with a single parameter.\nConsider the tth iteration, the ensemble classi\ufb01er is\n\nt\n\nft(x) =\n\n\u03b1khk(x)\n\n(3)\n\nXk=1\n\nwhere previous t \u2212 1 weak classi\ufb01ers hk(x) and corresponding weights \u03b1k, k = 1, \u00b7 \u00b7 \u00b7 , t \u2212 1 have\nbeen selected and determined. The inference function for sample xi is de\ufb01ned as\n\nFt(xi, y) = yft(xi) = y (\n\n\u03b1khk(xi)) + \u03b1tyht(xi)\n\n(4)\n\nt\u22121\n\nXk=1\n\nSince a(xi) = Pt\u22121\nwe re-write the equation above as,\n\nk=1 \u03b1khk(xi) is constant and hk(xi) is either +1 or -1 depending on sample xi,\n\nFt(xi, y) = y ht(xi)\u03b1t + ya(xi)\n\n(5)\n\nNote that for each label y of sample xi, there is a linear function of \u03b1t with the slope to be either +1 or\n-1 and intercept to be ya(xi). Given an input of \u03b1t, each example xi has two linear scoring functions,\nFt(xi, +1) and Ft(xi, \u22121), i = 1, \u00b7 \u00b7 \u00b7 , n, one for the positive label y = +1 and one for the negative\nlabel y = \u22121. From these two linear scoring functions, the one with the higher score determines\nthe predicted label \u02c6yi of the ensemble classi\ufb01er ft(xi). The intersection point ei of these two linear\nscoring functions is the critical point that the predicted label \u02c6yi switches its sign, the intersection\npoint satis\ufb01es the condition that Ft(xi, +1) = Ft(xi, \u22121) = 0, i.e. a(xi) + \u03b1tht(xi) = 0, and can\nbe computed as ei = \u2212 a(xi)\nht(xi) , i = 1, \u00b7 \u00b7 \u00b7 , n. These points divide \u03b1t into (at most) n + 1 intervals,\neach interval has the value of a true classi\ufb01cation error, thus the classi\ufb01cation error is a stepwise\n\n2\n\n\fVisit each sample in the order that |a(xi)| is increasing.\n\nAlgorithm 1 Greedy coordinate descent algorithm that minimizes a 0-1 loss.\n1: D = {(xi, yi), i = 1, \u00b7 \u00b7 \u00b7 , n}\n2: Sort |a(xi)|, i = 1, \u00b7 \u00b7 \u00b7 , n in an increasing order.\n3: for a weak classi\ufb01er hk \u2208 H do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n12: Pick the weak classi\ufb01ers that lead to largest classi\ufb01cation error reduction.\n13: Among selected these weak classi\ufb01ers, only update the weight of one weak classi\ufb01er that gives\n\nCompute the slope and intercept of F (xi, yi) = yihk(xi)\u03b1 + yia(xi).\nLet \u02c6ei = |a(xi)|.\nIf (slope > 0 and intercept < 0), error update on the righthand side of \u02c6ei is -1.\nIf (slope < 0 and intercept > 0), error update on the righthand side of \u02c6ei is +1.\n\nIncrementally calculate classi\ufb01cation error on intervals of \u02c6eis.\nGet the interval that has minimum classi\ufb01cation error.\n\nthe smallest exponential loss.\n\n14: Repeat 2-13 until training error reaches minimum.\n\nfunction of \u03b1t. The value of ei, i = 1, \u00b7 \u00b7 \u00b7 , n can be negative or positive, however since H is closed\nin negation, we only care about these that are positive.\n\nThe greedy coordinate descent algorithm that sequentially minimizes a 0-1 loss is described in\nAlgorithm 1, lines 3-11 are the weak learning steps and the rest are boosting steps. Consider\nan example with 4 samples to illustrate this procedure. Suppose for a weak classi\ufb01er, we have\nFt(xi, yi), i = 1, 2, 3, 4 as shown in Figure 1. At \u03b1t = 0, samples x1 and x2 have negative margins,\nthus they are misclassi\ufb01ed, the error rate is 50%. We incrementally update the classi\ufb01cation error on\nintervals of \u02c6ei, i = 1, 2, 3, 4: For Ft(x1, y1), its slope is negative and its intercept is negative, sample\nx1 always has a negative margin for \u03b1t > 0, thus there is no error update on the right-hand side of\n\u02c6e1. For Ft(x2, y2), its slope is positive and its intercept is negative, then when \u03b1t is at the right side\nof \u02c6e2, sample x2 has positive margin and becomes correctly classi\ufb01ed, so we update the error by -1,\nthe error rate is reduced to 25%. For Ft(x3, y3), its slope is negative and its intercept is positive, then\nwhen \u03b1t is at the right side of \u02c6e3, sample x3 has a negative margin and becomes misclassi\ufb01ed, so\nwe update the error rate changes to 50% again. For Ft(x4, y4), its slope is positive and its intercept\nis positive, sample x4 always has positive margin for \u03b1t > 0, thus there is no error update on the\nright-hand side of \u02c6e4. We \ufb01nally have the minimum error rate of 25% on the interval of [\u02c6e2, \u02c6e3].\nWe repeat this procedure until the training error reaches\nits minimum, which may be 0 in a data separable case.\nWe then go to the next stage, explained below, that aims to\nmaximize margins. A nice property of the above greedy\ncoordinate descent algorithm is that the classi\ufb01cation er-\nror is monotonically decreasing. Assume there are M\nweak classi\ufb01ers be considered, the computational com-\nplexity of Algorithm 1 in the training stage is O(M n) for\neach iteration.\n\nFt(x3, y3)\n\nFt(x1, y1)\n\nFt(x4, y4)\n\n|a2|\n|a1|\n\na4,|a4|\n\na3,|a3|\n\n0\na1\na2\n\n\u02c6e1\n\n\u02c6e2\n\n\u02c6e3\n\n\u02c6e4\n\n\u03b1t\n\nFt(x2, y2)\n\n50%\n\n25%\n\n0\n\nClassi\ufb01cation error\n\n\u02c6e2\n\n\u02c6e3\n\n\u03b1t\n\nFor boosting, as long as the weaker learner is strong\nenough to achieve reasonably high accuracy, the data will\nbe linearly separable and the minimum 0-1 loss is usually\n0. As shown in Theorem 1, the region of zero 0-1 loss is\na (convex) cone.\n\nFigure 1: An example of computing mini-\nmum 0-1 loss of a weak learner over 4 sam-\nples.\nTheorem 1 The region of zero training error, if exists, is a cone, and it is not a set of isolated cones.\nAlgorithm 1 is a heuristic procedure that minimizes 0-1 loss, it is not guaranteed to \ufb01nd the global\nminimum, it may trap to a coordinatewise local minimum [22] of 0-1 loss. Nevertheless, we switch\nto algorithms that directly maximize the margins we present below.\n2.2 Maximizing Margins\nThe margins theory [17] provides an insightful analysis for the success of AdaBoost where the\nauthors proved that the generalization error of any ensemble classi\ufb01ers is bounded in terms of the\n\n3\n\n\fentire distribution of margins of training examples, as well as the number of training examples and\nthe complexity of the base classi\ufb01ers, and AdaBoost\u2019s dynamics has a strong tendency to increase the\nmargins of training examples. Instead, we can prove that the generalization error of any ensemble\nclassi\ufb01er is bounded in terms of the average margin of bottom n\u2032 samples or n\u2032th order margin\nof training examples, as well as the number of training examples and the complexity of the base\nclassi\ufb01ers. This view motivates us to propose a coordinate ascent algorithm to directly maximize\nseveral types of margins just right after the training error reaches a (local coordinatewise) minimum.\nThe margin of a labeled example (xi, yi) with respect\nto an ensemble classi\ufb01er ft(x) =\nPt\n\nk=1 \u03b1khk(xi) is de\ufb01ned to be\n\nmi =\n\nk=1 \u03b1khk(xi)\n\nk=1 \u03b1k\n\nyiPt\nPt\n\n(6)\n\nThis is a real number between -1 and +1 that intuitively measures the con\ufb01dence of the classi\ufb01er in\nits prediction on the ith example. It is equal to the weighted fraction of base classi\ufb01ers voting for\nthe correct label minus the weighted fraction voting for the incorrect label [17].\n\nWe denote the minimum margin, the average margin, and median margin over the training examples\nas gmin = mini\u2208{1,\u00b7\u00b7\u00b7 ,n} mi, gaverage = 1\ni=1 mi, and gmedian = median{mi, i = 1, \u00b7 \u00b7 \u00b7 , n}.\nFurthermore, we can sort the margins over all training examples in an increasing order, and consider\nn\u2032 worst training examples n\u2032 \u2264 n that have smaller margins, and compute the average margin over\nthose n\u2032 training examples. We call this the average margin of the bottom n\u2032 samples, and denote\nit as gaverage n\u2032 = 1\nn\u2032 Pi\u2208Bn\u2032 mi, where Bn\u2032 denotes the set of n\u2032 samples having the smallest\nmargins.\n\nn Pn\n\nThe margin maximization method described below is a greedy coordinate ascent algorithm that adds\na weak classi\ufb01er achieving maximum margin. It allows us to continuously maximize the margin\nwhile keeping the training error at a minimum by running the greedy coordinate descent algorithm\npresented in the previous section. The margin mi is a linear fractional function of \u03b1, and it is\nquasiconvex, and quasiconcave, i.e., quasilinear [2, 5]. Theorem 2 shows that the average margin of\nbottom n\u2032 examples is quasiconcave in the region of the zero training error.\nTheorem 2 Denote the average margin of bottom n\u2032 samples as\n\ngaverage n\u2032 (\u03b1) = Xi\u2208{Bn\u2032 |\u03b1}\n\nk=1 \u03b1khk(xi)\n\nk=1 \u03b1k\n\nyiPt\nPt\n\nwhere {Bn\u2032 |\u03b1} denotes the set of n\u2032 samples whose margins are at the bottom for \ufb01xed \u03b1. Then\ngaverage n\u2032 (\u03b1) in the region of zero training error is quasiconcave.\nWe denote ai = Pt\u22121\nmargin on the ith example (xi, yi) can be rewritten as,\nai + bi,t\u03b1t\n\nk=1 yi\u03b1khk(xi), bi,t = yiht(xi) \u2208 {\u22121, +1} and c = Pt\u22121\n\nk=1 \u03b1k, then the\n\n(7)\n\nmi =\n\nc + \u03b1t\n\nThe derivative of the margin on ith example with respect to \u03b1t is calculated as,\n\nMargin\n\nm6\n\nm5\nm4\n\nm3\nm2\n\nm1\n\n0\n\nq1\n\nq2\n\nq3\n\nq4\n\nd\n\n\u03b1t\n\nFigure 2: Margin curves of six exam-\nples. At points q1, q2, q3 and q4, the me-\ndian example is changed. At points q2\nand q4, the set of bottom n\u2032 = 3 exam-\nples are changed.\n\n\u2202gaverage\n\n\u2202\u03b1t\n\n\u2202mi\n\u2202\u03b1t\n\n=\n\nbi,tc \u2212 ai\n(c + \u03b1t)2\n\n(8)\n\nSince c \u2265 ai, depending on the sign of bi,t, the derivative\nof the margin on the ith sample (xi, yi) is either positive or\nnegative, which is irrelevant to the value of \u03b1t. This is also\ntrue for the second derivative of the margin. Therefore, the\nmargin on the ith example (xi, yi) with respect to \u03b1t is either\nconcave when it is monotonically increasing or convex when\nit is monotonically decreasing. See Figure 2 for a simple\nillustration.\nConsider a greedy coordinate ascent algorithm that maxi-\nmizes the average margin gaverage over all training examples.\nThe derivative of gaverage can be written as,\n\n= Pn\n\ni=1 bi,tc \u2212Pn\n\n(c + \u03b1t)2\n\ni=1 ai\n\n4\n\n(9)\n\n\fAlgorithm 2 Greedy coordinate ascent algorithm that maximizes the average margin of bottom n\u2032\nexamples.\n1: Input: ai=1,\u00b7\u00b7\u00b7 ,n and c from previous round.\n2: Sort ai=1,\u00b7\u00b7\u00b7 ,n in an increasing order. Bn\u2032 \u2190 {n\u2032 samples having the smallest ai at \u03b1t = 0}.\n3: for a weak classi\ufb01er do\n4:\n5:\n6:\n7:\n\nDetermine the lowest sample whose margin is decreasing and determine d.\nCompute Dn\u2032 \u2190 Pi\u2208Bn\u2032 (bi,tc \u2212 ai).\nj \u2190 0, qj \u2190 0.\nCompute the intersection qj+1 of the j + 1th highest increasing margin in Bn\u2032 and the j + 1th\nsmallest decreasing margin in Bc\nif qj+1 < d and Dn\u2032 > 0 then\n\nn\u2032 (the complement of the set Bn\u2032).\n\nIncrementally update Bn\u2032, Bc\nGo back to Line 7.\n\n8:\n9:\n10:\n11:\n12:\n13:\nend if\n14:\n15: end for\n16: Pick the weak classi\ufb01er with the largest increment of the average margin of bottom n\u2032 examples\n\nif Dn\u2032 > 0 then q\u2217 \u2190 d; otherwise q\u2217 \u2190 qj.\nCompute the average margin of the bottom n\u2032 examples at q\u2217.\n\nn\u2032 and Dn\u2032 at \u03b1t = qj+1; j \u2190 j + 1.\n\nelse\n\nwith weight being q\u2217.\n\n17: Repeat 2-16 until no increment in average margin of bottom n\u2032 examples.\n\nTherefore, the maximum average margin can only happen at two ends of the interval. As shown in\nFigure 2, the maximum average margin is either at the origin or at point d, which depends on the\nsign of the derivative in (9). If it is positive, the average margin is monotonically increasing, we set\n\u03b1t = d \u2212 \u01eb, otherwise we set \u03b1t = 0. The greedy coordinate ascent algorithm found by: looking\nat all weak classi\ufb01ers in H, if the nominator in (9) is positive, we let its weight \u01eb close to the right\nvalue on the interval where the training error is minimum, and compute the value of the average\nmargin. We add the weak classi\ufb01er which has the largest average margin increment. We iterate this\nprocedure until convergence. Its convergence is given by Theorem 3 shown below.\n\nTheorem 3 When constrained to the region of zero training error, the greedy coordinate ascent\nalgorithm that maximizes the average margin over all examples converges to an optimal solution.\n\nNow consider a greedy coordinate ascent algorithm maximizing the average margin of bottom n\u2032\ntraining examples, gaverage n\u2032. Apparently maximizing the minimum margin is a special case by\nchoosing n\u2032 = 1. Figure 2 is a simple illustration with six training examples. Our aim is to maximize\nthe average margin of the bottom 3 examples. The interval [0, d] of \u03b1t indicates an interval where\nthe training error is zero. On the point of d, the sample margin m3 alters from positive to negative,\nwhich causes the training error jump from 0 to 1/6. As shown in Figure 2, the margin of each of six\ntraining examples is either monotonically increasing or decreasing.\nIf we know a \ufb01xed set of bottom n\u2032 training examples having smaller margins for an interval of \u03b1t\nwith a minimum training error, it is straightforward to compute the derivative of the average margin\nof bottom n\u2032 training examples as\n\n\u2202gaverage n\u2032\n\n\u2202\u03b1t\n\n= Pi\u2208Bn\u2032\n\nbi,tc \u2212Pi\u2208Bn\u2032\n\n(c + \u03b1t)2\n\nai\n\n(10)\n\nAgain gaverage n\u2032 is a monotonic function of \u03b1t, depending on the sign of the derivative in (10), it is\nmaximized either on the left side or on the right side of the interval.\nIn general, the set of bottom n\u2032 training examples for an interval of \u03b1t with a minimum training\nerror varies over \u03b1t, it is required to precisely search for any snapshot of bottom n\u2032 examples with a\ndifferent value of \u03b1.\nTo address this, we \ufb01rst examine when the margins of two examples intersect. Consider the ith\nexample (xi, yi) with margin mi = ai+bi,t\u03b1t\nand the jth example (xj, yj) with margin mj =\naj +bj,t\u03b1t\n. Notice bi, bj is either -1 or +1. Assume bi = bj, then because mi 6= mj (since ai 6= aj),\nthe margins of example i and example j never intersect; assume bi 6= bj, then because mi = mj\n\nc+\u03b1t\n\nc+\u03b1t\n\n5\n\n\f2\n\nat \u03b1t = |ai\u2212aj |\n, the margins of example i and example j might intersect with each other if |ai\u2212aj |\nbelongs to the interval of \u03b1t with the minimum training error. In summary, given any two samples,\nwe can decide whether they intersect by checking whether b terms have the same sign, if not, they\ndo intersect, and we can determine the intersection point.\n\n2\n\nThe greedy coordinate ascent algorithm that sequentially maximizes the average margin of bottom\nn\u2032 examples is described in Algorithm 2, lines 3-15 are the weak learning steps and the rest are\nboosting steps. At line 5 we compute Dn\u2032 which can be used to check the sign of the derivative\nin (10). Since the function of the average margin of bottom n\u2032 examples is quasiconcave, we can\ndetermine the optimal point q\u2217 by Dn\u2032, and only need to compute the margin value at q\u2217. We add the\nweak learner, which has the largest increment of the average margin over bottom n\u2032 examples, into\nthe ensembled classi\ufb01er. This procedure terminates if there is no increment in the average margin\nof bottom n\u2032 examples over the considered weak classi\ufb01ers. If M weak learners are considered, the\ncomputational complexity of Algorithm 2 in the training stage is O (max(n log n, M n\u2032)) for each\niteration. The convergence analysis of Algorithm 2 is given by Theorem 4.\nTheorem 4 When constrained to the region of zero training error, the greedy coordinate ascent\nalgorithm that maximizes average margin of bottom n\u2032 samples converges to a coordinatewise max-\nimum solution, but it is not guaranteed to converge to an optimal solution due to the non-smoothness\nof the average margin of bottom n\u2032 samples.\n\u01eb-relaxation: Unfortunately, there is a fundamental dif\ufb01culty in the greedy coordinate ascent al-\ngorithm that maximizes the average margin of bottom n\u2032 samples: It gets stuck at a corner, from\nwhich it is impossible to make progress along any coordinate direction. We propose an \u01eb-relaxation\nmethod to overcome this dif\ufb01culty. This method was \ufb01rst proposed by [3] for the assignment prob-\nlem, and was extended to the linear cost network \ufb02ow problem and strictly convex costs and linear\nconstraints [4, 21]. The main idea is to allow a single coordinate to change even if this worsens the\nmargin function. When a coordinate is changed, it is set to \u01eb plus or \u01eb minus the value that maximizes\nthe margin function along that coordinate, where \u01eb is a positive number.\nWe can design a similar greedy coordinate ascent algorithm to directly maximize the bottom n\u2032th\nsample margin by only making a slight modi\ufb01cation to Algorithm 2: for a weak classi\ufb01er, we choose\nthe intersection point that led to the largest increasing of the bottom n\u2032th margin. When combined\nwith \u01eb-relaxation, this algorithm will eventually approach a small neighbourhood of a local optimal\nsolution that maximizes the bottom n\u2032th sample margin. As shown in Figure 2, bottom n\u2032th margin\nis a multimodal function, this algorithm with \u01eb-relaxation is very sensitive to the choice of n\u2032, and it\nusually gets stuck in a bad coordinatewise point without using \u01eb-relaxation. However, an impressive\nadvantage is that this method is tolerant to noise, which will be shown in Section 3.\n3 Experimental Results\nIn the experiments below, we \ufb01rst evaluate the performance of DirectBoost on 10 UCI data sets.\nWe then evaluate noise robustness of DirectBoost. For all the algorithms in our comparison, we\nuse decision trees with depth of either 1 or 3 as weak learners since for the small datasets, decision\nstumps (tree depth of 1) is already strong enough. DirectBoost with decision trees is implemented\nby a greedy top-down recursive partition algorithm to \ufb01nd the tree but differently from AdaBoost\nand LPBoost, since DirectBoost does not maintain a distribution over training samples. Instead, for\neach splitting node, DirectBoost simply chooses the attribute to split on by minimizing 0-1 loss or\nmaximizing the prede\ufb01ned margin value. In all the experiments that \u01eb-relaxation is used, the value\nof \u01eb is 0.01. Note that our empirical study is focused on whether the proposed boosting algorithm\nis able to effectively improve the accuracy of state-of-the-art boosting algorithms with the same\nweak learner space H, thus we restrict our comparison to boosting algorithms with the same weak\nlearners, rather than a wide range of classi\ufb01cation algorithms, such as SVMs and KNN.\n\n3.1 Experiments on UCI data\nWe \ufb01rst compare DirectBoost with AdaBoost, LogitBoost, soft margin LPBoost and BrownBoost\non 10 UCI data sets1 from the UCI Machine Learning Repository [8]. We partition each UCI dataset\ninto \ufb01ve parts with the same number of samples for \ufb01ve-fold cross validation. In each fold, we use\nthree parts for training, one part for validation, and the remaining part for testing. The validation\n\n1For Adult data, where we use a subset a5a in LIBSVM set http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm. We\n\ndo not use the original Adult data which has 48842 examples since LPBoost runs very slow on it.\n\n6\n\n\fLPBoost BrownBoost DirectBoostavg DirectBoost\u01eb\nN D depth AdaBoost LogitBoost\n1.15(0.8)\n2.62(0.8)\n1.47(0.7)\n9\n958\n1.47(1.0)\n25.49(3.0)\n26.01(3.3)\n27.71(1.7) 27.32(1.3)\n768\n8\n13.33(3.0)\n16.23(2.6) 14.49(4.4)\n14.2(1.8)\n690 14\n2.44(1.6)\n3.02(2.3)\n1.86(1.3)\n862\n2\n1.86(1.3)\n8.29(2.7)\n8.57(2.7)\n9.71(3.1)\n9.71(3.7)\n351 34\n4.0(0.5)\n4.8(1.4)\n5.3(2.6)\n5.3(1.4)\n1000 61\n4.07(2.0)\n4.25(2.5)\nCancer-wdbc 569 29\n4.42(1.4)\n3.89(1.5)\n24.62(7.6)\n27.69(7.6) 30.26(7.3) 26.15(10.5)\nCancer-wpbc 198 32\n16.67(7.5)\n270 13\n17.41(7.7) 18.52(5.1)\n19.26(8.1)\n15.28(0.8)\n16.2(1.1)\n15.39(0.8)\n15.6(0.7)\n6414 14\n\n3.66(1.3)\n26.67(2.6)\n13.77(4.6)\n2.33(1.7)\n10.86(2.8)\n6.1(1.1)\n4.25(2.2)\n28.72(8.4)\n18.15(7.2)\n15.56(0.9)\n\n0.63(0.4)\n25.62(2.5)\n14.06(3.6)\n2.33(1.0)\n7.71(3.0)\n4.8(0.7)\n4.96(3.0)\n27.69(8.1)\n18.15(5.1)\n16.25(1.7)\n\n3\n3\n3\n3\n3\n3\n1\n1\n1\n3\n\nDatasets\nTic-tac-toe\nDiabetes\nAustralian\nFourclass\nIonosphere\n\nSplice\n\nHeart\nAdult\n\navg DirectBoostorder\n\n1.05(0.4)\n23.4(3.7)\n13.48(2.9)\n1.74(1.5)\n7.71(4.4)\n6.7(1.6)\n3.72(2.9)\n27.18(10.0)\n18.15(7.6)\n15.8(1.1)\n\nTable 1: Percent test errors of AdaBoost, LogitBoost, soft margin LPBoost with column generation, Brown-\nBoost, and three DirectBoost methods on 10 UCI datasets each with N samples and D attributes.\n\nset is used to choose the optimal model for each algorithm: For AdaBoost and LogitBoost, the\nvalidation data is used to perform early stopping since there is no nature stopping criteria for these\nalgorithms. We run the algorithms until convergence where the stopping criterion is that the change\nof loss is less than 1e-6, and then choose the ensemble classi\ufb01er from the round with minimum error\non the validation data. For BrownBoost, we select the optimal cutoff parameters by the validation\nset, which are chosen from {0.0001, 0.001, 0.01, 0.03, 0.05, 0.08, 0.1, 0.14, 0.17, 0.2}. LPBoost\nmaximizes the soft margin subject to linear constraints, its objective is equivalent to DirectBoost\nwith maximizing the average margin of bottom n\u2032 samples [19], thus we set the same candidate\nparameters n\u2032/n = {0.01, 0.05, 0.1, 0.2, 0.5, 0.8} for them. For LPBoost, the termination rule we\nuse is same to the one in [6], and we select the optimal regularization parameter by the validation\nset. For DirectBoost, the algorithm terminates when there is no increment in the targeted margin\nvalue, and we select the model with the optimal n\u2032 by the validation set.\nWe use DirectBoostavg to denote our method that runs Algorithm 1 \ufb01rst and then maximizes the\naverage of bottom n\u2032 margins without \u01eb-relaxation, DirectBoost\u01eb\navg to denote our method that runs\nAlgorithm 1 \ufb01rst and then maximizes the average margin of bottom n\u2032 samples with \u01eb-relaxation, and\nDirectBoostorder to denote our method that runs Algorithm 1 \ufb01rst and then maximizes the bottom\nn\u2032th margin with \u01eb-relaxation. The means and standard deviations of test errors are given in Table 1.\nClearly DirectBoostavg, DirectBoost\u01eb\navg and DirectBoostorder outperform other boosting algorithms\navg is better than AdaBoost, LogitBoost, LPBoost and BrownBoost\nin general, specially DirectBoost\u01eb\nover all data sets except Cancer-wdbc. Among the family of DirectBoost algorithms, DirectBoostavg\nwins on two datasets where it searches the optimal margin solution in the region of zero training\nerror, this means that keeping the training error at zero may lead to good performance in some\ncases. DirectBoostorder wins on three other datasets, but its results are unstable and sensitive to\nn\u2032. With \u01eb-relaxation, DirectBoost\u01eb\navg searches the optimal margin solution in the whole parameter\nspace and gives the best performance on the remaining 5 data sets. It is well known that AdaBoost\nperforms well on the datasets with a small test error such as Tic-tac-toe and Fourclass, it is extremely\nhard for other boosting algorithms to beat AdaBoost. Nevertheless, DirectBoost is still able to give\neven better results in this case. For example, on Tic-tac-toe data set, the test error becomes 0.63%,\nmore than half the error rate reduction. Our method would be more valuable for those who value\nprediction accuracy, which might be the case in areas of medical and genetic research.\n\nDirectBoost\u01eb\navg and LPBoost are both designed\nto maximize the average margin over bot-\ntom n\u2032 samples [19], but as shown by the\navg gener-\nleft \ufb01gure in Figure 3, DirectBoost\u01eb\nates a larger margin value than LPBoost when\ndecision trees with depth greater than 1 are\nused as weak learners, this may explain why\nDirectBoost\u01eb\navg outperforms LPBoost. When\ndecision stumps are used as weak learners,\nLPBoost converges to a global optimal solu-\ntion, and DirectBoost\u01eb\navg nearly converges to\nthe maximum margin as shown by the right \ufb01g-\n\nure in Figure 3, even though no theoretical justi\ufb01cation is known for this observed phenomenon.\n\nFigure 3: The value of average margins of bottom n\u2032\nsamples vs. the number of iterations for LPBoost with\ncolumn generation and DirectBoost\u01eb\navg on Australian\ndataset, left: Decision tree, right: Decision stump.\n\n7\n\n\f# of iterations\n\nTotal running times\n\nAdaBoost\nLPBoost\nDirectBoost\u01eb\n\nTable 2 shows the number of iterations and total\nrun times (in seconds) for AdaBoost, LPBoost\navg at the training stage, where\nand DirectBoost\u01eb\nwe use the Adult dataset with 10000 training\nsamples. All these three algorithms employ de-\ncision trees with a depth of 3 as weak learners.\nThe experiments are conducted on a PC with\nCore2 Duo 2.6GHz CPU and 2G RAM. Clearly\navg takes less time for the entire training stage since it converges much faster. LPBoost\nDirectBoost\u01eb\nconverges in less than three hundred rounds, but as a total corrective algorithm, it has a greater com-\nputational cost on each round. To handle large scale data sets in practice, similar to AdaBoost, we\ncan use many tricks. For example, we can partition the data into many parts and use distributed\nalgorithms to select the weak classi\ufb01er.\n\nTable 2: Number of iterations and total run times (in\nseconds) in training stage on Adult dataset with 10000\ntraining samples and the depth of DecisionTrees is 3.\n\n31168\n167520\n\n606\n\n117852\n\n286\n1737\n\navg\n\n3.2 Evaluate noise robustness\nIn the experiments conducted below, we evaluate the noise robustness of each boosting method.\nFirst, we run the above algorithms on a synthetic example created by [14]. This is a simple coun-\nterexample to show that for a broad class of convex loss functions, no boosting algorithm is provably\nrobust to random label noise, this class includes AdaBoost, LogitBoost, etc. For LPBoost and its\nvariations [25, 26], they do not satisfy the preconditions of the theorem presented by [14], but Glo-\ncer [12] showed experimentally that these soft margin boosting methods have the same problem as\nthe AdaBoost and LogitBoost to handle random noise.\n\nl\n5\n\n20\n\n\u03b7\n0\n\n0.05\n0.2\n0\n\n0.05\n0.2\n\nAB\n0\n\n17.6\n24.2\n\n0\n\n30.0\n29.9\n\nLB\n0\n0\n\nLPB\n\n0\n0\n\n23.4\n\n14.5\n\n0\n\n29.6\n30.0\n\n0\n\n27.0\n29.8\n\nBB DB\u01eb\navg DBorder\n0\n0\n1.2\n0\n2.2\n0.6\n15.0\n19.6\n\n0\n0\n0\n0\n0\n3.2\n\n25.4\n29.6\n\n24.7\n\n0\n\ndata\nwdbc\n\nIono.\n\n\u03b7\n0\n\n0.05\n0.2\n0\n\n0.05\n0.2\n\nAB\n4.3\n6.6\n8.8\n9.7\n10.3\n16.6\n\nLB\n4.4\n6.8\n8.8\n9.7\n12.3\n15.0\n\nLPB\n4.0\n4.9\n7.6\n8.6\n9.3\n14.6\n\nBB DB\u01eb\n4.1\n4.5\n5.0\n6.5\n8.4\n8.3\n8.8\n8.3\n9.3\n11.5\n17.9\n14.4\n\n3.7\n5.0\n6.6\n7.7\n8.6\n9.5\n\navg DBorder\n\nPercent\n\nTable 3:\ntest errors of AdaBoost (AB),\nLogitBoost (LB), LPBoost (LPB), BrownBoost (BB),\nDirectBoost\u01eb\navg, and DirectBoostorder on Long and\nServedio\u2019s example with random noise.\n\nPercent\n\nTable 4:\ntest errors of AdaBoost (AB),\nLogitBoost (LB), LPBoost (LPB), BrownBoost (BB),\nDirectBoost\u01eb\nand DirectBoostorder on two UCI\ndatasets with random noise.\n\navg,\n\nWe repeat the synthetic learning problem with binary-valued weak classi\ufb01ers that is described\nin [14]. We set the number of training examples to 1000 and the labels are corrupted with a noise\nrate \u03b7 at 0%, 5%, and 20% respectively. Examples in this setting are binary vectors of length 2l +11.\nTable 3 reports the error rates on a clean test data set with size 5000, that is, the labels of test data\nare uncorrupted, and a same size clean data is generated as validation data. AdaBoost performs very\npoor on this problem. This result is not surprising at all since [14] designed this example on pur-\npose to explain the inadequacy of convex optimization methods. LogitBoost, LPBoost with column\ngeneration, and DirectBoost\u01eb\navg perform better in the case that l = 5 and \u03b7 = 5%, but for the other\ncases they do as bad as AdaBoost. BrownBoost is designed for noise tolerance, and it does well in\nthe case of l = 5, but it also cannot handle the case of l = 20 and \u03b7 > 0%. On the other hand,\nDirectBoostorder performs very well for all cases, showing DirectBoostorder\u2019s impressive noise tol-\nerance property since the most dif\ufb01cult examples are given up without any penalty.\n\nThese algorithms are also tested on two UCI datasets, randomly corrupted with additional label\nnoise on training data at rates of 5% and 20% respectively. Again, we keep the validation and the\ntest data are clean. The results are reported in Table 4 by \ufb01ve-fold cross validation, the same as\navg and DirectBoostorder do well in\nExperiment 1. LPBoost with column generation, DirectBoost\u01eb\nthe case of \u03b7 = 5%, and their performance is better than AdaBoost, LogitBoost, and BrownBoost.\nFor the case of \u03b7 = 20%, all the algorithms perform much worse than the corresponding noise-free\ncase, except DirectBoostorder which still generates a good performance close to the noise-free case.\n\n4 Acknowledgements\nThis research is supported in part by AFOSR under grant FA9550-10-1-0335, NSF under grant\nIIS:RI-small 1218863, DoD under grant FA2386-13-1-3023, and a Google research award.\n\n8\n\n\fReferences\n\n[1] P. Bartlett and M. Traskin. AdaBoost is consistent. Journal of Machine Learning Research, 8:2347\u20132368,\n\n2007.\n\n[2] M. Bazaraa, H. Sherali and C. Shetty. Nonlinear Programming: Theory and Algorithms, 3rd Edition.\n\nWiley-Interscience, 2006.\n\n[3] D. P. Bertsekas. A distributed algorithm for the assignment problem. Technical Report, MIT, 1979.\n[4] D. Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scienti\ufb01c, 1998.\n[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[6] A. Demiriz, K. Bennett and J. Shawe-Taylor. Linear programming boosting via column generation, Ma-\n\nchine Learning, 46:225\u2013254, 2002.\n\n[7] L. Devroye, L. Gy\u00a8or\ufb01 and G. Lugosi. A Probabilistic Theory of Pattern Recognition Springer, New York,\n\n1996.\n\n[8] A. Frank and A. Asuncion. UCI Machine Learning Repository. School of Information and Computer\n\nScience, University of California at Irvine, 2006.\n\n[9] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[10] Y. Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293\u2013318,\n\n2001.\n\n[11] J. Friedman, T. Hastie and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The\n\nAnnals of Statistics, 28(2):337\u2013374, 2000.\n\n[12] K. Glocer. Entropy regularization and soft margin maximization. Ph.D. Dissertation, UCSC, 2009.\n[13] K. Hoffgen, H. Simon and K. van Horn. Robust trainability of single neurons. Journal of Computer and\n\nSystem Sciences, 50(1):114\u2013125, 1995.\n\n[14] P. Long and R. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters. Machine\n\nLearning, 78:287-304, 2010.\n\n[15] E. Mammen and A. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27, 1808-1829,\n\n1999.\n\n[16] D. McAllester, T. Hazan and J. Keshet. Direct loss minimization for structured prediction. Neural Infor-\n\nmation Processing Systems (NIPS), 1594-1602, 2010.\n\n[17] R. Schapire, Y. Freund, P. Bartlett and W. Lee. Boosting the margin: A new explanation for the effective-\n\nness of voting methods. The Annals of Statistics, 26(5):1651\u20131686, 1998.\n\n[18] R. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT Press, 2012.\n[19] S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: new\n\nrelaxations and ef\ufb01cient boosting algorithms. Machine Learning, 80(2-3): 141-163, 2010.\n\n[20] I. Steinwart. Consistency of support vector machines and other regularized kernel classi\ufb01ers.\n\nTransactions on Information Theory, 51(1):128-142, 2005.\n\nIEEE\n\n[21] P. Tseng and D. Bertsekas. Relaxation methods for strictly convex costs and linear constraints. Mathe-\n\nmatics of Operations Research, 16:462-481, 1991.\n\n[22] P. Tseng. Convergence of block coordinate descent method for nondifferentiable minimization. Journal\n\nof Optimization Theory and Applications, 109(3):475\u2013494, 2001.\n\n[23] A. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statistics, 32(1):135-\n\n166, 2004.\n\n[24] V. Vapnik. Statistical Learning Theory. John Wiley, 1998.\n[25] M. Warmuth, K. Glocer and G. Ratsch. Boosting algorithms for maximizing the soft margin. Advances\n\nin Neural Information Processing Systems (NIPS), 21, 1585-1592, 2007.\n\n[26] M. Warmuth, K. Glocer and S. Vishwanathan. Entropy regularized LPBoost. The 19th International\n\nconference on Algorithmic Learning Theory (ALT), 256-271, 2008.\n\n[27] S. Zhai, T. Xia, M. Tan and S. Wang. Direct 0-1 loss minimization and margin maximization with\n\nboosting. Technical Report, 2013.\n\n[28] T. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk minimiza-\n\ntion. The Annals of Statistics, 32(1):56\u201385, 2004.\n\n[29] T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics,\n\n33:1538\u20131579, 2005.\n\n9\n\n\f", "award": [], "sourceid": 481, "authors": [{"given_name": "Shaodan", "family_name": "Zhai", "institution": "Wright State University"}, {"given_name": "Tian", "family_name": "Xia", "institution": "Wright State University"}, {"given_name": "Ming", "family_name": "Tan", "institution": "Wright State University"}, {"given_name": "Shaojun", "family_name": "Wang", "institution": "Wright State University"}]}