{"title": "Multiclass Boosting: Theory and Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2124, "page_last": 2132, "abstract": "The problem of multiclass boosting is considered. A new framework,based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets.", "full_text": "Multiclass Boosting: Theory and Algorithms\n\nMohammad J. Saberian\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nsaberian@ucsd.edu\n\nNuno Vasconcelos\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nnuno@ucsd.edu\n\nAbstract\n\nThe problem of multi-class boosting is considered. A new framework, based on\nmulti-dimensional codewords and predictors is introduced. The optimal set of\ncodewords is derived, and a margin enforcing loss proposed. The resulting risk is\nminimized by gradient descent on a multidimensional functional space. Two algo-\nrithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one\npredictor component at a time, 2) GD-MCBoost, based on gradient descent, up-\ndates all components jointly. The algorithms differ in the weak learners that they\nsupport but are both shown to be 1) Bayes consistent, 2) margin enforcing, and\n3) convergent to the global minimum of the risk. They also reduce to AdaBoost\nwhen there are only two classes. Experiments show that both methods outperform\nprevious multiclass boosting approaches on a number of datasets.\n\n1\n\nIntroduction\n\nBoosting is a popular approach to classi\ufb01er design in machine learning. It is a simple and effective\nprocedure to combine many weak learners into a strong classi\ufb01er. However, most existing boosting\nmethods were designed primarily for binary classi\ufb01cation.\nIn many cases, the extension to M-\nary problems (of M > 2) is not straightforward. Nevertheless, the design of multi-class boosting\nalgorithms has been investigated since the introduction of AdaBoost in [8].\n\nTwo main approaches have been attempted. The \ufb01rst is to reduce the multiclass problem to a col-\nlection of binary sub-problems. Methods in this class include the popular \u201cone vs all\u201d approach, or\nmethods such as \u201call pairs\u201d, ECOC [4, 1], AdaBoost-M2 [7], AdaBoost-MR [18] and AdaBoost-\nMH [18, 9]. The binary reduction can have various problems, including 1) increased complexity, 2)\nlack of guarantees of an optimal joint predictor, 3) reliance on data representations, such as adding\none extra dimension that includes class numbers to each data point [18, 9], that may not necessarily\nenable effective binary discrimination, or 4) using binary boosting scores that do not represent true\nclass probabilities [15]. The second approach is to boost an M-ary classi\ufb01er directly, using multi-\nclass weak learners, such as trees. Methods of this type include AdaBoost-M1[7], SAMME[12] and\nAdaBoost-Cost [16]. These methods require strong weak learners which substantially increase com-\nplexity and have high potential for over\ufb01tting. This is particularly problematic because, although\nthere is a uni\ufb01ed view of these methods under the game theory interpretation of boosting [16], none\nof them has been shown to maximize the multiclass margin. Overall, the problem of optimal and\nef\ufb01cient M-ary boosting is still not as well understood as its binary counterpart.\n\nIn this work, we introduce a new formulation of multi-class boosting, based on 1) an alternative\nde\ufb01nition of the margin for M-ary problems, 2) a new loss function, 3) an optimal set of codewords,\nand 4) the statistical view of boosting, which leads to a convex optimization problem in a multidi-\nmensional functional space. We propose two algorithms to solve this optimization: CD-MCBoost,\nwhich is a functional coordinate descent procedure, and GD-MCBoost, which implements functional\ngradient descent. The two algorithms differ in terms of the strategy used to update the multidimen-\nsional predictor. CD-MCBoost supports any type of weak learners, updating one component of\nthe predictor per boosting iteration, GD-MCBoost requires multiclass weak learners but updates all\n\n1\n\n\fcomponents simultaneously. Both methods directly optimize the predictor of the multiclass problem\nand are shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global min-\nimum of the classi\ufb01cation risk. They also reduce to AdaBoost for binary problems. Experiments\nshow that they outperform comparable prior methods on a number of datasets.\n\n2 Multiclass boosting\n\nWe start by reviewing the fundamental ideas behind the classical use of boosting for the design of\nbinary classi\ufb01ers, and then extend these ideas to the multiclass setting.\n\n2.1 Binary classi\ufb01cation\n\nA binary classi\ufb01er, F (x), is a mapping from examples x \u2208 X to class labels y \u2208 {\u22121, 1}. The\noptimal classi\ufb01er, in the minimum probability of error sense, is Bayes decision rule\n\n(1)\nThis can be hard to implement, due to the dif\ufb01culty of estimating the probabilities PY |X (y|x). This\ndif\ufb01culty is avoided by large margin methods, such as boosting, which implement the classi\ufb01er as\n\nF (x) = arg miny\u2208{\u22121,1}PY |X (y|x).\n\nF (x) = sign[f\n\n\u2217(x)]\n\nwhere f \u2217(x) : X \u2192 R is the continuous valued predictor\n\u2217(x) = arg min\n\nf\n\nR(f )\n\nf\n\nthat minimizes the classi\ufb01cation risk\n\n(2)\n\n(3)\n\n(4)\nassociated with a loss function L[., .]. In practice, the optimal predictor is learned from a sample\nD = {(xi, yi)}n\n\ni=1 of training examples, and (4) is approximated by the empirical risk\n\nR(f ) = EX,Y {L[y, f (x)]}\n\nn(cid:1)\n\nR(f ) \u2248\n\nL[yi, f (xi)].\n\ni=1\n\n(5)\n\nThe loss L[., .] is said to be Bayes consistent if (1) and (2) are equivalent. For large margin methods,\nsuch as boosting, the loss is also a function of the classi\ufb01cation margin yf (x), i.e.\n\n(6)\nfor some non-negative function \u03c6(.). This dependence on the margin yf (x) guarantees that the\nclassi\ufb01er has good generalization when the training sample is small\n[19]. Boosting learns the\noptimal predictor f \u2217(x) : X \u2192 R as the solution of\n\nL[y, f (x)] = \u03c6(yf (x))\n\n(cid:2)\n\nminf (x) R(f )\ns.t\n\nf (x) \u2208 span(H).\n\n(7)\n\nwhere H = {h1(x), ...hp(x)} is a set of weak learners hi(x) : X \u2192 R, and the optimization is\ncarried out by gradient descent in the functional space span(H) of linear combinations of hi(x) [14].\n\n2.2 Multiclass setting\n\nTo extend the above formulation to the multiclass setting, we note that the de\ufb01nition of the classi\ufb01ca-\ntion labels as \u00b11 plays a signi\ufb01cant role in the formulation of the binary case. One of the dif\ufb01culties\nof the multiclass extension is that these labels do not have an immediate extension to the multiclass\nsetting. To address this problem, we return to the classical setting, where the class labels of a M-ary\nproblem take values in the set {1, . . . , M }. Each class k is then mapped into a distinct class label\nyk, which can be thought of as a codeword that identi\ufb01es the class.\nIn the binary case, these codewords are de\ufb01ned as y1 = 1 and y2 = \u22121. It is possible to derive\nan alternative form for the expressions of the margin and classi\ufb01er F (x) that depends explicitly on\ncodewords. For this, we note that (2) can be written as\nF (x) = arg max\n\n\u2217(x)\n\nykf\n\n(8)\n\nk\n\n2\n\n\fand the margin can be expressed as\n\n(cid:2)\n\nyf =\n\nf\n\u2212f\n\nif k = 1\nif k = 2\n\n=\n\n(cid:2)\n\n1\n\n2 (y1f \u2212 y2f )\n2 (y2f \u2212 y1f )\n\n1\n\nif k = 1\nif k = 2\n\n=\n\n1\n2\n\n(ykf \u2212 max\nl(cid:3)=k\n\nylf ).\n\n(9)\n\nThe interesting property of these forms is that they are directly extensible to the M-ary classi\ufb01cation\ncase. For this, we assume that the codewords yk and the predictor f (x) are multi-dimensional, i.e.\nyk, f (x) \u2208 Rd for some dimension d which we will discuss in greater detail in the following section.\nThe margin of f (x) with respect to class k is then de\ufb01ned as\n\nM(f (x), yk) =\n\n1\n2\n\n[< f (x), yk > \u2212 max\nl(cid:3)=k\n\n< f (x), yl >]\n\nand the classi\ufb01er as\n\nwhere < ., . > is the standard dot-product. Note that this is equivalent to\n\nF (x) = arg maxk < f (x), yk >,\n\nF (x) = arg max\n\nk\u2208{1,...,M }\n\nM(f (x), yk),\n\n(10)\n\n(11)\n\n(12)\n\nand thus F (x) is the class of largest margin for the predictor f (x). This de\ufb01nition is closely related to\nprevious notions of multiclass margin. For example, it generalizes that of [11], where the codewords\nyk are restricted to the binary vectors in the canonical basis of Rd, and is a special case of that in\n[1], where the dot products < f (x), yk > are replaced by a generic function of f, x, and k. Given a\ntraining sample D = {(xi, yi)}n\n\ni=1, the optimal predictor f \u2217(x) minimizes the risk\n\nRM (f ) = EX,Y {LM [y, f (x)]} \u2248\n\nLM [yi, f (xi)]}\n\n(13)\n\nn(cid:1)\n\ni=1\n\nwhere LM [., .] is a multiclass loss function. A natural extension of (6) and (9) is a loss of the form\n(14)\n\nLM [y, f (x)] = \u03c6(M(f (x), y)).\nTo avoid the nonlinearity of the max operator in (10), we rely on\n\nM(cid:1)\n\nLM [y, f (x)] =\n\nk=1\n\n\u2212 1\ne\n\n2\n\n[\u2212]\n\n.\n\n(15)\n\nwhich is shown, in Appendix A, to upper bound 1 + e\u2212M(f (x),y). It follows that the minimization of\nthe risk of (13) encourages predictors of large margin M(f \u2217(xi), yi), \u2200i. For M = 2, LM [y, f (x)]\nreduces to\n\n(16)\nand the risk minimization problem is identical to that of AdaBoost [8]. In appendices B and C it\nis shown that RM (f ) is convex and Bayes consistent, in the sense that if f \u2217(x) is the minimizer of\n(13), then\n\nL2[y, f (x)] = 1 + e\n\n\u2212yf (x)\n\n< f\n\n\u2217(x), yk >= log PY |X (yk|x) + c \u2200k\n\n(17)\n\nand (11) implements the Bayes decision rule\n\nF (x) = arg maxkPY |X (yk|x).\n\n(18)\n\n2.3 Optimal set of codewords\n\nFrom (15), the choice of codewords yk has an impact in the optimal predictor f \u2217(x), which is\ndetermined by the projections < f \u2217(x), yk >. To maximize the margins of (10), the difference\nbetween these projections should be as large as possible. To accomplish this we search for the set of\nM distinct unit codewords Y = {y1, . . . , yM } \u2208 Rd that are as dissimilar as possible\n\n\u23a7\u23aa\u23a8\n\u23aa\u23a9\n\nmaxd,y1,...yM [mini(cid:3)=j ||yi \u2212 yj||2]\n\ns.t\n\n||yk|| = 1 \u2200k = 1..M.\nyk \u2208 Rd \u2200k = 1..M.\n\n3\n\n(19)\n\n\f1.5\n\n1\n\n0 5\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1.5\n\n-1.5\n\n-1\n\n-0.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n-1.5\n\n-1.5\n\n1\n\n0\n\n-1\n\n-1\n\n-0.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n(M = 2)\n\n(M = 3)\n\n1\n\n0\n\n-1\n\n-1\n\n(M = 4)\n\n1\n\n0\n\nFigure 1: Optimal codewords for M = 2, 3, 4.\n\nTo solve this problem,we start by noting that, for d < M, the smallest distance of (19) can be\nincreased by simply increasing d, since this leads to a larger space. On the other hand, since M\npoints y1, ...yM lie in an, at most, M \u2212 1 dimensional subspace of Rd, e.g. any three points belong\nto a plane, there is no bene\ufb01t in increasing d beyond M \u2212 1. On the contrary, as shown in Appendix\nD, if d > M \u2212 1 there exits a vector v \u2208 Rd with equal projection on all codewords,\n\n< yi, v >=< yj, v > \u2200i, j = 1, .., M.\n\n(20)\n\nSince the addition of v to the predictor f (x) does not change the classi\ufb01cation rule of (11), this makes\nthe optimal predictor underdetermined. To avoid this problem, we set d = M \u2212 1. In this case, as\nshown in Appendix E, the vertices of a M \u22121 dimensional regular simplex1 centered at the origin [3]\nare solutions of (19). Figure 1 presents the set of optimal codewords when M = 2, 3, 4. Note that\nin the binary case this set consists of the traditional codewords yi \u2208 {+1, \u22121}. In general, there is\nno closed form solution for the vertices of a regular simplex of M vectors. However, these can be\nderived from those of a regular simplex of M \u2212 1 vectors, and a recursive solution is possible [3].\n\n3 Risk minimization\n\nWe have so far de\ufb01ned a proper margin loss function for M-ary classi\ufb01cation and identi\ufb01ed an\noptimal codebook. In this section, we derive two boosting algorithms for the minimization of the\nclassi\ufb01cation risk of (13). These algorithms are both based on the GradientBoost framework [14].\nThe \ufb01rst is a functional coordinate descent algorithm, which updates a single component of the\npredictor per boosting iteration. The second is a functional gradient descent algorithm that updates\nall components simultaneously.\n\n3.1 Coordinate descent\n\nIn the \ufb01rst method, each component f \u2217\nthe linear combination of weak learners that solves the optimization problem\n\ni (x) of the optimal predictor f \u2217(x) = [f \u2217\n\n1 (x), ..f \u2217\n\nM \u22121(x)], is\n\n(cid:2)\n\nminf1(x),...,fM \u22121(x) R([f1(x), ..., fM \u22121(x)])\ns.t\n\nfj(x) \u2208 span(H) \u2200j = 1..M \u2212 1.\n\n(21)\n\n1(x), ..., f t\n\nwhere H = {h1(x), ...hp(x)} is a set of weak learners, hi(x) : X \u2192 R. These can be\nstumps, regression trees, or member of any other suitable model family. We denote by f t(x) =\nM \u22121(x)] the predictor available after t boosting iterations. At iteration t + 1 a single\n[f t\ncomponent fj(x) of f (x) is updated with a step in the direction of the scalar functional g that most\ndecreases the risk R[f t\nM \u22121]. For this, we consider the functional derivative of\nR[f (x)] along the direction of the functional g : X \u2192 R, at point f (x) = f t(x), with respect to the\njth component fj(x) of f (x) [10],\n\n1, ..., f t\n\nj + \u03b1\u2217\n\nj g, ..., f t\n\n(cid:7)(cid:7)(cid:7)(cid:7)\n\n\u03b4R[f t; j, g] =\n\n\u2202R[f t + \u0001g1j]\n\n\u2202\u0001\n\n,\n\n\u0001=0\n\n(22)\n\n1A regular M \u2212 1 dimensional simplex is the convex hull of M normal vectors which have equal pair-wise\n\ndistances.\n\n4\n\n\fn(cid:1)\n\nn(cid:1)\n\ni=1\n\nwhere 1j \u2208 Rd is a vector whose jth element is one and the remainder zero, i.e. f t + \u0001g1j =\n[f t\n\nM \u22121]. Using the risk of (13), it is shown in Appendix F that\n\nj + \u0001g, ..f t\n\n1, .., f t\n\n\u2212\u03b4R[f t; j, g] =\n\ng(xi)wj\ni ,\n\nwith\n\nwj\n\ni =\n\n1\n2\n\n\u2212 1\ne\n\n2 \n\nM(cid:1)\n\nk=1\n\nThe direction of greatest risk decrease is the weak learner\n\ni=1\n\n< 1j, yi \u2212 yk > e\n\n1\n\n2 .\n\n\u2217\nj (x) = arg max\ng\ng\u2208H\n\ng(xi)wj\ni ,\n\nand the optimal step size along this direction\n\u2217\nj = arg min\n\u03b1\u2208R\n\n\u03b1\n\nThe classi\ufb01er is thus updated as\n\nR[f t(x) + \u03b1g\n\n\u2217\nj (x)1j].\n\nf t+1 = f t(x) + \u03b1\n\n(27)\nThis procedure is summarized in Algorithm 1-left and denoted CD-MCBoost. It starts with f 0(x) =\n0 \u2208 Rd and updates the predictor components sequentially. Note that, since (13) is a convex function\nof f (x), it converges to the global minimum of the risk.\n\nj + \u03b1\n\nM \u22121]\n\n1, ..., f t\n\n\u2217\nj g\n\n\u2217\nj (x)1j = [f t\n\n\u2217\nj g\n\n\u2217\nj , ..., f t\n\n3.2 Gradient descent\n\nAlternatively, (13) can be minimized by learning a linear combination of multiclass weak learners.\nIn this case, the optimization problem is\n\n(cid:2)\n\nminf (x) R[f (x)]\ns.t\n\nf (x) \u2208 span(H),\n\n(28)\n\nwhere H = {h1(x), ..., hp(x)} is a set of multiclass weak learners, hi(x) : X \u2192 RM \u22121, such as\ndecision trees. Note that to \ufb01t tree classi\ufb01ers in this de\ufb01nition their output (usually a class number)\nshould be translated into a class codeword. As before, let f t(x) \u2208 RM \u22121 be the predictor available\nafter t boosting iterations. At iteration t + 1 a step is given along the direction g(x) \u2208 H of largest\ndecrease of the risk R[f (x)]. For this, we consider the directional functional derivative of R[f (x)]\nalong the direction of the functional g : X \u2192 RM \u22121, at point f (x) = f t(x).\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\n(29)\n\n(30)\n\n(31)\n\n(32)\n\n(33)\n\n\u03b4R[f t; g] =\n\n\u2212\u03b4R[f t; g] =\n\nAs shown in Appendix G,\n\nwhere wi \u2208 RM \u22121\n\nwi =\n\n1\n2\n\n\u2212 1\ne\n\n2 \n\n(cid:7)(cid:7)(cid:7)(cid:7)\n\n.\n\n\u0001=0\n\n\u2202R[f t + \u0001g]\n\n\u2202\u0001\n\nn(cid:1)\n\n< g(xi), wi >\n\ni=1\n\nM(cid:1)\n\n(yi \u2212 yk)e\n\n1\n\n2 .\n\nk=1\n\nn(cid:1)\n\nThe direction of greatest risk decrease is the weak learner\n\ng\n\n\u2217(x) = arg max\ng\u2208H\n\n< g(xi), wi >,\n\ni=1\n\nand the optimal step size along this direction\n\n\u03b1\n\n\u2217 = arg min\n\u03b1\u2208R\n\nR[f t(x) + \u03b1g\n\n\u2217(x)].\n\nThe predictor is updated to f t+1(x) = f t(x)+\u03b1\u2217g\u2217(x). This procedure is summarised in Algorithm\n1-right, and denoted GD-MCBoost. Since (13) is convex, it converges to the global minimum of the\nrisk.\n\n5\n\n\fAlgorithm 1 CD-MCBoost and GD-MCBoost\nInput: Number of classes M, set of codewords Y = {y1, . . . , yM }, number of iterations N and\ndataset S = {(x1, y1), ..., (xn, yn)}, where xi are examples and yi \u2208 Y are their class codewords.\nInitialization: set t = 0, and f t = 0 \u2208 RM \u22121\n\nCD-MCBoost\n\nwhile t < N do\n\nfor j = 1 to M \u2212 1 do\n\nj (x), \u03b1\u2217\n\nCompute wj\nFind g\u2217\nUpdate f t+1\nUpdate f t+1\nk\nt = t + 1\n\nj\n\ni with (24)\n\nj using (25) and (26)\nj (x) + \u03b1\u2217\nj g\u2217\n(x) = f t\nk(x) \u2200k (cid:5)= j\n(x) = f t\n\nj (x)\n\nend for\nend while\n\nGD-MCBoost\n\nwhile t < N do\n\nCompute wi with (31)\nFind g\u2217(x), \u03b1\u2217 using (32) and (33)\nUpdate f t+1(x) = f t(x) + \u03b1\u2217g\u2217(x)\nt = t + 1\n\nend while\n\nOutput: decision rule: F (x) = arg maxk < f N (x), yk >\n\n4 Comparison to previous methods\n\n2\n\nMulti-dimensional predictors and codewords have been used implicitly, [7, 18, 16, 6], or explicitly,\n[12, 9], in all previous multiclass boosting methods.\n\u201cone vs all\u201d, \u201call pairs\u201d and \u201cECOC\u201d [1]: as shown in [1], these methods can be interpreted\nas assigning a codeword yk to each class, where yk \u2208 {+1, 0, \u22121}l and l = M for \u201cone vs all\u201d,\nl = M (M \u22121)\nfor \u201call pairs\u201d and l is variable for \u201cECOC\u201d, depending on the error correction code. In\nall these methods, binary classi\ufb01ers are learned independently for each of the codeword components.\nThis does not guarantee an optimal joint predictor. These methods are similar to CD-MCBoost in the\nsense that the predictor components are updated individually at each boosting iteration. However,\nin CD-MCBoost, the codewords are not restricted to {+1, 0, \u22121} and the predictor components are\nlearned jointly.\nAdaBoost-MH [18, 9]: This method converts the M-ary classi\ufb01cation problem into a binary one,\nlearned from a M times larger training set, where each example x is augmented with a feature y that\nidenti\ufb01es a class. Examples such that x belongs to class y receive binary label 1, while the remaining\nreceive the label \u22121 [9]. In this way, the binary classi\ufb01er learns if the multiclass label y is correct\nfor x or not. AdaBoost-MH uses weak learners ht : X \u00d7 {1, . . . , M } \u2192 R and the decision rule\n\n\u00afF (x) = arg max\n\nj\u2208{1,2,..M }\n\nht(x, j)\n\nt\n\n(34)\n\nwhere t is the iteration number. This is equivalent to the decision rule of (11) if f (x) is an M-\ndimensional predictor with jth component fj(x) =\nt ht(x, j), and the label codewords are de-\n\ufb01ned as yj = 1j. This method is comparable to CD-MCBoost in the sense that it does not require\nmulticlass weak learners. However, there are no guarantees that the weak learners in common use\nare able to discriminate the complex classes of the augmented binary problem.\nAdaBoost-M1 [7] and AdaBoost-Cost [16]: These methods use multiclass weak learners ht :\nX \u2192 {1, 2, ..M } and a classi\ufb01cation rule of the form\n\n(cid:1)\n\n(cid:8)\n\n(cid:1)\n\n\u00afF (x) = arg max\n\nj\u2208{1,2,..M }\n\n\u03b1tht(x),\n\nt|ht(x)=j\n\n(35)\n\nwhere t is the boosting iteration and \u03b1t the coef\ufb01cient of weak learner ht(x). This is equivalent\n(cid:8)\nto the decision rule of (11) if f (x) is an M-dimensional predictor with jth component fj(x) =\nt|ht(x)=j \u03b1tht(x) and label codewords yj = 1j. These methods are comparable to GD-MCBoost,\nin the sense that they update the predictor components simultaneously. However, they have not been\nshown to be Bayes consistent, and it is not clear that they can be interpreted as maximizing the\nmulticlass margin.\n\n6\n\n\f)\nx\n(\n\n2\n\nf\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\u22121\n\ny1\ny2\ny3\n\n\u22120.5\n\n0\n\n0.5\n\n(x)\nf\n1\nt = 0\n\n1\n\n1.5\n\n)\nx\n(\n\n2\n\nf\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\u22122\n\nclass 1\nclass 2\nclass 3\ny1\ny2\ny3\n\n\u22121\n\n0\n\n1\n\nf\n(x)\n1\nt = 10\n\n2\n\n3\n\n)\nx\n(\n\n2\n\nf\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\u22122\n\nClass 1\nClass 2\nClass 3\ny1\ny2\ny3\n\n\u22121\n\n0\n\nf\n(x)\n1\n\n1\n\n2\n\n3\n\nt = 100\n\nFigure 2: Classi\ufb01er predictions of CD-MCBoost, on the test set, after t = 0, 10, 100 boosting iterations.\n\nSAMME [12]: This method explicitly uses M-dimensional predictors with codewords\n\n(cid:10)\n\n\u22121\n\nM \u2212 1\n\n\u22121\n\nM \u2212 1\n\n,\n\n, ..., 1,\n\n\u22121\n\nM \u2212 1\n\n\u22121\n\nM \u2212 1\n\n,\n\n\u2208 RM ,\n\n(36)\n\n(cid:9)\n\nyj =\n\nM 1j \u2212 1\nM \u2212 1\n\n=\n\nand decision rule\n\n\u00afF (x) = arg max\n\nfj(x).\n\n(37)\n\nj\u2208{1,2,..M }\n\n(cid:8)\nSince, as discussed in Section 2.3, the optimal detector is not unique when the predictor is M-\ndimensional, this algorithm includes the additional constraint\nj=1 fj(x) = 0 and solves a con-\nstrained optimization problem [12, 9]. It is comparable to GD-MCBoost in the sense that it up-\ndates the predictor components simultaneously, but uses the loss function LSAMM E[yk, f (x)] =\ne\u2212 1\n\n,f (x)>. Using (36), the minimization of this loss is equivalent to maximizing\n\nM = fk(x) \u2212\n\n1\n\nM \u2212 1\n\nfj(x),\n\nj(cid:3)=k\n\n(38)\n\nwhich is not a proper margin since M(cid:4)(f (x), yk) > 0 does not imply correct classi\ufb01cation i.e.\nfk(x) > fj(x) \u2200j (cid:5)= k. Hence, SAMME does not guarantee a large margin solution for the\nmulticlass problem.\n\nWhen compared to all these methods, MCBoost has the advantage of combining 1) a Bayes consis-\ntent and margin enforcing loss function, 2) an optimal set of codewords, 3) the ability to boost any\ntype of weak learner, 4) guaranteed convergence to the global minimum of (21), for CD-MCBoost, or\n(28), for GD-MCBoost, and 5) equivalence to the classical AdaBoost algorithm for binary problems.\nIt is worth emphasizing that MCBoost can boost any type of weak learners of non-zero directional\nderivative, i.e. non-zero (23) for CD-MCBoost and (30) for GD-MCBoost. This is independent\nof the type of weak learner output, and unlike previous multiclass boosting approaches, which can\nonly boost weak learners of speci\ufb01c output types. Note that, although the weak learner selection\ncriteria of previous approaches can have interesting interpretations, e.g. based on weighted error\nrates [16], these only hold for speci\ufb01c weak learners. Finally, MCBoost extends the de\ufb01nition of\nmargin and loss function to multi-dimensional predictors. The derivation of Section 2 can easily be\ngeneralized to the design of other multiclass boosting algorithms by the use of 1) alternative \u03c6(v)\nfunctions in (14) (e.g. those of the logistic [9] or Tangent [13] losses for increased outlier robustness,\nasymmetric losses for cost-sensitive classi\ufb01cation, etc.), and 2) alternative optimization approaches\n(e.g. Newton\u2019s method [9, 17]).\n\n5 Evaluation\n\nA number of experiments were conducted to evaluate the MCBoost algorithms2.\n\n5.1 Synthetic data\n\nWe start with a synthetic example, for which the optimal decision rule is known. This is a three class\nproblem, with two-dimensional Gaussian classes of means [1, 2], [\u22121, 0], [2, \u22121] and covariances of\n\n2Codes for CD-MCBoost and GD-MCBoost are available from [2].\n\n7\n\n\fTable 1: Accuracy of multiclass boosting methods, using decision stumps, on six UCI data sets\n\nmethod\n\nOne Vs All\n\nAdaBoost-MH [18]\n\nCD-MCBoost\n\nletter\n\npendigit\n\nlandsat\nisolet\n84.80% 50.92% 86.56% 89.93% 87.11% 88.97%\n47.70% 15.73% 24.41% 73.62% 79.16% 66.71%\n85.70% 49.60% 89.51% 92.82% 88.01% 91.02%\n\noptdigit\n\nshuttle\n\nTable 2: Accuracy of multiclass boosting methods, using trees of max depth 2, on six UCI data sets\n\nmethod\n\nAdaBoost-M1[7]\n\nAdaBoost-SAMME[12]\n\nAdaBoost-Cost [16]\n\nGD-MCBoost\n\nletter\n\n\u2212\n\noptdigit\n\npendigit\n\nlandsat\n72.85%\n79.80% 45.65% 83.82% 87.53% 99.70% 61.00%\n83.95% 42.00% 80.53% 86.20% 99.55% 63.69%\n86.65% 59.65% 92.94% 92.32% 99.73% 84.28%\n\nshuttle\n96.45%\n\nisolet\n\n\u2212\n\n\u2212\n\n\u2212\n\n[1, 0.5; 0.5, 2],[1, 0.3; 0.3, 1],[.4, 0.1; 0.1, 0.8] respectively. Training and test sets of 1, 000 examples\neach were randomly sampled and the Bayes rule computed in closed form [5]. The associated Bayes\nerror rate was 11.67% in the training and 11.13% in the test set. A classi\ufb01er was learned with\nCD-MCBoost and decision stumps.\nFigure 2) shows predictions3 of f t(x) on the test set, for t = 0, 10, 100. Note that f 0(xi) = [0, 0]\nfor all examples xi. However, as the iterations proceed, CD-MCBoost produces predictions that are\nmore aligned with the true class codewords, shown as dashed lines, while maximizing the distance\nbetween examples of different classes (by increasing their distance to the origin). In this context,\n\u201calignment of f (x) with yk\u201d implies that < f (x), yk >\u2265< f (x), yj >, \u2200j (cid:5)= k. This combination\nof alignment and distance maximization results in higher margins, leading to more accurate and\nrobust classi\ufb01cation. The test error rate after 100 iterations of boosting was 11.30%, and very close\nto the Bayes error rate of 11.13%.\n\n5.2 CD-MCBoost\n\nWe next conducted a number of experiments to evaluate the performance of CD-MCBoost on the\nsix UCI datasets of Table 1. Among the methods identi\ufb01ed as comparable in the previous section,\nwe implemented \u201cone vs all\u201d and AdaBoost-MH [18].\nIn all cases, decision stumps were used\nas weak learners, and we used the training/test set decomposition speci\ufb01ed for each dataset. The\n\u201cone vs all\u201d detectors were trained with 20 iterations. The remaining methods were then allowed\nto include the same number of weak learners in their \ufb01nal decision rules. Table 1 presents the\nresulting classi\ufb01cation accuracies. CD-MCBoost produced the most accurate classi\ufb01er in four of\nthe \ufb01ve datasets, and was a close second in the remaining one. \u201cOne vs all\u201d achieved the next best\nperformance, with AdaBoost-MH producing the worst classi\ufb01ers.\n\n5.3 GD-MCBoost\n\nFinally, the performance of GD-MCBoost was compared to AdaBoost-M1 [7], AdaBoost-Cost [16]\nand AdaBoost-SAMME [12]. The experiments were based on the UCI datasets of the previous sec-\ntion, but the weak learners were now trees of depth 2. These were built with a greedy procedure\nso as to 1) minimize the weighted error rate of AdaBoost-M1 [7] and AdaBoost-SAMME[12], 2)\nminimize the classi\ufb01cation cost of AdaBoost-Cost [16], or 3) maximize (32) for GD-MCBoost. Ta-\nble 2 presents the classi\ufb01cation accuracy of each method, for 50 training iterations. GD-MCBoost\nachieved the best accuracy on all datasets, reaching substantially larger classi\ufb01cation rate than all\nother methods in the most dif\ufb01cult datasets, e.g.\nfrom a previous best of 63.69% to 84.28% in\nisolet, 45.65% to 59.65% in letter, and 83.82% to 92.94% in pendigit. Among the remaining meth-\nods, AdaBoost-SAMME achieved the next best performance, although this was close to that of\nAdaBoost-Cost. AdaBoost-M1 had the worst results, and was not able to boost the weak learners\nused in this experiment for four of the six datasets. It should be noted that the results of Tables 1 and\n2 are not directly comparable, since the classi\ufb01ers are based on different types of weak learners and\nhave different complexities.\n\n3We emphasize the fact that these are plots of f t(x) \u2208 R2, not x \u2208 R2.\n\n8\n\n\fReferences\n\n[1] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for\n\nmargin classi\ufb01ers. J. Mach. Learn. Res., 1:113\u2013141, September 2001.\n\n[2] N. N. Author. Suppressed for anonymity.\n[3] H. S. M. Coxeter. Regular Polytopes. Dover Publications, 1973.\n[4] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes.\n\nJournal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, 1995.\n\n[5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi\ufb01cation. Wiley, New York, 2. edition, 2001.\n[6] G. Eibl and R. Schapire. Multiclass boosting for weak classi\ufb01ers.\n\nIn Journal of Machine Learning\n\nResearch, pages 6\u2013189, 2005.\n\n[7] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thir-\n\nteenth International Conference In Machine Learning, pages 148\u2013156, 1996.\n\n[8] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. Journal of Comp. and Sys. Science, 1997.\n\n[9] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting.\n\nAnnals of Statistics, 28, 1998.\n\n[10] B. A. Frigyik, S. Srivastava, and M. R. Gupta. An introduction to functional derivatives. Technical\n\nReport(University of Washington), 2008.\n\n[11] Y. Guermeur. Vc theory of large margin multi-category classi\ufb01ers. J. Mach. Learn. Res., 8:2551\u20132594,\n\nDecember 2007.\n\n[12] S. R. Ji Zhu, Hui Zou and T. Hastie. Multi-class adaboost. Statistics and Its Interface, 2:349\u20133660, 2009.\n[13] H. Masnadi-Shirazi, N. Vasconcelos, and V. Mahadevan. On the design of robust classi\ufb01ers for computer\n\nvision. In CVPR, 2010.\n\n[14] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In NIPS, 2000.\n[15] D. Mease and A. Wyner. Evidence contrary to the statistical view of boosting. J. Mach. Learn. Res.,\n\n9:131\u2013156, June 2008.\n\n[16] I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. In NIPS, 2010.\n[17] M. J. Saberian, H. Masnadi-Shirazi, and N. Vasconcelos. Taylorboost: First and second order boosting\n\nalgorithms with explicit margin control. In CVPR, 2010.\n\n[18] R. E. Schapire and Y. Singer. Improved boosting algorithms using con\ufb01dence-rated predictions. Mach.\n\nLearn., 37:297\u2013336, December 1999.\n\n[19] V. N. Vapnik. Statistical Learning Theory. John Wiley Sons Inc, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1181, "authors": [{"given_name": "Mohammad", "family_name": "Saberian", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}