{"title": "Multi-Class Deep Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 2501, "page_last": 2509, "abstract": "We present new ensemble learning algorithms for multi-class classification. Our algorithms can use as a base classifier set a family of deep decision trees or other rich or complex families and yet benefit from strong generalization guarantees. We give new data-dependent learning bounds for convex ensembles in the multi-class classification setting expressed in terms of the Rademacher complexities of the sub-families composing the base classifier set, and the mixture weight assigned to each sub-family. These bounds are finer than existing ones both thanks to an improved dependency on the number of classes and, more crucially, by virtue of a more favorable complexity term expressed as an average of the Rademacher complexities based on the ensemble\u2019s mixture weights. We introduce and discuss several new multi-class ensemble algorithms benefiting from these guarantees, prove positive results for the H-consistency of several of them, and report the results of experiments showing that their performance compares favorably with that of multi-class versions of AdaBoost and Logistic Regression and their L1-regularized counterparts.", "full_text": "Multi-Class Deep Boosting\n\nVitaly Kuznetsov\nCourant Institute\n251 Mercer Street\n\nNew York, NY 10012\n\nvitaly@cims.nyu.edu\n\nMehryar Mohri\n\nCourant Institute & Google Research\n\n251 Mercer Street\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nUmar Syed\n\nGoogle Research\n76 Ninth Avenue\n\nNew York, NY 10011\nusyed@google.com\n\nAbstract\n\nWe present new ensemble learning algorithms for multi-class classi\ufb01cation. Our\nalgorithms can use as a base classi\ufb01er set a family of deep decision trees or other\nrich or complex families and yet bene\ufb01t from strong generalization guarantees.\nWe give new data-dependent learning bounds for convex ensembles in the multi-\nclass classi\ufb01cation setting expressed in terms of the Rademacher complexities of\nthe sub-families composing the base classi\ufb01er set, and the mixture weight assigned\nto each sub-family. These bounds are \ufb01ner than existing ones both thanks to an\nimproved dependency on the number of classes and, more crucially, by virtue of\na more favorable complexity term expressed as an average of the Rademacher\ncomplexities based on the ensemble\u2019s mixture weights. We introduce and discuss\nseveral new multi-class ensemble algorithms bene\ufb01ting from these guarantees,\nprove positive results for the H-consistency of several of them, and report the\nresults of experiments showing that their performance compares favorably with\nthat of multi-class versions of AdaBoost and Logistic Regression and their L1-\nregularized counterparts.\n\n1\n\nIntroduction\n\nDevising ensembles of base predictors is a standard approach in machine learning which often helps\nimprove performance in practice. Ensemble methods include the family of boosting meta-algorithms\namong which the most notable and widely used one is AdaBoost [Freund and Schapire, 1997],\nalso known as forward stagewise additive modeling [Friedman et al., 1998]. AdaBoost and its\nother variants learn convex combinations of predictors. They seek to greedily minimize a convex\nsurrogate function upper bounding the misclassi\ufb01cation loss by augmenting, at each iteration, the\ncurrent ensemble, with a new suitably weighted predictor.\nOne key advantage of AdaBoost is that, since it is based on a stagewise procedure, it can learn\nan effective ensemble of base predictors chosen from a very large and potentially in\ufb01nite family,\nprovided that an ef\ufb01cient algorithm is available for selecting a good predictor at each stage. Fur-\nthermore, AdaBoost and its L1-regularized counterpart [R\u00a8atsch et al., 2001a] bene\ufb01t from favorable\nlearning guarantees, in particular theoretical margin bounds [Schapire et al., 1997, Koltchinskii and\nPanchenko, 2002]. However, those bounds depend not just on the margin and the sample size, but\nalso on the complexity of the base hypothesis set, which suggests a risk of over\ufb01tting when using too\ncomplex base hypothesis sets. And indeed, over\ufb01tting has been reported in practice for AdaBoost in\nthe past [Grove and Schuurmans, 1998, Schapire, 1999, Dietterich, 2000, R\u00a8atsch et al., 2001b].\nCortes, Mohri, and Syed [2014] introduced a new ensemble algorithm, DeepBoost, which they\nproved to bene\ufb01t from \ufb01ner learning guarantees, including favorable ones even when using as base\nclassi\ufb01er set relatively rich families, for example a family of very deep decision trees, or other simi-\nlarly complex families. In DeepBoost, the decisions in each iteration of which classi\ufb01er to add to the\nensemble and which weight to assign to that classi\ufb01er, depend on the (data-dependent) complexity\n\n1\n\n\fof the sub-family to which the classi\ufb01er belongs \u2013 one interpretation of DeepBoost is that it applies\nthe principle of structural risk minimization to each iteration of boosting. Cortes, Mohri, and Syed\n[2014] further showed that empirically DeepBoost achieves a better performance than AdaBoost,\nLogistic Regression, and their L1-regularized variants. The main contribution of this paper is an\nextension of these theoretical, algorithmic, and empirical results to the multi-class setting.\nTwo distinct approaches have been considered in the past for the de\ufb01nition and the design of boosting\nalgorithms in the multi-class setting. One approach consists of combining base classi\ufb01ers mapping\neach example x to an output label y. This includes the SAMME algorithm [Zhu et al., 2009] as\nwell as the algorithm of Mukherjee and Schapire [2013], which is shown to be, in a certain sense,\noptimal for this approach. An alternative approach, often more \ufb02exible and more widely used in\napplications, consists of combining base classi\ufb01ers mapping each pair (x, y) formed by an example\nx and a label y to a real-valued score. This is the approach adopted in this paper, which is also\nthe one used for the design of AdaBoost.MR [Schapire and Singer, 1999] and other variants of that\nalgorithm.\nIn Section 2, we prove a novel generalization bound for multi-class classi\ufb01cation ensembles that\ndepends only on the Rademacher complexity of the hypothesis classes to which the classi\ufb01ers in the\nensemble belong. Our result generalizes the main result of Cortes et al. [2014] to the multi-class set-\nting, and also represents an improvement on the multi-class generalization bound due to Koltchinskii\nand Panchenko [2002], even if we disregard our \ufb01ner analysis related to Rademacher complexity. In\nSection 3, we present several multi-class surrogate losses that are motivated by our generalization\nbound, and discuss and compare their functional and consistency properties. In particular, we prove\nthat our surrogate losses are realizable H-consistent, a hypothesis-set-speci\ufb01c notion of consistency\nthat was recently introduced by Long and Servedio [2013]. Our results generalize those of Long and\nServedio [2013] and admit simpler proofs. We also present a family of multi-class DeepBoost learn-\ning algorithms based on each of these surrogate losses, and prove general convergence guarantee for\nthem. In Section 4, we report the results of experiments demonstrating that multi-class DeepBoost\noutperforms AdaBoost.MR and multinomial (additive) logistic regression, as well as their L1-norm\nregularized variants, on several datasets.\n\n2 Multi-class data-dependent learning guarantee for convex ensembles\n\nIn this section, we present a data-dependent learning bound in the multi-class setting for convex\nensembles based on multiple base hypothesis sets. Let X denote the input space. We denote by\nY = {1, . . . , c} a set of c \u2265 2 classes. The label associated by a hypothesis f : X \u00d7 Y \u2192 R to\nx \u2208 X is given by argmaxy\u2208Y f(x, y). The margin \u03c1f (x, y) of the function f for a labeled example\n(x, y) \u2208 X \u00d7 Y is de\ufb01ned by\n\n\u03c1f (x, y) = f(x, y) \u2212 max\ny06=y\n\nf(x, y0).\n\nmapping from X \u00d7 Y to [0, 1] and the ensemble family F = conv(Sp\nfunctions f of the form f =PT\n\n(1)\nThus, f misclassi\ufb01es (x, y) iff \u03c1f (x, y) \u2264 0. We consider p families H1, . . . , Hp of functions\nk=1 Hk), that is the family of\nt=1 \u03b1tht, where \u03b1 = (\u03b11, . . . , \u03b1T ) is in the simplex \u2206 and where, for\neach t \u2208 [1, T ], ht is in Hkt for some kt \u2208 [1, p]. We assume that training and test points are drawn\ni.i.d. according to some distribution D over X \u00d7 Y and denote by S = ((x1, y1), . . . , (xm, ym)) a\ntraining sample of size m drawn according to Dm. For any \u03c1 > 0, the generalization error R(f), its\n\u03c1-margin error R\u03c1(f) and its empirical margin error are de\ufb01ned as follows:\nR(f) = E\n\n[1\u03c1f (x,y)\u2264\u03c1], and bRS,\u03c1(f) = E\n\n[1\u03c1f (x,y)\u22640], R\u03c1(f) = E\n\n[1\u03c1f (x,y)\u2264\u03c1],\n(2)\nwhere the notation (x, y) \u223c S indicates that (x, y) is drawn according to the empirical distribution\nde\ufb01ned by S. For any family of hypotheses G mapping X \u00d7 Y to R, we de\ufb01ne \u03a01(G) by\n\n\u03a01(G) = {x 7\u2192 h(x, y): y \u2208 Y, h \u2208 G}.\n\n(3)\nThe following theorem gives a margin-based Rademacher complexity bound for learning with en-\nsembles of base classi\ufb01ers with multiple hypothesis sets. As with other Rademacher complexity\nlearning guarantees, our bound is data-dependent, which is an important and favorable characteris-\ntic of our results.\n\n(x,y)\u223cD\n\n(x,y)\u223cD\n\n(x,y)\u223cS\n\n2\n\n\fTheorem 1. Assume p > 1 and let H1, . . . , Hp be p families of functions mapping from X \u00d7 Y to\n[0, 1]. Fix \u03c1 > 0. Then, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over the choice of a sample S\nt=1 \u03b1tht \u2208 F:\n\nof size m drawn i.i.d. according to D, the following inequality holds for all f =PT\nsl 4\nrlog p\n(cid:1)mlog p\nTX\n\u03c12 log(cid:0) c2\u03c12m\nR(f) \u2264 bRS,\u03c1(f)+\n(cid:18)rlog p\ni(cid:19)\nh \u03c12c2m\nThus, R(f) \u2264 bRS,\u03c1(f) + 8c\n\n\u03b1tRm(\u03a01(Hkt))+\n\n+\n\nlog 2\n\u03b4\n2m\n\n,\n\nt=1 \u03b1tRm(Hkt) + O\n\nPT\n\n2\nc\u03c1\n\n8c\n\u03c1\n\nlog\n\n4 log p\n\n+\n\nt=1\n\nm\n\nm\n\n.\n\n\u03c1\n\n\u03c12m\n\n4 log p\n\nThe full proof of theorem 3 is given in Appendix B. Even for p = 1, that is for the special case of\na single hypothesis set, our analysis improves upon the multi-class margin bound of Koltchinskii\nand Panchenko [2002] since our bound admits only a linear dependency on the number of classes\nc instead of a quadratic one. However, the main remarkable bene\ufb01t of this learning bound is that\nits complexity term admits an explicit dependency on the mixture coef\ufb01cients \u03b1t. It is a weighted\naverage of Rademacher complexities with mixture weights \u03b1t, t \u2208 [1, T ]. Thus, the second term\nof the bound suggests that, while some hypothesis sets Hk used for learning could have a large\nRademacher complexity, this may not negatively affect generalization if the corresponding total\nmixture weight (sum of \u03b1ts corresponding to that hypothesis set) is relatively small. Using such\npotentially complex families could help achieve a better margin on the training sample.\nThe theorem cannot be proven via the standard Rademacher complexity analysis of Koltchinskii and\nk=1 Hk)) =\nk=1 Hk) which does not admit an explicit dependency on the mixture weights and is lower\nt=1 \u03b1tRm(Hkt). Thus, the theorem provides a \ufb01ner learning bound than the one\n\nPanchenko [2002] since the complexity term of the bound would then be Rm(conv(Sp\nRm(Sp\nbounded byPT\n\nobtained via a standard Rademacher complexity analysis.\n\n3 Algorithms\n\nIn this section, we will use the learning guarantees just described to derive several new ensemble\nalgorithms for multi-class classi\ufb01cation.\n\n3.1 Optimization problem\n\nLet H1, . . . , Hp be p disjoint families of functions taking values in [0, 1] with increasing Rademacher\ncomplexities Rm(Hk), k \u2208 [1, p]. For any hypothesis h \u2208 \u222ap\nk=1Hk, we denote by d(h) the index\nof the hypothesis set it belongs to, that is h \u2208 Hd(h). The bound of Theorem 3 holds uniformly for\nk=1 Hk). Since the last term of the bound does not depend on\n\nall \u03c1 > 0 and functions f \u2208 conv(Sp\n\n\u03b1, it suggests selecting \u03b1 that would minimize:\n\nmX\nerror, we can instead search for \u03b1 \u2265 0 withPT\n\nG(\u03b1) =\n\n1\nm\n\ni=1\n\nTX\n\nt=1\n\n8c\n\u03c1\n\n1\u03c1f (xi,yi)\u2264\u03c1 +\n\n\u03b1trt,\n\nwhere rt = Rm(Hd(ht)) and \u03b1 \u2208 \u2206.1 Since for any \u03c1 > 0, f and f /\u03c1 admit the same generalization\n\nmX\n\ni=1\n\nmin\n\u03b1\u22650\n\n1\nm\n\nt=1 \u03b1t \u2264 1/\u03c1, which leads to\nTX\n\u03b1t \u2264 1\n\u03c1\n\nTX\n\n\u03b1trt\n\ns.t.\n\nt=1\n\nt=1\n\n1\u03c1f (xi,yi)\u22641 + 8c\n\n.\n\n(4)\n\nThe \ufb01rst term of the objective is not a convex function of \u03b1 and its minimization is known to be\ncomputationally hard. Thus, we will consider instead a convex upper bound. Let u 7\u2192 \u03a6(\u2212u)\nbe a non-increasing convex function upper-bounding u 7\u2192 1u\u22640 over R. \u03a6 may be selected to be\nt=1 \u03b1t \u2264 1. To see this, use for example\n\nt=1 \u03b1t = 1 of Theorem 3 can be relaxed toPT\n\n1 The conditionPT\n\na null hypothesis (ht = 0 for some t).\n\n3\n\n\ffor example the exponential function as in AdaBoost [Freund and Schapire, 1997] or the logistic\nfunction. Using such an upper bound, we obtain the following convex optimization problem:\n\n(cid:16)\n\nmX\n\ni=1\n\nmin\n\u03b1\u22650\n\n1\nm\n\n(cid:17)\n\nTX\n\nt=1\n\n\u03a6\n\n1 \u2212 \u03c1f (xi, yi)\n\n+ \u03bb\n\n\u03b1trt\n\ns.t.\n\n\u03b1t \u2264 1\n\u03c1\n\n,\n\n(5)\n\nwhere we introduced a parameter \u03bb \u2265 0 controlling the balance between the magnitude of the values\ntaken by function \u03a6 and the second term.2 Introducing a Lagrange variable \u03b2 \u2265 0 associated to the\nconstraint in (5), the problem can be equivalently written as\n\n(cid:16)\n\nh TX\n\n(cid:16)\n\n1 \u2212h TX\n\nmX\n\ni=1\n\nmX\n\ni=1\n\nmin\n\u03b1\u22650\n\n1\nm\n\n\u03a6\n\n1 \u2212 min\ny6=yi\n\n\u03b1tht(xi, yi) \u2212 \u03b1tht(xi, y)\n\n+\n\n(\u03bbrt + \u03b2)\u03b1t.\n\nt=1\n\nt=1\n\nHere, \u03b2 is a parameter that can be freely selected by the algorithm since any choice of its value\nis equivalent to a choice of \u03c1 in (5). Since \u03a6 is a non-decreasing function, the problem can be\nequivalently written as\n\nmin\n\u03b1\u22650\n\n1\nm\n\nmax\ny6=yi\n\n\u03a6\n\n\u03b1tht(xi, yi) \u2212 \u03b1tht(xi, y)\n\n+\n\n(\u03bbrt + \u03b2)\u03b1t.\n\nt=1\n\nt=1\n\nLet {h1, . . . , hN} be the set of distinct base functions, and let Fmax be the objective function based\non that expression:\n\nTX\n\nt=1\n\ni(cid:17)\n\ni(cid:17)\n\n(cid:17)\n\nTX\n\nTX\n\nNX\n\nFmax(\u03b1) =\n\n1\nm\n\nmax\ny6=yi\n\n\u03a6\n\nmX\n\ni=1\n\n(cid:16)\n\n1 \u2212 NX\n\n\u03b1jhj(xi, yi, y)\n\n+\n\n\u039bj\u03b1j,\n\n(6)\n\nj=1\n\nj=1\n\nwith \u03b1 = (\u03b11, . . . , \u03b1N ) \u2208 RN , hj(xi, yi, y) = hj(xi, yi)\u2212 hj(xi, y), and \u039bj = \u03bbrj + \u03b2 for all j \u2208\n[1, N]. Then, our optimization problem can be rewritten as min\u03b1\u22650 Fmax(\u03b1). This de\ufb01nes a convex\noptimization problem since the domain {\u03b1 \u2265 0} is a convex set and since Fmax is convex: each\nterm of the sum in its de\ufb01nition is convex as a pointwise maximum of convex functions (composition\nof the convex function \u03a6 with an af\ufb01ne function) and the second term is a linear function of \u03b1. In\ngeneral, Fmax is not differentiable even when \u03a6 is, but, since it is convex, it admits a sub-differential\nat every point. Additionally, along each direction, Fmax admits left and right derivatives both non-\nincreasing and a differential everywhere except for a set that is at most countable.\n\n3.2 Alternative objective functions\n\nWe now consider the following three natural upper bounds on Fmax which admit useful properties\nthat we will discuss later, the third one valid when \u03a6 can be written as the composition of two\nfunction \u03a61 and \u03a62 with \u03a61 a non-increasing function:\n\nFsum(\u03b1) =\n\nFmaxsum(\u03b1) =\n\nFcompsum(\u03b1) =\n\n1\nm\n\n1\nm\n\n1\nm\n\n\u03b1jhj(xi, yi, y)\n\n+\n\n\u039bj\u03b1j\n\ni=1\n\nmX\nmX\nmX\n\ni=1\n\ni=1\n\ny6=yi\n\n\u03a6\n\n(cid:16)\n1 \u2212 NX\nX\n(cid:16)\n1 \u2212 NX\n(cid:16)X\n\n(cid:16)\n\nj=1\n\nj=1\n\n\u03a6\n\n\u03a61\n\n\u03a62\n\ny6=yi\n\n1 \u2212 NX\n\nj=1\n\nNX\n\nj=1\n\n(cid:17)\nNX\n\nj=1\n\n(cid:17)\n\n\u03b1jhj(xi, yi, y)\n\n\u03b1j\u03c1hj (xi, yi)\n\n+\n\n\u039bj\u03b1j\n\n(cid:17)(cid:17)\n\nNX\n\nj=1\n\n+\n\n\u039bj\u03b1j.\n\n(7)\n\n(8)\n\n(9)\n\nFsum is obtained from Fmax simply by replacing in the de\ufb01nition of Fmax the max operator by a\nsum. Clearly, function Fsum is convex and inherits the differentiability properties of \u03a6. A drawback\nof Fsum is that for problems with very large c as in structured prediction, the computation of the sum\n\nlent to a vector optimization problem, where (Pm\n\n2Note that this is a standard practice in the \ufb01eld of optimization. The optimization problem in (4) is equiva-\nt=1 \u03b1trt) is minimized over \u03b1. The latter\n\ni=1 1\u03c1f (xi,yi)\u22641,PT\n\nproblem can be scalarized leading to the introduction of a parameter \u03bb in (5).\n\n4\n\n\fmay require resorting to approximations. Fmaxsum is obtained from Fmax by noticing that, by the\nsub-additivity of the max operator, the following inequality holds:\n\nNX\n\n\u2212\u03b1jhj(xi, yi, y) \u2264 NX\n\nj=1\n\nj=1\n\nmax\ny6=yi\n\n\u2212\u03b1jhj(xi, yi, y) =\n\nmax\ny6=yi\n\n\u03b1j\u03c1hj (xi, yi).\n\nNX\n\nj=1\n\n\u03a62\n\ny6=yi\n\nAs with Fsum, function Fmaxsum is convex and admits the same differentiability properties as \u03a6.\nUnlike Fsum, Fmaxsum does not require computing a sum over the classes. Furthermore, note that the\nexpressions \u03c1hj (xi, yi), i \u2208 [1, m], can be pre-computed prior to the application of any optimization\nalgorithm. Finally, for \u03a6 = \u03a61 \u25e6 \u03a62 with \u03a61 non-increasing, the max operator can be replaced by a\nsum before applying \u03c61, as follows:\n\n(cid:0)1 \u2212 f(xi, yi, y)(cid:1)(cid:17) \u2264 \u03a61\n\n(cid:16)X\n\n(cid:0)1 \u2212 f(xi, yi, y)(cid:1)(cid:17)\n\n,\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n= \u03a61\n\nmax\ny6=yi\n\n\u03a62\n\nmax\ny6=yi\n\n\u03a6\n\n1 \u2212 f(xi, yi, y)\n\nwhere f(xi, yi, y) =PN\n\nj=1 \u03b1jhj(xi, yi, y). This leads to the de\ufb01nition of Fcompsum.\n\nIn Appendix C, we discuss the consistency properties of the loss functions just introduced. In partic-\nular, we prove that the loss functions associated to Fmax and Fsum are realizable H-consistent (see\nLong and Servedio [2013]) in the common cases where the exponential or logistic losses are used\nand that, similarly, in the common case where \u03a61(u) = log(1 + u) and \u03a62(u) = exp(u + 1), the\nloss function associated to Fcompsum is H-consistent.\nFurthermore, in Appendix D, we show that, under some mild assumptions, the objective functions\nwe just discussed are essentially within a constant factor of each other. Moreover, in the case of\nbinary classi\ufb01cation all of these objectives coincide.\n\n3.3 Multi-class DeepBoost algorithms\n\nIn this section, we discuss in detail a family of multi-class DeepBoost algorithms, which are derived\nby application of coordinate descent to the objective functions discussed in the previous paragraphs.\nWe will assume that \u03a6 is differentiable over R and that \u03a60(u) 6= 0 for all u. This condition is not\nnecessary, in particular, our presentation can be extended to non-differentiable functions such as the\nhinge loss, but it simpli\ufb01es the presentation. In the case of the objective function Fmaxsum, we will\nassume that both \u03a61 and \u03a62, where \u03a6 = \u03a61\u25e6\u03a62, are differentiable. Under these assumptions, Fsum,\nFmaxsum, and Fcompsum are differentiable. Fmax is not differentiable due to the presence of the max\noperators in its de\ufb01nition, but it admits a sub-differential at every point.\nFor convenience, let \u03b1t = (\u03b1t,1, . . . , \u03b1t,N )> denote the vector obtained after t \u2265 1 iterations and\nlet \u03b10 = 0. Let ek denote the kth unit vector in RN , k \u2208 [1, N]. For a differentiable objective\nF , we denote by F 0(\u03b1, ej) the directional derivative of F along the direction ej at \u03b1. Our co-\nordinate descent algorithm consists of \ufb01rst determining the direction of maximal descent, that is\nk = argmaxj\u2208[1,N ] |F 0(\u03b1t\u22121, ej)|, next of determining the best step \u03b7 along that direction that\npreserves non-negativity of \u03b1, \u03b7 = argmin\u03b1t\u22121+\u03b7ek\u22650 F (\u03b1t\u22121 + \u03b7ek), and updating \u03b1t\u22121 to\n\u03b1t = \u03b1t\u22121 + \u03b7ek. We will refer to this method as projected coordinate descent. The following\ntheorem provides a convergence guarantee for our algorithms in that case.\nTheorem 2. Assume that \u03a6 is twice differentiable and that \u03a600(u) > 0 for all u \u2208 R. Then, the\nprojected coordinate descent algorithm applied to F converges to the solution \u03b1\u2217 of the optimization\nmax\u03b1\u22650 F (\u03b1) for F = Fsum, F = Fmaxsum, or F = Fcompsum. If additionally \u03a6 is strongly convex\nover the path of the iterates \u03b1t, then there exists \u03c4 > 0 and \u03b3 > 0 such that for all t > \u03c4,\n\nF (\u03b1t+1) \u2212 F (\u03b1\u2217) \u2264 (1 \u2212 1\n\n\u03b3 )(F (\u03b1t) \u2212 F (\u03b1\u2217)).\n\n(10)\n\nThe proof is given in Appendix I and is based on the results of Luo and Tseng [1992]. The theorem\ncan in fact be extended to the case where instead of the best direction, the derivative for the direc-\ntion selected at each round is within a constant threshold of the best [Luo and Tseng, 1992]. The\nconditions of Theorem 2 hold for many cases in practice, in particular in the case of the exponential\nloss (\u03a6 = exp) or the logistic loss (\u03a6(\u2212x) = log2(1 + e\u2212x)). In particular, linear convergence is\nguaranteed in those cases since both the exponential and logistic losses are strongly convex over a\ncompact set containing the converging sequence of \u03b1ts.\n\n5\n\n\fMDEEPBOOSTSUM(S = ((x1, y1), . . . , (xm, ym)))\n\nfor i \u2190 1 to m do\n\nfor y \u2208 Y \u2212 {yi} do\nD1(i, y) \u2190 1\n\nm(c\u22121)\n\n\u039bjm\n2St\n\nfor t \u2190 1 to T do\nk \u2190 argmin\nj\u2208[1,N ]\n\u03b7t \u2190 \u2212\u03b1t\u22121,k\n\n\u0001t,j +\n\n(cid:1) then\nif(cid:0)(1 \u2212 \u0001t,k)e\u03b1t\u22121,k \u2212 \u0001t,ke\u2212\u03b1t\u22121,k < \u039bkm\ni\n(cid:3)2 + 1\u2212\u0001t\nj=1 \u03b1t,jhj(xi, yi, y)(cid:1)\nSt+1 \u2190Pm\nj=1 \u03b1t,j hj (xi,yi,y)(cid:1)\n\nq(cid:2) \u039bkm\nh \u2212 \u039bkm\n\u03a60(cid:0)1 \u2212PN\nP\nDt+1(i, y) \u2190 \u03a60(cid:0)1\u2212PN\n\nelse \u03b7t \u2190 log\n\u03b1t \u2190 \u03b1t\u22121 + \u03b7tek\nfor i \u2190 1 to m do\n\nfor y \u2208 Y \u2212 {yi} do\n\ny6=yi\n\n2\u0001tSt\n\n2\u0001tSt\n\n+\n\ni=1\n\nSt\n\n\u0001t\n\nSt+1\n\nf \u2190PN\n\nreturn f\n\nj=1 \u03b1t,jhj\n\n1\n2\n3\n4\n5\n\n6\n7\n8\n9\n10\n11\n12\n\n13\n14\n15\n\nFigure 1: Pseudocode of the MDeepBoostSum algorithm for both the exponential loss and the lo-\ngistic loss. The expression of the weighted error \u0001t,j is given in (12).\n\nWe will refer to the algorithm de\ufb01ned by projected coordinate descent applied to Fsum by MDeep-\nBoostSum, to Fmaxsum by MDeepBoostMaxSum, to Fcompsum by MDeepBoostCompSum, and to\nFmax by MDeepBoostMax. In the following, we brie\ufb02y describe MDeepBoostSum, including its\npseudocode. We give a detailed description of all of these algorithms in the supplementary mate-\nrial: MDeepBoostSum (Appendix E), MDeepBoostMaxSum (Appendix F), MDeepBoostCompSum\n(Appendix G), MDeepBoostMax (Appendix H).\n\nDe\ufb01ne ft\u22121 =PN\n\nj=1 \u03b1t\u22121,jhj. Then, Fsum(\u03b1t\u22121) can be rewritten as follows:\n\nFsum(\u03b1t\u22121) =\n\n1\nm\n\n\u03a6\n\n1 \u2212 ft\u22121(xi, yi, y)\n\n+\n\n\u039bj\u03b1t\u22121,j.\n\n(cid:16)\n\nmX\n\nX\n\ni=1\n\ny6=yi\n\n(cid:17)\n\nNX\n\nj=1\n\nFor any t \u2208 [1, T ], we denote by Dt the distribution over [1, m]\u00d7 [1, c] de\ufb01ned for all i \u2208 [1, m] and\ny \u2208 Y \u2212 {yi} by\n\nwhere St is a normalization factor, St =Pm\n\nDt(i, y) =\n\nand s \u2208 [1, T ], we also de\ufb01ne the weighted error \u0001s,j as follows:\n\ni=1\n\nSt\n\u03a60(1 \u2212 ft\u22121(xi, yi, y)). For any j \u2208 [1, N]\n\n(11)\n\n\u03a60(cid:0)1 \u2212 ft\u22121(xi, yi, y)(cid:1)\nP\n\n,\n\ny6=yi\n\n(cid:2)hj(xi, yi, y)(cid:3)i\n\nh\n\n\u0001s,j =\n\n1\n2\n\n1 \u2212 E\n\n(i,y)\u223cDs\n\n.\n\n(12)\n\nFigure 1 gives the pseudocode of the MDeepBoostSum algorithm. The details of the derivation of\nthe expressions are given in Appendix E. In the special cases of the exponential loss (\u03a6(\u2212u) =\nexp(\u2212u)) or the logistic loss (\u03a6(\u2212u) = log2(1 + exp(\u2212u))), a closed-form expression is given\nfor the step size (lines 6-8), which is the same in both cases (see Sections E.2.1 and E.2.2). In the\ngeneric case, the step size can be found using a line search or other numerical methods.\nThe algorithms presented above have several connections with other boosting algorithms, particu-\nlarly in the absence of regularization. We discuss these connections in detail in Appendix K.\n\n6\n\n\f4 Experiments\n\nandP\n\nThe algorithms presented in the previous sections can be used with a variety of different base clas-\nsi\ufb01er sets. For our experiments, we used multi-class binary decision trees. A multi-class binary\ndecision tree in dimension d can be de\ufb01ned by a pair (t, h), where t is a binary tree with a variable-\nthreshold question at each internal node, e.g., Xj \u2264 \u03b8, j \u2208 [1, d], and h = (hl)l\u2208Leaves(t) a vector of\ndistributions over the leaves Leaves(t) of t. At any leaf l \u2208 Leaves(t), hl(y) \u2208 [0, 1] for all y \u2208 Y\ny\u2208Y hl(y) = 1. For convenience, we will denote by t(x) the leaf l \u2208 Leaves(t) associated to\nx by t. Thus, the score associated by (t, h) to a pair (x, y) \u2208 X \u00d7 Y is hl(y) where l = t(x).\nLet Tn denote the family of all multi-class decision trees with n internal nodes in dimension d. In\nAppendix J, we derive the following upper bound on the Rademacher complexity of Tn:\n\nr(4n + 2) log2(d + 2) log(m + 1)\n\nm\n\nR(\u03a01(Tn)) \u2264\n\n.\n\n(13)\n\nAll of the experiments in this section use Tn as the family of base hypothesis sets (parametrized by\nn). Since Tn is a very large hypothesis set when n is large, for the sake of computational ef\ufb01ciency\nwe make a few approximations. First, although our MDeepBoost algorithms were derived in terms of\nRademacher complexity, we use the upper bound in Eq. (13) in place of the Rademacher complexity\n(thus, in Algorithm 1 we let \u039bn = \u03bbBn + \u03b2, where Bn is the bound given in Eq. (13)). Secondly,\ninstead of exhaustively searching for the best decision tree in Tn for each possible size n, we use the\nfollowing greedy procedure: Given the best decision tree of size n (starting with n = 1), we \ufb01nd the\nbest decision tree of size n+1 that can be obtained by splitting one leaf, and continue this procedure\nuntil some maximum depth K. Decision trees are commonly learned in this manner, and so in this\ncontext our Rademacher-complexity-based bounds can be viewed as a novel stopping criterion for\ndecision tree learning. Let H\u2217\nK be the set of trees found by the greedy algorithm just described.\nK \u222a {h1, . . . , ht\u22121}, where\nIn each iteration t of MDeepBoost, we select the best tree in the set H\u2217\nh1, . . . , ht\u22121 are the trees selected in previous iterations.\nWhile we described many objective functions that can be used as the basis of a multi-class deep\nboosting algorithm, the experiments in this section focus on algorithms derived from Fsum. We also\nrefer the reader to Table 3 in Appendix A for results of experiments with Fcompsum objective func-\ntions. The Fsum and Fcompsum objectives combine several advantages that suggest they will perform\nwell empirically. Fsum is consistent and both Fsum and Fcompsum are (by Theorem 4) H-consistent.\nAlso, unlike Fmax both of these objectives are differentiable, and therefore the convergence guaran-\ntee in Theorem 2 applies. Our preliminary \ufb01ndings also indicate that algorithms based on Fsum and\nFcompsum objectives perform better than those derived from Fmax and Fmaxsum. All of our objective\nfunctions require a choice for \u03a6, the loss function. Since Cortes et al. [2014] reported comparable\nresults for exponential and logistic loss for the binary version of DeepBoost, we let \u03a6 be the expo-\nnential loss in all of our experiments with MDeepBoostSum. For MDeepBoostCompSum we select\n\u03a61(u) = log2(1 + u) and \u03a62(\u2212u) = exp(\u2212u).\nIn our experiments, we used 8 UCI data sets: abalone, handwritten, letters, pageblocks,\npendigits, satimage, statlog and yeast \u2013 see more details on these datasets in Table 4, Ap-\npendix L. In Appendix K, we explain that when \u03bb = \u03b2 = 0 then MDeepBoostSum is equivalent to\nAdaBoost.MR. Also, if we set \u03bb = 0 and \u03b2 6= 0 then the resulting algorithm is an L1-norm regu-\nlarized variant of AdaBoost.MR. We compared MDeepBoostSum to these two algorithms, with the\nresults also reported in Table 1 and Table 2 in Appendix A. Likewise, we compared MDeepBoost-\nCompSum with multinomial (additive) logistic regression, LogReg, and its L1-regularized version\nLogReg-L1, which, as discussed in Appendix K, are equivalent to MDeepBoostCompSum when\n\u03bb = \u03b2 = 0 and \u03bb = 0, \u03b2 \u2265 0 respectively. Finally, we remark that it can be argued that the parame-\nter optimization procedure (described below) signi\ufb01cantly extends AdaBoost.MR since it effectively\nimplements structural risk minimization: for each tree depth, the empirical error is minimized and\nwe choose the depth to achieve the best generalization error.\nAll of these algorithms use maximum tree depth K as a parameter. L1-norm regularized versions\nadmit two parameters: K and \u03b2 \u2265 0. Deep boosting algorithms have a third parameter, \u03bb \u2265 0.\nTo set these parameters, we used the following parameter optimization procedure: we randomly\npartitioned each dataset into 4 folds and, for each tuple (\u03bb, \u03b2, K) in the set of possible parameters\n(described below), we ran MDeepBoostSum, with a different assignment of folds to the training\n\n7\n\n\fTable 1: Empirical results for MDeepBoostSum, \u03a6 = exp. AB stands for AdaBoost.\n\nabalone\n\nError\n\n(std dev)\n\nletters\n\nError\n\n(std dev)\n\npendigits\n\nError\n\n(std dev)\n\nstatlog\n\nError\n\n(std dev)\n\nAB.MR\n0.739\n(0.0016)\n\nAB.MR\n0.065\n(0.0018)\n\nAB.MR\n0.014\n(0.0025)\n\nAB.MR\n0.029\n(0.0026)\n\nAB.MR-L1 MDeepBoost\n\nhandwritten\n\n0.737\n(0.0065)\n\n0.735\n(0.0045)\n\nError\n\n(std dev)\n\nAB.MR-L1 MDeepBoost\n\npageblocks\n\n0.059\n(0.0059)\n\n0.058\n(0.0039)\n\nError\n\n(std dev)\n\nAB.MR-L1 MDeepBoost\n\nsatimage\n\n0.014\n(0.0013)\n\n0.012\n(0.0011)\n\nAB.MR-L1 MDeepBoost\n\n0.026\n(0.0071)\n\n0.024\n(0.0008)\n\nError\n\n(std dev)\n\nyeast\nError\n\n(std dev)\n\nAB.MR\n0.024\n(0.0011)\n\nAB.MR\n0.035\n(0.0045)\n\nAB.MR\n0.112\n(0.0123)\n\nAB.MR\n0.415\n(0.0353)\n\nAB.MR-L1 MDeepBoost\n\n0.025\n(0.0018)\n\n0.021\n(0.0015)\n\nAB.MR-L1 MDeepBoost\n\n0.035\n(0.0031)\n\n0.033\n(0.0014)\n\nAB.MR-L1 MDeepBoost\n\n0.117\n(0.0096)\n\n0.117\n(0.0087)\n\nAB.MR-L1 MDeepBoost\n\n0.410\n(0.0324)\n\n0.407\n(0.0282)\n\nset, validation set and test set for each run. Speci\ufb01cally, for each run i \u2208 {0, 1, 2, 3}, fold i was\nused for testing, fold i + 1 (mod 4) was used for validation, and the remaining folds were used for\ntraining. For each run, we selected the parameters that had the lowest error on the validation set and\nthen measured the error of those parameters on the test set. The average test error and the standard\ndeviation of the test error over all 4 runs is reported in Table 1. Note that an alternative procedure\nto compare algorithms that is adopted in a number of previous studies of boosting [Li, 2009a,b, Sun\net al., 2012] is to simply record the average test error of the best parameter tuples over all runs.\nWhile it is of course possible to overestimate the performance of a learning algorithm by optimizing\nhyperparameters on the test set, this concern is less valid when the size of the test set is large relative\nto the \u201ccomplexity\u201d of the hyperparameter space. We report results for this alternative procedure in\nTable 2 and Table 3, Appendix A.\nFor each dataset, the set of possible values for \u03bb and \u03b2 was initialized to {10\u22125, 10\u22126, . . . , 10\u221210},\nand to {1, 2, 3, 4, 5} for the maximum tree depth K. However, if we found an optimal parameter\nvalue to be at the end point of these ranges, we extended the interval in that direction (by an order\nof magnitude for \u03bb and \u03b2, and by 1 for the maximum tree depth K) and re-ran the experiments.\nWe have also experimented with 200 and 500 iterations but we have observed that the errors do not\nchange signi\ufb01cantly and the ranking of the algorithms remains the same.\nThe results of our experiments show that, for each dataset, deep boosting algorithms outperform the\nother algorithms evaluated in our experiments. Let us point out that, even though not all of our re-\nsults are statistically signi\ufb01cant, MDeepBoostSum outperforms AdaBoost.MR and AdaBoost.MR-\nL1 (and, hence, effectively structural risk minimization) on each dataset. More importantly, for each\ndataset MDeepBoostSum outperforms other algorithms on most of the individual runs. Moreover,\nresults for some datasets presented here (namely pendigits) appear to be state-of-the-art. We also\nrefer our reader to experimental results summarized in Table 2 and Table 3 in Appendix A. These\nresults provide further evidence in favor of DeepBoost algorithms. The consistent performance im-\nprovement by MDeepBoostSum over AdaBoost.MR or its L1-norm regularized variant shows the\nbene\ufb01t of the new complexity-based regularization we introduced.\n\n5 Conclusion\n\nWe presented new data-dependent learning guarantees for convex ensembles in the multi-class set-\nting where the base classi\ufb01er set is composed of increasingly complex sub-families, including very\ndeep or complex ones. These learning bounds generalize to the multi-class setting the guarantees\npresented by Cortes et al. [2014] in the binary case. We also introduced and discussed several new\nmulti-class ensemble algorithms bene\ufb01ting from these guarantees and proved positive results for the\nH-consistency and convergence of several of them. Finally, we reported the results of several ex-\nperiments with DeepBoost algorithms, and compared their performance with that of AdaBoost.MR\nand additive multinomial Logistic Regression and their L1-regularized variants.\n\nAcknowledgments\n\nWe thank Andres Mu\u02dcnoz Medina and Scott Yang for discussions and help with the experiments.\nThis work was partly funded by the NSF award IIS-1117591 and supported by a NSERC PGS grant.\n\n8\n\n\fReferences\nP. B\u00a8uhlmann and B. Yu. Boosting with the L2 loss. J. of the Amer. Stat. Assoc., 98(462):324\u2013339, 2003.\nM. Collins, R. E. Schapire, and Y. Singer. Logistic regression, Adaboost and Bregman distances. Machine\n\nLearning, 48:253\u2013285, September 2002.\n\nC. Cortes, M. Mohri, and U. Syed. Deep boosting. In ICML, pages 1179 \u2013 1187, 2014.\nT. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees:\n\nBagging, boosting, and randomization. Machine Learning, 40(2):139\u2013157, 2000.\nJ. C. Duchi and Y. Singer. Boosting with structural sparsity. In ICML, page 38, 2009.\nN. Duffy and D. P. Helmbold. Potential boosters? In NIPS, pages 258\u2013264, 1999.\nY. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of Computer System Sciences, 55(1):119\u2013139, 1997.\n\nJ. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189\u2013\n\n1232, 2000.\n\nJ. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals\n\nof Statistics, 28:2000, 1998.\n\nA. J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles.\n\nAAAI/IAAI, pages 692\u2013699, 1998.\n\nIn\n\nJ. Kivinen and M. K. Warmuth. Boosting as entropy projection. In COLT, pages 134\u2013144, 1999.\nV. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of\n\ncombined classi\ufb01ers. Annals of Statistics, 30, 2002.\n\nM. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.\nP. Li. ABC-boost: adaptive base class boost for multi-class classi\ufb01cation. In ICML, page 79, 2009a.\nP. Li. ABC-logitboost for multi-class classi\ufb01cation. Technical report, Rutgers University, 2009b.\nP. M. Long and R. A. Servedio. Consistency versus realizable H-consistency for multiclass classi\ufb01cation. In\n\nICML (3), pages 801\u2013809, 2013.\n\nZ.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimiza-\n\ntion. Journal of Optimization Theory and Applications, 72(1):7 \u2013 35, 1992.\n\nL. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In NIPS, 1999.\nM. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.\nI. Mukherjee and R. E. Schapire. A theory of multiclass boosting. JMLR, 14(1):437\u2013497, 2013.\nG. R\u00a8atsch and M. K. Warmuth. Maximizing the margin with boosting. In COLT, pages 334\u2013350, 2002.\nG. R\u00a8atsch and M. K. Warmuth. Ef\ufb01cient margin maximizing with boosting. JMLR, 6:2131\u20132152, 2005.\nG. R\u00a8atsch, S. Mika, and M. K. Warmuth. On the convergence of leveraging. In NIPS, pages 487\u2013494, 2001a.\nG. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for AdaBoost. Machine Learning, 42(3):287\u2013320, 2001b.\nR. E. Schapire. Theoretical views of boosting and applications. In Proceedings of ALT 1999, volume 1720 of\n\nLecture Notes in Computer Science, pages 13\u201325. Springer, 1999.\n\nR. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.\nR. E. Schapire and Y. Singer.\n\nImproved boosting algorithms using con\ufb01dence-rated predictions. Machine\n\nLearning, 37(3):297\u2013336, 1999.\n\nR. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effective-\n\nness of voting methods. In ICML, pages 322\u2013330, 1997.\n\nP. Sun, M. D. Reid, and J. Zhou. Aoso-logitboost: Adaptive one-vs-one logitboost for multi-class problem. In\n\nICML, 2012.\n\nA. Tewari and P. L. Bartlett. On the consistency of multiclass classi\ufb01cation methods. JMLR, 8:1007\u20131025,\n\n2007.\n\nM. K. Warmuth, J. Liao, and G. R\u00a8atsch. Totally corrective boosting algorithms that maximize the margin. In\n\nICML, pages 1001\u20131008, 2006.\n\nT. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods. JMLR, 5:1225\u20131251,\n\n2004a.\n\nT. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk minimization.\n\nAnnals of Statistics, 32(1):56\u201385, 2004b.\n\nJ. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class adaboost. Statistics and Its Interface, 2009.\nH. Zou, J. Zhu, and T. Hastie. New multicategory boosting algorithms based on multicategory \ufb01sher-consistent\n\nlosses. Annals of Statistics, 2(4):1290\u20131306, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1311, "authors": [{"given_name": "Vitaly", "family_name": "Kuznetsov", "institution": "Courant Institute"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute, NYU & Google"}, {"given_name": "Umar", "family_name": "Syed", "institution": "Google Research"}]}