{"title": "On the Convergence of Leveraging", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 494, "abstract": null, "full_text": "On the Convergence of Leveraging\n\n\u0003\nGunnar R atsch,\n\ny\n\nSebastian Mika\n\nz\n\nand Manfred K. Warmuth\n\nx\ny\n\nRSISE, Australian National University, Canberra, ACT 0200 Australia\n\nz\n\nFraunhofer FIRST, Kekulestr. 7, 12489 Berlin, Germany\n\nx\n\nUniversity of California at Santa Cruz, CA 95060, USA\n\nraetsch@csl.anu.edu.au, mika@first.fhg.de, manfred@cse.ucsc.edu\nAbstract\n\nWe give an unified convergence analysis of ensemble learning meth-\nods including e.g. AdaBoost, Logistic Regression and the Least-Square-\nBoost algorithm for regression. These methods have in common that\nthey iteratively call a base learning algorithm which returns hypotheses\nthat are then linearly combined. We show that these methods are related\nto the Gauss-Southwell method known from numerical optimization and\nstate non-asymptotical convergence results for all these methods. Our\nanalysis includes ` 1 -norm regularized cost functions leading to a clean\nand general way to regularize ensemble learning.\n1 Introduction\n\nWe show convergence rates of ensemble learning methods such as AdaBoost [10], Logistic\nRegression (LR) [11, 5] and the Least-Square (LS) regression algorithm called LS-Boost\n[12]. These algorithms have in common that they iteratively call a base learning algorithm\n\nL (also called weak learner) on a weighted training sample. The base learner is expected\nto return in each iteration t a hypothesis\n\n^\n\nh t from some hypothesis set of weak hypotheses\n\nH that has small weighted training error. This is the weighted number of false predictions\nin classification and weighted estimation error in regression. These hypotheses are then\nlinearly combined to form the final hypothesis f ^ \u000b (x) =\n\nP\n\nt ^ \u000b t\n\n^\n\nh t (x); in classification one\nuses the sign of f ^ \u000b for prediction. The hypothesis coefficient ^ \u000b t is determined at iteration\n\nt, such that a certain objective is minimized or approximately minimized, and is fixed for\nlater iterations. Here we will work out sufficient conditions on the base learning algorithm\nto achieve linear convergence to the minimum of an associated loss function G. This means\nthat for any starting condition the minimum can be reached with precision \u000f > 0 in only\n\nO(log(1=\u000f)) iterations.\n\nRelation to Previous Work In the original work on AdaBoost it has been shown that\nthe optimization objective (which is an upper bound on the training error) converges ex-\nponentially fast to zero, if the base learner is consistently better than random guessing, i.e.\nits weighted training error \u000f is always smaller than some constant \r with \r \n\n1\n2\n\n. In this\ncase the convergence is known to be linear (i.e. exponentially decreasing) [10]. One can\neasily show that this is the case when the data is separable:\n\n1\n\nIf the data is not separable, the\n\n\u0003\n\nSupported by DFG grants MU 987/1-1, JA 379/9-1 and NSF grant CCR 9821087; we gratefully\nacknowledge help from B. Borchers, P. Spellucci, R. Israel and S. Lemm. This work has been done,\nwhile G. Ratsch was at Fraunhofer FIRST, Berlin.\n\n1\n\nWe call the data separable, if there exists \u000b such that f\u000b (x) separates the training examples.\n\f\nweighted training error \u000f cannot be upper bounded by a constant smaller\n\n1\n2\n\n, otherwise one\ncould use AdaBoost to find a separation using the aforementioned convergence result.\n\n2\n\nFor AdaBoost and Logistic Regression it has been shown [5] that they generate a combined\nhypothesis asymptotically minimizing a loss functional G only depending on the output of\nthe combined hypothesis f\u000b . This holds for the non-separable case; however, the assumed\nconditions in [5] on the performance of the base learner are rather strict and can usually\nnot be satisfied in practice. Although the analysis in [5] holds in principle for any strictly\nconvex cost function of Legendre-type (e.g. [24], p. 258, and [1]), one needs to show the\nexistence of a so-called auxiliary function [7, 5] for each cost function other than the expo-\nnential or the logistic loss. This can indeed be done [cf. 19, Section 4.2], but in any case\nonly leads to asymptotical results. In the present work we can also show rates of conver-\ngence.\nIn an earlier attempt to show the convergence of such methods for arbitrary loss functions\n[17], one needed to assume that the hypothesis coefficients ^ \u000b t are upper bounded by a\nrather small constant. For this case it has been shown that the algorithm asymptotically\nconverges to a combined hypothesis minimizing G. However, since the ^ \u000b t 's need to be\nsmall, the algorithm requires many iterations to achieve this goal.\nIn [9] it has been shown that for loss functions which are (essentially) exponentially de-\ncreasing (including the loss functions of AdaBoost and Logistic regression), the loss is\n\nO(1=\np\n\nt) in the first ~ t iterations and afterwards O(\u0011\n\n~ t t\n\n). This implies linear convergence.\nHowever, this only holds, if the loss reaches zero, i.e. if the data is separable. In our work\nwe do not need to assume separability.\nAn equivalent optimization problem for AdaBoost has also been considered in a paper that\npredates the formulation of AdaBoost [4]. This optimization problem concerns the likeli-\nhood maximization for some exponential family of distributions. In this work convergence\nis proven for the general non-separable case, however, only for the exponential loss, i.e. for\nthe case of AdaBoost.\n\n3\n\nThe framework set up in this paper is more general and we are able\nto treat any strictly convex loss function.\nIn this paper we propose a family of algorithms that are able to generate a combined hy-\npothesis f converging to the minimum of G[f ] (if it exists), which is a functional depending\non the outputs of the function f evaluated on the training set. Special cases are AdaBoost,\nLogistic Regression and LS-Boost. While assuming mild conditions on the base learning\nalgorithm and the loss function G, we can show linear convergence rates [15] (beginning\nin the first iteration) of the type G[f t+1 ] G[f\n\n\u0003\n\n] \u0014 \u0011(G[f t ] G[f\n\n\u0003\n\n]) for some fixed\n\n\u0011 2 [0; 1). This means that the difference to the minimum loss converges exponentially\nfast to zero (in the number of iterations). A similar convergence has been proven for Ada-\nBoost in the special case of separable data [10], although the constant \u0011 shown in [10] can\nbe considerable smaller [see also 9]. To prove the convergence of leveraging, we exploit\nresults of Luo & Tseng [16] for a variant of the Gauss-Southwell method known from nu-\nmerical optimization.\nSince in practice the hypothesis set H can be quite large, ensemble learning algorithms\nwithout any regularization often suffer from overfitting [22, 12, 2, 19]. Here, the com-\nplexity can only be controlled by the size of the base hypothesis set or by early stopping\nafter a few iterations. However, it has been shown that shrinkage regularization implied\nby penalizing some norm of the hypothesis coefficients is the favorable strategy [6, 12].\nWe therefore extend our analysis to the case of ` 1 -norm regularized loss functions. With\na slight modification this leads to a family of converging algorithms that e.g. includes the\nLeveraged Vector Machine [25] and a variant of LASSO [26].\nIn the following section we briefly review AdaBoost, Logistic Regression, and LS-Boost\nand cast them in a common framework. In Sec. 3 we present our main results. After re-\n\n2\n\nThis can also be seen when analyzing a certain linear program in the dual domain (cf. [23])\n\n3\n\nWe will expand on this connection in the full paper (see also [14, 19]).\n\f\nlating these results to leveraging algorithms, we present an extension to regularized cost\nfunctions in Sec. 4 and finally conclude.\n\n2 Leveraging algorithms revisited\n\nWe first briefly review some of the most well known leveraging algorithms for classification\nand regression. For more details see e.g. [10, 11, 12, 8]. We work with Alg. 1 as a template\nfor a generic leveraging algorithm, since these algorithms have the same algorithmical\nstructure. Finally, we will generalize the problem and extend the notation.\n\nAdaBoost & Logistic Regression are designed for classification tasks. In each iteration\nthey call a base learning algorithm on the training set S = f(x 1 ; y 1 ); : : : ; (xn ; yn )g \u0012 X \u0002\nf1;+1g (cf. step 3a in Alg. 1). Here a weighting d\n\nt\n\n= [d\n\nt\n\n1 ; : : : ; d\n\nt\nN ] on the sample is used\nthat is recomputed in each iteration t. The base learner is expected to return a hypothesis\n\n^\n\nh t from some hypothesis space\n\n4\n\nH := fh j j h j : X 7! f1;+1g; j = 1; : : : ; Jg that\nhas small weighted classification error\n\n5\n\n\u000f t =\n\nP N\nn=1 jd\n\nt\nn jI(y n 6=\n\n^ h t (xn )) [10, 11], where\n\nI(true) = 1 and I(false) = 0. It is more convenient to work with the edge of\n\n^\n\nh t , which is\ndefined as \r t = 1 2\u000f t =\n\nP N\nn=1 d\n\nt\nn\n\n^\n\nh t (xn ). After selecting the hypothesis, its weight ^ \u000b t\n\nis computed such that it minimizes a certain functional (cf. step 3b). For AdaBoost this is\n\nG\n\nAB\n\n(^\u000b) =\n\nXN\n\nn=1\n\nexp\n\nn\n\nyn\n\n\u0010\n\n^ \u000b ^ h t (xn ) + f t 1 (xn )\n\n\u0011o\n\n(1)\nand for Logistic Regression it is\n\nG\n\nLR\n\n(^\u000b) =\n\nXN\n\nn=1\n\nlog\n\nn\n\n1 + exp\n\n\u0010\n\nyn (^\u000b ^ h t (xn ) + f t 1 (xn ))\n\n\u0011o\n\n; (2)\nwhere f t 1 is the combined hypothesis of the previous iteration given by f t 1 (xn ) =\n\nP t 1\n\nr=1 ^ \u000b r\n\n^\n\nh r (xn ). For AdaBoost it has been shown that ^ \u000b t minimizing (1) can be com-\nputed analytically [3]. This is true because we assumed that the hypotheses are binary\nvalued. Similarly, for LR there exists an analytic solution of (2). The weighting d on the\nsample is updated based on the new combined hypothesis f t (xn ) = ^ \u000b ^ h t (xn ) + f t 1 (xn ):\n\nd\n\nt+1\nn = yn exp( yn f t (xn )) and d\n\nt+1\nn = yn\n\nexp( ynf t (xn ))\n1+exp( ynf t (xn ))\n\n; for AdaBoost and Logistic\nRegression, respectively.\n\nLeast-Square-Boost is an algorithm to solve regression tasks. In this case S =\n\nf(x 1 ; y 1 ); : : : ; (xn ; yn )g \u0012 X \u0002Y , Y \u0012 R and H := fh j j h j : X 7! Y ; j = 1; : : : ; Jg. It\nworks in a similar way as AdaBoost and LR. It first selects a hypothesis solving\n\n^ h t = argmin\n\n^ h2H\n\n1\n2\n\nXN\n\nn=1\n\n\u0010\n\nd\n\nt\nn\n\n^\n\nh(xn )\n\n\u0011 2\n\n; (3)\nand then finds the hypothesis weight ^ \u000b t by minimizing the squared error of the new com-\nbined hypothesis:\n\nG\n\nLS\n\n(^\u000b) =\n1\n2\n\nXN\n\nn=1\n\n\u0010\n\nyn f t 1 (xn ) ^ \u000b\n\n^\n\nh t (xn )\n\n\u0011 2\n\n: (4)\nThe ``weighting'' of the sample is computed as d\n\nt+1\nn = yn f t (xn ), which is the residual\n\nof f t [12]. In a second version of LS-Boost, the base hypothesis and its weight are found\nsimultaneously by solving [12]:\n\n[\n^\n\nh t ; ^ \u000b t ] = argmin\n\n^ \u000b2R; ^ h2H\n\n1\n2\n\nXN\n\nn=1\n\n\u0010\n\nyn f t 1 (xn ) ^ \u000b\n\n^\n\nh(xn )\n\n\u0011 2\n\n(5)\nSince in (5) one reaches a lower loss function value than with (3) and (4), it might be the\nfavorable strategy.\n\n4\n\nNotice that H always contains only a finite number of different hypotheses when evaluated on\nthe training set and is effectively finite [2].\n\n5\n\nDifferent from common convention, we include the yn in dn to make the presentation simpler.\n\f\nAlgorithm 1 -- A Leveraging algorithm for the loss function G.\n\n1. Input: S = h(x1 ; y1 ); : : : ; (xN ; yN )i, No. of Iterations T , Loss function G : R\n\nN\n\n! R\n\n2. Initialize: f0 \u0011 0, d\n\n1\n\nn = g'(yn ; f0 (xn)) for all n = 1 : : : N\n\n3. Do for t = 1; : : : ; T ,\n(a) Train classifier on fS; d\n\nt\n\ng and obtain hypothesis ^ h t : X ! Y\n\n(b) Set ^ \u000b t = argmin \u000b2R G[f t + \u000b ^ h t ]\n\n(c) Update f t+1 = f t + ^ \u000b t ^ h t and d\n\nt+1\nn = g'\n\n\u0010\n\nyn ;\n\nP t\nr=1 ^ \u000br\n\n^\n\nhr (xn)\n\n\u0011\n4. Output: fT\n\nThe General Case These algorithms can be summarized in Alg. 1 (where case (5) is\nslightly degenerated, cf. Sec. 3.2) for some appropriately defined functions G and g': plug-\nin G[f ] =\n\nP N\nn=1 g(y n ; f(xn )) and choosing g as g(y; f(x)) = exp( yf(x)) for Ada-\nBoost (cf. (1)), g(y; f(x)) = log(1 + exp( yf(x))) for Logistic Regression (cf. (2)) and\n\ng(y; f(x)) =\n\n1\n2\n\n(y f(x))\n\n2\n\nfor LS-Boost (cf. (4)).\nIt can easily be verified that the function g', used for computing the weights d, is the deriva-\ntive of g with respect to the second argument [3, 12].\n\nThe Optimization Problem It has been argued in [3, 18, 11, 17] and finally shown in\n[5] that AdaBoost and Logistic Regression under certain condition asymptotically con-\nverge to a combined hypothesis f minimizing the respective loss G on the training sam-\nple, where f is a linear combination of hypotheses from H, i.e. f\u000b 2 lin(H) :=\n\nn\nP J\nj=1 \u000b j h j jh j 2 H;\u000b j 2 R\n\no\n\n. Thus, they solve the optimization problem:\n\nmin f2lin(H) G[f ] = min \u000b2R\n\nJ G(H\u000b); (6)\nwhere we defined a matrix H 2 R\n\nN\u0002J\n\nwith H ij = h j (x i ).\n\nTo avoid confusions, note that hypotheses and coefficients generated during the iterative\nalgorithm are marked by a hat. In the algorithms discussed so far, the optimization takes\nplace by employing the leveraging scheme outlined in Alg. 1. The output of such an algo-\nrithm is a sequence of pairs (^\u000b t ; ^ h t ) and a combined hypothesis f(x) =\n\nP t\nr=1\n\n^ \u000b r\n\n^ h r (x).\n\nWith \u000b\n\nt\nj =\n\nP t\nr=1 ^ \u000b r I( ^\n\nh r = h j ), j = 1; : : : ; J , it is easy to verify that\n\nP t\nr ^ \u000b r\n\n^\n\nh r (x) =\n\nP J\nj=1\n\n\u000b\n\nt\nj h j (x), which is in lin(H) (note the missing hat).\n\nOther Preliminaries Throughout the paper we assume the loss function G is of the form\n\nG[f\u000b ] = G(H\u000b) =\n\nP N\nn=1 g(y n ; f\u000b (xn ));\n\nAlthough, this assumption is not necessary, the presentation becomes easier. In [7, 5, 19]\na more general case of Legendre-type cost functions is considered. However, note that\nadditive loss functions are commonly used, if one considers i.i.d.-drawn examples.\nWe assume that each element H nj and yn is finite (j = 1; : : : ; J , n = 1; : : : ; N ) and H\n\ndoes not contain a zero column. Furthermore, the function g(y; \u0001) : R ! R is assumed to\nbe strictly convex for all y 2 Y .\nFor simplicity we assume for the rest of the paper that H is finite and complementation\nclosed, i.e. for every h 2 H there exists also h 2 H. The assumption on the finiteness is\nnot crucial for classification (cf. footnote 4). For regression problems the hypothesis space\nmight be infinite. This case has explicitly been analyzed in [20, 19] and goes beyond the\nscope of this paper (see also [27]).\n\n3 Main Result\n\nWe now state a result known from the field of numerical optimization. Then we show how\nthe reviewed leveraging algorithms fit into this optimization framework.\n\f\n3.1 Coordinate Descent\n\nThe idea of coordinate descent is to iteratively select a coordinate, say the j-th, and find\n\n\u000b j such that some functional F([\u000b 1 ; : : : ; \u000b j ; : : : ; \u000b T ]\n\n>\n\n) is minimized with respect to \u000b j .\nThere exist several different strategies for selecting the coordinates [e.g. 15]; however, we\nare in particular interested in the Gauss-Southwell-type (GS) selection scheme: It selects\nthe coordinate that has the largest absolute value in the gradient vector \f := r F(\u000b), i.e.\n\nj = argmax j 0 =1;:::;J j\f j 0 j. Instead of doing steps in the direction of the negative gradient\nas in standard gradient descent methods, one only changes the variable that has the largest\ngradient component. This can be efficient, if there are many variables and most of them are\nzero at the minimum.\nWe start with the following general convergence result, which seemed to be fallen into\noblivion even in the optimization community. It will be very useful in the analysis of\nleveraging algorithms. Due to a lack of space we omit proofs (see [21, 19]).\n\nTheorem 1 (Convergence of Coordinate Descent [16]). Suppose G : R\n\nN\n\n!R is twice\ncontinuously differentiable and strictly convex on domG. Assume that domG is open, the\nset of solutions S\n\n\u0003\n\n\u001a R\n\nJ\n\nto\n\nmin\n\n\u000b2S\n\nF(\u000b) := G(H\u000b) + \r\n\n>\n\n\u000b (7)\n\nis not empty, where H 2 R\n\nN\u0002J\n\nis a fixed matrix having no zero column, \r 2 R\n\nJ\n\nfixed\nand S \u0012 R\n\nJ\n\nis a (possibly unbounded) box-constrained set. Furthermore assume that\nthe Hessian r\n\n2\n\nG(H\u000b\n\n\u0003\n\n) is a positive matrix for all \u000b\n\n\u0003\n\n2 S\n\n\u0003\n\n. Let f\u000b\n\nt\n\ng be the sequence\ngenerated by coordinate descent, where the coordinate selection j 1 ; j 2 ; : : : satisfies\n\nj\u000b\n\nt+1;j t\n\nj t\n\n\u000b\n\nt\nj t j \u0015 \f max\n\nj=1;:::;J\n\nj\u000b\n\nt+1;j\nj \u000b\n\nt\nj j (8)\n\nfor some \f 2 (0; 1], where \u000b\n\nt+1;j\nj is the optimal value of \u000b\n\nt+1\nj if it would be selected, i.e.\n\n\u000b\n\nt+1;j\nj := min\n\n\u000b2S j\n\nG H\u000b\n\nt\n\n+H j (\u000b \u000b\n\nt\nj )\n\n\u0001\n\n+ \r j \u000b: (9)\n\nThen f\u000b\n\nt\n\ng converges to an element in S\n\n\u0003\n\n.\n\nThe coordinate selection in Thm. 1 is slightly different from the Gauss-Southwell selection\nrule described before. We therefore need the following:\n\nProposition 2 (Convergence of GS on R\n\nJ\n\n). Assume the conditions on G and H as in\nThm. 1. Let S\n\nb\n\nbe a convex subset of S := R\n\nJ\n\nsuch that \u000b\n\nt;j\n\n2 S\n\nb\n\n. Assume\n\n@\n\n2\n\nG(H\u000b)\n\n@\u000b\n\n2\n\nj\n\n\u0014 \u0011 u and\n\n@\n\n2\n\nG(H\u000b)\n\n@\u000b\n\n2\n\nj\n\n\u0015 \u0011 l 8\u000b 2 S (10)\n\nholds for some fixed \u0011 l ; \u0011 u > 0. Then a coordinate selection j 1 ; j 2 ; : : : satisfies (8) of\nThm. 1, if there exists a fixed 2 (0; 1] such that\n\n\f\n\f\n\f\n\f\n\f\n\n@ F(\u000b\n\nt\n\n)\n\n@\u000b\n\nt\nj t\n\n\f\n\f\n\f\n\f\n\f\n\n\u0015 max\n\nj=1;:::;J\n\n\f\n\f\n\f\n\f\n\f\n\n@ F(\u000b\n\nt\n\n)\n\n@\u000b\n\nt\nj\n\n\f\n\f\n\f\n\f\n\f\n\n8t = 1; 2; : : : (11)\nThus the approximate Gauss-Southwell method on R\n\nJ\n\nas described above converges. To\nshow the convergence of the second variant of LS-Boost (cf. (5)) we need the following\n\nProposition 3 (Convergence of the maximal improvement scheme on R\n\nJ\n\n). Let G; H;S\n\nand S\n\nb\n\nas in Proposition 2 and assume (10) holds. Then a coordinate selection j 1 ; j 2 ; : : :\n\nsatisfies (8), if there exists a fixed 2 (0; 1] with\n\nF(\u000b\n\nt\n\n) F(\u000b\n\nt+1;j t\n\n) \u0015 max\n\nj=1;:::;J\n\nF(\u000b\n\nt\n\n) F(\u000b\n\nt+1;j\n\n)\n\n\u0001\n\n8t = 1; 2; : : : (12)\n\f\nThus the maximal improvement scheme on R\n\nJ\n\nas above converges in the sense of Thm. 1.\nFinally we can also state a rate of convergence, which is surprisingly not worse than the\nrates for standard gradient descent methods:\n\nTheorem 4 (Rate of Convergence of Coordinate Descent, [16]). Assume the conditions\nof Thm. 1 hold. Let S\n\nb\n\nas in Prop. 2 and assume (10) holds for some \u0011 l > 0. Then we have\n\n\u000f t+1 := F(\u000b\n\nt+1\n\n) F(\u000b\n\n\u0003\n\n) \u0014\n\n\u0010\n\n1\n\n1\n\n\u0011\n\n\u0011\n\n(F(\u000b\n\nt\n\n) F(\u000b\n\n\u0003\n\n)); (13)\n\nwhere \u000b\n\nt\n\nis the estimate after the t-th coordinate descent step, \u000b\n\n\u0003\n\ndenotes a optimal solu-\ntion, and 0 1. Especially at iteration t: \u000f t \u0014 (1 1=\u0011)\n\nt\n\n\u000f 0 .\n\nFollowing [16] one can show that the constant \u0011 is O(\n\n\u0014\n\n2\n\nLJ\n\n4\n\nN\n\n2\n\n\n\n2 ), where L is the Lipschitz\nconstant of rG and \u0014 is a constant that depends on H and therefore on the geometry of\nthe hypothesis set (cf. [16, 13] for details). While the upper bound on \u0011 can be rather large,\nmaking the convergence slow, it is important to note (i) that this is only a rough estimate\nof the true constant and (ii) still guarantees an exponential decrease in the error functional\nwith the number of iterations.\n\n3.2 Leveraging and Coordinate Descent\n\nWe now return from the abstract convergence results in Sec. 3.1 to our examples of lever-\naging algorithms, i.e. we show how to retrieve the Gauss-Southwell algorithm on R\n\nJ\n\nas a\npart of Alg. 1. For now we set \r = 0. The gradient of G with respect to \u000b j is given by\n\n@ G(\u000b)\n\n@\u000b j\n\n=\n\nP N\nn=1 g'(y n ; f\u000b (xn ))h j (xn ) =\n\nP N\nn=1 dn h j (xn ) (14)\nwhere dn is given as in step 3c of Alg. 1. Thus, the coordinate with maximal absolute gra-\ndient corresponds to the hypothesis with largest absolute edge (see definition). However,\naccording to Proposition 2 and 3 we need to assume less on the base learner. It either has\nto return a hypothesis that (approximately) maximizes the edge, or alternatively (approxi-\nmately) minimizes the loss function.\n\nDefinition 5 (-Optimality). A base learning algorithm L is called -optimal, if it always\nreturns hypotheses that either satisfy condition (11) or (12) for some fixed > 0.\n\nSince we have assumed H is closed under complementation, there always exist two hy-\npotheses having the same absolute gradient (h and h). We therefore only need to consider\nthe hypothesis with maximum edge as opposed to the maximum absolute edge.\nFor classification it means: if the base learner returns the hypothesis with approximately\nsmallest weighted training error, this condition is satisfied. It is left to show that we can\napply the Thm. 1 for the loss functions reviewed in Sec. 2:\n\nLemma 6. The loss functions of AdaBoost, Logistic regression and LS-Boost are bounded,\nstrongly convex and fulfill the conditions in Thm. 1 on any bounded subset of R\n\nN\n\n.\n\nWe can finally state the convergence result for leveraging algorithms:\n\nTheorem 7. Let G be a loss function satisfying the conditions in Thm. 1. Suppose Alg. 1\ngenerates a sequence of hypotheses\n\n^\n\nh 1 ;\n\n^\n\nh 1 ; : : : and weights ^ \u000b 1 ; ^ \u000b 2 ; : : : using a -optimal\n\nbase learner. Assume f\u000b\n\nt\n\ng with \u000b\n\nt\nj =\n\nP t\nr=1\n\n^ \u000b r I( ^ h r = h j ) is bounded. Then any limit\npoint of f\u000b\n\nt\n\ng is a solution of (6) and converges linearly in the sense of Thm. 4.\n\nNote that this result in particular applies to AdaBoost, Logistic regression and the second\nversion of LS-Boost. For the selection scheme of LS-Boost given by (3) and (4), both\nconditions in Definition 5 cannot be satisfied in general, unless\n\nP N\nn=1 h j (xn )\n\n2\n\nis constant\nfor all hypotheses. Since\n\nP N\nn=1\n\n(d n h j (xn ))\n\n2\n\n=\n\nP N\nn=1\n\n(h j (xn )\n\n2\n\n2dnhn (xn )+const: ),\n\nthe base learner prefers hypotheses with small\n\nP N\nn=1 h j (xn )\n\n2\n\nand could therefore stop\nimproving the objective while being not optimal (see [20, Section 4.3] and [19, Section 5]\nfor more details).\n\f\n4 Regularized Leveraging approaches\n\nWe have not yet exploited all features of Thm. 1. It additionally allows for box constraints\nand a linear function in terms of the hypothesis coefficients. Here, we are in particular\ninterested in ` 1 -norm penalized loss functions of the type F(\u000b) = G(H\u000b)+Ck\u000bk 1 , which\nare frequently used in machine learning. The LASSO algorithm for regression [26] and the\nPBVM algorithm for classification [25] are examples. Since we assumed complementation\ncloseness of H, we can assume without loss of generality that a solution \u000b\n\n\u0003\n\nsatisfies \u000b\n\n\u0003\n\n\u0015\n\n0. We can therefore implement the ` 1 -norm regularization using the linear term \r\n\n>\n\n\u000b,\n\nwhere \r = C1 and C \u0015 0 is the regularization constant. Clearly, the regularization\ndefines a structure of nested subsets of H, where the hypothesis set is restricted to a smaller\nset for larger values of C.\n\nThe constraint \u000b \u0015 0 causes some minor complications with the assumptions on the base\nlearning algorithm. However, these can easily be resolved (cf. [21]), while not assuming\nmore on the base learning algorithm. The first step in solving the problem is to add the\nadditional constraint \u000b t \u0015 0 to the minimization with respect to \u000b t in step 3b of Alg. 1.\nRoughly speaking, this induces the problem that hypothesis coefficient chosen too large in\na previous iteration, cannot be reduced again. To solve this problem one can check for each\ncoefficient of a previously selected hypothesis whether not selecting it would violate the\n\n-optimality condition (11) or (12). If so, the\n\nAlgorithm 2 -- A Leveraging algorithm for ` 1 -norm regularized loss G.\n\n1. Input: Sample S, No. of Iterations T , Loss function G : R\n\nN\n\n! R, Reg. const. C > 0\n\n2. Initialize: f0 \u0011 0, d\n\n1\n\nn = g'(yn ; f0 (xn)) for all n = 1 : : : N\n\n3. Do for t = 1; : : : ; T ,\n(a) Train classifier on fS; d\n\nt\n\ng and obtain hypothesis ^ h t : X ! Y\n\n(b) Let ^ \rr =\n\nP N\nn=1 d\n\nt\nn\n\n^\n\nhr (xn) and \u000b r =\n\nP t\ns=1 ^ \u000br I( ^\n\nhs =\n^\n\nhr ) for r = 1; : : : ; t\n\n(c) r\n\n\u0003\n\n= argmin i2J ^ \r i , where J = fi j i 2 f1; : : : ; t 1g and \u000b i > 0g.\n\n(d) if ^ \r t C C \rr\n\n\u0003\n\nthen\n\n^ h t = ^\n\nhr\n\n\u0003 and \u000b t = \u000b r \u0003 else \u000b t = 0\n\n(e) Set ^ \u000b t = argmin \u000b\u0015\u000b t\n\nG[f t + \u000b ^ h t ] + C\u000b\n\n(f) Update f t+1 = f t + ^ \u000b t\n\n^\n\nh t and d\n\nt+1\nn = g'(yn ; f t+1 (xn)); n = 1; : : : ; N\n\n4. Output: fT\n\nalgorithm selects such a coordinate for the next iteration instead of calling the base learning\nalgorithm. This idea leads to Alg. 2 (see [21] for a detailed discussion). For this algorithm\nwe can show the following:\n\nTheorem 8 (Convergence of ` 1 -norm penalized Leveraging). Assume G; H are as\nThm. 1, G is strictly convex, C > 0, and the base learner satisfies\n\n@ F(\u000b\n\nt\n\n)\n\n@\u000b\n\nt\nj t\n\n\u0015 max j=1;:::;J\n@ F(\u000b\n\nt\n\n)\n\n@\u000b\n\nt\nj\n\n8t = 1; 2; : : : (15)\n\nfor > 0. Then Alg. 2 converges linearly to a minimum of the regularized loss function.\n\nThis can also be shown for a maximum-improvement like condition on the base learner,\nwhich we have to omit due to space limitation.\nIn [27] a similar algorithm has been suggested that solves a similar optimization problem\n(keeping k\u000bk 1 fixed). For this algorithm one can show order one convergence (which is\nweaker than linear convergence), which also holds if the hypothesis set is infinite.\n\n5 Conclusion\n\nWe gave a unifying convergence analysis for a fairly general family of leveraging methods.\nThese convergence results were obtained under rather mild assumptions on the base learner\nand, additionally, led to linear convergence rates. This was achieved by relating leveraging\n\f\nalgorithms to the Gauss-Southwell method known from numerical optimization.\nWhile the main theorem used here was already proven in [16], its applications closes a\ncentral gap between existing algorithms and their theoretical understanding in terms of\nconvergence. Future investigations include the generalization to infinite hypotheses spaces\nand an improvement of the convergence rate \u0011. Furthermore, we conjecture that our results\ncan be extended to many other variants of boosting type algorithms proposed recently in\nthe literature (cf. http://www.boosting.org).\n\nReferences\n\n[1] H.H. Bauschke and J.M. Borwein. Legendre functions and the method of random bregman\nprojections. Journal of Convex Analysis, 4:27--67, 1997.\n[2] K.P. Bennett, A. Demiriz, and J. Shawe-Taylor. A column generation algorithm for boosting.\nIn P. Langley, editor, Proceedings, 17th ICML, pages 65--72. Morgan Kaufmann, 2000.\n[3] L. Breiman. Prediction games and arcing algorithms. Neural Comp., 11(7):1493--1518, 1999.\n[4] N. Cesa-Bianchi, A. Krogh, and M. Warmuth. Bounds on approximate steepest descent for\nlikelihood maximization in exponential families. IEEE Trans. Inf. Th., 40(4):1215--1220, 1994.\n[5] M. Collins, R.E. Schapire, and Y. Singer. Logistic Regression, Adaboost and Bregman dis-\ntances. In Proc. COLT, pages 158--169, San Francisco, 2000. Morgan Kaufmann.\n[6] J. Copas. Regression, prediction and shrinkage. J.R. Statist. Soc. B, 45:311--354, 1983.\n[7] S. Della Pietra, V. Della Pietra, and J. Lafferty. Duality and auxiliary functions for bregman\ndistances. TR CMU-CS-01-109, Carnegie Mellon University, 2001.\n[8] N. Duffy and D.P. Helmbold. A geometric approach to leveraging weak learners. In P. Fischer\nand H. U. Simon, editors, Proc. EuroCOLT '99, pages 18--33, 1999.\n[9] N. Duffy and D.P. Helmbold. Potential boosters? In S.A. Solla, T.K. Leen, and K.-R. Muller,\neditors, NIPS, volume 12, pages 258--264. MIT Press, 2000.\n[10] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an\napplication to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997.\n[11] J. Friedman, T. Hastie, and R.J. Tibshirani. Additive Logistic Regression: a statistical view of\nboosting. Annals of Statistics, 2:337--374, 2000.\n[12] J.H. Friedman. Greedy function approximation. Tech. rep., Stanford University, 1999.\n[13] A.J. Hoffmann. On approximate solutions of systems of linear inequalities. Journal of Research\nof the National Bureau of Standards, 49(4):263--265, October 1952.\n[14] J. Kivinen and M. Warmuth. Boosting as entropy projection. In Proc. 12th Annu. Conference\non Comput. Learning Theory, pages 134--144. ACM Press, New York, NY, 1999.\n[15] D.G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley Publishing Co., Read-\ning, second edition, May 1984. Reprinted with corrections in May, 1989.\n[16] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differen-\ntiable minimization. Journal of Optimization Theory and Applications, 72(1):7--35, 1992.\n[17] L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Functional gradient techniques for combining\nhypotheses. In Adv. Large Margin Class., pages 221--247. MIT Press, 2000.\n[18] T. Onoda, G. Ratsch, and K.-R. Muller. An asymptotic analysis of AdaBoost in the binary\nclassification case. In L. Niklasson, M. Boden, and T. Ziemke, editors, Proc. of the Int. Conf. on\nArtificial Neural Networks (ICANN'98), pages 195--200, March 1998.\n[19] G. Ratsch. Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam,\nOctober 2001. http://mlg.anu.edu.au/raetsch/thesis.ps.gz.\n[20] G. Ratsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in infinite and finite\nhypothesis spaces. Machine Learning, 48(1-3):193--221, 2002.\n[21] G. Ratsch, S. Mika, and M.K. Warmuth. On the convergence of leveraging. NeuroCOLT2\nTechnical Report 98, Royal Holloway College, London, 2001.\n[22] G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Machine Learning,\n\n42(3):287--320, March 2001. also NeuroCOLT Technical Report NC-TR-1998-021.\n[23] G. Ratsch and M.K. Warmuth. Marginal boosting. NeuroCOLT2 Tech. Rep. 97, 2001.\n[24] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[25] Y. Singer. Leveraged vector machines. In S.A. Solla, T.K. Leen, and K.-R. Muller, editors,\n\nNIPS, volume 12, pages 610--616. MIT Press, 2000.\n[26] R.J. Tibshirani. Regression selection and shrinkage via the LASSO. Technical report, Depart-\nment of Statistics, University of Toronto, June 1994. ftp://utstat.toronto.edu/pub/tibs/lasso.ps.\n[27] T. Zhang. A general greedy approximation algorithm with applications. In Advances in Neural\nInformation Processing Systems, volume 14. MIT Press, 2002. in press.\n\f\n", "award": [], "sourceid": 1963, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Sebastian", "family_name": "Mika", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}