{"title": "AdaBoost is Consistent", "book": "Advances in Neural Information Processing Systems", "page_first": 105, "page_last": 112, "abstract": null, "full_text": "AdaBoost is Consistent\nPeter L. Bartlett Department of Statistics and Computer Science Division University of California, Berkeley\nbartlett@stat.berkeley.edu\n\nMikhail Traskin Department of Statistics University of California, Berkeley\nmtraskin@stat.berkeley.edu\n\nAbstract\nThe risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n iterations--for sample size n and < 1--the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L > 0.\n\n1 Introduction\nBoosting algorithms are an important recent development in classification. These algorithms belong to a group of voting methods, for example [1, 2, 3], that produce a classifier as a linear combination of base or weak classifiers. While empirical studies show that boosting is one of the best off the shelf classification algorithms (see [3]) theoretical results don't give a complete explanation of their effectiveness. Breiman [4] showed that under some assumptions on the underlying distribution \"population boosting\" converges to the Bayes risk as the number of iterations goes to infinity. Since the population version assumes infinite sample size, this does not imply a similar result for AdaBoost, especially given results of Jiang [5], that there are examples when AdaBoost has prediction error asymptotically suboptimal at t = (t is the number of iterations). Several authors have shown that modified versions of AdaBoost are consistent. These modifications include restricting the l1 -norm of the combined classifier [6, 7] and restricting the step size of the algorithm [8]. Jiang [9] analyses the unmodified boosting algorithm and proves a process consistency property, under certain assumptions. Process consistency means that there exists a sequence (tn ) such that if AdaBoost with sample size n is stopped after tn iterations, its risk approaches the Bayes risk. However Jiang also imposes strong conditions on the underlying distribution: the distribution of X (the predictor) has to be absolutely continuous with respect to Lebesgue measure and the funcP =1| tion FB (X ) = 1 ln P((Y=-1X )) has to be continuous on X . Also Jiang's proof is not constructive 2 Y |X and does not give any hint on when the algorithm should be stopped. Bickel, Ritov and Zakai in [10] prove a consistency result for AdaBoost, under the assumption that the probability distribution is such that the steps taken by the algorithm are not too large. We would like to obtain a simple stopping rule that guarantees consistency and doesn't require any modification to the algorithm. This paper provides a constructive answer to all of the mentioned issues: 1. We consider AdaBoost (not a modification). 2. We provide a simple stopping rule: the number of iterations t is a fixed function of the sample size n. 3. We assume only that the class of base classifiers has finite VC-dimension, and that the span of this class is sufficiently rich. Both assumptions are clearly necessary.\n\n\f\n2 Setup and notation\nHere we describe the AdaBoost procedure formulated as a coordinate descent algorithm and introduce definitions and notation. We consider a binary classification problem. We are given X , the measurable (feature) space, and Y = {-1, 1}, set of (binary) labels. We are given a sample Sn = {(Xi , Yi )}n 1 of i.i.d. observations distributed as the random variable (X, Y ) P , where P i= is an unknown distribution. Our goal is to construct a classifier gn : X Y based on this sample. The quality of the classifier gn is given by the misclassification probability L(gn ) = P(gn (X ) = Y |Sn ). Of course we want this probability to be as small as possible and close to the Bayes risk L = inf L(g ) = E(min{ (X ), 1 - (X )}),\ng\n\nwhere the infimum is taken over all possible (measurable) classifiers and () is a conditional probability (x) = P(Y = 1|X = x). The infimum above is achieved by the Bayes classifier g (x) = g (2 (x) - 1), where 1 , x > 0, g (x) = -1 , x 0. We are going to produce a classifier as a linear combination of base classifiers in H = {h|h : X Y }. We shall assume that =lass H. has a finite VC (Vapnik-Chervonenkis) dimension dV C (H) = c H | max S | : S X , 2|S | |S Define\nn 1i Rn (f ) = e-Yi f (Xi ) n =1\n\nand\n\nR(f ) = Ee-Y f (X ) .\n\nThen the boosting procedure can be described as follows. 1. Set f0 0, choose number of iterations t. 2. For k = 1, . . . , t set fk = fk-1 + k-1 hk-1 , where the following holds Rn (fk ) =\nhH,R\n\ninf\n\nRn (fk-1 + h)\n\n(1)\n\nWe call i the step size of the algorithm at step i. 3. Output g ft as a final classifier. We shall also use the convex hull of H scaled by 0, a ff in in F = = i hi , n N {0}, i 0, i = , hi H\n=1 =1\n\ns well as the set of k -combinations, k N, of functions in H ff ik k F= = i hi , i R , h i H\n=1\n\n.\n\nWe shall also need to define the l -norm: for any f F | f = inf { i |, f = i hi , hi H}. Define the squashing function l () to be l (x) = , x > l, x , x [-l, l], -l , x < -l. l\n\n\f\nThen the set of truncated functions is ~ . ~ l F = f |f = l (f ), f F The set of classifiers based on class F is denoted by ~~ g F = {f |f = g (f ), f F }. Define the derivative of an arbitrary function Q() in the direction of h as Q(f + h) ( Q f ; h) = . =0 The second derivative Q (f ; h) is defined similarly.\n\n3 Consistency of boosting procedure\nWe shall need the following assumption. Assumption 1 Let the distribution P and class H be such that\n f F\n\nlim inf R(f ) = R ,\n\nwhere R = inf R(f ) over all measurable functions. For many classes H, the above assumption is satisfied for all possible distributions P . See [6, Lemma 1] for sufficient conditions for Assumption 1. As an example of such a class, we can take a class of indicators of all rectangles or indicators of half-spaces defined by hyperplanes or binary trees with the number of terminal nodes equal to d + 1 (we consider trees with terminal nodes formed by successive univariate splits), where d is the dimensionality of X (see [4]). We begin with a simple lemma (see [1, Theorem 8] or [11, Theorem 6.1]): Lemma 1 For any t N if dV C (H) 2 the following holds: dP (F t ) 2(t + 1)(dV C (H) + 1) log2 [2(t + 1)/ ln 2], where dP (F t ) is the pseudodimension of class F t . The proof of AdaBoost consistency is based on the following result, which builds on the result by Koltchinskii and Panchenko [12] and resembles [6, Lemma 2]. Lemma 2 For a continuous function define the Lipschitz constant L, = inf {L|L > 0, |(x) - (y )| L|x - y |, - x, y } and maximum absolute value of () when argument is in [-, ] M, = max (x).\nx[-,]\n\nThen for functions R (f ) = E(Y f (X )) V = dV C (H), c = 24 1 l and R,n (f ) =\nn 1i (Yi f (Xi )), n =1\n\ne n 82 d and any n, > 0 and t > 0, ( V + 1)(t + 1) log2 [2(t + 1)/ ln 2] E sup |R (f ) - R,n (f )| cL, t n f F 0\n\n(2)\n\nand E sup |R (f ) - R,n (f )| 4L,\nf F\n\n2\n\nV ln(4n + 2) . n\n\n(3)\n\n\f\nAlso, for any > 0, with probability at least 1 - , sup\nf F t\n\n|R (f ) - R,n (f )|\n\n cL, l + M,\n\n( V + 1)(t + 1) log2 [2(t + 1)/ ln 2] n n(1/ ) 2n l n(1/ ) . 2n (4)\n\nand sup |R (f ) - R,n (f )| 4L,\nf F\n\n2\n\nV ln(4n + 2) + M, n\n\n(5)\n\nProof. Equations (3) and (5 ) constitute [6, Lemma 2]. The proof of equations (2) and (4) is similar. We begin with symmetrization to get 1 , in E sup |R (f ) - R,n (f )| 2E sup i ((-Yi f (Xi )) - (0)) f F t f F t n =1 where i are i.i.d. with P(i = 1) = P(i = -1) = 1/2. Then we use the \"contraction principle\" (see [13, Theorem 4.12, pp. 112113]) with a function (x) = ((x) - (0))/L, to get 1n = i E sup |R (f ) - R,n (f )| 4L, E sup -i Yi f (Xi ) f F t f F t n =1 1n . i 4L, E sup i f (Xi ) f F t n =1 Next we proceed and find the supremum. Notice, that functions in F t are bounded and clipped to the absolute value equal , therefore we can rescale F t by (2)-1 and get 1 = 1 . in in E sup i f (Xi ) 2E sup i f (Xi ) f F t n =1 f (2)-1 F t n =1 Next, we are going to use Dudley's entropy integral [14] to bound the r.h.s above 1n l i 12 n N ( , (2)-1 F t , L2 (Pn ))d . i f (Xi ) E sup n0 f (2)-1 F t n =1 Since for > 1 the covering number N is 1, then upper integration limit can be taken 1, and we can use Pollard's bound [15] for F [0, 1]X 4 dP (F ) e N ( , F, L2 (P )) 2 , 2 1 l 8e E n 2d where dP (F ) is a pseudodimension, and obtain for c = 12 0 ~ 1 d -1 in F t ) P ((2) sup i f (Xi ) , c ~ n f (2)-1 F t n =1 also notice that constant c doesn't depend on F t or . Next, since (2)-1 is a non-decreasing ~ transform, we use inequality dP ((2)-1 F t ) dP (F t ) (e.g. [11, Theorem 11.3]) 1n d t i P (F ) E sup i f (Xi ) c . n f (2)-1 F t n =1 And then, since Lemma 1 gives an upper-bound on the pseudodimension of the class F t , we have 1 ( in V + 1)(t + 1) log2 [2(t + 1)/ ln 2] E sup i f (Xi ) c , tn n f F =1\n\n\f\nwith constant c above being independent of H, t and . To prove the second statement we use McDiarmid's bounded difference inequality [16, Theorem 9.2, p. 136], since i s M, up |R (f ) - R,n (f )| - sup |R (f ) - R ,n (f )| sup , n f F t (xj ,yj )n=1 ,(xi ,yi ) f F t j where R,n (f ) is obtained from R,n (f ) by changing pair (xi , yi ) to (xi , yi ). This completes the L proof of the lemma. emma 2, unlike [6, Lemma 2], allows us to choose the number of steps t, that describes the complexity of the linear combination of base functions in addition to the parameter , which governs the size of the deviations of the functions in F , and this is essential for the proof of the consistency. It is easy to see that for AdaBoost (i.e. (x) = e-x ) we have to choose = ln n and t = n with > 0, > 0 and 2 + < 1. So far we dealt with the statistical properties of the function we are minimizing, now we turn to the algorithmic part. We need the following simple consequence of the proof of [10, Theorem 1] Theorem 1 Let function Q(f ) be convex in f . Let Q = lim inf f F Q(f ). Assume that c1 , c2 , s.t. Q < c1 < c2 < , 0 < inf {Q (f ; h) : c1 < Q(f ) < c2 , h H} sup{Q (f ; h) : Q(f ) < c2 , h H} < . Then for any reference function f and the sequence of functions fm , produced by the boosting algorithm, the following bound holds m s.t. Q(fm ) > Q(f ). 8 1 l -2 B 3 Q(f0 )(Q(f0 ) - Q(f )) 2 0 + c3 (m + 1) Q(fm ) Q(f ) + , (6) n 3 2 0 where k = f - fk , c3 = 2Q(f0 )/ , = inf {Q (f ; h) : Q(f ) < Q(f ) < Q(f0 ), h H}, ( B = sup{Q f ; h) : Q(f ) < Q(f0 ), h H}. Proof. The statement of the theorem is a version of the result implicit in the proof of [10, Theorem 1]. If for some m we have Q(fm ) Q(f ), then theorem is trivially true for all m m. Therefore, we are going to consider only the case when Q(fm+1 ) > Q(f ). By convexity of Q() |Q (fm ; fm - f )| Q(fm ) - Q(f ) = m. (7) ~~ = ~ i correspond to the best representation (with the smallest Let fm - f i hi , where i and h ~ l -norm). Then from (7) and linearity of the derivative we have ~ | ~ m i Q (fm ; hi ) sup |Q (fm ; h)| i |, ~\nhH\n\ntherefore\nhH\n\nsup Q (fm ; h) f\n\nm . m-f\n\n(8)\n\nNext,\n\n1 ~ Q(fm + hm ) = Q(fm ) + Q (fm ; hm ) + 2 Q (fm ; hm ), 2 ~ ~ where fm = fm + m hm , for m [0, m ], and since by assumption fm is on the path from fm to ~ ~ fm+1 we have the following bounds ~ Q(f ) < Q(fm+1 ) Q(fm ) Q(fm ) Q(f0 ), then by assumption of the theorem for , that depends on Q(f ), we have 1 |Q (fm ; hm )|2 Q(fm+1 ) Q(fm ) + inf (Q (fm ; hm ) + 2 ) = Q(fm ) - . (9) R 2 2 On the other hand, = Q 1 Q(fm + m hm ) = inf Q(fm + h) inf (fm ) + Q (fm ; h) + 2 B ) hH,R hH,R 2 suphH |Q (fm ; h)|2 Q(fm ) - . (10) 2B\n\n\f\nTherefore, combining (9) and (10) , we get |Q fm ; hm )| sup |Q fm ; h)|\nhH ( (\n\nB\n\n.\n\n(11)\n\nAnother Taylor expansion, this time around fm+1 , gives us 12 ~ ~ Q(fm ) = Q(fm+1 ) + m Q (f m ; hm ), (12) 2 ~ ~ where f m is some (other) function on the path from fm to fm+1 . Therefore, if |m | < ( |Q fm ; hm )|/B , then |Q (fm ; hm )|2 Q(fm ) - Q(fm+1 ) < , 2B but by (10) suphH |Q (fm ; h)|2 |Q (fm ; hm )|2 Q(fm ) - Q(fm+1 ) , 2B 2B therefore we conclude, by combining (11) and (8), that suphH |Q (fm ; h)| m |Q (fm ; hm )| . (13) |m | B B 3/2 mB 3/2 Using (12) we have m im 2i 2 2 i (Q(fi ) - Q(fi+1 )) (Q(f0 ) - Q(f )). (14) =0 =0 Recall that f\nm\n\n -f\n\n\n\n\n\nf\n\nm-1\n\n -f\n\n\n\n+ |m-1 | m-1 i\n=0\n\nf\n\n0\n\n -f\n\n\n\n+\n\nm-1 i =0\n\n|i |\n\n\n\nf\n\n0\n\n -f\n\n\n\n+\n\n\n\n1/2\n2 i\n\nm\n\n,\n\ntherefore, combining with (14) and (13), since sequence i is decreasing, m im im 2 i 2 i 2 (Q(f0 ) - Q(f )) i 3 32 0 B =0 2 B m =0 i =0 +i i 2 m B3 =0\nm m\n\n1 i\n\n-1 j =0\n\n2 j\n\n1/2 2\n\n0 +\n\n\n\ni\n\n1 2\n\nQ(f0 ) \n\n1/2 2\n\n Since im\n=0\n\ni 1 2 . 3m 2Q(f0 ) 2B i 0 =0 2 + 1 a + bi m+1\n0\n\ndx 1 a + b(m + 1) = ln , a + bx b a\n2Q(f0 ) (m \n\nthen 2+ 2 2 0 (Q(f0 ) - Q(f )) 2 ln 3 Q(f ) m 4B 0 Therefore m 8 B 3 Q(f0 )(Q(f0 ) - Q(f )) 3 l n 2 0 +\n\n+ 1)\n\n2 0 + 1)\n\n.\n1 -2\n\n2Q(f0 ) (m \n\n2 0\n\n,\n\nT and this completes the proof. he theorem above allows us to get an upper bound on the difference between the -risk of the function output by AdaBoost and the -risk of the appropriate reference function.\n\n\f\nTheorem 2 Assume R > 0. Let tn = n be the number of steps we run AdaBoost, let n = ln n, with > 0, > 0 and + 2 < 1. Let fn be a minimizer of the function Rn () within Fn . Then for n large enough with high probability the following holds l2 -1/2 8 + (4/R )tn Rn (ftn ) Rn (fn ) + 3/2 n n 2 (R ) n Proof. This theorem follows directly from Theorem 1. Because in AdaBoost Rn (f ; h) =\nn n 1i 1i (-Yi h(Xi ))2 e-Yi f (Xi ) = e-Yi f (Xi ) = R(f ) n =1 n =1\n\nthen all the conditions in Theorem 1 are satisfied (with Q(f ) replaced by Rn (f )) and in the Equation f (6) we have B = Rn (f0 ) = 1, Rn (fn ), 0 - fn n . Since for t s.t. Rn (ft ) Rn (fn ) the theorem is trivially true we only have to notice that Lemma 2 guarantees that with probability at least 1 - l 2 V ln(4n + 2) n(1/ ) + M,n . |R(fn ) - Rn (fn )| 4n L,n n 2n Thus for n such that the r.h.s. of the above expression is less than R /2 we have Rn (fn ) ) > 0. T R /2 and the result follows immediately from Equation (6) if we use the fact that Rn (f hen, having all the ingredients at hand we can formulate the main result of the paper. Theorem 3 Assume V = dV C (H) < , L > 0,\n f F\n\nlim inf R(f ) = R ,\n\ntn , and tn = O(n ) for < 1. Then AdaBoost stopped at step tn returns a sequence of classifiers almost surely satisfying L(g (ftn )) L . Proof. For the exponential loss function L > 0 implies R > 0. Let n = ln n, > 0, 2 + < 1. Also, let f be a minimizer of R and fn be a minimizer of Rn within Fn . Then we have R(n (ftn )) Rn (n (ftn )) + 1 by Lemma 2 Rn (ftn ) + 1 + (n ) since (n (x)) (x) + (n ) Rn (fn ) + 1 + (n ) + 2 by Theorem 2 R(f ) + 1 + (n ) + 2 + 3 by Lemma 2. (15) (16) (17)\n\nInequalities (15) and (17) hold with probability at least 1 - n , while inequality (16) is true for sufficiently large n when (17) holds. The 's above are l ( V + 1)(n + 1) log2 [2(n + 1)/ ln 2] n(1/n ) +n 1 = cn ln n n 2n l -1/2 2 8 ( ln n) + (4/R )n 2= n , )3/2 ( ln n)2 (R 2 l V ln(4n + 2) n(1/n ) 3 = 4n ln n +n n 2n and (n ) = n- . Therefore, by the choice of and and appropriate choice of n , for example n = n-2 , we have 1 0, 2 0, 3 0 and (n ) 0. Also, R(f ) R by Assumption 1. Now we appeal to the Borel-Cantelli lemma and arrive at R( (ftn )) R a.s. Eventually we can use [17, Theorem 3] to conclude that L(g (n (ftn ))) L . But for n > 0 we have g (n (ftn )) = g (ftn ), therefore L(g (ftn )) L . Hence AdaBoost is consistent if stopped after n steps.\na.s. a.s.\n\n\f\n4 Discussion\nWe showed that AdaBoost is consistent if stopped sufficiently early, after tn iterations, for tn = n with < 1, given that Bayes risk L > 0. It is unclear whether this number can be increased. Results by Jiang [5] imply that for some X and function class H AdaBoost algorithm will achieve zero training error after tn steps, where n2 /tn = o(1). We don't know what happens in between O(n1- ) and O(n2 ln n). Lessening this gap is a subject of further research. We analyzed only AdaBoost, the boosting algorithm that uses loss function (x) = e-x . Since the proof of Theorem 2 relies on the properties of the exponential loss, we cannot make a similar conclusion for other versions of boosting, e.g., logit boosting with (x) = ln(1 + e-x ): in this case assumption on the second derivative holds with Rn (f ; h) Rn (f )/n, though the resulting inequality is trivial, the factor 1/n precludes us from finding any useful bound. It is a subject of future work to find an analog of Theorem 2 that will handle logit loss. Acknowledgments We gratefully acknowledge the support of NSF under award DMS-0434383. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119139, 1997. [2] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996. [3] Leo Breiman. Arcing classifiers (with discussion). The Annals of Statistics, 26(3):801849, 1998. (Was Department of Statistics, U.C. Berkeley Technical Report 460, 1996). [4] Leo Breiman. Some infinite theory for predictor ensembles. Technical Report 579, Department of Statistics, University of California, Berkeley, 2000. [5] Wenxin Jiang. On weak base hypotheses and their implications for boosting regression and classification. The Annals of Statistics, 30:5173, 2002. [6] Gabor Lugosi and Nicolas Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32(1):3055, 2004. [7] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):5685, 2004. [8] Tong Zhang and Bin Yu. Boosting with early stopping: convergence and consistency. The Annals of Statistics, 33:15381579, 2005. [9] Wenxin Jiang. Process consistency for AdaBoost. The Annals of Statistics, 32(1):1329, 2004. [10] P. J. Bickel, Y. Ritov, and A. Zakai. Some theory for generalized boosting algorithms. Journal of Machine Learning Research, 7:705732, May 2006. [11] Martin Anthony and Peter Bartlett. Neural network learning: theoretical foundations. Cambridge University Press, 1999. [12] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30:150, 2002. [13] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces. Springer-Verlag, New York, 1991. [14] Richard M. Dudley. Uniform central limit theorems. Cambridge University Press, Cambridge, MA, 1999. [15] David Pollard. Empirical Processes: Theory and Applications. IMS, 1990. [16] Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, New York, 1996. [17] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138156, 2006.\n\n\f\n", "award": [], "sourceid": 2963, "authors": [{"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Mikhail", "family_name": "Traskin", "institution": null}]}