{"title": "One-Pass Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 80, "abstract": null, "full_text": "One-Pass Boosting\n\nZafer Barutcuoglu\n\nzbarutcu@cs.princeton.edu\n\nPhilip M. Long\n\nplong@google.com\n\nRocco A. Servedio\n\nrocco@cs.columbia.edu\n\nAbstract\n\nThis paper studies boosting algorithms that make a single pass over a set of base\nclassi(cid:2)ers.\nWe (cid:2)rst analyze a one-pass algorithm in the setting of boosting with diverse base\nclassi(cid:2)ers. Our guarantee is the same as the best proved for any boosting algo-\nrithm, but our one-pass algorithm is much faster than previous approaches.\nWe next exhibit a random source of examples for which a (cid:147)picky(cid:148) variant of Ad-\naBoost that skips poor base classi(cid:2)ers can outperform the standard AdaBoost al-\ngorithm, which uses every base classi(cid:2)er, by an exponential factor.\nExperiments with Reuters and synthetic data show that one-pass boosting can sub-\nstantially improve on the accuracy of Naive Bayes, and that picky boosting can\nsometimes lead to a further improvement in accuracy.\n\n1 Introduction\n\nBoosting algorithms use simple (cid:147)base classi(cid:2)ers(cid:148) to build more complex, but more accurate, aggre-\ngate classi(cid:2)ers. The aggregate classi(cid:2)er typically makes its class predictions using a weighted vote\nover the predictions made by the base classi(cid:2)ers, which are usually chosen one at a time in rounds.\nWhen boosting is applied in practice, the base classi(cid:2)er at each round is usually optimized: typically,\neach example is assigned a weight that depends on how well it is handled by the previously chosen\nbase classi(cid:2)ers, and the new base classi(cid:2)er is chosen to minimize the weighted training error. But\nsometimes this is not feasible; there may be a huge number of base classi(cid:2)ers with insuf(cid:2)cient\napparent structure among them to avoid simply trying all of them out to (cid:2)nd out which is best. For\nexample, there may be a base classi(cid:2)er for each word or k-mer. (Note that, due to named entities, the\nnumber of (cid:147)words(cid:148) in some analyses can far exceed the number of words in any natural language.)\nIn such situations, optimizing at each round may be prohibitively expensive.\nThe analysis of AdaBoost, however, suggests that there could be hope in such cases. Recall\nthat if AdaBoost is run with a sequence of base classi(cid:2)ers b1; : : : ; bn that achieve weighted er-\n2 (cid:0) (cid:13)n, then the training error of AdaBoost\u2019s (cid:2)nal output hypothesis is at most\nror 1\nt ): One could imagine applying AdaBoost without performing optimization: (a)\nexp((cid:0)2Pn\n(cid:2)xing an order b1; :::; bn of the base classi(cid:2)ers without looking at the data, (b) committing to use\nbase classi(cid:2)er bt in round t, and (c) setting the weight with which bt votes as a function of its\nweighted training error using AdaBoost. (In a one-pass scenario, it seems sensible to use AdaBoost\nsince, as indicated by the above bound, it can capitalize on the advantage over random guessing\nof every hypothesis.) The resulting algorithm uses essentially the same computational resources as\nNaive Bayes [2, 7], but bene(cid:2)ts from taking some account of the dependence among base classi-\n(cid:2)ers. Thus motivated, in this paper we study the performance of different boosting algorithms in a\none-pass setting.\nContributions. We begin by providing theoretical support for one-pass boosting using the (cid:147)diverse\nbase classi(cid:2)ers(cid:148) framework previously studied in [1, 6]. In this scenario there are n base classi(cid:2)ers.\nFor an unknown subset G of k of the base classi(cid:2)ers, the events that the classi(cid:2)ers in G are correct\non a random item are mutually independent. This formalizes the notion that these k base classi(cid:2)ers\n\n2 (cid:0) (cid:13)1; : : : ; 1\nt=1 (cid:13)2\n\n\f2 (cid:0) (cid:13) under the initial\nare not redundant. Each of these k classi(cid:2)ers is assumed to have error 1\ndistribution, and no assumption is made about the other n (cid:0) k base classi(cid:2)ers. In [1] it is shown\nthat if Boost-by-Majority is applied with a weak learner that does optimization (i.e. always uses the\n(cid:147)best(cid:148) of the n candidate base classi(cid:2)ers at each of (cid:2)(k) stages of boosting), the error rate of the\ncombined hypothesis with respect to the underlying distribution is (roughly) at most exp((cid:0)(cid:10)((cid:13) 2k)):\nIn Section 2 we show that a one-pass variant of Boost-by-Majority achieves a similar bound with a\nsingle pass through the n base classi(cid:2)ers, reducing the computation time required by an (cid:10)(k) factor.\nWe next show in Section 3 that when running AdaBoost using one pass, it can sometimes be advan-\ntageous to abstain from using base classi(cid:2)ers that are too weak. Intuitively, this is because using\nmany weak base classi(cid:2)ers early on can cause the boosting algorithm to reweight the data in a way\nthat obscures the value of a strong base classi(cid:2)er that comes later. (Note that the quadratic depen-\nt ) means that one good base classi(cid:2)er is more\ndence on (cid:13)t in the exponent of the exp((cid:0)2Pn\nvaluable than many poor ones.) In a bit more detail, suppose that base classi(cid:2)ers are considered\nin the order b1; : : : ; bn, where each of b1; : : : ; bn(cid:0)1 has a (cid:147)small(cid:148) advantage over random guessing\nunder the initial distribution D and bn has a (cid:147)large(cid:148) advantage under D: Using b1; : : : ; bn(cid:0)1 for the\n(cid:2)rst n (cid:0) 1 stages of AdaBoost can cause the distributions D2; D3; : : : to change from the initial D1\nin such a way that when bn is (cid:2)nally considered, its advantage under Dn is markedly smaller than\nits advantage under D0, causing AdaBoost to assign bn a small voting weight. In contrast, a (cid:147)picky(cid:148)\nversion of AdaBoost would pass up the opportunity to use b1; : : : ; bn(cid:0)1 (since their advantages are\ntoo small) and thus be able to reap the full bene(cid:2)t of using bn under distribution D0 (since when bn\nis (cid:2)nally considered the distribution D is still D0, since no earlier base classi(cid:2)ers have been used).\nFinally, Section 4 gives experimental results on Reuters and synthetic data. These show that one-pass\nboosting can lead to substantial improvement in accuracy over Naive Bayes while using a similar\namount of computation, and that picky one-pass boosting can sometimes further improve accuracy.\n\nt=1 (cid:13)2\n\n2 Faster learning with diverse base classi\ufb01ers\n\nWe consider the framework of boosting in the presence of diverse base classi(cid:2)ers studied in [1].\n\nDe\ufb01nition 1 (Diverse (cid:13)-good) Let D be a distribution over X (cid:2) f(cid:0)1; 1g: We say that a set G of\nclassi\ufb01ers is diverse and (cid:13)-good with respect to D if (i) each classi\ufb01er in G has advantage at least\n(cid:13) (i.e., error at most 1\n2 (cid:0) (cid:13)) with respect to D, and (ii) the events that the classi\ufb01ers in G are correct\nare mutually independent under D:\n\nWe will analyze the Picky-One-Pass Boost-by-Majority (POPBM) algorithm, which we de(cid:2)ne as\nfollows. It uses three parameters, (cid:11), T and (cid:15).\n\n1. Choose a random ordering b1; :::; bn of the base classi(cid:2)ers in H, and set i1 = 1.\n2. For as many rounds t as it (cid:20) minfT; ng:\n\n(a) De(cid:2)ne Dt as follows: for each example (x; y),\n\ni. Let rt(x; y) be the the number of previously chosen base classi(cid:2)ers h1; : : : ; ht(cid:0)1\nthat are correct on (x; y);\nii. Let wt(x; y) = (cid:0)\nZt = E(x;y)(cid:24)D (wt(x; y)), and let Dt(x; y) = wt(x;y)D(x;y)\n(b) Compare Zt to (cid:15)=T , and\n\n2 c(cid:0)rt(x;y)(cid:1)( 1\n\n2 e(cid:0)t(cid:0)1+rt(x;y), let\n\n2 + (cid:11))b T\n\n2 c(cid:0)rt(x;y)( 1\n\n2 (cid:0) (cid:11))d T\n\nT (cid:0)t(cid:0)1\n\nb T\n\n.\n\nZt\n\ni. If Zt (cid:21) (cid:15)=T , then try bit; bit+1; ::: until you encounter a hypothesis bj with advan-\ntage at least (cid:11) with respect to Dt (and if you run out of base classi(cid:2)ers before this\nhappens, then go to step 3). Set ht to be bj (i.e. return bj to the boosting algorithm)\nand set it+1 to j + 1 (i.e. the index of the next base classi(cid:2)er in the list).\n\nii. If Zt < (cid:15)=T , then set ht to be the constant-1 hypothesis (i.e. return this constant\n\nhypothesis to the boosting algorithm) and set it+1 = it.\n\n3. If t < T +1 (i.e. the algorithm ran out of base classi(cid:2)ers before selecting T of them), abort.\n\nOtherwise, output the (cid:2)nal classi(cid:2)er f (x) = M aj(h1(x); : : : ; hT (x)).\n\n\fThe idea behind step 2.b.ii is that if Zt is small, then Lemma 4 will show that it doesn\u2019t much matter\nhow good this weak hypothesis is, so we simply use a constant hypothesis.\nTo simplify the exposition, we have assumed that POPBM can exactly determine quantities such\nas Zt and the accuracies of the weak hypotheses. This would provably be the case if D were\nconcentrated on a moderate number of examples, e.g. uniform over a training set. With slight\ncomplications, a similar analysis can be performed when these quantities must be estimated.\nThe following lemma from [1] shows that if the (cid:2)ltered distribution is not too different from the\noriginal distribution, then there is a good weak hypothesis relative to the original distribution.\n\nLemma 2 ([1]) Suppose a set G of classi\ufb01ers of size k is diverse and (cid:13)-good with respect to D: For\n3 e(cid:13)2k=2D(x; y) for all (x; y) 2 X (cid:2) f(cid:0)1; 1g,\nany probability distribution Q such that Q(x; y) (cid:20) (cid:13)\nthere is a g 2 G such that\n(1)\nThe following simple extension of Lemma 2 shows that, given a stronger constraint on the (cid:2)ltered\ndistribution, there are many good weak hypotheses available.\n\nPr(x;y)(cid:24)Q(g(x) = y) (cid:21) 1\n\n2 + (cid:13)\n4 :\n\nLemma 3 Suppose a set G of classi\ufb01ers of size k is diverse and (cid:13)-good with respect to D: Fix any\n\u2018 < k: For any probability distribution Q such that\n\nQ(x; y) (cid:20)\n\ne(cid:13)2\u2018=2D(x; y)\n\n(cid:13)\n3\n\n(2)\n\nfor all (x; y) 2 X (cid:2) f(cid:0)1; 1g, there are at least k (cid:0) \u2018 + 1 members g of G such that (1) holds.\nProof: Fix any distribution Q satisfying (2). Let g1; :::; g\u2018 be an arbitrary collection of \u2018 elements\nof G. Since the fg1; :::; g\u2018g and Q satisfy the requirements of Lemma 2 with k set to \u2018, one of\ng1; : : : ; g\u2018 must satisfy (1); so any set of \u2018 elements drawn from G contains an element that satis(cid:2)es\n(1). This yields the lemma.\n\nWe will use another lemma, implicit in Freund\u2019s analysis [3], formulated as stated here in [1]. It\nformalizes two ideas: (a) if the weak learners perform well, then so will the strong learner; and (b)\nthe performance of the weak learner is not important in rounds for which Zt is small.\n\nLemma 4 Suppose that Boost-by-Majority is run with parameters (cid:11) and T , and generates clas-\nsi\ufb01ers h1; :::; hT for which D1(h1(x) = y) = 1\n2 + (cid:13)T : Then,\nfor a random element of D, a majority vote over h1; :::; hT is incorrect with probability at most\n\n2 + (cid:13)1; : : : ; DT (hT (x) = y) = 1\n\ne(cid:0)2(cid:11)2T +PT\nNow we give our analysis.\n\nt=1((cid:11) (cid:0) (cid:13)t)Zt:\n\nTheorem 5 Suppose the set H of base classi\ufb01ers used by POPBM contains a subset G of k elements\nthat is diverse and (cid:13)-good with respect to the initial distribution D, where (cid:13) is a constant (say 1=4).\nThen there is a setting of the parameters of POPBM so that, with probability 1 (cid:0) 2(cid:0)(cid:10)(k), it outputs\na classi\ufb01er with accuracy exp((cid:0)(cid:10)((cid:13) 2k)) with respect to the original distribution D.\n\nProof: We prove that (cid:11) = (cid:13)=4, T = k=64, and (cid:15) = 3k\nrequired. We will establish the following claim:\n\n8(cid:13) e(cid:0)(cid:13)2k=16 is a setting of parameters as\n\nClaim 6 For the above parameter settings we have Pr[POPBM aborts in Step 3] = 2(cid:0)(cid:10)(k).\n\nSuppose for now that the claim holds, so that with high probability POPBM outputs a classi(cid:2)er.\nIn case it does, let f be this output. Then since POPBM runs for a full T rounds, we may apply\nLemma 4 which bounds the error rate of the Boost-by-Majority (cid:2)nal classi(cid:2)er. The lemma gives us\nthat D(f (x) 6= y) is at most\n\ne(cid:0)2(cid:11)2T +\n\nT\n\nPt=1\n\n((cid:11) (cid:0) (cid:13)t)Zt = e(cid:0)(cid:13)2T =8 + Pt:Zt< (cid:15)\n\n((cid:11) (cid:0) (cid:13)t)Zt + Pt:Zt(cid:21) (cid:15)\n(cid:20) e(cid:0)(cid:10)((cid:13)2k) + T ((cid:15)=T ) + 0 = e(cid:0)(cid:10)((cid:13)2k):\n\nT\n\nT\n\n((cid:11) (cid:0) (cid:13)t)Zt\n\n(Theorem 5)\n\n\fThe (cid:2)nal inequality holds since (cid:11) (cid:0) (cid:13)t (cid:20) 0 if Zt (cid:21) (cid:15)=T and (cid:11) (cid:0) (cid:13)t (cid:20) 1 if Zt < (cid:15)=T:\nProof of Claim 6: In order for POPBM to abort, it must be the case that as the k base classi(cid:2)ers in\nG are encountered in sequence as the algorithm proceeds through h1; : : : ; hn, more than 63k=64 of\nthem are skipped in Step 2.b.i. We show this occurs with probability at most 2(cid:0)(cid:10)(k):\nFor each j 2 f1; :::; kg, let Xj be an indicator variable for the event that the jth member of G\nin the ordering b1; : : : ; bn is encountered during the boosting process and skipped, and for each\nj=1 Xj) (cid:0) (3=4)\u2018; k=8g: We claim that S1; :::; Sk=8 is a super-\n\u2018 2 f1; :::; kg, let S\u2018 = minf(P\u2018\nmartingale, i.e. that E[S\u2018+1jS1; : : : ; S\u2018] (cid:20) S\u2018 for all \u2018 < k=8. If S\u2018 = k=8 or if the boosting\nprocess has terminated by the \u2018th member of G, this is obvious. Suppose that S\u2018 < k=8 and that the\nalgorithm has not terminated yet. Let t be the round of boosting in which the \u2018th member of G is en-\ncountered. The value wt(x; y) can be interpreted as a probability, and so we have that wt(x; y) (cid:20) 1.\nConsequently, we have that\nD(x; y)\n\ne(cid:13)2k=16 < D(x; y) (cid:1)\n\ne(cid:13)2k=8:\n\nDt(x; y) (cid:20)\n\n(cid:20) D(x; y) (cid:1)\n\n= D(x; y) (cid:1)\n\nZt\n\nNow Lemma 3 implies that at least half of the classi(cid:2)ers in G have advantage at least (cid:11) w.r.t. Dt.\nSince \u2018 < k=4, it follows that at least k=4 of the remaining (at most k) classi(cid:2)ers in G that have not\nyet been seen have advantage at least (cid:11) w.r.t. Dt. Since the base classi(cid:2)ers were ordered randomly,\nany order over the remaining hypotheses is equally likely, and so also is any order over the remaining\nhypotheses from G: Thus, the probability that the next member of G to be encountered has advantage\nat least (cid:11) is at least 1=4, so the probability that it is skipped is at most 3=4. This completes the proof\nthat S1; :::; Sk=8 is a supermartingale.\nSince jS\u2018 (cid:0) S\u2018(cid:0)1j (cid:20) 1, Azuma\u2019s inequality for supermartingales implies that Pr(Sk=8 > k=64) (cid:20)\ne(cid:0)(cid:10)(k): This means that the probability that at least k=64 good elements were not skipped is at least\n1 (cid:0) e(cid:0)O(k), which completes the proof.\n\nT\n(cid:15)\n\n(cid:13)\n24\n\n(cid:13)\n3\n\n3 For one-pass boosting, PickyAdaBoost can outperform AdaBoost\n\nAdaBoost is the most popular boosting algorithm. It is most often applied in conjunction with a\nweak learner that performs optimization, but it can be used with any weak learner. The analysis\nof AdaBoost might lead to the hope that it can pro(cid:2)tably be applied for one-pass boosting. In this\nsection, we compare AdaBoost and its picky variant on an arti(cid:2)cial source especially designed to\nillustrate why the picky variant may be needed.\nAdaBoost. We brie(cid:3)y recall some basic facts about AdaBoost (see Figure 1). If we run AdaBoost\nfor T stages with weak hypotheses h1; : : : ; hT , it constructs a (cid:2)nal hypothesis\n\nT\n\nPt=1\n\nH(x) = sgn(f (x)) where\n\nf (x) =\n\n(cid:11)tht(x)\n\n(3)\n\n2 ln 1(cid:0)(cid:15)t\n(cid:15)t\n\n: Here (cid:15)t = Pr(x;y)(cid:24)Dt[ht(x) 6= y] where Dt is the t-th distribution constructed\nwith (cid:11)t = 1\nby the algorithm (the (cid:2)rst distribution D1 is just D, the initial distribution). We write (cid:13)t to denote\n2 (cid:0) (cid:15)t, the advantage of the t-th weak hypothesis under distribution Dt. Freund and Schapire [5]\n1\nproved that if AdaBoost is run with an initial distribution D over a set of labeled examples, then the\nerror rate of the (cid:2)nal combined classi(cid:2)er H is at most exp((cid:0)2PT\nt ) under D:\ni=1 (cid:13)2\nt(cid:19) :\nPi=1\n\nPr(x;y)(cid:24)D[H(x) 6= y] (cid:20) exp(cid:18)(cid:0)2\n\n(We note that AdaBoost is usually described in the case in which D is uniform over a training set, but\nthe algorithm and most of its analyses, including (4), go through in the greater generality presented\nhere. The fact that the de(cid:2)nition of (cid:11)t depends indirectly on an expectation evaluated according to\nD makes the case in which D is uniform over a sample most directly relevant to practice. However,\nit is easiest to describe our construction using this more general formulation of AdaBoost.)\nPickyAdaBoost. Now we de(cid:2)ne a (cid:147)picky(cid:148) version of AdaBoost, which we call PickyAdaBoost.\nThe PickyAdaBoost algorithm is initialized with a parameter (cid:13) > 0. Given a value (cid:13), the Pick-\nyAdaBoost algorithm works like AdaBoost but with the following difference. Suppose that Pick-\nyAdaBoost is performing round t of boosting, the current distribution is some D0, and the current\n\n(4)\n\nT\n\n(cid:13)2\n\n\fGiven a source D of random examples.\n\n(cid:15) Initialize D1 = D.\n(cid:15) For each round t from 1 to T :\n\n\u2013 Present Dt to a weak learner, and receive base classi(cid:2)er ht;\n\u2013 Calculate error (cid:15)t = Pr(x;y)(cid:24)Dt[ht(x) 6= y] and set (cid:11)t = 1\n\u2013 Update\n\nDe(cid:2)ne Dt+1\nexp((cid:0)(cid:11)tyht(x))Dt(x; y) and normalizing D0\ntion Dt+1 = D0\n\ndistribution:\n\nt+1=Zt+1;\n\nthe\n\nby\n\n;\n2 ln 1(cid:0)(cid:15)t\n(cid:15)t\nsetting D0\n\n=\nt+1 to get the probability distribu-\n\nt+1(x; y)\n\n(cid:15) Return the (cid:2)nal classi(cid:2)cation rule H(x) = sgn (Pt (cid:11)tht(x)) :\nFigure 1: Pseudo-code for AdaBoost (from [4]).\n\nbase classi(cid:2)er ht being considered has advantage (cid:13) under D0, where j(cid:13)j < (cid:13). If this is the case\nthen PickyAdaBoost abstains in that round and does not include ht into the combined hypothesis it\nis constructing. (Note that consequently the distribution for the next round of boosting will also be\nD0.) On the other hand, if the current base classi(cid:2)er has advantage (cid:13) where j(cid:13)j (cid:21) (cid:13), then PickyAd-\naBoost proceeds to use the weak hypothesis just like AdaBoost, i.e. it adds (cid:11)tht to the function f\ndescribed in (3) and adjusts D0 to obtain the next distribution.\nNote that we only require the magnitude of the advantage to be at least (cid:13). Whether a given base\nclassi(cid:2)er is used, or its negation is used, the effect that it has on the output of AdaBoost is the same\n1(cid:0)(cid:15)). Consequently, the appropriate notion of a (cid:147)picky(cid:148) version of\n(brie(cid:3)y, because ln 1(cid:0)(cid:15)\nAdaBoost is to require the magnitude of the advantage to be large.\n\n(cid:15) = (cid:0) ln (cid:15)\n\n3.1 The construction\n\nWe consider a sequence of n + 1 base classi(cid:2)ers b1; : : : ; bn; bn+1. For simplicity we suppose that\nthe domain X is f(cid:0)1; 1gn+1 and that the value of the i-th base classi(cid:2)er on an instance x 2 f0; 1gn\nis simply bi(x) = xi:\nNow we de(cid:2)ne the distribution D over X (cid:2)f(cid:0)1; 1g. A draw of (x; y) is obtained from D as follows:\nthe bit y is chosen uniformly from f+1; (cid:0)1g. Each bit x1; : : : ; xn is chosen independently to equal\n2 + (cid:13), and the bit xn+1 is chosen to equal y if there exists an i, 1 (cid:20) i (cid:20) n, for\ny with probability 1\nwhich xi = y; if xi = (cid:0)y for all 1 (cid:20) i (cid:20) n then xn+1 is set to (cid:0)y:\n\n2 (cid:0) (cid:13))n: Note that the inequality (cid:13) < 1\n\n3.2 Base classi\ufb01ers in order b1; : : : ; bn; bn+1\nThroughout Section 3.2 we will only consider parameter settings of (cid:13); (cid:13); n for which (cid:13) < (cid:13) (cid:20)\n2 (cid:0) (cid:13), which\n1\n2 (cid:0) ( 1\nholds for all n (cid:21) 2:\n2 (cid:0) (cid:13))n, it is easy to analyze the error rate of\nPickyAdaBoost. In the case where (cid:13) < (cid:13) (cid:20) 1\nPickyAdaBoost((cid:13)) after one pass through the base classi(cid:2)ers in the order b1; : : : ; bn; bn+1. Since\neach of b1; : : : ; bn has advantage exactly (cid:13) under D and bn+1 has advantage 1\n2 (cid:0) (cid:13))n under D,\nPickyAdaBoost((cid:13)) will abstain in rounds 1; : : : ; n and so its (cid:2)nal hypothesis is sgn(bn+1((cid:1))), which\nis the same as bn+1: It is clear that bn+1 is wrong only if each xi 6= y for i = 1; : : : ; n, which occurs\nwith probability ( 1\n\n2 (cid:0) (cid:13))n is equivalent to ( 1\n\n2 (cid:0) (cid:13))n < 1\n\n2 (cid:0) (cid:13))n. We thus have:\n\n2 (cid:0) ( 1\n\n2 (cid:0) ( 1\n\n2 (cid:0) ( 1\n\nLemma 7 For (cid:13) < (cid:13) (cid:20) 1\nerror rate precisely ( 1\n\n2 (cid:0) (cid:13))n under D:\n\n2 (cid:0) ( 1\n\n2 (cid:0) (cid:13))n, PickyAdaBoost((cid:13)) constructs a \ufb01nal hypothesis which has\n\nAdaBoost. Now let us analyze the error rate of AdaBoost after one pass through the base classi(cid:2)ers\nin the order b1; : : : ; bn+1: We write Dt to denote the distribution that AdaBoost uses at the t-th stage\nof boosting (so D = D1). Recall that (cid:13)t is the advantage of bt under distribution Dt.\nThe following claim is an easy consequence of the fact that given the value of y, the values of the\nbase classi(cid:2)ers b1; : : : ; bn are all mutually independent:\n\n\fClaim 8 For each 1 (cid:20) t (cid:20) n we have that (cid:13)t = (cid:13):\n\nIt follows that the coef(cid:2)cients (cid:11)1; : : : ; (cid:11)n of b1; : : : ; bn are all equal to 1\nThe next claim can be straightforwardly proved by induction on t:\n\n2 ln 1=2+(cid:13)\n\n1=2(cid:0)(cid:13) = 1\n\n2 ln 1+2(cid:13)\n1(cid:0)2(cid:13) :\n\nClaim 9 Let Dr denote the distribution constructed by AdaBoost after processing the base classi-\n\ufb01ers b1; : : : ; br(cid:0)1 in that order. A draw of (x; y) from Dr is distributed as follows:\n\n(cid:15) The bit y is uniform random from f(cid:0)1; +1g;\n(cid:15) Each bit x1; : : : ; xr(cid:0)1 independently equals y with probability 1\n\nindependently equals y with probability 1\n\n2 + (cid:13);\n\n2 , and each bit xr; : : : ; xn\n\n(cid:15) The bit xn+1 is set as described in Section 3.1, i.e. xn+1 = (cid:0)y if and only if x1 = (cid:1) (cid:1) (cid:1) =\n\nxn = (cid:0)y:\n\n1\n\nClaim 9 immediately gives (cid:15)n+1 = Pr(x;y)(cid:24)Dn+1[bn+1(x) 6= y] = 1=2n. It follows that (cid:11)n+1 =\n2 ln(2n (cid:0) 1): Thus an explicit expression for the (cid:2)nal hypothesis of AdaBoost after\n2 ln 1(cid:0)(cid:15)n+1\none pass over the n + 1 classi(cid:2)ers b1; : : : ; bn+1 is H(x) = sgn(f (x)), where\n\n= 1\n\n(cid:15)n+1\n\nf (x) = 1\n\n2 (cid:16)ln(cid:16) 1+2(cid:13)\n\n1(cid:0)2(cid:13)(cid:17)(cid:17) (x1 + (cid:1) (cid:1) (cid:1) + xn) + 1\n\n2 (ln(2n (cid:0) 1))xn+1:\n\nUsing the fact that H(x) 6= y if and only if yf (x) < 0, it is easy to establish the following:\n\nClaim 10 The classi\ufb01er H(x) makes a mistake on (x; y) if and only if more than A of the variables\nx1; : : : ; xn disagree with y; where A = n\n\n:\n\n2 + ln(2n(cid:0)1)\n2 ln 1+2(cid:13)\n1(cid:0)2(cid:13)\n\nFor (x; y) drawn from source D, we have that each of x1; : : : ; xn independently agrees with y with\nprobability 1\n\n2 + (cid:13). Thus we have established the following:\n\nLemma 11 Let B(n; p) denote a binomial random variable with parameters n; p (i.e. a draw from\nB(n; p) is obtained by summing n i.i.d. 0=1 random variables each of which has expectation p).\nThen the AdaBoost \ufb01nal hypothesis error rate is Pr[B(n; 1\n\n2 (cid:0) (cid:13)) > A], which equals\n\nn\n\nPi=bAc+1(cid:18)n\n\ni(cid:19)(1=2 (cid:0) (cid:13))i(1=2 + (cid:13))n(cid:0)i:\n\n(5)\n\n2 (cid:0) (cid:13)) (cid:21) n]: We thus have that\n\nIn terms of Lemma 11, Lemma 7 states that the PickyAdaBoost((cid:13)) (cid:2)nal hypothesis has error\nif A < n (cid:0) 1 then AdaBoost\u2019s (cid:2)nal hypothesis has\nPr[B(n; 1\ngreater error than PickyAdaBoost.\nWe now give a few concrete settings for (cid:13), n with which PickyAdaBoost beats AdaBoost. First\nwe observe that even in some simple cases the AdaBoost error rate (5) can be larger than the Pick-\nyAdaBoost error rate by a fairly large additive constant. Taking n = 3 and (cid:13) = 0:38, we (cid:2)nd that\nthe error rate of PickyAdaBoost((cid:13)) is ( 1\n2 (cid:0) 0:38)3 = 0:001728; whereas the AdaBoost error rate is\n2 (cid:0) 0:38)3 + 3( 1\n( 1\nNext we observe that there can be a large multiplicative factor difference between the AdaBoost and\nPickyAdaBoost error rates. We have that Pr[B(n; 1=2 (cid:0) (cid:13)) > A] equals Pn(cid:0)bAc(cid:0)1\n(cid:0)n\ni(cid:1)(1=2 (cid:0)\n(cid:13))n(cid:0)i(1=2 + (cid:13))i: This can be lower bounded by\n\n2 + 0:38) = 0:03974.\n\n2 (cid:0) 0:38)2 (cid:1) ( 1\n\ni=0\n\nPr[B(n; 1=2 (cid:0) (cid:13)) > A] (cid:21) (1=2 (cid:0) (cid:13))n\n\n(6)\n\nn(cid:0)bAc(cid:0)1\n\nPi=0 (cid:18)n\ni(cid:19);\n\nthis bound is rough but good enough for our purposes. Viewing n as an asymptotic parameter and (cid:13)\nas a (cid:2)xed constant, we have\n\n(6) (cid:21) (1=2 (cid:0) (cid:13))n\n\n(cid:11)n\n\nPi=0(cid:18)n\ni(cid:19)\n\n(7)\n\n\f2 ln 1+2(cid:13)\n1(cid:0)2(cid:13)\n\n2 (cid:0) ln 2\n\nwhere (cid:11) = 1\ni(cid:1) = 2n(cid:1)(H((cid:11))(cid:6)o(1)), which holds for\n2, we see that any setting of (cid:13) such that (cid:11) is bounded above zero by a constant gives an\n0 < (cid:11) < 1\nexponential gap between the error rate of PickyAdaBoost (which is (1=2(cid:0)(cid:13))n) and the lower bound\non AdaBoost\u2019s error provided by (7). As it happens any (cid:13) (cid:21) 0:17 yields (cid:11) > 0:01. We thus have\n\n(cid:0) o(1): Using the bound P(cid:11)n\n\ni=0(cid:0)n\n\nClaim 12 For any \ufb01xed (cid:13) 2 (0:17; 0:5) and any (cid:13) < (cid:13), the \ufb01nal error rate of AdaBoost on the\nsource D is 2(cid:10)(n) times that of PickyAdaBoost((cid:13)).\n\n3.3 Base classi\ufb01ers in an arbitrary ordering\n\nThe above results show that PickyAdaBoost can outperform AdaBoost if the base classi(cid:2)ers are\nconsidered in the particular order b1; : : : ; bn+1: A more involved analysis (omitted because of space\nconstraints) establishes a similar difference when the base classi(cid:2)ers are chosen in a random order:\n\nProposition 13 Suppose that 0:3 < (cid:13) < (cid:13) < 0:5 and 0 < c < 1 are \ufb01xed constants independent of\nn that satisfy Z((cid:13)) < c, where Z((cid:13))\n: Suppose the base classi\ufb01ers are listed in an\n\n(1(cid:0)2(cid:13))2\n\ndef\n=\n\n4\n\nln\nln 1+2(cid:13)\n\n(1(cid:0)2(cid:13))3\n\norder such that bn+1 occurs at position c (cid:1) n: Then the error rate of AdaBoost at least 2n(1(cid:0)c) (cid:0) 1 =\n2(cid:10)(n) times greater than the error of PickyAdaBoost((cid:13)).\n\nFor the case of randomly ordered base classi(cid:2)ers, we may view c as a real value that is uniformly\ndistributed in [0; 1]; and for any (cid:2)xed constant 0:3 < (cid:13) < 0:5 there is a constant probability (at least\n1 (cid:0) Z((cid:13))) that AdaBoost has error rate 2(cid:10)(n) times larger than PickyAdaBoost((cid:13)). This probability\ncan be fairly large, e.g. for (cid:13) = 0:45 it is greater than 1=5:\n\n4 Experiments\n\nWe used Reuters data and synthetic data to examine the behavior of three algorithms: (i) Naive\nBayes; (ii) one-pass Adaboost; and (iii) PickyAdaBoost.\nThe Reuters data was downloaded from www.daviddlewis.com. We used the ModApte splits\ninto training and test sets. We only used the text of each article, and the text was converted into\nlower case before analysis. We compared the boosting algorithms with multinomial Naive Bayes\n[7]. We used boosting with con(cid:2)dence-rated base classi(cid:2)ers [8], with a base classi(cid:2)er for each stem\nof length at most 5; analogously to the multinomial Naive Bayes, the con(cid:2)dence of a base classi(cid:2)er\nwas taken to be the number of times its stem appeared in the text. (Schapire and Singer [8, Section\n3.2] suggested, when the con(cid:2)dence of base classi(cid:2)ers cannot be bounded a priori, to choose each\nvoting weight (cid:11)t in order to maximize the reduction in potential. We did this, using Newton\u2019s\nmethod to do this optimization.) We averaged over 10 random permutations of the features. The\nresults are compiled in Table 1. The one-pass boosting algorithms usually improve on the accuracy\nof Naive Bayes, while retaining similar simplicity and computational ef(cid:2)ciency. PickyAdaBoost\nappears to usually improve somewhat on AdaBoost. Using a t-test at level 0.01, the W-L-T for\nPickyAdaBoost(0:1) against multinomial Naive Bayes is 5-1-4.\nWe also experimented with synthetic data generated according to a distribution D de(cid:2)ned as follows:\nto draw (x; y), begin by picking y 2 f(cid:0)1; +1g uniformly at random. For each of the k features\nx1; : : : ; xk in the diverse (cid:13)-good set G, set xi equal to y with probability 1=2 + (cid:13) (independently\nfor each i). The remaining n (cid:0) k variables are in(cid:3)uenced by a hidden variable z which is set\nindependently to be equal to y with probability 4=5. The features xk+1; : : : ; xn are each set to\nbe independently equal to z with probability p. So each such xj (j (cid:21) k + 1) agrees with y with\nprobability (4=5) (cid:1) p + (1=5) (cid:1) (1 (cid:0) p).\nThere were 10000 training examples and 10000 test examples. We tried n = 1000 and n = 10000.\nResults when n = 10000 are summarized in Table 2. The boosting algorithms predictably perform\nbetter than Naive Bayes, because Naive Bayes assigns too much weight to the correlated features.\nThe picky boosting algorithm further ameliorates the effect of this correlation. Results for n = 1000\nare omitted due to space constraints: these are qualitatively similar, with all algorithms performing\nbetter, and the differences between algorithms shrinking somewhat.\n\n\fData\n\nNB\n\nOPAB\n\nPickyAdaBoost\n\nError rates\n\nFeature counts\n\nNB\n\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n\nOPAB\n\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n19288\n\nPickyAdaBoost\n0.1\n52\n41\n62\n58\n64\n61\n58\n61\n67\n67\n\n0.01\n542\n508\n576\n697\n650\n641\n501\n632\n804\n640\n\n0.001\n2871\n3041\n2288\n2865\n2622\n2579\n2002\n2294\n2557\n2343\n\nearn\nacq\n\nmoney-fx\n\ncrude\ngrain\ntrade\ninterest\nwheat\nship\ncorn\n\n0.042\n0.036\n0.043\n0.026\n0.038\n0.068\n0.026\n0.022\n0.013\n0.027\n\n0.023\n0.094\n0.042\n0.031\n0.021\n0.028\n0.032\n0.014\n0.018\n0.014\n\n0.001\n0.020\n0.065\n0.041\n0.027\n0.023\n0.028\n0.029\n0.013\n0.018\n0.014\n\n0.01\n0.018\n0.071\n0.041\n0.026\n0.019\n0.026\n0.032\n0.013\n0.017\n0.014\n\n0.1\n0.027\n0.153\n0.048\n0.040\n0.018\n0.029\n0.035\n0.017\n0.016\n0.013\n\nTable 1: Experimental results. On the left are error rates on the 3299 test examples for Reuters\ndata sets. On the right are counts of the number of features used in the models. NB is the multino-\nmial Naive Bayes, and OPAB is one-pass AdaBoost. Results are shown for three PickyAdaBoost\nthresholds: 0.001, 0.01 and 0.1.\n\nk\n\np\n\n(cid:13)\n\nNB OPAB\n\n20\n20\n20\n50\n50\n50\n100\n100\n100\n\n0.85\n0.9\n0.95\n0.7\n0.75\n0.8\n0.63\n0.68\n0.73\n\n0.24\n0.24\n0.24\n0.15\n0.15\n0.15\n0.11\n0.11\n0.11\n\n0.2\n0.2\n0.21\n0.2\n0.2\n0.21\n0.2\n0.2\n0.2\n\n0.11\n0.09\n0.06\n0.13\n0.12\n0.11\n0.14\n0.13\n0.1\n\n0.16\n0.03\n0.03\n0.02\n0.09\n0.03\n0.03\n\nPickyAdaBoost\n0.07\n0.04\n0.03\n0.02\n0.06\n0.05\n0.04\n0.07\n0.06\n0.05\n\n0.1\n0.04\n0.03\n0.02\n0.04\n0.04\n0.03\n0.05\n0.05\n0.04\n\nTable 2: Test-set error rate for synthetic data. Each value is an average over 100 independent runs\n(random permutations of features). Where a result is omitted, the corresponding picky algorithm did\nnot pick any base classi(cid:2)ers.\n\nReferences\n[1] S. Dasgupta and P. M. Long. Boosting with diverse base classi(cid:2)ers. COLT, 2003.\n[2] R. O. Duda and P. E. Hart. Pattern Classi\ufb01cation and Scene Analysis. Wiley, 1973.\n[3] Y. Freund. Boosting a weak learning algorithm by majority. Inf. and Comput., 121(2):256(cid:150)285,\n\n1995.\n\n[4] Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148(cid:150)\n\n156, 1996.\n\n[5] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. JCSS, 55(1):119(cid:150)139, 1997.\n\n[6] N. Littlestone. Redundant noisy attributes, attribute errors, and linear-threshold learning using\n\nWinnow. In COLT, pages 147(cid:150)156, 1991.\n\n[7] A. Mccallum and K. Nigam. A comparison of event models for naive bayes text classi(cid:2)cation.\n\nIn AAAI-98 Workshop on Learning for Text Categorization, 1998.\n\n[8] R. Schapire and Y. Singer. Improved boosting algorithms using con(cid:2)dence-rated predictions.\n\nMachine Learning, 37:297(cid:150)336, 1999.\n\n\f", "award": [], "sourceid": 560, "authors": [{"given_name": "Zafer", "family_name": "Barutcuoglu", "institution": null}, {"given_name": "Phil", "family_name": "Long", "institution": null}, {"given_name": "Rocco", "family_name": "Servedio", "institution": null}]}