{"title": "Adaptive Martingale Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 984, "abstract": "In recent work Long and Servedio LS05short presented a ``martingale boosting'' algorithm that works by constructing a branching program over weak classifiers and has a simple analysis based on elementary properties of random walks. LS05short showed that this martingale booster can tolerate random classification noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classifiers it receives. In this paper we present a variant of the original martingale boosting algorithm and prove that it is adaptive. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original LS05short algorithm, such as random classification noise tolerance, and has several other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with confidence-rated weak hypotheses that output real values rather than Boolean predictions.", "full_text": "Adaptive Martingale Boosting\n\nPhilip M. Long\n\nGoogle\n\nplong@google.com\n\nRocco A. Servedio\nColumbia University\n\nrocco@cs.columbia.edu\n\nAbstract\n\nIn recent work Long and Servedio [LS05] presented a \u201cmartingale boosting\u201d al-\ngorithm that works by constructing a branching program over weak classi\ufb01ers and\nhas a simple analysis based on elementary properties of random walks. [LS05]\nshowed that this martingale booster can tolerate random classi\ufb01cation noise when\nit is run with a noise-tolerant weak learner; however, a drawback of the algorithm\nis that it is not adaptive, i.e. it cannot effectively take advantage of variation in the\nquality of the weak classi\ufb01ers it receives.\nWe present an adaptive variant of the martingale boosting algorithm. This adap-\ntiveness is achieved by modifying the original algorithm so that the random walks\nthat arise in its analysis have different step size depending on the quality of the\nweak learner at each stage. The new algorithm inherits the desirable properties of\nthe original [LS05] algorithm, such as random classi\ufb01cation noise tolerance, and\nhas other advantages besides adaptiveness: it requires polynomially fewer calls to\nthe weak learner than the original algorithm, and it can be used with con\ufb01dence-\nrated weak hypotheses that output real values rather than Boolean predictions.\n\n1 Introduction\n\nBoosting algorithms are ef\ufb01cient procedures that can be used to convert a weak learning algorithm\n(one which outputs a weak hypothesis that performs only slightly better than random guessing for\na binary classi\ufb01cation task) into a strong learning algorithm (one which outputs a high-accuracy\nclassi\ufb01er). A rich theory of boosting has been developed over the past two decades; see [Sch03,\nMR03] for some overviews. Two important issues for boosting algorithms which are relevant to the\ncurrent work are adaptiveness and noise-tolerance; we brie\ufb02y discuss each of these issues before\ndescribing the contributions of this paper.\n\nAdaptiveness. \u201cAdaptiveness\u201d refers to the ability of boosting algorithms to adjust to different\naccuracy levels in the sequence of weak hypotheses that they are given. The \ufb01rst generation of\nboosting algorithms [Sch90, Fre95] required the user to input an \u201cadvantage\u201d parameter \u03b3 such that\nthe weak learner was guaranteed to always output a weak hypothesis with accuracy at least 1/2 + \u03b3.\nGiven an initial setting of \u03b3, even if the sequence of weak classi\ufb01ers generated by the runs of the\nweak learner included some hypotheses with accuracy (perhaps signi\ufb01cantly) better than 1/2+\u03b3, the\nearly boosting algorithms were unable to capitalize on this extra accuracy; thus, these early boosters\nwere not adaptive. Adaptiveness is an important property since it is often the case that the advantage\nof successive weak classi\ufb01ers grows smaller and smaller as boosting proceeds.\n\nA major step forward was the development of the AdaBoost algorithm [FS97]. AdaBoost does\nnot require a lower bound \u03b3 on the minimum advantage, and the error rate of its \ufb01nal hypothesis\ndepends favorably on the different advantages of the different weak classi\ufb01ers in the sequence. More\nprecisely, if the accuracy of the t-th weak classi\ufb01er is 1\n2 + \u03b3t, then the AdaBoost \ufb01nal hypothesis\n\n\ft . This error rate is usually upper bounded (see [FS97]) by\n\nt!\n\n\u03b32\n\n(1)\n\nexp \u22122\n\nT \u22121Xt=0\n\nhas error at mostQT \u22121\n\nt=0 p1 \u2212 4\u03b32\n\nand indeed (1) is a good approximation if no \u03b3t is too large.\n\nNoise tolerance. One drawback of many standard boosting techniques, including AdaBoost, is that\nthey can perform poorly when run on noisy data [FS96, MO97, Die00, LS08]. Motivated in part by\nthis observation, in recent years boosting algorithms that work by constructing branching programs\nover the weak classi\ufb01ers (note that this is in contrast with AdaBoost, which constructs a single\nweighted sum of weak classi\ufb01ers) have been developed and shown to enjoy some provable noise\ntolerance. In particular, the algorithms of [KS05, LS05] have been shown to boost to optimally high\naccuracy in the presence of random classi\ufb01cation noise when run with a random classi\ufb01cation noise\ntolerant weak learner. (Recall that \u201crandom classi\ufb01cation noise at rate \u03b7\u201d means that the true binary\nlabel of each example is independently \ufb02ipped with probability \u03b7. This is a very well studied noise\nmodel, see e.g. [AL88, Kea98, AD98, BKW03, KS05, RDM06] and many other references.)\n\nWhile the noise tolerance of the boosters [KS05, LS05] is an attractive feature, a drawback of these\nalgorithms is that they do not enjoy the adaptiveness of algorithms like AdaBoost. The MMM\nbooster of [KS05] is not known to have any adaptiveness at all, and the \u201cmartingale boosting\u201d\nalgorithm of [LS05] only has the following limited type of adaptiveness. The algorithm works in\nstages t = 0, 1, . . . where in the t-th stage a collection of t + 1 weak hypotheses are obtained; let \u03b3t\ndenote the minimum advantage of these t + 1 hypotheses obtained in stage t. [LS05] shows that the\n\ufb01nal hypothesis constructed by martingale boosting has error at most\n\nexp \u2212\n\n(PT \u22121\n\nt=0 \u03b3t)2\n2T\n\n! .\n\n(2)\n\n(2) is easily seen to always be a worse bound than (1), and the difference can be substan-\ntial. Consider, for example, a sequence of weak classi\ufb01ers in which the advantages decrease as\n\u03b3t = 1/\u221at + 1 (this is in line with the oft-occurring situation, mentioned above, that advantages\ngrow smaller and smaller as boosting progresses). For any \u01eb > 0 we can bound (1) from above by \u01eb\nby taking T \u2248 1/\u221a\u01eb, whereas for this sequence of advantages the error bound (2) is never less than\n0.5 (which is trivial), and in fact (2) approaches 1 as t \u2192 \u221e.\nOur contributions: adaptive noise-tolerant boosting. We give the \ufb01rst boosting algorithm that\nt(cid:17)(cid:17) and is provably tolerant to\nis both adaptive enough to satisfy a bound of exp(cid:16)\u2212\u2126(cid:16)PT \u22121\n\nrandom classi\ufb01cation noise. We do this by modifying the martingale boosting algorithm of [LS05]\nto make it adaptive; the modi\ufb01cation inherits the noise-tolerance of the original [LS05] algorithm. In\naddition to its adaptiveness, the new algorithm also improves on [LS05] by constructing a branching\nprogram with polynomially fewer nodes than the original martingale boosting algorithm (thus it\nrequires fewer calls to the weak learner), and it can be used directly with weak learners that generate\ncon\ufb01dence-rated weak hypotheses (the original martingale boosting algorithm required the weak\nhypotheses to be Boolean-valued).\n\nt=0 \u03b32\n\nOur approach. We brie\ufb02y sketch the new idea that lets us achieve adaptiveness. Recall that the\noriginal martingale booster of Long and Servedio formulates the boosting process as a random walk;\nintuitively, as a random example progresses down through the levels of the branching program con-\nstructed by the [LS05] booster, it can be viewed as performing a simple random walk with step size 1\non the real line, where the walk is biased in the direction (positive or negative) corresponding to the\ncorrect classi\ufb01cation of the example. (The quantity tracked during the random walk is the difference\nbetween the number of positive predictions and the number of negative predictions made by base\nclassi\ufb01ers encountered in the braching program up to a given point in time.) This means that after\nenough stages, a random positive example will end up to the right of the origin with high probability,\nand contrariwise for a random negative example. Thus a high-accuracy classi\ufb01er is obtained simply\nby labelling each example according to the sign (+ or \u2212) of its \ufb01nal location on the real line.\nThe new algorithm extends this approach in a simple and intuitive way, by having examples perform\na random walk with variable step size:\nif the weak classi\ufb01er at a given internal node has large\n\n\fadvantage, then the new algorithm makes the examples that reach that node take a large step in\nthe random walk. This is a natural way to exploit the fact that examples reaching such a large-\nadvantage node usually tend to walk in the right direction. The idea extends straightforwardly to\nlet us handle con\ufb01dence-rated weak hypotheses (see [SS99]) whose predictions are real values in\n[\u22121, 1] as opposed to Boolean values from {\u22121, 1}. This is done simply by scaling the step size for a\ngiven example x from a given node according to the numerical value h(x) that the con\ufb01dence-rated\nweak hypothesis h at that node assigns to example x.\nWhile using different step sizes at different levels is a natural idea, it introduces some complications.\nIn particular, if a branching program is constructed naively based on this approach, it is possible for\nthe number of nodes to increase exponentially with the depth. To avoid this, we use a randomized\nrounding scheme together with the variable-step random walk to ensure that the number of nodes\nin the branching program grows polynomially rather than exponentially in the number of stages\nin the random walk (i.e.\nthe depth of the branching program). In fact, we actually improve on\nthe ef\ufb01ciency of the original martingale boosting algorithm of [LS05] by a polynomial factor, by\ntruncating \u201cextreme\u201d nodes in the branching program that are \u201cfar\u201d from the origin. Our analysis\nshows that this truncation has only a small effect on the accuracy of the \ufb01nal classi\ufb01er, while giving\na signi\ufb01cant asymptotic savings in the size of the \ufb01nal branching program (roughly 1/\u03b33 nodes as\nopposed to the 1/\u03b34 nodes of [KS05, LS05]).\n\n2 Preliminaries\n\nWe make the following assumptions and notational conventions throughout the paper. There is an\ninitial distribution D over a domain of examples X. There is a target function c : X \u2192 {\u22121, 1} that\nwe are trying to learn. Given the target function c and the distribution D, we write D+ to denote\nthe distribution D restricted to the positive examples {x \u2208 X : c(x) = 1}. Thus, for any event\nS \u2286 {x \u2208 X : c(x) = 1} we have PrD+ [x \u2208 S] = PrD[x \u2208 S]/PrD[c(x) = 1]. Similarly, we\nwrite D\u2212 to denote D restricted to the negative examples {x \u2208 X : c(x) = \u22121}.\nAs usual, our boosting algorithms work by repeatedly passing a distribution D\u2032 derived from D to\na weak learner, which outputs a classi\ufb01er h. The future behavior will be affected by how well h\nperforms on data distributed according to D\u2032. To keep the analysis clean, we will abstract away\nissues of sampling from D\u2032 and estimating the accuracy of the resulting h. These issues are trivial\nif D is uniform over a moderate-sized domain (since all probabilities can be computed exactly), and\notherwise they can be handled via the same standard estimation techniques used in [LS05].\nMartingale boosting. We brie\ufb02y recall some key aspects of the martingale boosting algorithm of\n[LS05] which are shared by our algorithm (and note some differences). Both boosters work by\nconstructing a leveled branching program. Each node in the branching program has a location; this\nis a pair (\u03b2, t) where \u03b2 is a real value (a location on the line) and t \u2265 0 is an integer (the level of the\nnode; each level corresponds to a distinct stage of boosting). The initial node, where all examples\nstart, is at (0, 0). In successive stages t = 0, 1, 2, . . . the booster constructs nodes in the branching\nprogram at levels 0, 1, 2, . . . . For a location (\u03b2, t) where the branching program has a node, let\nD\u03b2,t be the distribution D conditioned on reaching the node at (\u03b2, t). We sometimes refer to this\ndistribution D\u03b2,t as the distribution induced by node (\u03b2, t).\nAs boosting proceeds, in stage t, each node (\u03b2, t) at level t is assigned a hypothesis which we\ncall h\u03b2,t. Unlike [LS05] we shall allow con\ufb01dence-rated hypotheses, so each weak hypothesis is a\nmapping from X to [\u22121, 1]. Once the hypothesis h\u03b2,t has been obtained, out-edges are constructed\nfrom (\u03b2, t) to its child nodes at level t + 1. While the original martingale boosting algorithm of\n[LS05] had two child nodes at (\u03b2 \u2212 1, t + 1) and (\u03b2 + 1, t + 1) from each internal node, as we\ndescribe in Section 3 our new algorithm will typically have four child nodes for each node (but may,\nfor a con\ufb01dence-rated base classi\ufb01er, have as many as eight).\nOur algorithm. To fully specify our new boosting algorithm we must describe:\n\n(1) How the weak learner is run at each node (\u03b2, t) to obtain a weak classi\ufb01er. This is straight-\nforward for the basic case of \u201ctwo-sided\u201d weak learners that we describe in Section 3 and\nsomewhat less straightforward in the usual (non-two-sided) weak learner setting. In Sec-\ntion 5.1 we describe how to use a standard weak learner, and how to handle noise \u2013 both\nextensions borrow heavily from earlier work [LS05, KS05].\n\n\f(2) What function is used to label the node (\u03b2, t), i.e. how to route subsequent examples that\nreach (\u03b2, t) to one of the child nodes. It turns out that this function is a randomized version\nof the weak classi\ufb01er mentioned in point (1) above.\n\n(3) Where to place the child nodes at level t + 1; this is closely connected with (2) above.\n\nAs in [LS05], once the branching program has been fully constructed down through some level T\nthe \ufb01nal hypothesis it computes is very simple. Given an input example x, the output of the \ufb01nal\nhypothesis on x is sgn(\u03b2) where (\u03b2, T ) is the location in level T to which x is ultimately routed as\nit passes through the branching program.\n\n3 Boosting a two-sided weak learner\n\nIn this section we assume that we have a two-sided weak learner. This is an algorithm which, given\na distribution D, can always obtain hypotheses that have two-sided advantage as de\ufb01ned below:\nDe\ufb01nition 1 A hypothesis h : X \u2192 [\u22121, 1] has two-sided advantage \u03b3 with respect to D if it\nsatis\ufb01es both E\n\nx\u2208D+[h(x)] \u2265 \u03b3 and E\n\nx\u2208D\u2212[h(x)] \u2264 \u2212\u03b3.\n\nAs we explain in Section 5.1 we may apply methods of [LS05] to reduce the typical case, in which\nwe only receive \u201cnormal\u201d weak hypotheses rather than two-sided weak hypotheses, to this case.\n\nThe branching program starts off with a single node at location (0, 0). Assuming the branching\nprogram has been constructed up through level t, we now explain how it is extended in the t-th stage\nup through level t + 1. There are two basic steps in each stage: weak training and branching.\nWeak training. Consider a given node at location (\u03b2, t) in the branching program. As in [LS05] we\nconstruct a weak hypothesis h\u03b2,t simply by running the two-sided weak learner on examples drawn\nfrom D\u03b2,t and letting h\u03b2,t be the hypothesis it generates. Let us write \u03b3\u03b2,t to denote\n\n\u03b3\u03b2,t\n\ndef\n\n= min{E\n\nWe call \u03b3\u03b2,t the advantage at node (\u03b2, t).\nWe do this for all nodes at level t. Now we de\ufb01ne the advantage at level t to be\n\nx\u2208(D\u03b2,t)+ [h\u03b2,t(x)], E\n\nx\u2208(D\u03b2,t)\u2212 [\u2212h\u03b2,t(x)]}.\n\n\u03b3t\n\ndef\n= min\n\n\u03b2\n\n\u03b3\u03b2,t.\n\n(3)\n\nBranching. Intuitively, we would like to use \u03b3t as a scaling factor for the \u201cstep size\u201d of the random\nwalk at level t. Since we are using con\ufb01dence-rated weak hypotheses, it is also natural to have\nthe step that example x takes at a given node be proportional to the value of the con\ufb01dence-rated\nhypothesis at that node on x. The most direct way to do this would be to label the node (\u03b2, t) with\nthe weak classi\ufb01er h\u03b2,t and to route each example x to a node at location (\u03b2 + \u03b3th\u03b2,t(x), t + 1).\nHowever, there are obvious dif\ufb01culties with this approach; for one thing a single node at (\u03b2, t) could\ngive rise to arbitrarily many (in\ufb01nitely many, if |X| = \u221e) nodes at level t+1. Even if the hypotheses\nh\u03b2,t were all guaranteed to {\u22121, 1}-valued, if we were to construct a branching program in this way\nthen it could be the case that by the T -th stage there are 2T \u22121 distinct nodes at level T .\nWe get around this problem by creating nodes at level t + 1 only at integer multiples of \u03b3t\n2 . Note that\nthis \u201cgranularity\u201d that is used is different at each level, depending on the advantage at each level (we\nshall see in the next section that this is crucial for the analysis). This keeps us from having too many\nnodes in the branching program at level t + 1. Of course, we only actually create those nodes in the\nbranching program that have an incoming edge as described below (later we will give an analysis to\nbound the number of such nodes).\nWe simulate the effect of having an edge from (\u03b2, t) to (\u03b2 + \u03b3th\u03b2,t(x), t + 1) by using two edges\nfrom (\u03b2, t) to (i \u00b7 \u03b3t/2, t + 1) and to ((i + 1) \u00b7 \u03b3t/2, t + 1), where i is the unique integer such that\ni\u00b7 \u03b3t/2 \u2264 \u03b2 + \u03b3th\u03b2,t(x) < (i + 1)\u00b7 \u03b3t/2. To simulate routing an example x to (\u03b2 + \u03b3th\u03b2,t(x), t + 1),\nthe branching program routes x randomly along one of these two edges so that the expected location\nat which x ends up is (\u03b2 + \u03b3th\u03b2,t(x), t + 1). More precisely, if \u03b2 + \u03b3th\u03b2,t(x) = (i + \u03c1)\u00b7 \u03b3t/2 where\n0 \u2264 \u03c1 < 1, then the rule used at node (\u03b2, t) to route an example x is \u201cwith probability \u03c1 send x to\n((i + 1) \u00b7 \u03b3t/2, t + 1) and with probability (1 \u2212 \u03c1) send x to (i \u00b7 \u03b3t/2, t + 1).\u201d\n\n\fSince |h\u03b2,t(x)| \u2264 1 for all x by assumption, it is easy to see that at most eight outgoing edges\nare required from each node (\u03b2, t). Thus the branching program that the booster constructs uses\na randomized variant of each weak hypothesis h\u03b2,t to route examples along one of (at most) eight\noutgoing edges.\n\n4 Proof of correctness for boosting a two-sided weak learner\n\nThe following theorem shows that the algorithm described above is an effective adaptive booster for\ntwo-sided weak learners:\n\nTheorem 2 Consider running the above booster for T stages. For t = 0, . . . , T \u2212 1 let the val-\nues \u03b30, . . . , \u03b3T \u22121 > 0 be de\ufb01ned as described above, so each invocation of the two-sided weak\nlearner on distribution D\u03b2,t yields a hypothesis h\u03b2,t that has \u03b3\u03b2,t \u2265 \u03b3t. Then the \ufb01nal hypothesis h\nconstructed by the booster satis\ufb01es\n\n(4)\n\nj=0 \u03b3j calls to the weak learner (i.e. con-\n\nThe algorithm makes at most M \u2264 O(1) \u00b7PT \u22121\n\nstructs a branching program with at most M nodes).\n\nt! .\n\n\u03b32\n\nT \u22121Xt=0\n\n1\n8\n\nPrx\u2208D[h(x) 6= c(x)] \u2264 exp \u2212\n\u03b3tPt\u22121\nx\u2208D+[h(x) 6= 1] \u2264 exp(cid:16)\u2212 1\n\nt=0\n\n1\n\n8PT \u22121\n\nt(cid:17); a completely symmetric\nProof: We will show that Pr\nt=0 \u03b32\nargument shows a similar bound for negative examples, which gives (4).\nFor t = 1, . . . , T we de\ufb01ne the random variable At as follows: given a draw of x from D+ (the\noriginal distribution D restricted to positive examples), the value of At is \u03b3t\u22121h\u03b2,t\u22121(x), where\n(\u03b2, t \u2212 1) is the location of the node that x reaches at level t of the branching program. Intuitively\nAt captures the direction and size of the move that we would like x to make during the branching\nstep that brings it to level t.\nWe de\ufb01ne Bt to be the random variable that captures the direction and size of the move that x\nactually makes during the branching step that brings it to level t. More precisely, let i be the integer\nsuch that i \u00b7 (\u03b3t\u22121/2) \u2264 \u03b2 + \u03b3t\u22121h\u03b2,t\u22121(x) < (i + 1) \u00b7 (\u03b3t\u22121/2), and let \u03c1 \u2208 [0, 1) be such that\n\u03b2 + \u03b3t\u22121h\u03b2,t\u22121(x) = (i + \u03c1) \u00b7 (\u03b3t\u22121/2). Then\n\nBt =(cid:26)((i + 1) \u00b7 (\u03b3t\u22121/2) \u2212 \u03b2) with probability \u03c1, and\nwith probability 1 \u2212 \u03c1.\n\n(i \u00b7 (\u03b3t\u22121/2) \u2212 \u03b2)\n\nWe have that E[Bt] (where the expectation is taken only over the \u03c1-probability in the de\ufb01nition of\ni=1 Bt, so\n\nBt) equals ((i + \u03c1) \u00b7 (\u03b3t\u22121/2) \u2212 \u03b2)h\u03b2,t\u22121(x) = \u03b3t\u22121h\u03b2,t\u22121(x) = At. Let Xt denotePt\nthe value of Xt is the actual location on the real line where x ends up at level t.\nFix 1 \u2264 t \u2264 T and let us consider the conditional random variable (Xt|Xt\u22121). Conditioned on\nXt\u22121 taking any particular value (i.e. on x reaching any particular location (\u03b2, t \u2212 1)), we have that\nx is distributed according to (D\u03b2,t\u22121)+, and thus we have\n(5)\nE[Xt|Xt\u22121] = Xt\u22121 + E\nwhere the \ufb01rst inequality follows from the two-sided advantage of h\u03b2,t\u22121.\n\nx\u2208(D\u03b2,t)+ [\u03b3t\u22121h\u03b2,t\u22121(x)] \u2265 Xt\u22121 + \u03b3t\u22121\u03b3\u03b2,t\u22121 \u2265 Xt\u22121 + \u03b32\n\nt\u22121,\n\ni (so Y0 = X0 = 0). Since\nconditioning on the value of Yt\u22121 is equivalent to conditioning on the value of Xt\u22121, using (5) we\nget\n\nFor t = 0, . . . , T , de\ufb01ne the random variable Yt as Yt = Xt \u2212Pt\u22121\nE[Yt|Yt\u22121] = E\"Xt \u2212\nt\u22121Xi=0\n\ni(cid:12)(cid:12)Yt\u22121# = E[Xt|Yt\u22121] \u2212\n\nso the sequence of random variables Y0, . . . , YT is a sub-martingale.1 To see that this sub-martingale\nhas bounded differences, note that we have\n\n\u03b32\ni \u2265 Xt\u22121 \u2212\n\n\u03b32\ni = Yt\u22121,\n\nt\u22121Xi=0\n\nt\u22122Xi=0\n\ni=0 \u03b32\n\n\u03b32\n\n|Yt \u2212 Yt\u22121| = |Xt \u2212 Xt\u22121 \u2212 \u03b32\n\nt\u22121| = |Bt \u2212 \u03b32\n\nt\u22121|.\n\n1The more common de\ufb01nition of a sub-martingale requires that E[Yt|Y0, ..., Yt\u22121] \u2264 Yt\u22121, but the weaker\nassumption that E[Yt|Yt\u22121] \u2264 Yt\u22121 suf\ufb01ces for the concentration bounds that we need (see [ASE92, Hay05]).\n\n\fThe value of Bt is obtained by \ufb01rst moving by \u03b3t\u22121h\u03b2,t\u22121(x), and then rounding to a neighboring\nt\u22121 \u2264 2\u03b3t\u22121.\nmultiple of \u03b3t\u22121/2, so |Bt| \u2264 (3/2)\u03b3t\u22121, which implies |Yt \u2212 Yt\u22121| \u2264 (3/2)\u03b3t\u22121 + \u03b32\nNow recall Azuma\u2019s inequality for sub-martingales:\n\nLet 0 = Y0, . . . , YT be a sub-martingale which has |Yi \u2212 Yi\u22121| \u2264 ci for each\ni(cid:17) .\ni = 1, . . . , T . Then for any \u03bb > 0 we have Pr[YT \u2264 \u2212\u03bb] \u2264 exp(cid:16)\u2212 \u03bb2\n\n2 PT\n\ni=1 c2\n\nt! .\n\nt! = exp \u2212\n\nt=0 \u03b32\n\nt . This gives us that the error rate of h on\n\npositive examples, Pr\n\nWe apply this with each ci = 2\u03b3i\u22121 and \u03bb = PT \u22121\nPr[XT < 0] = Pr[YT < \u2212\u03bb] \u2264 exp \u2212\n\nx\u2208D+[h(x) = \u22121], equals\n\n\u03b32\n\n1\n8\n\n(6)\n\n\u03bb2\nt=0 \u03b32\n\n8PT \u22121\n\nSo we have established (4); it remains to bound the number of nodes constructed in the branching\n\nT \u22121Xt=0\nprogram. Let us write Mt to denote the number of nodes at level t, so M =PT \u22121\nThe t-th level of boosting can cause the rightmost (leftmost) node to be at most 2\u03b3t\u22121 distance\nfarther away from the origin than the rightmost (leftmost) node at the (t \u2212 1)-st level. This means\nthat at level t, every node is at a position (\u03b2, t) with |\u03b2| \u2264 2Pt\u22121\nj=0 \u03b3j. Since nodes are placed at\ninteger multiples of \u03b3t/2, we have that M =PT \u22121\nt=0 Mt \u2264 O(1) \u00b7PT \u22121\nRemark. Consider the case in which each advantage \u03b3t is just \u03b3 and we are boosting to accuracy\n\u01eb. As usual taking T = O(log(1/\u01eb)/\u03b32) gives an error bound of \u01eb. With these parameters we have\nthat M \u2264 O(log2(1/\u01eb)/\u03b34), the same asymptotic bound achieved in [LS05]. In the next section we\ndescribe a modi\ufb01cation of the algorithm that improves this bound by essentially a factor of 1\n\u03b3 .\n\n\u03b3tPt\u22121\n\nt=0 Mt.\n\nj=0 \u03b3j.\n\nt=0\n\n1\n\n4.1 Improving ef\ufb01ciency by freezing extreme nodes\n\nHere we describe a variant of the algorithm from the previous section that constructs a branching\nprogram with fewer nodes.\n\nThe algorithm requires an input parameter \u01eb which is an upper bound on the desired \ufb01nal error of the\naggregate classi\ufb01er. For t \u2265 1, after the execution of step t \u2212 1 of boosting, when all nodes at level\n\u01eb(cid:1) is \u201cfrozen.\u201d The\n\nt have been created, each node (\u03b1, t) with |\u03b1| > r(cid:16)8Pt\u22121\n\nalgorithm commits to classifying any test examples routed to any such nodes according to sgn(\u03b1),\nand these nodes are not used to generate weak hypotheses during the next round of training.\n\ns(cid:17)(cid:0)2 ln t + ln 4\n\ns=0 \u03b32\n\nWe have the following theorem about the performance of this algorithm:\n\nTheorem 3 Consider running the modi\ufb01ed booster for T stages. For t = 0, . . . , T \u2212 1 let the\nvalues \u03b31, . . . , \u03b3T > 0 be de\ufb01ned as described above, so each invocation of the weak learner on\ndistribution D\u03b2,t yields a hypothesis h\u03b2,t that has \u03b3\u03b2,t \u2265 \u03b3t. Then the \ufb01nal output hypothesis h of\nthe booster satis\ufb01es\n\nPrx\u2208D[h(x) 6= c(x)] \u2264\n\nThe algorithm makes O(cid:18)r(cid:16)PT \u22121\n\nt=0 \u03b32\n\nt(cid:17)(cid:0)ln T + ln 1\n\n\u01eb\n2\n\n1\n\n\u03b32\n\n1\n8\n\n(7)\n\nt(cid:19) .\n\nT \u22121Pt=0\n\u03b3t(cid:19) calls to the weak learner.\n\n+ exp(cid:18)\u2212\n\u01eb(cid:1) \u00b7PT \u22121\nx\u2208D+[h(x) 6= 1]. The proof of Theorem 2\nx\u2208D+[h(x) 6= 1] \u2264 exp(cid:16)\u2212 1\nt(cid:17) . Now\n\nt=0 \u03b32\n\nt=0\n\nProof: As in the previous proof it suf\ufb01ces to bound Pr\ngives us that if we never did any freezing, then Pr\nlet us analyze the effect of freezing in a given stage t < T . Let At be the distance from the origin\n\u01eb ). Nearly exactly\nthe same analysis as proves (6) can be used here: for a positive example x to be incorrectly frozen\n\npast which examples are frozen in round t; i.e. At =q(8Pt\u22121\n\n8PT \u22121\n\ns )(2 ln t + ln 4\n\ns=0 \u03b32\n\n\fin round t, it must be the case Xt < \u2212At, or equivalently Yt < \u2212At \u2212Pt\u22121\n\nx\u2208D+[x incorrectly frozen in round t] is at most\n\nof At gives us that Pr\n\ni=0 \u03b32\n\ni . Thus our choice\n\nt\u22121Xi=0\n\nPr[Yt \u2264 \u2212At \u2212\n\n\u03b32\nt ] \u2264 Pr[Yt \u2264 \u2212At] \u2264\nx\u2208D+[x ever incorrectly frozen ] \u2264 \u01eb\n\nx\u2208D+[h(x) = 0] equals\n\n\u01eb\n4t2 ,\n\nso consequently we have Pr\n[LS05]: we have that Pr\n\nPr\n\nx\u2208D+[h(x = 0 and x is frozen] + Pr\n\nx\u2208D+[h(x) = 0 and x is not frozen] \u2264\n\nwhich gives (7). The bound on the number of calls to the weak learner follows from the\nfact that there are O(At/\u03b3t) such calls in each stage of boosting, and the fact that At \u2264\n\nq(8PT \u22121\n\ns=0 \u03b32\n\ns )(2 ln T + ln 4\n\n\u01eb ) for all t.\n\n2 . From here we may argue as in\n\n+ exp \u2212\n\n1\n2\n\n\u01eb\n2\n\nt!\n\n\u03b32\n\nT \u22121Xt=0\n\nIt is easy to check that if \u03b3t = \u03b3 for all t, taking T = O(log(1/\u01eb)/\u03b32) the algorithm in this section\nwill construct an \u01eb-accurate hypothesis that is an O(log2(1/\u01eb)/\u03b33)-node branching program.\n\n5 Extensions\n\n5.1 Standard weak learners\n\nIn Sections 3 and 4, we assumed that the boosting algorithm had access to a two-sided weak learner,\nwhich is more accurate than random guessing on both the positive and the negative examples sepa-\nrately. To make use of a standard weak learner, which is merely more accurate than random guessing\non average, we can borrow ideas from [LS05].\n\nThe idea is to force a standard weak learner to provide a hypothesis with two-sided accuracy by (a)\nbalancing the distribution so that positive and negative examples are accorded equal importance, (b)\nbalancing the predictions of the output of the weak learner so that it doesn\u2019t specialize on one kind\nof example.\n\n1\n\n2D\u2212[S].\n\nDe\ufb01nition 4 Given a probability distribution D over examples, let bD be the distribution obtained\nby rescaling the positive and negative examples so that they have equal weight: i.e., let bD[S] =\n2D+[S] + 1\nDe\ufb01nition 5 Given a con\ufb01dence-rated classi\ufb01er h : X \u2192 [\u22121, 1] and a probability distribution D\nover X, let the balanced variant of h with respect to D be the function \u02c6h : X \u2192 [\u22121, 1] de\ufb01ned as\nfollows: (a) if Ex\u2208D[h(x)] \u2265 0, then, for all x \u2208 X, \u02c6h(x) =\nEx\u2208D [h(x)]+1 \u2212 1. (b) if Ex\u2208D[h(x)] \u2264\n0, then, for all x \u2208 X, \u02c6h(x) =\nThe analysis is the natural generalization of Section 5 of [LS05] to con\ufb01dence-rated classi\ufb01ers.\nLemma 6 If D is balanced with respect to c, and h is a con\ufb01dence-rated classi\ufb01er such that\nEx\u2208D[h(x)c(x)] \u2265 \u03b3, then Ex\u2208D[\u02c6h(x)c(x)] \u2265 \u03b3/2.\nProof. Assume without loss of generality that Ex\u2208D[h(x)] \u2265 0 (the other case can be handled\nsymmetrically). By linearity of expectation\n\n\u2212Ex\u2208D[h(x)]+1 + 1.\n\nh(x)\u22121\n\nh(x)+1\n\nEx\u2208D[\u02c6h(x)c(x)] =\n\nEx\u2208D[h(x)c(x)]\nEx\u2208D[h(x)] + 1\n\n+ Ex\u2208D[c(x)](cid:18)\n\nEx\u2208D[h(x]) + 1 \u2212 1(cid:19) .\nSince D is balanced we have Ex\u2208D[c(x)] = 0, and hence Ex\u2208D[\u02c6h(x)c(x)] = Ex\u2208D [h(x)c(x)]\nlemma follows from the fact that Ex\u2208D[h(x)] \u2264 1.\nWe will use a standard weak learner to simulate a two-sided weak learner as follows. Given a\ndistribution D, the two-sided weak learner will pass bD to the standard weak learner, take its output\n\ng, and return h = \u02c6g. Our next lemma analyzes this transformation.\n\nEx\u2208D [h(x)]+1 , so the\n\n1\n\n\fLemma 7 If E\n\nx\u2208 bD[g(x)c(x)] \u2265 \u03b3, then E\n\nProof: Lemma 6 implies that E\n\nE\n\nx\u2208D+[h(x)] \u2265 \u03b3/2 and E\n\nx\u2208D\u2212[\u2212h(x)] \u2265 \u03b3/2.\nx\u2208 bD[h(x)c(x)] \u2265 \u03b3/2. Expanding the de\ufb01nition of bD, we have\n\n(8)\nx\u2208 bD[h(x)] = 0. Once again expanding\nx\u2208D\u2212[h(x)] =\nx\u2208D+[h(x)]. Substituting each of the RHS for its respec-\n\nx\u2208D\u2212[h(x)] = 0 which implies E\n\nx\u2208D\u2212[h(x)] \u2265 \u03b3.\n\nx\u2208D+[h(x)] \u2212 E\n\nSince h balanced g with respect to bD and c, we have E\nthe de\ufb01nition of bD, we get that E\n\n\u2212E\ntive LHS in (8) completes the proof.\n\nx\u2208D+[h(x)] = \u2212E\n\nx\u2208D+[h(x)] and E\n\nx\u2208D+[h(x)] + E\n\nLemma 7 is easily seen to imply counterparts of Theorems 2 and 3 in which the requirement of a\ntwo-sided weak learner is weakened to require only standard weak learning, but each \u03b3t is replaced\nwith \u03b3t/2.\n\n5.2 Tolerating random classi\ufb01cation noise\n\nAs in [LS05], noise tolerance is facilitated by the fact that the path through the network is not\naffected by altering the label of an example. On the other hand, balancing the distribution before\npassing it to the weak learner, which was needed to use a standard weak learner, may disturb the\nindependence between the event that an example is noisy, and the random draw of x. This can be\nrepaired exactly as in [KS05, LS05]; because of space constraints we omit the details.\n\nReferences\n[AD98]\n\n[AL88]\n\nJ. Aslam and S. Decatur. Speci\ufb01cation and simulation of statistical query algorithms for ef\ufb01ciency\nand noise tolerance. J. Comput & Syst. Sci., 56:191\u2013208, 1998.\nDana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343\u2013370,\n1988.\n\n[ASE92] N. Alon, J. Spencer, and P. Erdos. The Probabilistic Method (1st ed.). Wiley-Interscience, New\n\nYork, 1992.\n\n[BKW03] A. Blum, A. Kalai, and H. Wasserman. Noise-tolerant learning, the parity problem, and the statisti-\n\n[Die00]\n\n[Fre95]\n\n[FS96]\n\n[FS97]\n\ncal query model. J. ACM, 50(4):506\u2013519, 2003.\nT.G. Dietterich. An experimental comparison of three methods for constructing ensembles of deci-\nsion trees: bagging, boosting, and randomization. Machine Learning, 40(2):139\u2013158, 2000.\nY. Freund. Boosting a weak learning algorithm by majority.\n121(2):256\u2013285, 1995.\nY. Freund and R. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148\u2013156,\n1996.\nY. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an appli-\ncation to boosting. JCSS, 55(1):119\u2013139, 1997.\nT. P. Hayes. A large-deviation inequality for vector-valued martingales. 2005.\n\nInformation and Computation,\n\n[Hay05]\n[Kea98] M. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. JACM, 45(6):983\u20131006, 1998.\n[KS05]\n[LS05]\n[LS08]\n\nA. Kalai and R. Servedio. Boosting in the presence of noise. JCSS, 71(3):266\u2013290, 2005.\nP. Long and R. Servedio. Martingale boosting. In Proc. 18th Annual COLT, pages 79\u201394, 2005.\nP. Long and R. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters.\nICML, 2008.\n\nIn\n\n[MO97] R. Maclin and D. Opitz. An empirical evaluation of bagging and boosting. In AAAI/IAAI, pages\n\n[MR03]\n\n546\u2013551, 1997.\nR. Meir and G. R\u00a8atsch. An introduction to boosting and leveraging. In LNAI Advanced Lectures on\nMachine Learning, pages 118\u2013183, 2003.\n\n[RDM06] L. Ralaivola, F. Denis, and C. Magnan. CN=CNNN. In ICML, pages 265\u2013272, 2006.\n[Sch90]\n[Sch03]\n[SS99]\n\nR. Schapire. The strength of weak learnability. Machine Learning, 5(2):197\u2013227, 1990.\nR. Schapire. The boosting approach to machine learning: An overview. Springer, 2003.\nR. Schapire and Y. Singer. Improved boosting algorithms using con\ufb01dence-rated predictions. Ma-\nchine Learning, 37:297\u2013336, 1999.\n\n\f", "award": [], "sourceid": 101, "authors": [{"given_name": "Phil", "family_name": "Long", "institution": null}, {"given_name": "Rocco", "family_name": "Servedio", "institution": null}]}