{"title": "Boosting the Area under the ROC Curve", "book": "Advances in Neural Information Processing Systems", "page_first": 945, "page_last": 952, "abstract": null, "full_text": "Boosting the Area Under the ROC Curve\n\nPhilip M. Long\n\nplong@google.com\n\nRocco A. Servedio\n\nrocco@cs.columbia.edu\n\nAbstract\n\nWe show that any weak ranker that can achieve an area under the ROC curve\nslightly better than 1/2 (which can be achieved by random guessing) can be ef\ufb01-\nciently boosted to achieve an area under the ROC curve arbitrarily close to 1. We\nfurther show that this boosting can be performed even in the presence of indepen-\ndent misclassi\ufb01cation noise, given access to a noise-tolerant weak ranker.\n\n1 Introduction\n\nBackground. Machine learning is often used to identify members of a given class from a list of\ncandidates. This can be formulated as a ranking problem, where the algorithm takes a input a list of\nexamples of members and non-members of the class, and outputs a function that can be used to rank\ncandidates. The goal is to have the top of the list enriched for members of the class of interest.\n\nROC curves [12, 3] are often used to evaluate the quality of a ranking function. A point on an ROC\ncurve is obtained by cutting off the ranked list, and checking how many items above the cutoff are\nmembers of the target class (\u201ctrue positives\u201d), and how many are not (\u201cfalse positives\u201d).\n\nThe AUC [1, 10, 3] (area under the ROC curve) is often used as a summary statistic. It is obtained\nby rescaling the axes so the true positives and false positives vary between 0 and 1, and, as the name\nimplies, examining the area under the resulting curve.\n\nThe AUC measures the ability of a ranker to identify regions in feature space that are unusually\ndensely populated with members of a given class. A ranker can succeed according to this criterion\neven if positive examples are less dense than negative examples everywhere, but, in order to succeed,\nit must identify where the positive examples tend to be. This is in contrast with classi\ufb01cation, where,\nif Pr[y = 1|x] is less than 1/2 everywhere, just predicting y = \u22121 everywhere would suf\ufb01ce.\nOur Results. It is not hard to see that an AUC of 1/2 can be achieved by random guessing (see [3]),\nthus it is natural to de\ufb01ne a \u201cweak ranker\u201d to be an algorithm that can achieve AUC slightly above\n1/2. We show that any weak ranker can be boosted to a strong ranker that achieves AUC arbitrarily\nclose to the best possible value of 1.\nWe also consider the standard independent classi\ufb01cation noise model, in which the label of each\nexample is \ufb02ipped with probability \u03b7. We show that in this setting, given a noise-tolerant weak\nranker (that achieves nontrivial AUC in the presence of noisy data as described above), we can\nboost to a strong ranker that achieves AUC at least 1 \u2212 \u01eb, for any \u03b7 < 1/2 and any \u01eb > 0.\nRelated work. Freund, Iyer, Schapire and Singer [4] introduced RankBoost, which performs rank-\ning with more \ufb01ne-grained control over preferences between pairs of items than we consider here.\nThey performed an analysis that implies a bound on the AUC of the boosted ranking function in\nterms of a different measure of the quality of weak rankers. Cortes and Mohri [2] theoretically ana-\nlyzed the \u201ctypical\u201d relationship between the error rate of a classi\ufb01er based on thresholding a scoring\nfunction and the AUC obtained through the scoring function; they also pointed out the close rela-\ntionship between the loss function optimized by RankBoost and the AUC. Rudin, Cortes, Mohri,\nand Schapire [11] showed that, when each of two classes are equally likely, the loss function op-\ntimized by AdaBoost coincides with the loss function of RankBoost. Noise-tolerant boosting has\npreviously been studied for classi\ufb01cation. Kalai and Servedio [7] showed that, if data is corrupted\n\n\fwith noise at a rate \u03b7, it is possible to boost the accuracy of any noise-tolerant weak learner arbitrar-\nily close to 1 \u2212 \u03b7, and they showed that it is impossible to boost beyond 1 \u2212 \u03b7. In contrast, we show\nthat, in the presence of noise at a rate arbitrarily close to 1/2, the AUC can be boosted arbitrarily\nclose to 1. Our noise tolerant boosting algorithm uses as a subroutine the \u201cmartingale booster\u201d for\nclassi\ufb01cation of Long and Servedio [9].\nMethods. The key observation is that a weak ranker can be used to \ufb01nd a \u201ctwo-sided\u201d weak classi\ufb01er\n(Lemma 4), which achieves accuracy slightly better than random guessing on both positive and\nnegative examples. Two-sided weak classi\ufb01ers can be boosted to obtain accuracy arbitrarily close\nto 1, also on both the positive examples and the negative examples; a proof of this is implicit in the\nanalysis of [9]. Such a two-sided strong classi\ufb01er is easily seen to lead to AUC close to 1.\nWhy is it possible to boost past the AUC past the noise rate, when this is provably not possible for\nclassi\ufb01cation? Known approaches to noise-tolerant boosting [7, 9] force the weak learner to provide\na two-sided weak hypothesis by balancing the distributions that are constructed so that both classes\nare equally likely. However, this balancing skews the distributions so that it is no longer the case that\nthe event that an example is corrupted with noise is independent of the instance; randomization was\nused to patch this up in [7, 9], and the necessary slack was only available if the desired accuracy was\ncoarser than the noise rate. (We note that the lower bound from [7] is proved using a construction in\nwhich the class probability of positive examples is less than the noise rate; the essence of that proof\nis to show that in that situation it is impossible to balance the distribution given access to noisy\nexamples.) In contrast, having a weak ranker provides enough leverage to yield a two-sided weak\nclassi\ufb01er without needing any rebalancing.\nOutline. Section 2 gives some de\ufb01nitions. In Section 3, we analyze boosting the AUC when there\nis no noise in an abstract model where the weak learner is given a distribution and returns a weak\nranker, and sampling issues are abstracted away. In Section 4, we consider boosting in the presence\nof noise in a similarly abstract model. We address sampling issues in Section 5.\n\n2 Preliminaries\n\nRankings and AUC. Throughout this work we let X be a domain, c : X \u2192 {\u22121, 1} be a classi\ufb01er,\nand D be a probability distribution over labeled examples (x, c(x)). We say that D is nontrivial (for\nc) if D assigns nonzero probability to both positive and negative examples. We write D+ to denote\nthe marginal distribution over positive examples and D\u2212 to denote the marginal distribution over\nnegative examples, so D is a mixture of the distributions D+ and D\u2212.\nAs has been previously pointed out, we may view any function h : X \u2192 R as a ranking of X. Note\nthat if h(x1) = h(x2) then the ranking does not order x1 relative to x2. Given a ranking function\nh : X \u2192 R, for each value \u03b8 \u2208 R there is a point (\u03b1\u03b8, \u03b2\u03b8) on the ROC curve of h, where \u03b1\u03b8 is the\nfalse positive rate and \u03b2\u03b8 is the true positive rate of the classi\ufb01er obtained by thresholding h at \u03b8:\nand \u03b2\u03b8 = D+[h(x) \u2265 \u03b8]. Every ROC curve contains the points (0, 0) and\n\u03b1\u03b8 = D\u2212[h(x) \u2265 \u03b8]\n(1, 1) corresponding to \u03b8 = \u221e and \u2212\u221e respectively.\nGiven h : X \u2192 R and D, the AUC can be de\ufb01ned as AUC(h; D) = Pr\nh(v)] + 1\n2\nabove is equal to the area under the ROC curve for h.\nWeak Rankers. Fix any distribution D. It is easy to see that any constant function h achieves\nAUC(h; D) = 1\n2 , and also that for X \ufb01nite and \u03c0 a random permutation of X, the expected AUC\nof h(\u03c0(\u00b7)) is 1\n\nu\u2208D+,v\u2208D\u2212[h(u) >\nu\u2208D+,v\u2208D\u2212[h(u) = h(v)]. It is well known (see e.g. [2, 6]) that the AUC as de\ufb01ned\n\nPr\n\n2 for any function h. This motivates the following de\ufb01nition:\n\nDe\ufb01nition 1 A weak ranker with advantage \u03b3 is an algorithm that, given any nontrivial distribution\nD, returns a function h : X \u2192 R that has AUC(h; D) \u2265 1\n\n2 + \u03b3.\n\nIn the rest of the paper we show how boosting algorithms originally designed for classi\ufb01cation can\nbe adapted to convert weak rankers into \u201cstrong\u201d rankers (that achieve AUC at least 1 \u2212 \u01eb) in a range\nof different settings.\n\n\f3 From weak to strong AUC\n\nThe main result of this section is a simple proof that the AUC can be boosted. We achieve this in a\nrelatively straightforward way by using the standard AdaBoost algorithm for boosting classi\ufb01ers.\n\nAs in previous work [9], to keep the focus on the main ideas we will use an abstract model in which\nthe booster successively passes distributions D1, D2, ... to a weak ranker which returns ranking\nfunctions h1, h2, .... When the original distribution D is uniform over a training set, as in the usual\nanalysis of AdaBoost, this is easy to do. In this model we prove the following:\n\nTheorem 2 There is an algorithm AUCBoost that, given access to a weak ranker with advantage \u03b3\nas an oracle, for any nontrivial distribution D, outputs a ranking function with AUC at least 1 \u2212 \u01eb.\nThe AUCBoost algorithm makes T = O( log(1/\u01eb)\n) many calls to the weak ranker. If D has \ufb01nite\nsupport of size m, AUCBoost takes O(mT log m) time.\n\n\u03b3 2\n\nAs can be seen from the observation that it does not depend on the relative frequency of positive\nand negative examples, the AUC requires a learner to perform well on both positive and negative\nexamples. When such a requirement is imposed on a base classi\ufb01er, it has been called two-sided\nweak learning. The key to boosting the AUC is the observation (Lemma 4 below) that a weak\nranker can be used to generate a two-sided weak learner.\n\nDe\ufb01nition 3 A \u03b3 two-sided weak learner is an algorithm that, given a nontrivial distribution D,\noutputs a hypothesis h that satis\ufb01es both Prx\u2208D+[h(x) = 1] \u2265 1\n2 + \u03b3 and Prx\u2208D\u2212[h(x) = \u22121] \u2265\n1\n2 + \u03b3. We say that such an h has two-sided advantage \u03b3 with respect to D.\n\nLemma 4 Let A be a weak ranking algorithm with advantage \u03b3. Then there is a \u03b3/4 two-sided\nweak learner A\u2032 based on A that always returns classi\ufb01ers with equal error rate on positive and\nnegative examples.\n\nProof: Algorithm A\u2032 \ufb01rst runs A to get a real-valued ranking function h : X \u2192 R. Consider the\nROC curve corresponding to h. Since the AUC is at least 1\n2 + \u03b3, there must be some point (u, v) on\nthe curve such that v \u2265 u + \u03b3. Recall that, by the de\ufb01nition of the ROC curve, this means that there\nis a threshold \u03b8 such that D+[h(x) \u2265 \u03b8] \u2265 D\u2212[h(x) \u2265 \u03b8] + \u03b3. Thus, for the classi\ufb01er obtained by\ndef\nthresholding h at \u03b8, the class conditional error rates p+\n= D\u2212[h(x) \u2265 \u03b8]\n2 \u2212 \u03b3\nsatisfy p+ + p\u2212 \u2264 1 \u2212 \u03b3. This in turn means that either p+ \u2264 1\n2 .\n\ndef\n= D+[h(x) < \u03b8] and p\u2212\n2 or p\u2212 \u2264 1\n\n2 \u2212 \u03b3\n\n2 \u2212 \u03b3\n\nSuppose that p\u2212 \u2264 p+, so that p\u2212 \u2264 1\n2 (the other case can be handled symmetrically). Consider\nthe randomized classi\ufb01er g that behaves as follows: given input x, (a) if h(x) < \u03b8, it \ufb02ips a biased\ncoin, and with probability \u03b6 \u2265 0, predicts 1, and with probability 1 \u2212 \u03b6, predicts \u22121, and (b) if\nh(x) \u2265 \u03b8, it predicts 1. Let g(x, r) be the output of g on input x and with randomization r and let\ndef\n= Prx\u2208D+,r[g(x, r) = \u22121]. We have \u01eb+ = (1 \u2212 \u03b6)p+ and\n\u01eb\u2212\n\u01eb\u2212 = p\u2212 + \u03b6(1 \u2212 p\u2212). Let us choose \u03b6 so that \u01eb\u2212 = \u01eb+; that is, we choose \u03b6 = p+\u2212p\u2212\n. This\n1+p+\u2212p\u2212\nyields\n\ndef\n= Prx\u2208D\u2212,r[g(x, r) = 1] and \u01eb+\n\n\u01eb\u2212 = \u01eb+ =\n\n.\n\n(1)\n\np+\n\n1 + p+ \u2212 p\u2212\n\nFor any \ufb01xed value of p\u2212 the RHS of (1) increases with p+. Recalling that we have p+ +p\u2212 \u2264 1\u2212\u03b3,\ndef\nthe maximum of (1) is achieved at p+ = 1 \u2212 \u03b3 \u2212 p\u2212, in which case we have (de\ufb01ning \u01eb\n= \u01eb\u2212 = \u01eb+)\n. The RHS of this expression is nonincreasing in p\u2212, and therefore\n\u01eb =\nis maximized at p\u2212 is 0, when it takes the value 1\n\n= (1\u2212\u03b3)\u2212p\u2212\n2\u2212\u03b3\u22122p\u2212\n\n4 . This completes the proof.\n\n2 \u2212 \u03b3\n\n2(2\u2212\u03b3) \u2264 1\n\n2 \u2212 \u03b3\n\n(1\u2212\u03b3)\u2212p\u2212\n\n1+(1\u2212\u03b3\u2212p\u2212)\u2212p\u2212\n\nFigure 1 gives an illustration of the proof of the previous lemma; since the y-coordinate of (a) is at\nleast \u03b3 more than the x-coordinate and (b) lies closer to (a) than to (1, 1), the y-coordinate of (b) is\nat least \u03b3/2 more than the x-coordinate, which means that the advantage is at least \u03b3/4.\nWe will also need the following simple lemma which shows that a classi\ufb01er that is good on both the\npositive and the negative examples, when viewed as a ranking function, achieves a good AUC.\n\n\ftrue\npositive\nrate\n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\n(b)\n\n\u2022\n\n\u2022\n\n(a)\n\n 0\n\n 0\n\n 0.2\n\n 0.4\n\n 0.6\n\n 0.8\n\n 1\n\nfalse positive rate\n\nFigure 1: The curved line represents the\nROC curve for ranking function h. The\nlower black dot (a) corresponds to the value\n\u03b8 and is located at (p\u2212, 1\u2212p+). The straight\nline connecting (0, 0) and (1, 1), which cor-\nresponds to a completely random ranking,\nis given for reference. The dashed line (cov-\nered by the solid line for 0 \u2264 x \u2264 .16)\nrepresents the ROC curve for a ranker h\u2032\nwhich agrees with h on those x for which\nh(x) \u2265 \u03b8 but randomly ranks those x for\nwhich h(x) < \u03b8. The upper black dot (b)\nis at the point of intersection between the\nROC curve for h\u2032 and the line y = 1 \u2212 x; its\ncoordinates are (\u01eb, 1 \u2212 \u01eb). The randomized\nclassi\ufb01er g is equivalent to thresholding h\u2032\nwith a value \u03b8\u2032 corresponding to this point.\n\nLemma 5 Let h : X \u2192 {\u22121, 1} and suppose that Pr\nPr\n\nx\u2208D\u2212[h(x) = \u22121] = 1 \u2212 \u01eb\u2212. Then we have AUC(h; D) = 1 \u2212 \u01eb++\u01eb\u2212\n\n.\n\nx\u2208D+[h(x) = 1] = 1 \u2212 \u01eb+ and\n\n2\n\nProof: We have\n\nAUC(h; D) = (1 \u2212 \u01eb+)(1 \u2212 \u01eb\u2212) +\n\n\u01eb+(1 \u2212 \u01eb\u2212) + \u01eb\u2212(1 \u2212 \u01eb+)\n\n2\n\n= 1 \u2212\n\n\u01eb+ + \u01eb\u2212\n\n2\n\n.\n\nProof of Theorem 2: AUCBoost works by running AdaBoost on 1\n2 D\u2212. In round t, each copy\nof AdaBoost passes its reweighted distribution Dt to the weak ranker, and then uses the process of\nLemma 4 to convert the resulting weak ranking function to a classi\ufb01er ht with two-sided advantage\n\u03b3/4. Since ht has two-sided advantage \u03b3/4, no matter how Dt decomposes into a mixture of D+\nt\nand D\u2212\n\nt , it must be the case that Pr(x,y)\u2208Dt[ht(x) 6= y] \u2264 1\n\n2 D++ 1\n\n2 \u2212 \u03b3/4.\n\nThe analysis of AdaBoost (see [5]) shows that T = O(cid:16) log(1/\u01eb)\n\n\u03b3 2 (cid:17) rounds are suf\ufb01cient for H to have\n\n2 D+ + 1\n\n2 D\u2212. Lemma 5 now gives that the classi\ufb01er H(x) is a ranking\n\nerror rate at most \u01eb under 1\nfunction with AUC at least 1 \u2212 \u01eb.\nFor the \ufb01nal assertion of the theorem, note that at each round, in order to \ufb01nd the value of \u03b8 that\nde\ufb01nes ht the algorithm needs to minimize the sum of the error rates on the positive and negative\nexamples. This can be done by sorting the examples using the weak ranking function (in O(m log m)\ntime steps) and processing the examples in the resulting order, keeping running counts of the number\nof errors of each type.\n\n4 Boosting weak rankers in the presence of misclassi\ufb01cation noise\n\nThe noise model: independent misclassi\ufb01cation noise. The model of independent misclassi\ufb01ca-\ntion noise has been widely studied in computational learning theory. In this framework there is a\nnoise rate \u03b7 < 1/2, and each example (positive or negative) drawn from distribution D has its true\nlabel c(x) independently \ufb02ipped with probability \u03b7 before it is given to the learner. We write D\u03b7 to\ndenote the resulting distribution over (noise-corrupted) labeled examples (x, y).\nBoosting weak rankers in the presence of independent misclassi\ufb01cation noise. We now show\nhow the AUC can be boosted arbitrarily close to 1 even if the data given to the booster is corrupted\nwith independent misclassi\ufb01cation noise, using weak rankers that are able to tolerate independent\nmisclassi\ufb01cation noise. We note that this is in contrast with known results for boosting the accuracy\nof binary classi\ufb01ers in the presence of noise; Kalai and Servedio [7] show that no \u201cblack-box\u201d\nboosting algorithm can be guaranteed to boost the accuracy of an arbitrary noise-tolerant weak\nlearner to accuracy 1 \u2212 \u03b7 in the presence of independent misclassi\ufb01cation noise at rate \u03b7.\n\n\fv0,1\n\nv0,2\n\nv1,2\n\nv0,3\n\nv1,3\n\nv2,3\n\n.\n.\n.\n\n.\n.\n.\n\n.\n.\n.\n\n.\n.\n.\n\n...\n\nv0,T +1\n\n|\n\nv1,T +1\noutput -1\n\n{z\n\n}\n\n|\n\nvT \u22121,T +1 vT ,T +1\n\noutput 1\n\n{z\n\n}\n\nFigure 2: The branching program produced\nby the boosting algorithm. Each node vi,t\nis labeled with a weak classi\ufb01er hi,t; left\nedges correspond to -1 and right edges to 1.\n\nAs in the previous section we begin by abstracting away sampling issues and using a model in which\nthe booster passes a distribution to a weak ranker. Sampling issues will be treated in Section 5.\n\nDe\ufb01nition 6 A noise-tolerant weak ranker with advantage \u03b3 is an algorithm with the following\nproperty: for any noise rate \u03b7 < 1/2, given a noisy distribution D\u03b7, the algorithm outputs a ranking\nfunction h : X \u2192 R such that AUC(h; D) \u2265 1\n\n2 + \u03b3.\n\nOur algorithm for boosting the AUC in the presence of noise uses the Basic MartiBoost algorithm\n(see Section 4 of [9]). This algorithm boosts any two-sided weak learner to arbitrarily high accuracy\nand works in a series of rounds. Before round t the space of labeled examples is partitioned into a\nseries of bins B0,t, ..., Bt\u22121,t. (The original bin B0,1 consists of the entire space.) In the t-th round\nthe algorithm \ufb01rst constructs distributions D0,t, ..., Dt\u22121,t by conditioning the original distribution\nD on membership in B0,t, ..., Bt\u22121,t respectively. It then calls a two-sided weak learner t times\nusing each of D0,t, ..., Dt\u22121,t, getting weak classi\ufb01ers h0,t, ..., ht\u22121,t respectively. Having done\nthis, it creates t + 1 bins for the next round by assigning each element (x, y) of Bi,t to Bi,t+1 if\nhi,t(x) = \u22121 and to Bi+1,t+1 otherwise. Training proceeds in this way for a given number T of\nrounds, which is an input parameter of the algorithm.\n\nThe output of Basic MartiBoost is a layered branching program de\ufb01ned as follows. There is a node\nvi,t for each round 1 \u2264 t \u2264 T + 1 and each index 0 \u2264 i < t (that is, for each bin constructed during\ntraining). An item x is routed through the branching program the same way a labeled example (x, y)\nwould have been routed during the training phase: it starts in node v0,1, and from each node vi,t it\ngoes to vi,t+1 if hi,t(x) = \u22121, and to vi+1,t+1 otherwise. When the item x arrives at a terminal\nnode of the branching program in layer T + 1, it is at some node vj,T +1. The prediction is 1 if\nj \u2265 T /2 and is \u22121 if j < T /2; in other words, the prediction is according to the majority vote of\nthe weak classi\ufb01ers that were encountered along the path through the branching program that the\nexample followed. See Figure 3.\n\nThe following lemma is proved in [9]. (The crux of the proof is the observation that positive (re-\nspectively, negative) examples are routed through the branching program according to a random\nwalk that is biased to the right (respectively, left); hence the name \u201cmartingale boosting.\u201d)\n\nLemma 7 ([9]) Suppose that Basic MartiBoost is provided with a hypothesis hi,t with two-sided ad-\nvantage \u03b3 w.r.t. Di,t at each node vi,t. Then for T = O(log(1/\u01eb)/\u03b32), Basic MartiBoost constructs\na branching program H such that D+[H(x) = \u22121] \u2264 \u01eb and D\u2212[H(x) = 1] \u2264 \u01eb.\n\nWe now describe our noise-tolerant AUC boosting algorithm, which we call Basic MartiRank.\nGiven access to a noise-tolerant weak ranker A with advantage \u03b3, at each node vi,t the Basic Marti-\nRank algorithm runs A and proceeds as described in Lemma 4 to obtain a weak classi\ufb01er hi,t. Basic\nMartiRank runs Basic MartiBoost with T = O(log(1/\u01eb)/\u03b32) and simply uses the resulting classi\ufb01er\nH as its ranking function. The following theorem shows that Basic MartiRank is an effective AUC\nbooster in the presence of independent misclassi\ufb01cation noise:\n\nTheorem 8 Fix any \u03b7 < 1/2 and any \u01eb > 0. Given access to D\u03b7 and a noise-tolerant weak ranker A\nwith advantage \u03b3, Basic MartiRank outputs a branching program H such that AUC(H; D) \u2265 1 \u2212 \u01eb.\n\nProof: Fix any node vi,t in the branching program. The crux of the proof is the following simple\nobservation: for a labeled example (x, y), the route through the branching program that is taken\n\n\fby (x, y) is determined completely by the predictions of the base classi\ufb01ers, i.e. only by x, and\nis unaffected by the value of y. Consequently if Di,t denotes the original noiseless distribution D\nconditioned on reaching vi,t, then the noisy distribution conditioned on reaching vi,t, i.e. (D\u03b7)i,t, is\nsimply Di,t corrupted with independent misclassi\ufb01cation noise, i.e. (Di,t)\u03b7. So each time the noise-\ntolerant weak ranker A is invoked at a node vi,t, it is indeed the case that the distribution that it is\ngiven is an independent misclassi\ufb01cation noise distribution. Consequently A does construct weak\nrankers with AUC at least 1/2 + \u03b3, and the conversion of Lemma 4 yields weak classi\ufb01ers that have\nadvantage \u03b3/4 with respect to the underlying distribution Di,t. Given this, Lemma 7 implies that the\n\ufb01nal classi\ufb01er H has error at most \u01eb on both positive and negative examples drawn from the original\ndistribution D, and Lemma 5 then implies that H, viewed a ranker, achieves AUC at least 1 \u2212 \u01eb.\n\nIn [9], a more complex variant of Basic MartiBoost, called Noise-Tolerant SMartiBoost, is presented\nand is shown to boost any noise-tolerant weak learning algorithm to any accuracy less than 1 \u2212 \u03b7\nin the presence of independent misclassi\ufb01cation noise. In contrast, here we are using just the Basic\nMartiBoost algorithm itself, and can achieve any AUC value 1 \u2212 \u01eb even for \u01eb < \u03b7.\n\n5 Implementing MartiRank with a distribution oracle\n\n2 + \u03b3.\n\nIn this section we analyze learning from random examples. Formally, we assume that the weak\nranker is given access to an oracle for the noisy distribution D\u03b7. We thus now view a noise-tolerant\nweak ranker with advantage \u03b3 as an algorithm A with the following property: for any noise rate\n\u03b7 < 1/2, given access to an oracle for D\u03b7, the algorithm outputs a ranking function h : X \u2192 R\nsuch that AUC(h; D) \u2265 1\nWe let mA denote the number of examples from each class that suf\ufb01ce for A to construct a ranking\nfunction as described above. In other words, if A is provided with a sample of draws from D\u03b7\nsuch that each class, positive and negative, has at least mA points in the sample with that true label,\nthen algorithm A outputs a \u03b3-advantage weak ranking function. (Note that for simplicity we are\nassuming here that the weak ranker always constructs a weak ranking function with the desired\nadvantage, i.e. we gloss over the usual con\ufb01dence parameter \u03b4; this can be handled with an entirely\nstandard analysis.)\n\nIn order to achieve a computationally ef\ufb01cient algorithm in this setting we must change the Marti-\nRank algorithm somewhat; we call the new variant Sampling Martirank, or SMartiRank. We prove\nthat SMartiRank is computationally ef\ufb01cient, has moderate sample complexity, and ef\ufb01ciently gen-\nerates a high-accuracy \ufb01nal ranking function with respect to the underlying distribution D.\nOur approach follows the same general lines as [9] where an oracle implementation is presented\nfor the MartiBoost algorithm. The main challenge in [9] is the following: for each node vi,t in the\nbranching program, the boosting algorithm considered there must simulate a balanced version of\nthe induced distribution Di,t which puts equal weight on positive and negative examples. If only a\ntiny fraction of examples drawn from D are (say) positive and reach vi,t, then it is very inef\ufb01cient\nto simulate this balanced distribution (and in a noisy scenario, as discussed earlier, if the noise rate\nis high relative to the frequency of the desired class then it may in fact be impossible to simulate\nthe balanced distribution). The solution in [9] is to \u201cfreeze\u201d any such node and simply classify any\nexample that reaches it as negative; the analysis argues that since only a tiny fraction of positive\nexamples reach such nodes, this freezing only mildly degrades the accuracy of the \ufb01nal hypothesis.\n\nIn the ranking scenario that we now consider, we do not need to construct balanced distributions, but\nwe do need to obtain a non-negligible number of examples from each class in order to run the weak\nlearner at a given node. So as in [9] we still freeze some nodes, but with a twist: we now freeze\nnodes which have the property that for some class label (positive or negative), only a tiny fraction of\nexamples from D with that class label reach the node. With this criterion for freezing we can prove\nthat the \ufb01nal classi\ufb01er constructed has high accuracy both on positive and negative examples, which\nis what we need to achieve good AUC. We turn now to the details.\nGiven a node vi,t and a bit b \u2208 {\u22121, 1}, let pb\ni,t denote D[x reaches vi,t and c(x) = b]. The\nSMartiRank algorithm is like Basic MartiBoost but with the following difference: for each node vi,t\n\n\fand each value b \u2208 {\u22121, 1}, if\n\npb\ni,t <\n\n\u01eb \u00b7 D[c(x) = b]\n\nT (T + 1)\n\n(2)\n\nthen the node vi,t is \u201cfrozen,\u201d i.e. it is labeled with the bit 1 \u2212 b and is established as a terminal node\nwith no outgoing edges. (If this condition holds for both values of b at a particular node vi,t then the\nnode is frozen and either output value may be used as the label.) The following theorem establishes\nthat if SMartiRank is given weak classi\ufb01ers with two-sided advantage at each node that is not frozen,\nit will construct a hypothesis with small error rate on both positive and negative examples:\n\nTheorem 9 Suppose that the SMartiRank algorithm as described above is provided with a hypothe-\nsis hi,t that has two-sided advantage \u03b3 with respect to Di,t at each node vi,t that is not frozen. Then\nfor T = O(log(1/\u01eb)/\u03b32), the \ufb01nal branching program hypothesis H that SMartiRank constructs\nwill have D+[H(x) = \u22121] \u2264 \u01eb and D\u2212[H(x) = 1] \u2264 \u01eb.\n\nProof: We analyze D+[h(x) = \u22121]; the other case is symmetric.\nGiven an unlabeled instance x \u2208 X, we say that x freezes at node vi,t if x\u2019s path through the\nbranching program causes it to terminate at a node vi,t with t < T + 1 (i.e. at a node vi,t which was\n\nfrozen by SMartiRank). We have D[x freezes and c(x) = 1] =Pi,t D[x freezes at vi,t and c(x) =\n1] \u2264Pi,t\n\n2 \u00b7 D[c(x) = 1]. Consequently we have\n\nD[x freezes and c(x) = 1]\n\n\u01eb\u00b7D[c(x)=1]\n\nT (T +1) \u2264 \u01eb\n\n(3)\n\nD+[x freezes] =\n\nD[c(x) = 1]\n\n<\n\n\u01eb\n2\n\n.\n\nNaturally, D+[h(x) = \u22121] = D+[(h(x) = \u22121) & (x freezes)] + D+[(h(x) =\n\u22121) & (x does not freeze)]. By (3), this is at most \u01eb\n2 + D+[(h(x) = \u22121) & (x does not freeze)].\nArguments identical to those in the last two paragraphs of the proof of Theorem 3 in [9] show that\nD+[(h(x) = \u22121) & (x does not freeze)] \u2264 \u01eb\n\n2 , and we are done.\n\nWe now describe how SMartiRank can be run given oracle access to D\u03b7 and sketch the analysis of\nthe required sample complexity (some details are omitted because of space limits). For simplicity of\ndef\npresentation we shall assume that the booster is given the value p\n= min{D[c(x) = \u22121], D[c(x) =\n1]}; we note if that p is not given a priori, a standard \u201cguess and halve\u201d technique can be used\nto ef\ufb01ciently obtain a value that is within a multiplicative factor of two of p, which is easily seen\nto suf\ufb01ce. We also make the standard assumption (see [7, 9]) that the noise rate \u03b7 is known;\nthis assumption can similarly be removed by having the algorithm \u201cguess and check\u201d the value to\nsuf\ufb01ciently \ufb01ne granularity. Also, the con\ufb01dence can be analyzed using the standard appeal to the\nunion bound \u2013 details are omitted.\n\n\u01eb\n\n4T (T +1) times the estimate of D[c(x) = b].\n\nSMartiRank will replace (2) with a comparison of sample estimates of the two quantities. To allow\nfor the fact that they are just estimates, it will be more conservative, and freeze when the estimate of\ni,t is at most\npb\nWe \ufb01rst observe that for any distribution D and any bit b, we have Pr(x,y)\u223cD\u03b7[y = b] = \u03b7 + (1 \u2212\n2\u03b7)Pr(x,c(x))\u223cD[c(x) = b], which is equivalent to D[c(x) = b] = D\u03b7[y=b]\u2212\u03b7\n. Consequently, given\nan empirical estimate of D\u03b7[y = b] that is accurate to within an additive \u00b1 p(1\u22122\u03b7)\n(which can easily\nbe obtained from O(\np2(1\u22122\u03b7)2 ) draws to D\u03b7), it is possible to estimate D[c(x) = b] to within an\nadditive \u00b1p/10, and thus to estimate the RHS of (2) to within an additive \u00b1 \u01ebp\n10T (T +1) . Now in order\nto determine whether node vi,t should be frozen, we must compare this estimate with a similarly\naccurate estimate of pb\ni,t (arguments similar to those of, e.g., Section 6.3 of [9] can be used to show\nthat it suf\ufb01ces to run the algorithm using these estimated values). We have\n\n1\u22122\u03b7\n\n10\n\n1\n\ni,t = D[x reaches vi,t] \u00b7 D[c(x) = b | x reaches vi,t] = D\u03b7[x reaches vi,t] \u00b7 Di,t[c(x) = b]\npb\n\n= D\u03b7[x reaches vi,t] \u00b7 D\u03b7\n\ni,t[y = b] \u2212 \u03b7\n\n1 \u2212 2\u03b7\n\n! .\n\nA standard analysis (see e.g. Chapter 5 of [8]) shows that this quantity can be estimated to additive\naccuracy \u00b1\u03c4 using poly(1/\u03c4, 1/(1 \u2212 2\u03b7)) many calls to D\u03b7 (brie\ufb02y, if D\u03b7[x reaches vi,t] is less than\n\n\f1\n\n\u03c4 (1 \u2212 2\u03b7) then an estimate of 0 is good enough, while if it is greater than \u03c4 (1 \u2212 2\u03b7) then a \u03c4-accurate\n\u03c4 3(1\u22122\u03b7)3 ) draws from D\u03b7, since at\nestimate of the second multiplicand can be obtained using O(\nleast a \u03c4 (1 \u2212 2\u03b7) fraction of draws will reach vi,t.) Thus for each vi,t, we can determine whether\nto freeze it in the execution of SMartiRank using poly(T, 1/\u01eb, 1/p, 1/(1 \u2212 2\u03b7)) draws from D\u03b7.\nFor each of the nodes that are not frozen, we must run the noise-tolerant weak ranker A using the\ndistribution D\u03b7\ni,t. As discussed at the beginning of this section, this requires that we obtain a sample\nfrom D\u03b7\ni,t containing at least mA examples whose true label belongs to each class. The expected\nnumber of draws from D\u03b7 that must be made in order to receive an example from a given class\nis 1/p, and since vi,t is not frozen, the expected number of draws from D\u03b7 belonging to a given\nclass that must be made in order to simulate a draw from D\u03b7\ni,t belonging to that class is O(T 2/\u01eb).\nThus, O(T 2mA/(\u01ebp)) many draws from D\u03b7 are required in order to run the weak learner A at any\nparticular node. Since there are O(T 2) many nodes overall, we have that all in all O(T 4mA/(\u01ebp))\nmany draws from D\u03b7 are required, in addition to the poly(T, 1/\u01eb, 1/p, 1/(1 \u2212 2\u03b7)) draws required\nto identify which nodes to freeze. Recalling that T = O(log(1/\u01eb)/\u03b32), all in all we have:\n\nTheorem 10 Let D be a nontrivial distribution over X, p = min{D[c(x) = \u22121], D[c(x) = 1]},\nand \u03b7 < 1\n2 . Given access to an oracle for D\u03b7 and a noise-tolerant weak ranker A with advantage\n\u03b3, the SMartiRank algorithm makes mA\u00b7 poly( 1\np ) calls to D\u03b7, and and with probability\n1 \u2212 \u03b4 outputs a branching program H such that AUC(h; D) \u2265 1 \u2212 \u01eb.\n\n1\u22122\u03b7 , 1\n\n\u01eb , 1\n\u03b3 ,\n\n1\n\nAcknowledgement\n\nWe are very grateful to Naoki Abe for suggesting the problem of boosting the AUC.\n\nReferences\n\n[1] A. P. Bradley. Use of the area under the ROC curve in the evaluation of machine learning\n\nalgorithms. Pattern Recognition, 30:1145\u20131159, 1997.\n\n[2] C. Cortes and M. Mohri. AUC optimization vs. error rate minimzation. In NIPS 2003, 2003.\n[3] T. Fawcett. ROC graphs: Notes and practical considerations for researchers. Technical Report\n\nHPL-2003-4, HP, 2003.\n\n[4] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research, 4(6):933\u2013970, 2004.\n\n[5] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[6] J. Hanley and B. McNeil. The meaning and use of the area under a receiver operating charac-\n\nteristic (ROC) curve. Radiology, 143(1):29\u201336, 1982.\n\n[7] A. Kalai and R. Servedio. Boosting in the presence of noise. Journal of Computer & System\n\nSciences, 71(3):266\u2013290, 2005. Preliminary version in Proc. STOC\u201903.\n\n[8] M. Kearns and U. Vazirani. An introduction to computational learning theory. MIT Press,\n\nCambridge, MA, 1994.\n\n[9] P. Long and R. Servedio. Martingale boosting.\n\nIn Proceedings of the Eighteenth Annual\n\nConference on Computational Learning Theory (COLT), pages 79\u201394, 2005.\n\n[10] F. Provost, T. Fawcett, and Ron Kohavi. The case against accuracy estimation for comparing\n\ninduction algorithms. ICML, 1998.\n\n[11] C. Rudin, C. Cortes, M. Mohri, and R. E. Schapire. Margin-based ranking meets boosting in\n\nthe middle. COLT, 2005.\n\n[12] J. A. Swets. Signal detection theory and ROC analysis in psychology and diagnostics: Col-\n\nlected papers. Lawrence Erlbaum Associates, 1995.\n\n\f", "award": [], "sourceid": 347, "authors": [{"given_name": "Phil", "family_name": "Long", "institution": null}, {"given_name": "Rocco", "family_name": "Servedio", "institution": null}]}