{"title": "AUC Optimization vs. Error Rate Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 313, "page_last": 320, "abstract": "", "full_text": "AUC Optimization vs. Error Rate Minimization\n\nCorinna Cortes\u2217and Mehryar Mohri\n\nAT&T Labs \u2013 Research\n\n180 Park Avenue, Florham Park, NJ 07932, USA\n\n{corinna, mohri}@research.att.com\n\nAbstract\n\nThe area under an ROC curve (AUC) is a criterion used in many appli-\ncations to measure the quality of a classi\ufb01cation algorithm. However,\nthe objective function optimized in most of these algorithms is the error\nrate and not the AUC value. We give a detailed statistical analysis of the\nrelationship between the AUC and the error rate, including the \ufb01rst exact\nexpression of the expected value and the variance of the AUC for a \ufb01xed\nerror rate. Our results show that the average AUC is monotonically in-\ncreasing as a function of the classi\ufb01cation accuracy, but that the standard\ndeviation for uneven distributions and higher error rates is noticeable.\nThus, algorithms designed to minimize the error rate may not lead to\nthe best possible AUC values. We show that, under certain conditions,\nthe global function optimized by the RankBoost algorithm is exactly the\nAUC. We report the results of our experiments with RankBoost in several\ndatasets demonstrating the bene\ufb01ts of an algorithm speci\ufb01cally designed\nto globally optimize the AUC over other existing algorithms optimizing\nan approximation of the AUC or only locally optimizing the AUC.\n\n1 Motivation\n\nIn many applications, the overall classi\ufb01cation error rate is not the most pertinent perfor-\nmance measure, criteria such as ordering or ranking seem more appropriate. Consider for\nexample the list of relevant documents returned by a search engine for a speci\ufb01c query.\nThat list may contain several thousand documents, but, in practice, only the top \ufb01fty or so\nare examined by the user. Thus, a search engine\u2019s ranking of the documents is more critical\nthan the accuracy of its classi\ufb01cation of all documents as relevant or not. More gener-\nally, for a binary classi\ufb01er assigning a real-valued score to each object, a better correlation\nbetween output scores and the probability of correct classi\ufb01cation is highly desirable.\nA natural criterion or summary statistic often used to measure the ranking quality of a clas-\nsi\ufb01er is the area under an ROC curve (AUC) [8].1 However, the objective function opti-\nmized by most classi\ufb01cation algorithms is the error rate and not the AUC. Recently, several\nalgorithms have been proposed for maximizing the AUC value locally [4] or maximizing\nsome approximations of the global AUC value [9, 15], but, in general, these algorithms do\nnot obtain AUC values signi\ufb01cantly better than those obtained by an algorithm designed to\nminimize the error rates. Thus, it is important to determine the relationship between the\nAUC values and the error rate.\n\n\u2217This author\u2019s new address is: Google Labs, 1440 Broadway, New York, NY 10018,\n\ncorinna@google.com.\n\n1The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to\nthe Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed\nout by [2], and slightly modi\ufb01ed under the name of Linear Ranking by [13, 14].\n\n\fe ROC Curve. AUC=0.718\n(1,1)\n\nt\n\na\nr\n \n\ne\nv\ni\nt\ni\ns\no\np\ne\nu\nr\nT\n\n \n\n(0,0)\n\nFalse positive rate\n\nTrue positive rate =\n\ncorrectly classi\ufb01ed positive\n\ntotal positive\n\nFalse positive rate =\n\nincorrectly classi\ufb01ed negative\n\ntotal negative\n\nFigure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random\nclassi\ufb01cation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the\nsensitivity (resp. speci\ufb01city) in this context.\n\nIn the following sections, we give a detailed statistical analysis of the relationship between\nthe AUC and the error rate, including the \ufb01rst exact expression of the expected value and\nthe variance of the AUC for a \ufb01xed error rate.2 We show that, under certain conditions, the\nglobal function optimized by the RankBoost algorithm is exactly the AUC. We report the\nresults of our experiments with RankBoost in several datasets and demonstrate the bene\ufb01ts\nof an algorithm speci\ufb01cally designed to globally optimize the AUC over other existing\nalgorithms optimizing an approximation of the AUC or only locally optimizing the AUC.\n\n2 De\ufb01nition and properties of the AUC\n\nThe Receiver Operating Characteristics (ROC) curves were originally developed in signal\ndetection theory [3] in connection with radio signals, and have been used since then in many\nother applications, in particular for medical decision-making. Over the last few years, they\nhave found increased interest in the machine learning and data mining communities for\nmodel evaluation and selection [12, 10, 4, 9, 15, 2].\nThe ROC curve for a binary classi\ufb01cation problem plots the true positive rate as a function\nof the false positive rate. The points of the curve are obtained by sweeping the classi\ufb01ca-\ntion threshold from the most positive classi\ufb01cation value to the most negative. For a fully\nrandom classi\ufb01cation, the ROC curve is a straight line connecting the origin to (1, 1). Any\nimprovement over random classi\ufb01cation results in an ROC curve at least partially above\nthis straight line. Fig. (1) shows an example of ROC curve. The AUC is de\ufb01ned as the area\nunder the ROC curve and is closely related to the ranking quality of the classi\ufb01cation as\nshown more formally by Lemma 1 below.\nConsider a binary classi\ufb01cation task with m positive examples and n negative examples.\nWe will assume that a classi\ufb01er outputs a strictly ordered list for these examples and will\ndenote by 1X the indicator function of a set X.\n\nLemma 1 ([8]) Let c be a \ufb01xed classi\ufb01er. Let x1, . . . , xm be the output of c on the positive\nexamples and y1, . . . , yn its output on the negative examples. Then, the AUC, A, associated\nto c is given by:\n\nthat is the value of the Wilcoxon-Mann-Whitney statistic [8].\n\nA = Pm\n\ni=1Pn\n\nj=1 1xi>yj\nmn\n\n(1)\n\nProof. The proof is based on the observation that the AUC value is exactly the probability\nP (X > Y ) where X is the random variable corresponding to the distribution of the out-\nputs for the positive examples and Y the one corresponding to the negative examples [7].\nThe Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the\ndiscrete case, which proves the lemma [8].\n\nThus, the AUC can be viewed as a measure based on pairwise comparisons between classi-\n\ufb01cations of the two classes. With a perfect ranking, all positive examples are ranked higher\nthan the negative ones and A = 1. Any deviation from this ranking decreases the AUC.\n\n2An attempt in that direction was made by [15], but, unfortunately, the authors\u2019 analysis and the\n\nresult are both wrong.\n\n\fThreshold\n\nk \u2212 x Positive examples\n\nx Negative examples\n\nn \u2212 x Negative examples\n\nm \u2212 (k \u2212 x) Positive examples\n\nFigure 2: For a \ufb01xed number of errors k, there may be x, 0 \u2264 x \u2264 k, false negative examples.\n\n3 The Expected Value of the AUC\n\nIn this section, we compute exactly the expected value of the AUC over all classi\ufb01cations\nwith a \ufb01xed number of errors and compare that to the error rate.\nDifferent classi\ufb01ers may have the same error rate but different AUC values. Indeed, for a\ngiven classi\ufb01cation threshold \u03b8, an arbitrary reordering of the examples with outputs more\nthan \u03b8 clearly does not affect the error rate but leads to different AUC values. Similarly,\none may reorder the examples with output less than \u03b8 without changing the error rate.\nAssume that the number of errors k is \ufb01xed. We wish to compute the average value of the\nAUC over all classi\ufb01cations with k errors. Our model is based on the simple assumption\nthat all classi\ufb01cations or rankings with k errors are equiprobable. One could perhaps argue\nthat errors are not necessarily evenly distributed, e.g., examples with very high or very low\nranks are less likely to be errors, but we cannot justify such biases in general.\nFor a given classi\ufb01cation, there may be x, 0 \u2264 x \u2264 k, false positive examples. Since the\nnumber of errors is \ufb01xed, there are k \u2212 x false negative examples. Figure 3 shows the cor-\nresponding con\ufb01guration. The two regions of examples with classi\ufb01cation outputs above\nand below the threshold are separated by a vertical line. For a given x, the computation of\nthe AUC, A, as given by Eq. (1) can be divided into the following three parts:\n\nA =\n\nA1 + A2 + A3\n\nmn\n\n,\n\nwith\n\n(2)\n\nA1 = the sum over all pairs (xi, yj) with xi and yj in distinct regions;\nA2 = the sum over all pairs (xi, yj) with xi and yj in the region above the threshold;\nA3 = the sum over all pairs (xi, yj) with xi and yj in the region below the threshold.\nThe \ufb01rst term, A1, is easy to compute. Since there are (m \u2212 (k \u2212 x)) positive examples\nabove the threshold and n \u2212 x negative examples below the threshold, A1 is given by:\n\nA1 = (m \u2212 (k \u2212 x))(n \u2212 x)\n\n(3)\nTo compute A2, we can assign to each negative example above the threshold a position\nbased on its classi\ufb01cation rank. Let position one be the \ufb01rst position above the threshold\nand let \u03b11 < . . . < \u03b1x denote the positions in increasing order of the x negative examples\nin the region above the threshold. The total number of examples classi\ufb01ed as positive is\nN = m \u2212 (k \u2212 x) + x. Thus, by de\ufb01nition of A2,\n\nx\n\nA2 =\n\n(N \u2212 \u03b1i) \u2212 (x \u2212 i)\n\n(4)\n\nwhere the \ufb01rst term N \u2212 \u03b1i represents the number of examples ranked higher than the ith\nexample and the second term x \u2212 i discounts the number of negative examples incorrectly\nranked higher than the ith example. Similarly, let \u03b10\nk\u2212x denote the positions of\nthe k \u2212 x positive examples below the threshold, counting positions in reverse by starting\nfrom the threshold. Then, A3 is given by:\n\n1 < . . . < \u03b10\n\nXi=1\n\nx0\n\nXj=1\n\nA3 =\n\n(N 0 \u2212 \u03b10\n\nj) \u2212 (x0 \u2212 j)\n\n(5)\n\nwith N 0 = n \u2212 x + (k \u2212 x) and x0 = k \u2212 x. Combining the expressions of A1, A2, and\nA3 leads to:\n\nA =\n\nA1 + A2 + A3\n\nmn\n\n= 1 +\n\n(k \u2212 2x)2 + k\n\n2mn\n\n\u2212\n\ni=1 \u03b1i +Px0\n(Px\n\nmn\n\nj=1 \u03b10\nj)\n\n(6)\n\nq\n\fLemma 2 For a \ufb01xed x, the average value of the AUC A is given by:\n\n< A >x= 1 \u2212\n\nx\n\nn + k\u2212x\n\nm\n\n2\n\n(7)\n\nj=1 \u03b10\n\nProof. The proof is based on the computation of the average values of Px\nPx0\n\ni=1 \u03b1i and\nj for a given x. We start by computing the average value < \u03b1i >x for a given\ni, 1 \u2264 i \u2264 x. Consider all the possible positions for \u03b11 . . . \u03b1i\u22121 and \u03b1i+1 . . . \u03b1x, when the\nvalue of \u03b1i is \ufb01xed at say \u03b1i = l. We have i \u2264 l \u2264 N \u2212 (x \u2212 i) since there need to be at\nleast i \u2212 1 positions before \u03b1i and N \u2212 (x \u2212 i) above. There are l \u2212 1 possible positions for\n\u03b11 . . . \u03b1i\u22121 and N \u2212 l possible positions for \u03b1i+1 . . . \u03b1x. Since the total number of ways\n\nl=i\n\nl=i\n\nx\n\n<\n\n(8)\n\nThus,\n\nXi=1\n\n\u03b1i >x= Px\n\nof choosing the x positions for \u03b11 . . . \u03b1x out of N is(cid:0)N\nx(cid:1), the average value < \u03b1i >x is:\n< \u03b1i >x= PN \u2212(x\u2212i)\nl(cid:0)l\u22121\ni\u22121(cid:1)(cid:0)N \u2212l\nx\u2212i(cid:1)\n(cid:0)N\nx(cid:1)\ni=1PN \u2212(x\u2212i)\nl(cid:0)l\u22121\ni\u22121(cid:1)(cid:0)N \u2212l\nx\u2212i(cid:1)\n(cid:0)N\nx(cid:1)\np2(cid:1) =(cid:0)u+v\np1(cid:1)(cid:0) v\nUsing the classical identity:Pp1+p2=p(cid:0) u\nl=1 l(cid:0)N \u22121\nx\u22121(cid:1)\n(cid:0)N\nx(cid:1)\nXj=1\ni=1 \u03b1i >x and < Px0\n\n= PN\nl=1 lPx\ni=1(cid:0)l\u22121\n(cid:0)N\nx(cid:1)\np (cid:1), we can write:\n(cid:0)N \u22121\nx\u22121(cid:1)\n(cid:0)N\nx(cid:1)\n\nReplacing < Px\n\n\u03b1i >x= PN\n\nEq. (10) and Eq. (11) leads to:\n\ni\u22121(cid:1)(cid:0)N \u2212l\nx\u2212i(cid:1)\n\nSimilarly, we have:\n\nx0(N 0 + 1)\n\nN (N + 1)\n\nXi=1\n\nx(N + 1)\n\nj >x in Eq. (6) by the expressions given by\n\nj=1 \u03b10\n\nj >x=\n\n(10)\n\n(11)\n\n(9)\n\nx0\n\n<\n\n\u03b10\n\nx\n\n<\n\n=\n\n2\n\n=\n\n2\n\n2\n\n< A >x= 1 +\n\n(k \u2212 2x)2 + k \u2212 x(N + 1) \u2212 x0(N 0 + 1)\n\n2mn\n\n= 1 \u2212\n\nx\n\nn + k\u2212x\n\nm\n\n2\n\n(12)\n\nwhich ends the proof of the lemma.\nNote that Eq. (7) shows that the average AUC value for a given x is simply one minus the\naverage of the accuracy rates for the positive and negative classes.\nProposition 1 Assume that a binary classi\ufb01cation task with m positive examples and n\nnegative examples is given. Then, the expected value of the AUC A over all classi\ufb01cations\nwith k errors is given by:\n\nk\n\n\u2212\n\n4mn\n\nm + n\n\n< A >= 1 \u2212\n\n k\n\n(n \u2212 m)2(m + n + 1)\n\nProof. Lemma 2 gives the average value of the AUC for a \ufb01xed value of x. To compute\nthe average over all possible values of x, we need to weight the expression of Eq. (7) with\n\n\u2212 Pk\u22121\nx=0(cid:0)m+n\nx (cid:1)\nPk\nx=0(cid:0)m+n+1\n(cid:1)\nthe total number of possible classi\ufb01cations for a given x. There are(cid:0)N\nx(cid:1) possible ways of\nchoosing the positions of the x misclassi\ufb01ed negative examples, and similarly(cid:0)N 0\nx0(cid:1) possible\n\nways of choosing the positions of the x0 = k \u2212 x misclassi\ufb01ed positive examples. Thus, in\nview of Lemma 2, the average AUC is given by:\n\n! (13)\n\nm + n\n\nx\n\n< A >= Pk\n\nx(cid:1)(cid:0)N 0\nx=0(cid:0)N\nx0(cid:1)(1 \u2212\nPk\nx=0(cid:0)N\nx(cid:1)(cid:0)N 0\nx0(cid:1)\n\nx\n\nn + k\u2212x\n\nm\n\n2\n\n)\n\n(14)\n\n\f0\n1\n\n.\n\n9\n\n.\n\n0\n\n8\n\n.\n\n0\n\n7\n0\n\n.\n\n6\n\n.\n\n0\n\n5\n0\n\n.\n\nMean value of the AUC\n\nr=0.05\n\nr=0.01\n0.0\n\nr=0.1 r=0.25\n0.1\n0.3\n0.2\nError rate\n\nr=0.5\n0.5\n\n0.4\n\n5\n2\n\n.\n\n0\n2\n\n.\n\n5\n1\n\n.\n\n0\n1\n\n.\n\n5\n0\n\n.\n\n0\n0\n\n.\n\nRelative standard deviation\nr=0.01\n\nr=0.05\n\nr=0.1\n\nr=0.25\n\nr=0.5\n\n0.0\n\n0.1\n\n0.2\nError rate\n\n0.3\n\n0.4\n\n0.5\n\nFigure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate.\nEach curve corresponds to a \ufb01xed ratio of r = n/(n + m). The average AUC value monotonically\nincreases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC\ncoincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest\ncurve corresponds to n = m.\n\nThis expression can be simpli\ufb01ed into Eq. (13)3 using the following novel identities:\n\nk\n\nXx=0 N\nx N\nXx=0\n\nk\n\nx! N 0\nx! N 0\n\nx0! =\nx0! =\n\nk\n\nx\n\nXx=0 n + m + 1\n!\nXx=0\n\n2\n\nk\n\n(k \u2212 x)(m \u2212 n) + k\n\n(15)\n\n(16)\n\n n + m + 1\n!\n\nx\n\nthat we obtained by using Zeilberger\u2019s algorithm4 and numerous combinatorial \u2019tricks\u2019.\nFrom the expression of Eq. (13), it is clear that the average AUC value is identical to the\naccuracy of the classi\ufb01er only for even distributions (n = m). For n 6= m, the expected\nvalue of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a \ufb01xed\nratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m)\nto 1. The average AUC varies monotonically in the range of accuracy between 0.5 and\n1.0. In other words, on average, there seems nothing to be gained in designing speci\ufb01c\nlearning algorithms for maximizing the AUC: a classi\ufb01cation algorithm minimizing the\nerror rate also optimizes the AUC. However, this only holds for the average AUC. Indeed,\nwe will show in the next section that the variance of the AUC value is not null for any ratio\nn/(n + m) when k 6= 0.\n\n4 The Variance of the AUC\n\nLet D = mn + (k\u22122x)2+k\nEq. (6), mnA = D \u2212 \u03b1. Thus, the variance of the AUC, \u03c32(A), is given by:\n\ni=1 \u03b1i, a0 = Px0\n\n, a = Px\n\nj=1 \u03b10\n\n2\n\nj, and \u03b1 = a + a0. Then, by\n\n(mn)2\u03c32(A) = < (D \u2212 \u03b1)2\n\n\u2212 (< D > \u2212 < \u03b1 >)2 >\n\n(17)\n\n= < D2 > \u2212 < D >2 + < \u03b12 > \u2212 < \u03b1 >2\n\n\u22122(< \u03b1D > \u2212 < \u03b1 >< D >)\n\nAs before, to compute the average of a term X over all classi\ufb01cations, we can \ufb01rst deter-\nmine its average < X >x for a \ufb01xed x, and then use the function F de\ufb01ned by:\n\nx(cid:1)(cid:0)N 0\nF (Y ) = Pk\nx=0(cid:0)N\nx0(cid:1) Y\nPk\nx(cid:1)(cid:0)N 0\nx=0(cid:0)N\nx0(cid:1)\n\n(18)\n\nand < X >= F (< X >x). A crucial step in computing the exact value of the variance of\n\nthe AUC is to determine the value of the terms of the type < a2 >x=< (Px\n\n3An essential difference between Eq. (14) and the expression given by [15] is the weighting by\nthe number of con\ufb01gurations. The authors\u2019 analysis leads them to the conclusion that the average\nAUC is identical to the accuracy for all ratios n/(n + m), which is false.\n\ni=1 \u03b1i)2 >x.\n\n4We thank Neil Sloane for having pointed us to Zeilberger\u2019s algorithm and Maple package.\n\n\fLemma 3 For a \ufb01xed x, the average of (Px\n\n< a2 >x=\n\nx(N + 1)\n\ni=1 \u03b1i)2 is given by:\n\n(3N x + 2x + N )\n\n12\n\nProof. By de\ufb01nition of a, < a2 >x= b + 2c with:\n\nx\n\nx\n\nb =<\n\nc =<\n\n\u03b1i\u03b1j >x\n\nReasoning as in the proof of Lemma 2, we can obtain:\n\n\u03b12\n\ni >x\n\nXi=1\nl2(cid:0)l\u22121\ni\u22121(cid:1)(cid:0)N \u2212l\nx\u2212i(cid:1)\n\nX1\u2264ix, for a given pair (i, j)\nwith i < j. As in the proof of Lemma 2, consider all the possible positions of \u03b11 . . . \u03b1i\u22121,\n\u03b1i+1 . . . \u03b1j\u22121, and \u03b1j+1 . . . \u03b1x when \u03b1i is \ufb01xed at \u03b1i = l, and \u03b1j is \ufb01xed at \u03b1j = l0.\nThere are l \u2212 1 possible positions for the \u03b11 . . . \u03b1i\u22121, l0 \u2212 l \u2212 1 possible positions for\n\u03b1i+1 . . . \u03b1j\u22121, and N \u2212 l0 possible positions for \u03b1j+1 . . . \u03b1x. Thus, we have:\n\nand\n\nj\u2212i\u22121(cid:1)(cid:0)N \u2212l0\ni\u22121(cid:1)(cid:0)l0\u2212l\u22121\n< \u03b1i\u03b1j >x= Pi\u2264lx]2) \u2212 F (D\u2212 < a + a0 >x)2+\n\n12\n\nF (< a2 >x \u2212 < a >2\n\nx) + F (< a02 >x \u2212 < a0 >2\nx)\n\n(26)\nThe expressions for < a >x and < a0 >x were given in the proof of Lemma 2, and\nthat of < a2 >x by Lemma 3. The following formula can be obtained in a similar\nway: < a02 >x= x0(N 0+1)\n(3N 0x0 + 2x0 + N 0). Replacing these expressions in Eq. (26)\nand further simpli\ufb01cations give exactly Eq. (25) and prove the proposition.\nThe expression of the variance is illustrated by Fig. (3)(right) which shows the value of\none standard deviation of the AUC divided by the corresponding mean value of the AUC.\nThis \ufb01gure is parallel to the one showing the mean of the AUC (Fig. (3)(left)). Each line\nis obtained by \ufb01xing the ratio n/(n + m) and varying the number of errors from 1 to the\nsize of the smallest class. The more uneven class distributions have the highest variance,\nthe variance increases with the number of errors. These observations contradict the inexact\nclaim of [15] that the variance is zero for all error rates with even distributions n = m. In\nFig. (3)(right), the even distribution n = m corresponds to the lowest dashed line.\n\n\fDataset\n\nBreast-Wpbc\nCredit\nIonosphere\nPima\nSPECTF\nPage-blocks\nYeast (CYT)\n\nSize\n\n194\n653\n351\n768\n269\n5473\n1484\n\n# of\nAttr.\n33\n15\n34\n8\n43\n10\n8\n\nn\n\nn+m\n(%)\n23.7\n45.3\n35.9\n34.9\n20.4\n10.2\n31.2\n\nAUCsplit[4]\n\nAccuracy (%)\n69.5 \u00b1 10.6\n\nAUC (%)\n\n59.3 \u00b1 16.2\n\n89.6 \u00b1 5.0\n72.5 \u00b1 5.1\n\n89.7 \u00b1 6.7\n76.7 \u00b1 6.0\n\n96.8 \u00b1 0.2\n71.1 \u00b1 3.6\n\n95.1 \u00b1 6.9\n73.3 \u00b1 4.0\n\nRankBoost\n\nAccuracy (%)\n65.5 \u00b1 13.8\n81.0 \u00b1 7.4\n83.6 \u00b1 10.9\n69.7 \u00b1 7.6\n\n67.3\n\n92.0 \u00b1 2.5\n45.3 \u00b1 3.8\n\nAUC (%)\n\n80.4 \u00b1 8.0\n94.5 \u00b1 2.9\n98.0 \u00b1 3.3\n84.8 \u00b1 6.5\n\n93.4\n\n98.5 \u00b1 1.5\n78.5 \u00b1 3.0\n\nTable 1: Accuracy and AUC values for several datasets from the UC Irvine repository. The values\nfor RankBoost are obtained by 10-fold cross-validation. The values for AUCsplit are from [4].\n\n5 Experimental Results\n\nProposition 2 above demonstrates that, for uneven distributions, classi\ufb01ers with the same\n\ufb01xed (low) accuracy exhibit noticeably different AUC values. This motivates the use of\nalgorithms directly optimizing the AUC rather than doing so indirectly via minimizing the\nerror rate. Under certain conditions, RankBoost [5] can be viewed exactly as an algorithm\noptimizing the AUC. In this section, we make the connection between RankBoost and\nAUC optimization, and compare the performance of RankBoost to two recent algorithms\nproposed for optimizing an approximation [15] or locally optimizing the AUC [4].\nThe objective of RankBoost is to produce a ranking that minimizes the number of incor-\nrectly ordered pairs of examples, possibly with different costs assigned to the mis-rankings.\nWhen the examples to be ranked are simply two disjoint sets, the objective function mini-\nmized by RankBoost is\n\nm\n\nn\n\nrloss =\n\n1\nm\n\n1\nn\n\n1xi\u2264yj\n\n(27)\n\nXi=1\n\nXj=1\n\nwhich is exactly one minus the Wilcoxon-Mann-Whitney statistic. Thus, by Lemma 1, the\nobjective function maximized by RankBoost coincides with the AUC.\nRankBoost\u2019s optimization is based on combining a number of weak rankings. For our\nexperiments, we chose as weak rankings threshold rankers with the range {0, 1}, similar\nto the boosted stumps often used by AdaBoost [6]. We used the so-called Third Method of\nRankBoost for selecting the best weak ranker. According to this method, at each step, the\nweak threshold ranker is selected so as to maximize the AUC of the weighted distribution.\nThus, with this method, the global objective of obtaining the best AUC is obtained by\nselecting the weak ranking with the best AUC at each step.\nFurthermore, the RankBoost algorithm maintains a perfect 50-50% distribution of the\nweights on the positive and negative examples. By Proposition 1, for even distributions,\nthe mean of the AUC is identical to the classi\ufb01cation accuracy. For threshold rankers like\nstep functions, or stumps, there is no variance of the AUC, so the mean of the AUC is equal\nto the observed AUC. That is, instead of viewing RankBoost as selecting the weak ranker\nwith the best weighted AUC value, one can view it as selecting the weak ranker with the\nlowest weighted error rate. This is similar to the choice of the best weak learner for boosted\nstumps in AdaBoost. So, for stumps, AdaBoost and RankBoost differ only in the updat-\ning scheme of the weights: RankBoost updates the positive examples differently from the\nnegative ones, while AdaBoost uses one common scheme for the two groups.\nOur experimental results corroborate the observation that RankBoost is an algorithm op-\ntimizing the AUC. RankBoost based on boosted stumps obtains AUC values that are sub-\nstantially better than those reported in the literature for algorithms designed to locally or\napproximately optimize the AUC. Table 1 compares the results of RankBoost on a number\nof datasets from the UC Irvine repository to the results reported by [4]. The results for\nRankBoost are obtained by 10-fold cross-validation. For RankBoost, the accuracy and the\nbest AUC values reported on each line of the table correspond to the same boosting step.\nRankBoost consistently outperforms AUCsplit in a comparison based on AUC values, even\nfor the datasets such as Breast-Wpbc and Pima where the two algorithms obtain similar ac-\ncuracies. The table also lists results for the UC Irvine Credit Approval and SPECTF heart\ndataset, for which the authors of [15] report results corresponding to their AUC optimiza-\ntion algorithms. The AUC values reported by [15] are no better than 92.5% for the Credit\n\n\fApproval dataset and only 87.5% for the SPECTF dataset, which is substantially lower.\nFrom the table, it is also clear that RankBoost is not an error rate minimization algorithm.\nThe accuracy for the Yeast (CYT) dataset is as low as 45%.\n\n6 Conclusion\n\nA statistical analysis of the relationship between the AUC value and the error rate was\ngiven, including the \ufb01rst exact expression of the expected value and standard deviation of\nthe AUC for a \ufb01xed error rate. The results offer a better understanding of the effect on the\nAUC value of algorithms designed for error rate minimization. For uneven distributions\nand relatively high error rates, the standard deviation of the AUC suggests that algorithms\ndesigned to optimize the AUC value may lead to substantially better AUC values. Our\nexperimental results using RankBoost corroborate this claim.\nIn separate experiments we have observed that AdaBoost achieves signi\ufb01cantly better er-\nror rates than RankBoost (as expected) but that it also leads to AUC values close to those\nachieved by RankBoost. It is a topic for further study to explain and understand this prop-\nerty of AdaBoost. A partial explanation could be that, just like RankBoost, AdaBoost\nmaintains at each boosting round an equal distribution of the weights for positive and neg-\native examples.\n\nReferences\n\n[1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regres-\n\nsion Trees. Wadsworth International, Belmont, CA, 1984.\n\n[2] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting customer\ngroups using gain and cost matrix; a marketing application. Technical report, ERIC\nLaboratory - University of Lyon 2, 2001.\n\n[3] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975.\n[4] C. Ferri, P. Flach, and J. Hern\u00b4andez-Orallo. Learning decision trees using the area\n\nunder the ROC curve. In ICML-2002. Morgan Kaufmann, 2002.\n\n[5] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for\n\ncombining preferences. In ICML-98. Morgan Kaufmann, San Francisco, US, 1998.\n\n[6] Yoav Freund and Robert E. Schapire. A decision theoretical generalization of on-\nline learning and an application to boosting. In Proceedings of the Second European\nConference on Computational Learning Theory, volume 2, 1995.\n\n[7] D. M. Green and J. A Swets. Signal detection theory and psychophysics. New York:\n\nWiley, 1966.\n\n[8] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver\n\noperating characteristic (ROC) curve. Radiology, 1982.\n\n[9] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz.\n\nProdding the ROC curve. In NIPS-2002. MIT Press, 2002.\n\n[10] C. Perlich, F. Provost, and J. Simonoff. Tree induction vs. logistic regression: A\n\nlearning curve analysis. Journal of Machine Learning Research, 2003.\n\n[11] G. Piatetsky-Shapiro and S. Steingold. Measuring lift quality in database marketing.\n\nIn SIGKDD Explorations. ACM SIGKDD, 2000.\n\n[12] F. Provost and T. Fawcett. Analysis and visualization of classi\ufb01er performance: Com-\n\nparison under imprecise class and cost distribution. In KDD-97. AAAI, 1997.\n\n[13] S. Rosset. Ranking-methods for \ufb02exible evaluation and ef\ufb01cient comparison of 2-\n\nclass models. Master\u2019s thesis, Tel-Aviv University, 1999.\n\n[14] S. Rosset, E. Neumann, U. Eick, N. Vatnik, and I. Idan. Evaluation of prediction\n\nmodels for marketing campaigns. In KDD-2001. ACM Press, 2001.\n\n[15] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classi\ufb01er Perfor-\n\nmance Via the Wilcoxon-Mann-Whitney Statistics. In ICML-2003, 2003.\n\n\f", "award": [], "sourceid": 2518, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}]}