{"title": "A Large Deviation Bound for the Area Under the ROC Curve", "book": "Advances in Neural Information Processing Systems", "page_first": 9, "page_last": 16, "abstract": null, "full_text": "A Large Deviation Bound\n\nfor the Area Under the ROC Curve\n\nShivani Agarwal\u2217, Thore Graepel\u2020, Ralf Herbrich\u2020 and Dan Roth\u2217\n\u2217Dept. of Computer Science\n\n\u2020Microsoft Research\n7 JJ Thomson Avenue\n\nCambridge CB3 0FB, UK\n\n{thoreg,rherb}@microsoft.com\n\nUniversity of Illinois\n\nUrbana, IL 61801, USA\n\n{sagarwal,danr}@cs.uiuc.edu\n\nAbstract\n\nThe area under the ROC curve (AUC) has been advocated as an evalu-\nation criterion for the bipartite ranking problem. We study large devi-\nation properties of the AUC; in particular, we derive a distribution-free\nlarge deviation bound for the AUC which serves to bound the expected\naccuracy of a ranking function in terms of its empirical AUC on an inde-\npendent test sequence. A comparison of our result with a corresponding\nlarge deviation result for the classi\ufb01cation error rate suggests that the test\nsample size required to obtain an \u0001-accurate estimate of the expected ac-\ncuracy of a ranking function with \u03b4-con\ufb01dence is larger than that required\nto obtain an \u0001-accurate estimate of the expected error rate of a classi\ufb01-\ncation function with the same con\ufb01dence. A simple application of the\nunion bound allows the large deviation bound to be extended to learned\nranking functions chosen from \ufb01nite function classes.\n\n1 Introduction\nIn many learning problems, the goal is not simply to classify objects into one of a \ufb01xed\nnumber of classes; instead, a ranking of objects is desired. This is the case, for example, in\ninformation retrieval problems, where one is interested in retrieving documents from some\ndatabase that are \u2018relevant\u2019 to a given query or topic. In such problems, one wants to return\nto the user a list of documents that contains relevant documents at the top and irrelevant\ndocuments at the bottom; in other words, one wants a ranking of the documents such that\nrelevant documents are ranked higher than irrelevant documents.\nThe problem of ranking has been studied from a learning perspective under a variety of\nsettings [2, 8, 4, 7]. Here we consider the setting in which objects come from two cate-\ngories, positive and negative; the learner is given examples of objects labeled as positive\nor negative, and the goal is to learn a ranking in which positive objects are ranked higher\nthan negative ones. This captures, for example, the information retrieval problem described\nabove; in this case, the training examples consist of documents labeled as relevant (posi-\ntive) or irrelevant (negative). This form of ranking problem corresponds to the \u2018bipartite\nfeedback\u2019 case of [7]; for this reason, we refer to it as the bipartite ranking problem.\nFormally, the setting of the bipartite ranking problem is similar to that of the binary clas-\nIn both problems, there is an instance space X and a set of two\nsi\ufb01cation problem.\nclass labels Y = {\u22121, +1}. One is given a \ufb01nite sequence of labeled training examples\nS = ((x1, y1), . . . , (xM , yM )) \u2208 (X \u00d7 Y)M , and the goal is to learn a function based on\nthis training sequence. However, the form of the function to be learned in the two problems\n\n\fis different. In classi\ufb01cation, one seeks a binary-valued function h : X\u2192Y that predicts\nthe class of a new instance in X . On the other hand, in ranking, one seeks a real-valued\nfunction f : X \u2192 R that induces a ranking over X ; an instance that is assigned a higher\nvalue by f is ranked higher than one that is assigned a lower value by f.\nThe area under the ROC curve (AUC) has recently gained some attention as an evaluation\ncriterion for the bipartite ranking [3]. Given a ranking function f : X\u2192R and a \ufb01nite\ndata sequence T = ((x1, y1), . . . , (xN , yN )) \u2208 (X \u00d7 Y)N containing m positive and n\nnegative examples, the AUC of f with respect to T , denoted \u02c6A(f; T ), can be expressed as\nthe following Wilcoxon-Mann-Whitney statistic [3]:\n\n(cid:0)I{f (xi)>f (xj )} +\n\nI{f (xi)=f (xj )}(cid:1) ,\n\n1\n2\n\n(1)\n\n\u02c6A(f; T ) =\n\n1\nmn\n\nX\n\nX\n\n{i:yi=+1}\n\n{j:yj =\u22121}\n\nwhere I{\u00b7} denotes the indicator variable whose value is one if its argument is true and\nzero otherwise. The AUC of f with respect to T is thus simply the fraction of positive-\nnegative pairs in T that are ranked correctly by f, assuming that ties are broken uniformly\nat random.1\nThe AUC is an empirical quantity that evaluates a ranking function with respect to a partic-\nular data sequence. What does the empirical AUC tell us about the expected performance\nof a ranking function on future examples? This is the question we consider. The question\nhas two parts, both of which are important for machine learning practice. First, what can be\nsaid about the expected performance of a ranking function based on its empirical AUC on\nan independent test sequence? Second, what can be said about the expected performance of\na learned ranking function based on its empirical AUC on the training sequence from which\nit is learned? We address the \ufb01rst question in this paper; the second question is addressed\nin [1].\nWe start by de\ufb01ning the expected ranking accuracy of a ranking function (analogous to\nthe expected error rate of a classi\ufb01cation function) in Section 2. Section 3 contains our\nlarge deviation result, which serves to bound the expected accuracy of a ranking function\nin terms of its empirical AUC on an independent test sequence. Our conceptual approach\nin deriving the large deviation result for the AUC is similar to that of [9], in which large\ndeviation properties of the average precision were considered. Section 4 compares our\nbound to a corresponding large deviation bound for the classi\ufb01cation error rate. A simple\napplication of the union bound allows the large deviation bound to be extended to learned\nranking functions chosen from \ufb01nite function classes; this is described in Section 5.\n\n2 Expected Ranking Accuracy\nWe begin by introducing some notation. As in classi\ufb01cation, we shall assume that all\nexamples are drawn randomly and independently according to some (unknown) underly-\ning distribution D over X \u00d7 Y. The notation D+1 and D\u22121 will be used to denote the\nclass-conditional distributions DX|Y =+1 and DX|Y =\u22121, respectively. We shall \ufb01nd it con-\nvenient to decompose a data sequence T = ((x1, y1), . . . , (xN , yN )) \u2208 (X \u00d7 Y)N into\ntwo components, TX = (x1, . . . , xN ) \u2208 X N and TY = (y1, . . . , yN ) \u2208 Y N . Several\nof our results will involve the conditional distribution DTX|TY =y for some label sequence\ny = (y1, . . . , yN ) \u2208 Y N ; this distribution is simply Dy1 \u00d7 . . . \u00d7 DyN .2 As a \ufb01nal note of\n1In [3], a slightly simpler form of the Wilcoxon-Mann-Whitney statistic is used, which does not\n2Note that, since the AUC of a ranking function f with respect to a data sequence T \u2208 (X \u00d7Y)N\nis independent of the ordering of examples in the sequence, our results involving the conditional\ndistribution DTX|TY =y for some label sequence y = (y1, . . . , yN ) \u2208 Y N depend only on the number\nm of positive labels in y and the number n of negative labels in y. We state our results in terms of\nthe distribution DTX|TY =y \u2261 Dy1 \u00d7 . . .\u00d7DyN only because this is more general than Dm\n+1 \u00d7Dn\u22121.\n\naccount for ties.\n\n\fconvention, we use T \u2208 (X \u00d7Y)N to denote a general data sequence (e.g., an independent\ntest sequence), and S \u2208 (X \u00d7 Y)M to denote a training sequence.\nDe\ufb01nition 1 (Expected ranking accuracy). Let f : X\u2192R be a ranking function on X .\nDe\ufb01ne the expected ranking accuracy (or simply ranking accuracy) of f, denoted by A(f),\nas follows:\n\nA(f) = EX\u223cD+1,X0\u223cD\u22121\n\nI{f (X)>f (X0)} +\n\nI{f (X)=f (X0)}\n\n.\n\no\n\nn\n\n1\n2\n\nThe ranking accuracy A(f) de\ufb01ned above is simply the probability that an instance drawn\nrandomly according to D+1 will be ranked higher by f than an instance drawn randomly\naccording to D\u22121, assuming that ties are broken uniformly at random. The following sim-\nple lemma shows that the empirical AUC of a ranking function f is an unbiased estimator\nof the expected ranking accuracy of f:\nLemma 1. Let f : X\u2192R be a ranking function on X , and let y = (y1, . . . , yN ) \u2208 Y N be\nany \ufb01nite label sequence. Then\n\nn \u02c6A(f; T )\no\n\nETX|TY =y\n\n= A(f) .\n\nProof. Let m be the number of positive labels in y, and n the number of negative labels in\ny. Then from the de\ufb01nition of the AUC (Eq. (1)) and linearity of expectation, we have\n\no\nn \u02c6A(f; T )\nX\nX\n\n{i:yi=+1}\n\nX\nX\n\n{j:yj =\u22121}\n\n{i:yi=+1}\n\n{j:yj =\u22121}\n\n=\n\nETX|TY =y\n1\nmn\n1\nmn\n\n=\n\n= A(f) .\n\nn\n\nEXi\u223cD+1,Xj\u223cD\u22121\n\nA(f)\n\nI{f (Xi)>f (Xj )} +\n\n1\n2\n\nI{f (Xi)=f (Xj )}\n\no\n\nut\n\n3 Large Deviation Bound\nWe are interested in bounding the probability that the empirical AUC of a ranking function\nf with respect to a (random) test sequence T will have a large deviation from its expected\nranking accuracy. In other words, we are interested in bounding probabilities of the form\n\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(f; T ) \u2212 A(f)\no\n\nP\n\nfor given \u0001 > 0. Our main tool in deriving such a large deviation bound will be the follow-\ning powerful concentration inequality of McDiarmid [10], which bounds the deviation of\nany function of a sample for which a single change in the sample has limited effect:\nTheorem 1 (McDiarmid, 1989). Let X1, . . . , XN be independent random variables with\nXk taking values in a set Ak for each k. Let \u03c6 : (A1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 AN )\u2192R be such that\nk, xk+1, . . . , xN )| \u2264 ck .\n\n|\u03c6(x1, . . . , xN ) \u2212 \u03c6(x1, . . . , xk\u22121, x0\n\nP{|\u03c6(X1, . . . , XN ) \u2212 E{\u03c6(X1, . . . , XN )}| \u2265 \u0001} \u2264 2e\u22122\u00012/PN\n\nk=1 c2\n\nk .\n\nwith probability one and \u03c6(X1, . . . , XN ) =PN\n\nNote that when X1, . . . , XN are independent bounded random variables with Xk \u2208 [ak, bk]\nk=1 Xk, McDiarmid\u2019s inequality (with ck =\nbk \u2212 ak) reduces to Hoeffding\u2019s inequality. Next we de\ufb01ne the following quantity which\nappears in several of our results:\n\nsup\nxi\u2208Ai,x0\n\nk\u2208Ak\nThen for any \u0001 > 0,\n\n\fDe\ufb01nition 2 (Positive skew). Let y = (y1, . . . , yN ) \u2208 Y N be a \ufb01nite label sequence of\nlength N \u2208 N. De\ufb01ne the positive skew of y, denoted by \u03c1(y), as follows:\n\nX\n\n\u03c1(y) =\n\n1\nN\n\n1 .\n\n{i:yi=+1}\n\nThe following can be viewed as the main result of this paper. We note that our results are\nall distribution-free, in the sense that they hold for any distribution D over X \u00d7 Y.\nTheorem 2. Let f : X\u2192R be a \ufb01xed ranking function on X and let y = (y1, . . . , yN ) \u2208\nY N be any label sequence of length N \u2208 N. Then for any \u0001 > 0,\n\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(f; T ) \u2212 A(f)\no \u2264 2e\u22122\u03c1(y)(1\u2212\u03c1(y))N \u00012\n\nPTX|TY =y\n\n.\n\nProof. Let m be the number of positive labels in y, and n the number of negative labels in\ny. We can view TX = (X1, . . . , XN ) \u2208 X N as a random vector; given the label sequence\ny, the random variables X1, . . . , XN are independent, with each Xk taking values in X .\nNow, de\ufb01ne \u03c6 : X N\u2192R as follows:\n\n\u03c6 (x1, . . . , xN ) = \u02c6A (f; ((x1, y1), . . . , (xN , yN ))) .\nThen, for each k such that yk = +1, we have the following for all xi, x0\n\n(cid:12)(cid:12)\u03c6(x1, . . . , xN ) \u2212 \u03c6(x1, . . . , xk\u22121, x0\n\nk, xk+1 . . . , xN )(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X\n\n{j:yj =\u22121}\n\n1\nmn\n\n (cid:18)\n(cid:18)\n\n=\n\n\u2264\n\n=\n\nn\n\n1\nmn\n1\nm\n\n.\n\nI{f (xk)>f (xj )} +\n\nI{f (xk)=f (xj )}\n\nI{f (x0\n\nk)>f (xj )} +\n\nI{f (x0\n\nk)=f (xj )}\n\n1\n2\n\n1\n2\n\nk \u2208 X :\n(cid:19)\n(cid:19)!(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2212\n\nThus, taking ck = 1/m for k such that yk = +1 and ck = 1/n for k such that yk = \u22121,\nand applying McDiarmid\u2019s theorem, we get for any \u0001 > 0,\n\nSimilarly, for each k such that yk = \u22121, one can show for all xi, x0\n\nk \u2208 X :\n\nk, xk+1 . . . , xN )(cid:12)(cid:12) \u2264 1\n(cid:12)(cid:12)\u03c6(x1, . . . , xN ) \u2212 \u03c6(x1, . . . , xk\u22121, x0\no(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(f; T ) \u2212 ETX|TY =y\no \u2264 2e\u22122\u00012/(m( 1\nn \u02c6A(f; T )\nn \u02c6A(f; T )\no\n\n= A(f) .\n\nETX|TY =y\n\nn\n\n.\n\nPTX|TY =y\n\nNow, from Lemma 1,\n\nm )2+n( 1\n\nn )2) . (2)\n\nAlso, we have\n\n1\n\nm( 1\n\nm)2 + n( 1\n\nn)2\n\n=\n\n1\n\n1\n\nm + 1\n\nn\n\n= mn\nm + n\n\n= \u03c1(y)(1 \u2212 \u03c1(y))N .\n\nSubstituting the above in Eq. (2) gives the desired result.\n\nut\n\n\fWe note that the result of Theorem 2 can be strengthened so that the conditioning is only\non the numbers m and n of positive and negative labels, and not on the speci\ufb01c label vector\ny.3 From Theorem 2, we can derive a con\ufb01dence interval interpretation of the bound that\ngives, for any 0 < \u03b4 \u2264 1, a con\ufb01dence interval based on the empirical AUC of a ranking\nfunction (on a random test sequence) which is likely to contain the true ranking accuracy\nwith probability at least 1 \u2212 \u03b4. More speci\ufb01cally, we have:\nCorollary 1. Let f : X\u2192R be a \ufb01xed ranking function on X and let y = (y1, . . . , yN ) \u2208\nY N be any label sequence of length N \u2208 N. Then for any 0 < \u03b4 \u2264 1,\n\n((cid:12)(cid:12)(cid:12) \u02c6A(f; T ) \u2212 A(f)\n(cid:12)(cid:12)(cid:12) \u2265\n\ns\n\nPTX|TY =y\n\nln(cid:0) 2\n\n\u03b4\n\n(cid:1)\n\n)\n\n2\u03c1(y)(1 \u2212 \u03c1(y))N\n\n\u2264 \u03b4 .\n\nProof. This follows directly from Theorem 2 by setting 2e\u22122\u03c1(y)(1\u2212\u03c1(y))N \u00012 = \u03b4 and solv-\nut\ning for \u0001.\n\nTheorem 2 also allows us to obtain an expression for a test sample size that is suf\ufb01cient to\nobtain, for 0 < \u0001, \u03b4 \u2264 1, an \u0001-accurate estimate of the ranking accuracy with \u03b4-con\ufb01dence:\nCorollary 2. Let f : X\u2192R be a \ufb01xed ranking function on X and let 0 < \u0001, \u03b4 \u2264 1. Let\ny = (y1, . . . , yN ) \u2208 Y N be any label sequence of length N \u2208 N. If\n\nN \u2265\n\n(cid:1)\nln(cid:0) 2\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(f; T ) \u2212 A(f)\no \u2264 \u03b4 .\n\n2\u03c1(y)(1 \u2212 \u03c1(y))\u00012 ,\n\n\u03b4\n\nthen\n\nPTX|TY =y\n\nProof. This follows directly from Theorem 2 by setting 2e\u22122\u03c1(y)(1\u2212\u03c1(y))N \u00012 \u2264 \u03b4 and solv-\nut\ning for N.\n\nFigure 1 illustrates the dependence of the above expression for the suf\ufb01cient test sample\nsize on the the accuracy parameter \u0001 and positive skew \u03c1(y) for different values of \u03b4.\nThe con\ufb01dence interval of Corollary 1 can in fact be generalized to remove the conditioning\non the label vector completely:\nTheorem 3. Let f : X\u2192R be a \ufb01xed ranking function on X and let N \u2208 N. Then for any\n0 < \u03b4 \u2264 1,\n\n)\n\n\u03b4\n\n(cid:1)\n\nln(cid:0) 2\n\n2\u03c1(TY )(1 \u2212 \u03c1(TY ))N\n\ns\n((cid:12)(cid:12)(cid:12) \u02c6A(f; T ) \u2212 A(f)\n(cid:12)(cid:12)(cid:12) \u2265\ns\n((cid:12)(cid:12)(cid:12) \u02c6A(f; T ) \u2212 A(f)\nln(cid:0) 2\n(cid:1)\n(cid:12)(cid:12)(cid:12) \u2265\n(cid:9)\n(cid:8)I\u03a6(T,\u03b4)\nn\n(cid:9)o\n(cid:8)I\u03a6(T,\u03b4)\nPTX|TY =y {\u03a6(T, \u03b4)}o\nn\n= ETY\n\u2264 ETY {\u03b4}\n= \u03b4 .\n\nPT {\u03a6(T, \u03b4)} = ET\n= ETY\n\nETX|TY =y\n\n(by Corollary 1)\n\n2\u03c1(TY )(1 \u2212 \u03c1(TY ))N\n\n\u03b4\n\n\u2264 \u03b4 .\n\n)\n\n.\n\nut\n\nProof. For T \u2208 (X \u00d7 Y)N and 0 < \u03b4 \u2264 1, de\ufb01ne the proposition\n\nPT\u223cDN\n\n\u03a6(T, \u03b4) \u2261\n\nThen for any 0 < \u03b4 \u2264 1, we have\n\n3Our thanks to an anonymous reviewer for pointing this out.\n\n\fFigure 1: The test sample size N (based on Corollary 2) suf\ufb01cient to obtain an \u0001-accurate estimate\nof the ranking accuracy with \u03b4-con\ufb01dence, for various values of the positive skew \u03c1 \u2261 \u03c1(y) for some\nlabel sequence y, for (left) \u03b4 = 0.01 and (right) \u03b4 = 0.001.\n\nNote that the above \u2018trick\u2019 works only once we have gone to a con\ufb01dence interval; an\nattempt to generalize the bound of Theorem 2 in a similar way gives an expression in which\nthe \ufb01nal expectation is not easy to evaluate. Interestingly, the above proof does not even\nrequire a factorized distribution DTY since it is built on a result for any \ufb01xed label sequence\ny. We note that the above technique could also be applied to generalize the results of [9] in\na similar manner.\n\n4 Comparison with Large Deviation Bound for Error Rate\nOur use of McDiarmid\u2019s inequality in deriving the large deviation bound for the AUC of\na ranking function is analogous to the use of Hoeffding\u2019s inequality in deriving a large\ndeviation bound for the error rate of a classi\ufb01cation function. (e.g., see [6, Chapter 8]). The\nneed for the more general inequality of McDiarmid in our derivations arises from the fact\nthat the empirical AUC, unlike the empirical error rate, cannot be expressed as a sum of\nindependent random variables.\nGiven a classi\ufb01cation function h : X\u2192Y, let L(h) denote the expected error rate of h:\n\nL(h) = EXY \u223cD(cid:8)I{h(X)6=Y }(cid:9) .\n\nSimilarly, given a classi\ufb01cation function h : X\u2192Y and a \ufb01nite data sequence T =\n((x1, y1), . . . , (xN , yN )) \u2208 (X \u00d7 Y)N , let \u02c6L(h; T ) denote the empirical error rate of h\nwith respect to T :\n\n1\nN\n\n\u02c6L(h; T ) =\n\nNX\nn(cid:12)(cid:12)(cid:12)\u02c6L(h; T ) \u2212 L(h)\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\no \u2264 2e\u22122N \u00012\n\nI{h(xi)6=yi} .\n\ni=1\n\nThen the large deviation bound obtained via Hoeffding\u2019s inequality for the classi\ufb01cation\nerror rate states that for a \ufb01xed classi\ufb01cation function h : X\u2192Y and for any N \u2208 N, \u0001 > 0,\n(3)\n\n.\n\nPT\u223cDN\n\nComparing Eq. (3) to the bound of Theorem 2, we see that the AUC bound differs from the\nerror rate bound by a factor of \u03c1(y)(1 \u2212 \u03c1(y)) in the exponent. This difference translates\ninto a 1/(\u03c1(y)(1 \u2212 \u03c1(y))) factor difference in the resulting sample size bounds: given\n0 < \u0001, \u03b4 \u2264 1, the test sample size suf\ufb01cient to obtain an \u0001-accurate estimate of the expected\naccuracy of a ranking function with \u03b4-con\ufb01dence is 1/(\u03c1(y)(1\u2212\u03c1(y))) times larger than the\ncorresponding test sample size suf\ufb01cient to obtain an \u0001-accurate estimate of the expected\nerror rate of a classi\ufb01cation function with the same con\ufb01dence. For \u03c1(y) = 1/2, this means\na sample size larger by a factor of 4; as the positive skew \u03c1(y) departs from 1/2, the factor\ngrows larger (see Figure 2).\n\n\fFigure 2: The test sample size bound for the AUC, for positive skew \u03c1 \u2261 \u03c1(y) for some label\nsequence y, is larger than the corresponding test sample size bound for the classi\ufb01cation error rate by\na factor of 1/(\u03c1(1 \u2212 \u03c1)).\n\n5 Bound for Learned Ranking Functions Chosen from Finite Classes\nThe large deviation result of Theorem 2 bounds the expected accuracy of a ranking function\nin terms of its empirical AUC on an independent test sequence. A simple application of the\nunion bound allows the result to be extended to bound the expected accuracy of a learned\nranking function in terms of its empirical AUC on the training sequence from which it is\nlearned, in the case when the learned ranking function is chosen from a \ufb01nite function class.\nMore speci\ufb01cally, we have:\nTheorem 4. Let F be a \ufb01nite class of real-valued functions on X and let fS \u2208 F denote\nthe ranking function chosen by a learning algorithm based on the training sequence S. Let\ny = (y1, . . . , yM ) \u2208 Y M be any label sequence of length M \u2208 N. Then for any \u0001 > 0,\n\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(fS; S) \u2212 A(fS)\no \u2264 2|F|e\u22122\u03c1(y)(1\u2212\u03c1(y))M \u00012\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(fS; S) \u2212 A(fS)\no\n(cid:26)\n(cid:27)\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\n(cid:12)(cid:12)(cid:12) \u02c6A(f; S) \u2212 A(f)\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(f; S) \u2212 A(f)\no\n\nPSX|SY =y\n\nmax\nf\u2208F\n\n.\n\n(by the union bound)\n\nPSX|SY =y\n\n\u2264 PSX|SY =y\n\n\u2264 X\n\nf\u2208F\n\nPSX|SY =y\n\nProof. For any \u0001 > 0, we have\n\n\u2264 2|F|e\u22122\u03c1(y)(1\u2212\u03c1(y))M \u00012\n\n(by Theorem 2) .\n\nut\n\nAs before, we can derive from Theorem 4 expressions for con\ufb01dence intervals and suf\ufb01-\ncient training sample size. We give these here without proof:\nCorollary 3. Under the assumptions of Theorem 4, for any 0 < \u03b4 \u2264 1,\n\n((cid:12)(cid:12)(cid:12) \u02c6A(fS; S) \u2212 A(fS)\n(cid:12)(cid:12)(cid:12) \u2265\n\ns\n\nPSX|SY =y\n\n\u2264 \u03b4 .\n\n)\n\n2\u03c1(y)(1 \u2212 \u03c1(y))M\n\n\u03b4\n\nln|F| + ln(cid:0) 2\n(cid:18)2\n\nln|F| + ln\n\n(cid:1)\n(cid:19)(cid:19)\n\n(cid:18)\n\n,\n\n\u03b4\n\nCorollary 4. Under the assumptions of Theorem 4, for any 0 < \u0001, \u03b4 \u2264 1, if\n\nM \u2265\n\n1\n\n2\u03c1(y)(1 \u2212 \u03c1(y))\u00012\n\n\fthen\n\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\nn(cid:12)(cid:12)(cid:12) \u02c6A(fS; S) \u2212 A(fS)\no \u2264 \u03b4 .\n\nPSX|SY =y\n\nTheorem 5. Let F be a \ufb01nite class of real-valued functions on X and let fS \u2208 F denote\nthe ranking function chosen by a learning algorithm based on the training sequence S. Let\nM \u2208 N. Then for any 0 < \u03b4 \u2264 1,\n\n((cid:12)(cid:12)(cid:12) \u02c6A(fS; S) \u2212 A(fS)\n(cid:12)(cid:12)(cid:12) \u2265\n\ns\n\nPS\u223cDM\n\nln|F| + ln(cid:0) 2\n\n(cid:1)\n\n)\n\n2\u03c1(SY )(1 \u2212 \u03c1(SY ))M\n\n\u03b4\n\n\u2264 \u03b4 .\n\n6 Conclusion\nWe have derived a distribution-free large deviation bound for the area under the ROC curve\n(AUC), a quantity used as an evaluation criterion for the bipartite ranking problem. Our re-\nsult parallels the classical large deviation result for the classi\ufb01cation error rate obtained via\nHoeffding\u2019s inequality. Since the AUC cannot be expressed as a sum of independent ran-\ndom variables, a more powerful inequality of McDiarmid was required. A comparison with\nthe corresponding large deviation result for the error rate suggests that, in the distribution-\nfree setting, the test sample size required to obtain an \u0001-accurate estimate of the expected\naccuracy of a ranking function with \u03b4-con\ufb01dence is larger than the test sample size required\nto obtain a similar estimate of the expected error rate of a classi\ufb01cation function. A simple\napplication of the union bound allows the large deviation bound to be extended to learned\nranking functions chosen from \ufb01nite function classes.\nA possible route for deriving an alternative large deviation bound for the AUC could be\nvia the theory of U-statistics; the AUC can be expressed as a two-sample U-statistic, and\ntherefore it may be possible to apply specialized results from U-statistic theory (see, for\nexample, [5]) to the AUC.\n\nReferences\n[1] S. Agarwal, S. Har-Peled, and D. Roth. A uniform convergence bound for the area under the\nROC curve. In Proceedings of the 10th International Workshop on Arti\ufb01cial Intelligence and\nStatistics, 2005.\n\n[2] W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. Journal of Arti\ufb01cial\n\nIntelligence Research, 10:243\u2013270, 1999.\n\n[3] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. In S. Thrun, L. Saul,\n\nand B. Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16, 2004.\n\n[4] K. Crammer and Y. Singer. Pranking with ranking.\n\nIn T. G. Dietterich, S. Becker, and\n\nZ. Ghahramani, editors, Advances in Neural Information Processing Systems 14, 2002.\n\n[5] V. H. de la P\u02dcena and E. Gin\u00b4e. Decoupling: From Dependence to Independence. Springer-Verlag,\n\nNew York, 1999.\n\n[6] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-\n\nVerlag, New York, 1996.\n\n[7] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[8] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regres-\n\nsion. Advances in Large Margin Classi\ufb01ers, pages 115\u2013132, 2000.\n\n[9] S. I. Hill, H. Zaragoza, R. Herbrich, and P. J. W. Rayner. Average precision and the problem\nof generalisation. In Proceedings of the ACM SIGIR Workshop on Mathematical and Formal\nMethods in Information Retrieval, 2002.\n\n[10] C. McDiarmid. On the method of bounded differences.\n\npages 148\u2013188. Cambridge University Press, 1989.\n\nIn Surveys in Combinatorics 1989,\n\n\f", "award": [], "sourceid": 2544, "authors": [{"given_name": "Shivani", "family_name": "Agarwal", "institution": null}, {"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Dan", "family_name": "Roth", "institution": null}]}