{"title": "Overlaying classifiers: a practical approach for optimal ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 313, "page_last": 320, "abstract": "ROC curves are one of the most widely used displays to evaluate performance of scoring functions. In the paper, we propose a statistical method for directly optimizing the ROC curve. The target is known to be the regression function up to an increasing transformation and this boils down to recovering the level sets of the latter. We propose to use classifiers obtained by empirical risk minimization of a weighted classification error and then to construct a scoring rule by overlaying these classifiers. We show the consistency and rate of convergence to the optimal ROC curve of this procedure in terms of supremum norm and also, as a byproduct of the analysis, we derive an empirical estimate of the optimal ROC curve.", "full_text": "Overlaying classi\ufb01ers:\n\na practical approach for optimal ranking\n\nSt\u00b4ephan Cl\u00b4emenc\u00b8on\n\nTelecom Paristech (TSI) - LTCI UMR Institut Telecom/CNRS 5141\n\nstephan.clemencon@telecom-paristech.fr\n\nNicolas Vayatis\n\nENS Cachan & UniverSud - CMLA UMR CNRS 8536\n\nvayatis@cmla.ens-cachan.fr\n\nAbstract\n\nROC curves are one of the most widely used displays to evaluate performance\nof scoring functions. In the paper, we propose a statistical method for directly\noptimizing the ROC curve. The target is known to be the regression function up\nto an increasing transformation and this boils down to recovering the level sets of\nthe latter. We propose to use classi\ufb01ers obtained by empirical risk minimization of\na weighted classi\ufb01cation error and then to construct a scoring rule by overlaying\nthese classi\ufb01ers. We show the consistency and rate of convergence to the optimal\nROC curve of this procedure in terms of supremum norm and also, as a byproduct\nof the analysis, we derive an empirical estimate of the optimal ROC curve.\n\n1 Introduction\n\nIn applications such as medical diagnosis, credit risk screening or information retrieval, one aims at\nordering instances under binary label information. The problem of ranking binary classi\ufb01cation data\nis known in the machine learning literature as the bipartite ranking problem ([FISS03], [AGH+05],\n[CLV08]). A natural approach is to \ufb01nd a real-valued scoring function which mimics the order\ninduced by the regression function. A classical performance measure for scoring functions is the\nReceiver Operating Characteristic (ROC) curve which plots the rate of true positive against false\npositive ([vT68], [Ega75]). The ROC curve offers a graphical display which permits to judge rapidly\nhow a scoring rule discriminates the two populations (positive against negative). A scoring rule\nwhose ROC curve is close to the diagonal line does not discriminate at all, while the one lying above\nall others is the best possible choice. From a statistical learning perspective, risk minimization (or\nperformance maximization) strategies for bipartite ranking have been based mostly on a popular\nsummary of the ROC curve known as the Area Under a ROC Curve (AUC - see [CLV08], [FISS03],\n[AGH+05]) which corresponds to the L1-metric on the space of ROC curves. In the present paper,\nwe propose a statistical methodology to estimate the optimal ROC curve in a stronger sense than\nthe AUC sense, namely in the sense of the supremum norm. In the same time, we will explain how\nto build a nearly optimal scoring function. Our approach is based on a simple observation: optimal\nscoring functions can be represented from the collection of level sets of the regression function.\nHence, the bipartite ranking problem may be viewed as a \u2019continuum\u2019 of classi\ufb01cation problems\nwith asymmetric costs where the targets are the level sets. In a nonparametric setup, regression\nor density level sets can be estimated with plug-in methods ([Cav97], [RV06], [AA07], [WN07],\n...). Here, we take a different approach based on a weighted empirical risk minimization principle.\nWe provide rates of convergence with which an optimal point of the ROC curve can be recovered\naccording to this principle. We also develop a practical ranking method based on a discretization of\nthe original problem. From the resulting classi\ufb01ers and their related empirical errors, we show how\n\n1\n\n\fto build a linear-by-part estimate of the optimal ROC curve and a quasi-optimal piecewise constant\nscoring function. Rate bounds in terms of the supremum norm on ROC curves for these procedures\nare also established.\nThe rest of the paper is organized as follows: in Section 2, we present the problem and give some\nproperties of ROC curves, in Section 3, we provide a statistical result for the weighted empirical risk\nminimization, and in Section 4, we develop the main results of the paper which describe the statisti-\ncal performance of a scoring rule based on overlaying classi\ufb01ers as well as the rate of convergence\nof the empirical estimate of the optimal ROC curve.\n\n2 Bipartite ranking, scoring rules and ROC curves\n\nSetup. We study the ranking problem for classi\ufb01cation data with binary labels. The data are assumed\nto be generated as i.i.d. copies of a random pair (X, Y ) \u2208 X \u00d7 {\u22121, +1} where X is a random\ndescriptor living in the measurable space X and Y represents its binary label (relevant vs. irrelevant,\nhealthy vs. sick, ...). We denote by P = (\u00b5, \u03b7) the distribution of (X, Y ), where \u00b5 is the marginal\ndistribution of X and \u03b7 is the regression function (up to an af\ufb01ne transformation): \u03b7(x) = P{Y =\n1 | X = x}, x \u2208 X . We will also denote by p = P{Y = 1} the proportion of positive labels.\nIn the sequel, we assume that the distribution \u00b5 is absolutely continuous with respect to Lebesgue\nmeasure.\n\nOptimal scoring rules. We consider the approach where the ordering can be derived by the means\nof a scoring function s : X \u2192 R: one expects that the higher the value s(X) is, the more likely the\nevent \u201dY = +1\u201d should be observed. The following de\ufb01nition sets the goal of learning methods in\nthe setup of bipartite ranking.\n\nDe\ufb01nition 1 (Optimal scoring functions) The class of optimal scoring functions is given by the set\n\nS\u2217 = { s\u2217 = T \u25e6 \u03b7 | T : [0, 1] \u2192 R strictly increasing }.\n\nInterestingly, it is possible to make the connection between an arbitrary (bounded) optimal scoring\nfunction s\u2217 \u2208 S\u2217 and the distribution P (through the regression function \u03b7) completely explicit.\nProposition 1 (Optimal scoring functions representation, [CV08]) A bounded scoring function\ns\u2217 is optimal if and only if there exist a nonnegative integrable function w and a continuous random\nvariable V in (0, 1) such that:\n\u2200x \u2208 X ,\n\ns\u2217(x) = infX s\u2217 + E (w(V ) \u00b7 I{\u03b7(x) > V }) .\n\nA crucial consequence of the last proposition is that solving the bipartite ranking problem amounts\nto recovering the collection {x \u2208 X | \u03b7(x) > u}u\u2208(0,1) of level sets of the regression function \u03b7.\nHence, the bipartite ranking problem can be seen as a collection of overlaid classi\ufb01cation problems.\nThis view was \ufb01rst introduced in [CV07] and the present paper is devoted to the description of a\nstatistical method implementing this idea.\n\nROC curves. We now recall the concept of ROC curve and explain why it is a natural choice of\nperformance measure for the ranking problem with classi\ufb01cation data. We consider here only true\nROC curves which correspond to the situation where the underlying distribution is known. First,\nwe need to introduce some notations. For a given scoring rule s, the conditional cdfs of the random\nvariable s(X) are denoted by Gs and Hs. We also set, for all z \u2208 R:\n\n\u00afGs(z) = 1 \u2212 Gs(z) = P{s(X) > z | Y = +1} ,\n\u00afHs(z) = 1 \u2212 Hs(z) = P{s(X) > z | Y = \u22121} .\n\nto be the residual conditional cdfs of the random variable s(X). When s = \u03b7, we shall denote the\nprevious functions by G\u2217, H\u2217, \u00afG\u2217, \u00afH\u2217 respectively.\nWe introduce the notation Q(Z, \u03b1) to denote the quantile of order 1 \u2212 \u03b1 for the distribution of a\nrandom variable Z conditioned on the event Y = \u22121. In particular, the following quantile will be\nof interest:\n\nQ\u2217(\u03b1) = Q(\u03b7(X), \u03b1) = \u00afH\u2217\u22121(\u03b1) ,\n\n2\n\n\fwhere we have used here the notion of generalized inverse F \u22121 of a c`adl`ag function F : F \u22121(z) =\ninf{t \u2208 R | F (t) \u2265 z}. We now turn to the de\ufb01nition of the ROC curve.\nDe\ufb01nition 2 (True ROC curve) The ROC curve of a scoring function s is the parametric curve:\n\nz (cid:55)\u2192(cid:0) \u00afHs(z), \u00afGs(z)(cid:1)\n\nfor thresholds z \u2208 R. It can also be de\ufb01ned as the plot of the function:\n\nROC(s,\n\n\u00b7 ) : \u03b1 \u2208 [0, 1] (cid:55)\u2192 \u00afGs \u25e6 \u00afH\u22121\n\ns (\u03b1) = \u00afGs (Q(s(X), \u03b1)) .\n\nBy convention, points of the curve corresponding to possible jumps (due to possible degenerate\npoints of Hs or Gs) are connected by line segments, so that the ROC curve is always continuous.\nFor s = \u03b7, we take the notation ROC\u2217(\u03b1) = ROC(\u03b7, \u03b1).\nThe residual cdf \u00afGs is also called the true positive rate while \u00afHs is the false positive rate, so that\nthe ROC curve is the plot of the true positive rate against the false positive rate.\nNote that, as a functional criterion, the ROC curve induces a partial order over the space of all scor-\ning functions. Some scoring function might provide a better ranking on some part of the observation\nspace and a worst one on some other. A natural step to take is to consider local properties of the\nROC curve in order to focus on best instances but this is not straightforward as explained in [CV07].\nWe expect optimal scoring functions to be those for which the ROC curve dominates all the others\nfor all \u03b1 \u2208 (0, 1). The next proposition highlights the fact that the ROC curve is relevant when\nevaluating performance in the bipartite ranking problem.\nProposition 2 The class S\u2217 of optimal scoring functions provides the best possible ranking with\nrespect to the ROC curve. Indeed, for any scoring function s, we have:\n\u2200\u03b1 \u2208 (0, 1) , ROC\u2217(\u03b1) \u2265 ROC(s, \u03b1) ,\n\nand \u2200s\u2217 \u2208 S\u2217, \u2200\u03b1 \u2208 (0, 1) , ROC(s\u2217, \u03b1) = ROC\u2217(\u03b1).\nThe following result will be needed later.\n\nProposition 3 We assume that the optimal ROC curve is differentiable. Then, we have, for any \u03b1\nsuch that Q\u2217(\u03b1) < 1:\n\nROC\u2217(\u03b1) =\n\nd\nd\u03b1\n\n1 \u2212 p\np\n\n\u00b7 Q\u2217(\u03b1)\n1 \u2212 Q\u2217(\u03b1) .\n\nFor proofs of the previous propositions and more details on true ROC curves, we refer to [CV08].\n\n3 Recovering a point on the optimal ROC curve\n\nWe consider here the problem of recovering a single point of the optimal ROC curve from a sample\nof i.i.d. copies {(Xi, Yi)}i=1,...,n of (X, Y ). This amounts to recovering a single level set of the\nregression function \u03b7 but we aim at controlling the error in terms of rates of false positive and true\npositive. For any measurable set C \u2282 X , we set the following notations:\n\n\u03b1(C) = P(X \u2208 C | Y = \u22121) and \u03b2(C) = P(X \u2208 C | Y = +1) .\n\nWe also de\ufb01ne the weighted classi\ufb01cation error:\n\nL\u03c9(C) = 2p(1 \u2212 \u03c9) (1 \u2212 \u03b2(C)) + 2(1 \u2212 p)\u03c9 \u03b1(C) ,\n\nwith \u03c9 \u2208 (0, 1) being the asymmetry factor.\nProposition 4 The optimal set for this error measure is C\u2217\nfor all C \u2282 X :\n\u03c9) \u2264 L\u03c9(C) .\n\nL\u03c9(C\u2217\n\nAlso the optimal error is given by:\n\n\u03c9 = {x : \u03b7(x) > \u03c9}. We have indeed,\n\nThe excess risk for an arbitrary set C can be written:\n\nL\u03c9(C\u2217\n\n\u03c9) = 2E min{\u03c9(1 \u2212 \u03b7(X)), (1 \u2212 \u03c9)\u03b7(X)} .\n\u03c9) = 2E (| \u03b7(X) \u2212 \u03c9 | I{X \u2208 C\u2206C\u2217\n\n\u03c9}) ,\n\nL\u03c9(C) \u2212 L\u03c9(C\u2217\n\nwhere \u2206 stands for the symmetric difference between sets.\n\n3\n\n\fThe empirical counterpart of the weighted classi\ufb01cation error can be de\ufb01ned as:\n\n\u02c6L\u03c9(C) =\n\n2\u03c9\nn\n\nI{Yi = \u22121, Xi \u2208 C} +\n\n2(1 \u2212 \u03c9)\n\nn\n\nI{Yi = +1, Xi /\u2208 C} .\n\nThis leads to consider the weighted empirical risk minimizer over a class C of candidate sets:\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n\u02c6C\u03c9 = arg min\n\nC\u2208C\n\n\u02c6L\u03c9(C).\n\nThe next result provides rates of of convergence of the weighted empirical risk minimizer \u02c6C\u03c9 to the\nbest set in the class in terms of the two types of error \u03b1 and \u03b2.\nTheorem 1 Let \u03c9 \u2208 (0, 1). Assume that C is of \ufb01nite VC dimension V and contains C\u2217\n\u03c9. Suppose\nalso that both G\u2217 and H\u2217 are twice continuously differentiable with strictly positive \ufb01rst derivatives\nand that ROC\u2217 has a bounded second derivative. Then, for all \u03b4 > 0, there exist constants c(V )\nindependent of \u03c9 such that, with probability at least 1 \u2212 \u03b4:\n\nc(V )(cid:112)p(1 \u2212 \u03c9)\nfactor term of(cid:112)(1 \u2212 p)\u03c9 in the denominator instead .\n\n|\u03b1( \u02c6C\u03c9) \u2212 \u03b1(C\u2217\n\n\u03c9)| \u2264\n\n\u00b7\n\n(cid:18)log(1/\u03b4)\n\n(cid:19) 1\n\n3\n\n.\n\nn\n\nThe same result also holds for the excess risk of \u02c6C\u03c9 in terms of the rate \u03b2 of true positive with a\n\nIt is noteworthy that, while convergence in terms of classi\ufb01cation error is expected to be of the order\nof n\u22121/2, its two components corresponding to the rate of false positive and true positive present\nslower rates.\n\n4 Nearly optimal scoring rule based on overlaying classi\ufb01ers\n\nMain result. We now propose to collect the classi\ufb01ers studied in the previous section in order to\nbuild a scoring function for the bipartite ranking problem. From Proposition 1, we can focus on\noptimal scoring rules of the form:\n\ns\u2217(x) =\n\nI{x \u2208 C\u2217\n\n\u03c9} \u03bd(d\u03c9),\n\n(1)\n\n(cid:90)\n\nwhere the integral is taken w.r.t. any positive measure \u03bd with same support as the distribution of\n\u03b7(X).\nConsider a \ufb01xed partition \u03c90 = 0 < \u03c91 \u2264 . . . \u2264 \u03c9K \u2264 1 = \u03c9K+1 of the interval (0, 1). We can\nthen construct an estimator of s\u2217 by overlaying a \ufb01nite collection of (estimated) density level sets\n\u02c6C\u03c91, . . . , \u02c6C\u03c9K :\n\nK(cid:88)\n\n\u02c6s(x) =\n\nI{x \u2208 \u02c6C\u03c9i},\n\ni=1\n\nwhich may be seen as an empirical version of a discrete version of the target s\u2217.\nIn order to consider the performance of such an estimator, we need to compare the ROC curve of \u02c6s to\nthe optimal ROC curve. However, if the sequence { \u02c6C\u03c9i}i=1,...,K is not decreasing, the computation\nof the ROC curve as a function of the errors of the overlaying classi\ufb01ers becomes complicated.\nThe main result of the paper is the next theorem which is proved for a modi\ufb01ed sequence which\nyields to a different estimator. We introduce: { \u02dcC\u03c9i}1\u2264i\u2264K de\ufb01ned by:\n\n\u02dcC\u03c91 = \u02c6C\u03c91 and \u02dcC\u03c9i+1 = \u02dcC\u03c9i \u222a \u02c6C\u03c9i+1 for all i \u2208 {1, . . . , K \u2212 1} .\n\nThe corresponding scoring function is then given by:\n\n\u02dcsK(x) =\n\nI{x \u2208 \u02dcC\u03c9i} .\n\n(2)\n\nK(cid:88)\n\ni=1\n\n4\n\n\fHence, the ROC curve of \u02dcsK is simply the broken line that connects the knots (\u03b1( \u02dcC\u03c9i), \u03b2( \u02dcC\u03c9i)),\n0 \u2264 i \u2264 K + 1.\nThe next result offers a rate bound in the ROC space, equipped with a sup-norm. Up to our knowl-\nedge, this is the \ufb01rst result on the generalization ability of decision rules in such a functional space.\n\nTheorem 2 Under the same assumptions as in Theorem 1 and with the previous notations, we set\nK = Kn \u223c n1/8. Fix \u0001 > 0. Then, there exists a constant c such that, with probability at least\n1 \u2212 \u03b4, we have:\n\nsup\n\n\u03b1\u2208[\u0001,1\u2212\u0001]\n\n|ROC\u2217(\u03b1) \u2212 ROC(\u02dcsK, \u03b1)| \u2264 c log(1/\u03b4)\n\n.\n\n\u0001n1/4\n\nRemark 1 (PERFORMANCE OF CLASSIFIERS AND ROC CURVES.) In the present paper, we have\nadopted a scoring approach to ROC analysis which is somehow related to the evaluation of the\nperformance of classi\ufb01ers in ROC space. Using combinations of such classi\ufb01ers to improve perfor-\nmance in terms of ROC curves has also been pointed out in [BDH06] and [BCT07].\n\nRemark 2 (PLUG-IN ESTIMATOR OF THE REGRESSION FUNCTION.) Note that taking \u03bd = \u03bb\nthe Lebesgue measure over [0, 1] in the expression of s\u2217 leads to the regression function \u03b7(x) =\n\u03c9} d\u03c9. An estimator for the regression function could be the following: \u02c6\u03b7K(x) =\n\n(cid:82) I{x \u2208 C\u2217\n(cid:80)K+1\ni=1 (\u03c9i \u2212 \u03c9i\u22121)I{x \u2208 \u02dcC\u03c9i}.\n\nRemark 3 (ADAPTIVITY OF THE PARTITION.) A natural extension of the approach would be to\nconsider a \ufb02exible partition (\u03c9i)i which could possibly be adaptively chosen depending on the local\nregularity of the ROC curve. For now, it is not clear how to extend the method of the paper to\ntake into account adaptive partitions, however we have investigated such partitions corresponding\nto different approximation schemes of the optimal ROC curve elsewhere ([CV08]), but the rates of\nconvergence obtained in the present paper are faster.\n\nOptimal ROC curve approximation and estimation. We now provide some insights on the pre-\nvious result. The key for the proof of Theorem 2 is the idea of a piecewise linear approximation of\nthe optimal ROC curve.\nWe introduce some notations. Let \u03c90 = 0 < \u03c91 < . . . < \u03c9K < \u03c9K+1 = 1 be a given partition\nof [0, 1] such that maxi\u2208{0,...,K}{\u03c9i+1 \u2212 \u03c9i} \u2264 \u03b4. Set: \u2200i \u2208 {0, . . . , K + 1}, \u03b1\u2217\n) and\n\u03b2\u2217\ni = \u03b2(C\u2217\nThe broken line that connects the knots {(\u03b1\u2217\ni ); 0 \u2264 i \u2264 K + 1} provides a piecewise linear\n(concave) approximation/interpolation of the optimal ROC curve ROC\u2217. In the spirit of the \ufb01nite\nelement method (FEM, see [dB01] for instance), we introduce the \u201dhat functions\u201d de\ufb01ned by:\n\ni = \u03b1(C\u2217\n\ni , \u03b2\u2217\n\n).\n\n\u03c9i\n\n\u03c9i\n\n\u2200i \u2208 {1, . . . , K \u2212 1}, \u03c6\u2217\n\ni\u22121, \u03b1\u2217\n\ni )) \u2212 \u03c6( \u00b7 ; (\u03b1\u2217\n\ni , \u03b1\u2217\n\ni+1)),\n\nwith the notation \u03c6(\u03b1, (\u03b11, \u03b12)) = (\u03b1 \u2212 \u03b11)/(\u03b12 \u2212 \u03b11) \u00b7 I{\u03b1 \u2208 [\u03b11, \u03b12]} for all \u03b11 < \u03b12. We\nalso set \u03c6\u2217\nK, 1)) for notational convenience. The piecewise linear approximation\nof ROC\u2217 may then be written as:\n\nK( \u00b7 ) = \u03c6( \u00b7 ; (\u03b1\u2217\n\ni ( \u00b7 ) = \u03c6( \u00b7 ; (\u03b1\u2217\nK(cid:88)\n\n\u2217\n\n(\u03b1) =\n\n(cid:93)ROC\n\n\u03b2\u2217\ni \u03c6\u2217\n\ni (\u03b1) .\n\ni=1\n\u2217\n\nIn order to obtain an empirical estimator of (cid:93)ROC\ntrue level set C\u2217\ncorresponding errors \u02c6\u03b1i and \u02c6\u03b2i using a test sample {(X(cid:48)\n\n(\u03b1), we propose: i) to \ufb01nd an estimate \u02c6C\u03c9i of the\n\u03c9i based on the training sample {(Xi, Yi)}i=1,...,n as in Section 3, ii) to compute the\n\ni, Y (cid:48)\ni = \u22121} and \u02c6\u03b2i(C) =\n\nn(cid:88)\ni = +1} = n \u2212 n\u2212. We set \u02c6\u03b1i = \u02c6\u03b1i( \u02c6C\u03c9i) and \u02c6\u03b2i = \u02c6\u03b2i( \u02c6C\u03c9i). We propose\n\ni )}i=1,...,n. Hence we de\ufb01ne:\n\nwith n+ =(cid:80)n\n\n1\nn\u2212\nI{Y (cid:48)\n\ni \u2208 C, Y (cid:48)\n\ni \u2208 C, Y (cid:48)\n\ni = +1},\n\nn(cid:88)\n\n\u02c6\u03b1i(C) =\n\nI{X(cid:48)\n\nI{X(cid:48)\n\n1\nn+\n\ni=1\n\ni=1\n\ni=1\n\nthe following estimator of (cid:93)ROC\n\n\u2217\n\n(\u03b1):\n\n(cid:92)ROC\u2217(\u03b1) =\n\n\u02c6\u03b2i\n\n\u02c6\u03c6i(\u03b1),\n\nK(cid:88)\n\ni=1\n\n5\n\n\fwhere \u02c6\u03c6K(\u03b1) = \u03c6(.; (\u02c6\u03b1K, 1)) and \u02c6\u03c6i(\u03b1) = \u03c6(.; (\u02c6\u03b1i\u22121, \u02c6\u03b1i)) \u2212 \u03c6(.; (\u02c6\u03b1i, \u02c6\u03b1i+1)) for 1 \u2264 i < K.\nHence, (cid:91)ROC is the broken line connecting the empirical knots {(\u02c6\u03b1i, \u02c6\u03b2i); 0 \u2264 i \u2264 K + 1}.\nThe next result takes the form of a deviation bound for the estimation of the optimal ROC curve.\nIt quanti\ufb01es the order of magnitude of a con\ufb01dence band in supremum norm around an empirical\nestimate based on the previous approximation scheme with empirical counterparts.\n\nTheorem 3 Under the same assumptions as in Theorem 1 and with the previous notations, set K =\nKn \u223c n1/6. Fix \u0001 > 0. Then, there exists a constant c such that, with probability at least 1 \u2212 \u03b4,\n\n(cid:18)log(n/\u03b4)\n\n(cid:19)1/3\n\n.\n\nn\n\n|(cid:92)ROC\u2217(\u03b1) \u2212 ROC\u2217(\u03b1)| \u2264 c\u0001\u22121\n\nsup\n\n\u03b1\u2208[\u0001,1\u2212\u0001]\n\n5 Conclusion\n\nWe have provided a strategy based on overlaid classi\ufb01ers to build a nearly-optimal scoring function.\nStatistical guarantees are provided in terms of rates of convergence for a functional criterion which\nis the ROC space equipped with a supremum norm. This is the \ufb01rst theoretical result of this nature.\nTo conclude, we point out that ROC analysis raises important and novel issues for statistical learning\nand we hope that the present contribution gives a \ufb02avor of possible research directions.\n\nAppendix - Proof section\n\nProof of Theorem 1. The idea of the proof is to relate the excess risk in terms of \u03b1-error to the excess\nrisk in terms of weighted classi\ufb01cation error. First we re-parameterize the weighted classi\ufb01cation\nerror. Set C(\u03b1) = {x \u2208 X | \u03b7(x) > Q\u2217(\u03b1)} and:\n\n(cid:96)\u03c9(\u03b1) = L\u03c9(C(\u03b1)) = 2(1 \u2212 p)\u03c9 \u03b1 + 2p(1 \u2212 \u03c9)(1 \u2212 ROC\u2217(\u03b1))\n\n\u03c9) minimizes (cid:96)\u03c9(\u03b1). Denote by (cid:96)\u2217\n\nSince ROC\u2217 is assumed to be differentiable and using Proposition 3, it is easy to check that the\n\u03c9 = (cid:96)\u03c9(\u03b1\u2217). It follows from a Taylor expansion\nvalue \u03b1\u2217 = \u03b1(C\u2217\nof (cid:96)\u03c9(\u03b1) around \u03b1\u2217 at the second order that there exists \u03b10 \u2208 [0, 1] such that:\nd\u03b12 ROC\u2217(\u03b10) (\u03b1 \u2212 \u03b1\u2217)2\n\n\u03c9 \u2212 p(1 \u2212 \u03c9) d2\n\nUsing also the fact that ROC\u2217 dominates any other curve of the ROC space, we have: \u2200C \u2282 X\nmeasurable, \u03b2(C) \u2264 ROC\u2217(\u03b1(C)). Also, by assumption, there exists m such that: \u2200\u03b1 \u2208 [0, 1],\nd\u03b12 ROC\u2217(\u03b1) \u2265 \u2212m. Hence, since (cid:96)\u03c9(\u03b1( \u02c6C\u03c9)) = L\u03c9( \u02c6C\u03c9), we have:\nd2\n\n(cid:96)\u03c9(\u03b1) = (cid:96)\u2217\n\n(cid:16)\n\n(cid:17)2 \u2264\n\n\u03b1( \u02c6C\u03c9) \u2212 \u03b1(C\u2217\n\u03c9)\n\n(cid:16)\n\n1\n\nmp(1 \u2212 \u03c9)\n\nL\u03c9( \u02c6C\u03c9) \u2212 L\u03c9(C\u2217\n\u03c9)\n\n.\n\n(cid:17)\n\nWe have obtained the desired inequality. It remains to get the rate of convergence for the weighted\nempirical risk.\nNow set: F \u2217 = pG\u2217 + (1 \u2212 p)H\u2217. We observe that: \u2200t > 0, P(|\u03b7(X) \u2212 \u03c9| \u2264 t) = F \u2217(\u03c9 +\nt) \u2212 F \u2217(\u03c9 \u2212 t) \u2264 2t supu(F \u2217)(cid:48)(u). We have thus shown that the distribution satis\ufb01es a modi\ufb01ed\nTsybakov\u2019s margin condition [Tsy04], for all \u03c9 \u2208 [0, 1], of the form:\n\nP(|\u03b7(X) \u2212 \u03c9| \u2264 t) \u2264 D t\n\n\u03b3\n\n1\u2212\u03b3 .\n\nwith \u03b3 = 1/2 and D = 2 supu(F \u2217)(cid:48)(u). Adapting slightly the argument used in [Tsy04], [BBL05],\nwe have that, under the modi\ufb01ed margin condition, there exists a constant c such that, with proba-\nbility 1 \u2212 \u03b4:\n\nL\u03c9( \u02c6C\u03c9) \u2212 L\u2217\n\n\u03c9(C\u2217\n\n\u03c9) \u2264 c\n\n(cid:18)log(1/\u03b4)\n\n(cid:19) 1\n\n2\u2212\u03b3\n\nn\n\n.\n\n\u03c6( \u00b7 ; (\u02dc\u03b1i, \u02dc\u03b1i+1)). We then have ROC(\u02dcsK, \u03b1) = (cid:80)K\n\nProof of Theorem 2. We note \u02dc\u03b1i = \u03b1( \u02dcC\u03c9i), \u02dc\u03b2i = \u03b2( \u02dcC\u03c9i) and also \u02dc\u03c6i( \u00b7 ) = \u03c6( \u00b7 ; (\u02dc\u03b1i\u22121, \u02dc\u03b1i)) \u2212\n\u02dc\u03c6i(\u03b1) and we can use the following\n\n\u02dc\u03b2i\n\ni=1\n\n6\n\n\fdecomposition, for any \u03b1 \u2208 [0, 1]:\n\nROC\u2217(\u03b1) \u2212 ROC(\u02dcsK, \u03b1) =\n\n(cid:32)\nROC\u2217(\u03b1) \u2212 K(cid:88)\n\n(cid:33)\n\nROC\u2217(\u02dc\u03b1i) \u02dc\u03c6i(\u03b1)\n\n+\n\nK(cid:88)\n\n(ROC\u2217(\u02dc\u03b1i) \u2212 \u02dc\u03b2i) \u02dc\u03c6i(\u03b1) .\n\nIt is well-known folklore in linear approximation theory ([dB01]) that if \u02dcsK is a piecewise constant\nscoring function whose ROC curve interpolates the points {(\u02dc\u03b1i, ROC\u2217(\u02dc\u03b1i))}i=0,...,K of the optimal\nROC curve, then we can bound the \ufb01rst term (which is positive), \u2200\u03b1 \u2208 [0, 1], by:\n\ni=1\n\ni=1\n\n\u22121\n8\n\ninf\n\n\u03b1\u2208[0,1]\n\nd2\n\nd\u03b12 ROC\u2217(\u03b1) \u00b7 max\n0\u2264i\u2264K\n\n(\u02dc\u03b1i+1 \u2212 \u02dc\u03b1i)2 .\n\nNow, to control the second term, we upper bound the following quantity:\n\n|ROC\u2217(\u02dc\u03b1i) \u2212 \u02dc\u03b2i| \u2264 sup\n\u03b1\u2208[0,1]\n\nd\nd\u03b1\n\nROC\u2217(\u03b1) \u00b7 |\u02dc\u03b1i \u2212 \u03b1\u2217\n\ni | + |\u03b2\u2217\n\ni \u2212 \u02dc\u03b2i|\n\nWe further bound: |\u02dc\u03b1i \u2212 \u03b1\u2217\n\ufb01rst term, the next lemma will be needed:\nLemma 1 We have, for all k \u2208 {1, . . . , K}:\n\ni | \u2264 |\u02dc\u03b1i \u2212 \u03b1i| + |\u03b1i \u2212 \u03b1\u2217\n\ni | where \u03b1i = \u03b1( \u02c6Ci). In order to deal with the\n\n\u03b1( \u02dcCk) = \u03b1( \u02c6Ck) + (k \u2212 1)OP(n\u22121/4) .\n\nwhere the notation OP(1) is used for a r.v. which is bounded in probability.\nFrom the lemma, it follows that: max1\u2264i\u2264K |\u02dc\u03b1i \u2212 \u03b1i| = OP(Kn\u22121/4). We can then use Theorem\n1 with \u03b4 replaced by \u03b4/K to get that max1\u2264i\u2264K |\u03b1i \u2212 \u03b1\u2217\ni | = OP((n\u22121 log K)1/3). The same\ninequalities hold with the \u03b2\u2019s. It remains to control the quantity \u02dc\u03b1i+1 \u2212 \u02dc\u03b1i. We have:\n\n| \u02dc\u03b1i+1 \u2212 \u02dc\u03b1i |\u2264 max\n1\u2264k\u2264K\n\n| \u03b1( \u02c6Ck) \u2212 \u03b1( \u02c6Ck\u22121) | +K OP(n\u22121/4) .\n\nWe have that:\n\nmax\n1\u2264k\u2264K\n\n| \u03b1( \u02c6Ck) \u2212 \u03b1( \u02c6Ck\u22121) |\u2264 2 max\n1\u2264k\u2264K\n\n| \u03b1( \u02c6Ck) \u2212 \u03b1(C\u2217\n\nk) | + max\n1\u2264k\u2264K\n\n| \u03b1(C\u2217\n\nk) \u2212 \u03b1(C\u2217\n\nk\u22121) |\n\nAs before, we have that the \ufb01rst term is of the order (log K/n)1/3 and since the second derivative\nof the optimal ROC curve is bounded, the second term is of the order K\u22121. Eventually, we choose\nK in order to optimize the quantity: K\u22122 + (log K/n)2/3 + K 2n\u22121/2 + Kn\u22121/4 + (log K/n)1/3.\nAs only the \ufb01rst and the third term matter, this leads to the choice of K = Kn \u223c n1/8.\nProof of Lemma 1.\nWe have that \u03b1( \u02dcC2) = \u03b1( \u02c6C2) + \u03b1( \u02c6C1 \\ \u02c6C2). Therefore, since C\u2217\n\n2 and observing that\n\n1 \u2282 C\u2217\n\n\u03b1( \u02c6C1 \\ \u02c6C2) = \u03b1((( \u02c6C1 \\ C\u2217\n\n1 ) \u222a ( \u02c6C1 \u2229 C\u2217\n\n1 )) \\ (( \u02c6C2 \\ C\u2217\n\n2 ) \u222a ( \u02c6C2 \u2229 C\u2217\n\n2 )) ,\n\nit suf\ufb01ces to use the additivity of the probability measure \u03b1(.) to get: \u03b1( \u02dcC2) = \u03b1( \u02c6C2) + OP(n\u22121/4).\nEventually, errors are stacked and we obtain the result.\n\nProof of Theorem 3.\nWe use the following decomposition, for any \ufb01xed \u03b1 \u2208 (0, 1):\n\n(cid:92)ROC\u2217(\u03b1)\u2212ROC\u2217(\u03b1) =\n\nROC\u2217(\u02c6\u03b1i) \u02c6\u03c6i(\u03b1)\n\nTherefore, we have by a triangular inequality: \u2200\u03b1 \u2208 [0, 1],\n\n(cid:32)\n(cid:92)ROC\u2217(\u03b1) \u2212 K(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 max\n\ni=1\n\n1\u2264i\u2264K\n\nROC\u2217(\u02c6\u03b1i) \u02c6\u03c6i(\u03b1)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:92)ROC\u2217(\u03b1) \u2212 K(cid:88)\n\ni=1\n\n(cid:33)\n\n(cid:32) K(cid:88)\n\n+\n\ni=1\n\n(cid:33)\n\nROC\u2217(\u02c6\u03b1i) \u02c6\u03c6i(\u03b1) \u2212 ROC\u2217(\u03b1)\n\n.\n\n| \u02c6\u03b2i \u2212 \u03b2i| + |\u03b2i \u2212 \u03b2\u2217\n\ni | + |ROC\u2217(\u03b1\u2217\n\ni ) \u2212 ROC\u2217(\u02c6\u03b1i)| .\n\n7\n\n\fAnd, by the \ufb01nite increments theorem, we have:\n\n|ROC\u2217(\u03b1\u2217\n\ni ) \u2212 ROC\u2217(\u02c6\u03b1i)| \u2264\n\n(cid:33)\n\nROC\u2217(\u03b1)\n\n(|\u03b1\u2217\n\ni \u2212 \u03b1i| + |\u03b1i \u2212 \u02c6\u03b1i|) .\n\n(cid:32)\n\nd\nd\u03b1\n\nsup\n\u03b1\u2208[0,1]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u22121\n\n8\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K(cid:88)\n\nFor the other term, we use the same result on approximation as in the proof of Theorem 2:\n\nd2\n\ninf\n\n\u03b1\u2208[0,1]\n\ni=1\nmax\n0\u2264i\u2264K\n\n(\u02c6\u03b1i+1 \u2212 \u02c6\u03b1i) \u2264 max\n0\u2264i\u2264K\n\nROC\u2217(\u02c6\u03b1i) \u02c6\u03c6i(\u03b1) \u2212 ROC\u2217(\u03b1)\n\nd\u03b12 ROC\u2217(\u03b1) \u00b7 max\n0\u2264i\u2264K\n|\u03b1\u2217\ni \u2212 \u03b1i| + 2 max\n1\u2264i\u2264K\ni } is of the\nWe recall that: max1\u2264i\u2264K |\u02c6\u03b1i \u2212 \u03b1i|. = OP(Kn\u22121/2). Moreover, max0\u2264i\u2264K{\u03b1\u2217\ni \u2212 \u03b1i| is bounded as\norder of K\u22121. And with probability at least 1 \u2212 \u03b4, we have that max1\u2264i\u2264K |\u03b1\u2217\nin Theorem 1, except that \u03b4 is replaced by \u03b4/K in the bound. Eventually, we get the generalization\nbound: K\u22122 + (log K/n)1/3, which is optimal for a number of knots: K \u223c n1/6.\n\n|\u02c6\u03b1i \u2212 \u03b1i| .\ni+1 \u2212 \u03b1\u2217\n\ni ) + 2 max\n1\u2264i\u2264K\n\n(\u02c6\u03b1i+1 \u2212 \u02c6\u03b1i)2\n\n(\u03b1\u2217\n\ni+1 \u2212 \u03b1\u2217\n\nReferences\n[AA07]\n\nJ.-Y. Audibert and A.Tsybakov. Fast learning rates for plug-in classi\ufb01ers. Annals of\nstatistics, 35(2):608\u2013633, 2007.\n\n[AGH+05] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds\n\n[BBL05]\n\nfor the area under the ROC curve. J. Mach. Learn. Res., 6:393\u2013425, 2005.\nS. Boucheron, O. Bousquet, and G. Lugosi. Theory of Classi\ufb01cation: A Survey of Some\nRecent Advances. ESAIM: Probability and Statistics, 9:323\u2013375, 2005.\n\n[BCT07] M. Barreno, A.A. Cardenas, and J.D. Tygar. Optimal ROC curve for a combination of\n\n[BDH06]\n\n[Cav97]\n\n[CLV08]\n\n[CV07]\n\n[CV08]\n\nclassi\ufb01ers. In NIPS\u201907, 2007.\nF.R. Bach, D.Heckerman, and Eric Horvitz. Considering cost asymmetry in learning\nclassi\ufb01ers. Journal of Machine Learning Research, 7:1713\u20131741, 2006.\nL. Cavalier. Nonparametric estimation of regression level sets. Statistics, 29:131\u2013160,\n1997.\nS. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of\nU-statistics. The Annals of Statistics, 36(2):844\u2013874, 2008.\nS. Cl\u00b4emenc\u00b8on and N. Vayatis. Ranking the best instances. Journal of Machine Learning\nResearch, 8:2671\u20132699, 2007.\nS. Cl\u00b4emenc\u00b8on and N. Vayatis. Tree-structured ranking rules and approximation of the\noptimal ROC curve. Technical Report hal-00268068, HAL, 2008.\nC. de Boor. A practical guide to splines. Springer, 2001.\nJ.P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975.\n\n[dB01]\n[Ega75]\n[FISS03] Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for\n\n[RV06]\n\n[Tsy04]\n\n[vT68]\n[WN07]\n\ncombining preferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\nP. Rigollet and R. Vert. Fast rates for plug-in estimators of density level sets. Technical\nReport arXiv:math/0611473v2, arXiv:math/0611473v2, 2006.\nA. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statis-\ntics, 32(1):135\u2013166, 2004.\nH.L. van Trees. Detection, Estimation, and Modulation Theory, Part I. Wiley, 1968.\nR. Willett and R. Nowak. Minimax optimal level set estimation. IEEE Transactions on\nImage Processing, 16(12):2965\u20132979, 2007.\n\n8\n\n\f", "award": [], "sourceid": 581, "authors": [{"given_name": "St\u00e9phan", "family_name": "Cl\u00e9men\u00e7con", "institution": null}, {"given_name": "Nicolas", "family_name": "Vayatis", "institution": null}]}