{"title": "Empirical performance maximization for linear rank statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 305, "page_last": 312, "abstract": "The ROC curve is known to be the golden standard for measuring performance of a test/scoring statistic regarding its capacity of discrimination between two populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be seen as summaries of the ROC curve. This paper highlights the fact that many of these empirical criteria can be expressed as (conditional) linear rank statistics. We investigate the properties of empirical maximizers of such performance criteria and provide preliminary results for the concentration properties of a novel class of random variables that we will call a linear rank process.", "full_text": "Empirical performance maximization\n\nfor linear rank statistics\n\nSt\u00b4ephan Cl\u00b4emenc\u00b8on\n\nTelecom Paristech (TSI) - LTCI UMR Institut Telecom/CNRS 5141\n\nstephan.clemencon@telecom-paristech.fr\n\nNicolas Vayatis\n\nENS Cachan & UniverSud - CMLA UMR CNRS 8536\n\nvayatis@cmla.ens-cachan.fr\n\nAbstract\n\nThe ROC curve is known to be the golden standard for measuring performance of\na test/scoring statistic regarding its capacity of discrimination between two popu-\nlations in a wide variety of applications, ranging from anomaly detection in signal\nprocessing to information retrieval, through medical diagnosis. Most practical\nperformance measures used in scoring applications such as the AUC, the local\nAUC, the p-norm push, the DCG and others, can be seen as summaries of the\nROC curve. This paper highlights the fact that many of these empirical criteria\ncan be expressed as (conditional) linear rank statistics. We investigate the proper-\nties of empirical maximizers of such performance criteria and provide preliminary\nresults for the concentration properties of a novel class of random variables that\nwe will call a linear rank process.\n\n1 Introduction\n\nIn the context of ranking, several performance measures may be considered. Even in the simplest\nframework of bipartite ranking, where a binary label is available, there is not one and only natural cri-\nterion, but many possible options. The ROC curve provides a complete description of performance\nbut its functional nature renders direct optimization strategies rather complex. Empirical risk mini-\nmization strategies are thus based on summaries of the ROC curve, which take the form of empirical\nrisk functionals where the averages involved are no longer taken over i.i.d. sequences. The most\npopular choice is the so-called AUC criterion (see [AGH+05] or [CLV08] for instance), but when\ntop-ranked instances are more important, various choices can be considered: the Discounted Cumu-\nlative Gain or DCG [CZ06], the p-norm push (see [Rud06]), or the local AUC (refer to [CV07]).\nThe present paper starts from the simple observation that all these summary criteria have a common\nfeature: conditioned upon the labels, they all belong to the class of linear rank statistics. Such statis-\ntics have been extensively studied in the mathematical statistics literature because of their optimality\nproperties in hypothesis testing, see [HS67]. Now, in the statistical learning view, with the impor-\ntance of excess risk bounds, the theory of rank tests needs to be revisited and new problems come\nup. The arguments required to deal with risk functionals based on linear rank statistics have been\nsketched in [CV07] in a special case. The empirical AUC, known as the Wilcoxon-Mann-Whitney\nstatistic, is also a U-statistic and this particular dependence structure was extensively exploited in\n[CLV08]. In the present paper, we describe the generic structure of linear rank statistics as an or-\nthogonal decomposition after projection onto the space of sums of i.i.d. random variables (Section\n2). This projection method is the key to all statistical results related to maximizers of such criteria:\nconsistency, (fast) rates of convergence or model selection. We relate linear rank statistics to perfor-\nmance measures relevant for the ranking problem by showing that the target of ranking algorithms\n\n1\n\n\fcorrespond to optimal ordering rules in that sense (Section 3). Eventually, we provide some pre-\nliminary results in Section 4 for empirical maximizers of performance criteria based on linear rank\nstatistics with smooth score-generating functions.\n\n2 Criteria based on linear rank statistics\n\nAlong the paper, we shall consider the standard binary classi\ufb01cation model. Take a random pair\n(X, Y ) \u2208 X \u00d7{\u22121, +1}, where X is an observation vector in a high dimensional space X \u2282 Rd and\nY is a binary label, and denote by P the distribution of (X, Y ). The dependence structure between X\nand Y can be described by conditional distributions. We can consider two descriptions: either P =\n(\u00b5, \u03b7) where \u00b5 is the marginal distribution of X and \u03b7 is the posterior distribution de\ufb01ned by \u03b7(x) =\nP{Y = 1 | X = x} for all x \u2208 Rd, or else P = (p, G, H) with p = P{Y = 1} being the proportion\nof positive instances, G = L(X | Y = +1) the conditional distribution of positive instances and\nH = L(X | Y = \u22121) the conditional distribution of negative instances. A sample of size n of i.i.d.\nrealizations of this statistical model can be represented as a set of pairs {(Xi, Yi)}1\u2264i\u2264n, where\n(Xi, Yi) is a copy of (X, Y ), but also as a set {X +\ni ) = G,\nL(X\u2212\ni ) = H, and k + m = n. In this setup, the integers k and m are random, drawn as binomials\nof size n and respective parameters p and 1 \u2212 p.\n\nm,} where L(X +\n\n1 , . . . , X\u2212\n\n1 , . . . , X +\n\nk , X\u2212\n\n2.1 Motivation\n\nMost of the statistical learning theory has been developed for empirical risk minimizers (ERM) of\nsums of i.i.d. random variables. Mathematical results were elaborated with the use of empirical\nprocesses techniques and particularly concentration inequalities for such processes (see [BBL05]\nfor an overview). This was made possible by the standard assumption that, in a batch setup, for\nthe usual prediction problems (classi\ufb01cation, regression or density estimation), the sample data\n{(Xi, Yi)}i=1,...,n are i.i.d. random variables. Another reason is that the error probability in these\nproblems involves only \u201d\ufb01rst-order\u201d events, depending only on (X1, Y1). In classi\ufb01cation, for in-\nstance, most theoretical developments were focused on the error probability P{Y1 (cid:54)= g(X1)} of a\nclassi\ufb01er g : X \u2192 {\u22121, +1}, which is hardly considered in practice because the two populations are\nrarely symmetric in terms of proportions or costs. For prediction tasks such as ranking or scoring,\nmore involved statistics need to be considered, such as the Area Under the ROC curve (AUC), the\nlocal AUC, the Discounted Cumulative Gain (DCG), the p-norm push, etc. For instance, the AUC,\na very popular performance measure in various scoring applications, such as medical diagnosis or\ncredit-risk screening, can be seen as a probability of an \u201devent of order two\u201d, i.e. depending on\n(X1, Y1), (X2, Y2). In information retrieval, the DCG is the reference measure and it seems to have\na rather complicated statistical structure. The \ufb01rst theoretical studies either attempt to get back to\nsums of i.i.d. random variables by arti\ufb01cially reducing the information available (see [AGH+05],\n[Rud06]) or adopt a plug-in strategy ([CZ06]). Our approach is to i) avoid plug-in in order to under-\nstand the intimate nature of the learning problem, ii) keep all the information available and provide\nthe analysis of the full statistic. We shall see that this approach requires the development of new\ntools for handling the concentration properties of rank processes, namely collections of rank statis-\ntics indexed by classes of functions, which have never been studied before.\n\n2.2 Empirical performance of scoring rules\n\nThe learning task on which we focus here is known as the bipartite ranking problem. The goal of\nranking is to order the instances Xi by means of a real-valued scoring function s : X \u2192 R , given the\nbinary labels Yi. We denote by S the set of all scoring functions. It is natural to assume that a good\nscoring rule s would assign higher ranks to the positive instances (those for which Yi = +1) than to\nthe negative ones. The rank of the observation Xi induced by the scoring function s is expressed as\nI{s(Xj )\u2264s(Xi)} and ranges from 1 to n. In the present paper, we consider a\n\nRank(s(Xi)) =(cid:80)n\n\nparticular class of simple (conditional) linear rank statistics inspired from the Wilcoxon statistic.\n\nj=1\n\n2\n\n\fn(cid:88)\n\n(cid:99)Wn(s) =\n\nI{Yi=+1}\u03c6\n\n(cid:18)Rank(s(Xi))\n\n(cid:19)\n\n, \u2200s \u2208 S.\n\nDe\ufb01nition. 1 Let \u03c6 : [0, 1] \u2192 [0, 1] be a nondecreasing function. We de\ufb01ne the \u201dempirical W-\nranking performance measure\u201d as the empirical risk functional\n\nThe function \u03c6 is called the \u201dscore-generating function\u201d of the \u201drank process\u201d {(cid:99)Wn(s)}s\u2208S.\n\nn + 1\n\ni=1\n\nWe refer to the book by Ser\ufb02ing [Ser80] for properties and asymptotic theory of rank statistics.\nWe point out that our de\ufb01nition does not match exactly with the standard de\ufb01nition of linear rank\nstatistics. Indeed, in our case, coef\ufb01cients of the ranks in the sum are random because they involve\n\nthe variables Yi. We will call statistics(cid:99)Wn(s) conditional linear rank statistics.\n\nIt is a very natural idea to consider ranking criteria based on ranks. Observe indeed that the perfor-\nmance of a given scoring function s is invariant by increasing transforms of the latter, when evaluated\nthrough the empirical W -ranking performance measure. For speci\ufb01c choices of the score-generating\nfunction \u03c6, we recover the main examples mentioned in the introduction and many relevant criteria\ncan be accurately approximated by statistics of this form:\n\nrelated to the empirical version of the AUC (see [CLV08]).\n\n\u2022 \u03c6(u) = u - this choice leads to the celebrated Wilcoxon-Mann-Whitney statistic which is\n\u2022 \u03c6(u) = u \u00b7 I{u\u2265u0}, for some u0 \u2208 (0, 1) - such a score-generating function corresponds\nto the local AUC criterion, introduced recently in [CV07]. Such a criterion is of interest\nwhen one wants to focus on the highest ranks.\n\u2022 \u03c6(u) = up - this is another choice which puts emphasis on high ranks but in a smoother\nway than the previous one. This is related to the p-norm push approach taken in [Rud06].\nHowever, we point out that the criterion studied in the latter work relies on a different\nde\ufb01nition of the rank of an observation. Namely, the rank of positive instances among\nnegative instances (and not in the pooled sample) is used. This choice permits to use\nindependence which makes the technical part much simpler, at the price of increasing the\nvariance of the criterion.\n\u2022 \u03c6(u) = \u03c6n(u) = c ((n + 1) u)\u00b7I{u\u2265k/(n+1)} - this corresponds to the DCG criterion in the\nbipartite setup, one of the \u201dgold standard quality measure\u201d in information retrieval, when\ngrades are binary (namely I{Yi=+1}). The c(i)\u2019s denote the discount factors, c(i) measur-\ning the importance of rank i. The integer k denotes the number of top-ranked instances to\ntake into account. Notice that, with our indexation, top positions correspond to the largest\nranks and the sequence {ci} should be chosen to be increasing.\n\n2.3 Uniform approximation of linear rank statistics\n\nThis subsection describes the main result of the present analysis, which shall serve as the essential\ntool for deriving statistical properties of maximizers of empirical W -ranking performance mea-\nsures. For a given scoring function s, we denote by Gs, respectively Hs, the conditional cumulative\ndistribution function of s(X) given Y = +1, respectively Y = \u22121. With these notations, the uncon-\nditional cdf of s(X) is then Fs = pGs + (1\u2212 p)Hs. For averages of non-i.i.d. random variables, the\nunderlying statistical structure can be revealed by orthogonal projections onto the space of sums of\ni.i.d. random variables in many situations. This projection argument was the key for the study of em-\npirical AUC maximization, which involved U-processes, see [CLV08]. In the case of U-statistics,\nthis orthogonal decomposition is known as the Hoeffding decomposition and the remainder may be\nexpressed as a degenerate U-statistic, see [Hoe48]. For rank statistics, a similar though less accurate\ndecomposition can be considered. We refer to [Haj68] for a systematic use of the projection method\nfor investigating the asymptotic properties of general statistics.\n\nintegrable statistic. The r.v. (cid:98)T =(cid:80)n\n\nLemma. 2 ([Haj68]) Let Z1, . . . , Zn be independent r.v.\u2019s and T = T (Z1, . . . , Zn) be a square\nE[T | Zi] \u2212 (n \u2212 1)E[Z] is called the H\u00b4ajek projection of\n\nT . It satis\ufb01es\n\ni=1\n\nE[(cid:98)T ] = E[T ] and E[((cid:98)T \u2212 T )2] = E[(T \u2212 E[T ])2] \u2212 E[((cid:98)T \u2212 E[(cid:98)T ])2].\n\n3\n\n\fFrom the perspective of ERM in statistical learning theory, through the projection method, well-\nknown concentration results for standard empirical processes may carry over to more complex col-\nlections of r.v. such as rank processes, as shown by the next approximation result.\n\non [0, 1]. We set \u03a6s(x) = \u03c6(Fs(s(x))) + p(cid:82) +\u221e\nwhere (cid:98)Vn(s) =(cid:80)n\n\nProposition. 3 Consider a score-generating function \u03c6 which is twice continuously differentiable\ns(x) \u03c6(cid:48)(Fs(u))dGs(u) for all x \u2208 X . Let S0 \u2282 S be a\nVC major class of functions. Then, we have: \u2200s \u2208 S0,\nI{Yi=+1}\u03a6s(Xi) and (cid:98)Rn(s) = OP(1) as n \u2192 \u221e uniformly over s \u2208 S .\n\n(cid:99)Wn(s) = (cid:98)Vn(s) + (cid:98)Rn(s),\n\ni=1\n\nThe notation OP(1) means bounded in probability and the integrals are represented in the sense of\nthe Lebesgue-Stieltjes integral. Details of the proof can be found in the Appendix.\n\nRemark 1 (ON THE COMPLEXITY ASSUMPTION.) On the terminology of major sets and major\nclasses, we refer to [Dud99]. In the Proposition 3\u2019s proof, we need to control the complexity of\nsubsets of the form {x \u2208 X : s(x) \u2264 t}. The stipulated complexity assumption garantees that this\ncollection of sets indexed by (s, t) \u2208 S0 \u00d7 R forms a VC class.\nRemark 2 (ON THE SMOOTHNESS ASSUMPTION.) We point out that it is also possible to deal with\ndiscontinuous score-generating functions as seen in [CV07]. In this case, the lack of smoothness of \u03c6\nhas to be compensated by smoothness assumptions on the underlying conditional distributions. An-\n\nthe score-generating function \u03c8 would be a smooth approximation of \u03c6. Owing to space limitations,\nhere we only handle the smooth case.\n\nother approach would consist of approximating(cid:99)Wn(s) by the empirical W-ranking criterion where\nit as a function of the sampling cdf. Denoting by (cid:98)Fs(x) = n\u22121(cid:80)n\ni )(cid:1)(cid:19)\n(cid:0)s(X +\ncounterpart of Fs(x), we have:(cid:99)Wn(s) =\n\nAn essential hint to the study of the asymptotic behavior of a linear rank statistic consists in rewriting\nI{s(Xi)\u2264x} the empirical\n\n(cid:18) n\nn + 1(cid:98)Fs\n\ni=1\n\n.\n\nk(cid:88)\n\ni=1\n\n\u03c6\n\nwhich may easily shown to converge to E[\u03c6(Fs(s(X)) | Y = +1] as n \u2192 \u221e, see [CS58].\nDe\ufb01nition. 4 For a given score-generating function \u03c6, we will call the functional\n\na \u201dW-ranking performance measure\u201d.\n\nW\u03c6(s) = E[\u03c6(Fs(s(X)) | Y = +1] ,\n\nThe following result is a consequence of Proposition 3 and its proof can be found in the Appendix.\nProposition. 5 Let S0 \u2282 S be a VC major class of functions with VC dimension V and \u03c6 be a\nscore-generating function of class C1. Then, as n \u2192 \u221e, we have with probability one:\n\n|(cid:99)Wn(s) \u2212 kW\u03c6(s)| \u2192 0.\n\n1\nn\n\nsup\ns\u2208S0\n\n3 Optimality\nWe introduce the class S\u2217 of scoring functions obtained as strictly increasing transformations of the\nregression function \u03b7:\n\nS\u2217 = { s\u2217 = T \u25e6 \u03b7 | T : [0, 1] \u2192 R strictly increasing }.\n\nThe class S\u2217 contains the optimal scoring rules for the bipartite ranking problem. The next para-\ngraphs motivate the use of W -ranking performance measures as optimization criteria for this prob-\nlem.\n\n4\n\n\f3.1 ROC curves\n\nA classical tool for measuring the performance of a scoring rule s is the so-called ROC curve\n\nROC(s, .) : \u03b1 \u2208 [0, 1] (cid:55)\u2192 1 \u2212 Gs \u25e6 H\u22121\n\ns (1 \u2212 \u03b1),\n\ns (x) = inf{t \u2208 R | Hs(t) \u2265 x}.\n\nwhere H\u22121\nIn the case where s = \u03b7, we will denote\nROC(\u03b7, \u03b1) = ROC\u2217(\u03b1), for any \u03b1 \u2208 [0, 1]. The set of points (\u03b1, \u03b2) \u2208 [0, 1]2 which can be\nachieved as (\u03b1, ROC(s, \u03b1)) for some scoring function s is called the ROC space.\nIt is a well-known fact that the regression function provides an optimal scoring function for the ROC\ncurve. This fact relies on a simple application of Neyman-Pearson\u2019s lemma. We refer to [CLV08]\nfor the details. Using the fact that, for a given scoring function, the ROC curve is invariant by\nincreasing transformations of the scoring function, we get the following result:\nLemma. 6 For any scoring function s and any \u03b1 \u2208 [0, 1], we have:\n\n\u2200s\u2217 \u2208 S\u2217 , ROC(s, \u03b1) \u2264 ROC(s\u2217, \u03b1) .= ROC\u2217(\u03b1) .\n\nThe next result states that the set of optimal scoring functions coincides with the set of maximizers\nof the W\u03c6-ranking performance, provided that the score-generating function \u03c6 is strictly increasing.\n\nProposition. 7 Assume that the score-generating function \u03c6 is strictly increasing. Then, we have:\n\nMoreover W \u2217\n\n\u03c6\n\n.= W\u03c6(\u03b7) = W\u03c6(s\u2217) for any s\u2217 \u2208 S\u2217.\n\n\u2200s \u2208 S , W\u03c6(s) \u2264 W\u03c6(\u03b7) .\n\nRemark 3 (ON PLUG-IN RANKING RULES) Theoretically, a possible approach to ranking is the\nplug-in method ([DGL96]), which consists of using an estimate \u02c6\u03b7 of the regression function as a\nscoring function. As shown by the subsequent bound, when \u03c6 is differentiable with a bounded\nderivative, when \u02c6\u03b7 is close to \u03b7 in the L1-sense, it leads to a nearly optimal ordering in terms of\nW-ranking criterion:\n\n\u03c6 \u2212 W\u03c6 ((cid:98)\u03b7) \u2264 (1 \u2212 p)||\u03c6(cid:48)||\u221eE[|(cid:98)\u03b7(X) \u2212 \u03b7(X)|].\n\nW \u2217\n\nHowever, one faces dif\ufb01culties with the plug-in approach when dealing with high-dimensional data,\nsee [GKKW02]), which provides the motivation for exploring algorithms based on W-ranking per-\nformance maximization.\n\n3.2 Connection to hypothesis testing\n\nFrom the angle embraced in this paper, the ranking problem is tightly related to hypothesis testing.\nDenote by X + and X\u2212 two r.v. distributed as G and H respectively. As a \ufb01rst go, we can reformu-\nlate the ranking problem as the one of \ufb01nding a scoring function s such that s(X\u2212) is stochastically\nsmaller than s(X +), which means, for example, that: \u2200t \u2208 R, P{s(X\u2212) \u2265 t} \u2264 P{s(X +) \u2265 t}.\nIt is easy to see that the latter statement means that the ROC curve of s dominates the \ufb01rst diagonal\nof the ROC space. We point out the fact that the \ufb01rst diagonal corresponds to nondiscriminating\nscoring functions s0 such that Hs0 = Gs0. However, searching for a scoring function s ful\ufb01lling\nthis property is generally not suf\ufb01cient in practice. Heuristically, one would like to pick an s in order\nto be as far as possible from the case where \u201dGs = Hs\u201d. This requires to specify a certain measure\nof dissimilarity between distributions. In this respect, various criteria may be considered such as the\nL1-Mallows metric (see the next remark). Indeed, assuming temporarily that s is \ufb01xed and consid-\nering the problem of testing similarity vs. dissimilarity between two distributions Hs and Gs based\non two independent samples s(X +\nm), it is well-known that\nnonparametric tests based on linear rank statistics have optimality properties. We refer to Chapter\n9 in [Ser80] for an overview of rank procedures for testing homogeneity, which may yield relevant\ncriteria in the ranking context.\n\n1 ), . . . , s(X\u2212\n\nk ) and s(X\u2212\n\n1 ), . . . , s(X +\n\ncriterion: AUC(s) =(cid:82) 1\n\nRemark 4 (CONNECTION BETWEEN AUC AND THE L1-MALLOWS METRIC ) Consider the AUC\n\u03b1=0 ROC(s, \u03b1)d\u03b1. It is well-known that this criterion may be interpreted as\n\n5\n\n\fthe \u201drate of concording pairs\u201d: AUC(s) = P{s(X) < s(X(cid:48)) | Y = \u22121, Y (cid:48) = +1} where (X, Y )\nand (X(cid:48), Y (cid:48)) denote independent copies. Furthermore, it may be easily shown that\n\n(cid:90) \u221e\n\n\u2212\u221e\n\nAUC(s) =\n\n1\n2\n\n+\n\n{Hs(t) \u2212 Gs(t)}dF (t),\n\nwhere the cdf F may be taken as any linear convex combination of Hs and Gs. Provided that Hs is\nstochastically smaller than Gs and that F (dt) is the uniform distribution over (0, 1) (this is always\npossible, even if it means replacing s by F \u25e6 s, which leaves the ordering untouched), the second\nterm may be identi\ufb01ed as the L1-Mallows distance between Hs and Gs, a well-known probability\nmetric widely considered in the statistical literature (also known as the L1-Wasserstein metric).\n\n4 A generalization error bound\n\nWe now provide a bound on the generalization ability of scoring rules based on empirical maximiza-\ntion of W-ranking performance criteria.\n\nTheorem. 8 Set the empirical W -ranking performance maximizer \u02c6sn = arg maxs\u2208S(cid:99)Wn(s). Un-\n\nder the same assumptions as in Proposition 3 and assuming in addition that the class of functions\n\u03a6s induced by S0 is also a VC major class of functions, we have, for any \u03b4 > 0, and with probability\n1 \u2212 \u03b4:\n\n(cid:114)\n\n(cid:114)log(1/\u03b4)\n\n,\n\nn\n\n\u03c6 \u2212 W\u03c6(\u02c6sn) \u2264 c1\nW \u2217\n\nfor some positive constants c1, c2.\n\nV\nn\n\n+ c2\n\nThe proof is a straightforward consequence from Proposition 3 and it can be found in the Appendix.\n\n5 Conclusion\n\nIn this paper, we considered a general class of performance measures for ranking/scoring which can\nbe described as conditional linear rank statistics. Our overall setup encompasses in particular known\ncriteria used in medical diagnosis and information retrieval. We have described the statistical nature\n\u221a\nof such statistics, proved that they ar compatible with optimal scoring functions in the bipartite setup,\nand provided a preliminary generalization bound with a\nn-rate of convergence. By doing so, we\nprovided the very results on a class of linear rank processes. Further work is needed to identify a\nvariance control assumption in order to derive fast rates of convergence and to obtain consistency\nunder weaker complexity assumptions. Moreover, it is not clear how to formulate convex surrogates\nfor such functionals yet.\n\nAppendix - Proofs\n\nProof of Proposition 5\n\nBy virtue of the \ufb01nite increment theorem, we have:\n\n|(cid:99)Wn(s) \u2212 kW\u03c6(s)| \u2264 k||\u03c6(cid:48)||\u221e\n\nsup\ns\u2208S0\n\n1\n\nn + 1\n\n+ sup\n\n(s,t)\u2208S0\u00d7R\n\n(cid:32)\n\n(cid:33)\n|(cid:98)Fs(t) \u2212 Fs(t)|\n\nand the desired result immediately follows from the application of the VC inequality, see Remark 1.\n\nProof of Proposition 3\nSince \u03c6 is of class C2, a Taylor expansion at the second order immediately yields:\n\nk(cid:88)\n\n(cid:99)Wn(s) =\n\ni ))) + (cid:98)Bn(s) + (cid:98)Rn(s),\n\n\u03c6(Fs(s(X +\n\ni=1\n\n6\n\n\fwith\n\nk(cid:88)\n(cid:98)Bn(s) =\n|(cid:98)Rn(s)| \u2264 k(cid:88)\n\ni=1\n\ni=1\n\n(cid:18)Rank(s(X +\n(cid:18)Rank(s(X +\n\nn + 1\n\nn + 1\n\ni ))\n\n\u2212 Fs(s(X +\ni ))\n\n\u03c6(cid:48)(Fs(s(X +\n\ni )))\n\ni ))\n\n\u2212 Fs(s(X +\ni ))\n\n(cid:19)\n(cid:19)2 ||\u03c6(cid:48)(cid:48)||\u221e.\n(cid:19)\n\nE\n\nE\n\ni=1\n\ni=1\n\n(cid:21)\n\n,\n\ni=1\n\nj=1\n\n(cid:19)\n\nn + 1\n\nn + 1\n\n(I) =\n\n(II) =\n\nk(cid:88)\n\nI{Yi=+1}\n\nI{Yi=+1}\n\nThis projection may be splitted into two terms:\n\ni (Xi, Yi)] < \u221e for all i \u2208 {1, . . . , n}:\ni\u2264n fi(Xi, Yi) such that E[f 2\ni ))\n\nFollowing in the footsteps of [Haj68], we \ufb01rst compute the projection of (cid:98)Bn(s) onto the space \u03a3 of\nr.v.\u2019s of the form(cid:80)\nP\u03a3((cid:98)Bn(s)) =\nn(cid:88)\nn(cid:88)\n\n(cid:21)\nhave E[Rank(s(Xi)) | s(Xi)] = n(cid:98)Fs(s(Xi)) and, by assumption, sup(s,t)\u2208S\u00d7R |(cid:98)Fs(t) \u2212 Fs(t)| =\n\n(cid:20)(cid:18)Rank(s(X +\nn(cid:88)\n(cid:18) 1\n(cid:88)\n\nThe \ufb01rst term is easily handled and may be seen as negligible (it is of order OP(n\u22121/2)), since we\n\nE[Rank(s(Xi)) | s(Xi)] \u2212 Fs(s(Xi))\n\n(cid:20)(cid:18)Rank(s(X +\n\n\u03c6(cid:48)(Fs(s(Xi))) | s(Xj), Yj\n\n\u03c6(cid:48)(Fs(s(Xi))) | Xj, Yj\n\n\u2212 Fs(s(X +\ni ))\n\n\u2212 Fs(s(X +\ni ))\n\n\u03c6(cid:48)(Fs(s(Xi))),\n\nI{Yi=+1}\n\nOP(n\u22121/2) (see Remark 1). Up to an additive term of order OP(1) uniformly over s \u2208 S, the second\nterm may be rewritten as\n\nE(cid:2)I{s(Xj )\u2264s(Xi)}\u03c6(cid:48)(Fs(s(Xi))) | s(Xj), Yj\n(cid:90) \u221e\n\nn(cid:88)\nn(cid:88)\n\u03c6(cid:48)(Fs(u))dGs(u).\nI{Yi=+1}(cid:82) \u221e\ns(Xi) \u03c6(cid:48)(Fs(u))dGs(u)/(n + 1) \u2264 supu\u2208[0,1] \u03c6(cid:48)(t) and k/(n + 1) \u223c p, we get\nk(cid:88)\n\n\u03c6(cid:48)(Fs(u))dGs(u) \u2212 1\nn + 1\n\nthat, uniformly over s \u2208 S0:\n\nAs(cid:80)n\n\n(cid:90) \u221e\n\nI{Yi=+1}\n\n(cid:88)\n\nn(cid:88)\n\n(II) =\n\nn + 1\n\nn + 1\n\nn + 1\n\n(cid:19)\n\ni ))\n\n1\n\nk\n\nj=1\n\ns(Xj )\n\n(cid:3)\n\nj(cid:54)=i\n\nj(cid:54)=i\n\ns(Xi)\n\n=\n\ni=1\n\ni=1\n\ni=1\n\n.\n\nThe term (cid:98)Rn(s) is negligible, since, up to the multiplicative constant ||\u03c6(cid:48)(cid:48)||\u221e, it is bounded by\n\ni=1\n\n\u03c6(Fs(s(X +\n\ni ))) + P\u03a3((cid:98)Bn(s))) = (cid:98)Vn(s) + OP(1) as n \u2192 \u221e.\n\uf8f6\uf8f82\uf8f9\uf8fa\uf8fb .\n\uf8ee\uf8ef\uf8f0\n\uf8eb\uf8ed2Fs(s(Xi)) +(cid:88)\n\uf8f6\uf8f82\n\n{I{s(Xk)\u2264s(Xi)} \u2212 Fs(s(Xi))}\n\n| s(Xi)\n\nk(cid:54)=i\n\ni=1\n\nE\n\nk(cid:54)=i\n\n\uf8f9\uf8fa\uf8fb =(cid:88)\n\n{I{s(Xk)\u2264s(Xi)} \u2212 Fs(s(Xi))}\n\nAs Fs is bounded by 1, it suf\ufb01ces to observe that for all i:\n\n\uf8ee\uf8ef\uf8f0\n\uf8eb\uf8ed(cid:88)\nEventually, one needs to evaluate the accuracy of the approximation yield by the projection (cid:98)Bn(s)\u2212\n{P\u03a3((cid:98)Bn(s)) \u2212 (n \u2212 1)E[(cid:98)Bn(s)]}. Write, for all s \u2208 S0,\n(cid:18) 1\n\nBounding the variance of the binomial r.v. E[I{s(Xk) \u2264 s(Xi)} \u2212 Fs(s(Xi))})2 | s(Xi)] by 1/4,\none \ufb01nally gets that \u02c6Rn(s) is of order OP(1) uniformly over s \u2208 S0.\n\nE(cid:2)(I{s(Xk)\u2264s(Xi)} \u2212 Fs(s(Xi))})2 | s(Xi)(cid:3) .\n\n(cid:19)\n\nk(cid:54)=i\n\nn(cid:88)\n\n(cid:98)Bn(s) = n(cid:98)Un(s) +\n\n\u03c6(cid:48)(Fs(s(Xi))),\n\n\u2212 Fs(s(Xi))\n\nI{Yi=+1}\n\nn(cid:88)\n\nE\n\n1\n\n(n + 1)2\n\ni=1\n\nn + 1\n\n7\n\n\fHence, we have\n\ni(cid:54)=j qs((Xi, Yi), (Xj, Yj))/(n(n + 1)) is a U-statistic with kernel:\nqs((x, y), (x(cid:48), y(cid:48))) = I{y=+1} \u00b7 I{s(x(cid:48))\u2264s(x)} \u00b7 \u03c6(cid:48)(Fs(s(x))).\n\nwhere Un(s) =(cid:80)\nn\u22121(cid:16)(cid:98)Bn(s) \u2212 {P\u03a3((cid:98)Bn(s)) \u2212 (n \u2212 1)E[(cid:98)Bn(s)]}(cid:17)\n= (cid:98)Un(s) \u2212 {P\u03a3((cid:98)Un(s)) \u2212 (n \u2212 1)E[(cid:98)Un(s)]}\n(cid:98)Un(s). Now, given that sups\u2208S0 ||qs||\u221e < \u221e, it follows from Theorem 11 in [CLV08] for instance,\n\nwhich actually corresponds to the degenerate part of the Hoeffding decomposition of the U-statistic\n\ncombined with the basic symmetrization device of the kernel qs, that\n\n|(cid:98)Un(s) \u2212 {P\u03a3((cid:98)Un(s)) \u2212 (n \u2212 1)E[(cid:98)Un(s)]}| = OP(n\u22121) as n \u2192 \u221e,\n\nsup\ns\u2208S0\n\nwhich concludes the proof.\n\nProof of Proposition 7\nUsing the decomposition Fs = pGs + (1 \u2212 p)Hs, we are led to the following expression:\n\npW\u03c6(s) =\n\n\u03c6(u) du \u2212 (1 \u2212 p)E[\u03c6(Fs(s(X))) | Y = \u22121].\n\n(cid:90) 1\n\n0\nThen, using a change of variable:\n\nE[\u03c6(Fs(s(X))) | Y = \u22121] =\n\n(cid:90) 1\n\n0\n\n\u03c6(p(1 \u2212 ROC(s, \u03b1)) + (1 \u2212 p)(1 \u2212 \u03b1)) d\u03b1 .\n\nIt is now easy to conclude since \u03c6 is increasing (by assumption) and because of the optimality of\nelements of S\u2217 in the sense of Lemma 6.\n\nProof of Theorem 8\n\nObserve that, by virtue of Proposition 3,\n\n\u03c6 \u2212 W\u03c6(\u02c6sn) \u2264 2 sup\nW \u2217\ns\u2208S0\n\n|(cid:99)Wn(s)/k \u2212 W\u03c6(s)| \u2264 2\n\n|(cid:98)Vn(s) \u2212 kW\u03c6(s)| + OP(n\u22121),\n\nsup\ns\u2208S0\n\nk\n\nand the desired bound derives from the VC inequality applied to the sup term, noticing that it follows\nfrom our assumptions that {(x, y) (cid:55)\u2192 I{y=+1}\u03a6s(x)}s\u2208S0 is a VC class of functions.\n\n[CLV08]\n\n[BBL05]\n\nReferences\n[AGH+05] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds\nfor the area under the ROC curve. Journal of Machine Learning Research, 6:393\u2013425,\n2005.\nS. Boucheron, O. Bousquet, and G. Lugosi. Theory of Classi\ufb01cation: A Survey of\nSome Recent Advances. ESAIM: Probability and Statistics, 9:323\u2013375, 2005.\nS. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of\nU-statistics. The Annals of Statistics, 36(2):844\u2013874, 2008.\nJ. Chernoff and Savage. Asymptotic normality and ef\ufb01ciency of certain non parametric\ntest statistics. Ann. Math. Stat., 29:972\u2013994, 1958.\nS. Cl\u00b4emenc\u00b8on and N. Vayatis. Ranking the best instances. Journal of Machine Learn-\ning Research, 8:2671\u20132699, 2007.\nD. Cossock and T. Zhang. Subset ranking using regression. In H.U. Simon and G. Lu-\ngosi, editors, Proceedings of COLT 2006, volume 4005 of Lecture Notes in Computer\nScience, pages 605\u2013619, 2006.\nL. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.\nSpringer, 1996.\n\n[DGL96]\n\n[CV07]\n\n[CZ06]\n\n[CS58]\n\n8\n\n\f[Dud99]\n[GKKW02] L. Gy\u00a8or\ufb01, M. K\u00a8ohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Non-\n\nR.M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.\n\n[Haj68]\n\n[Hoe48]\n\n[HS67]\n[Rud06]\n\n[Ser80]\n\nparametric Regression. Springer, 2002.\nJ. Hajek. Asymptotic normality of simple linear rank statistics under alternatives. Ann.\nMath. Stat., 39:325\u2013346, 1968.\nW. Hoeffding. A class of statistics with asymptotically normal distribution. Ann. Math.\nStat., 19:293\u2013325, 1948.\nJ. H\u00b4ajek and Z. Sid\u00b4ak. Theory of Rank Tests. Academic Press, 1967.\nC. Rudin. Ranking with a P-Norm Push.\nIn H.U. Simon and G. Lugosi, editors,\nProceedings of COLT 2006, volume 4005 of Lecture Notes in Computer Science, pages\n589\u2013604, 2006.\nR.J. Ser\ufb02ing. Approximation theorems of mathematical statistics. John Wiley & Sons,\n1980.\n\n9\n\n\f", "award": [], "sourceid": 551, "authors": [{"given_name": "St\u00e9phan", "family_name": "Cl\u00e9men\u00e7con", "institution": null}, {"given_name": "Nicolas", "family_name": "Vayatis", "institution": null}]}