{"title": "On the (Non-)existence of Convex, Calibrated Surrogate Losses for Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 197, "page_last": 205, "abstract": "", "full_text": "On the (Non-)existence of Convex, Calibrated\n\nSurrogate Losses for Ranking\n\nCl\u00b4ement Calauz`enes, Nicolas Usunier, Patrick Gallinari\n\nLIP6 - UPMC\n\n4 place Jussieu, 75005 Paris, France\nfirstname.lastname@lip6.fr\n\nAbstract\n\nWe study surrogate losses for learning to rank, in a framework where the rankings\nare induced by scores and the task is to learn the scoring function. We focus on the\ncalibration of surrogate losses with respect to a ranking evaluation metric, where\nthe calibration is equivalent to the guarantee that near-optimal values of the sur-\nrogate risk imply near-optimal values of the risk de\ufb01ned by the evaluation metric.\nWe prove that if a surrogate loss is a convex function of the scores, then it is not\ncalibrated with respect to two evaluation metrics widely used for search engine\nevaluation, namely the Average Precision and the Expected Reciprocal Rank. We\nalso show that such convex surrogate losses cannot be calibrated with respect to\nthe Pairwise Disagreement, an evaluation metric used when learning from pair-\nwise preferences. Our results cast lights on the intrinsic dif\ufb01culty of some ranking\nproblems, as well as on the limitations of learning-to-rank algorithms based on the\nminimization of a convex surrogate risk.\n\n1\n\nIntroduction\n\nA surrogate loss is a loss function used as a substitute for the true quality measure during training in\norder to ease the optimization of the empirical risk. The hinge loss or the exponential loss, which are\nused in Support Vector Machines or AdaBoost as convex upper bounds of the classi\ufb01cation error, are\nwell-known examples of surrogate losses for binary classi\ufb01cation. In this paper, we study surrogate\nlosses for learning to rank, in a context where a set of items should be ranked given an input query\nand where the ranking is obtained by sorting the items according to predicted numerical scores. This\nwork is motivated by the intensive research that has recently been carried out on machine learning\napproaches to improve the quality of search engine results, and more speci\ufb01cally on the design of\nsurrogate losses that lead to high quality rankings (see [16] for a review).\nConsidering algorithms for learning to rank on the axis of scalability, there are \ufb01rst algorithms that\nare designed for small-scale datasets only and that directly solve the NP-hard problem [5] with-\nout using any surrogate loss; after them come algorithms that use a surrogate loss chosen as a\nnon-convex but continuous and (almost everywhere) differentiable approximation of the evaluation\nmetric [3, 21, 10], and \ufb01nally algorithms that use a convex surrogate loss. Most algorithms for\nlearning to rank fall into the latter category, including the reference algorithms RankBoost [12] and\nRanking SVMs [14, 4] or the regression approach of [8], because convex surrogate losses lead to\noptimization problems that can be solved ef\ufb01ciently while non-convex approaches may require in-\ntensive computations to \ufb01nd a good local optimum. The disadvantage of convex surrogate losses is\nthat they cannot closely approximate the evaluation metrics on the whole prediction space. However,\nas more examples are available and smaller values of the surrogate risk are achieved, the only region\nof interest becomes that of near-optimal predictions. It is thus possible that the minimization of the\nsurrogate risk provably leads to optimal predictions according to the risk de\ufb01ned by the evaluation\nmeasure. In that case, the surrogate loss is said to be calibrated with respect to the evaluation metric.\n\n1\n\n\fThe calibration of surrogate losses has been extensively studied for various classi\ufb01cation settings\n[1, 26, 27, 18, 19] and for AUC optimization [7, 15]. For each of these tasks, many usual convex\nlosses are calibrated with respect to the natural evaluation metric. In the context of learning to rank\nfor search engines, several families of convex losses are calibrated with respect to the Discounted\nCumulative Gain (DCG) and its variants [8, 2, 17]. However, other metrics than the DCG are often\nused as reference for the evaluation of ranked results, such as the Average Precision (AP), used in\npast TREC competitions [22], the Expected Reciprocal Rank (ERR), used the Yahoo! Learning to\nRank Challenge [6], or the Pairwise Disagreement (PD), used when learning from pairwise pref-\nerences. And despite the multiplicity of convex losses that have been proposed for ranking, none\nof them was proved to be calibrated with respect to any of these three metrics. This lead us to the\nquestion of whether convex losses can be calibrated with respect to the AP, the ERR, or the PD.\nOur main contribution is a de\ufb01nitive and negative answer to that question. We prove that if a sur-\nrogate loss is convex, then it cannot be calibrated with respect to any of the AP, the ERR or the\nPD. Thus, if one of these metrics should be optimized, the price to pay for the computational ad-\nvantage of convex losses is an inconsistent learning procedure, which may converge to non-optimal\npredictions as the number of examples increases.\nOur result generalizes previous works on non-calibration. First, Duchi et al. [11] showed that many\nconvex losses based on pairwise comparisons, such as those of RankBoost [12] or Ranking SVMs\n[14, 4], are not calibrated with respect to the PD. Secondly, Buffoni et al. [2] showed that speci\ufb01c\nconvex losses, called order-preserving, are not calibrated with respect to the AP or the ERR, even\nthough these losses are calibrated with respect to (any variant of) the DCG. Our result is stronger\nthan those because we do not make any assumption on the exact structure of the loss; our approach as\na whole is also more general because it directly applies to the three evaluation metrics (AP, ERR and\nPD). Finally, Duchi et al. conjectured that no convex loss can be calibrated with the PD in general\n[11, Section 2.1] because it would provide a polynomial algorithm to solve an NP-hard problem.\nOur approach thus leads to a direct proof of this conjecture.\nIn the next section, we describe our framework for learning to rank. We then present in Section 3\nthe general framework of calibration of [20], and give a new characterization of calibration for the\nevaluation metrics we consider (Theorem 2), and the implications of the convexity of a surrogate\nloss. Our main result is proved in Section 4. Section 5 concludes the paper, and Section 6 is a\ntechnical part containing the full proof of Theorem 2.\nNotation Let V, W be two sets. A set-valued function g from V to W maps all v \u2208 V to a subset\nof W (set-valued functions appear in the paper as the result of arg min operations). Given a subset\nV of V, the image of V by g, denoted by g(V ), is the union of the images by g of all members of\nv\u2208V g(v). If n is a positive integer, [n] is the set {1, ..., n}, and Sn is the set of\npermutations of [n]. Boldface characters are used for vectors of Rn. If x \u2208 Rn, the i-th component\nof x is denoted by xi (normal font and subscript). The cardinal of a \ufb01nite set V is denoted by |V|.\n\nV , i.e. g(V ) =(cid:83)\n\n2 Ranking Framework\n\nWe describe in this section the formal framework of ranking we consider. We \ufb01rst present the\nprediction problem we address, and then de\ufb01ne the two main objects of our study: evaluation metrics\nfor ranking and surrogate losses. We end the section with an outline of our technical contributions.\n\n2.1 Framework and De\ufb01nitions\nWe consider a framework similar to label ranking [9] or subset ranking [8]. Let X be a measurable\nspace (the instance space). An instance x \u2208 X represents a query and its associated n items to rank,\nfor an integer n \u2265 3. The items are indexed from 1 to n, and the goal is to order the set of item\nindexes [n] = {1, ..., n} given x. The ordering (or ranking) is predicted by a scoring function, which\nis a measurable function f : X \u2192 Rn. For any input instance x, the scoring function f predicts a\nvector of n relevance scores (one score for each item) and the ranking is predicted by sorting the\nitem indexes by decreasing scores. We use permutations over [n] to represent rankings, with the\nfollowing conventions. First, given a permutation \u03c3 in Sn, k in [n] is the rank of the item \u03c3(k);\nsecond, low ranks are better, so that \u03c3(1) is the top-ranked item.\n\n2\n\n\fTable 1: Formulas of r(y, \u03c3) for some common ranking evaluation metrics\n\nTYPE OF FEEDBACK\n\nMETRIC\n\ny \u2208 Y = {0, ..., Y }n ,\n\nY \u2208 N, Y \u2265 1\n\ny \u2208 Y = {0, 1}n\n\ny \u2208 Y = all DAGs over [n]\n\nn(cid:80)\n\nk=1\n\nDiscounted Cumulative Gain\n(higher values mean better performances)\n\nExpected Reciprocal Rank\n(higher values mean better performances)\n\nAverage Precision\n\n(higher values mean better performances)\n\nPairwise Disagreement\n\n(lower values mean better performances)\n\nFORMULA\n\nn(cid:80)\n\nRk\nk\n\nk=1\n\n(1 \u2212 Rq), Rk = 2y\u03c3(k) \u20131\n\n2y\u03c3(k)\u22121\nlog(1+k)\n\nk\u22121(cid:81)\n(cid:80)\nI(cid:0)\u03c3-1(i) > \u03c3-1(j)(cid:1)\n(cid:80)\n\n\u03c3-1(i)(cid:80)\n\ny\u03c3(k)\n\u03c3-1(i)\n\ni:yi=1\n\nk=1\n\nq=1\n\n2Y\n\n1\n\n|{i:yi=1}|\n\ni\u2192j\u2208y\n\nThe quality of a ranking is measured by a ranking evaluation metric, relatively to a feedback. The\nfeedback space, denoted by Y, is a \ufb01nite set, and an evaluation metric is a function r : Y \u00d7 Sn \u2192 R.\nWe use the convention that lower values of r are preferable, and thus when we discuss existing\nmetrics for which higher values are better (e.g. the DCG, the AP or the ERR), we implicitly consider\ntheir opposite. Table 1 gives the formula and feedback spaces of the evaluation metrics that we\ndiscuss in the paper. The \ufb01rst three metrics \u2013 the DCG, the ERR and the AP \u2013 are commonly used\nfor search engine evaluation. The feedback they consider is a vector of relevance judgments (one\njudgment per item). The last measure we consider is the PD, which is widely used when learning\nfrom pairwise preferences. For the feedback space of the PD, we follow [11] and take Y as the set\nof all directed acyclic graph (DAG) over [n]. For a DAG y \u2208 Y, there is an edge from item i to j\n(denoted i \u2192 j \u2208 y) when i is preferred to j, or, equivalently when i should have better rank than j.\nIn general, using a sorting algorithm, any ranking evaluation metric r induces a quality measure on\nvectors of scores instead of rankings, considering that the sorting algorithm break ties randomly.\nThus, using the following set-valued function from Rn to Sn, called arg sort, which gives the set\nof rankings induced by a vector of scores:\n\narg sort(s) =(cid:8)\u03c3 \u2208 Sn|\u2200k \u2208 [n \u2212 1] , s\u03c3(k) \u2265 s\u03c3(k+1)\n\n\u2200s = (s1, ..., sn) \u2208 Rn,\n\n(cid:9) ,\n\nthe evaluation metric on vectors of scores induced by r is de\ufb01ned by:\n\n(cid:88)\n\n\u2200y \u2208 Y,\u2200s \u2208 Rn,\n\nr(cid:48)(y, s) =\n\nr (y, \u03c3)\n\n| arg sort(s)| .\n\n\u03c3\u2208arg sort(s)\n\nis to \ufb01nd a scoring function f with low ranking risk R(cid:48)(D, f ) =(cid:82)\n\nFor a \ufb01xed, but unknown, probability measure D on X \u00d7 Y, the objective of a learning algorithm\nX\u00d7Y r(cid:48)(y, f (x))dD(x, y) using a\n\ntraining set of (instance, feedback) pairs (e.g. drawn i.i.d. according to D).\nThe optimization of the empirical ranking risk is usually intractable because the ranking loss is\ndiscontinuous. To address this issue, algorithms optimize the empirical risk associated to a surrogate\nloss instead. Throughout the paper, we assume that this loss is bounded below, so that all the in\ufb01ma\nwe take are well-de\ufb01ned. Without loss of generality, we assume that the surrogate loss has non-\nnegative values, and we de\ufb01ne a surrogate loss as a measurable function (cid:96) : Y \u00d7 Rn \u2192 R+. The\n\nsurrogate risk of a scoring function f is then de\ufb01ned by L(D, f ) =(cid:82)\n\nX\u00d7Y (cid:96) (y, f (x))dD(x, y).\n\n2.2 Outline of the Analysis\n\nAny learning algorithm that performs empirical or structural risk minimization on the surrogate risk\ncan, at most, be expected to reach low values of the surrogate risk. The question we address in\nthis paper is whether such an algorithm provably solves the real learning task, which is to achieve\nlow values of the ranking risk. More formally, the criterion under study is whether the following\nimplication holds for every sequence of scoring functions (fk)k\u22650 and every data distribution D:\n\nL(D, fk) \u2212\u2192\n\nk\u2192\u221e inf\n\nf\n\nL(D, f ) \u21d2 R(cid:48)(D, fk) \u2212\u2192\n\nR(cid:48)(D, f )\n\nk\u2192\u221e inf\n\nf\n\n(1)\n\nwhere the in\ufb01ma are taken over all scoring functions. In particular, we show that if a surrogate loss\nis convex in the sense that (cid:96)(y, .) is convex for every y \u2208 Y, and if the evaluation metric is the AP,\n\n3\n\n\fthe ERR or the PD, then there are distributions and sequences of scoring functions for which (1)\ndoes not hold. In other words, we show that learning-to-rank algorithms that de\ufb01ne their objective\nthrough a convex surrogate loss cannot provably optimize any of these evaluation metrics.\nIn order to perform a general analysis for all the three evaluation metrics, we consider Assumption\n(A) below, which formalizes the common property of these metrics that is relevant to our study.\nIntuitively, it means that for any given item, there is a feedback for which the performance only\ndepends on the rank of this item, with a strict improvement of performances when one improves the\nrank of the item:\n\n(A) \u2203\u03b21 < \u03b22 < ... < \u03b2n such that \u2200i \u2208 [n] ,\u2203y \u2208 Y : \u2200\u03c3 \u2208 Sn, r(y, \u03c3) = \u03b2\u03c3-1(i).\n\nNote that in the assumption, the values of \u03b2k (i.e. the performance when item i is predicted at rank\nk) are the same for all items. This is not a strong requirement because the metrics we consider do\nnot depend on how we index the elements. The DCG, the AP and the ERR satisfy (A): for each i,\nwe take the vector of relevance with a 1 for item i and 0 for all other items so that the values of the\nmetrics only depends on the rank of i (which should be ranked \ufb01rst). The PD satis\ufb01es Assumption\n(A) as well: for each i, take y as the DAG containing the edges i \u2192 j,\u2200j \u2208 [n]\\{i} and only those\nedges. For this feedback, i is preferred to all other items (and no preference is speci\ufb01ed regarding\nthe other items) and thus the quality of a ranking only depends on the rank of i.\nOur analysis is organized as follows. In the next section, we introduce the notion of a calibrated\nsurrogate loss de\ufb01ned by Steinwart [20], which is a criterion equivalent to (1). We then obtain a\nnew condition that is equivalent to calibration when (A) holds, and \ufb01nally we restrict our attention\nto evaluation metrics satisfying (A) and to convex surrogate losses. In that context, using our new\ncondition for calibration, we show that evaluation metrics with a calibrated surrogate loss necessarily\nsatisfy a speci\ufb01c property. Then, in Section 4, we prove that the AP, the ERR and the PD do not\nsatisfy this property. Since Assumption (A) holds for these three metrics, this latter result implies\nthat they do not have any convex and calibrated surrogate loss. Equivalently, it implies that (1) does\nnot hold in general for these metrics if the surrogate loss is convex.\n\n3 A New Characterization of Calibration\n\nWe present in this section the notion of calibration as studied in [20], which is the basis of our\nwork. Then, we provide a characterization of calibration more speci\ufb01c to the evaluation metrics\nwe consider, that relates more closely calibrated surrogate losses and evaluation metrics. This more\nspeci\ufb01c characterization of calibration is the starting point of the analysis of convex and calibrated\nsurrogate losses carried out in the last subsection and that allows us to state the results of Section 4.\n\n3.1 The Framework of Calibration\n\nnoting the set of probability distributions over Y by P = (cid:8)p : Y \u2192 [0, 1]|(cid:80)\n\nApplying the general results of [20] to our setting, the criterion de\ufb01ned by (1) can be studied by\nrestricting our attention to the contributions of a single instance to the surrogate and ranking risk.\nThese contributions are called the inner surrogate risk and the inner ranking risk respectively. De-\ninner risks are respectively de\ufb01ned for all p \u2208 P and all s \u2208 Rn by:\n\n(cid:88)\n\ny\u2208Y\n\ny\u2208Y p(y) = 1(cid:9), the\n(cid:88)\n\np(y)r (y, \u03c3) .\n\ny\u2208Y\n\nand R(cid:48)(p, s) =\n\n(cid:88)\n\n\u03c3\u2208arg sort(s)\n\nL (p, s) =\n\np(y)(cid:96) (y, s)\n\nR (p, \u03c3)\n\n| arg sort(s)| , where \u2200\u03c3 \u2208 Sn, R (p, \u03c3) =\n\nTheir optimal values are denoted by L (p) = inf\ns\u2208Rn\n\nL (p, s) and R(cid:48)(p) = R (p) = min\n\u03c3\u2208Sn\n\nR (p, \u03c3).\n\nMore precisely, [20, Theorem 2.8] shows that (1) holds for any distribution D and any sequence of\nscoring functions if and only if the surrogate loss is r-calibrated according to the de\ufb01nition below.\nSimilarly to (1), the calibration is an implication of two limits, but it involves the inner risks L and\nR(cid:48) instead of the risks L and R(cid:48). For convenience in the rest of the work, we write the implication\n\n4\n\n\fbetween the two limits of L and R(cid:48) as an inclusion of the sets of near-optimal vectors of scores. For\nany \u03b5 > 0 and \u03b4 > 0, the latter sets are respectively denoted by\nM(cid:96)(p, \u03b4) = {s \u2208 Rn|L (p, s) \u2212 L (p) < \u03b4} and Mr(p, \u03b5) = {s \u2208 Rn|R(cid:48)(p, s) \u2212 R(cid:48)(p) < \u03b5} ,\nso that the de\ufb01nition of an r-calibrated loss is the following:\nDe\ufb01nition 1. [20, De\ufb01nition 2.7] The surrogate loss (cid:96) is r-calibrated if\n\n\u2200p \u2208 P,\u2200\u03b5 > 0, \u2203\u03b4 > 0 : M(cid:96)(p, \u03b4) \u2286 Mr(p, \u03b5) .\n\n3.2 Calibration through Optimal Rankings\n\nDe\ufb01nition 1 is the starting point of our analysis, and our goal is to show that if the evaluation metric\nis the AP, the ERR or the PD, then no convex surrogate loss can satisfy it. The goal of this subsection\nis to give a stronger characterization of r-calibrated surrogate losses when Assumption (A) holds.\nThe starting point of this characterization is to rewrite De\ufb01nition 1 in terms of rankings induced by\nthe sets of near-optimal scores, from which we can deduce that (cid:96) is r-calibrated if and only if1:\n\n\u2200p \u2208 P,\u2200\u03b5 > 0, \u2203\u03b4 > 0 : arg sort(M(cid:96)(p, \u03b4)) \u2286 arg sort(Mr(p, \u03b5)) .\n\nIn contrast to this characterization of calibration, our result (Theorem 2 below), which is speci\ufb01c to\nmetrics that satisfy (A), replaces the inclusion (which can be strict in general) of sets of ranking by\nan equality when \u03b5 tends to 0. More speci\ufb01cally, we de\ufb01ne the set of optimal rankings for the inner\nranking risk with the following set-valued function from P to Sn:\n\n\u2200p \u2208 P, Ar(p) = arg min\n\u03c3\u2208Sn\n\nR (p, \u03c3) ,\n\nso that when Assumption (A) holds, the set of optimal rankings is equal to a set of rankings induced\nby near-optimal scores of the inner surrogate risk:\nTheorem 2. If Assumption (A) holds, then (cid:96) is r-calibrated if and only if\n\n\u2200p \u2208 P,\u2203\u03b4 > 0 s.t. arg sort(M(cid:96)(p, \u03b4)) = Ar(p) .\n\nThe proof of Theorem 2 is deferred to Section 6 at the end of the paper. This theorem enables us to\nrelate the surrogate loss and the evaluation metric so that the convexity of (cid:96) induces some constraints\non r that are not satis\ufb01ed by all evaluation metrics.\n\n3.3 The implication of Convexity on Sets of Optimal Rankings\nIf (cid:96)(y, .) is convex for all y \u2208 P, then the inner risk L(p, .) is also convex for every distribution\np \u2208 P. This implies that M(cid:96)(p, \u03b4) is a convex subset of Rn. Thus, if (cid:96) is r-calibrated, then Theorem\n2 implies that Ar(p) = arg sort(M(cid:96)(p, \u03b4)) is a set of rankings induced by a convex set of Rn.\nThe following theorem presents a condition that the set Ar(p) must satisfy if it is generated by a\nconvex set of scores: if there exists at least one pair of items (i, j) which are inverted in two rankings\nof Ar(p), then i and j are \u201cindifferent\u201d in Ar(p):\nTheorem 3. Assume that for all y \u2208 Y, the function s (cid:55)\u2192 (cid:96) (y, s) is convex. If Assumption (A) holds\nand (cid:96) is r-calibrated, then r satis\ufb01es: \u2200p \u2208 P,\u2200i, j \u2208 [n] ,\u2200\u03c3, \u03c3(cid:48) \u2208 Ar(p),\n\u03c3-1(i) < \u03c3-1(j) and \u03c3(cid:48)-1(i) > \u03c3(cid:48)-1(j) \u21d2 \u2203s \u2208 Rn : si = sj and arg sort(s) \u2286 Ar(p) .\n\n(2)\n\narg sort(cid:0)M(cid:96)(p, \u03b4)(cid:1) by Theorem 2. Thus, there are two score vectors u and v in M(cid:96)(p, \u03b4) such\n\nProof of Theorem 3. Assume that the conditions of the theorem are satis\ufb01ed. From now on, we \ufb01x\nsome p \u2208 P and two i and j in [n]. Take \u03c3 and \u03c3(cid:48) in Ar(p) and assume that \u03c3-1(i) < \u03c3-1(j)\nand \u03c3(cid:48)-1(i) > \u03c3(cid:48)-1(j). Since Assumption (A) holds, there is a \u03b4 > 0 such that Ar(p) =\nthat ui \u2265 uj (u induces the ranking \u03c3) and vi \u2264 vj (v induces the ranking \u03c3(cid:48)).\nMoreover, since (cid:96) is convex, the function L (p, .) is convex for every p \u2208 P, and thus M(cid:96)(p, \u03b4) is\nconvex. Consequently, for all t \u2208 [0, 1], the vector \u03b3(t) = (1 \u2212 t)u + tv belongs to M(cid:96)(p, \u03b4). We\nde\ufb01ne g : t (cid:55)\u2192 \u03b3i(t) \u2212 \u03b3j(t) for t \u2208 [0, 1]. Then, g is continuous, with g(0) = ui \u2212 uj \u2265 0 and\ng(1) = vi \u2212 vj \u2264 0. By the intermediate value theorem, there is t0 \u2208 [0, 1] such that g(t0) = 0. The\nconsequence is that the score vector s, de\ufb01ned by s = \u03b3(t0), satis\ufb01es s \u2208 M(cid:96)(p, \u03b4) and si = sj.\n\n1We remind to the reader the notation arg sort(M(cid:96)(p, \u03b4)) =(cid:83)\n\ns\u2208M(cid:96)(p,\u03b4) arg sort(s).\n\n5\n\n\fTable 2: Examples for Corollary 4. There are three elements to rank. i (cid:31) j (cid:31) k represents the\npermutation that ranks item i \ufb01rst, j second and k last. For the ERR and the AP, we consider binary\nrelevance judgments. p110 denotes a Dirac distribution at the feedback vector y = [1, 1, 0]. p001\nis de\ufb01ned similarly. For the Pairwise Disagreement, p1(cid:31)2(cid:31)3 is the Dirac distribution at the DAG\ncontaining the edges 1 \u2192 2, 2 \u2192 3 and 1 \u2192 3, i.e. the DAG corresponding to 1 (cid:31) 2 (cid:31) 3. The\nDirac distribution at the DAG containing only the edge 3\u2192 1 is denoted by p3(cid:31)1. In all cases, \u02dcp(\u03b1)\nis a mixture between two Dirac distributions. The sets Ar(\u02dcp(\u03b1)) are obtained by direct calculations.\nThe set Ar(\u02dcp(\u03b1)) is the same for all \u03b1s in the range given in the third column.\n\nDISTRIBUTION \u02dcp(\u03b1) METRIC RANGE OF \u03b1\n\n(1 \u2212 \u03b1)p110 + \u03b1p001\n\nERR\nAP\n\n(1 \u2212 \u03b1)p1(cid:31)2(cid:31)3 + \u03b1p3(cid:31)1\n\nPD\n\n3 , 1\n\n\u03b1 \u2208(cid:0) 1\n(cid:1)\n\u03b1 \u2208(cid:0) 2\n3 , 1(cid:1)\n\n\u03b1 = 5\n13\n\n2\n\nAr( \u02dcp(\u03b1))\n\n{(1 (cid:31) 3 (cid:31) 2), (2 (cid:31) 3 (cid:31) 1)}\n{(1 (cid:31) 2 (cid:31) 3), (3 (cid:31) 1 (cid:31) 2),\n(2 (cid:31) 1 (cid:31) 3), (3 (cid:31) 2 (cid:31) 1)}\n{(2 (cid:31) 3 (cid:31) 1), (3 (cid:31) 1 (cid:31) 2)}\n\nThe contrapositive of Theorem 3 is our technical tool to prove the nonexistence of convex and\ncalibrated losses. Indeed, for a given evaluation metric r, if we are able to exhibit a distribution\np \u2208 P such that (2) is not satis\ufb01ed, this evaluation metric cannot have a surrogate loss both convex\nand calibrated. In the next subsection, we apply this argument to the AP, the ERR and the PD.\nRemark 1. It has been proved by several authors that there exist convex surrogate losses that are\nDCG-calibrated [8, 2, 17]. Thus, the DCG satis\ufb01es (2). It can be seen by observing that the optimal\nrankings for the DCG are exactly those generated by sorting the items according to the vector of\nscore s\u2217(p) de\ufb01ned by s\u2217\n\ny\u2208Y p(y)2yi, i.e. Ar(p) = arg sort(s\u2217(p)).\n\ni (p) =(cid:80)\n\n4 Nonexistence Results\n\nWe now present the main result of the nonexistence of convex, calibrated surrogate losses:\nCorollary 4. No convex surrogate loss is calibrated with respect to the AP, the ERR or the PD.\n\nProof. We consider the case where there are three elements to rank, and we use the examples and\nthe notations of Table 2. Since all three metrics satisfy (A), Theorem 3 implies that if r (taken as\neither the AP, the ERR or the PD) has a calibrated, convex surrogate loss, then, for any distribution\n\u02dcp(\u03b1), we have: if item i is preferred to j according to a ranking in Ar(\u02dcp(\u03b1)), and j is preferred to i\naccording to another ranking in Ar(\u02dcp(\u03b1)), then one of the two assertions below must hold:\n\n(a) (cid:8)(i (cid:31) j (cid:31) k), (j (cid:31) i (cid:31) k)(cid:9) \u2286 Ar(\u02dcp(\u03b1)) ,\nsi = sj \u2265 sk. Now, let us consider the case of the ERR. Taking an arbitrary \u03b1 \u2208 (cid:0) 1\n\n(b) (cid:8)(k (cid:31) i (cid:31) j), (k (cid:31) j (cid:31) i)(cid:9) \u2286 Ar(\u02dcp(\u03b1))\n(cid:1), we\n\nbecause there exists s \u2208 R3 such that arg sort(s) \u2286 Ar(\u02dcp(\u03b1)) for which either si = sj \u2264 sk or\nsee on the last column of Table 2 that Ar(\u02dcp(\u03b1)) contains two rankings: one of them ranks item 1\nbefore item 2, and the other one ranks 2 before 1. If the ERR had a convex calibrated surrogate loss,\nthen either (a) or (b) should hold. However, we see that neither (a) nor (b) holds. Thus their is no\nconvex, ERR-calibrated surrogate loss. For the AP, a similar argument with items 1 and 3 leads to\nthe conclusion. For the PD, taking any two items leads to the result.\n\n3 , 1\n\n2\n\nA \ufb01rst consequence of Corollary 4 is that for ranking problems evaluated in terms of AP, ERR or\nPD, surrogate losses de\ufb01ned as convex upper bounds on an evaluation metric as discussed in [24],\nas well as convex surrogate losses proposed in the structured output framework such as SVMmap\n[25] are not calibrated with respect to the evaluation metric they are designed for. The convex\nsurrogate losses used by most participants of the recent Yahoo! Learning to Rank Challenge [6] are\nalso not calibrated with respect to the ERR, the of\ufb01cial evaluation metric of the challenge. The fact\nthat the minimization of a non-calibrated surrogate risk leads to suboptimal prediction functions on\nsome data distributions suggests that convex losses are not a de\ufb01nitive solution to learning to rank.\nSigni\ufb01cant improvements in performances may then be obtained by switching to other approaches\nthan the optimization of a convex risk.\n\n6\n\n\f5 Conclusion\n\nWe proved that convex surrogate losses cannot be calibrated with three major ranking evaluation\nmetrics. The result cast light on the intrinsic limitations of all algorithms based on (empirical)\nconvex risk minimization for ranking, even though most existing algorithms for learning to rank\nfollow this approach. A possible direction for future work is to study whether the calibration of\nconvex losses can be obtained under low noise conditions. Such studies was carried out for the\nPD [11], and calibrated, convex surrogate losses were found for special cases of practical interest.\nNonetheless, in order to obtain algorithms that do not rely on low noise assumptions, our results\nsuggest to explore whether alternatives to convex surrogate approaches can lead to improvements in\nterms of performances. A \ufb01rst possibility is to turn to non-convex losses for ranking as in [10, 3],\nand to study the calibration of such losses. Another alternative is to use another surrogate approach\nthan scoring, such as directly learning pairwise preferences [13], even though the reconstruction of\nan optimal ranking, given the pairwise predictions, that is optimal for evaluation metrics such as the\nAP, the ERR or the PD is still mostly an open issue.\n\n6 Proof of Theorem 2\n\nWe remind the statement of Theorem 2: if r satis\ufb01es (A), then (cid:96) is r-calibrated if and only if for all\n\np \u2208 P, there exists \u03b4 > 0 such that Ar(p) = arg sort(M(cid:96)(p, \u03b4)(cid:1). We prove the result using the\n(cid:101)L (p, \u03c3) where(cid:101)L (p, \u03c3) = inf(cid:8)L (p, s)| s \u2208 Rn s.t. \u03c3 \u2208 arg sort(s)(cid:9) .\n\nfollowing set-valued function which de\ufb01nes the set of optimal rankings for the inner surrogate risk:\n\nA(cid:96)(p) = arg min\n\u03c3\u2208Sn\n\nThen, Theorem 2 is a direct implication of the two following claims that we prove in this section:\n\n(a) the assertion \u2200p \u2208 P,\u2203\u03b4 > 0, arg sort(M(cid:96)(p, \u03b4)(cid:1) = A(cid:96)(p) is true in general;\n\n(b) if Assumption (A) holds, then (cid:96) is r-calibrated if and only if \u2200p \u2208 P,A(cid:96)(p) = Ar(p).\n\n(iii) (cid:96) is r-calibrated if and only if: \u2200p \u2208 P, A(cid:96)(p) \u2286 Ar(p).\n\nThe proof of these two claims is based on three lemmas that we present before the \ufb01nal proof. The\n\ufb01rst lemma, which does not need any assumption on the evaluation metric, both proves equality\n(a) and provides a general characterization of calibration in terms of optimal rankings. The second\nlemma concerns the surrogate loss; it states that a slight perturbation in p does not affect \u201ctoo much\u201d\nA(cid:96)(p). The third lemma concerns evaluation metrics and gives a simple consequence of Assumption\n(A). The \ufb01nal proof of Theorem 2 connects all these pieces together to prove (b).\nLemma 5. The following claims are true:\n\n(i) \u2200p \u2208 P,\u2200\u03b4 > 0, A(cid:96)(p) \u2286 arg sort(cid:0)M(cid:96)(p, \u03b4)(cid:1).\n(ii) \u2200p \u2208 P,\u2203\u03b40 > 0 : A(cid:96)(p) = arg sort(cid:0)M(cid:96)(p, \u03b40)(cid:1).\nProof. (i) Fix p \u2208 P and \u03b4 > 0. Let \u03c3 \u2208 A(cid:96)(p). By the de\ufb01nition of (cid:101)L, there is an s \u2208 Rn such\nthat \u03c3 \u2208 arg sort(s) and L (p, s) \u2212(cid:101)L (p, \u03c3) < \u03b4. Since(cid:101)L (p, \u03c3) = min\u03c3(cid:48)\u2208Sn(cid:101)L (p, \u03c3(cid:48)) = L (p), we\nhave L (p, s) \u2212 L (p) < \u03b4 . This proves s \u2208 M(cid:96)(p, \u03b4) and thus \u03c3 \u2208 arg sort(cid:0)M(cid:96)(p, \u03b4)(cid:1).\n(ii) Fix p \u2208 P and take \u03b40 = min\u03c3(cid:54)\u2208A(cid:96)(p)(cid:101)L (p, \u03c3) \u2212 L (p) > 0, with the convention min\u2205 = +\u221e.\nequivalent to arg sort(cid:0)M(cid:96)(p, \u03b40)(cid:1) \u2286 A(cid:96)(p). The reverse inclusion is given by the \ufb01rst point.\nthat arg sort(cid:0)M(cid:96)(p, \u03b4)(cid:1) \u2286 Ar(p). This characterization and the \ufb01rst two points give the result.\n\n(iii) Since r can only take a \ufb01nite set of values, we can prove that (cid:96) is r-calibrated if and only if:\n\u2200p \u2208 P,\u2203\u03b4 > 0 : \u2200s \u2208 Rn, L (p, s) \u2212 L (p) < \u03b4 \u21d2 R(cid:48)(p, s) = R(cid:48)(p). Moreover, we\nhave R(cid:48)(p, s) = R(cid:48)(p) \u21d4 arg sort(s) \u2286 Ar(p) since R(cid:48)(p, s) is the mean of R (p, \u03c3) for\n\u03c3 \u2208 arg sort(s). Thus, (cid:96) is r-calibrated if and only if for every p \u2208 P, there exists \u03b4 > 0 such\n\nThe choice of \u03b40 guarantees that \u2200s \u2208 Rn, L (p, s) \u2212 L (p) < \u03b40 \u21d2 arg sort(s) \u2286 A(cid:96)(p), which is\n\n7\n\n\f2\n\nWe now present a more technical result on A(cid:96), which shows the set of optimal rankings cannot\ndramatically change under a slight perturbation in the distribution over the feedback space. From\nnow on, for any p \u2208 P and any \u03b7 > 0, we denote by B(p, \u03b7) the open ball of P (with respect to\n(cid:107).(cid:107)1) of radius \u03b7 centered at p, i.e. B(p, \u03b7) = {p(cid:48) \u2208 P|(cid:107)p \u2212 p(cid:48)(cid:107)1 < \u03b7}.\nLemma 6. \u2200p \u2208 P,\u2203\u03b7 > 0 such that A(cid:96)(B(p, \u03b7)) = A(cid:96)(p).\nProof. Note that A(cid:96)(p) \u2286 A(cid:96)(B(p, \u03b7)) since p \u2208 B(p, \u03b7). We now prove A(cid:96)(B(p, \u03b7)) \u2286 A(cid:96)(p);\nIndeed, let us \ufb01x p \u2208 P and de\ufb01ne \u03b5 = 1\n\nthe main argument is that (cid:101)L (., \u03c3) is continuous for every \u03c3 because Y is \ufb01nite [23, Theorem 2].\n(cid:0) min\u03c3(cid:48)(cid:54)\u2208A(cid:96)(p)(cid:101)L (p, \u03c3(cid:48)) \u2212 L (p)(cid:1). For each \u03c3 \u2208 Sn, since\n(cid:101)L (., \u03c3) is continuous, there exists \u03b7\u03c3 > 0 such that \u2200p(cid:48) \u2208 B(p, \u03b7\u03c3),|(cid:101)L (p(cid:48), \u03c3) \u2212(cid:101)L (p, \u03c3)| < \u03b5.\n\u2200\u03c3(cid:48) (cid:54)\u2208 A(cid:96)(p) ,(cid:101)L (p(cid:48), \u03c3(cid:48)) =(cid:101)L (p(cid:48), \u03c3(cid:48)) \u2212(cid:101)L (p, \u03c3(cid:48)) +(cid:101)L (p, \u03c3(cid:48)) \u2212 L (p) + L (p) > \u2212\u03b5 + 2\u03b5 + L (p) .\n(cid:54)\u2208 A(cid:96)(p) ,(cid:101)L (p(cid:48), \u03c3(cid:48)) > L (p) + \u03b5. Additionally, the de\ufb01nition of \u03b7 gives \u2200\u03c3 \u2208\nA(cid:96)(p) ,(cid:101)L (p(cid:48), \u03c3) < L (p) + \u03b5. Thus, we have min\u03c3(cid:48)(cid:54)\u2208A(cid:96)(p)(cid:101)L (p(cid:48), \u03c3(cid:48)) > min\u03c3\u2208A(cid:96)(p)(cid:101)L (p(cid:48), \u03c3).\nThis proves that a ranking that is not optimal for (cid:101)L (p, .) cannot be optimal for (cid:101)L (p(cid:48), .). Thus\n\nLet \u03b7 = min\u03c3\u2208Sn \u03b7\u03c3, and let p(cid:48) be an arbitrary member of B(p, \u03b7). By the de\ufb01nition of \u03b5, we have:\n\nA(cid:96)(p(cid:48)) \u2286 A(cid:96)(p) from which we conclude A(cid:96)(B(p, \u03b7)) \u2286 A(cid:96)(p).\nNow that we have studied the properties of A(cid:96), we analyze in more depth the evaluation metrics. We\nprove the following consequence of Assumption (A): for each possible ranking there is a distribution\nover the feedback space for which this ranking is the unique optimal ranking.\nLemma 7. If Assumption (A) holds, then \u2200\u03c3 \u2208 Sn,\u2203p\u03c3 \u2208 P such that Ar(p\u03c3) = {\u03c3}.\n\nThus, \u2200\u03c3(cid:48)\n\nk=1 \u03b1\u03c3-1(k)r(cid:0)yk, \u03c3(cid:48)(cid:1) =(cid:80)n\n\nk=1 \u03b1\u03c3-1(k)\u03b2\u03c3(cid:48)-1(k) is obtained if and only if \u03c3-1 = \u03c3(cid:48)-1 (i.e. \u03c3(cid:48) = \u03c3).\n\nProof. Assume (A) holds, and, for each item k, let us denote by yk the feedback corresponding\nto item k in Assumption (A). Now, let us take some \u03c3 \u2208 Sn and de\ufb01ne p\u03c3 as p\u03c3(yk) = \u03b1\u03c3-1(k)\nk=1 \u03b1k = 1. Then, for any \u03c3(cid:48) \u2208 Sn, we have the equality\nk=1 \u03b1\u03c3-1(k)\u03b2\u03c3(cid:48)-1(k). Since the \u03b1s are non-negative, and\nsince there are ties neither the \u03b1s nor in the \u03b2s, the rearrangement inequality implies that the mini-\nmum value of R (p\u03c3, \u03c3(cid:48)) is obtained for the single permutation \u03c3(cid:48) for which the \u03b2\u03c3(cid:48)-1(k) are in reverse\norder relatively to the \u03b1\u03c3-1(k) (i.e. smaller values \u03b2\u03c3(cid:48)-1(k) should be associated to greater values of\n\u03b1\u03c3-1(k)). Since the \u03b1ks are decreasing with k and the \u03b2ks are increasing, the minimum value of\n\nwith \u03b11 > ... > \u03b1n > 0 and (cid:80)n\nR (p\u03c3, \u03c3(cid:48)) =(cid:80)n\n\u03c3(cid:48) (cid:55)\u2192 R (p\u03c3, \u03c3(cid:48)) =(cid:80)n\nthere is \u03b4 > 0 such that A(cid:96)(p) = arg sort(M(cid:96)(p, \u03b4)(cid:1). What remains to show is that if Assumption\n\nProof of Theorem 2. We remind to the reader that by the second point of Lemma 5, for any p \u2208 P,\n(A) holds, then (cid:96) is r-calibrated if and only if \u2200p \u2208 P,A(cid:96)(p) = Ar(p).\n(\u201cif\u201d direction) If \u2200p \u2208 P,A(cid:96)(p) = Ar(p) then (cid:96) is r-calibrated by Lemma 5.\n(\u201conly if\u201d direction) Assume that (A) holds and that (cid:96) is r-calibrated. Let p \u2208 P. By Point (iii) of\nLemma 5, we know that A(cid:96)(p) \u2286 Ar(p). We now prove the reverse inclusion A(cid:96)(p(cid:48)) \u2286 Ar(p(cid:48)).\nBy Lemma 6, there exists some \u03b7 > 0 such that A(cid:96)(B(p, \u03b7)) = A(cid:96)(p). Let \u03c3 \u2208 Ar(p). The idea is\nto use Lemma 7 to \ufb01nd some p(cid:48) \u2208 B(p, \u03b7) such that A(cid:96)(p(cid:48)) = {\u03c3} which would prove \u03c3 \u2208 A(cid:96)(p)\nand thus the result. The rest of the proof consists in building p(cid:48).\nUsing Lemma 7, let p\u03c3 \u2208 P such that Ar(p\u03c3) = {\u03c3}. Now, let p(cid:48) = (1\u2212 \u03b7\n4 p\u03c3. Then, we have\n(cid:107)p \u2212 p(cid:48)(cid:107)1 = \u03b7\n4(cid:107)p \u2212 p\u03c3(cid:107)1 \u2264 \u03b7/2 and thus p(cid:48) \u2208 B(p, \u03b7). Moreover, Ar(p(cid:48)) = {\u03c3} since \u03c3 is optimal\nfor both p and p\u03c3, and any other permutation is suboptimal for p\u03c3. We also have A(cid:96)(p(cid:48)) = {\u03c3}\nbecause A(cid:96) has non-empty values and calibration implies that A(cid:96)(p(cid:48)) \u2286 Ar(p(cid:48)) by Lemma 5.\n\n4 )p+ \u03b7\n\nAcknowledgements\n\nThis work was partially funded by the French DGA. The authors thank M. R. Amini, D. Buffoni, S.\nCl\u00b4emenc\u00b8on, L. Denoyer and G. Wisniewski for their comments and suggestions.\n\n8\n\n\fReferences\n[1] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. J. of the\n\nAmerican Stat. Assoc., pages 1\u201336, 2006.\n\n[2] D. Buffoni, C. Calauz`enes, P. Gallinari, and N. Usunier. Learning scoring functions with order-preserving\n\nlosses and standardized supervision. In Proc. of the Intl. Conf. on Mach. Learn., pages 825\u2013832, 2011.\n\n[3] C. J. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In Proc. of Adv.\n\nin Neural Info. Proc. Syst., pages 193\u2013200, 2007.\n\n[4] Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm to document retrieval.\n\nIn Proc. of the ACM SIGIR Conf. on Res. and Dev. in Info. Retr., pages 186\u2013193, 2006.\n\n[5] A. Chang, C. Rudin, M. Cavaretta, R. Thomas, and G. Chou. How to reverse-engineer quality rankings.\n\nMach. Learn., 88(3):369\u2013398, Sept. 2012.\n\n[6] O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. J. of Mach. Learn. Res., 14:1\u201324,\n\n2011.\n\n[7] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and scoring using empirical risk minimization. In\n\nProc. of the 18th Conf. on Learning Theory, COLT\u201905, pages 1\u201315, 2005.\n\n[8] D. Cossock and T. Zhang. Statistical analysis of bayes optimal subset ranking. IEEE Trans. Info. Theory,\n\n54:5140\u20135154, 2008.\n\n[9] O. Dekel, C. D. Manning, and Y. Singer. Log-linear models for label ranking. In Proc. of Advances in\n\nNeural Information Processing Systems (NIPS), 2003.\n\n[10] C. B. Do, Q. Le, C. H. Teo, O. Chapelle, and A. Smola. Tighter bounds for structured estimation. In Proc.\n\nof Adv. in Neural Inf. Processing Syst., pages 281\u2013288, 2008.\n\n[11] J. Duchi, L. W. Mackey, and M. I. Jordan. On the consistency of ranking algorithms. In Proc. of the Int.\n\nConf. on Mach. Learn., pages 327\u2013334, 2010.\n\n[12] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining prefer-\n\nences. J. of Mach. Learn. Res., 4:933\u2013969, 2003.\n\n[13] E. Hullermeier, J. Furnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences.\n\nArti\ufb01cial Intelligence, 172(16-17):1897\u20131916, Nov. 2008.\n\n[14] T. Joachims. Optimizing search engines using clickthrough data. In Proc. of Know. Disc. and Dat. Mining\n\n(SIGKDD), pages 133\u2013142, 2002.\n\n[15] W. Kotlowski, K. Dembczynski, and E. Huellermeier. Bipartite ranking through minimization of univari-\n\nate loss. In Proc. of the Intl. Conf. on Mach. Learn., pages 1113\u20131120, 2011.\n\n[16] T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval,\n\n3:225\u2013331, March 2009.\n\n[17] P. D. Ravikumar, A. Tewari, and E. Yang. On ndcg consistency of listwise ranking methods. J. of Mach.\n\nLearn. Res. - Proc. Track, 15:618\u2013626, 2011.\n\n[18] M. D. Reid and R. C. Williamson. Surrogate Regret Bounds for Proper Losses. In Proc. of the Intl. Conf.\n\non Mach. Learn., pages 897\u2013904, 2009.\n\n[19] C. Scott. Surrogate losses and regret bounds for cost-sensitive classi\ufb01cation with example-dependent\n\ncosts. Proc. of the Intl. Conf. on Mach. Learn., pages 153\u2013160, 2011.\n\n[20] I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation,\n\n26(2):225\u2013287, 2007.\n\n[21] M. Taylor, J. Guiver, S. Robertson, and T. Minka. Softrank: optimizing non-smooth rank metrics. In\nProceedings of the international conference on Web search and web data mining, WSDM \u201908, pages\n77\u201386, 2008.\n\n[22] E. Voorhees, D. Harman, N. I. of Standards, and T. (U.S.). TREC: experiment and evaluation in informa-\n\ntion retrieval. Digital libraries and electronic publishing. MIT Press, 2005.\n\n[23] R. A. Wijsman. Continuity of the bayes risk. The Annals of Math. Stat., 41(3):pp. 1083\u20131085, 1970.\n[24] J. Xu, T.-Y. Liu, M. Lu, H. Li, and W.-Y. Ma. Directly optimizing evaluation measures in learning to rank.\nIn Proceedings of the 31st annual international ACM SIGIR conference on Research and development in\ninformation retrieval, SIGIR \u201908, pages 107\u2013114, 2008.\n\n[25] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average preci-\n\nsion. In Proc. of the ACM SIGIR Intl. Conf. on Res. and Dev. in Info. Retr., pages 271\u2013278, 2007.\n\n[26] T. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods. J. of Mach.\n\nLearn. Res., 5:1225\u20131251, 2004.\n\n[27] T. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk minimiza-\n\ntion. The Annals of Stat., 32(1):pp. 56\u201385, 2004.\n\n9\n\n\f", "award": [], "sourceid": 4646, "authors": [{"given_name": "Cl\u00e9ment", "family_name": "Calauz\u00e8nes", "institution": ""}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": ""}, {"given_name": "Patrick", "family_name": "Gallinari", "institution": ""}]}