{"title": "Ranking Measures and Loss Functions in Learning to Rank", "book": "Advances in Neural Information Processing Systems", "page_first": 315, "page_last": 323, "abstract": "Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking function by minimizing the loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking function. In this work, we reveal the relationship between ranking measures and loss functions in learning-to-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that these loss functions are upper bounds of the measure-based ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performance, demonstrating the correctness of our analysis.", "full_text": "Ranking Measures and Loss Functions\n\nin Learning to Rank\n\nWei Chen\u2217\n\nChinese Academy of sciences\n\nTie-Yan Liu\n\nMicrosoft Research Asia\n\nchenwei@amss.ac.cn\n\ntyliu@micorsoft.com\n\nYanyan Lan\n\nChinese Academy of sciences\nlanyanyan@amss.ac.cn\n\nZhiming Ma\n\nChinese Academy of sciences\n\nmazm@amt.ac.cn\n\nHang Li\n\nMicrosoft Research Asia\n\nhangli@micorsoft.com\n\nAbstract\n\nLearning to rank has become an important research topic in machine learning.\nWhile most learning-to-rank methods learn the ranking functions by minimizing\nloss functions, it is the ranking measures (such as NDCG and MAP) that are used\nto evaluate the performance of the learned ranking functions. In this work, we\nreveal the relationship between ranking measures and loss functions in learning-\nto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We\nshow that the loss functions of these methods are upper bounds of the measure-\nbased ranking errors. As a result, the minimization of these loss functions will lead\nto the maximization of the ranking measures. The key to obtaining this result is to\nmodel ranking as a sequence of classi\ufb01cation tasks, and de\ufb01ne a so-called essen-\ntial loss for ranking as the weighted sum of the classi\ufb01cation errors of individual\ntasks in the sequence. We have proved that the essential loss is both an upper\nbound of the measure-based ranking errors, and a lower bound of the loss func-\ntions in the aforementioned methods. Our proof technique also suggests a way to\nmodify existing loss functions to make them tighter bounds of the measure-based\nranking errors. Experimental results on benchmark datasets show that the modi\ufb01-\ncations can lead to better ranking performances, demonstrating the correctness of\nour theoretical analysis.\n\n1 Introduction\n\nLearning to rank has become an important research topic in many \ufb01elds, such as machine learning\nand information retrieval. The process of learning to rank is as follows. In training, a number of\nsets are given, each set consisting of objects and labels representing their rankings (e.g., in terms of\nmulti-level ratings1). Then a ranking function is constructed by minimizing a certain loss function\non the training data. In testing, given a new set of objects, the ranking function is applied to produce\na ranked list of the objects.\n\nMany learning-to-rank methods have been proposed in the literature, with different motivations and\nformulations. In general, these methods can be divided into three categories [3]. The pointwise\napproach, such as subset regression [5] and McRank [10], views each single object as the learn-\ning instance. The pairwise approach, such as Ranking SVM [7], RankBoost [6], and RankNet [2],\nregards a pair of objects as the learning instance. The listwise approach, such as ListNet [3] and\n\n\u2217The work was performed when the \ufb01rst and the third authors were interns at Microsoft Research Asia.\n1In information retrieval, such a label represents the relevance of a document to the given query.\n\n1\n\n\fListMLE [16], takes the entire ranked list of objects as the learning instance. Almost all these\nmethods learn their ranking functions by minimizing certain loss functions, namely the pointwise,\npairwise, and listwise losses. On the other hand, however, it is the ranking measures that are used\nto evaluate the performance of the learned ranking functions. Taking information retrieval as an ex-\nample, measures such as Normalized Discounted Cumulative Gain (NDCG) [8] and Mean Average\nPrecision (MAP) [1] are widely used, which obviously differ from the loss functions used in the\naforementioned methods. In such a situation, a natural question to ask is whether the minimization\nof the loss functions can really lead to the optimization of the ranking measures.2\nActually people have tried to answer this question. It has been proved in [5] and [10] that the regres-\nsion and classi\ufb01cation based losses used in the pointwise approach are upper bounds of (1\u2212NDCG).\nHowever, for the pairwise and listwise approaches, which are regarded as the state-of-the-art of\nlearning to rank [3, 11], limited results have been obtained. The motivation of this work is to reveal\nthe relationship between ranking measures and the pairwise/listwise losses.\n\nThe problem is non-trivial to solve, however. Note that ranking measures like NDCG and MAP\nare de\ufb01ned with the labels of objects (i.e., in terms of multi-level ratings). Therefore it is relatively\neasy to establish the connection between the pointwise losses and the ranking measures, since the\npointwise losses are also de\ufb01ned with the labels of objects. In contrast, the pairwise and listwise\nlosses are de\ufb01ned with the partial or total order relations among objects, rather than their individual\nlabels. As a result, it is much more dif\ufb01cult to bridge the gap between the pairwise/listwise losses\nand the ranking measures.\n\nTo tackle the challenge, we propose making a transformation of the labels on objects to a permutation\nset. All the permutations in the set are consistent with the labels, in the sense that an object with a\nhigher rating is ranked before another object with a lower rating in the permutation. We then de\ufb01ne\nan essential loss for ranking on the permutation set as follows. First, for each permutation, we\nconstruct a sequence of classi\ufb01cation tasks, with the goal of each task being to distinguish an object\nfrom the objects ranked below it in the permutation. Second, the weighted sum of the classi\ufb01cation\nerrors of individual tasks in the sequence is computed. Third, the essential loss is de\ufb01ned as the\nminimum value of the weighted sum over all the permutations in the set.\n\nOur study shows that the essential loss has several nice properties, which help us reveal the rela-\ntionship between ranking measures and the pairwise/listwise losses. First, it can be proved that the\nessential loss is an upper bound of measure-based ranking errors such as (1\u2212NDCG) and (1\u2212MAP).\nFurthermore, the zero value of the essential loss is a suf\ufb01cient and necessary condition for the zero\nvalues of (1\u2212NDCG) and (1\u2212MAP). Second, it can be proved that the pairwise losses in Ranking\nSVM, RankBoost, and RankNet, and the listwise loss in ListMLE are all upper bounds of the essen-\ntial loss. As a consequence, we come to the conclusion that the loss functions used in these methods\ncan bound (1\u2212NDCG) and (1\u2212MAP) from above. In other words, the minimization of these loss\nfunctions can effectively maximize NDCG and MAP.\n\nThe proofs of the above results suggest a way to modify existing pairwise/listwise losses so as\nto make them tighter bounds of (1\u2212NDCG). We hypothesize that tighter bounds will lead to better\nranking performances; we tested this hypothesis using benchmark datasets. The experimental results\nshow that the methods minimizing the modi\ufb01ed losses can outperform the original methods, as well\nas many other baseline methods. This validates the correctness of our theoretical analysis.\n\n2 Related work\n\nIn this section, we review the widely-used loss functions in learning to rank, ranking measures in\ninformation retrieval, and previous work on the relationship between loss functions and ranking\nmeasures.\n\n2Note that recently people try to directly optimize ranking measures [17, 12, 14, 18]. The relationship\nbetween ranking measures and the loss functions in such work is explicitly known. However, for other methods,\nthe relationship is unclear.\n\n2\n\n\f2.1 Loss functions in learning to rank\n\nLet x = {x1, \u00b7 \u00b7 \u00b7 , xn} be the objects be to ranked.3 Suppose the labels of the objects are given\nas multi-level ratings L = {l(1), ..., l(n)}, where l(i) \u2208 {r1, ..., rK } denotes the label of xi [11].\nWithout loss of generality, we assume l(i) \u2208 {0, 1, ..., K \u2212 1} and name the corresponding labels\nas K-level ratings. If l(i) > l(j), then xi should be ranked before xj. Let F be the function class\nand f \u2208 F be a ranking function. The optimal ranking function is learned from the training data\nby minimizing a certain loss function de\ufb01ned on the objects, their labels, and the ranking function.\nSeveral approaches have been proposed to learn the optimal ranking function.\nIn the pointwise approach, the loss function is de\ufb01ned on the basis of single objects. For example,\nin subset regression [5], the loss function is as follows,\n\nn\n\nLr(f ; x, L) =\n\nXi=1 (cid:0)f (xi) \u2212 l(i)(cid:1)2\n\n.\n\n(1)\n\nIn the pairwise approach, the loss function is de\ufb01ned on the basis of pairs of objects whose labels\nare different. For example, the loss functions of Ranking SVM [7], RankBoost [6], and RankNet [2]\nall have the following form,\n\nLp(f ; x, L) =\n\nn\u22121\n\nXs=1\n\nn\n\nXi=1,l(i) l(j), then xi is ranked before xj in y. Notation y(i)\nrepresents the index of the object ranked at the i-th position in y.\n\n2.2 Ranking measures\n\nSeveral ranking measures have been proposed in the literature to evaluate the performance of a\nranking function. Here we introduce two of them, NDCG [8] and MAP[1], which are popularly\nused in information retrieval.\nNDCG is de\ufb01ned with respect to K-level ratings L,\nn\nXr=1\n\nN DCG(f ; x, L) =\n\nG(cid:0)l(\u03c0f (r))(cid:1)D(r),\n\n1\nNn\n\nwhere \u03c0f is the ranked list produced by ranking function f, G is an increasing function (named\nthe gain function), D is a decreasing function (named the position discount function), and Nn =\nlog2(1+z) if\n\nr=1 G(cid:0)l(\u03c0(r))(cid:1)D(r). In practice, one usually sets G(z) = 2z \u2212 1; D(z) =\n\nmax\u03c0 Pn\n\n1\n\nz \u2264 C, and D(z) = 0 if z > C (C is a \ufb01xed integer).\nMAP is de\ufb01ned with respect to 2-level ratings as follows,\n\nM AP (f ; x, L) =\n\n1\n\nn1 Xs:l(\u03c0f (s))=1\n\nPi\u2264s I{l(\u03c0f (i))=1}\n\ns\n\n.\n\n(4)\n\nwhere I{\u00b7} is the indicator function, and n1 is the number of objects with label 1. When the labels\nare given in terms of K-level ratings (K > 2), a common practice of using MAP is to \ufb01x a level\nk\u2217, and regard all the objects whose levels are lower than k\u2217 as having label 0, and regard the other\nobjects as having label 1 [11].\nFrom the de\ufb01nitions of NDCG and MAP, we can see that their maximum values are both one.\nTherefore, we can consider (1\u2212NDCG) and (1\u2212MAP) as ranking errors. For ease of reference, we\ncall them measure-based ranking errors.\n\n3For example, for information retrieval, x represents the documents associated with a query.\n\n3\n\n\f2.3 Previous bounds\n\nFor the pointwise approach, the following results have been obtained in [5] and [10].4\nThe regression based pointwise loss is an upper bound of (1\u2212NDCG),\n\nLr(f ; x, L)1/2.\nThe classi\ufb01cation based pointwise loss is also an upper bound of (1\u2212NDCG),\n\n1 \u2212 N DCG(f ; x, L) \u2264\n\nXi=1\n\nD(i)2(cid:17)1/2\n\nn\n\n1\n\nNn(cid:16)2\n\n1 \u2212 N DCG(f ; x, L) \u2264\n\n15\u221a2\nNn (cid:16)\n\nn\n\nXi=1\n\nD(i)2 \u2212 n\n\nn\n\nYi=1\n\nD(i)2/n(cid:17)1/2(cid:16)\n\nn\n\nXi=1\n\nI{\u02c6l(i)6=l(i)}(cid:17)1/2\n\n,\n\nwhere \u02c6l(i) is the label of object xi predicted by the classi\ufb01er, in the setting of 5-level ratings.\nFor the pairwise approach, the following result has been obtained [9],\n\n1 \u2212 M AP (f ; x, L) \u2264 1 \u2212\n\n1\nn1\n\n(Lp(f ; x, L) + C 2\n\nn1+1)\u22121(\n\nn1\n\nXi=1\n\n\u221ai)2.\n\nAccording to the above results, minimizing the regression and classi\ufb01cation based pointwise losses\nwill minimize (1\u2212NDCG). Note that the zero values of these two losses are suf\ufb01cient but not nec-\nessary conditions for the zero value of (1\u2212NDCG). That is, when (1\u2212NDCG) is zero, the loss\nfunctions may still be very large [10]. For the pairwise losses, the result is even weaker: their zero\nvalues are even not suf\ufb01cient for the zero value of (1-MAP).\n\nTo the best of our knowledge, there was no other theoretical result for the pairwise/listwise losses.\nGiven that the pairwise and listwise approaches are regarded as the state-of-the-art in learning to\nrank [3, 11], it is very meaningful and important to perform more comprehensive analysis on these\ntwo approaches.\n\n3 Main results\n\nIn this section, we present our main results on the relationship between ranking measures and the\npairwise/listwise losses. The basic conclusion is that many pairwise and listwise losses are upper\nbounds of a quantity which we call the essential loss, and the essential loss is an upper bound of\nboth (1\u2212NDCG) and (1\u2212MAP). Furthermore, the zero value of the essential loss is a suf\ufb01cient and\nnecessary condition for the zero values of (1\u2212NDCG) and (1\u2212MAP).\n\n3.1 Essential loss: ranking as a sequence of classi\ufb01cations\n\nIn this subsection, we describe the essential loss for ranking.\n\nFirst, we propose an alternative representation of the labels of objects (i.e., multi-level ratings). The\nbasic idea is to construct a permutation set, with all the permutations in the set being consistent with\nthe labels. The de\ufb01nition that a permutation is consistent with multi-level ratings is given as below.\nDe\ufb01nition 1. Given multi-level ratings L and permutation y, we say y is consistent with L, if\n\u2200i, s \u2208 {1, ..., n} satisfying i < s, we always have l(y(i)) \u2265 l(y(s)), where y(i) represents the index\nof the object that is ranked at the i-th position in y. We denote YL = {y|y is consistent with L}.\n\nAccording to the de\ufb01nition, it is clear that the NDCG and MAP of a ranking function equal one, if\nand only if the ranked list (permutation) given by the ranking function is consistent with the labels.\nSecond, given each permutation y \u2208 YL, we decompose the ranking of objects x into several se-\nquential steps. For each step s, we distinguish xy(s), the object ranked at the s-th position in y, from\nall the other objects ranked below the s-th position in y, using ranking function f.5 Speci\ufb01cally, we\ndenote x(s) = {xy(s), \u00b7 \u00b7 \u00b7 , xy(n)} and de\ufb01ne a classi\ufb01er based on f, whose target output is y(s),\n\nTf (x(s)) = arg\n\nmax\n\nj\u2208{y(s),\u00b7\u00b7\u00b7 ,y(n)}\n\nf (xj).\n\n(5)\n\n4Note that the bounds given in the original papers of [5] and [10] are with respect to DCG. Here we give their\nequivalent forms in terms of NDCG, and set P (\u00b7|xi, S) = \u03b4l(i)(\u00b7) in the bound of [5], for ease of comparison.\n5For simplicity and clarity, we assume f (xi) 6= f (xj) \u2200i 6= j, such that the classi\ufb01er will have a unique\noutput. It can be proved (see [4]) that the main results in this paper still hold without this assumption.\n\n4\n\n\fIt is clear that there are n \u2212 s possible outputs of this classi\ufb01er, i.e., {y(s), \u00b7 \u00b7 \u00b7 , y(n)}. The 0-1\nloss for this classi\ufb01cation task can be written as follows, where the second equality is based on the\nde\ufb01nition of Tf ,\n\nls(cid:0)f ; x(s), y(s)(cid:1) = I{Tf (x(s))6=y(s)} = 1 \u2212\n\nn\n\nYi=s+1\n\nI{f (xy(s))>f (xy(i))}.\n\nWe give a simple example in Figure 1 to illustrate the aforementioned process of decomposition.\n\ny\nA\nB\nC\n\n\uf8eb\n\uf8ed\n\n\uf8f6\n\uf8f8\n\n\u03c0\nB\nA\nC\n\n\uf8eb\n\uf8ed\n\n\uf8f6\n\uf8f8\n\nincorrect\n\n======\u21d2\n\nremove A\n\ny\n\n\u03c0\n\ny\n\n\u03c0\n\n(cid:18) B\n\nC (cid:19) (cid:18) B\n\nC (cid:19)\n\ncorrect\n\n=======\u21d2\n\nremove B\n\n(cid:0) C (cid:1)\n\n(cid:0) C (cid:1)\n\nFigure 1: Modeling ranking as a sequence of classi\ufb01cations\n\nSuppose there are three objects, A, B, and C, and a permutation y = (A, B, C). Suppose the output\nof the ranking function for these objects is (2, 3, 1), and accordingly the predicted ranked list is\n\u03c0 = (B, A, C). At step one of the decomposition, the ranking function predicts object B to be on\nthe top of the list. However, A should be on the top according to y. Therefore, a prediction error\noccurs. For step two, we remove A from both y and \u03c0. Then the ranking function predicts object B\nto be on the top of the remaining list. This is in accordance with y and there is no prediction error.\nAfter that, we further remove object B, and it is easy to verify there is no prediction error in step\nthree either. Overall, the ranking function makes one error in this sequence of classi\ufb01cation tasks.\nThird, we assign a non-negative weight \u03b2(s)(s = 1, \u00b7 \u00b7 \u00b7 , n \u2212 1) to the classi\ufb01cation task at the\ns-th step, representing its importance to the entire sequence. We compute the weighted sum of the\nclassi\ufb01cation errors of all individual tasks,\n\nL\u03b2(f ; x, y) ,\n\nn\u22121\n\nXs=1\n\n\u03b2(s)(cid:0)1 \u2212\n\nn\n\nYi=s+1\n\nI{f (xy(s))>f (xy(i))}(cid:1),\n\n(6)\n\nand then de\ufb01ne the minimum value of the weighted sum over all the permutations in YL as the\nessential loss for ranking.\n\nL\u03b2(f ; x, L) = min\n\ny\u2208YL\n\nL\u03b2(f ; x, y).\n\n(7)\n\nAccording to the above de\ufb01nition of the essential loss, we can obtain its following nice property.\nDenote the ranked list produced by f as \u03c0f . Then it is easy to verify that,\n\nL\u03b2(f ; x, L) = 0 \u21d0\u21d2 \u2203y \u2208 YL satisfying L\u03b2(f ; x, y) = 0 \u21d0\u21d2 \u03c0f = y \u2208 YL.\n\nIn other words, the essential loss is zero if and only if the permutation given by the ranking function\nis consistent with the labels. Further considering the discussions on the consistent permutation at\nthe begining of this subsection, we can come to the conclusion that the zero value of the essential\nloss is a suf\ufb01cient and necessary condition for the zero values of (1-NDCG) and (1-MAP).\n\n3.2 Essential loss: upper bound of measure-based ranking errors\n\nIn this subsection, we show that the essential loss is an upper bound of (1\u2212NDCG) and (1\u2212MAP),\nwhen speci\ufb01c weights \u03b2(s) are used.\n\nTheorem 1. Given K-level rating data (x, L) with nk objects having label k and PK\n\nthen \u2200f, the following inequalities hold,\n\ni=k\u2217 ni > 0,\n\n(1)\n\n(2)\n\n1 \u2212 N DCG(f ; x, L) \u2264\n1 \u2212 M AP (f ; x, L) \u2264\n\n1\nNn\n1\n\ni=k\u2217 ni\n\nPK\n\nL\u03b21 (f ; x, L), where \u03b21(s) = G(cid:0)l(y(s))(cid:1)D(s), \u2200y \u2208 YL;\n\nL\u03b22 (f ; x, L), where \u03b22(s) \u2261 1.\n\nProof. (1) We now prove the inequality for (1\u2212NDCG). First, we reformulate NDCG using the\npermutation set YL. This can be done by changing the index of the sum in NDCG from the rank\n\n5\n\n\f\u03c0\u22121\n\nposition r in \u03c0f to the rank position s in \u2200y \u2208 YL. Considering that s = y\u22121(cid:0)\u03c0f (r)(cid:1) and r =\nf (cid:0)y(s)(cid:1), it is easy to verify,\nN DCG(f ; x, L) =\n\nf (y(s))(cid:1) =\nSecond, we consider the essential loss case by case. Note that\n\nf y(s))(cid:1)(cid:17)D(cid:0)\u03c0\u22121\n\nG(cid:0)l(y(s))(cid:1)D(cid:0)\u03c0\u22121\n\nG(cid:16)l(cid:0)\u03c0f (\u03c0\u22121\n\nf (y(s))(cid:1).\n\nXs=1\n\nXs=1\n\n1\nNn\n\n1\nNn\n\nn\n\nn\n\nL\u03b21 (f ; x, L) = min\n\ny\u2208YL\n\nn\u22121\n\nXs=1\n\nG(cid:0)l(y(s))(cid:1)D(s)(cid:0)1 \u2212\n\nn\n\nYi=s+1\n\nI{\u03c0\u22121\n\nf (y(s))<\u03c0\u22121\n\nf (y(i))}(cid:1).\n\nf (y(s))<\u03c0\u22121\n\ni=s+1 I{\u03c0\u22121\n\nf (y(s)) \u2264 s. As a consequence, D(s)Qn\n\nThen \u2200y \u2208 YL, if position s satis\ufb01es Qn\n\u03c0\u22121\nf (y(i))), we have \u03c0\u22121\nD(s) \u2264 D(cid:0)\u03c0\u22121\nD(s)Qn\nf (y(i))} = 0 \u2264 D(cid:0)\u03c0\u22121\nD(s)Qn\nf (y(i))} \u2264 D(cid:0)\u03c0\u22121\nD(\u00b7) is a decreasing function, we have D(n) \u2264 D(cid:0)\u03c0\u22121\n\nf (y(s))(cid:1). Otherwise, if Qn\n\ni=s+1 I{\u03c0\u22121\ni=s+1 I{\u03c0\u22121\n\nf (y(i))} = 1 (i.e., \u2200i > s, \u03c0\u22121\nf (y(s))<\u03c0\u22121\n\nf (y(s)) <\nf (y(i))} =\nf (y(i))} = 0, it is easy to see that\nf (y(s))(cid:1). To sum up, \u2200s \u2208 {1, 2, ..., n \u2212 1},\nf (y(n)) \u2264 n and\nf (y(n))(cid:1). As a result, we obtain,\n\nf (y(s))(cid:1). Further considering \u03c0\u22121\n\ni=s+1 I{\u03c0\u22121\n\ni=s+1 I{\u03c0\u22121\n\nf (y(s))<\u03c0\u22121\n\nf (y(s))<\u03c0\u22121\n\nf (y(s))<\u03c0\u22121\n\n1 \u2212 N DCG(f ; x, L) =\n\n1\nNn\n\nn\n\nXs=1\n\nG(cid:0)l(y(s))(cid:1)(cid:16)D(s) \u2212 D(cid:0)\u03c0\u22121\n\nf (y(s))(cid:1)(cid:17) \u2264\n\n1\nNn\n\nL\u03b21 (f ; x, L).\n\nn1(cid:0)1 \u2212 M AP (f ; x, L)(cid:1) = n1 \u2212Ps: l(\u03c0f (s))=1 P i\u2264s I{l(\u03c0f (i))=1}\n\n(2) We then prove the inequality for (1\u2212MAP). First, we prove the result for 2-level ratings. Given\n2-level rating data (x, L), it can be proved (see Lemma 1 in [4]) that L\u03b22 (f ; x, L) = n1 \u2212 i0 + 1,\nwhere i0 denotes the position of the \ufb01rst object with label 0 in \u03c0f , and i0 \u2264 n1 +1. We then consider\ncase by case. If i0 > n1 (i.e., the\n\ufb01rst object with label 0 is ranked after position n1 in \u03c0f ), then all the objects with label 1 are ranked\nbefore the objects with label 0. Thus n1(1 \u2212 M AP (f ; x, L)) = n1 \u2212 n1 = 0 = L\u03b22 (f ; x, L).\nIf i0(\u03c0f ) \u2264 n1, there are i0(\u03c0f ) \u2212 1 objects with label 1 ranked before all the objects with label\n0. Thus n1(1 \u2212 M AP (f ; x, L)) \u2264 n1 \u2212 i0(\u03c0f ) + 1 = L\u03b22 (f ; x, L). This proves the theorem for\n2-level ratings.\nSecond, given K-level rating data (x, L), we denote the 2-level ratings induced by L as L0. Then it\nis easy to verify YL \u2286 YL0. As a result, we have,\n\ns\n\nL\u03b22 (f ; x, L0) = min\n\ny\u2208YL0\n\nL\u03b22 (f ; x, y) \u2264 min\n\ny\u2208YL\n\nL\u03b22 (f ; x, y) = L\u03b22 (f ; x, L).\n\nUsing the result for 2-level ratings, we obtain\n\n1 \u2212 M AP (f ; x, L) = 1 \u2212 M AP (f ; x, L0) \u2264\n\n1\n\ni=k\u2217 ni\n\nPK\u22121\n\nL\u03b22 (f ; x, L0) \u2264\n\n1\n\ni=k\u2217 ni\n\nPK\u22121\n\nL\u03b22 (f ; x, L).\n\n3.3 Essential loss: lower bound of loss functions\n\nIn this section, we show that many pairwise/listwise losses are upper bounds of the essential loss.\nTheorem 2. The pairwise losses in Ranking SVM, RankBoost, and RankNet, and the listwise loss\nin ListMLE are all upper bounds of the essential loss, i.e.,\n\n(1) L\u03b2(f ; x, L) \u2264 (cid:0) max\n(2) L\u03b2(f ; x, L) \u2264\n\n1\u2264s\u2264n\u22121\n1\n\n\u03b2(s)(cid:1)Lp(f ; x, L);\n\nln 2(cid:0) max\n\n1\u2264s\u2264n\u22121\n\n\u03b2(s)(cid:1)Ll(f ; x, y), \u2200y \u2208 YL.\n\nProof. (1) We now prove the inequality for the pairwise losses. First, we reformulate the pairwise\nlosses using permutation set YL,\n\nLp(f ; x, L) =\n\nn\u22121\n\nXs=1\n\nn\n\nXi=s+1,\n\nl(y(s))6=l(y(i))\n\n\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i))(cid:1) =\n\nn\u22121\n\nXs=1\n\nn\n\nXi=s+1\n\na(cid:0)y(i), y(s)(cid:1)\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i))(cid:1),\n\n6\n\n\fwhere y is an arbitrary permutation in YL, a(i, j) = 1 if l(i) 6= l(j); a(i, j) = 0 otherwise. Note that\nonly those pairs whose \ufb01rst object has a larger label than the second one are counted in the pairwise\nloss. Thus, the value of the pairwise loss is equal \u2200y \u2208 YL.\nSecond, we consider the value of a(cid:0)Tf (x(s)), y(s)(cid:1) case by case. \u2200y and \u2200s \u2208 {1, 2, ..., n \u2212 1},\nif a(cid:0)Tf (x(s)), y(s)(cid:1) = 1 (i.e., \u2203i0 > s, satisfying l(y(i0)) 6= l(y(s)) and f (xy(i0)) > f (xy(s))),\n\nconsidering that function \u03c6 in Ranking SVM, RankBoost and RankNet are all non-negative, non-\nincreasing, and \u03c6(0) = 1, we have,\n\nn\n\nXi=s+1\n\na(cid:0)y(i), y(s)(cid:1)\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i))(cid:1)\n\n\u2265 a(cid:0)y(i0), y(s)(cid:1)\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i0))(cid:1) = \u03c6(cid:0)f (xy(s)) \u2212 f (xy(i0))(cid:1) > 1 = a(cid:0)Tf (x(s)), y(s)(cid:1).\ni=s+1 a(cid:0)y(i), y(s)(cid:1)\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i))(cid:1) \u2265 0 =\n\nIf a(cid:0)Tf (x(s)), y(s)(cid:1) = 0, it is clear that Pn\na(cid:0)Tf (x(s)), y(s)(cid:1). Therefore,\n\nn\u22121\n\nn\n\n\u03b2(s)\n\nXs=1\n\nXi=s+1\n\na(cid:0)y(i), y(s)(cid:1)\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i))(cid:1) \u2265\n\nn\u22121\n\nXs=1\n\n\u03b2(s)a(cid:0)Tf (x(s)), y(s)(cid:1).\n\n(8)\n\nThird, it can be proved (see Lemma 2 in [4]) that the following inequality holds,\n\nL\u03b2(f ; x, L) \u2264 max\n\ny\u2208YL\n\nn\u22121\n\nXs=1\n\n\u03b2(s)a(cid:0)Tf (x(s)), y(s)(cid:1).\n\nConsidering inequality (8) and noticing that the pairwise losses are equal \u2200y \u2208 YL, we have\n\nL\u03b2(f ; x, L) \u2264 max\n\ny\u2208YL\n\nn\u22121\n\nXs=1\n\n\u03b2(s)\n\nn\n\nXi=s+1\n\na(cid:0)y(i), y(s)(cid:1)\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i))(cid:1) \u2264 (cid:0) max\n\n1\u2264s\u2264n\u22121\n\n\u03b2(s)(cid:1)Lp(f ; x, L).\n\n(2) We then prove the inequality for the loss function of ListMLE. Again, we prove the result case by\ncase. Consider the loss of ListMLE in Eq.(3). \u2200y and \u2200s \u2208 {1, 2, ..., n \u2212 1}, if I{Tf (x(s))6=y(s)} = 1\n(i.e., \u2203i0 > s satisfying f (xy(i0)) > f (xy(s))), then ef (xy(s)) < 1\ni=s ef (xy(s)). Therefore, we\nf (xy(i)) > ln 2 = ln 2 I{Tf (x(s))6=y(s)}. If I{Tf (x(s))6=y(s)} = 0, then it is clear\nhave \u2212 ln\n\n2 Pn\n\nf (xy(s))\n\ne\nP n\ni=s e\nf (xy(i)) > 0 = ln 2 I{Tf (x(s))6=y(s)}. To sum up, we have,\n\nf (xy(s))\n\n\u2212 ln\n\nn\u22121\n\ne\ni=s e\n\nP n\n\u03b2(s)(cid:16) \u2212 ln\n\nef (xy(s))\n\nn\u22121\n\nXs=1\nBy further relaxing the inequality, we obtain the following result,\n\ni=s ef (xy(i))(cid:17) >\nPn\n\n\u03b2(s) ln 2 I{Tf (x(s))6=y(s)} \u2265 ln 2 min\n\nXs=1\n\ny\u2208YL\n\nL\u03b2(\u03c0f , y) = ln 2 L\u03b2(\u03c0f , L).\n\nL\u03b2(f ; x, L) \u2264\n\n3.4 Summary\n\n1\n\nln 2(cid:0) max\n\n1\u2264s\u2264n\u22121\n\n\u03b2(s)(cid:1)Ll(f ; x, y), \u2200y \u2208 YL.\n\nWe have the following inequalities by combining the results obtained in the previous subsections.\n(1) The pairwise losses in Ranking SVM, RankBoost, and RankNet are upper bounds of (1\u2212NDCG)\nand (1\u2212MAP).\n\n1 \u2212 N DCG(f ; x, L) \u2264\n1 \u2212 M AP (f ; x, L) \u2264\n\nG(K \u2212 1)D(1)\n\nNn\n\nLp(f ; x, L);\n\n1\n\ni=k\u2217 ni\n\nPK\n\nLp(f ; x, L).\n\n(2) The listwise loss in ListMLE is an upper bound of (1\u2212NDCG) and (1\u2212MAP).\n\n1 \u2212 N DCG(f ; x, L) \u2264\n1 \u2212 M AP (f ; x, L) \u2264\n\nLl(f ; x, y), \u2200y \u2208 YL;\nLl(f ; x, y), \u2200y \u2208 YL.\n\nG(K \u2212 1)D(1)\n\nNn ln 2\n\n1\ni=k\u2217 ni\n\nln 2PK\n\n7\n\n\fTable 1: Ranking accuracy on OHSUMED\n\nMethods\nNDCG@5\nNDCG@10\n\nRankNet W-RankNet\n0.4568\n0.4414\n\n0.4868\n0.4604\n\nListMLE W-ListMLE\n0.4471\n0.4347\n\n0.4588\n0.4453\n\nMethods\nNDCG@5\nNDCG@10\n\nRegression Ranking SVM RankBoost\n\n0.4278\n0.4110\n\n0.4164\n0.414\n\n0.4494\n0.4302\n\nFRank\n0.4588\n0.4433\n\nListNet\n0.4432\n0.441\n\nSVMMAP\n\n0.4516\n0.4319\n\n4 Discussion\n\nThe proofs of Theorems 1 and 2 actually suggest a way to improve existing loss functions. The key\nidea is to introduce weights related to \u03b21(s) to the loss functions so as to make them tighter bounds\nof (1\u2212NDCG).\nSpeci\ufb01cally, we introduce weights to the pairwise and listwise losses in the following way,\n\n\u02dcLp(f ; x, L) =\n\n\u02dcLl(f ; x, y) =\n\nn\u22121\n\nXs=1\n\nn\u22121\n\nXs=1\n\nK\u22121\n\nXk=l(y(s))+1\n\nG(cid:0)l(y(s))(cid:1)D(cid:16)1 +\nnk(cid:17)\nG(cid:0)l(y(s))(cid:1)D(s)(cid:16) \u2212 f (xy(s)) + ln(cid:0)\n\nn\n\nXi=s+1\n\na(cid:0)y(i), y(s)(cid:1)\u03c6(cid:0)f (xy(s)) \u2212 f (xy(i))(cid:1), \u2200y \u2208 YL;\nexp(f (xy(i)))(cid:1)(cid:17).\n\nn\n\nXi=s\n\nIt can be proved (see Proposition 1 in [4]) that the above weighted losses are still upper bounds of\n(1\u2212NDCG) and they are lower bounds of the original pairwise and listwise losses. In other words,\nthe above weighted loss functions are tighter bounds of (1\u2212NDCG) than existing loss functions.\nWe tested the effectiveness of the weighted loss functions on the OHSUMED dataset in LETOR 3.0.6\nWe took RankNet and ListMLE as example algorithms. The methods that minimize the weighted\nloss functions are referred to as W-RankNet and W-ListMLE. From Table 1, we can see that (1)\nW-RankNet and W-ListMLE signi\ufb01cantly outperform RankNet and ListMLE. (2) W-RankNet and\nW-ListMLE also outperform other baselines on LETOR such as Regression, Ranking SVM, Rank-\nBoost, FRank [15], ListNet and SVMMAP [18]. These experimental results seem to indicate that\noptimizing tighter bounds of the ranking measures can lead to better ranking performances.\n\n5 Conclusion and future work\n\nIn this work, we have proved that many pairwise/listwise losses in learning to rank are actually upper\nbounds of measure-based ranking errors. We have also shown a way to improve existing methods\nby introducing appropriate weights to their loss functions. Experimental results have validated our\ntheoretical analysis. As future work, we plan to investigate the following issues.\n\n(1) We have modeled ranking as a sequence of classi\ufb01cations, when de\ufb01ning the essential loss. We\nbelieve this modeling has its general implication for ranking, and will explore its other usages.\n\n(2) We have taken NDCG and MAP as two examples in this work. We will study whether the\nessential loss is an upper bound of other measure-based ranking errors.\n\n(3) We have taken the loss functions in Ranking SVM, RankBoost, RankNet and ListMLE as ex-\namples in this study. We plan to investigate the loss functions in other pairwise and listwise ranking\nmethods, such as RankCosine [13], ListNet [3], FRank [15] and QBRank [19].\n\n(4) While we have mainly discussed the upper-bound relationship in this work, we will study\nwhether loss functions in existing learning-to-rank methods are statistically consistent with the es-\nsential loss and the measure-based ranking errors.\n\n6http://research.microsoft.com/\u02dcletor\n\n8\n\n\fReferences\n[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May\n\n1999.\n\n[2] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender.\nLearning to rank using gradient descent. In ICML \u201905: Proceedings of the 22nd International\nConference on Machine learning, pages 89\u201396, New York, NY, USA, 2005. ACM.\n\n[3] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to\nlistwise approach. In ICML \u201907: Proceedings of the 24th International Conference on Machine\nlearning, pages 129\u2013136, New York, NY, USA, 2007. ACM.\n\n[4] W. Chen, T.-Y. Liu, Y. Lan, Z. Ma, and H. Li. Essential loss: Bridge the gap between ranking\nmeasures and loss functions in learning to rank. Technical report, Microsoft Research, MSR-\nTR-2009-141, 2009.\n\n[5] D. Cossock and T. Zhang. Statistical analysis of bayes optimal subset ranking. Information\n\nTheory, 54:5140\u20135154, 2008.\n\n[6] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[7] R. Herbrich, K. Obermayer, and T. Graepel. Large margin rank boundaries for ordinal re-\ngression. In Advances in Large Margin Classi\ufb01ers, pages 115\u2013132, Cambridge, MA, 1999.\nMIT.\n\n[8] K. J\u00a8arvelin and J. Kek\u00a8al\u00a8ainen. Cumulated gain-based evaluation of ir techniques. ACM Trans-\n\nactions on Information Systems, 20(4):422\u2013446, 2002.\n\n[9] T. Joachims. Optimizing search engines using clickthrough data. In KDD \u201902: Proceedings\nof the 8th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 133\u2013142, New York, NY, USA, 2002. ACM.\n\n[10] P. Li, C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classi\ufb01cation and\ngradient boosting. In NIPS \u201907: Advances in Neural Information Processing Systems 20, pages\n897\u2013904, Cambridge, MA, 2008. MIT.\n\n[11] T.-Y. Liu, J. Xu, T. Qin, W.-Y. Xiong, and H. Li. Letor: Benchmark dataset for research\non learning to rank for information retrieval. In SIGIR \u201907 Workshop, San Francisco, 2007.\nMorgan Kaufmann.\n\n[12] Q. L. Olivier Chapelle and A. Smola. Large margin optimization of ranking measures. In NIPS\n\nworkshop on Machine Learning for Web Search 2007, 2007.\n\n[13] T. Qin, X.-D. Zhang, M.-F. Tsai, D.-S. Wang, T.-Y. Liu, , and H. Li. Query-level loss functions\n\nfor information retrieval. Information Processing and Management, 44(2):838\u2013855, 2008.\n\n[14] M. Taylor, J. Guiver, S. Robertson, and T. Minka. Softrank: optimizing non-smooth rank\nmetrics. In Proceedings of the International Conference on Web search and web data mining,\npages 77\u201386, Palo Alto, California, USA, 2008. ACM.\n\n[15] M.-F. Tsai, T.-Y. Liu, T. Qin, H.-H. Chen, and W.-Y. Ma. Frank: a ranking method with \ufb01delity\nloss. In SIGIR \u201907: Proceedings of the 30th annual ACM SIGIR conference, pages 383\u2013390,\nAmsterdam, The Netherlands, 2007. ACM.\n\n[16] F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank - theory\nand algorithm. In ICML \u201908: Proceedings of the 25th International Conference on Machine\nlearning, pages 1192\u20131199. Omnipress, 2008.\n\n[17] J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval.\n\nIn SIGIR \u201907:\nProceedings of the 30th annual international ACM SIGIR conference on Research and devel-\nopment in information retrieval, pages 391\u2013398, 2007.\n\n[18] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing\naverage precision. In SIGIR \u201907: Proceedings of the 30th annual international ACM SIGIR\nconference on Research and development in information retrieval, pages 271\u2013278, New York,\nNY, USA, 2007. ACM.\n\n[19] Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun. A general boosting method\nand its application to learning ranking functions for web search. In NIPS \u201907: Advances in\nNeural Information Processing Systems 20, pages 1697\u20131704. MIT, Cambridge, MA, 2008.\n\n9\n\n\f", "award": [], "sourceid": 493, "authors": [{"given_name": "Wei", "family_name": "Chen", "institution": null}, {"given_name": "Tie-yan", "family_name": "Liu", "institution": null}, {"given_name": "Yanyan", "family_name": "Lan", "institution": null}, {"given_name": "Zhi-ming", "family_name": "Ma", "institution": null}, {"given_name": "Hang", "family_name": "Li", "institution": null}]}