{"title": "Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space", "book": "Advances in Neural Information Processing Systems", "page_first": 1232, "page_last": 1240, "abstract": "This paper is concerned with the statistical consistency of ranking methods. Recently, it was proven that many commonly used pairwise ranking methods are inconsistent with the weighted pairwise disagreement loss (WPDL), which can be viewed as the true loss of ranking, even in a low-noise setting. This result is interesting but also surprising, given that the pairwise ranking methods have been shown very effective in practice. In this paper, we argue that the aforementioned result might not be conclusive, depending on what kind of assumptions are used. We give a new assumption that the labels of objects to rank lie in a rank-differentiable probability space (RDPS), and prove that the pairwise ranking methods become consistent with WPDL under this assumption. What is especially inspiring is that RDPS is actually not stronger than but similar to the low-noise setting. Our studies provide theoretical justifications of some empirical findings on pairwise ranking methods that are unexplained before, which bridge the gap between theory and applications.", "full_text": "Statistical Consistency of Ranking Methods in A\n\nRank-Differentiable Probability Space\n\nYanyan Lan\n\nInstitute of Computing Technology\n\nChinese Academy of Sciences\nlanyanyan@ict.ac.cn\n\nJiafeng Guo\n\nInstitute of Computing Technology\n\nChinese Academy of Sciences\nguojiafeng@ict.ac.cn\n\nXueqi Cheng\n\nInstitute of Computing Technology\n\nChinese Academy of Sciences\n\nTie-Yan Liu\n\nMicrosoft Research Asia\n\nTie-Yan.Liu@microsoft.com\n\ncxq@ict.ac.cn\n\nAbstract\n\nThis paper is concerned with the statistical consistency of ranking methods. Re-\ncently, it was proven that many commonly used pairwise ranking methods are\ninconsistent with the weighted pairwise disagreement loss (WPDL), which can\nbe viewed as the true loss of ranking, even in a low-noise setting. This result\nis interesting but also surprising, given that the pairwise ranking methods have\nbeen shown very effective in practice. In this paper, we argue that the aforemen-\ntioned result might not be conclusive, depending on what kind of assumptions\nare used. We give a new assumption that the labels of objects to rank lie in a\nrank-differentiable probability space (RDPS), and prove that the pairwise ranking\nmethods become consistent with WPDL under this assumption. What is especial-\nly inspiring is that RDPS is actually not stronger than but similar to the low-noise\nsetting. Our studies provide theoretical justi\ufb01cations of some empirical \ufb01ndings\non pairwise ranking methods that are unexplained before, which bridge the gap\nbetween theory and applications.\n\n1 Introduction\n\nRanking is a central problem in many applications, such as document retrieval, meta search, and\ncollaborative \ufb01ltering. In recent years, machine learning technologies called \u2018learning to rank\u2019 have\nbeen successfully applied. A learning-to-rank process can be described as follows. In training, a\nnumber of sets (queries) of objects (documents) are given and within each set the objects are labeled\nby assessors, mainly based on multi-level ratings. The target of learning is to create a model that\nprovides a ranking over the objects that best respects the observed labels. In testing, given a new set\nof objects, the trained model is applied to generate a ranked list of the objects.\nIdeally, the learning process should be guided by minimizing a true loss such as the weighted pair-\nwise disagreement loss (WPDL) [11], which encodes people\u2019s knowledge on ranking evaluation.\nHowever, the minimization can be very dif\ufb01cult due to the nonconvexity of the true loss. Alterna-\ntively, many learning-to-rank methods minimize surrogate loss functions. For example, RankSVM\n[14], RankBoost [12], and RankNet [3] minimize the hinge loss, the exponential loss, and the cross-\nentropy loss, respectively.\nIn machine learning, statistical consistency is regarded as a desired property of a learning method\n[1, 21, 20], which reveals the statistical connection between a surrogate loss function and the true\nloss. Statistical consistency in the context of ranking have been actively studied in recent years\n\n1\n\n\f[8, 9, 19, 11, 2, 18]. According to the studies in [11], many existing pairwise ranking methods\nare, surprisingly, inconsistent with WPDL, even in a low-noise setting. However, as we know, the\npairwise ranking methods have been shown to work very well in practice, and have been regarded\nas state-of-the-art even today [15, 16, 17]. For example, the experimental results in [2] show that\na weighted preorder loss in RankSVM [4] can outperform a consistent surrogate loss in terms of\nNDCG (See Table 2 in [2]).\nThe contradiction between theory and application inspires us to revisit the statistical consistency of\npairwise ranking methods. In particular, we will study whether there exists a new assumption on the\nprobability space that can make statistical consistency naturally hold, and how this new assumption\ncompares with the low-noise setting used in [11].\nTo perform our study, we \ufb01rst derive a suf\ufb01cient condition for statistical consistency of ranking\nmethods called rank-consistency, which is in nature very similar to edge-consistency in [11] and\norder-preserving in [2]. Then we give an assumption on the probability space where ratings (labels)\nof objects come from, which we call a rank-differentiable probability space (RDPS). Intuitively,\nRDPS reveals the reason why an object (denoted as object A) should be ranked higher than another\nobject (denoted as object B). That is, the probability of any ratings consistent with the preference1 is\nlarger than that of its dual ratings (obtained by exchanging the labels of object A and object B while\nkeeping others unchanged). We then prove that with the RDPS assumption, the weighted pairwise\nsurrogate loss, which is a generalization of many surrogate loss functions used in existing pairwise\nranking methods (e.g., the preorder loss in RankSVM [2], the exponential loss in RankBoost [12],\nand the logistic loss in RankNet [3]), is statistically consistent with WPDL.\nPlease note that our theoretical result contradicts the result obtained in [11], mainly due to the\ndifferent assumptions used. What is interesting, and to some extent inspiring, is that our RDPS\nassumption is not stronger than the low-noise setting used in [11], and in some sense they are very\nsimilar to each other (although they focus on different aspects of the probability space). We then\nconducted detailed comparisons between them to gain more insights on what affects the consistency\nof ranking.\nAccording to our theoretical analysis, we argue that it is not yet appropriate to draw any conclusion\nabout the inconsistency of pairwise ranking methods, especially because it is hard to know what\nthe probability space really is. In this sense, we think the pairwise ranking methods are still good\nchoices for real ranking applications, due to their good empirical performances.\nThe rest of this paper is organized as follows. Sections 2 de\ufb01nes the consistency problem formally\nand provides a suf\ufb01cient condition under which consistency with WPDL is achieved for ranking\nmethods. Section 3 gives the main theoretical results, including formal de\ufb01nition of RDPS and\nconditions of statistical consistency of pairwise ranking methods. Further discussions on whether\nRDPS is a strong assumption and why our results contradict with that in [11] are presented in Section\n4. Conclusions are presented in Section 5.\n\n2 Preliminaries of Statistical Consistency\nLet x = {x1,\u00b7\u00b7\u00b7 , xm} be a set of objects to be ranked. Suppose the labels of the objects are given\nas multi-level ratings r = (r1,\u00b7\u00b7\u00b7 , rm) from space R, where ri denotes the label of xi. Without\nloss of generality, we adopt K-level ratings used in [7], that is, ri \u2208 {0, 1,\u00b7\u00b7\u00b7 , K\u22121}. If ri > rj, xi\nshould be ranked higher than xj. Assume that (x, r) is a random variable of space X \u00d7R according\nto a probability measure P . Following existing literature, let f be a ranking function that gives a\nscore to each object to produce a ranked list and denote F as the space of all ranking functions.\nIn this paper, we adopt the weighted pairwise disagreement loss (WPDL) de\ufb01ned in [11, 10] as the\ntrue loss to evaluate f:\n\nl0(\u03b1, G) =\n\n(1)\nwhere \u03b1 = (\u03b11,\u00b7\u00b7\u00b7 , \u03b1m) = (f (x1),\u00b7\u00b7\u00b7 , f (xm)), G is a directed acyclic graph (DAG for short)\nwith edge i \u2192 j to represent the preference that xi should be ranked higher than xj, and aG\nij is a\nnon-negative penalty indexed by i \u2192 j on graph G.\n\nij1{(cid:11)i\u2264(cid:11)j} +\naG\n\nij1{(cid:11)i<(cid:11)j},\naG\n\n\u2211\n\ni<j\n\n\u2211\n\ni>j\n\n1Here, consistency with the preference means that the rating of object A is larger than that of object B.\n\n2\n\n\fSpeci\ufb01cally, in the setting of multi-level ratings, i \u2192 j is constructed between pair (i, j) with\nij is thus just relevant to the labels of the two objects. For ease of representation2, we\nri > rj, and aG\nreplace aG\n\nij with D(ri, rj), and WPDL becomes the following form:\n\nl0(f ; x, r) =\n\nD(ri, rj)1{f (xi)\u2212f (xj )\u22640},\n\n(2)\n\n\u2211\n\ni;j:ri>rj\n\nwhere 1{\u00b7} is an indicator function3 and D(ri, rj) is a weight function s.t. (1) \u2200ri \u0338= rj, D(ri, rj) >\n0; (2) \u2200ri, rj, D(ri, rj) = D(rj, ri); (3) \u2200ri < rj < rk, D(ri, rj)\u2264 D(ri, rk), D(rj, rk)\u2264 D(ri, rk).\nThe conditional expected true risk and the expected true risk of f are then de\ufb01ned as:\nl0(f ; x, r)P (r|x), R0(f ) = Ex[Er|xl0(f ; x, r)].\n\nR0(f|x) = Er|xl0(f ; x, r) =\n\n\u2211\n\n(3)\n\nr\u2208R\n\n\u2211\n\nr\u2208R\n\nDue to the nonconvexity of the true loss, it is infeasible to minimize the true risk in Eq.(3). As is\ndone in the literature of machine learning, we adopt a surrogate loss l(cid:8) to minimize in place of l0.\nThe conditional expected surrogate risk and the expected surrogate risk of f are then de\ufb01ned as:\n\nR(cid:8)(f|x) = Er|xl(cid:8)(f ; x, r) =\n\nl(cid:8)(f ; x, r)P (r|x), R(cid:8)(f ) = Ex[Er|xl(cid:8)(f ; x, r)].\n\n(4)\n\nStatistical consistency is a desired property for a good surrogate loss, which measures whether the\nexpected true risk of the ranking function obtained by minimizing a surrogate loss converges to the\nexpected true risk of the optimal ranking in the large sample limit.\nDe\ufb01nition 1. We say a ranking method that minimizes a surrogate loss l(cid:8) is statistically consistent\nwith respect to the true loss l0, if \u2200\u03f51 > 0,\u2203\u03f52 > 0, such that for any ranking function f \u2208 F,\nR(cid:8)(f ) \u2264 inf h\u2208F R(cid:8)(h) + \u03f52 implies R0(f ) \u2264 inf h\u2208F R0(h) + \u03f51.\nWe then introduce a property of the surrogate loss, called rank-consistency, which is a suf\ufb01cient\ncondition for the statistical consistency of the surrogate loss, as indicated by Theorem 1.\nDe\ufb01nition 2. We say a surrogate loss l(cid:8) is rank-consistent with the true loss l0, if \u2200x, for any\nranking function f \u2208 F such that R0(f|x) > inf h\u2208F R0(h|x), the following inequality holds:\nh\u2208F R(cid:8)(h|x) < inf{R(cid:8)(g|x) : g \u2208 F , g(xi) \u2264 g(xj), for (i, j) where f (xi) \u2264 f (xj).}.\n\n(5)\n\ninf\n\nRank-consistency can be viewed as a generalization of in\ufb01nite sample consistency for classi\ufb01cation\nproposed in [20] (also referred to as \u2018classi\ufb01cation-calibrated\u2019 in [1]) to ranking on a set of objects.\nIt is also similar to edge-consistent in [11] and order-preserving in [2].\nTheorem 1. If a surrogate loss l(cid:8) is rank-consistent with the true loss l0 on the function space F,\nthen it is statistically consistent with the true loss l0 on F.\nWe omit the proof since it is a straightforward extension of the proof for Theorem 3 in [20]. The\nproof is also similar to Lemma 3, 4, 5 and Theorem 6 in [11].\n\n3 Main Results\n\nIn this section, we present our main theoretical results: with a new assumption on the probability\nspace, many commonly used pairwise ranking algorithms can be proved consistent with WPDL.\n\n3.1 A Rank-Differentiable Probability Space\n\nFirst, we give a new assumption named a rank-differentiable probability space (RDPS for short),\nwith which many pairwise ranking methods will be rank-consistent with WPDL. Hereafter, we will\nalso refer to data from RDPS as having a rank-differentiable property.\n\n2Here we do not distinguish i > j and i < j, because they are just introduced to avoid minor technical\n\nissues as stated in [11]. Furthermore, it will not in\ufb02uence the consistency results.\n\n31A = 1,if A is true and 1A = 0,if A is false.\n\n3\n\n\fBefore introducing the de\ufb01nition of RDPS, we give two de\ufb01nitions, an equivalence class of ratings\nand dual ratings. Intuitively, we say two ratings are equivalent if they induce the same ranking\nor preference relationships. And we say two ratings are the dual ratings with respect to a pair of\nobjects, if the two ratings just exchange the ratings of the two objects while keeping the ratings of\nother objects unchanged. The formal de\ufb01nitions are given as follows.\nDe\ufb01nition 3. A ratings r is called equivalent to ~r, denoted as r \u223c ~r, if P(r) = P(~r). Where\nP(r) = {(i, j) : ri > rj.} and P(~r) = {(i, j) : ~ri > ~rj.} stand for the preference relationships\ninduced by r and ~r, respectively. Therefore, an equivalence class of the ratings r, denoted as [r], is\nde\ufb01ned as the set of ratings which are equivalent to r. That is, [r] = {~r \u2208 R : ~r \u223c r.}.\nDe\ufb01nition 4. Let R(i, j) = {r \u2208 R : ri > rj.}, r\n\u2032 is called the dual ratings of r \u2208 R(i, j) with\nk = rk,\u2200k \u0338= i, j.\n\u2032\nrespect to (i, j) if r\n\n\u2032\nj = ri, r\n\n\u2032\ni = rj, r\n\nNow we give the de\ufb01nition of RDPS. An intuitive explanation on this de\ufb01nition is that there exists\na unique equivalence class of ratings that for each induced pairwise preference relationship, the\nprobability will be able to separate the two dual ratings with respect to that pair.\nDe\ufb01nition 5. Let R(i, j) = {r \u2208 R : ri > rj.}, a probability space is called rank-differentiable\nwith (i, j), if for any r \u2208 R(i, j), P (r|x) \u2265 P (r\n\u2032|x), and there exists at least one ratings r \u2208\nR(i, j), s.t. P (r|x) > P (r\nDe\ufb01nition 6. A probability space is called rank-differentiable, if there exists an equivalence class\n[r\u2217], s.t. P(r\n) =\n{(i, j) : r\n\u2217\n\u2217\n].\ni > r\n\n) = {(i, j) : the probability space is rank-differentiable with(i, j).}, where P(r\n\u2217\nj .}. We will also call this probability space a RDPS or rank-differentiable with [r\n\u2217\n\n\u2032 is the dual ratings of r.\n\n\u2032|x), where r\n\n\u2217\n\n\u2217\n\n] in De\ufb01nition 6 is unique, which can be directly proved by De\ufb01nition 3.\n\nPlease note that [r\nDe\ufb01nition 5 implies that if a probability space is rank-differentiable with (i, j), the optimal ranking\nfunction will rank xi higher than xj, as shown in the following theorem. The proof is similar to\nthat of Theorem 4, thus we omit it here for space limitation. Hereafter, we will call this property\n\u2018separability on pairs\u2019.\n\u2217|x) = inf f\u2208F R0(f|x).\nTheorem 2. \u2200x \u2208 X , let f\n\u2217\nIf the probability space is rank-differentiable with (i, j), we have f\n\n\u2217 \u2208 F be an optimal ranking function that R0(f\n(xi) > f\n\n(xj).\n\n\u2217\n\nFurther considering the \u2018transitivity4 over pairs\u2019 of a ranking function, De\ufb01nition 6 implies that if a\n], the optimal ranking function will induce the same\nprobability space is rank-differentiable with [r\npreference relationships, as shown in the following theorem.\nTheorem 3. \u2200x \u2208 X , let f\n\u2217|x) = inf f\u2208F R0(f|x).\n], for any (i, j) \u2208 P(r\n\u2217\nIf the probability space is rank-differentiable with [r\n(xi) >\n\u2217\nf\n\n\u2217 \u2208 F be an optimal ranking function that R0(f\n\n) = {(i, j) : r\n\n), we have f\n\n\u2217\n\n\u2217\n\n\u2217\n\n(xj), where P(r\n\u2217\n\nj .}.\n\u2217\n3.2 Conditions of Statistical Consistency\n\n\u2217\ni > r\n\nWith RDPS as the new assumption, we study the statistical consistency of pairwise ranking methods.\nFirst, we de\ufb01ne the weighted pairwise surrogate loss as\n\nl(cid:8)(f ; x, r) =\n\nD(ri, rj)\u03d5(f (xi) \u2212 f (xj)),\n\n(6)\n\n\u2211\n\ni;j:ri>rj\n\nwhere \u03d5 is a convex function. The surrogate losses used in many existing pairwise ranking methods\ncan be regarded as special cases of this weighted pairwise surrogate loss, such as the hinge loss in\nRankSVM [14], the exponential loss in RankBoost [12], the cross-entropy loss in RankNet [3] and\nthe preorder loss in [2]. For the weighted pairwise surrogate loss, we get its suf\ufb01cient condition\nof statistical consistency as shown in Theorem 5. In order to prove this theorem, we \ufb01rst prove\nTheorem 4.\n\u2217\nTheorem 4. We assume the probability space is rank-differentiable with an equivalence class [r\n].\nSuppose that \u03d5(\u00b7) : R \u2192 R in the weighted pairwise surrogate loss is a non-increasing function\nsuch that \u03d5(z) < \u03d5(\u2212z),\u2200z > 0. \u2200x \u2208 X , let f \u2208 F be a ranking function such that R(cid:8)(f|x) =\n4Transitivity means that if xi is ranked higher than xj and xj is ranked higher than xk, xi must be ranked\n\nhigher than xk.\n\n4\n\n\finf h\u2208F R(cid:8)(h|x), then for any object pair (xi, xj), r\n\u03d5(\u00b7) is differentiable and \u03d5\n\nj , we have f (xi) \u2265 f (xj). Moreover, if\n\u2217\n(0) < 0, we have f (xi) > f (xj).\n\n\u2217\ni > r\n\n\u2032\n\n;\n\n\u2032\n\n\u2032\n\n=\n\n\u2032\nr;r\n\nR(cid:8)(f\n\nk:rj <ri<rk\n\nr\u2208R(i;j)\n\n(xj) = f (xi), f\n\nProof. (1) We assume that f (xi) < f (xj), and de\ufb01ne f\nf (xj), f\n\n\u2032 as the function such that f\n(xk) = f (xk),\u2200k \u0338= i, j. We can then get the following equation,\n\n\u2211\n\u2211\n\u2032|x) \u2212 R(cid:8)(f|x)\n[D(rk, rj)\u2212D(rk, ri)][\u03d5(f (xk)\u2212f (xi))\u2212\u03d5(f (xk)\u2212f (xj))][P (r|x)\u2212P (r\n\u2211\n\u2211\n\u2211\n\u2211\n\u2211\n\u2211\n\nD(rk, rj)[\u03d5(f (xk)\u2212f (xi))\u2212\u03d5(f (xk)\u2212f (xj))][P (r|x)\u2212P (r\n\nD(ri, rk)[\u03d5(f (xj)\u2212f (xk))\u2212\u03d5(f (xi)\u2212f (xk))][P (r|x)\u2212P (r\n\n+\nr\u2208R(i;j)\n\n+\nr\u2208R(i;j)\n\n\u2032|x)]\n\n\u2032|x)]\n\nk:rj <rk<ri\n\nk:rj <rk<ri\n\n\u2032\nr;r\n\n\u2032\nr;r\n\n;\n\n;\n\n\u2032\n\n(xi) =\n\n\u2032|x)]\n\n[D(ri, rk)\u2212D(rj, rk)][\u03d5(f (xj)\u2212f (xk))\u2212\u03d5(f (xi)\u2212f (xk))][P (r|x)\u2212P (r\n\n\u2032|x)]\n\n;\n\n\u2032\nr;r\n\nk:rk<rj <ri\n\n+\nr\u2208R(i;j)\n+[\u03d5(f (xj)\u2212f (xi))\u2212\u03d5(f (xi)\u2212f (xj))]\n\n\u2211\n\nD(ri, rj)[P (r|x)\u2212P (r\n\n\u2032|x)]\n\n\u2032\nr;r\n\n;\n\nr\u2208R(i;j)\n\nAccording to the conditions of RDPS, the requirements of the weight function D in Section 2 and\nthe assumption that \u03d5 is a non-increasing function such that \u03d5(z) < \u03d5(\u2212z),\u2200z > 0, we can obtain\n\u2032|x)] < 0.\nR(cid:8)(f\n\n\u2032|x) \u2212 R(cid:8)(f|x) \u2264 [\u03d5(f (xj)\u2212f (xi))\u2212\u03d5(f (xi)\u2212f (xj))]\n\nD(ri, rj)[P (r|x)\u2212P (r\n\n(cid:12)(cid:12)(cid:12)\n\nThis is a contradiction with R(cid:8)(f )=inf h\u2208F R(cid:8)(h|x). Therefore, we have proven that f (xi)\u2265f (xj).\n(2) Now we assume that f (xi) = f (xj) = f0. From the assumption R(cid:8)(f|x) = inf h\u2208F R(cid:8)(h|x), we\n\u2211\ncan get @R(cid:8)(f|x)\n= 0. Accordingly, we can obtain the two following equations:\n\u2032|x) = 0,\n\nB1P (r|x) + B2P (r\n\nA1P (r|x) + A2P (r\n\n= 0, @R(cid:8)(f|x)\n\n\u2032|x) = 0,\n\n\u2211\n\n@f (xj )\n\n@f (xi)\n\n(cid:12)(cid:12)(cid:12)\n\n(7)\n\nf0\n\nf0\n\n\u2211\n\n\u2032\n\n;\n\nr;r\n\nr\u2208R(i;j)\n\n\u2032\n\n;\n\nr;r\n\nr\u2208R(i;j)\n\nwhere,\n\nA1 = B2 =\n\nk:rj <ri<rk\n\n\u2211\nD(rk, ri)[\u2212\u03d5\n\u2211\n\u2032\n\n\u2032\nD(ri, rk)\u03d5\n\n+\nk:rk<rj <ri\n\n\u2211\nD(rk, rj)[\u2212\u03d5\n\u2211\n\n\u2032\n\n+\nk:rk<rj <ri\n\nD(rj, rk)\u03d5\n\nk:rj <ri<rk\n\n\u2032\n\n;\n\nr;r\n\nr\u2208R(i;j)\n\n\u2211\n\n\u2211\n\n(f (xk)\u2212f0)] +\n(f0\u2212f (xk)) + D(ri, rj)\u03d5\n\nk:rj <rk<ri\n\n\u2032\n\n(0).\n\n\u2032\nD(ri, rk)\u03d5\n\n(f0\u2212f (xk))\n\n(f (xk)\u2212f0)] +\n(f0\u2212f (xk)) + D(ri, rj)[\u2212\u03d5\n\nk:rj <rk<ri\n\n\u2032\n\n\u2032\n\n(0)].\n\nD(rk, rj)[\u2212\u03d5\n\n\u2032\n\n(f (xk)\u2212f0)]\n\n\u2032\nIf \u03d5\n\n(0) < 0, based on the requirements of RDPS and the weight function D, we can get\n\n(A1 \u2212 B1)P (r|x) + (A2 \u2212 B2)P (r\n\n(A1 \u2212 A2)[P (r|x) \u2212 P (r\n\n\u2032|x)] \u2264 2\u03d5\n\n\u2032\n\nD(ri, rj)[P (r|x) \u2212 P (r\n\n\u2032|x)] < 0.\n\nA2 = B1 =\n\n\u2211\n\u2211\n\n\u2032\nr;r\n\n;\n\nr\u2208R(i;j)\n\n=\n\n\u2032\n\n;\n\nr;r\n\nr\u2208R(i;j)\n\n\u2032|x)\n\u2211\n\n(0)\nr\u2208R(i;j)\n\nr;r\n\n\u2032\n\n;\n\nThis is a contradiction with Eq.(7). Therefore, we actually have proven that f (xi) > f (xj).\n\n5\n\n\fFigure 1: Relationships among order-preserving, rank-differentiable and low-noise.\n\nTheorem 5. Let \u03d5(\u00b7) be a non-negative, non-increasing and differentiable function such that\n\u2032\n(0) < 0. Then the weighted pairwise surrogate loss is consistent with WPDL under the as-\n\u03d5\nsumption of RDPS.\n\nProof. We assume that the probability space is rank-differentiable with an equivalence class [r\nThen for any object pair (xi, xj), r\n\n\u2217\nj , we are going to prove that\n\n\u2217\ni > r\n\n\u2217\n(cid:8)|x = inf\n\nh\u2208F R(cid:8)(h|x) < inf{R(cid:8)(f|x) : f \u2208 F, f (xi) \u2264 f (xj).}\n\nR\n\n\u2217\n\n].\n\n(8)\n\nbecause from Theorem 3 this implies the rank-consistency condition in Eq.(5) holds.\nSuppose Eq.(8) is not true, then we can \ufb01nd a sequence of functions {fm} such that 0 = fm(xi) \u2264\nfm(xj), and limm R(cid:8)(fm|x) = R\n\u2217\n(cid:8)|x. We can further select a subsequence such that for each pair\n(i, j), fm(xi) \u2212 fm(xj) converges (may also converge to \u00b1\u221e). This leads to a limit function f,\nwith properly de\ufb01ned f (xi) \u2212 f (xj), even when either f (xi) or f (xj) is \u00b1\u221e. This implies that\nR(cid:8)(f|x) = R\n(cid:8)|x and 0 = f (xi) \u2264 f (xj). However, this violates Theorem 4. Thus, Eq.(8) is true.\n\u2217\nTherefore, we have proven that the weighted pairwise surrogate loss is consistent with WPDL.\n\nMany commonly used pairwise surrogate losses, such as the preorder loss in RankSVM [2], the\nexponential loss in RankBoost [12], and the logistic loss in RankNet[3], satisfy the conditions in\nTheorem 5, thus they are consistent with WPDL. In other words, we have shown that statistical\nconsistency of pairwise ranking methods is achieved under the assumption of RDPS.\n\n4 Discussions\n\nIn Section 3, we have shown that statistical consistency of pairwise ranking methods is achieved\nwith the assumption of RDPS. Considering the contradicting conclusion drawn in [11], a natural\nquestion is whether the RDPS is stronger than the low-noise setting used in [11]. In this section we\nwill make some discussions on this issue.\n\n4.1 Relationships of RDPS with Previous Work\n\nHere, we discuss the relationships between the rank-differentiable property and the assumptions used\nin some previous works (including the order-preserving property in [19] and the low-noise setting\nin [11]). According to our analysis, we \ufb01nd that the rank-differentiable property is not a strong\nassumption on the probability space. Actually, it is a weaker assumption than the order-preserving\nproperty and is very similar to the low-noise setting. A sketch map of the relationships between the\nthree assumptions is presented in Figure 1, where the low-noise probability spaces stands for a set\nof spaces satisfying the low-noise setting. Detailed discussions are given as follows.\n\n6\n\n\f1. Rank-Differentiable vs. Order-Preserving\n\nThe rank-differentiable property is de\ufb01ned on the space of multi-level ratings while the order-\npreserving property is de\ufb01ned on the permutation space. To understand their relationship, we need\nto put them onto the same space. Actually, we can restrict the space of multi-level ratings to the\npermutation space by setting K = m \u2212 1 and requiring the ratings of each two objects to be differ-\nent. After doing so, it is not dif\ufb01cult to see that the rank-differentiable property is weaker than the\norder-preserving property, as shown in the following theorem.\nTheorem 6. Let K = m \u2212 1. For each permutation y \u2208 Y, where y(i) stands for the position of\ni = m \u2212 y(i), i = 1,\u00b7\u00b7\u00b7 , n. Assume\nxi, de\ufb01ne the corresponding ratings ry = (ry\nthat P (ry) = P (y), and P (r) = 0 if there does not exist a permutation y s.t. r = ry. If the\nprobability space is order-preserving with respect to m\u22121 pairs (j1, j2), (j2, j3),\u00b7\u00b7\u00b7, (jm\u22121, jm),\n, i = 1,\u00b7\u00b7\u00b7 , m, but the\nit is rank-differentiable with the equivalence class [r\nconverse is not always true.\n\n1 ,\u00b7\u00b7\u00b7 , ry\n\nm) as ry\n\n\u2217\nji+1\n\n\u2217\nji\n\n\u2217\n\n], where r\n\n> r\n\n2. Rank-Differentiable vs. Low-Noise\n\nThe rank-differentiable property is de\ufb01ned on the space of multi-level ratings while the low-noise\nsetting is de\ufb01ned on the space of DAGs. According to the correspondence between ratings and\nDAGs (as stated in Section 2), we can restrict the space of DAGs to the space of multi-level ratings.\nConsequently, we obtain the relationship between the rank-differentiable property and the low-noise\nsetting as follows:\n(1) Mathematically, the inequalities in the low-noise setting can be viewed as the combinations of\nthe corresponding inequalities in the rank-differentiable property. They are similar to each other in\ntheir forms and the rank-differentiable property is not stronger than the low-noise setting.\n(2) Intuitively, the rank-differentiable property induces \u2018separability on pairs\u2019 and \u2018transitivity over\npairs\u2019 as described in Theorem 2 and 3, while the low-noise setting aims to explicitly express the\ntransitivity over pairs, but fails in achieving it.\nLet us use an example to illustrate the above points. Suppose there are three objects to be ranked in\nthe setting of three-level ratings (K = 3). Furthermore, suppose that the ratings of every two objects\nare different and all the graphs are fully connected DAGs in the setting of [11]. We order the ratings\nand DAGs as:\n\nr1 = (2, 1, 0), r2 = (1, 2, 0), r3 = (2, 0, 1), r4 = (0, 2, 1), r5 = (1, 0, 2), r6 = (0, 1, 2).\nG1 ={(1\u2192 2),(2\u2192 3),(1\u2192 3)}, G2 ={(2\u2192 1),(1\u2192 3),(2\u2192 3)}, G3 ={(1\u2192 3),(3\u2192 2),(1\u2192 2)},\nG4 ={(2\u2192 3),(3\u2192 1),(2\u2192 1)}, G5 ={(3\u2192 1),(1\u2192 2),(3\u2192 2)}, G6 ={(3\u2192 2),(2\u2192 1),(3\u2192 1)},\nTherefore ri, Gi have one-to-one correspondence, we can set the probability as P (ri|x) =\nP (Gi|x) = Pi and de\ufb01ne aGi\nConsidering conditions in the de\ufb01nition of RDPS, rank-differentiable with [r1] requires the follow-\ning inequalities to hold and at least one inequalities in (9) and (10) to hold strictly.\n\nkl = D(rik, ril), i = 1,\u00b7\u00b7\u00b7 , 6; k, l = 1, 2, 3.\n\n(9)\n(10)\nWe assume there are edges 1 \u2192 2 and 2 \u2192 3 in the difference graph. Then the low-noise setting in\nDe\ufb01nition 8 of [11] requires that a13 \u2212 a31 \u2265 a12 \u2212 a21 + a23 \u2212 a32, where,\n\nP1 \u2212 P2 \u2265 0, P3 \u2212 P4 \u2265 0, P5 \u2212 P6 \u2265 0,\nP4 \u2212 P6 \u2265 0, P2 \u2212 P5 \u2265 0, P1 \u2212 P3 \u2265 0,\n\na12 \u2212 a21 = D(2, 1)(P1 \u2212 P2) + D(2, 0)(P3 \u2212 P4) + D(1, 0)(P5 \u2212 P6),\na23 \u2212 a32 = D(2, 1)(P4 \u2212 P6) + D(2, 0)(P2 \u2212 P5) + D(1, 0)(P1 \u2212 P3),\na13 \u2212 a31 = D(2, 1)(P3 \u2212 P5) + D(2, 0)(P1 \u2212 P6) + D(1, 0)(P2 \u2212 P4).\n\nAccording to the above example,\n(1) a12 \u2212 a21 and a23 \u2212 a32 are exactly the combinations of the terms in (9) and (10), respectively.\nThus, if the probability space is rank-differentiable with [r1], we can only get a12 \u2212 a21 > 0, a23 \u2212\na32 > 0, but not the inequalities in the low-noise setting. This indicates that our rank-differentiable\nproperty is not stronger than the low-noise setting.\n\n7\n\n\f(2) With the assumption that aij \u2212 aji > 0 can guarantee the optimal ranking with which xi is\nranked higher than xj, it seems that the low-noise setting intends to make the preferences of 1 \u2192 2\nand 2 \u2192 3 transitive to 1 \u2192 3. However, the assumption is not always true. Instead, the rank-\ndifferentiable property can naturally induce the \u2018transitivity over pairs\u2019 (See Theorem 2 and 3).\nIn this sense, the rank-differentiable property is much more powerful than the low-noise setting,\nalthough not stronger.\n\n4.2 Explanation on Theoretical Contradiction\n\nOn one hand, different conclusions on the consistency of pairwise ranking methods have been ob-\ntained in our work and in [11]. On the other hand, we have shown that there exists an connection\nbetween the rank-differentiable property and the low-noise setting (see Figure 1). Therefore, one\nmay get confused by the contradicting results and may wonder what will happen if a probability\nspace satis\ufb01es both the rank-differentiable property and the low-noise setting. In this subsection, we\nwill make discussions on this issue.\nPlease note that we adopt the multi-level ratings as the labeling strategy (as stated clearly in Section\n2) in our analysis. With this setting, the graph space G in [11] will not contain all the DAGs. For\nexample, considering a three-graph case, the graph G2 = {(1, 2, 3) : (2 \u2192 3), (3 \u2192 1)} in the proof\nof Theorem 11 of [11] (the main negative result on the consistency of pairwise surrogate losses)\nactually does not exist. That is because if 2 \u2192 3 and 3 \u2192 1 exist in a graph G, we can get that\nr2 > r3, r3 > r1 according to the correspondence between graphs and ratings as stated in Section 2.\nTherefore, we can immediately get r2 > r1. Once again according to the correspondence between\ngraphs and ratings, we will get that 2 \u2192 1 should be contained in graph G, which contradicts with\nG2. Thus, G2 will not exist in the setting of multi-level ratings. However, in the proof of [11], they\ndo not take the constraint of multi-level ratings into consideration, thus deduce contradict results.\nFrom the above discussions, we can see that our theoretical results contradict with that in [11] mainly\nbecause the two works consider different settings and assumptions. If a probability space satis\ufb01es\nboth the rank-differentiable property and the low-noise setting, the pairwise ranking methods will be\nconsistent with WPDL in the setting of multi-level ratings but inconsistent in the setting of DAGs.\nOne may argue that the setting of multi-level ratings is not as general as the DAG setting, however,\nplease note that multi-level ratings are the dominant setting in the literature of \u2018learning to rank\u2019\n[13, 16, 15, 6] and have been widely used in many applications such as web search and document\nretrieval [17, 5]. Therefore, we think the setting of multi-level ratings is general enough and our\nresult has its value to the mainstream research of learning to rank.\nTo sum up, based on all the discussions in this paper, we argue that it is not yet appropriate to draw\nany conclusion about the inconsistency of pairwise ranking methods, especially because it is hard to\nknow what the probability space really is. In this sense, we think the pairwise ranking methods are\nstill good choices for real ranking applications, due to their good empirical performances.\n\n5 Conclusions\n\nIn this paper, we have discussed the statistical consistency of ranking methods. Speci\ufb01cally, we\nargue that the previous results on the inconsistency of commonly-used pairwise ranking methods\nare not conclusive, depending on the assumptions about the probability space. We then propose a\nnew assumption, which we call a rank-differentiable probability space (RDPS), and prove that the\npairwise ranking methods are consistent with the same true loss as in previous studies under this\nassumption. We show that RDPS is not a stronger assumption than the assumptions used in previous\nwork, indicating that our \ufb01nding is similarly reliable to previous ones.\n\nAcknowledgments\n\nThis research work was funded by the National Natural Science Foundation of China under Grant\nNo. 60933005, No. 61173008, No. 61003166 , No. 61203298 and 973 Program of China under\nGrants No. 2012CB316303.\n\n8\n\n\fReferences\n[1] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[2] D. Buffoni, C. Calauzenes, P. Gallinari, and N. Usunier. Learning scoring functions with\norder-preserving losses and standardized supervision. In Proceedings of the 28th International\nConference on Machine Learning (ICML 2011), pages 825\u2013832, 2011.\n\n[3] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender.\nLearning to rank using gradient descent. In Proceedings of the 22th International Conference\non Machine Learning (ICML 2005), pages 89\u201396, 2005.\n\n[4] O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19:1155\u2013\n\n1178, 2007.\n\n[5] O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. Journal of Machine\n\nLearning Research - Proceedings Track, 14:1\u201324, 2011.\n\n[6] O. Chapelle, Y. Chang, and T.-Y. Liu. Future directions in learning to rank. Journal of Machine\n\nLearning Research - Proceedings Track, 14:91\u2013100, 2011.\n\n[7] W. Chen, T.-Y. Liu, Y. Lan, Z. Ma, and H. Li. Ranking measures and loss functions in learning\nto rank. In 24th Annual Conference on Neural Information Processing Systems (NIPS 2009),\npages 315\u2013323, 2009.\n\n[8] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and scoring using empirical risk minimiza-\ntion. In Proceedings of the 18th Annual Conference on Learning Theory (COLT 2005), volume\n3559, pages 1\u201315, 2005.\n\n[9] D. Cossock and T. Zhang. Subset ranking using regression. In Proceedings of the 19th Annual\n\nConference on Learning Theory (COLT 2006), pages 605\u2013619, 2006.\n\n[10] O. Dekel, C. D. Manning, and Y. Singer. Log-linear models for label ranking. In 18th Annual\n\nConference on Neural Information Processing Systems (NIPS 2003), 2003.\n\n[11] J. C. Duchi, L. W. Mackey, and M. I. Jordan. On the consistency of ranking algorithms. In\nProceedings of the 27th International Conference on Machine Learning (ICML 2010), pages\n327\u2013334, 2010.\n\n[12] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[13] R. Herbrich, K. Obermayer, and T. Graepel. Large margin rank boundaries for ordinal regres-\n\nsion. In Advances in Large Margin Classi\ufb01ers., pages 115\u2013132, 1999.\n\n[14] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the 8th\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD\n2002), pages 133\u2013142, 2002.\n\n[15] H. Li, T.-Y. Liu, and C. Zhai. Learning to rank for information retrieval (lr4ir 2008). SIGIR\n\nForum, 42:76\u201379, 2008.\n\n[16] T.-Y. Liu. Learning to rank for information retrieval. Foundation and Trends on Information\n\nRetrieval, 3:225\u2013331, 2009.\n\n[17] T.-Y. Liu, J. Xu, T. Qin, W.-Y. Xiong, and H. Li. Letor: Benchmark dataset for research\non learning to rank for information retrieval. In SIGIR \u201907 Workshop, San Francisco, 2007.\nMorgan Kaufmann.\n\n[18] P. D. Ravikumar, A. Tewari, and E. Yang. On ndcg consistency of listwise ranking methods.\n\nJournal of Machine Learning Research - Proceedings Track, 15:618\u2013626, 2011.\n\n[19] F. Xia, T.-Y. Liu, J. Wang, W. S. Zhang, and H. Li. Listwise approach to learning to rank\nIn Proceedings of the 25th International Conference on Machine\n\n- theory and algorithm.\nLearning (ICML 2008), 2008.\n\n[20] T. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225\u20131251, 2004.\n\n[21] T. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk\n\nminimization. Annuals of Statistics, 32:56\u201385, 2004.\n\n9\n\n\f", "award": [], "sourceid": 599, "authors": [{"given_name": "Yanyan", "family_name": "Lan", "institution": null}, {"given_name": "Jiafeng", "family_name": "Guo", "institution": null}, {"given_name": "Xueqi", "family_name": "Cheng", "institution": null}, {"given_name": "Tie-yan", "family_name": "Liu", "institution": null}]}