{"title": "Statistical Consistency of Top-k Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 2098, "page_last": 2106, "abstract": "This paper is concerned with the consistency analysis on listwise ranking methods. Among various ranking methods, the listwise methods have competitive performances on benchmark datasets and are regarded as one of the state-of-the-art approaches. Most listwise ranking methods manage to optimize ranking on the whole list (permutation) of objects, however, in practical applications such as information retrieval, correct ranking at the top k positions is much more important. This paper aims to analyze whether existing listwise ranking methods are statistically consistent in the top-k setting. For this purpose, we define a top-k ranking framework, where the true loss (and thus the risks) are defined on the basis of top-k subgroup of permutations. This framework can include the permutation-level ranking framework proposed in previous work as a special case. Based on the new framework, we derive sufficient conditions for a listwise ranking method to be consistent with the top-k true loss, and show an effective way of modifying the surrogate loss functions in existing methods to satisfy these conditions. Experimental results show that after the modifications, the methods can work significantly better than their original versions.", "full_text": "Statistical Consistency of Top-k Ranking\n\nFen Xia\n\nInstitute of Automation\n\nChinese Academy of Sciences\n\nTie-Yan Liu\n\nMicrosoft Research Asia\n\nHang Li\n\nMicrosoft Research Asia\n\ntyliu@microsoft.com\n\nhanglig@microsoft.com\n\nfen.xia@ia.ac.cn\n\nAbstract\n\nThis paper is concerned with the consistency analysis on listwise ranking meth-\nods. Among various ranking methods, the listwise methods have competitive per-\nformances on benchmark datasets and are regarded as one of the state-of-the-art\napproaches. Most listwise ranking methods manage to optimize ranking on the\nwhole list (permutation) of objects, however, in practical applications such as in-\nformation retrieval, correct ranking at the top k positions is much more important.\nThis paper aims to analyze whether existing listwise ranking methods are statisti-\ncally consistent in the top-k setting. For this purpose, we de\ufb01ne a top-k ranking\nframework, where the true loss (and thus the risks) are de\ufb01ned on the basis of\ntop-k subgroup of permutations. This framework can include the permutation-\nlevel ranking framework proposed in previous work as a special case. Based on\nthe new framework, we derive suf\ufb01cient conditions for a listwise ranking method\nto be consistent with the top-k true loss, and show an effective way of modify-\ning the surrogate loss functions in existing methods to satisfy these conditions.\nExperimental results show that after the modi\ufb01cations, the methods can work sig-\nni\ufb01cantly better than their original versions.\n\n1 Introduction\n\nRanking is the central problem in many applications including information retrieval (IR). In recent\nyears, machine learning technologies have been successfully applied to ranking, and many learning\nto rank methods have been proposed, including the pointwise [12] [9] [6], pairwise [8] [7] [2], and\nlistwise methods [13] [3] [16]. Empirical results on benchmark datasets have demonstrated that the\nlistwise ranking methods have very competitive ranking performances [10].\n\nTo explain the high ranking performances of the listwise ranking methods, a theoretical framework\nwas proposed in [16]. In the framework, existing listwise ranking methods are interpreted as making\nuse of different surrogate loss functions of the permutation-level 0-1 loss. Theoretical analysis shows\nthat these surrogate loss functions are all statistically consistent in the sense that minimization of the\nconditional expectation of them will lead to obtaining the Bayes ranker, i.e., the optimal ranked list\nof the objects.\n\nHere we point out that there is a gap between the analysis in [16] and many real ranking problems,\nwhere the correct ranking of the entire permutation is not needed. For example, in IR, users usually\ncare much more about the top ranking results and thus only correct ranking at the top positions is\nimportant. In this new situation, it is no longer clear whether existing listwise ranking methods are\nstill statistically consistent. The motivation of this work is to perform formal study on the issue.\n\nFor this purpose, we propose a new ranking framework, in which the \u201ctrue loss\u201d is de\ufb01ned on the\ntop-k subgroup of permutations instead of on the entire permutation. The new true loss only mea-\nsures errors occurring at the top k positions of a ranked list, therefore we refer to it as the top-k true\nloss (Note that when k equals the length of the ranked list, the top-k true loss will become exactly\n\n1\n\n\fthe permutation-level 0-1 loss). We prove a new theorem which gives suf\ufb01cient conditions for a\nsurrogate loss function to be consistent with the top-k true loss. We also investigate the change of\nthe conditions with respect to different k\u2019s. Our analysis shows that, as k decreases, to guarantee the\nconsistency of a surrogate loss function, the requirement on the probability space becomes weaker\nwhile the requirement on the surrogate loss function itself becomes stronger. As a result, a surro-\ngate loss function that is consistent with the permutation-level 0-1 loss might not be consistent with\nthe top-k true loss any more. Therefore, the surrogate loss functions in existing listwise ranking\nmethods, which have been proved to be consistent with the permutation-level 0-1 loss, are not theo-\nretically guaranteed to have good performances in the top-k setting. Modi\ufb01cations to these surrogate\nloss functions are needed to further make them consistent with the top-k true loss. We show how\nto make such modi\ufb01cations, and empirically verify that such modi\ufb01cations can lead to signi\ufb01cant\nperformance improvement. This validates the correctness of our theoretical analysis.\n\n2 Permutation-level ranking framework\n\nWe review the permutation-level ranking framework proposed in [16].\n\nLet X be the input space whose elements are groups of objects to be ranked, Y be the output space\nwhose elements are permutations of objects, and PXY be an unknown but \ufb01xed joint probability\ndistribution of X and Y . Let h \u2208 H : X \u2192 Y be a ranking function. Let x \u2208 X and y \u2208 Y , and\nlet y(i) be the index of the object that is ranked at position i in y. The task of learning to rank is to\nlearn a function that can minimize the expected risk R(h), de\ufb01ned as,\n\nR(h) = ZX\u00d7Y\n\nl(h(x), y)dP (x, y),\n\nwhere l(h(x), y) is the true loss such that\n\nl(h(x), y) = (cid:26) 1,\n\n0,\n\nif h(x) 6= y\nif h(x) = y.\n\n(1)\n\n(2)\n\nThe above true loss indicates that if the permutation of the predicted result is exactly the same as\nthe permutation in the ground truth, then the loss is zero; otherwise the loss is one. For ease of\nreference, we call it permutation-level 0-1 loss. The optimal ranking function which can minimize\nthe expected true risk R(h\u2217) = inf R(h) is referred to as the permutation-level Bayes ranker.\n\nh\u2217(x) = arg max\ny\u2208Y\n\nP (y|x).\n\n(3)\n\nIn practice, for ef\ufb01ciency consideration,\nthe ranking function is usually de\ufb01ned as h(x) =\nsort(g(x1), . . . , g(xn)), where g(\u00b7) denotes the scoring function, and sort(\u00b7) denotes the sorting\nfunction. Since the risk is non-continuous and non-differentiable with respect to the scoring function\ng, a continuous and differentiable surrogate loss function \u03c6(g(x), y) is usually used as an approxi-\nmation of the true loss. In this way, the expected risk becomes\n\nR\u03c6(g) = ZX\u00d7Y\n\n\u03c6(g(x), y)dP (x, y),\n\n(4)\n\nwhere g(x) = (g(x1), . . . , g(xn)) is a vector-valued function induced by g.\nIt has been shown in [16] that many existing listwise ranking methods fall into the above framework,\nwith different surrogate loss functions used. Furthermore, their surrogate loss functions are statis-\ntically consistent under certain conditions with respect to the permutation-level 0-1 loss. However,\nas shown in the next section, the permutation-level 0-1 loss is not suitable to describe the ranking\nproblem in many real applications.\n\n3 Top-k ranking framework\n\nWe next describe the real ranking problem, and then propose the top-k ranking framework.\n\n2\n\n\f3.1 Top-k ranking problem\n\nIn real ranking applications like IR, people pay more attention to the top-ranked objects. Therefore\nthe correct ranking on the top positions is critically important. For example, modern web search\nengines only return top 1, 000 results and 10 results in each page. According to a user study1, 62%\nof search engine users only click on the results within the \ufb01rst page, and 90% of users click on the\nresults within the \ufb01rst three pages. It means that two ranked lists of documents will likely provide\nthe same experience to the users (and thus suffer the same loss), if they have the same ranking results\nfor the top positions. This, however, cannot be re\ufb02ected in the permutation-level 0-1 loss in Eq.(2).\nThis characteristic of ranking problems has also been explored in earlier studies in different settings\n[4, 5, 14]. We refer to it as the top-k ranking problem.\n\n3.2 Top-k true loss\n\nTo better describe the top-k ranking problem, we propose de\ufb01ning the true loss based on the top k\npositions in a ranked list, referred to as the top-k true loss.\n\nlk(h(x), y) = (cid:26) 0,\n\n1,\n\nif \u02c6y(i) = y(i) \u2200i \u2208 {1, . . . , k}, where \u02c6y = h(x),\n\notherwise .\n\n(5)\n\nThe actual value of k is determined by application. When k equals the length of the entire ranked\nlist, the top-k true loss will become exactly the permutation-level 0-1 loss. In this regard, the top-k\ntrue loss is more general than the permutation-level 0-1 loss.\n\nWith Eq.(5), the expected risk becomes\n\nRk(h) = ZX\u00d7Y\n\nlk(h(x), y)dP (x, y).\n\n(6)\n\nIt can be proved that the optimal ranking function with respect to the top-k true loss (i.e., the top-k\nBayes ranker) is any permutation in the top-k subgroup having the highest probability2, i.e.,\n\nh\u2217\nk(x) \u2208 arg maxGk(j1,j2,...,jk)\u2208Gk\n\nP (Gk(j1, j2, ..., jk)|x),\n\n(7)\n\nwhere Gk(j1, j2, ..., jk) = {y \u2208 Y |y(t) = jt, \u2200t = 1, 2, . . . k} denotes a top-k subgroup in which\nall the permutations have the same top-k true loss; Gk denotes the collection of all top-k subgroups.\nWith the above setting, we will analyze the consistency of the surrogate loss functions in existing\nranking methods with the top-k true loss in the next section.\n\n4 Theoretical analysis\n\nIn this section, we \ufb01rst give the suf\ufb01cient conditions of consistency for the top-k ranking problem.\nNext, we show how these conditions change with respect to k. Last, we discuss whether the surrogate\nloss functions in existing methods are consistent, and how to make them consistent if not.\n\n4.1 Statistical consistency\n\nWe investigate what kinds of surrogate loss functions \u03c6(g(x), y) are statistically consistent with\nthe top-k true loss. For this purpose, we study whether the ranking function that minimizes the\nconditional expectation of the surrogate loss function de\ufb01ned as follows coincides with the top-k\nBayes ranker as de\ufb01ned in Eq.(7).\n\nQ(P (y|x), g(x)) = Xy\u2208Y\n\nP (y|x)\u03c6(g(x), y).\n\n(8)\n\n1iProspect Search Engine User Behavior Study, April 2006, http://www.iprospect.com/\n2Note that the probability of a top-k subgroup is de\ufb01ned as the sum of the probabilities of the permutations\n\nin the subgroup (cf., De\ufb01nitions 6 and 7 in [3]).\n\n3\n\n\fAccording to [1], the above condition is the weakest condition to guarantee that optimizing a sur-\nrogate loss function will lead to obtaining a model achieving the Bayes risk (in our case, the top-k\nBayes ranker), when the training sample size approaches in\ufb01nity.\nWe denote Q(P (y|x), g(x)) as Q(p, g), g(x) as g and P (y|x) as py. Hence, Q(p, g) is the loss\nof g at x with respect to the conditional probability distribution py. The key idea is to decompose\nthe sorting of g into pairwise relationship between scores of objects. To this end, we denote Yi,j as\na permutation set in which each permutation ranks object i before object j, i.e., Yi,j , {y \u2208 Y :\ny\u22121(i) < y\u22121(j)} (here y\u22121(j) denotes the position of object j in permutation y), and introduce\nthe following de\ufb01nitions.\nDe\ufb01nition 1. \u039bGk is the a top-k subgroup probability space, such that \u039bGk\nPGk(j1,j2,...,jk)\u2208Gk\nDe\ufb01nition 2. A top-k subgroup probability space \u039bGk is order preserving with respect to objects\ni and j, if \u2200y \u2208 Yi,j and Gk(y(1), y(2), ..., y(k)) 6= Gk(\u03c3\u22121\ni,j y(k)), we\nhave pGk(y(1),y(2),...,y(k)) > pGk(\u03c3\u22121\ni,j y denotes the permutation\nin which the positions of objects i and j are exchanged while those of the other objects remain the\nsame as in y.\nDe\ufb01nition 3. A surrogate loss function \u03c6 is top-k subgroup order sensitive on a set \u2126 \u2282 Rn, if \u03c6\nis a non-negative differentiable function and the following three conditions hold for \u2200 objects i and\nj: (1) \u03c6(g, y) = \u03c6(\u03c3\u22121\ni,j y); (2)Assume gi < gj, \u2200y \u2208 Yi,j. If Gk(y(1), y(2), ..., y(k)) 6=\nGk(\u03c3\u22121\ni,j y(k)), then \u03c6(g, y) \u2265 \u03c6(g, \u03c3\u22121\ni,j y) and for at least one y, the strict\ninequality holds; otherwise, \u03c6(g, y) = \u03c6(g, \u03c3\u22121\n(3) Assume gi = gj. \u2203y \u2208 Yi,j with\ni,j y(k)) satisfying \u2202\u03c6(g,\u03c3\u22121\nGk(y(1), y(2), ..., y(k)) 6= Gk(\u03c3\u22121\n\npGk(j1,j2,...,jk) = 1, pGk(j1,j2,...,jk) \u2265 0}.\n\ni,j g, \u03c3\u22121\ni,j y(2), ..., \u03c3\u22121\n\ni,j y(k)). Here \u03c3\u22121\n\ni,j y(2), ..., \u03c3\u22121\n\ni,j y(2), ..., \u03c3\u22121\n\n, {p \u2208 R|Gk|\n\ni,j y(1), \u03c3\u22121\n\ni,j y(1), \u03c3\u22121\n\ni,j y(1), \u03c3\u22121\n\n> \u2202\u03c6(g,y)\n\ni,j y(1),\u03c3\u22121\n\ni,j y(2),...,\u03c3\u22121\n\ni,j y).\n\n:\n\ni,j y)\n\n\u2202gi\n\n.\n\n\u2202gi\n\nThe order preserving property of a top-k subgroup probability space (see De\ufb01nition 2) indicates\nthat if the top-k subgroup probability on a permutation y \u2208 Yi,j is larger than that on permutation\n\u03c3\u22121\ni,j y, then the relation holds for any other permutation y0 in Yi,j and and the corresponding \u03c3\u22121\ni,j y0\nprovided that the top-k subgroup of the former is different from that of the latter. The order sensitive\nproperty of a surrogate loss function (see De\ufb01nition 3) indicates that (i) \u03c6(g, y) exhibits a symmetry\nin the sense that simultaneously exchanging the positions of objects i and j in the ground truth\nand their scores in the predicted score list will not make the surrogate loss change. (ii) When a\npermutation is transformed to another permutation by exchanging the positions of two objects of it,\nif the two permutations do not belong to the same top-k subgroup, the loss on the permutation that\nranks the two objects in the decreasing order of their scores will not be greater than the loss on its\ncounterpart. (iii) There exists a permutation, for which the speed of change in loss with respect to\nthe score of an object will become faster if exchanging its position with another object with the same\nscore but ranked lower. A top-k subgroup order sensitive surrogate loss function has several nice\nproperties as shown below.\nProposition 4. Let \u03c6(g, y) be a top-k subgroup order sensitive loss function.\nGk(y(1), y(2), . . . , y(k)), we have \u03c6(g, \u03c0) = \u03c6(g, y).\nProposition 5. Let \u03c6(g, y) be a top-k subgroup order sensitive surrogate loss function. \u2200 objects i\nand j with gi = gj, \u2200y \u2208 Yi,j, if Gk(y(1), y(2), ..., y(k)) 6= Gk(\u03c3\u22121\ni,j y(k)),\nthen \u2202\u03c6(g,\u03c3\u22121\n\n. Otherwise, \u2202\u03c6(g,\u03c3\u22121\n\ni,j y(2), ..., \u03c3\u22121\n\ni,j y(1), \u03c3\u22121\n\n\u2200y, \u2200\u03c0 \u2208\n\n= \u2202\u03c6(g,y)\n\n\u2265 \u2202\u03c6(g,y)\n\ni,j y)\n\ni,j y)\n\n.\n\n\u2202gi\n\n\u2202gi\n\n\u2202gi\n\n\u2202gi\n\nProposition 4 shows that all permutations in the same top-k subgroup share the same loss \u03c6(g, y)\nand thus share the same partial difference with respect to the score of a given object. Proposition 5\nindicates that the partial difference of \u03c6(g, y) also has a similar property to \u03c6(g, y) (see the second\ncondition in De\ufb01nition 3). Due to space restriction, we omit the proofs (see [15] for more details).\n\nBased on the above de\ufb01nitions and propositions, we give the main theorem (Theorem 6), which\nstates the suf\ufb01cient conditions for a surrogate loss function to be consistent with the top-k true loss.\nTheorem 6. Let \u03c6 be a top-k subgroup order sensitive loss function on \u2126 \u2282 Rn . For \u2200n ob-\njects, if its top-k subgroup probability space is order preserving with respect to n \u2212 1 object pairs\ni=2 , then the loss \u03c6(g, y) is consistent with the\n{(ji, ji+1)}k\ntop-k true loss as de\ufb01ned in Eq.(5).\n\ni=1 and {(jk+si , jk+i : 0 \u2264 si < i)}n\u2212k\n\n4\n\n\fThe proof of the main theorem is mostly based on Theorem 7, which speci\ufb01es the score relation\nbetween two objects for the minimizer of Q(p, g). Due to space restriction, we only give Theorem\n7 and its detailed proof. For the detailed proof of Theorem 6, please refer to [15].\nTheorem 7. Let \u03c6(g, y) be a top-k subgroup order sensitive loss function. \u2200i and j, if the top-\nk subgroup probability space is order preserving with respect to them, and g is a vector which\nminimizes Q(p, g) in Eq.(8), then gi > gj.\n\nProof. Without loss of generality, we assume i = 1, j = 2, g0\nFirst, we prove g1 \u2265 g2 by contradiction. Assume g1 < g2, we have\n\n1 = g2, g0\n\n2 = g1, and g0\n\nk = gk(k > 2).\n\nQ(p, g\n\n0) \u2212 Q(p, g) = Xy\u2208Y\n\n(p\u03c3\u22121\n\n1,2y \u2212 py)\u03c6(g, y) = Xy\u2208Y1,2\n\n(p\u03c3\u22121\n\n1,2y \u2212 py)(\u03c6(g, y) \u2212 \u03c6(g, \u03c3\u22121\n\n1,2y)).\n\nThe \ufb01rst equation is based on the fact g\n\u03c3\u22121\n1,2\u03c3\u22121\n\n1,2y = y. After some algebra, by using Proposition 4, we have,\n0) \u2212 Q(p, g) =\n\nQ(p, g\n\n0 = \u03c3\u22121\n\n1,2 g, and the second equation is based on the fact\n\n(pGk(\u03c3\u22121\n\n1,2y) \u2212 pGk(y))(\u03c6(g, y) \u2212 \u03c6(g, \u03c3\u22121\n\n1,2y)),\n\nGk(y)\u2208{Gk:Gk(y)6=Gk(\u03c3\u22121\n\n1,2y)}:y\u2208Y1,2\n\nX\n\nwhere Gk(y) denotes the subgroup that y belongs to.\nSince g1 < g2, we have \u03c6(g, y) \u2265 \u03c6(g, \u03c3\u22121\n1,2y) < pGk(y) due to the order\npreserving of the top-k subgroup probability space. Thus each component in the sum is non-positive\nand at least one of them is negative, which means Q(p, g\n0) < Q(p, g). This is a contradiction to\nthe optimality of g. Therefore, we must have g1 \u2265 g2.\nSecond, we prove g1 6= g2, again by contradiction. Assume g1 = g2. By setting the derivative of\nQ(p, g) with respect to g1 and g2 to zero and compare them3, we have,\n\n1,2y). Meanwhile, pGk(\u03c3\u22121\n\nXy\u2208Y1,2\n\n(py \u2212 p\u03c3\u22121\n\n1,2y)(\n\n\u2202\u03c6(g, y)\n\n\u2202g1\n\n\u2212\n\n\u2202\u03c6(g, \u03c3\u22121\n\n1,2y)\n\n\u2202g1\n\n) = 0.\n\nAfter some algebra, we obtain,\n\nGk(y)\u2208{Gk:Gk(y)6=Gk(\u03c3\u22121\n\n1,2y)}:y\u2208Y1,2\n\nX\n\n(pGk(y) \u2212 pGk(\u03c3\u22121\n\n1,2y))(\n\n\u2202\u03c6(g, y)\n\n\u2202g1\n\n\u2212\n\n\u2202\u03c6(g, \u03c3\u22121\n\n1,2y)\n\n\u2202g1\n\n) = 0.\n\nAccording to Proposition 5, we have \u2202\u03c6(g,y)\n1,2y) < pGk(y) due to\nthe order preserving of the top-k subgroup probability space. Thus, the above equation cannot hold\nsince at least one of components in the sum is negative according to De\ufb01nition 3.\n\n. Meanwhile, pGk(\u03c3\u22121\n\n\u2264\n\n\u2202g1\n\n\u2202g1\n\n\u2202\u03c6(g,\u03c3\u22121\n\n1,2y)\n\n4.2 Consistency with respect to k\n\nWe discuss the change of the consistency conditions with respect to various k values.\n\nFirst, we have the following proposition for the top-k subgroup probability space.\nProposition 8. If the top-k subgroup probability space is order preserving with respect to object i\nand j, the top-(k \u2212 1) subgroup probability space is also order preserving with respect to i and j.\n\nThe proposition can be proved by decomposing a top-(k \u2212 1) subgroup into the sum of top-k sub-\ngroups. One can \ufb01nd the detailed proof in [15]. Here we give an example to illustrate the basic idea.\nSuppose there are three objects {1, 2, 3} to be ranked. If the top-2 subgroup probability space is or-\nder preserving with respect to objects 1 and 2, then we have pG2(1,2) > pG2(2,1), pG2(1,3) > pG2(2,3)\nand pG2(3,1) > pG2(3,2). On the other hand, for top-1, we have pG1(1) > pG1(2). Note that\npG1(1) = pG2(1,2) + pG2(1,3) and pG1(2) = pG2(2,1) + pG2(2,3). Thus, it is easy to verify that\nProposition 8 holds for this case while the opposite does not.\n\nSecond, we obtain the following proposition for the surrogate loss function \u03c6.\n\n3By trivial modi\ufb01cations, one can handle the case that g1 or g2 is in\ufb01nite (cf. [17]).\n\n5\n\n\fProposition 9. If the surrogate loss function \u03c6 is top-k subgroup order sensitive on a set \u2126 \u2282 Rn,\nthen it is also top-(k + 1) subgroup order sensitive on the same set.\n\nAgain, one can refer to [15] for the detailed proof of the proposition, and here we only pro-\nvide an example. Let us consider the same setting in the previous example. Assume that\ng1 < g2. If \u03c6 is top-1 subgroup order sensitive, then we have \u03c6(g, (1, 2, 3)) \u2265 \u03c6(g, (2, 1, 3)),\n\u03c6(g, (1, 3, 2)) \u2265 \u03c6(g, (2, 3, 1)), and \u03c6(g, (3, 1, 2)) = \u03c6(g, (3, 2, 1)). From Proposition 4, we know\nthat the two inequalities are strict. On the other hand, if \u03c6 is top-2 subgroup order sensitive, the\nfollowing inequalities hold with at least one of them being strict: \u03c6(g, (1, 2, 3)) \u2265 \u03c6(g, (2, 1, 3)),\n\u03c6(g, (1, 3, 2)) \u2265 \u03c6(g, (2, 3, 1)), and \u03c6(g, (3, 1, 2)) \u2265 \u03c6(g, (3, 2, 1)). Therefore top-1 subgroup\norder sensitive is a special case of top-2 subgroup order sensitive.\nAccording to the above propositions, we can come to the following conclusions.\n\n\u2022 For the consistency with the top-k true loss, when k becomes smaller, the requirement on\nthe probability space becomes weaker but the requirement on the surrogate loss function\nbecomes stronger. Since we never know the real property of the (unknown) probability\nspace, it is more likely the requirement on the probability space for the consistency with\nthe top-k true loss can be satis\ufb01ed than that for the top-l (l > k) true loss. Speci\ufb01cally, it is\nrisky to assume the requirement for the permutation-level 0-1 loss to hold.\n\n\u2022 If we \ufb01x the true loss to be top-k and the probability space to be top-k subgroup order\npreserving, the surrogate loss function should be at most top-l (l \u2264 k) subgroup order\nsensitive in order to meet the consistency conditions. It is not guaranteed that a top-l (l > k)\nsubgroup order sensitive surrogate loss function can be consistent with the top-k true loss.\nFor example, a top-1 subgroup order sensitive surrogate loss function may be consistent\nwith any top-k true loss, but a permutation-level order sensitive surrogate loss function\nmay not be consistent with any top-k true loss, if k is smaller than the length of the list.\n\nFor ease of understanding the above discussions, let us see an example shown in the following\nproposition (the proof of this proposition can be found in [15]). It basically says that given a proba-\nbility space that is top-1 subgroup order preserving, a top-3 subgroup order sensitive surrogate loss\nfunction may not be consistent with the top-1 true loss.\nProposition 10. Suppose there are three objects to be ranked. \u03c6 is a top-3 subgroup order sensitive\nloss function and the strict inequality \u03c6(g, (3, 1, 2)) < \u03c6(g, (3, 2, 1)) holds when g1 > g2. The\nprobabilities of permutations are p123 = p1, p132 = 0, p213 = p2, p231 = 0, p312 = 0, p321 = p2\nrespectively, where p1 > p2. Then \u03c6 is not consistent with the top-1 true loss.\n\nThe above discussions imply that although the surrogate loss functions in existing listwise ranking\nmethods are consistent with the permutation-level 0-1 loss (under a rigid condition), they may not\nbe consistent with the top-k true loss (under a mild condition). Therefore, it is necessary to modify\nthese surrogate loss functions. We will make discussions on this in the next subsection.\n\n4.3 Consistent surrogate loss functions\n\nIn [16], the surrogate loss functions in ListNet, RankCosine, and ListMLE have been proved to be\npermutation-level order sensitive. According to the discussion in the previous subsection, however,\nthey may not be top-k subgroup order sensitive, and therefore not consistent with the top-k true loss.\nEven for the consistency with the permutation-level 0-1 loss, in order to guarantee these surrogate\nloss functions to be consistent, the requirement on the probability space may be too strong in some\nreal scenarios. To tackle the challenge, it is desirable to modify these surrogate loss functions to\nmake them top-k subgroup order sensitive. Actually this is doable, and the modi\ufb01cations to the\naforementioned surrogate loss functions are given as follows.\n\n4.3.1 Likelihood loss\n\nThe likelihood loss is the loss function used in ListMLE [16], which is de\ufb01ned as below,\n\n\u03c6(g(x), y) = \u2212 log P (y|x; g),\n\nwhere P (y|x; g) =\n\nn\n\nYi=1\n\nexp(g(xy(i)))\nt=i exp(g(xy(t)))\n\nPn\n\n.\n\n(9)\n\n6\n\n\fWe propose replacing the permutation probability with the top-k subgroup probability (which is also\nde\ufb01ned with the Luce model [11]) in the above de\ufb01nition:\n\nP (y|x; g) =\n\nk\n\nYi=1\n\nexp(g(xy(i)))\nt=i exp(g(xy(t)))\n\nPn\n\n.\n\nIt can be proved that the modi\ufb01ed loss is top-k subgroup order sensitive (see [15]).\n\n4.3.2 Cosine loss\n\nThe cosine loss is the loss function used in RankCosine [13], which is de\ufb01ned as follows,\n\n\u03c6(g(x), y) =\n\n1\n2\n\n(1 \u2212\n\n\u03c8y(x)T\n\ng(x)\n\nk\u03c8y(x)kkg(x)k\n\n),\n\n(10)\n\n(11)\n\nwhere the score vector of the ground truth is produced by a mapping function \u03c8y(\u00b7) : Rd \u2192 R,\nwhich retains the order in a permutation, i.e., \u03c8y(xy(1)) > \u00b7 \u00b7 \u00b7 > \u03c8y(xy(n)).\nWe propose changing the mapping function as follows. Let the mapping function retain the order\nfor the top k positions of the ground truth permutation and assigns to all the remaining positions\na small value (which is smaller than the score of any object ranked at the top-k positions), i.e.,\n\u03c8y(xy(1)) > \u00b7 \u00b7 \u00b7 > \u03c8y(xy(k)) > \u03c8y(xy(k+1)) = \u00b7 \u00b7 \u00b7 = \u03c8y(xy(n)) = \u0001. It can be proved that after\nthe modi\ufb01cation, the cosine loss becomes top-k subgroup order sensitive (see [15]).\n\n4.3.3 Cross entropy loss\n\nThe cross entropy loss is the loss function used in ListNet [3], de\ufb01ned as follows,\n\n\u03c6(g(x), y) = D(P (\u03c0|x; \u03c8y)||P (\u03c0|x; g)),\n\n(12)\n\nwhere \u03c8 is a mapping function whose de\ufb01nition is similar to that in RankCosine, and P (\u03c0|x; \u03c8y)\nand P (\u03c0|x; g) are the permutation probabilities in the Luce model.\nWe propose using a mapping function to modify the cross entropy loss in a similar way as in the case\nof the cosine loss4 It can be proved that such a modi\ufb01cation can make the surrogate loss function\ntop-k subgroup order sensitive (see [15]).\n\n5 Experimental results\n\nIn order to validate the theoretical analysis in this work, we conducted some empirical study. Speci\ufb01-\ncally, we used OHSUMED, TD2003, and TD2004 in the LETOR benchmark dataset [10] to perform\nsome experiments. As evaluation measure, we adopted Normalized Discounted Cumulative Gain\n(N) at positions 1, 3, and 10, and Precision (P) at positions 1, 3, and 10.5 It is obvious that these\nmeasures are top-k related and are suitable to evaluate the ranking performance in top-k ranking\nproblems.\n\nWe chose ListMLE as example method since the likelihood loss has nice properties such as con-\nvexity, soundness, and linear computational complexity [16]. We refer to the new method that we\nobtained by applying the modi\ufb01cations mentioned in Section 4.3 as top-k ListMLE. We tried dif-\nferent values of k (i.e., k=1, 3, 10, and the exact length of the ranked list). Obviously the last case\ncorresponds to the original likelihood loss in ListMLE.\n\nSince the training data in LETOR is given in the form of multi-level ratings, we adopted the methods\nproposed in [16] to produce the ground truth ranked list. We then used stochastic gradient descent\nas the algorithm for optimization of the likelihood loss. As for the ranking model, we chose linear\nNeural Network, since the model has been widely used [3, 13, 16].\n\n4Note that in [3], a top-k cross entropy loss was also proposed, by using the top-k Luce model. However,\nit can be veri\ufb01ed that the so-de\ufb01ned top-k cross entropy loss is still permutation-level order sensitive, but not\ntop-k subgroup order sensitive. In other words, the proposed modi\ufb01cation here is still needed.\n\n5On datasets with only two ratings such as TD2003 and TD2004, N@1 equals P@1.\n\n7\n\n\fThe experimental results are summarized in Tables 1-3.\n\nMethods\n\nListMLE\n\nTop-1 ListMLE\n\nTop-3 ListMLE\n\nTop-10 ListMLE\n\nN@1\n\n0.548\n\n0.529\n\n0.535\n\n0.558\n\nN@3\n\n0.473\n\n0.482\n\n0.484\n\n0.473\n\nN@10\n\n0.446\n\n0.447\n\n0.445\n\n0.444\n\nP@1\n\n0.642\n\n0.652\n\n0.671\n\n0.672\n\nP@3\n\n0.582\n\n0.595\n\n0.608\n\n0.601\n\nP@10\n\n0.495\n\n0.499\n\n0.504\n\n0.509\n\nMethods\n\nListMLE\n\nTop-1 ListMLE\n\nTop-3 ListMLE\n\nTop-10 ListMLE\n\nN/P@1\n\n0.24\n\n0.4\n\n0.44\n\n0.5\n\nN@3\n\n0.253\n\n0.329\n\n0.382\n\n0.410\n\nN@10\n\n0.261\n\n0.314\n\n0.343\n\n0.378\n\nP@3\n\n0.22\n\n0.3\n\n0.34\n\n0.38\n\nP@10\n\n0.146\n\n0.176\n\n0.204\n\n0.22\n\nTable 1: Ranking accuracies on OHSUMED\n\nTable 2: Ranking accuracies on TD2003\n\nMethods\n\nListMLE\n\nTop-1 ListMLE\n\nTop-3 ListMLE\n\nTop-10 ListMLE\n\nN/P@1\n\n0.4\n\n0.52\n\n0.506\n\n0.52\n\nN@3\n\n0.351\n\n0.469\n\n0.456\n\n0.469\n\nN@10\n\n0.356\n\n0.451\n\n0.458\n\n0.472\n\nP@3\n\n0.284\n\n0.413\n\n0.417\n\n0.413\n\nP@10\n\n0.188\n\n0.248\n\n0.261\n\n0.269\n\nTable 3: Ranking accuracies on TD2004\n\nMethods\n\nRankBoost\n\nRanking SVM\n\nListNet\n\nRankCosine\n\nTop-10 ListMLE\n\nN@1\n\n0.497\n\n0.495\n\n0.523\n\n0.523\n\n0.558\n\nN@3\n\n0.472\n\n0.464\n\n0.477\n\n0.475\n\n0.473\n\nN@10\n\n0.435\n\n0.441\n\n0.448\n\n0.437\n\n0.444\n\nP@1\n\n0.604\n\n0.633\n\n0.642\n\n0.642\n\n0.672\n\nP@3\n\n0.586\n\n0.592\n\n0.602\n\n0.589\n\n0.601\n\nP@10\n\n0.495\n\n0.507\n\n0.509\n\n0.493\n\n0.509\n\nTable 4: Ranking accuracies on OHSUMED\n\nFrom the tables, we can see that with the modi\ufb01cations the ranking accuracies of ListMLE can be\nsigni\ufb01cantly boosted, in terms of all measures, on both TD2003 and TD2004. This clearly validates\nour theoretical analysis. On OHSUMED, all the loss functions achieve comparable performances.\nThe possible explanation is that the probability space in OHSUMED is well formed such that it is\norder preserving for many different k values.\n\nNext, we take Top-10 ListMLE as an example to make comparison with some other baseline meth-\nods such as Ranking SVM [8], RankBoost [7], ListNet [3], and RankCosine [13]. The results are\nlisted in Tables 4-6. We can see from the tables, Top-10 ListMLE achieves the best performance\namong all the methods on the TD2003 and TD2004 datasets in terms of almost all measures. On the\nOHSUMED dataset, it also performs fairly well as compared to the other methods. Especially for\nN@1 and P@1, it signi\ufb01cantly outperforms all the other methods on all the datasets.\n\nMethods\n\nN/P@1\n\nRankBoost\n\nRanking SVM\n\nListNet\n\nRankCosine\n\nTop-10 ListMLE\n\n0.26\n\n0.42\n\n0.46\n\n0.36\n\n0.5\n\nN@3\n\n0.270\n\n0.378\n\n0.408\n\n0.346\n\n0.410\n\nN@10\n\nP@3\n\nP@10\n\nMethods\n\nN/P@1\n\n0.285\n\n0.341\n\n0.374\n\n0.322\n\n0.378\n\n0.24\n\n0.34\n\n0.36\n\n0.3\n\n0.38\n\n0.178\n\n0.206\n\n0.222\n\n0.182\n\n0.22\n\nRankBoost\n\nRanking SVM\n\nListNet\n\nRankCosine\n\nTop-10 ListMLE\n\n0.48\n\n0.44\n\n0.439\n\n0.439\n\n0.52\n\nN@3\n\n0.463\n\n0.409\n\n0.437\n\n0.397\n\n0.469\n\nN@10\n\n0.471\n\n0.420\n\n0.457\n\n0.405\n\n0.472\n\nP@3\n\n0.404\n\n0.351\n\n0.399\n\n0.328\n\n0.413\n\nP@10\n\n0.253\n\n0.225\n\n0.257\n\n0.209\n\n0.269\n\nTable 5: Ranking accuracies on TD2003\n\nTable 6: Ranking accuracies on TD2004\n\nFrom the above experimental results, we can come to the conclusion that for real ranking applica-\ntions like IR (where top-k evaluation measures are widely used), it is better to use the top-k true loss\nthan the permutation-level 0-1 loss, and is better to use the modi\ufb01ed surrogate loss functions than\nthe original surrogate loss functions.\n\n6 Conclusion\n\nIn this paper we have proposed a top-k ranking framework, which can better describe real rank-\ning applications like information retrieval. In the framework, the true loss is de\ufb01ned on the top-k\nsubgroup of permutations. We have derived the suf\ufb01cient conditions for a surrogate loss function\nto be statistically consistent with the top-k true loss. We have also discussed how to modify the\nloss functions in existing listwise ranking methods to make them consistent with the top-k true loss.\nOur experiments have shown that with the proposed modi\ufb01cations, algorithms like ListMLE can\nsigni\ufb01cantly outperform their original version, and also many other ranking methods.\n\nAs future work, we plan to investigate the following issues. (1) we will empirically study the modi-\n\ufb01ed ListNet and RankCosine, to see whether their performances can also be signi\ufb01cantly boosted in\nthe top-k setting. (2) We will also study the consistency of the pointwise and pairwise loss functions\nwith the top-k true loss.\n\n8\n\n\fReferences\n\n[1] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101:138\u2013156, 2006.\n\n[2] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender.\n\nLearning to rank using gradient descent. In Proc. of ICML\u201905, pages 89\u201396, 2005.\n\n[3] Z. Cao, T. Qin, T. Y. Liu, M. F. Tsai, and H. Li. Learning to rank: From pairwise approach to\n\nlistwise approach. In Proc. of ICML\u201907, pages 129\u2013136, 2007.\n\n[4] S. Clemencon and N. Vayatis. Ranking the best instances. Journal of Machine Learning\n\nResearch, 8:2671\u20132699, 2007.\n\n[5] D. Cossock and T. Zhang. Subset ranking using regression. In Proc. of COLT, pages 605\u2013619,\n\n2006.\n\n[6] D. Cossock and T. Zhang. Statistical analysis of bayes optimal subset ranking. Information\n\nTheory, 54:5140\u20135154, 2008.\n\n[7] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. In Proc. of ICML\u201998, pages 170\u2013178, 1998.\n\n[8] R. Herbrich, T. Graepel, and K. Obermayer. Support vector vector learning for ordinal regres-\n\nsion. In Proc. of ICANN\u201999, pages 97\u2013102, 1999.\n\n[9] P. Li, C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classi\ufb01cation and\ngradient boosting. In Advances in Neural Information Processing Systems 20(NIPS 07), pages\n897\u2013904, Cambridge, MA, 2008. MIT Press.\n\n[10] T. Y. Liu, T. Qin, J. Xu, W. Y. Xiong, and H. Li. Letor: Benchmark dataset for research on\nlearning to rank for information retrieval. In LR4IR 2007, in conjunction with SIGIR 2007,\n2007.\n\n[11] J. I. Marden, editor. Analyzing and Modeling Rank Data. Chapman and Hall, London, 1995.\n[12] R. Nallapati. Discriminative models for information retrieval. In Proc. of SIGIR\u201904, pages\n\n64\u201371, 2004.\n\n[13] T. Qin, X.-D. Zhang, M.-F. Tsai, D.-S. Wang, T.-Y. Liu, and H. Li. Query-level loss functions\n\nfor information retrieval. Information processing and management, 44:838\u2013855, 2008.\n\n[14] C. Rudin. Ranking with a p-norm push. In Proc. of COLT, pages 589\u2013604, 2006.\n[15] F. Xia, T. Y. Liu, and H. Li. Top-k consistency of learning to rank methods. Technical report,\n\nMicrosoft Research, MSR-TR-2009-139, 2009.\n\n[16] F. Xia, T. Y. Liu, J. Wang, W. S. Zhang, and H. Li. Listwise approach to learning to rank -\n\ntheory and algorithm. In Proc. of ICML\u201908, pages 1192\u20131199, 2008.\n\n[17] T. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225\u20131251, 2004.\n\n9\n\n\f", "award": [], "sourceid": 524, "authors": [{"given_name": "Fen", "family_name": "Xia", "institution": null}, {"given_name": "Tie-yan", "family_name": "Liu", "institution": null}, {"given_name": "Hang", "family_name": "Li", "institution": null}]}