{"title": "Learning to Rank with Nonsmooth Cost Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 193, "page_last": 200, "abstract": null, "full_text": "Learning to Rank with Nonsmooth Cost Functions\n\nChristopher J.C. Burges\n\nMicrosoft Research\nOne Microsoft Way\n\nRedmond, WA 98052, USA\ncburges@microsoft.com\n\nRobert Ragno\n\nMicrosoft Research\nOne Microsoft Way\n\nRedmond, WA 98052, USA\nrragno@microsoft.com\n\nQuoc Viet Le\n\nStatistical Machine\nLearning Program\n\nNICTA, ACT 2601, Australia\n\nquoc.le@anu.edu.au\n\nAbstract\n\nThe quality measures used in information retrieval are particularly dif\ufb01cult to op-\ntimize directly, since they depend on the model scores only through the sorted\norder of the documents returned for a given query. Thus, the derivatives of the\ncost with respect to the model parameters are either zero, or are unde\ufb01ned. In\nthis paper, we propose a class of simple, \ufb02exible algorithms, called LambdaRank,\nwhich avoids these dif\ufb01culties by working with implicit cost functions. We de-\nscribe LambdaRank using neural network models, although the idea applies to\nany differentiable function class. We give necessary and suf\ufb01cient conditions for\nthe resulting implicit cost function to be convex, and we show that the general\nmethod has a simple mechanical interpretation. We demonstrate signi\ufb01cantly im-\nproved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We\nalso show that LambdaRank provides a method for signi\ufb01cantly speeding up the\ntraining phase of that ranking algorithm. Although this paper is directed towards\nranking, the proposed method can be extended to any non-smooth and multivariate\ncost functions.\n\n1 Introduction\n\nIn many inference tasks, the cost function1 used to assess the \ufb01nal quality of the system is not the one\nused during training. For example for classi\ufb01cation tasks, an error rate for a binary SVM classi\ufb01er\nmight be reported, although the cost function used to train the SVM only very loosely models the\nnumber of errors on the training set, and similarly neural net training uses smooth costs, such as\nMSE or cross entropy. Thus often in machine learning tasks, there are actually two cost functions:\nthe desired cost, and the one used in the optimization process. For brevity we will call the former the\n\u2018target\u2019 cost, and the latter the \u2018optimization\u2019 cost. The optimization cost plays two roles: it is chosen\nto make the optimization task tractable (smooth, convex etc.), and it should approximate the desired\ncost well. This mismatch between target and optimization costs is not limited to classi\ufb01cation tasks,\nand is particularly acute for information retrieval. For example, [10] list nine target quality measures\nthat are commonly used in information retrieval, all of which depend only on the sorted order of the\ndocuments2 and their labeled relevance. The target costs are usually averaged over a large number\nof queries to arrive at a single cost that can be used to assess the algorithm. These target costs\npresent severe challenges to machine learning: they are either \ufb02at (have zero gradient with respect\nto the model scores), or are discontinuous, everywhere. It is very likely that a signi\ufb01cant mismatch\nbetween the target and optimizations costs will have a substantial adverse impact on the accuracy of\nthe algorithm.\n\n1Throughout this paper, we will use the terms \u201ccost function\u201d and \u201cquality measure\u201d interchangeably, with\nthe understanding that the cost function is some monotonic decreasing function of the corresponding quality\nmeasure.\n\n2For concreteness we will use the term \u2018documents\u2019 for the items returned for a given query, although the\n\nreturned items can be more general (e.g. multimedia items).\n\n\fIn this paper, we propose one method for attacking this problem. Perhaps the \ufb01rst approach that\ncomes to mind would be to design smoothed versions of the cost function, but the inherent \u2019sort\u2019\nmakes this very challenging. Our method bypasses the problems introduced by the sort, by de\ufb01ning\na virtual gradient on each item after the sort. The method is simple and very general: it can be used\nfor any target cost function. However, in this paper we restrict ourselves to the information retrieval\ndomain. We show that the method gives signi\ufb01cant bene\ufb01ts (for both training speed, and accuracy)\nfor applications of commercial interest.\nNotation: for the search problem, we denote the score of the ranking function by sij, where i =\n1; : : : ; NQ indexes the query, and j = 1; : : : ; ni indexes the documents returned for that query. The\ngeneral cost function is denoted C(fsijg; flijg), where the curly braces denote sets of cardinality\nni, and where lij is the label of the j\u2019th document returned for the i\u2019th query, where j indexes the\ndocuments sorted by score. We will drop the query index i when the meaning is clear. Ranked lists\nare indexed from the top, which is convenient when list length varies, and to conform with the notion\nthat high rank means closer to the top of the list, we will take \u201chigher rank\u201d to mean \u201clower rank\nindex\u201d. Terminology: for neural networks, we will use \u2018fprop\u2019 and \u2018backprop\u2019 as abbreviations for\na forward pass, and for a weight-updating backward pass, respectively. Throughout this paper we\nalso use the term \u201csmooth\u201d to denoteC 1 (i.e. with \ufb01rst derivatives everywhere de\ufb01ned).\n\n2 Common Quality Measures Used in Information Retrieval\n\nWe list some commonly used quality measures for information retrieval tasks: see [10] and refer-\nences therein for details. We distinguish between binary and multilevel measures: for binary mea-\nsures, we assume labels in f0; 1g, with 1 meaning relevant and 0 meaning not. Average Precision is\na binary measure where for each relevant document, the precision is computed at its position in the\nordered list, and these precisions are then averaged over all relevant documents. The corresponding\nquantity averaged over queries is called \u2018Mean Average Precision\u2019. Mean Reciprocal Rank (MRR)\nis also a binary measure: if ri is the rank of the highest ranking relevant document for the i\u2019th query,\nthen the MRR is just the reciprocal rank, averaged over queries: MRR = 1\ni=1 1=ri. MRR was\nused, for example, in TREC evaluations of Question Answering systems, before 2002 [14]. Winner\nTakes All (WTA) is a binary measure for which, if the top ranked document for a given query is rel-\nevant, the WTA cost is zero, otherwise it is one. WTA is used, for example, in TREC evaluations of\nQuestion Answering systems, after 2002 [14]. Pair-wise Correct is a multilevel measure that counts\nthe number of pairs that are in the correct order, as a fraction of the maximum possible number of\nsuch pairs, for a given query. In fact for binary classi\ufb01cation tasks, the pair-wise correct is the same\nas the AUC, which has led to work exploring optimizing the AUC using ranking algorithms [15, 3].\nbpref biases the pairwise correct to the top part of the ranking by choosing a subset of documents\nfrom which to compute the pairs [1, 10]. The Normalized Discounted Cumulative Gain (NDCG)\nis a cumulative, multilevel measure of ranking quality that is usually truncated at a particular rank\nlevel [6]. For a given query Qi the NDCG is computed as\n\nNQ PNQ\n\nNi (cid:17) Ni\n\nL\n\nXj=1\n\n(2r(j) (cid:0) 1)= log(1 + j)\n\n(1)\n\nwhere r(j) is the relevance level of the j\u2019th document, and where the normalization constant Ni is\nchosen so that a perfect ordering would result in Ni = 1. Here L is the ranking truncation level at\nwhich the NDCG is computed. The Ni are then averaged over the query set. NDCG is particularly\nwell suited to Web search applications because it is multilevel and because the truncation level can\nbe chosen to re\ufb02ect how many documents are shown to the user. For this reason we will use the\nNDCG measure in this paper.\n\n3 Previous Work\n\nThe ranking task is the task of \ufb01nding a sort on a set, and as such is related to the task of learning\nstructured outputs. Our approach is very different, however, from recent work on structured outputs,\nsuch as the large margin methods of [12, 13]. There, structures are also mapped to the reals (through\nchoice of a suitable inner product), but the best output is found by estimating the argmax over all\n\n\fpossible outputs. The ranking problem also maps outputs (documents) to the reals, but solves a\nmuch simpler problem in that the number of documents to be sorted is tractable. Our focus is on\na very different aspect of the problem, namely, \ufb01nding ways to directly optimize the cost that the\nuser ultimately cares about. As in [7], we handle cost functions that are multivariate, in the sense\nthat the number of documents returned for a given query can itself vary, but the key challenge we\naddress in this paper is how to work with costs that are everywhere either \ufb02at or non-differentiable.\nHowever, we emphasize that the method also handles the case of multivariate costs that cannot be\nrepresented as a sum of terms, each depending on the output for a single feature vector and its label.\nWe call such functions irreducible (such costs are also considered by [7]). Most cost functions used\nin machine learning are instead reducible (for example, MSE, cross entropy, log likelihood, and\nthe costs commonly used in kernel methods). The ranking problem itself has attracted increasing\nattention recently (see for example [4, 2, 8]), and in this paper we will use the RankNet algorithm of\n[2] as a baseline, since it is both easy to implement and performs well on large retrieval tasks.\n\n4 LambdaRank\n\nOne approach to working with a nonsmooth target cost function would be to search for an optimiza-\ntion function which is a good approximation to the target cost, but which is also smooth. However,\nthe sort required by information retrieval cost functions makes this problematic. Even if the target\ncost depends on only the top few ranked positions after sorting, the sort itself depends on all docu-\nments returned for the query, and that set can be very large; and since the target costs depend on only\nthe rank order and the labels, the target cost functions are either \ufb02at or discontinuous in the scores\nof all the returned documents. We therefore consider a different approach. We illustrate the idea\nwith an example which also demonstrates the perils introduced by a target / optimization cost mis-\nmatch. Let the target cost be WTA and let the chosen optimization cost be a smooth approximation\nto pairwise error. Suppose that a ranking algorithm A is being trained, and that at some iteration,\nfor a query for which there are only two relevant documents D1 and D2, A gives D1 rank one and\nD2 rank n. Then on this query, A has WTA cost zero, but a pairwise error cost of n (cid:0) 2. If the\nparameters of A are adjusted so that D1 has rank two, and D2 rank three, then the WTA error is now\nmaximized, but the number of pairwise errors has been reduced by n (cid:0) 4. Now suppose that at the\nnext iteration, D1 is at rank two, and D2 at rank n (cid:29) 1. The change in D1\u2019s score that is required\nto move it to top position is clearly less (possibly much less) than the change in D2\u2019s score required\nto move it to top position. Roughly speaking, we would prefer A to spend a little capacity moving\nD1 up by one position, than have it spend a lot of capacity moving D2 up by n (cid:0) 1 positions. If j1\nand j2 are the rank indices of D1, D2 respectively, then instead of pairwise error, we would prefer\nan optimization cost C that has the property that\n\nj\n\n@C\n@sj1\n\nj (cid:29) j\n\n@C\n@sj2\n\nj\n\n(2)\n\nwhenever j2 (cid:29) j1. This illustrates the two key intuitions behind LambdaRank: \ufb01rst, it is usually\nmuch easier to specify rules determining how we would like the rank order of documents to change,\nafter sorting them by score for a given query, than to construct a general, smooth optimization cost\nthat has the desired properties for all orderings. By only having to specify rules for a given ordering,\nwe are de\ufb01ning the gradients of an implicit cost function C only at the particular points in which\nwe are interested. Second, the rules can encode our intuition of the limited capacity of the learning\nalgorithm, as illustrated by Eq. (2). Let us write the gradient of C with respect to the score of the\ndocument at rank position j, for the i\u2019th query, as\n\n@C\n@sj\n\n= (cid:0)(cid:21)j(s1; l1; (cid:1) (cid:1) (cid:1) ; sni ; lni )\n\n(3)\n\nThe sign is chosen so that positive (cid:21)j means that the document must move up the ranked list to\nreduce the cost. Thus, in this framework choosing an implicit cost function amounts to choosing\nsuitable (cid:21)j, which themselves are speci\ufb01ed by rules that can depend on the ranked order (and scores)\nof all the documents. We will call these choices the (cid:21) functions. At this point two questions naturally\narise: \ufb01rst, given a choice for the (cid:21) functions, when does there exist a function C for which Eq. (3)\nholds; and second, given that it exists, when is C convex? We have the following result from\nmultilinear algebra (see e.g. [11]):\n\n\fTheorem (Poincar\u00b4e Lemma): If S (cid:26) Rn is an open set that is star-shaped with respect to the origin,\nthen every closed form on S is exact.\n\nNote that since every exact form is closed, it follows that on an open set that is star-shaped with\nrespect to the origin, a form is closed if and only if it is exact. Now for a given query Qi and\ncorresponding set of returned Dij, the ni (cid:21)\u2019s are functions of the scores sij, parameterized by the\n(\ufb01xed) labels lij. Let dxj be a basis of 1-forms on Rn and de\ufb01ne the 1-form\n\n(cid:21) (cid:17) Xj\n\n(cid:21)jdxj\n\n(4)\n\nThen assuming that the scores are de\ufb01ned over Rn, the conditions for the theorem are satis\ufb01ed and\n(cid:21) = dC for some function C if and only if d(cid:21) = 0 everywhere. Using classical notation, this\namounts to requiring that\n\n@(cid:21)j\n@sk\n\n=\n\n@(cid:21)k\n@sj\n\n8j; k 2 f1; : : : ; nig\n\n(5)\n\nThis provides a simple test on the (cid:21)\u2019s to determine if there exists a cost function for which they are\nthe derivatives: the Jacobian (that is, the matrix Jjk (cid:17) @(cid:21)j=@sk) must be symmetric. Furthermore,\ngiven that such a cost function C does exist, then since its Hessian is just the above Jacobian, the\ncondition that C be convex is that the Jacobian be positive semide\ufb01nite everywhere. Under these\nconstraints, the Jacobian looks rather like a kernel matrix, except that while an entry of a kernel\nmatrix depends on two elements of a vector space, an entry of the Jacobian can depend on all of the\nscores sj. Note that for constant (cid:21)\u2019s, the above two conditions are trivially satis\ufb01ed, and that for\nother choices that give rise to symmetric J, positive de\ufb01niteness can be imposed by adding diagonal\nregularization terms of the form (cid:21)j 7! (cid:21)j + (cid:11)jsj; (cid:11)j > 0.\nLambdaRank has a clear physical analogy. Think of the documents returned for a given query as\npoint masses. (cid:21)j then corresponds to a force on the point mass Dj. If the conditions of Eq. (5)\nare met, then the forces in the model are conservative, that is, they may be viewed as arising from\na potential energy function, which in our case is the implicit cost function C. For example, if the\n(cid:21)\u2019s are linear in the outputs s, then this corresponds to a spring model, with springs that are either\ncompressed or extended. The requirement that the Jacobian is positive semide\ufb01nite amounts to the\nrequirement that the system of springs have a unique global minimum of the potential energy, which\ncan be found from any initial conditions by gradient descent (this is not true in general, for arbitrary\nsystems of springs). The physical analogy provides useful guidance in choosing (cid:21) functions. For\nexample, for a given query, the forces ((cid:21)\u2019s) should sum to zero, since otherwise the overall system\n(mean score) will accelerate either up or down. Similarly if a contribution to a document A\u2019s (cid:21) is\ncomputed based on its position with respect to document B, then B\u2019s (cid:21) should be incremented by\nan equal and opposite amount, to prevent the pair itself from accelerating (Newton\u2019s third law, [9]).\n\nFinally, we emphasize that LambdaRank is a very simple method. It requires only that one provide\nrules for the derivatives of the implicit cost for any given sorted order of the documents, and as we\nwill show, such rules are easy to come up with.\n\n5 A Speedup for RankNet Learning\n\nRankNet [2] uses a neural net as its function class. Feature vectors are computed for each\nquery/document pair. RankNet is trained on those pairs of feature vectors, for a given query, for\nwhich the corresponding documents have different labels. At runtime, single feature vectors are\nfpropped through the net, and the documents are ordered by the resulting scores. The RankNet cost\nconsists of a sigmoid (to map the outputs to [0; 1]) followed by a pair-based cross entropy cost, and\ntakes the form given in Eq. (8) below. Training times for RankNet thus scale quadratically with the\nmean number of pairs per query, and linearly with the number of queries.\n\nThe ideas proposed in Section 4 suggest a simple method for signi\ufb01cantly speeding up RankNet\ntraining, making it also approximately linear in the number of labeled documents per query, rather\nthan in the number of pairs per query. This is a very signi\ufb01cant bene\ufb01t for large training sets. In fact\nthe method works for any ranking method that uses gradient descent and for which the cost depends\non pairs of items for each query. Most neural net training, RankNet included, uses a stochastic\ngradient update, which is known to give faster convergence. However here we will use batch learning\n\n\fper query (that is, the weights are updated for each query). We present the idea for a general ranking\nfunction f : Rn 7! R with optimization cost C : R 7! R. It is important to note that adopting\nbatch training alone does not give a speedup: to compute the cost and its gradients we would still\nneed to fprop each pair. Consider a single query for which n documents have been returned. Let the\noutput scores of the ranker be sj; j = 1; : : : ; n, the model parameters be wk 2 R, and let the set of\n\npairs of document indices used for training be P. The total cost is CT (cid:17) Pfi;jg2P C(si; sj) and its\n\nderivative with respect to wk is\n\n@CT\n@wk\n\n= Xfi;jg2P\n\n@C(si; sj)\n\n@si\n\n@si\n@wk\n\n+\n\n@C(si; sj)\n\n@sj\n\n@sj\n@wk\n\n(6)\n\nIt is convenient to refactor the sum: let Pi be the set of indices j for which fi; jg is a valid pair, and\nlet D be the set of document indices. Then we can write the \ufb01rst term as\n\n@CT\n@wk\n\n= Xi2D\n\n@si\n\n@wk Xj2Pi\n\n@C(si; sj)\n\n@si\n\n(7)\n\nand similarly for the second. The algorithm is as follows: instead of backpropping each pair, \ufb01rst n\nfprops are performed to compute the si (and for the general LambdaRank algorithm, this would also\n@C(si;sj )\n\nbe where the sort on the scores is performed); then for each i = 1; : : : ; n the (cid:21)i (cid:17) Pj2Pi\n\nare computed; then to compute the gradients @si\n, n fprops are performed, and \ufb01nally the n back-\n@wk\nprops are done. The key point is that although the overall computation still has an n2 dependence\narising from the second sum in (7), computing the terms @C(si;sj )\n1+es1(cid:0)s2 is far cheaper than the\ncomputation required to perform the 2n fprops and n backprops. Thus we have effectively replaced\na O(n2) algorithm with an O(n) one3.\n\n@si\n\n=\n\n(cid:0)1\n\n@si\n\n6 Experiments\n\nWe performed experiments to (1) demonstrate the training speedup for RankNet, and (2) assess\nwhether LambdaRank improves the NDCG test performance. For the latter, we used RankNet as a\nbaseline. Even though the RankNet optimization cost is not NDCG, RankNet is still very effective\nat optimizing NDCG, using the method proposed in [2]: after each epoch, compute the NDCG\non a validation set, and after training, choose the net for which the validation NDCG is highest.\nRather than attempt to derive from \ufb01rst principles the optimal Lambda function for the NDCG target\ncost (and for a given dataset), which is beyond the scope of this paper, we wrote several plausible (cid:21)-\nfunctions and tested them on the Web search data. We then picked the single (cid:21) function that gave the\nbest results on that particular validation set, and then used that (cid:21) function for all of our experiments;\nthis is described below.\n\n6.1 RankNet Speedup Results\n\nHere the training scheme is exactly LambdaRank training, but with the RankNet gradients, and with\nno sort: we call the corresponding (cid:21) function G. We will refer to the original RankNet training as\nV1 and LambdaRank speedup as V2. We compared V1 and V2 in two sets of experiments. In the\n\ufb01rst we used 1000 queries taken from the Web data described below, and in the second we varied\nthe number of documents for a given query, using the arti\ufb01cial data described below. Experiments\nwere run on a 2.2GHz 32 bit Opteron machine. We compared V1 to V2 for 1 layer and 2 layer\n(with 10 hidden nodes) nets. V1 was also run using batch update per query, to clearly show the gain\n(the convergence as a function of epoch was found to be similar for batch and non-batch updates;\nfurthermore running time for batch and non-batch is almost identical). For the single layer net, on\nthe Web data, LambdaRank with G was measured to be 5.1 times faster, and for two layer, 8.0 times\nfaster: the left panel of Figure 1 shows the results (where max validation NDCG is plotted). Each\npoint on the graph is one epoch. Results for the two layer nets were similar. The right panel shows\na log log plot of training time versus number of documents, as the number of documents per query\n\n3Two further speedups are possible, and are not explored here: \ufb01rst, only the \ufb01rst n fprops need be per-\nformed if the node activations are stored, since those stored activations could then be used during the n back-\nprops; second, the e\n\nsi could be precomputed before the pairwise sum is done.\n\n\fvaries from 4,000 to 512,000 in the arti\ufb01cial set. Fitting the curves using linear regression gives\nthe slopes of V1 and V2 to be 1.943 and 1.185 respectively. Thus V1 is close to quadratic (but\nnot exactly, due to the fact that only a subset of pairs is used, namely, those with documents whose\nlabels differ), and V1 is close to linear, as expected.\n\n(cid:1)(cid:2)(cid:7)\n\n(cid:1)(cid:2)(cid:6)(cid:4)\n\n(cid:1)(cid:2)(cid:6)\n\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:6)(cid:2)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:12)(cid:5)(cid:13)(cid:11)\n(cid:6)(cid:2)(cid:7)(cid:8)(cid:14) (cid:12)(cid:15) (cid:9)(cid:16)\n\n(cid:17) (cid:2)(cid:18) (cid:7)(cid:18) (cid:7)(cid:19)\n\n(cid:4)\n(cid:3)\n(cid:2)\n(cid:1)\n\n(cid:1)(cid:2)(cid:4)(cid:4)\n\n(cid:1)(cid:2)(cid:4)\n\n(cid:1)(cid:2)(cid:5)(cid:4)\n\n(cid:1)(cid:2)(cid:5)\n\n(cid:1)(cid:2)(cid:3) (cid:4)\n(cid:1)\n\n(cid:2)\n\n(cid:3)(cid:1)\n\n(cid:3)(cid:2)\n\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)\n\n(cid:4)(cid:1)\n\n(cid:4)(cid:2)\n\n(cid:5)(cid:1)\n\n(cid:18)\n(cid:11)\n(cid:6)\n(cid:15)\n(cid:17)\n(cid:8)\n(cid:16)\n(cid:10)\n(cid:15)\n(cid:8)\n(cid:14)\n(cid:13)\n(cid:12)\n(cid:6)\n(cid:11)\n(cid:10)\n(cid:9)\n(cid:7)\n(cid:6)\n(cid:5)\n\n(cid:8)\n\n(cid:11)\n\n(cid:10)\n(cid:2)\n\n(cid:8)\n\n(cid:5)\n\n(cid:4)\n\n(cid:3)\n\n(cid:1)\n\n(cid:9)(cid:3)\n\n(cid:9)(cid:4)\n\n(cid:9)(cid:5)\n\n(cid:6)\n\n(cid:6)(cid:2)(cid:7)(cid:8)(cid:14) (cid:12)(cid:15) (cid:9)(cid:16)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:6)(cid:2)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:12)(cid:5)(cid:13)(cid:11)\n\n(cid:17) (cid:2)(cid:18) (cid:7)(cid:18) (cid:7)(cid:19)\n\n(cid:7)\n\n(cid:3)(cid:1)\n(cid:8) (cid:4)(cid:9)\n\n(cid:3)(cid:3)\n(cid:14) (cid:2)(cid:15)\n(cid:10) (cid:4)(cid:16)\n\n(cid:3)(cid:4)\n(cid:13) (cid:2)(cid:5)(cid:18) (cid:7)\n\n(cid:17) (cid:4)(cid:3)(cid:12)\n\n(cid:3)(cid:5)\n\n(cid:3)(cid:8)\n\nFigure 1: Speeding up RankNet training. Left: linear nets. Right: two layer nets.\n\n6.2 (cid:21)-function Chosen for Ranking Experiments\n\nTo implement LambdaRank training, we must \ufb01rst choose the (cid:21) function (Eq. (3)), and then sub-\nstitute in Eq. (5). Using the physical analogy, specifying a (cid:21) function amounts to specifying rules\nfor the \u2018force\u2019 on a document given its neighbors in the ranked list. We tried two kinds of (cid:21) func-\ntion: those where a document\u2019s (cid:21) gets a contribution from all pairs with different labels (for a given\nquery), and those where its (cid:21) depends only on its nearest neighbors in the sorted list. All (cid:21) functions\nwere designed with the NDCG cost function in mind, and most had a margin built in (that is, a force\nis exerted between two documents even if they are in the correct order, until their difference in scores\nexceeds that margin). We investigated step potentials, where the step sizes are proportional to the\nNDCG gain found by swapping the pair; spring models; models that estimated the NDCG gradient\nusing \ufb01nite differences; and models where the cost was estimated as the gradient of a smooth, pair-\nwise cost, also scaled by NDCG gain from swapping the two documents. We tried ten different (cid:21)\nfunctions in all. Due to space limitations we will not give results on all these functions here: instead\nwe will use the one that worked best on the Web validation data for all experiments. This function\nused the RankNet cost, scaled by the NDCG gain found by swapping the two documents in ques-\ntion. The RankNet cost combines a sigmoid output and the cross entropy cost, and is similar to the\nnegative binomial log-likelihood cost [5], except that it is based on pairs of items: if document i is\nto be ranked higher than document j, then the RankNet cost is [2]:\n\nC R\n\ni;j = sj (cid:0) si + log(1 + esi(cid:0)sj )\n\nand if the corresponding document ranks are ri and rj, then taking derivatives of Eq.\ncombining with Eq. (1) gives\n\n(cid:21) = N (cid:18)\n\n1\n\n1 + esi(cid:0)sj (cid:19)(cid:0)2li (cid:0) 2lj(cid:1)(cid:18)\n\n1\n\nlog(1 + i)\n\n(cid:0)\n\n1\n\nlog(1 + j)(cid:19)\n\n(8)\n\n(8) and\n\n(9)\n\nwhere N is the reciprocal max DCG for the query. Thus for each pair, after the sort, we increment\neach document\u2019s force by (cid:6)(cid:21), where the more relevant document gets the positive increment.\n\n6.3 Ranking for Search Experiments\n\nWe performed experiments on three datasets: arti\ufb01cial, web search, and intranet search data. The\ndata are labeled from 0 to M, in order of increasing relevance: the Web search and arti\ufb01cial data\nhave M = 4, and the intranet search data, M = 3. The corresponding NDCG gains (the numerators\nin Eq. (1)) were therefore 0, 3, 7, 15 and 31. In all graphs, 95% con\ufb01dence intervals are shown.\nIn all experiments, we varied the learning rate from as low as 1e-7 to as high as 1e-2, and for each\n\n(cid:10)\n(cid:11)\n(cid:12)\n(cid:13)\n(cid:10)\n\fexperiment we picked that rate that gave the best validation results. For all training, the learning\nrate was reduced be a factor of 0.8 if the training cost (Eq. (8), for RankNet, and the NDCG at\ntruncation level 10, for LambdaRank) increased over the value for the previous epoch. Training was\ndone for 300 epochs for the arti\ufb01cial and Web search data, and for 200 epochs for the intranet data,\nand training was restarted (with random weights) if the cost did not reduce for 50 iterations.\n\n6.3.1 Arti\ufb01cial Data\n\nWe used arti\ufb01cial data to remove any variance stemming from the quality of the features or of the\nlabeling. We followed the prescription given in [2] for generating random cubic polynomial data.\nHowever, here we use \ufb01ve levels of relevance instead of six, a label distribution corresponding to\nreal datasets, and more data, all to more realistically approximate a Web search application. We\nused 50 dimensional data, 50 documents per query, and 10K/5K/10K queries for train/valid/test\nrespectively. We report the NDCG results in Figure 2 for ten NDCG truncation levels. In this clean\ndataset, LambdaRank clearly outperforms RankNet. Note that the gap increases at higher relevance\nlevels, as one might expect due to the more direct optimization of NDCG.\n\n(cid:10)(cid:11)(cid:7)(cid:5)\n\n(cid:10)(cid:11)(cid:7)(cid:10)\n\n(cid:10)(cid:11)(cid:6)(cid:5)\n\n(cid:4)\n(cid:3)\n(cid:2)\n(cid:1)\n\n(cid:10)(cid:11)(cid:6)(cid:10)\n\n(cid:10)(cid:11)(cid:5)(cid:5)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\n(cid:4)\n\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:6)(cid:2)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:1)(cid:2)(cid:12)(cid:13)(cid:14)\n(cid:6)(cid:2)(cid:7)(cid:8)(cid:15) (cid:13)(cid:16) (cid:9)(cid:10)(cid:11)(cid:1)(cid:2)(cid:12)(cid:13)(cid:14)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:6)(cid:2)(cid:7)(cid:8)(cid:1)(cid:17) (cid:7)(cid:13)(cid:2)(cid:14)\n(cid:6)(cid:2)(cid:7)(cid:8)(cid:15) (cid:13)(cid:16) (cid:1)(cid:17) (cid:7)(cid:13)(cid:2)(cid:14)\n\n(cid:5)\n\n(cid:7)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:4)(cid:10)(cid:11)(cid:12)(cid:13)(cid:12)(cid:14)\n\n(cid:6)\n\n(cid:8)\n\n(cid:9)\n\n(cid:1)(cid:10)\n\n(cid:10)(cid:11)(cid:6)(cid:2)\n\n(cid:10)(cid:11)(cid:6)(cid:10)\n\n(cid:10)(cid:11)(cid:5)(cid:8)\n\n(cid:10)(cid:11)(cid:5)(cid:6)\n\n(cid:10)(cid:11)(cid:5)(cid:4)\n\n(cid:10)(cid:11)(cid:5)(cid:2)\n\n(cid:10)(cid:11)(cid:5)(cid:10)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\n(cid:4)\n\n(cid:5)\n\n(cid:7)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:4)(cid:10)(cid:11)(cid:12)(cid:13)(cid:12)(cid:14)\n\n(cid:6)\n\n(cid:8)\n\n(cid:9)\n\n(cid:1)(cid:10)\n\nFigure 2: Left: Cubic polynomial data. Right: Intranet search data.\n\n6.3.2\n\nIntranet Search Data\n\nThis data has dimension 87, and only 400 queries in all were available. The average number\nof documents per query is 59.4. We used 5 fold cross validation, with 2+2+1 splits between\ntrain/validation/test sets. We found that it was important for such a small dataset to use a rela-\ntively large validation set to reduce variance. The results for the linear nets are shown in Figure 2:\nalthough LambdaRank gave uniformly better mean NDCGs, the overlapping error bars indicate that\non this set, LambdaRank does not give statistically signi\ufb01cantly better results than RankNet at 95%\ncon\ufb01dence. For the two layer nets the NDCG means are even closer. This is an example of a case\nwhere larger datasets are needed to see the difference between two algorithms (although it\u2019s possible\nthat more powerful statistical tests would \ufb01nd a difference here also).\n\n6.4 Web Search Data\n\nThis data is from a commercial search engine and has 367 dimensions, with on average 26.1 doc-\numents per query. The data was created by shuf\ufb02ing a larger dataset and then dividing into train,\nvalidation and test sets of size 10K/5K/10K queries, respectively. In Figure 3, we report the NDCG\nscores on the dataset at truncation levels from 1 to 10. We show separate plots to clearly show the\ndifferences: in fact, the linear LambdaRank results lie on top of the two layer RankNet results, for\nthe larger truncation values.\n\n7 Conclusions\n\nWe have demonstrated a simple and effective method for learning non-smooth target costs. Lamb-\ndaRank is a general approach: in particular, it can be used to implement RankNet training, and it\n\n(cid:12)\n(cid:13)\n(cid:14)\n(cid:15)\n(cid:16)\n(cid:13)\n(cid:17)\n(cid:13)\n(cid:18)\n(cid:19)\n(cid:12)\n(cid:20)\n(cid:18)\n(cid:21)\n(cid:13)\n(cid:22)\n(cid:17)\n(cid:13)\n(cid:18)\n(cid:19)\n(cid:23)\n(cid:21)\n(cid:24)\n(cid:12)\n(cid:20)\n(cid:18)\n(cid:21)\n(cid:13)\n(cid:22)\n\f(cid:4)\n(cid:3)\n(cid:2)\n(cid:1)\n\n(cid:10)(cid:11)(cid:7)(cid:2)\n\n(cid:10)(cid:11)(cid:7)(cid:10)\n\n(cid:10)(cid:11)(cid:6)(cid:8)\n\n(cid:10)(cid:11)(cid:6)(cid:6)\n\n(cid:10)(cid:11)(cid:6)(cid:4)\n\n(cid:10)(cid:11)(cid:6)(cid:2)\n\n(cid:10)(cid:11)(cid:6)(cid:10)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\n(cid:4)\n(cid:3)\n(cid:2)\n(cid:1)\n\n(cid:10)(cid:11)(cid:7)(cid:2)\n\n(cid:10)(cid:11)(cid:7)(cid:10)\n\n(cid:10)(cid:11)(cid:6)(cid:8)\n\n(cid:10)(cid:11)(cid:6)(cid:6)\n\n(cid:10)(cid:11)(cid:6)(cid:4)\n\n(cid:10)(cid:11)(cid:6)(cid:2)\n\n(cid:10)(cid:11)(cid:6)(cid:10)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\n(cid:4)\n\n(cid:13) (cid:2)(cid:1)(cid:2)(cid:3)(cid:4)(cid:8)(cid:9)(cid:3)(cid:6)(cid:2)(cid:10)\n\n(cid:8)(cid:2)(cid:11)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:3)(cid:6)(cid:2)(cid:10)\n\n(cid:5)\n\n(cid:4)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:4)(cid:10)(cid:11)(cid:12)(cid:13)(cid:12)(cid:14)\n\n(cid:6)\n\n(cid:7)\n\n(cid:8)\n\n(cid:9)\n\n(cid:1)(cid:10)\n\n(cid:13) (cid:2)(cid:1)(cid:2)(cid:3)(cid:4)(cid:14)\n\n(cid:8)(cid:2)(cid:11)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:14)\n\n(cid:16) (cid:8)(cid:2)(cid:17) (cid:6)(cid:10)\n\n(cid:16) (cid:8)(cid:2)(cid:17) (cid:6)(cid:10)\n\n(cid:5)\n\n(cid:7)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:4)(cid:10)(cid:11)(cid:12)(cid:13)(cid:12)(cid:14)\n\n(cid:6)\n\n(cid:8)\n\n(cid:9)\n\n(cid:1)(cid:10)\n\nFigure 3: NDCG for RankNet and LambdaRank. Left: linear nets. Right: two layer nets\n\nfurnishes a signi\ufb01cant training speedup there. We studied LambdaRank in the context of the NDCG\ntarget cost for neural network models, but the same ideas apply to any non-smooth target cost, and\nto any differentiable function class. It would be interesting to investigate using the same method\nstarting with other classi\ufb01ers such as boosted trees.\n\nAcknowledgments\n\nWe thank M. Taylor, J. Platt, A. Laucius, P. Simard and D. Meyerzon for useful discussions and for\nproviding data.\n\nReferences\n\n[1] C. Buckley and E. Voorhees. Evaluating evaluation measure stability. In SIGIR, pages 33\u201340, 2000.\n[2] C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to\n\nRank using Gradient Descent. In ICML 22, Bonn, Germany, 2005.\n\n[3] C. Cortes and M. Mohri. Con\ufb01dence Intervals for the Area Under the ROC Curve. In NIPS 18. MIT\n\nPress, 2005.\n\n[4] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining prefer-\n\nences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[5] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The\n\nAnnals of Statistics, 28(2):337\u2013374, 2000.\n\n[6] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In SIGIR\n\n23. ACM, 2000.\n\n[7] T. Joachims. A support vector method for multivariate performance measures. In ICML 22, 2005.\n[8] I. Matveeva, C. Burges, T. Burkard, A. Lauscius, and L. Wong. High accuracy retrieval with multiple\n\nnested rankers. In SIGIR, 2006.\n\n[9] I. Newton. Philosophiae Naturalis Principia Mathematica. The Royal Society, 1687.\n[10] S. Robertson and H. Zaragoza. On rank-based effectiveness measures and optimisation. Technical Report\n\nMSR-TR-2006-61, Microsoft Research, 2006.\n\n[11] M. Spivak. Calculus on Manifolds. Addison-Wesley, 1965.\n[12] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediciton models: A large\n\nmargin approach. In ICML 22, Bonn, Germany, 2005.\n\n[13] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interde-\n\npendent and structured output spaces. In ICML 24, 2004.\n\n[14] E.M. Voorhees. Overview of the TREC 2001/2002 Question Answering Track. In TREC, 2001,2002.\n[15] L. Yan, R. Dodlier, M.C. Mozer, and R. Wolniewicz. Optimizing Classi\ufb01er Performance via an Approxi-\n\nmation to the Wilcoxon-Mann-Whitney Statistic. In ICML 20, 2003.\n\n(cid:12)\n(cid:15)\n(cid:12)\n(cid:15)\n\f", "award": [], "sourceid": 2971, "authors": [{"given_name": "Christopher", "family_name": "Burges", "institution": null}, {"given_name": "Robert", "family_name": "Ragno", "institution": null}, {"given_name": "Quoc", "family_name": "Le", "institution": null}]}