{"title": "A Representation Theory for Ranking Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 369, "abstract": "This paper presents a representation theory for permutation-valued functions, which in their general form can also be called listwise ranking functions. Pointwise ranking functions assign a score to each object independently, without taking into account the other objects under consideration; whereas listwise loss functions evaluate the set of scores assigned to all objects as a whole. In many supervised learning to rank tasks, it might be of interest to use listwise ranking functions instead; in particular, the Bayes Optimal ranking functions might themselves be listwise, especially if the loss function is listwise. A key caveat to using listwise ranking functions has been the lack of an appropriate representation theory for such functions. We show that a natural symmetricity assumption that we call exchangeability allows us to explicitly characterize the set of such exchangeable listwise ranking functions. Our analysis draws from the theories of tensor analysis, functional analysis and De Finetti theorems. We also present experiments using a novel reranking method motivated by our representation theory.", "full_text": "A Representation Theory for Ranking Functions\n\nHarsh Pareek, Pradeep Ravikumar\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\n{harshp,pradeepr}@cs.utexas.edu\n\nAbstract\n\nThis paper presents a representation theory for permutation-valued functions,\nwhich in their general form can also be called listwise ranking functions. Point-\nwise ranking functions assign a score to each object independently, without taking\ninto account the other objects under consideration; whereas listwise loss functions\nevaluate the set of scores assigned to all objects as a whole. In many supervised\nlearning to rank tasks, it might be of interest to use listwise ranking functions\ninstead; in particular, the Bayes Optimal ranking functions might themselves be\nlistwise, especially if the loss function is listwise. A key caveat to using list-\nwise ranking functions has been the lack of an appropriate representation theory\nfor such functions. We show that a natural symmetricity assumption that we call\nexchangeability allows us to explicitly characterize the set of such exchangeable\nlistwise ranking functions. Our analysis draws from the theories of tensor anal-\nysis, functional analysis and De Finetti theorems. We also present experiments\nusing a novel reranking method motivated by our representation theory.\n\n1\n\nIntroduction\n\nA permutation-valued function, also called a ranking function, outputs a ranking over a set of ob-\njects given features corresponding to the objects, and learning such ranking functions given data is\nbecoming an increasingly key machine learning task. For instance, tracking a set of objects given\na particular order of uncertain sensory inputs involves predicting the permutation of objects cor-\nresponding to the inputs at each time step. Collaborative \ufb01ltering and recommender systems can\nbe modeled as ranking movies (or other consumer objects). Extractive document summarization\ninvolves ranking sentences in order of their importance, while also taking diversity into account.\nLearning rankings over documents, in particular, has received considerable attention in the Infor-\nmation Retrieval community, under the sub\ufb01eld of \u201clearning to rank\u201d. The problems above involve\ndiverse kinds of supervision and diverse evaluation metrics, but with the common feature that the\nobject of interest is a ranking function, that when given an input set of objects, outputs a permutation\nover the set of objects. In this paper, we will consider the standard generalization of ranking func-\ntions which output a real-valued score vector, which can be sorted to yield the desired permutation.\nThe tasks above then entail learning a ranking function given data, and given some evaluation metric\nwhich captures the compatibility between two permutations. These evaluation metrics are domain-\nspeci\ufb01c, and even in speci\ufb01c domains such as information retrieval, could be varied based on actual\nuser preferences. Popular IR evaluation metrics for instance include Mean Average Precision (MAP)\n[1], Expected Reciprocal Rank (ERR) [7] and Normalized Discounted Cumulative Gain (NDCG)\n[17]. A common characteristic of these evaluation loss functionals are that these are typically list-\nwise: so that the loss evaluates the entire set of scores assigned to all the objects in a manner that\nis not separable in the individual scores. Indeed, some tasks by their very nature require listwise\nevaluation metrics. A key example is that of ranking with diversity[5], where the user prefers re-\nsults that are not only relevant individually, but also diverse mutually; searching for web-pages with\nthe query \u201cJaguar\u201d should not just return individually relevant results, but also results that cover\n\n1\n\n\fthe car, the animal and the sports team, among others. Chapelle et al [8] also mention ranking for\ndiversity as an important future direction in learning to rank. Other fundamentally listwise ranking\nproblems include pseudo-relevance feedback, topic distillation, subtopic retrieval and ranking over\ngraphs (e.g.. social networks) [22].\nWhile these evaluation/loss functionals (and typically their corresponding surrogate loss functionals\nas well) are listwise, most parameterizations of the ranking functions used within these (surrogate)\nloss functionals are typically pointwise, i.e. they rank each object (e.g. document) independently\nof the other objects. Why should we require listwise ranking functions for listwise ranking tasks?\nPointwise ranking functions have the advantage of computational ef\ufb01ciency: since these evaluate\neach object independently, they can be parameterized very compactly. Moreover, for certain ranking\ntasks, such as vanilla rank prediction with 0/1 loss or multilabel ranking with certain losses[11],\nit can be shown that the Bayes-consistent ranking function is pointwise, so that one would lose\nstatistical ef\ufb01ciency by not restricting to the sub-class of pointwise ranking functions. However,\nas noted above, many modern ranking tasks have an inherently listwise \ufb02avor, and correspondingly\ntheir Bayes-consistent ranking functions are listwise as well. For instance, [24] show that the Bayes-\nconsistent ranking function of the popular NDCG evaluation metric is inherently listwise.\nThere is however a caveat to using listwise ranking functions: a lack of representation theory, and\ncorresponding guidance to parameterizing such listwise ranking functions. Indeed, the most com-\nmonly used ranking functions are linear ranking functions and decision trees, both of which are\npointwise. With decision trees, gradient boosting is often used as a technique to increase the com-\nplexity of the function class. The Yahoo! Learning to Rank challenge [6] was dominated by such\nmethods, which comprise the state-of-the-art in learning to rank for information retrieval today. It\nshould be noted that gradient boosted decision trees, even if trained with listwise loss functions\n(e.g.. via LambdaMART[3]), are still a sum of pointwise ranking functions and therefore pointwise\nranking functions themselves, and hence subject to the theoretical limitations outlined in this paper.\nIn a key contribution of this paper, we impose a very natural assumption on general listwise rank-\ning functions, which we term exchangeability, which formalizes the notion that the ranking function\ndepends only on the object features, and not the order in which the documents are presented. Specif-\nically, as detailed further in Section 3, we de\ufb01ne exchangeable ranking functions as those listwise\nfunctions where if their set of input objects is permuted, their output permutation/score vector is\npermuted in the same way. This simple assumption allows us to provide an explicit characterization\nof the set of listwise ranking functions in the following form:\n\n(cid:88)\n\n(f (x))i = h(xi,{x\\i}) =\n\n\u03a0j(cid:54)=igt(xi, xj)\n\n(1)\n\nt\n\nThis representation theorem is the principal contribution of this work. We hope that this result will\nprovide a general recipe for designing learning to rank algorithms for diverse domains. For each\ndomain, practitioners would need to utilize domain knowledge to de\ufb01ne a suitable class of pairwise\nfunctions g parameterized by w, and use this ranking function in conjunction with a suitable listwise\nloss. Individual terms in (1) can be \ufb01t via standard optimization methods such as gradient descent,\nwhile multiple terms can be \ufb01t via gradient boosting.\nIn recent work, two papers have proposed speci\ufb01c listwise ranking functions. Qin et al. [22] sug-\ngest the use of conditional random \ufb01elds (CRFs) to predict the relevance scores of the individual\ndocuments via the the most probable con\ufb01guration of the CRF. They distinguish between \u201clocal\nranking,\u201d which we called ranking with pointwise ranking functions above, and \u201cglobal ranking\u201d\nwhich corresponds to listwise ranking functions; and argue that using CRFs would allow for global\nranking. Weston and Blitzer [26] propose a listwise ranking function (\u201cLatent Structured Ranking\u201d)\nassuming a low rank structure for the set of items to be ranked. Both of these ranking functions are\nexchangeable as we detail in Appendix A. The improved performance of these speci\ufb01c classes of\nranking functions also provides empirical support for the need for a representation theory of general\nlistwise ranking functions.\nWe \ufb01rst consider the case where features are discrete and derive our representation theorem us-\ning the theory of symmetric tensor decomposition. For the more general continuous case, we \ufb01rst\npresent the the case with three objects using functional analytic spectral theory. We then present\nthe extension to the general continuous case by drawing upon De Finetti\u2019s theorem. Our analysis\nhighlights the correspondences between these theories, and brings out an important open problem in\nthe functional analysis literature.\n\n2\n\n\f2 Problem Setup\n\nWe consider the general ranking setting, where the m objects to be ranked (possibly contingent on a\nquery), are represented by the feature vectors x = (x1, x2, . . . , xm) \u2208 X m. Typically, X = Rk for\nsome k. The key object of interest in this paper is a ranking function:\n\nDe\ufb01nition 2.1 (Ranking function) Given a set of object feature vectors x (possibly contingent on a\nquery q), a ranking function f : X m \u2192 Rm is a function that takes as input the m object feature vec-\ntors, and has as output a vector of scores for the set of objects, so that f (x) = (f1(x), . . . , fm(x));\nfor some functions fj : X m \u2192 R.\nIt is instructive at this juncture to distinguish between pointwise (local) and listwise (global) ranking\nfunctions. A pointwise ranking function f would score each object xi independently, ignoring\nthe other objects, so that each component function fj(x) above depends only on xj, and can be\nwritten as a function fj(xj) with some overloading of notation. In contrast, the components fj(x)\nof the output vector of a listwise ranking function would depend on the feature-vectors of all the\ndocuments.\n\n3 Representation theory\n\nWe investigate the class of ranking functions which satisfy a very natural property: exchanging the\nfeature-vectors of any two documents should cause their positions in the output ranking order to be\nexchanged. De\ufb01nition 3.1 formalizes this intuition.\nDe\ufb01nition 3.1 (Exchangeable Ranking Function) A listwise ranking function f : X m \u2192 Rm is\nsaid to be exchangeable if f (\u03c0(x)) = \u03c0(f (x)) for every permutation \u03c0 \u2208 Sk (where Sk is the set of\nall permutations of order k)\n\nLetting (f1, f2, . . . , fm) denote the components of the ranking function f, we arrive at the following\nkey characterization of exchangeable ranking functions.\nTheorem 3.2 Every exchangeable ranking function f : X m \u2192 Rm can be written as f (x) =\n(f1(x), f2(x), . . . , fm(x)) with\n\n(2)\nwhere {x\\i} = {xj|1 \u2264 j \u2264 m, j (cid:54)= i}, and for some h : X m \u2192 R symmetric in {x\\i}\n(i.e. h(y) = h(\u03c0(y)),\u2200y \u2208 X m\u22121, \u03c0 \u2208 Sk)\nProof The components of a ranking function f : X m \u2192 Rm, viz. fi(x), represent the score\nassigned to each document. First, exchangeability implies that exchanging the feature values of\nsome two documents does not affect the scores of the remaining documents, i.e. fi(x) does not\nchange if i is not involved in the exchange, i.e. fi(x) is symmetric in {x\\i} Second, exchanging the\nfeature values of documents 1 and i exchanges their scores, i.e.,\n\nfi(x) = h(xi,{x\\i})\n\nfi(x1, . . . , xi, . . . , xn) = f1(xi, . . . , x1, . . . , xn)\n\n(3)\n\nThus, the scoring function for the ith document can be expressed in terms of that of the \ufb01rst docu-\nment. Call that scoring function h. Then, combining the two properties above, we have,\n\nfi(x) = h(xi,{x\\i})\n\n(4)\n\nwhere h is symmetric in {x\\i}.\n\nTheorem 3.2 entails the intuitive result that the component functions fi of exchangeable ranking\nfunctions f can all be expressed in terms of a single partially symmetric function h whose \ufb01rst\nargument is the document corresponding to that component and which is symmetric in the other\ndocuments. Pointwise ranking functions then correspond to the special case where h is independent\nof the other document-feature-vectors (so that h(xi,{x\\i}) = h(xi) with some overloading of\nnotation) and are thus trivially exchangeable.\n\n3\n\n\fAs the main result of this paper, we will characterize the class of such partially symmetric functions\nh, and thus the set of exchangeable listwise ranking functions, for various classes X as\n\n\u221e(cid:88)\nt=1, gt : X \u00d7 X \u2192 R.\n\nfi(x) =\n\nt=1\n\nfor some set of functions {gt}\u221e\n\n\u03a0j(cid:54)=igt(xi, xj)\n\n(5)\n\n3.1 The Discrete Case: Tensor Decomposition\n\nWe \ufb01rst consider a decomposition theorem for symmetric tensors, and then through a correspon-\ndence between symmetric tensors and symmetric functions with \ufb01nite domains, derive the corre-\nsponding decomposition for symmetric functions. We then simply extend the analysis to obtain the\ncorresponding decomposition theorem for partially symmetric functions.\nThe term tensor may have connotations (from its use in Physics) with regards to how a quantity\nbehaves under linear transformations, but here we use it only to mean \u201cmulti-way array\u201d.\n\nDe\ufb01nition 3.3 (Tensor) A real-valued order-k tensor is a collection of real-valued elements\nAi1,i2,...,ik \u2208 R indexed by tuples (i1, i2, . . . , ik) \u2208 X k.\nDe\ufb01nition 3.4 (Symmetric tensor) An order-k tensor A = [Ai1,i2...,ik ] is said to be symmetric iff\nfor any permutation \u03c0 \u2208 Sk,\n\nAi1,i2,...,ik = Ai\u03c0(1),i\u03c0(2),...,i\u03c0(k) .\n\n(6)\n\nComon et al. [9] show that such a symmetric tensor (sometimes called supersymmetric since it is\nsymmetric w.r.t. all dimensions) can be decomposed into a sum of rank-1 symmetric tensors, where\na rank-1 symmetric tensor is a k-way outer product of some vector v (we will use the standard\nnotation \u2297 to denote an outer product u \u2297 v \u2297 \u00b7\u00b7\u00b7 \u2297 z = [uj1vj2 . . . zjk ]j1,...,jk).\nProposition 3.5 (Decomposition theorem for symmetric tensors [9]) Any order-k symmetric ten-\nsor A can be decomposed as a sum of k-fold outer product tensors as follows:\n\nA =\n\n\u2297kvi\n\n(7)\n\ni=1\n\nThe special matrix case (k = 2) of this theorem should be familiar to the reader as the spectral\ntheorem. In that case, the vi are orthogonal, the smallest such representation is unique and can be\nrecovered by tractable algorithms. In the general symmetric tensor case, the vi are not necessarily\northogonal and the decomposition need not be unique; it is however \ufb01nite [9]. While the spectral\ntheory for symmetric tensors is relatively straightforward, bearing similarity to that for matrices, the\ntheory for general non-symmetric tensors is nontrivial: we refer the interested reader to [21, 20, 10].\nHowever, since we are interested not in general non-symmetric tensors, but partially symmetric\ntensors, the above theorem can be extended in a straightforward way in our case as we shall see in\nTheorem 3.7.\nOur next step involves generalizing the earlier proposition to multivariate symmetric functions by\nrepresenting them as tensors, which then yields a corresponding spectral theorem of product de-\ncompositions for such functions. In particular, note that when the feature vector of each document\ntakes values only from a \ufb01nite set X , of size |X|, a symmetric function h(x1, x2, . . . , xm) can be\nrepresented as an order-m symmetric tensor H where Hv1v2...vm = h(v1, v2, . . . , vm) for vi \u2208 X .\nWe can thus leverage Proposition 3.5 to obtain the result of the following proposition:\n\nProposition 3.6 (Symmetric Product decomposition for multivariate functions (\ufb01nite domain))\nAny symmetric function f : X m \u2192 R for a \ufb01nite set X can be decomposed as\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\nf (x) =\n\n\u03a0jgt(xj),\n\n(8)\n\nfor some set of functions {gt}T\n\nt=1, gt : X \u2192 R, T < \u221e\n\nt=1\n\n4\n\n\fIn the case of ranking three documents, each fi assigns a score to document i taking the other\ndocument\u2019s features as arguments. fi then corresponds to a matrix and the functions gt correspond\nto the set of eigenvectors of this matrix. In the general case of ranking m documents, fi is an order\nm \u2212 1 tensor and gt are the eigenvectors for a symmetric decomposition of the tensor.\nOur class of exchangeable ranking functions corresponds to partially symmetric functions. In the\nfollowing, we extend the theory above to the partially symmetric case (proof in Appendix B).\n\nTheorem 3.7 (Product decomposition for partially symmetric functions) A partially symmetric\nfunction h : X m \u2192 R symmetric in x2, . . . , xm on a \ufb01nite set X can be decomposed as\n\nh(x1,{x\\1}) =\nt=1, gt : X \u00d7 X \u2192 R, T < \u221e.\n\nt=1\n\n\u03a0j(cid:54)=1gt(x1, xj)\n\nfor some set of functions {gt}T\nRemarks:\nI. To the best of our knowledge, the study of partially symmetric tensors and their decomposi-\ntions as above has not been considered in the literature. Notions such as rank and best successive\napproximations would be interesting areas for future research.\n\n(9)\n\n\u221e(cid:88)\n\nII. The tensor view of learning to rank gives rise to a host of other interesting research directions.\nConsider the learning to rank problem: each training example corresponds to one entry in the\nresulting ranking tensor. A candidate approach to learning to rank might thus be tensor-completion,\nperhaps using a convex nuclear tensor norm regularization [14].\n\n3.2 The Continuous Case\n\nIn this section, we generalize the results of the previous section to the more realistic setting where\nthe feature space X is compact. The extension to the partially symmetric case from the symmetric\none is similar to that in the discrete case and is given as Theorem C.1 in Appendix C, so we discuss\nonly decomposition theorems for symmetric functions below.\n\n3.2.1 Argument via Functional Analytic Spectral Theorem\n\nWe \ufb01rst recall some key de\ufb01nitions from functional analysis [25, pp.203]. A linear operator T is\nbounded if its norm (cid:107)T(cid:107) = sup(cid:107)x(cid:107)=1 (cid:107)T x(cid:107) is \ufb01nite. A bounded linear operator T is self-adjoint if\nT = T \u2217, where T \u2217 is the adjoint operator. A linear operator A from a Banach space X to a Banach\nspace Y is compact if it takes bounded sets in X into relatively compact sets (i.e. whose closure is\ncompact) in Y.\nThe Hilbert-Schmidt theorem [25] provides a spectral decomposition for such compact self-adjoint\noperators. Let A be a compact self-adjoint operator on a Hilbert space H. Then, by the Hilbert-\nSchmidt theorem, there is a complete orthonormal basis, {\u03c6n}, for H so that A\u03c6n = \u03bbn\u03c6n and\n\u03bbn \u2192 0 as n \u2192 \u221e. A can then be written as:\n\nA =\n\n\u03bbn\u03c6n(cid:104)\u03c6n,\u00b7(cid:105).\n\n(10)\n\nn=1\n\nWe refer the reader to [25] for further details. The compactness condition can be relaxed to bound-\nedness, but in that case a discrete spectrum {\u03bbn} does not exist, and is replaced by a measure \u00b5,\nand the summation in the Hilbert-Schmidt theorem 3.8 is replaced by an integral. We consider only\ncompact self-adjoint operators in this paper.\nIn the following key theorem, we provide a decomposition theorem for bivariate symmetric functions\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\nt=1\n\n5\n\nTheorem 3.8 (Product decomposition for symmetric bivariate functions) A symmetric function\nf (x, y) \u2208 L2(X \u00d7 X ) corresponds to a compact self-adjoint operator, and can be decomposed as\n\nf (x, y) =\n\n\u03bbtgt(x)gt(y),\n\n\ffor some functions gt \u2208 L2(X ), \u03bbt \u2192 0 as t \u2192 \u221e\nThe above result gives a corresponding decomposition theorem (via Theorem C.1) for partially\nsymmetric functions in three variables. Extending the result to beyond three variables would require\nextending this decomposition result for linear operators to the general multilinear operator case.\nUnfortunately, to the best of our knowledge, a decomposition theorem for multilinear operators is\nan open problem in the functional analysis literature. Indeed, even the corresponding discrete tensor\ncase has only been studied recently. Instead, in the next section, we will use a result from probability\ntheory instead, and obtain a proof for our decomposition theorem under additional conditions.\n\n3.2.2 Argument via De Finetti\u2019s Theorem\n\nIn the previous section, we leveraged the interpretation of multivariate functions as multilinear oper-\nators. However, it is also possible to interpret multivariate functions as measures on a product space.\nUnder appropriate assumptions, we will show that a De Finetti-like theorem gives us the required\ndecomposition theorem for symmetric measures.\nWe \ufb01rst review De Finetti\u2019s theorem and related terms.\n\nDe\ufb01nition 3.9 (In\ufb01nite Exchangeability) An in\ufb01nite sequence X1, X2, . . . of random variables is\nsaid to be exchangeable if for any n \u2208 N and any permutation \u03c0 \u2208 Sn,\n\np(X1, X2, . . . , Xn) = p(X\u03c0(1), X\u03c0(2), . . . , X\u03c0(n))\n\n(11)\n\nWe note that exchangeability as de\ufb01ned in the probability theory literature refers to symmetricity of\nthe kind above, and is a distinct if related notion compared to that used in the rest of this paper.\nThen, we have a class of De-Finetti-like theorems:\n\nTheorem 3.10 (De Finetti-like theorems) A sequence of random variables X1, X2, . . . is in\ufb01nitely\nexchangeable iff, for all n, there exists a probability distribution function \u00b5, such that ,\n\n(cid:90)\n\np(X1, . . . , Xn) =\n\n\u03a0n\n\ni=1p(Xi; \u03b8)\u00b5(d\u03b8)\n\n(12)\n\nwhere p denotes the pdf of the corresponding distribution\n\nBernoulli with parameter \u03b8 a real-valued random variable, \u03b8 = limn\u2192\u221e(cid:80)\n\nThis decomposes the joint distribution over n variables into an integral over product distributions.\nDe Finetti originally proved this result for 0-1 random variables, in which case the p(Xi; \u03b8) are\ni Xi/n. For accessible\nproofs of this result and a similar one for the case when Xi are instead discrete, we refer the reader to\n[15, 2]. This result was later extended to the case where the variables Xi take values in a compact set\nX by Hewitt and Savage [16]. (The proof in [16] \ufb01rst shows that the set of symmetric measures is a\nconvex set whose set of extreme points is precisely the set of all product measures, i.e. independent\ndistributions. Then, it establishes a Choquet representation i.e. an integral representation of this\nconvex set as a convex combination of its extreme points, giving us a De Finetti-like theorem as\nabove.) In this general case, the parameter \u03b8 can be interpreted as being distribution-valued \u2013 as\nopposed to real valued in the binary case described above. Our description of this result is terse for\nlack of space, see [2, pp.188] for details. Thus, we derive the following theorem:\n\nevery leading subset of n documents, and(cid:82) f = M < \u221e, then f /M corresponds to a probability\n\nTheorem 3.11 (Product decomposition for Symmetric functions) Given an in\ufb01nite sequence of\ndocuments with features xi from a compact set X , if a function f : X m \u2192 R+ is symmetric in\nmeasure and f can be decomposed as\n\nf (x) =\n\n\u03a0jg(xj; \u03b8)\u00b5(d\u03b8)\n\n(13)\n\nfor some set of functions {g(\u00b7; \u03b8)}, g : X \u2192 R\n\nThis theorem can also be applied to discrete valued features Xi, and we would obtain a repre-\nsentation similar to that obtained through tensor analysis in Section 3.1. Applied to features Xi\n\n6\n\n(cid:90)\n\n\fbelonging to a compact set, we obtain the required representation theorem similar to the functional\nanalytic theory of Section 3.2.1. However, note that De Finetti\u2019s theorem integrates over products\nof probabilities, so that each term is non-negative, a restriction not present in the functional analytic\ncase. Moreover, we have an integral in the De Finetti decomposition, while via tensor analysis in the\ndiscrete case, we have a \ufb01nite sum whose size is given by the rank of the tensor, and in the functional\nanalytic analysis, the spectrum for bounded operators is discrete. De Finetti\u2019s theorem also requires\nthe existence of in\ufb01nitely many objects for which every leading \ufb01nite subsequence is exchangeable.\nThe similarities and differences between the functional analytic viewpoint and De Finetti\u2019s theorem\nhave been previously noted in the literature, for instance in Kingman\u2019s 1977 Wald Lecture [19] and\nwe discuss them further in Appendix E.\n\n4 Experiments\n\n1 , r(i)\n\n2 , . . . , r(i)\n\n2 , . . . , x(i)\n\nFor our experiments, we consider the information retrieval learning to rank task, where we are given\na training set consisting of n queries. Each query q(i) is associated with m documents, represented\nm ) \u2208 X m. The documents for q(i) have relevance levels\nvia feature vectors x(i) = (x(i)\n1 , x(i)\nm ) \u2208 Rm. Typically, R = {0, 1, . . . , l \u2212 1}. The training set thus consists\nr(i) = (r(i)\ni=1. T is assumed sampled i.i.d. from a distribution D over X m\u00d7Rm.\nof the tuples T = {x(i), r(i)}n\nRanking Loss Functionals We are interested in the NDCG ranking evaluation metric, and hence\nfor the ranking loss functional, we focus on optimization-amenable listwise surrogates for NDCG;\nspeci\ufb01cally, a convex class of strongly NDCG-consistent loss functions introduced in [24] and\nnonconvex listwise loss functions, ListNet [4] and the Cosine Loss. In addition, we impose an (cid:96)2\nregularization penalty on (cid:107)w(cid:107).\n[24] exhaustively characterized the set of strongly NDCG consistent surrogates as Bregman diver-\ngences D\u03c8 corresponding to strictly convex \u03c8 (see Appendix F). We choose the following instances\ni xi log xi\u2212xi), the square loss with \u03c8(x) = (cid:107)x(cid:107)2\nand the q-norm loss with \u03c8(x) = (cid:107)x(cid:107)2\nq, q = log(m) + 2 (where m is the number of documents).\nNote that the multiplicative factor in \u03c8 is signi\ufb01cant as it does affect \u03c6.\n\nof \u03c8: the Cross Entropy loss with \u03c8(x) = 0.01((cid:80)\n\nwkSk(xi, xj)\n\n(14)\n\nk\n\n(cid:33)\n\nRanking Functions The representation theory of the previous sections gives a functional form\nfor listwise ranking functions. In this section, we pick a simple class of ranking functions inspired\nby this representation theory, and use it to rerank the scores output by various pointwise ranking\nfunctions. Consider the following class of exchangeable ranking functions f (x) where the score for\nthe ith document is given by:\n\n(cid:32)(cid:88)\ntheory suggests that we can combine several such terms as fi(x) =(cid:80)\n\nfi(x) = b(xi)\u03a0j(cid:54)=ig(xi, xj; w) = b(xi)\u03a0j(cid:54)=i exp\n\nwhere b(xi) is the score provided by the base ranker for the i-th document, and Sk are pairwise\nfunctions (\u201ckernels\u201d) applied to xi and xj. Note that w = 0 yields the base ranking functions. Our\nt b(xi; vt)\u03a0j(cid:54)=ig(xi, xj; wt).\nFor our experiments, we only use one such term. A Gradient Boosting procedure can be used on top\nof our procedure to \ufb01t multiple terms for this series.\nOur choice of g is motivated by computational considerations: For general functions g, the compu-\ntation of (14) would require O(m) time per function evaluation, where m is the number of docu-\nments. However, the speci\ufb01c functional form in (14) allows O(1) time per function evaluation as\nj(cid:54)=i Sk(xi, xj) in the RHS\nfi(x; w) = b(xi)\u03a0k(exp(wk\ndoes not depend on w and can be precomputed. Thus after the precomputation step, each function\nevaluation is as ef\ufb01cient as that for a pointwise ranking function.\nAs the base pointwise rankers b, we use those provided by RankLib1: MART, RankNet, RankBoost,\nAdaRank, Coordinate Ascent (CA), LambdaMART, ListNet, Random Forests, Linear regression.\nWe refer the reader to the RankLib website for details on these.\n\nj(cid:54)=i Sk(xi, xj))), where the inner term(cid:80)\n\n(cid:80)\n\n1https://sourceforge.net/p/lemur/wiki/RankLib/\n\n7\n\n\fTable 1: Results for our reranking procedure across LETOR 3.0 datasets. For each dataset, the \ufb01rst\ncolumn is the base ranker, second column is the loss function used for reranking.\n\nOHSUMED\n\nTD2003\n\nNP2003\n\nBase\n\nRankBoost\n\nndcg@1\nndcg@2\nndcg@5\nndcg@10\n\n0.5104\n0.4798\n0.4547\n0.4356\n\nReranked w/\nCross Ent\n0.5421\n0.4901\n0.4615\n0.4445\n\nBase\nCA\n\n0.3500\n0.2875\n0.3228\n0.3210\n\nReranked w/\n\nq-Norm\n0.3250\n0.3375\n0.3461\n0.3385\n\nBase\nMART\n0.5467\n0.6500\n0.7112\n0.7326\n\nReranked w/\n\nSquare\n0.5600\n0.6567\n0.7128\n0.7344\n\nHP2003\n\nBase\nMART\n0.6667\n0.7667\n0.7546\n0.7740\n\nReranked w/\nCross Ent\n0.7333\n0.7667\n0.7618\n0.7747\n\nHP2004\n\nNP2004\n\nBase\n\nRankBoost\n\n0.5200\n0.6067\n0.7034\n0.7387\n\nReranked w/\n\nq-Norm\n0.5333\n0.6533\n0.7042\n0.7420\n\nBase\nMART\n0.3600\n0.4733\n0.5603\n0.5951\n\nReranked w/\n\nSquare\n0.3733\n0.4867\n0.5719\n0.6102\n\nndcg@1\nndcg@2\nndcg@5\nndcg@10\n\nResults We use the LETOR 3.0 collection [23], which contains the OHSUMED dataset and the\nGov collection: HP2003/04, TD2003/04, NP2003/04, which respectively correspond to the listwise\nHomepage Finding, Topic Distillation and Named Page Finding tasks. We use NDCG as evaluation\nmetric and show gains instead of losses, so larger values are better.\nWe use the following pairwise functions/kernels {Sk}: we construct a cosine similarity function for\ndocuments using the Query Normalized document features for each LETOR dataset. In addition,\nOHSUMED contains document similarity information for each query and the Gov datasets contain\nlink information and a sitemap, i.e. a parent-child relation. We use these relations directly as the\nkernels Sk in (14). Thus, we have two kernels for OHSUMED and three for the Gov datasets, and\nw is 2- and 3-dimensional respectively. To obtain the scores b for the baseline pointwise ranking\nfunction, we used Ranklib v2.1-patched with its default parameter values.\nLETOR contains 5 prede\ufb01ned folds with training, validation and test sets. We use these directly\nand report averaged results on the test set. For the (cid:96)2 regularization parameter, we pick a C from\n[0, 1e-5,1e-2, 1e-1, 1, 10, 1e2,1e3] tuning for maximum NDCG@10 on the validation set. We used\ngradient descent on w to \ufb01t parameters. Though our objective is nonconvex, we found that random\nrestarts did not affect the achieved minimum and used the initial value w = 0 for our experiments.\nSince w = 0 corresponds to the base pointwise rankers, we expect the reranking method to perform\nas well as the base rankers in the worst case. Table 1 shows some results across LETOR datasets\nwhich show improvements over the base rankers. For each dataset, we compare the NDCG for\nthe speci\ufb01ed base rankers with the NDCG for our reranking method with that base ranker and the\nspeci\ufb01ed listwise loss. (Detailed results are presented in Appendix G). Gradient descent required on\naverage only 17 iterations and 20 function evaluations, thus the principal computational cost of this\nmethod was the precomputation for eq. (14). The low computational cost and shown empirical re-\nsults for the reranking method are promising and validate our theoretical investigation. We hope that\nthis representation theory will enable the development of listwise ranking functions across diverse\ndomains, especially those less studied than ranking in information retrieval.\n\nAcknowledgements\n\nWe acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-\n1320894, IIS-1447574, and DMS-1264033.\n\n8\n\n\fReferences\n[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval. Addison Wesley, 1999.\n[2] J. M. Bernardo and A. F. Smith. Bayesian theory, volume 405. John Wiley & Sons, 2009.\n[3] C. J. Burges. From RankNet to LambdaRank to LambdaMart: An overview. Learning, 11:23\u2013581, 2010.\n[4] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise\n\napproach. In International Conference on Machine learning 24, pages 129\u2013136. ACM, 2007.\n\n[5] J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents\nand producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on\nResearch and development in information retrieval, pages 335\u2013336. ACM, 1998.\n\n[6] O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. Journal of Machine Learning\n\nResearch-Proceedings Track, 14:1\u201324, 2011.\n\n[7] O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In\n\nConference on Information and Knowledge Management (CIKM), 2009.\n\n[8] O. Chapelle, Y. Chang, and T. Liu. Future directions in learning to rank. In JMLR Workshop and Confer-\n\nence Proceedings, volume 14, pages 91\u2013100, 2011.\n\n[9] P. Comon, G. Golub, L. Lim, and B. Mourrain. Symmetric tensors and symmetric tensor rank. SIAM\n\nJournal on Matrix Analysis and Applications, 30(3):1254\u20131279, 2008.\n\n[10] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM\n\njournal on Matrix Analysis and Applications, 21(4):1253\u20131278, 2000.\n\n[11] K. Dembczynski, W. Kotlowski, and E. Huellermeier. Consistent multilabel ranking through univariate\n\nlosses. arXiv preprint arXiv:1206.6401, 2012.\n\n[12] P. Diaconis. Finite forms of de Finetti\u2019s theorem on exchangeability. Synthese, 36(2):271\u2013281, 1977.\n[13] P. Diaconis and D. Freedman. Finite exchangeable sequences. The Annals of Probability, pages 745\u2013764,\n\n1980.\n\n[14] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex opti-\n\nmization. Inverse Problems, 27(2):025010, 2011.\n\n[15] D. Heath and W. Sudderth. De Finetti\u2019s theorem on exchangeable variables. The American Statistician,\n\n30(4):188\u2013189, 1976.\n\n[16] E. Hewitt and L. J. Savage. Symmetric measures on Cartesian products. Transactions of the American\n\nMathematical Society, pages 470\u2013501, 1955.\n\n[17] K. J\u00a8arvelin and J. Kek\u00a8al\u00a8ainen. IR evaluation methods for retrieving highly relevant documents. In SIGIR\n\u201900: Proceedings of the 23rd annual international ACM SIGIR conference on research and development\nin information retrieval, pages 41\u201348, New York, NY, USA, 2000. ACM.\n\n[18] E. T. Jaynes. Some applications and extensions of the de Finetti representation theorem. Bayesian Infer-\n\nence and Decision Techniques, 31:42, 1986.\n\n[19] J. F. Kingman. Uses of exchangeability. The Annals of Probability, pages 183\u2013197, 1978.\n[20] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013500,\n\n2009.\n\n[21] L. Qi. The spectral theory of tensors (Rough Version). arXiv preprint arXiv:1201.3424, 2012.\n[22] T. Qin, T. Liu, X. Zhang, D. Wang, and H. Li. Global ranking using continuous conditional random\n\ufb01elds. In Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Sys-\ntems (NIPS 2008), 2008.\n\n[23] T. Qin, T. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning to rank for\n\ninformation retrieval. Information Retrieval, 13(4):346\u2013374, 2010.\n\n[24] P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. 2011.\n[25] M. C. Reed and B. Simon. Methods of modern mathematical physics: Functional analysis, volume 1.\n\nGulf Professional Publishing, 1980.\n\n[26] J. Weston and J. Blitzer. Latent Structured Ranking. arXiv preprint arXiv:1210.4914, 2012.\n\n9\n\n\f", "award": [], "sourceid": 257, "authors": [{"given_name": "Harsh", "family_name": "Pareek", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}]}