{"title": "Two-Layer Generalization Analysis for Ranking Using Rademacher Average", "book": "Advances in Neural Information Processing Systems", "page_first": 370, "page_last": 378, "abstract": "This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents per query. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of algorithms. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent if represented by features extracted from the matching between document and query. To tackle the challenge, we decompose the generalization error according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performance of ranking algorithms.", "full_text": "Two-layer Generalization Analysis for Ranking Using\n\nRademacher Average\n\nWei Chen\u2217\n\nChinese Academy of Sciences\n\nTie-Yan Liu\n\nMicrosoft Research Asia\n\nZhiming Ma\n\nChinese Academy of Sciences\n\nchenwei@amss.ac.cn\n\ntyliu@micorsoft.com\n\nmazm@amt.ac.cn\n\nAbstract\n\nThis paper is concerned with the generalization analysis on learning to rank for\ninformation retrieval (IR). In IR, data are hierarchically organized, i.e., consisting\nof queries and documents. Previous generalization analysis for ranking, however,\nhas not fully considered this structure, and cannot explain how the simultaneous\nchange of query number and document number in the training data will affect the\nperformance of the learned ranking model. In this paper, we propose performing\ngeneralization analysis under the assumption of two-layer sampling, i.e., the i.i.d.\nsampling of queries and the conditional i.i.d sampling of documents per query.\nSuch a sampling can better describe the generation mechanism of real data, and\nthe corresponding generalization analysis can better explain the real behaviors of\nlearning to rank algorithms. However, it is challenging to perform such analy-\nsis, because the documents associated with different queries are not identically\ndistributed, and the documents associated with the same query become no longer\nindependent after represented by features extracted from query-document match-\ning. To tackle the challenge, we decompose the expected risk according to the two\nlayers, and make use of the new concept of two-layer Rademacher average. The\ngeneralization bounds we obtained are quite intuitive and are in accordance with\nprevious empirical studies on the performances of ranking algorithms.\n\n1\n\nIntroduction\n\nLearning to rank has recently gained much attention in machine learning, due to its wide applica-\ntions in real problems such as information retrieval (IR). When applied to IR, learning to rank is a\nprocess as follows [16]. First, a set of queries, their associated documents, and the corresponding\nrelevance judgments are given. Each document is represented by a set of features, measuring the\nmatching between document and query. Widely-used features include the frequency of query terms\nin the document and the query likelihood given by the language model of the document. A rank-\ning function, which combines the features to predict the relevance of a document to the query, is\nlearned by minimizing a loss function de\ufb01ned on the training data. Then for a new query, the rank-\ning function is used to rank its associated documents according to their predicted relevance. Many\nlearning to rank algorithms have been proposed, among which the pairwise ranking algorithms such\nas Ranking SVMs [12, 13], RankBoost [11], and RankNet [5] have been widely applied.\n\nTo understand existing ranking algorithms, and to guide the development of new ones, people have\nstudied the learning theory for ranking, in particular, the generalization ability of ranking methods.\nGeneralization ability is usually represented by a bound of the deviation between the expected and\nempirical risks for an arbitrary ranking function in the hypothesis space. People have investigated the\ngeneralization bounds under different assumptions. First, with the assumption that documents are\ni.i.d., the generalization bounds of RankBoost [11], stable pairwise ranking algorithms like Ranking\n\n\u2217The work was performed when the \ufb01rst author was an intern at Microsoft Research Asia.\n\n1\n\n\fSVMs [2], and algorithms minimizing pairwise 0-1 loss [1, 9] were studied. We call these general-\nization bounds \u201cdocument-level generalization bounds\u201d, which converge to zero when the number of\ndocuments in the training set approaches in\ufb01nity. Second, with the assumption that queries are i.i.d.,\nthe generalization bounds of stable pairwise ranking algorithms like Ranking SVMs and IR-SVM\n[6] and listwise algorithms were obtained in [15] and [14]. We call these generalization bounds\n\u201cquery-level generalization bounds\u201d. When analyzing the query-level generalization bounds, the\ndocuments associated with each query are usually regarded as a deterministic set [10, 14], and no\nrandom sampling of documents is assumed. As a result, query-level generalization bounds converge\nto zero only when the number of queries approaches in\ufb01nity, no matter how many documents are\nassociated with them.\n\nWhile the existing generalization bounds can explain the behaviors of some ranking algorithms, they\nalso have their limitations. (1) The assumption that documents are i.i.d. makes the document-level\ngeneralization bounds not directly applicable to ranking in IR. This is because it has been widely ac-\ncepted that the documents associated with different queries do not follow the same distribution [17]\nand the documents with the same query are no longer independent after represented by document-\nquery matching features. (2) It is not reasonable for query-level generalization bounds to assume\nthat one can obtain the document set associated with each query in a deterministic manner. Usually\nthere are many random factors that affect the collection of documents. For example, in the labeling\nprocess of TREC, the ranking results submitted by all TREC participants were put together and then\na proportion of them were selected and presented to human annotators for labeling. In this process,\nthe number of participants, the ranking result given by each participant, the overlap between dif-\nferent ranking results, the labeling budget, and the selection methodology can all in\ufb02uence which\ndocuments and how many documents are labeled for each query. As a result, it is more reasonable\nto assume a random sampling process for the generation of labeled documents per query.\n\nTo address the limitations of previous work, we propose a novel theoretical framework for ranking,\nin which a two-layer sampling strategy is assumed. In the \ufb01rst layer, queries are i.i.d. sampled\nfrom the query space according to a \ufb01xed but unknown probability distribution. In the second layer,\nfor each query, documents are i.i.d. sampled from the document space according to a \ufb01xed but\nunknown conditional probability distribution determined by the query (i.e., documents associated\nwith different queries do not have the identical distribution). Then, a set of features are extracted\nfor each document with respect to the query. Note that the feature representations of the documents\nwith the same query, as random variables, are not independent any longer. But they are conditionally\nindependent if the query is given. As can be seen, this new sampling strategy removes improper\nassumptions in previous work, and can more accurately describe the data generation process in IR.\n\nBased on the framework, we have performed two-layer generalization analysis for pairwise rank-\ning algorithms. However, the task is non-trivial mainly because the two-layer sampling does not\ncorrespond to a typical empirical process: the documents for different queries are not identically\ndistributed while the documents for the same query are not independent. Thus, the empirical pro-\ncess techniques, widely used in previous work on generalization analysis, are not suf\ufb01cient. To\ntackle the challenge, we carefully decompose the expected risk according to the query and docu-\nment layers, and employ a new concept called two-layer Rademacher average. The new concept\naccurately describes the complexity in the two-layer setting, and its reduced versions can be used to\nderive meaningful bounds for query layer error and document layer error respectively.\n\nAccording to the generalization bounds we obtained, we have the following \ufb01ndings: (i) Both more\nqueries and more documents per query can enhance the generalization ability of ranking methods;\n(ii) Only if both the number of training queries and that of documents per query simultaneously\napproach in\ufb01nity, can the generalization bound converge to zero; (iii) Given a \ufb01xed size of training\ndata, there exists an optimal tradeoff between the number of queries and the number of documents\nper query. These \ufb01ndings are quite intuitive and can well explain empirical observations [19].\n\n2 Related Work\n\n2.1 Pairwise Learning to Rank\n\nPairwise ranking is one of the major approaches to learning to rank, and has been widely adopted in\nreal applications [5, 11, 12, 13]. The process of pairwise ranking can be described as follows.\n\n2\n\n\f1,\u00b7\u00b7\u00b7 , di\n\nmi} and their judgments {yi\n\nAssume there are n queries {q1, q2,\u00b7\u00b7\u00b7 , qn} in the training data. Each query qi is associated with mi\nj \u2208 Y. Each document di\nj is\ndocuments {di\nrepresented by a set of features xi\nj, qi) \u2208 X , measuring the matching between document di\nj\nand query qi. Widely-used features include the frequency of query terms in the document and the\nquery likelihood given by the language model of the document. For ease of reference, we use z =\n(x, y) \u2208 X \u00d7 Y = Z to denote document d since it encodes all the information of d in the learning\nprocess. Then the training set can be denoted as S = {S1,\u00b7\u00b7\u00b7 , Sn} where Si , {zi\nj \u2208 Z}j=1,\u00b7\u00b7\u00b7 ,mi\nis the document sample for query qi. For a ranking function f : X \u2192 R, the pairwise 0-1 loss l0\u22121\nand pairwise surrogate loss l\u03c6 are de\ufb01ned as below:\n\nmi}, where yi\n\n1,\u00b7\u00b7\u00b7 , yi\n\nj = \u03c8(di\n\nl0\u22121(z, z0; f ) = I{(y\u2212y0)(f (x)\u2212f (x0))<0},\n\nl\u03c6(z, z0; f ) = \u03c6(cid:0) \u2212 sgn(y \u2212 y0) \u00b7 (f (x) \u2212 f (x0))(cid:1),\n\n(1)\nwhere I{\u00b7} is the indicator function and z, z0 are two documents associated with the same query.\nWhen function \u03c6 takes different forms, we will get the surrogate loss functions for different al-\ngorithms. For example, for Ranking SVMs, RankBoost, and RankNet, function \u03c6 is the hinge,\nexponential, and logistic functions respectively.\n\n2.2 Document-level Generalization Analysis\n\nIn document-level generalization analysis, it is assumed that the documents are i.i.d. sampled from\nthe document space Z according to P (z).Then the expected risk of pairwise ranking algorithms can\nbe de\ufb01ned as below,\n\nRl\n\nQ(f ) \u2264\n\nm(m \u2212 1) Xj6=k\nwhere \u0001(\u03b4,F, n) \u2192 0 as query number n \u2192 \u221e.\n\nXi=1\n\n1\nn\n\n1\n\n3\n\nRl\n\nD(f ) =ZZ2\n\nl(z, z0; f )dP 2(z, z0),\n\nwhere P 2(z, z0) is the product probability of P (z) on the product space Z 2.\nThe document-level generalization bound usually takes the following form: with probability at least\n1 \u2212 \u03b4,\n\nl(zj, zk; f ) + \u0001(\u03b4, F , m), \u2200f \u2208 F ,\n\nRl\n\nD(f ) \u2264\n\n1\n\nm(m \u2212 1) Xj6=k\n\nwhere \u0001(\u03b4,F, m) \u2192 0 when document number m \u2192 \u221e.\nAs representative work, the generalization bounds for the pairwise 0-1 loss were derived in [1, 9]\nand the generalization bounds for RankBoost and Ranking SVMs were obtained in [2] and [11].\n\nAs aforementioned, the assumption that documents are i.i.d. makes the document-level generaliza-\ntion bounds not directly applicable to ranking in IR. Even if the assumption holds, the document-\nlevel generalization bounds still cannot be used to explain existing pairwise ranking algorithms.\nActually, according to the document-level generalization bound, what we can obtain is: with proba-\nbility at least 1\u2212 \u03b4, Rl\naverage of the pairwise losses on all the document pairs. This is clearly not the real empirical risk\nof ranking in IR, where documents associated with different queries cannot be compared with each\nother, and pairs are constructed only by documents associated with the same query.\n\n+ \u0001(\u03b4, F ,P mi), \u2200f \u2208 F . The empirical risk is the\n\nj ,zi0\nP mi(P mi\u22121)\n\nP(i,j)6=(i0 ,k) l(zi\n\nD(f ) \u2264\n\nk ;f )\n\n2.3 Query-level Generalization Analysis\n\nIn existing query-level generalization analysis [14], it is assumed that each query qi, represented by\na deterministic document set Si with the same number of documents (i.e. mi \u2261 m), is i.i.d. sampled\nfrom the space Z m. Then the expected risk can be de\ufb01ned as follows,\n\nRl\n\nQ(f ) =ZZm\n\n1\n\nm(m \u2212 1) Xj6=k\n\nl(zj, zk; f )dP (z1, \u00b7 \u00b7 \u00b7 , zm).\n\nThe query-level generalization bound usually takes the following form: with probability at least\n1 \u2212 \u03b4,\n\nn\n\nl(zi\n\nj, zi\n\nk; f ) + \u0001(\u03b4, F , n), \u2200f \u2208 F ,\n\n\fAs representative work, the query-level generalization bounds for stable pairwise ranking algorithms\nsuch as Ranking SVMs and IR-SVM and listwise ranking algorithms were derived in [15] 1 and [14].\nAs mentioned in the introduction, the assumption that each query is associated with a deterministic\nset of documents is not reasonable. The fact is that many random factors can in\ufb02uence what kinds of\ndocuments and how many documents are labeled for each query. Due to this inappropriate assump-\ntion, the query-level generalization bounds are sometimes not intuitive. For example, when more\nlabeled documents are added to the training set, the generalization bounds of stable pairwise rank-\ning algorithms derived in [15] do not change and the generalization bounds of some of the listwise\nranking algorithms derived in [14] get even looser.\n\n3 Two-Layer Generalization Analysis\n\nIn this section, we introduce the concepts of two-layer data sampling and two-layer generalization\nability for ranking. These concepts can help describe the data generation process and explain the\nbehaviors of learning to rank algorithms more accurately than previous work.\n\n3.1 Two-Layer Sampling in IR\n\n1, yi\n\nmi , yi\n\n1),\u00b7\u00b7\u00b7 , (di\n\nWhen applying learning to rank techniques to IR, a training set is needed. The creation of such a\ntraining set is usually as follows. First, queries are randomly sampled from query logs of search\nengines. Then for each query, documents that are potentially relevant to the query are sampled (e.g.,\nusing the strategy in TREC [8]) from the entire document repository and presented to human anno-\ntators. Human annotators make relevance judgment to these documents, according to the matching\nbetween them and the query. Mathematically, we can represent the above process in the following\nmanner. First, queries Q = {q1,\u00b7\u00b7\u00b7 , qn} are i.i.d. sampled from the query space Q according to\ndistribution P (q). Second, for each query qi, its associated documents and their relevant judgments\nmi )} are i.i.d. sampled from the document space D according to a conditional\n{(di\nj is then rep-\ndistribution P (d|qi) where mi is the number of sampled documents. Each document di\nresented by a set of matching features, i.e., xi\nj, qi), where \u03c8 is a feature extractor. Following\nj = \u03c8(di\nthe notation rules in Section 2.1, we use zi\nj) to represent document di\nj and its label, and\nj = (xi\nj, yi\nj}j=1,\u00b7\u00b7\u00b7 ,mi are\nj}j=1,\u00b7\u00b7\u00b7 ,mi. Note that although {di\ndenote the training data for query qi as Si = {zi\nj}j=1,\u00b7\u00b7\u00b7 ,mi are no longer independent because they share the\ni.i.d. samples, random variables {zi\nsame query qi. Only if qi is given, we can regard them as independent of each other.\nWe call the above data generation process two-layer sampling, and denote the training data generated\nin this way as (Q, S), where Q is the query sample and S = {Si}i=1,\u00b7\u00b7\u00b7 ,n is the document sample.\nThe two-layer sampling process can be illustrated using Figure 1.\n\uf8f6\n\uf8f7\uf8f7\uf8f7\uf8f8\n\nP (\u00b7|q) \uf8eb\n\uf8f6\n\uf8ec\uf8ec\uf8ec\uf8ed\n\uf8f7\uf8f7\uf8f7\uf8f8\n\n2, y1\n2)\n2, y2\n2)\n...\n2 , yn\n2 )\n\n1, y1\n1)\n1, y2\n1)\n...\n1 , yn\n1 )\n\n(dn\n\nmn , yn\n\nmn )\n\n(d2\n\nm2 , y2\n\nm2 )\n\n\uf8eb\n\uf8ec\uf8ec\uf8ec\uf8ed\n\n(d1\n\nm1 , y1\n. . .\n\nm1 )\n\nq1\nq2\n...\nqn\n\n. . .\n. . .\n\n...\n\n. . .\n\n...\n\n. . .\n\n. . .\n\n...\n\n(d1\n(d2\n\n(dn\n\n(Q, P )\n\nFigure 1: Two-layer sampling\n\nNote that two-layer sampling has signi\ufb01cant difference from the sampling strategies used in pre-\nvious generalization analysis. (i) As compared to the sampling in document-level generalization\nanalysis, two-layer sampling introduces the sampling of queries, and documents associated with\n\n1In [15], although a similar sampling strategy to the two-layer sampling is mentioned, the generalization\nanalysis, however, does not consider the independent sampling at the document layer. As a result, the general-\nization bound they obtained is a query-level generalization bound, but not a two-layer generalization bound.\n\n4\n\n(d1\n(d2\n\n(dn\n\nz1\n1\nz2\n1\n...\nzn\n1\n\nz1\n2\nz2\n2\n...\nzn\n2\n\nz = (\u03c6(q, d), y) \uf8eb\n\uf8ec\uf8ec\uf8ec\uf8ed\n\n. . .\n. . .\n\n...\n\nz1\nm1\n. . .\n\n...\n\n. . .\n\n. . .\n\nz2\nm2\n\n. . .\n\n...\nzn\nmn\n\n\uf8f6\n\uf8f7\uf8f7\uf8f7\uf8f8\n\n\f(P, P ) \u2192\n\nP1\nP2\n...\nPn\n\n\u2192\n\u2192\n\n...\n\n\u2192\n\nz1\n1\nz2\n1\n...\nzn\n1\n\nz1\n2\nz2\n2\n...\nzn\n2\n\n. . .\n. . .\n\n...\n\n. . .\n\n\uf8eb\n\uf8ec\uf8ec\uf8ec\uf8ed\n\nz1\nm\nz2\nm\n...\nzn\nm\n\n\uf8f6\n\uf8f7\uf8f7\uf8f7\uf8f8\n\nFigure 2: (n, m)-sampling\n\ndifferent queries are sampled according to different conditional distributions. (ii) As compared to\nthe sampling in query-level generalization analysis, two-layer sampling considers the sampling of\ndocuments for each query.\n\nTo some extent, the aforementioned two-layer sampling has relationship with directly sampling from\nthe product space of query and document, and the (n, m)-sampling proposed in [4]. However, as\nshown below, they also have signi\ufb01cant differences. Firstly, it is clear that directly sampling from\nthe product space of query and document does not describe the real data generation process. Fur-\nthermore, even if we sample a large number of documents in this way, it is not guaranteed that we\ncan have suf\ufb01cient number of documents for each single query. Secondly, comparing Figure 1 with\nFigure 2 (which illustrate (n, m)-sampling), we can easily \ufb01nd: (i) in (n, m)-sampling, tasks (cor-\nresponding to queries) have the same number (i.e., m) of elements (corresponding to documents),\nhowever, in two-layer sampling, queries can be associated with different numbers of documents;\n(ii) in (n, m)-sampling, all the elements are i.i.d., however, in two-layer sampling documents (if\nrepresented by matching features) associated with the same query are not independent of each other.\n\n3.2 Two-Layer Generalization Ability\n\nWith the probabilistic assumption of two-layer sampling, we de\ufb01ne the expected risk for pairwise\nranking as follows,\n\nRl(f ) =ZQZZ2\n\nl(z, z0; f )dP (z, z0|q)dP (q),\n\n(2)\n\nwhere P (z, z0|q) is the product probability of P (z|q) on the product space Z 2.\nDe\ufb01nition 1. We say that an ERM learning process with loss l in hypothesis space F has two-layer\ngeneralization ability, if with probability at least 1 \u2212 \u03b4,\n\nwhere \u02c6Rl\n0 iff query number n and document number per query mi simultaneously approach in\ufb01nity.\n\nk; f ), and \u0001(\u03b4,F, n, m1,\u00b7\u00b7\u00b7 , mn) \u2192\n\nmi(mi\u22121) Pj6=k l(zi\n\nn;m1,\u00b7\u00b7\u00b7 ,mn (f ; S) + \u0001(\u03b4, F , n, m1, , \u00b7 \u00b7 \u00b7 , mn), \u2200f \u2208 F ,\nn Pn\n\nRl(f ) \u2264 \u02c6Rl\nn;m1,\u00b7\u00b7\u00b7 ,mn (f ; S) = 1\n\nj, zi\n\ni=1\n\n1\n\nIn the next section, we will show our theoretical results on the two-layer generalization abilities of\ntypical pairwise ranking algorithms.\n\n4 Main Theoretical Result\n\nIn this section, we show our results on the two-layer generalization ability of ERM learning with\npairwise ranking losses (either pairwise 0-1 loss or pairwise surrogate losses). As prerequisites, we\nrecall the concept of conventional Rademacher averages (RA)[3].\nDe\ufb01nition 2. For sample {x1, . . . , xm}, the RA of l \u25e6 F is de\ufb01ned as follows, Rm(l \u25e6 F) =\nE\u03c3 hsupf\u2208F (cid:12)(cid:12)(cid:12)\ni , where \u03c31, . . . , \u03c3m are independent Rademacher random vari-\n\nj=1 \u03c3jl(xj; f )(cid:12)(cid:12)(cid:12)\n\nables independent of data sample.\n\nm Pm\n\n2\n\nWith the above de\ufb01nitions, we have the following theorem, which describes when and how the\ntwo-layer generalization bounds of pairwise ranking algorithms converge to zero.\nTheorem 1. Suppose l is the loss function for pairwise ranking. Assume 1) l \u25e6 F is bounded by M,\n2) E [Rm(l \u25e6 F)] \u2264 D(l \u25e6 F, m), then with probability at least 1 \u2212 \u03b4, for \u2200f \u2208 F\nRl(f ) \u2264 \u02c6Rl\n\nn;m1,\u00b7\u00b7\u00b7 ,mn (f ) + D(l \u25e6 F , n) +s 2M 2 log( 4\n\n2M 2 log 4\n\u03b4\n\nD(l \u25e6 F , b\n\nmi\n2\n\nmin2\n\n1\nn\n\n\u03b4 )\n\n+\n\nn\n\n.\n\nn\n\nXi=1\n\nc) +vuut\n\nn\n\nXi=1\n\n5\n\n\fRemark: The condition of the existence of upper bounds for E [Rm(l \u25e6 F)] can be satis\ufb01ed in\nmany situations. For example, for ranking function class F that satis\ufb01es V C( \u02dcF ) = V , where\n\u02dcF = {f (x, x0) = f (x) \u2212 f (x0); f \u2208 F}, V C(\u00b7) denotes the VC dimension, and |f (x)| \u2264 B, it\nhas been proved that D(l0\u22121 \u25e6 F, m) = c1pV /m and D(l\u03c6 \u25e6 F, m) = c2B\u03c60(B)pV /m in [3, 9],\nwhere c1 and c2 are both constants.\n\n4.1 Proof of Theorem 1\n\nn(f ) \u2212 \u02c6Rl\n\nn(f ) + \u02c6Rl\n\nn(f ) = 1\n\nn;m1,\u00b7\u00b7\u00b7 ,mn (f ),\n\nNote that the proof of Theorem 1 is non-trivial because documents generated by two-layer sampling\nare neither independent nor identically distributed, as aforementioned. As a result, the two-layer\nsampling does not correspond to an empirical process and classical proving techniques in statisti-\ncal learning are not suf\ufb01cient for the proof. To tackle the challenge, we decompose the two-layer\nexpected risk as follows:\nRl(f ) = \u02c6Rl\nn Pn\n\nn;m1,\u00b7\u00b7\u00b7 ,mn (f ) + Rl(f ) \u2212 \u02c6Rl\ni=1 RZ 2 l(z, z0; f )dP (z, z0|qi). We call Rl(f ) \u2212 \u02c6Rl\n\nn(f ) query-layer error and\nn;m1,\u00b7\u00b7\u00b7 ,mn(f ) document-layer error. Then, inspired by conventional RA [3], we propose\n\nwhere \u02c6Rl\n\u02c6Rl\nn(f )\u2212 \u02c6Rl\na concept called two-layer RA to describe the complexity of sample (Q, S).\nDe\ufb01nition 3. For two-layer sample (Q, S), the two-layer RA of l \u25e6 F is de\ufb01ned as follows,\n\uf8f9\n\uf8fb ,\n\nRn;m1,\u00b7\u00b7\u00b7 ,mn (l \u25e6 F (Q, S)) = E\u03c3\uf8ee\nf \u2208F(cid:12)(cid:12)(cid:12)\n\uf8f0sup\n\nj} are independent Rademacher random variables independent of data sample.\n\nwhere {\u03c3i\nIf\n(Q, S) = {qi; zi, z0i}i=1,\u00b7\u00b7\u00b7 ,n, we call its expected two-layer RA, i.e., EQ,S [Rn;2,\u00b7\u00b7\u00b7 ,2(l \u25e6 F(Q, S))],\ndocument-layer reduced two-layer RA. If (q, S) = {q; z1,\u00b7\u00b7\u00b7 , zm}, we call its conditional expected\ntwo-layer RA, i.e., ES|q [R1;m(l \u25e6 F(q, S))], query-layer reduced two-layer RA.\nBased on the concept of two-layer RA, we can derive meaningful bounds for the two-layer expected\nrisk. In Section 4.1.1, we prove the query-layer error bound by using document-layer reduced two-\nlayer RA; and in Section 4.1.2, we prove the document-layer error bound by using query-layer\nreduced two-layer RA. Combining the two bounds, we can prove Theorem 1 in Section 4.1.3.\n\nbmi/2c+j; f )(cid:12)(cid:12)(cid:12)\n\nXi=1\n\nXj=1\n\n\u03c3i\njl(zi\n\nbmi/2c\n\nj, zi\n\nbmi/2c\n\n2\nn\n\n1\n\nn\n\n4.1.1 Query-Layer Error Bounds\n\nAs for the query-layer error bound, we have the following theorem.\nTheorem 2. Assume l \u25e6 F is bounded by M, then with probability at least 1 \u2212 \u03b4,\n\nRl(f ) \u2212 \u02c6Rl\n\nn(f ) \u2264 EQ,S [Rn;2,\u00b7\u00b7\u00b7 ,2(l \u25e6 F (Q, S))] +r 2M 2 log(2/\u03b4)\n\nn\n\n, \u2200f \u2208 F .\n\nn\n\n\u03b4 )\n\nProof. We de\ufb01ne a function Lf as follows: Lf (q) = RZ 2 l(z, z0; f )dP 2(z|q). Since q1,\u00b7\u00b7\u00b7 , qn\nare i.i.d. sampled, Lf (q1),\u00b7\u00b7\u00b7 , Lf (qn) are also i.i.d.. Denote G1(Q) = supf\u2208F (cid:12)(cid:12)(cid:12)\nn(f )(cid:12)(cid:12)(cid:12)\n.\nSince l \u25e6 F is bounded by M, by the McDiarmid\u2019s inequality, we have G1(Q) \u2264 E [G1(Q)] +\nq 2M 2 log( 2\n. By introducing a ghost query sample \u02dcQ = { \u02dcq1,\u00b7\u00b7\u00b7 , \u02dcqn}, we have\nLf ( \u02dcqi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nf \u2208F(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nE [G1(Q)] = EQ\"sup\nXi=1\nFurther assuming that there are virtual document samples {zi, z0i}i=1,\u00b7\u00b7\u00b7 ,n and { \u02dczi, \u02dcz0i}i=1,\u00b7\u00b7\u00b7 ,n for\n[l( \u02dczi, \u02dcz0i; f )].\nquery samples Q and \u02dcQ, we have Lf (qi) = Ezi,z0\nSubstitute Lf (qi) and Lf ( \u02dcqi) into inequality 3, we obtain the following result:\n\nLf (qi) \u2212Z Lf (q)dP (q)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\ni|qi[l(zi, z0i; f ); Lf ( \u02dcqi)] = E\n\nf \u2208F(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nQ, \u02dcQ\"sup\n\nRl(f ) \u2212 \u02c6Rl\n\n# \u2264 E\n\nXi=1\n\nXi=1\n\nLf (qi) \u2212\n\n\u02dczi, \u02dcz0\n\ni| \u02dcqi\n\n1\nn\n\n1\nn\n\n1\nn\n\nn\n\nn\n\nn\n\n# (3)\n\nn\n\n1\nn\n\nf \u2208F(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nE[G1(Q)] = EQ,Q0\"sup\nXi=1\nf \u2208F(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\ni|qi, \u02dcqi\"sup\n\n\u2264 Eqi, \u02dcqi\n\ni, \u02dczi, \u02dcz0\n\nzi,z0\n\nE\n\n1\nn\n\nE\n\nzi,z0\n\ni, \u02dczi, \u02dcz0\n\ni|qi, \u02dcqi(cid:16)l(zi, z0\n\ni; f ) \u2212 l( \u02dczi, \u02dcz0\n\nn\n\nXi=1(cid:16)l(zi, z0\n\ni; f ) \u2212 l( \u02dczi, \u02dcz0\n\nAccording to the de\ufb01nition of document-layer reduced two-layer RA, Theorem 2 is proved.\n\n#\n\ni; f )(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nf \u2208F(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n# = EQ,S E\u03c3\"sup\n\ni; f )(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03c3il(zi, z0\n\n2\nn\n\nn\n\nXi=1\n\n#\n\ni; f )(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n6\n\n\f4.1.2 Document-layer Error Bound\n\nIn order to obtain the bound for document-layer error, we consider the fact that documents are\nindependent if the query sample is given. Then for any given query sample, we can obtain the\nfollowing theorem by concentration inequality and symmetrization.\nTheorem 3. Denote G(S) , supf\u2208F ( \u02c6Rn(f ) \u2212 \u02c6Rn;m1,\u00b7\u00b7\u00b7 ,mn(f )) and assume l \u25e6 F is bounded by\nM, then we have:\n\nPnG(S) \u2264\n\n1\nn\n\nn\n\nXi=1\n\nESi|qi [R1;mi (l \u25e6 F (qi, Si))] +vuut\n\nn\n\nXi=1\n\n2M 2 log (2/\u03b4)\n\nmin2\n\n(cid:12)(cid:12)(cid:12)\n\nQo \u2264 1 \u2212 \u03b4.\n\n(4)\n\nProof. First, we prove the bounded difference property for G(S).2 Given query sample Q, all the\ndocuments in the document sample will become independent. Denote S0 as the document sample\nobtained by replacing document zi0\nj0\n\nin S with a new document \u02dczi0\nj0\n\n. It is clear that\n\nsup\n\nS,S0(cid:12)(cid:12)G(S) \u2212 G(S0)(cid:12)(cid:12) \u2264 sup\nf \u2208F Pk6=j0(cid:12)(cid:12)(l(zi0\n\n\u2264 sup\nS,S0\n\nsup\n\nS,S0\n\nsup\n\nf \u2208F(cid:12)(cid:12)\n\n\u02c6Rn;m1,\u00b7\u00b7\u00b7 ,mn (f ; S) \u2212 \u02c6Rn;m1,\u00b7\u00b7\u00b7 ,mn (f ; S0)(cid:12)(cid:12)\n\n, zi0\n\n, zi0\n\nk ; f ) \u2212 l(\u02dczi0\nj0\n\nj0\nnmi0 (mi0 \u2212 1)\n\nk ; f )(cid:12)(cid:12)\n\n\u2264\n\n2M\nmi0 n\n\n.\n\nThen by the McDiarmid\u2019s inequality, with probability at least 1 \u2212 \u03b4, we have\n\nG(S) \u2264 ES|Q[G(S)] +vuut\n\nn\n\nXi=1\n\n2M 2 log (2/\u03b4)\n\nmin2\n\n.\n\n(5)\n\nSecond, inspired by [9] we introduce permutations to convert the non-sum-of-i.i.d. pairwise loss to\na sum-of-i.i.d. form. Assume Smi is the symmetric group of degree mi and \u03c0i \u2208 Smi(i = 1,\u00b7\u00b7\u00b7 , n)\nwhich permutes the mi documents associates with qi. Since documents associated with the same\nquery follow the identical distribution, we have,\n\nn\n\n1\nn\n\nXi=1\nwhere p\n\n1\n\nmi(mi \u2212 1) Xj6=k\n\nl(zi\n\nj, zi\n\nk; f )\n\np\n=\n\n1\nn\n\nn\n\nXi=1\n\n1\n\nmi! X\u03c0i\n\n1\n\nbmi/2c\n\nbmi/2c\n\nXj=1\n\n= means identity in distribution. De\ufb01ne a function \u02dcG(Si) on each Si as follows:\n\n\u02dcG(Si) = sup\n\nf \u2208F(cid:12)(cid:12)(cid:12)\n\n1\n\nbmi/2c\n\nbmi/2c\n\nXj=1\n\nl(zi\n\nj, zi\nb\n\nmi\n\n2 c+j; f ) \u2212 Ez,z0|qi(cid:2)l(z, z0; f )(cid:3)(cid:12)(cid:12)(cid:12)\n\n.\n\nWe can see that \u02dcG(Si) does not contain any document pairs that share a common document. By\nusing Eqn.(6), we can decompose ES|Q[G(S)] into the sum of ESi|qi[ \u02dcG(Si)] as below:\n\nl(zi\n\n\u03c0i(j), zi\n\n\u03c0i(bmi/2c+j); f ),\n\n(6)\n\nES|Q[G(S)] \u2264\n\n1\nn\n\n\u2212\n\n1\nb mi\n2 c\n\nbmi/2c\n\nXj=1\n\nn\n\nXi=1\n\n1\n\nmi! X\u03c0i\n\nl(Z i\n\n\u03c0i(j), Z i\n\n\u03c0i(b\n\nf \u2208F(cid:12)(cid:12)(cid:12)ZZ2\nESi|qih sup\n2 c+j); f )(cid:12)(cid:12)(cid:12)i =\n\nmi\n\n1\nn\n\nl(z, z0; f )dP (z, z0|qi)\n\nn\n\nXi=1\n\nESi|qih \u02dcG(Si)i .\n\n(7)\n\nbmi/2c\n\nThird, we give a bound for ESi|qi h \u02dcG(Si)i by use of symmetrization. We introduce a ghost doc-\nument sample \u02dcSi = {\u02dczi\nj}j=1,\u00b7\u00b7\u00b7 ,mi that is independent of Si and identically distributed. Assume\nare independent Rademacher random variables, independent of Si and \u02dcSi. Then,\n1,\u00b7\u00b7\u00b7 , \u03c3i\n\u03c3i\nSi, \u02dcSi|qi\uf8ee\nf \u2208F(cid:12)(cid:12)(cid:12)\n\uf8f0sup\nXj=1\n\n\uf8f9\n2 c+j; f )(cid:1)(cid:12)(cid:12)(cid:12)\n\uf8fb\n\uf8f9\n\uf8fb = ESi|qi [R1;mi (l \u25e6 F (qi, Si))] .\n\nESi|qih \u02dcG(Si)i \u2264 E\nSi,\u03c3i|qi\uf8ee\nf \u2208F(cid:12)(cid:12)(cid:12)\n\uf8f0sup\n\nXj=1 (cid:0)l(zi\nbmi/2c+j; f )(cid:12)(cid:12)(cid:12)\n\nJointly considering (5), (7), and (8), we can prove the theorem.\n\n2 c+j; f ) \u2212 l(\u02dczi\n\n\u03c3i\njl(zi\n\nbmi/2c\n\nbmi/2c\n\nj, zi\nb\n\nj, \u02dczi\nb\n\nj, zi\n\nbmi/2c\n\nbmi/2c\n\n= E\n\n(8)\n\nmi\n\nmi\n\n2\n\n1\n\n2We say a function has bounded difference, if the value of the function can only have bounded change when\n\nonly one variable is changed.\n\n7\n\n\f4.1.3 Combining the Bounds\n\nConsidering Theorem 3 and taking expectation on query sample Q, we can obtain that with proba-\nbility at least 1 \u2212 \u03b4,\n\n\u02c6Rl\n\nn(f ) \u2212 \u02c6Rl\n\nn;m1,\u00b7\u00b7\u00b7 ,mn (f ) \u2264\n\n1\nn\n\nn\n\nXi=1\n\nESi|qi [R1;mi (l \u25e6 F (qi, Si))] +vuut\n\n2M 2 log 2\n\u03b4\n\nmin2\n\nn\n\nXi=1\n\n, \u2200f \u2208 F .\n\nFurthermore, if conventional RA has an upper bound D(l \u25e6 F,\u00b7), for arbitrary sample distribution,\ndocument-layer reduced two-layer RA can be upper bounded by D(l\u25e6F, n) and query-layer reduced\ntwo-layer RA can be bounded by D(l \u25e6 F,bmi/2c).\nCombining the document-layer error bound and the query-layer error bound presented in the previ-\nous subsections, and considering the above discussions, we can eventually prove Theorem 1.\n\n4.2 Discussions\n\nAccording to Theorem 1, we can have the following discussions.\n\n(1) The increasing number of either queries or documents per query in the training data will enhance\nthe two-layer generalization ability. This conclusion seems more intuitive and reasonable than that\nobtained in [15].\n(2) Only if n \u2192 \u221e and mi \u2192 \u221e simultaneously does the two-layer generalization bound uniformly\nconverge. That is, if the number of documents for some query is \ufb01nite, there will always exist\ndocument-layer error no matter how many queries have been used for training; if the number of\nqueries is \ufb01nite, then there will always exist query-layer error, no matter how many documents per\nquery have been used for training.\n(3) If we only have a limited budget to label C documents in total, according to Theorem 1, there is\nan optimal trade off between the number of training queries and that of training documents per query.\nThis is consistent with previous empirical \ufb01ndings in [19]. Actually one can attain the optimal trade\noff by solving the following optimization problem:\n\nD(l \u25e6 F , n) +vuut\n\nn\n\nXi=1\n\n2M 2 log 2\n\u03b4\n\nmin2 +\n\nn\n\nXi=1\n\nD(l \u25e6 F , bmi/2c)\n\nmin\n\nn,m1,\u00b7\u00b7\u00b7 ,mn\n\ns.t.\n\nn\n\nXi=1\n\nmi = C\n\nThis optimum problem is easy to solve. For example, if ranking function class F satis\ufb01es V C( \u02dcF) =\nn\u2217 where c1 is a constant.\nV , for the pairwise 0-1 loss, we have n\u2217 =\nFrom this result we have the following discussions. (i) n\u2217 decreases with the increasing capacity of\nthe function class. That is, we should label fewer queries and more documents per query when the\nhypothesis space is larger. (ii) For \ufb01xed hypothesis space, n\u2217 increases with the con\ufb01dence level \u03b4.\nThat is, we should label more query if we want the bound to hold with a larger probability.\n\n\u221aC, m\u2217i \u2261 C\n\nc1\u221aV +\u221a2 log(4/\u03b4)\n\nc1\u221a2V\n\nThe above \ufb01ndings can be used to explain the behavior of existing pairwise ranking algorithms, and\ncan be used to guide the construction of training set for learning to rank.\n\n5 Conclusions and Discussions\n\nIn this paper, we have proposed conducting two-layer generalization analysis for ranking, and proved\na two-layer generalization bound for ERM learning with pairwise losses. The theoretical results we\nhave obtained can better explain experimental observations in learning to rank than previous results,\nand can provide general guidelines to trade off between deep labeling and shallow labeling in the\nconstruction of training data.\n\nFor future work, we plan to i) extend our analysis to listwise loss functions in ranking, such as\nListNet [7] and listMLE [18]; ii) and introduce noise condition in order to obtain faster convergency.\n\n8\n\n\fReferences\n\n[1] S. Agarwal, T. Graepel, R. Herbrich, S.Har-Peled, and D. Roth. Generalization bounds for the\n\narea under the roc curve. Journal of Machine Learning Research, 6:393\u2013425, 2005.\n\n[2] S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithms via algorithmic sta-\n\nbility. Journal of Machine Learning Research, 10:441\u2013474, 2009.\n\n[3] P. L. Bartlett, S. Mendelson, and M. Long. Rademacher and gaussian complexities: risk bounds\n\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[4] J. Baxter. Learning internal representations. In Proceedings of the Eighth International Con-\n\nference on Computational Learning Theory, pages 311\u2013320. ACM Press, 1995.\n\n[5] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender.\nLearning to rank using gradient descent. In ICML \u201905: Proceedings of the 22nd International\nConference on Machine learning, pages 89\u201396, 2005.\n\n[6] Y. Cao, J. Xu, T. Y. Liu, H. Li, Y. Huang, and H. W. Hon. Adapting ranking svm to document\nretrieval. In SIGIR \u201906: Proceedings of the 29th Annual International ACM SIGIR Conference\non Research and Development in Information Retrieval, pages 186\u2013193. ACM Press, 2006.\n\n[7] Z. Cao, T. Qin, T. Y. Liu, M. F. Tsai, and H. Li. Learning to rank: from pairwise approach to\nlistwise approach. In ICML \u201907: Proceedings of the 24th International Conference on Machine\nlearning, pages 129\u2013136, 2007.\n\n[8] C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track. Technical\n\nreport, no date.\n\n[9] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and scoring using empirical risk minimiza-\ntion. In COLT \u201905: Proceedings of the 18th Annual Conference on Learning Theory, pages\n1\u201315, 2005.\n\n[10] D. Cossock and T. Zhang. Subset ranking using regression. In COLT \u201906: Proceedings of the\n\n19th Annual Conference on Learning Theory, pages 605\u2013619, 2006.\n\n[11] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[12] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal re-\ngression. In Advances in Large Margin Classi\ufb01ers, pages 115\u2013132, Cambridge, MA, 1999.\nMIT.\n\n[13] T. Joachims. Optimizing search engines using clickthrough data. In KDD \u201902: Proceedings\nof the 8th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 133\u2013142, 2002.\n\n[14] Y. Y. Lan, T. Y. Liu, Z. M. Ma, and H. Li. Generalization analysis of listwise learning-to-\nrank algorithms. In ICML \u201909: Proceedings of the 26th International Conference on Machine\nLearning, pages 577\u2013584, 2009.\n\n[15] Y. Y. Lan, T. Y. Liu, T. Qin, Z. M. Ma, and H. Li. Query-level stability and generalization in\nlearning to rank. In ICML \u201908: Proceedings of the 25th International Conference on Machine\nLearning, pages 512\u2013519, 2008.\n\n[16] T. Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information\n\nRetrieval, 3:225\u2013331, 2009.\n\n[17] J. R. Wen, J. Y. Nie, and H. J. Zhang. Clustering user queries of a search engine. In WWW \u201901:\nProceedings of the 10th international conference on World Wide Web, pages 162\u2013168, New\nYork, NY, USA, 2001. ACM.\n\n[18] F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank - theory\nand algorithm. In ICML \u201908: Proceedings of the 25th International Conference on Machine\nlearning, pages 1192\u20131199. Omnipress, 2008.\n\n[19] E. Yilmaz and S. Robertson. Deep versus shallow judgments in learning to rank. In SIGIR\n\u201909: Proceedings of the 32th annual international ACM SIGIR conference on Research and\ndevelopment in information retrieval, pages 662\u2013663, 2009.\n\n9\n\n\f", "award": [], "sourceid": 124, "authors": [{"given_name": "Wei", "family_name": "Chen", "institution": null}, {"given_name": "Tie-yan", "family_name": "Liu", "institution": null}, {"given_name": "Zhi-ming", "family_name": "Ma", "institution": null}]}