{"title": "Policy Learning for Fairness in Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 5426, "page_last": 5436, "abstract": "Conventional Learning-to-Rank (LTR) methods optimize the utility of the rankings to the users, but they are oblivious to their impact on the ranked items. However, there has been a growing understanding that the latter is important to consider for a wide range of ranking applications (e.g. online marketplaces, job placement, admissions). To address this need, we propose a general LTR framework that can optimize a wide range of utility metrics (e.g. NDCG) while satisfying fairness of exposure constraints with respect to the items. This framework expands the class of learnable ranking functions to stochastic ranking policies, which provides a language for rigorously expressing fairness specifications. Furthermore, we provide a new LTR algorithm called Fair-PG-Rank for directly searching the space of fair ranking policies via a policy-gradient approach. Beyond the theoretical evidence in deriving the framework and the algorithm, we provide empirical results on simulated and real-world datasets verifying the effectiveness of the approach in individual and group-fairness settings.", "full_text": "Policy Learning for Fairness in Ranking\n\nAshudeep Singh\n\nDepartment of Computer Science\n\nCornell University\nIthaca, NY 14850\n\nashudeep@cs.cornell.edu\n\nThorsten Joachims\n\nDepartment of Computer Science\n\nCornell University\nIthaca, NY 14850\n\ntj@cs.cornell.edu\n\nAbstract\n\nConventional Learning-to-Rank (LTR) methods optimize the utility of the rankings\nto the users, but they are oblivious to their impact on the ranked items. However,\nthere has been a growing understanding that the latter is important to consider for\na wide range of ranking applications (e.g. online marketplaces, job placement,\nadmissions). To address this need, we propose a general LTR framework that can\noptimize a wide range of utility metrics (e.g. NDCG) while satisfying fairness\nof exposure constraints with respect to the items. This framework expands the\nclass of learnable ranking functions to stochastic ranking policies, which provides\na language for rigorously expressing fairness speci\ufb01cations. Furthermore, we\nprovide a new LTR algorithm called FAIR-PG-RANK for directly searching the\nspace of fair ranking policies via a policy-gradient approach. Beyond the theoretical\nevidence in deriving the framework and the algorithm, we provide empirical results\non simulated and real-world datasets verifying the effectiveness of the approach in\nindividual and group-fairness settings.\n\n1\n\nIntroduction\n\nInterfaces based on rankings are ubiquitous in today\u2019s multi-sided online economies (e.g., online\nmarketplaces, job search, property renting, media streaming). In these systems, the items to be\nranked are products, job candidates, or other entities that transfer economic bene\ufb01t, and it is widely\nrecognized that the position of an item in the ranking has a crucial in\ufb02uence on its exposure and\neconomic success. Surprisingly, though, the algorithms used to learn these rankings are typically\noblivious to the effect they have on the items. Instead, the learning algorithms blindly maximize the\nutility of the rankings to the users issuing queries to the systems [1], and there is evidence (e.g. [2, 3])\nthat this does not necessarily lead to rankings that would be considered fair or desirable.\nIn contrast to fairness in supervised learning for classi\ufb01cation (e.g., [4\u201310]), fairness for rankings has\nbeen a relatively under-explored domain despite the growing in\ufb02uence of online information systems\non our society and economy. In the work that does exist, some consider group fairness in rankings\nalong the lines of demographic parity [11, 12], proposing de\ufb01nitions and methods that minimize\nthe difference in the representation between groups in a pre\ufb01x of the ranking [13\u201316]. Other recent\nworks have argued that fairness of ranking systems corresponds to how they allocate exposure to\nindividual items or group of items based on their merit [3, 17]. These works specify and enforce\nfairness constraints that explicitly link relevance to exposure in expectation or amortized over a set of\nqueries. However, these works assume that the relevances of all items are known and they do not\naddress the learning problem.\nIn this paper, we develop a Learning-to-Rank (LTR) algorithm \u2013 named FAIR-PG-RANK \u2013 that not\nonly maximizes utility to the users, but that also rigorously enforces merit-based exposure constraints\ntowards the items. Focusing on notions of fairness around the key scarce resource that search\nengines arbitrate, namely the relative allocation of exposure based on the items\u2019 merit, such fairness\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconstraints may be required to conform with anti-trust legislation [18], to alleviate winner-takes-all\ndynamics in a music streaming service [19], to implement anti-discrimination measures [20], or to\nimplement some variant of search neutrality [21, 22]. By considering fairness already during learning,\nwe \ufb01nd that FAIR-PG-RANK can identify biases in the representation that post-processing methods\n[3, 17] are, by design, unable to detect. Furthermore, we \ufb01nd that FAIR-PG-RANK performs better\nthan heuristic approaches [23].\nFrom a technical perspective, the main contributions of the paper are three-fold. First, we develop\na conceptual framework in which it is possible to formulate fair LTR as a policy-learning problem\nsubject to fairness constraints. We show that viewing fair LTR as learning a stochastic ranking policy\nleads to a rigorous formulation that can be addressed via Empirical Risk Minimization (ERM) on both\nthe utility and the fairness constraint. Second, we propose a class of fairness constraints for ranking\nthat incorporates notions of both individual and group fairness. And, third, we propose a policy-\ngradient method for implementing the ERM procedure that can directly optimize any information\nretrieval utility metric and a wide range of fairness criteria. Across a number of empirical evaluations,\nwe \ufb01nd that the policy-gradient approach is a competitive LTR method in its own right, that FAIR-\nPG-RANK can identify and avoid biased features when trading-off utility for fairness, and that it can\neffectively optimize notions of individual and group fairness on real-world datasets.\n\n2 Learning Fair Ranking Policies\n\nThe key goal of our work is to learn ranking policies where the allocation of exposure to items\nis not an accidental by-product of maximizing utility to the users, but where one can specify a\nmerit-based exposure-allocation constraint that is enforced by the learning algorithm. An illustrative\nexample adapted from Singh and Joachims [3] is that of ranking 10 job candidates, where the\nestimated probabilities of relevance (e.g., probability that an employer will invite for an interview)\nof 5 male job candidates are {0.89, 0.89, 0.89, 0.89, 0.89} and those of 5 female candidates are\n{0.88, 0.88, 0.88, 0.88, 0.88}. Notice that these probabilities of relevance can themselves be gender-\nbiased because of biased data or a biased prediction model. If these 10 candidates were ranked by\nthese probabilities of relevance \u2013 thus maximizing utility to the users under virtually all information\nretrieval metrics [1] \u2013 the female candidates would get far less exposure (ranked 6,7,8,9,10) than the\nmale candidates (ranked 1,2,3,4,5) even though they have almost the same estimated relevance. In\nthis way, the ranking function itself is responsible for creating a strong endogenous bias against the\nfemale candidates, greatly amplifying and thus perpetuating any exogenous bias that may have led to\nsmall differences in the relevance estimates.\nAddressing the endogenous bias created by the system itself, we argue that it should be possible to\nexplicitly specify how exposure is allocated (e.g. make exposure proportional to relevance), that\nthis speci\ufb01ed exposure allocation is truthfully learned by the ranking policy (e.g. no systematic\nbias towards one of the groups), and that the ranking policy maintains a high utility to the users.\nGeneralizing from this illustrative example, we develop our fair LTR framework as guided by the\nfollowing three goals:\nGoal 1: Exposure allocated to an item is based on its merit. More merit means more exposure.\nGoal 2: Enable the explicit statement of how exposure is allocated relative to the merit of the items.\nGoal 3: Optimize the utility of the rankings to the users while satisfying Goal 1 and Goal 2.\nWe will illustrate and further re\ufb01ne these goals as we develop our framework in the rest of this section.\nIn particular, we \ufb01rst formulate the LTR problem in the context of empirical risk minimization (ERM)\nwhere exposure-allocation constraints are included in the empirical risk. We then de\ufb01ne concrete\nfamilies of allocation constraints for both individual and group fairness.\n\n2.1 Learning to Rank as Policy Learning via ERM\nLet Q be the distribution from which queries are drawn. Each query q has a candidate set of\ndocuments dq = {dq\nn(q)} that needs to be ranked, and a corresponding set of real-valued\nrelevance judgments, relq = (relq\nn(q)). Our framework is agnostic to how relevance is\nde\ufb01ned, and it could be the probability that a user with query q \ufb01nds the document relevant, or it could\nbe some subjective judgment of relevance as assigned by a relevance judge. Finally, each document\ndq\ni is represented by a feature vector xq\ni ) that describes the match between document dq\ni\nand query q.\n\ni = \u03a8(q, dq\n\n2 . . . relq\n\n1, dq\n\n2, . . . dq\n\n1, relq\n\n2\n\n\fWe consider stochastic ranking functions \u03c0 \u2208 \u03a0, where \u03c0(r|q) is a distribution over the rankings r\n(i.e. permutations) of the candidate set. We refer to \u03c0 as a ranking policy and note that deterministic\nranking functions are merely a special case. However, a key advantage of considering the full space\nof stochastic ranking policies is their ability to distribute expected exposure in a continuous fashion,\nwhich provides more \ufb01ne-grained control and enables gradient-based optimization.\nThe conventional goal in LTR is to \ufb01nd a ranking policy \u03c0\u2217 that maximizes the expected utility of \u03c0\n\n\u03c0\u2217\n\n= argmax\u03c0\u2208\u03a0 (cid:69)q\u223cQ(cid:2)U (\u03c0|q)(cid:3),\n(cid:2)\u2206(cid:0)r, relq(cid:1)(cid:3) .\n\nwhere the utility of a stochastic policy \u03c0 for a query q is de\ufb01ned as the expectation of a ranking metric\n\u2206 over \u03c0\n\nU (\u03c0|q) = (cid:69)r\u223c\u03c0(r|q)\n\non NDCG as in [24], which is the normalized version of \u2206DCG(r, relq) =(cid:80)nq\n\nCommon choices for \u2206 are DCG, NDCG, Average Rank, or ERR. For concreteness, we focus\nu(r(j)|q)\nlog(1+j) , where\nu(r(j)|q) is the utility of the document placed by ranking r on position j for q as a function of\nrelevance (e.g., u(i|q) = 2relq\nmaxr \u2206DCG(r,relq).\nFair Ranking policies. Instead of single-mindedly maximizing this utility measure like in conven-\ntional LTR algorithms, we include a constraint into the learning problem that enforces an application-\ndependent notion of fair allocation of exposure. To this effect, let\u2019s denote with D(\u03c0|q) \u2265 0 a\nmeasure of unfairness or the disparity, which we will de\ufb01ne in detail in Section \u00a7 2.2. We can now\nformulate the objective of fair LTR by constraining the space of admissible ranking policies to those\nthat have expected disparity less than some parameter \u03b4.\n\ni \u2212 1). NDCG normalizes DCG via \u2206NDCG(r, relq) = \u2206DCG(r,relq)\n\nj=1\n\n\u03c0\u2217\n\u03b4 = argmax\u03c0 (cid:69)q\u223cQ [U (\u03c0|q)] s.t. (cid:69)q\u223cQ [D(\u03c0|q)] \u2264 \u03b4\n\nq=1, the empirical analog of the optimization problem becomes \u02c6\u03c0\u2217\n\nSince we only observe samples from the query distribution Q, we resort to the ERM princi-\n(cid:80)N\n(cid:80)N\nple and estimate the expectations with their empirical counterparts. Denoting the training set\nas T = {(xq, relq)}N\n\u03b4 =\n(cid:80)N\nq=1 D(\u03c0|q) \u2264 \u03b4. Using a Lagrange multiplier, this is equivalent\nq=1 U (\u03c0|q) s.t. 1\nargmax\u03c0\nto \u02c6\u03c0\u2217\n. In the following, we avoid\nminimization w.r.t. \u03bb for a chosen \u03b4. Instead, we steer the utility/fairness trade-off by chosing a\nparticular \u03bb and then computing the corresponding \u03b4 afterwards. This means we merely have to solve\n\n\u03b4 = argmax\u03c0 min\u03bb\u22650\n\nq=1U (\u03c0|q) \u2212 \u03bb\n\nq=1D(\u03c0|q)\u2212\u03b4\n\n(cid:80)N\n\n(cid:16)1\n\n(cid:17)\n\n1\nN\n\nN\n\n1\nN\n\nN\n\n(1)\n\n(2)\n\nN(cid:88)\n\nq=1\n\nU (\u03c0|q) \u2212 \u03bb\n\n1\nN\n\nN(cid:88)\nq=1 D(\u03c0|q)\n\n1\nN\n\n\u02c6\u03c0\u2217\n\u03bb = argmax\u03c0\n\n(cid:80)N\nq=1 D(\u02c6\u03c0\u2217\n\nand then recover \u03b4\u03bb = 1\nN\ngoal from the opening paragraph, although we still lack a concrete de\ufb01nition of D.\n2.2 De\ufb01ning a Class of Fairness Measures for Rankings\n\n\u03bb|q) afterwards. Note that this formulation implements our third\n\nTo make the training objective in Equation (1) fully speci\ufb01ed, we still need a concrete de\ufb01nition of the\nunfairness measure D. To this effect, we adapt the \u201cFairness of Exposure for Rankings\u201d framework\nfrom Singh and Joachims [3], since it allows a wide range of application dependent notions of\ngroup-based fairness, including Statistical Parity, Disparate Exposure, and Disparate Impact. In order\nto formulate any speci\ufb01c disparity measure D, we \ufb01rst need to de\ufb01ne position bias and exposure.\nPosition Bias. The position bias of position j, vj, is de\ufb01ned as the fraction of users accessing a\nranking who examine the item at position j. This captures how much attention an item will receive,\nwhere higher positions are expected to receive more attention than lower positions. In operational\nsystems, position bias can be directly measured using eye-tracking [25], or indirectly estimated\nthrough swap experiments [26] or intervention harvesting [27, 28].\nExposure. For a given query q and ranking distribution \u03c0(r|q), the exposure of a document is de\ufb01ned\nas the expected attention that a document receives. This is equivalent to the expected position bias\nfrom all the positions that the document can be placed in. Exposure is denoted as v\u03c0(di) and can be\nexpressed as\n\n(cid:2)vr(di)\n\n(cid:3) ,\n\nExposure(di|\u03c0) = v\u03c0(di) = (cid:69)r\u223c\u03c0(r|q)\n\n3\n\n\fwhere r(di) is the position of document di under ranking r.\nAllocating exposure based on merit. Our \ufb01rst two goals from the opening paragraph postulate\nthat exposure should be based on an application dependent notion of merit. We de\ufb01ne the merit\ni or \u221areli depending on the\nof a document as a function of its relevance to the query (e.g., reli, rel2\napplication). Let\u2019s denote the merit of document di as M (reli) \u2265 0, or simply Mi, and we state that\neach document in the candidate set should get exposure proportional to its merit Mi.\n\n\u2200di \u2208 dq : Exposure(di|\u03c0) \u221d M (reli)\n\nFor many queries, however, this set of exposure constraints is infeasible. As an example, consider a\nquery where one document in the candidate set has relevance 1, while all other documents have small\nrelevance \u0001. For suf\ufb01ciently small \u0001, any ranking will provide too much exposure to the \u0001-relevant\ndocuments, since we have to put these documents somewhere in the ranking. This violates the\nexposure constraint, and this shortcoming is also present in the Disparate Exposure measure of Singh\nand Joachims [3] and the Equity of Attention constraint of Biega et al. [17].\nNote that this overabundance of exposure for some queries is not a fairness problem, since the extra\nexposure that some items receive does not come at the expense of other items. Furthermore, it is\ntypically the items that have slightly lower merit that get disadvantaged by utility maximization,\nas illustrated in the introductory example. We thus replace the proportionality constraint with the\nfollowing set of inequality constraints where \u2200di, dj \u2208 dq with M (reli) \u2265 M (relj) > 0,\n\nExposure(di|\u03c0)\n\nM (reli) \u2264\n\nExposure(dj|\u03c0)\n\nM (relj )\n\n(cid:80)\n\n1\n|Hq|\n\n(cid:20)\n\n(cid:21)\n\n,\n\nThis one-sided set of constraints still enforce proportionality of exposure to merit, but allows the\nallocation of overabundant exposure which is achieved by only enforcing that higher-merit items\ndon\u2019t get exposure beyond their merit. Note that the opposite direction of the constraint is already\nencouraged by utility maximization, where high-merit items tend to receive more exposure than they\ndeserve.\nConnecting this reasoning back to the example, after putting the item with relevance 1 at rank one, we\nhave to put \u0001-relevant items in position two and further. These \u0001-relevant items are now overexposed\nwhich violates the two-sided constraint, but not the one-sided constraint. In this way, the one-sided\nmetric together with utility maximization allows non-relevant items to get higher exposure when\nthis is unavoidable in the tail of the ranking. In the other direction, the metric counteracts unmerited\nrich-get-richer dynamics, as present in the motivating example earlier.\nMeasuring disparate exposure. We can now de\ufb01ne the following disparity measure D that captures\nin how far the fairness-of-exposure constraints are violated\n\nDind(\u03c0|q) =\n\n(i,j)\u2208Hq max\n\n0, v\u03c0(di)\n\nMi \u2212 v\u03c0(dj )\n\nMj\n\n(3)\n\nwhere Hq = {(i, j) s.t. Mi \u2265 Mj > 0}. The measure Dind(\u03c0|q) is always non-negative and it equals\nzero only when the individual constraints are exactly satis\ufb01ed.\nGroup fairness disparity. The disparity measure from above implements an individual notion of\nfairness, while other applications ask for a group-based notion. Here, fairness is aggregated over\nthe members of each group. A group of documents can refer to sets of items sold by one seller in\nan online marketplace, to content published by one publisher, or to job candidates belonging to a\nprotected group. Similar to the case of individual fairness, we want to allocate exposure to groups\nproportional to their merit. Hence, in the case of only two groups G0 and G1, we can de\ufb01ne the\nfollowing group fairness disparity for query q as\nDgroup(\u03c0|q) = max\n\nMGi \u2212 v\u03c0(Gj )\n\n0, v\u03c0(Gi)\n\n(cid:18)\n\n(cid:19)\n\n(4)\n\nMGj\n\n,\n\nwhere Gi and Gj are such that MGi \u2265 MGj and Exposure(G|\u03c0) = v\u03c0(G) = 1|G|\nthe average exposure of group G, and the merit of the group G is denoted by MG = 1|G|\n\ndi\u2208G v\u03c0(di) is\ndi\u2208G Mi.\n\n(cid:80)\n\n(cid:80)\n\n3 FAIR-PG-RANK: A Policy Learning Algorithm for Fair LTR\n\nIn the previous section, we de\ufb01ned a general framework for learning ranking policies under fairness-\nof-exposure constraints. What remains to be shown is that there exists a stochastic policy class \u03a0 and\n\n4\n\n\fan associated training algorithm that can solve the objective in Equation (1) under the disparities D\nde\ufb01ned above. To this effect, we now present the FAIR-PG-RANK algorithm. In particular, we \ufb01rst\nde\ufb01ne a class of Plackett-Luce ranking policies that incorporate a machine learning model, and then\npresent a policy-gradient approach to ef\ufb01ciently optimize the training objective.\n\n3.1 Plackett-Luce Ranking Policies\n\nThe ranking policies \u03c0 we de\ufb01ne in the following comprise of two components: a scoring model\nthat de\ufb01nes a distribution over rankings, and its associated sampling method. Starting with the\nscoring model h\u03b8, we allow any differentiable machine learning model with parameters \u03b8, for\nexample a linear model or a neural network. Given an input xq representing the feature vectors\nof all query-document pairs of the candidate set, the scoring model outputs a vector of scores\nh\u03b8(xq) = (h\u03b8(xq\nnq )). Based on this score vector, the probability \u03c0\u03b8(r|q) of a\nranking r = (cid:104)r(1), r(2), . . . r(nq)(cid:105) under the Plackett-Luce model [29] is the following product of\nnq(cid:89)\nsoftmax distributions\n\n2), . . . h\u03b8(xq\n\nexp(h\u03b8(xq\n\n1), h\u03b8(xq\n\nr(i)))\n\n\u03c0\u03b8(r|q) =\n\nexp(h\u03b8(xq\n\nr(i)))+. . .+exp(h\u03b8(xq\n\ni=1\n\n.\n\nr(nq)))\n\n(5)\n\nNote that this probability of a ranking can be computed ef\ufb01ciently, and that the derivative of \u03c0\u03b8(r|q)\nand log \u03c0\u03b8(r|q) exists whenever the scoring model h\u03b8 is differentiable. Sampling a ranking under\nthe Plackett-Luce model is ef\ufb01cient as well. To sample a ranking, starting from the top, documents\nare drawn recursively from the probability distribution resulting from Softmax over the scores of the\nremaining documents in the candidate set, until the set is empty.\n\n3.2 Policy-Gradient Training Algorithm\n\nThe next step is to search this policy space \u03a0 for a model that maximizes the objective in Equa-\ntion (1). This section proposes a policy-gradient approach [30, 31], where we use stochastic gradient\ndescent (SGD) updates to iteratively improve our ranking policy. However, since both U and D are\nexpectations over rankings sampled from \u03c0, computing the gradient brute-force is intractable. In\nthis section, we derive the required gradients over expectations as an expectation over gradients. We\nthen estimate this expectation as an average over a \ufb01nite sample of rankings from the policy to get an\napproximate gradient.\nConventional LTR methods that maximize user utility are either designed to optimize over a smoothed\nversion of a speci\ufb01c utility metric, such as SVMRank [32], RankNet [33] etc., or use heuristics to\noptimize over probabilistic formulations of rankings (e.g. SoftRank [34]). Our LTR setup is similar to\nListNet [35], however, instead of using a heuristic loss function for utility, we present a policy gradient\nmethod to directly optimize over both utility and disparity measures. Directly optimizing the ranking\npolicy via policy-gradient learning has two advantages over most conventional LTR algorithms, which\noptimize upper bounds or heuristic proxy measures. First, our learning algorithm directly optimizes\na speci\ufb01ed user utility metric and has no restrictions in the choice of the information retrieval (IR)\nmetric. Second, we can use the same policy-gradient approach on our disparity measure D as well,\nsince it is also an expectation over rankings. Overall, the use of policy-gradient optimization in the\nspace of stochastic ranking policies elegantly handles the non-smoothness inherent in rankings.\n\n3.2.1 PG-RANK: Maximizing User Utility\n\nThe user utility of a policy \u03c0\u03b8 for a query q is de\ufb01ned as U (\u03c0|q) = (cid:69)r\u223c\u03c0\u03b8(r|q)\u2206(cid:0)r, relq(cid:1). Note that\n\ntaking the gradient w.r.t. \u03b8 over this expectation is not straightforward, since the space of rankings is\nexponential in cardinality. To overcome this, we use sampling via the log-derivative trick pioneered\nin the REINFORCE algorithm [30] as follows:\n\n\u2207\u03b8U (\u03c0\u03b8|q) = \u2207\u03b8(cid:69)r\u223c\u03c0\u03b8(r|q)\u2206(cid:0)r, relq(cid:1) = (cid:69)r\u223c\u03c0\u03b8(r|q)[\u2207\u03b8log \u03c0\u03b8(r|q)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u2206(r, relq)]\n\n(6)\n\nEq. (5)\n\nThis transformation exploits that the gradient of the expected value of the metric \u2206 over rankings\nsampled from \u03c0 can be expressed as the expectation of the gradient of the log probability of each\n\n5\n\n\fsampled ranking multiplied by the metric value of that ranking. The \ufb01nal expectation is approximated\nvia Monte-Carlo sampling from the Plackett-Luce model in Eq. (5).\nNote that this policy-gradient approach to LTR, which we call PG-RANK, is novel in itself and\nbeyond fairness. It can be used as a standalone LTR algorithm for virtually any choice of utility\nmetric \u2206, including NDCG, DCG, ERR, and Average-Rank. Furthermore, PG-RANK also supports\nnon-linear metrics, IPS-weighted metrics for partial information feedback [26], and listwise metrics\nthat do not decompose as a sum over individual documents [36].\nUsing baseline for variance reduction. Since making stochastic gradient descent updates with this\ngradient estimate is prone to high variance, we subtract a baseline term from the reward [30] to act as\na control variate for variance reduction. Speci\ufb01cally, in the gradient estimate in Eq. (6), we replace\n\u2206(r, relq) with \u2206(r, relq) \u2212 b(q) where b(q) is the average \u2206 for the current query.\nEntropy Regularization While optimizing over stochastic policies, entropy regularization is used as\na method for encouraging exploration as to avoid convergence to suboptimal deterministic policies\n[37, 38]. For our algorithm, we add the entropy of the probability distribution Softmax(h\u03b8(xq))\ntimes a regularization coef\ufb01cient \u03b3 to the objective.\n\n3.2.2 Minimizing disparity\nWhen a fairness-of-exposure term D is included in the training objective, we also need to compute the\ngradient of this term. Fortunately, it has a structure similar to the utility term, so that the same Monte-\nCarlo approach applies. Speci\ufb01cally, for the individual-fairness disparity measure in Equation (3),\nthe gradient can be computed as:\n\n(cid:88)\n\n(cid:20)(cid:18) v\u03c0(di)\n\n(cid:19)\n\n(cid:21)\n\nv\u03c0(dj)\n\n(cid:20)(cid:0) vr(di)\n\n(cid:1)\n\nvr(dj)\n\n(cid:21)\n\n1\n|H|\n\nMi \u2212\n\n(cid:49)\n(i,j)\u2208H\n\n\u2207\u03b8Dind =\n\n\u2207\u03b8 log \u03c0\u03b8(r|q)\nMi \u2212\n(H = {(i, j) s.t. Mi \u2265 Mj})\nFor the group-fairness disparity measure de\ufb01ned in Equation (4), the gradient can be derived as\nfollows:\n\n\u00d7 (cid:69)r\u223c\u03c0\u03b8(r|q)\n\n> 0\n\nMj\n\nMj\n\n\u2207\u03b8Dgroup(\u03c0|G0, G1, q) = \u2207\u03b8max(cid:0)0, \u03beqdiff(\u03c0|q)(cid:1) = (cid:49)(cid:2)\u03beqdiff(\u03c0|q) > 0(cid:3)\u03beq\u2207\u03b8diff(\u03c0|q)\n(cid:21)\n\n(cid:16) v\u03c0(G0)\n(cid:17)\n(cid:20)(cid:18)(cid:80)\n(cid:19)\nMG0 \u2212 v\u03c0(G1)\n, and \u03beq = sign(MG0 \u2212 MG1 ).\n(cid:80)\nvr(d)\nd\u2208G0\nM (reld)\nd\u2208G0\n\n\u2207\u03b8diff(\u03c0|q) = (cid:69)r\u223c\u03c0\u03b8\n\nvr(d)\nM (reld) \u2212\n\n\u2207\u03b8log \u03c0\u03b8(r|q)\n\n(cid:80)\n(cid:80)\n\nd\u2208G1\nd\u2208G1\n\nMG1\n\nwhere diff(\u03c0|q) =\n\nThe derivation of the gradients is shown in the supplementary material. The expectation of the\ngradient in both the cases can be estimated as an average over a Monte Carlo sample of rankings\nfrom the distribution. The size of the sample is denoted by S in the rest of the paper.\nThe completes all necessary ingredients for SGD training of objective (1), and now we present all\nsteps of the FAIR-PG-RANK algorithm.\n\n3.3 Summary of the FAIR-PG-RANK algorithm\n\nAlgorithm 1 summarizes our method for learning fair ranking policies given a training dataset.\n\n4 Empirical Evaluation\n\nWe conduct experiments on simulated and real-world datasets to empirically evaluate our approach.\nFirst, in Section \u00a7 4.1, we validate that the policy-gradient algorithm is competitive with conventional\nLTR approaches independent of fairness considerations. We accomplish this by comparing our\nmethod PG-RANK relative to conventional LTR baselines on the Yahoo! Learning-to-Rank dataset.\nSecond, in Section \u00a7 4.2, we use simulated data to verify that FAIR-PG-RANK can detect and mitigate\nunfair features. Third, we show the effectiveness of our algorithm on real-world datasets by presenting\nexperiments on the Yahoo! Learning to Rank dataset for individual fairness and the German Credit\nDataset [39] for group fairness (Section \u00a7 4.3).\n\n6\n\n\fAlgorithm 1 FAIR-PG-RANK\nInput: T = {(xq, relq)}N\nParameters: model h\u03b8, learning rate \u03b7, entropy reg \u03b3\nInitialize h\u03b8 with parameters \u03b80\nrepeat\n\ni=1, disparity measure D, utility/fairness trade-off \u03bb\n\nq = (xq, relq) \u223c T {Draw a query from training set}\nh\u03b8(xq) = (h\u03b8(xq\nfor i = 1 to S do\n\n2), . . . h\u03b8(xq\n\n1), h\u03b8(xq\n\nnq )) {Obtain scores for each document}\n\nri \u223c \u03c0\u03b8(r|q) {Plackett-Luce sampling}\n\nend for\n\u2207 \u2190 \u02c6\u2207\u03b8U \u2212 \u03bb \u02c6\u2207\u03b8D {Compute gradient as an average over all ri using \u00a7 3.2.1 and \u00a7 3.2.2}\n\u03b8 \u2190 \u03b8 + \u03b7\u2207 {Update}\n\nuntil convergence on the validation set\n\nTable 1: Comparing PG-RANK to the baseline LTR methods from [24] on the Yahoo dataset.\n\nRankSVM [40]\nGBDT [41]\nPG-RANK (Linear model)\nPG-RANK (Neural Network)\n\nNDCG@10\n0.75924\n0.79013\n0.76145\n0.77082\n\nERR\n0.43680\n0.46201\n0.44988\n0.45440\n\nFor all the experiments, we use NDCG as the utility metric, de\ufb01ne merit using the identity function\nM (rel) = rel, and set the position bias v to follow the same distribution as the gain factor in DCG i.e.\nvj \u221d\n4.1 Can PG-RANK learn accurate ranking policies?\n\nlog2(1+j) where j = 1, 2, 3, . . . is a position in the ranking.\n\n1\n\nTo validate that PG-RANK is indeed a highly effective LTR method, we conduct experiments on the\nYahoo dataset [24]. We use the standard experiment setup on the SET 1 dataset and optimize NDCG\nusing PG-RANK, which is equivalent to \ufb01nding the optimal policy in Eq. (1) with \u03bb = 0.\nWe train FAIR-PG-RANK for two kinds of scoring models: a linear model and a neural network\n(one hidden layer with 32 hidden units and ReLU activation). Details of the models and training\nhyperparameters are given in the supplementary material. The policy learned by our method is a\nstochastic policy, however, for the purpose of evaluation in this task, we use the highest probability\nranking of the candidate set for each query to compute the average NDCG@10 and ERR (Expected\nReciprocal Rank) over all the test set queries. We compare our evaluation scores with two baselines\nfrom Chapelle and Chang [24] \u2013 a linear RankSVM [40] and a non-linear regression-based ranker\nthat uses Gradient-boosted Decision Trees (GBDT) [41].\nTable 1 shows that PG-RANK achieves competitive performance compared to the conventional LTR\nmethods. When comparing PG-RANK to RankSVM for linear models, our method outperforms\nRankSVM in terms of both NDCG@10 and ERR. This veri\ufb01es that the policy-gradient approach\nis effective at optimizing utility without having to rely on a possibly lose convex upper bound like\nRankSVM. PG-RANK with the non-linear neural network model further improves on the linear model.\nFurthermore, additional parameter tuning and variance-control techniques from policy optimization\nare likely to further boost the performance of PG-RANK, but are outside the scope of this paper.\n\n4.2 Can FAIR-PG-RANK effectively trade-off between utility and fairness?\n\nWe designed a synthetic dataset to allow inspection into how FAIR-PG-RANK trades-off between\nuser utility and fairness of exposure. The dataset contains 100 queries with 10 candidate documents\neach. In expectation, 8 of those documents belong to the majority group G0 and 2 belong to the\nminority group G1. For each document we independently and uniformly draw two values x1 and\nx2 from the interval (0, 3), and set the relevance of the document to x1 + x2 clipped between 0\nand 5. For the documents from the majority group G0, the features vector (x1, x2) representing the\ndocuments provides perfect information about relevance. For documents in the minority group G1,\n\n7\n\n\fFigure 1: Experiments on Simulated dataset. The shaded regions show different ranges of the values\nof (a) NDCG, (b) Group Disparity (Dgroup), with varying model parameters \u03b8 = (\u03b81, \u03b82). The (+)\npoints show the models learned by FAIR-PG-RANK under different values of \u03bb. (c) Comparison of\nNDCG and Group Disparity (Dgroup) trade-off for different methods.\n\nFigure 2: Effect of varying \u03bb on NDCG@10 (user utility) and Dind (individual fairness disparity) on\nYahoo data. Left: Linear model, Right: Neural Network. The overlapping dotted curves represent the\ntraining set NDCG@10 and Disparity, while solid curves show test set performance.\n\nhowever, feature x2 is corrupted by replacing it with zero so that the information about relevance for\ndocuments in G1 only comes from x1. This leads to a biased representation between groups, and any\nuse of x2 is prone to producing unfair exposure between groups.\nIn order to validate that FAIR-PG-RANK can detect and neutralize this biased feature, we consider\na linear scoring model h\u03b8(x) = \u03b81x1 + \u03b82x2 with parameters \u03b8 = (\u03b81, \u03b82). Figure 1 shows the\ncontour plots of NDCG and Dgroup evaluated for different values of \u03b8. Note that not only the direction\nof the \u03b8 vector affects both NDCG and Dgroup, but also its length as it determines the amount of\nstochasticity in \u03c0\u03b8. The true relevance model lies on the \u03b81 = \u03b82 line (dotted), however, a fair model\nis expected to ignore the biased feature x2. We use PG-RANK to train this linear model to maximize\nNDCG and minimize Dgroup. The dots in Figure 1 denote the models learned by FAIR-PG-RANK for\ndifferent values of \u03bb. For small values of \u03bb, FAIR-PG-RANK puts more emphasis on NDCG and thus\nlearns parameter vectors along the \u03b81 = \u03b82 direction. As we increase emphasis on group fairness\ndisparity Dgroup by increasing \u03bb, the policies learned by FAIR-PG-RANK become more stochastic and\nit correctly starts to discount the biased attribute by learning models where increasingly \u03b81 >> \u03b82.\nIn Figure 1(c), we compare FAIR-PG-RANK with two baselines. As the \ufb01rst baseline, we estimate\nrelevances with a fairness-oblivious linear regression and then use the post-processing method from\n[3] on the estimates. Unlike FAIR-PG-RANK, which reduces disparity with increasing \u03bb, the post-\nprocessing method is mislead by the estimated relevances that use the biased feature x2, and the\nranking policies become even less fair as \u03bb is increased. As the second baseline, we apply the method\nof Zehlike and Castillo [23], but the heuristic measure it optimizes shows little effect on disparity.\n\n4.3 Can FAIR-PG-RANK learn fair ranking policies on real-world data?\n\nIn order to study FAIR-PG-RANK on real-world data, we conducted two sets of experiments.\nFor Individual Fairness, we train FAIR-PG-RANK with a linear and a neural network model on the\nYahoo! Learning to rank challenge dataset, optimizing Equation 1 with different values of \u03bb. The\ndetails about the model and training hyperparameters are present in the supplementary material. For\nboth the models, Figure 2 shows the average NDCG@10 and Dind (individual disparity) over the test\nand training (dotted line) datasets for different values of \u03bb parameter. As desired, FAIR-PG-RANK\nemphasizes lower disparity over higher NDCG as the value of \u03bb increases, with disparity going down\n\n8\n\n024\u03b81012345\u03b82\u03bb=0\u03bb=1\u03bb=5\u03bb=10\u03bb=20\u03bb=25(a)NDCG0.8000.8250.8500.8750.9000.9250.9500.9751.000024\u03b81012345\u03b82\u03bb=0\u03bb=1\u03bb=5\u03bb=10\u03bb=20\u03bb=25(b)Disparity0.0000.0020.0040.0060.0080.0100.0120.0140.0160.0180.880.900.920.940.960.98NDCG0.010.020.030.04\u02c6Dgroup(c)Utility-Fairnesstrade-o\ufb00Post-Processing(\u03bb\u2208[0,6])OurMethod(\u03bb\u2208[0,25])Zehlikeetal.(\u03bb\u2208[0,106])010\u2212210\u22121100101102103\u03bbind0.30.40.50.60.70.8NDCG@10TestNDCG@10TrainNDCG@10\u22120.03\u22120.02\u22120.01\u2212\u02c6DindTest-\u02c6DindTrain-\u02c6Dind010\u2212210\u22121100101102103104\u03bbind0.30.40.50.60.70.8NDCG@10TestNDCG@10TrainNDCG@10\u22120.03\u22120.02\u22120.01\u2212\u02c6DindTest-\u02c6DindTrain-\u02c6Dind\fFigure 3: Left: Effect of varying \u03bb on the test set NDCG and Dgroup for the German Credit Dataset.\nThe shaded area shows the standard deviation over \ufb01ve runs of the algorithm on the data. Right:\nComparison of NDCG and Group Disparity (Dgroup) trade-off for different methods.\nto zero eventually. Furthermore, the training and test curves for both NDCG and disparity overlap\nindicating the learning method generalizes to unseen queries. This is expected since both training\nquantities concentrate around their expectation as the training set size increases.\nFor Group fairness, we adapt the German Credit Dataset from the UCI repository [39] to a learning-\nto-rank task (described in the supplementary), choosing gender as the group attribute. We train\nFAIR-PG-RANK using a linear model, for different values of \u03bb. Figure 3 shows that FAIR-PG-RANK\nis again able to effectively trade-off NDCG and fairness. Here we also plot the standard deviation to\nillustrate that the algorithm reliably converges to solutions of similar performance over multiple runs.\nSimilar to the synthetic example, Figure 3 (right) again shows that FAIR-PG-RANK can effectively\ntrade-off NDCG for Dgroup, while the baselines fail.\n5 Conclusion\n\nWe presented a framework for learning ranking functions that not only maximize utility to their users,\nbut that also obey application speci\ufb01c fairness constraints on how exposure is allocated to the ranked\nitems based on their merit. Based on this framework, we derived the FAIR-PG-RANK policy-gradient\nalgorithm that directly optimizes both utility and fairness without having to resort to upper bounds\nor heuristic surrogate measures. We demonstrated that our policy-gradient approach is effective\nfor training high-quality ranking functions, that FAIR-PG-RANK can identify and neutralize biased\nfeatures, and that it can effectively learn ranking functions under both individual fairness and group\nfairness constraints.\n\nAcknowledgements\n\nThis work was supported in part by a gift from Workday Inc., as well as NSF awards IIS-1615706,\nIIS-1513692, and IIS-1901168. We thank Jessica Hong for the interesting discussions that informed\nthe direction of this paper. Any opinions, \ufb01ndings, and conclusions or recommendations expressed\nin this material are those of the author(s) and do not necessarily re\ufb02ect the views of the National\nScience Foundation.\n\nReferences\n[1] Stephen E Robertson. The probability ranking principle in ir. Journal of documentation, 33(4):\n\n294\u2013304, 1977.\n\n[2] Matthew Kay, Cynthia Matuszek, and Sean Munson. Unequal representation and gender\n\nstereotypes in image search results for occupations. In CHI. ACM, April 2015.\n\n[3] Ashudeep Singh and Thorsten Joachims. Fairness of exposure in rankings. In KDD, pages\n\n2219\u20132228. ACM, 2018.\n\n[4] Solon Barocas and Andrew D Selbst. Big data\u2019s disparate impact. Cal. L. Rev., 104:671, 2016.\n\n[5] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness\n\nthrough awareness. In ITCS, pages 214\u2013226, 2012.\n\n9\n\n010\u22121100101102103104\u03bbgroup0.7000.7250.7500.7750.800NDCG\u22120.015\u22120.010\u22120.0050.000\u2212\u02c6DgroupTestNDCGTest\u02c6Dgroup0.600.650.700.750.800.85NDCG0.000.010.020.030.040.05\u02c6DgroupPost-Processing(\u03bb\u2208[0,0.2])OurMethod(\u03bb\u2208[0,100])Zehlike&Castillo2018(\u03bb\u2208[0,106])\f[6] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In\n\nNIPS, pages 3315\u20133323, 2016.\n\n[7] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representa-\n\ntions. In ICML, pages 325\u2013333, 2013.\n\n[8] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi.\nFairness beyond disparate treatment & disparate impact: Learning classi\ufb01cation without dis-\nparate mistreatment. In WWW, pages 1171\u20131180, 2017.\n\n[9] Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik\nJanzing, and Bernhard Sch\u00f6lkopf. Avoiding discrimination through causal reasoning. In NIPS,\npages 656\u2013666, 2017.\n\n[10] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In\n\nNIPS, pages 4069\u20134079, 2017.\n\n[11] Indre Zliobaite. On the relation between accuracy and fairness in binary classi\ufb01cation. FATML\n\nWorkshop at ICML, 2015.\n\n[12] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classi\ufb01ers with independency\n\nconstraints. In Data mining workshops, ICDMW, pages 13\u201318, 2009.\n\n[13] Ke Yang and Julia Stoyanovich. Measuring fairness in ranked outputs. SSDBM, 2017.\n\n[14] L Elisa Celis, Damian Straszak, and Nisheeth K Vishnoi. Ranking with fairness constraints.\n\narXiv preprint arXiv:1704.06840, 2017.\n\n[15] Abolfazl Asudehy, HV Jagadishy, Julia Stoyanovichz, and Gautam Das. Designing fair ranking\n\nschemes. arXiv preprint arXiv:1712.09752, 2017.\n\n[16] Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, and\n\nRicardo Baeza-Yates. FA* IR: A Fair Top-k Ranking Algorithm. CIKM, 2017.\n\n[17] Asia J. Biega, Krishna P. Gummadi, and Gerhard Weikum. Equity of attention: Amortizing\n\nindividual fairness in rankings. In SIGIR, pages 405\u2013414. ACM, 2018.\n\n[18] Mark Scott. Google Fined Record $2.7 Billion in E.U. Antitrust Ruling. New York Times, 2017.\n\nURL https://www.nytimes.com/2017/06/27/technology/eu-google-fine.html.\n\n[19] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz.\nTowards a fair marketplace: Counterfactual evaluation of the trade-off between relevance,\nfairness & satisfaction in recommendation systems. In CIKM, pages 2243\u20132251. ACM, 2018.\n\n[20] Benjamin Edelman, Michael Luca, and Dan Svirsky. Racial discrimination in the sharing\neconomy: Evidence from a \ufb01eld experiment. American Economic Journal: Applied Economics,\n9(2):1\u201322, 2017.\n\n[21] Lucas D Introna and Helen Nissenbaum. Shaping the web: Why the politics of search engines\n\nmatters. The information society, 16(3):169\u2013185, 2000.\n\n[22] James Grimmelmann. Some skepticism about search neutrality. The Next Digital Decade:\nEssays on the future of the Internet, page 435, 2011. URL https://ssrn.com/abstract=\n1742444.\n\n[23] Meike Zehlike and Carlos Castillo. Reducing disparate exposure in ranking: A learning to rank\n\napproach. arXiv preprint arXiv:1805.08716, 2018.\n\n[24] Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. In Proceedings of\n\nthe Learning to Rank Challenge, pages 1\u201324, 2011.\n\n[25] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri\nGay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web\nsearch. ACM Transactions on Information Systems (TOIS), 25(2):7, 2007.\n\n10\n\n\f[26] T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased learning-to-rank with biased feedback.\n\nIn WSDM, pages 781\u2013789. ACM, 2017.\n\n[27] Aman Agarwal, Ivan Zaitsev, Xuanhui Wang, Cheng Li, Marc Najork, and Thorsten Joachims.\nEstimating position bias without intrusive interventions. In International Conference on Web\nSearch and Data Mining (WSDM), 2019.\n\n[28] Zhichong Fang, A. Agarwal, and T. Joachims. Intervention harvesting for context-dependent\nexamination-bias estimation. In ACM Conference on Research and Development in Information\nRetrieval (SIGIR), 2019.\n\n[29] Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193\u2013202, 1975.\n\n[30] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[31] Richard S Sutton. Introduction to reinforcement learning, volume 135. 1998.\n\n[32] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structural\n\nsvms. Machine Learning, 77(1):27\u201359, 2009.\n\n[33] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg\n\nHullender. Learning to rank using gradient descent. In ICML, pages 89\u201396. ACM, 2005.\n\n[34] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank: Optimizing\n\nnon-smooth rank metrics. In WSDM, pages 77\u201386. ACM, 2008.\n\n[35] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise\n\napproach to listwise approach. In ICML, pages 129\u2013136. ACM, 2007.\n\n[36] Cheng Xiang Zhai, William W. Cohen, and John Lafferty. Beyond independent relevance:\n\nMethods and evaluation metrics for subtopic retrieval. In SIGIR, pages 10\u201317. ACM, 2003.\n\n[37] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In ICML, pages 1928\u20131937, 2016.\n\n[38] Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement\n\nlearning algorithms. Connection Science, 3(3):241\u2013268, 1991.\n\n[39] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017. URL http:\n\n//archive.ics.uci.edu/ml.\n\n[40] Thorsten Joachims. Training linear svms in linear time. In KDD, pages 217\u2013226. ACM, 2006.\n\n[41] Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. Stochastic gradient boosted\n\ndistributed decision trees. In CIKM, pages 2061\u20132064. ACM, 2009.\n\n11\n\n\f", "award": [], "sourceid": 2906, "authors": [{"given_name": "Ashudeep", "family_name": "Singh", "institution": "Cornell University"}, {"given_name": "Thorsten", "family_name": "Joachims", "institution": "Cornell"}]}