NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2906
Title:Policy Learning for Fairness in Ranking

Reviewer 1

The paper is fairly interesting in terms of motivation, but the actual execution perhaps isn't mature enough the way it has been presented. -- since this is a new fairness criterion, a more detailed discussion of its merits would be helpful (even if it is non-rigorous). in particular, it is unclear why the disparity is only measured in one direction (over-emphasis of higher relevance item) with the direction based on relevance rather than group identity? Ideally, fairness critera would be defined in a way that is cognisant of historical/natural directions of bias, and therefore checks over/under-emphasis based on group identity rather than utility (which is dealt with separately). given that, this paper is more of a "diversity" metric rather than a "fairness" metric -- the experimental results are a bit confusing, not least because some of the axes measure D and others measure -D. In particular, figure 3 (right) seems to suggest that the post-processing method has a configuration where increasing the NDCG score also decreases the disparity? Isn't that a good thing? -- the kinds of experiments presented are all over the place-- the yahoo dataset and the german credit dataset show entirely different types of experiments, which makes it difficult to assess -- in figure 2, the dashed and the solid lines (train and test results respectively) seem suspiciously close to each other. but since the authors have provided code (that i did not check) i am willing to give them the benefit of the doubt on this one. -- the simulated data experiments, while mildly interesting, is too much of a toy experiment to take too seriously otherwise, the paper is fairly clear and reasonably well-motivated and includes most of the relevant references. this can be a pretty good paper with some improvement, but i don't think it is there just yet.

Reviewer 2

a new algorithm provided for ranking named as Fair-PG-Rank which seems to provide useful results.

Reviewer 3

This paper presents a framework for expressing specifications for ranking fairness, along with a new learning to rank algorithm based on policy-gradient approach, which can support various fairness constraints. Overall, it is extremely well written, and makes important contributions in an area of rapidly growing importance. Instead of starting with an existing LTR algorithm, this paper takes a fresh view on LTR as the problem of learning a stochastic ranking policy, which is learnt via ERM. This is the most important and novel contribution of the paper. Next, they propose a class of fairness constraints for ranking that incorporate both individual and group fairness, building on previous related work to adapt to the learning procedure. Finally, they show a policy gradient approach for directly optimizing *any* IR utility metric trading off with a variety of fairness criteria. They show the effectiveness of their proposed approach on both synthetic and real world datasets. Ironically, the motivating example presented to explain the tradeoff of utility and exposure, though only illustrative, is slightly biased itself. My suggestion would be to reverse the genders in the example to avoid perpetuating implicit biases about women’s merit. One concern in the ERM formulation is that NDCG is typically used as NDCG@k. I believe this should be handled by appropriately setting the position bias values, but not sure if it would introduce some discontinuities in the optimization problem. Some clarification on this would be useful. Another confusing part about the formulation is the handling of the parameter \delta, which is the maximum expected disparity. I would think that it would be more desirable for the designer for the model to be able to specify the value of \delta based on the domain requirement, and then we should minimize w.r.t. \lambda for a chosen \delta. The idea of letting the model designer steer the utility/fairness trade-off might be useful in certain settings, but not in others. This paper makes two very interesting improvements over the previous work in fairness constraints: one is the proposal of constraints that enforce the proportionality of exposure to merit. The idea that “higher merit items don’t get exposure beyond their merit, since the opposite direction is already achieved through utility maximization” is a pretty powerful one. Second, the proposed measures of disparate exposure for individual and group fairness can be very useful by themselves as tests of fairness for any given ranking system. One clarification that would be nice is the use of “merit” as a function of “relevance”, which is not well motivated. It’s not clear why we need this, instead of directly using relevance, which is indeed used in the experiments. The FAIR-PG-RANK algorithm is well designed, but it does have the limitation of allowing only differential ML models, which does exclude some popular LTR approaches. The log derivative trick is extremely clever, and does open up a lot of possibilities. Again, some elaboration of what happens in the NDCG@K case would be extremely useful. Given the overall technical strength of the paper, the empirical evaluation does leave much to be desired. For example, the proposed LTR approach is only compared with very few, and relatively older methods. Even though the idea is to just show that this approach is competitive, comparisons to more recent and widely used algorithms like LambdaRank, LambdaMART or their successors would have been better. Also, the argument that it does worse than GBDT since it’s different model class if pretty weak. The synthetic data experiments could also have been repeated on larger document sets for better understanding of model behavior. It’s not clear why group parity is not studied using synthetic data as well. Finally, the experiments on German Credit Dataset demonstrate the generalization and trade off properties, but some comparison with some modified version of other fairness approaches, such as top-K, or one of the supervised classification ones would make this paper much stronger. Overall, the paper is well written, and makes strong contributions to the fairness in ranking field. The empirical evaluation is somewhat weak, but there are enough high impact ideas proposed here for it to be accepted.