{"title": "Does mitigating ML's impact disparity require treatment disparity?", "book": "Advances in Neural Information Processing Systems", "page_first": 8125, "page_last": 8135, "abstract": "Following precedent in employment discrimination law, two notions of disparity are widely-discussed in papers on fairness and ML. Algorithms exhibit treatment disparity if they formally treat members of protected subgroups differently;\nalgorithms exhibit impact disparity when outcomes differ across subgroups (even unintentionally). Naturally, we can achieve impact parity through purposeful treatment disparity. One line of papers aims to reconcile the two parities proposing disparate learning processes (DLPs). Here, the sensitive feature is used during training but a group-blind classifier is produced. In this paper, we show that: (i) when sensitive and (nominally) nonsensitive features are correlated, DLPs will indirectly implement treatment disparity, undermining the policy desiderata they are designed to address; (ii) when group membership is partly revealed by other features, DLPs induce within-class discrimination; and (iii) in general, DLPs provide suboptimal trade-offs between accuracy and impact parity. Experimental results on several real-world datasets highlight the practical consequences of applying DLPs.", "full_text": "Does mitigating ML\u2019s impact disparity\n\nrequire treatment disparity?\n\nZachary C. Lipton1, Alexandra Chouldechova1, Julian McAuley2\n\nzlipton@cmu.edu, achould@cmu.edu, jmcauley@cs.ucsd.edu\n\n1Carnegie Mellon University\n\n2University of California, San Diego\n\nAbstract\n\nFollowing precedent in employment discrimination law, two notions of disparity\nare widely-discussed in papers on fairness and ML. Algorithms exhibit treatment\ndisparity if they formally treat members of protected subgroups differently; al-\ngorithms exhibit impact disparity when outcomes differ across subgroups (even\nunintentionally). Naturally, we can achieve impact parity through purposeful treat-\nment disparity. One line of papers aims to reconcile the two parities proposing\ndisparate learning processes (DLPs). Here, the sensitive feature is used during\ntraining but a group-blind classi\ufb01er is produced. In this paper, we show that: (i)\nwhen sensitive and (nominally) nonsensitive features are correlated, DLPs will\nindirectly implement treatment disparity, undermining the policy desiderata they\nare designed to address; (ii) when group membership is partly revealed by other fea-\ntures, DLPs induce within-class discrimination; and (iii) in general, DLPs provide\nsuboptimal trade-offs between accuracy and impact parity. Experimental results on\nseveral real-world datasets highlight the practical consequences of applying DLPs.\n\n1\n\nIntroduction\n\nEffective decision-making requires choosing among options given the available information. That\nmuch is unavoidable, unless we wish to make trivial decisions. In selection processes, such as hiring,\nuniversity admissions, and loan approval, the options are people; the available features include (but\nare rarely limited to) direct evidence of quali\ufb01cations; and decisions impact lives.\nLaws in many countries restrict the ways in which certain decisions can be made. For example, Title\nVII of the US Civil Rights Act [1], forbids employment decisions that discriminate on the basis of\ncertain protected characteristics. Interpretation of this law has led to two notions of discrimination:\ndisparate treatment and disparate impact. Disparate treatment addresses intentional discrimination,\nincluding (i) decisions explicitly based on protected characteristics; and (ii) intentional discrimination\nvia proxy variables (e.g literacy tests for voting eligibility). Disparate impact addresses facially\nneutral practices that might nevertheless have an \u201cunjusti\ufb01ed adverse impact on members of a\nprotected class\u201d [1]. One might hope that detecting unjusti\ufb01ed impact were as simple as detecting\nunequal outcomes. However, absent intentional discrimination, unequal outcomes can emerge due\nto correlations between protected and unprotected characteristics. Complicating matters, unequal\noutcomes may not always signal unlawful discrimination [2].\nRecently, owing to the increased use of machine learning (ML) to assist in consequential decisions,\nthe topic of quantifying and mitigating ML-based discrimination has attracted interest in both policy\nand ML. However, while the existing legal doctrine offers qualitative ideas, intervention in an ML-\nbased system requires more concrete formalism. Inspired by the relevant legal concepts, technical\npapers have proposed several criteria to quantify discrimination. One criterion requires that the\nfraction given a positive decision be equal across different groups. Another criterion states that a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Demonstration of a DLP\u2019s undesirable side effects on a simple example of hiring data\n(see \u00a74.1). An unconstrained classi\ufb01er (vertical line) hires candidates based on work experience,\nyielding higher hiring rates for men than for women. A DLP (dashed diagonal) achieves near-parity\nby differentiating based on an irrelevant attribute (hair length). The DLP hurts some short-haired\nwomen, \ufb02ipping their decisions to reject, and helps some long-haired men.\n\nclassi\ufb01er should be blind to the protected characteristic. Within the technical literature, these criteria\nare commonly referred to as disparate impact and disparate treatment, respectively.\nIn this paper, we call these technical criteria impact parity and treatment parity to distinguish them\nfrom their legal antecedents. The distinction between technical and legal terminology is important to\nmaintain. While impact and treatment parity are inspired by legal concepts, technical approaches that\nachieve these criteria may fail to satisfy the underlying legal and ethical desiderata.\nWe demonstrate one such disconnect through DLPs, a class of algorithms designed to simultaneously\nsatisfy treatment- and impact-parity criteria [3\u20135]. DLPs operate according to the following principle:\nThe protected characteristic may be used during training, but is not available to the model at\nprediction time. In the earliest such approach the protected characteristic is used to winnow the set of\nacceptable rules from an expert system [3]. Others incorporate the protected characteristic as either a\nregularizer, a constraint, or to preprocess the training data [5\u20137].\nThese approaches are grounded in the premise that DLPs are acceptable in cases where using a\nprotected characteristic as a direct input to the model would constitute disparate treatment and thus\nbe impermissible. Indeed, DLPs in some sense operationalize a form of prospective fair \u201ctest design\u201d\nthat is well aligned with the ruling in Ricci v. DeStefano [8]. In this paper we investigate the utility of\nDLPs as a technical solution and present the following cautionary insights:\n1. When protected characteristics are redundantly encoded in the other features, suf\ufb01ciently powerful\n\nDLPs can (indirectly) implement any form of treatment disparity.\n\n2. When protected characteristics are partially encoded DLPs induce within-class discrimination\n\nbased on irrelevant features, and can harm some members of the protected group.\n\n3. DLPs provide a suboptimal trade-off between accuracy and impact parity.\n4. While disparate treatment is by de\ufb01nition illegal, the status of treatment disparity is debated [9].\n\n2 Disparate Learning Processes\n\nTo begin our formal description of the prior work, we\u2019ll introduce some notation. A dataset consists\nof n examples, or data points {xi \u2208 X , yi \u2208 Y}, each consisting of a feature vector xi and a label yi.\nA supervised learning algorithm f : X n \u00d7 Y n \u2192 (X \u2192 [0, 1]) is a mapping from datasets to models.\nThe learning algorithm produces a model \u02c6y : X \u2192 Y, which given a feature vector xi, predicts the\ncorresponding output yi. In this discussion we focus on binary classi\ufb01cation (Y = {0, 1}).\nWe consider probabilistic classi\ufb01ers which produce estimates \u02c6p(x) of the conditional probability\nP(y = 1 | x) of the label given a feature vector x. To make a prediction \u02c6y(x) \u2208 Y given an estimated\nprobability \u02c6p(x) a threshold rule is used: \u02c6yi = 1 iff \u02c6pi > t. The optimal choice of the threshold\nt depends on the performance metric being optimized. In our theoretical analysis, we consider\n\n2\n\n0.02.55.07.510.012.515.017.520.0Work experience (years)051015202530Hair length (cm)Acc=0.96; p% rule=26% - UnconstrainedAcc=0.74; p% rule=105% - DLPWoman advantaged by DLPWoman disadvantaged by DLPMan advantaged by DLPMan disadvantaged by DLP\f(cid:80)n\n\noptimizing the immediate utility [10], of which classi\ufb01cation accuracy (expected 0 \u2212 1 loss) is a\nspecial case. We will de\ufb01ne this metric more precisely in the next section.\nIn formal descriptions of discrimination-aware ML, a dataset possesses a protected feature zi \u2208 Z,\nmaking each example a three-tuple (xi, yi, zi). The protected characteristic may be real-valued, like\nage, or categorical, like race or gender. The goal of many methods in discrimination-aware ML is not\nonly to maximize accuracy, but also to ensure some form of impact parity. Following related work,\nwe consider binary protected features that divide the set of examples into two groups a and b. Our\nanalysis extends directly to settings with more than two groups.\nOf the various measures of impact disparity, the two that are the most relevant here are the Calders-\n1(\u02c6pi > t), where nz =\nVerwer gap and the p-% rule. At a given threshold t, let qz = 1\nnz\n1(zi = z). The Calders-Verwer (CV) gap, qa \u2212 qb, is the difference between the proportions\nassigned to the positive class in the advantaged group a and the disadvantaged group b [4]. The p-%\nrule is a related metric[5]. Classi\ufb01ers satisfy the p-% rule if qb/qa \u2265 p/100.\nMany papers in discrimination-aware ML propose to optimize accuracy (or some other risk) subject\nto constraints on the resulting level of impact parity as assessed by some metric [3, 7, 11\u201314]. Use of\nDLPs presupposes that using the protected feature z as a model input is impermissible in this effort.\nDiscarding protected features, however, does not guarantee impact parity [15]. DLPs incorporate z in\nthe learning algorithm, but without making it an input to the classi\ufb01er. Formally, a DLP is a mapping:\nX n \u00d7 Y n \u00d7 Z n \u2192 (X \u2192 Y). By de\ufb01nition, DLPs achieve treatment parity. However, satisfying\ntreatment parity in this fashion may still violate disparate treatment.\n\ni\n\n(cid:80)\n\ni:zi=z\n\nAlternative approaches. Researchers have proposed a number of other techniques for reconciling\naccuracy and impact parity. One approach consists of preprocessing the training data to reduce the\ndependence between the resulting model predictions and the sensitive attribute [6, 16\u201319]. These\nmethods differ in terms of which variables they affect and the degree of independence achieved. [6]\nproposed \ufb02ipping negative labels of training examples form the protected group. [20] proposed learn-\ning representations (cluster assignments) so that group membership cannot be inferred from cluster\nmembership. [17] and [19] also construct representations designed to be marginally independent\nfrom Z.\n\n3 Theoretical Analysis\n\nWe present a set of simple theoretical results that demonstrate the optimality of treatment disparity,\nand highlight some properties of DLPs. We summarize our results as follows:\n1. Direct treatment disparity on the basis of z is the optimal strategy for maximizing classi\ufb01cation\n\naccuracy1 subject to CV and p-% constraints.\n\n2. When X fully encodes Z, a suf\ufb01ciently powerful DLP is equivalent to treatment disparity.\nIn Section 4, we empirically demonstrate a related point:\n3. When X only partially encodes Z, a DLP may be suboptimal and can induce intra-group disparity\n\non the basis of otherwise irrelevant features correlated with Z.\n\nTreatment disparity is optimal Absent impact parity constraints, the Bayes-optimal decision\nrule for minimizing expected 0 \u2212 1 loss (i.e., maximizing accuracy) is given by d\u2217\nuncon(x, z) =\n\u03b4(pY |X,Z(x, z) \u2265 0.5), where \u03b4() is an indicator function.\nWe now show that the optimal decision rules in the CV and p-% constrained problems have a similar\nform. The optimal decision rule will again be based on thresholding pY |X,Z(x, z), but at group-\nspeci\ufb01c thresholds. These rules can be thought of as operationalizing the following mechanism:\nSuppose that we start with the classi\ufb01cations of the unconstrained rule d\u2217\nuncon(x, z), and this results\nin a CV gap of qa \u2212 qb > \u03b3. To reduce the CV gap to \u03b3 we have two mechanisms: We can (i) \ufb02ip\npredictions from 0 to 1 in group b, and (ii) we can \ufb02ip predictions from 1 to 0 in group a. The optimal\nstrategy is to perform these \ufb02ips on group b cases that have the highest value of pY |X,Z(x, z) and\ngroup a cases that have the lowest value of pY |X,Z(x, z).\n\n1Our results are all presented in terms of a more general performance metric, of which classi\ufb01cation accuracy\n\nis a special case.\n\n3\n\n\fThe results in this section adapt the work of [10], who establish optimal decision rules d under exact\nparity. In that work, the authors characterize the optimal decision rule d = d(x, z) that maximizes the\nimmediate utility u(d, c) = E[Y d(X, Z) \u2212 cd(X, Z)] for (0 < c < 1), under different exact parity\ncriteria. We begin with a lemma showing that expected classi\ufb01cation accuracy has the functional\nform of an immediate utility function.\nLemma 1. Optimizing classi\ufb01cation accuracy is equivalent to optimizing immediate utility with\nc = 0.5.\nProof. The expected accuracy of a binary decision rule d(X) can be written as E[Y d(X) + (1 \u2212\nY )(1 \u2212 d(X))]. Expanding and rearranging this expression gives\nE[Y d(X) + (1 \u2212 Y )(1 \u2212 d(X))] = E(2Y d(X) \u2212 d(X)) + E(Y ) + 1 = 2u(d, 0.5) + E(Y ) + 1.\nThe only term in this expression that depends on d is the immediate utility u. Thus the decision rule\nthat maximizes u also maximizes accuracy.\n\nWe note that the results in this section are related to the recent independent work of [21], who derive\nBayes-optimal decision rules under the same parity constraints we consider here, working instead\nwith the cost-sensitive risk, CS(d; c) = \u03c0(1 \u2212 c)FNR(d) + (1 \u2212 \u03c0)cFPR(d), where \u03c0 = P(Y = 1).\nOne can show that u(d, c) = \u2212CS(d; c) + \u03c0(1\u2212 c), and hence the problem of maximizing immediate\nutility considered here is equivalent to minimizing cost-sensitive risk as in [21]. In our case, it will be\nmore convenient to work with the immediate utility.\nFor the next set of results, we follow [10] and assume that pY |X,Z(X, Z), viewed as a random variable,\nhas positive density on [0, 1]. This ensures that the optimal rules are unique and deterministic by\ndisallowing point-masses of probability that would necessitate tie-breaking among observations with\nequal probability. The \ufb01rst result that we state is a direct corollary of two results in [10]. It considers\nthe case where we desire exact parity, i.e., that qa = qb.\nCorollary 2. The optimal decision rules d\u2217 under various parity constraints have the following form\nand are unique up to a set of probability zero:\nthe optimum is d\u2217(x, z) =\n1. Among rules satisfying statistical parity (the 100% rule),\n\u03b4(pY |X,Z(x, z) \u2265 tz), where tz \u2208 [0, 1] are constants that depend only on group membership z.\n2. Among rules that have equal false positive rates across groups, the optimum is d\u2217(x, z) =\n\u03b4(pY |X,Z(x, z) \u2265 sz), where sz are constants that depend only on group membership z (but are\ndifferent from tz).\n3. (1) and (2) continue to hold even in the resource-constrained setting where the overall proportion\n\nof cases classi\ufb01ed as positive is constrained.\n\nProof. (1) and (2) are direct corollaries of Lemma 1 combined with Thm 3.2 and Prop 3.3 of [10].\n\nThe next set of results establishes optimality under general p-% and CV rules.\nProposition 3. Under the same assumptions as above, the optimum among rules that satisfy the CV\nconstraint 0 \u2264 qa \u2212 qb < \u03b3 or the p-% rule also has the form d\u2217(x, z) = \u03b4(pY |X,Z(x, z) \u2265 tz),\nwhere tz \u2208 [0, 1] are constants that depend on the group membership z, and on the choice of\nconstraint parameter \u03b3 or p. The thresholds tz are different for the CV constraint and p-% rule.\n\nProof. Suppose that the optimal solution under the CV or p-% rule constraint classi\ufb01es proportions\nqa and qb of the advantaged and disadvantaged groups, respectively, to the positive class. As shown\nin Corbett-Davies et al. [10], we can rewrite the immediate utility as\nu(d, 0.5) = E[d(X, Z)(pY |X,Z \u2212 0.5)].\n\nThus the utility will be maximized when d\u2217(X, Z) = 1 for the qz proportion of individuals in each\ngroup that have the highest values of pY |X,Z. Since the optimal values of qz may differ between the\nCV-constrained solution and the p-% solution, the optimal thresholds may differ as well.\n\nOur \ufb01nal result shows that a decision rule that does not directly use z as an input variable or for\ndetermining thresholds will have lower accuracy than the optimal rule that uses this information. That\nis, we show that DLPs are suboptimal for trading off accuracy and impact parity.\n\n4\n\n\fTheorem 4. Let d\u2217(x, z) be the optimal decision rule under a the CV-\u03b3 or p-% constraint. Let\ndDLP (x) be the optimal solution to a DLP. If d(x, z) and dDLP (x) satisfy CV or p-% constraints\nwith the same qa and qb, the DLP solution results in lower or equal accuracy (equal only if the\nsolutions are the same.)\n\nProof. From Proposition 3, we know that the unique accuracy-optimizing solution is given by\nd\u2217(x, z) = \u03b4(pY |X,Z(x, z) \u2265 tz), where tz is the 1 - qz quantile of pY |X,Z. The difference in\nimmediate utility between the two decision rules can be expressed as follows:\n\u2217\nE[d\n\u2217\n= E[(d\n= E[pY |X,Z\u2212.5|d\n\n(X, Z)(pY |X,Z \u2212 .5)] \u2212 E[dDLP (X)(pY |X,Z \u2212 .5)]\n\n(X, Z) \u2212 dDLP (X))(pY |X,Z \u2212 0.5)]\n\n= 1, dDLP = 0)\u2212E[pY |X,Z\u2212.5|d\n\u2217\n\n= 1, dDLP = 0]P(d\n\n\u2217\n= 0, dDLP = 1)\n\n=(cid:0)E[pY |X,Z \u2212 .5 | d\n\n\u2217\n\n= 0, dDLP = 1](cid:1)P(d\n\n\u2217\n= 0, dDLP = 1]P(d\n\n\u2217\n\n= 1, dDLP = 0)\n\n\u2217\n\n= 1, dDLP = 0] \u2212 E[pY |X,Z \u2212 .5 | d\n\u2217\n\n\u2265 0\nThe \ufb01nal inequality follows since d\u2217(X, Z) = 1 for the highest values of pY |X,Z, so pY |X,Z is\nstochastically greater on the event {d\u2217 = 1, dDLP = 0} than on {d\u2217 = 0, dDLP = 1}. Note that equality\nholds only if P(d\u2217 = 1, dDLP = 0) = 0, i.e., if the two rules are (almost surely) equivalent.\n\nOur results continue to hold under \u201cdo no harm\u201d constraints, where we require that any individual in\nthe disadvantaged group who was classi\ufb01ed as positive under the unconstrained rule duncons(x, z)\nremains positively classi\ufb01ed. This corresponds to the setting where the proportion of cases in the\ndisadvantaged group classi\ufb01ed as positive is constrained to be no lower than the proportion under\nthe unconstrained rule (or no lower than some \ufb01xed value qmin\n). Such constraints impose an upper\nbound on the optimal thresholds tb, but do not change the structure of the optimal rules.\n\na\n\nFunctional equivalence when protected characteristic is redundantly encoded. Consider the\ncase where the protected feature z is redundantly encoded in the other features x. More precisely,\nsuppose that there exists a known subcomputation g such that z = g(x). This allows for any function\nof the data f (x, z) to be represented as a function of x alone via \u02dcf (x) = f (x, g(x)). While it remains\nthe case that \u02dcf (x) does not directly use z as an input variable\u2014and thus satis\ufb01es treatment parity\u2014 \u02dcf\nshould be no less legally suspect from a disparate treatment perspective than the original function f\nthat uses z directly. The main difference for the purpose of our discussion is that \u02dcf, resulting from a\nDLP, may technically satisfy treatment parity, while f does not.\nWhile this form of \u201cstrict\u201d redundancy is unlikely, characterizing this edge case is important for\nconsidering whether DLPs should have different legal standing vis-a-vis disparate treatment than\nmethods that use z directly. This is particularly relevant if one thinks of the \u2018practitioner\u2019 in question\nas having discriminatory intent. Furthermore, the partial encoding of the protected attribute is\ncommonplace in settings where discrimination is a concern (as with gender in our experiment in\n\u00a74). Indeed, the very premise of DLPs requires that x is signi\ufb01cantly correlated with z. Moreover,\nDLPs provide an incentive for practitioners to game the system by adding features that are predictive\nof the protected attribute but not necessarily of the outcome, as these would improve the DLP\u2019s\nperformance.\n\nWithin-class discrimination when protected characteristic is partially redundantly encoded.\nWhen the protected characteristic is partially encoded in the other features, disparate treatment may\ninduce within-class discrimination by applying the bene\ufb01t of the af\ufb01rmative action unevenly, and can\neven harm some members of the protected class. Next we demonstrate this phenomenon empirically\nusing (synthetically biased) university admissions data and several public datasets. The ease of\nproducing such examples might convince the reader that the varied effects of intervention with a DLP\non members of the disadvantaged group raises practice and policy concerns about DLPs.\n\n4 Empirical Analysis\n\nThis preceding analysis demonstrates several theoretical advantages to increasing impact parity via\ntreatment disparity:\n\n5\n\n\folds maximizes accuracy subject to an impact parity constraint.\n\n\u2022 Optimality: As demonstrated for CV score and for p-% rule, intervention via per-group thresh-\n\u2022 Rational ordering: Within each group, individuals with higher probability of belonging to the\n\u2022 Does no harm to the protected group: The treatment disparity intervention can be constrained\n\npositive class are always assigned to the positive class ahead of those with lower probabilities.\n\nto only bene\ufb01t members of the disadvantaged class.\n\nDLPs attempt to produce a classi\ufb01er that satis\ufb01es the parity constraints, by relying upon the proxy\nfeatures to satisfy the parity metric. Typically, this is accomplished either by introducing constraints\nto a convex optimization problem, or by adding a regularization term and tuning the corresponding\nhyper-parameter. Because the CV score and p-% rule are non-convex in model parameters (scores\nonly change when a point crosses the decision boundary), [4, 5] introduce convex surrogates aimed at\nreducing the correlation between the sensitive feature and the prediction.\nThese approaches presume that the proxy variables contain information about the sensitive attribute.\nOtherwise, the parity could only be satis\ufb01ed via a trivial solution (e.g. assign either everyone or\nnobody to the positive class). So we must consider two scenarios: (i) the proxy variables x fully\nencode z, in which case, a suf\ufb01ciently powerful DLP will implicitly reconstruct z, because this gives\nthe optimal solution to the impact-constrained objective; and (ii) x doesn\u2019t fully capture z, or the\nDLP is unable to recover z from x, in which case the DLP may be sub-optimal, may violate rational\nordering within groups, and may harm members of the disadvantaged group.\n\n4.1 Synthetic data example: work experience and hair length in hiring\n\nTo begin, we illustrate our arguments empirically with a simple synthetic data experiment. To\nconstruct the data, we sample nall = 2000 total observations from the data-generating process\ndescribed below. 70% of the observations are used for training, and the remaining 30% are reserved\nfor model testing.\n\nzi \u223c Bernoulli(0.5)\nhair_lengthi | zi = 1 \u223c 35 \u00b7 Beta(2, 2)\nhair_lengthi | zi = 0 \u223c 35 \u00b7 Beta(2, 7)\n\nwork_expi | zi \u223c Poisson(25 + 6zi) \u2212 Normal(20, \u03c3 = 0.2)\nyi | work_exp \u223c 2 \u00b7 Bernoulli(pi) \u2212 1,\n\nwhere pi = 1/ (1 + exp[\u2212(\u221225.5 + 2.5work_exp)])\n\nThis data-generating process has the following key properties: (i) the historical hiring process was\nbased solely on the number of years of work experience; (ii) because women on average have fewer\nyears of work experience than men (5 years vs. 11), men have been hired at a much higher rate than\nwomen; and (iii) women have longer hair than men, a fact that was irrelevant to historical hiring\npractice.\nFigure 1 shows the test set results of applying a DLP to the available historical data to equalize\nhiring rates between men and women. We apply the DLP proposed by Zafar et al. [5], using code\navailable from the authors.2 While the DLP nearly equalizes hiring rates (satisfying a 105-% rule), it\ndoes so through a problematic within-class discrimination mechanism. The DLP rule advantages\nindividuals with longer hair over those with shorter hair and considerably longer work experience.\nWe \ufb01nd that several women who would have been hired under historical practices, owing to their\n12+ years of work experience, would not be hired under the DLP due to their short hair (i.e., their\nmale-like characteristics captured in x). Similarly, several men, who would not have been hired\nbased on work experience alone, are advantaged by the DLP due to their longer hair (i.e., their\n\u2018female-like\u2019 characteristics in x). The DLP violates rational ordering, and harms some of the most\nquali\ufb01ed individuals in the protected group. Group parity is achieved at the cost of individual\nunfairness.\nGranted, we might not expect factors such as hair length to knowingly be used as inputs to a typical\nhiring algorithm. We construct this toy example to illustrate a more general point: since DLPs do\nnot have direct access to the protected feature, they must infer from the other features which people\nare most likely to belong to each subgroup. Using the protected feature directly can yield more\n\n2https://github.com/mbilalzafar/fair-classification/\n\n6\n\n\fFigure 2: (left) probability of the sensitive variable versus (unconstrained) admission probability,\non unseen test data. Downward triangles indicate individuals rejected only after applying the DLP\n(\u201ctreatment\u201d), while upward triangles indicate individuals accepted only by the DLP. The remaining\n\u223c4,000 blue/yellow dots indicate people whose decisions are not altered. Many students bene\ufb01ting\nfrom the DLP are males who \u2018look like\u2019 females based on other features, whereas females who \u2018look\nlike\u2019 males are hurt by the DLP. Detail view (center) and summary statistics (right) of the same plot.\n\nreasonable policies: For example, by applying per-group thresholds, we could hire the highest rated\nindividuals in each group, rather than distorting rankings within groups based on how female/male\nindividuals appear to be from their other features.\n\n4.2 Case study: Gender bias in CS graduate admissions\n\nFor our next example, we demonstrate a similar result but this time by analyzing real data with\nsynthetic discrimination, to empirically demonstrate our arguments. We consider a sample of \u223c9,000\nstudents considered for admission to the MS program of a large US university over an 11-year period.\nHalf of the examples are withheld for testing. Available attributes include basic information, such as\ncountry of origin, interest areas, and gender, as well as quantitative \ufb01elds such as GRE scores. Our\ndata also includes a label in the form of an \u2018above-the-bar\u2019 decision provided by faculty reviewers.\nAdmission rates for male and female applicants were observed to be within 1% of each other. So, to\ndemonstrate the effects of DLPs, we corrupt the data with synthetic discrimination. Of all women who\nwere admitted, i.e., zi = b, yi = 1, we \ufb02ip 25% of those labels to 0: giving noisy labels \u00afyi = yi \u00b7 \u03b7,\nfor \u03b7 \u223c Bernoulli(.25). This simulates historical bias in the training data.\nWe then train three logistic regressors: (1) To predict the (synthetically corrupted) labels \u00afyi from the\nnon-sensitive features xi; (2) The same model, applying the DLP of [5]; and (3) A model to predict\nthe sensitive feature zi from the non-sensitive features xi. The data contains limited information\nthat predicts gender, though such predictions can be made better than random (AUC=0.59) due to\ndifferent rates of gender imbalance across (e.g.) countries and interests.\nFigure 2 (left) shapes our basic intuition for what is happening: Considering the probability of\nadmission for the unconstrained classi\ufb01er (y-axis), students whose decisions are \u2018\ufb02ipped\u2019 (after\napplying the fairness constraint) tend to be those close to the decision boundary. Furthermore,\nstudents predicted to be male (x-axis) tend to be \ufb02ipped to the negative class (left half of plot) while\nstudents predicted to be female tend to be \ufb02ipped to the positive class (right half of plot). This\nis shown in detail in Figure 2 (center and right). Of the 43 students whose decisions are \ufb02ipped\nto \u2018non-admit,\u2019 5 are female, each of whom has \u2018male-like\u2019 characteristics according to their other\nfeatures as demonstrated in our synthetic hair-length example. Demonstrated here with real-world\ndata, the DLP both disrupts the within-group ordering and violates the do no harm principle by\ndisadvantaging some women who, but for the DLP, would have been admitted.\n\nComparison with Treatment Disparity. To demonstrate the better performance of per-group\nthresholding, we implement a simple decision scheme and compare its performance to the\nDLP.\nOur thresholding rule for maximizing accuracy subject to a p-% rule works as follows: Recall that\n100 qa \u2212 qb < 0. We denote\nthe p-% rule requires that qb/qa > p/100, which can be written as\n100 qa \u2212 qb as the p-gap. To maximize accuracy subject to satisfying the p-% rule, we\nthe quantity p\nconstruct a score that quanti\ufb01es reduction in p-gap per reduction in accuracy. Starting from the\n\np\n\n7\n\n0.00.20.40.60.81.0p(female)0.00.20.40.60.81.0p(admit) (unconstrained)Graduate admissions w/ 25% synthetic rejection of femalesFemale, admitted because of treatmentFemale, rejected because of treatmentMale, admitted because of treatmentMale, rejected because of treatment0.00.10.20.30.40.50.6p(female)0.420.440.460.480.500.520.540.560.580.60p(admit) (unconstrained)Graduate admissions w/ 25% synthetic rejection of femalesFemale, admitted because of treatmentFemale, rejected because of treatmentMale, admitted because of treatmentMale, rejected because of treatment0.00.10.20.30.40.50.6p(female)0.47550.50830.48180.5221p(admit) (unconstrained)Graduate admissions w/ 25% synthetic rejection of femalesFemale, admitted because of treatmentFemale, rejected because of treatmentMale, admitted because of treatmentMale, rejected because of treatment\fTable 1: Statistics of public datasets.\n\ndataset\nIncome\nMarketing\nCredit\nEmployee Attr.\nCustomer Attr.\n\nsource\nUCI [22]\nUCI [23]\nUCI [24]\nIBM [25]\nIBM [25]\n\nprotected feature\nGender (female)\nStatus (married)\nGender (female)\nStatus (married)\nStatus (married)\n\nprediction target\nincome > $50k\ncustomer subscribes\ncredit card default\nemployee attrition\ncustomer attrition\n\nn\n32,561\n45,211\n30,000\n1,470\n7,043\n\naccuracy-maximizing classi\ufb01cations \u02c6y (thresholding at .5), we then \ufb02ip those predictions which close\nthe gap fastest:\n1. Assign each example with {\u02dcyi = 0, zi = b} or {\u02dcyi = 1, zi = a}, a score ci equal to the reduction\n\nin the p-gap divided by the reduction in accuracy:\n(a) For each example in group a with initial \u02c6yi = 1, ci =\n(b) For each example in group b with initial \u02c6yi = 0, ci =\n\np\n\n100na(2 \u02c6pi\u22121).\nnb(1\u22122 \u02c6pi).\n\n1\n\n2. Flip examples in descending order according to this score until the desired CV-score is reached.\n\nThese scores do not change after each iteration, so the greedy policy leads to optimal \ufb02ips (equivalently,\noptimal classi\ufb01cation thresholds).\n\nThe unconstrained classi\ufb01er achieves a p-% rule of 71.4%. By applying this thresholding strategy,\nwe were able to obtain the same accuracy as the method of [5], but with a higher p-% rule of 78.3%\ncompared to 77.6%. Note that on this data, the method of [5] maxes out at a p-% rule of 77.6%. That\nis, the method is limited in what p-% rules may be achieved. By contrast, the thresholding rule can\nachieve any desired parity level. Subject to a < 1% drop in accuracy relative to the DLP we can\nachieve a p-% rule of \u223c 100%.\n\n4.3 Examples on public datasets\n\nFinally, for reproducibility, we repeat our experiments from Section 4.2 on a variety of public\ndatasets (code and data will be released at publication time). Again we compare applying our simple\nthresholding scheme against the fairness constraint of [5], considering a binary outcome and a single\nprotected feature. Basic info about these datasets (including the prediction target and protected\nfeature) is shown in Table 1.\n\nThe protocol we follow is the same as in Section 4.2. Each of these datasets exhibits a certain degree\nof bias w.r.t. the protected characteristic (Table 2), so no synthetic discrimination is applied. In\nTable 2, we compare (1) The p-% rule obtained using the classi\ufb01er of [5] compared to that of a na\u00efve\nclassi\ufb01er (column k vs. column h); and (2) The p-% rule obtained when applying our thresholding\nstrategy from Section 4.2. As before, half of the data are withheld for testing.\n\nFirst, we note that in most cases, the method of [5] increases the p-% rule (column k vs. h), while\nmaintaining an accuracy similar to that of unconstrained classi\ufb01cation (column i vs. f). One exception\nis the UCI-Credit dataset, in which both the accuracy and the p-% rule simultaneously decrease;\nalthough this is against our expectations, note that the optimization technique of [5] is an approxima-\ntion scheme and does not offer accuracy guarantees in practice (nor can it in general achieve a p-%\nrule of 100%). However these details are implementation-speci\ufb01c and not the focus of this paper.\nSecond, as in Section 4.2, we note that the optimal thresholding strategy is able to offer a strictly\nlarger p-% rule (column l vs. k) at a given accuracy (in this case, the accuracy from column i). In\nmost cases, we can obtain a p-% rule of (close to) 100% at the given accuracy.\n\nWe emphasize that the goal of our experiments is not to \u2018beat\u2019 the method of [5], or even to comment\non any speci\ufb01c discrimination-aware classi\ufb01cation scheme. Rather, we emphasize that any DLP is\nfundamentally upper-bounded (in terms of the p-% rule/accuracy trade-off) by simple schemes that\nexplicitly consider the protected feature. Our experiments validate this claim, and reveal that the\ntwo schemes make strikingly different decisions. While concealing the protected feature from the\nclassi\ufb01er may be conceptually desirable, practitioners should be aware of the consequences.\n\n8\n\n\fTable 2: Comparison between unconstrained classi\ufb01cation, DLPs, and thresholding schemes. Note\nthat the p-% rules from [5] were the strongest that could be obtained with their method; on complex\ndatasets p-% rules of 100% are rarely obtained in practice, due to their speci\ufb01c approximation scheme.\nEmployee and Customer datasets are from IBM, the others are UCI datasets.\n\nna\u00efve (unconstrained)\n\nclassi\ufb01cation\nlabel p-% acc. prot./non-prot.\n\nin positive\n\nfair (constrained)\nclassi\ufb01cation [5]\np-% acc. prot./non-prot.\nh\n\nin positive\n\nj\n\ni\n\noptimal\nthreshold\np-% at\nconst. acc.\n\nl\n\np-%\nk\n\ng\n\nbasic statistics\n\ndataset %prot. %prot.\nin +\u2019ve\n\na\n\nb\n\nc\n\n%non-prot.\n\nin +\u2019ve\n\nd\n\nIncome\nMarketing\nCredit\nEmployee\nCustomer\n\n66.9% 30.6%\n60.2% 14.1%\n60.4% 24.1%\n45.8% 19.2%\n48.3% 33.0%\n\n10.9%\n10.1%\n20.8%\n12.5%\n19.7%\n\ne\n\n35.8%\n71.9%\n86.0%\n65.0%\n59.7%\n\nf\n\n0.85\n0.89\n0.82\n0.87\n0.80\n\n8% / 25%\n3% / 4%\n10% / 12%\n8% / 12%\n15% / 30%\n\n31% 0.85\n82% 0.89\n88% 0.74\n65% 0.86\n49% 0.79\n\n7% / 24%\n3% / 3%\n21% / 25%\n8% / 11%\n16% / 19%\n\n29%\n102%\n85%\n69%\n84%\n\n52.9%\n100.3%\n100.0%\n100.4%\n100.2%\n\n5 Discussion\nComing to terms with treatment disparity. Legal considerations aside, treatment disparity ap-\nproaches have three advantages over DLPs: they optimally trade accuracy for representativeness,\npreserve rankings among members of each group, and do no harm to members of the disadvantaged\ngroup. In addition, treatment disparity has another advantage: by setting class-dependent thresh-\nolds, it\u2019s easier to understand how treatment disparity impacts individuals. It seems plausible that\npolicy-makers could reason about thresholds to decide on the right trade-off between group equality\nand individual fairness. By contrast the tuning parameters of DLPs may be harder to reason about\nfrom a policy standpoint. Several key challenges remain. Our theoretical arguments demonstrate\nthat thresholding approaches are optimal in the setting where we assume complete knowledge of the\ndata-generating distribution. It is not always clear how best to realize these gains in practice, where\nimbalanced or unrepresentative datasets can pose a signi\ufb01cant obstacle to accurate estimation.\n\nSeparating estimation from decision-making.\nIn the context of algorithmic, or algorithm-\nsupported decision-making, it\u2019s often useful to obtain not just a classi\ufb01cation, but also an accurate\nprobability estimate. These estimates could then be incorporated into the decision-theoretic part of\nthe pipeline where appropriate measures could be taken to align decisions with social values. By\nintervening at the modeling phase, DLPs distort the predicted probabilities themselves. It\u2019s not clear\nwhat the outputs of the resulting classi\ufb01ers actually signify. In unconstrained learning approaches,\neven if the label itself may re\ufb02ect historical prejudice, one at least knows what is being estimated. This\nleaves open the possibility of intervening at decision time to promote more equal outcomes.\n\nFairness beyond disparate impact How best to quantify discrimination and unfairness remains\nan important open question. The CV scores and p \u2212 % rules offer one set of de\ufb01nitions, but there are\nmany other parity criteria to which our results do not directly apply, e.g., equality of opportunity [13].\nOther notions of fairness and the trade-offs between them have been studied [14, 26\u201329]. In a recent\npaper, Zafar et al. [30] depart from parity-based de\ufb01nitions and propose instead a preference-based\nnotion of fairness. Dwork et al. [11] address the problem of how best to incorporate information\nabout protected characteristics for several of these other fairness criteria.\n\nProblematically, research into fairness in ML is often motivated by the case in which our ground-truth\ndata is tainted, capturing existing discriminatory patterns. Characterizing different forms of data bias,\nhow to detect them, and how to draw valid inference from such data remain important outstanding\nchallenges.\n\nEven in settings where treatment disparity in favor of disadvantaged groups is an acceptable solution,\nquestions remain of \u201chow\u201d, \u201chow much?\u201d and \u201cwhen?\u201d. While in some cases treatment disparity\nmay arguably be correcting for omitted variable bias historical discrimination, in other settings it\nmay be viewed as itself a form of discrimination. For example, in the United States, Asian students\nare simultaneously over-represented and discriminated against in higher education [2]. Such policy\njudgments require a keen understanding and awareness of the social and historical context in which\nthe algorithms are developed and meant to operate. Recent work on identifying proxy discrimination\n[31] and causal formulations of fairness [32\u201334] offer some promising approaches to translating such\nunderstanding into technological solutions.\n\n9\n\n\fReferences\n[1] Civil rights act of 1964, 1964. Accessed on September 11th, 2017.\n\n[2] Anemona Hartocollis and Stepanie Saul.\n\nAf\ufb01rmative action battle has a new fo-\n2017. URL https://www.nytimes.com/2017/08/02/us/\ncus: Asian americans.\naffirmative-action-battle-has-a-new-focus-asian-americans.html?mcubz=1.\n\n[3] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In\n\nKDD, 2008.\n\n[4] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through\n\nregularization approach. In ICDM Workshops, 2011.\n\n[5] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi.\n\nFairness constraints: Mechanisms for fair classi\ufb01cation. In AISTATS, 2017.\n\n[6] Faisal Kamiran and Toon Calders. Classifying without discriminating. In Computer, Control\n\nand Communication, 2009.\n\n[7] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy. Discrimination aware decision tree\n\nlearning. In ICDM, 2010.\n\n[8] Pauline Kim. Auditing algorithms for discrimination. 2017.\n\n[9] Pauline T Kim. Data-driven discrimination at work. William & Mary Law Review, 58(3), 2017.\n\n[10] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic\n\ndecision making and the cost of fairness. arXiv preprint arXiv:1701.08230, 2017.\n\n[11] Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Decoupled\n\nclassi\ufb01ers for fair and ef\ufb01cient machine learning. arXiv preprint arXiv:1707.06613, 2017.\n\n[12] Yahav Bechavod and Katrina Ligett. Learning fair classi\ufb01ers: A regularization-inspired ap-\n\nproach. arXiv preprint arXiv:1707.00044, 2017.\n\n[13] Moritz Hardt, Eric Price, Nati Srebro, et al. Equality of opportunity in supervised learning. In\n\nNIPS, 2016.\n\n[14] Ya\u2019acov Ritov, Yuekai Sun, and Ruofei Zhao. On conditional parity as a notion of non-\n\ndiscrimination in machine learning. arXiv preprint arXiv:1706.08519, 2017.\n\n[15] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness\n\nthrough awareness. In Innovations in Theoretical Computer Science Conference, 2012.\n\n[16] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classi\ufb01cation without\n\ndiscrimination. Knowledge and Information Systems, 33(1):1\u201333, 2012.\n\n[17] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkata-\n\nsubramanian. Certifying and removing disparate impact. In KDD, 2015.\n\n[18] Philip Adler, Casey Falk, Sorelle A Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon\nSmith, and Suresh Venkatasubramanian. Auditing black-box models by obscuring features.\narXiv preprint arXiv:1602.07043, 2016.\n\n[19] James E Johndrow and Kristian Lum. An algorithm for removing sensitive information:\napplication to race-independent recidivism prediction. arXiv preprint arXiv:1703.04957, 2017.\n\n[20] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representa-\n\ntions. In ICML, 2013.\n\n[21] Aditya Menon and Robert Williamson. The cost of fairness in binary classi\ufb01cation. In Fairness,\n\nAccountability and Transparency, 2018.\n\n[22] Ron Kohavi. Scaling up the accuracy of naive-bayes classi\ufb01ers: a decision-tree hybrid. In KDD,\n\n1996.\n\n10\n\n\f[23] S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemar-\n\nketing. Decision Support Systems, 2014.\n\n[24] I. C. Yeh and C. H. Lien. The comparisons of data mining techniques for the predictive accuracy\n\nof probability of default of credit card clients. Expert Systems with Applications, 2009.\n\n[25] IBM Watson analytics blog.\n\nwatson-analytics-blog/.\n\nhttps://www.ibm.com/communities/analytics/\n\n[26] Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. Rawlsian\n\nfairness for machine learning. arXiv preprint arXiv:1610.09559, 2016.\n\n[27] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair\n\ndetermination of risk scores. arXiv preprint arXiv:1609.05807, 2016.\n\n[28] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism\n\nprediction instruments. Big Data, 2017.\n\n[29] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in\ncriminal justice risk assessments: The state of the art. arXiv preprint arXiv:1703.09207, 2017.\n\n[30] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, Krishna P Gummadi, and\nAdrian Weller. From parity to preference-based notions of fairness in classi\ufb01cation. arXiv\npreprint arXiv:1707.00010, 2017.\n\n[31] Anupam Datta, Matt Fredrikson, Gihyuk Ko, Piotr Mardziel, and Shayak Sen. Proxy non-\n\ndiscrimination in data-driven systems. arXiv preprint arXiv:1707.08120, 2017.\n\n[32] Razieh Nabi and Ilya Shpitser. Fair inference on outcomes. arXiv preprint arXiv:1705.10378,\n\n2017.\n\n[33] Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik\nJanzing, and Bernhard Sch\u00f6lkopf. Avoiding discrimination through causal reasoning. arXiv\npreprint arXiv:1706.02744, 2017.\n\n[34] Matt J Kusner, Joshua R Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness.\n\narXiv preprint arXiv:1703.06856, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4982, "authors": [{"given_name": "Zachary", "family_name": "Lipton", "institution": "Carnegie Mellon University"}, {"given_name": "Julian", "family_name": "McAuley", "institution": "UCSD"}, {"given_name": "Alexandra", "family_name": "Chouldechova", "institution": "CMU"}]}