{"title": "The Price of Fair PCA: One Extra dimension", "book": "Advances in Neural Information Processing Systems", "page_first": 10976, "page_last": 10987, "abstract": "We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data.", "full_text": "The Price of Fair PCA: One Extra Dimension\n\nSamira Samadi\nGeorgia Tech\n\nssamadi6@gatech.edu\n\nUthaipon Tantipongpipat\n\nGeorgia Tech\n\ntao@gatech.edu\n\nJamie Morgenstern\n\nGeorgia Tech\n\njamiemmt.cs@gatech.edu\n\nMohit Singh\nGeorgia Tech\n\nmohitsinghr@gmail.com\n\nSantosh Vempala\n\nGeorgia Tech\n\nvempala@cc.gatech.edu\n\nAbstract\n\nWe investigate whether the standard dimensionality reduction technique of PCA\ninadvertently produces data representations with different \ufb01delity for two different\npopulations. We show on several real-world data sets, PCA has higher recon-\nstruction error on population A than on B (for example, women versus men or\nlower- versus higher-educated individuals). This can happen even when the data\nset has a similar number of samples from A and B. This motivates our study of\ndimensionality reduction techniques which maintain similar \ufb01delity for A and B.\nWe de\ufb01ne the notion of Fair PCA and give a polynomial-time algorithm for \ufb01nding\na low dimensional representation of the data which is nearly-optimal with respect\nto this measure. Finally, we show on real-world data sets that our algorithm can be\nused to ef\ufb01ciently generate a fair low dimensional representation of the data.\n\n1\n\nIntroduction\n\nIn recent years, the ML community has witnessed an onslaught of charges that real-world machine\nlearning algorithms have produced \u201cbiased\u201d outcomes. The examples come from diverse and\nimpactful domains. Google Photos labeled African Americans as gorillas [Twitter, 2015; Simonite,\n2018] and returned queries for CEOs with images overwhelmingly male and white [Kay et al., 2015],\nsearches for African American names caused the display of arrest record advertisements with higher\nfrequency than searches for white names [Sweeney, 2013], facial recognition has wildly different\naccuracy for white men than dark-skinned women [Buolamwini and Gebru, 2018], and recidivism\nprediction software has labeled low-risk African Americans as high-risk at higher rates than low-risk\nwhite people [Angwin et al., 2018].\nThe community\u2019s work to explain these observations has roughly fallen into either \u201cbiased data\u201d or\n\u201cbiased algorithm\u201d bins. In some cases, the training data might under-represent (or over-represent)\nsome group, or have noisier labels for one population than another, or use an imperfect proxy for the\nprediction label (e.g., using arrest records in lieu of whether a crime was committed). Separately,\nissues of imbalance and bias might occur due to an algorithm\u2019s behavior, such as focusing on accuracy\nacross the entire distribution rather than guaranteeing similar false positive rates across populations, or\nby improperly accounting for con\ufb01rmation bias and feedback loops in data collection. If an algorithm\nfails to distribute loans or bail to a deserving population, the algorithm won\u2019t receive additional data\nshowing those people would have paid back the loan, but it will continue to receive more data about\nthe populations it (correctly) believed should receive loans or bail.\nMany of the proposed solutions to \u201cbiased data\u201d problems amount to re-weighting the training set or\nadding noise to some of the labels; for \u201cbiased algorithms\u201d, most work has focused on maximizing\naccuracy subject to a constraint forbidding (or penalizing) an unfair model. Both of these concerns\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Left: Average reconstruction error of PCA on labeled faces in the wild data set (LFW),\nseparated by gender. Right: The same, but sampling 1000 faces with men and women equiprobably\n(mean over 20 samples).\n\nand approaches have signi\ufb01cant merit, but form an incomplete picture of the ML pipeline and where\nunfairness might be introduced therein. Our work takes another step in \ufb02eshing out this picture by\nanalyzing when dimensionality reduction might inadvertently introduce bias. We focus on principal\ncomponent analysis (henceforth PCA), perhaps the most fundamental dimensionality reduction\ntechnique in the sciences [Pearson, 1901; Hotelling, 1933; Jolliffe, 1986]. We show several real-world\ndata sets for which PCA incurs much higher average reconstruction error for one population than\nanother, even when the populations are of similar sizes. Figure 1 shows that PCA on labeled faces in\nthe wild data set (LFW) has higher reconstruction error for women than men even if male and female\nfaces are sampled with equal weight.\nThis work underlines the importance of considering fairness and bias at every stage of data science,\nnot only in gathering and documenting a data set [Gebru et al., 2018] and in training a model, but\nalso in any interim data processing steps. Many scienti\ufb01c disciplines have adopted PCA as a default\npreprocessing step, both to avoid the curse of dimensionality and also to do exploratory/explanatory\ndata analysis (projecting the data into a number of dimensions that humans can more easily visualize).\nThe study of human biology, disease, and the development of health interventions all face both\naforementioned dif\ufb01culties, as do numerous economic and \ufb01nancial analysis. In such high-stakes\nsettings, where statistical tools will help in making decisions that affect a diverse set of people, we\nmust take particular care to ensure that we share the bene\ufb01ts of data science with a diverse community.\nWe also emphasize this work has implications for representational rather than just allocative harms,\na distinction drawn by Crawford [2017] between how people are represented and what goods or\nopportunities they receive. Showing primates in search results for African Americans is repugnant\nprimarily due to its representing and reaf\ufb01rming a racist painting of African Americans, not because\nit directly reduces any one person\u2019s access to a resource. If the default template for a data set begins\nwith running PCA, and PCA does a better job representing men than women, or white people over\nminorities, the new representation of the data set itself may rightly be considered an unacceptable\nsketch of the world it aims to describe.\nOur work proposes a different linear dimensionality reduction which aims to represent two popula-\ntions A and B with similar \ufb01delity\u2014which we formalize in terms of reconstruction error. Given an\nn-dimensional data set and its d-dimensional approximation, the reconstruction error of the data with\nrespect to its low-dimensional approximation is the sum of squares of distances between the original\ndata points and their approximated points in the d-dimensional subspace. To eliminate the effect\nof size of a population, we focus on average reconstruction error over a population. One possible\nobjective for our goal would \ufb01nd a d-dimensional approximation of the data which minimizes the\nmaximum reconstruction error over the two populations. However, this objective doesn\u2019t avoid\ngrappling with the fact that population A may perfectly embed into d dimensions, whereas B might\nrequire many more dimensions to have low reconstruction error. In such cases, this objective would\nnot necessarily favor a solution with average reconstruction error of \u270f for A and y \u270f for B over\none with y error for A and y error for B. This holds even if B requires y reconstruction error to be\nembedded into d dimensions and thus the \ufb01rst solution is nearly optimal for both populations in d\ndimensions.\n\n2\n\n\fThis motivates our focus on \ufb01nding a projection which minimizes the maximum additional or\nmarginal reconstruction error for each population above the optimal n into d projection for that\npopulation alone. This quantity captures how much a population\u2019s reconstruction error increases by\nincluding another population in the dimensionality reduction optimization. Despite this computational\nproblem appearing more dif\ufb01cult than solving \u201cvanilla\u201d PCA, we introduce a polynomial-time\nalgorithm which \ufb01nds an n into (d + 1)-dimensional embedding with objective value better than\nany d-dimensional embedding. Furthermore, we show that optimal solutions have equal additional\naverage error for populations A and B.\n\nSummary of our results We show PCA can overemphasize the reconstruction error for one\npopulation over another (equally sized) population, and we should therefore think carefully about\ndimensionality reduction in domains where we care about fair treatment of different populations.\nWe propose a new dimensionality reduction problem which focuses on representing A and B with\nsimilar additional error over projecting A or B individually. We give a polynomial-time algorithm\nwhich \ufb01nds near-optimal solutions to this problem. Our algorithm relies on solving a semide\ufb01nite\nprogram (SDP), which can be prohibitively slow for practical applications. We note that it is possible\nto (approximately) solve an SDP with a much faster multiplicative-weights style algorithm, whose\nrunning time in practice is equivalent to solving standard PCA at most 10-15 times. The details of the\nalgorithm are given in the full version of this work. We then evaluate the empirical performance of\nthis algorithm on several human-centric data sets.\n\n2 Related work\n\nThis work contributes to the area of fairness for machine learning models, algorithms, and data\nrepresentations. One interpretation of our work is that we suggest using Fair PCA, rather than PCA,\nwhen creating a lower-dimensional representation of a data set for further analysis. Both pieces\nof work which are most relevant to our work take the posture of explicitly trying to reduce the\ncorrelation between a sensitive attribute (such as race or gender) and the new representation of the\ndata. The \ufb01rst piece is a broad line of work [Zemel et al., 2013; Beutel et al., 2017; Calmon et al.,\n2017; Madras et al., 2018; Zhang et al., 2018] that aims to design representations which will be\nconditionally independent of the protected attribute, while retaining as much information as possible\n(and particularly task-relevant information for some \ufb01xed classi\ufb01cation task). The second piece is the\nwork by Olfat and Aswani [2018], who also look to design PCA-like maps which reduce the projected\ndata\u2019s dependence on a sensitive attribute. Our work has a qualitatively different goal: we aim not to\nhide a sensitive attribute, but instead to maintain as much information about each population after\nprojecting the data. In other words, we look for representation with similar richness for population A\nas B, rather than making A and B indistinguishable.\nOther work has developed techniques to obfuscate a sensitive attribute directly [Pedreshi et al., 2008;\nKamiran et al., 2010; Calders and Verwer, 2010; Kamiran and Calders, 2011; Luong et al., 2011;\nKamiran et al., 2012; Kamishima et al., 2012; Hajian and Domingo-Ferrer, 2013; Feldman et al., 2015;\nZafar et al., 2015; Fish et al., 2016; Adler et al., 2016]. This line of work diverges from ours in two\nways. First, these works focus on representations which obfuscate the sensitive attribute rather than a\nrepresentation with high \ufb01delity regardless of the sensitive attribute. Second, most of these works do\nnot give formal guarantees on how much an objective will degrade after their transformations. Our\nwork directly minimizes the amount by which each group\u2019s marginal reconstruction error increases.\nMuch of the other work on fairness for learning algorithms focuses on fairness in classi\ufb01cation or\nscoring [Dwork et al., 2012; Hardt et al., 2016; Kleinberg et al., 2016; Chouldechova, 2017], or\nonline learning settings [Joseph et al., 2016; Kannan et al., 2017; Ensign et al., 2017b,a]. These works\nfocus on either statistical parity of the decision rule, or equality of false positives or negatives, or an\nalgorithm with a fair decision rule. All of these notions are driven by a single learning task rather\nthan a generic transformation of a data set, while our work focuses on a ubiquitous, task-agnostic\npreprocessing step.\n\n3 Notation and vanilla PCA\nWe are given n-dimensional data points represented as rows of matrix M 2 Rm\u21e5n. We will refer to\nthe set and matrix representation interchangeably. The data consists of two subpopulations A and\n\n3\n\n\fB corresponding to two groups with different value of a binary sensitive attribute (e.g., males and\n\nB the concatenation of two matrices A, B by row. We refer to the ith\nfemales). We denote by\uf8ff A\nrow of M as Mi, the jth column of M as M j and the (i, j)th element of M as Mij. We denote the\nFrobenius norm of matrix M by kMkF and the 2-norm of the vector Mi by kMik. For k 2 N, we\nwrite [k] := {1, . . . , k}. |A| denotes the size of a set A. Given two matrices M and N of the same size,\nthe Frobenius inner product of these matrices is de\ufb01ned as hM, Ni =Pij MijNij = Tr(M T N ).\n\n3.1 PCA\nThis section recalls useful facts about PCA that we use in later sections. We begin with a reminder of\nthe de\ufb01nition of the PCA problem in terms of minimizing the reconstruction error of a data set.\n\ncharacterizes the solutions to this classic problem [e.g., Shalev-Shwartz and Ben-David, 2014].\n\nDe\ufb01nition 3.1. (PCA problem) Given a matrix M 2 Rm\u21e5n, \ufb01nd a matrix cM 2 Rm\u21e5n of rank at\nmost d (d \uf8ff n) that minimizes kM cMkF .\nWe will refer to cM as an optimal rank-d approximation of M. The following well-known fact\nFact 3.1. IfcM is a solution to the PCA problem, thencM = M W W T for a matrix W 2 Rn\u21e5d with\nW T W = I. The columns of W are eigenvectors corresponding to top d eigenvalues of M T M.\nThe matrix W W T 2 Rn\u21e5n is called a projection matrix.\n4 Fair PCA\n\nGiven the n-dimensional data with two subgroups A and B, let cM , bA,bB be optimal rank-d PCA\n\napproximations for M, A, and B, respectively. We introduce our approach to fair dimensionality\nreduction by giving two compelling examples of settings where dimensionality reduction inherently\nmakes a tradeoff between groups A and B. Figure 2a shows a setting where projecting onto any\nsingle dimension either favors A or B (or incurs signi\ufb01cant reconstruction error for both), while\neither group separately would have a high-\ufb01delity embedding into a single dimension. This example\nsuggests any projection will necessarily make a trade off between error on A and error on B.\nOur second example (shown in Figure 2b) exhibits a setting where A and B suffer very different\nreconstruction error when projected onto one dimension: A has high reconstruction error for every\nprojection while B has a perfect representation in the horizontal direction. Thus, asking for a\nprojection which minimizes the maximum reconstruction error for groups A and B might require\nincurring additional error for B while not improving the error for A. So, minimizing the maximum\nreconstruction error over A and B fails to account for the fact that two populations might have wildly\ndifferent representation error when embedded into d dimensions. Optimal solutions to such objective\nmight behave in a counterintuitive way, preferring to exactly optimize for the group with larger\ninherent representation error rather than approximately optimizing for both groups simultaneously.\nWe \ufb01nd this behaviour undesirable\u2014it requires sacri\ufb01ce in quality for one group for no improvement\nfor the other group.\nRemark 4.1. We focus on the setting where we ask for a single projection into d dimensions rather\nthan two separate projections because using two distinct projections (or more generally two models)\nfor different populations raises legal and ethical concerns. Learning two different projections also\nfaces no inherent tradeoff in representing A or B with those projections.1\nWe therefore turn to \ufb01nding a projection which minimizes the maximum deviation of each group\nfrom its optimal projection. This optimization asks that A and B suffer a similar loss for being\nprojected together into d dimensions compared to their individually optimal projections. We now\nintroduce our notation for measuring a group\u2019s loss when being projected to Z rather than to its\noptimal d-dimensional representation:\nDe\ufb01nition 4.2 (Reconstruction error). Given two matrices Y and Z of the same size, the reconstruc-\ntion error of Y with respect to Z is de\ufb01ned as\n\n1Lipton et al. [2017] has asked whether equal treatment requires different models for two groups.\n\nerror(Y, Z) = kY Zk2\nF .\n\n4\n\n\f(a) The best one dimensional PCA projection for\ngroup A is vector (1, 0) and for group B it is vec-\ntor (0, 1).\n\n(b) Group B has a perfect one-dimensional projec-\ntion. For group A, any one-dimensional projection\nis equally bad.\n\nFigure 2\n\nDe\ufb01nition 4.3 (Reconstruction loss). Given a matrix Y 2 Ra\u21e5n, let bY 2 Ra\u21e5n be the optimal\nrank-d approximation of Y . For a matrix Z 2 Ra\u21e5n with rank at most d we de\ufb01ne\n\nloss(Y, Z) := kY Zk2\n\nF .\n\nF kY bY k2\n\nThen, the optimization that we study asks to minimize the maximum loss suffered by any group. This\ncaptures the idea that, \ufb01xing a feasible solution, the objective will only improve if it improves the loss\nfor the group whose current representation is worse. Furthermore, considering the reconstruction loss\nand not the reconstruction error prevents the optimization from incurring error for one subpopulation\nwithout improving the error for the other one as described in Figure 2b.\nDe\ufb01nition 4.4 (Fair PCA). Given m data points in Rn with subgroups A and B, we de\ufb01ne the\nproblem of \ufb01nding a fair PCA projection into d-dimensions as optimizing\n\nmin\n\nU2Rm\u21e5n, rank(U )\uf8ffd\n\nmax\u21e2 1\n\n|A|\n\nloss(A, UA),\n\nloss(B, UB) ,\n\n1\n|B|\n\n(1)\n\nwhere UA and UB are matrices with rows corresponding to rows of U for groups A and B respectively.\n\nThis de\ufb01nition does not appear to have a closed-form solution (unlike vanilla PCA\u2014see Fact 3.1). To\ntake a step in characterizing solutions to this optimization, Theorem 4.5 states that a fair PCA low\ndimensional approximation of the data results in the same loss for both groups.\nTheorem 4.5. Let U be a solution to the Fair PCA problem (1), then\n\nloss(A, UA) =\n\n1\n|A|\n\nloss(B, UB).\n\n1\n|B|\n\nBefore proving Theorem 4.5, we need to state some building blocks of the proof, Lemmas 4.6, 4.7,\nand 4.8. For the proofs of the lemmas please refer to the appendix B.\nLemma 4.6. Given a matrix U 2 Rm\u21e5n such that rank(U ) \uf8ff d ,\nmaxn 1\nrow space of U and V := [v1, . . . , vd] 2 Rn\u21e5d. Then\n\nloss(B, UB)o. Let {v1, . . . , vd}\u21e2 Rn be an orthonormal basis of the\nB V V T\u25c6 = f\u2713\uf8ff AV V T\nf\u2713\uf8ff A\n\nBV V T \u25c6 \uf8ff f (U ).\n\nloss(A, UA), 1\n|B|\n\nlet f (U ) =\n\n|A|\n\nThe next lemma presents some equalities that we will use frequently in the proofs.\nLemma 4.7. Given a matrix V = [v1, . . . , vd] 2 Rn\u21e5d with orthonormal columns, we have:\n\n\u21e7 loss(A, AV V T ) = kbAk2\n\nF Pd\n\ni=1 kAvik2 = kbAk2\n\n5\n\nF hAT A, V V Ti\n\n\f\u21e7 kA AV V Tk2\n\nF = kAk2\n\nF kAV k2\n\nF = kAk2\n\nF Pd\n\ni=1 kAvik2\n\nLet the function gA = gA(U ) measure the reconstruction error of a \ufb01xed matrix A with respect\nto its orthogonal projection to the input subspace U. The next lemma shows that the value of the\nfunction gA at any local minimum is the same.\nLemma 4.8. Given a matrix A 2 Ra\u21e5n, and a d-dimensional subspace U, let the function gA =\ngA(U ) denote the reconstruction error of matrix A with respect to its orthogonal projection to the\nsubspace U, that is gA(U ) := kA AU U Tk2\nF , where by abuse of notation we use U inside the norm\nto denote the matrix which has an orthonormal basis of the subspace U as its columns. The value of\nthe function gA at any local minimum is the same.\n\nProof of Theorem 4.5:\nConsider the functions gA and gB de\ufb01ned in Lemma 4.8. It follows from Lemma 4.6 and Lemma 4.7\nthat for V 2 Rn\u21e5d with V T V = I we have\n\nF kAk2\nF kBk2\n\nF + gA(V ),\nF + gB(V ).\n\n(2)\n\nloss(A, AV V T ) = kbAk2\nloss(B, BV V T ) = kbBk2\nf (V ) := max\u21e2 1\n\n|A|\n\nTherefore, the Fair PCA problem is equivalent to\n\nmin\n\nV 2Rn\u21e5d,V T V =I\n\nloss(A, AV V T ),\n\nloss(B, BV V T ) .\n\n1\n|B|\n\nWe proceed to prove the claim by contradiction. Let W be a global minimum of f and assume that\n\nloss(A, AW W T ) >\n\n1\n|A|\n\n1\n|B|\n\nloss(B, BW W T ).\n\n(3)\n\nloss(A, AW\u270fW T\n\nloss(A, AW W T ) or equivalently a local minimum of gA because of (2).\n\n\u270f W\u270f = I in a small enough neighborhood\n\u270f ). Since W is a global minimum of f, it is a local minimum of\n\nHence, since loss is continuous, for any matrix W\u270f with W T\nof W , f (W\u270f) = 1\n|A|\n1\n|A|\nLet {v1, . . . , vn} be an orthonormal basis of the eigenvectors of AT A corresponding to eigen-\nvalues 1 2 . . . n. Let V \u21e4 be the subspace spanned by {v1, . . . , vd}. Note that\nloss(A, AV \u21e4T V \u21e4) = 0. Since the loss is always non-negative for both A and B, (3) implies that\nloss(A, AW W T ) > 0. Therefore, W 6= V \u21e4 and gA(V \u21e4) < gA(W ). By Lemma 4.8, this is in\n\u21e4\ncontradiction with V \u21e4 being a global minimum and W being a local minimum of gA.\n\n5 Algorithm and analysis\n\nIn this section, we present a polynomial-time algorithm for solving the fair PCA problem. Our\nalgorithm outputs a matrix of rank at most d + 1 and guarantees that it achieves the fair PCA objective\nvalue equal to the optimal d-dimensional fair PCA value. The algorithm has two steps: \ufb01rst, relax\nfair PCA to a semide\ufb01nite optimization problem and solve the SDP; second, solve an LP designed\nto reduce the rank of said solution. We argue using properties of extreme point solutions that the\nsolution must satisfy a number of constraints of the LP with equality, and argue directly that this\nimplies the solution must lie in d + 1 or fewer dimensions. We refer the reader to Lau et al. [2011]\nfor basics and applications of this technique in approximation algorithms.\nTheorem 5.1. There is a polynomial-time algorithm that outputs an approximation matrix of the\ndata such that it is either of rank d and is an optimal solution to the fair PCA problem OR it is of\nrank d + 1, has equal losses for the two populations and achieves the optimal fair PCA objective\nvalue for dimension d.\nProof of Theorem 5.1: The algorithm to prove Theorem 5.1 is presented in Algorithm 1. Using\nLemma 4.7, we can write the semi-de\ufb01nite relaxation of the fair PCA objective (Def. 4.4) as SDP (4).\nThis semi-de\ufb01nite program can be solved in polynomial time. The system of constraints (5)-(9) is a\n\n6\n\n\f: A 2 Rm1\u21e5n, B 2 Rm2\u21e5n, d < n, m = m1 + m2\n\nAlgorithm 1: Fair PCA\nInput\nOutput : U 2 Rm\u21e5n, rank(U ) \uf8ff d + 1\n1 Find optimal rank-d approximations of A, B as bA,bB (e.g. by Singular Value Decomposition).\n2 Let ( \u02c6P , \u02c6z) be a solution to the SDP:\n\n1\n\nm1 \u00b7\u21e3kbAk2\nm2 \u00b7\u21e3kbBk2\n3 Apply Singular Value Decomposition to \u02c6P , \u02c6P =Pn\n\nminP2Rn\u21e5n, z2R z\ns.t. z \nz \nTr(P ) \uf8ff d, 0 P I\n\u02c6jujuT\nj .\n\n4 Find an extreme solution (\u00af, z\u21e4) of the LP:\n\nj=1\n\n1\n\nF hA>A, Pi\u2318\nF hB>B, Pi\u2318\n\nmin\n\n2Rn, z2R\n\nz\n\n1\n\n1\n\nz \n\ns.t. z \n\nm1\u21e3kbAk2\nm2\u21e3kbBk2\nPn\n5 Set P \u21e4 =Pn\nB P \u21e4\n6 return U =\uf8ff A\n\ni=1 i \uf8ff d\n0 \uf8ff i \uf8ff 1\nj=1 \u21e4j ujuT\n\n(4)\n\n(5)\n\n(8)\n(9)\n\nF hA>A,\n\nnXj=1\nnXj=1\nF hB>B,\n\njujuT\n\njujuT\n\nj i\u2318 =\nj i\u2318 =\n\n1\n\nm1\u21e3kbAk2\nm2\u21e3kbBk2\n\n1\n\nF \n\nF \n\n(6)\n\nj i\u2318\nnXj=1\nj \u00b7h A>A, ujuT\nj i\u2318 (7)\nnXj=1\nj \u00b7h B>B, ujuT\n\nj where \u21e4j = 1 q1 \u00afj.\n\nlinear program in the variables i (with the ui\u2019s \ufb01xed). Therefore, an extreme point solution (\u00af, z\u21e4)\nis de\ufb01ned by n + 1 equalities, at most three of which can be constraints in (6)-(8) and the rest (at\nleast n 2 of them) must be from the \u00afi = 0 or \u00afi = 1 for i 2 [n]. Given the upper bound of d\non the sum of the \u00afi\u2019s, this implies that at least d 1 of them are equal to 1, i.e., at most two are\nfractional and add up to 1.\nCase 1. All the eigenvalues are integral. Therefore, there are d eigenvalues equal to 1. This results\nin orthogonal projection to d-dimension.\nCase 2. n 2 of eigenvalues are in {0, 1} and two eigenvalues 0 < \u00afd, \u00afd+1 < 1. Since we have\nn + 1 tight constraints, this means that both of the \ufb01rst two constraints are tight. Therefore\ni i) = z\u21e4 \uf8ff \u02c6z,\n\n(kbBk2\n\u00afihAT A, uiuT\ngiven by an af\ufb01ne projection P \u21e4 =Pn\nF = Tr(A AP \u21e4)(A AP \u21e4)> kAk2\nF kA bAk2\nloss(A, AP \u21e4) = kA AP \u21e4k2\n= Tr(A AP \u21e4)(A AP \u21e4)> kAk2\nF + kbAk2\n(2\u21e4i \u21e4i\n\nwhere the inequality is by observing that (\u02c6, \u02c6z) is a feasible solution. Note that the loss of group A\n\nF + kbAk2\nF 2Tr(AP \u21e4A>) + Tr(AP \u21e42A>)\ni i = kbAk2\n\u00afhAT A, uiuT\ni i,\n\nF = kbAk2\n2)hAT A, uiuT\n\n= kbAk2\n\n\u00afihBT B, uiuT\n\n(kbAk2\n\nj=1 \u21e4ujuT\n\nnXi=1\n\nnXi=1\n\nnXi=1\n\nnXi=1\n\ni i) =\n\n1\n|B|\n\n1\n|A|\n\nF \n\nF \n\nF \n\nF \n\nj is\n\nF\n\n7\n\n\fwhere the last inequality is by the choice of \u21e4j = 1 q1 \u00afj. The same equality holds true\nfor group B. Therefore, P \u21e4 gives the equal loss of z\u21e4 \uf8ff \u02c6z for two groups. The embedding\nx ! (x \u00b7 u1, . . . , x \u00b7 ud1,p\u21e4d x \u00b7 ud,p\u21e4d+1 x \u00b7 ud+1) corresponds to the af\ufb01ne projection of any\npoint (row) of A, B de\ufb01ned by the solution P \u21e4.\nIn both cases, the objective value is at most that of the original fairness objective.\n\n\u21e4\n\nThe result of Theorem 5.1 in two groups generalizes to more than two groups as follows. Given m\ndata points in Rn with k subgroups A1, A2, . . . , Ak, and d \uf8ff n the desired number of dimensions of\nprojected space, we generalize De\ufb01nition 4.4 of fair PCA problem as optimizing\n\nmin\n\nU2Rm\u21e5n, rank(U )\uf8ffd\n\ni2{1,...,k}\u21e2 1\n\nmax\n\nloss(Ai, UAi)) ,\n\n|Ai|\n\n(10)\n\nwhere UAi are matrices with rows corresponding to rows of U for groups Ai.\nTheorem 5.2. There is a polynomial-time algorithm to \ufb01nd a projection such that it is of dimension\nat most d + k 1 and achieves the optimal fairness objective value for dimension d.\nIn contrast to the case of two groups, when there are more than two groups in the data, it is possible\nthat all optimal solutions to fair PCA will not assign the same loss to all groups. However, with k 1\nextra dimensions, we can ensure that the loss of each group remains at most the optimal fairness\nobjective in d dimension. The result of Theorem 5.2 follows by extending algorithm in Theorem 5.1\nby adding linear constraints to SDP and LP for each extra group. An extreme solution (\u00af, z\u21e4) of\nthe resulting LP contains at most k of i\u2019s that are strictly in between 0 and 1. Therefore, the \ufb01nal\nprojection matrix P \u21e4 has rank at most d + k 1.\nRuntime We now analyze the runtime of Algorithm 1, which consists of solving SDP (4) and\n\ufb01nding an extreme solution to an LP (5)-(9). The SDP and LP can be solved up to additive error of \u270f>\n0 in the objective value in O(n6.5 log(1/\u270f)) [Ben-Tal and Nemirovski, 2001] and O(n3.5 log(1/\u270f))\n[Schrijver, 1998] time, respectively. The running time of SDP dominates the algorithm both in theory\nand practice, and is too slow for practical uses for moderate size of n.\nWe propose another algorithm of solving SDP using the multiplicative weight (MW) update method.\nIn theory, our MW takes O( 1\n\u270f2 ) runtime,\nwhich may or may not be faster than O(n6.5 log(1/\u270f)) depending on n, \u270f. In practice, however, we\nobserve that after appropriately tuning one parameter in MW, the MW algorithm achieves accuracy\n\u270f< 105 within tens of iterations, and therefore is used to obtain experimental results in this paper.\nOur MW can handle data of dimension up to a thousand with running time in less than a minute. The\ndetails of implementation and analysis of MW method are in Appendix A.\n\n\u270f2 ) iterations of solving standard PCA, giving a total of O( n3\n\n6 Experiments\n\nWe use two common human-centric data sets for our experiments. The \ufb01rst one is labeled faces in\nthe wild (LFW) [Huang et al., 2007], the second is the Default Credit data set [Yeh and Lien, 2009].\nWe preprocess all data to have its mean at the origin. For the LFW data, we normalized each pixel\nvalue by 1\n255. The gender information for LFW was taken from A\ufb01\ufb01 and Abdelhamed [2017], who\nmanually veri\ufb01ed the correctness of these labels. For the credit data, since different attributes are\nmeasurements of incomparable units, we normalized the variance of each attribute to be equal to 1.\n\nResults We focus on projections into relatively few dimensions, as those are used ubiquitously in\nearly phases of data exploration. As we already saw in Figure 1 left, at lower dimensions, there is a\nnoticeable gap between PCA\u2019s average reconstruction error for men and women on the LFW data\nset. This gap is at the scale of up to 10% of the total reconstruction error when we project to 20\ndimensions. This still holds when we subsample male and female faces with equal probability from\nthe data set, and so men and women have equal magnitude in the objective function of PCA (Figure 1\nright).\nFigure 3 shows the average reconstruction error of each population (Male/Female, Higher/Lower\neducation) as the result of running vanilla PCA and Fair PCA on LFW and Credit data. As we expect,\n\n8\n\n\fFigure 3: Reconstruction error of PCA/Fair PCA on LFW and the Default Credit data set.\n\nFigure 4: Loss of PCA/Fair PCA on LFW and the Default Credit data set.\n\nas the number of dimensions increase, the average reconstruction error of every population decreases.\nFor LFW, the original data is in 1764 dimensions (42\u21e542 images), therefore, at 20 dimensions we\nstill see a considerable reconstruction error. For the Credit data, we see that at 21 dimensions, the\naverage reconstruction error of both populations reach 0, as this data originally lies in 21 dimensions.\nIn order to see how fair are each of these methods, we need to zoom in further and look at the average\nloss of populations.\nFigure 4 shows the average loss of each population as the result of applying vanilla PCA and Fair PCA\non both data sets. Note that at the optimal solution of Fair PCA, the average loss of two populations\nare the same, therefore we have one line for \u201cFair loss\u201d. We observe that PCA suffers much higher\naverage loss for female faces than male faces. After running fair PCA, we observe that the average\nloss for fair PCA is relatively in the middle of the average loss for male and female. So, there is\nimprovement in terms of the female average loss which comes with a cost in terms of male average\nloss. Similar observation holds for the Credit data set. In this context, it appears there is some cost to\noptimizing for the less well represented population in terms of the better-represented population.\n\n7 Future work\n\nThis work is far from a complete study of when and how dimensionality reduction might help or\nhurt the fair treatment of different populations. Several concrete theoretical questions remain using\nour framework. What is the complexity of optimizing the fairness objective? Is it NP-hard, even for\nd = 1? Our work naturally extends to k prede\ufb01ned subgroups rather than just 2, where the number of\nadditional dimensions our algorithm uses is k 1. Are these additional dimensions necessary for\ncomputational ef\ufb01ciency?\nIn a broader sense, this work aims to point out another way in which standard ML techniques might\nintroduce unfair treatment of some subpopulation. Further work in this vein will likely prove very\nenlightening.\n\n9\n\n\fAcknowledgements\n\nThis work was supported in part by NSF awards CCF-1563838, CCF-1717349, and CCF-1717947.\n\nReferences\nPhilip Adler, Casey Falk, Sorelle Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and\nSuresh Venkatasubramanian. Auditing black-box models for indirect in\ufb02uence. In Proceedings of\nthe 16th International Conference on Data Mining, pages 1\u201310, 2016.\n\nMahmoud A\ufb01\ufb01 and Abdelrahman Abdelhamed. A\ufb01f4: Deep gender classi\ufb01cation based on adaboost-\nbased fusion of isolated facial features and foggy faces. arXiv preprint arXiv:1706.04277, 2017.\nJulia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias - prop-\nhttps://www.propublica.org/article/machine-bias-risk-\n\nublica.\nassessments-in-criminal-sentencing, 2018.\n\nSanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-\n\nalgorithm and applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\nAhron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algo-\n\nrithms, and engineering applications, volume 2. Siam, 2001.\n\nAlex Beutel, Jilin Chen, Zhe Zhao, and Ed Huai-hsin Chi. Data decisions and theoretical implications\n\nwhen adversarially learning fair representations. CoRR, abs/1707.00075, 2017.\n\nJoy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial\ngender classi\ufb01cation. In Conference on Fairness, Accountability and Transparency, pages 77\u201391,\n2018.\n\nToon Calders and Sicco Verwer. Three naive Bayes approaches for discrimination-free classi\ufb01cation.\n\nData Mining and Knowledge Discovery, 21(2):277\u2013292, 2010.\n\nFlavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R\nIn Advances in Neural\n\nVarshney. Optimized pre-processing for discrimination prevention.\nInformation Processing Systems, pages 3992\u20134001, 2017.\n\nAlexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism\n\nprediction instruments. Big data, 5(2):153\u2013163, 2017.\n\nKate Crawford. The trouble with bias, 2017. URL http://blog.revolutionanalytics.\ncom/2017/12/the-trouble-with-bias-by-kate-crawford.html. Invited Talk\nby Kate Crawford at NIPS 2017, Long Beach, CA.\n\nCynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through\nawareness. In Proceedings of the 3rd innovations in theoretical computer science conference,\npages 214\u2013226. ACM, 2012.\n\nDanielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasub-\nramanian. Runaway feedback loops in predictive policing. arXiv preprint arXiv:1706.09847,\n2017a.\n\nDanielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Eduardo Scheidegger, and Suresh Venkata-\nsubramanian. Runaway feedback loops in predictive policing. Workshop on Fairness, Accountabil-\nity, and Transparency in Machine Learning, 2017b.\n\nMichael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubra-\nmanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 259\u2013268, 2015.\n\nBenjamin Fish, Jeremy Kun, and \u00c1d\u00e1m D\u00e1niel Lelkes. A con\ufb01dence-based approach for balancing\nfairness and accuracy. In Proceedings of the 16th SIAM International Conference on Data Mining,\npages 144\u2013152, 2016.\n\n10\n\n\fTimnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach,\nHal Daume\u00e9 III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,\n2018.\n\nSara Hajian and Josep Domingo-Ferrer. A methodology for direct and indirect discrimination\nIEEE Transactions on Knowledge and Data Engineering, 25(7):\n\nprevention in data mining.\n1445\u20131459, 2013.\n\nMoritz Hardt, Eric Price, Nati Srebro, et al. Equality of opportunity in supervised learning. In\n\nAdvances in neural information processing systems, pages 3315\u20133323, 2016.\n\nHarold Hotelling. Analysis of a complex of statistical variables into principal components. Journal\n\nof educational psychology, 24(6):417, 1933.\n\nGary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A\ndatabase for studying face recognition in unconstrained environments. Technical Report 07-49,\nUniversity of Massachusetts, Amherst, October 2007.\n\nIan T Jolliffe. Principal component analysis and factor analysis. In Principal component analysis,\n\npages 115\u2013128. Springer, 1986.\n\nMatthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning:\nClassic and contextual bandits. In Advances in Neural Information Processing Systems, pages\n325\u2013333, 2016.\n\nFaisal Kamiran and Toon Calders. Data preprocessing techniques for classi\ufb01cation without discrimi-\n\nnation. Knowledge and Information Systems, 33(1):1\u201333, 2011.\n\nFaisal Kamiran, Toon Calders, and Mykola Pechenizkiy. Discrimination aware decision tree learning.\nIn Proceedings of the 10th IEEE International Conference on Data Mining, pages 869\u2013874, 2010.\n\nFaisal Kamiran, Asim Karim, and Xiangliang Zhang. Decision theory for discrimination-aware\nclassi\ufb01cation. In Proceedings of the 12th IEEE International Conference on Data Mining, pages\n924\u2013929, 2012.\n\nToshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classi\ufb01er with\nprejudice remover regularizer. In Proceedings of the European Conference on Machine Learning\nand Knowledge Discovery in Databases, pages 35\u201350, 2012.\n\nSampath Kannan, Michael Kearns, Jamie Morgenstern, Mallesh M. Pai, Aaron Roth, Rakesh V.\nVohra, and Zhiwei Steven Wu. Fairness incentives for myopic agents. In Proceedings of the 2017\nACM Conference on Economics and Computation, pages 369\u2013386, 2017.\n\nMatthew Kay, Cynthia Matuszek, and Sean A Munson. Unequal representation and gender stereotypes\nin image search results for occupations. In Proceedings of the 33rd Annual ACM Conference on\nHuman Factors in Computing Systems, pages 3819\u20133828. ACM, 2015.\n\nJon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determi-\n\nnation of risk scores. arXiv preprint arXiv:1609.05807, 2016.\n\nLap Chi Lau, Ramamoorthi Ravi, and Mohit Singh. Iterative methods in combinatorial optimization,\n\nvolume 46. Cambridge University Press, 2011.\n\nZachary C. Lipton, Alexandra Chouldechova, and Julian McAuley. Does mitigating ML\u2019s disparate\n\nimpact require disparate treatment? arXiv preprint arXiv:1711.07076, 2017.\n\nBinh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-NN as an implementation of situation\ntesting for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD\ninternational conference on Knowledge discovery and data mining, pages 502\u2013510. ACM, 2011.\n\nDavid Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and\ntransferable representations. In Proceedings of the 35th International Conference on Machine\nLearning, pages 3384\u20133393, 2018.\n\n11\n\n\fMatt Olfat and Anil Aswani. Convex formulations for fair principal component analysis. arXiv\n\npreprint arXiv:1802.03765, 2018.\n\nKarl Pearson. On lines and planes of closest \ufb01t to systems of points in space. The London, Edinburgh,\n\nand Dublin Philosophical Magazine and Journal of Science, 2(11):559\u2013572, 1901.\n\nDino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining.\n\nIn\nProceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data\nmining, pages 560\u2013568. ACM, 2008.\n\nAlexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.\nShai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge University Press, 2014.\n\nTom Simonite. When it comes to gorillas, google photos remains blind.\n\nhttps:\n\n//www.wired.com/story/when-it-comes-to-gorillas-google-photos-\nremains-blind/, Jan 2018.\n\nLatanya Sweeney. Discrimination in online ad delivery. Communications of the ACM, 56(5):44\u201354,\n\n2013.\n\nTwitter. Jacky lives: Google photos, y\u2019all fucked up. My friend\u2019s not a gorilla. https://twitter.\n\ncom/jackyalcine/status/615329515909156865, June 2015.\n\nI-Cheng Yeh and Che-hui Lien. The comparisons of data mining techniques for the predictive\naccuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2):\n2473\u20132480, 2009.\n\nMuhammad Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna Gummadi. Fairness\n\nconstraints: A mechanism for fair classi\ufb01cation. CoRR, abs/1507.05259, 2015.\n\nRich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations.\n\nIn International Conference on Machine Learning, pages 325\u2013333, 2013.\n\nBrian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial\n\nlearning. arXiv preprint arXiv:1801.07593, 2018.\n\n12\n\n\f", "award": [], "sourceid": 8034, "authors": [{"given_name": "Samira", "family_name": "Samadi", "institution": "Georgia Tech"}, {"given_name": "Uthaipon", "family_name": "Tantipongpipat", "institution": "Georgia Tech"}, {"given_name": "Jamie", "family_name": "Morgenstern", "institution": "Georgia Tech"}, {"given_name": "Mohit", "family_name": "Singh", "institution": "Georgia Tech"}, {"given_name": "Santosh", "family_name": "Vempala", "institution": "Georgia Tech"}]}