__ Summary and Contributions__: The paper studies the question of training fairness-constrained machine learning models given noisy or biased protected group information. The paper proposes two different approach based on different ideas from robust optimization; the first under deviation of the data distribution conditional on actual protected attribute or conditional on proxy protected attribute; another that is robust over a partial identification set leveraging recent work.
------
I have read the other reviews and the rebuttal. There was extensive discussion about the setup of the experiments. I agree with concerns regarding the experimental setup in the original paper. It seems to me that some concerns regarding the experimental setup are compounded by the expectation that it matters to get the details right, especially for policing; and concerns that future papers would likely build on this.
I think the experiment in the rebuttal improves upon the label bias issue at least and introduces equalized odds.
I think many concerns about the experiment setup would be non-existent if the experiment setting were not policing. Then, the stakes of letting the published record stand with mistranslations to a CS research setting would be lower.
I think the paper could be accepted with the edited experiments, with the suggested caveats in the rebuttal about these concerns, though that may not be ideal. The BPD experiment does seem not set up very well as it stands.

__ Strengths__: Relevance: important problem and discusses different strategies for tackling it
Methodological: considers different methodological approaches (drawing on previous work solving DRO problems and constrained fairness problems via the Lagrangian), with a robust approach and a partial identification approach based on data combination

__ Weaknesses__: methodological: calibration of the DRO approach (since solving robust ERM is fairly well developed algorithmically, the details about whether the framework makes sense to apply kind of matter)
This is acknowledged so is not a huge issue, but the limitations could be made a bit clearer in the paper.

__ Correctness__: Yes, the claims and method are correct to the best of my knowledge.

__ Clarity__: The paper is reasonably clear.

__ Relation to Prior Work__: Relation to prior work is clear.

__ Reproducibility__: Yes

__ Additional Feedback__: The paper is a solid application of robust optimization / constrained optimization for learning fair classifiers when the protected attribute is unobserved, which is an important problem.
The paper discusses two approaches for robustness and is a pretty comprehensive discussion of the problem. Since DRO and ERM have been relatively well-studied algorithmically and separately constrained classification for fair ERM via the Lagrangian have been well studied, the crux is in translating the fair classification unobserved protected attribute problem into such frameworks and if they are sensible, and the details about what possibly is lost in translation in the application of the "DRO" approach is important.
One concern is regarding the looseness of the TV bound.
Referring to: "P(\hat G_j= Gj| G = j): if \hat G mapping zip codes to the most common socioeconomic group for that zip code, then census data provides a prior for how often \hat G produces an incorrect socioeconomic group."
The concern is that calibrating the DRO neighborhood with the bound of Lemma 1 is generically expected to be loose since the bound does not seem to be able to take into account the quality of the proxy (eg even if soft groups are imputed conditional on \hat G | X, the error bound P(\hat G_j= Gj| G = j) is still uniformly bounded by the marginal distribution over groups). That is, it seems difficult to bound P(\hat G_j= Gj| G = j) by anything other than the marginal probability of groups themselves.
This concern is particularly salient because in the case of unobserved protected attribute it seems difficult to calibrate the divergence between X,Y|A directly. While it's useful to provide lemma 1, it seems it may not be tight enough either.
This is certainly discussed in Figure 2 but perhaps could be more explicitly discussed as a difficulty of calibrating a DRO approach without prior information (and calibrating with lemma 1 might be loose).

__ Summary and Contributions__: They study models of group fairness in classification when the group labels are noisy. They show bounds on the naive solution and design algorithms with better bounds under two distinct noise models. In the second model, they provide a proven algorithm as well as a more practical heuristic. They test the algorithms empirically.
------
I have read the rebuttal and discussed with other reviewers. I have kept my score of 5. My conclusion is that the paper has a lot of merit, but should be rejected in its current form due to the experiments. If this were a journal submission, I would accept with revisions. However, I do not feel comfortable accepting the paper without reviewing the revisions.
The following points have informed my decision after discussing this submission and thinking more about it:
-The BPD experiments in the original submission should not be published for reasons summarized in my initial review.
-The new BPD experiments are an improvement, but I still believe it would be irresponsible to publish them. As far as I know, there is not software making stop and frisk recommendations to BPD (or NYPD). So this is different from applying fairness to something like recidivism prediction which is already in practice. With the BPD experiments, the paper is implicitly proposing and endorsing the classification problem of predicting whether someone should be searched based on whether a similar person was searched in the past. This is especially problematic in a paper claiming to propose a fair algorithm.
-The new experiments on the NYPD SQF dataset are the least problematic. There is prior work (Zafar et al. From Parity to Preference-based Notions of Fairness in Classification. NeurIPS, 2017) which uses the NYPD SQF dataset. An important distinction is that task is predicting whether a person who *was searched* *had a weapon or not* and there is a known ground truth. However, since this isn’t a paper on policing or criminal justice specifically, it would probably be better to find a different application for the experiments. Criminal justice is an application area that requires a lot of care to be handled responsibly. It’s hard to find space for that in conference paper on more general purpose ML techniques.
-I do not believe adding a paragraph explaining potential issues would be sufficient to address the aforementioned problematic experiments.
-Followup work will be expected to run comparison experiments using the same data and setup. This is an otherwise good paper which has the potential to be a highly cited paper in the fairness and noisy protected groups literature.
-Looking closer at the data in the supplement csv file, it appears that the BPD experiments are basing a person’s true race on a feature called “description”. As far as I can tell, that feature is an officer’s description of a person’s race, not their self-identification. While the officer’s description may be an appropriate feature for an analytical study exploring officer bias, I’m not sure it’s appropriate here. At the very least, it should be discussed in the paper and I don’t see such a discussion even in the supplement. In the first few lines of the csv file, there are examples where the proxy race feature is hispanic, but the description feature is black or white. In those cases, the proxy race feature could actually be the person’s self-identified race. Even the idea that a person has a ground truth race label intersects with deeper questions about how race is defined that should not be completely ignored.
Since that was a lot of criticism, I want to stress again that I feel there’s a lot of great work in the paper. If the experiment issues were resolved, it would be an easy accept for me.

__ Strengths__: The problem and models of noisy groups are well-motivated.
They consider two distinct models of noisy group membership and both theoretical and practical approaches.
The technical contributions are significant and valuable.

__ Weaknesses__: It would be good to see experiments for other fairness criteria besides equality of opportunity. Most importantly, Figure 2 suggests the DRO and SA algorithms are trading accuracy for fairness as the noise increases. This is easier to do in the equal opportunity setting by increasing positive labels.
I would like to see experiments on more data sets to get a clearer picture of the empirical performance. For example, we observe that at different noise levels we may prefer DRO or SA in Figure 1 or that the error rate of SA takes a sharp increase at 0.3 noise in Figure 2. It would be interesting to see maybe two more data sets plotted in those figures especially if they reveal dramatically different behavior.
I have concerns about the BPD experiments noted below. In general, it does not appear that this sensitive data set and topic is being handled with care.
As I understand it, equal opportunity in this experiment is an equal opportunity to be searched/frisked. Is that correct? In line with the meaning of this one-sided fairness criteria, shouldn’t it be the opposite? Wouldn’t it make more sense if the classifier is predicting whether search is necessary and people who are non-criminals (true positive) are given an equal opportunity to not be searched/frisked? Additionally, given the error performance, is the algorithm effectively labeling more innocent people as search/frisk when the noise increases?
There is no discussion of data set bias and biased labeling. The bias in the fiofs_type attribute is a serious and central concern in this domain. This should be addressed. I wonder if it's even appropriate to use these labels in this type of experiment.
There should be more discussion of the proxy group used in the BPD experiments. E.g., the police department itself may be using district as a proxy for group membership. The interaction between this choice of proxy and the noise model should be explored more deeply.
I would like to see more experimental comparison with prior work on non-noisy fairness. Can you compare those approaches to yours on the data sets they use, but with simulated noise or proxies for groups? I know Adult is one such data set.

__ Correctness__: I did not carefully check the supplement, but it appears to be very thorough on both the theoretical and experimental sides.

__ Clarity__: Overall, I found the paper clear and easy to read. Minor comments below.
Page 6:
“This ideal algorithm relies on a minimization oracle, which is not always computationally tractable.”
It would be helpful to clearly characterize the scenarios under which this algorithm is tractable or not in Section 6.2 or 6.3 of the main paper.
Page 8:
“However, as the noise increases, the naïve approach no longer controls the fairness violations on the true groups G, even though it does satisfy the constraints on the noisy groups ^G (Figure 3 in Appendix F.2).”
I found this slightly confusing until I check Figure 3 in the appendix. Can you add a separate sentence just summarizing the results of Figure 3?

__ Relation to Prior Work__: Yes. It may be worth mentioning the following although [24] mostly suffices.
Fairness under unawareness: Assessing disparity when protected class is unobserved.
Jiahao Chen, Nathan Kallus, Xiaojie Mao, Geoffry Svacha, and Madeleine Udell.

__ Reproducibility__: Yes

__ Additional Feedback__: Most questions/feedback appear in other sections. There is a lot to like about this paper, but the experiment issues noted in the weaknesses section weigh heavily on my evaluation. Absent these issues, my score would be much higher.
Can you explain the different performance of DRO and SA in Figure 2? Is there some intuition for the behavior of each?
Typo on page 2: “Let p denotes”

__ Summary and Contributions__: This paper studies the problem of fair classification when we don’t have direct access to the protected groups (G), but only have noisy proxies (\hat{G}) of these attributes. For example, zip codes can often act as a noisy indicator of socio-economic groups like race. The paper has three main contributions. First, they bound the fairness violation if one just ensures fairness property with respect to the noisy group. Next they consider two approaches to guarantee fairness with respect to the true groups. The distributionally robust approach enforces fairness constraints for all conditional distributions that are within some distance from the conditional distribution wrt the noisy group. The soft group assignments approach assigns each example with a vector of weights, where the j-th entry of the weight is the probability this example belongs to group j. Since one cannot uniquely identify such a weight vector, the authors enforce fairness constraints with respect to a set of weights consistent with the observed data. Finally, the experiments show that the proposed methods perform better than the naive approach of ensuring fairness with respect to the noisy groups but it comes at a cost of higher test error.

__ Strengths__: I think the distributionally robust approach of enforcing fairness constraint is quite natural given the observation that the TV-distance between the two conditional distributions controls the fairness violation. So, the main contribution lies in developing the soft assignment approach. As the authors pointed out this particular weighted estimator was already introduced by Kallus et. al. (2020) and the main challenge was solving the DRO problem with respect to the whole class of weights. This is challenging because each step of the algorithm requires access to a non-convex minimization oracle and there are multiple-levels of nesting in the optimization problem.
This paper is clearly relevant to the NeurIPS community. Moreover, the particular problem considered in this paper is timely as there has been a growing interest in designing fair classifiers with missing protected attributes.

__ Weaknesses__: 1. My main criticism is that the main algorithm of the paper (algorithm 1 Ideal algorithm) builds upon ideas developed by Kallus et. al. (2020). In this sense, this approach is not entirely novel. However, it does require significant effort to solve this DRO problem.
2. The convergence guarantees are established only for the ideal algorithm which assumes access to an oracle for a non-convex minimization oracle. In the appendix, the authors show that this assumption can be relaxed by reformulating the Lagrangian. However, no convergence guarantee was provided for this practical algorithm.
3. The authors show that the TV distance between two conditional distributions (noisy and true protected attributes) provides an upper bound on the fairness violation wrt the true group. But is this bound optimal? I would have liked to see an example showing that this bound is tight, and how such a bound works for the real-world datasets.
4. Although the methods work when G and \hat{G} have different domains (e.g. race and zip-code) the experiment only considers similar domains of true and noisy values. I would have liked to see an experiment where the noisy proxy variables are high-dimensional compared to the true groups.

__ Correctness__: I think the claims are correct. I also liked the fact that the authors positioned the paper nicely among several other papers on this subject, and carefully stated what the precise contributions are. In terms of empirical methodology, the results seem reasonable to me. However, instead of generating a noisy group conditioned on the proxy variables, it would have been nice to see an example that directly works with the proxy variables.

__ Clarity__: Yes, the paper is well-written and I could follow most of the arguments. However, I felt that subsection 6.1 should have been presented with more detail. For example, why does the weight depend on \hat{y} and y, why does the set of weights are determined by the particular set of constraints?

__ Relation to Prior Work__: Yes, the authors did a wonderful job in explaining all the related work and highlighting their specific contributions.

__ Reproducibility__: Yes

__ Additional Feedback__:
---------Post Rebuttal------------
Thanks for the response and answering my questions! I am happy with the answers to two of my concerns, however, other reviewers raised concerns about the experiments and I share the same. So, I am keeping my score as it is.

__ Summary and Contributions__: The paper demonstrates (empirically) that for the task of learning with fairness constraints, if the labels used in training are noisy, it can lead to a significant violation of these constraints in reality. They also present an upper bound on the worst case violation in terms of TV distance of the conditional distributions corresponding to the correct labels and the noisy labels.
They present two ways to mitigate this issue. The first method is based on distributionally robust optimization (DRO) by directly considering all distributions that differ by up to some threshold TV distance. This formulation is solved by adapting a method from "Stochastic gradient methods for distributionally robust optimization with f-divergences" by Namkoong and Duchi. The alternative (and in my opinion more interesting) approach is the soft assignment approach (SA) that builds on the criteria proposed in "Assessing algorithmic fairness with unobserved protected class using data combination." by Kallus et al. This naturally leads to a saddle point problem and the authors propose an 'ideal' algorithm for solving it to optimality and a more practical one.
The experiments on two different data sets demonstrates that both methods succeed in learning models that satisfy the fairness constraints. The trade-off is an increase in test accuracy, and in this regard the SA method performs better than DRO.

__ Strengths__: This work tackles an important problem and presents two methods solving the problem. The DRO method is relatively straightforward but makes complete sense to try, while the SA method builds on existing work in an interesting way and gives better empirical results. I did not go through the mathematical details in the appendix in detail, but it looks okay and seems reasonable to follow.

__ Weaknesses__: I did not notice major issues with this work. I have some questions about the experiments and about auxillary datasets that I'll ask in the "additional feedback" section below.

__ Correctness__: To my knowledge, yes.

__ Clarity__: Yes. The paper is well-organized and I find the flow of the paper to be very natural. The contributions are clear.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: It seems like the performance of these methods can depend very heavily on parameters that one can estimate through an auxiliary dataset. What happens when the estimate is significantly off?
For DRO, a simple experiment that could illustrate this would be to fix the group noise level but change the bound on the TV distance. Then we can get plots that will show a test-error/violation trade-off. Not quite sure if there's a simple one for SA.