Reviews: Characterizing Bias in Classifiers using Generative Models

Originality: The task of characterizing biases in face classification systems has recently received increasing attention from researchers. While related work either use computer graphics or real-world data, here, the authors propose to use conditional GANs. Related work is mostly adequately cited, although I would recommend to also take the following related computer graphics approaches into account: - Qiu, Weichao, and Alan Yuille. "Unrealcv: Connecting computer vision to unreal engine." European Conference on Computer Vision. Springer, Cham, 2016. - Kortylewski, Adam, et al. "Analyzing and Reducing the Damage of Dataset Bias to Face Recognition With Synthetic Data." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2019. - Kortylewski, Adam, et al. "Empirically analyzing the effect of dataset biases on deep face recognition systems." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018. Quality: The claims made are supported by empirical analyses, although the experimental setting is rather limited, because only two classifiers have been tested. The limitations of GANs in terms of generating realistic images have been pointed out adequately. Clarity: The quality of the manuscript is good. The paper is very well understandable. Significance: Although, the use of conditional GANs for characterizing biases in classifiers is novel, I think the manuscript in its current state is not significant enough to justify a publication at NeurIPS. The proposed work is a purely empirical analysis, no significant theoretical or technical contribution was made and the findings in this paper basically confirm those of related work (as also pointed out by the authors). Post rebuttal feedback: I would like to thank the authors for taking the effort to answer my cocnerns. The authors have shown that their approach could be used to increase the performance of biased classifiers, although there still remains a significant gap to the performance of an unbiased classifier. Nevertheless, I am willing to upgrade my review to accept, because usings GANs to interrogate biased classifiers can be potentially useful for a number of application areas.

Reviewer 2

The method proposed in the paper seems original to my knowledge. And the significance of the problem the paper trying to address is also descent in my view. I'll focus on the quality and clarity of the paper in this section. First the paper does an outstanding job on clearly stating the problem in the introduction and giving readers a detailed, comprehensive cover on the related works. But starting at the 3rd section, the paper falls short on the clarity on the method, especially the section that related to the bayesian optimization. More specifically, 1. In Eq (2), does the classifier C has anything to do with the classifier that being tested? 2. Line 173, the definition of L_c is not sufficiently clearly. The authors say "Loss is the classification loss", which is overly general. Does it has to be minimized for better classifier performance or maximized (it can go either way depending on what you use)? From the later section it seems it need to be minimized for a good classifier performance. But I would recommen clearly define what type of loss can be used here. Examples would be better. 3. The method section lacks of sufficient detail for the other researchers to fully understand the model. I know that the author provided the code in the review process but it does not reduce the importance of including sufficient detail of the method in the paper. More specicially, a). Eq(4) contains a set of perviously found examples, how large should this set be? Does the size of this set affect the performance of the proposed method? b). What is the stopping criterion for the bayeisan optimizatoin? Does it has to reach a specific number of samples? Does the loss has to be non-increasing a certain number of iterations? What would be the rationale of the choice? c). The process of bayesian optimization is very vague. The paper states that the loss is modeled as a GP. And all the other thing the authors says about the process is they use RBF kernel and EI as acquisition function. I would suggest more details on how the GP models the loss function dynamics. Related equations might also be beneficial here. d). The authors state that the image data corresponds to theta, which is a one-hot vector for genders and races. In the bayesian optimization section, does the optimization also gives one-hot vector as resulting paramters? If yes, how does the optimization algorithm constrained to generate only one-hot vectors rather than vectors in real numbers? If not, why it is reasonable to use this generated vector as input to the generator? Since the generator always takes one-hot vector as input (as shown in Eq(1)), it might perform poorly if given a vector otherwise. Pose rebuttal: thanks for the response from the authors. Their answer is clear to me. I would recommend including these explaination to the paper which will definitely improve the clarity and quality of the paper.

Reviewer 3

This paper proposes a method to discover the biases of face and gender detection systems by optimizing the conditioning variables of a GAN using bayesian optimization. Additionally, this work also releases a dataset with gender and geographic region labels. Pros: This is a well written paper Cons: The necessity of using a GAN is not well explained in the paper. As far as I understand, the objective in Eq 4 can also be optimized via querying the database created by the authors with the relevant conditioning variables \(theta\). There is no need to generate samples conditioned on \(theta\) using a GAN. At the bare minimum, querying would use better quality images for optimizing Eq 4. In ‘Validation of Image Generation’ (Section 4) it would be helpful if authors provide a breakdown of quality ratings with both geographic region and gender (the authors do mention that there was “no significant difference” in lines 230-231, but I’d still like to see the scores for the sake of completeness.). Additionally, FID scores for each region and gender would also give further insight into the quality of images generated by the GAN. Such a search procedure is most likely an overkill for a parameter vector that has only 8 possible values (4 (geographic region) x 2 (gender); as explained in the ‘Data’ section) Post Rebuttal comments: I would like to thank the authors for their detailed response. As far as I know, GANs generally do not generate images with a greater variety than the dataset they have been trained on (as claimed by the authors on line 34 of the feedback), however I do agree with the authors that one could sample more images using a GAN. In light of their response I have upgraded the rating to accept. In the camera ready version, I would additionally like to see a comparison of the face detection failure rate of GAN + BO and Querying + BO.

Paper ID:	2889
Title:	Characterizing Bias in Classifiers using Generative Models

Reviewer 1

Reviewer 2

Reviewer 3