NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7252
Title:Assessing Social and Intersectional Biases in Contextualized Word Representations

Reviewer 1

[Update: thank you for providing the author response. I look forward to the final version including more details about the tests, as requested by reviewer 2.] This paper studies the presence of social biases in contextualized word representations. First, word co-occurnce statistics of pronouns and stereotypical occupations are provided for various datasets used for training contextualizers. Then, the word/sentence embedding association test is extended for the contextual case. Using templates, instead of aggregating over word representations (in sentence test) or taking the context-free word embedding (in word test), the contextual word representation is used. Then, an association test compares the association between a concept and an attribute using a permutation test. The number of significant associations across tests is recorded for every word representation model. The results show that some representation exhibit more bias in more tests than others. Some of the interesting trends are that larger models exhibit fewer positive effects and that contextualized word representations present different biases than sentence encoders. Race bias seems to be more pronounced than gender bias, both on its own and when intersected with gender. The discussion acknowledges certain limitations of the work in representing gender and more nuanced identities, and points to directions for future work. Originality: the work is a natural next step after the investigation of bias in context-free embeddings and in sentence representations. The method is a simple extension from WEAT/SEAT. The contribution is well situated w.r.t prior work. Quality: the paper presents a nice set of experiments that explore different possible manifestations of bias. The appendix provides more complete results. The appear also acknowledges and discusses potential limitations. Some further analysis would be welcome, such as: - Relate the overall results in section 4.5 to the corpora statistics in table 1. - Clarify and expand the discussion of overlap or no overlap between tests revealed by contextualized representations and sentence encoders (sec 4.5, second paragraph). It's a bit hard to follow. - Test the claim that bigger models exhibit less bias by training models of increasing size (but otherwise identical), e.g. Glove with different sizes. - More discussion of the intersectional results in 4.7. What do they mean? Clarity: the paper is well written and very clear. A few minor comments: - Define intersectional earlier. - Define acronyms on first occurrence (e.g. LSTM) - corpuses -> corpora? - Say a bit more on the counting method from [37]. - line 113: table 1. Also refer to other tables from the text. - Table 2: give total number of tests per type (say, how many race tests are there in total). On the other hand, the current "total" column is misleading because it counts the same tests multiple times (across models). - Line 130: sentence seems broken. - Line 142: the the - Section 4.2: line 153 says "we use the contextual word representation of the token of interest, before any pooling is used to obtain the sentence encoding". If I understand correctly, the contextual representation is used instead of pooling; that is, in this case, there is no pooling. Right? The specific method should be described in the main paper in my opinion. - line 259: onto (a) the Significance: The results are important for the NLP community, where both contextualized word representations and bias are topics of much research. The results show complementary benefits to previous work on sentence encodings (although this part can be analyzed better).

Reviewer 2

UPDATE AFTER READING RESPONSE: Many of my concerns have been addressed; thank you for the careful response! Some minor fixes: - line 128-130 sentence is messed up (noted in another review) - Eq 3 has a typo - missing "s(" I think - line 136: "a more severe pro-stereotypical" grammar / word choice error. Maybe they mean "pro-stereotypical representation"? ============== This work tests contextual neural network language representations for social biases in their representations of people, for stereotypes associated with race, gender, and their intersection. This is a nice advance to this fairly new, rapidly growing, and important literature on social biases in AI systems, that has very deep connections to regularities in human cognition, social psychology, language use, and huge implications for machine learned AI systems. The paper finds substantial biases persisting across a range of the latest-and-greatest models, and in particular, within the contextually aware models (ELMo/LSTMs, BERT/GPT/self-attention) that recently supplanted acontextual word embeddings in current NLP research. Right now, I believe there are two competing frameworks for testing bias in word embeddings. Bolukbasi's linear projection analysis was, in my opinion, fairly convincingly critiqued by Gonen and Goldberg at NAACL this year; fortunately, this paper uses the understudied (in NLP/ML) criterion from Calisken, which compares within- versus between- pairwise simlarities of words in two groups. The authors argue that the previously proposed method to analyze contextual models - May et al.'s "SEAT" -- was insufficient. I agree it seems like that, but the proposal in this work - to use the top layer of per-token embeddings - seems like an obvious step. It doesn't say this work is a big advance that it's better than May et al.; it says May et al. was oddly limited. But still, the test is insufficiently explained, even after reading the appendix. The Calisken test (eq 1-4) is presented in terms of averages among "concept embeddings." But the tests are defined by wordlists. Is there averaging among a word type's instances in the corpus? Or are token-to-token similarities used to calculate the distances? This work re-examines a number of the wordlist tests from previous work on thes newer models, and proposes a few new wordlists tests as well. I wish the new wordlist tests were explained more; it is a difficult socio/cultural/linguistic problem to create a robust and social scientifically valid wordlist to draw conclusions from. For example, for the new or modified versions of the Heilman double bind -- a nontrivial phenomenon -- what exactly was done for " we also extend the attribute lists of competence and likability to the context of race. "? In general, since there was heavy reliance on May et al., I found it a little hard to follow what exactly was being done in parts of this paper. The comparison of race versus gender bias was interesting, though I wish there had been more. I don't love the approach of counting significant effects. Please see the ASA's critique of p-values from 2016 ("Statement on Statistical Significance and P-Values"). Effect size may be an important additional way to interpret the meaning of the differences among these models. Or maybe I'm wrong and there's a better way to analyze the data out there. I'm not convinced what's done in this paper is a great way to do it. The results seem like a straightforward or obvious documentation, without much insight about what's going on. For example, did any of the previous studies find negative effects, or are these examples for BERT/GPT the very first time it's been observed? What does it mean? How do findings on intersectional identities relate to the social science/scholarship literature on them? This style of paper and analysis is potentially very interesting, and can be a huge contribution to the field. For example, in terms of one specific finding, I found myself impressed with the Calisken paper's connection of real-world employment data all the way to what word embeddings learn; it's a nice illustration of how AI systems are deeply embedded in both concrete and stereotypical facts about society. To make this type of contribution, the discussion and analysis is key, beyond reporting numeric results, to get into the social and AI system hypotheses the work has implications for. As a technical issue, context could be quite interesting (e.g. do certain contexts mediate or affect bias?), and as a social issue, intersectionality could be fascinating. Unfortuantely, the interesting aspects of this area don't come through strongly in this paper. The work has lots of potential, and I hope the authors can improve or explain it better to get there.

Reviewer 3

The authors use known asociation tests as the basis for their analysis and include existing measures and bleached sentences. Their primary contribution is the addition of race and a different approach for analyzing contextualized word embeddings (not moving to the sentence level). They find slight but predictable correlations on known biases across the population splits they analyze and differences between both model sizes and datasets. This includes the surprising result that larger models appear to exhibit less of the known issues than expected.