NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1828
Title:SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Reviewer 1

This paper establishes a new set of benchmark datasets to evaluate "general-purpose language understanding". This benchmark contains six tasks: one general-purpose NLI (RTE), one NLI focused on embedded clauses and beliefs (CB), one causality (COPA), one QA (MultiRC), one WSD (WiC), and one coref (WSC). These tasks were selected from a large number with the filtering criterion based on effectiveness of curent SOTA models such as BERT. Building systems that do well on these tasks should help advance the state-of-the-art in language understanding of this type. As the authors say, the GLUE benchmark has been very helpful as an evaluation testbed for pre-training methods such as ELMo and BERT, multi-task learning methodology, transfer learning, and more. But as performance gets higher, the shortcomings of that benchmark become more apparent, so the creation of a new set of benchmark tasks will no doubt spur more research. I'll discuss some pros and cons of this approach: Pros: (1) The dataset pulls in text from a variety of domains (Table 1), with a range of NLP problem formats and phenomena tested. (2) This dataset has fewer benchmarks than GLUE. I view this as a pro: it means that things won't be so NLI-focused by virtue of having several simliar benchmarks, and hopefully means that researchers will do more in-depth analysis on each dataset, rather than simply reporting the average (6 being a somewhat manageable number to deal with). (3) As I mentioned above, the authors are very forthcoming about the limits of their datasets (biases, focus on standard written English, etc.) which I think is a good trend. Cons: (1) The format is less uniform than GLUE. This is probably inevitable for scaling to harder problems, but may discourage some folks who were attracted to the highly uniform framework of GLUE. Neutral: (1) The datasets are less large-scale. This will probably make pre-trained approaches even more critical for high performance here, potentially reducing the scope of types of work that are evaluated here. However, there may be a lower computational barrier to entry, which may democratize research on this benchmark. ----------- Overall, this is a resource that I believe will be quite useful for a lot of authors. Its construction is not "original" as such but I believe the authors are leading by example in terms of methodology for creating task suites, which will become more important as NLP moves beyond single i.i.d. train-test splits. The work done so far is of high quality so I think we can expect that the framework code will prove to be of the same standard. This paper is very clearly written and was a pleasure to read. ============== Thanks for the author response. I had a high opinion of this work before and the authors have provided some sensible discussion of the points raised by the reviewers. I am still highly in favor of accepting this work.

Reviewer 2

# Originality The paper clearly explains the previous work. It builds upon the success of GLUE, and combines it with (1) new desiderata for task selection, (2) new governance, and (3) human baseline evaluations. # Quality The key claim in the paper was that SOTA LMs perform much worse on SuperGLUE than humans do. This claim was well supported by the experimental results in the paper. # Clarity The authors clearly explained how they evaluated the humans and the models to form their baseline, as well as the criteria used in selecting the baseline. # Significance This is the most difficult and comprehensive benchmark that I have seen for LMs, and I expect that it will help drive further research progress for the community.

Reviewer 3

It is overall a good quality paper. However, a few details could be improved. 1. Can the authors assign a name to their metric? It would help others to adopt it more easily. 2. Can the authors better categorize the tasks according to different aspects that the tasks can reveal about a system’s language understanding ability? Basically, is there a deeper motivation other than the chosen tasks are “more difficult”? 3. In terms of the reflection on a system’s language understanding ability, different tasks will have overlaps. Have the authors considered such bias in the overall metric? 4. Could the authors provide more dataset stats on WSC? The original test set is sometimes considered to be too small. How large is the test set size here? ------------------------------- I thank the authors for providing such detailed responses, which have addressed all my concerns. Thus, I am increasing the score to 8.