Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper provides a method to pretrain a single Transformer architecture on three objectives: (i) unidirectional language model (e.g. like GPT-2), (ii) bidirectional language model (e.g. like BERT), and (iii) sequence-to-sequence language model (e.g. like a standard encoder-decoder architecture, with a bidirectional network to encode the source information, paired with a unidirectional decoder to generate the target sequence). This unified architecture circumvents the shortcoming of both models like BERT (which can condition on bidirectional context, but harder to use for downstream tasks that involve generation due to bidirectionality) and GPT-2 (easy to apply for generation tasks since it works left-to-right, but bidirectional encoders have been known to work much better than unidirectional ones in sequence-to-sequence models), and thereby combines the best of both worlds. This is done using a simple masking scheme that restricts which words the model can pay attention to, depending on which objective function is used (e.g. if using a unidirectional, left-to-right objective, then all tokens to the right of the target word are masked out). Experiments on text summarisation (CNN/DailyMail and Gigaword), question answering (SQuAD, CoQA extractive, and CoQA abstractive), question generation, and GLUE indicate that the proposed pretraining approach largely matches or surpasses the current state of the art. Originality: this paper addresses an important problem of unifying the different language modelling pretraining objectives (unidirectional/bidirectional/seq2seq) with a single model, thus circumventing the limitations of earlier work. Their masking approach crucially enables pretraining the two key ingredients of sequence-to-sequence models with a single model: (i) a bidirectional encoder, and (ii) a unidirectional decoder. In earlier work, the advantage of language model pretraining has mostly been shown for classification tasks, but this work crucially demonstrates a way of incorporating such pretraining method for language generation tasks (with substantial improvements to show for summarisation, question answering, and question generation). - Quality: The proposed masking approach is simple yet effective. All the claims of the method's success are backed by the empirical findings on multiple tasks and datasets. - Clarity: This paper clearly explains the method used, such as using Figure 1. Overall the method is presented well and the paper is easy-to-understand. The paper also contains sufficient technical details, such as the hyper-parameters used and any additional tweaks, in order to help reproducibility. I have some minor comments regarding the presentation in point 5 ("Improvements") below. - Significance: This paper introduces an effective way to combine different language modelling objectives for improving text generation in sequence-to-sequence models, which is an important and welcome contribution in the emerging field of transfer learning in NLP. ----- Update after authors' response: the response addressed most of my concerns. Overall, I believe that the method and empirical findings help make progress towards data-efficient NLP (particularly for generation tasks), and would be useful for the community. I am therefore maintaining my overall score of "7".
1, There do exist challenges applying pretraining LMs to both NLU and NLG tasks. This paper takes advantage of BERT’s masked LM objective and GPT’s architecture. Combining three types of LM objectives is a straightforward but effective extension. 2, For the second advantage of this paper (line 45-48), there is no experiments to compare pretrained LMs with different objectives. I also have doubts on why single objective LMs will overfit since it is trained on large scale corpus. 3, The generation experiments are not convincing enough because the authors only conduct experiments on summarization and question generation. I would like to see more experimental results on other generation tasks such as machine translation and response generation.
I have read the authors' response. While the response doesn't fully answer my concern, i.e., providing results of training their large-sized model from scratch, it is a reasonable response. Also given the results and experiments conducted, I think the paper benefits the community. As such I have increased the score from 6 to 7. --- This paper extends BERT to propose a multi-task pretraining approach. The tasks are language modeling (left-to-right, right-to-left), BERT's objective, and seq2seq generation. All the three tasks share the same Transformer backbone (BERT large) and different task are accomplished through masking matrices. The authors presented strong improvements across a wide range of tasks, such as abstractive summarization (CNN/DailyMail), question answering (SQuAD, CoQA) using both extractive and generative approaches, and question generation. My only concern point is the initialization of this model from BERT large: what happens if this model is trained from scratch? or what happens if BERT large is continued to be trained on the version of their data?