Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Well motivated, well-executed demonstration of permutation based language modeling for both autoregressive and bidirectional purposes The permutation strategy is new and solves an important problem around the split between autoregressive and bidirectional models This is a complete work, but I would have liked more analysis demonstrating how the new methods change the model dynamics, attention patterns over memory and context, or some empirical attempt at explaining how the new methods create better representations. It is clear and well-organized Many are likely to rely on these ideas or similar ones. This does belong in the thread of literature that succeeds BERT. Update: Thank you for including the attentional analysis. If you have the opportunity to examine which tasks are affected most by transfer and can shed light on which examples are correct by XLNet and not by BERT, that might provide valuable insight into what these pretraining procedures are learning.
The paper proposes a clever solution to bridge the gap between training and fine-tuning a language model (i.e. BERT with masked LM objective). The proposed permutation LM allows the model to look at both left and right context, which has been shown to be beneficial for downstream NLP tasks. To achive this, the paper proposed two-stream self-attention, in which one self-attention layer prevent the current word attend to itself (thus makes optimization become trivial, the model just learn to copy). The authors evaluate their model on downstream NLP tasks including some challenging tasks such as RACE M/H, Yelp-5, SQuaD 2.0. In all experiments, XLNet is fairly compared with BERT (Large and Base) and the results show that XLNet performs better than BERT on downstream tasks. The paper is well-written and easy to follow.
Originality: The architecture is novel compare to recent lines of language model work, which all used variation of BERT or GPT (SciBERT, MT-DNN, MASS and etc). The example ("New York is a city" one) makes sense, but considering the permutation is random when computing the objective function, I still couldn't get why it works better than sequential order because human speaks/writes in sequential order. Could you add more intuitions in paper? Or have you tried predicting n-gram, compare to permutation? Quality: Very high considering they did extensive of studies on multiple benchmarks, also the ablation study is nicely done as well. However, the comparison is little bit of unfair since BERT only predicts sub-tokens originally. So I hope you could compare to BERT whole word masking (released around end of May) under the same training corpus, if possible. Clarity: The paper is well written and organized. Significance: Achieving first place on multiple benchmarks and more than 3500 stars on their repo explains the significance of this work. I hardly believe there will be lots of follows up on this work because probably only small number of people can afford to train this. But people will use it as the base architecture and fine-tune on top of it, which still benefits the whole NLP community. I am very satisfied with their new ablation studies and will increase my score.