NeurIPS 2020

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Meta Review

This is another paper in a recent sequence of 'distilling contextualized word embeddings' papers where the primary innovation is that they only distill the final layer of the transformer -- also introducing a 'teacher assistant' mechanism (as introduced in previous work) to improve performance of the final student model. The result is simpler than competing work while performing better over a relatively extensive set of experiments (i.e., GLUE), especially if including the supplementary material. The reviews were positive to begin with and concerns were addressed during rebuttal -- thus, I recommend accepting for publication.