NeurIPS 2020

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Meta Review

The proposed method for training BERT is practically useful. My main concern on this paper is that the novelty in this paper is somewhat limited. It combines two existing techniques. One is PreLN which has been well studied in the literature for training BERT, and the other is stochastically dropping layers which was first proposed for training CV models. On the other hand, how to effectively combine these two techniques and fine tune to make them work for training BERT needs certain efforts. The provided training recipe should be interesting for practitioners.