Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Atli Kosson, Bettina Messmer, Martin Jaggi
Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size Δwt=ηtut early in training by using lower values for the learning rate ηt. In this work we argue that warmup benefits training by keeping the overall size of Δwt limited, counteracting large initial values of ut. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: *Why and by which criteria are early updates ut too large?* We analyze different metrics for the update size including the ℓ2-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize ut based on the aforementioned metrics.