Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Kevin Scaman, Cedric Malherbe
This work proposes a novel analysis of stochastic gradient descent (SGD) for non-convex and smooth optimization. Our analysis sheds light on the impact of the probability distribution of the gradient noise on the convergence rate of the norm of the gradient. In the case of sub-Gaussian and centered noise, we prove that, with probability $1-\delta$, the number of iterations to reach a precision $\varepsilon$ for the squared gradient norm is $O(\varepsilon^{-2}\ln(1/\delta))$. In the case of centered and integrable heavy-tailed noise, we show that, while the expectation of the iterates may be infinite, the squared gradient norm still converges with probability $1-\delta$ in $O(\varepsilon^{-p}\delta^{-q})$ iterations, where $p,q > 2$. This result shows that heavy-tailed noise on the gradient slows down the convergence of SGD without preventing it, proving that SGD is robust to gradient noise with unbounded variance, a setting of interest for Deep Learning. In addition, it indicates that choosing a step size proportional to $T^{-1/b}$ where $b$ is the tail-parameter of the noise and $T$ is the number of iterations leads to the best convergence rates. Both results are simple corollaries of a unified analysis using the novel concept of biased expectations, a simple and intuitive mathematical tool to obtain concentration inequalities. Using this concept, we propose a new quantity to measure the amount of noise added to the gradient, and discuss its value in multiple scenarios.