Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro
We study the power of learning via mini-batch stochastic gradient descent (SGD) on the loss of a differentiable model or neural network, and ask what learning problems can be learnt using this paradigm. We show that SGD can always simulate learning with statistical queries (SQ), but its ability to go beyond that depends on the precision $\rho$ of the gradients and the minibatch size $b$. With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Moreover, with polynomially many bits of precision (i.e. when $\rho$ is exponentially small), SGD can simulate PAC learning regardless of the batch size. On the other hand, when $b \rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.