Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Alex Damian, Eshaan Nichani, Rong Ge, Jason Lee
We focus on the task of learning a single index model σ(w⋆⋅x) with respect to the isotropic Gaussian distribution in d dimensions. Prior work has shown that the sample complexity of learning w⋆ is governed by the information exponent k⋆ of the link function σ, which is defined as the index of the first nonzero Hermite coefficient of σ. Ben Arous et al. (2021) showed that n≳ samples suffice for learning w^\star and that this is tight for online SGD. However, the CSQ lower bound for gradient based methods only shows that n \gtrsim d^{k^\star/2} samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns w^\star with n \gtrsim d^{k^\star/2} samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.