Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu
We consider the learning of a single-index target function f∗:Rd→R under spiked covariance data: f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle), ~~ \boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top), ~~ \theta\asymp d^{\beta} \text{ for } \beta\in[0,1), where the link function σ∗:R→R is a degree-p polynomial with information exponent k (defined as the lowest degree in the Hermite expansion of σ∗), and it depends on the projection of input \boldsymbol{x} onto the spike (signal) direction \boldsymbol{\mu}\in\mathbb{R}^d. In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n,d\to\infty, n/d\to\psi\in(0,\infty), we ask the following question: how large should the spike magnitude \theta (i.e., the strength of the low-dimensional component) be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f_*? We show that for kernel ridge regression, \beta\ge 1-\frac{1}{p} is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, \beta>1-\frac{1}{k} suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since k\le p by definition, neural networks can adapt to such structures more effectively.