Processing math: 36%

Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective

Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track

Bibtex Paper

Authors

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu

Abstract

We consider the learning of a single-index target function f:RdR under spiked covariance data: f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle), ~~ \boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top), ~~ \theta\asymp d^{\beta} \text{ for } \beta\in[0,1), where the link function σ:RR is a degree-p polynomial with information exponent k (defined as the lowest degree in the Hermite expansion of σ), and it depends on the projection of input \boldsymbol{x} onto the spike (signal) direction \boldsymbol{\mu}\in\mathbb{R}^d. In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n,d\to\infty, n/d\to\psi\in(0,\infty), we ask the following question: how large should the spike magnitude \theta (i.e., the strength of the low-dimensional component) be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f_*? We show that for kernel ridge regression, \beta\ge 1-\frac{1}{p} is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, \beta>1-\frac{1}{k} suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since k\le p by definition, neural networks can adapt to such structures more effectively.