Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Denny Wu, Ji Xu
We consider the linear model \vy=\vX\vbeta⋆+\vepsilon with \vX∈Rn×p in the overparameterized regime p>n. We estimate \vbeta⋆ via generalized (weighted) ridge regression: ^\vbetaλ=(\vX\t\vX+λ\vSigmaw)†\vX\t\vy, where \vSigmaw is the weighting matrix. Assuming a random effects model with general data covariance \vSigmax and anisotropic prior on the true coefficients \vbeta⋆, i.e., \bbE\vbeta⋆\vbeta\t⋆=\vSigmaβ, we provide an exact characterization of the prediction risk E(y−\vx\t^\vbetaλ)2 in the proportional asymptotic limit p/n→γ∈(1,∞). Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting λ\opt for the ridge parameter λ and confirm the implicit ℓ2 regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that λ\opt can be \textit{negative} in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when \vX and \vbeta⋆ are both anisotropic. Finally, we determine the optimal weighting matrix \vSigmaw for both the ridgeless (λ→0) and optimally regularized (λ=λ\opt) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.