Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track
Arthur Jacot, Eugene Golikov, Clement Hongler, Franck Gabriel
We study the loss surface of DNNs with L2 regularization. Weshow that the loss in terms of the parameters can be reformulatedinto a loss in terms of the layerwise activations Zℓ of thetraining set. This reformulation reveals the dynamics behind featurelearning: each hidden representations Zℓ are optimal w.r.t.to an attraction/repulsion problem and interpolate between the inputand output representations, keeping as little information from theinput as necessary to construct the activation of the next layer.For positively homogeneous non-linearities, the loss can be furtherreformulated in terms of the covariances of the hidden representations,which takes the form of a partially convex optimization over a convexcone.This second reformulation allows us to prove a sparsity result forhomogeneous DNNs: any local minimum of the L2-regularized losscan be achieved with at most N(N+1) neurons in each hidden layer(where N is the size of the training set). We show that this boundis tight by giving an example of a local minimum that requires N2/4hidden neurons. But we also observe numerically that in more traditionalsettings much less than N2 neurons are required to reach theminima.