The paper shows that if a certain layer of a NN is expanded as a composition of multiple linear layers, it leads to higher performance than training the reduced network (i.e., when multiple layers are reduced to a single layer). The results are convincing and the ablation analysis is thorough. I encourage the authors to further probe if the same results hold true for standard architectures like ResNet and if there are theoretical insights to be gleaned from the empirical findings.