NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4983
Title:Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias

Reviewer 1

Post author response: I have read the author's response and other reviews. I believe author's contribution in understanding architectural bias merits publication. My overall score remains the same. Additional comment: is AlexNet on CIFAR-100 without data augmentation? Besides being harder than CIFAR-10, that may explain why 40% accuracy seems lower than other reported numbers on CIFAR-100. Authors should either clarify this fact; or even more interesting show effects don't change in the presence of data augmentation. -------------------------------------------------------------------- Author’s study architectural bias of convolutional networks compared to the fully connected networks. For many interesting tasks, we’ve observed that CNN perform much better than FC networks. Although we understand this superior performance from CNNs incorporating prior knowledge of structured data, understanding of the training dynamics at the level of loss landscape is quite low. By mapping CNN to equivalent FCN(eFCN) and studying training dynamics of eFCN, the authors provide a new tool to investigate CNN architectural bias in loss landscape. Although the practicality of proposed method is probably low (finding efficient way to relax locality constraint would be an interesting future work as suggested by the authors), the utility of the method is quite novel and and shed light on SGD training dynamics of different architectures. The paper is very clearly written with proposed ideas and methods are easy to follow. The experimental results are presented in a logical and understandable manner. Proposed method is simple to understand but yet quite novel, as far as I could tell, providing unknown insights of GD based training on CNNs and FCNs. The paper provides significant novel insights through their new proposed method. Few interesting results observed 1) CNN initialization in the ambient FC space provide better model than FC init on original space 2) Some intermediate switch time after CNN init to ambient eFC space can find better model than full CNN model. 3) Sharpness indicators (gradient norm/max eigenvalue of Hessian) are large during that intermediate switch time and then becomes quite small trained in eFCN space. In the supplementary material, the authors show that the findings also show on different architecture and dataset. Also they provide codes to reproduce results which will allow researchers to build on the findings to gain more insights in CNN architectural bias regards to FCN.

Reviewer 2

The authors show how a CNN prior (local and sharing constraints on the weights) on FCN weights can find better local minima compared to FCN with no constraints. This is suggesting that the architectural bias is only required in the initial part of the optimization to avoid trivial local minima which don't generalize better. One interesting observation is that even with an initial CNN prior FCN tends to performs quite well compared to regular FCN. The experimental results also suggest that a combination of template matching and local filters tend to give better performance compared to only template matching or only local filters. Many interesting insights presented based on experimental validation only.Perhaps either a more thorough experimental analysis or some theoretical evidence can be provided. If the authors claim achieving better performance by relaxing constraints at right point during training, it needs more experimental validation.

Reviewer 3

This is an empirical study to understand why over-parametrized CNN performs better than a Fully Connected Network. As fas as I know, on one did the same experiments before. The results suggested that the architectural bias is not necessary through all the training pass. My key concern about this paper was the experiments. In the paper, it did on a realistic task (cifar), but the model looks small. It's unclear if the model has more parameters or more advanced architecture, does the conclusion still holds? It become even more suspicious, if we look at the results on supplementary material. One AlexNet, the accuracy is only 40%. How come AlexNet performs so bad? If I remember correctly, even FCN can reach 70% with carefully tuning. In another word, there are tons of CNN which performs good on Cifar-10, why pick AlexNet? It's also unclear how the optimization approach affect the results. The paper use a smaller learning rate (0.01) to fine-tune. Is that sensitive or not? Also, it's unclear if the CNN also fine-tuned by this smaller learning rate. If CNN always use the constant learning rate, it doesn't looks fair to me. The paper didn't cite "Do Deep Nets Really Need to be Deep?" which use a FCN to mimic a CNN. Overall, I feel it's an interesting study but the experiments do not strong enough to support the conclusion.