Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
This paper is interesting and shows some meaningful results for understanding the loss landscape of deep CNNs. However, I do not feel the contribution of a visualization technique is sufficient. Instead of applying the proposed technique to explain the well-known models (and some conclusions are also well-known in deep learning community), it will make the paper stronger if the paper can investigate some new model architectures or loss functions.
There is little known about the loss landscape of deep neural nets given their many layers and parameters! The authors of the current work propose a new approach for 2D visualization of such landscapes! Even though the proposed method differs from the existing work in terms of a new normalization, the insights that led to this adjustment is elegant. The authors provides different insights into the loss landscape of neural net with different structures and depths and used their visualization tools to justify why certain structure learn better or generalize better! Even though the work lacks any new theoretical understanding (which is not easy for the loss landscape of deep neural network), I enjoyed reading this paper and I believe it confirms some of our previous understandings and questions some others and overall helps us to design more effective DNN structures! One minor problem: I understand the authors point about the scale invariance property of neural nets when Relu is used but I don't see scale invariance property is generally held for NNs with other non-linear activation functions. All the experiments are also done with Relu! Thus, I would be a bit more precise in the title and text of this paper by pointing this is not for general NNs.
Overview: Visualizing the loss landscape of the parameters in neural nets is hard, as the dimension of parameters in neural network is huge. Being able to visualize the loss landscape in neural net could help us understand how different techniques are helping to shape the loss surface, and how the loss surface "sharpness" vs "flatness" are related to generalization error. Previous works include interpolating a 1D loss path between the parameters of two models to reveal the loss surface along the paths. However, 1D visualization sometimes can be misleading and losing critical information about the surrounding loss surface and local minimums. For 2D visualizations, usually 2 directions are selected for interpolating the loss surface along the 2 directions from an origin. However, as the neural net has a "scale-invariant" problem, which means the weights in different layers can appear in different orders but remain the same effect. The author proposed a new way of visualizing neural network loss surface in 2D with consideration of scale-invariant in neural network parameters. The method is to sample two Gaussian random vectors that normalized by the filter norms in the neural network, and then compute the loss across different combinations of the two filter-normalized Gaussian random vectors from the minimizer. The author conducted experiments to show that the loss surface in 1D could be misleading, and their 2D filter-normalized plot is revealing more information about minima comparing to non-filter-normalized plot. The experiments on resnet also visualized the network trained under-different settings to show that certain techniques do help flatten the loss surface, and flattened surface is helping the generalization error. Quality: The method used in the paper is technically sound and the experiments are well organized. Clarity: The writing of the paper is mostly clear and easy-to-understand. There are two suggestions: 1) When introducing the method to generate filter-normalized plot, it is a little bit ambiguous when the author mention the term "filter". My understanding is that each part of the elements in the gaussian vector should be normalized by the corresponding convolutional filter in the neural network. Probably a simple toy example will be better explaining this idea. 2) The organization of the sections is a little bit ambiguous, as the author firstly introduced the method to generate 2D filter-normalized plot, but then in the first parts of section 3, the author showed the result for 1D plot to reveal the constraint of 1D plot. This is a little bit confusing. Moving the 1D plot to the front and introducing the method to generate 2D-filter-normalized plot could probably be better in the reading flow. Originality: The major difference from the previous method is the 2D filter-normalized plot. However, the author also claimed that without filter-normalization, the plot could still reveal some of the local-minima of the two different methods. The other originality is to apply the visualization on the network trained by different methods. Significance: The visualizations of the loss landscapes are very impressing and very informative. However, the method itself is a little bit simple. The observations and conclusions are very helpful, but stayed on an eyeball based analysis. If somehow the author could bring a more quantitative study of the sharpness/flatness of the landscape, or even relate the sharpness/flatness of the surface to the regularizer in training, or relate that to the sensitivity analysis of the network could be more impactful. Also, this method is still computational heavy, as one needs to compute the value of each interpolation block. I don't see high-resolution with an efficient implementation in their study._x000c_