Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The authors report increased accuracy and robustness by adding an extremely simple, linear filtering with a difference-of-Gaussian kernel. While this would be a quite remarkable finding if it turned out to be reliable, I have substantial doubts, which I will outline below: - Instead of using a well-established baseline (e.g. a ResNet), the authors train their own architecture (similar to VGG) on what seems to be their own 100-class variant of ImageNet. Despite these simplifications, their performance is quite poor (a ResNet-18 would achieve ~70% top-1 accuracy on full ImageNet). As a consequence, we don't really know whether the improvement by their surround modulation module is an artefact of a poorly trained model or a real effect. Training a ResNet on ImageNet is trivial these days (scripts are provided in the official PyTorch or Tensorflow repositories), so it's really unclear to me why the authors try to establish their own baseline. In fact, training VGG-like architectures without batch norm is anything from trivial, so it's quite likely that the authors' model underperforms substantially. - The relation between what the authors do and surround suppression in the brain is weak at best. In the authors' implementation, the only neurons contributing are the ones from the same feature map. Such specificity is unlikely to be the case in the brain. In addition, biological surround modulation is purely modulatory, i.e. has no effect if there is no stimulus in the classical receptive field. However, the authors' implementation as a linear filter will elicit a response if only the surround is modulated without any stimulus in the center. - I am unsure about the value of the analysis about sparsity. What are these figures meant to tell us? I think they could as well be left out without the paper losing anything. - Why is the surround modulation module added only after the first layer? - Were the networks used for NORB and occluded MNIST trained from scratch on these datasets or pre-trained on ImageNet and only fine-tuned?
This paper incorporates surround modulation neural mechanism to convolution neural network. The authors add local lateral connections (defined in eqn 1 and 2) to the activation maps of convolutional neural networks that mimics surround modulation. To my knowledge, this work is of interest to both the machine learning and neuroscience communities. I find that the results that surround modulation improves CNN performance very interesting. However, I have the following comments: 1- The application of the difference of Gaussians kernel in the computer vision domain is not new. DoG filtering has been proposed before for feature extraction in computer vision (see for example Lowe et al., IEEE 1999 also a wikipedia article here https://en.wikipedia.org/wiki/Difference_of_Gaussians and https://en.wikipedia.org/wiki/Scale-invariant_feature_transform). However, I still believe the results that using this kind of 'engineered' filters gives better performance than baselines is interesting finding. 2- The experiments presentation requires more clarity. In particular, I found it hard to understand the baselines. There are three things that one needs to control for: 1) the number of trainable parameters 2) the depth of the network 3) the structure of the SM kernel. The authors tried to clarify their baselines in paragraph starting line 155. However, I find this description largely unclear. 3- The experiments lack hyperparam tuning. One simple explanation of the results is that the training hyper-parameters were not optimal for the baseline models. 4- For the generalization results, it is unclear whether one could get the same SM-CNN results or better by using standard regularization methods. 5- It would be very interesting to see if the same results would hold for larger networks.
Having read the other reviews and the authors' response, I am willing to downgrade my score a tad (due to Reviewer 1's points). But that's still a good score! I didn't quite understand their response to my review; I don't believe I said anything about different size receptive fields. This paper presents an incredibly simple idea that is effective. Given this, I am not sure whether it has been done before (I don’t know of any other papers that do this exact thing, but I am willing to be corrected). Hence I am unsure of the originality. The idea is simple: surround modulation is a pervasive feature of the visual system. A similar surround will reduce the response of a neuron, indicating that nearby neurons sensitive to the same pattern are inhibiting its response. This is implicated in a large number of phenomena in visual neuroscience, which are listed in the paper. There are at least three potential mechanisms for it, one of which is lateral connections with a difference-of-gaussians (DOG), or center-surround shape. Neurons responsive to the same stimulus nearby enhance response, and a little farther away inhibit it. In this paper, the authors choose to implement the lateral inhibition, DOG idea. This is implemented as a convolution of a DOG linear filter on the activation maps of half of the first layer of convolutions (before ReLU, I believe). Other variants are also tested, including applying it to all of the initial filters, applying it to the input image instead of the first layer of convolutions, and applying it to the first maxpooling layer. As a control experiment, another layer is added above the first layer, both with and without the ReLU nonlinearity. Using a subset of ImageNet, they show that this model learns more quickly and achieves higher performance than the control networks. Two of the variants also perform better than an equivalent convnet. Here, the paper would be clearer by associating the three variants with the labels used in Table 1, although it was clear to me. They then test the robustness of this model against different lighting conditions using the NORB dataset. The results are quite convincing, in that the network with the DOG convolution is much more robust to the lighting changes than a network without it, by around 15%. My guess is that one should attribute most of this to the DOGs providing a kind of contrast normalization. They also show that this network is more robust to occlusion in MNIST when compared to a standard convnet. Finally, they show that the network activations are relatively sparse, in both lifetime and population senses, and that the activations are made more independent by this manipulation. The paper is very well-written and clear. Lines 62-77 could be considerably shortened, considering the audience is very familiar with these concepts. I think this paper is significant (modulo my lack of knowledge of it being done before), because it is a simple idea that appears to aid in learning, classification performance, and robustness. I also like the fact that it’s biologically inspired. Line 51: insert "are" between "and" and "unlikely." Line 93: reference 39 is also appropriate here (with 37 and 60). Line 23 of supplementary material: separation is misspelled.