NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:620
Title:Doubly Convolutional Neural Networks

Reviewer 1

Summary

This paper proposed the double convolution operation to share and compress parameters in the CNNs, given the observation that many filters are translated version of others with some small changes. Results on several datasets seem to indicate that DCNNs can outperform CNNs with same number of parameters.

Qualitative Assessment

The authors proposed their solution based on the observation on several datasets. I believe their observation can be extended to other tasks as well. Their solution also seems to be reasonable which shares some of the parameters across filters. They compared their approach with several other alternatives and showed their approach's superiority. I would say there are many ways to compress the filter parameters as they are very redundant. For example, these filters may be constructed as a product of two low-rank matrices. My guess is that the parameterized compression approach may provide larger compression rate or achieve better accuracy with the same number of parameters.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

The authors introduce doubly convolutional neural networks (DCNNs). It is an extension of convolutional neural networks that accounts for shifted filters by having one large fiter and using all patches of it (or pooling over the patches). DCNNs are easy to implement and show significant improvements over pure CNNs on image classification tasks (both on CIFAR and ImageNet).

Qualitative Assessment

This paper makes a great point for DCNNs. It starts with an analysis of how many fiters have a shifted correlate in AlexNet and VGGNet. Large number of shifted correlates beautifully motivates the introduction of DCNNs. Experimental results are strong, e.g., the top-5 error on ImageNet goes down from 10.3% for CNN to 7.8% for DCNN, less than even GoogLeNet. The presentation is nice, but could be improved in small details: * When introducing convolution (lines 74-75) and double convolution (90-91) you use the inner product with flattening and python notation. This is not needed -- the formula for convolution can be written in the same amount of space with just basic mathematical operators (+ and * on numbers, not vectors) and a sum. Please, use the more basic version. While the inner product might expose parallelism and python notation is understood by many, it is still best to stick to the most basic notation when possible, and it is possible in this case. (To avoid python, use indices!) * In Table 3, please provide the approximate number of parameters for ResNet-152 and GoogLeNet. You might add a footnote saying that it was derived by you and might be slightly mistaken (if that is the case), or just say "about 3" or sth. But some ballpark approximation would help understand your results and their significance better, and these numbers can be computed from the papers.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

The paper proposes to increase parameter sharing in convolutional layers by making all filters part of larger "metafilters". Filters are obtained by convolutionally extracting overlapping parts from the metafilters, hence the name "doubly convolutional neural networks". Models with this form of parameter sharing are compared with regular CNNs and some other variants on the CIFAR-10, CIFAR-100 and ImageNet datasets.

Qualitative Assessment

Added after rebuttal / discussion: Although I still contend that the CIFAR-10 baseline network used in the paper is unnecessarily weak, I appreciate that results on a large-scale dataset such as ImageNet (on which the method is shown to be quite competitive) are much more relevant, and I no longer consider it a major issue. I still disagree with the presentation of the idea (asserting that the filters are "translated versions of each other", when the reason the method works is precisely because they are _not_ exact translated versions of each other, only approximately so), but I guess that can be put down to a matter of taste. That leaves the problem with the CyclicCNN baseline, which I maintain is unnecessarily crippled by removing its ability to use relative orientation information. My issue was not with the fact that this is not explained in enough detail (as the rebuttal seems to imply), but rather that the model is wrong. This form of parameter sharing is pointless, except in very rare cases where relative orientation information is not relevant to the task at hand (I can't think of any situations where this is true, but there might be some). In natural image classification problems like the ones tackled here, I think it is clear that relative orientations between low-level features encode crucial information that should not be discarded. I really feel that this should be addressed, either by fixing the model architecture of the CyclicCNN so it actually benefits from the parameter sharing, rather than being crippled by it (i.e. not pooling after every layer), or by simply removing this model from the paper altogether. As it stands it doesn't really add anything because there are other, more interesting baselines to compare to, and it unnecessarily weakens the paper. The paper probably should not be held back by an issue with an additional baseline. In light of all this I have decided to increase the technical quality score to 3, in the hope that the CyclicCNN issue will be addressed. --------- I'm not sure if the averaged maximum 1-translated correlation shown across different layers in Figure 2 is all that meaningful. A stroke detecting filter such as the one in the top-left corner of Figure 1, and its horizontal flip, will presumably also have a relatively high correlation compared to random filters. It would be useful to compare with something like this as well, because it would be more meaningful than the comparison with random filters (arguably a filter and its horizontal flip are not translated versions of each other, so if they also have high correlation it means the measure is not particularly meaningful). The paper states that "it is interesting to see for most filters, there exist several filters that are roughly translated versions of each other". I agree that this is interesting, surprising even, considering that convolution involves exactly filter translation. I would have expected a bit more in-depth discussion of why convnets tend to learn approximately translated filters despite this. My intuition after reading the paper is that the value of the proposed approach seems to be in masking larger filters in various ways. Clearly these are _not_ translated versions of each other (only approximately so), else the resulting feature maps would be redundant (the would also be shifted versions of each other). So the fact that they are only _approximately_ translated versions of each other seems to be key, and this is not really stressed in the paper. Although the experimental section is quite extensive, which is commendable, there are a few issues with the evaluation: - The baseline models for CIFAR-10 and CIFAR-100 have the same number of units in all convolutional layers. This is non-standard and it constitutes an inefficient use of the parameter budget. It means that the total size of the representation (number of feature maps * height * width) shrinks considerably throughout the network. Usually, later layers are given more feature maps to compensate for smaller feature map sizes. This is reflected in the fairly poor performance of the baseline CNN (10.02% without / 9.59% with data augmentation), which reduces the impact of the results. - The CyclicCNN model differs significantly from the models described in "Exploiting Cyclic Symmetry in CNNs" by Dieleman et al., so its name is a bit misleading. The CyclicCNN has rotation invariant layers (because after each layer, the activations are pooled across different orientations), whereas in the original paper, the entire network is invariant but the individual layers are instead equivariant (local vs. global invariance). This is an important difference because it means that the CyclicCNN is unable to learn high-level features that depend on relative orientations of low-level features. This cripples the network compared to the baseline, so it is not surprising that it performs worse. The CyclicCNN is actually more similar to the models described in "Flip-rotate-pooling Convolution and Split Dropout on CNNs for Image Classification" by Wu et al. than to any of the models described in the cyclic symmetry paper (although their framework could arguably be used to construct this type of model). - At the end of Section 4.2.1 it is stated that, to ensure a fair comparison, the number of layers and the number of units per layer is kept constant across models. However, this does not ensure a fair comparison at all: instead, the number of parameters should be kept (approximately) constant. In Table 2, the DCNN is shown to outperform all other models, but it has almost twice as many parameters so this comparison is not fair. Instead, the number of feature maps in each layer of the DCNN should be reduced so the number of parameters becomes comparable. Minor comments: - A nonstandard symbol (circle with middle dot) is used for the convolution operation. This symbol is usually used for elementwise multiplication instead, so it's a bit confusing. - Regarding MaxoutDCNN, it would be interesting to also evaluate other pooling function (mean pooling in particular, possibly after the application of a nonlinearity). - The related work section should also discuss "Learning Invariant Features through Topographic Filter Maps" by Kavukcuoglu et al., which details another approach for grouping similar filters together (through a group sparsity penalty rather than explicit weight tying). - Regarding [11] it is said that "this explores the scale invariant property of images", but I don't see how this is the case unless the weights across different layers are tied. - [15] is said to "investigate the combination of more than one transformation of filters", but in fact STN transform the input, not the filters. - In section 4.2.1, for MaxoutCNN it is said that the stride is k. Is this a stride or a window size? Or are both equal?

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)