Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is well written. The explanation is easy to follow. The proposed approach is well-motivated. Ln. 129: the paper states that CNN^2 need to use far fewer filters to achiever performance comparable to existing schemes. Does the accuracy of CNN^2 improve as more filters are added. Am I correct to assume that the size of the features remain the same as the input as one moves up the layers due to concentric multi-scale pooling? Is it necessary?
Originality: To my knowledge, the motivation for such dual-pathway design is not new. But the particular design of this paper, CM polling in particular, is definitely novel. Quality: I think the evaluation of this work is quite thorough, but missing some important items. 1. Missing an important ablation study. It seems that using CM pooling in vanilla CNNs is not not shown in the paper. This makes it less clear if the this pooling actually improves the performance of vanilla CNNs. 2. Missing Vanilla CNN tuning details. It is great that the authors provided the hyper-parameter search details for PTN, CNN2 and Capsule nets. Yet it seems unclear how the vanilla CNNs are tuned. It seems that hyper-parameter tuning plays a significant role in CNN2's performance(6.7% performance drop by changing the last channel# from 40 to 64). 3. Scales of the study. The number of instances in each class is so small ~15-20, which might lead to great variance. This makes it hard to interpret the numbers. 4. Missing confusion matrices. Having a confusion matrices would make the numbers more interpretable as it provides details on the error patterns between classes. Clarity: This paper is well written with significant and thorough reference to related works. Significance: Personally, I believe this direction of incorporating biologically inspired inductive bias into network designs, and the particular problem of generalizing object recognition across vastly different views are of great significance. However flaws in the evaluation sections do have a negative impact on this.
With the above contributions, the paper explores the problem of paired-image classification. It designs plausible ways for the two streams of the network to interact so as to simulate the binocular vision. However, the connection between the network architecture and human binocular vision system is far-fetched. The simple minus and concatenation do not seem to be a generic way to pass messages across streams of networks. In particular, the effectiveness of the proposed modules is not well-justified. It is not surprising at all that a two-stream network with across-stream interaction can perform better than an early fusion network. How about the late fusion networks and other fusion methods that pass messages between the two streams? Without comparison more possible baselines, it is not possible to experimentally justify the proposed modules are particularly useful for binocular images. The experimental settings are toy-like, and the neural network backbone is also weak. Results under these settings are not practically meaningful. When a strong neural network can get much better accuracy using a single image in practice, it is not so useful to use a suboptimal binocular system to get some improvement regarding weak baselines on non-realistic datasets.