Reviews: Neural Similarity Learning

Update: I appreciate the authors taking my feedback seriously. The way the results are presented now is both more complete and easier to interpret. I count on the authors for updating the explanation of shared parametrization (figure and text). ---------- Clarifications/typos: Line 47: inner produce -> product Line 177: not clear to me what the "Shared parametrization scheme" is. What is shared exactly? Is there a single M for the whole network? Line 184: first transform the *inputs* to the same dimension -> are you talking about the images or the feature maps ? Line 380: sifnificant Can you clarify the Shared parametrization strategy ? - What is the input of M(.), the images or the feature maps ? - Is there really just a single M for the whole network ? - When you are resizing "inputs" with the adaptation network, are you talking about feature map or images? Is there just a single adaptation network? - In Fig. 3 there is only one layer so it's not clear how the method behaves for many layers This whole section is very fuzzy; please improve clarity. For the meta-learning experiments, the results can be hard to interpret: * It is well known that for the few-shot learning task, accuracy depends largely on the backbone architecture. * For instance, in the paper "A Closer Look at Few-Shot Classification", the fine-tuning baseline with Resnet-10 architecture achieves 75.90% for miniImageNet 5-shot. * Therefore, it only makes sense to compare meta-learning methods for similar architecture. * The static NSN can be reduced to the classic Conv-4 architecture and can be compared with the other methods. * However, the dynamic NSN cannot really be considered the same architecture, as it has more flexibility than Conv-4.

Reviewer 2

The authors propose to learn a custom similarity metric for CNNs together with adaptive kernel shape. This is formulated via learning a matrix M that modulates the application of a set of kernels W to the input X via f(W, X) = W' M X. Structural constraints can be imposed on M to simplify optimization and minimize the number of parameters, but in its most general form it is capable of completely modifying the behavior of W. Although at test time M can be integrated into the weights W via matrix multiplication, during learning it regularizes training via matrix factorization. In addition, a variant is proposed where M is predicted dynamically given the input to the layer via a dedicated subnetwork. A comprehensive ablation analysis is provided that demonstrates that basic version of the proposed approach performs marginally better than a standard CNN with a comparable number of parameters on CIFAR-10, but the dynamic variant outperforms it by 1%. On ImageNet a 1.5% improvement is demonstrated on the top-1 metric, but a very weak model is used as a backbone (10-layer CNN without batchnorm). Finally, the proposed approach is adapted to the few-shot learning scenario. To this end the basic variant of the model is pretrained on base categories, and the matrices M are fine-tuned on the novel categories with MAML, while the actual CNN kernels remain fixed. This approach outperforms the state-of-the-art LEO method by a statistically significant margin while being much simpler. The paper is well written, and is relatively easy to follow. Overall the approach is interesting but I have several concerns regarding the evaluation (see Improvements). The authors have addressed my concerns as well. Given the results of the additional experiments requested by R1 I'm also ready to recommend the paper for acceptance. However, I would like to point out that, like for most similar approaches, the performance improvements seem to diminish as the network depth increases. In addition, the results in Table 1 in the rebuttal indicate that the meta-learning part of the few-shot learning approach is of a marginal importance. I would appreciate if the authors toned it down in the camera-ready version.

Reviewer 3

#1. The problem tackled in this paper is quite interesting, in which I’ve never seen such work to switch inner product into more general metric. More interestingly, convolutional neural network with generalized inner product with a bilinear matrix is superior to the baseline with the same amount of parameters. #2. I’m very impressed that Dynamic NSN achieves the better few-shot classification accuracy than LEO, even without using residual networks. #3. It's very well-written and easy to follow most of parts in the manuscript. == Updates after the authors’ rebuttal == After reading the rebuttal and having a full discussion, my final recommendation is to accept this paper. Below is a summary of justification to the final score. [Novelty] Though inner product-based convolution is mostly adopted, dynamic neural similarity has some potential to improve the performance of CNN further. Specifically, such generalization seems to be well-suited to few-shot classification, because NSL is theoretically connected to nuclear norm regularization, briefly discussed in the rebuttal. [Experiments] In the response, the authors included some additional experiments on few-shot classification, showing that Dynamic NSN really improves the classification performance on both prototypical network and MAML.

Paper ID:	2770
Title:	Neural Similarity Learning

Reviewer 1

Reviewer 2

Reviewer 3