Reviews: GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs

The goal of this work is to compute descriptors for image keypoints that are invariant to a group of transforms -- yet retain distinctiveness. The authors attain this in a novel way, by treating image pixel descriptors as functions over the transform group. The effect is that when a transform is applied to an image, the function values at corresponding keypoints are simply permuted. The authors then apply group convolutions to the function values to aggregate local structure in these values over group neighborhoods in a way that retains equivariance, and finally they use two CNNs and group bi-linear blending to produce an invariant but discriminative descriptor that is sensitive to second-order statistics of the features. Overall this is a very nice generalization of the traditional CNN image pipeline, where equivariant convolutions over the translation group are then made invariant through max pooling. The paper does not describe at all (1) how edge effects are handled (e.g., padding rotated and down-scaled images), (2) how to guarantee that the group of transformations can be made compact so as to enforce the permutation property (otherwise function values can fall of the ends), and (3) limitations from the restriction that the group structure must itself be a regular grid, so that traditional convolutions can be applied (e.g., the authors consider only a discrete set of scales and rotations and their compositions). The discussion of prior work evaluation of the method, and ablation study, are good. This reviewer has read the authors' rebuttal. After the ensuing discussion, the final score assigned remains unchanged.

Reviewer 2

The submission introduces GIFT, a network that aims to learn image feature descriptors which are invariant to affine image transformations. Pros: + Extension of CNNs to aggregate descriptors computed from affine-transformed images. Cons: - Insufficient evaluation. - Lack of clarity. Details: While I think the extension of CNNs to learn descriptors from different transformed images is interesting and solid, I still have major concerns: Insufficient experiments. - While I value that the experiments use mainly planar scenes, I think the evaluations are missing experiments on more challenging datasets (e.g., a structure-from-motion dataset). At the end of the day, these descriptors mostly end up being used for camera relocalization in SLAM systems, 3D reconstruction, and image-based localization where scenes violate the assumption of smooth 3D surfaces in the scene. - The evaluations are missing experiments demonstrating robustness to systematic increases in rotation, scale, and view point. - I think experiments must show the std. deviation of the results shown in most Tables. This is because the results are aggregates from different images. This is important since not all images return the same number of keypoints; images with rich textures can return more keypoints than images with low-complex textures (e.g., walls). - Experiments are missing more details or lack clarity. In the experiments described in Section 4.3, what keypoint detector was used for the experiments? According to lines 276-277 the same keypoint detector was used. I assume these were Superpoint, DoG, and LF-Net shown in Table 4? If so, please make it explicit in the text. Lack of clarity. The submission is hard to read and clarity is key for reproducibility. In particular, Section 3 defines too many symbols and operations and barely relates them in Fig. 1. I think Fig. 1 should include some symbols and operations indicating at a high level the sequence of operations and the input and output they produce. By doing so, I think the submission can significantly improve clarity. Also, I think Fig. 2 is taking too much real state and informs too little. Lemma 1 is good enough to make the point that features from different warps are just "permuted" in the activation tensors. Another part that needs more clarity is in Eq. 2. What does it mean to construct matrix W defined on H? Does it mean there is a matrix W for every transformed h in H? Also, I assume W_i(h) is a column of the matrix W(h) and therefore b is a scalar. In sum, I think clarifying the dimensions of each element in Eq. 2 can improve clarity. Minor comments: - Use \max in Eq. 5. Post Rebuttal: The rebuttal addressed some of my concerns, but unfortunately triggered more questions than answers: 1. The rebuttal still included experiments mainly on planar scenes. These planar scenes will comply with the smooth assumption. Unfortunately, real scenes are *not* smooth. Real scenes have objects with complex geometry that makes the assumption of smoothness unrealistic. Most of the references that the rebuttal mentioned in the "Importance of affine-invariant descriptors." paragraph in the rebuttal use datasets with planar scenes. Adding results on scenes with more complex geometry (e.g., SfM datasets) would have added more interesting benefits and limitations of the approach. I would recommend checking out the following CVPR evaluation paper: Comparative Evaluation of Hand-Crafted and Learned Local Features, by J. Schonberger, et al.. 2. Why was SIFT not a baseline in the experiments measuring robustness to systematic increase in rotation and scale? 3. The experiments on SUN3D are not that challenging since the dataset mainly contains indoor scenes which clearly have too many dominant planes. As such, this dataset satisfies their assumption. But how about outdoor SfM datasets? 4. It is not clear how the experiments measure pose errors and how the experiments estimate the relative pose. I don't understand why the results only show rotation errors and not translation errors. 5. The rotation error from SIFT correspondences seems very large in my experience. Why is this? 6. The large number of inliers for GIFT+Dense may be an artifact of using a dense grid. If so, the comparison is not fair and we cannot draw a confident conclusion. 7. The experiments do not indicate how they define an inlier and what thresholds they used. Given these questions, my rating does not change.

Reviewer 3

Originality: 1. The paper leverages a group feature extraction process, where the output group features is proved to be equivariant to transformations. 2. The paper designs a novel group feature embedding module with two group CNNs and bilinear pooling. Quality: While I appreciate the authors' effort, one missing part is the motivation/rationale for using two group CNNs and bilinear pooling. A naive solution is to use one group CNN + fc layer to generate a descriptor. We could double the group CNN parameters to achieve a similar expressive power as the proposed group feature embedding module. Clarify: The paper is well-written. Significance: The paper shows improved performance on standard datasets. However, some important baseline comparison is missing. One baseline is to use one group CNN + fc layer, as described above. One is the Group Equivariant CNN, which is considered the most related work in the paper. Also, would a more advanced backbone feature network help, e.g., ResNet-101?

Paper ID:	3786
Title:	GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs

Reviewer 1

Reviewer 2

Reviewer 3