Review for NeurIPS paper: GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network

NeurIPS 2020

GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network

Review 1

Summary and Contributions: The authors introduce a module that refines the feature map before computing a correlation volume for dense correspondence problem. Iteratively updating the feature map of the reference image with learned matching priors from the reference image itself and query image enables robust correspondence estimation. Extensive experiments are conducted on three datasets; HPatches, KITTI, sintel.

Strengths: - I like the idea of refining a feature map before computing correlation volume that has not been heavily explored before. - The interpretation of GOCor module with optimization-based meta-learning techniques is interesting. - The generalization ability of GOCor module makes it can be integrated with any dense correspondence estimation algorithms. - The above strengths are demonstrated through extensive experiments and thoughtful ablation studies. - The paper is generally well written and organized with promising results

Weaknesses: - Regarding to the objectives described in Sec. 3.4. and 3.5, the intuitions of them would be more solid if similar attempts are referred. - Imposing constraint of Sec. 3.4 (i.e. w^t_ij*f^r_kl =0 when (k, l) != (i, j)) increase discriminability power of f^r_ij since it is encouraged to be far away from all f^r_kl. However, this may come with the cost of the robustness to variations making w^t sensitive to large appearance or geometric variations. The smoothness constraint of Sec. 3.5 (or from the pyramidal structure of GLU-net and PWC-net) may compensate for this trade-off, but this is somewhat heuristic. The experiments are conducted on classical dense correspondence problems where the variations might be mild, so additional experiments on semantic matching task [23,38,39,40] would be expected where larger appearance and geometric variations exist than traditional stereo or optical flow estimation tasks. The results when combined with other matching algorithms that do not impose smoothness constraint via coarse-to-fine scheme would be interesting.

Correctness: See Weaknesses section.

Clarity: The paper is generally well written. In my opinion, Sec. 3.2 has some repetitive descriptions overall Sec. 3 and can be reduced. The spared room may include the ablation study of Sec. E.1 and E. 2 in supplementary material that I found useful.

Relation to Prior Work: The Related Work section is well-written. Authors might want to cite these as related works: [A] Dynamic Filter Networks, NeurIPS'2016 - where filters are generated dynamically conditioned on an input [B] SuperGlue: Learning Feature Matching with Graph Neural Networks, CVPR'2020 - where the features are updated with attentional graph neural networks whose edges are defined within the same image (intra-image edges) or the other image (inter-image edges).

Reproducibility: Yes

Additional Feedback: Regarding to the objective function of eq. (5), what about cross entropy rather than least square as proposed in [23]? It is reported that cross entropy loss function enable the features to be more discriminative than L2 distance. -- After rebuttal -- I appreciate the authors' response to my concerns and upgrade my original recommendation as "A good submission (7)".

Review 2

Summary and Contributions: This paper presents globally optimized correspondence volumes (GOCor), which is a fully differentiable dense matching module, to address the limitations of previous feature correlation layer. The previous feature correlation layer is formulated to measure the similarity between each pixel from source and target images (or reference and query images) independently, and did not leverage the priors on the images, thus providing the limited capability. Unlike this, the proposed method is capable of effectively learning spatial matching priors to resolve further matching ambiguities. By introducing the filter predictor on the feature of a source image, the extracted correlation volumes with GOCor provide more effective encoding of the matching confidence. Experiments on geometric matching and optical flow have shown the robustness of GOCor compared to existing methods.

Strengths: - The problem formulation that catches the limitation of raw matching cost makes sense and would be a very important further direction, because the feature correlation layer has been popularly used in many applications such as semantic matching, video object segmentation, and few-shot segmentation. - The motivation that reference and query images have a useful prior for dense matching also makes sense. - Two tailored loss functions for reference frame and query frame are well designed. - Thorough experiments on geometric matching and optical flow, as well as ablation study, prove the superiority of the proposed method. - Nice visualizations in Fig. 1, 2, 3 would help the reader to understand this paper well. - This paper is well organized and written.

Weaknesses: There is no strong weakness, but here are some minor issues. - Why the filter predictor P should be built after CNN feature of a source image should be discussed. Ideally, the network module could be applied to not only source but also target images, or target image only. Such an ablation study make an architectural design much stronger. - Is the optimizer described in L231-242 a offline process or online process? - There is a lack of computational complexity.

Correctness: Correct.

Clarity: This paper is well organized and written.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I strongly agree other reviewer's rating, so I will keep my initial rating, "acceptance".

Review 3

Summary and Contributions: This paper describes a neural network module that computes match confidence values for the prediction of dense correspondences between two images. In current networks, the basic information for this task is computed by a so-called correspondence volume (or feature correlation) layer. Such a layer computes a measure of similarity between deep features at corresponding image locations for every location of the reference image and for every vector in a predefined, finite set of image displacements (or, in the global approach, for all possible integer-values displacements). In contrast, the proposed method uses information that a correspondence volume layer does not have access to, namely, (i) the distribution of deep features for image locations with similar appearance and (ii) prior knowledge about properties of the correspondence field, and specifically its smoothness and the uniqueness of the correspondence for each reference location.

Strengths: The correspondence confidence values computed by the proposed method optimize a global similarity criterion, rather than a local one, and achieves superior performance as a result. This is achieved through minimizing a simple and well-motivated loss function. The idea (lines 202-207) of imposing the smoothness constraint on the matching confidence values, as opposed to the displacement field, is interesting and novel. In addition, (the convolution kernel of) this smoothness regularizer is learned during training, rather than being hand-crafted, and can therefore potentially match the data better. The module generalizes well to domains characterized by new image content and motion patterns. The module can be plugged into standard neural networks to replace a correspondence volume layer. Experiments demonstrate superior performance, by significant margins, when the proposed module replaces correspondence volume modules in recent state-of-the-art networks for geometric correspondences and optical flow. For flow, the improvement margins are particularly good on natural scenes (as opposed to synthetic animations), even when training is done on animations. This speaks to the ability of the proposed system to generalize to different domains, a point emphasized also by one of the ablation studies (compare entries II and III in Table 3, page 8).

Weaknesses: The advantages mentioned above are achieved at the cost of an optimization procedure that is performed at inference time. However, a loop-unrolling scheme is proposed to make this optimization efficient, albeit approximate, so this aspect is not a major drawback. Some claims are made in the description of the method that, while plausible, are not evaluated experimentally. Specifically, the proposed query-frame objective does not enforce uniqueness, but merely provides flexibility for the regularizer potentially to learn that correspondences are unique. It is not clear whether this makes things better or worse, and a focused study of this point would be useful. In the terminology of the paper, does the regularizer indeed learn “peak-enhancing operators?” Another useful ablation study would show that computing smoothness on confidence values is better than smoothness on displacement values. This is an interesting and plausible point that deserves empirical support. Similar considerations hold for a more detailed analysis of what happens in the presence of occlusions. While no claim is made about this point, it would be interesting at least to show some anecdotal examples of an optical flow map near occlusion boundaries: To what extent does smoothness regularization blur the flow map? Some of the supplementary materials help to some extent in some of these directions.

Correctness: The claims and the mathematics are correct. The experimental methodology is also correct.

Clarity: The paper is very well written, with good structure, well-designed mathematical notation, good English, and clear pictures.

Relation to Prior Work: Yes. See discussion above.

Reproducibility: Yes

Additional Feedback: The “weaknesses” listed above are really just a plea for more insights about an interesting method. I am fully aware of the difficulty of adding material in an 8-page paper, so these further insights may have to be left for future publications, or for the supplementary materials. I would report running times in the main paper, rather than in the supplementary materials, given that inference involves optimization. The authors' responses to my technical remarks are well taken, and strengthen my original assessment that this is a good paper.

Review 4

Summary and Contributions: This paper proposes a novel cross-correlation layer called GOCor that explicitly takes into consideration the underlying structure of the reference and query images and the regularity exhibited in the scene. The proposed layer is well motivated in the sense that the standard feature correlation layer does not disambiguate between multiple similar regions. And the proposed one can probably alleviate these issues by an internal optimization procedure that explicitly accounts for similar ones. Improvements have been demonstrated in both geometric matching and optical flow benchmarks, together with an extensive ablation study that analyzes the effectiveness of each component.

Strengths: The proposed layer GOCor can help alleviate the ambiguities in the matching field derived from the standard cross-correlation layer by increasing the uniqueness of the features extracted on the reference image and imposing regularities in the matching field induced by the query image. Repetitive patterns and textureless regions are properly addressed; also, the proposed objective is differentiable, which allows end-to-end learning. Good explanation and well-described implementation details. Moreover, promising results compared to the standard one.

Weaknesses: Even though the proposed method could disambiguate similar regions, the introduced internal optimization could also incur heavy computation. Should give some advice on how to reach a good speed-accuracy trade off.

Correctness: The main claims are supported by the experiments. But on the design of the regularization operator, it is called induced by the query frame? However the convolution kernel R_theta is only set of parameters trained by both the reference frame and the query frame. Does it actually take in the query frame as input? If not, why would call Eq 6 the query frame objective? Seem to me it is just a smoothing operator.

Clarity: Yes, with a nice suppmat.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I would like to keep my rating. Appropriate naming of the terms are still not well addressed in my opinion. But I think it does not hurt the contribution of the paper. If addressed, would be more elegant.