Reviews: Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization

# Paper structure - the paper is clearly written and well structured. Although I have the impression that if the reader is not fully aware of the SOTA in this subdomain, or what is standard or what is not, the current SOTA literature review is not enough. - Some parts need additional explanation, and I will detail below where. In general, I have the feeling that many details have not been addressed carefully enough leaving few doubts about the contribution. # General comments - The paper makes an interesting contribution, and mainly because, as stated in the introduction, i) a polar transform is applied to the database-retrieved aerial image to be matched and ii) the spatial attention module. However, these two aspects in conjunction with a non crystal clear experimental setup raise some questions. - in the polar transform, what are the coordinates at which the aerial image is transformed, and how are those chosen? Are all possible coordinates tested against the ground level image? L60-63 seem to clearly point to the fact that polar transform is applied _before_ the deep metric learning, and therefore the location is not learned jointly. - How are the aerial images sampled in terms of size and coverage? Is the GPS of the ground image used for a ballpark cropping? How is orientation and alignment chosen? This question arises naturally and I think that depending on the answer further comments might happen, e.g. whether the prior establishment of some of these parameters make the results realistic or not. E.g. in the data used, the pairs showed in figures seem to be nicely pre-aligned, such that the center of the aerial image corresponds to the position of the ground level image. Although this is fine when training, when testing, one does not have this amount of knowledge and images should not be pre-aligned, if not, there is no point in stating that the method is useful for localization. There is a full question mark about the retrieval part, which I hope will be clarified in the rebuttal. On the same line, datasets and experimental setups should be presented a bit in more detail. This point and the one before, make me wonder on whether results are very accurate only because the problem is simplified greatly not direclty by the polar transform, but by the implicit massive reduction of potential false positive rates. I expect a setup like [18], but it does not seem the case? - The development of the spatial attention module seems a bit arbitrary. Depending on the number of neurons, input images, amount of detail, clutter, etc. A full max-pooling along channels might well result in non-discriminative masking. L170-171 seem to be the main justification of this doing, and it sounds extremely arbitrary, which make the statement at L172-173 a pure hypothesis at this point. From the images in the results, although being able to focus on some features, these seem not to be really that discriminative, as e.g. in Fig 5 left the non-polar transfromed image seem to detect both streets (or the same one just distorted), while the polar transformed image activation seem only to highlight one. I probably miss something here, but it would be nice to have clarification on what is doing what, and, mostly, why. - L175-176 states that "we make SPE generate multiple embeddings ... ". How is this achieved? Intuitively, I'd say that several polar transformations are applied, and then features concatenated to perform matching. Still, in what written in the paper this does not seem the case as polar transformation and SPE seem independent modules. This further strenghten my question about how is the polar transformation applied in practice, and how are target orientation and coordinates chosen. - Would this pipeline really work for natural ground-level images, or panoramas would be needed because of the polar transform? Can this be tested somehow? - What is the level of "prior" matching needed from aerial images and ground images, in terms of content? Is the method robust to changes? Aerial images are verly likely to contain outdated information, can this be tested somehow by artificially changing content of current dataset? (maybe once the method is trained?) - Experiments are done properly and ablation studies are good, I just wonder the actual retrieval / matching problem how it influences the final alignment, e.g. if instead of small pre-centered image, only very large and coarse resolution aerial images are available, how the method scales to testing all possible locations for the polar transforms, etc - I wonder if other competitors could be added to the evaluations of another dataset used, since for now this does not really helps in positioning the contribution in SOTA. i see big improvements over the previous SOTA, but it is still hard to figure it out. Since such improvements are so big, and most of the time such massive improvements are suspicious, I'd be careful in motivating even further. Ablations really help in that direction, but might be not enough. # Specific comments - in the introduction, I'd explicitly state very clearly how the pipeline is applied at test time as well, from the selection of candidates to the final scoring. This information is missing also in the experimental section. - L171-173 need to be clarified - Section 4.datasets. data should be described a bit in more detail, although are freely accessible. the fact that so many pairs are established, it seems that aerial images are pre-cropped and pre-centered, which, if true, would invalidate the results in my opinion, because they would be unrealistic. I understand the competitors might evaluate in the same terms and settings, but the final numbers would still be unrealistically high for real world, because the dataset is unrealistic. - Tab 3 and L269: M is used as "the position embedding map" and cannot be used now to identify counts

On the positive side, the paper makes two interesting and effective technical contributions in the polar transform and the spatial-aware feature aggregation. Detailed experiments show that each of the two contributions alone is sufficient to obtain state-of-the-art results and that their combination further improves performance. I especially like the polar transform as it is a neat idea that is simple to implement, requires no parameter tuning / learning, and can be used as a pre-processing step to boost the performance of previous work (as shown in the supplementary material, where Polar CVM-Net (which uses the polar transform before applying CVM-Net [5]) already reaches state-of-the-art performance compared to standard CVM-Net and [12] on both CVUSA and CVACT_val). Given that cross-view localization between ground-level and aerial images is a challenging problem, I am surprised and very impressed by how much the proposed contributions improve performance. The paper clearly advances the state-of-the-art by a significant margin. Both the polar transform and the spatial-aware feature aggregation are technically sound. I am not aware of previous work on cross-view localization that uses either of them. The paper does a good job in terms of motivating its technical contributions, clearly explaining the need for both of them. The very detailed experimental evaluation further verifies the two contributions by showing the impact of each one individually and combined. The references to prior work seem adequate (although [Regmi, Borji, Cross-View Image Synthesis using Conditional GANs, CVPR 2018] and [Vo et al., Revisiting IM2GPS in the Deep Learning Era, ICCV 2017] also seem relevant) and the paper does a good job of explaining the differences of the proposed approach to previous methods. My main point of criticism is the clarity of presentations of parts of the paper, which I believe could be improved: 1) I found the description of the SPE module rather confusing. Looking at Eq. 2, it seems that embedding map M is strongly tied to the input feature map (basically selecting the maximum activation over all channels at a given pixel position). In particular, no additional learning seem to be involved, an impression that is corroborated by Fig. 3. If this is the case, I do not see how the embedding map provides further information that is not present in the input feature maps from Fig. 3. In contrast, Fig. 2 suggests that the SPE module performs a set of convolutions to obtain the embedding map M and then combining this spatial layout with the original input features. This would make more sense to me as I would not see how one would obtain benefits from multiple SPE modules without introducing additional trainable parameters. 2) No information is provided which layer of the VGG16 architecture is used as input to SAFA. If SAFA contains trainable information, then the architecture of the SPE modules used should be specified as well. Otherwise, it will be very hard to replicate the results reported in this paper (unless source code is released). In addition, it is unclear to me whether the weights of the Siamese network from Fig. 2 are shared between aerial and ground-level images. The color coding used in the figure seems to indicate that the parameters are not shared, which would make sense given that the two image sources are not geometrically aligned. 3) In order to produce similar descriptors for warped aerial images and ground-level panoramas, they would need to be aligned. There is a potential rotation ambiguity between the original panoramas and those obtained via the polar transform. However, I do not see how this would be handled by SAFA. Does the paper assume that the transformed aerial image has the same orientation as the original panorama, e.g., that the center of the panorama corresponds to the north direction? 4) Splitting the results on CVUSA and CVACT_val over multiple tables (Tab. 1, 2, and 3) makes it hard to directly compare the different variants of the proposed approach with state-of-the-art results. Combining all results into a single table should make the presentation of the results clearer to read. It should also open up enough space to include the results for Polar CVM-Net from the supp. mat. (which I think add quite some value to the paper by showing that the polar transformation can be used for other approaches as well). 5) Sec. 4.2 states that the CVACT_test set is used for evaluation while the tables only mention the CVACT_val set. Which one is used? 6) [15] (VGG) seems to be the wrong reference for the statement: "[15] aims to learn image descriptors that are invariant against large viewpoint changes." In general, it would be good to cite the conference versions of papers (e.g., [15] was published at ICLR 2015) rather than their arXiv versions. While the shortcomings listed above impact the reproducibility of the paper under review, and thus decrease its potential impact, I do not think that they are severe enough to justify a rejection of the paper as they can be addressed in a potential camera ready version of the paper. As such, I am recommending to accept the paper. ----- update after rebuttal and discussion ---- After reading the other reviews and the rebuttal, I still believe that the paper is worth to be accepted. While I had hoped for some more details regarding SAFA (I agree with R1 that this part is not really described in detail), I am satisfied with the rebuttal and trust the authors to provide more details in the final version of the paper. However, given the lack of technical details in the rebuttal on SAFA, I am reluctant to raise my initial score. Regarding the concern of a prior on the orientation alignment raised by R2: I also have the impression that the orientation (north direction) of both ground and aerial images seems to be roughly known. I don't think this is much of a limiting assumption though. For the aerial images, it should be possible to get a good estimate of north from other sensors. For ground level images, one could just use multiple orientation hypotheses at test time (for panoramas, this would simply be a shift of the center of the images). While this would come at the price of increased run-time, I don't think that this would be a problem in practice, especially given the significant increase in performance by the proposed approach. However, I think this part should be made clearer.

Paper ID:	5336
Title:	Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization

Reviewer 1

Reviewer 2

Reviewer 3