Reviews: Guided Similarity Separation for Image Retrieval

Clarity: - the paper is globally well-written, but there are many unclear parts * parts mentioning the "manifold" without any prior explanation on what it is, e.g. lines 22~23, lines 27~28 * line 120: the said normalization procedure cannot ensure that the rows and columns will have unit norms (contrary to what is claimed). An trivial counter-example is the [[1,1],[1,0]] symmetric matrix. * line 147: not sure to understand * figure 2(b): the runtime overhead is relative to what? * how is implemented k-NN in all experiments? Novelty: - the method combines existing approaches together with a novel unsupervised loss. Namely, graph convolutional networks (GCN) that take as input the image descriptors and a fixed adjacency matrix defining pairwise similarities, while the novel GSS regression loss teaches the GCN how to cluster points based again on pariwise similarities on the output descriptors. - the paper claims to "introduce a fully unsupervised loss to optimize the model" [for image retrieval] (line 42), which is unheard of, and thus extremely impressive... but also deceiptful. As stated below, the success of the approach entirely depends on the initial image descriptors, which are state-of-the-art from [33] and trained with full supervision using a ranking loss. No experiments are given to investigate this crucial aspect. Quality: - experiments results are extremely good on multiple standard benchmarks, but some important points need to be clarified: - the proposed approach have 2 layers of GCN (eq. (3)), which is actually not deep at all (i.e. at best, second-order neighbors are used to augment the image descriptors). There need to be experiments to investigate this parameter. - similarly, there is no experimental proof that the GSS loss is actually driving the good performance. What happens when the GCN is applied with identity weights and null bias for 1 or 2 GCN layers? What would be the best is to plot the AP on multiple dataset as a function of the training iteration (0 --> 300). - isn't the approach actually just exploiting a bias due to the dataset construction? Specifically, we must recall that these datasets have been constructed by first gathering different clusters of related images, taking out few images from each cluster to form the queries, and then appending random distractor images (which are naturally unclustered). It is thus expected that a method explicitly clustering the dataset, like the proposed method, would achieve top performance precisely because it exploits this structure present in the dataset due to the construction procedure? But what would happen for real-life datasets? * a hint towards this hypothesis is that hyper-parameters need to be carefully tuned for each dataset separately (line 255). This is because the clusters in each dataset have different sizes and distributions. * one way to verify this hypothesis is to provide results for the datasets augmented with 1M distractor images. This would significantly hamper the clustering process (which is very easy and straightforward on small datasets) and would render things more realistic. * a related question is: what is the training time, as a function of the number of images in the dataset? - the convergence of this unsupervised approach, hence its good results, largely depends on the initial image descriptors. This paper uses descriptors from [33], but what would happen if these descriptors are corrupted by noise? Significance: - as far as i understand, the proposed approach is trained separately for each dataset, which strongly limits its general applicability - at test time, the paper need to embed the query into the adjacency matrix and make a forward pass through the GCN. As a matter of fact, this is more or less equivalent to some elaborate QE (and the same process has been applied to the whole database). Isn't this defeating the purpose of "directly encoding the neighbor information into image descriptors"? Because, in the end, the query needs to be processed even more than traditional QE (kNN search, then aggregation with neighbors using even more computations). This is very clear from Figure 2 (c).

Originality: - The approach is original. It uses GCN to learn and encode neighborhood information in image retrieval, that previously relied on mostly handcrafted QE/similarity propagation/spatial verification. This has the potential of learning higher-order relationships and more powerful representations. Quality: - The proposed technique appears sound and well researched. The proposed GSS loss is based on a simple intuition and is demonstrated to work well. The second-order approximation to the adjacency matrix during inference is reasonable. Clarity: - This paper is well written. The motivation and techniques are explained clearly, and implementation details are discussed. Ablation study on the most important parameter, $\beta$, is included. Significance: - The results show convincing improvements over state-of-the-art on INSTRE, ROxford and RParis. Although, it would be nice to see results on the augmented ROxford+1M and RParis+1M.

Paper ID:	881
Title:	Guided Similarity Separation for Image Retrieval

Reviewer 1

Reviewer 2

Reviewer 3