{"title": "Guided Similarity Separation for Image Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 1556, "page_last": 1566, "abstract": "Despite recent progress in computer vision, image retrieval remains a challenging open problem. Numerous variations such as view angle, lighting and occlusion make it difficult to design models that are both robust and efficient. Many leading methods traverse the nearest neighbor graph to exploit higher order neighbor information and uncover the highly complex underlying manifold. In this work we propose a different approach where we leverage graph convolutional networks to directly encode neighbor information into image descriptors. We further leverage ideas from clustering and manifold learning, and introduce an unsupervised loss based on pairwise separation of image similarities. Empirically, we demonstrate that our model is able to successfully learn a new descriptor space that significantly improves retrieval accuracy, while still allowing efficient inner product inference. Experiments on five public benchmarks show highly competitive performance with up to 24\\% relative improvement in mAP over leading baselines. Full code for this work is available here: https://github.com/layer6ai-labs/GSS.", "full_text": "Guided Similarity Separation for Image Retrieval\n\nChundi Liu\nLayer6 AI\n\nchundi@layer6.ai\n\nGuangwei Yu\n\nLayer6 AI\n\nguang@layer6.ai\n\nCheng Chang\n\nLayer6 AI\n\njason@layer6.ai\n\nHimanshu Rai\n\nLayer6 AI\n\nhimanshu@layer6.ai\n\nJunwei Ma\nLayer6 AI\n\njeremy@layer6.ai\n\nSatya Krishna Gorti\n\nLayer6 AI\n\nsatya@layer6.ai\n\nMaksims Volkovs\n\nLayer6 AI\n\nmaks@layer6.ai\n\nAbstract\n\nDespite recent progress in computer vision, image retrieval remains a challenging\nopen problem. Numerous variations such as view angle, lighting and occlusion\nmake it dif\ufb01cult to design models that are both robust and ef\ufb01cient. Many leading\nmethods traverse the nearest neighbor graph to exploit higher order neighbor\ninformation and uncover the highly complex underlying manifold. In this work we\npropose a different approach where we leverage graph convolutional networks to\ndirectly encode neighbor information into image descriptors. We further leverage\nideas from clustering and manifold learning, and introduce an unsupervised loss\nbased on pairwise separation of image similarities. Empirically, we demonstrate\nthat our model is able to successfully learn a new descriptor space that signi\ufb01cantly\nimproves retrieval accuracy, while still allowing ef\ufb01cient inner product inference.\nExperiments on \ufb01ve public benchmarks show highly competitive performance with\nup to 24% relative improvement in mAP over leading baselines. Full code for this\nwork is available here: https://github.com/layer6ai-labs/GSS.\n\nIntroduction\n\n1\nImage retrieval is a fundamental problem in computer vision with a wide range of applications\nincluding image search [43, 12], medical image analysis [21], 3D scene reconstruction [14], e-\ncommerce [16, 24] and surveillance [41, 35]. To cope with the tremendous volume of visual data,\nmost image retrieval systems address the problem in two stages. First, images are mapped to\ndescriptors that support ef\ufb01cient inner product retrieval. Recent advances in convolutional neural\nnetworks ushered signi\ufb01cant progress in descriptor models based on deep learning [13, 33], largely\nreplacing traditional local feature descriptors [37, 25, 29] due to better performance and ef\ufb01ciency.\nFollowing descriptor retrieval, second stage re\ufb01nes the retrieved set by considering manifold structure\nthat is known to be important for visual perception [34, 39].\nRobust image retrieval remains a challenging problem. Variations in view angle, lighting and\nocclusion make it challenging to design retrieval models that are robust to these artifacts. Many\ncurrently leading approaches borrow ideas from clustering where similar challenges exist. They\naim to satisfy local consistency where images with nearby descriptors are relevant, and global\nconsistency where images on the same descriptor manifold are also relevant [47]. Popular methods\nin this category include query expansion (QE) [8] and its variants [33], that combine descriptors\nform neighboring images pushing them closer together. Another popular direction is similarity\npropagation/diffusion [48, 9, 19] that apply random walk on the nearest neighbor graph. Similarity\npropagation can explore higher-order neighbors than QE, and better uncover the underlying image\nmanifold. While effective, both QE and similarity propagation rely on hyper-parameters that need to\nbe tuned by hand, and typically no learning is done. This limits the representational power of these\nmodels as they rely heavily on base descriptors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fInspired by these directions, we propose a new retrieval model that is trained end-to-end to improve\nboth local and global consistency. To promote local consistency we use a graph convolutional\nnetwork [23] to encode neighbor information into image descriptors. This has a similar effect to\nQE where nearby descriptors share information. However, unlike QE that always operates in the\nsame descriptor space, our model learns a new representation that utilizes higher order neighbor\ninformation to improve retrieval. We additionally introduce a novel loss function to optimize the\nproposed model in a fully unsupervised fashion. The proposed loss encourages pairwise clustering of\nsimilarity scores, and stems from ideas in manifold learning where analogous pairwise objectives\nwere shown to successfully uncover robust low-dimensional manifolds [15, 26, 44]. Finally, we\nshow that our model can successfully learn re\ufb01ned neighbor information from spatial veri\ufb01cation,\nand introduce an approximate inference procedure to leverage this information without applying\nspatial veri\ufb01cation at query time. Experiments on \ufb01ve public benchmarks show highly competitive\nperformance with up to 24% relative improvement in mAP over leading baselines.\n\n2 Related Work\nPopular research direction in image retrieval that shares motivation with our work is similarity\npropagation [46, 31, 9, 19, 3]. As the name suggests, similarity propagation applies random walk to\npropagate similarities on a weighted nearest-neighbor graph generated through descriptor retrieval.\nSimilarities are propagated repeatedly until a global stable state is achieved. Spatial veri\ufb01cation [2]\nis often applied in conjunction with similarity propagation to re\ufb01ne the neighbor graph and reduce\nfalse positives. Our approach differs from similarity propagation in that instead of traversing the\nnearest-neighbor graph we directly encode it into image descriptors, learning a new descriptor space\nin the process.\nClustering Similarity propagation is based on the idea of local and global consistency [47] which\nstems from clustering and related \ufb01elds [5, 44]. The commonly used cluster assumption states\nthat decision boundary should lie in regions of low density, and equivalently, related points should\nbe connected by a path passing through high density region [5]. Based on this assumption, many\nclustering approaches optimize pairwise similarity between examples demonstrating that it indirectly\nachieves the desired separation effect [39, 10, 6]. This forms the basis of our loss function and\nlearning scheme \u2013 we adapt the pairwise clustering paradigm to unsupervised image retrieval where\nthere are no \ufb01xed clusters or cluster centroids.\nDeep Embedded Clustering (DEC) [44] is the most similar clustering approach to our work. Authors\nof DEC propose to start with a reasonably good model, then alternate between generating cluster\nassignments and optimizing the model using high con\ufb01dence assignments. They demonstrate that\nimprovement is made at each iteration by learning from high con\ufb01dence assignments which in\nturn improves low con\ufb01dence ones. Our approach follows a similar framework where we alternate\nbetween updating descriptors and maximizing (minimizing) already con\ufb01dently high (low) pairwise\nsimilarity scores. The main difference is that our loss is designed for retrieval, and operates purely on\nimage pairs without any explicit cluster information.\nManifold Learning Manifold learning aims to infer low dimensional representations for high\ndimensional input that capture the manifold structure. This area is highly relevant to image retrieval\nas global descriptors aim to achieve a similar goal. Here, popular methods include IsoMap [39]\nLLE [34], t-SNE [26] and LINE [38]. In image retrieval, a manifold learning methods that is most\nsimilar to our work is the recently proposed IME [45]. Analogous to our approach, IME learns a new\ndescriptor space by reconstructing pairwise distances between images in the nearest neighbor graph.\nThe major difference, however, is that IME aims to preserve the structure of the graph generated\nby the base descriptors. Given that base descriptors (and the resulting neighbor graph) can have\nmany inaccuracies, we instead focus on improving pairwise descriptor distances between images by\nlearning from con\ufb01dent predictions that are likely to be correct.\n\n3 Approach\nWe follow the standard image retrieval set-up used in literature [9, 19]. Given a database of n\nimages X = {x1, ..., xn} and a query image xq, the goal is to retrieve all relevant images from X\nfor xq. Analogous to previous work we assume that global image descriptors have been extracted,\nand each image is represented by a vector x \u2208 Rd [19]. We de\ufb01ne a k-nearest neighbor (k-\n\n2\n\n\fNN) graph Gk = (X , Ak) with nodes X and edges described by the symmetric adjacency matrix\nAk = (aij) \u2208 Rn\u00d7n:\n\n(cid:26)x(cid:62)\n\n0\n\ni xj\n\naij =\n\nif xj \u2208 Nk(xi) \u2228 xi \u2208 Nk(xj)\notherwise,\n\n(1)\n\nwhere Nk(x) is the set of k nearest neighbors (including itself) of x in the descriptor space. The\nadjacency matrix is highly sparse with no more than 2kn non-zero values. Currently leading image\nretrieval approaches use various versions of the k-NN graph as transition matrix in a random walk\nprocess also known as similarity propagation/diffusion [9, 19, 18]. Random walk enables effective\ntraversal of the underlying manifold signi\ufb01cantly improving retrieval quality over base descriptors.\nHowever, most similarity propagation models depend on hyper-parameters that need to be tuned\nby hand, and typically no learning is done. In this work we propose a different approach where we\nencode neighbor information directly into image descriptors, and then train the model to learn a new\ndescriptor space with desired properties.\nThe main idea behind our approach stems from clustering which also forms the basis of many\nsimilarity propagation methods [47, 48]. Speci\ufb01cally, we aim to satisfy both local consistency where\nimages with nearby descriptors are relevant, and global consistency where images on the same\ndescriptor manifold are also relevant. To achieve this we propose a new architecture where graph\nconvolutional network (GCN) [23] is used to encode information from the k-NN graph into image\ndescriptors to achieve local consistency. We then introduce a novel loss function that encourages\nGCN to learn descriptors that improve global consistency. Our approach, referred to as Guided\nSimilarity Separation (GSS), is fully unsupervised and supports ef\ufb01cient inner product retrieval.\n\n3.1 Model Architecture\n\nIn similarity propagation, local consistency is achieved by de\ufb01ning a transition matrix based on\nthe k-NN graph. Here, we take a different approach and use a GCN [23] architecture to achieve\na similar effect. The main operation in each GCN layer is the multiplication of the (normalized)\nadjacency matrix Ak with image descriptors X . This has the effect of applying weighted average to\nneighbor descriptors. Analogous to query expansion [8], averaged descriptors move closer together\nimproving local consistency. However, unlike similarity propagation that uses a static transition\nmatrix, parameterized GCN iteratively learns an entirely new descriptor space. This signi\ufb01cantly\nincreases the representational power of the model, while maintaining ef\ufb01cient retrieval and storage\nenabled by global descriptors. To apply GCN in this setting, we \ufb01rst normalize the adjacency matrix\nusing a similar procedure to [23]:\n\nNormalisation reduces bias towards \u201cpopular\" images that appear in many k-NN neighborhoods, and\nimproves optimization stability. Using the normalized matrix, a multi-layer GCN is then de\ufb01ned as:\n\nj\n\nis the output of the l\u2019th layer for image xi with h(0)\n\nwhere h(l)\ni = xi, and \u03c3 is the non-linear activation\ni\nfunction; w(l) and b(l) are weight and bias parameters to be learned. Note that we deviate from\nthe standard GCN de\ufb01nition of [23] by introducing a bias term which we empirically found to\nbe bene\ufb01cial. Successive GCN layers capture increasingly higher order neighbor information by\nrepeatedly combining descriptors for each image with its nearest neighbors. The output of the last\nlayer (L) is then normalized to produce new descriptors:\n\nOnce the network is trained, retrieval is done via inner product in the new descriptor space \u02dcx.\n\n\u02dcxi =\n\nh(L)\ni\n(cid:107)h(L)\ni (cid:107)2\n\n(4)\n\n3\n\n(cid:18) n(cid:88)\n\n(cid:19)\u2212 1\n2(cid:18) n(cid:88)\n\n(cid:19)\u2212 1\n\n2\n\n\u02dcaij =\n\naim\n\najm\n\naij\n\nm=1\n\nm=1\n\n(cid:18)\nw(l)(cid:88)\n\n(cid:19)\n\nh(l+1)\ni\n\n= \u03c3\n\n\u02dcaijh(l)\n\nj + b(l)\n\n(2)\n\n(3)\n\n\fepoch 0\n\nepoch 100\n\nepoch 300\n\nA\ny\nr\ne\nu\nQ\n\n)\ni\nq\ns\n(\nL\n\n(cid:48)\n\nB\ny\nr\ne\nu\nQ\n\n)\ni\nq\ns\n(\nL\n\n(cid:48)\n\nsqi\n\n)\ni\nq\ns\n(\nL\n\n(cid:48)\n\n)\ni\nq\ns\n(\nL\n\n(cid:48)\n\nsqi\n\n)\ni\nq\ns\n(\nL\n\n(cid:48)\n\n)\ni\nq\ns\n(\nL\n\n(cid:48)\n\nsqi\n\nsqi\n\nsqi\n\nsqi\n\nFigure 1: Top and bottom rows show two different ROxford queries A and B at various stages of\ntraining from start (epoch 0) to epoch 300. Each \ufb01gure shows GSS gradient L(cid:48)(sqi) = \u2212\u03b1(sqi \u2212 \u03b2)\nagainst the similarity scores sqi between query xq and database images xi \u2208 X . Here, \u03b1 = 2 and\n\u03b2 = 0.25 shown with a dashed vertical line. Scores for relevant to the query images are colored in\norange, and average precision (AP) retrieval score is shown at each training stage. Note that this\nmodel is trained in a fully unsupervised fashion and does not see the relevance labels.\n\n3.2 Guided Similarity Separation\n\nTo improve global consistency we propose a new unsupervised loss based on pairwise alignment\nbetween descriptors. Previous work on clustering/manifold learning demonstrated that complex\nlow-dimensional manifolds can be successfully learned by optimizing pairwise distances between\nexamples [15, 26, 44]. Retrieval bears many similarities to clustering \u2013 if all relevant images are\n\u201ccloser\" in descriptor inner product than non-relevant images perfect retrieval is achieved. Conse-\nquently, we hypothesize that by optimizing pairwise distances between descriptors we can achieve an\nanalogous effect, and learn a global low-dimensional descriptor manifold that improves retrieval.\nThis forms the basis of our loss function. We assume that the base descriptors are reasonably good, so\nthat at the start of training similarity scores sij = \u02dcx(cid:62)\ni \u02dcxj are generally higher when images are relevant\n(we empirically demonstrate this to be true). The main idea behind guided similarity separation is to\nincrease sij if it is above a given threshold and lower it otherwise. This has a clustering effect where\nimages with higher similarity scores move closer together, and those with lower scores get pushed\nfurther apart. In gradient-based learning one way to achieve this effect is through a loss function that\nhas the following derivative:\n\n(5)\nwhere \u03b2 \u2208 (0, 1) is a similarity threshold and \u03b1 > 0 controls the slope. Solving the above differential\nequation leads to our GSS loss:\n\n\u2202L(sij)\n\u2202sij\n\n= \u2212\u03b1(sij \u2212 \u03b2)\n\nL(sij) = \u2212 \u03b1\n2\n\n(sij \u2212 \u03b2)2\n\n(cid:12)(cid:12)0 = 0 and \u2202L(s)\n\n\u2202s\n\n(cid:12)(cid:12)1 = 0. While this results in two\n\n\u2202s\n\nWe then restrict the range of similarity scores to [0, 1] by re-scaling sij with max(0, sij), and set the\ngradient to be zero at the boundaries i.e. \u2202L(s)\ndiscontinuities, it is analogous to the discontinuity in the commonly used ReLU function and has\nlittle effect on optimization.\nThe GSS loss has a 3-fold effect. First, similarity scores above \u03b2 get increased bringing the cor-\nresponding descriptors closer together. Second, scores below \u03b2 get lowered pushing descriptors\nfurther apart. Third, scores near \u03b2 have little gradient and remain largely unchanged. Moreover, the\nmagnitude of the gradient increases linearly with the distance from \u03b2, and \u03b1 controls the rate of\nincrease. Jointly this has the effect where con\ufb01dent predictions (in either direction) become more\ncon\ufb01dent, while uncon\ufb01dent predictions remain largely unchanged.\n\n(6)\n\n4\n\n\fThe proposed loss further emphasizes already con\ufb01dent predictions so initial similarity scores need\nto be reasonably good. To ensure that at the start of learning, forward passes through randomly\ninitialized GCN don\u2019t adversely affect the scores, we carefully initialize parameters in each GCN\nlayer. Speci\ufb01cally, we set all biases to 0 and initialize weights w(l) to have unit diagonal with\noff-diagonal elements sampled from N (0, \u0001). Setting variance \u0001 suf\ufb01ciently small produces weight\nmatrices that are very close to identity. This makes forward pass analogous to multiple iterations of\n(noisy) QE applied to database descriptors [2]. Database QE has been shown to consistently improve\nretrieval quality by promoting local consistency [2, 13]. We observe a similar effect here, where even\nwith near identity weights GCN improves base descriptors making GSS loss more effective.\nFigure 1 shows the effect of training with our GCN architecture and GSS loss. Each row shows a\nquery from the ROxford dataset [32] \u2013 a standard benchmark in image retrieval. GSS gradients are\nplotted against the similarity scores sqi between query xq and all database images xi \u2208 X , relevant\ndatabase images are colored in orange. Similarity threshold \u03b2 is set to 0.25 and is shown with a\nvertical dashed line. For each query we show progress at various stages of training from epoch 0\nto 300, and compute average precision (AP) retrieval accuracy at each stage. From the \ufb01gure it is\nseen that the initial similarity scores for query A in the top row are reasonably well separated, and\nmost relevant examples are above the \u03b2 threshold. Our model is able to make quick progress here\nand further separate the relevant/not relevant examples into increasingly tighter clusters as learning\nprogresses.\nThe bottom row of Figure 1 shows a less well separated query B. Here, our model is able to\nsigni\ufb01cantly improve the base descriptors, and still achieve near perfect separation gaining almost 20\npoints in mAP. We also clearly see that the GSS gradient increases linearly as scores move further\naway from \u03b2 in either direction. This is contrary to many existing objectives such as cross-entropy\nwhere gradients gradually go to 0 once predictions become close to target. We found this property to\nbe instrumental to achieving the best performance. One possible explanation is that GSS puts a lot\nmore emphasis on fully separating/collapsing the descriptors resulting in much tighter clusters.\n\n3.3 Spatial Veri\ufb01cation\n\nThe effectiveness of the GCN architecture depends on the quality of the adjacency matrix. Spatial\nveri\ufb01cation [29, 2] is a commonly used approach to reduce false positives by applying robust\nveri\ufb01cation with local feature matching. It is particularly effective in combination with global\ndescriptors that are primarily designed for fast retrieval at the cost of false positives [2, 32]. While\neffective, computing local features is computationally expensive and can signi\ufb01cantly slow down\nretrieval pipeline [29]. In this work we propose an ef\ufb01cient approach to incorporate spatial veri\ufb01cation\ninto our model.\nThe main idea is to apply veri\ufb01cation in the of\ufb02ine training phase to re\ufb01ne the adjacency matrix\nAk. During inference, however, veri\ufb01cation is removed and initial retrieval for each query is done\nonly with base descriptors. Interestingly, we found that once the GCN encodes information from the\nre\ufb01ned Ak into database descriptors, the bene\ufb01t of spatial veri\ufb01cation can be effectively preserved\nwithout explicitly applying it to query during inference. This approach enables us to of\ufb02oad the\ncomputational burden to the of\ufb02ine phase, while still maintaining accuracy bene\ufb01t during inference.\nTo re\ufb01ne the adjacency matrix, for each database image xi we \ufb01rst retrieve a set of candidate images\nV(xi). Spatial veri\ufb01cation is then applied to each image in V(xi), and top-k images with highest\nveri\ufb01cation scores are kept to get Vk(xi). The re\ufb01ned set is used to compute the adjacency matrix:\n\n(cid:26)x(cid:62)\n\n0\n\ni xj\n\nif xj \u2208 Vk(xi) \u2228 xi \u2208 Vk(xj)\notherwise\n\naij =\n\n(7)\nwhere Vk(xi) is the veri\ufb01ed k-nearest neighbors of xi. The rest of the GCN architecture is applied\nas before. The size of the candidate set V is a hyper parameter, and generally better results can be\nobtained with larger candidate sets particularly in cases where descriptor retrieval is not accurate.\nThis, however, comes at the expense of additional computational cost.\n\n3.4\n\nInference\n\nOnce the model is trained, given a new query xq we need to retrieve relevant images for xq from\nX . In the of\ufb02ine phase, we make a forward pass through the GCN to compute updated database\n\n5\n\n\fTable 1: mAP retrieval results on INSTRE, ROxford and RParis (Medium and Hard) datasets. Spatial\nVeri\ufb01cation section contains approaches that use spatial veri\ufb01cation as part of pipeline.\n\nMethod\n\nGeM [33]\nGeM+aQE [33]\nGeM+DFS [19]\nGeM+FSR [17]\nGeM+DFS-FSR [18]\nGeM+IME [45]\nGeM+DSM [36]\nGeM+DSM [36]+DFS\nGeM+GSS\n\nINSTRE\n\nmAP\nROxford\n\nRParis\n\n69.1\n74.6\n81.1\n78.2\n77.9\n82.3\n\n89.2\n\n87.4\n\n-\n-\n\n-\n-\n\n90.5\n92.4\n\nMedium Hard Medium Hard\n56.3\n61.8\n78.5\n78.0\n78.1\n68.7\n56.2\n79.3\n83.5\n\n77.2\n80.7\n88.9\n88.7\n88.7\n85.0\n77.4\n89.3\n92.4\n\n64.7\n67.2\n69.8\n70.7\n70.5\n70.4\n65.3\n75.0\n77.8\n\n38.5\n40.8\n40.5\n42.2\n40.3\n45.6\n39.2\n46.2\n57.5\n\n77.2\n79.1\n75.2\n79.1\n80.6\n\n54.9\n52.7\n53.3\n62.2\n64.7\n\n88.9\n91.0\n73.1\n93.4\n93.4\n\n74.8\n81.0\n48.9\n85.3\n85.3\n\nSpatial Veri\ufb01cation\n\nGeM+aQE+DELF [28]-SV\nGeM+DFS+HessAff-ASMK [40]-SV\nHessAffNet-HardNet++ [27]+ HQE [20]-SV\nGeM+GSSV\nGeM+GSSV-SV\n\ndescriptors \u02dcxi for each image in X . Then, to get an updated descriptor \u02dcxq we incorporate query into\nthe adjacency matrix, and make another forward pass through the GCN. However, multiplication\nwith Ak in each GCN layer introduces a dependency on k instances in X . This dependency grows at\nthe rate of kL for a model with L layers and can quickly make forward pass prohibitively expensive.\nTo deal with this problem we use an approximation where only \ufb01rst and second order neighbors of\nij) \u2208\nthe query are retained. Formally, we de\ufb01ne an approximate query adjacency matrix Aq\nR(n+1)\u00d7(n+1):\n\nk = (aq\n\n\uf8f1\uf8f2\uf8f3x(cid:62)\n\nq xj\nx(cid:62)\ni xj\n0\n\naq\nij =\n\nif i = q, xj \u2208 Nk(xq)\nif xi \u2208 Nk(xq), xj \u2208 Nk(xi)\notherwise.\n\n(8)\n\nNote that Aq\nk is highly sparse and repeated multiplication with this matrix only requires at most\nk(k + 1) descriptors from X . Forward pass with Aq\nk can thus be done very ef\ufb01ciently by \ufb01rst caching\nthe required descriptors, and then applying sparse matrix multiplications. Empirically, we \ufb01nd that\napproximate inference has negligible effect on accuracy vs running the full forward pass. This is\nconsistent with previous \ufb01ndings for database-side QE [13], where large k is used to construct the\naugmented database descriptors, but a much smaller k is used during inference. The intuition is\nthat the new database descriptors \u02dcx already encode extensive neighbor information making query\naugmentation less critical.\nThis approximate adjacency matrix is used to make a forward pass through the GCN as outlined in\nSection 3.1 to get the new query descriptor \u02dcxq. Retrieval is then done via inner product in the new\ndescriptor space. It is important to note here repeat queries can be handled very ef\ufb01ciently and don\u2019t\nrequire forward passes through the GCN.\n\n4 Experiments\n\nDatasets We evaluate our model on \ufb01ve challenging public benchmarks. The popular Oxford [29]\nand Paris [30] datasets have recently been revised to include more dif\ufb01cult instances, correct annota-\ntion mistakes, and introduce new evaluation protocols [32]. The new datasets, referred to as ROxford\nand RParis, contain 4,993 and 6,322 database images respectively. There are 70 query images in\neach dataset, and depending on the complexity of the retrieval task evaluation is further partitioned\ninto Easy, Medium and Hard tasks. In this work, we focus on the more challenging Medium and\nHard tasks. We also evaluate our model on the INSTRE dataset [42] which is an instance-level image\nretrieval benchmark containing various objects such as buildings, toys and book covers in natural\nscenes. We follow data partitioning and evaluation protocol proposed by [19] with 1,250 query and\n27,293 database images used for retrieval. Performance of all models is measured by the mean\naverage precision (mAP).\n\n6\n\n\fy\nc\nn\ne\nu\nq\ne\nr\nF\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\nRParisROxford\n\nINSTRE\n\n(0.24, 98%)\n\n(0.26, 98%)\n\nP\nA\nm\n\n(0.38, 98%)\n\n0\n\n0.3\n\nsij\n\n0.6\n\n0.9\n\n(a)\n\n67\n\n64\n\n61\n\n58\n\n55\n\n200\n\n500\n\n1000\n\n100\n50\n\nGSS\nGSSV\nGSSV-SV\n50% 100% 150% 200%\n\nRelative Overhead\n\n(b)\n\n1,200\n\n)\ns\n\nm\n\n(\n\ne\nm\nT\n\ni\n\n900\n\n600\n\n300\n\n0\n\nkNN\naQE\nDFS\nGSS\n\nn\n\n0.1\n\n0.4\n\n0.7\n\nn\n\n1\n\u00b7106\n\n(c)\n\nFigure 2: (a) \u03b2 selection using the 98\u2019th percentile of the pairwise score distribution. (b) Relative\ninference run-time overhead vs mAP from applying SV to query on ROxford Hard. Number of\nveri\ufb01ed candidates |V(xq)| (see Section 3.3) is shown next to each point for GSSV-SV. We also\nshow performance for GSS and GSSV that don\u2019t apply SV to query. (c) Average per query inference\nrun-time for GSS and several baselines. kNN corresponds to nearest neighbor retrieval with GeM.\nThese experiments were done on the larger version of ROxford with 1M images.\n\nBaselines We benchmark our GSS model against leading baselines including aQE [33], IME [45],\nand state-of-the-art similarity propagation methods DFS [19], FSR [17], DSM [36] as well as their\ncombinations [18]. For spatial veri\ufb01cation (SV) we compare against recent results including global-\nlocal hybrid model with DFS [40], DELF-based SV [28], and leading neural network model based on\nHessAffNet [27] with local feature bag-of-words pipeline [37]. To make comparison fair all models\nuse the same GeM descriptors [33]. We use code and weights released by the original authors 1, and\ndon\u2019t do any re-training or \ufb01ne-tuning. Following authors\u2019 pipeline, multiple scale aggregation and\ndiscriminative whitening are applied to obtain 2,048-dimensional descriptor for each image. For\nspatial veri\ufb01cation, we follow a standard pipeline of [7] to \ufb01lter image pairs based on estimated\ninlier counts over aligned points of interest computed by RANSAC [11]. Deep local feature model\n(DELF) [28] is used for ROxford and RParis since it is trained speci\ufb01cally for landmark retrieval and\nused extensively in recent baselines. SIFT descriptors [25] are used for the more general INSTRE\ndataset.\nGSS We implemented our approach using the TensorFlow library [1]. After parameter sweeps,\ntwo-layer 2048 \u2192 2048 \u2192 2048 GCN architecture produced the best performance and is used in\nall experiments. To set the important \u03b2 parameter we note that only image pairs with suf\ufb01ciently\nhigh similarity scores should be pushed closer together in the GSS loss. This leads to a general\nprocedure where we \ufb01rst compute distribution of the pairwise scores sij, then set \u03b2 in the upper\npercentile of this distribution. We consistently found that using 98\u2019th percentile worked well across\nall datasets. Figure 2a illustrates this procedure and shows pairwise score distributions for each of\nthe three datasets together with selected \u03b2 values. Other hyper-parameters are set as follows: \u03b1 = 1,\nk = 5 for ROxford; \u03b1 = 1, k = 5 for RParis; \u03b1 = 1, k = 10 for INSTRE. All models are optimized\nusing the ADAM optimizer [22] with default settings and weight initialization outlined in Section 3.2\nwith \u0001 = 10\u22125. For spatial veri\ufb01cation we use GSSV to denote models that were trained with spatial\nveri\ufb01cation (see Section 3.3), and GSSV-SV to indicate that spatial veri\ufb01cation is also applied at\nquery time. The size of the candidate set |V| is \ufb01xed to 250 for all datasets unless otherwise stated.\nAll experiments are conducted on a 20-core Intel(R) Xeon(R) CPU E5-2630 v4 @2.20GHz machine\nwith NVIDIA V100 GPU. Model training takes around 30 seconds for ROxford and RParis, and 10\nminutes for INSTRE.\n\n4.1 Results\nTable 1 shows mAP retrieval results on all \ufb01ve datasets. From the table we see that GSS outperforms\nall baselines on each dataset. Notably the improvement is larger on the more challenging Hard\ntasks for ROxford and RParis, where our model achieves a relative gain of up to 11 mAP points\nor 24% over the best baseline. This demonstrates that the proposed pairwise loss combined with\nGCN neighbor encoding can learn a much better descriptor space, improving over the input GeM\ndescriptors by up to 50%. Similar pattern can be observed from the spatial veri\ufb01cation section of\nthe table. Here, GSSV also outperforms all baselines that use spatial veri\ufb01cation. Furthermore,\n\n1https://github.com/filipradenovic/cnnimageretrieval-pytorch\n\n7\n\n\f(a) GeM\n\n(b) GeM+GSS\n\nFigure 3: Qualitative analysis on ROxford. GeM and GeM+GSS descriptors are plotted using\nPCA followed by t-SNE [26] projection to two dimensions. We show three example queries with\ncorresponding relevant database images colored with red, green and blue. For each query, we display\nthe query image (shown with AP score) and a hard relevant database image.\nGSSV consistently performs better than GSS gaining over 12% on ROxford Hard. These results\nindicate that our model can successfully encode information from spatially veri\ufb01ed adjacency matrix\ninto database descriptors, and improve performance without applying SV at query time. Further\nimprovement can be obtained by also applying SV to query as shown by GSSV-SV but at additional\ncost. Figure 2b shows inference run-time overhead from applying SV to query as determined by the\nnumber of veri\ufb01ed candidates. We see that additional gains in accuracy can be achieved by verifying\nup to 200 candidates retrieved by GeM. This, however, comes at a signi\ufb01cant overhead of over 70%\nincrease in run-time. GSSV thus offers a principled way to leverage spatial veri\ufb01cation without\nincurring additional inference cost.\nAverage query inference run-time in ms is shown in Figure 2c. To test the performance at scale\nwe use the large version of the ROxford dataset with 1M database images. We vary database size\nfrom 100K to 1M, and record average query time over 100 restarts. During inference GSS requires\nkNN descriptor retrieval to compute the (approximate) adjacency matrix Aq\nk, then forward pass\nthrough GCN to compute updated query descriptor \u02dcxq and \ufb01nally another kNN retrieval with \u02dcxq.\nThe approximate inference procedure outlined in Section 3.4 is highly ef\ufb01cient, making the cost of\nthe GCN forward pass independent of n. This is further evidenced by Figure 2c where GSS adds\nonly a small constant overhead of around 140ms on top of aQE which also does two stages of kNN\nretrieval. In contrast, leading similarity propagation method DFS adds an overhead proportional\nto n [19] which, as seen from Figure 2c, becomes signi\ufb01cant as n grows. While approximate and\nmore ef\ufb01cient DFS inference has been proposed, it generally suffers from substantial decrease in\naccuracy [4].\nQualitative results on ROxford Hard are shown in Figure 3. Here, we use PCA and t-SNE [26] to\ncompress the descriptors from GeM and our model to two dimensions and then plot them. We pick\nthree queries and show their relevant database images in red, green and blue. For each query we also\ndisplay the query image (shown with AP under each model) and one \u201chard\" relevant database image.\nThe hard images are visually dissimilar to the query and GeM is unable to retrieve them correctly.\nFrom the \ufb01gure we see that our model learns much tighter clusters and signi\ufb01cantly improves retrieval\naccuracy for each query. Speci\ufb01cally, it is able to place each of the shown hard database images into\na tight cluster around the query even though there is little visual similarity between them.\n\n5 Conclusion\n\nWe proposed a new unsupervised model for image retrieval based on graph neural network neighbor\nencoding and pairwise similarity separation loss. Our model effectively learns a new descriptor space\nthat signi\ufb01cantly improves retrieval accuracy while maintaining ef\ufb01cient inner product inference.\nResults on public benchmarks show highly competitive performance with up to 24% relative improve-\nment over leading baselines. Future work involves further investigation into manifold learning as\nwell as supervised version of our approach.\n\n8\n\n\fReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Relja Arandjelovic and Andrew Zisserman. Three things everyone should know to improve object retrieval.\n\nIn CVPR, 2012.\n\n[3] Song Bai, Xiang Bai, Qi Tian, and Longin Jan Latecki. Regularized diffusion process for visual retrieval.\n\nIn AAAI, 2017.\n\n[4] Cheng Chang, Guangwei Yu, Chundi Liu, and Maksims Volkovs. Explore-exploit graph traversal for image\n\nretrieval. In CVPR, 2019.\n\n[5] Olivier Chapelle, Jason Weston, and Bernhard Sch\u00f6lkopf. Cluster kernels for semi-supervised learning. In\n\nNIPS, 2003.\n\n[6] Olivier Chapelle and Alexander Zien. Semi-supervised classi\ufb01cation by low density separation.\n\nAISTATS, 2005.\n\nIn\n\n[7] Ondrej Chum et al. Large-scale discovery of spatially related images. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 32(2):371\u2013377, 2010.\n\n[8] Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. Total recall: Automatic\n\nquery expansion with a generative feature model for object retrieval. In ICCV, 2007.\n\n[9] Michael Donoser and Horst Bischof. Diffusion processes for retrieval revisited. In CVPR, 2013.\n\n[10] Bernd Fischer, Volker Roth, and Joachim M Buhmann. Clustering with the connectivity kernel. In NIPS,\n\n2004.\n\n[11] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model \ufb01tting with\n\napplications to image analysis and automated cartography. Communications of the ACM, 1981.\n\n[12] Albert Gordo, Jon Almaz\u00e1n, Jerome Revaud, and Diane Larlus. Deep image retrieval: Learning global\n\nrepresentations for image search. In ECCV, 2016.\n\n[13] Albert Gordo, Jon Almaz\u00e1n, Jerome Revaud, and Diane Larlus. End-to-End Learning of Deep Visual\nRepresentations for Image Retrieval. International Journal of Computer Vision, 124(2):237\u2013254, 2017.\n\n[14] Jared Heinly, Johannes L. Sch\u00f6nberger, Enrique Dunn, and Jan-Michael Frahm. Reconstructing the world*\n\nin six days *(as captured by the Yahoo 100 Million Image Dataset). In CVPR, 2015.\n\n[15] Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In NIPS, 2003.\n\n[16] Junshi Huang, Rogerio S Feris, Qiang Chen, and Shuicheng Yan. Cross-domain image retrieval with a\n\ndual attribute-aware ranking network. In ICCV, 2015.\n\n[17] Ahmet Iscen, Yannis Avrithis, Giorgos Tolias, Teddy Furon, and Ond\u02c7rej Chum. Fast spectral ranking for\n\nsimilarity search. In CVPR, 2018.\n\n[18] Ahmet Iscen, Yannis Avrithis, Giorgos Tolias, Teddy Furon, and Ondrej Chum. Hybrid diusion: Spectral-\n\ntemporal graph \ufb01ltering for manifold ranking. In ACCV, 2018.\n\n[19] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, and Ond\u02c7rej Chum. Ef\ufb01cient diffusion on\n\nregion manifolds: Recovering small objects with compact CNN representations. In CVPR, 2017.\n\n[20] Herv\u00e9 J\u00e9gou, Matthijs Douze, and Cordelia Schmid. Improving bag-of-features for large scale image\n\nsearch. International journal of computer vision, 87(3):316\u2013336, 2010.\n\n[21] Jayashree Kalpathy-Cramer, Alba Garc\u00eda Seco de Herrera, Dina Demner-Fushman, Sameer Antani, Steven\nBedrick, and Henning M\u00fcller. Evaluating performance of biomedical image retrieval systems\u2014-an\noverview of the medical image retrieval task at ImageCLEF 2004\u20132013. Computerized Medical Imaging\nand Graphics, 2015.\n\n[22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n9\n\n\f[23] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nICLR, 2017.\n\n[24] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes\n\nrecognition and retrieval with rich annotations. In CVPR, 2016.\n\n[25] David G Lowe. Distinctive image features from scale-invariant keypoints.\n\nComputer Vision, 60(2):91\u2013110, 2004.\n\nInternational Journal of\n\n[26] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning\n\nResearch, 9(Nov):2579\u20132605, 2008.\n\n[27] Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Repeatability is not enough: Learning af\ufb01ne regions via\n\ndiscriminability. In ECCV, 2018.\n\n[28] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-Scale Image Retrieval\n\nwith Attentive Deep Local Features. In ICCV, 2017.\n\n[29] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with\n\nlarge vocabularies and fast spatial matching. In CVPR, 2007.\n\n[30] James Philbin, Ond\u02c7rej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization:\n\nImproving particular object retrieval in large scale image databases. In CVPR, 2008.\n\n[31] Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack, and Luc Van Gool. Hello neighbor: Accurate\n\nobject retrieval with k-reciprocal nearest neighbors. In CVPR, 2011.\n\n[32] Filip Radenovic, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond\u02c7rej Chum. Revisiting Oxford and\n\nParis: large-scale image retrieval benchmarking. In CVPR, 2018.\n\n[33] Filip Radenovic, Giorgos Tolias, and Ondrej Chum. Fine-tuning CNN Image Retrieval with No Human\n\nAnnotation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[34] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290(5500):2323\u20132326, 2000.\n\n[35] Zhiyuan Shi, Timothy M Hospedales, and Tao Xiang. Transferring a semantic representation for person\n\nre-identi\ufb01cation and search. In CVPR, 2015.\n\n[36] O. Simeoni, Y. Avrithis, and O. Chum. Local features and visual words emerge in activations. In CVPR,\n\n2019.\n\n[37] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos.\n\nIn ICCV, 2003.\n\n[38] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale\n\ninformation network embedding. In WWW, 2015.\n\n[39] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[40] Giorgos Tolias, Yannis Avrithis, and Herv\u00e9 J\u00e9gou. Image search with selective match kernels: aggregation\n\nacross single and multiple images. International Journal of Computer Vision, 2016.\n\n[41] Daniel A Vaquero, Rogerio S Feris, Duan Tran, Lisa Brown, Arun Hampapur, and Matthew Turk. Attribute-\n\nbased people search in surveillance environments. In WACV, 2009.\n\n[42] Shuang Wang and Shuqiang Jiang.\n\nInstre: A new benchmark for instance-level object retrieval and\n\nrecognition. ACM Transactions on Multimedia Computing, Communications, and Applications, 2015.\n\n[43] Tobias Weyand and Bastian Leibe. Discovering favorite views of popular places with iconoid shift. In\n\nICCV, 2011.\n\n[44] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In\n\nICML, 2016.\n\n[45] Jian Xu, Chunheng Wang, Chengzuo Qi, Cunzhao Shi, and Baihua Xiao. Iterative manifold embedding\n\nlayer learned by incomplete data for large-scale image retrieval. Transactions on Multimedia, 2018.\n\n10\n\n\f[46] Xingwei Yang, Suzan K\u00f6knar-Tezel, and Longin Jan Latecki. Locally constrained diffusion process on\n\nlocally densi\ufb01ed distance spaces with applications to shape retrieval. In CVPR Workshops, 2009.\n\n[47] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch. Learning with\n\nLocal and Global Consistency. NIPS, 2003.\n\n[48] Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Sch\u00f6lkopf. Ranking on\n\ndata manifolds. In NIPS, 2003.\n\n11\n\n\f", "award": [], "sourceid": 881, "authors": [{"given_name": "Chundi", "family_name": "Liu", "institution": "Layer6 AI"}, {"given_name": "Guangwei", "family_name": "Yu", "institution": "Layer6"}, {"given_name": "Maksims", "family_name": "Volkovs", "institution": "Layer6 AI"}, {"given_name": "Cheng", "family_name": "Chang", "institution": "Layer6 AI"}, {"given_name": "Himanshu", "family_name": "Rai", "institution": "Layer6 AI"}, {"given_name": "Junwei", "family_name": "Ma", "institution": "Layer6 AI"}, {"given_name": "Satya Krishna", "family_name": "Gorti", "institution": "Layer6 AI"}]}