{"title": "Working hard to know your neighbor's margins: Local descriptor learning loss", "book": "Advances in Neural Information Processing Systems", "page_first": 4826, "page_last": 4837, "abstract": "We introduce a loss for metric learning, which is inspired by the Lowe's matching criterion for SIFT. We show that the proposed loss, that maximizes the distance between the closest positive and closest negative example in the batch, is better than complex regularization methods; it works well for both shallow and deep convolution network architectures. Applying the novel loss to the L2Net CNN architecture results in a compact descriptor named HardNet. It has the same dimensionality as SIFT (128) and shows state-of-art performance in wide baseline stereo, patch verification and instance retrieval benchmarks.", "full_text": "Working hard to know your neighbor\u2019s margins:\n\nLocal descriptor learning loss\n\nAnastasiya Mishchuk1, Dmytro Mishkin2, Filip Radenovi\u00b4c2, Ji\u02c7ri Matas2\n\n1 Szkocka Research Group, Ukraine\nanastasiya.mishchuk@gmail.com\n\n2 Visual Recognition Group, CTU in Prague\n\n{mishkdmy, filip.radenovic, matas}@cmp.felk.cvut.cz\n\nAbstract\n\nWe introduce a loss for metric learning, which is inspired by the Lowe\u2019s matching\ncriterion for SIFT. We show that the proposed loss, that maximizes the distance\nbetween the closest positive and closest negative example in the batch, is better\nthan complex regularization methods; it works well for both shallow and deep\nconvolution network architectures. Applying the novel loss to the L2Net CNN\narchitecture results in a compact descriptor named HardNet.\nIt has the same\ndimensionality as SIFT (128) and shows state-of-art performance in wide baseline\nstereo, patch veri\ufb01cation and instance retrieval benchmarks.\n\n1\n\nIntroduction\n\nMany computer vision tasks rely on \ufb01nding local correspondences, e.g.\nimage retrieval [1, 2],\npanorama stitching [3], wide baseline stereo [4], 3D-reconstruction [5, 6]. Despite the growing\nnumber of attempts to replace complex classical pipelines with end-to-end learned models, e.g., for\nimage matching [7], camera localization [8], the classical detectors and descriptors of local patches\nare still in use, due to their robustness, ef\ufb01ciency and their tight integration. Moreover, reformulating\nthe task, which is solved by the complex pipeline as a differentiable end-to-end process is highly\nchallenging.\nAs a \ufb01rst step towards end-to-end learning, hand-crafted descriptors like SIFT [9, 10] or detec-\ntors [9, 11, 12] have been replace with learned ones, e.g., LIFT [13], MatchNet [14] and DeepCom-\npare [15]. However, these descriptors have not gained popularity in practical applications despite\ngood performance in the patch veri\ufb01cation task. Recent studies have con\ufb01rmed that SIFT and its\nvariants (RootSIFT-PCA [16], DSP-SIFT [17]) signi\ufb01cantly outperform learned descriptors in image\nmatching and small-scale retrieval [18], as well as in 3D-reconstruction [19]. One of the conclusions\nmade in [19] is that current local patches datasets are not large and diverse enough to allow the\nlearning of a high-quality widely-applicable descriptor.\nIn this paper, we focus on descriptor learning and, using a novel method, train a convolutional neural\nnetwork (CNN), called HardNet. We additionally show that our learned descriptor signi\ufb01cantly\noutperforms both hand-crafted and learned descriptors in real-world tasks like image retrieval and two\nview matching under extreme conditions. For the training, we use the standard patch correspondence\ndata thus showing that the available datasets are suf\ufb01cient for going beyond the state of the art.\n\n2 Related work\n\nClassical SIFT local feature matching consists of two parts: \ufb01nding nearest neighbors and comparing\nthe \ufb01rst to second nearest neighbor distance ratio threshold for \ufb01ltering false positive matches. To\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbest of our knowledge, no work in local descriptor learning fully mimics such strategy as the learning\nobjective.\nSimonyan and Zisserman [20] proposed a simple \ufb01lter plus pooling scheme learned with convex\noptimization to replace the hand-crafted \ufb01lters and poolings in SIFT. Han et al. [14] proposed a two-\nstage siamese architecture \u2013 for embedding and for two-patch similarity. The latter network improved\nmatching performance, but prevented the use of fast approximate nearest neighbor algorithms like\nkd-tree [21]. Zagoruyko and Komodakis [15] have independently presented similar siamese-based\nmethod which explored different convolutional architectures. Simo-Serra et al [22] harnessed hard-\nnegative mining with a relative shallow architecture that exploited pair-based similarity.\nThe three following papers have most closedly followed the classical SIFT matching scheme. Bal-\nntas et al [23] used a triplet margin loss and a triplet distance loss, with random sampling of the\npatch triplets. They show the superiority of the triplet-based architecture over a pair based. Although,\nunlike SIFT matching or our work, they sampled negatives randomly. Choy et al [7] calculate the\ndistance matrix for mining positive as well as negative examples, followed by pairwise contrastive\nloss.\nTian et al [24] use n matching pairs in batch for generating n2 \u2212 n negative samples and require that\nthe distance to the ground truth matchings is minimum in each row and column. No other constraint\non the distance or distance ratio is enforced. Instead, they propose a penalty for the correlation\nof the descriptor dimensions and adopt deep supervision [25] by using intermediate feature maps\nfor matching. Given the state-of-art performance, we have adopted the L2Net [24] architecture as\nbase for our descriptor. We show that it is possible to learn even more powerful descriptor with\nsigni\ufb01cantly simpler learning objective without need of the two auxiliary loss terms.\n\nFigure 1: Proposed sampling procedure. First, patches are described by the current network, then a\ndistance matrix is calculated. The closest non-matching descriptor \u2013 shown in red \u2013 is selected for\neach ai and pi patch from positive pair (green) respectively. Finally, among two negative candidates\nthe hardest one is chosen. All operations are done in a single forward pass.\n\n3 The proposed descriptor\n\n3.1 Sampling and loss\n\nOur learning objective mimics SIFT matching criterion. The process is shown in Figure 1. First, a\nbatch X = (Ai, Pi)i=1..n of matching local patches is generated, where A stands for the anchor and\nP for the positive. The patches Ai and Pi correspond to the same point on 3D surface. We make sure\nthat in batch X , there is exactly one pair originating from a given 3D point.\nSecond, the 2n patches in X are passed through the network shown in Figure 2.\n\n2\n\nBatch of input patches DescriptorsDistance matrix\ud835\udc37=\ud835\udc5d\ud835\udc51\ud835\udc56\ud835\udc60\ud835\udc61(\ud835\udc4e,\ud835\udc5d)\ud835\udc5d1\ud835\udc5d2\ud835\udc5d3\ud835\udc5d4\ud835\udc4e1\ud835\udc4e2\ud835\udc4e3\ud835\udc4e4\ud835\udc43\ud835\udc56\ud835\udc34\ud835\udc56\ud835\udc4e\ud835\udc56\ud835\udc5d\ud835\udc56\ud835\udc51(\ud835\udc4e1,\ud835\udc5d1)\ud835\udc51(\ud835\udc4e1,\ud835\udc5d2)\ud835\udc51(\ud835\udc4e1,\ud835\udc5d3)\ud835\udc51(\ud835\udc4e1,\ud835\udc5d4)\ud835\udc51(\ud835\udc4e2,\ud835\udc5d1)\ud835\udc51(\ud835\udc4e2,\ud835\udc5d2)\ud835\udc51(\ud835\udc4e2,\ud835\udc5d3)\ud835\udc51(\ud835\udc4e2,\ud835\udc5d4)\ud835\udc51(\ud835\udc4e3,\ud835\udc5d1)\ud835\udc51(\ud835\udc4e3,\ud835\udc5d2)\ud835\udc51(\ud835\udc4e3,\ud835\udc5d3)\ud835\udc51(\ud835\udc4e3,\ud835\udc5d4)\ud835\udc51(\ud835\udc4e4,\ud835\udc5d1)\ud835\udc51(\ud835\udc4e4,\ud835\udc5d2)\ud835\udc51(\ud835\udc4e4,\ud835\udc5d3)\ud835\udc51(\ud835\udc4e4,\ud835\udc5d4)\ud835\udc4e1\ud835\udc5d1\ud835\udc4e2\ud835\udc5a\ud835\udc56\ud835\udc5b\ud835\udc51(\ud835\udc4e1,\ud835\udc5d4\ud835\udc5a\ud835\udc56\ud835\udc5b)>\ud835\udc51(\ud835\udc4e2\ud835\udc5a\ud835\udc56\ud835\udc5b,\ud835\udc5d1)\u27f9\ud835\udc60\ud835\udc52\ud835\udc59\ud835\udc52\ud835\udc50\ud835\udc61\ud835\udc4e2Final triplet (one of n in batch)\ud835\udc5d4\ud835\udc5a\ud835\udc56\ud835\udc5b\ud835\udc4e2\ud835\udc5a\ud835\udc56\ud835\udc5b\fL2 pairwise distance matrix D = cdist(a, p), where, d(ai, pj) =(cid:112)2 \u2212 2aipj, i = 1..n, j = 1..n of\n\nsize n \u00d7 n is calculated, where ai and pj denote the descriptors of patches Ai and Pj respectively.\nNext, for each matching pair ai and pi the closest non-matching descriptors i.e. the 2nd nearest\nneighbor, are found respectively:\nai \u2013 anchor descriptor, pi \u2013 positive descriptor,\npjmin \u2013 closest non-matching descriptor to ai, where jmin = arg minj=1..n,j(cid:54)=i d(ai, pj),\nakmin \u2013 closest non-matching descriptor to pi where kmin = arg mink=1..n,k(cid:54)=i d(ak, pi).\nThen from each quadruplet of descriptors (ai, pi, pjmin , akmin ), a triplet is formed: (ai, pi, pjmin ), if\nd(ai, pjmin ) < d(akmin , pi) and (pi, ai, akmin ) otherwise.\nOur goal is to minimize the distance between the matching descriptor and closest non-matching\ndescriptor. These n triplet distances are fed into the triplet margin loss:\n\n(cid:88)\n\ni=1,n\n\nL =\n\n1\nn\n\nmax (0, 1 + d(ai, pi) \u2212 min (d(ai, pjmin ), d(akmin , pi)))\n\n(1)\n\nwhere min (d(ai, pjmin), d(akmin , pi) is pre-computed during the triplet construction.\nThe distance matrix calculation is done on GPU and the only overhead compared to the random triplet\nsampling is the distance matrix calculation and calculating the minimum over rows and columns.\nMoreover, compared to usual learning with triplets, our scheme needs only two-stream CNN, not\nthree, which results in 30% less memory consumption and computations.\nUnlike in [24], neither deep supervision for intermediate layers is used, nor a constraint on the\ncorrelation of descriptor dimensions. We experienced no signi\ufb01cant over-\ufb01tting.\n\n3.2 Model architecture\n\nFigure 2: The architecture of our network, adopted from L2Net [24]. Each convolutional layer is\nfollowed by batch normalization and ReLU, except the last one. Dropout regularization is used before\nthe last convolution layer.\n\nThe HardNet architecture, Figure 2, is identical to L2Net [24]. Padding with zeros is applied to\nall convolutional layers, to preserve the spatial size, except to the \ufb01nal one. There are no pooling\nlayers, since we found that they decrease performance of the descriptor. That is why the spatial size is\nreduced by strided convolutions. Batch normalization [26] layer followed by ReLU [27] non-linearity\nis added after each layer, except the last one. Dropout [28] regularization with 0.1 dropout rate is\napplied before the last convolution layer. The output of the network is L2 normalized to produce\n128-D descriptor with unit-length. Grayscale input patches with size 32 \u00d7 32 pixels are normalized\nby subtracting the per-patch mean and dividing by the per-patch standard deviation.\nOptimization is done by stochastic gradient descent with learning rate of 0.1, momentum of 0.9 and\nweight decay of 0.0001. Learning rate was linearly decayed to zero within 10 epochs for the most of\nthe experiments in this paper. Training is done with PyTorch library [29].\n\n3.3 Model training\n\nUBC Phototour [3], also known as Brown dataset. It consists of three subsets: Liberty, Notre Dame\nand Yosemite with about 400k normalized 64x64 patches in each. Keypoints were detected by DoG\ndetector and veri\ufb01ed by 3D model.\n\n3\n\n3x3 Convpad 132BN + ReLU3x3 Convpad 132BN + ReLU3x3 Convpad 1 /2BN + ReLU3x3 Convpad 1 BN + ReLU3x3 Convpad 1 BN + ReLU8x8 ConvBN+ L2Norm112864128128 32x323x3 Convpad 1 /2BN + ReLU64\fTest set consists of 100k matching and non-matching pairs for each sequence. Common setup is to\ntrain descriptor on one subset and test on two others. Metric is the false positive rate (FPR) at point\nof 0.95 true positive recall. It was found out by Michel Keller that [14] and [23] evaluation procedure\nreports FDR (false discovery rate) instead of FPR (false positive rate). To avoid the incomprehension\nof results we\u2019ve decided to provide both FPR and FDR rates and re-estimated the scores for straight\ncomparison. Results are shown in Table 1. Proposed descriptor outperforms competitors, with training\naugmentation, or without it. We haven\u2018t included results on multiscale patch sampling or so called\n\u201ccenter-surrounding\u201d architecture for two reasons. First, architectural choices are beyond the scope\nof current paper. Second, it was already shown in [24, 30] that \u201ccenter-surrounding\u201d consistently\nimproves results on Brown dataset for different descriptors, while hurts matching performance on\nother, more realistic setups, e.g., on Oxford-Af\ufb01ne [31] dataset.\nIn the rest of paper we use descriptor trained on Liberty sequence, which is a common practice, to\nallow a fair comparison. TFeat [23] and L2Net [24] use the same dataset for training.\n\nTable 1: Patch correspondence veri\ufb01cation performance on the Brown dataset. We report false\npositive rate at true positive rate equal to 95% (FPR95). Some papers report false discovery rate\n(FDR) instead of FPR due to bug in the source code. For consistency we provide FPR, either\nobtained from the original article or re-estimated from the given FDR (marked with *). The best\nresults are in bold.\n\nNotredame Yosemite\n\nLiberty Yosemite\n\nLiberty Notredame\n\nMean\n\nLiberty\n\n29.84\n\nNotredame\n\n22.53\n\nYosemite\n\n27.29\n\nTraining\n\nTest\n\nSIFT [9]\nMatchNet*[14]\nTFeat-M* [23]\nL2Net [24]\nHardNet (ours)\n\nGLoss+[30]\nDC2ch2st+[15]\nL2Net+ [24] +\nHardNet+ (ours)\n\n7.04\n7.39\n3.64\n3.06\n\n3.69\n4.85\n2.36\n2.28\n\n5.65\n3.8\n1.62\n1.4\n\n11.47\n10.31\n5.29\n4.27\n\n11.6\n8.06\n4.43\n3.04\nAugmentation: \ufb02ip, 90\u25e6 random rotation\n3.09\n5.00\n2.57\n2.13\n\n1.14\n2.11\n1.29\n0.96\n\n3.82\n3.06\n1.15\n0.96\n\n0.77\n1.9\n0.72\n0.57\n\n4.91\n7.2\n4.7\n3.25\n\nFDR\n\nFPR\n\n26.55\n8.05\n6.64\n3.24\n2.54\n\n2.71\n4.19\n2.23\n1.9\n\n7.74\n6.47\n\n3.00\n\n1.97\n\n8.7\n7.24\n3.30\n2.53\n\n2.67\n4.10\n1.71\n2.22\n\n3.4 Exploring the batch size in\ufb02uence\n\nWe study the in\ufb02uence of mini-batch size on the \ufb01nal descriptor performance. It is known that\nsmall mini-batches are bene\ufb01cial to faster convergence and better generalization [32], while large\nbatches allow better GPU utilization. Our loss function design should bene\ufb01t from seeing more\nhard negative patches to learn to distinguish them from true positive patches. We report the results\nfor batch sizes 16, 64, 128, 512, 1024, 2048. We trained the model described in Section 3.2 using\nLiberty sequence of Brown dataset. Results are shown in Figure 3. As expected, model performance\nimproves with increasing the mini-batch size, as more examples are seen to get harder negatives.\nAlthough, increasing batch size to more than 512 does not bring signi\ufb01cant bene\ufb01t.\n\n4 Empirical evaluation\n\nRecently, Balntas et al. [23] showed that good performance on patch veri\ufb01cation task on Brown dataset\ndoes not always mean good performance in the nearest neighbor setup and vice versa. Therefore, we\nhave extensively evaluated learned descriptors on real-world tasks like two view matching and image\nretrieval.\nWe have selected RootSIFT [10], TFeat-M* [23], and L2Net [24] for direct comparison with our\ndescriptor, as they show the best results on a variety of datasets.\n\n4\n\n\fFigure 3: In\ufb02uence of the batch size on de-\nscriptor performance. The metric is false\npositive rate (FPR) at true positive rate\nequal to 95%, averaged over Notredame\nand Yosemite validation sequences.\n\n4.1 Patch descriptor evaluation\n\nFigure 4: Patch retrieval descriptor performance\n(mAP) vs. the number of distractors, evaluated on\nHPatches dataset.\n\nHPatches [18] is a recent dataset for local patch descriptor evaluation. It consists of 116 sequences of\n6 images. The dataset is split into two parts: viewpoint \u2013 59 sequences with signi\ufb01cant viewpoint\nchange and illumination \u2013 57 sequences with signi\ufb01cant illumination change, both natural and\narti\ufb01cial. Keypoints are detected by DoG, Hessian and Harris detectors in the reference image and\nreprojected to the rest of the images in each sequence with 3 levels of geometric noise: Easy, Hard,\nand Tough variants. The HPatches benchmark de\ufb01nes three tasks: patch correspondence veri\ufb01cation,\nimage matching and small-scale patch retrieval. We refer the reader to the HPatches paper [18] for a\ndetailed protocol for each task.\nResults are shown in Figure 5. L2Net and HardNet have shown similar performance on the patch\nveri\ufb01cation task with a small advantage of HardNet. On the matching task, even the non-augmented\nversion of HardNet outperforms the augmented version of L2Net+ by a noticeable margin. The\ndifference is larger in the TOUGH and HARD setups. Illumination sequences are more challenging\nthan the geometric ones, for all the descriptors. We have trained network with TFeat architecture, but\nwith proposed loss function \u2013 it is denoted as HardTFeat. It outperforms original version in matching\nand retrieval, while being on par with it on patch veri\ufb01cation task.\nIn patch retrieval, relative performance of the descriptors is similar to the matching problem: HardNet\nbeats L2Net+. Both descriptors signi\ufb01cantly outperform the previous state-of-the-art, showing the\nsuperiority of the selected deep CNN architecture over the shallow TFeat model.\n\nFigure 5: Left to right: Veri\ufb01cation, matching and retrieval results on HPatches dataset. Marker\ncolor indicates the level of geometrical noise in: EASY, HARD and TOUGH. Marker type indicates\nthe experimental setup. DIFFSEQ and SAMESEQ shows the source of negative examples for the\nveri\ufb01cation task. VIEWPT and ILLUM indicate the type of sequences for matching. None of the\ndescriptors is trained on HPatches.\n\n5\n\n12345678Epoch0.000.020.040.060.080.100.12FPR166412851210242048102103104numberofdistractors0.00.10.20.30.40.50.60.70.80.9mAPHardNetHardNet+L2NetL2Net+RootSIFTSIFTTFeatM*EASYHARDTOUGHDIFFSEQSAMESEQVIEWPTILLUM87.12%86.69%86.19%84.46%81.90%81.32%58.53%020406080100Patch Verification mAP [%]rSIFTHardTFeatTFeat-M*L2NetHardNetL2Net+HardNet+50.38%48.24%45.04%40.82%38.07%32.64%27.22%020406080100Image Matching mAP [%]rSIFTTFeat-M*HardTFeatL2NetL2Net+HardNetHardNet+66.82%65.26%63.37%59.64%55.12%52.03%42.49%020406080100Patch Retrieval mAP [%]rSIFTTFeat-M*HardTFeatL2NetL2Net+HardNetHardNet+\fTable 2: Comparison of the loss functions and sampling strategies on the HPatches matching task,\nthe mean mAP is reported. CPR stands for the regularization penalty of the correlation between\ndescriptor channels, as proposed in [24]. Hard negative mining is performed once per epoch. Best\nresults are in bold. HardNet uses the hardest-in-batch sampling and the triplet margin loss.\n\nSampling / Loss\n\nSoftmin\n\nTriplet margin\n\nContrastive\n\nm = 1\n\nm = 1 m = 2\n\nRandom\nHard negative mining\nRandom + CPR\nHard negative mining + CPR\nHardest in batch (ours)\n\n0.349\n0.391\n0.474\n\nover\ufb01t\nover\ufb01t\n\n0.286\n0.346\n0.482\n\n0.007\n0.055\n0.444\n\n0.083\n0.279\n0.482\n\nWe also ran another patch retrieval experiment, varying the number of distractors (non-matching\npatches) in the retrieval dataset. The results are shown in Figure 4. TFeat descriptor performance,\nwhich is comparable to L2Net in the presence of low number distractors, degrades quickly as\nthe size of the the database grows. At about 10,000 its performance drops below SIFT. This\nexperiment explains why TFeat performs relatively poorly on the Oxford5k [33] and Paris6k [34]\nbenchmarks, which contain around 12M and 15M distractors, respectively, see Section 4.4 for more\ndetails. Performance of the HardNet decreases slightly for both augmented and plain version and the\ndifference in mAP to other descriptors grows with the increasing complexity of the task.\n\n4.2 Ablation study\n\nFor better understanding of the signi\ufb01cance of the sampling strategy and the loss function, we conduct\nexperiments summarized in Table 2. We train our HardNet model (architecture is exactly the same as\nL2Net model), change one parameter at a time and evaluate its impact.\nThe following sampling strategies are compared: random, the proposed \u201chardest-in-batch\u201d, and the\n\u201cclassical\u201d hard negative mining, i.e. selecting in each epoch the closest negatives from the full\ntraining set. The following loss functions are tested: softmin on distances, triplet margin with margin\nm = 1, contrastive with margins m = 1, m = 2. The last is the maximum possible distance for\nunit-normed descriptors. Mean mAP on HPatches Matching task is shown in Table 2.\nThe proposed \u201chardest-in-batch\u201d clearly outperforms all the other sampling strategies for all loss\nfunctions and it is the main reason for HardNet\u2019s good performance. The random sampling and\n\u201cclassical\u201d hard negative mining led to huge over\ufb01t, when training loss was high, but test performance\nwas low and varied several times from run to run. This behavior was observed with all loss function.\nSimilar results for random sampling were reported in [24]. The poor results of hard negative mining\n(\u201chardest-in-the-training-set\u201d) are surprising. We guess that this is due to dataset label noise, the\nmined \u201chard negatives\u201d are actually positives. Visual inspection con\ufb01rms this. We were able to get\n\n\u2202p = 1\n\n\u2202p = 1\n\n\u2202L\n\n\u2202L\n\n\u2202L\n\n\u2202n = \u2202L\n\u2202n = 0, \u2202L\n\u2202n = \u2202L\n\n\u2202p = 0\n\nFigure 6: Contribution to the gradient magnitude from the positive and negative examples. Horizontal\nand vertical axes show the distance from the anchor (a) to the negative (n) and positive (p) examples\nrespectively. Softmin loss gradient quickly decreases when d(a, n) > d(a, p), unlike the triplet\nmargin loss. For the contrastive loss, negative examples with d(a, n) > m contribute zero to the\ngradient. The triplet margin loss and the contrastive loss with a big margin behave very similarly.\n\n6\n\n00.51.01.52.0d(a,n)00.51.01.52.0d(a,p)Tripletmargin,m=100.51.01.52.0d(a,n)00.51.01.52.0d(a,p)Softmin00.51.01.52.0d(a,n)00.51.01.52.0d(a,p)Contrastive,m=100.51.01.52.0d(a,n)00.51.01.52.0d(a,p)Contrastive,m=2\freasonable results with random and hard negative mining sampling only with additional correlation\npenalty on descriptor channels (CPR), as proposed in [24].\nRegarding the loss functions, softmin gave the most stable results across all sampling strategies, but\nit is marginally outperformed by contrastive and triplet margin loss for our strategy. One possible\nexplanation is that the triplet margin loss and contrastive loss with a large margin have constant\nnon-zero derivative w.r.t to both positive and negative samples, see Figure 6. In the case of contrastive\nloss with a small margin, many negative examples are not used in the optimization (zero derivatives),\nwhile the softmin derivatives become small, once the distance to the positive example is smaller than\nto the negative one.\n\n4.3 Wide baseline stereo\n\nTo validate descriptor generalization and their ability to operate in extreme conditions, we tested them\non the W1BS dataset [4]. It consists of 40 image pairs with one particular extreme change between\nthe images:\nAppearance (A): difference in appearance due to seasonal or weather change, occlusions, etc;\nGeometry (G): difference in scale, camera and object position;\nIllumination (L): signi\ufb01cant difference in intensity, wavelength of light source;\nSensor (S): difference in sensor data (IR, MRI).\n\nMoreover, local features in W1BS dataset are detected with MSER [35], Hessian-Af\ufb01ne [11] (in\nimplementation from [36]) and FOCI [37] detectors. They \ufb01re on different local structures than DoG.\nNote that DoG patches were used for the training of the descriptors. Another signi\ufb01cant difference to\nthe HPatches setup is the absence of the geometrical noise: all patches are perfectly reprojected to\nthe target image in pair. The testing protocol is the same as for the HPatches matching task.\nResults are shown in Figure 7. HardNet and L2Net perform comparably, former is performing better\non images with geometrical and appearance changes, while latter works a bit better in map2photo and\nvisible-vs-infrared pairs. Both outperform SIFT, but only by a small margin. However, considering\nthe signi\ufb01cant amount of the domain shift, descriptors perform very well, while TFeat loses badly\nto SIFT. HardTFeat signi\ufb01cantly outperforms the original TFeat descriptor on the W1BS dataset,\nshowing the superiority of the proposed loss.\nGood performance on patch matching and veri\ufb01cation task does not automatically lead to the better\nperformance in practice, e.g. to more images registered. Therefore we also compared descriptor on\nwide baseline stereo setup with two metric: number of successfully matched image pairs and average\nnumber of inliers per matched pair, following the matcher comparison protocol from [4]. The only\nchange to the original protocol is that \ufb01rst fast matching steps with ORB detector and descriptor were\nremoved, as we are comparing \u201cSIFT-replacement\u201d descriptors.\nThe results are shown in Table 3. Results on Edge Foci (EF) [37], Extreme view [38] and Oxford\nAf\ufb01ne [11] datasets are saturated and all the descriptors are good enough for matching all image pairs.\n\nFigure 7: Descriptor evaluation on the W1BS patch dataset, mean area under precision-recall curve is\nreported. Letters denote nuisance factor, A: appearance; G: viewpoint/geometry; L: illumination; S:\nsensor; map2photo: satellite photo vs. map.\n\n7\n\nAGLSmap2photoAverageNuisancefactor0.000.020.040.060.080.100.120.14mAUCTFeatHardTFeatSIFTRootSIFTL2NetL2Net+HardNetHardNet+\fHardNet has an a slight advantage in a number of inliers per image. The rest of datasets: SymB [39],\nGDB [40], WxBS [4] and LTLL [41] have one thing in common: image pairs are or from different\ndomain than photo (e.g. drawing to drawing) or cross-domain (e.g., drawing to photo). Here HardNet\noutperforms learned descriptors and is on-par with hand-crafted RootSIFT. We would like to note\nthat HardNet was not learned to match in different domain, nor cross-domain scenario, therefore such\nresults show the generalization ability.\n\nTable 3: Comparison of the descriptors on wide baseline stereo within MODS matcher[4] on wide\nbaseline stereo datasets. Number of matched image pairs and average number of inliers are reported.\nNumbers is the header corresponds to the number of image pairs in dataset.\n\nEF\n\nEVD\n\nOxAff\n\nSymB\n\nGDB\n\nWxBS\n\nLTLL\n\nDescriptor\n\nRootSIFT\nTFeat-M*\nL2Net+\nHardNet+\n\n33\n33\n32\n33\n33\n\ninl.\n\n32\n30\n34\n35\n\n15\n15\n15\n15\n15\n\ninl.\n\n34\n37\n34\n41\n\n40\n40\n40\n40\n40\n\ninl.\n\n169\n265\n304\n316\n\n46\n45\n40\n43\n44\n\ninl.\n\n43\n45\n46\n47\n\n22\n21\n16\n19\n21\n\ninl.\n\n52\n72\n78\n75\n\n37\n11\n10\n9\n11\n\ninl.\n93\n62\n51\n54\n\n172\n\n123\n96\n127\n127\n\ninl.\n\n27\n29\n26\n31\n\n4.4\n\nImage retrieval\n\nWe evaluate our method, and compare against the related ones, on the practical application of image\nretrieval with local features. Standard image retrieval datasets are used for the evaluation, i.e.,\nOxford5k [33] and Paris6k [34] datasets. Both datasets contain a set of images (5062 for Oxford5k\nand 6300 for Paris6k) depicting 11 different landmarks together with distractors. For each of the\n11 landmarks there are 5 different query regions de\ufb01ned by a bounding box, constituting 55 query\nregions per dataset. The performance is reported as mean average precision (mAP) [33].\nIn the \ufb01rst experiment, for each image in the dataset, multi-scale Hessian-af\ufb01ne features [31] are\nextracted. Exactly the same features are described by ours and all related methods, each of them\nproducing a 128-D descriptor per feature. Then, k-means with approximate nearest neighbor [21] is\nused to learn a 1 million visual vocabulary on an independent dataset, that is, when evaluating on\nOxford5k, the vocabulary is learned with descriptors of Paris6k and vice versa. All descriptors of\ntesting dataset are assigned to the corresponding vocabulary, so \ufb01nally, an image is represented by\nthe histogram of visual word occurrences, i.e., the bag-of-words (BoW) [1] representation, and an\ninverted \ufb01le is used for an ef\ufb01cient search. Additionally, spatial veri\ufb01cation (SV) [33], and standard\nquery expansion (QE) [34] are used to re-rank and re\ufb01ne the search results. Comparison with the\nrelated work on patch description is presented in Table 4. HardNet+ and L2Net+ perform comparably\nacross both datasets and all settings, with slightly better performance of HardNet+ on average across\n\nTable 4: Performance (mAP) evaluation on bag-of-words (BoW) image retrieval. Vocabulary\nconsisting of 1M visual words is learned on independent dataset, that is, when evaluating on Oxford5k,\nthe vocabulary is learned with features of Paris6k and vice versa. SV: spatial veri\ufb01cation. QE: query\nexpansion. The best results are highlighted in bold. All the descriptors except SIFT and HardNet++\nwere learned on Liberty sequence of Brown dataset [3]. HardNet++ is trained on union of Brown and\nHPatches [18] datasets.\n\nDescriptor\nTFeat-M* [23]\nRootSIFT [10]\nL2Net+ [24]\nHardNet\nHardNet+\nHardNet++\n\nBoW\n46.7\n55.1\n59.8\n59.0\n59.8\n60.8\n\nOxford5k\nBoW+SV BoW+QE BoW\n43.8\n59.3\n63.0\n61.4\n61.0\n65.0\n\n72.2\n78.4\n80.4\n83.2\n83.0\n84.5\n\n55.6\n63.0\n67.7\n67.6\n68.8\n69.6\n\nParis6k\n\nBoW+SV BoW+QE\n\n51.8\n63.7\n66.6\n67.4\n67.0\n70.3\n\n65.3\n76.4\n77.2\n77.5\n77.5\n79.1\n\n8\n\n\fTable 5: Performance (mAP) comparison with the state-of-the-art image retrieval with local features.\nVocabulary is learned on independent dataset, that is, when evaluating on Oxford5k, the vocabulary\nis learned with features of Paris6k and vice versa. All presented results are with spatial veri\ufb01cation\nand query expansion. VS: vocabulary size. SA: single assignment. MA: multiple assignments. The\nbest results are highlighted in bold.\n\nMethod\nSIFT\u2013BoW [36]\nSIFT\u2013BoW-fVocab [46]\nRootSIFT\u2013HQE [43]\nHardNet++\u2013HQE\n\nOxford5k\nMA\nSA\nVS\n78.4\n82.2\n1M\n84.9\n16M 74.0\n88.0\n85.3\n65k\n86.8\n88.3\n65k\n\nParis6k\n\nSA\n\u2013\n\n73.6\n81.3\n82.8\n\nMA\n\u2013\n\n82.4\n82.8\n84.9\n\nall results (average mAP 69.5 vs. 69.1). RootSIFT, which was the best performing descriptor in\nimage retrieval for a long time, falls behind with average mAP 66.0 across all results.\nWe also trained HardNet++ version \u2013 with all available training data at the moment: union of Brown\nand HPatches datasets, instead of just Liberty sequence from Brown for the HardNet+. It shows the\nbene\ufb01ts of having more training data and is performing best for all setups.\nFinally, we compare our descriptor with the state-of-the-art image retrieval approaches that use local\nfeatures. For fairness, all methods presented in Table 5 use the same local feature detector as described\nbefore, learn the vocabulary on an independent dataset, and use spatial veri\ufb01cation (SV) and query\nexpansion (QE). In our case (HardNet++\u2013HQE), a visual vocabulary of 65k visual words is learned,\nwith additional Hamming embedding (HE) [42] technique that further re\ufb01nes descriptor assignments\nwith a 128 bits binary signature. We follow the same procedure as RootSIFT\u2013HQE [43] method, by\nreplacing RootSIFT with our learned HardNet++ descriptor. Speci\ufb01cally, we use: (i) weighting of the\nvotes as a decreasing function of the Hamming distance [44]; (ii) burstiness suppression [44]; (iii)\nmultiple assignments of features to visual words [34, 45]; and (iv) QE with feature aggregation [43].\nAll parameters are set as in [43]. The performance of our method is the best reported on both\nOxford5k and Paris6k when learning the vocabulary on an independent dataset (mAP 89.1 was\nreported [10] on Oxford5k by learning it on the same dataset comprising the relevant images), and\nusing the same amount of features (mAP 89.4 was reported [43] on Oxford5k when using twice as\nmany local features, i.e., 22M compared to 12.5M used here).\n\n5 Conclusions\n\nWe proposed a novel loss function for learning a local image descriptor that relies on the hard negative\nmining within a mini-batch and the maximization of the distance between the closest positive and\nclosest negative patches. The proposed sampling strategy outperforms classical hard-negative mining\nand random sampling for softmin, triplet margin and contrastive losses.\nThe resulting descriptor is compact \u2013 it has the same dimensionality as SIFT (128), it shows state-\nof-art performance on standard matching, patch veri\ufb01cation and retrieval benchmarks and it is\nfast to compute on a GPU. The training source code and the trained convnets are available at\nhttps://github.com/DagnyT/hardnet.\n\nAcknowledgements\n\nThe authors were supported by the Czech Science Foundation Project GACR P103/12/G084, the\nAustrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Re-\nsearch and Economy, and the Province of Upper Austria in the frame of the COMET center, the\nCTU student grant SGS17/185/OHK3/3T/13, and the MSMT LL1303 ERC-CZ grant. Anastasiya\nMishchuk was supported by the Szkocka Research Group Grant.\n\n9\n\n\fReferences\n[1] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching\nin videos. In International Conference on Computer Vision (ICCV), pages 1470\u20131477, 2003.\n\n[2] Filip Radenovic, Giorgos Tolias, and Ondrej Chum. CNN image retrieval learns from BoW:\nUnsupervised \ufb01ne-tuning with hard examples. In European Conference on Computer Vision\n(ECCV), pages 3\u201320, 2016.\n\n[3] Matthew Brown and David G. Lowe. Automatic panoramic image stitching using invariant\n\nfeatures. International Journal of Computer Vision (IJCV), 74(1):59\u201373, 2007.\n\n[4] Dmytro Mishkin, Jiri Matas, Michal Perdoch, and Karel Lenc. Wxbs: Wide baseline stereo\n\ngeneralizations. Arxiv 1504.06603, 2015.\n\n[5] Johannes L. Schonberger, Filip Radenovic, Ondrej Chum, and Jan-Michael Frahm. From single\nimage query to detailed 3D reconstruction. In Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 5126\u20135134, 2015.\n\n[6] Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Confer-\n\nence on Computer Vision and Pattern Recognition (CVPR), pages 4104\u20134113, 2016.\n\n[7] Christopher B. Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal\ncorrespondence network. In Advances in Neural Information Processing Systems, pages 2414\u2013\n2422, 2016.\n\n[8] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network\nfor real-time 6-DOF camera relocalization. In International Conference on Computer Vision\n(ICCV), 2015.\n\n[9] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal\n\nof Computer Vision (IJCV), 60(2):91\u2013110, 2004.\n\n[10] Relja Arandjelovic and Andrew Zisserman. Three things everyone should know to improve\nobject retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), pages\n2911\u20132918, 2012.\n\n[11] Krystian Mikolajczyk and Cordelia Schmid. Scale & af\ufb01ne invariant interest point detectors.\n\nInternational Journal of Computer Vision (IJCV), 60(1):63\u201386, 2004.\n\n[12] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An ef\ufb01cient alternative\nto SIFT or SURF. In International Conference on Computer Vision (ICCV), pages 2564\u20132571,\n2011.\n\n[13] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: Learned invariant feature\n\ntransform. In European Conference on Computer Vision (ECCV), pages 467\u2013483, 2016.\n\n[14] Xufeng Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature\nand metric learning for patch-based matching. In Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 3279\u20133286, 2015.\n\n[15] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional\nneural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[16] Andrei Bursuc, Giorgos Tolias, and Herve Jegou. Kernel local descriptors with implicit rotation\n\nmatching. In ACM International Conference on Multimedia Retrieval, 2015.\n\n[17] Jingming Dong and Stefano Soatto. Domain-size pooling in local descriptors: DSP-SIFT. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 5097\u20135106, 2015.\n\n[18] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. HPatches: A\nbenchmark and evaluation of handcrafted and learned local descriptors. In Conference on\nComputer Vision and Pattern Recognition (CVPR), 2017.\n\n10\n\n\f[19] Johannes L. Schonberger, Hans Hardmeier, Torsten Sattler, and Marc Pollefeys. Comparative\nevaluation of hand-crafted and learned local features. In Conference on Computer Vision and\nPattern Recognition (CVPR), 2017.\n\n[20] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Descriptor learning using convex\n\noptimisation. In European Conference on Computer Vision (ECCV), pages 243\u2013256, 2012.\n\n[21] Marius Muja and David G. Lowe. Fast approximate nearest neighbors with automatic algo-\nrithm con\ufb01guration. In International Conference on Computer Vision Theory and Application\n(VISSAPP), pages 331\u2013340, 2009.\n\n[22] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc\nMoreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In\nInternational Conference on Computer Vision (ICCV), pages 118\u2013126, 2015.\n\n[23] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning local feature\ndescriptors with triplets and shallow convolutional neural networks. In British Machine Vision\nConference (BMVC), 2016.\n\n[24] Bin Fan Yurun Tian and Fuchao Wu. L2-Net: Deep learning of discriminative patch descriptor\nin euclidean space. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[25] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-\n\nsupervised nets. In Arti\ufb01cial Intelligence and Statistics, pages 562\u2013570, 2015.\n\n[26] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training\n\nby Reducing Internal Covariate Shift. ArXiv 1502.03167, 2015.\n\n[27] Vinod Nair and Geoffrey E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann\nmachines. In International Conference on Machine Learning (ICML), pages 807\u2013814, 2010.\n\n[28] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research (JMLR), 15(1):1929\u20131958, 2014.\n\n[29] PyTorch. http://pytorch.org.\n\n[30] Vijay Kumar B. G., Gustavo Carneiro, and Ian Reid. Learning local image descriptors with deep\nsiamese and triplet convolutional networks by minimising global loss functions. In Conference\non Computer Vision and Pattern Recognition (CVPR), pages 5385\u20135394, 2016.\n\n[31] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas,\nFrederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A comparison of af\ufb01ne region detectors.\nInternational Journal of Computer Vision (IJCV), 65(1):43\u201372, 2005.\n\n[32] D. Randall Wilson and Tony R. Martinez. The general inef\ufb01ciency of batch training for gradient\n\ndescent learning. Neural Networks, 16(10):1429\u20131451, 2003.\n\n[33] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object\nretrieval with large vocabularies and fast spatial matching. In Conference on Computer Vision\nand Pattern Recognition (CVPR), pages 1\u20138, 2007.\n\n[34] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in\nquantization: Improving particular object retrieval in large scale image databases. In Conference\non Computer Vision and Pattern Recognition (CVPR), pages 1\u20138, 2008.\n\n[35] Jiri Matas, Ondrej Chum, Martin Urban, and Tomas Pajdla. Robust wide baseline stereo from\nIn British Machine Vision Conference (BMVC), pages\n\nmaximally stable extrema regions.\n384\u2013393, 2002.\n\n[36] Michal Perdoch, Ondrej Chum, and Jiri Matas. Ef\ufb01cient representation of local geometry for\nlarge scale object retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR),\npages 9\u201316, 2009.\n\n11\n\n\f[37] C. Lawrence Zitnick and Krishnan Ramnath. Edge foci interest points.\n\nConference on Computer Vision (ICCV), pages 359\u2013366, 2011.\n\nIn International\n\n[38] Dmytro Mishkin, Jiri Matas, and Michal Perdoch. Mods: Fast and robust method for two-\nview matching. Computer Vision and Image Understanding, 141:81 \u2013 93, 2015. doi: https:\n//doi.org/10.1016/j.cviu.2015.08.005.\n\n[39] Daniel C. Hauagge and Noah Snavely. Image matching using local symmetry features. In\n\nComputer Vision and Pattern Recognition (CVPR), pages 206\u2013213, 2012.\n\n[40] Gehua Yang, Charles V Stewart, Michal Sofka, and Chia-Ling Tsai. Registration of challenging\nimage pairs: Initialization, estimation, and decision. Pattern Analysis and Machine Intelligence\n(PAMI), 29(11):1973\u20131989, 2007.\n\n[41] Basura Fernando, Tatiana Tommasi, and Tinne Tuytelaars. Location recognition over large time\nlags. Computer Vision and Image Understanding, 139:21 \u2013 28, 2015. ISSN 1077-3142. doi:\nhttps://doi.org/10.1016/j.cviu.2015.05.016.\n\n[42] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Improving bag-of-features for large scale\n\nimage search. International Journal of Computer Vision (IJCV), 87(3):316\u2013336, 2010.\n\n[43] Giorgos Tolias and Herve Jegou. Visual query expansion with or without geometry: re\ufb01ning\n\nlocal descriptors by feature aggregation. Pattern Recognition, 47(10):3466\u20133476, 2014.\n\n[44] Herve Jegou, Matthijs Douze, and Cordelia Schmid. On the burstiness of visual elements. In\n\nComputer Vision and Pattern Recognition (CVPR), pages 1169\u20131176, 2009.\n\n[45] Herve Jegou, Cordelia Schmid, Hedi Harzallah, and Jakob Verbeek. Accurate image search\nusing the contextual dissimilarity measure. Pattern Analysis and Machine Intelligence (PAMI),\n32(1):2\u201311, 2010.\n\n[46] Andrej Mikulik, Michal Perdoch, Ond\u02c7rej Chum, and Ji\u02c7r\u00ed Matas. Learning vocabularies over a\n\n\ufb01ne quantization. International Journal of Computer Vision (IJCV), 103(1):163\u2013175, 2013.\n\n12\n\n\f", "award": [], "sourceid": 2512, "authors": [{"given_name": "Anastasiia", "family_name": "Mishchuk", "institution": "Szkocka Research Group, Ukraine"}, {"given_name": "Dmytro", "family_name": "Mishkin", "institution": "Czech Technical University in Prague"}, {"given_name": "Filip", "family_name": "Radenovic", "institution": "Visual Recognition Group, CTU in Prague"}, {"given_name": "Jiri", "family_name": "Matas", "institution": "Czech Technical University"}]}