{"title": "Neighbourhood Consensus Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1651, "page_last": 1662, "abstract": "We address the problem of finding reliable dense correspondences between a pair of images. This is a challenging task due to strong appearance differences between the corresponding scene elements and ambiguities generated by repetitive patterns. The contributions of this work are threefold. First, inspired by the classic idea of disambiguating feature matches using semi-local constraints, we develop an end-to-end trainable convolutional neural network architecture that identifies sets of spatially consistent matches by analyzing neighbourhood consensus patterns in the 4D space of all possible correspondences between a pair of images without the need for a global geometric model. Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences.\nThird, we show the proposed neighbourhood consensus network can be applied to a range of matching tasks including both category- and instance-level matching, obtaining the state-of-the-art results on the PF Pascal dataset and the InLoc indoor visual localization benchmark.", "full_text": "Neighbourhood Consensus Networks\n\nIgnacio Rocco\u2020\n\nMircea Cimpoi\u2021\n\nRelja Arandjelovi\u00b4c\u00a7\n\nAkihiko Torii\u2217\n\nTomas Pajdla\u2021\n\nJosef Sivic\u2020,\u2021\n\n\u2020Inria\n\n\u2021CIIRC, CTU in Prague\n\n\u00a7DeepMind\n\n\u2217Tokyo Institute of Technology\n\nAbstract\n\nWe address the problem of \ufb01nding reliable dense correspondences between a pair\nof images. This is a challenging task due to strong appearance differences between\nthe corresponding scene elements and ambiguities generated by repetitive patterns.\nThe contributions of this work are threefold. First, inspired by the classic idea\nof disambiguating feature matches using semi-local constraints, we develop an\nend-to-end trainable convolutional neural network architecture that identi\ufb01es sets\nof spatially consistent matches by analyzing neighbourhood consensus patterns\nin the 4D space of all possible correspondences between a pair of images without\nthe need for a global geometric model. Second, we demonstrate that the model\ncan be trained effectively from weak supervision in the form of matching and\nnon-matching image pairs without the need for costly manual annotation of point\nto point correspondences. Third, we show the proposed neighbourhood consensus\nnetwork can be applied to a range of matching tasks including both category- and\ninstance-level matching, obtaining the state-of-the-art results on the PF Pascal\ndataset and the InLoc indoor visual localization benchmark.\n\n1\n\nIntroduction\n\nFinding visual correspondences is one of the fundamental image understanding problems with ap-\nplications in 3D reconstruction [2], visual localization [32, 42] or object recognition [21]. In recent\nyears, signi\ufb01cant effort has gone into developing trainable image representations for \ufb01nding corre-\nspondences between images under strong appearance changes caused by viewpoint or illumination\nvariations [3, 4, 10, 13, 17, 37, 38, 44, 45]. However, unlike in other visual recognition tasks, such as\nimage classi\ufb01cation or object detection, where trainable image representations have become the de\nfacto standard, the performance gains obtained by trainable features over the classic hand-crafted\nones have been only modest at best [36].\nOne of the reasons for this plateauing performance could be the currently dominant approach for\n\ufb01nding image correspondence based on matching individual image features. While we have now better\nlocal patch descriptors, the matching is still performed by variants of the nearest neighbour assignment\nin a feature space followed by separate disambiguation stages based on geometric constraints. This\napproach has, however, fundamental limitations. Imagine a scene with textureless regions or repetitive\npatterns, such as a corridor with almost textureless walls and only few distinguishing features. A\nsmall patch of an image, depicting a repetitive pattern or a textureless area, is indistinguishable from\n\u2020WILLOW project, D\u00e9partement d\u2019informatique de l\u2019\u00c9cole normale sup\u00e9rieure, ENS/INRIA/CNRS UMR\n\u2021CIIRC \u2013 Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in\n\n8548, PSL Research University, Paris, France.\n\nPrague, Czechia.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fother portions of the image depicting the same repetitive or textureless pattern. Such matches will be\neither discarded [23] or incorrect. As a result, matching individual patch descriptors will often fail in\nsuch challenging situations.\nIn this work we take a different direction and develop a trainable neural network architecture that\ndisambiguates such challenging situations by analyzing local neighbourhood patterns in a full set of\ndense correspondences. The intuition is the following: in order to disambiguate a match on a repetitive\npattern, it is necessary to analyze a larger context of the scene that contains a unique non-repetitive\nfeature. The information from this unique match can then be propagated to the neighbouring uncertain\nmatches. In other words, the certain unique matches will support the close-by uncertain ambiguous\nmatches in the image.\nThis powerful idea goes back to at least 1990s [5, 34, 35, 39, 47], and is typically known as\nneighbourhood consensus or more broadly as semi-local constraints. The neighbourhood consensus\nhas been typically carried out on sparsely detected local invariant features as a \ufb01ltering step performed\nafter a hard assignment of features by nearest neighbour matching using the Euclidean distance\nin the feature space. Furthermore, the neighbourhood consensus has been evaluated by manually\nengineered criteria, such as a certain number of locally consistent matches [5, 34, 39], or consistency\nin geometric parameters including distances and angles between matches [35, 47].\nIn this work, we go one step further and propose a way of learning neighbourhood consensus\nconstraints directly from training data. Moreover, we perform neighbourhood consensus before hard\nassignment of feature correspondence; that is, on the complete set of dense pair-wise matches. In\nthis way, the decision on matching assignment is done only after taking into account the spatial\nconsensus constraints, hence avoiding errors due to early matching decisions on ambiguous, repetitive\nor textureless matches.\n\nContributions. We present the following contributions. First, we develop a neighbourhood con-\nsensus network \u2013 a convolutional neural network architecture for dense matching that learns local\ngeometric constraints between neighbouring correspondences without the need for a global geomet-\nric model. Second, we show that parameters of this network can be trained from scratch using a\nweakly supervised loss-function that requires supervision at the level of image pairs without the need\nfor manually annotating individual correspondences. Finally, we show that the proposed model is\napplicable to a range of matching tasks producing high-quality dense correspondences, achieving\nstate-of-the-art results on both category- and instance-level matching benchmarks. Both training code\nand models are available at [1].\n\n2 Related work\n\nThis work relates to several lines of research, which we review below.\n\nMatching with hand-crafted image descriptors. Traditionally, correspondences between images\nhave been obtained by hand crafted local invariant feature detectors and descriptors [23, 25, 43]\nthat were extracted from the image with a controlled degree of invariance to local geometric and\nphotometric transformations. Candidate (tentative) correspondences were then obtained by variants\nof nearest neighbour matching. Strategies for removing ambiguous and non-distinctive matches\ninclude the widely used second nearest neighbour ratio test [23], or enforcing matches to be mutual\nnearest neighbours. Both approaches work well for many applications, but have the disadvantage\nof discarding many correct matches, which can be problematic for challenging scenes, such as\nindoor spaces considered in this work that include repetitive and textureless areas. While successful,\nhandcrafted descriptors have only limited tolerance to large appearance changes beyond the built-in\ninvariance.\n\nMatching with trainable descriptors. The majority of trainable image descriptors are based on\nconvolutional neural networks (CNNs) and typically operate on patches extracted using a feature\ndetector such as DoG [23], yielding a sparse set of descriptors [3, 4, 10, 17, 37, 38] or use a pre-trained\nimage-level CNN feature extractor [26, 33]. Others have recently developed trainable methods that\ncomprise both feature detection and description [7, 26, 44]. The extracted descriptors are typically\ncompared using the Euclidean distance, but an appropriate similarity score can be also learnt in\na discriminative manner [13, 45], where a trainable model is used to both extract descriptors and\n\n2\n\n\fproduce a similarity score. Finding matches consistent with a geometric model is typically performed\nin a separate post-processing stage [3, 4, 7, 10, 17, 22, 26, 37, 38, 44].\n\nTrainable image alignment. Recently, end-to-end trainable methods have been developed to\nproduce correspondences between images according to a parametric geometric model, such as an\naf\ufb01ne, perspective or thin-plate spline transformation [28, 29]. In these works, all pairwise feature\nmatches are computed and used to estimate the geometric transformation parameters using a CNN.\nUnlike previous methods that capture only a sparse set of correspondences, this geometric estimation\nCNN captures interactions between a full set of dense correspondences. However, these methods\ncurrently only estimate a low complexity parametric transformation, and therefore their application\nis limited to only coarse image alignment tasks. In contrast, we target a more general problem of\nidentifying reliable correspondences between images of a general 3D scene. Our approach is not\nlimited to a low dimensional parametric model, but outputs a generic set of locally consistent image\ncorrespondences, applicable to a wide range of computer vision problems ranging from category-level\nimage alignment to camera pose estimation. The proposed method builds on the classical ideas of\nneighbourhood consensus, which we review next.\n\nMatch \ufb01ltering by neighbourhood consensus. Several strategies have been introduced to decide\nwhether a match is correct or not, given the supporting evidence from the neighbouring matches. The\nearly examples analyzed the patterns of distances [47] or angles [35] between neighbouring matches.\nLater work simply counts the number of consistent matches in a certain image neighbourhood [34, 39],\nwhich can be built in a scale invariant manner [31] or using a regular image grid [5]. While\nsimple, these techniques have been remarkably effective in removing random incorrect matches and\ndisambiguating local repetitive patterns [31]. Inspired by this simple yet powerful idea we develop a\nneighbourhood consensus network \u2013 a convolutional neural architecture that (i) analyzes the full set of\ndense matches between a pair of images and (ii) learns patterns of locally consistent correspondences\ndirectly from data.\n\nFlow and disparity estimation. Related are also methods that estimate optical \ufb02ow or stereo\ndisparity such as [6, 15, 16, 24, 40], or their trainable counterparts [8, 19, 41]. These works also aim\nat establishing reliable point to point correspondences between images. However, we address a more\ngeneral matching problem where images can have large viewpoint changes (indoor localization) or\nmajor changes in appearance (category-level matching). This is different from optical \ufb02ow where\nimage pairs are usually consecutive video frames with small viewpoint or appearance changes, and\nstereo where matching is often reduced to a local search around epipolar lines. The optical \ufb02ow\nand stereo problems are well addressed by specialized methods that explicitly exploit the problem\nconstraints (such as epipolar line constraint, small motion, smoothness, etc.).\n\n3 Proposed approach\n\nIn this work, we combine the robustness of neighbourhood consensus \ufb01ltering with the power of\ntrainable neural architectures. We design a model which learns to discriminate a reliable match by\nrecognizing patterns of supporting matches in its neighbourhood. Furthermore, we do this in a fully\ndifferentiable way, such that this trainable matching module can be directly combined with strong\nCNN image descriptors. The resulting pipeline can then be trained in an end-to-end manner for the\ntask of feature matching. An overview of our proposed approach is presented in Fig. 1. There are\n\ufb01ve main components: (i) dense feature extraction and matching, (ii) the neighbourhood consensus\nnetwork, (iii) a soft mutual nearest neighbour \ufb01ltering, (iv) extraction of correspondences from the\noutput 4D \ufb01ltered match tensor, and (v) weakly supervised training loss. These components are\ndescribed next.\n\n3.1 Dense feature extraction and matching\n\nIn order to produce an end-to-end trainable model, we follow the common practice of using a deep\nconvolutional neural network (CNN) as a dense feature extractor.\nThen, given an image I, this feature extractor will produce a dense set of descriptors, {f I\nij} \u2208 Rd,\nwith indices i = 1, . . . , h and j = 1, . . . , w, and (h, w) denoting the number of features along image\nheight and width (i.e. the spatial resolution of the features), and d the dimensionality of the features.\n\n3\n\n\fFigure 1: Overview of the proposed method. A fully convolutional neural network is used to extract dense\nimage descriptors f A and f B for images IA and IB, respectively. All pairs of individual feature matches f A\nij\nand f B\nkl are represented in the 4-D space of matches (i, j, k, l) (here shown as a 3-D perspective for illustration),\nand their matching scores stored in the 4-D correlation tensor c. These matches are further processed by the\nproposed soft-nearest neighbour \ufb01ltering and neighbourhood consensus network (see Figure 2) to produce the\n\ufb01nal set of output correspondences.\n\nWhile classic hand-crafted neighbourhood consensus approaches are applied after a hard assignment\nof matches is done, this is not well suited for developing a matching method that is differentiable\nand amenable for end-to-end training. The reason is that the step of selecting a particular match is\nnot differentiable with respect to the set of all the possible features. In addition, in case of repetitive\nfeatures, assigning the match to the \ufb01rst nearest neighbour might result in an incorrect match, in which\ncase the hard assignment would lose valuable information about the subsequent closest neighbours.\nTherefore, in order to have an approach that is amenable to end-to-end training, all pairwise feature\nmatches need to be computed and stored. For this we use an approach similar to [28]. Given two sets\nof dense feature descriptors f A = {f A\nij} and f B = {f B\nij } corresponding to the images to be matched,\nthe exhaustive pairwise cosine similarities between them are computed and stored in a 4-D tensor\nc \u2208 Rh\u00d7w\u00d7h\u00d7w referred to as correlation map, where:\nkl(cid:105)\n(cid:104)f A\nij , f B\n(cid:107)f A\nij(cid:107)\n(cid:107)f B\nkl(cid:107)2\n\ncijkl =\n\n(1)\n\n.\n\n2\n\nNote that, by construction, elements of c in the vicinity of index ijkl correspond to matches between\nfeatures that are in the local neighbourhoods NA and NB of descriptors f A\nkl in\nimage B, respectively, as illustrated in Fig. 1; this structure of the 4-D correlation map tensor c will\nbe exploited in the next section.\n\nij in image A and f B\n\n3.2 Neighbourhood consensus network\n\nThe correlation map contains the scores of all pairwise matches. In order to further process and \ufb01lter\nthe matches, we propose to use 4-D convolutional neural network (CNN) for the neighbourhood\nconsensus task (denoted by N (\u00b7)), which is illustrated in Fig. 2.\nDetermining the correct matches from the correlation map is, a priori, a signi\ufb01cant challenge. Note\nthat the number of correct matches are of order of hw, while the size of the correlation map is of\nthe order of (hw)2. This means that the great majority of the information in the correlation map\ncorresponds to matching noise due to incorrectly matched features.\nHowever, supported by the idea of neighbourhood consensus presented in Sec. 1, we can expect\ncorrect matches to have a coherent set of supporting matches surrounding them in the 4-D space.\nThese geometric patterns are equivariant with translations in the input images; that is, if the images are\ntranslated, the matching pattern is also translated in the 4-D space by an equal amount. This property\nmotivates the use of 4-D convolutions for processing the correlation map as the same operations\nshould be performed regardless of the location in the 4-D space. This is analogous to the motivation\nfor using 2-D convolutions to process individual images \u2013 it makes sense to use convolutions, instead\nof for example a fully connected layer, in order to pro\ufb01t from weight sharing and keep the number\nof trainable parameters low. Furthermore, it facilitates sample-ef\ufb01cient training as a single training\nexample provides many error signals to the convolutional weights, since the same weights are applied\nat all positions of the correlation map. Finally, by processing matches with a 4D convolutional\nnetwork we establish a strong locality prior on the relationships between the matches. That is, by\n\n4\n\nDense CNNfeatures4D space of featurematches4D correlationmapSoft mutualnearest neighbour \ufb01ltering4D \ufb01lteredmatchesNeighbourhoodConsensusNetworkSoft mutualnearest neighbour \ufb01ltering\fFigure 2: Neighbourhood Consensus Network (NC-Net). A neighbourhood consensus CNN operates on the\n4D space of feature matches. The \ufb01rst 4D convolutional layer \ufb01lters span NA \u00d7 NB, the Cartesian product of\nlocal neighbourhoods NA and NB in images A and B respectively. The proposed 4D neighbourhood consensus\nCNN can learn to identify the matching patterns of reliable and unreliable matches, and \ufb01lter the matches\naccordingly.\n\ndesign, the network will determine the quality of a match by examining only the information in a\nlocal 2D neighbourhood in each of the two images.\nThe proposed neighbourhood consensus network has several convolutional layers, as illustrated in\nFig. 2, each followed by ReLU non-linearities. The convolutional \ufb01lters of the \ufb01rst layer of the\nproposed CNN span a local 4-D region of the matches space, which corresponds to the Cartesian\nproduct of local neighbourhoods NA and NB in each image, respectively. Therefore, each 4-D \ufb01lter\nof the \ufb01rst layer can process and detect patterns in all pairwise matches of these two neighbourhoods.\nThis \ufb01rst layer has N1 \ufb01lters that can specialize in learning different local geometric deformations,\nproducing N1 output channels, that correspond to the agreement with these local deformations at each\n4-D point of the correlation tensor. These output channels are further processed by subsequent 4-D\nconvolutional layers. The aim is that these layers capture more complex patterns by combining the\noutputs from the previous layer, analogously to what has been observed for 2-D CNNs [46]. Finally,\nthe neighbourhood consensus CNN produces a single channel output, which has the same dimensions\nas the 4D input matches.\nFinally, in order to produce a method that is invariant to the particular order of the input images, that\nis, that it will produce the same matches regardless of whether an image pair is input to the net as\n(I A, I B) or (I B, I A), we propose to apply the network twice in the following way:\n\n\u02dcc = N (c) +(cid:0)N (cT)(cid:1)T\n\n,\n\n(2)\n\n(cid:0)cT(cid:1)\n\nwhere by cT we mean swapping the pair of dimensions corresponding to the \ufb01rst and second images:\nijkl = cklij. This \ufb01nal output constitutes the \ufb01ltered matches \u02dcc using the neighbourhood\nconsensus network, where matches with inconsistent local patterns are downweighted or removed.\nFurther \ufb01ltering can be done by means of a global \ufb01ltering strategy, as presented next.\n\n3.3 Soft mutual nearest neighbour \ufb01ltering\n\nAlthough the proposed neighbourhood consensus network can suppress and amplify matches based\non the supporting evidence in their neighbourhoods \u2013 that is, at a semi-local level \u2013 it cannot enforce\nglobal constraints on matches, such as being a reciprocal match, where matched features are required\nto be mutual nearest neighbours:\n\n(f A\n\nab, f B\n\ncd) mutual N.N. \u21d0\u21d2\n\n(3)\n\n(cid:26)(a, b) = arg minij (cid:107)f A\n\n(c, d) = arg minkl (cid:107)f A\n\nij \u2212 f B\ncd(cid:107)\nab \u2212 f B\nkl(cid:107).\n\nFiltering the matches by imposing the hard mutual nearest neighbour condition expressed by (3)\nwould eliminate the great majority of candidate matches, which makes it unsuitable for usage in an\nend-to-end trainable approach, as this hard decision is non-differentiable.\n\n5\n\n\ufb01lters(conv. layer 2)4D inputmatches4D \ufb01lteredmatches\ufb01lters(conv. layer 1)\ufb01lter(conv. layer 3)\fWe therefore propose a softer version of the mutual nearest neighbour \ufb01ltering (M (\u00b7)), both in the\nsense of softer decision and better differentiability properties, that can be applied on dense 4-D match\nscores:\n\n\u02c6c = M (c), where\n\n(4)\nijkl are the ratios of the score of the particular match cijkl with the best scores along\n\n\u02c6cijkl = rA\n\nijklcijkl,\n\nijklrB\n\nijkl and rB\n\nand rA\neach pair of dimensions corresponding to images A and B respectively:\ncijkl\n\ncijkl\n\nrA\nijkl =\n\nmaxab cabkl\n\n,\n\nand\n\nrB\nijkl =\n\nmaxcd cijcd\n\n.\n\n(5)\n\nij , f B\n\nThis soft mutual nearest neighbour \ufb01ltering operates as a gating mechanism on the input, downweight-\ning the scores of matches that are not mutual nearest neighbours. Note that the proposed formulation\nis indeed a softer version of the mutual nearest neighbours criterion as \u02c6cijkl equals the matching score\ncijkl iff (f A\nkl ) are mutual nearest neighbours, and is decreased to a value in [0, cijkl) otherwise.\nOn the contrary, the \u201chard\u201d mutual nearest neighbour matching would assign \u02c6cijkl = 0 in the latter\ncase.\nWhile this \ufb01ltering step has no trainable parameters, it can be inserted in the CNN pipeline at\nboth training and evaluation stages, and it will help to enforce the global reciprocity constraint on\nmatches. In the proposed approach, the soft mutual nearest neighbour \ufb01ltering is used to \ufb01lter both\nthe correlation map, as well as the output of the neighbourhood consensus CNN, as illustrated in\nFig. 1.\n\n3.4 Extracting correspondences from the correlation map\n\nSuppose that we want to match two images I A and I B. Then, the output of our model will produce a\n4-D \ufb01ltered correlation map c, which contains (\ufb01ltered) scores for all pairwise matches. However, for\nvarious applications, such as image warping, geometric transformation estimation, pose estimation,\nvisualization, etc, it is desirable to obtain a set of point-to-point image correspondences between the\ntwo images. To achieve this, a hard assignment can be performed in either of two possible directions,\nfrom features in image A to features in image B, or vice versa.\nFor this purpose, two scores are de\ufb01ned from the correlation map, by performing soft-max in the\ndimensions corresponding to images A and B:\n\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n\nsA\nijkl =\n\nexp(cijkl)\nab exp(cabkl)\n\nand\n\nsB\nijkl =\n\nexp(cijkl)\ncd exp(cijcd)\n\n.\n\n(6)\n\nNote that the scores are: (i) positive, (ii) normalized using the soft-max function, which makes\nijab = 1. Hence we can interpret them as discrete conditional probability distributions of\nab sB\nkl being a match, given the position (i, j) of the match in A or (k, l) in B. If we denote\nf A\nij , f B\n(I, J, K, L) the discrete random variables indicating the position of a match (a priori unknown), and\n(i, j, k, l) the particular position of a match, then:\nand\n\n(7)\nThen, the hard-assignment in one direction can be done by just taking the most likely match (the\nmode of the distribution):\n\nP (K = k, L = l | I = i, J = j) = sB\n\nP (I = i, J = j | K = k, L = l) = sA\n\nijkl.\n\nijkl\n\nkl assigned to a given f A\nf B\n\nij \u21d0\u21d2 (k, l) = arg max\n\nP (K = c, L = d | I = i, J = j)\n\ncd\n\n= arg max\n\ncd\n\nsB\nijcd,\n\n(8)\n\nand analogously to obtain the matches f A\nThis probabilistic intuition allows us to model the match uncertainty using a probability distribution\nand will be also useful to motivate the loss used for weakly-supervised training, which will be\ndescribed next.\n\nij assigned to a given f B\nkl.\n\n3.5 Weakly-supervised training\n\nIn this section we de\ufb01ne the loss function used to train our network. One option is to use a strongly-\nsupervised loss, but this requires dense annotations consisting of all pairs of corresponding points\n\n6\n\n\ffor each training image pair. Obtaining such exhaustive ground-truth is complicated \u2013 dense manual\nannotation is impractical, while sparse annotation followed by an automatic densi\ufb01cation technique\ntypically results in imprecise and erroneous training data. Another alternative is to resort to synthetic\nimagery which would provide point correspondences by construction, but this has the downside of\nmaking it harder to generalize to larger appearance variations encountered in real image pairs we\nwish to handle. Therefore, it is desirable to be able to train directly from pairs of real images, and\nusing as little annotation as possible.\nFor this we propose to use a training loss that only requires a weak-level of supervision consisting of\nannotation on the level of image pairs. These training pairs (I A, I B) can be of two types, positive\npairs, labelled with y = +1, or negative pairs, labelled with y = \u22121. Then, the following loss\nfunction is proposed:\n\nL(I A, I B) = \u2212y(cid:0)\u00afsA + \u00afsB(cid:1) ,\n\n(9)\nwhere \u00afsA and \u00afsB are the mean matching scores over all hard assigned matches as per (8) of a given\nimage pair (I A, I B) in both matching directions.\nNote that the minimization of this loss maximizes the scores of positive and minimizes the scores of\nnegative image pairs, respectively. As explained in 3.4, the hard-assigned matches correspond to the\nmodes of the distributions of (7). Therefore, maximizing the score forces the distribution towards\na Kronecker delta distribution, having the desirable effect of producing well-identi\ufb01ed matches in\npositive image pairs. Similarly, minimizing the score forces the distribution towards the uniform\none, weakening the matches in the negative image pairs. Note that while the only scores that directly\ncontribute to the loss are the ones coming from hard-assigned matches, all matching scores affect\nthe loss because of the normalization in (6). Therefore, all matching scores will be updated at each\ntraining step.\n\n4 Experimental results\n\nThe proposed approach was evaluated on both instance- and category-level matching problems. The\nsame approach is used to obtain reliable matches for both problems, which are then used to solve\ntwo completely different tasks: (i) camera pose estimation in the challenging scenario of indoor\nlocalization, in the instance-level matching case, and (ii) semantic object alignment in the category-\nlevel matching case. Next we present the implementation details, followed by the results on the two\ntasks.\n\nImplementation details. The model was implemented in PyTorch [27], and a ResNet-101 net-\nwork [14] initialized on ImageNet was used for feature extraction (up to the conv4_23 layer). The\nneighbourhood consensus network N (\u00b7) contains three layers of 5 \u00d7 5 \u00d7 5 \u00d7 5 \ufb01lters or two layers\nof 3 \u00d7 3 \u00d7 3 \u00d7 3 \ufb01lters for category level and instance level matching, respectively. In both cases,\nthe intermediate results have 16 channels (N1 = N2 = 16). A feature resolution of 25 \u00d7 25 was\nused for training. As accurately localized features are needed for the pose estimation task, we\nextract correspondence for pose estimation at test time at a higher resolution resulting in a 200 \u00d7 150\nfeature map, which is downsampled using 4-D max+argmax pooling operation after computing the\ncorrelation map to 100 \u00d7 75 for increased ef\ufb01ciency. The model is initially trained for 5 epochs\nusing Adam optimizer [20], with a learning rate of 5 \u00d7 10\u22124 and keeping the feature extraction\nlayer weights \ufb01xed. For category level matching, the model is then subsequently \ufb01netuned for 5\nmore epochs, training both the feature extraction and the neighbourhood consensus network, with a\nlearning rate of 1 \u00d7 10\u22125. In the case of instance level matching, \ufb01netuning the feature extraction\ndid not improve the performance. Additional implementation details are given in the supplementary\nmaterial [30].\n\n4.1 Category-level matching\n\nThe proposed method was evaluated on the task of category-level matching, where, given two images\ncontaining different instances from the same category (e.g. two different cat images) the goal is to\nmatch similar semantic parts.\n\nDataset and evaluation measure. We report results on the PF-Pascal [11] dataset, which contains\n1,351 semantically related image pairs from the 20 object categories of the PASCAL VOC [9] dataset.\n\n7\n\n\fPCK\nMethod\nHOG+PF-LOM [11] 62.5\n72.2\nSCNet-AG+ [12]\n71.9\nCNNGeo [28]\n75.8\nWeakAlign [29]\nNC-Net\n78.9\n\nTable 1: Results for semantic\nkeypoint transfer. We show the\nrate (%) of correctly transferred\nkeypoints within thresh. \u03b1 = 0.1.\n\nDist. SparsePE DensePE DensePE DensePE InLoc InLoc\n(m)\n0.25\n0.50\n1.00\n2.00\n\n[42] + MNN + NC-Net\n38.9\n56.5\n69.9\n74.2\n\n[42]\n21.3\n30.7\n42.6\n48.3\n\nInLoc\n\n44.1\n63.8\n76.0\n78.4\n\n[42]\n35.3\n47.4\n57.1\n61.1\n\n+ MNN + NC-Net\n\n31.9\n50.5\n62.0\n64.7\n\n37.1\n53.5\n62.9\n66.3\n\n37.1\n60.2\n72.0\n76.3\n\nTable 2: Comparison of indoor localization methods. We show the rate\n(%) of correctly localized queries within a given distance (m) and 10\u25e6 angular\nerror.\n\nWe follow the same evaluation protocol as [12, 29], and use the split from [12] which divides the\ndataset into approximately 700 pairs for training, 300 for validation and 300 for testing. In order\nto train the network in a weakly-supervised manner using the proposed loss (9), the 700 training\npairs are used as positive training pairs, and negative pairs are generated by randomly pairing images\nof different categories, such as a car with a dog image. The performance is measured using the\npercentage of correct keypoints (PCK), that is, number of correctly matched annotated keypoints.\n\nResults. Quantitative results are presented in Table 1. The proposed neighbourhood consensus\nnetwork (NC-Net) obtains ~3% improvement over the state-of-the-art methods on this dataset [29].\nAn example of semantic keypoint transfer is shown in Figure 3 and demonstrates how our approach\ncan correctly match semantic object parts in challenging situations with large changes of appearance\nand non-rigid geometric deformations. See the supplementary material [30] for additional examples.\n\n4.2\n\nInstance-level matching\n\nNext we show that our method is also suitable for instance level matching and consider speci\ufb01cally the\napplication to indoor visual localization, where the goal is to estimate an accurate 6DoF camera pose\nof a query photograph given a large-scale 3D model of a building. This is an extremely challenging\ninstance-level matching task as indoor spaces are often self-similar and contain large textureless areas.\nWe compare our method with the recently introduced indoor localization approach of [42], which\nis a strong baseline that outperforms several state-of-the-art methods, and introduces a challenging\ndataset for large scale indoor localization.\n\nDataset and evaluation measure. We evaluate on the InLoc dataset [42], which consists of 10K\ndatabase images (perspective cutouts) extracted from 227 RGBD panoramas, and an additional set of\n356 query images captured with a smart-phone camera at a different time from the database images.\nWe follow the same evaluation protocol and report the percentage of correctly localized queries at a\ngiven camera position error threshold. As the InLoc dataset was designed for evaluation and does not\ncontain a training set, we collected an Indoor Venues Dataset, consisting of user-uploaded photos,\ncaptured at public places such as restaurants, cafes, museums or cathedrals, by crawling Google\nMaps. It features similar appearance variations as the InLoc dataset, such as illumination changes,\nand scene modi\ufb01cations due to the passage of time. This dataset contains 3861 positive image pairs\nfrom 89 different venues in 6 different cities, split into train: 3481 pairs (80 places) and validation:\n380 pairs (from the remaining 9 places). The design and collection procedures are described in\nthe supplementary material [30] and the dataset is available at [1]. As in the case of category-level\nmatching, negative pairs were generated by randomly sampling images from different places.\n\nResults. We plug-in our trainable neighbourhood consensus network (NC-Net) as a correspondence\nmodule into the InLoc indoor localization pipeline [42]. We evaluate two variants of the approach.\nIn the \ufb01rst variant, denoted DensePE+NC-Net, the DensePE [42] method is used for generating\ncandidate image pairs, and then our network (NC-Net) is used to produce the correspondences that are\nemployed for pose estimation. In the second variant, denoted InLoc+NC-Net, we use the full InLoc\npipeline, including pose-veri\ufb01cation by view synthesis. In this case, matches produced by NC-Net\nare used as input for pose estimation for each of the top N = 10 candidate pairs from DensePE,\nand the resulting candidate poses are re-ranked using pose-veri\ufb01cation by view-synthesis. As an\nablation study, these two experiments are also performed when NC-Net is replaced with hard mutual\n\n8\n\n\fFigure 3: Semantic keypoint transfer. The\nannotated (ground truth) keypoints in the left\nimage are automatically transferred to the right\nimage using the dense correspondences be-\ntween the two images obtained from our NC-\nNet.\n\nFigure 4: Instance-level matching. Top row: inlier corre-\nspondences (shown as green dots) obtained by our approach\n(InLoc+NC-Net). Bottom row: Baseline inlier correspon-\ndences (InLoc+MNN). Our method provides a much larger\nand locally consistent set of matches, even in low-textured\nregions. Both methods use the same CNN features.\n\nnearest neighbours matching (MNN), using the same base CNN network (ResNet-101). Results\nare summarised in Table 2 and clearly demonstrate bene\ufb01ts of our approach (DensePE+NC-Net)\ncompared to both sparse keypoint (DoG+SIFT) matching (SparsePE) and the CNN feature matching\nused in [42] (DensePE). When inserted into the entire localization pipeline, our approach (InLoc\n+ NC-Net) obtains state-of-the-art results on the indoor localization benchmark. An example of\nobtained correspondences on a challenging indoor scene with repetitive structures and texture-less\nareas is shown in \ufb01gure 4. Additional results are shown in the supplementary material [30].\n\n4.3 Limitations\n\nWhile our method identi\ufb01es correct matches in many challenging cases, some situations remain\ndif\ufb01cult. The two typical failure modes include: repetitive patterns combined with large changes in\nscale, and locally geometrically consistent groups of incorrect matches. Furthermore, the proposed\nmethod has quadratic O(N 2) complexity with respect to the number of image pixels (or CNN features)\nN. This limits the resolution of the images that we are currently able to handle to 1600 \u00d7 1200px (or\n3200 \u00d7 2400px if using the 4-D max+argmax pooling operation).\n\n5 Conclusion\n\nWe have developed a neighbourhood consensus network \u2014 a CNN architecture that learns local\npatterns of correspondences for image matching without the need for a global geometric model.\nWe have shown the model can be trained effectively from weak supervision and obtains strong\nresults outperforming state-of-the-art on two very different matching tasks. These results open up\nthe possibility for end-to-end learning of other challenging visual correspondence tasks, such as 3D\ncategory-level matching [18], or visual localization across day/night illumination [32].\n\nAcknowledgements\n\nThis work was partially supported by JSPS KAKENHI Grant Numbers 15H05313, 16KK0002,\nEU-H2020 project LADIO No. 731970, ERC grant LEAP No. 336845, CIFAR Learning in Machines\n& Brains program and the European Regional Development Fund under the project IMPACT (reg.\nno. CZ.02.1.01/0.0/0.0/15 003/0000468).\n\n9\n\n\fReferences\n[1] Project webpage (code/networks). http://www.di.ens.fr/willow/research/ncnet/.\n\n[2] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski.\n\nBuilding rome in a day. Communications of the ACM, 54(10):105\u2013112, 2011.\n\n[3] V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PN-Net: Conjoined triple deep network for\n\nlearning local image descriptors. arXiv preprint arXiv:1601.05030, 2016.\n\n[4] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk. Learning local feature descriptors with\n\ntriplets and shallow convolutional neural networks. In Proc. BMVC., 2016.\n\n[5] J. Bian, W.-Y. Lin, Y. Matsushita, S.-K. Yeung, T.-D. Nguyen, and M.-M. Cheng. GMS:\nGrid-based motion statistics for fast, ultra-robust feature correspondence. In Proc. CVPR, 2017.\n\n[6] T. Brox and J. Malik. Large displacement optical \ufb02ow: descriptor matching in variational\n\nmotion estimation. IEEE PAMI, 2011.\n\n[7] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal correspondence network. In\n\nNIPS, 2016.\n\n[8] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt,\nD. Cremers, and T. Brox. FlowNet: Learning optical \ufb02ow with convolutional networks. In Proc.\nICCV, 2015.\n\n[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The\nPASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascal-\nnetwork.org/challenges/VOC/voc2011/workshop/index.html.\n\n[10] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching with convolutional neural networks:\n\na comparison to SIFT. arXiv preprint arXiv:1405.5769, 2014.\n\n[11] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal \ufb02ow: Semantic correspondences from\n\nobject proposals. IEEE PAMI, 2017.\n\n[12] K. Han, R. S. Rezende, B. Ham, K.-Y. K. Wong, M. Cho, C. Schmid, and J. Ponce. SCNet:\n\nLearning Semantic Correspondence. In Proc. ICCV, 2017.\n\n[13] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. MatchNet: Unifying feature and\n\nmetric learning for patch-based matching. In Proc. CVPR, 2015.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc.\n\nCVPR, 2016.\n\n[15] H. Hirschm\u00fcller. Stereo processing by semiglobal matching and mutual information. IEEE\n\nPAMI, 2008.\n\n[16] B. K. Horn and B. G. Schunck. Determining optical \ufb02ow. Arti\ufb01cial Intelligence, 1981.\n\n[17] M. Jahrer, M. Grabner, and H. Bischof. Learned local descriptors for recognition and matching.\n\nIn Computer Vision Winter Workshop, 2008.\n\n[18] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-speci\ufb01c mesh recon-\n\nstruction from image collections. In Proc. ECCV, 2018.\n\n[19] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry.\nEnd-to-end learning of geometry and context for deep stereo regression. In Proc. ICCV, 2017.\n\n[20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[21] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. Sift-\ufb02ow: Dense correspondence\n\nacross different scenes. In Proc. ECCV, 2008.\n\n[22] J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? In NIPS, 2014.\n\n10\n\n\f[23] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.\n\n[24] B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to\n\nstereo vision. IJCAI, 1981.\n\n[25] K. Mikolajczyk and C. Schmid. An af\ufb01ne invariant interest point detector. In Proc. ECCV,\n\n2002.\n\n[26] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han. Large-scale image retrieval with attentive\n\ndeep local features. In Proc. ICCV, 2017.\n\n[27] A. Paszke, S. Gross, S. Chintala, and G. Chanan. PyTorch. http://pytorch.org/.\n\n[28] I. Rocco, R. Arandjelovi\u00b4c, and J. Sivic. Convolutional neural network architecture for geometric\n\nmatching. In Proc. CVPR, 2017.\n\n[29] I. Rocco, R. Arandjelovi\u00b4c, and J. Sivic. End-to-end weakly-supervised semantic alignment. In\n\nProc. CVPR, 2018.\n\n[30] I. Rocco, M. Cimpoi, R. Arandjelovi\u00b4c, A. Torii, T. Pajdla, and J. Sivic. Neighbourhood\n\nconsensus networks. arXiv preprint arXiv:1810.10510, 2018.\n\n[31] T. Sattler, B. Leibe, and L. Kobbelt. SCRAMSAC: Improving RANSAC\u2018s Ef\ufb01ciency with a\n\nSpatial Consistency Filter. In Proc. ICCV, 2009.\n\n[32] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi,\nM. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla. Benchmarking 6dof urban visual localization in\nchanging conditions. In Proc. CVPR, 2018.\n\n[33] N. Savinov, L. Ladicky, and M. Pollefeys. Matching neural paths: transfer from recognition to\n\ncorrespondence search. In NIPS, 2017.\n\n[34] F. Schaffalitzky and A. Zisserman. Automated scene matching in movies. In International\n\nConference on Image and Video Retrieval, 2002.\n\n[35] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE PAMI, 1997.\n\n[36] J. L. Schonberger, H. Hardmeier, T. Sattler, and M. Pollefeys. Comparative evaluation of\n\nhand-crafted and learned local features. In Proc. CVPR, 2017.\n\n[37] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative\n\nlearning of deep convolutional feature point descriptors. In Proc. ICCV, 2015.\n\n[38] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local feature descriptors using convex\n\noptimisation. IEEE PAMI, 2014.\n\n[39] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos.\n\nIn Proc. ICCV, 2003.\n\n[40] D. Sun, S. Roth, and M. J. Black. Secrets of optical \ufb02ow estimation and their principles. In\n\nProc. CVPR, 2010.\n\n[41] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical \ufb02ow using pyramid,\n\nwarping, and cost volume. In Proc. CVPR, 2018.\n\n[42] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii.\nInLoc: Indoor visual localization with dense matching and view synthesis. In Proc. CVPR,\n2018.\n\n[43] T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: A survey. Foundations and\n\nTrends in Computer Graphics and Vision, 3(3):177\u2013280, 2008.\n\n[44] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned invariant feature transform. In Proc.\n\nECCV, 2016.\n\n11\n\n\f[45] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural\n\nnetworks. In Proc. CVPR, 2015.\n\n[46] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc.\n\nECCV, 2014.\n\n[47] Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong. A robust technique for matching two un-\ncalibrated images through the recovery of the unknown epipolar geometry. Arti\ufb01cial intelligence,\n1995.\n\n12\n\n\f", "award": [], "sourceid": 839, "authors": [{"given_name": "Ignacio", "family_name": "Rocco", "institution": "Inria"}, {"given_name": "Mircea", "family_name": "Cimpoi", "institution": "CIIRC, CTU Prague"}, {"given_name": "Relja", "family_name": "Arandjelovi\u0107", "institution": "DeepMind"}, {"given_name": "Akihiko", "family_name": "Torii", "institution": "Tokyo Institute of Technology, Japan"}, {"given_name": "Tomas", "family_name": "Pajdla", "institution": "Czech Technical University in Prague"}, {"given_name": "Josef", "family_name": "Sivic", "institution": "Inria and Czech Technical University"}]}