{"title": "Locally Uniform Comparison Image Descriptor", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 9, "abstract": "Keypoint matching between pairs of images using popular descriptors like SIFT or a faster variant called SURF is at the heart of many computer vision algorithms including recognition, mosaicing, and structure from motion. For real-time mobile applications, very fast but less accurate descriptors like BRIEF and related methods use a random sampling of pairwise comparisons of pixel intensities in an image patch. Here, we introduce Locally Uniform Comparison Image Descriptor (LUCID), a simple description method based on permutation distances between the ordering of intensities of RGB values between two patches. LUCID is computable in linear time with respect to patch size and does not require floating point computation. An analysis reveals an underlying issue that limits the potential of BRIEF and related approaches compared to LUCID. Experiments demonstrate that LUCID is faster than BRIEF, and its accuracy is directly comparable to SURF while being more than an order of magnitude faster.", "full_text": "Locally Uniform Comparison Image Descriptor\n\nAndrew Ziegler\u2217 Eric Christiansen David Kriegman Serge Belongie\n\nDepartment of Computer Science and Engineering, University of California, San Diego\namz@gatech.edu, {echristiansen, kriegman, sjb}@cs.ucsd.edu\n\nAbstract\n\nKeypoint matching between pairs of images using popular descriptors like SIFT\nor a faster variant called SURF is at the heart of many computer vision algorithms\nincluding recognition, mosaicing, and structure from motion. However, SIFT and\nSURF do not perform well for real-time or mobile applications. As an alternative\nvery fast binary descriptors like BRIEF and related methods use pairwise compar-\nisons of pixel intensities in an image patch. We present an analysis of BRIEF and\nrelated approaches revealing that they are hashing schemes on the ordinal correla-\ntion metric Kendall\u2019s tau. Here, we introduce Locally Uniform Comparison Image\nDescriptor (LUCID), a simple description method based on linear time permuta-\ntion distances between the ordering of RGB values of two image patches. LUCID\nis computable in linear time with respect to the number of pixels and does not\nrequire \ufb02oating point computation.\n\n1 Introduction\n\nLocal image descriptors have long been explored in the context of machine learning and computer\nvision. There are countless applications that rely on local feature descriptors, such as visual regis-\ntration, reconstruction and object recognition. One of the most widely used local feature descriptors\nis SIFT which uses automatic scale selection, orientation normalization, and histograms of oriented\ngradients to achieve partial af\ufb01ne invariance [15]. SIFT is known for its versatility and reliable\nrecognition performance, but these characteristics come at a high computational cost.\nRecently, mobile devices and affordable reliable imaging sensors have become ubiquitous. The\nwide adoption of these devices has made new real-time mobile applications of computer vision\nand machine learning feasible. Examples of such applications include visual search, augmented\nreality, perceptual interfaces, and wearable computing. Despite this, these devices have less com-\nputational power than typical computers and perform poorly for \ufb02oating point heavy applications.\nThese factors have provided an impetus for new ef\ufb01cient discrete approaches to feature descrip-\ntion and matching. In this work we explore current trends in feature description and provide a new\nview of BRIEF and its related methods. We also present a novel feature description method that is\nsurprisingly simple and effective.\n\n1.1 Background\n\nBay et al. proposed SURF as an approximation to SIFT, a notable shift toward real-time feature\ndescription [1]. SURF obtains a large speed up over SIFT while retaining most of its desirable\nproperties and comparable recognition rates. However, SURF is not generally suited to real-time\napplications without acceleration via a powerful GPU [21].\nIn [3] Bosch et al. proposed Ferns as a classi\ufb01cation based approach to key point recognition. Ferns\nuses sparse binary intensity comparisons between pixels in an image patch for descriptive power.\n\n\u2217This work was completed while the author was at UCSD.\n\n1\n\n\fThis simple scheme provides real-time performance in exchange for expensive off-line learning.\nIn response to the success of Ferns, Calonder et al. presented a novel binary feature descriptor\nthey named BRIEF [4]. Rather than training off-line, BRIEF makes use of random pixel intensity\ncomparisons to create a binary descriptor quickly. These descriptors can be matched an order of\nmagnitude faster than SIFT with the Hamming distance, even on mobile processors. As a result,\nBRIEF has come into widespread use and has inspired several variants based on the approach [12,\n14, 19]. However, little explanation as to why or how these types of descriptors work is given.\nThere is a fuzzy notion that pairwise intensity comparisons are an approximation to signed intensity\ngradients. This is not the whole story, and in fact these methods are sampling in an ad hoc manner\nfrom a rich source of discriminative information.\n\n1.2 Related work\n\nIn this work we diverge from the current paradigm for fast feature description and explore a deter-\nministic approach based on permutations. The study of distances between permutations began near\nthe inception of group theory and has continued unabated since [5, 7, 8, 9, 11, 10, 16].\nA notable early use of permutation based methods in the realm of visual feature description was pre-\nsented by Bhat and Nayar in [2]. They investigated the use of rank permutations of pixel intensities\nfor the purpose of dense stereo, the motivation being to \ufb01nd a robust alternative to the (cid:96)2 norm. Per-\nmutations on pixel intensities offer a transformed representation of the data which is naturally less\nsensitive to noise and invariant to monotonic photometric transformations. Bhat and Nayar present\na similarity measure between two rank permutations that is based on the Kolmogorov Smirnov test.\nTheir measure was designed to be robust to impulse noise, sometimes called salt and pepper noise,\nwhich can greatly corrupt a rank permutation. In [20] Scherer et al. reported that though Bhat and\nNayar\u2019s method was useful, it suffered from poor discrimination.\nIn [18] Mittal and Ramesh proposed an improved version of the method presented by Bhat and\nNayar. Their improvement was in a similar vein to [20], based on a modi\ufb01cation to Kendall\u2019s tau\n[11]. The key observation made was that both Kendall\u2019s tau metric and Bhat and Nayar\u2019s metric are\nhighly sensitive to Gaussian noise. To become robust to Gaussian noise Mittal and Ramesh account\nfor actual intensity differences while only considering uncorrelated order changes. We choose to\nexplore the Hamming and Cayley distances, in part because they are naturally robust to Gaussian\nnoise, impulse noise is not a major issue for modern imaging devices, and they are computable in\nlinear time as opposed to quadratic time.\nRecently there has been more research on the application of ordinal correlation methods to sparse\nvisual feature description. In [22] and [13] ordinal methods were applied to SIFT descriptors. In\ncontrast to [2] and [20] the elements of the SIFT descriptor are sorted, rather than sorting pixel\nintensities themselves. Though these methods do improve the recognition performance of SIFT they\nadd computational cost, rather than reducing it.\n\n1.3 Our contributions\n\nIn this paper, we introduce LUCID, a novel approach to real-time feature description based on order\npermutations. We contrast LUCID with BRIEF, and provide a theoretical basis for understanding\nthese two methods. We prove that BRIEF is effectively a locality sensitive hashing (LSH) scheme on\nKendall\u2019s tau. It follows from this that other descriptors based on binary intensity comparisons are\ndimensionality reduction schemes on Kendall\u2019s tau. We then explore alternative distances based on\nthe observation that image patch matching can be viewed as a near duplicate recognition problem.\nIn the next section we describe LUCID, provide a background on permutation distances and discuss\noptimizations for an ef\ufb01cient implementation. Section 3 provides an analysis of BRIEF and com-\npares it to LUCID. Section 4 reports on experiments that evaluate LUCID\u2019s accuracy and run time\nrelative to SURF and BRIEF.\n\n2\n\n\f2 LUCID\n\nHere we present a new method of feature description that is surprisingly simple and effective. We\ncall our method Locally Uniform Comparison Image Descriptor or LUCID. Our descriptors implic-\nitly encapsulate all possible intensity comparisons in a local area of an image. They are extremely\nef\ufb01cient to compute and are related through the generalized Hamming distance for ef\ufb01cient match-\ning [10].\n\n[~, desc1] = sort(p1(:));\n[~, desc2] = sort(p2(:));\ndistance = sum(desc1 ~= desc2);\n\n2.1 Constructing a descriptor\nLet p1 and p2 be n \u00d7 n image patches with\nc color channels. We can compute descrip-\ntors for both patches and the Hamming dis-\ntance between them in three lines of Mat-\nlab as shown in Figure 1. Here desc1 and\ndesc2 are the order permutation represen-\ntations for p1 and p2 respectively. Let\nm = cn2, then clearly this depiction has\nan O(m log m) running time. However, our\nnative implementation makes use of a sta-\nble comparison-free linear time sort and thus\ntakes O(m) time and space. Descriptor con-\nstruction is depicted in Figure 1.\n\n2.2 Permutation distances\n\nwhere \u03c3k =(cid:81)k\n\nA more detailed discussion of the following\nis given in [16]. Recall the de\ufb01nition of a\npermutation: a bijective mapping of a \ufb01nite\nset onto itself. This mapping \u03c0 is a mem-\nber of the symmetric group Sn formed by\nfunction composition on the set of all per-\nFigure 1: Top: LUCID feature construction and\nmutations of n labelled objects. We write\nmatching method in 3 lines of Matlab. Note: ~ is\n\u03c0(i) = j to denote the action of \u03c0 with\ni, j \u2208 {1, 2, ..., n}. The permutation product\nused to ignore the \ufb01rst return value of sort; and the\nfor \u03c01, \u03c02 \u2208 Sn is de\ufb01ned as function com-\nsecond value is the order permutation. Bottom: An\nposition \u03c01\u03c02 = \u03c01\u25e6\u03c02, the permutation that\nillustration of an image patch split into its RGB color\nchannels, vectorized and then sorted; inducing a per-\nresults from \ufb01rst applying \u03c02 then \u03c01. Every\npermutation \u03c0 \u2208 Sn can be written as a prod-\nmutation on the indices.\nuct of disjoint cycles \u03c31, \u03c32, ..., \u03c3(cid:96). Cycles are permutations such that \u03c3k(i) = i for some k \u2264 n\nj=1 \u03c3. We will use the notation #cycles(\u03c0) = (cid:96) to denote the number of cycles in \u03c0.\nA convenient representation for a permutation \u03c0 \u2208 Sn is the n dimensional vector with the ith\ncoordinate equal to \u03c0(i); this is the permutation vector. The convex hull of the permutation vectors\nSn \u2282 Rn is the permutation polytope of Sn. This is an n \u2212 1 dimensional polytope with |Sn| = n!\nvertices. The vertices are equidistant from the centroid and lie on the surface of a circumscribed n\u22121\ndimensional sphere. The vertices corresponding to two permutations \u03c01, \u03c02 \u2208 Sn are connected by\nan edge if they are related by a pairwise adjacent transposition. This is analogous to Kendall\u2019s tau,\nde\ufb01ned to be the minimum number of pairwise adjacent transpositions between two vectors, more\nprecisely Kd(\u03c01, \u03c02) = |{(i, j)|\u03c01(i) < \u03c01(j), \u03c02(i) > \u03c02(j), 1 \u2264 i, j \u2264 n}|.\nThere are at least two classes of distances that can be de\ufb01ned between permutations [16]. Spatial\ndistances can be viewed as measuring the distance travelled along some path between two vertices\nof the permutation polytope. Examples of spatial distances are Kendall\u2019s tau which steps along the\nedges of the polytope, the Euclidean distance which takes the straight line path, and Spearman\u2019s\nfootrule which takes unit steps on the circumscribed sphere of the polytope. A disorder distance\nmeasures the disorder between two permutations and ignores the spatial structure of the polytope.\nExamples of disorder distances are the generalized Hamming distance Hd(\u03c01, \u03c02) = |{i|\u03c01(i) (cid:54)=\n\u03c02(i)}| which is the number of elements that differ between two permutation vectors and the Cayley\n\n3\n\n\fdistance Cd(\u03c01, \u03c02) = n \u2212 #cycles(\u03c02\u03c0\u22121\n1 ) which is the minimum number of unrestricted transpo-\nsitions between \u03c01 and \u03c02. We choose the generalized Hamming distance to relate our descriptors\nbecause it is much simpler than the Cayley distance to compute. Hamming also lends itself to SIMD\nparallel processing unlike Cayley which is inherently serial. However, if time is not a constraint\nexperimental results show that the Cayley distance should be preferred for accuracy.\nDisorder distances are not sensitive to Gaussian noise, but are highly sensitive to impulse noise.\nIn contrast, Kendall\u2019s tau is confused by Gaussian noise, but is more resilient to impulse noise\n[2, 20, 18]. Impulse noise can severely corrupt these permutations since it can cause pixels in a\npatch to become maximal or minimal elements changing each element in the permutation vector.\nIn the presence of moderate impulse noise the Cayley and Hamming distances will likely become\nmaximal while Kendall\u2019s tau would be at O(1/n) its maximal distance. Generally, modern imaging\ndevices do not suffer from severe impulse noise, but there are other sources of impulse noise such\nas occlusions and partial shadows. LUCID is used with sparse interest points and only individual\nimage patches would be affected by impulse noise. Since impulse noise would cause the distance to\nbecome maximal these bad matches can be reliably identi\ufb01ed via threshold.\nKendall\u2019s tau is normally used in situations where multiple independent judges are ranking sets or\nsubsets of objects, such as top-k lists, movie preferences or surveys. In these scenarios multiple\njudges are asked to rank preferences and the permutation polytope can be used as a discrete analog\nto histograms to gain valuable insight into the distribution of the judges\u2019 preferences. In the context\nof sparse image patch matching, the imaging sensor ideally acts as a single consistent judge; thus a\nsingle image patch will correspond to one vertex on the permutation polytope. Ideally, for a pair of\ncorresponding patches in different images the permutations should be identical. Thus in our scenario\nthe image sensor can be viewed as one judge comparing nearly identical objects. The structure of\nthe permutation polytope becomes less important in this context.\nSince the Cayley and Hamming distances are computed in linear time rather than quadratic time\nlike Kendall\u2019s tau, they may be better suited for fast image patch matching. In section 3 we present\na proof demonstrating that BRIEF is a locality sensitive hashing scheme on Kendall\u2019s tau metric\nbetween vectors of pixel intensities.\n\n2.3 An ef\ufb01cient implementation\n\nLUCID-24-RGB\n\nBRIEF\n\nSURF\n\n64\n256\n256\n1728\n64\n\n20\n30\n40\n50\n450\n\n240\n880\n2130\n4120\n420\n\nDescriptor\n\nDimension Construction Matching\n\nLUCID-8-Gray\nLUCID-16-Gray\n\nTable 1: Time in milliseconds to construct 10,000\ndescriptors and to exhaustively match 5000x5000 de-\nscriptors.\n\nOur choice to use the Hamming distance\nis inspired by the new Streaming SIMD\nExtensions (SSE) instructions.\nSSE is\na simple way to add parallelism to na-\ntive programs through vector operations.\nIn our implementation we use a 128-bit\npacked comparison which gives LUCID\n16x matching parallelism for grayscale\nimage patches up to 16x16, and 8x par-\nallelism for RGB image patches up to\n147x147. Many mobile processors have\nthese types of instructions, but even when\nthey are not available it is still possible to gain parallelism. One additional bit per descriptor element\ncan be reserved allowing the use of binary addition and bit masks to produce a packed Hamming\ndistance. For descriptor lengths less than 215, 16 bits per element are needed. This strategy supports\nRGB image patches up to 105x105 pixels and yields 4x parallelism on 64-bit processors. It is also\npossible to randomly sample a small subset of pixels before sorting to achieve greater speed. This\noperation can be interpreted as randomly projecting the descriptors into a lower dimension.\nOrder permutations are fast to construct and access memory in sequential order. Since pixel inten-\nsities are represented with small positive integers they are ideal candidates for stable linear time\nsorting methods like counting and radix sort. These sorting algorithms access memory in linear\norder and thus with the fewest number of possible cache misses. BRIEF accesses larger portions of\nmemory than LUCID in a non-linear fashion and should incur more time consuming cache misses.\nTherefore LUCID offers a modest improvement in terms of descriptor construction time as shown\nin Table 1.\n\n4\n\n\fWe investigate three versions of LUCID since they are the \ufb01rst three multiples of eight: LUCID-24-\nRGB, LUCID-16-Gray, and LUCID-8-Gray which respectively are LUCID on image patches that\nare 24x24 in RGB color, 16x16 grayscale and 8x8 grayscale. Before construction a 5x5 averaging\nblur is applied to the entire image to remove noise that may perturb the order permutation. BRIEF\nalso performs pre-smoothing; Calonder et al. reported that they found a 9x9 blurring kernel to be\n\u201cnecessary and suf\ufb01cient\u201d [4].\nWe compare the running time of LUCID to the OpenCV implementations of SURF and BRIEF with\ndefault parameters on a 2.66GHz Intel\u00ae Core\u00ae i7.1 In Table 1 timing results for SURF, BRIEF\nand the variants of LUCID are shown. BRIEF uses 48x48 image patches and produces a descriptor\nwith 256 dimensions which is equal to the dimension of LUCID-16-Gray. Surprisingly, LUCID-16-\nGray is faster to match than BRIEF; this was not expected since BRIEF has the same complexity\nas LUCID to match. This might indicate that there are further optimizations that can be made for\nOpenCV\u2019s implementation.\n\n3 Understanding BRIEF and related methods\n\nIn [4] Calonder et al. propose BRIEF, an ef\ufb01cient binary descriptor. BRIEF is intended to be simple\nto compute and match based solely on sparse intensity comparisons. These comparisons provide for\nthe ef\ufb01cient construction of a compact descriptor. Here we discuss their method as presented in [4].\nDe\ufb01ne a test \u03c4\n\n(cid:26)1,\n\n\u03c4 (p; x, y) :=\n\nif p(x) < p(y)\n\n0, otherwise\n\n(1)\n\nthe nd dimensional bitstring fnd (p) :=(cid:80)\n\nwhere p is a square image patch and p(x) is the smoothed value of the pixel with the local coor-\ndinates x = (u, v)(cid:62). This test will represent one bit in the \ufb01nal descriptor. To construct a BRIEF\ndescriptor a set of pre-de\ufb01ned pixel comparisons are performed. This pattern is a set of nd pixel co-\nordinate pairs (x, y) that should be compared in each image patch. A descriptor is then de\ufb01ned to be\n2i\u22121\u03c4 (p; xi, yi). Calonder et al. suggest that in-\ntuitively these pairwise intensity comparisons capture the signs of intensity gradients. However, this\nis not precise and in the next section we prove that the reason BRIEF works is that it inadvertently\napproximates Kendall\u2019s tau.\n\n1\u2264i\u2264nd\n\n3.1 BRIEF is LSH on Kendall\u2019s Tau\n\nConsider a version of BRIEF where the pixel sampling pattern consists of all(cid:0)m\n\n(cid:1) pairs of pixels.\n\nThen the Hamming distance between two of these BRIEF descriptors is equivalent to the Kendall\u2019s\ntau distance between the pixel intensities of the vectorized image patches. The original formulation\nof BRIEF is LSH on the normalized Kendall\u2019s tau correlation metric.\nProof. Let p1, p2 be m dimensional vectorized image patches. De\ufb01ne Bk(i, j) := I(pk(i) < pk(j))\n(cid:80)\nwhere I is the indicator function. For image patches containing m pixels, BRIEF chooses a pattern of\npairs P \u2286 {(i, j)|1 \u2264 i < j \u2264 m}, and for two vectorized image patches p1, p2, it returns the score\n(i,j)\u2208P I(B1(i, j) (cid:54)= B2(i, j)). When P = {(i, j)|1 \u2264 i < j \u2264 m}, this is precisely Kd(p1, p2). It can\nbe shown that BRIEF satis\ufb01es the de\ufb01nition of LSH as de\ufb01ned in [6], consider a random pair (i, j) with i < j.\nThen\n\n2\n\nP [B1(i, j) (cid:54)= B2(i, j)] =\n\n(cid:48)\n\n(cid:48)\n\n) (cid:54)= B2(i\n\n(cid:48)\n\n, j\n\n(cid:48)\n\n, j\n\n)) = KdN (p1, p2).\n\n(cid:88)\n\n1(cid:0)m\n(cid:1) I(B1(i\n\ni(cid:48)