{"title": "Locality-sensitive binary codes from shift-invariant kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1509, "page_last": 1517, "abstract": "This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing.", "full_text": "Locality-Sensitive Binary Codes\n\nfrom Shift-Invariant Kernels\n\nMaxim Raginsky\nDuke University\n\nDurham, NC 27708\nm.raginsky@duke.edu\n\nSvetlana Lazebnik\nUNC Chapel Hill\n\nChapel Hill, NC 27599\nlazebnik@cs.unc.edu\n\nAbstract\n\nThis paper addresses the problem of designing binary codes for high-dimensional\ndata such that vectors that are similar in the original space map to similar bi-\nnary strings. We introduce a simple distribution-free encoding scheme based on\nrandom projections, such that the expected Hamming distance between the bi-\nnary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a\nGaussian kernel) between the vectors. We present a full theoretical analysis of the\nconvergence properties of the proposed scheme, and report favorable experimental\nperformance as compared to a recent state-of-the-art method, spectral hashing.\n\n1 Introduction\n\nRecently, there has been a lot of interest in the problem of designing compact binary codes for\nreducing storage requirements and accelerating search and retrieval in large collections of high-\ndimensional vector data [11, 13, 15]. A desirable property of such coding schemes is that they\nshould map similar data points to similar binary strings, i.e., strings with a low Hamming distance.\nHamming distances can be computed very ef\ufb01ciently in hardware, resulting in very fast retrieval of\nstrings similar to a given query, even for brute-force search in a database consisting of millions of\ndata points [11, 13]. Moreover, if code strings can be effectively used as hash keys, then similarity\nsearches can be carried out in sublinear time. In some existing schemes, e.g. [11, 13], the notion of\nsimilarity between data points comes from supervisory information, e.g., two documents are similar\nif they focus on the same topic or two images are similar if they contain the same objects. The\nbinary encoder is then trained to reproduce this \u201csemantic\u201d similarity measure. In this paper, we are\nmore interested in unsupervised schemes, where the similarity is given by Euclidean distance or by\na kernel de\ufb01ned on the original feature space. Weiss et al. [15] have recently proposed a spectral\nhashing approach motivated by the idea that a good encoding scheme should minimize the sum of\nHamming distances between pairs of code strings weighted by the value of a Gaussian kernel be-\ntween the corresponding feature vectors. With appropriate heuristic simpli\ufb01cations, this objective\ncan be shown to yield a very ef\ufb01cient encoding rule, where each bit of the code is given by the sign\nof a sine function applied to a one-dimensional projection of the feature vector. Spectral hashing\nshows promising experimental results, but its behavior is not easy to characterize theoretically. In\nparticular, it is not clear whether the Hamming distance between spectral hashing code strings con-\nverges to any function of the Euclidean distance or the kernel value between the original vectors as\nthe number of bits in the code increases.\n\nIn this paper, we propose a coding method that is similar to spectral hashing computationally, but\nis derived from completely different considerations, is amenable to full theoretical analysis, and\nshows better practical behavior as a function of code size. We start with a low-dimensional mapping\nof the original data that is guaranteed to preserve the value of a shift-invariant kernel (speci\ufb01cally,\nthe random Fourier features of Rahimi and Recht [8]), and convert this mapping to a binary one\nwith similar guarantees. In particular, we show that the normalized Hamming distance (i.e., Ham-\n\n\fming distance divided by the number of bits in the code) between any two embedded points sharply\nconcentrates around a well-de\ufb01ned continuous function of the kernel value. This leads to a Johnson\u2013\nLindenstrauss type result [4] which says that a set of any N points in a Euclidean feature space can\nbe embedded in a binary cube of dimension O(log N ) in a similarity-preserving way: with high\nprobability, the binary encodings of any two points that are similar (as measured by the kernel) are\nnearly identical, while those of any two points that are dissimilar differ in a constant fraction of their\nbits. Using entropy bounds from the theory of empirical processes, we also prove a stronger result\nof this type that holds for any compact domain of RD, provided the number of bits is proportional\nto the intrinsic dimension of the domain. Our scheme is completely distribution-free with respect to\nthe data: its structure depends only on the underlying kernel. In this, it is similar to locality sensitive\nhashing (LSH) [1], which is a family of methods for deriving low-dimensional discrete represen-\ntations of the data for sublinear near-neighbor search. However, our scheme differs from LSH in\nthat we obtain both upper and lower bounds on the normalized Hamming distance between any two\nembedded points, while in LSH the goal is only to preserve nearest neighbors (see [6] for further dis-\ncussion of the distinction between LSH and more general similarity-preserving embeddings). To the\nbest of our knowledge, our scheme is among the \ufb01rst random projection methods for constructing a\nsimilarity-preserving embedding into a binary cube. In addition to presenting a thorough theoretical\nanalysis, we have evaluated our approach on both synthetic and real data (images from the LabelMe\ndatabase [10] represented by high-dimensional GIST descriptors [7]) and compared its performance\nto that of spectral hashing. Despite the simplicity and distribution-free nature of our scheme, we\nhave been able to obtain very encouraging experimental results.\n\n2 Binary codes for shift-invariant kernels\n\nConsider a Mercer kernel K(\u00b7,\u00b7) on RD that satis\ufb01es the following for all points x, y \u2208 RD:\n\n(K1) It is translation-invariant (or shift-invariant), i.e., K(x, y) = K(x \u2212 y).\n(K2) It is normalized, i.e., K(x \u2212 y) \u2264 1 and K(x \u2212 x) \u2261 K(0) = 1.\n(K3) For any real number \u03b1 \u2265 1, K(\u03b1x \u2212 \u03b1y) \u2264 K(x \u2212 y).\n\nThe Gaussian kernel K(x, y) = exp(\u2212\u03b3kx \u2212 yk2/2) or the Laplacian kernel K(x, y) =\nexp(\u2212\u03b3kx \u2212 yk1) are two well-known examples. We would like to construct an embedding F n of\nRD into the binary cube {0, 1}n such that for any pair x, y the normalized Hamming distance\n\n1\nn\n\ndH (F n(x), F n(y)) \u25b3=\n\n1\nn\n\n1{Fi(x)6=Fi(y)}\n\nn\n\nXi=1\n\nbetween F n(x) = (F1(x), . . . , Fn(x)) and F n(y) = (F1(y), . . . , Fn(y)) behaves like\n\nh1(K(x \u2212 y)) \u2264\n\n1\nn\n\ndH (F n(x), F n(y)) \u2264 h2(K(x \u2212 y))\n\nwhere h1, h2 : [0, 1] \u2192 R+ are continuous decreasing functions, and h1(1) = h2(1) = 0 and\nh1(0) = h2(0) = c > 0. In other words, we would like to map D-dimensional real vectors into\nn-bit binary strings in a locality-sensitive manner, where the notion of locality is induced by the\nkernel K. We will achieve this goal by drawing F n appropriately at random.\n\nRandom Fourier features. Recently, Rahimi and Recht [8] gave a scheme that takes a Mercer\nkernel satisfying (K1) and (K2) and produces a random mapping \u03a6n : RD \u2192 Rn, such that,\nwith high probability, the inner product of any two transformed points approximates the kernel:\n\u03a6n(x)\u00b7\u03a6n(y) \u2248 K(x\u2212y) for all x, y. Their scheme exploits Bochner\u2019s theorem [9], a fundamental\nresult in harmonic analysis which says that any such K is a Fourier transform of a uniquely de\ufb01ned\nprobability measure PK on RD. They de\ufb01ne the random Fourier features (RFF) via\n\n(1)\nwhere \u03c9 \u223c PK and b \u223c Unif[0, 2\u03c0]. For example, for the Gaussian kernel K(s) = e\u2212\u03b3ksk2/2, we\ntake \u03c9 \u223c Normal(0, \u03b3ID\u00d7D). With these features, we have E[\u03a6\u03c9,b(x)\u03a6\u03c9,b(y)] = K(x \u2212 y).\nThe scheme of [8] is as follows: draw an i.i.d. sample ((\u03c91, b1), . . . , (\u03c9n, bn)), where each\n\n\u03a6\u03c9,b(x)\n\n\u25b3\n\n= \u221a2 cos(\u03c9 \u00b7 x + b),\n\n\f\u03c9i \u223c PK and bi \u223c Unif[0, 2\u03c0], and de\ufb01ne a mapping \u03a6n : RD \u2192 Rn via \u03a6n(x)\n1\u221an(cid:0)\u03a6\u03c91,b1 (x), . . . , \u03a6\u03c9n,bn (x)(cid:1) for x \u2208 X . Then E[\u03a6n(x) \u00b7 \u03a6n(y)] = K(x \u2212 y) for all x, y.\nFrom random Fourier features to random binary codes. We will compose the RFFs with\nrandom binary quantizers. Draw a random threshold t \u223c Unif[\u22121, 1] and de\ufb01ne the quantizer\nQt : [\u22121, 1] \u2192 {\u22121, +1} via Qt(u) \u25b3= sgn(u + t), where we let sgn(u) = \u22121 if u < 0 and\nsgn(u) = +1 if u \u2265 0. We note the following basic fact (we omit the easy proof):\nLemma 2.1 For any u, v \u2208 [\u22121, 1], Pt {Qt(u) 6= Qt(v)} = |u \u2212 v|/2.\nNow, given a kernel K, we de\ufb01ne a random map Ft,\u03c9,b : RD \u2192 {0, 1} through\n\n=\n\n\u25b3\n\nFt,\u03c9,b(x)\n\n\u25b3\n\n=\n\n[1 + Qt (cos(\u03c9 \u00b7 x + b))] ,\n\n(2)\n\n1\n2\n\nwhere t \u223c Unif[\u22121, 1], \u03c9 \u223c PK, and b \u223c Unif[0, 2\u03c0] are independent of one another. From now\non, we will often omit the subscripts t, \u03c9, b and just write F for the sake of brevity. We have:\n\nLemma 2.2\n\n8\n\u03c02\n\n\u221e\n\nXm=0\n\n1 \u2212 K(mx \u2212 my)\n\n,\n\nE 1{F (x)6=F (y)} = hK(x \u2212 y) \u25b3=\n\n4m2 \u2212 1\nProof: Using Lemma 2.1, we can show E 1{F (x)6=F (y)} = 1\nE\u03c9,b | cos(\u03c9\u00b7 x + b)\u2212 cos(\u03c9\u00b7 y + b)|.\nUsing trigonometric identities and the independence of \u03c9 and b, we can express this expectation as\nsin(cid:18) \u03c9 \u00b7 (x \u2212 y)\n\nEb,\u03c9 |cos(\u03c9 \u00b7 x + b) \u2212 cos(\u03c9 \u00b7 y + b)| =\n\nWe now make use of the Fourier series representation of the full recti\ufb01ed sine wave g(\u03c4 ) = | sin(\u03c4 )|:\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n\n4\n\u03c0\n\n2\n\n2\n\n.\n\n\u2200x, y\n\n(3)\n\ng(\u03c4 ) =\n\n2\n\u03c0\n\n+\n\n4\n\u03c0\n\n\u221e\n\nXm=1\n\n1\n\n1 \u2212 4m2 cos(m\u03c4 ) =\n\n1 \u2212 cos(2m\u03c4 )\n\n4m2 \u2212 1\n\n.\n\nE\u03c9(cid:12)(cid:12)(cid:12)(cid:12)\nXm=1\n\n4\n\u03c0\n\n\u221e\n\nUsing this together with the fact that E\u03c9 cos(\u03c9 \u00b7 s) = K(s) for any s \u2208 RD [8], we obtain (3). (cid:4)\nLemma 2.2 shows that the probability that F (x) 6= F (y) is a well-de\ufb01ned continuous function of\nx\u2212 y. The in\ufb01nite series in (3) can, of course, be computed numerically to any desired precision. In\naddition, we have the following upper and lower bounds solely in terms of the kernel value K(x\u2212y):\nLemma 2.3 De\ufb01ne the functions\n\n\u25b3\n\n\u25b3\n\n4\n\n2\n\n\u25b3\n\n=\n\nand\n\nh2(u)\n\nh1(u)\n\n\u221a1 \u2212 u,\n\n4\n\u03c02 (1 \u2212 u)\n\n\u03c02 (1 \u2212 2u/3)(cid:27),\n\n= min(cid:26) 1\nwhere u \u2208 [0, 1]. Note that h1(0) = h2(0) = 4/\u03c02 \u2248 0.405 and that h1(1) = h2(1) = 0. Then\nh1(K(x \u2212 y)) \u2264 hK(x \u2212 y) \u2264 h2(K(x \u2212 y)) for all x, y.\n\u221aE \u22062 (the last step\nProof: Let \u2206\nuses concavity of the square root). Using the properties of the RFF, E \u22062 = (1/2) E[(\u03a6\u03c9,b(x) \u2212\n\u03a6\u03c9,b(y))2] = 1 \u2212 K(x \u2212 y). Therefore, E 1{F (x)6=F (y)} = (1/2) E|\u2206| \u2264 (1/2)p1 \u2212 K(x \u2212 y).\nWe also have\n\u03c02(cid:0)1\u2212 2K(x\u2212 y)/3(cid:1).\nE 1{F (x)6=F (y)} =\n\n= cos(\u03c9 \u00b7 x + b) \u2212 cos(\u03c9 \u00b7 y + b). Then E|\u2206| = E\u221a\u22062 \u2264\n\n8\n3\u03c02 K(x\u2212 y) =\n\nThis proves the upper bound in the lemma. On the other hand, since K satis\ufb01es (K3),\n\nK(mx \u2212 my)\n\n4m2 \u2212 1\n\nXm=1\n\n4\n\u03c02 \u2212\n\n4\n\u03c02 \u2212\n\n8\n\u03c02\n\n\u2264\n\n\u221e\n\n4\n\nhK(x \u2212 y) \u2265(cid:0)1 \u2212 K(x \u2212 y)(cid:1) \u00b7\n\n\u03c02(cid:0)1 \u2212 K(x \u2212 y)(cid:1),\nbecause the mth term of the series in (3) is not smaller than(cid:0)1 \u2212 K(x \u2212 y)(cid:1)/(4m2 \u2212 1).\n\n(cid:4)\nFig. 1 shows a comparison of the kernel approximation properties of the RFFs [8] with our scheme\nfor the Gaussian kernel.\n\n4m2 \u2212 1\n\n=\n\n1\n\n4\n\n8\n\u03c02\n\n\u221e\n\nXm=1\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Approximating the Gaussian kernel by random features (green) and random signs (red). (b) Rela-\ntionship of normalized Hamming distance between random signs to functions of the kernel. The scatter plots in\n(a) and (b) are obtained from a synthetic set of 500 uniformly distributed 2D points with n = 5000. (c) Bounds\nfor normalized Hamming distance in Lemmas 2.2 and 2.3 vs. the Euclidean distance.\n\nNow we concatenate several mappings of the form Ft,\u03c9,b to construct an embedding of X into the\nbinary cube {0, 1}n. Speci\ufb01cally, we draw n i.i.d. triples (t1, \u03c91, b1), . . . , (tn, \u03c9n, bn) and de\ufb01ne\n\nF n(x)\n\n\u25b3\n\n=(cid:0)F1(x), . . . , Fn(y)(cid:1),\n\nwhere Fi(x) \u2261 Fti,\u03c9i,bi (x), i = 1, . . . , n\n\nAs we will show next, this construction ensures that, for any two points x and y, the fraction of the\nbits where the binary strings F n(x) and F n(y) disagree sharply concentrates around hK(x \u2212 y),\nprovided n is large enough. Using the results proved above, we conclude that, for any two points\nx and y that are \u201csimilar,\u201d i.e., K(x \u2212 y) \u223c 1, most of the bits of F n(x) and F n(y) will agree,\nwhereas for any two points x and y that are \u201cdissimilar,\u201d i.e., K(x \u2212 y) \u223c 0, F n(x) and F n(y)\nwill disagree in about 40% or more of their bits.\n\nAnalysis of performance. We \ufb01rst prove a Johnson\u2013Lindenstrauss type result which says that,\nfor any \ufb01nite subset of RD, the normalized Hamming distance respects the similarities between\npoints. It should be pointed out that the analogy with Johnson\u2013Lindenstrauss is only qualitative:\nour embedding is highly nonlinear, in contrast to random linear projections used there [4], and the\nresulting distortion of the neighborhood structure, although controllable, does not amount to a mere\nrescaling by constants.\n\nTheorem 2.4 Fix \u01eb, \u03b4 \u2208 (0, 1). For any \ufb01nite data set D = {x1, . . . , xN} \u2282 RD, F n is such that\n(4)\n\nhK(xj \u2212 xk) \u2212 \u03b4 \u2264\n\n1\nn\n1\nn\n\ndH (F n(xj), F n(xk)) \u2264 hK(xj \u2212 xk) + \u03b4\ndH (F n(xj), F n(xk)) \u2264 h2(K(xj \u2212 xk)) + \u03b4\n\nh1(K(xj \u2212 xk)) \u2212 \u03b4 \u2264\n\n(5)\nfor all j, k with probability \u2265 1 \u2212 N 2e\u22122n\u03b42. Moreover, the events (4) and (5) will hold with\nprobability \u2265 1 \u2212 \u01eb if n \u2265 (1/2\u03b42) log(N 2/\u01eb). Thus, any N -point subset of RD can be embedded,\nwith high probability, into the binary cube of dimension O(log N ) in a similarity-preserving way.\n\nThe proof (omitted) is by a standard argument using Hoeffding\u2019s inequality and the union bound, as\nwell as the bounds of Lemma 2.3. We also prove a much stronger result: any compact subset X \u2282\nRD can be embedded into a binary cube whose dimension depends only on the intrinsic dimension\nand the diameter of X and on the second moment of PK, such that the normalized Hamming distance\nbehaves in a similarity-preserving way for all pairs of points in X simultaneously. We make use of\nthe following [5]:\nDe\ufb01nition 2.5 The Assouad dimension of X \u2282 RD, denoted by dX , is the smallest integer k, such\nthat, for any ball B \u2282 RD, the set B \u2229 X can be covered by 2k balls of half the radius of B.\nThe Assouad dimension is a widely used measure of the intrinsic dimension [2, 6, 3]. For example,\nif X is an \u2113p ball in RD, then dX = O(D); if X is a d-dimensional hyperplane in RD, then\ndX = O(d) [2]. Moreover, if X is a d-dimensional Riemannian submanifold of RD with a suitably\nbounded curvature, then dX = O(d) [3]. We now have the following result:\n\n\fTheorem 2.6 Suppose that the kernel K is such that LK\nexists a constant C > 0 independent of D and K, such that the following holds. Fix any \u01eb, \u03b4 > 0. If\n\n\u25b3\n\nn \u2265 max(cid:26) CLKdX diamX\n\n\u03b42\n\n,\n\n= pE\u03c9\u223cPK k\u03c9k2 < +\u221e. Then there\n\u03b42 log(cid:18) 2\n\n\u01eb(cid:19)(cid:27) ,\n\n2\n\nthen, with probability at least 1 \u2212 \u01eb, the mapping F n is such that, for every pair x, y \u2208 X ,\n\nhK(x \u2212 y) \u2212 \u03b4 \u2264\n\ndH (F n(x), F n(y)) \u2264 hK(x \u2212 y) + \u03b4\n\n1\nn\n\n(6)\n\nProof: For every pair x, y \u2208 X , let Ax,y be the set of all \u03b8 \u2261 (t, \u03c9, b), such that Ft,\u03c9,b(x) 6=\nFt,\u03c9,b(y), and let A = {Ax,y : x, y \u2208 X}. Then we can write\nXi=1\n\ndH (F n(x), F n(y)) =\n\n1{\u03b8i\u2208Ax,y}.\n\nFor any sequence \u03b8n = (\u03b81, . . . , \u03b8n), de\ufb01ne the uniform deviation\n\n1\nn\n\n1\nn\n\nn\n\n\u2206(\u03b8n)\n\n\u25b3\n\n= sup\n\n1\nn\n\nn\n\nXi=1\n\nx,y\u2208X(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n1{\u03b8i\u2208Ax,y} \u2212 E 1{Ft,\u03c9,b(x)6=Ft,\u03c9,b(y)}(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n.\n\nFor every 1 \u2264 i \u2264 n and an arbitrary \u03b8\u2032i, let \u03b8n\nThen |\u2206(\u03b8n) \u2212 \u2206(\u03b8n\n\n(7)\nNow we need to bound E\u03b8n \u2206(\u03b8n). Using a standard symmetrization technique [14], we can write\n\n(i))| \u2264 1/n for any i and any \u03b8\u2032i. Hence, by McDiarmid\u2019s inequality,\nP {|\u2206(\u03b8n) \u2212 E\u03b8n \u2206(\u03b8n)| > \u03b2} \u2264 2e\u22122n\u03b2 2\n\n(i) denote \u03b8n with the ith component replaced by \u03b8\u2032i.\n\n\u2200\u03b2 > 0.\n\n,\n\nE\u03b8n \u2206(\u03b8n) \u2264 2R(A)\n\n\u25b3\n\nx,y\u2208X(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n= 2 E\u03b8n,\u03c3n\" sup\n\n1\nn\n\nn\n\nXi=1\n\n\u03c3i1{\u03b8i\u2208Ax,y}(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n#,\n\n(8)\n\n2\n\n\u2032 ,y\n\n\u2032 ,y\n\n\u2032(cid:13)(cid:13)\n\nR(A) \u2264\n\nC0\u221anZ \u221e\n\nwhere \u03c3n = (\u03c31, . . . , \u03c3n) is an i.i.d. Rademacher sequence, P{\u03c3i = \u22121} = P(\u03c3i = +1} = 1/2.\nThe quantity R(A) can be bounded by the Dudley entropy integral [14]\n0 qlog N (\u01eb,A,k \u00b7 kL2(\u00b5))d\u01eb,\n\n(9)\nwhere C0 > 0 is a universal constant, and N (\u01eb,A,k \u00b7 kL2(\u00b5)) is the \u01eb-covering number of the\nfunction class {\u03b8 7\u2192 1{\u03b8\u2208A} : A \u2208 A} with respect to the L2(\u00b5) norm, where \u00b5 is the distribution\nof \u03b8 \u2261 (t, \u03c9, b). We will bound these covering numbers by the covering numbers of X with respect\nto the Euclidean norm on RD. It can be shown that, for any four points x, x\u2032, y, y\u2032 \u2208 X ,\nL2(\u00b5) =Z (cid:0)1{\u03b8\u2208Ax,y} \u2212 1{\u03b8\u2208Ax\n\u2032}(cid:1)2\n(cid:13)(cid:13)1Ax,y \u2212 1Ax\nd\u00b5(\u03b8) \u2264 \u00b5(Bx\u25b3Bx\u2032 ) + \u00b5(By\u25b3By\u2032 ),\n\u25b3= {(t, \u03c9, b) : Qt(cos(\u03c9 \u00b7 x + b)) = +1}\nwhere \u25b3 denotes symmetric difference of sets, and Bx\n(details omitted for lack of space). Now,\n2\u00b5 (Bx\u25b3Bx\u2032 ) = 2 E\u03c9,bh Pt(cid:8)Qt(cos(\u03c9 \u00b7 x + b)) 6= Qt(cos(\u03c9 \u00b7 y + b))(cid:9)i\n= E\u03c9,b |cos(\u03c9 \u00b7 x + b) \u2212 cos(\u03c9 \u00b7 x\u2032 + b)| \u2264 E\u03c9 |\u03c9 \u00b7 (x \u2212 x\u2032)| \u2264 LKkx \u2212 x\u2032k.\nThen \u00b5 (Bx\u25b3Bx\u2032) + \u00b5 (By\u25b3By\u2032 ) \u2264 LK\nThis implies that\n2 (kx \u2212 x\u2032k + ky \u2212 y\u2032k).\nN (\u01eb,A,k \u00b7 kL2(\u00b5)) \u2264 N (\u01eb2/LK,X ,k\u00b7k)2, where N (\u03b4,X ,k\u00b7k) are the covering numbers of X w.r.t.\nthe Euclidean norm k\u00b7k. By de\ufb01nition of the Assouad dimension, N (\u03b4,X ,k\u00b7k) \u2264 (2 diamX /\u03b4)dX ,\nso N (\u01eb,A,k \u00b7 kL2(\u00b5)) \u2264(cid:0) 2LK diam X\nfor some constant C1 > 0. From (10) and (8), we obtain E\u03b8n \u2206(\u03b8n) \u2264 C2q LK dX diam X\nC2 = 2C1. Using this and (7) with \u03b2 = \u03b4/2, we obtain (6) with C = 16C2\n2.\nFor example, with the Gaussian kernel K(s) = e\u2212\u03b3ksk2/2 on RD, we have LK = \u221aD\u03b3. The kernel\nbandwidth \u03b3 is often chosen as \u03b3 \u221d 1/[D(diamX )2] (see, e.g., [12, Sec. 7.8]); with this setting,\nthe number of bits needed to guarantee the bound (6) is n = \u2126((dX /\u03b42) log(1/\u01eb)). It is possible,\nin principle, to construct a dimension-reducing embedding of X into a binary cube, provided the\nnumber of bits in the embedding is larger than the intrinsic dimension of X .\n\n(cid:1)2dX . We can now estimate the integral in (9) by\nR(A) \u2264 C1r LKdX diamX\n\n, where\n\n(10)\n\nn\n\n(cid:4)\n\n\u01eb2\n\nn\n\n,\n\n\fOur method\n\nSpectral hashing\n\n(a)\n\n(c)\n\n(e)\n\n(b)\n\n(d)\n\n(f)\n\nFigure 2: Synthetic results. First row: scatter plots of normalized Hamming distance vs. Euclidean distance\nfor our method (a) and spectral hashing (b) with code size 32 bits. Green indicates pairs of data points that\nare considered true \u201cneighbors\u201d for the purpose of retrieval. Second row: scatter plots for our method (c) and\nspectral hashing (d) with code size 512 bits. Third row: recall-precision plots for our method (e) and spectral\nhashing (f) for code sizes from 8 to 512 bits (best viewed in color).\n\n3 Empirical Evaluation\n\nIn this section, we present the results of our scheme with a Gaussian kernel, and compare our perfor-\nmance to spectral hashing [15].1 Spectral hashing is a recently introduced, state-of-the-art approach\nthat has been reported to obtain better results than several other well-known methods, including\nLSH [1] and restricted Boltzmann machines [11]. Unlike our method, spectral hashing chooses\ncode parameters in a deterministic, data-dependent way, motivated by results on convergence of\n\n1We use the code made available by the authors of [15] at http://www.cs.huji.ac.il/\u02dcyweiss/SpectralHashing/.\n\n\fOur method\n\nSpectral hashing\n\nFigure 3: Recall-precision curves for the LabelMe database for our method (left) and for spectral hashing\n(right). Best viewed in color.\n\neigenvectors of graph Laplacians to Laplacian eigenfunctions on manifolds. Though spectral hash-\ning is derived from completely different considerations than our method, its encoding scheme is\nsimilar to ours in terms of basic computation. Namely, each bit of a spectral hashing code is given\nby sgn(cos(k \u03c9 \u00b7 x)), where \u03c9 is a principal direction of the data (instead of a randomly sampled\ndirection, as in our method) and k is a weight that is deterministically chosen according to the ana-\nlytical form of certain kinds of Laplacian eigenfunctions. The structural similarity between spectral\nhashing and our method makes comparison between them appropriate.\n\nTo demonstrate the basic behavior of our method, we \ufb01rst report results for two-dimensional syn-\nthetic data using a protocol similar to [15] (we have also conducted tests on higher-dimensional\nsynthetic data, with very similar results). We sample 10,000 \u201cdatabase\u201d and 1,000 \u201cquery\u201d points\nfrom a uniform distribution de\ufb01ned on a 2d rectangle with aspect ratio 0.5. To distinguish true posi-\ntives from false positives for evaluating retrieval performance, we select a \u201cnominal\u201d neighborhood\nradius so that each query point on average has 50 neighbors in the database. Next, we rescale the\ndata so that this radius is 1, and set the bandwidth of the kernel to \u03b3 = 1. Fig. 2 (a,c) shows scatter\nplots of normalized Hamming distance vs. Euclidean distance for each query point paired with each\ndatabase point for 32-bit and 512-bit codes. As more bits are added to our code, the variance of the\nscatter plots decreases, and the points cluster tighter around the theoretically expected curve (Eq. (3),\nFig. 1). The scatter plots for spectral hashing are shown in Fig. 2 (b,d). As the number of bits in the\nspectral hashing code is increased, normalized Hamming distance does not appear to converge to any\nclear function of the Euclidean distance. Because the derivation of spectral hashing in [15] includes\nseveral heuristic steps, the behavior of the resulting scheme appears to be dif\ufb01cult to analyze, and\nshows some undesirable effects as the code size increases. Figure 2 (e,f) compares recall-precision\ncurves for both methods using a range of code sizes. Since the normalized Hamming distance for\nour method converges to a monotonic function of the Euclidean distance, its performance keeps\nimproving as a function of code size. On the other hand, spectral hashing starts out with promising\nperformance for very short codes (up to 32 bits), but then deteriorates for higher numbers of bits.\n\nNext, we present retrieval results for 14,871 images taken from the LabelMe database [10]. The\nimages are represented by 320-dimensional GIST descriptors [7], which have proven to be effective\nat capturing perceptual similarity between scenes. For this experiment, we randomly select 1,000\nimages to serve as queries, and the rest make up the \u201cdatabase.\u201d As with the synthetic experiments, a\nnominal threshold of the average distance to the 50th nearest neighbor is used to determine whether\na database point returned for a given query is considered a true positive. Figure 3 shows precision-\nrecall curves for code sizes ranging from 16 bits to 1024 bits. As in the synthetic experiments,\nspectral hashing appears to have an advantage over our method for extremely small code sizes, up to\nabout 32 bits. However, this low bit regime may not be very useful in practice, since below 32 bits,\nneither method achieves performance levels that would be satisfactory for real-world applications.\nFor larger code sizes, our method begins to dominate. For example, with a 128-bit code (which is\nequivalent to just two double-precision \ufb02oating point numbers), our scheme achieves 0.8 precision\n\n\fEuclidean neighbors\n\n32 bit code\n\n512 bit code\n\nPrecision: 0.81\n\nPrecision: 1.00\n\nPrecision: 0.38\n\nPrecision: 0.96\n\nFigure 4: Examples of retrieval for two query images on the LabelMe database. The left column shows top\n48 neighbors for each query according to Euclidean distance (the query image is in the top left of the collage).\nThe middle (resp. right) column shows nearest neighbors according to normalized Hamming distance with a\n32-bit (resp. 512-bit) code. The precision of retrieval is evaluated as the proportion of top Hamming neighbors\nthat are also Euclidean neighbors within the \u201cnominal\u201d radius. Incorrectly retrieved images in the middle and\nright columns are shown with a red border. Best viewed in color.\n\nat 0.2 recall, whereas spectral hashing only achieves about 0.5 precision at the same recall. More-\nover, the performance of spectral hashing actually begins to decrease for code sizes above 256 bits.\nFinally, Figure 4 shows retrieval results for our method on a couple of representative query images.\n\nIn addition to being completely distribution-free and exhibiting more desirable behavior as a func-\ntion of code size, our scheme has one more practical advantage. Unlike spectral hashing, we retain\nthe kernel bandwidth \u03b3 as a \u201cfree parameter,\u201d which gives us \ufb02exibility in terms of adapting to target\nneighborhood size, or setting a target Hamming distance for neighbors at a given Euclidean dis-\ntance. This can be especially useful for making sure that a signi\ufb01cant fraction of neighbors for each\nquery are mapped to strings whose Hamming distance from the query is no greater than 2. This is a\nnecesary condition for being able to use binary codes for hashing as opposed to brute-force search\n(although, as demonstrated in [11, 13], even brute-force search with binary codes can already be\nquite fast). To ensure high recall within a low Hamming radius, we can progressively increase the\nkernel bandwidth \u03b3 as the code size increases, thus counteracting the increase in unnormalized Ham-\nming distance that inevitably accompanies larger code sizes. Preliminary results (omitted for lack of\nspace) show that this strategy can indeed increase recall for low Hamming radius while sacri\ufb01cing\nsome precision. In the future, we will evaluate this tradeoff more extensively, and test our method\non datasets consisting of millions of data points. At present, our promising initial results, combined\nwith our comprehensive theoretical analysis, convincingly demonstrate the potential usefulness of\nour scheme for large-scale indexing and search applications.\n\nAcknowledgments\n\nThis work was supported by NSF CAREER Award No. IIS 0845629.\n\n\fReferences\n\n[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high\n\ndimensions. Commun. ACM, 51(1):117\u2013122, 2008.\n\n[2] K. Clarkson. Nearest-neighbor searching and metric space dimensions. In Nearest-Neighbor Methods for\n\nLearning and Vision: Theory and Practice, pages 15\u201359. MIT Press, 2006.\n\n[3] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In STOC, 2008.\n[4] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random\n\nStruct. Alg., 22(1):60\u201365, 2003.\n\n[5] J. Heinonen. Lectures on Analysis on Metric Spaces. Springer, New York, 2001.\n[6] P. Indyk and A. Naor. Nearest-neighbor-preserving embeddings. ACM Trans. Algorithms, 3(3):Art. 31,\n\n2007.\n\n[7] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial enve-\n\nlope. Int. J. Computer Vision, 42(3):145\u2013175, 2001.\n\n[8] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[9] M. Reed and B. Simon. Methods of Modern Mathematical Physics II: Fourier Analysis, Self-Adjointness.\n\nAcademic Press, 1975.\n\n[10] B. Russell, A. Torralba, K. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool for\n\nimage annotation. Int. J. Computer Vision, 77:157\u2013173, 2008.\n\n[11] R. Salakhutdinov and G. Hinton. Semantic hashing. In SIGIR Workshop on Inf. Retrieval and App. of\n\nGraphical Models, 2007.\n\n[12] B. Sch\u00a8olkopf and A. J. Smola. Learning With Kernels. MIT Press, 2002.\n[13] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large databases for recognition. In CVPR, 2008.\n[14] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.\n[15] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.\n\n\f", "award": [], "sourceid": 146, "authors": [{"given_name": "Maxim", "family_name": "Raginsky", "institution": null}, {"given_name": "Svetlana", "family_name": "Lazebnik", "institution": null}]}