{"title": "Super-Bit Locality-Sensitive Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 108, "page_last": 116, "abstract": "Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within $(0,\\pi/2]$. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.", "full_text": "Super-Bit Locality-Sensitive Hashing\n\nJianqiu Ji\u21e4, Jianmin Li\u21e4, Shuicheng Yan\u2020, Bo Zhang\u21e4, Qi Tian\u2021\n\n\u21e4State Key Laboratory of Intelligent Technology and Systems,\n\nTsinghua National Laboratory for Information Science and Technology (TNList),\n\nDepartment of Computer Science and Technology,\n\nTsinghua University, Beijing 100084, China\njijq10@mails.tsinghua.edu.cn,\n\n{lijianmin, dcszb}@mail.tsinghua.edu.cn\n\u2020Department of Electrical and Computer Engineering,\nNational University of Singapore, Singapore, 117576\n\neleyans@nus.edu.sg\n\n\u2021Department of Computer Science, University of Texas at San Antonio,\n\nOne UTSA Circle, University of Texas at San Antonio, San Antonio, TX 78249-1644\n\nqitian@cs.utsa.edu\n\nAbstract\n\nSign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic\ndimension reduction method which provides an unbiased estimate of angular sim-\nilarity, yet suffers from the large variance of its estimation. In this work, we pro-\npose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement,\nwhich orthogonalizes the random projection vectors in batches, and it is theoreti-\ncally guaranteed that SBLSH also provides an unbiased estimate of angular sim-\nilarity, yet with a smaller variance when the angle to estimate is within (0,\u21e1/ 2].\nThe extensive experiments on real data well validate that given the same length\nof binary code, SBLSH may achieve signi\ufb01cant mean squared error reduction in\nestimating pairwise angular similarity. Moreover, SBLSH shows the superiority\nover SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.\n\n1\n\nIntroduction\n\nLocality-sensitive hashing (LSH) method aims to hash similar data samples to the same hash code\nwith high probability [7, 9]. There exist various kinds of LSH for approximating different distances\nor similarities, e.g., bit-sampling LSH [9, 7] for Hamming distance and `1-distance, min-hash [2, 5]\nfor Jaccard coef\ufb01cient. Among them are some binary LSH schemes, which generate binary codes.\nBinary LSH approximates a certain distance or similarity of two data samples by computing the\nHamming distance between the corresponding compact binary codes. Since computing Hamming\ndistance involves mainly bitwise operations, it is much faster than directly computing other dis-\ntances, e.g. Euclidean, cosine, which require many arithmetic operations. On the other hand, the\nstorage is substantially reduced due to the use of compact binary codes. In large-scale applications\n[22, 11, 5, 17], e.g. near-duplicate image detection, object and scene recognition, etc., we are often\nconfronted with the intensive computing of distances or similarities between samples, then binary\nLSH may act as a scalable solution.\n\n1.1 Locality-Sensitive Hashing for Angular Similarity\n\nFor many data representations, the natural pairwise similarity is only related with the angle between\nthe data, e.g., the normalized bag-of-words representation for documents, images, and videos, and\nthe normalized histogram-based local features like SIFT [20]. In these cases, angular similarity\n\n1\n\n\fcan serve as a similarity measurement, which is de\ufb01ned as sim(a, b) = 1 cos1( ha,bi\nkakkbk\nha, bi denotes the inner product of a and b, and k\u00b7k denotes the `2-norm of a vector.\nOne popular LSH for approximating angular similarity is the sign-random-projection LSH (SRP-\nLSH) [3], which provides an unbiased estimate of angular similarity and is a binary LSH method.\nFormally, in a d-dimensional data space, let v denote a random vector sampled from the normal\ndistribution N (0, Id), and x denote a data sample, then an SRP-LSH function is de\ufb01ned as hv(x) =\nsgn(vT x), where the sign function sgn(\u00b7) is de\ufb01ned as\nsgn(z) =\u21e2 1,\n\nz 0\nz < 0\n\n)/\u21e1. Here\n\n0,\n\nGiven two data samples a, b, let \u2713a,b = cos1( ha,bi\nkakkbk\n\n), then it can be proven that [8]\n\nP r(hv(a) 6= hv(b)) = \u2713a,b\n\n\u21e1\n\nThis property well explains the essence of locality-sensitive, and also reveals the relation between\nHamming distance and angular similarity.\nBy independently sampling K d-dimensional vectors v1, ..., vK from the normal distribution\nN (0, Id), we may de\ufb01ne a function h(x) = (hv1(x), hv2(x), ..., hvK (x)), which consists of K\nSRP-LSH functions and thus produces K-bit codes. Then it is easy to prove that\n\nE[dHamming(h(a), h(b))] = K\u2713a,b\n\n\u21e1 = C\u2713a,b\n\nThat is, the expectation of the Hamming distance between the binary hash codes of two given data\nsamples a and b is an unbiased estimate of their angle \u2713a,b, up to a constant scale factor C = K/\u21e1.\nThus SRP-LSH provides an unbiased estimate of angular similarity.\nSince dHamming(h(a), h(b)) follows a binomial distribution,\ni.e.\nB(K, \u2713a,b\n\u21e1 ),\nimplies\ndHamming(h(a), h(b))/K, i.e. V ar[dHamming(h(a), h(b))/K], satis\ufb01es\n\ndHamming(h(a), h(b)) \u21e0\nthe variance of\n\n(1 \u2713a,b\n\u21e1 ).\n\nits variance\n\nThis\n\nis K\u2713a,b\n\n\u21e1\n\nthat\n\nV ar[dHamming(h(a), h(b))/K] = \u2713a,b\n\nK\u21e1 (1 \u2713a,b\n\u21e1 )\n\nThough being widely used, SRP-LSH suffers from the large variance of its estimation, which leads\nto large estimation error. Generally we need a substantially long code to accurately approximate\nthe angular similarity [24, 12, 23]. Since any two of the random vectors may be close to being\nlinearly dependent, the resulting binary code may be less informative as it seems, and even contains\nmany redundant bits. An intuitive idea would be to orthogonalize the random vectors. However,\nonce being orthogonalized, the random vectors can no longer be viewed as independently sampled.\nMoreover, it remains unclear whether the resulting Hamming distance is still an unbiased estimate\nof the angle \u2713a,b multiplied by a constant, and what its variance will be. Later we will give answers\nwith theoretical justi\ufb01cations to these two questions.\nIn the next section, based on the above intuitive idea, we propose the so-called Super-Bit locality-\nsensitive hashing (SBLSH) method. We provide theoretical guarantees that after orthogonalizing the\nrandom projection vectors in batches, we still get an unbiased estimate of angular similarity, yet with\na smaller variance when \u2713a,b 2 (0,\u21e1/ 2], and thus the resulting binary code is more informative. Ex-\nperiments on real data show the effectiveness of SBLSH, which with the same length of binary code\nmay achieve as much as 30% mean squared error (MSE) reduction compared with the SRP-LSH in\nestimating angular similarity on real data. Moreover, SBLSH performs best among several widely\nused data-independent LSH methods in approximate nearest neighbor (ANN) retrieval experiments.\n\n2 Super-Bit Locality-Sensitive Hashing\nThe proposed SBLSH is founded on SRP-LSH. When the code length K satis\ufb01es 1 < K \uf8ff d,\nwhere d is the dimension of data space, we can orthogonalize N (1 \uf8ff N \uf8ff min(K, d) = K) of the\nrandom vectors sampled from the normal distribution N (0, Id). The orthogonalization procedure\n\n2\n\n\fis the Gram-Schmidt process, which projects the current vector orthogonally onto the orthogonal\ncomplement of the subspace spanned by the previous vectors. After orthogonalization, these N\nrandom vectors can no longer be viewed as independently sampled, thus we group their resulting\nbits together as an N-Super-Bit. We call N the Super-Bit depth.\nHowever, when the code length K > d, it is impossible to orthogonalize all K vectors. Assume\nthat K = N \u21e5 L without loss of generality, and 1 \uf8ff N \uf8ff d, then we can perform the Gram-\nSchmidt process to orthogonalize them in L batches. Formally, K random vectors {v1, v2..., vK}\nare independently sampled from the normal distribution N (0, Id), and then divided into L batches\nwith N vectors each. By performing the Gram-Schmidt process to these L batches of N vectors\nrespectively, we get K = N \u21e5 L projection vectors {w1, w2..., wK}. This results in K SBLSH\nfunctions (hw1, hw2..., hwK ), where hwi(x) = sgn(wT\ni x). These K functions produce L N-Super-\nBits and altogether produce binary codes of length K. Figure 1 shows an example of generating\n12 SBLSH projection vectors. Algorithm 1 lists the algorithm for generating SBLSH projection\nvectors. Note that when the Super-Bit depth N = 1, SBLSH becomes SRP-LSH. In other words,\nSRP-LSH is a special case of SBLSH. The algorithm can be easily extended to the case when the\ncode length K is not a multiple of the Super-Bit depth N. In fact one can even use variable Super-Bit\ndepth Ni as long as 1 \uf8ff Ni \uf8ff d. With the same code length, SBLSH has the same running time\nO(Kd) as SRP-LSH in on-line processing, i.e. generating binary codes when applying to data.\n\nFigure 1: An illustration of 12 SBLSH projection vectors {wi} generated by orthogonalizing {vi}\nin 4 batches.\n\nAlgorithm 1 Generating Super-Bit Locality-Sensitive Hashing Projection Vectors\n\nInput: Data space dimension d, Super-Bit depth 1 \uf8ff N \uf8ff d, number of Super-Bit L 1,\nresulting code length K = N \u21e5 L.\nGenerate a random matrix H with each element sampled independently from the normal distribu-\ntion N (0, 1), with each column normalized to unit length. Denote H = [v1, v2, ..., vK].\nfor i = 0 to L 1 do\nfor j = 1 to N do\nwiN +j = viN +j.\nfor k = 1 to j 1 do\nend for\nwiN +j = wiN +j/kwiN +jk.\n\nwiN +j = wiN +j wiN +kwT\n\niN +kviN +j.\n\nend for\n\nend for\nOutput: \u02dcH = [w1, w2, ..., wK].\n\n2.1 Unbiased Estimate\nIn this subsection we prove that SBLSH provides an unbiased estimate of \u2713a,b of a, b 2 Rd.\nLemma 1. ([8], Lemma 3.2) Let S d1 denote the unit sphere in Rd. Given a random vector v\nuniformly sampled from S d1, we have P r[hv(a) 6= hv(b)] = \u2713a,b/\u21e1.\nLemma 2. If v 2 Rd follows an isotropic distribution, then \u00afv = v/kvk is uniformly distributed on\nS d1.\nThis lemma can be proven by the de\ufb01nition of isotropic distribution, and we omit the details here.\n\n3\n\n2,1v1w2w3w4w5w6w7w8w9w10w11w12w7v8v9v4v5v6v1v2v12v11v10v3v Orthogonalize in 4 batches Random projection vectors sampled from N(0, I) Resulting SBLSH projection vectors \fLemma 3. Given k vectors v1, ..., vk 2 Rd, which are sampled i.i.d. from the normal distribution\nN (0, Id), and span a subspace Sk, let PSk denote the orthogonal projection onto Sk, then PSk is a\nrandom matrix uniformly distributed on the Grassmann manifold Gk,dk.\nThis lemma can be proven by applying the Theorem 2.2.1(iii), Theorem 2.2.2(iii) in [4].\nLemma 4. If P is a random matrix uniformly distributed on the Grassmann manifold Gk,dk,\n1 \uf8ff k \uf8ff d, and v \u21e0N (0, Id) is independent of P , then the random vector \u02dcv = P v follows an\nisotropic distribution.\n\nFrom the uniformity of P on the Grassmann manifold and the property of the normal distribution\nN (0, Id), we can get this result directly. We give a sketch of proof below.\nProof. We can write P = U U T , where the columns of U = [u1, u2, ..., uk] constitute an orthonor-\nmal basis of a random k-dimensional subspace. Since the standard normal distribution is 2-stable\n[6], \u02c6v = U T v = [ \u02c6v1, \u02c6v2, ..., \u02c6vk]T is a N (0, Ik)-distributed vector, where each \u02c6vi \u21e0N (0, 1), and it\nis easy to verify that \u02c6v is independent of U. Therefore \u02dcv = P v = U \u02c6v =\u2303 k\ni=1 \u02c6viui. Since ui, ..., uk\ncan be any orthonormal basis of any k-dimensional subspace with equal probability density, and\n{ \u02c6v1, \u02c6v2, ..., \u02c6vk} are i.i.d. N (0, 1) random variables, \u02dcv follows an isotropic distribution.\nTheorem 1. Given N i.i.d. random vectors v1, v2, ..., vN 2 Rd sampled from the normal distri-\nbution N (0, Id), where 1 \uf8ff N \uf8ff d, perform the Gram-Schmidt process on them and produce N\northogonalized vectors w1, w2, . . . , wN, then for any two data vectors a, b 2 Rd, by de\ufb01ning N\nindicator random variables X1, X2, ..., XN as\n\nXi =\u21e2 1, hwi(a) 6= hwi(b)\n\n0, hwi(a) = hwi(b)\n\nwe have E[Xi] = \u2713a,b/\u21e1, for any 1 \uf8ff i \uf8ff N.\nProof. Denote Si1 the subspace spanned by {w1, ..., wi1}, and the orthogonal projection onto its\northogonal complement as P ?Si1. Then wi = P ?Si1vi. Denote \u00afw = wi/kwik.\nFor any 1 \uf8ff i \uf8ff N, E[Xi] = P r[Xi = 1] = P r[hwi(a) 6= hwi(b)] = P r[h \u00afw(a) 6= h \u00afw(b)]. For\ni = 1, by Lemma 2 and Lemma 1, we have P r[X1 = 1] = \u2713a,b/\u21e1.\nFor any 1 < i \uf8ff N, consider the distribution of wi. By lemma 3, PSi1 is a random matrix\nuniformly distributed on the Grassmann manifold Gi1,di+1, thus P ?Si1 = I PSi1 is uniformly\ndistributed on Gdi+1,i1. Since vi \u21e0N (0, Id) is independent of v1, v2, ..., vi1, vi is independent\nof P ?Si1. By Lemma 4, we have that wi = P ?Si1vi follows an isotropic distribution. By Lemma\n2, \u00afw = wi/kwik is uniformly distributed on the unit sphere in Rd. By Lemma 1, P r[h \u00afw(a) 6=\nh \u00afw(b)] = \u2713a,b/\u21e1.\nCorollary 1. For any Super-Bit depth N, 1 \uf8ff N \uf8ff d, assuming that the code length K = N \u21e5 L,\nthe Hamming distance dHamming(h(a), h(b)) is an unbiased estimate of \u2713a,b, for any two data\nvectors a and b 2 Rd, up to a constant scale factor C = K/\u21e1.\nProof. Apply Theorem 1 and we get E[dHamming(h(a), h(b))] = L \u21e5 E[\u2303N\n\u2303N\ni=1E[Xi] = L \u21e5 \u2303N\n2.2 Variance\nIn this subsection we prove that when the angle \u2713a,b 2 (0,\u21e1/ 2], the variance of SBLSH is strictly\nsmaller than that of SRP-LSH.\nLemma 5. For the random variables {Xi} de\ufb01ned in Theorem 1, we have the following equality\nP r[Xi = 1|Xj = 1] = P r[Xi = 1|X1 = 1], 1 \uf8ff j < i \uf8ff N \uf8ff d.\nProof. P r[Xi = 1|Xj = 1] = P r[hwi(a) 6= hwi(b)|Xj = 1] = P r[hvi\u2303i1\nhvi\u2303i1\n\n(a) 6=\n(b)|hwj (a) 6= hwj (b)]. Since {w1, ...wi1} is a uniformly random orthonormal\n\ni=1Xi] = L \u21e5\n\nk=1wkwT\n\ni=1\u2713a,b/\u21e1 = K\u2713a,b\n\n\u21e1 = C\u2713a,b.\n\nk=1wkwT\n\nk vi\n\nk vi\n\n4\n\n\fk=1wkwT\n\nk vi\n\n(a) 6= hvi\u2303i1\n\nk=1wkwT\n\nk vi\n\n\u21e1 +N(N1) p2,1\u2713a,b\n\n\u21e1 ( N\u2713 a,b\n\n2 ], we have P r[X2 = 1|X1 = 1] < \u2713a,b\n\u21e1 .\n\nbasis of a random subspace uniformly distributed on Grassmann manifold, by exchanging the index\nj and 1 we have that equals P r[hvi\u2303i1\n(b)|hw1(a) 6= hw1(b)] =\nP r[Xi = 1|X1 = 1].\nLemma 6. For {Xi} de\ufb01ned in Theorem 1, we have P r[Xi = 1|Xj = 1] = P r[X2 = 1|X1 = 1],\n1 \uf8ff j < i \uf8ff N \uf8ff d. Given \u2713a,b 2 (0, \u21e1\nThe proof of this lemma is long, thus we provide it in the Appendix (in supplementary \ufb01le).\nTheorem 2. Given two vectors a, b 2 Rd and random variables {Xi} de\ufb01ned as in Theorem 1,\ndenote p2,1 = P r[X2 = 1|X1 = 1], and SX =\u2303 N\ni=1Xi which is the Hamming distance between\nthe N-Super-Bits of a and b, for 1 < N \uf8ff d, then V ar[SX] = N\u2713 a,b\n\u21e1 )2.\nProof. By Lemma 6, P r[Xi = 1|Xj = 1] = P r[X2 = 1|X1 = 1] = p2,1 when 1 \uf8ff j < i \uf8ff N.\nTherefore P r[Xi = 1, Xj = 1] = P r[Xi = 1|Xj = 1]P r[Xj = 1] = p2,1\u2713a,b\n, for any 1 \uf8ff j <\ni \uf8ff N. Therefore V ar[SX] = E[S2\ni ] + 2\u2303j N2 = 1, V ar[SBLSHN1,K] V ar[SBLSHN2,K] = K\u2713a,b\nCorollary 3. Denote V ar[SRP LSHK] as the variance of the Hamming distance produced by SRP-\nLSH, where K = N \u21e5 L is the code length and L is a positive integer, 1 < N \uf8ff d. If \u2713a,b 2 (0, \u21e1\n2 ],\nthen V ar[SRP LSHK] > V ar[SBLSHN,K].\n\n\u21e1 +K(N1\n2 ], for N1 > N2 > 1, 0 \uf8ff p2,1 < \u2713a,b\n\u21e1 .\n\u21e1 ) < 0. For\n\u21e1 ) < 0\n\n)2) = K\u2713a,b\n\n\u21e1 ( N1\u2713a,b\n(N1 N2)(p2,1 \u2713a,b\n(N1 1)(p2,1 \u2713a,b\n\ni=1E[X 2\n\u21e1 + N(N 1) p2,1\u2713a,b\n\n\u21e1 KN1( \u2713a,b\n\n\u21e1 )2. By Lemma 6, when \u2713a,b 2 (0, \u21e1\n\nX] E[SX]2 =\u2303 N\n\u21e1 )2 = N\u2713 a,b\n\n\u21e1 +N1(N11) p2,1\u2713a,b\n\n\u21e1\n\n\u21e1\n\nN\u2713 a,b\n\n\u21e1\n\n\u21e1 ( N\u2713 a,b\n\n\u21e1 )2.\n\n\u21e1\n\nProof. By Corollary 2, V ar[SRP LSHK] = V ar[SBLSH1,K] > V ar[SBLSHN,K].\n\n2.2.1 Numerical veri\ufb01cation\n\nFigure 2: The variances of SRP-LSH and SBLSH against the angle \u2713a,b to estimate.\n\nIn this subsection we verify numerically the behavior of the variances of both SRP-LSH and SBLSH\nfor different angles \u2713a,b 2 (0,\u21e1 ]. By Theorem 2, the variance of SBLSH is closely related to p2,1\nde\ufb01ned in Theorem 2. We randomly generate 30 points in R10, which involves 435 angles. For\neach angle, we numerically approximate p2,1 using sampling method, where the sample number is\n1000. We \ufb01x K = N = d, and plot the variances V ar[SRP LSHN] and V ar[SBLSHN,N] against\nvarious angles \u2713a,b. Figure 2 shows that when \u2713a,b 2 (0,\u21e1/ 2], SBLSH has a much smaller variance\nthan SRP-LSH, which veri\ufb01es the correctness of Corollary 3 to some extent. Furthermore, Figure 2\nshows that even when \u2713a,b 2 (\u21e1/2,\u21e1 ], SBLSH still has a smaller variance.\n\n5\n\n (cid:2024)/2 \f2.3 Discussion\n\nFrom Corollary 1, SBLSH provides an unbiased estimate of angular similarity. From Corollary\n3, when \u2713a,b 2 (0,\u21e1/ 2], with the same length of binary code, the variance of SBLSH is strictly\nsmaller than SRP-LSH. In real applications, many vector representations are limited in non-negative\northant with all vector entries being non-negative, e.g., bag-of-words representation of documents\nand images, and histogram-based representations like SIFT local descriptor [20]. Usually they are\nnormalized to unit length, with only their orientations maintained. For this kind of data, the angle\nof any two different samples is limited in (0,\u21e1/ 2], and thus SBLSH will provide more accurate\nestimation than SRP-LSH on such data. In fact, our later experiments show that even when \u2713a,b is\nnot constrained in (0,\u21e1/ 2], SBLSH still gives much more accurate estimate of angular similarity.\n\n3 Experimental Results\n\nWe conduct two sets of experiments, angular similarity estimation and approximate nearest neighbor\n(ANN) retrieval, to evaluate the effectiveness of our proposed SBLSH method. In the \ufb01rst set of\nexperiments we directly measure the accuracy in estimating pairwise angular similarity. The second\nset of experiments then test the performance of SBLSH in real retrieval applications.\n\n3.1 Angular Similarity Estimation\n\nIn this experiment, we evaluate the accuracy of estimating pairwise angular similarity on several\ndatasets. Speci\ufb01cally, we test the effect to the estimation accuracy when the Super-Bit depth N\nvaries and the code length K is \ufb01xed, and vice versa. For each preprocessed data D, we get DLSH\nafter performing SRP-LSH, and get DSBLSH after performing the proposed SBLSH. We compute\nthe angles between each pair of samples in D, the corresponding Hamming distances in DLSH and\nDSBLSH. We compute the mean squared error between the true angle and the approximated angles\nfrom DLSH and DSBLSH respectively. Note that after computing the Hamming distance, we divide\nthe result by C = K/\u21e1 and get the approximated angle.\n\n3.1.1 Datasets and Preprocessing\nWe conduct the experiment on the following datasets:\n1) Photo Tourism patch dataset1 [26], Notre Dame, which contains 104,106 patches, each of which\nis represented by a 128D SIFT descriptor (Photo Tourism SIFT); and 2) MIR-Flickr2, which con-\ntains 25,000 images, each of which is represented by a 3125D bag-of-SIFT-feature histogram;\nFor each dataset, we further conduct a simple preprocessing step as in [12], i.e. mean-centering each\ndata sample, so as to obtain additional mean-centered versions of the above datasets, Photo Tourism\nSIFT (mean), and MIR-Flickr (mean). The experiment on these mean-centered datasets will test the\nperformance of SBLSH when the angles of data pairs are not constrained in (0,\u21e1/ 2].\n\n3.1.2 The Effect of Super-Bit Depth N and Code Length K\n\nFigure 3: The effect of Super-Bit depth N (1 < N \uf8ff min(d, K)) with \ufb01xed code length K (K =\nN \u21e5 L), and the effect of code length K with \ufb01xed Super-Bit depth N.\n\n1http://phototour.cs.washington.edu/patches/default.htm\n2http://users.ecs.soton.ac.uk/jsh2/mirflickr/\n\n6\n\n SRP-LSH SBLSH Mean+SRP-LSH Mean+SBLSH SRP-LSH SBLSH Mean+SRP-LSH Mean+SBLSH SRP-LSH SBLSH Mean+SRP-LSH Mean+SBLSH SRP-LSH SBLSH Mean+SRP-LSH Mean+SBLSH \fTable 1: ANN retrieval results, measured by proportion of good neighbors within query\u2019s Hamming\nball of radius 3. Note that the code length K = 30.\n\nData\n\nE2LSH\n\nSRP-LSH\n\nSBLSH\n\nNotre Dame\nHalf Dome\nTrevi\n\n0.4675 \u00b1 0.0900\n0.4503 \u00b1 0.0712\n0.4661 \u00b1 0.0849\n\n0.7500 \u00b1 0.0525\n0.7137 \u00b1 0.0413\n0.7591 \u00b1 0.0464\n\n0.7845\u00b1 0.0352\n0.7535\u00b1 0.0276\n0.7891\u00b1 0.0329\n\nIn each dataset, for each (N, K) pair, i.e. Super-Bit depth N and code length K, we randomly\nsample 10,000 data, which involve about 50,000,000 data pairs, and randomly generate SRP-LSH\nfunctions, together with SBLSH functions by orthogonalizing the generated SRP in batches. We\nrepeat the test for 10 times, and compute the mean squared error (MSE) of the estimation.\nTo test the effect of Super-Bit depth N, we \ufb01x K = 120 for Photo Tourism SIFT and K = 3000 for\nMIR-Flickr respectively, and to test the effect of code length K, we \ufb01x N = 120 for Photo Tourism\nSIFT and N = 3000 for MIR-Flickr. We repeat the experiment on the mean-centered versions of\nthese datasets, and denote the methods by Mean+SRP-LSH and Mean+SBLSH respectively.\nFigure 3 shows that when using \ufb01xed code length K, as the Super-Bit depth N gets larger\n(1 < N \uf8ff min(d, K)), the MSE of SBLSH gets smaller, and the gap between SBLSH and SRP-\nLSH gets larger. Particularly, when N = K, over 30% MSE reduction can be observed on all the\ndatasets. This veri\ufb01es Corollary 2 that when applying SBLSH, the best strategy would be to set the\nSuper-Bit depth N as large as possible, i.e. min(d, K). An informal explanation to this interesting\nphenomenon is that as the degree of orthogonality of the random projections gets higher, the code\nbecomes more and more informative, and thus provides better estimate. On the other hand, it can be\nobserved that the performances on the mean-centered datasets are similar as on the original datasets.\nThis shows that even when the angle between each data pair is not constrained in (0,\u21e1/ 2], SBLSH\nstill gives much more accurate estimation.\nFigure 3 also shows that with \ufb01xed Super-Bit depth N SBLSH signi\ufb01cantly outperforms SRP-LSH.\nWhen increasing the code length K, the accuracies of SBLSH and SRP-LSH shall both increase.\nThe performances on the mean-centered datasets are similar as on the original datasets.\n\n3.2 Approximate Nearest Neighbor Retrieval\n\nIn this subsection, we conduct ANN retrieval experiment, which compares SBLSH with two other\nwidely used data-independent binary LSH methods: SRP-LSH and E2LSH (we use the binary ver-\nsion in [25, 1]). We use the datasets Notre Dame, Half Dome and Trevi from the Photo Tourism\npatch dataset [26], which is also used in [12, 10, 13] for ANN retrieval. We use 128D SIFT repre-\nsentation and normalize the vectors to unit norm. For each dataset, we randomly pick 1,000 samples\nas queries, and the rest samples (around 100,000) as the corpus for the retrieval task. We de\ufb01ne\nthe good neighbors to a query q as the samples within the top 5% nearest neighbors (measured in\nEuclidean distance) to q. We adopt the evaluation criterion used in [12, 25], i.e. the proportion of\ngood neighbors in returned samples that are within the query\u2019s Hamming ball of radius r. We set\nr = 3. Using code length K = 30, we repeat the experiment for 10 times and take the mean of the\nresults. For SBLSH, we \ufb01x the Super-Bit depth N = K = 30. Table 1 shows that SBLSH performs\nbest among all these data-independent hashing methods.\n\n4 Relations to Other Hashing Methods\n\nThere exist different kinds of LSH methods, e.g. bit-sampling LSH [9, 7] for Hamming distance\nand `1-distance, min-hash [2] for Jaccard coef\ufb01cient, p-stable-distribution LSH [6] for `p-distance\nwhen p 2 (0, 2]. These data-independent methods are simple, thus easy to be integrated as a module\nin more complicated algorithms involving pairwise distance or similarity computation, e.g. nearest\nneighbor search. New data-independent methods for improving these original LSH methods have\nbeen proposed recently. [1] proposed a near-optimal LSH method for Euclidean distance. Li et al.\n[16] proposed b-bit minwise hash which improves the original min-hash in terms of compactness.\n\n7\n\n\f[17] shows that b-bit minwise hash can be integrated in linear learning algorithms for large-scale\nlearning tasks. [14] reduces the variance of random projections by taking advantage of marginal\nnorms, and compares the variance of SRP with regular random projections considering the margins.\n[15] proposed very sparse random projections for accelerating random projections and SRP.\nPrior to SBLSH, SRP-LSH [3] was the only hashing method proven to provide unbiased estimate of\nangular similarity. The proposed SBLSH method is the \ufb01rst data-independent method that outper-\nforms SRP-LSH in terms of higher accuracy in estimating angular similarity.\nOn the other hand, data-dependent hashing methods have been extensively studied. For example,\nspectral hashing [25] and anchor graph hashing [19] are data-dependent unsupervised methods.\nKulis et al. [13] proposed kernelized locality-sensitive hashing (KLSH), which is based on SRP-\nLSH, to approximate the angular similarity in very high or even in\ufb01nite dimensional space induced\nby any given kernel, with access to data only via kernels. There are also a bunch of works devoted\nto semi-supervised or supervised hashing methods [10, 21, 23, 24, 18], which try to capture not only\nthe geometry of the original data, but also the semantic relations.\n\n5 Discussion\n\nInstead of the Gram-Schmidt process, we can use other method to orthogonalize the projection\nvectors, e.g. Householder transformation, which is numerically more stable. The advantage of\nGram-Schmidt process is its simplicity in describing the algorithm.\nIn the paper we did not test the method on data of very high dimension. When the dimension is high,\nand N is not small, the Gram-Schmidt process will be computationally expensive. In fact, when the\ndimension of data is very high, the random normal projection vectors {vi}i=1,2...,K will tend to be\northogonal to each other, thus it may not be very necessary to orthogonalize the vectors deliberately.\nFrom Corollary 2 and the results in Section 3.1.2, we can see that the closer the Super-Bit depth N\nis to the data dimension d, the larger the variance reduction SBLSH achieves over SRP-LSH.\nA technical report3 (Li et al.) shows that b-bit minwise hashing almost always has a smaller variance\nthan SRP in estimating Jaccard coef\ufb01cient on binary data. The comparison of SBLSH with b-bit\nminwise hashing for Jaccard coef\ufb01cient is left for future work.\n\n6 Conclusion and Future Work\n\nThe proposed SBLSH is a data-independent hashing method which signi\ufb01cantly outperforms SRP-\nLSH. We have theoretically proved that SBLSH provides an unbiased estimate of angular similarity,\nand has a smaller variance than SRP-LSH when the angle to estimate is in (0,\u21e1/ 2]. The algorithm\nis simple, easy to implement and can be integrated as a basic module in more complicated algo-\nrithms. Experiments show that with the same length of binary code, SBLSH achieves over 30%\nmean squared error reduction over SRP-LSH in estimating angular similarity, when the Super-Bit\ndepth N is close to the data dimension d. Moreover, SBLSH performs best among several widely\nused data-independent LSH methods in approximate nearest neighbor retrieval experiments. Theo-\nretically exploring the variance of SBLSH when the angle is in (\u21e1/2,\u21e1 ] is left for future work.\n\nAcknowledgments\n\nThis work was supported by the National Basic Research Program (973 Program) of China (Grant\nNos. 2013CB329403 and 2012CB316301), National Natural Science Foundation of China (Grant\nNos. 91120011 and 61273023), and Tsinghua University Initiative Scienti\ufb01c Research Program\nNo.20121088071, and NExT Research Center funded under the research grant WBS. R-252-300-\n001-490 by MDA, Singapore. And it was supported in part to Dr. Qi Tian by ARO grant W911BF-\n12-1-0057, NSF IIS 1052851, Faculty Research Awards by Google, FXPAL, and NEC Laboratories\nof America, respectively.\n\n3www.stat.cornell.edu/\u02dcli/hashing/RP_minwise.pdf\n\n8\n\n\fReferences\n[1] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. In Annual IEEE Symposium on Foundations of Computer Science, 2006.\n\n[2] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of\n\nthe web. Computer Networks, 29(8-13):1157\u20131166, 1997.\n\n[3] Moses Charikar. Similarity estimation techniques from rounding algorithms.\n\nTheory of Computing, 2002.\n\nIn ACM Symposium on\n\n[4] Yasuko Chikuse. Statistics on Special Manifolds. Springer, February 2003.\n[5] Ondrej Chum, James Philbin, and Andrew Zisserman. Near duplicate image detection: min-hash and\n\ntf-idf weighting. In British Machine Vision Conference, 2008.\n\n[6] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme\n\nbased on p-stable distributions. In Symposium on Computational Geometry, 2004.\n\n[7] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In\n\nInternational Conference on Very Large Databases, 1999.\n\n[8] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and\n\nsatis\ufb01ability problems using semide\ufb01nite programming. Journal of the ACM, 42(6):1115\u20131145, 1995.\n\n[9] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimen-\n\nsionality. In ACM Symposium on Theory of Computing, 1998.\n\n[10] Prateek Jain, Brian Kulis, and Kristen Grauman. Fast image search for learned metrics. In IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, 2008.\n\n[11] Herv\u00b4e J\u00b4egou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 33(1):117\u2013128, 2011.\n\n[12] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Advances in\n\nNeural Information Processing Systems, 2009.\n\n[13] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scalable image search. In\n\nIEEE International Conference on Computer Vision, 2009.\n\n[14] Ping Li, Trevor Hastie, and Kenneth Ward Church. Improving random projections using marginal infor-\n\nmation. In COLT, pages 635\u2013649, 2006.\n\n[15] Ping Li, Trevor Hastie, and Kenneth Ward Church. Very sparse random projections.\n\n287\u2013296, 2006.\n\nIn KDD, pages\n\n[16] Ping Li and Arnd Christian K\u00a8onig. b-bit minwise hashing. In International World Wide Web Conference,\n\n2010.\n\n[17] Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd Christian K\u00a8onig. Hashing algorithms for\n\nlarge-scale learning. In Advances in Neural Information Processing Systems, 2011.\n\n[18] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels.\n\nIn CVPR, pages 2074\u20132081, 2012.\n\n[19] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In ICML, pages 1\u20138, 2011.\nInternational Journal of\n[20] David G. Lowe. Distinctive image features from scale-invariant keypoints.\n\nComputer Vision, 60(2):91\u2013110, 2004.\n\n[21] Yadong Mu, Jialie Shen, and Shuicheng Yan. Weakly-supervised hashing in kernel space.\n\nConference on Computer Vision and Pattern Recognition, 2010.\n\nIn IEEE\n\n[22] Antonio Torralba, Robert Fergus, and William T. Freeman. 80 million tiny images: A large data set for\nnonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1958\u20131970,\n2008.\n\n[23] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Sequential projection learning for hashing with compact\n\ncodes. In International Conference on Machine Learning, 2010.\n\n[24] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for large scale search. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 99(PrePrints), 2012.\n\n[25] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In Advances in Neural Information\n\nProcessing Systems, 2008.\n\n[26] Simon A. J. Winder and Matthew Brown. Learning local image descriptors.\n\nComputer Vision and Pattern Recognition, 2007.\n\nIn IEEE Conference on\n\n9\n\n\f", "award": [], "sourceid": 60, "authors": [{"given_name": "Jianqiu", "family_name": "Ji", "institution": null}, {"given_name": "Jianmin", "family_name": "Li", "institution": null}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": null}, {"given_name": "Bo", "family_name": "Zhang", "institution": null}, {"given_name": "Qi", "family_name": "Tian", "institution": null}]}