{"title": "Angular Quantization-based Binary Codes for Fast Similarity Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1196, "page_last": 1204, "abstract": "This paper focuses on the problem of learning binary embeddings for efficient retrieval of high-dimensional non-negative data. Such data typically arises in a large number of vision and text applications where counts or frequencies are used as features. Also, cosine distance is commonly used as a measure of dissimilarity between such vectors. In this work, we introduce a novel spherical quantization scheme to generate binary embedding of such data and analyze its properties. The number of quantization landmarks in this scheme grows exponentially with data dimensionality resulting in low-distortion quantization. We propose a very efficient method for computing the binary embedding using such large number of landmarks. Further, a linear transformation is learned to minimize the quantization error by adapting the method to the input data resulting in improved embedding. Experiments on image and text retrieval applications show superior performance of the proposed method over other existing state-of-the-art methods.", "full_text": "Angular Quantization-based Binary Codes for\n\nFast Similarity Search\n\nYunchao Gong\u2020, Sanjiv Kumar(cid:63), Vishal Verma\u2020, Svetlana Lazebnik\u2021\n\n(cid:63)Google Research, New York, NY 10011, USA\n\n\u2020Computer Science Department, University of North Carolina at Chapel Hill, NC 27599, USA\n\u2021Computer Science Department, University of Illinois at Urbana-Champaign, IL 61801, USA\n\n{yunchao,verma}@cs.unc.edu, sanjivk@google.com, slazebni@uiuc.edu\n\nAbstract\n\nThis paper focuses on the problem of learning binary codes for ef\ufb01cient retrieval\nof high-dimensional non-negative data that arises in vision and text applications\nwhere counts or frequencies are used as features. The similarity of such feature\nvectors is commonly measured using the cosine of the angle between them. In\nthis work, we introduce a novel angular quantization-based binary coding (AQBC)\ntechnique for such data and analyze its properties. In its most basic form, AQBC\nworks by mapping each non-negative feature vector onto the vertex of the bina-\nry hypercube with which it has the smallest angle. Even though the number of\nvertices (quantization landmarks) in this scheme grows exponentially with da-\nta dimensionality d, we propose a method for mapping feature vectors to their\nsmallest-angle binary vertices that scales as O(d log d). Further, we propose a\nmethod for learning a linear transformation of the data to minimize the quanti-\nzation error, and show that it results in improved binary codes. Experiments on\nimage and text datasets show that the proposed AQBC method outperforms the\nstate of the art.\n\n1\n\nIntroduction\n\nRetrieving relevant content from massive databases containing high-dimensional data is becoming\ncommon in many applications involving images, videos, documents, etc. Two main bottlenecks in\nbuilding an ef\ufb01cient retrieval system for such data are the need to store the huge database and the\nslow speed of retrieval. Mapping the original high-dimensional data to similarity-preserving binary\ncodes provides an attractive solution to both of these problems [21, 23]. Several powerful techniques\nhave been proposed recently to learn binary codes for large-scale nearest neighbor search and re-\ntrieval. These methods can be supervised [2, 11, 16], semi-supervised [10, 24] and unsupervised\n[7, 8, 9, 12, 15, 18, 20, 26], and can be applied to any type of vector data.\nIn this work, we investigate whether it is possible to achieve an improved binary embedding if\nthe data vectors are known to contain only non-negative elements. In many vision and text-related\napplications, it is common to represent data as a Bag of Words (BoW), or a vector of counts or\nfrequencies, which contains only non-negative entries. Furthermore, cosine of angle between vectors\nis typically used as a similarity measure for such data. Unfortunately, not much attention has been\npaid in the past to exploiting this special yet widely used data type.\nA popular binary coding method for cosine similarity is based on Locality Sensitive Hashing\n(LSH) [4], but it does not take advantage of the non-negative nature of histogram data. As we\nwill show in the experiments, the accuracy of LSH is limited for most real-world data. Min-wise\nHashing is another popular method which is designed for non-negative data [3, 13, 14, 22]. How-\never, it is appropriate only for Jaccard distance and also it does not result in binary codes. Special\n\n1\n\n\fclustering algorithms have been developed for data sampled on the unit hypersphere, but they also\ndo not lead to binary codes [1]. To the best of our knowledge, this paper describes the \ufb01rst work that\nspeci\ufb01cally learns binary codes for non-negative data with cosine similarity.\nWe propose a novel angular quantization technique to learn binary codes from non-negative data,\nwhere the angle between two vectors is used as a similarity measure. Without loss of generality\nsuch data can be assumed to live in the positive orthant of a unit hypersphere. The proposed tech-\nnique works by quantizing each data point to the vertex of the binary hypercube with which it has\nthe smallest angle. The number of these quantization centers or landmarks is exponential in the\ndimensionality of the data, yielding a low-distortion quantization of a point. Note that it would be\ncomputationally infeasible to perform traditional nearest-neighbor quantization as in [1] with such\na large number of centers. Moreover, even at run time, \ufb01nding the nearest center for a given point\nwould be prohibitively expensive. Instead, we present a very ef\ufb01cient method to \ufb01nd the nearest\nlandmark for a point, i.e., the vertex of the binary hypercube with which it has the smallest angle.\nSince the basic form of our quantization method does not take data distribution into account, we fur-\nther propose a learning algorithm that linearly transforms the data before quantization to reduce the\nangular distortion. We show experimentally that it signi\ufb01cantly outperforms other state-of-the-art\nbinary coding methods on both visual and textual data.\n\n2 Angular Quantization-based Binary Codes\n\nOur goal is to \ufb01nd a quantization scheme that maximally preserves the cosine similarity (angle) be-\ntween vectors in the positive orthant of the unit hypersphere. This section introduces the proposed\nangular quantization technique that directly yields binary codes of non-negative data. We \ufb01rst pro-\npose a simpli\ufb01ed data-independent algorithm which does not involve any learning, and then present\na method to adapt the quantization scheme to the input data to learn robust codes.\n\n2.1 Data-independent Binary Codes\nSuppose we are given a database X containing n d-dimensional points such that X = {xi}n\ni=1,\nwhere xi \u2208 Rd. We \ufb01rst address the problem of computing a d-bit binary code of an input vector\nxi. A c-bit code for c < d will be described later in Sec. 2.2. For angle-preserving quantization,\nwe de\ufb01ne a set of quantization centers or landmarks by projecting the vertices of the binary hy-\npercube {0, 1}d onto the unit hypersphere. This construction results in 2d \u2212 1 landmark points for\nd-dimensional data.1 An illustration of the proposed quantization model is given in Fig. 1. Given a\npoint x on the hypersphere, one \ufb01rst \ufb01nds its nearest2 landmark vi, and the binary encoding for xi\nis simply given by the binary vertex bi corresponding to vi.3\nOne of the main characteristics of the proposed model is that the number of landmarks grows ex-\nponentially with d, and for many practical applications d can easily be in thousands or even more.\nOn the one hand, having a huge number of landmarks is preferred as it can provide a \ufb01ne-grained,\nlow-distortion quantization of the input data, but on the other hand, it poses the formidable com-\nputational challenge of ef\ufb01ciently \ufb01nding the nearest landmark (and hence the binary encoding) for\nan arbitrary input point. Note that performing brute-force nearest-neighbor search might even be\nslower than nearest-neighbor retrieval from the original database! To obtain an ef\ufb01cient solution, we\ntake advantage of the special structure of our set of landmarks, which are given by the projections\nof binary vectors onto the unit hypercube. The nearest landmark of a point x, or the binary vertex\nhaving the smallest angle with x, is given by\n\n\u02c6b = arg max\n\nb\n\nbT x\n(cid:107)b(cid:107)2\n\ns. t. b \u2208 {0, 1}d.\n\n(1)\n\nThis is an integer programming problem but its global maximum can be found very ef\ufb01ciently as we\nshow in the lemma below. The corresponding algorithm is presented in Algorithm 1.\n\n1Note that the vertex with all 0\u2019s is excluded as its norm is 0, which is not permissible in eq. (1).\n2In terms of angle or Euclidean distance, which are equivalent for unit-norm data.\n3Since in terms of angle from a point, both bi and vi are equivalent, we will use the term landmark for\n\neither bi or vi depending on the context.\n\n2\n\n\f(a) Quantization model in 3D.\n\n(b) Cosine of angle between binary vertices.\n\nFigure 1: (a) An illustration of our quantization model in 3D. Here bi is a vertex of the unit cube and vi is its\nprojection on the unit sphere. Points vi are used as the landmarks for quantization. To \ufb01nd the binary code of\na given data point x, we \ufb01rst \ufb01nd its nearest landmark point vi on the sphere, and the correponding bi gives its\nbinary code (v4 and b4 in this case). (b) Bound on cosine of angle between a binary vertex b1 with Hamming\nweight m, and another vertex b2 at a Hamming distance r from b1. See Lemma 2 for details.\n\nAlgorithm 1: Finding the nearest binary landmark for a point on the unit hypersphere.\n\nfor k = 1, . . . , d\n\nInput: point x on the unit hypersphere. Output: \u02c6b, binary vector having the smallest angle with x.\n1. Sort the entries of x in descending order as x(1), . . . , x(d).\n2.\n3.\n4.\n5.\n6. end for\n7. Return bk corresponding to m = arg maxk \u03c8(x, k).\n\nif x(k) = 0 break.\nForm binary vector bk whose elements are 1 for the k largest positions in x, 0 otherwise.\nCompute \u03c8(x, k) = (xT bk)/(cid:107)bk(cid:107)2 =\n\n(cid:16)(cid:80)k\n\n\u221a\nk.\n\nj=1 x(j)\n\n(cid:17)\n\n/\n\n\u221a\n\n1, . . . ,\n\nLemma 1 The globally optimal solution of the integer programming problem in eq.\n(1) can be\ncomputed in O(d log d) time. Further, for a sparse vector with s non-zero entries, it can be computed\nin O(s log s) time.\nProof: Since b is a d-dimensional binary vector, its norm (cid:107)b(cid:107)2 can have at most d different values,\ni.e., (cid:107)b(cid:107)2 \u2208 {\u221a\nd}. We can separately consider the optimal solution of eq. (1) for each\nvalue of the norm. Given (cid:107)b(cid:107)2 =\nk (i.e., b has k ones), eq. (1) is maximized by setting to one\nthe entries of b corresponding to the largest k entries of x. Since (cid:107)b(cid:107)2 can take on d distinct values,\nwe need to evaluate eq. (1) at most d times, and \ufb01nd the k and the corresponding \u02c6b for which the\nobjective function is maximized (see Algorithm 1 for a detailed description of the algorithm). To\n\ufb01nd the largest k entries of x for k = 1, . . . , d, we need to sort all the entries of x, which takes\nO(d log d) time, and checking the solutions for all k is linear in d. Further, if the vector x is sparse\nwith only s non-zero elements, it is obvious that the maximum of eq. (1) is achieved when k varies\nfrom 1 to s. Hence, one needs to sort only the non-zero entries of x, which takes O(s log s) time\n(cid:3)\nand checking all possible solutions is linear in s.\nNow we study the properties of the proposed quantization model. The following lemma helps to\ncharacterize the angular resolution of the quantization landmarks.\nLemma 2 Suppose b is an arbitrary binary vector with Hamming weight (cid:107)b(cid:107)1 = m, where (cid:107) \u00b7 (cid:107)1\n(cid:48) that lie at a Hamming radius r from b, the cosine of\nis the L1 norm. Then for all binary vectors b\nthe angle between b and b\nm ,\nProof: Since (cid:107)b(cid:107)1 = m, there are exactly m ones in b and the rest are zeros, and b\n(cid:48) has exactly\n(cid:48), we\nr bits different from b. To \ufb01nd the lower bound on the cosine of the angle between b and b\nis maximized. It is easy to see that this will happen when\nwant to \ufb01nd a b\n(cid:48) has exactly m \u2212 r ones in common positions with b and the remaining entries are zero, i.e.,\nb\n(cid:107)b\n(cid:48)(cid:107)1 = m \u2212 r and bT b\nm . Similarly, the upper\n\n= m \u2212 r. This gives the lower bound of\n\n(cid:104)(cid:113) m\u2212r\n\n(cid:113) m\u2212r\n\n(cid:48) is bounded by\n\n(cid:113) m\n\n(cid:48) such that\n\n\u221a\nbT b(cid:48)\n\n(cid:107)b(cid:107)1\n\n(cid:107)b(cid:48)(cid:107)1\n\n(cid:105)\n\n.\n\n\u221a\n\n(cid:48)\n\n\u221a\n\nm+r\n\n3\n\n10010110210300.20.40.60.81m (log scale)cos(b1,b2) lower bound (r=1)upper bound (r=1)lower bound (r=3)upper bound (r=3)lower bound (r=5)upper bound (r=5)\fbound can be obtained when b\n(cid:107)b\n\n(cid:48)(cid:107)1 = m + r and bT b\n\n(cid:48)\n\n(cid:48) has all ones at the same locations as b, and additional r ones, i.e.,\n(cid:3)\n\n= m. This yields the upper bound of\n\n(cid:113) m\n\nm+r .\n\nWe can understand this result as follows. The Hamming weight m of each binary vertex corresponds\nto its position in space. When m is low, the point is closer to the boundary of the positive orthant\nand when m is high, it is closer to the center. The above lemma implies that for landmark points on\nthe boundary, the Voronoi cells are relatively coarse, and cells become progressively denser as one\nmoves towards the center. Thus the proposed set of landmarks non-uniformly tessellates the surface\nof the positive orthant of the hypersphere. We show the lower and upper bounds on angle for various\nm and r in Fig. 1 (b). It is clear that for relatively large m, the angle between different landmarks\nis very small, thus providing dense quantization even for large r. To get good performance, the\ndistribution of the data should be such that a majority of the points fall closer to landmarks with\nhigher m.\nThe Algorithm 1 constitutes the core of our proposed angular quantization method, but it has several\nlimitations: (i) it is data-independent, and thus cannot adapt to the data distribution to control the\nquantization error; (ii) it cannot control m which, based on our analysis, is critical for low quanti-\nzation error; (iii) it can only produce a d-bit code for d-dimensional data, and thus cannot generate\nshorter codes. In the following section, we present a learning algorithm to address the above issues.\n\n2.2 Learning Data-dependent Binary Codes\n\nthe scaled cumulative sums \u03c8(x, k) =(cid:80)k\n\nWe start by addressing the \ufb01rst issue of how to adapt the method to the given data to minimize\nthe quantization error. Similarly to the Iterative Quantization (ITQ) method of Gong and Lazebnik\n[7], we would like to align the data to a pre-de\ufb01ned set of quantization landmarks using a rotation,\nbecause rotating the data does not change the angles \u2013 and, therefore, the similarities \u2013 between\nthe data points. Later in this section, we will present an objective function and an optimization\nalgorithm to accomplish this goal, but \ufb01rst, by way of motivation, we would like to illustrate how\napplying even a random rotation to a typical frequency/count vector can affect the Hamming weight\nm of its angular binary code.\nZipf\u2019s law or power law is commonly used for modeling frequency/count data in many real-world\napplications [17, 28]. Suppose, for a data vector x, the sorted entries x(1), . . . , x(d) follow Zipf\u2019s\nlaw, i.e., x(k) \u221d 1/ks, where k is the index of the entries sorted in descending order, and s is the\npower parameter that controls how quickly the entries decay. The effective m for x depends directly\non the power s: the larger s is, the faster the entries of x decay, and the smaller m becomes. More\ngermanely, for a \ufb01xed s, applying a random rotation R to x makes the distribution of the entries\nof the resulting vector RT x more uniform and raises the effective m. In Fig. 2 (a), we plot the\nsorted entries of x generated from Zipf\u2019s law with s = 0.8. Based on Algorithm 1, we compute\n, which are shown in Fig. 2 (b). Here the optimal\nm = arg maxk \u03c8(x, k) is relatively low (m = 2). In Fig. 2 (c), we randomly rotate the data and\nshow the sorted values of RT x, which become more uniform. Finally, in Fig. 2 (d), we show\n\u03c8(RT x, k). The Hamming weight m after this random rotation becomes much higher (m = 25).\nThis effect is typical: the average of m over 1000 random rotations for this example is 27.36. Thus,\neven randomly rotating the data tends to lead to \ufb01ner Voronoi cells and reduced quantization error.\nNext, it is natural to ask whether we can optimize the rotation of the data to increase the cosine\nsimilarities between data points and their corresponding binary landmarks.\nWe seek a d \u00d7 d orthogonal transformation R such that the sum of cosine similarities of each\ntransformed data point RT xi and its corresponding binary landmark bi is maximized.4 Let B \u2208\n{0, 1}d\u00d7n denote a matrix whose columns are given by the bi. Then the objective function for our\noptimization problem is given by\nQ(B, R) = arg max\n\ns. t. bi \u2208 {0, 1}d, RT R = I d,\n\nn(cid:88)\n\nRT xi\n\nx(j)\u221a\nk\n\n(2)\n\nj=1\n\nbT\ni(cid:107)bi(cid:107)2\nwhere I d denotes the d \u00d7 d identity matrix.\n\nB,R\n\ni=1\n\n4Note that after rotation, RT xi may contain negative values but this does not affect the quantization since\n\nthe binarization technique described in Algorithm 1 effectively suppresses the negative values to 0.\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Effect of rotation on Hamming weight m of the landmark corresponding to a particular vector. (a)\nSorted vector elements x(k) following Zipf\u2019s law with s = 0.8; (b) Scaled cumulative sum \u03c8(x, k); (c) Sorted\nvector elements after random rotation; (d) Scaled cumulative sum \u03c8(RT x, k) for the rotated data. See text for\ndiscussion.\n\nThe above objective function still yields a d-bit binary code for d-dimensional data, while in many\nreal-world applications, a low-dimensional binary code may be preferable. To generate a c-bit code\nwhere c < d, we can learn a d \u00d7 c projection matrix R with orthogonal columns by optimizing the\nfollowing modi\ufb01ed objective function:\nbT\ni(cid:107)bi(cid:107)2\n\ns. t. bi \u2208 {0, 1}c, RT R = I c.\n\nQ(B, R) = arg max\n\nRT xi\n(cid:107)RT xi(cid:107)2\n\n(3)\n\nB,R\n\nn(cid:88)\n\ni=1\n\nNote that to minimize the angle after a low-dimensional projection (as opposed to a rotation), the\ndenominator of the objective function contains (cid:107)RT xi(cid:107)2 since after projection (cid:107)RT xi(cid:107)2 (cid:54)= 1.\nHowever, adding this new term to the denominator makes the optimization problem hard to solve.\nWe propose to relax it by optimizing the linear correlation instead of the angle:\n\nQ(B, R) = arg max\n\nB,R\n\nbT\ni(cid:107)bi(cid:107)2\n\nRT xi\n\ns. t. bi \u2208 {0, 1}c, RT R = I c.\n\n(4)\n\nn(cid:88)\n\ni=1\n\nThis is similar to eq. (2) but the geometric interpretation is slightly different: we are now looking\nfor a projection matrix R to map the d-dimensional data to a lower-dimensional space such that\nafter the mapping, the data has high linear correlation with a set of landmark points lying on the\nlower-dimensional hypersphere. Section 3 will demonstrate that this relaxation works quite well in\npractice.\n\n2.3 Optimization\n\nQ((cid:101)B, R) = arg max(cid:101)B,R\n\nTr((cid:101)B\n\nThe objective function in (4) can be written more compactly in a matrix form:\n\nT\n\nRT X)\n\nwhere Tr(\u00b7) is the trace operator, (cid:101)B is the c \u00d7 n matrix with columns given by bi/(cid:107)bi(cid:107)2, and X is\nthe d \u00d7 n matrix with columns given by xi. This objective is nonconvex in (cid:101)B and X jointly. To\n(1) Fix R, update (cid:101)B. For a \ufb01xed R, eq. (5) becomes separable in xi, and we can solve for each bi\n\nobtain a local maximum, we use a simple alternating optimization procedure as follows.\n\ns. t. RT R = I c,\n\n(5)\n\nseparately. Here, the individual sub-problem for each xi can be written as\n\n\u02c6bi = arg max\nbi\n\nbT\ni(cid:107)bi(cid:107)2\n\n(RT xi).\n\n(6)\n\nThus, given a point yi = RT xi in c-dimensional space, we want to \ufb01nd the vertex bi on the c-\ndimensional hypercube having the smallest angle with yi. To do this, we use Algorithm 1 to \ufb01nd bi\n\nfor each yi, and then normalize each bi back to the unit hypersphere: (cid:101)bi = bi/(cid:107)bi(cid:107)2. This yields\neach column of (cid:101)B. Note that the (cid:101)B found in this way is the global optimum for this subproblem.\n(2) Fix (cid:101)B, update R. When (cid:101)B is \ufb01xed, we want to \ufb01nd\n\nTr(RT X(cid:101)B\n\nT\n\n)\n\ns. t. RT R = I c.\n\n(7)\n\nTr((cid:101)B\n\n\u02c6R = arg max\nR\n\nT\n\nRT X) = arg max\nR\n\n5\n\n02040608010000.10.20.30.40.50.60.70.80.91sorted index (k)data value x(k)0204060801000.50.60.70.80.911.1sorted index (k)\u03a8(x,k)m=2020406080100\u22123\u22122\u221210123sorted index (k)after rotation (RTx)(k)0204060801000123456789sorted index (k)\u03a8(RTx,k)m=25\fT as X(cid:101)B\n\nT\n\nNamely, we take the SVD of the d \u00d7 c matrix X(cid:101)B\n\nThis is a well-known problem and its global optimum can be obtained by polar decomposition [5].\n= U SV T , let U c be the \ufb01rst c\n\nsingular vectors of U, and \ufb01nally obtain R = U cV T .\nThe above formulation involves solving two sub-problems in an alternating fashion. The \ufb01rst sub-\nproblem is an integer program, and the second one has non-convex orthogonal constraints. However,\nin each iteration the global optimum can be obtained for each sub-problem as discussed above. So,\neach step of the alternating method is guaranteed to increase the objective function. Since the objec-\ntive function is bounded from above, it is guaranteed to converge. In practice, one needs only a few\niterations (less than \ufb01ve) for the method to converge. The optimization procedure is initialized by\n\ufb01rst generating a random binary matrix by making each element 0 or 1 with probability 1\n2, and then\nnormalizing each column to unit norm. Note that the optimization is also computationally ef\ufb01cient.\nThe \ufb01rst subproblem takes O(nc log c) time while the second one takes O(dc2). This is linear in\ndata dimension d, which enables us to handle very high-dimensional feature vectors.\n\n2.4 Computation of Cosine Similarity between Binary Codes\n\n(cid:48): cos(\u03b8) = bT b(cid:48)\n\n(cid:107)b(cid:107)2(cid:107)b(cid:48)(cid:107)2\n(cid:48) can be obtained by bitwise AND followed by popcount, and (cid:107)b(cid:107)2 and (cid:107)b\n\nMost existing similarity-preserving binary coding methods measure the similarity between pairs of\nbinary vectors using the Hamming distance, which is extremely ef\ufb01cient to compute by bitwise\nXOR followed by bit count (popcount). By contrast, the appropriate similarity measure for our\napproach is the cosine of the angle \u03b8 between two binary vectors b and b\n. In\n(cid:48)(cid:107)2\nthis formulation, bT b\ncan be obtained by popcount and lookup table to \ufb01nd the square root. Of course, if b is the query\n(cid:48), then one can ignore (cid:107)b(cid:107)2. Therefore,\nvector that needs to be compared to every database vector b\neven though the cosine similarity is marginally slower than Hamming distance, it is still very fast\ncompared to computing similarity of the original data vectors.\n3 Experiments\nTo test the effectiveness of the proposed Angular Quantization-based Binary Codes (AQBC) method,\nwe have conducted experiments on two image datasets and one text dataset. The \ufb01rst image dataset\nis SUN, which contains 142,169 natural scene images [27]. Each image is represented by a 1000-\ndimensional bag of visual words (BoW) feature vector computed on top of dense SIFT descriptors.\nThe BoW vectors are power-normalized by taking the square root of each entry, which has been\nshown to improve performance for recognition tasks [19]. The second dataset contains 122,530\nimages from ImageNet [6], each represented by a 5000-dimensional vector of locality-constrained\nlinear coding (LLC) features [25], which are improved versions of BoW features. Dense SIFT is\nalso used as the local descriptor in this case. The third dataset is 20 Newsgroups,5 which contains\n18,846 text documents and 26,214 words. Tf-idf weighting is used for each text document BoW\nvector. The feature vectors for all three datasets are sparse, non-negative, and normalized to unit L2\nnorm. Due to this, Euclidean distance directly corresponds to the cosine similarity as dist2 = 2 \u2212\n2 sim. Therefore, in the following, we will talk about similarity and distance interchangeably.\nTo perform evaluation on each dataset, we randomly sample and \ufb01x 2000 points as queries, and use\nthe remaining points as the \u201cdatabase\u201d against which the similarity searches are run. For each query,\nwe de\ufb01ne the ground truth neighbors as all the points within the radius determined by the average dis-\ntance to the 50th nearest neighbor in the dataset, and plot precision-recall curves of database points\nordered by decreasing similarity of their binary codes with the query. This methodology is similar\nto that of other recent works [7, 20, 26]. Since our AQBC method is unsupervised, we compare with\nseveral state-of-the-art unsupervised binary coding methods: Locality Sensitive Hashing (LSH) [4],\nSpectral Hashing [26], Iterative Quantization (ITQ) [7], Shift-invariant Kernel LSH (SKLSH) [20],\nand Spherical Hashing (SPH) [9]. Although these methods are designed to work with the Euclidean\ndistance, they can be directly applied here since all the vectors have unit norm. We use the authors\u2019\npublicly available implementations and suggested parameters for all the experiments.\nResults on SUN and ImageNet. The precision-recall curves for the SUN dataset are shown in\nFig. 3. For all the code lengths (from 64 to 1000 bits), our method (AQBC) performs better than other\nstate-of-the-art methods. For a relatively large number of bits, SKLSH works much better than other\n\n5http://people.csail.mit.edu/jrennie/20Newsgroups\n\n6\n\n\f(a) 64 bits.\n\n(b) 256 bits.\n\n(c) 1000 bits.\n\nFigure 3: Precision-recall curves for different methods on the SUN dataset.\n\n(a) 64 bits.\n\n(b) 256 bits.\n\n(c) 1024 bits.\n\nFigure 4: Precision-recall curves for different methods on the ImageNet120K dataset.\n\nmethods, while still being worse than ours. It is interesting to verify how much we gain by using the\nlearned data-dependent quantization instead of the data-independent naive version (Sec. 2.1). Since\nthe naive version can only learn a d-bit code (1000 bits in this case), its performance (AQBC naive)\nis shown only in Fig. 3 (c). The performance is much worse than that of the learned codes, which\nclearly shows that adapting quantization to the data distribution is important in practice. Fig. 4 shows\nresults on ImageNet. On this dataset, the strongest competing method is ITQ. For a relatively low\nnumber of bits (e.g., 64), AQBC and ITQ are comparable, but AQBC has a more clear advantage\nas the number of bits increases. This is because for fewer bits, the Hamming weight (m) of the\nbinary codes tends to be small resulting in larger distortion error as discussed in Sec. 2.1. We also\nfound the SPH [9] method works well for relatively dense data, while it does not work very well for\nhigh-dimensional sparse data.\nResults on 20 Newsgroups. The results on the text features (Fig. 5) are consistent with those on the\nimage features. Because the text features are the sparsest and have the highest dimensionality, we\nwould like to verify whether learning the projection R helps in choosing landmarks with larger m as\nconjectured in Sec. 2.2. The average empirical distribution over sorted vector elements for this data\nis shown in Fig. 6 (a) and the scaled cumulative sum in Fig. 6 (b). It is clear that vector elements\nhave a rapidly decaying distribution, and the quantization leads to codes with low m implying higher\nquantization error. Fig. 6 (c) shows the distribution of entries of vector RT x, which decays more\nslowly than the original distribution in Fig. 6 (a). Fig. 6 (d) shows the scaled cumulative sum for the\nprojected vectors, indicating a much higher m.\nTiming. Table 1 compares the binary code generation time and retrieval speed for different methods.\nAll results are obtained on a workstation with 64GB RAM and 4-core 3.4GHz CPU. Our method\ninvolves linear projection and quantization using Algorithm 1, while ITQ and LSH only involve\nlinear projections and thresholding. SPH involves Euclidean distance computation and thresholding.\nSH and SKLSH involve linear projection, nonlinear mapping, and thresholding. The results show\nthat the quantization step (Algorithm 1) of our method is fast, adding very little to the coding time.\nThe coding speed of our method is comparable to that of LSH, ITQ, SPH, and SKLSH. As shown\n\n7\n\n00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBCAQBC naive00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC\f(a) 64 bits.\n\n(b) 256 bits.\n\n(c) 1024 bits.\n\nFigure 5: Precision-recall curves for different methods on the 20 Newsgroups dataset.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 6: Effect of projection on Hamming weight m for 20 Newsgroups data. (a) Distribution of sorted vector\nentries, (b) scaled cumulative function, (c) distribution over vector elements after learned projection, (d) scaled\ncumulative function for the projected data. For (a, b) we show only top 1000 entries for better visualization.\nFor (c, d), we project the data to 1000 dimensions.\n\n(a) Code generation time\n\ncode size\n64 bits\n512 bits\n\nSH\n2.20\n40.38\n\nLSH ITQ SKLSH SPH\n0.21\n0.14\n3.66\n3.94\n\n0.33\n5.81\n\n0.14\n3.66\n\nAQBC\n\n0.14 + 0.09 = 0.23\n3.66 + 0.55 = 4.21\n\n(b) Retrieval time\nHamming Cosine\n\n2.4\n15.8\n\n3.4\n20.4\n\nTable 1: Timing results.\n(a) Average binary code generation time per query (milliseconds) on 5000-\ndimensional LLC features. For the proposed AQBC method, the \ufb01rst number is projection time and the second\nis quantization time.\n(b) Average time per query, i.e., exhaustive similarity computation against the 120K\nImageNet images. Computation of Euclidean distance on this dataset takes 11580 ms.\n\nin Table 1(b), computation of cosine similarity is slightly slower than that of Hamming distance, but\nboth are orders of magnitude faster than Euclidean distance.\n4 Discussion\nIn this work, we have introduced a novel method for generating binary codes for non-negative fre-\nquency/count data. Retrieval results on high-dimensional image and text datasets have demonstrated\nthat the proposed codes accurately approximate neighbors in the original feature space according to\ncosine similarity. Note, however, that our experiments have not focused on evaluating the semantic\naccuracy of the retrieved neighbors (i.e., whether these neighbors tend to belong to the same\nhigh-level category as the query). To improve the semantic precision of retrieval, our earlier ITQ\nmethod [7] could take advantage of a supervised linear projection learned from labeled data with\nthe help of canonical correlation analysis. For the current AQBC method, it is still not clear how to\nincorporate supervised label information into learning of the linear projection. We have performed\nsome preliminary evaluations of semantic precision using unsupervised AQBC, and we have found\nit to work very well for retrieving semantic neighbors for extremely high-dimensional sparse\ndata (like the 20 Newsgroups dataset), while ITQ currently works better for lower-dimensional,\ndenser data. In the future, we plan to investigate how to improve the semantic precision of AQBC\nusing either unsupervised or supervised learning. Additional resources and code are available at\nhttp://www.unc.edu/\u223cyunchao/aqbc.htm\n\nAcknowledgments. We thank Henry A. Rowley and Ruiqi Guo for helpful discussions, and the reviewers for\nhelpful suggestions. Gong and Lazebnik were supported in part by NSF grants IIS 0916829 and IIS 1228082,\nand the DARPA Computer Science Study Group (D12AP00305).\n\n8\n\n00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC00.20.40.60.8100.20.40.60.81RecallPrecision ITQLSHSKLSHSHSPHAQBC0200400600800100000.050.10.150.20.250.30.35sorted index (k)data value (x(k))020040060080010000.20.30.40.50.60.70.80.91sorted index (k)\u03a8(x,k)m=3702004006008001000\u22120.04\u22120.0200.020.040.060.08sorted index (k)rotated data (RTx)(k)020040060080010000.050.10.150.20.250.30.350.40.450.50.55sorted index (k)\u03a8(RTx,k)m=304\fReferences\n[1] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere using von\n\nMises-Fisher distributions. JMLR, 2005.\n\ncategory recognition. NIPS, 2011.\n\nof Sequences, 1997.\n\n[2] A. Bergamo, L. Torresani, and A. Fitzgibbon. Picodes: Learning a compact code for novel-\n\n[3] A. Broder. On the resemblance and containment of documents. Compression and Complexity\n\n[4] M. S. Charikar. Similarity estimation techniques from rounding algorithms. STOC, 2002.\n[5] X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. SDM, 2011.\n[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical\n\n[7] Y. Gong and S. Lazebnik. Iterative quantization: A Procrustean approach to learning binary\n\nimage database. CVPR, 2009.\n\ncodes. CVPR, 2011.\n\nICCV, 2009.\n\nACM, 2011.\n\nNIPS, 2011.\n\n2012.\n\nPress, 1999.\n\n[8] J. He, R. Radhakrishnan, S.-F. Chang, and C. Bauer. Compact hashing with joint optimization\n\nof search accuracy and time. CVPR, 2011.\n\n[9] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. CVPR, 2012.\n[10] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. CVPR, 2008.\n[11] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS, 2009.\n[12] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In\n\n[13] P. Li and C. Konig. Theory and applications of b-bit minwise hashing. Communications of the\n\n[14] P. Li, A. Shrivastava, J. Moore, and C. Konig. Hashing algorithms for large-scale learning.\n\n[15] W. Liu, S. Kumar, and S.-F. Chang. Hashing with graphs. ICML, 2011.\n[16] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. CVPR,\n\n[17] C. D. Manning and H. Sch\u00a8utze. Foundations of statistical natural language processing. MIT\n\n[18] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. ICML, 2011.\n[19] F. Perronnin, J. Sanchez, , and Y. Liu. Large-scale image categorization with explicit data\n\nembedding. CVPR, 2010.\n\n[20] M. Raginsky and S. Lazebnik. Locality sensitive binary codes from sift-invariant kernels.\n\n[21] R. Salakhutdinov and G. Hinton. Semantic hashing.\n\nInternational Journal of Approximate\n\n[22] A. Shrivastava and P. Li. Fast near neighbor search in high-dimensional binary data. ECML,\n\n[23] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition.\n\n[24] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval.\n\n[25] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for\n\nimage classi\ufb01cation. CVPR, 2010.\n\n[26] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. NIPS, 2008.\n[27] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene\n\nrecognition from Abbey to Zoo. CVPR, 2010.\n\n[28] G. K. Zipf. The psychobiology of language. Houghton-Mif\ufb02in, 1935.\n\nNIPS, 2009.\n\nReasoning, 2009.\n\n2012.\n\nCVPR, 2008.\n\nCVPR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 584, "authors": [{"given_name": "Yunchao", "family_name": "Gong", "institution": null}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": null}, {"given_name": "Vishal", "family_name": "Verma", "institution": null}, {"given_name": "Svetlana", "family_name": "Lazebnik", "institution": null}]}