{"title": "Multiscale Quantization for Fast Similarity Search", "book": "Advances in Neural Information Processing Systems", "page_first": 5745, "page_last": 5755, "abstract": "We propose a multiscale quantization approach for fast similarity search on large, high-dimensional datasets. The key insight of the approach is that quantization methods, in particular product quantization, perform poorly when there is large variance in the norms of the data points. This is a common scenario for real- world datasets, especially when doing product quantization of residuals obtained from coarse vector quantization. To address this issue, we propose a multiscale formulation where we learn a separate scalar quantizer of the residual norm scales. All parameters are learned jointly in a stochastic gradient descent framework to minimize the overall quantization error. We provide theoretical motivation for the proposed technique and conduct comprehensive experiments on two large-scale public datasets, demonstrating substantial improvements in recall over existing state-of-the-art methods.", "full_text": "Multiscale Quantization for Fast Similarity Search\n\nXiang Wu Ruiqi Guo Ananda Theertha Suresh Sanjiv Kumar\n\nDan Holtmann-Rice David Simcha Felix X. Yu\n\n{wuxiang, guorq, theertha, sanjivk, dhr, dsimcha, felixyu}@google.com\n\nGoogle Research, New York\n\nAbstract\n\nWe propose a multiscale quantization approach for fast similarity search on large,\nhigh-dimensional datasets. The key insight of the approach is that quantization\nmethods, in particular product quantization, perform poorly when there is large\nvariance in the norms of the data points. This is a common scenario for real-\nworld datasets, especially when doing product quantization of residuals obtained\nfrom coarse vector quantization. To address this issue, we propose a multiscale\nformulation where we learn a separate scalar quantizer of the residual norm scales.\nAll parameters are learned jointly in a stochastic gradient descent framework to\nminimize the overall quantization error. We provide theoretical motivation for the\nproposed technique and conduct comprehensive experiments on two large-scale\npublic datasets, demonstrating substantial improvements in recall over existing\nstate-of-the-art methods.\n\n1\n\nIntroduction\n\nLarge-scale similarity search is central to information retrieval and recommendation systems for\nimages, audio, video, and textual information. For high-dimensional data, several hashing based\nmethods have been proposed, including randomized [19, 1, 32] and learning-based techniques\n[34, 35, 15]. Another set of techniques, based on quantization, have become popular recently due to\ntheir strong performance on real-world data. In particular, product quantization (PQ) [12, 20] and its\nvariants have regularly claimed top spots on public benchmarks such as GIST1M, SIFT1B [20] and\nDEEP10M [3].\nIn product quantization, the original vector space is decomposed into a Cartesian product of lower\ndimensional subspaces, and vector quantization is performed in each subspace independently. Vector\nquantization (VQ) approximates a vector x 2 Rdim(x) by \ufb01nding the closest quantizer in a codebook\nC:\n\nV Q(x; C) = argmin\n\nc2{Cj} kx ck2\n\nwhere C 2 Rdim(x)\u21e5m is a vector quantization codebook with m codewords, and the j-th column Cj\nrepresents the j-th quantizer. Similarly, product quantization (PQ) with K subspaces can be de\ufb01ned\nas following concatenation:\n\nP Q(x; S = {S(k)}) = [V Q(x(1); S(1));\u00b7\u00b7\u00b7 ; V Q(x(K); S(K))]\n\n(1)\nwhere x(k) denotes the subvector of x in the k-th subspace, and S(k) 2 Rdim(x(k))\u21e5l is a collection of\nK product quantization codebooks, each with l sub-quantizers.\nProduct quantization works well in large part due to the fact that it permits asymmetric distance\ncomputation [20], in which only dataset vectors are quantized while the query remains unquantized.\nThis is more precise than techniques based on Hamming distances (which generally require hashing\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2\n/\n1\n\n22\n\n#\n\"\n\n'\n/\n\n\t\n\n\t\n\n2\n\n#\n\"\n\n!\n\n(a)\n\n(b)\n\nFigure 1: Variance in data point norms poses a challenge to product quantization. (a) PQ quantization\nerror on a synthetic dataset X 2 Rd\u21e5N grows as the standard deviation of data point norms (kxk2)\nincreases. The mean of the dataset is zero \u00b5(x) = 0, and the average squared norm is \ufb01xed,\n2) = 1. In both settings, m = 16 codes are generated per data point and one with l = 16\n\u00b5(kxk2\nsub-quantizers per subspace, the other with l = 256. (b) Ratio between the standard deviation\n(krxk2) and normalization factorp\u00b5(krxk2\n2), where rx represents the residual after vector (coarse)\n\nquantization on the real-world dataset of SIFT1M.\n\nthe query), while still being ef\ufb01cient to compute using lookup table operations. We will give a more\ndetailed background on product quantization variants in Section 1.2.\n\n1.1 Motivation of Multiscale\nEmpirically, product quantization works the best when the variance in each subspace is roughly\nbalanced [20]. To ensure this, a rotation matrix is often applied to the data prior to performing\nquantization. This rotation can be either random [20] or learned [11, 30, 39].\nIn this work, however, we argue that the quality of the product quantization codebook also degenerates\nwhen there is variance in the norms of the data points being encoded\u2013even when the variance is\nrelatively moderate. We illustrate this point by generating synthetic datasets such that: (1) the dataset\nmean is zero; (2) data point direction is chosen uniformly at random; (3) the average squared norm\nof the data points is \ufb01xed. In Figure 1a, we plot quantization error (MSE) of product quantization\nagainst the standard deviation of the norms of the data points. Clearly, quantization error increases\nwith the variance of the data point norms. In real-world settings (Figure 1b), the residuals of a coarse\nvector quantization of the data also tend to have highly varying norms.\nTo compensate for the case when there is large variance in norms, we modify the formulation of\nproduct quantization by separately scalar quantizing data point norms, and then unit-normalizing the\ndata points before applying product quantization. When computing asymmetric distances, this simply\nrequires a scalar multiplication of the PQ codebook once per scalar quantizer, which has negligible\ncomputational cost in practice.\nTo scale quantization based search techniques to massive datasets, a popular strategy is to \ufb01rst\nvector quantize the input vectors in the original space (coarse quantization), and then apply product\nquantization on the vector quantization residuals [20]. However, in such a \u2018VQ-PQ\u2019 style approach,\nthe norms of the residuals exhibit signi\ufb01cant variance. Therefore, the proposed multiscale approach\nprovides signi\ufb01cant gains for massive search even when the original data is fully normalized.\n\n1.2 Related Works\nThe original idea of product quantization traces back to early works of signal processing [14, 12].\nJ\u00e9gou et al. [20] \ufb01rst introduced ef\ufb01cient asymmetric distance computation (ADC) and applied it to\nthe approximate nearest neighbor (ANN) search problem. Since then, there have been multiple lines\nof work focused on improving PQ.\nCoarse Quantizer. Also termed inverted \ufb01le (IVF) indexing structure in J\u00e9gou et al. [20], this\napproach learns a vector quantization of the data points via clustering, using the cluster indices to\nform an inverted index storing all data points corresponding to a given cluster index consecutively.\nA data point is encoded via PQ codes associated with the residual (offset) of the data point from its\ncloset cluster center. This design enables non-exhaustive search by searching only a subset of the m\n\n2\n\n\fclusters/partitions in the inverted index. However, previous works have learned coarse quantizers as a\nseparate preprocessing step, without training the coarse quantizers jointly with the PQ codebooks.\nRotation Matrix. Since PQ quantizes each subspace independently, a rotation matrix can be applied\nto reduce the intra-subspace statistical dependence. Researchers have proposed multiple ways to\nestimate such a rotation matrix: Norouzi and Fleet [30] use ITQ [13] style alternating quantization;\nOptimized PQ [11] also applied a simple strategy to minimize the quantization error; Locally\nOptimized PQ [22] learns a separate R for each coarse partition (and incurs the extra overhead\nof multiplying each local rotation with the query vector to compute lookup tables speci\ufb01c to each\npartition). In high-dimensional setup, Zhang et al. [39] address the scalability issue in learning the\nd \u21e5 d rotation matrix by imposing a Kronecker product structure. While learning such orthogonal\ntransformations is a good strategy in general, it does not change the norm of data points. Thus it still\nsuffers from norm variance as discussed in Section 1.1.\nAdditive Codebooks. Another line of research is focused on learning additive codebooks instead of\nsubspace codebooks. This includes additive quantization [5, 6, 26], composite quantization [37, 38]\nand stacked quantization [27]. Since they do not work in subspaces, additive codebooks don\u2019t require\nrotation, although they are harder to learn and more expensive to encode. Empirically, such additive\ncodebooks are more expressive, and outperform OPQ at lower bitrates. However, OPQ achieves\nsimilar performance at higher bitrates. Since additive codebooks don\u2019t address the variance of data\npoint norms, the proposed multiscale approach can also be applied to additive codebooks as well.\nImplementation Improvements. Much effort has been put into optimizing the implementation of\nADC, as it is computationally critical. Douze et al. [10] propose using Hamming distance for fast\npruning. Johnson et al. [21] come up with an ef\ufb01cient GPU implementation for ADC lookup. Andr\u00e9\net al. [2] propose to use SIMD-based computation to compute lower bounds for ADC. Our method is\ncompatible with all of these improvements. We also discuss our ADC implementation in Section 4.4.\nNon-quantization Techniques. There is a large body of similarity search literature on non-\nquantization based methods in both inner product search and nearest neighbor search. Tree based\nmethods [7, 29, 9], graph based methods [16] and locality sensitive hashing style algorithms [19, 1, 32]\nfocus on non-exhaustive search by partitioning the search space. In practice, these often lead to\nrandom memory accesses, and are often combined with exhaustive methods in ways similar to\nIVFADC [20, 4, 31, 28]. Binary embedding based approaches [36, 24, 18, 13, 17, 25] focus on\nlearning short binary codes, and can be searched ef\ufb01ciently in Hamming space. However, there\nis typically a large gap between the precision of distance computations in Hamming vs. product\ncodes under the same bitrate, and ADC can be computed with similar speed ([2, 21], Section 4.4).\nTherefore, we focus on comparison to ADC based techniques in this paper.\n\n1.3 Contributions\nWe propose a complete end-to-end training algorithm to learn coarse quantizers, a rotation matrix,\nand product quantization codebooks, together with scalar quantizers to capture coarse quantization\nresidual norms. This differs from prior work in that it (a) identi\ufb01es and addresses the problem of\nvariance in data point norms; (b) includes coarse quantizers as a part of the optimization; and (c) is end-\nto-end trainable using stochastic gradient descent (SGD), which leads to a signi\ufb01cant improvement\nin quantization error compared to previous methods using alternating optimization [30, 11]. We\nalso present ablation tests demonstrating the importance of each component of the algorithm in\nSection 4.2. In addition, we present theoretical motivation for our approach in Section 3.\n\n2 Methodology\n\nWe focus on minimizing quantization error kx \u02dcxk2, where x is a data point and \u02dcx is its quan-\ntized approximation, as a proxy for minimizing query-database distance approximation error\n|kq xk2 kq \u02dcxk2|. State-of-the-art quantization techniques take a hierarchical approach [11, 27].\nFor instance, one or more \u201ccoarse\u201d quantization stages (VQ) can be followed by product quantization\n(PQ) of the vector quantization residuals. A learned rotation is often applied to the residuals prior to\nproduct quantization to further reduce quantization error [11].\nThis style of approach provides two key bene\ufb01ts:\n\n3\n\n\f1. Real world data is often clusterable, with the diameter of clusters substantially lower than the\ndiameter of the dataset as a whole. The vector quantization can thus be used to obtain a \u201cresidual\ndataset\u201d with much smaller diameter, yielding signi\ufb01cant reductions in quantization error when\nquantized with only a product code [15].\n\n2. By additionally learning a rotation of the VQ residuals, the variance within each PQ subspace can\nbe signi\ufb01cantly reduced for many real world datasets, yielding substantially lower quantization\nerror and correspondingly higher recall.\n\nAs noted in Section 1.1, an additional source of quantization error when performing product quantiza-\ntion is the variance of data point norms. We extend the above strategy by explicitly representing the\nnorm of VQ residuals, learning a PQ codebook only on the unit-normalized rotated VQ residuals,\nwhile separately scalar quantizing the residual norm scales. Speci\ufb01cally, multiscale quantization\nemploys the following steps: (1) vector quantization of the dataset; (2) learned rotation of the vector\nquantization residuals; (3) reparameterization of the rotated residuals into direction and scale com-\nponents; (4) product quantization of the direction component; (5) scalar quantization of the scale\ncomponent.\nFormally, in multiscale quantization, the rotated residual rx and its `2 normalized version \u02c6rx are\nde\ufb01ned as:\n\nrx = R(x V Q(x)),\n\n\u02c6rx = rx/krxk2\n\nAnd a data point x 2 Rd is approximated by\nx \u21e1 \u02dcx = V Q(x) + \u02dcrx, where \u02dcrx = SQ(x)RT P Q(\u02c6rx) and x = krxk2/kP Q(\u02c6rx)k2\n(2)\nFrom above, V Q(x) = argminc2{Cj} kx ck2 returns the closest vector quantization codeword for\nx; C 2 Rd\u21e5m is a vector quantization codebook with m codewords; Cj is its j-th codeword (i.e. the\nj-th column of C); And the matrix R 2 Rd\u21e5d is a learned rotation, applied to the residuals of vector\nquantization; The residual norm scale x is a scalar multiplier to the product quantized P Q(\u02c6rx) that\nhelps preserve the `2 norm of the rotated residual rx; And SQ returns the nearest scalar quantizer\nfrom a scalar quantization codebook W 2 Rp with p codewords (equivalent to one-dimensional\nvector quantization). The product quantizer P Q(rx) is given by\n\nP Q(\u02c6rx) =0BBBB@\n\nP Q(\u02c6r(1)\n(1)\nx )\n(2)\nP Q(\u02c6r(2)\nx )\n\n...\n\n(K)\nP Q (\u02c6r(K)\n\nx\n\n1CCCCA\n\n)\n\n,\n\n\u02c6rx =0BBBB@\n\n\u02c6r(1)\nx\n\u02c6r(2)\nx\n...\n\u02c6r(K)\nx\n\n1CCCCA\n\nas the concatenation of codewords obtained by dividing the rotated and normalized residuals \u02c6rx\ninto K subvectors \u02c6r(k)\nx , k = 1, 2,\u00b7\u00b7\u00b7 , K, and quantizing the subvectors independently by vector\nquantizers (k)\n\nP Q(\u00b7) to minimize quantization error:\n\n(k)\nP Q(\u02c6r(k)\n\nx ) = argmin\ns2{S(k)\nj }\n\nk\u02c6r(k)\nx sk2.\n\nHence, S(k) 2 Rd(k)\u21e5l is the vector quantization codebook for the k-th subspace (with l codewords).\nFrequently, d(k), the dimension of the k-th subvector, is simply d\nK , although subvectors of varying\nsize are also possible.\nThe quantized, normalized residuals are represented by the K indices of index((k)\nx )), k =\n1,\u00b7\u00b7\u00b7 , K. This representation has an overall bitrate of K log2 l, where K is the number of subspaces,\nand l is the number of product quantizers in each subspace. The residual norm scales are maintained by\norganizing the residuals associated with a VQ partition into groups, where within a group all residuals\nhave the same quantized norm scale. The groups are ordered by quantized norm scale, and thus\nonly the indices of group boundaries need to be maintained. The total storage cost incluiding group\nboundaries and scalar quantization levels is thus O(mp), where m is number of vector quantizers and\np is the number of scalar quantizers. In our experiments, we set p to 8, which we \ufb01nd has a negligible\neffect on recall compared with using unquantized norm scales.\n\nP Q(\u02c6r(k)\n\n4\n\n\f2.1 Ef\ufb01cient Search under Multiscale Quantization\nThe multiscale quantization model enables nearest neighbor search to be carried out ef\ufb01ciently. For\na query q, we compute the squared `2 distance of q with each codeword in the vector quantization\ncodebook C, and search further within the nearest VQ partition. Suppose the corresponding quantizer\nis c\u21e4q = argminc2{Cj} kq ck2, and the corresponding quantization partition is P \u21e4q = {x 2\n{Xj}[N ] | V Q(x) = c\u21e4q}. Then, the approximate squared `2 distance between the query and database\npoints in P \u21e4q are computed using a lookup table. The \ufb01nal prediction is made by taking the database\npoint with the smallest approximate distance, i.e.\n2 2\u21e5R(q c\u21e4q)\u21e4 \u00b7 [SQ(x)P Q(\u02c6rx)] + kSQ(x)P Q(\u02c6rx)k2\n2 .\nWe use a lookup table to compute the quantized inner product between subvectors of the query\u2019s\nrotated VQ residual \u00afq = R(q c\u21e4q) and the scaled product quantized data point residuals\nSQ(x)P Q(\u02c6rx). Letting \u00afq(k) be the k-th subvector of \u00afq and wx = SQ(x) the quantized norm\nscale, we \ufb01rst precompute inner products and the squared quantized `2 norm with the PQ codebook S\nas v(k)\n2kS(k)\n2 for all j and k, giving K lookup tables v(1), . . . , v(K) each\nj k2\nof dimension l. We can then compute\n\nx2P \u21e4q kq c\u21e4qk2\n\nj = 2\u00afq(k) \u00b7 wxS(k)\n\nxpred\nq = argmin\n\nj + wx\n\n2\u00afq \u00b7 wxP Q(rx) + wx\n\n2kP Q(rx)k2\n\n2 =\n\nKXk=1\n\nv(k)\nindex((k)\n\nP Q(rx))\n\nIn practice, instead of searching only one vector quantization partition, one can use soft vector\nquantization and search the t partitions with the lowest kq Cjk2. The \ufb01nal complexity of the search\nm ).\nis O( N tK\nIn our implementation, since all the data points with the same quantized norm scale are stored in\nconsecutive groups, we need only create a new lookup table at the beginning of a new group, by\ncombining scale independent lookup tables of 2\u00afq(k) \u00b7 S(k)\n2 (multiplied by wx and wx\n2,\nrespectively) using hardware optimized fused multiply-add instructions. We incur this computation\ncost only p times for a VQ partition, where p is the number of scalar quantizers. In our experiment, we\nset p = 8 and the number of VQ partitions to search t = 8, maintaining relatively low performance\noverhead. We discuss more on the lookup table implementation in Section 4.4.\n2.2 Optimization Procedure\nWe can explicitly formulate the mean squared loss as a function of our parameter vector \u21e5 =\n(C;{S(k)}[K]; R;{Wi}[m]) per our approximation formulation (2). Wi here represents the param-\neter vector for the scalar quantizer of norm scales in partition i. To jointly train the parameters\nof the model, we use stochastic gradient descent. To optimize the orthogonal transformation of\nvector quantization residuals while maintaining orthogonality, we parameterize it via the Cayley\ncharacterization of orthogonal matrices [8]:\n\nand kS(k)\nj k2\n\nj\n\nR = (I A)(I + A)1,\n\n(3)\nwhere A is a skew-symmetric matrix, i.e. A = AT . Note that (3) is differentiable w.r.t. the\nd(d 1)/2 parameters of A. Computing the gradient requires an inversion of a d \u21e5 d matrix at each\niteration. However we found this tradeoff to be acceptable for datasets with dimensionalities in the\nhundreds to thousands. When applying this method on high-dimensional datasets, one can restrict the\nnumber of parameters of A to trade off capacity and computational cost.\nThe codebook for vector quantization is initialized using random samples from the dataset, while\nthe codebook for product quantization is initialized using the residuals (after vector quantization,\nnormalization and rotation) of a set of independent samples. To allow the vector quantization a\nchance to partition the space, we optimize only the vector quantization error for several epochs before\ninitializing the product codes and doing full joint training. The parameters of the skew-symmetric\nmatrix A were initialized by sampling from N (0, 1e3).\nAll optimization parameters were \ufb01xed for all datasets (although we note it would be possible to\nimprove results slightly with more extensive per-dataset tuning). We used the Adam optimization\nalgorithm [23] with the parameters suggested by the authors, minibatch sizes of 2000, and a learning\nrate of 1e4 during joint training (and 1e3 when training only the vector quantizers).\n\n5\n\n\fTo learn the scalar quantizer for residual norm scales and capture their local distribution within a\nVQ partition, we jointly optimize the assignment of PQ codes and the scalar quantizer for all data\npoints within the same partition. Leaving the PQ codebook and rotation \ufb01xed, we alternate between\nfollowing two steps until convergence:\n\nonly within the partition.\n\n1. Fix all assigned PQ codes and scalar quantize the norm scales x = krxk2/kP Q(\u02c6rx)k2\n2. Fix all quantized norm scales within the partition and reassign PQ codes for rx/SQ(x).\n\nIn practice, it only takes a few iterations to converge to a local minimum for every VQ partition.\n\n3 Analysis\n\nBelow we provide theoretical motivation and analysis for the components of the proposed quantization\napproach, including for multiscale, learned rotation, and coarse quantization stages.\n\n3.1 Multiscale\n\nWe \ufb01rst show that adding a scalar quantizer further increases the recall when the norms of the residuals\nexhibit large variance. For a query q and a given partition with center Cj, if we de\ufb01ne qj = q Cj,\nthen the `2 error caused by residual quantization is\n\n|kqj rxk2\n\n2 kqj \u02dcrxk2\n\nThe \ufb01rst query dependent term can be further transformed as\n\nTaking expectation w.r.t q yields\n\n|2qj \u00b7 (rx \u02dcrx)| = 2q[(rx \u02dcrx)T qj][qT\nEq|2qj \u00b7 (rx \u02dcrx)|\uf8ff 2qEq[(rx \u02dcrx)T (qjqT\n\n2| = | 2qj \u00b7 (rx \u02dcrx) + krxk2\n2 k\u02dcrxk2\n2|\n\uf8ff| 2qj \u00b7 (rx \u02dcrx)| + |krxk2\n2 k\u02dcrxk2\n2|.\nj (rx \u02dcrx)] = 2q(rx \u02dcrx)T (qjqT\nj )(rx \u02dcrx)] = 2q(rx \u02dcrx)T Eq(qjqT\n\nj )(rx \u02dcrx)\n\nj )(rx \u02dcrx),\nwhere the inequality follows from Jensen\u2019s inequality. If q is the largest eigen value of the covariance\nmatrix Eq(qjqT\n\nj ), then\nEq|kqj rxk2\n\n2 kqj \u02dcrxk2\n\n2|\uf8ff 2pqkrx \u02dcrxk2 + |krxk2\n\n2 k\u02dcrxk2\n2|.\n\nExisting quantization methods have focused on the \ufb01rst term in the error of `2 distance. However for\nVQ residuals with large variance in krxk2, the second quadratic term becomes dominant. By scalar\nquantizing the residual norm scales, especially within each VQ partition locally, we can reduce the\nsecond term substantially and thus improve recall on real datasets.\n\n3.2 Rotation Matrix\n\nPerforming quantization after a learned rotation has been found to work well in practice [13, 30].\nHere we show rotation is required in some scenarios. Let xi = Ryi, 1 \uf8ff i \uf8ff n. We show that\nthere exist simple examples, where the yi\u2019s have a product code with small codebook size and MSE\n0, whereas to get any small MSE on xis one may need to use exponentially many codewords. On\nreal-world datasets, this difference might not be quite so pronounced, but it is still signi\ufb01cant and\nhence undoing the rotation can yield signi\ufb01cantly better MSE. We provide the following Lemma (see\nthe supplementary material for a proof).\nLemma 1. Let X = RY, i.e., for 1 \uf8ff i \uf8ff n, xi = Ryi. There exists a dataset Y and a rotation\nmatrix R such that a canonical basis product code of size 2 is suf\ufb01cient to achieve MSE of 0 for Y,\nwhereas any product code on X requires 2c\u00b7min(d/K,K)\u270f codewords to achieve MSE \u270fkxkmax, where\nc is some universal constant and kxkmax is the maximum `2 norm of any data point.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: (a) Break down by contribution to MSE reduction from each component in our model\non SIFT1M and DEEP10M datasets with different bitrates. The baseline is the original IVFADC\nsetup with no rotation or norm scale quantization. (b) Time spent per query by different distance\ncomputation methods on linear search of a database of size |X| = 27, 28, 29,\u00b7\u00b7\u00b7 216 under 128 bits.\nLower curves indicate faster search time.\n3.3 Coarse Quantization\nWe analyze the proposed vector and product quantization when the data is generated by a K-subspace\nmixture model that captures two properties observed in many real-world data sets: samples belong to\none of several underlying categories, also referred to as components, and within each component the\nresiduals are generated independently in K subspaces. The precise model is de\ufb01ned in Appendix B.\nFor a query q, let x\u21e4q be the sample that minimizes kq xk2. Let xV Q\nbe the output of the hierarchical\nnearest neighbor algorithm that \ufb01rst \ufb01nds the nearest cluster center and then searches within that\ncluster. We show that if q is generated independently of x, then with high probability it returns an\nxV Q\nq\nTheorem 1. Given n samples from an underlying K-subspace mixture model that has been clustered\ncorrectly and an independently generated query q, with probability 1 ,\n+ 4r2r d2\n\n2 kq xV Q\nq k2\n\nthat is near-optimal.\n\n4n\n\n\n2n\n\n\n.\n\nlog\n\n2K\n\nq\n\nkq x\u21e4qk2\n\n2 \uf8ff 8br dr2\n\n2K\n\nlog\n\nSee Appendix B for a proof. Note that r = maxx2Xkrxk1 is the maximum value of the residual\nin any coordinate and offers a natural scaling for our problem and b = maxx2Xkq xk2 is the\nmaximum distance between q and any data point.\n\n4 Experiments\n\n4.1 Evaluation Datasets\nWe evaluate the performance of end-to-end trained multiscale quantization (MSQ) on the SIFT1M [20]\nand DEEP10M [3] datasets, which are often used in benchmarking the performance of nearest\nneighbor search. SIFT1M [20] contains 1 million, 128 dimensional SIFT descriptors extracted from\nFlickr images. DEEP10M is introduced in [3], by extracting 96 PCA components from the \ufb01nal\nhidden layer activations of GoogLeNet [33].\nAt training time, each dataset is indexed with 1024 VQ coarse quantizers. At query time, quantized\nresiduals from the 8 partitions closest to the query are further searched using ADC to generate the\n\ufb01nal nearest neighbors. We report results on both quantization error (MSE, Section 4.2) and in terms\nof retrieval recall (Recall1@N, Section 4.3). Often, the two metrics are strongly correlated.\n\n4.2 Ablation Tests\nCompared to IVFADC [20], which uses plain PQ with coarse quantizers, our end-to-end trained MSQ\nreduces quantization error by 15-20% on SIFT1M, and 20-25% on DEEP10M, which is a substantial\nreduction. Multiple components contribute to this reduction: (1) learned rotation of the VQ residuals;\n(2) separate quantization of the residual norms into multiple scales; and (3) end-to-end training of all\nparameters.\n\n7\n\n\fFigure 3: Recall curves when retrieving Top-1 neighbors (Recall1@N) on the SIFT1M dataset with\nvarying numbers of codebooks and centers. We search t = 8 out of m = 1024 VQ partitions.\n\nIn order to understand the effect of each component, we plot the MSE reduction relative to IV-\nFADC [20] for several ablation tests (Figure 2a). On DEEP10M, the proposed multiscale approach\nand the end-to-end learning contribute an additional 5-10% MSE reduction on top of learned rota-\ntion, while they contribute 10-15% on SIFT1M. It is important to note that on SIFT1M, multiscale\nquantization and end-to-end training have a bigger impact than learned rotation, which is itself often\nconsidered to yield a signi\ufb01cant improvement.\n\n4.3 Recall Experiments\nWe compare the proposed end-to-end trained multiscale quantization method against three baselines\nmethods: product quantization (PQ) [20], optimized product quantization (OPQ) [11] and stacked\nquantizers (SQ) [27]. We generate ground-truth results using brute force search, and compare the\nresults of each method against ground-truth in \ufb01xed-bitrate settings.\nFor \ufb01xed-bitrate experiments, we show recall curves for varying numbers of PQ codebooks from the\nrange {8, 16, 32} for the SIFT1M dataset and {6, 12, 24} for the DEEP10M dataset. For each number\nof codebooks, we experimented with both 16 centers for in-register table lookup and 256 centers\nfor in-memory table lookup in Figure 3 and 4. From the recall curves, it is clear that multiscale\nquantization performs better than all baselines across both datasets in all settings.\n\n4.4 Speed Benchmarks\nWe use the same indexing structure (IVF), and the same ADC computation implementation for all\nbaselines (PQ [20], OPQ [11], SQ [27]). Thus the speed of all baselines are essentially identical at the\nsame bitrate, meaning Figure 3 and 4 are both \ufb01xed-memory and \ufb01xed-time, and thus directly compa-\nrable. For codebooks with 256 centers, we implemented in-memory lookup table (LUT256) [20]; for\ncodebooks with 16 centers, we implemented in-register lookup table (LUT16) using the VPSHUFB\ninstruction from AVX2, which performs 32 lookups in parallel.\nAlso, we notice that there have been different implementations of ADC. The original algorithm\nproposed in [20] uses in-memory lookup tables. We place tables in SIMD registers and leverage\nSIMD instructions for fast lookup. Similar ideas are also reported in recent literature [10, 17, 2].\nHere we put them on equal footing and provide a comparison of different approaches. In Figure 2b,\nwe plot the time for distance computation at the same bitrate. Clearly, VPSHUFB based LUT16\nachieves almost the same speed compared to POPCNT based Hamming, and they are both 5x faster\nthan in-memory based ADC. As a practical observation, when the number of neighbors to be retrieved\nis large, Recall1@N of LUT256 and LUT16 is often comparable at the same bitrate, and LUT16\nwith 5x speed up is almost always preferred.\n\n8\n\n\fFigure 4: Recall curves when retrieving Top-1 neighbors (Recall1@N) on the DEEP10M datasets\nvarying numbers of codebooks and centers. We search t = 8 out of m = 1024 VQ partitions.\n5 Conclusions\nWe have proposed an end-to-end trainable multiscale quantization method that minimizes overall\nquantization loss. We introduce a novel scalar quantization approach to account for the variances in\ndata point norms, which is both empirically and theoretically motivated. Together with the end-to-end\ntraining, this contributes to large reduction in quantization error over existing competing methods that\nalready employ optimized rotation and coarse quantization. Finally, we conducted comprehensive\nnearest neighbor search retrieval experiments on two large-scale, publicly available benchmark\ndatasets, and achieve considerable improvement over state-of-the-art.\n\n6 Acknowledgements\nWe thank Jeffrey Pennington and Chen Wang for their helpful comments and discussions.\n\nReferences\n[1] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and\noptimal lsh for angular distance. In Advances in Neural Information Processing Systems, pages 1225\u20131233,\n2015.\n\n[2] Fabien Andr\u00e9, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. Cache locality is not enough: high-\nperformance nearest neighbor search with product quantization fast scan. Proceedings of the VLDB\nEndowment, 9(4):288\u2013299, 2015.\n\n[3] A. Babenko and V. Lempitsky. Ef\ufb01cient indexing of billion-scale datasets of deep descriptors. In 2016\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2055\u20132063, June 2016.\n\n[4] Artem Babenko and Victor Lempitsky. The inverted multi-index.\n\nIn Computer Vision and Pattern\n\nRecognition (CVPR), 2012 IEEE Conference on, pages 3069\u20133076. IEEE, 2012.\n\n[5] Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In Computer\n\nVision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 931\u2013938. IEEE, 2014.\n\n[6] Artem Babenko and Victor Lempitsky. Tree quantization for large-scale similarity search and classi\ufb01cation.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4240\u20134248,\n2015.\n\n[7] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications\n\nof the ACM, 18(9):509\u2013517, 1975.\n\n[8] Arthur Cayley. Sur quelques propri\u00e9t\u00e9s des d\u00e9terminants gauches. Journal f\u00fcr die reine und angewandte\n\nMathematik, 32:119\u2013123, 1846.\n\n9\n\n\f[9] Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In Proceed-\n\nings of the fortieth annual ACM symposium on Theory of computing, pages 537\u2013546. ACM, 2008.\n\n[10] Matthijs Douze, Herv\u00e9 J\u00e9gou, and Florent Perronnin. Polysemous codes. In European Conference on\n\nComputer Vision, pages 785\u2013801. Springer, 2016.\n\n[11] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 36(4):744\u2013755, April 2014.\n\n[12] Allen Gersho and Robert M Gray. Vector quantization and signal compression, volume 159. Springer\n\nScience & Business Media, 2012.\n\n[13] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin.\n\nIterative quantization: A\nprocrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on\nPattern Analysis and Machine Intelligence, 35(12):2916\u20132929, 2013.\n\n[14] Robert M Gray. Vector quantization. ASSP Magazine, IEEE, 1(2):4\u201329, 1984.\n\n[15] Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based fast inner\nproduct search. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics,\nAISTATS 2016, Cadiz, Spain, May 9-11, 2016, pages 482\u2013490, 2016.\n\n[16] Ben Harwood and Tom Drummond. FANNG: Fast approximate nearest neighbour graphs. In Computer\n\nVision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 5713\u20135722. IEEE, 2016.\n\n[17] Kaiming He, Fang Wen, and Jian Sun. K-means hashing: An af\ufb01nity-preserving quantization method for\nlearning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 2938\u20132945, 2013.\n\n[18] Jae-Pil Heo, Youngwoon Lee, Junfeng He, Shih-Fu Chang, and Sung-Eui Yoon. Spherical hashing. In\nComputer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2957\u20132964. IEEE,\n2012.\n\n[19] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimen-\nsionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604\u2013613.\nACM, 1998.\n\n[20] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.\n\nIEEE transactions on pattern analysis and machine intelligence, 33(1):117\u2013128, 2011.\n\n[21] Jeff Johnson, Matthijs Douze, and Herv\u00e9 J\u00e9gou. Billion-scale similarity search with gpus. arXiv preprint\n\narXiv:1702.08734, 2017.\n\n[22] Yannis Kalantidis and Yannis Avrithis. Locally optimized product quantization for approximate nearest\nneighbor search. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages\n2329\u20132336. IEEE, 2014.\n\n[23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[24] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Advances in\n\nneural information processing systems, pages 1042\u20131050, 2009.\n\n[25] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In Proceedings of the 28th\n\ninternational conference on machine learning (ICML-11), pages 1\u20138, 2011.\n\n[26] Julieta Martinez, Joris Clement, Holger H Hoos, and James J Little. Revisiting additive quantization. In\n\nEuropean Conference on Computer Vision, pages 137\u2013153. Springer, 2016.\n\n[27] Julieta Martinez, Holger H. Hoos, and James J. Little. Stacked quantizers for compositional vector\n\ncompression. CoRR, abs/1411.2173, 2014.\n\n[28] Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. Pqtable: Fast exact asymmetric distance\nneighbor search for product quantization using hash tables. In Proceedings of the IEEE International\nConference on Computer Vision, pages 1940\u20131948, 2015.\n\n[29] Marius Muja and David G Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 36(11):2227\u20132240, 2014.\n\n10\n\n\f[30] Mohammad Norouzi and David J Fleet. Cartesian k-means. In Proceedings of the IEEE Conference on\n\nComputer Vision and Pattern Recognition, pages 3017\u20133024, 2013.\n\n[31] Mohammad Norouzi, Ali Punjani, and David J Fleet. Fast exact search in hamming space with multi-index\n\nhashing. IEEE transactions on pattern analysis and machine intelligence, 36(6):1107\u20131119, 2014.\n\n[32] Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product\n\nsearch (mips). In Advances in Neural Information Processing Systems, pages 2321\u20132329, 2014.\n\n[33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[34] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey.\n\narXiv preprint arXiv:1408.2927, 2014.\n\n[35] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. Learning to hash for indexing big data survey.\n\nProceedings of the IEEE, 104(1):34\u201357, 2016.\n\n[36] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in neural information\n\nprocessing systems, pages 1753\u20131760, 2009.\n\n[37] Ting Zhang, Chao Du, and Jingdong Wang. Composite quantization for approximate nearest neighbor\n\nsearch. In ICML, number 2, pages 838\u2013846, 2014.\n\n[38] Ting Zhang, Guo-Jun Qi, Jinhui Tang, and Jingdong Wang. Sparse composite quantization. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 4548\u20134556, 2015.\n\n[39] Xu Zhang, Felix X. Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang, and Shih-Fu Chang. Fast orthogonal\nprojection based on kronecker product. In International Conference on Computer Vision (ICCV), 2015.\n\n11\n\n\f", "award": [], "sourceid": 2935, "authors": [{"given_name": "Xiang", "family_name": "Wu", "institution": "Google"}, {"given_name": "Ruiqi", "family_name": "Guo", "institution": "Google"}, {"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "Google"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google Research"}, {"given_name": "Daniel", "family_name": "Holtmann-Rice", "institution": "Google Inc"}, {"given_name": "David", "family_name": "Simcha", "institution": "Google"}, {"given_name": "Felix", "family_name": "Yu", "institution": "Google Research"}]}