{"title": "Norm-Ranging LSH for Maximum Inner Product Search", "book": "Advances in Neural Information Processing Systems", "page_first": 2952, "page_last": 2961, "abstract": "Neyshabur and Srebro proposed SIMPLE-LSH, which is the state-of-the-art hashing based algorithm for maximum inner product search (MIPS). We found that the performance of SIMPLE-LSH, in both theory and practice, suffers from long tails in the 2-norm distribution of real datasets. We propose NORM-RANGING LSH, which addresses the excessive normalization problem caused by long tails by partitioning a dataset into sub-datasets and building a hash index for each sub-dataset independently. We prove that NORM-RANGING LSH achieves lower query time complexity than SIMPLE-LSH under mild conditions. We also show that the idea of dataset partitioning can improve another hashing based MIPS algorithm. Experiments show that NORM-RANGING LSH probes much less items than SIMPLE-LSH at the same recall, thus significantly benefiting MIPS based applications.", "full_text": "Norm-Ranging LSH for Maximum Inner Product\n\nSearch\n\nXiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, James Cheng\n\nDepartment of Computer Science\n\nThe Chinese University of Hong Kong\n\nShatin, Hong Kong\n\n{xyan, jfli, xydai, hzchen, jcheng}@cse.cuhk.edu.hk\n\nAbstract\n\nNeyshabur and Srebro proposed SIMPLE-LSH [2015], which is the state-of-the-art\nhashing based algorithm for maximum inner product search (MIPS). We found\nthat the performance of SIMPLE-LSH, in both theory and practice, suffers from\nlong tails in the 2-norm distribution of real datasets. We propose NORM-RANGING\nLSH, which addresses the excessive normalization problem caused by long tails\nby partitioning a dataset into sub-datasets and building a hash index for each\nsub-dataset independently. We prove that NORM-RANGING LSH achieves lower\nquery time complexity than SIMPLE-LSH under mild conditions. We also show\nthat the idea of dataset partitioning can improve another hashing based MIPS\nalgorithm. Experiments show that NORM-RANGING LSH probes much less items\nthan SIMPLE-LSH at the same recall, thus signi\ufb01cantly bene\ufb01ting MIPS based\napplications.\n\nIntroduction\n\n1\nGiven a dataset S \u2282 Rd containing n vectors (also called items) and a query q \u2208 Rd, maximum inner\nproduct search (MIPS) \ufb01nds the vector in S that has the maximum inner product with q,\n\np = arg max\n\nx\u2208S q(cid:62)x.\n\n(1)\n\nMIPS may require items with the top k inner products and it usually suf\ufb01ces to return approxi-\nmate results (i.e., items with inner products close to the maximum). MIPS has many important\napplications including recommendation based on user and item embeddings obtained from matrix\nfactorization [Koren et al., 2009], multi-class classi\ufb01cation with linear classi\ufb01er [Dean et al., 2013],\n\ufb01ltering in computer vision [Felzenszwalb et al., 2010], etc.\nMIPS is a challenging problem as modern datasets often have high dimensionality and large cardinality.\nInitially, tree-based methods [Ram and Gray, 2012, Koenigstein et al., 2012] were proposed for\nMIPS, which use the idea of branch and bound similar to k-d tree [Friedman and Tukey, 1974].\nHowever, these methods suffer from the curse of dimensionality and their performance can be even\nworse than linear scan when feature dimension is as low as 20 [Weber et al., 1998]. Shrivastava\nand Li proposed L2-ALSH [2014], which attains the \ufb01rst provable sub-linear query time complexity\nfor approximate MIPS that is independent of dimensionality. L2-ALSH applies an asymmetric\ntransformation 1 to transform MIPS into L2 similarity search, which can be solved with well-known\nLSH functions. Following the idea of L2-ALSH, Shrivastava and Li formulated another pair of\nasymmetric transformations called SIGN-ALSH [2015] to transform MIPS into angular similarity\nsearch and obtained lower query time complexity than that of L2-ALSH.\n\n1Asymmetric transformation means that the transformations for the queries and the items are different, while\n\nsymmetric transformation means the same transformation is applied to the items and queries.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fNeyshabur and Srebro showed that asymmetry is not necessary when queries are normalized and items\nhave bounded 2-norm [2015]. They proposed SIMPLE-LSH, which adopts a symmetric transformation\nand transforms MIPS into angular similarity search similar to SIGN-ALSH. Moreover, they proved that\nSIMPLE-LSH is a universal LSH for MIPS, while L2-ALSH and SIGN-ALSH are not. SIMPLE-LSH is\nalso parameter-free and avoids the parameter tuning of L2-ALSH and SIGN-ALSH. Most importantly,\nSIMPLE-LSH achieves superior performance over L2-ALSH and SIGN-ALSH in both theory and\npractice, and thus is the state-of-the-art hashing based algorithm for MIPS.\nSIMPLE-LSH requires the 2-norms of the items to be bounded, which is achieved by normalizing the\nitems with the maximum 2-norm in the dataset. However, real datasets often have long tails in the\ndistribution of 2-norm, meaning that the maximum 2-norm can be much larger than the majority of\nthe items. As we will show in Section 3.1, the excessive normalization of SIMPLE-LSH makes the\nmaximum inner product between the query and the items small, which harms the performance of\nSIMPLE-LSH in both theory and practice.\nTo solve this problem, we propose NORM-RANGING LSH. The idea is to partition the dataset into\nmultiple sub-datasets according to the percentiles of the 2-norm distribution. For each sub-dataset,\nNORM-RANGING LSH uses SIMPLE-LSH as a subroutine to build an index independent of other\nsub-datasets. As each sub-dataset is normalized by its own maximum 2-norm, which is usually\nsigni\ufb01cantly smaller than the maximum 2-norm in the entire dataset, NORM-RANGING LSH achieves\nlower query time complexity than SIMPLE-LSH. To support ef\ufb01cient query processing, we also\nformulate a similarity metric which de\ufb01nes a probing order for buckets from different sub-datasets.\nWe compare NORM-RANGING LSH with SIMPLE-LSH and L2-ALSH on three real datasets and show\nempirically that NORM-RANGING LSH offers up to an order of magnitude speedup.\n\n2 Locality Sensitive Hashing for MIPS\n\n2.1 Locality Sensitive Hashing\n\nA de\ufb01nition of locality sensitive hashing (LSH) [Indyk and Motwani, 1998, Andoni et al., 2018] is\ngiven as follows:\nDe\ufb01nition 1. (LSH) A family H is called (S0, cS0, p1, p2)-sensitive if, for any two vectors x, y \u2208 Rd:\n\n\u2022 if sim(x, y) \u2265 S0, then PH [h(x) = h(y)] \u2265 p1,\n\u2022 if sim(x, y) \u2264 cS0, then PH [h(x) = h(y)] \u2264 p2.\n\nNote the original LSH is de\ufb01ned for distance function, we adopt a formalization adapted for similarity\nfunction [Shrivastava and Li, 2014], which is more suitable for MIPS. For a family of LSH to be\nuseful, it is required that p1 > p2 and 0 < c < 1. Given a family of (S0, cS0, p1, p2)-LSH, a query\nfor c-approximate nearest neighbor search 2 can be processed with a time complexity of O(n\u03c1 log n),\nwhere \u03c1 = log p1\nlog p2\n\n. For L2 distance, there exists a well-known family of LSH de\ufb01ned as follows:\n\n(cid:22) a(cid:62)x + b\n\n(cid:23)\n\nhL2\na,b(x) =\n\n(2)\nwhere (cid:98)(cid:99) is the \ufb02oor operation, a is a random vector whose entries follow i.i.d. standard normal\ndistribution and b is generated by a uniform distribution over [0, r]. When a hash function is drawn\nrandomly and independently for each pair of vectors [Wang et al., 2013], the collision probability\nof (2) is given as:\n\nr\n\n,\n\nPH\n\na,b(y)\n\nhL2\na,b(x) = hL2\n\n(3)\nin which \u03a6(x) is the cumulative density function of standard normal distribution and d = (cid:107)x \u2212 y(cid:107)\nis the L2 distance between x and y. For angular similarity, sign random projection is an LSH. Its\nexpression and collision probability can be given as [Goemans and Williamson, 1995]:\n\n= Fr(d) = 1 \u2212 2\u03a6(\u2212 r\nd\n\n(1 \u2212 e\u2212(r/d)2/2),\n\n) \u2212 2d\u221a\n2\u03c0r\n\n(cid:104)\n\n(cid:105)\n\n(cid:19)\n\n(cid:18) x(cid:62)y\n\n(cid:107)x(cid:107)(cid:107)y(cid:107)\n\n,\n\n(4)\n\nha(x) = sign(a(cid:62)x), PH [ha(x) = ha(y)] = 1 \u2212 1\n\u03c0\nwhere the entries of a follow i.i.d. standard normal distribution.\n\ncos\u22121\n\n2c-approximate nearest neighbor search solves the following problem: given parameters S0 > 0 and \u03b4 > 0,\nif there exists an S0-near neighbor of q in S, return some cS0-near neighbor in S with probability at least 1 \u2212 \u03b4.\n\n2\n\n\f2.2 L2-ALSH\n\nShrivastava and Li proved that there exists no symmetric LSH for MIPS if the domain of the item x\nand query q are both Rd [2014]. They applied a pair of asymmetric transformations, P (x) and Q(q),\nto the items and the query, respectively.\n\nP (x) = [U x;(cid:107)U x(cid:107)2;(cid:107)U x(cid:107)4; ...;(cid:107)U x(cid:107)2m\n\n(5)\nThe scaling factor U should ensure that (cid:107)U x(cid:107) < 1 for all x \u2208 S and the query is normalized to unit\n2-norm before the transformation. After the transformation, we have:\n\n]; Q(q) = [q; 1/2; 1/2; ...; 1/2]\n\n(cid:107)P (x) \u2212 Q(q)(cid:107)2 = 1 +\n\nm\n4\n\n\u2212 2U x(cid:62)q + (cid:107)U x(cid:107)2m+1\n\n.\n\n(6)\n\nAs the scaling factor U is common for all items and the last term vanishes with suf\ufb01ciently large\nm because (cid:107)U x(cid:107) < 1, (6) shows that the problem of MIPS is transformed into \ufb01nding the nearest\nneighbor of Q(q) in terms of L2 distance, which can be solved using the hash function in (2). Given\nS0 and c, a query time complexity of O(n\u03c1 log n) can be obtained for c-approximate MIPS with:\n\nlog Fr((cid:112)1 + m/4 \u2212 2U S0 + (U S0)2m+1)\n\nlog Fr((cid:112)1 + m/4 \u2212 2cU S0)\n\n\u03c1 =\n\n.\n\n(7)\n\nIt is suggested to use a grid search to \ufb01nd the parameters (m, U and r) that minimize \u03c1.\n\n2.3 SIMPLE-LSH\n\na,b(P (x)) = hL2\n\nNeyshabur and Srebro proved that L2-ALSH is not a universal LSH for MIPS, that is [2015], for any\nsetting of m, U and r, there always exists a pair of S0 and c such that x(cid:62)q = S0 and y(cid:62)q = cS0\nbut PH[hL2\na,b(Q(q))]. Moreover, they showed that\nasymmetry is not necessary if the items have bounded 2-norm and the query is normalized, which is\nexactly the assumption of L2-ALSH. They proposed a symmetric transformation to transform MIPS\ninto angular similarity search as follows:\n\na,b(Q(q))] < PH[hL2\n\nP (x) = [x;(cid:112)1 \u2212 (cid:107)x(cid:107)2]; P (q)(cid:62)P (x) = [q; 0](cid:62)[x;(cid:112)1 \u2212 (cid:107)x(cid:107)2] = q(cid:62)x.\n\na,b(P (y)) = hL2\n\n(8)\n\nThey apply the sign random projection in (4) to P (x) and P (q) to obtain an LSH for c-approximate\nMIPS with a query time complexity O(n\u03c1 log n) and \u03c1 is given as:\nlog(1 \u2212 cos\u22121(S0)\n)\nlog(1 \u2212 cos\u22121(cS0)\n\n\u03c1 = G(c, S0) =\n\n(9)\n\n)\n\n.\n\n\u03c0\n\n\u03c0\n\nThey called their algorithm SIMPLE-LSH as it avoids the parameter tuning process of L2-ALSH.\nMoreover, SIMPLE-LSH is proved to be a universal LSH for MIPS under any valid con\ufb01guration of\nS0 and c. SIMPLE-LSH also obtains better (lower) \u03c1 values than L2-ALSH and SIGN-ALSH in theory\nand outperforms both of them empirically [Shrivastava and Li, 2015].\n\n3 Norm-Ranging LSH\n\nIn this section, we \ufb01rst motivate norm-ranging LSH by showing the problem of SIMPLE-LSH on real\ndatasets, then introduce how norm-ranging LSH (or RANGE-LSH for short) solves the problem.\n\n3.1 SIMPLE-LSH on Real Datasets\n\nWe plot the relation between \u03c1 and S0 for SIMPLE-LSH in Figure 1(a). Recall that the query time\ncomplexity of SIMPLE-LSH is O(n\u03c1 log n) and observe that \u03c1 is a decreasing function of S0. As \u03c1\nis large when S0 is small, SIMPLE-LSH suffers from poor query performance when the maximum\ninner product between a query and the items is small. Before applying the transformation in (8),\nSIMPLE-LSH requires the 2-norm of the items to be bounded by 1, which is achieved by normalizing\nthe items with the maximum 2-norm U = maxx\u2208S (cid:107)x(cid:107). Assuming q(cid:62)x = S for item vector x, we\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) The relation between \u03c1 and S0; (b) 2-norm distribution of the SIFT descriptors in the\nImageNet dataset (maximum 2-norm scaled to 1); (c) The distribution of the maximum inner product\nof the queries after the normalization process of SIMPLE-LSH; (d) The distribution of the maximum\ninner product of the queries after the normalization process of RANGE-LSH (32 sub-datasets).\n\nAlgorithm 1 Norm-Ranging LSH: Index Building\n1: Input: Dataset S, dataset size n, number of sub-datasets m\n2: Output: A hash index Ij for each of the m sub-datasets\n3: Rank the items in S according to their 2-norms;\n4: Partition S into m sub-datasets {S1,S2, ...,Sm} such that Sj holds items whose 2-norms ranked\nm , jn\nm ];\n5: for every sub-dataset Sj do\n6:\n7:\n8: end for\n\nUse Uj = maxx\u2208Sj (cid:107)x(cid:107) to normalize Sj;\nApply SIMPLE-LSH to build index Ij for Sj;\n\nin the range [ (j\u22121)n\n\nhave q(cid:62)x = S/U after normalization. If U is signi\ufb01cantly larger than (cid:107)x(cid:107), the inner product will be\nscaled to a small value, and small inner product leads to high query complexity.\nWe plot the distribution of the 2-norm of a real dataset in Figure 1(b). The distribution has a long tail\nand the maximum 2-norm is much larger than the majority of the items. We also plot in Figure 1(c)\nthe distribution of the maximum inner product of the queries after the normalization process of\nSIMPLE-LSH. The results show that for the majority of the queries, the maximum inner product is\nsmall, which translates into a large \u03c1 and poor theoretical performance.\nThe long tail in 2-norm distribution also harms the performance of SIMPLE-LSH in practice. If (cid:107)x(cid:107) is\n\nsmall after normalization, the(cid:112)1 \u2212 (cid:107)x(cid:107)2 term, which is irrelevant to the inner product between x\nand q, will be dominant in P (x) = [x;(cid:112)1 \u2212 (cid:107)x(cid:107)2]. In this case, the result of sign random projection\napproximately 4\u00d7 109 buckets, these statistics show that the large(cid:112)1 \u2212 (cid:107)x(cid:107)2 term severely degrades\n\nin (4) will be largely determined by the last entry of a, causing many items to be gathered in the\nsame bucket. In our sample run of SIMPLE-LSH on the ImageNet dataset [Deng et al., 2009] with\na code length of 32, there are only 60,000 buckets and the largest bucket holds about 200,000\nitems. Considering that the ImageNet dataset contains roughly 2 million items and 32-bit code offers\n\nbucket balance in SIMPLE-LSH. Bucket balance is important for the performance of binary hashing\nalgorithms such as SIMPLE-LSH because they use Hamming distance to determine the probing order\nof the buckets [Cai, 2016, Gong et al., 2013]. If the number of buckets is small or some buckets\ncontain too many items, Hamming distance cannot de\ufb01ne a good probing order for the items, which\nresults in poor query performance.\n\n3.2 Norm-Ranging LSH\n\nThe index building and query processing procedures of RANGE-LSH are presented in Algorithm 1 and\nAlgorithm 2, respectively. To solve the excessive normalization problem of SIMPLE-LSH, RANGE-\nLSH partitions the items into m sub-datasets according to the percentiles of the 2-norm distribution\nso that each sub-dataset contains items with similar 2-norms. Note that ties are broken arbitrarily\nin the ranking process of Algorithm 1 to ensure that the percentiles based partitioning works even\nwhen many items have the same 2-norm. Instead of using U, i.e., the maximum 2-norm in the entire\ndataset, SIMPLE-LSH uses the local maximum 2-norm Uj = maxx\u2208Sj (cid:107)x(cid:107) in each sub-dataset for\nnormalization, so as to keep the inner products of the queries large. In Figure 1(d), we plot the\n\n4\n\n00.20.40.60.8100.20.40.60.81S0\u03c1 c=0.9c=0.7c=0.500.511.50100020003000400050002\u2212normFreqency00.20.40.60.810200400600800Maximum Inner ProductFreqency0.40.60.810100200300400500Maximum Inner ProductFreqency\fAlgorithm 2 Norm-Ranging LSH: Query Processing\n1: Input: Hash indexes {I1,I2, ...,Im} for the sub-datasets, query q\n2: Output: A c-approximate MIPS x(cid:63) to q\n3: for every hash index Ij do\n4:\n5: end for\n6: Select the item in {x(cid:63)\n\nConduct MIPS with q to get x(cid:63)\nj ;\n\n1, x(cid:63)\n\n2, ..., x(cid:63)\n\nm} that has the maximum inner product with q as the answer x(cid:63);\n\nmaximum inner product of the queries after the normalization process of RANGE-LSH. Comparing\nwith Figure 1(c), the inner products are signi\ufb01cantly larger. As a result, given S0, the \u03c1j of sub-dataset\nSj becomes \u03c1j = G(c, S0/Uj), which is smaller than \u03c1 = G(c, S0/U ) if Uj < U. The smaller\n\u03c1 values translate into better query performance. The idea of dataset partitioning is also used in\n[Andoni and Razenshteyn, 2015] for L2 similarity search, where the partitioning is conducted in a\npseudo-random manner. In the following, we prove that RANGE-LSH achieves a lower query time\ncomplexity bound than SIMPLE-LSH under mild conditions.\nTheorem 1. RANGE-LSH attains lower query time complexity upper bound than that of SIMPLE-\nLSH for c-approximate MIPS with suf\ufb01ciently large n, if the dataset is partitioned into m = n\u03b1\nsub-datasets and there are at most n\u03b2 sub-datasets with Uj = U, where 0 < \u03b1 < min{\u03c1, \u03c1\u2212\u03c1(cid:63)\n1\u2212\u03c1(cid:63) },\n0 < \u03b2 < \u03b1\u03c1, \u03c1(cid:63) = max\u03c1j <\u03c1 \u03c1j, \u03c1j = G(c, S0/Uj) and \u03c1 = G(c, S0/U ).\n\nProof. Firstly, we prove the correctness of RANGE-LSH, that is, it indeed returns a cS0 approximate\nanswer with probability at least 1 \u2212 \u03b4. Note that S0 is a pre-speci\ufb01ed parameter common to all\nsub-datasets rather than the actual maximum inner product in each sub-dataset. If there is an item\nx(cid:63) having an inner product of S0 with q in the original dataset, it is certainly contained in one of\nthe sub-datasets. When we conduct MIPS on all the sub-datasets, the sub-dataset containing x(cid:63)\nwill return an item having inner product cS0 with q with probability at least 1 \u2212 \u03b4 according to\nthe guarantee of SIMPLE-LSH. The \ufb01nal query result is obtained by selecting the optimal one (the\none having the largest inner product with q) from the query answers of all sub-dataset according to\nAlgorithm 2, which is guaranteed to be no less than cS0 with probability at least 1 \u2212 \u03b4.\nNow we analyze the query time complexity of RANGE-LSH. For each sub-dataset Sj, it contains n1\u2212\u03b1\nitems and the query time complexity upper bound of c-approximate MIPS is O(n(1\u2212\u03b1)\u03c1j log n1\u2212\u03b1)\nwith \u03c1j = G(c, S0/Uj). As there are m = n\u03b1 sub-datasets, the time complexity of selecting the\noptimal one from the answers of all sub-datasets is O(n\u03b1). Considering \u03c1j is an increasing function\nof Uj and there are at most n\u03b2 sub-datasets with Uj = U, the query time complexity of RANGE-LSH\ncan be expressed as:\n\nn\u03b1(cid:88)\n\nn\u03b1(cid:88)\nn\u03b1\u2212n\u03b2(cid:88)\n\nj=1\n\nf (n) = n\u03b1 +\n\nn(1\u2212\u03b1)\u03c1j log n1\u2212\u03b1 < n\u03b1 +\n\nn(1\u2212\u03b1)\u03c1j log n\n\nj=1\n\nn(1\u2212\u03b1)\u03c1j log n+n\u03b2n(1\u2212\u03b1)\u03c1 log n\n\n(10)\n\n= n\u03b1 +\n\nj=1\n\n< n\u03b1 + n\u03b1n(1\u2212\u03b1)\u03c1(cid:63)\n\nlog n + n\u03b2n(1\u2212\u03b1)\u03c1 log n\n\nStrictly speaking, the equal sign in the \ufb01rst line of (10) is not rigorous as the constants and non-\ndominant terms in the complexity of querying each sub-dataset are ignored. However, we are\ninterested in the order rather than the precise value of query time complexity, so the equal sign is used\nfor the conciseness of expression. Comparing f (n) with the O(n\u03c1 log n) complexity of SIMPLE-LSH,\n\nn\u03b1 +(cid:0)n\u03b1n(1\u2212\u03b1)\u03c1(cid:63)\n\n+n\u03b2n(1\u2212\u03b1)\u03c1(cid:1) log n\n\n<\n= n\u03b1\u2212\u03c1/ log n + n\u03b1+(1\u2212\u03b1)\u03c1(cid:63)\u2212\u03c1 + n\u03b2\u2212\u03b1\u03c1\n\nn\u03c1 log n\n\nf (n)\n\nn\u03c1 log n\n\n(11) goes to 0 with suf\ufb01ciently large n when \u03b1 \u2264 \u03c1, \u03b1 + (1 \u2212 \u03b1)\u03c1(cid:63) < \u03c1 and \u03b2 \u2212 \u03b1\u03c1 < 0, which is\nsatis\ufb01ed by \u03b1 < min{\u03c1, \u03c1\u2212\u03c1(cid:63)\n\n1\u2212\u03c1(cid:63) } and \u03b2 < \u03b1\u03c1.\n\n(11)\n\n5\n\n\fNote that the conditions of Theorem (1) can be easily satis\ufb01ed. Theorem (1) imposes an upper bound\ninstead of a lower bound on the number of sub-datasets, which is favorable as we usually do not\nwant to partition the dataset into a large number of sub-datasets. Moreover, the condition that the\nnumber of sub-datasets with Uj = U is smaller than n\u03b1\u03c1 is easily satis\ufb01ed as very often only the\nsub-dataset that contains the items with the largest 2-norms has Uj = U. The proof also shows that\nRANGE-LSH is not limited to datasets with long tail in 2-norm distribution. As long as U > Uj\nholds for most sub-datasets, RANGE-LSH can provide better performance than SIMPLE-LSH. We\nacknowledge that RANGE-LSH and SIMPLE-LSH are equivalent when all items have the same 2-norm.\nHowever, MIPS is equivalent to angular similarity search in this case, and thus can be solved directly\nwith sign random projection rather than SIMPLE-LSH. In most applications that involve MIPS, there\nare considerable variations in the 2-norms of the items and RANGE-LSH will be bene\ufb01cial.\nThe lower theoretical query time complexity of RANGE-LSH also translates into much better bucket\nbalance in practice. On the ImageNet dataset, RANGE-LSH with 32-bit code maps the items to\napproximately 2 million buckets and most buckets contain only 1 item. Comparing with the statistics\nof SIMPLE-LSH in Section 3.1, these numbers show that RANGE-LSH has much better bucket balance,\nand thus better ability to de\ufb01ne a good probing order for the items. This can be explained by the fact\nthat RANGE-LSH uses more moderate scaling factors for each sub-dataset than SIMPLE-LSH, thus\n\nsigni\ufb01cantly reducing the magnitude of the(cid:112)1 \u2212 (cid:107)x(cid:107)2 term in P (x) = [x;(cid:112)1 \u2212 (cid:107)x(cid:107)2].\n\n3.3 Similarity Metric\n\nAlthough the theoretical guarantee of LSH only holds when using multiple hash tables, in practice,\nLSH is usually used in a single-table multi-probe fashion for candidate generation for similarity\nsearch [Andoni et al., 2015, Lv et al., 2007]. The buckets(items) are ranked according to the number\nof identical hashes they have with the query (e.g., Hamming ranking) and the top-ranked buckets\nare probed \ufb01rst. Multi-probing is challenging for RANGE-LSH as different sub-datasets use different\nnormalization constants and buckets from different sub-datasets cannot be ranked simply according\nto their number of identical hashes. To support multi-probe in RANGE-LSH, we formulate a similarity\nmetric for bucket ranking that is ef\ufb01cient to manipulate.\nCombining the index building process of RANGE-LSH and the collision probability of sign random\nprojection in (4), the probability that an item x \u2208 Sj and the query collide on one bit is p =\n1 \u2212 1\n, where Uj is the maximum 2-norm in sub-dataset Sj. Denote the code length as\nL and the number of identical hashes bucket b has with the query as l, we can obtain an estimate of\nthe collision probability p as \u02c6p = l/L. Plug \u02c6p into the collision probability, we get an estimate \u02c6s of\nthe inner product between q and the items in bucket b (from sub-dataset Sj) as:\n\n\u03c0 cos\u22121(cid:16) q(cid:62)x\n\n(cid:17)\n\nUj\n\n(cid:20)\n\n(cid:21)\n\n\u02c6s = Uj cos\n\n\u03c0(1 \u2212 l\nL\n\n)\n\n.\n\n(12)\n\nTherefore, we can compute \u02c6s for the buckets(items) and use it for ranking. When l > L/2,\n\ncos(cid:2)\u03c0(1 \u2212 l\nthis problem, we adjust the similarity indicator to \u02c6s = Uj cos(cid:2)\u03c0(1 \u2212 \u0001)(1 \u2212 l\nis a small number. For the adjusted similarity indicator, cos(cid:2)\u03c0(1 \u2212 \u0001)(1 \u2212 l\n\nL )(cid:3) > 0, thus larger Uj indicates higher inner product while the opposite is true when\nL )(cid:3), where 0 < \u0001 < 1\nL )(cid:3) < 0 only when\n\nl < L/2. Since the code length is limited and l/L can diverge from the actual collision probability p,\nit is possible that a bucket has large Uj and large inner product with q, but it happens that l < L/2.\nIn this case, it will be probed late in the query process, which harms query performance. To alleviate\n\nl < L\n\n, which leaves some room to accommodate the randomness in hashing.\n\n(cid:104) 1\n2 \u2212 \u0001\n\n2(1\u2212\u0001)\n\n(cid:105)\n\nNote that the similarity metric in (12) can be manipulated ef\ufb01ciently with low complexity. We can\ncalculate the values of \u02c6s for all possible combinations of l and Uj, and sort them during index building.\nNote that the sorted structure is common for all queries and does not take too much space 3. When a\nquery comes, query processing can be conducted by traversing the sorted structure in ascending order.\nFor a pair (Uj, l), Uj determines the sub-dataset while l is used to choose the buckets to probe in that\nsub-dataset with standard hash lookup. We also provide an ef\ufb01cient method to rank the items when\ncode length is large and there are many empty buckets in the supplementary material.\n\n3l can take L + 1 values, Uj can take m values, so the size of the sorted structure is mL + m.\n\n6\n\n\fl\nl\na\nc\ne\nR\n\nl\nl\na\nc\ne\nR\n\nl\nl\na\nc\ne\nR\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n1\n\n2\n\n3\n\n4\n\nProbed Items [k]\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n5\n\n10\n\n15\n\n20\n\nProbed Items [k]\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n100\n\n200\n\n300\n\n400\n\nProbed Items [k]\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n0.2\n\n0.4\n\nProbed Items [k]\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n5\n\n10\n\n15\n\n20\n\nProbed Items [k]\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n100\n\n200\n\n300\n\n400\n\nProbed Items [k]\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n2\n\n4\n\n6\n\n8\n\nProbed Items [k]\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n5\n\n10\n\n15\n\n20\n\nProbed Items [k]\n\nSimpleLSH\nRange-LSH\nL2-ALSH\n\n500\n\n1,000 1,500 2,000\n\nProbed Items [k]\n\nFigure 2: Probed item-recall curve for top 10 MIPS on Net\ufb02ix (top row), Yahoo!Music (middle row),\nand ImageNet (bottom row). From left to right, the code lengths are 16, 32 and 64, respectively.\n\n4 Experimental Results\n\nWe used three popular datasets, i.e., Net\ufb02ix, Yahoo!Music and ImageNet, in the experiments 4. For\nthe Net\ufb02ix dataset and Yahoo!Music dataset, the user and item embeddings were obtained using\nalternating least square based matrix factorization [Yun et al., 2013], and each embedding has 300\ndimensions. We used the item embeddings as dataset items and the user embeddings as queries. The\nImageNet dataset contains more than 2 million SIFT descriptors of the ImageNet images, and we\nsampled 1000 SIFT descriptors as queries and used the rest as dataset items. Note that the 2-norm\ndistributions of the Net\ufb02ix and Yahoo!Music embeddings do not have long tail and the maximum\n2-norm is close to the median (see the supplementary material), which helps verify the robustness of\nRANGE-LSH to different 2-norm distributions. For each dataset, we report the average performance\nof 1,000 randomly selected queries.\nWe compared RANGE-LSH with SIMPLE-LSH and L2-ALSH. For L2-ALSH, we used the parameter\nsetting recommended by its authors, i.e., m = 3, U = 0.83, r = 2.5. For RANGE-LSH, part of the\nbits in the binary code are used to encode the index of the sub-datasets and the rest are generated by\nhashing. For example, if the code length is 16 and the dataset is partitioned into 32 sub-datasets, the\n16-bit code of RANGE-LSH consists of 5 bits for indexing the 32 sub-datasets, while the remaining\n11 bits are generated by hashing. We partitioned the dataset into 32, 64 and 128 sub-datasets under\na code length of 16, 32 and 64, respectively. For fairness of comparison, all algorithms use the\nsame total code length. Following existing researches, we mainly compare the performance of the\nalgorithms for single-table based multi-probing. While a comparison of the multi-table single probe\nperformance between RANGE-LSH and SIMPLE-LSH can be found in the supplementary material.\nWe plot the probed item-recall curves in Figure 2. The results show that RANGE-LSH probes\nsigni\ufb01cantly less items compared with SIMPLE-LSH and L2-ALSH at the same recall. Due to space\nlimitation, we only report the performance of top 10 MIPS, the performance under more con\ufb01gurations\ncan be found in the supplementary material.\nRecall that Algorithm 1 partitions a dataset into sub-datasets according to percentiles in the 2-norm\ndistribution. We tested an alternative partitioning scheme, which divides the domain of 2-norms into\n\n4Experiment codes https://github.com/xinyandai/similarity-search/tree/mipsex.\n\n7\n\n\fl\nl\na\nc\ne\nR\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nSimpleLSH\n\nprc128\nuni128\n\nprc32\nuni32\n\n2\n\n4\n\n6\n\n8\n\n10\n\n(a) Probed Items [k]\n\nSimpleLSH\n\nRH64\nRH256\n\nRH32\nRH128\n\n2\n\n4\n\n6\n\n8\n\n10\n\n(b) Probed Items [k]\n\nFigure 3: (a) Comparison between percentile based partitioning and uniform partitioning (best viewed\nin color), the code length is 32 bit and the dataset is Yahoo!Music. prc32 and uni32 mean percentile\nand uniform partitioning with 32 sub-datasets, respectively. (b) The in\ufb02uence of the number of\nsub-datasets on the performance on the Yahoo!Music dataset, the code length is 32 bit and the number\nof sub-datasets varies from 32 to 256. RH32 means RANGE-LSH with 32 sub-datasets.\n\nuniformly spaced ranges and items falling in the same range are partitioned into the same sub-dataset.\nThe results are plotted in Figure 3(a), which shows that uniform partitioning achieves slightly better\nperformance than percentile partitioning. This proves RANGE-LSH is general and robust to different\npartitioning methods as long as items with similar 2-norms are grouped into the same sub-dataset.\nWe also experimented the in\ufb02uence of the number of sub-datasets on performance in Figure 3(b).\nThe results show that performance improves with the number of sub-datasets when the number of\nsub-datasets is still small, but stabilizes when the number of sub-datasets is suf\ufb01ciently large.\n\n5 Extension to L2-ALSH\n\nIn this section, we show that the idea of RANGE-LSH, which partitions the original dataset into\nsub-datasets with similar 2-norms, can also be applied to L2-ALSH [Shrivastava and Li, 2014] to\nobtain more favorable (smaller) \u03c1 values than (7). Note that we get (7) from (6) as we only have\n0 \u2264 (cid:107)x(cid:107) \u2264 S0 if the entire dataset is considered. For a sub-dataset Sj, if we have the range of its\n2-norms as uj\u22121 < (cid:107)x(cid:107) \u2264 uj, uj < S0 and uj\u22121 > 0, we can obtain the \u03c1j of Sj as:\n\nlog Fr((cid:112)1 + m/4 \u2212 2UjS0 + (Ujuj)2m+1)\nlog Fr((cid:112)1 + m/4 \u2212 2cUjS0 + (Ujuj\u22121)2m+1)\n\n\u03c1j =\n\n.\n\n(13)\n\nAs uj < S0 and uj\u22121 > 0, the collision probability in the numerator increases while the the collision\nprobability in the denominator decreases if we compare (13) with (7). Therefore, we have \u03c1j < \u03c1.\nMoreover, partitioning the original dataset into sub-datasets allows us to use different normalization\nfactor Uj (in addition to m and r) for each sub-dataset and we only need to satisfy Uj < 1/uj rather\nthan U < 1/ maxx\u2208S (cid:107)x(cid:107), which allows more \ufb02exibility for parameter optimization. Similar to\nTheorem (1), it can also be proved that dividing the dataset into sub-datasets results in an algorithm\nwith lower query time complexity than the original L2-ALSH. We show empirically that dataset\npartitioning improves the performance of L2-ALSH in the supplementary material.\n\n6 Conclusions\n\nMaximum inner product search (MIPS) has many important applications such as collaborative \ufb01ltering\nand computer vision. We showed that, SIMPLE-LSH, the state-of-the-art hashing method for MIPS,\nhas critical performance limitations due to the long tail in the 2-norm distribution of real datasets. To\ntackle this problem, we proposed RANGE-LSH, which attains provably lower query time complexity\nthan SIMPLE-LSH under mild conditions. In addition, we also formulated a novel similarity metric\nthat can be processed with low complexity. The experimental results showed that RANGE-LSH\nsigni\ufb01cantly outperforms SIMPLE-LSH, and RANGE-LSH is robust to the shape of 2-norm distribution\nand different partitioning methods. We also showed that the idea of SIMPLE-LSH hashing is general\nand can be applied to boost the performance of L2-ALSH. The superior performance of RANGE-LSH\ncan bene\ufb01t many applications that involve MIPS.\n\n8\n\n\fAcknowledgments\n\nWe thank the reviewers for their valuable comments. This work was supported in part by Grant\nCUHK 14222816 from the Hong Kong RGC.\n\nReferences\nA. Andoni and I. P. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors.\n\nIn STOC, pages 793\u2013801, 2015.\n\nA. Andoni, P. Indyk, T. Laarhoven, I. P. Razenshteyn, and L. Schmidt. Practical and optimal LSH for\n\nangular distance. In NIPS, pages 1225\u20131233, 2015.\n\nA. Andoni, P. Indyk, and I. P. Razenshteyn. Approximate nearest neighbor search in high dimensions.\n\nCoRR, 2018.\n\nD. Cai. A revisit of hashing algorithms for approximate nearest neighbor search. CoRR, 2016.\n\nT. L. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate\n\ndetection of 100, 000 object classes on a single machine. In CVPR, pages 1814\u20131821, 2013.\n\nJ. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, pages 248\u2013255, 2009.\n\nP. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with\ndiscriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32:1627\u2013\n1645, 2010.\n\nJ. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE\n\nTrans. Computers, 23:881\u2013890, 1974.\n\nM. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and\n\nsatis\ufb01ability problems using semide\ufb01nite programming. JACM, 42:1115\u20131145, 1995.\n\nY. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to\nlearning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35:\n2916\u20132929, 2013.\n\nP. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimen-\n\nsionality. In STOC, pages 604\u2013613, 1998.\n\nN. Koenigstein, P. Ram, and Y. Shavitt. Ef\ufb01cient retrieval of recommendations in a matrix factorization\n\nframework. In CIKM, pages 535\u2013544, 2012.\n\nY. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n\nIEEE Computer, 42:30\u201337, 2009.\n\nQ. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: ef\ufb01cient indexing for\n\nhigh-dimensional similarity search. In VLDB, pages 950\u2013961, 2007.\n\nB. Neyshabur and N. Srebro. On symmetric and asymmetric lshs for inner product search. In ICML,\n\npages 1926\u20131934, 2015.\n\nP. Ram and A. G. Gray. Maximum inner-product search using cone trees. In KDD, pages 931\u2013939,\n\n2012.\n\nA. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search\n\n(MIPS). In NIPS, pages 2321\u20132329, 2014.\n\nA. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum\n\ninner product search (MIPS). In UAI, pages 812\u2013821, 2015.\n\nH. Wang, J. Cao, L. Shu, and D. Ra\ufb01ei. Locality sensitive hashing revisited: \ufb01lling the gap between\n\ntheory and algorithm analysis. In CIKM, pages 1969\u20131978, 2013.\n\n9\n\n\fR. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search\n\nmethods in high-dimensional spaces. In VLDB, pages 194\u2013205, 1998.\n\nH. Yun, H. F. Yu, C.J. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: non-locking,\nstochastic multi-machine algorithm for asynchronous and decentralized matrix completion. CoRR,\n2013.\n\n10\n\n\f", "award": [], "sourceid": 1538, "authors": [{"given_name": "Xiao", "family_name": "Yan", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Jinfeng", "family_name": "Li", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xinyan", "family_name": "Dai", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Hongzhi", "family_name": "Chen", "institution": "CUHK"}, {"given_name": "James", "family_name": "Cheng", "institution": "The Chinese University of Hong Kong"}]}