{"title": "Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)", "book": "Advances in Neural Information Processing Systems", "page_first": 2321, "page_last": 2329, "abstract": "We present the first provably sublinear time hashing algorithm for approximate \\emph{Maximum Inner Product Search} (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we extend the LSH framework to allow asymmetric hashing schemes. Our proposal is based on a key observation that the problem of finding maximum inner products, after independent asymmetric transformations, can be converted into the problem of approximate near neighbor search in classical settings. This key observation makes efficient sublinear hashing scheme for MIPS possible. Under the extended asymmetric LSH (ALSH) framework, this paper provides an example of explicit construction of provably fast hashing scheme for MIPS. Our proposed algorithm is simple and easy to implement. The proposed hashing scheme leads to significant computational savings over the two popular conventional LSH schemes: (i) Sign Random Projection (SRP) and (ii) hashing based on $p$-stable distributions for $L_2$ norm (L2LSH), in the collaborative filtering task of item recommendations on Netflix and Movielens (10M) datasets.", "full_text": "Asymmetric LSH (ALSH) for Sublinear Time\n\nMaximum Inner Product Search (MIPS)\n\nAnshumali Shrivastava\n\nDepartment of Computer Science\nComputing and Information Science\n\nCornell University\n\nIthaca, NY 14853, USA\n\nanshu@cs.cornell.edu\n\nPing Li\n\nDepartment of Statistics and Biostatistics\n\nDepartment of Computer Science\n\nRutgers University\n\nPiscataway, NJ 08854, USA\n\npingli@stat.rutgers.edu\n\nAbstract\n\nWe present the \ufb01rst provably sublinear time hashing algorithm for approximate\nMaximum Inner Product Search (MIPS). Searching with (un-normalized) inner\nproduct as the underlying similarity measure is a known dif\ufb01cult problem and\n\ufb01nding hashing schemes for MIPS was considered hard. While the existing Lo-\ncality Sensitive Hashing (LSH) framework is insuf\ufb01cient for solving MIPS, in this\npaper we extend the LSH framework to allow asymmetric hashing schemes. Our\nproposal is based on a key observation that the problem of \ufb01nding maximum in-\nner products, after independent asymmetric transformations, can be converted into\nthe problem of approximate near neighbor search in classical settings. This key\nobservation makes ef\ufb01cient sublinear hashing scheme for MIPS possible. Under\nthe extended asymmetric LSH (ALSH) framework, this paper provides an exam-\nple of explicit construction of provably fast hashing scheme for MIPS. Our pro-\nposed algorithm is simple and easy to implement. The proposed hashing scheme\nleads to signi\ufb01cant computational savings over the two popular conventional LSH\nschemes: (i) Sign Random Projection (SRP) and (ii) hashing based on p-stable\ndistributions for L2 norm (L2LSH), in the collaborative \ufb01ltering task of item rec-\nommendations on Net\ufb02ix and Movielens (10M) datasets.\n\nIntroduction and Motivation\n\n1\nThe focus of this paper is on the problem of Maximum Inner Product Search (MIPS). In this problem,\n\nwe are given a giant data vector collectionS of size N, whereS \u2282 RD, and a given query point\nq\u2208 RD. We are interested in searching for p\u2208S which maximizes (or approximately maximizes)\n\nthe inner product qT p. Formally, we are interested in ef\ufb01ciently computing\n\n(1)\n\np= arg max\nx\u2208S\nx\u2208Sq\u2212 x2\n2= arg min\nx\u2208S(x2\n\nqT x\n\np= arg min\n\n2\u2212 2qT x)\n\nThe MIPS problem is related to near neighbor search (NNS), which instead requires computing\n\nThese two problems are equivalent if the norm of every element x\u2208S is constant. Note that the\nvalue of the normq2 has no effect as it is a constant and does not change the identity of arg max\nthe elements inS have signi\ufb01cant variations [13] and cannot be controlled, e.g., (i) recommender\n\nor arg min. There are many scenarios in which MIPS arises naturally at places where the norms of\n\nsystem, (ii) large-scale object detection with DPM, and (iii) multi-class label prediction.\nRecommender systems: Recommender systems are often based on collaborative \ufb01ltering which\nrelies on past behavior of users, e.g., past purchases and ratings. Latent factor modeling based on\nmatrix factorization [14] is a popular approach for solving collaborative \ufb01ltering. In a typical matrix\nfactorization model, a user i is associated with a latent user characteristic vector ui, and similarly,\nan item j is associated with a latent item characteristic vector vj. The rating ri,j of item j by user i\nis modeled as the inner product between the corresponding characteristic vectors.\n\n(2)\n\n1\n\n\fIn this setting, given a user i and the corresponding learned latent vector ui \ufb01nding the right item j,\nto recommend to this user, involves computing\n\nover the norm of the learned vector, i.e.,\u0001vj\u00012, which often has a wide range in practice [13].\n\nwhich is an instance of the standard MIPS problem. It should be noted that we do not have control\n\n(3)\n\nj= arg max\nj\u2032\n\nri,j\u2032= arg max\nj\u2032\n\ni vj\u2032\nuT\n\n(4)\n\ntest wi\n\nIf there are N items to recommend, solving (3) requires computing N inner products. Recommen-\ndation systems are typically deployed in on-line application over web where the number N is huge.\nA brute force linear scan over all items, for computing arg max, would be prohibitively expensive.\nLarge-scale object detection with DPM: Deformable Part Model (DPM) based representation of\nimages is the state-of-the-art in object detection tasks [8]. In DPM model, \ufb01rstly a set of part \ufb01lters\nare learned from the training dataset. During detection, these learned \ufb01lter activations over various\npatches of the test image are used to score the test image. The activation of a \ufb01lter on an image patch\nis an inner product between them. Typically, the number of possible \ufb01lters are large (e.g., millions)\nand so scoring the test image is costly. Recently, it was shown that scoring based only on \ufb01lters with\nhigh activations performs well in practice [7]. Identifying those \ufb01lters having high activations on a\ngiven image patch requires computing top inner products. Consequently, an ef\ufb01cient solution to the\nMIPS problem will bene\ufb01t large scale object detections based on DPM.\nMulti-class (and/or multi-label) prediction: The models for multi-class SVM (or logistic regres-\nsion) learn a weight vector wi for each of the class label i. After the weights are learned, given a\nnew test data vector xtest, predicting its class label is basically an MIPS problem:\n\nwhereL is the set of possible class labels. Note that the norms of the vectors\u0001wi\u00012 are not constant.\nThe size,L, of the set of class labels differs in applications. Classifying with large number of possi-\nprediction task withL= 100, 000 [7]. Computing such high-dimensional vector multiplications for\n\nble class labels is common in multi-label learning and \ufb01ne grained object classi\ufb01cation, for instance,\n\nytest= arg max\ni\u2208L xT\n\npredicting the class label of a single instance can be expensive in, e.g., user-facing applications.\n1.1 The Need for Hashing Inner Products\nSolving the MIPS problem can have signi\ufb01cant practical impact. [19, 13] proposed solutions based\non tree data structure combined with branch and bound space partitioning technique similar to k-d\ntrees [9]. Later, the same method was generalized for general max kernel search [5], where the run-\ntime guarantees, like other space partitioning methods, are heavily dependent on the dimensionality\nand the expansion constants. In fact, it is well-known that techniques based on space partitioning\n(such as k-d trees) suffer from the curse of dimensionality. For example, [24] showed that techniques\nbased on space partitioning degrade to linear search, even for dimensions as small as 10 or 20.\nLocality Sensitive Hashing (LSH) [12] based randomized techniques are common and successful\nin industrial practice for ef\ufb01ciently solving NNS (near neighbor search). Unlike space partitioning\ntechniques, both the running time as well as the accuracy guarantee of LSH based NNS are in a way\nindependent of the dimensionality of the data. This makes LSH suitable for large scale processing\nsystem dealing with ultra-high dimensional datasets which are common in modern applications.\nFurthermore, LSH based schemes are massively parallelizable, which makes them ideal for modern\n\u201cBig\u201d datasets. The prime focus of this paper will be on ef\ufb01cient hashing based algorithms for\nMIPS, which do not suffer from the curse of dimensionality.\n1.2 Our Contributions\nWe develop Asymmetric LSH (ALSH), an extended LSH scheme for ef\ufb01ciently solving the approxi-\nmate MIPS problem. Finding hashing based algorithms for MIPS was considered hard [19, 13]. We\nformally show that, under the current framework of LSH, there cannot exist any LSH for solving\nMIPS. Despite this negative result, we show that it is possible to relax the current LSH framework to\nallow asymmetric hash functions which can ef\ufb01ciently solve MIPS. This generalization comes with\nno extra cost and the ALSH framework inherits all the theoretical guarantees of LSH.\nOur construction of asymmetric LSH is based on an interesting fact that the original MIPS problem,\nafter asymmetric transformations, reduces to the problem of approximate near neighbor search in\n\n2\n\n\fcS0-near neighbor of q in P .\n\nclassical settings. Based on this key observation, we provide an example of explicit construction of\nasymmetric hash function, leading to the \ufb01rst provably sublinear query time hashing algorithm for\napproximate similarity search with (un-normalized) inner product as the similarity. The new ALSH\nframework is of independent theoretical interest. We report other explicit constructions in [22, 21].\nWe also provide experimental evaluations on the task of recommending top-ranked items with col-\nlaborative \ufb01ltering, on Net\ufb02ix and Movielens (10M) datasets. The evaluations not only support our\ntheoretical \ufb01ndings but also quantify the obtained bene\ufb01t of the proposed scheme, in a useful task.\n2 Background\n2.1 Locality Sensitive Hashing (LSH)\nA commonly adopted formalism for approximate near-neighbor search is the following:\nDe\ufb01nition: (c-Approximate Near Neighbor or c-NN) Given a set of points in a D-dimensional space\n\nsimilarity of interest. Popular techniques for c-NN are often based on Locality Sensitive Hashing\n(LSH) [12], which is a family of functions with the nice property that more similar objects in the\ndomain of these functions have a higher probability of colliding in the range space than less similar\n\nRD, and parameters S0> 0, \u03b4> 0, construct a data structure which, given any query point q, does\nthe following with probability 1\u2212 \u03b4: if there exists an S0-near neighbor of q in P , it reports some\nIn the de\ufb01nition, the S0-near neighbor of point q is a point p with Sim(q, p)\u2265 S0, where Sim is the\nones. In formal terms, considerH a family of hash functions mapping RD to a setI.\nDe\ufb01nition: (Locality Sensitive Hashing (LSH)) A familyH is called(S0, cS0, p1, p2)-sensitive if,\nfor any two point x, y\u2208 RD, h chosen uniformly fromH satis\ufb01es the following:\nFor ef\ufb01cient approximate nearest neighbor search, p1> p2 and c< 1 is needed.\nFact 1 [12]: Given a family of(S0, cS0, p1, p2) -sensitive hash functions, one can construct a data\nstructure for c-NN with O(n\u03c1 log n) query time and space O(n1+\u03c1), where \u03c1= log p1\n[6] presented a novel LSH family for all Lp (p\u2208(0, 2]) distances. In particular, when p= 2, this\na random vector a with each component generated from i.i.d. normal, i.e., ai\u223c N(0, 1), and a scalar\nb generated uniformly at random from[0, r]. The hash function is de\ufb01ned as:\nwhere\u00e6\u00e6 is the \ufb02oor operation. The collision probability under this scheme can be shown to be\n2\u221a\n2\u03c0(r~d)\u00021\u2212 e\u2212(r~d)2~2\u0002\na,b(x)= hL2\nP r(hL2\nwhere \u03a6(x) = \u222b x\u2212\u221e 1\u221a\ne\u2212 x2\ntribution and d=x\u2212 y2 is the Euclidean distance between the vectors x and y. This collision\nprobability Fr(d) is a monotonically decreasing function of the distance d and hence hL2\npreviously,x\u2212 y2=\u0001(x2\n2\u2212 2xT y) is not monotonic in the inner product xT y unless the\n2+y2\n\n\u2022 if Sim(x, y)\u2265 S0 then P rH(h(x)= h(y))\u2265 p1\n\u2022 if Sim(x, y)\u2264 cS0 then P rH(h(x)= h(y))\u2264 p2\n\na,b is an LSH\nfor L2 distances. This scheme is also the part of LSH package [1]. Here r is a parameter. As argued\n\nscheme provides an LSH family for L2 distances. Formally, given a \ufb01xed (real) number r, we choose\n\n2 dx is the cumulative density function (cdf) of standard normal dis-\n\nhL2\n\na,b(x)=\u00e6 aT x+ b\n\u00e6\nFr(d)= 1\u2212 2\u03a6(\u2212r~d)\u2212\n\nr\n\na,b(y))= Fr(d);\n\n2.2 LSH for L2 Distance (L2LSH)\n\n< 1.\n\nlog p2\n\n(5)\n\n(6)\n\n2\u03c0\n\na,b is not suitable for MIPS.\n\ngiven data has a constant norm. Hence, hL2\nThe recent work on coding for random projections [16] showed that L2LSH can be improved when\nthe data are normalized for building large-scale linear classi\ufb01ers as well as near neighbor search [17].\nIn particular, [17] showed that 1-bit coding (i.e., sign random projections (SRP) [10, 3]) or 2-bit\ncoding are often better compared to using more bits. It is known that SRP is designed for retrieving\n. Again, ordering under this similarity can be very\n\nwith cosine similarity: Sim(x, y) =\n\nx2y2\n\nxT y\n\ndifferent from the ordering of inner product and hence SRP is also unsuitable for solving MIPS.\n\n3\n\n\f3 Hashing for MIPS\n3.1 A Negative Result\nWe \ufb01rst show that, under the current LSH framework, it is impossible to obtain a locality sensitive\nhashing scheme for MIPS. In [19, 13], the authors also argued that \ufb01nding locality sensitive hashing\nfor inner products could be hard, but to the best of our knowledge we have not seen a formal proof.\n\nTheorem 1 There cannot exist any LSH family for MIPS.\nProof: Suppose there exists such hash function h. For un-normalized inner products the self similar-\n2 and there may exist another points y, such that\n\nity of a point x with itself is Sim(x, x)= xT x=x2\nSim(x, y)= yT x>x2\n2+ C, for any constant C. Under any single randomized hash function h,\nthe collision probability of the event{h(x)= h(x)} is always 1. So if h is an LSH for inner product\nthen the event{h(x)= h(y)} should have higher probability compared to the event{h(x)= h(x)},\nsince we can always choose y with Sim(x, y)= S0+ \u03b4> S0 and cS0> Sim(x, x)\u2200S0 and c< 1.\nThis is not possible because the probability cannot be greater than 1. This completes the proof. \u0019\n\n3.2 Our Proposal: Asymmetric LSH (ALSH)\nThe basic idea of LSH is probabilistic bucketing and it is more general than the requirement of\nhaving a single hash function h. The classical LSH algorithms use the same hash function h for both\nthe preprocessing step and the query step. One assigns buckets in the hash table to all the candidates\n\nx\u2208S using h, then uses the same h on the query q to identify relevant buckets. The only requirement\nfor the proof of Fact 1, to work is that the collision probability of the event{h(q)= h(x)} increases\nwith the similarity Sim(q, x). The theory [11] behind LSH still works if we use hash function h1\nfor preprocessing x\u2208S and a different hash function h2 for querying, as long as the probability of\nthe event{h2(q)= h1(x)} increases with Sim(q, x), and there exist p1 and p2 with the required\n\n(Preprocessing\n\n(Query Transformation) and P \u2236 RD \u0015 RD\n\n\u2032\n\nproperty. The traditional LSH de\ufb01nition does not allow this asymmetry but it is not a required\ncondition in the proof. For this reason, we can relax the de\ufb01nition of c-NN without losing runtime\nguarantees. [20] used a related (asymmetric) idea for solving 3-way similarity search.\nWe \ufb01rst de\ufb01ne a modi\ufb01ed locality sensitive hashing in a form which will be useful later.\n\n\u2022 if Sim(q, x)\u2265 S0 then P rH(h(Q(q)))= h(P(x)))\u2265 p1\n\u2022 if Sim(q, x)\u2264 cS0 then P rH(h(Q(q))= h(P(x)))\u2264 p2\n\nDe\ufb01nition: (Asymmetric Locality Sensitive Hashing (ALSH)) A familyH, along with the two\nvector functions Q \u2236 RD \u0015 RD\nTransformation), is called(S0, cS0, p1, p2)-sensitive if, for a given c-NN instance with query q and\nany x in the collectionS, the hash function h chosen uniformly fromH satis\ufb01es the following:\nWhen Q(x) = P(x) = x, we recover the vanilla LSH de\ufb01nition with h(.) as the required hash\nfunction. Coming back to the problem of MIPS, if Q and P are different, the event{h(Q(x))=\nh(P(x))} will not have probability equal to 1 in general. Thus, Q\u2260 P can counter the fact that self\n{h(Q(q))= h(P(y))} to satisfy the conditions in the de\ufb01nition of c-NN for Sim(q, y)= qT y. Note\nP is applied to x\u2208S while creating hash tables. It is this asymmetry which will allow us to solve\nboth q and x\u2208S is the same. Formally, it is not dif\ufb01cult to show a result analogous to Fact 1.\nTheorem 2 Given a family of hash functionH and the associated query and preprocessing trans-\nformations P and Q, which is(S0, cS0, p1, p2) -sensitive, one can construct a data structure for\nc-NN with O(n\u03c1 log n) query time and space O(n1+\u03c1), where \u03c1= log p1\nWithout loss of any generality, let U < 1 be a number such thatxi2\u2264 U < 1, \u2200xi\u2208S. If this is\n\nMIPS ef\ufb01ciently. In Section 3.3, we explicitly show a construction (and hence the existence) of\nasymmetric locality sensitive hash function for solving MIPS. The source of randomization h for\n\nsimilarity is not highest with inner products. We just need the probability of the new collision event\n\nthat the query transformation Q is only applied on the query and the pre-processing transformation\n\n3.3 From MIPS to Near Neighbor Search (NNS)\n\nnot the case then de\ufb01ne a scaling transformation,\n\nS(x)= U\n\nM\n\n\u00d7 x; M= maxxi\u2208Sxi2;\n\n(7)\n\n\u2032\n\n.\n\nlog p2\n\n4\n\n\fwhile Q(x) simply appends m \u201c1/2\u201d to the end of the vector x. By observing that\n\nNote that we are allowed one time preprocessing and asymmetry, S is the part of asymmetric trans-\n\npendent of the norm of the query. Later we show in Section 3.6 that it can be easily removed.\nWe are now ready to describe the key step in our algorithm. First, we de\ufb01ne two vector transforma-\n\nformation. For simplicity of arguments, let us assume thatq2= 1, the arg max is anyway inde-\ntions P\u2236 RD\u0015 RD+m and Q\u2236 RD\u0015 RD+m as follows:\n2; ....;x2m\n2;x4\nP(x)=[x;x2\n2 ];\nQ(x)=[x; 1~2; 1~2; ....; 1~2],\nwhere [;] is the concatenation. P(x) appends m scalers of the formx2i\n2+ ...+xi2m\n(xi2\n2+xi4\nQ(q)T P(xi)= qT xi+ 1\n2+xi4\nP(xi)2\n2 );\n2=xi2\n2=(1+ m~4)\u2212 2qT xi+xi2m+1\nQ(q)\u2212 P(xi)2\nx\u2208SQ(q)\u2212 P(x)2\n\nSincexi2 \u2264 U < 1, xi2m+1 \u2192 0, at the tower rate (exponential to exponential). The term\n(1+ m~4) is a \ufb01xed constant. As long as m is not too small (e.g., m\u2265 3 would suf\ufb01ce), we have\n\nx\u2208S qT x\u08c3 arg min\n\nwe obtain the following key equality:\n\n2+ ...+xi2m+1\n\n2 at the end of the vector x,\n\narg max\n\n(10)\n\n(8)\n\n(9)\n\n2\n\n2\n\n2\n\n.\n\n2\n\nThis gives us the connection between solving un-normalized MIPS and approximate near neighbor\nsearch. Transformations P and Q, when norms are less than 1, provide correction to the L2 distance\n\n3.4 Fast Algorithms for MIPS\nEq. (10) shows that MIPS reduces to the standard approximate near neighbor search problem which\n\nbecomes negligible for any practical purposes. In fact, from theoretical perspective, since we are\ninterested in guarantees for c-approximate solutions, this additional error can be absorbed in the\napproximation parameter c. Formally, we can state the following theorem.\n\nQ(q)\u2212 P(xi)2 making it rank correlate with the (un-normalized) inner product. This works only\nafter shrinking the norms, as norms greater than 1 will instead blow the termxi2m+1\ncan be ef\ufb01ciently solved. As the error termxi2m+1\n< U 2m+1 goes to zero at a tower rate, it quickly\nTheorem 3 Given a c-approximate instance of MIPS, i.e., Sim(q, x)= qT x, and a query q such\nthatq2= 1 along with a collectionS havingx2\u2264 U < 1\u2200x\u2208S. Let P and Q be the vector\n1) if qT x\u2265 S0 then P r[hL2\n2) if qT x\u2264 cS0 then P r[hL2\nThus, we have obtained p1= Fr\u0001\u0001(1+ m~4)\u2212 2S0+ U 2m+1\u0001 and p2= Fr\u0001\u0001(1+ m~4)\u2212 2cS0\u0001.\nApplying Theorem 2, we can construct data structures with worst case O(n\u03c1 log n) query time\n\na,b(P(x))]\u2265 Fr\u0001\u0001\na,b(P(x))]\u2264 Fr\u0001\u0001\n\n1+ m~4\u2212 2S0+ U 2m+1\u0001\n1+ m~4\u2212 2cS0\u0001\n\ntransformations de\ufb01ned in (8). We have the following two conditions for hash function hL2\n\nwhere the function Fr is de\ufb01ned in (6).\n\na,b (5)\n\n2\n\na,b(Q(q))= hL2\na,b(Q(q))= hL2\n\u03c1= log Fr\u0001\u0001\n1+ m~4\u2212 2S0+ U 2m+1\u0001\nlog Fr\u0001\u0002\n1+ m~4\u2212 2cS0\u0001\n. Note that U 2m+1\n\nWe need p1> p2 in order for \u03c1< 1. This requires us to have\u22122S0+ U 2m+1 <\u22122cS0, which boils\ndown to the condition c< 1\u2212 U 2m+1\nappropriate value of m. For any given c< 1, there always exist U< 1 and m such that \u03c1< 1. This\n2\u03c0(r~d)\u00021\u2212 e\u2212(r~d)2~2\u0002. Thus, given a c-approximate MIPS instance, \u03c1\nFr(d)= 1\u2212 2\u03a6(\u2212r~d)\u2212\n2\u221a\n\nway, we obtain a sublinear query time algorithm for MIPS.\nWe also have one more parameter r for the hash function ha,b. Recall the de\ufb01nition of Fr in Eq. (6):\n\ncan be made arbitrarily close to zero with the\n\n2S0\n\n2S0\n\nguarantees for c-approximate MIPS, where\n\n(11)\n\n5\n\n\fFigure 1: Left panel: Optimal values of \u03c1\u2217 with respect to approximation ratio c for different S0.\nand c. Right Panel: \u03c1 values (dashed curves) for m= 3, U= 0.83 and r= 2.5. The solid curves are\n\u03c1\u2217 values. See more details about parameter recommendations in arXiv:1405.5869.\n\nThe optimization of Eq. (14) was conducted by a grid search over parameters r, U and m, given S0\n\nU,m,r\n\n(12)\n\nU 2m+1\n\n2S0\n\ns.t.\n\n\u2217\n\n3.5 Practical Recommendation of Parameters\n\n< 1\u2212 c, m\u2208 N+, 0< U< 1.\n\ninstance we aim to solve, which requires knowing the similarity threshold S0 and the approximation\n\nlog Fr\u0001\u0001\n1+ m~4\u2212 2S0+ U 2m+1\u0001\nlog Fr\u0001\u0001\n1+ m~4\u2212 2cS0\u0001\n\nis a function of 3 parameters: U, m, r. The algorithm with the best query time chooses U, m and r,\nwhich minimizes the value of \u03c1. For convenience, we de\ufb01ne\n\nlog n) query time and space O(n1+\u03c1\n\n\u03c1\u2217= min\nSee Figure 1 for the plots of \u03c1\u2217. With this best value of \u03c1, we can state our main result in Theorem 4.\nTheorem 4 (Approximate MIPS is Ef\ufb01cient) For the problem of c-approximate MIPS withq2=\n1, one can construct a data structure having O(n\u03c1\n\u2217), where\n\u03c1\u2217< 1 is the solution to constraint optimization (14).\nJust like in the typical LSH framework, the value of \u03c1\u2217 in Theorem 4 depends on the c-approximate\nratio c. Since,q2= 1 andx2\u2264 U < 1, \u2200x\u2208S, we have qtx\u2264 U. A reasonable choice of the\nthreshold S0 is to choose a high fraction of U, for example, S0= 0.9U or S0= 0.8U.\nThe computation of \u03c1\u2217 and the optimal values of corresponding parameters can be conducted via a\ngrid search over the possible values of U, m and r. We compute \u03c1\u2217 in Figure 1 (Left Panel). For\nconvenience, we recommend m= 3, U = 0.83, and r= 2.5. With this choice of the parameters,\nFigure 1 (Right Panel) shows that the \u03c1 values using these parameters are very close to \u03c1\u2217 values.\n3.6 Removing the Conditionq2= 1\nChanging norms of the query does not affect the arg maxx\u2208C qT x. Thus in practice for retrieving top-\nwe want the runtime guarantee to be independent ofq2. We are interested in the c-approximate\nEq (7). Let the transformation S \u2236 RD \u2192 RD be the ones de\ufb01ned in Eq (7). De\ufb01ne asymmetric\ntransformations P\u2032\u2236 RD\u2192 RD+2m and Q\u2032\u2236 RD\u2192 RD+2m as\n2; ....;x2m\n2 ; 1~2; ...1~2]; Q\u2032(x)=[x; 1~2; ....; 1~2;x2\nP\u2032(x)=[x;x2\n2; ....;x2m\n2 ],\nGiven the query q and data point x, our new asymmetric transformations are Q\u2032(S(q)) and\nP\u2032(S(x)) respectively. We observe that\n\u0004\nQ\u2032(S(q))\u2212 P\u2032(S(x))2\n2= m\n\u2264 U 2m+1\u2192 0. Using exactly same arguments as before, we obtain\nBothS(x)2m+1\n\ninstance which being a threshold based approximation changes if the query is normalized.\nPreviously, transformations P and Q were precisely meant to remove the dependency on the norms\nof x. Realizing the fact that we are allowed asymmetry, we can use the same idea to get rid of the\nnorm of q. Let M be the upper bound on all the norms or the radius of the space as de\ufb01ned in\n\n2;x4\n\u2212 2qtx\u00d7\u0004 U 2\n\n+S(x)2m+1\n\n2\n\n+S(q)2m+1\n\n2\n\nranked items, normalizing the query should not affect the performance. But for theoretical purposes,\n\n,S(q)2m+1\n\n2\n\n2\n\n2\n\n2;x4\n\nM 2\n\n(13)\n\n6\n\n00.20.40.60.810.30.40.50.60.70.80.91c\u03c1*S0 = 0.9US0 = 0.5U0.60.70.800.20.40.60.810.30.40.50.60.70.80.91c\u03c1S0 = 0.5U0.6S0 = 0.9U0.80.7m=3,U=0.83, r=2.5\f\u2217\n\nmin\n\nu log n) query time and\n\n\u2217\n\nTheorem 5 (Unconditional Approximate MIPS is Ef\ufb01cient) For the problem of c-approximate\n\nMIPS in a bounded space, one can construct a data structure having O(n\u03c1\nu), where \u03c1\u2217\nspace O(n1+\u03c1\nu< 1 is the solution to constraint optimization (14).\nlog Fr\u0001\u0002\nM 2\u0001+ 2U 2m+1\u0001\nm~2\u2212 2S0\u0001 U 2\n< 1\u2212 c,\nu=\nlog Fr\u0001\u0002\nU(2m+1\u22122)M 2\n\u03c1\u2217\nm~2\u2212 2cS0\u0001 U 2\nM 2\u0001\u0001\n0<U<1,m\u2208N,r\nu< 1. The theoretical guarantee only depends on the radius of the space M.\nsuch that \u03c1\u2217\nstruct ALSH for new similaritiesS that we are interested in. The generic idea is to take a particular\nsimilarity Sim(x, q) for which we know an existing LSH or ALSH. Then we construct transforma-\ntions P and Q such Sim(P(x), Q(q)) is monotonic in the similarityS that we are interested in.\n\n3.7 A Generic Recipe for Constructing Asymmetric LSHs\nWe are allowed any asymmetric transformation on x and q. This gives us a lot of \ufb02exibility to con-\n\nAgain, for any c-approximate MIPS instance, with S0 and c, we can always choose m big enough\n\ns.t.\n\nS0\n\n(14)\n\nThe other observation that makes it easier to construct P and Q is that LSH based guarantees are\nindependent of dimensions, thus we can expand the dimensions like we did for P and Q.\nThis paper focuses on using L2LSH to convert near neighbor search of L2 distance into an ALSH\n(i.e., L2-ALSH) for MIPS. We can devise new ALSHs for MIPS using other similarities and hash\nfunctions. For instance, utilizing sign random projections (SRP), the known LSH for correlations,\nwe can construct different P and Q leading to a better ALSH (i.e., Sign-ALSH) for MIPS [22]. We\nare aware another work [18] which performs very similarly to Sign-ALSH. Utilizing minwise hash-\ning [2, 15], which is the LSH for resemblance and is known to outperform SRP in sparse data [23],\nwe can construct an even better ALSH (i.e., MinHash-ALSH) for MIPS over binary data [21].\n4 Evaluations\nDatasets. We evaluate the proposed ALSH scheme for the MIPS problem on two popular collabo-\nrative \ufb01ltering datasets on the task of item recommendations: (i) Movielens(10M), and (ii) Net\ufb02ix.\n\nEach dataset forms a sparse user-item matrix R, where the value of R(i, j) indicates the rating\ni vj, \u2200j. Despite its simplicity, PureSVD\n\nof user i for movie j. Given the user-item ratings matrix R, we follow the standard PureSVD pro-\ncedure [4] to generate user and item latent vectors. This procedure generates latent vectors ui for\neach user i and vector vj for each item j, in some chosen \ufb01xed dimension f. The PureSVD method\nreturns top-ranked items based on the inner products uT\noutperforms other popular recommendation algorithms [4]. Following [4], we use the same choices\n\nfor the latent dimension f, i.e., f= 150 for Movielens and f= 300 for Net\ufb02ix.\n\n4.1 Ranking Experiment for Hash Code Quality Evaluations\nWe are interested in knowing, how the two hash functions correlate with the top-10 inner products.\nFor this task, given a user i and its corresponding user vector ui, we compute the top-10 gold\nstandard items based on the actual inner products uT\ncodes of the vector ui and all the item vectors vjs. For every item vj, we compute the number of\ntimes its hash values matches (or collides) with the hash values of query which is user ui, i.e., we\n\ni vj,\u2200j. We then compute K different hash\n\nt=1 1(ht(ui)= ht(vj)), based on which we rank all the items.\n\ncompute M atchesj=\u2211K\n\nFigure 2 reports the precision-recall curves in our ranking experiments for top-10 items, for com-\nparing our proposed method with two baseline methods: the original L2LSH and the original sign\nrandom projections (SRP). These results con\ufb01rm the substantial advantage of our proposed method.\n4.2 LSH Bucketing Experiment\n\nWe implemented the standard(K, L)-parameterized (where L is number of hash tables) bucketing\n(K, L) on the evaluations, we report the result from the best performing K and L chosen from\nK\u2208{5, 6, ..., 30} and L\u2208{1, 2, ..., 200} for each query. We use m= 3, U = 0.83, and r= 2.5 for\n\nalgorithm [1] for retrieving top-50 items based on PureSVD procedure using the proposed ALSH\nhash function and the two baselines: SRP and L2LSH. We plot the recall vs the mean ratio of inner\nproduct required to achieve that recall. The ratio being computed relative to the number of inner\nproducts required in a brute force linear scan. In order to remove the effect of algorithm parameters\n\n7\n\n\fFigure 2: Ranking. Precision-Recall curves (higher is better), of retrieving top-10 items, with the\n\nnumber of hashes K\u2208{16, 64, 256}. The proposed algorithm (solid, red if color is available) sig-\nni\ufb01cantly outperforms L2LSH. We \ufb01x the parameters m= 3, U= 0.83, and r= 2.5 for our proposed\nmethod and we present the results of L2LSH for all r values in{1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5}.\n\nFigure 3: Bucketing. Mean number of inner products per query, relative to a linear scan, evalu-\nated by different hashing schemes at different recall levels, for generating top-50 recommendations\n(Lower is better). The results corresponding to the best performing K and L (for a wide range of K\nand L) at a given recall value, separately for all the three hashing schemes, are shown.\n\nour hashing scheme. For L2LSH, we observe that using r= 4 usually performs well and so we show\nresults for r= 4. The results are summarized in Figure 3, con\ufb01rming that the proposed ALSH leads\n\nto signi\ufb01cant savings compared to baseline hash functions.\n5 Conclusion\nMIPS (maximum inner product search) naturally arises in numerous practical scenarios, e.g., col-\nlaborative \ufb01ltering. This problem is challenging and, prior to our work, there existed no provably\nsublinear time hashing algorithms for MIPS. Also, the existing framework of classical LSH (locality\nsensitive hashing) is not suf\ufb01cient for solving MIPS. In this study, we develop ALSH (asymmetric\nLSH), which generalizes the existing LSH framework by applying (appropriately chosen) asymmet-\nric transformations to the input query vector and the data vectors in the repository. We present an\nimplementation of ALSH by proposing a novel transformation which converts the original inner\nproducts into L2 distances in the transformed space. We demonstrate, both theoretically and em-\npirically, that this implementation of ALSH provides provably ef\ufb01cient as well as practical solution\nto MIPS. Other explicit constructions of ALSH, for example, ALSH through cosine similarity, or\nALSH through resemblance (for binary data), will be presented in followup technical reports.\nAcknowledgments\nThe research is partially supported by NSF-DMS-1444124, NSF-III-1360971, NSF-Bigdata-\n1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. We appreciate the construc-\ntive comments from the program committees of KDD 2014 and NIPS 2014. Shrivastava would also\nlike to thank Thorsten Joachims and the Class of CS6784 (Spring 2014) for valuable feedbacks.\n\n8\n\n020406080100051015Recall (%)Precision (%) MovielensTop 10, K = 16ProposedL2LSHSRP0204060801000102030Recall (%)Precision (%) MovielensTop 10, K = 64ProposedL2LSHSRP0204060801000204060Recall (%)Precision (%) MovielensTop 10, K = 256ProposedL2LSHSRP0204060801000246810Recall (%)Precision (%) NetFlixTop 10, K = 16ProposedL2LSHSRP02040608010005101520Recall (%)Precision (%) NetFlixTop 10, K = 64ProposedL2LSHSRP02040608010001020304050Recall (%)Precision (%) NetFlixTop 10, K = 256ProposedL2LSHSRP00.20.40.60.8100.20.40.60.81RecallFraction MultiplicationsTop 50Movielens ProposedSRPL2LSH00.20.40.60.8100.20.40.60.81RecallFraction MultiplicationsTop 50Netflix ProposedSRPL2LSH\fReferences\n[1] A. Andoni and P. Indyk. E2lsh: Exact euclidean locality sensitive hashing. Technical report,\n\n2004.\n\n[2] A. Z. Broder. On the resemblance and containment of documents. In the Compression and\n\nComplexity of Sequences, pages 21\u201329, Positano, Italy, 1997.\n\n[3] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages\n\n380\u2013388, Montreal, Quebec, Canada, 2002.\n\n[4] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-\nIn Proceedings of the fourth ACM conference on Recommender\n\nn recommendation tasks.\nsystems, pages 39\u201346. ACM, 2010.\n\n[5] R. R. Curtin, A. G. Gray, and P. Ram. Fast exact max-kernel search. In SDM, pages 1\u20139, 2013.\n[6] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokn. Locality-sensitive hashing scheme based\n\non p-stable distributions. In SCG, pages 253 \u2013 262, Brooklyn, NY, 2004.\n\n[7] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate\nIn Computer Vision and Pattern\n\ndetection of 100,000 object classes on a single machine.\nRecognition (CVPR), 2013 IEEE Conference on, pages 1814\u20131821. IEEE, 2013.\n\n[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with\ndiscriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE\nTransactions on, 32(9):1627\u20131645, 2010.\n\n[9] J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis.\n\nIEEE Transactions on Computers, 23(9):881\u2013890, 1974.\n\n[10] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut\nand satis\ufb01ability problems using semide\ufb01nite programming. Journal of ACM, 42(6):1115\u2013\n1145, 1995.\n\n[11] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing\n\nthe curse of dimensionality. Theory of Computing, 8(14):321\u2013350, 2012.\n\n[12] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of\n\ndimensionality. In STOC, pages 604\u2013613, Dallas, TX, 1998.\n\n[13] N. Koenigstein, P. Ram, and Y. Shavitt. Ef\ufb01cient retrieval of recommendations in a matrix\n\nfactorization framework. In CIKM, pages 535\u2013544, 2012.\n\n[14] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n[15] P. Li and A. C. K\u00a8onig. Theory and applications b-bit minwise hashing. Commun. ACM, 2011.\n[16] P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random projections. In ICML, 2014.\n[17] P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random projections and approximate\n\nnear neighbor search. Technical report, arXiv:1403.8144, 2014.\n\n[18] B. Neyshabur and N. Srebro. A simpler and better lsh for maximum inner product search\n\n(mips). Technical report, arXiv:1410.5518, 2014.\n\n[19] P. Ram and A. G. Gray. Maximum inner-product search using cone trees.\n\n931\u2013939, 2012.\n\nIn KDD, pages\n\n[20] A. Shrivastava and P. Li. Beyond pairwise: Provably fast algorithms for approximate k-way\n\nsimilarity search. In NIPS, Lake Tahoe, NV, 2013.\n\n[21] A. Shrivastava and P. Li. Asymmetric minwise hashing. Technical report, 2014.\n[22] A. Shrivastava and P. Li. An improved scheme for asymmetric lsh. Technical report,\n\narXiv:1410.5410, 2014.\n\n[23] A. Shrivastava and P. Li. In defense of minhash over simhash. In AISTATS, 2014.\n[24] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for\n\nsimilarity-search methods in high-dimensional spaces. In VLDB, pages 194\u2013205, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1229, "authors": [{"given_name": "Anshumali", "family_name": "Shrivastava", "institution": "Cornell University"}, {"given_name": "Ping", "family_name": "Li", "institution": "Rutgers University"}]}