{"title": "Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search", "book": "Advances in Neural Information Processing Systems", "page_first": 791, "page_last": 799, "abstract": "We go beyond the notion of pairwise similarity and look into search problems with $k$-way similarity functions. In this paper, we focus on problems related to \\emph{3-way Jaccard} similarity: $\\mathcal{R}^{3way}= \\frac{|S_1 \\cap S_2 \\cap S_3|}{|S_1 \\cup S_2 \\cup S_3|}$, $S_1, S_2, S_3 \\in \\mathcal{C}$, where $\\mathcal{C}$ is a size $n$ collection of sets (or binary vectors). We show that approximate $\\mathcal{R}^{3way}$ similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to $k$-way resemblance. In the process, we extend traditional framework of \\emph{locality sensitive hashing (LSH)} to handle higher order similarities, which could be of independent theoretical interest. The applicability of $\\mathcal{R}^{3way}$ search is shown on the Google sets\" application. In addition, we demonstrate the advantage of $\\mathcal{R}^{3way}$ resemblance over the pairwise case in improving retrieval quality.\"", "full_text": "Beyond Pairwise: Provably Fast Algorithms for\n\nApproximate k-Way Similarity Search\n\nAnshumali Shrivastava\n\nDepartment of Computer Science\nComputing and Information Science\n\nCornell University\n\nIthaca, NY 14853, USA\n\nPing Li\n\nDepartment of Statistics & Biostatistics\n\nDepartment of Computer Science\n\nRutgers University\n\nPiscataway, NJ 08854, USA\n\nanshu@cs.cornell.edu\n\npingli@stat.rutgers.edu\n\nAbstract\n\nWe go beyond the notion of pairwise similarity and look into search problems\nwith k-way similarity functions. In this paper, we focus on problems related to\n3-way Jaccard similarity: R3way =\n|S1\u222aS2\u222aS3|, S1; S2; S3 \u2208 C, where C is a\n|S1\u2229S2\u2229S3|\nsize n collection of sets (or binary vectors). We show that approximate R3way\nsimilarity search problems admit fast algorithms with provable guarantees, analo-\ngous to the pairwise case. Our analysis and speedup guarantees naturally extend\nto k-way resemblance. In the process, we extend traditional framework of locality\nsensitive hashing (LSH) to handle higher-order similarities, which could be of in-\ndependent theoretical interest. The applicability of R3way search is shown on the\n\u201cGoogle Sets\u201d application. In addition, we demonstrate the advantage of R3way\nresemblance over the pairwise case in improving retrieval quality.\n\n1 Introduction and Motivation\nSimilarity search (near neighbor search) is one of the fundamental problems in Computer Science.\nThe task is to identify a small set of data points which are \u201cmost similar\u201d to a given input query.\nSimilarity search algorithms have been one of the basic building blocks in numerous applications\nincluding search, databases, learning, recommendation systems, computer vision, etc.\nOne widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20].\nGiven two sets S1; S2 \u2286 \u2126 = {0; 1; 2; :::; D \u2212 1}, the resemblance R2way between S1 and S2 is\nde\ufb01ned as: R2way =\n|S1\u2229S2|\n|S1\u222aS2| . Existing notions of similarity in search problems mainly work with\npairwise similarity functions. In this paper, we go beyond this notion and look at the problem of\nk-way similarity search, where the similarity function of interest involves k sets (k \u2265 2). Our work\nexploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18,\n21], de\ufb01ned over k sets {S1; S2; :::; Sk} as Rk\u2212way =\nBinary high-dimensional data\nThe current web datasets are typically binary, sparse, and ex-\ntremely high-dimensional, largely due to the wide adoption of the \u201cBag of Words\u201d (BoW) represen-\ntations for documents and images. It is often the case, in BoW representations, that just the presence\nor absence (0/1) of speci\ufb01c feature words captures suf\ufb01cient information [7, 16, 20], especially\nwith (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse\nof ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently\nviewed as sets (containing locations of the nonzero features).\nWe list four practical scenarios where k-way resemblance search would be a natural choice.\n(i) Google Sets:\n(http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html)\nGoogle Sets is among the earliest google projects, which allows users to generate list of similar\nwords by typing only few related keywords. For example, if the user types \u201cmazda\u201d and \u201chonda\u201d\nthe application will automatically generate related words like \u201cbmw\u201d, \u201cford\u201d, \u201ctoyota\u201d, etc. This\napplication is currently available in google spreadsheet. If we assume the term document binary\n|w1\u2229w2\u2229w|\nrepresentation of each word w in the database, then given query w1 and w2, we show that\n|w1\u222aw2\u222aw|\nturns out to be a very good similarity measure for this application (see Section 7.1).\n\n|S1\u2229S2\u2229:::\u2229Sk|\n|S1\u222aS2\u222a:::\u222aSk|.\n\n1\n\n\f(ii) Joint recommendations: Users A and B would like to watch a movie together. The pro\ufb01le of\neach person can be represented as a sparse vector over a giant universe of attributes. For example,\na user pro\ufb01le may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the\nother hand, we can represent a movie M in the database over the same universe based on attributes\nassociated with the movie. If we have to recommend movie M, jointly to users A and B, then a\n|A\u2229B\u2229M|\nnatural measure to maximize is\n|A\u222aB\u222aM|. The problem of group recommendation [3] is applicable\nin many more settings such as recommending people to join circles, etc.\n(iii) Improving retrieval quality: We are interested in \ufb01nding images of a particular type of ob-\nject, and we have two or three (possibly noisy) representative images. In such a scenario, a natural\nexpectation is that retrieving images simultaneously similar to all the representative images should\nbe more re\ufb01ned than just retrieving images similar to any one of them. In Section 7.2, we demon-\nstrate that in cases where we have more than one element to search for, we can re\ufb01ne our search\nquality using k-way resemblance search. In a dynamic feedback environment [4], we can improve\nsubsequent search quality by using k-way similarity search on the pages already clicked by the user.\n(iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data\nthrough pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where\nthe af\ufb01nity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational\ncost, of course, will increase exponentially if we go beyond pairwise similarity.\nEf\ufb01ciency is crucial With the data explosion in modern applications, the brute force way of scan-\nning all the data for searching is prohibitively expensive, specially in user-facing applications like\nsearch. The need for k-way similarity search can only be ful\ufb01lled if it admits ef\ufb01cient algorithms.\nThis paper ful\ufb01lls this requirement for k-way resemblance and its derived similarities. In particular,\nwe show fast algorithms with provable query time guarantees for approximate k-way resemblance\nsearch. Our algorithms and analysis naturally provide a framework to extend classical LSH frame-\nwork [14, 13] to handle higher-order similarities, which could be of independent theoretical interest.\nOrganization\nIn Section 2, we review approximate near neighbor search and classical Locality\nSensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sec-\ntions 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demon-\nstrates the applicability of 3-way resemblance search in real applications.\n2 Classical c-NN and Locality Sensitive Hashing (LSH)\nInitial attempts of \ufb01nding ef\ufb01cient (sub-linear time) algorithms for exact near neighbor search, based\non space partitioning, turned out to be a disappointment with the massive dimensionality of current\ndatasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear\nquery time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN).\n\nDe\ufb01nition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a\nD-dimensional space RD, and parameters R0 > 0, (cid:14) > 0. The task is to construct a data structure\nwhich, given any query point q, if there exist an R0-near neighbor of q in P, it reports some cR0-near\nneighbor of q in P with probability 1 \u2212 (cid:14).\nThe usual notion of c-NN is for distance. Since we deal with similarities, we de\ufb01ne R0-near neighbor\nof point q as a point p with Sim(q; p) \u2265 R0, where Sim is the similarity function of interest.\nLocality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a\nfamily of functions, with the property that similar input objects in the domain of these functions\nhave a higher probability of colliding in the range space than non-similar ones. In formal terms,\nconsider H a family of hash functions mapping RD to some set S\nDe\ufb01nition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0; cR0; p1; p2)-sensitive if\nfor any two points x; y \u2208 RD and h chosen uniformly from H satis\ufb01es the following:\n\n\u2022 if Sim(x; y) \u2265 R0 then P rH(h(x) = h(y)) \u2265 p1\n\u2022 if Sim(x; y) \u2264 cR0 then P rH(h(x) = h(y)) \u2264 p2\n\nFor approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as\nwe are de\ufb01ning neighbors in terms of similarity. Basically, LSH trades off query time with extra\npreprocessing time and space which can be accomplished off-line.\n\n2\n\n\fFact 1 Given a family of (R0; cR0; p1; p2) -sensitive hash functions, one can construct a data struc-\nture for c-NN with O(n(cid:26) log1=p2 n) query time and space O(n1+(cid:26)), where (cid:26) = log 1=p1\nlog 1=p2\n\n.\n\nMinwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions\nassociated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing\nfamily applies an independent random permutation (cid:25) : \u2126 \u2192 \u2126, on the given set S \u2286 \u2126, and looks\nat the minimum element under (cid:25), i.e. min((cid:25)(S)). Given two sets S1; S2 \u2286 \u2126 = {0; 1; 2; :::; D \u2212 1},\nit can be shown by elementary probability argument that\n\nP r (min((cid:25)(S1)) = min((cid:25)(S2))) =\n\n(1)\n\n|S1 \u2229 S2|\n|S1 \u222a S2| = R2way:\n\nThe recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the\nlowest b bits of the hashed values: min((cid:25)(S1)), min((cid:25)(S2)). [26] implemented the idea of building\nhash tables for near neighbor search, by directly using the bits from b-bit minwise hashing.\n3 3-way Similarity Search Formulation\nOur focus will remain on binary vectors which can also be viewed as sets. We illustrate our method\n|S1\u2229S2\u2229S3|\n|S1\u222aS2\u222aS3|. The algorithm and\nusing 3-way resemblance similarity function Sim(S1; S2; S3) =\nguarantees naturally extend to k-way resemblance. Given a size n collection C \u2286 2\u2126 of sets (or\nbinary vectors), we are particularly interested in the following three problems:\n\n1. Given two query sets S1 and S2, \ufb01nd S3 \u2208 C that maximizes Sim(S1; S2; S3).\n2. Given a query set S1, \ufb01nd two sets S2; S3 \u2208 C maximizing Sim(S1; S2; S3).\n3. Find three sets S1; S2; S3 \u2208 C maximizing Sim(S1; S2; S3).\n\nThe brute force way of enumerating all possibilities leads to the worst case query time of O(n),\nO(n2) and O(n3) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the\ncase of pairwise near neighbor search, we de\ufb01ne the c-approximate (c < 1) versions of the above\nthree problems. As in the case of c-NN, we are given two parameters R0 > 0 and (cid:14) > 0. For each\nof the following three problems, the guarantee is with probability at least 1 \u2212 (cid:14):\n\n\u2032\n3\n\n\u2032\n2; S\n\n\u2032\n2; S\n\n\u2032\n3\n\n3) \u2265 cR0.\n\u2032\n\n3) \u2265 cR0.\n\u2032\n\n1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2, if there\n\u2208 C so that\n2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1, if there exists a pair of\n\u2208 C so that\n3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1; S2; S3 \u2208 C with\n3) \u2265 cR0.\n\u2032\n\nexists S3 \u2208 C with Sim(S1; S2; S3) \u2265 R0, then we report some S\nSim(S1; S2; S\nset S2; S3 \u2208 C with Sim(S1; S2; S3) \u2265 R0, then we report sets S\nSim(S1; S\nSim(S1; S2; S3) \u2265 R0, then we report sets S\n4 Sub-linear Algorithm for 3-way c-NN\nThe basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset\nin a fashion so that we can \ufb01lter many bad candidates without scanning all of them. LSH-based\ntechniques rely on randomized hash functions to create buckets that probabilistically \ufb01lter bad can-\ndidates. This philosophy is not restricted for binary similarity functions and is much more general.\nHere, we \ufb01rst focus on 3-way c-NN problem for binary data.\nTheorem 1 For R3way c-NN one can construct a data structure with O(n(cid:26) log1=cR0 n) query time\nand O(n1+(cid:26)) space, where (cid:26) = 1 \u2212\n(cid:3)\n\n\u2208 C so that Sim(S\n\n\u2032\n1; S\n\n\u2032\n\u2032\n2; S\n3\n\n\u2032\n1; S\n\n\u2032\n2; S\n\nlog 1=c\n\nlog 1=c+log 1=R0\n\n.\n\nThe argument for 2-way resemblance can be naturally extended to k-way resemblance. Speci\ufb01cally,\ngiven three sets S1, S2, S3 \u2286 \u2126 and an independent random permutation (cid:25) : \u2126 \u2192 \u2126, we have:\n\nP r (min((cid:25)(S1)) = min((cid:25)(S2)) = min((cid:25)(S3))) = R3way:\n\n(2)\nEq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way\n(in fact k-way) similarity structure of the data. The existence of such a hash function is the key\nrequirement behind the existence of ef\ufb01cient approximate search. For the pairwise case, the proba-\nbility event was a simple hash collision, and the min-hash itself serves as the bucket index. In case\n\n3\n\n\fof 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an\nindexing scheme. In particular, during preprocessing we need to create buckets for each individual\nS3, and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We\nneed extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme.\nProof of Theorem 1: We use two additional functions: f1 : \u2126 \u2192 N for manipulating min((cid:25)(S3))\nand f2 : \u2126 \u00d7 \u2126 \u2192 N for manipulating both min((cid:25)(S1)) and min((cid:25)(S2)). Let a \u2208 N+ such that\n|\u2126| = D < 10a. We de\ufb01ne f1(x) = (10a + 1) \u00d7 x and f2(x; y) = 10ax + y. This choice ensures\nthat given query S1 and S2, for any S3 \u2208 C, f1(min((cid:25)(S3))) = f2(min((cid:25)(S1)); min((cid:25)(S2))) holds\nif and only if min((cid:25)(S1)) = min((cid:25)(S2)) = min((cid:25)(S2)), and thus we get a bucketing scheme.\nTo complete the proof, we introduce two integer parameters K and L. De\ufb01ne a new hash function\nby concatenating K events. To be more precise, while preprocessing, for every element S3 \u2208 C\ncreate buckets g1(S3) = [f1(h1(S3)); :::; f1(hK(S3))] where hi is chosen uniformly from minwise\nhashing family. For given query points S1 and S2, retrieve only points in the bucket g2(S1; S2) =\n[f2(h1(S1); h1(S2)); :::; f2(hK(S1); hK(S2))]. Repeat this process L times independently. For any\nS3 \u2208 C, with Sim(S1; S2; S3) \u2265 R0, is retrieved with probability at least 1 \u2212 (1 \u2212 RK\n0 )L. Using\nK = \u2308 log n\n, the proof can be obtained\nusing standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that\n(cid:14) ). Note, the process is\nthe probability guarantee parameter (cid:14) gets absorbed in the constants as log( 1\nstopped as soon as we \ufb01nd some element with R3way \u2265 cR0.\n(cid:3)\n\n(cid:14) )\u2309, where (cid:26) = 1 \u2212\n\n\u2309 and L = \u2308n(cid:26) log( 1\n\nlog 1=c+log 1=R0\n\nlog 1\ncR0\n\nlog 1=c\n\n\u2217\n\n\u2217-way c-NN for any k\n\n\u2217-way similarity search so long as k\n\n+1) identical query sets in k-way c-NN, and it reduces to k\n\nTheorem 1 can be easily extended to k-way resemblance with same query time and space guarantees.\n\u2217 \u2264 k, because we can always\nNote that k-way c-NN is at least as hard as k\nchoose (k\u2212k\n\u2217-way c-NN problem. So,\nany improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard\nsimilarity. The proposed analysis is thus tight in this sense.\nThe above observation makes it possible to also perform the traditional pairwise c-NN search using\nthe same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have\ntwo different queries S1; S2, then we retrieve from bucket g2(S1; S2) and that is usual 3-way c-NN\nsearch. If we are just interested in pairwise near neighbor search given one query S1, then we will\nlook into bucket g2(S1; S1), and we know that the 3-way resemblance between S1; S1; S3 boils\ndown to the pairwise resemblance between S1 and S3. So, the same hash tables can be used for\nboth the purposes. This property generalizes, and hash tables created for k-way c-NN can be used\n\u2217 \u2264 k. The approximation guarantees still holds. This\nfor any k\n\ufb02exibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme.\nOne of the peculiarity of LSH based techniques is that the\nquery complexity exponent (cid:26) < 1 is dependent on the choice\nof the threshold R0 we are interested in and the value of c\nwhich is the approximation ratio that we will tolerate. Figure 1\nplots (cid:26) = 1\u2212\nwith respect to c, for selected R0\nvalues from 0.01 to 0.99. For instance, if we are interested in\nhighly similar pairs, i.e. R0 \u2248 1, then we are looking at near\nO(log n) query complexity for c-NN problem as (cid:26) \u2248 0. On\nthe other hand, for very lower threshold R0, there is no much\nof hope of time-saving because (cid:26) is close to 1.\n5 Other Ef\ufb01cient k-way Similarities\nWe refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with\nquery and space complexity exactly as given in Theorem 1 as ef\ufb01cient . We have demonstrated\nexistence of one such example of ef\ufb01cient similarities, which is the k-way resemblance. This leads\nto a natural question: \u201cAre there more of them?\u201d.\n\u2211\u221e\n[9] analyzed all the transformations on similarities that preserve existence of ef\ufb01cient LSH search. In\nparticular, they showed that if S is a similarity for which there exists an LSH family, then there also\nexists an LSH family for any similarity which is a probability generating function (PGF) transfor-\ni=1 piS i, where S \u2208 [0; 1] and\nmation on S. PGF transformation on S is de\ufb01ned as P GF (S) =\npi \u2265 0 satis\ufb01es\ni=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance.\n\nFigure 1: (cid:26) = 1 \u2212\n\n\u2211\u221e\n\nlog 1=c+log 1=R0\n\nlog 1=c\n\nlog 1=c\n\nlog 1=c+log 1=R0\n\n.\n\n4\n\n00.20.40.60.8100.20.40.60.810.050.10.20.30.40.50.60.70.80.90.95c\u03c1R0=0.99R0=0.01\f(cid:3)\n\nTheorem 2 Any PGF transformation on 3-way resemblance R3way is ef\ufb01cient.\nRecall\nin the proof of Theorem 1, we created hash assignments f1(min((cid:25)(S3))) and\nf2(min((cid:25)(S1)); min((cid:25)(S2))), which lead to a bucketing scheme for the 3-way resemblance search,\n\u2211\u221e\nwhere the collision event E = {f1(min((cid:25)(S3)) = f2(min((cid:25)(S1)); min((cid:25)(S2)))} happens with\nprobability P r(E) = R3way. To prove the above Theorem 2, we will need to create hash events\ni=1 pi(R3way)i. Note that 0 \u2264 P GF (R3way) \u2264 1. We will\nhaving probability P GF (R3way) =\nmake use of the following simple lemma.\nLemma 1 (R3way)n is ef\ufb01cient for all n \u2208 N.\nProof: De\ufb01ne new hash assignments gn\n[f2(h1(S1); h1(S2)); :::; f2(hn(S1); hn(S2))]. The collision event gn\nprobability (R3way)n. We now use the pair < gn\nguarantees, as in Theorem 1, for (R3way)n as well.\n\n2 (S1; S2) =\n2 (S1; S2) has\n2 > instead of < f1, f2 > and obtain same\n(cid:3)\n\n1 (S3) = [f1(h1(S3)); :::; f1(hn(S3))] and gn\n1 (S3) = gn\n\n1 , gn\n\n1; gi\n\n\u2211\u221e\n\nProof of Theorem 2: From Lemma 1, let < gi\nas used in above lemma. We sample one hash pair from the set {< gi\nthe probability of sampling < gi\n\n2 > be the hash pair corresponding to (R3way)i\n2 >: i \u2208 N}, where\n\u2211\u221e\n2 > is proportional to pi. Note that pi \u2265 0, and satis\ufb01es\ni=1 pi = 1, and so the above sampling is valid. It is not dif\ufb01cult to see that the collision of the\ni=1 pi(R3way)i.\n(cid:3)\n\nsampled hash pair has probability exactly\nTheorem 2 can be naturally extended to k-way similarity for any k \u2265 2. Thus, we now have\nin\ufb01nitely many k-way similarity functions admitting ef\ufb01cient sub-linear search. One, that might be\ninteresting, because of its radial basis kernel like nature, is shown in the following corollary.\n\n1; gi\n\n1; gi\n\nCorollary 1 e\n\nRk(cid:0)way\u22121 is ef\ufb01cient.\n\nlog 1=c\n\n.\n\nlog 1=c+log 1=R0\n\nRk(cid:0)way\u22121 is a PGF on Rk\u2212way.(cid:3)\n\nRk(cid:0)way normalized by e to see that e\n\nProof: Use the expansion of e\n6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems\nFor 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family\nwill save even more computations.\nTheorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with\nO(n2(cid:26) log1=cR0 n) query time and O(n1+2(cid:26)) space, where (cid:26) = 1 \u2212\n(cid:3)\nNote that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN\nproblem with search space O(n2) (all pairs) instead of n. A bit of analysis, similar to Theorem 1,\nwill show that this procedure achieves the required query time O(n2(cid:26) log1=cR0 n), but uses a lot\nmore space, O(n2(1+(cid:26))), than shown in the above theorem. It turns out that there is a better way of\ndoing c-CP that saves us space.\nProof of Theorem 3: We again start with constructing hash tables. For every element Sc \u2208 C, we\ncreate a hash-table and store Sc in bucket B(Sc) = [h1(Sc); h2(Sc); :::; hK(Sc)], where hi is chosen\nuniformly from minwise independent family of hash functions H. We create L such hash-tables. For\na query element Sq we look for all pairs in bucket B(Sq) = [h1(Sq); h2(Sq); :::; hK(Sq)] and repeat\nthis for each of the L tables. Note, we do not form pairs of elements retrieved from different tables\nas they do not satisfy Eq. (2). If there exists a pair S1, S2 \u2208 C with Sim(Sq; S1; S2) \u2265 R0, using\nEq. (2), we can see that we will \ufb01nd that pair in bucket B(Sq) with probability 1 \u2212 (1 \u2212 RK\n0 )L.\nHere, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there\n\u2309 and L = \u2308n2(cid:26) log( 1\n(cid:14) )\u2309,\nare O(n2) instead of O(n) possible pairs. We instead use K = \u2308 2 log n\nwith (cid:26) = 1 \u2212\n. With this choice of K and L, the result follows. Note, the process\nis stopped as soon as we \ufb01nd pairs S1 and S2 with Sim(Sq; S1; S2) \u2265 cR0. The key argument that\nsaves space from O(n2(1+(cid:26))) to O(n1+2(cid:26)) is that we hash n points individually. Eq. (2) makes it\nclear that hashing all possible pairs is not needed when every point can be processed individually,\n(cid:3)\nand pairs formed within each bucket itself \ufb01lter out most of the unnecessary combinations.\n\nlog 1=c+log 1=R0\n\nlog 1\ncR0\n\nlog 1=c\n\n5\n\n\flog 1=c\n\nlog 1=c+log 1=R0\n\n.\n\nTheorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time\nO(n1+2(cid:26) log1=cR0 n), where (cid:26) = 1 \u2212\n(cid:3)\nThe argument similar to one used in proof of Theorem 3 leads to the running time of\nO(n1+3(cid:26) log1=cR0 n) as we need L = O(n3(cid:26)), and we have to processes all points at least once.\nProof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting\nas query once. We use the same set of hash tables and hash functions every time. The prepro-\ncessing time is O(n1+2(cid:26) log1=cR0 n) evaluations of hash functions and the total querying time is\nO(n \u00d7 n2(cid:26) log1=cR0 n), which makes the total running time O(n1+2(cid:26) log1=cR0 n).\n(cid:3)\nFor k-way c-BC Problem, we can achieve O(n1+(k\u22121)(cid:26) log1=cR0 n) running time. If we are inter-\nested in very high similarity cluster, with R0 \u2248 1, then (cid:26) \u2248 0, and the running time is around\nO(n log n). This is a huge saving over the brute force O(nk). In most practical cases, specially in\nbig data regime where we have enormous amount of data, we can expect the k-way similarity of\ngood clusters to be high and \ufb01nding them should be ef\ufb01cient. We can see that with increasing k,\nhashing techniques save more computations.\n7 Experiments\nIn this section, we demonstrate the usability of 3-way and higher-order similarity search using (i)\nGoogle Sets, and (ii) Improving retrieval quality.\n7.1 Google Sets: Generating Semantically Similar Words\nHere, the task is to retrieve words which are \u201csemantically\u201d similar to the given set of query words.\nWe collected 1.2 million random documents from Wikipedia and created a standard term-doc bi-\nnary vector representation of each term present in the collected documents after removing standard\nstop words and punctuation marks. More speci\ufb01cally, every word is represented as a 1.2 million di-\nmension binary vector indicating its presence or absence in the corresponding document. The total\nnumber of terms (or words) was around 60,000 in this experiment.\nSince there is no standard benchmark available for this task, we show qualitative evaluations. For\nquerying, we used the following four pairs of semantically related words: (i) \u201cjaguar\u201d and \u201ctiger\u201d;\n(ii) \u201carti\ufb01cial\u201d and \u201cintelligence\u201d; (iii) \u201cmilky\u201d and \u201cway\u201d ; (iv) \u201c\ufb01nger\u201d and \u201clakes\u201d. Given the\nquery words w1 and w2, we compare the results obtained by the following four methods.\n\u2022 Google Sets: We use Google\u2019s algorithm and report 5 words from Google spreadsheets [1].\n\u2022 3-way Resemblance (3-way): We use 3-way resemblance\n|w1\u2229w2\u2229w|\n|w1\u222aw2\u222aw| to rank every word\n\u2022 Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem-\n\u2022 Pairwise Intersection (PI): We \ufb01rst retrieve top 100 words based on pairwise resemblance\nfor each w1 and w2 independently. We then report the words common in both. If there is\nno word in common we do not report anything.\n\n|w2\u2229w|\n|w2\u222aw| and report top 5 words based on this ranking.\n\nThis is Google\u2019s algorithm which uses its own data.\n\nw and report top 5 words based on this ranking.\n\nblance\n\n|w1\u2229w|\n|w1\u222aw| +\n\nThe results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates\nfor these four queries. An interesting query is \u201c\ufb01nger\u201d and \u201clakes\u201d. Finger Lakes is a region in\nupstate New York. Google could only relate it to New York, while 3-way resemblance could even\nretrieve the names of cities and lakes in the region. Also, for query \u201cmilky\u201d and \u201cway\u201d, we can\nsee some (perhaps) unrelated words like \u201cdance\u201d returned by Google. We do not see such random\nbehavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used\nby Google, we can see that 3-way resemblance appears to be a right measure for this application.\nThe above results also illustrate the problem with using the sum of pairwise similarity method. The\nsimilarity value with one of the words dominates the sum and hence we see for queries \u201carti\ufb01cial\u201d\nand \u201cintelligence\u201d that all the retrieved words are mostly related to the word \u201cintelligence\u201d. Same is\nthe case with query \u201c\ufb01nger\u201d and \u201clakes\u201d as well as \u201cjaguar\u201d and \u201ctiger\u201d. Note that \u201cjaguar\u201d is also a\ncar brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to\nthe each query word individually and so PI method never returns anything.\n\n6\n\n\fTable 1: Top \ufb01ve words retrieved using various methods for different queries.\n\n\u201cJAGUAR\u201d AND \u201c TIGER\u201d\n\nGOOGLE\n\nLION\n\nLEOPARD\nCHEETAH\n\nCAT\nDOG\n\n3-WAY\nLEOPARD\nCHEETAH\n\nLION\n\nPANTHER\n\nCAT\n\nSR\nCAT\n\nPI\n\u2014\nLEOPARD \u2014\n\u2014\nLITRE\n\u2014\nBMW\nCHASIS \u2014\n\nGOOGLE\nDANCE\nSTARS\nSPACE\nTHE\n\nUNIVERSE\n\nSTARS\nEARTH\nLIGHT\nSPACE\n\n\u201cMILKY\u201d AND \u201c WAY\u201d\nSR\nEVEN\n\n3-WAY\nGALAXY\n\nPI\n\u2014\nANOTHER \u2014\n\u2014\n\u2014\n\u2014\n\nSTILL\nBACK\nTIME\n\n\u201cARTIFICIAL\u201d AND \u201cINTELLIGENCE\u201d\nGOOGLE\nCOMPUTER\n\nCOMPUTER\n\n3-WAY\n\nPROGRAMMING\n\nSCIENCE\n\nSCIENCE\nROBOT\n\nROBOTICS\n\nINTELLIGENT\n\nHUMAN\n\nTECHNOLOGY\n\nSR\n\nPI\nSECURITY \u2014\nWEAPONS \u2014\nSECRET \u2014\nATTACKS \u2014\nHUMAN \u2014\n\n\u201cFINGER\u201d AND \u201cLAKES\u201d\n\nGOOGLE\n\nNEW\nYORK\n\nNY\n\nPARK\nCITY\n\n3-WAY\nSENECA\nCAYUGA\n\nERIE\n\nROCHESTER\nIROQUOIS\n\nSR\n\nRIVERS\n\nPI\n\u2014\nFRESHWATER \u2014\n\u2014\n\u2014\nSTREAMS\nFORESTED \u2014\n\nFISH\n\nImproving Retrieval Quality in Similarity Search\n\nWe should note the importance of the denominator term in 3-way resemblance, without which fre-\nquent words will be blindly favored. The exciting contribution of this paper is that 3-way resem-\nblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the\nother hand, no such provable guarantees are known for SR and other heuristic based search methods.\n7.2\nWe also demonstrate how the retrieval quality of traditional similarity search can be boosted by uti-\nlizing more query candidates instead of just one. For the evaluations we choose two public datasets:\nMNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search\nwith binary data using b-bit minwise hashing [20, 23].\nThe two datasets re\ufb02ect diversity both in terms of task and scale that is encountered in practice.\nThe MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 \u00d7 28\npixel yielding a 784 dimension vector with the associated class label (digit 0 \u2212 9). We binarize the\ndata by settings all non zeros to be 1. We used the standard partition of MNIST, which consists\nof 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143\nfeatures, consists of sparse vector representation of emails labeled as spam or not. We randomly\nsample 70,000 data points and partitioned them into two independent sets of size 35,000 each.\n\nTable 2: Percentage of top candidates with the same labels as that of query retrieved using various\nsimilarity criteria. More indicates better retrieval quality (Best marked in bold).\n\nTOP\nPairwise\n3-way NNbor\n4-way NNbor\n\n1\n94.20\n96.90\n97.70\n\nMNIST\n\n10\n\n92.33\n96.13\n96.89\n\n20\n\n91.10\n95.36\n96.28\n\n50\n\n89.06\n93.78\n95.10\n\n1\n98.45\n99.75\n99.90\n\nWEBSPAM\n10\n20\n\n96.94\n98.68\n98.87\n\n96.46\n97.80\n98.15\n\n50\n\n95.12\n96.11\n96.45\n\nFor evaluation, we need to generate potential similar search query candidates for k-way search. It\nmakes no sense in trying to search for object simultaneously similar to two very different objects. To\ngenerate such query candidates, we took one independent set of the data and partition it according\nto the class labels. We then run a cheap k-mean clustering on each class, and randomly sample\ntriplets < x1; x2; x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search.\nFor MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets,\neach partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each\ncluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For\nWEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed\nthe same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from\neach class making the total number of 2000 query candidates.\nThe above procedures ensure that the elements in each triplets < x1; x2; x3 > are not very far from\neach other and are of the same class label. For each triplet < x1; x2; x3 >, we sort all the points x\nin the other independent set based on the following:\n\n\u2022 Pairwise: We only use the information in x1 and rank x based on resemblance\n\n|x1\u2229x|\n|x1\u222ax|.\n\n7\n\n\f\u2022 3-way NN: We rank x based on 3-way resemblance\n\u2022 4-way NN: We rank x based on 4-way resemblance\n\n|x1\u2229x2\u2229x|\n|x1\u222ax2\u222ax| .\n|x1\u2229x2\u2229x3\u2229x|\n|x1\u222ax2\u222ax3\u222ax|.\n\nWe look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the\nquery triplets are of the same label, The percentage of top retrieved candidates having same label as\nthat of the query items is a natural metric to evaluate the retrieval quality. This percentage values\naccumulated over all the triplets are summarized in Table 2.\nWe can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points,\nare of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query\npoints, further improves the results compared to 3-way resemblance similarity search. This clearly\ndemonstrates that multi-way resemblance similarity search is more desirable whenever we have\nmore than one representative query in mind. Note that, for MNIST, which contains 10 classes, the\nboost compared to pairwise retrieval is substantial. The results follow a consistent trend.\n8 Future Work\nWhile the work presented in this paper is promising for ef\ufb01cient 3-way and k-way similarity search\nin binary high-dimensional data, there are numerous interesting and practical research problems we\ncan study as future work. In this section, we mention a few such examples.\nOne-permutation hashing. Traditionally, building hash tables for near neighbor search required\nmany (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for build-\ning tables but also for processing un-seen queries which have not been processed. One permutation\nhashing [22] provides the hope of reducing many permutations to merely one. The version in [22],\nhowever, was not applicable to near neighbor search due to the existence of many empty bins (which\noffer no indexing capability). The most recent work [27] is able to \ufb01ll the empty bins and works\nwell for pairwise near neighbor search. It will be interesting to extend [27] to k-way search.\nNon-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions\nto real-valued data are possible. For example, our results naturally apply to consistent weighted\nsampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not\nsolved if we are interested in similarities such as (normalized) k-way inner products, although the\nline of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on\nnon-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation\nto (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not\nwork in our context because the bottom (nonzero) subsets are not properly aligned.\nBuilding hash tables by directly using bits from minwise hashing. This will be a different approach\nfrom the way how the hash tables are constructed in this paper. For example, [26] directly used\nthe bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the signi\ufb01cant\nadvantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see\nthe performance of this approach in k-way similarity search.\nk-Way sign random projections. It would be very useful to develop theory for k-way sign random\nprojections. For usual (real-valued) random projections, it is known that the volume (which is related\nto the determinant) is approximately preserved [24, 17]. We speculate that the collision probability\nof k-way sign random projections might be also a (monotonic) function of the determinant.\n9 Conclusions\nWe formulate a new framework for k-way similarity search and obtain fast algorithms in the case of\nk-way resemblance with provable worst-case approximation guarantees. We show some applications\nof k-way resemblance search in practice and demonstrate the advantages over traditional search. Our\nanalysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond\nthe pairwise case. We believe the idea of probabilistic hashing still has a long way to go.\n\nAcknowledgement\nThe work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and\nAFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in\nthe summer of 2004 at Microsoft Research.\n\n8\n\n\fReferences\n[1] http://www.howtogeek.com/howto/15799/how-to-use-auto\ufb01ll-on-a-google-docs-spreadsheet-quick-tips/.\n[2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise\n\nclustering. In CVPR, 2005.\n\n[3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommenda-\n\ntion: semantics and ef\ufb01ciency. Proc. VLDB Endow., 2(1):754\u2013765, 2009.\n\n[4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM,\n\npages 247\u2013256, 2011.\n\n[5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity\n\nof Sequences, pages 21\u201329, Positano, Italy, 1997.\n\n[6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent\n\npermutations (extended abstract). In STOC, pages 327\u2013336, Dallas, TX, 1998.\n\n[7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based\n\nimage classi\ufb01cation. IEEE Trans. Neural Networks, 10(5):1055\u20131064, 1999.\n\n[8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.\n[9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012.\n[10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution\n\nof web pages. In WWW, pages 669\u2013678, Budapest, Hungary, 2003.\n\n[11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for \ufb01nding nearest neighbors.\n\nTransactions on Computers, 24:1000\u20131006, 1975.\n\nIEEE\n\n[12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and\n\nsatis\ufb01ability problems using semide\ufb01nite programming. Journal of ACM, 42(6):1115\u20131145, 1995.\n\n[13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing\n\nthe curse of dimensionality. Theory of Computing, 8(14):321\u2013350, 2012.\n\n[14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimen-\n\nsionality. In STOC, pages 604\u2013613, Dallas, TX, 1998.\n\n[15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010.\n[16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization\n\nand semantic video retrieval. In CIVR, pages 494\u2013501, Amsterdam, Netherlands, 2007.\n\n[17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report,\n\narXiv:1207.6083, 2013.\n\n[18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations.\n\nComputational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305\u2013354, 2007.\n\n[19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sam-\n\npling technique for sparse data. In NIPS, pages 873\u2013880, Vancouver, Canada, 2006.\n\n[20] Ping Li and Arnd Christian K\u00a8onig. b-bit minwise hashing.\n\nIn Proceedings of the 19th International\n\nConference on World Wide Web, pages 671\u2013680, Raleigh, NC, 2010.\n\n[21] Ping Li, Arnd Christian K\u00a8onig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simi-\n\nlarities. In NIPS, Vancouver, Canada, 2010.\n\n[22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012.\n[23] Ping Li, Anshumali Shrivastava, and Arnd Christian K\u00a8onig. b-bit minwise hashing in practice. In Inter-\n\nnetware, Changsha, China, 2013.\n\n[24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In\n\nAPPROX / RANDOM, pages 523\u2013534, 2008.\n\n[25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report\n\nMSR-TR-2010-73, Microsoft Research, 2010.\n\n[26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML,\n\nBristol, UK, 2012.\n\n[27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neigh-\n\nbor search. In ICML, Beijing, China, 2014.\n\n[28] Roger Weber, Hans-J\u00a8org Schek, and Stephen Blott. A quantitative analysis and performance study for\n\nsimilarity-search methods in high-dimensional spaces. In VLDB, pages 194\u2013205, 1998.\n\n[29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008.\n[30] D. Zhou, J. Huang, and B. Sch\u00a8olkopf. Beyond pairwise classi\ufb01cation and clustering using hypergraphs.\n\nIn NIPS, Vancouver, Canada, 2006.\n\n9\n\n\f", "award": [], "sourceid": 464, "authors": [{"given_name": "Anshumali", "family_name": "Shrivastava", "institution": "Cornell University"}, {"given_name": "Ping", "family_name": "Li", "institution": "Cornell University"}]}