{"title": "b-Bit Minwise Hashing for Estimating Three-Way Similarities", "book": "Advances in Neural Information Processing Systems", "page_first": 1387, "page_last": 1395, "abstract": "Computing two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b>= 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way similarity is technically non-trivial. We develop the precise estimator which is accurate and very complicated; and we recommend a much simplified estimator suitable for sparse data. Our analysis shows that $b$-bit minwise hashing can normally achieve a 10 to 25-fold improvement in the storage space required for a given estimator accuracy of the 3-way resemblance.", "full_text": "b-Bit Minwise Hashing for Estimating Three-Way Similarities\n\nPing Li\n\nDept. of Statistical Science\n\nCornell University\n\nArnd Christian K\u00a8onig\n\nMicrosoft Research\nMicrosoft Corporation\n\nWenhao Gui\n\nDept. of Statistical Science\n\nCornell University\n\nAbstract\n\nComputing1 two-way and multi-way set similarities is a fundamental problem.\nThis study focuses on estimating 3-way resemblance (Jaccard similarity) using\nb-bit minwise hashing. While traditional minwise hashing methods store each\nhashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits\n(where b \u2265 2 for 3-way). The extension to 3-way similarity from the prior work\non 2-way similarity is technically non-trivial. We develop the precise estimator\nwhich is accurate and very complicated; and we recommend a much simpli\ufb01ed\nestimator suitable for sparse data. Our analysis shows that b-bit minwise hashing\ncan normally achieve a 10 to 25-fold improvement in the storage space required\nfor a given estimator accuracy of the 3-way resemblance.\n\n1 Introduction\nThe ef\ufb01cient computation of the similarity (or overlap) between sets is a central operation in a variety\nof applications, such as word associations (e.g., [13]), data cleaning (e.g., [40, 9]), data mining\n(e.g., [14]), selectivity estimation (e.g., [30]) or duplicate document detection [3, 4]. In machine\nlearning applications, binary (0/1) vectors can be naturally viewed as sets. For scenarios where the\nunderlying data size is suf\ufb01ciently large to make storing them (in main memory) or processing them\nin their entirety impractical, probabilistic techniques have been proposed for this task.\nWord associations (collocations, co-occurrences)\nIf one inputs a query NIPS machine learning,\nall major search engines will report the number of pagehits (e.g., one reports 829,003), in addition to\nthe top ranked URLs. Although no search engines have revealed how they estimate the numbers of\npagehits, one natural approach is to treat this as a set intersection estimation problem. Each word can\nbe represented as a set of document IDs; and each set belongs to a very large space \u2126. It is expected\nthat |\u2126| > 1010. Word associations have many other applications in Computational Linguistics [13,\n38], and were recently used for Web search query reformulation and query suggestions [42, 12].\nHere is another example. Commercial search engines display various form of \u201cvertical\u201d content\n(e.g., images, news, products) as part of Web search. In order to determine from which \u201cvertical\u201d\nto display information, there exist various techniques to select verticals. Some of these (e.g., [29,\n15]) use the number of documents the words in a search query occur in for different text corpora\nrepresenting various verticals as features. Because this selection is invoked for all search queries\n(and the tight latency bounds for search), the computation of these features has to be very fast.\nMoreover, the accuracy of vertical selection depends on the number/size of document corpora that\ncan be processed within the allotted time [29], i.e., the processing speed can directly impact quality.\nNow, because of the large number of word-combinations in even medium-sized text corpora (e.g.,\nthe Wikipedia corpus contains > 107 distinct terms), it is impossible to pre-compute and store the\nassociations for all possible multi-term combinations (e.g., > 1014 for 2-way and > 1021 for 3-way);\ninstead the techniques described in this paper can be used for fast estimates of the co-occurrences.\nDatabase query optimization\nSet intersection is a routine operation in databases, employed for\nexample during the evaluation of conjunctive selection conditions in the presence of single-column\nindexes. Before conducting intersections, a critical task is to (quickly) estimate the sizes of the\nintermediate results to plan the optimal intersection order [20, 8, 25]. For example, consider the task\nof intersecting four sets of record identi\ufb01ers: A \u2229 B \u2229 C \u2229 D. Even though the \ufb01nal outcome will\nbe the same, the order of the join operations, e.g., (A \u2229 B) \u2229 (C \u2229 D) or ((A \u2229 B) \u2229 C) \u2229 D, can\nsigni\ufb01cantly affect the performance, in particular if the intermediate results, e.g., A\u2229B\u2229C, become\ntoo large for main memory and need to be spilled to disk. A good query plan aims to minimize\n\n1This work is supported by NSF (DMS-0808864), ONR (YIP-N000140910911) and Microsoft.\n\n\fthe total size of intermediate results. Thus, it is highly desirable to have a mechanism which can\nestimate join sizes very ef\ufb01ciently, especially for the lower-order (2-way and 3-way) intersections,\nwhich could potentially result in much larger intermediate results than higher-order intersections.\nDuplicate Detection in Data Cleaning: A common task in data cleaning is the identi\ufb01cation of\nduplicates (e.g., duplicate names, organizations, etc.) among a set of items. Now, despite the fact\nthat there is considerable evidence (e.g., [10]) that reliable duplicate-detection should be based on\nlocal properties of groups of duplicates, most current approaches base their decisions on pairwise\nsimilarities between items only. This is in part due to the computational overhead associated with\nmore complex interactions, which our approach may help to overcome.\nClustering Most clustering techniques are based on pair-wise distances between the items to be\nclustered. However, there are a number of natural scenarios where the af\ufb01nity relations are not\npairwise, but rather triadic, tetradic or higher (e.g. [1, 43]). Again, our approach may improve the\nperformance in these scenarios if the distance measures can be expressed in the form of set-overlap.\nData mining\nA lot of work in data mining has focused on ef\ufb01cient candidate pruning in the\ncontext of pairwise associations (e.g., [14]), a number of such pruning techniques leverage minwise\nhashing to prune pairs of items, but in many contexts (e.g., association rules with more than 2 items)\nmulti-way associations are relevant; here, pruning based on pairwise interactions may perform much\nless well than multi-way pruning.\n1.1 Ultra-high dimensional data are often binary\nFor duplicate detection in the context of Web crawling/search, each document can be represented as\na set of w-shingles (w contiguous words); w = 5 or 7 in several studies [3, 4, 17]. Normally only the\nabscence/presence (0/1) information is used, as a w-shingle rarely occurs more than once in a page\nif w \u2265 5. The total number of shingles is commonly set to be |\u2126| = 264; and thus the set intersection\ncorresponds to computing the inner product in binary data vectors of 264 dimensions. Interestingly,\neven when the data are not too high-dimensional (e.g., only thousands), empirical studies [6, 23, 26]\nachieved good performance using SVM with binary-quantized (text or image) data.\n1.2 Minwise Hashing and SimHash\nTwo of the most widely adopted approaches for estimating set intersections are minwise hashing [3,\n4] and sign (1-bit) random projections (also known as simhash) [7, 34], which are both special\ninstances of the general techniques proposed in the context of locality-sensitive hashing [7, 24].\nThese techniques have been successfully applied to many tasks in machine learning, databases, data\nmining, and information retrieval [18, 36, 11, 22, 16, 39, 28, 41, 27, 5, 2, 37, 7, 24, 21].\nLimitations of random projections\nThe method of random projections (including simhash) is\nlimited to estimating pairwise similarities. Random projections convert any data distributions to\n(zero-mean) multivariate normals, whose density functions are determined by the covariance matrix\nwhich contains only the pairwise information of the original data. This is a serious limitation.\n1.3 Prior work on b-Bit Minwise Hashing\nInstead of storing each hashed value using 64 bits as in prior studies, e.g., [17], [35] suggested to\nstore only the lowest b bits. [35] demonstrated that using b = 1 reduces the storage space at least\nby a factor of 21.3 (for a given accuracy) compared to b = 64, if one is interested in resemblance\n\u2265 0.5, the threshold used in prior studies [3, 4]. Moreover, by choosing the value b of bits to be\nretained, it becomes possible to systematically adjust the degree to which the estimator is \u201ctuned\u201d\ntowards higher similarities as well as the amount of hashing (random permutations) required.\n[35] concerned only the pairwise resemblance. To extend it to the multi-way case, we have to solve\nnew and challenging probability problems. Compared to the pairwise case, our new estimator is\nsigni\ufb01cantly different. In fact, as we will show later, estimating 3-way resemblance requires b \u2265 2.\n1.4 Notation\n\nFigure 1: Notation for 2-way and 3-way set intersections.\n\na12f1aa23f3a13f2r1r3s12ss23r2s13\fFig. 1 describes the notation used in 3-way intersections for three sets S1, S2, S3 \u2208 \u2126, |\u2126| = D.\n\n\u2022 f1 = |S1|, f2 = |S2|, f3 = |S3|.\n\u2022 a12 = |S1 \u2229 S2|, a13 = |S1 \u2229 S3|, a23 = |S2 \u2229 S3|, a = a123 = |S1 \u2229 S2 \u2229 S3|.\n\u2022 r1 = f1\nD , s = s123 = a\nD .\nD . s12 = a12\n\u2022 u = r1 + r2 + r3 \u2212 s12 \u2212 s13 \u2212 s23 + s.\n\nD , s23 = a23\n\nD , s13 = a13\n\nD , r2 = f2\n\nD , r3 = f3\n\nWe de\ufb01ne three 2-way resemblances (R12, R13, R23) and one 3-way resemblance (R) as:\n\nR12 =\n\n|S1 \u2229 S2|\n|S1 \u222a S2| , R13 =\n\n|S2 \u2229 S3|\n|S2 \u222a S3| ,\nwhich, using our notation, can be expressed in various forms:\n\n|S1 \u2229 S3|\n|S1 \u222a S3| , R23 =\n\nR = R123 =\n\n|S1 \u2229 S2 \u2229 S3|\n|S1 \u222a S2 \u222a S3| .\n\n(1)\n\n(2)\n\ns\n\nRij =\n\naij\n\nsij\n\n, i (cid:54)= j,\n\n=\n\nfi + fj \u2212 aij\nf1 + f2 + f3 \u2212 a12 \u2212 a23 \u2212 a13 + a\n\nri + rj \u2212 sij\na\n\nR =\n\n(3)\nNote that, instead of a123, s123, R123, we simply use a, s, R. When the set sizes, fi = |Si|, can be\nassumed to be known, we can compute resemblances from intersections and vice versa:\n\nr1 + r2 + r3 \u2212 s12 \u2212 s23 \u2212 s13 + s\n\n=\n\n=\n\n.\n\ns\nu\n\naij =\n\nRij\n\n1 + Rij\n\n(fi + fj),\n\na =\n\nR\n\n1 \u2212 R\n\n(f1 + f2 + f3 \u2212 a12 \u2212 a13 \u2212 a23) .\n\nThus, estimating resemblances and estimating intersection sizes are two closely related problems.\n\n1.5 Our Main Contributions\n\n\u2022 We derive the basic probability formula for estimating 3-way resemblance using b-bit hash-\ning. The derivation turns out to be signi\ufb01cantly much more complex than the 2-way case.\nThis basic probability formula naturally leads to a (complicated) estimator of resemblance.\nD \u2248\n0, but fi/fj = ri/rj may be still signi\ufb01cant) to develop a much simpli\ufb01ed estimator, which\nis desired in practical applications. This assumption of fi/D \u2192 0 signi\ufb01cantly simpli\ufb01es\nthe estimator and frees us from having to know the cardinalities fi.\n\n\u2022 We leverage the observation that many real applications involve sparse data (i.e., ri = fi\n\n\u2022 We analyze the theoretical variance of the simpli\ufb01ed estimator and compare it with the\noriginal minwise hashing method (using 64 bits). Our theoretical analysis shows that b-\nbit minwise hashing can normally achieve a 10 to 25-fold improvement in storage space\n(for a given estimator accuracy of the 3-way resemblance) when the set similarities are not\nextremely low (e.g., when the 3-way resemblance > 0.02). These results are particularly\nimportant for applications in which only detecting high resemblance/overlap is relevant,\nsuch as many data cleaning scenarios or duplicate detection.\n\nThe recommended procedure for estimating 3-way resemblances (in sparse data) is shown as Alg. 1.\n\nAlgorithm 1 The b-bit minwise hashing algorithm, applied to estimating 3-way resemblances in a\ncollection of N sets. This procedure is suitable for sparse data, i.e., ri = fi/D \u2248 0.\nInput: Sets Sn \u2208 \u2126 = {0, 1, ..., D \u2212 1}, n = 1 to N.\nPre-processing phrase:\n1) Generate k random permutations \u03c0j : \u2126 \u2192 \u2126, j = 1 to k.\n2) For each set Sn and permutation \u03c0j, store the lowest b bits of min (\u03c0j (Sn)), denoted by en,t,\u03c0j , t = 1 to b.\nEstimation phrase: (Use three sets S1, S2, and S3 as an example.)\n1) Compute \u02c6P12,b = 1\nk\n2) Compute \u02c6Pb = 1\nk\n\n(cid:111)\nt=1 1{e1,t,\u03c0j = e2,t,\u03c0j = e3,t,\u03c0j}\n\n. Similarly, compute \u02c6P13,b and \u02c6P23,b.\n\nt=1 1{e1,t,\u03c0j = e2,t,\u03c0j}\n\n(cid:110)(cid:81)b\n(cid:110)(cid:81)b\n\n(cid:80)k\n(cid:80)k\n\n(cid:111)\n\nj=1\n\n.\n\nj=1\n4b \u02c6Pb\u22122b( \u02c6P12,b+ \u02c6P13,b+ \u02c6P23,b)+2\n\n3) Estimate R by \u02c6Rb =\n4) If needed, the 2-way resemblances Rij,b can be estimated as \u02c6Rij,b = 2b \u02c6Pij,b\u22121\n2b\u22121\n\n(2b\u22121)(2b\u22122)\n\n.\n\n.\n\n\f2 The Precise Theoretical Probability Analysis\nMinwise hashing applies k random permutations \u03c0j : \u2126 \u2212\u2192 \u2126, \u2126 = {0, 1, ..., D \u2212 1}, and then\nestimates R12 (and similarly other 2-way resemblances) using the following probability:\n\nPr (min(\u03c0j(S1)) = min(\u03c0j(S2))) =\n\n|S1 \u2229 S2|\n|S1 \u222a S2| = R12.\n\nThis method naturally extends to estimating 3-way resemblances for three sets S1, S2, S3 \u2208 \u2126:\n\nPr (min(\u03c0j(S1)) = min(\u03c0j(S2)) = min(\u03c0j(S3))) =\n\n|S1 \u2229 S2 \u2229 S3|\n|S1 \u222a S2 \u222a S3| = R.\n\nTo describe b-bit hashing, we de\ufb01ne the minimum values under \u03c0 and their lowest b bits to be:\n\nzi = min (\u03c0 (Si)) ,\n\nei,t = t-th lowest bit of zi.\n\n(cid:195)\n\nb(cid:89)\n\n(cid:33)\n\nTo estimate R, we need to computes the empirical estimates of the probabilities Pij,b and Pb, where\n\nPij,b = Pr\n\n1{ei,t = ej,t} = 1\n\n,\n\nPb = P123,b = Pr\n\n1{e1,t = e2,t = e3,t} = 1\n\n.\n\nt=1\n\nt=1\n\nThe main theoretical task is to derive Pb. The prior work[35] already derived Pij,b; see Appendix A.\nTo simplify the algebra, we assume that D is large, which is virtually always satis\ufb01ed in practice.\n\n(4)\n\n(5)\n\n(cid:33)\n\n(cid:195)\n\nb(cid:89)\n\n(cid:33)\n\nTheorem 1 Assume D is large.\n\n(cid:195)\n\nb(cid:89)\n\ni=1\n\nwhere u = r1 + r2 + r3 \u2212 s12 \u2212 s13 \u2212 s23 + s, and\nZ =(s12 \u2212 s)A3,b +\n\n(r3 \u2212 s13 \u2212 s23 + s)\n\nr1 + r2 \u2212 s12\n\n(r1 \u2212 s12 \u2212 s13 + s)\n\n+(s23 \u2212 s)A1,b +\n\nr2 + r3 \u2212 s23\n+ [(r1 \u2212 s13)A3,b + (r3 \u2212 s13)A1,b]\n\n+ [(r1 \u2212 s12)A2,b + (r2 \u2212 s12)A1,b]\n\n(r2 \u2212 s12 \u2212 s23 + s)\n\nr1 + r3 \u2212 s13\n\n(r3 \u2212 s13 \u2212 s23 + s)\n\nr1 + r2 \u2212 s12\n\nG13,b\n\nG12,b,\n\nPb = Pr\n\n1{e1,i = e2,i = e3,i} = 1\n\n=\n\nZ\nu\n\n+ R =\n\nZ + s\n\nu\n\n,\n\n(6)\n\ns12G12,b + (s13 \u2212 s)A2,b +\n\nr1 + r3 \u2212 s13\ns23G23,b + [(r2 \u2212 s23)A3,b + (r3 \u2212 s23)A2,b]\n\n(r2 \u2212 s12 \u2212 s23 + s)\n\ns13G13,b\n\n(r1 \u2212 s12 \u2212 s13 + s)\n\nr2 + r3 \u2212 s23\n\nG23,b\n\nAj,b =\n\nrj (1 \u2212 rj )2b\u22121\n1 \u2212 (1 \u2212 rj )2b\n\n,\n\nGij,b =\n\n(ri + rj \u2212 sij )(1 \u2212 ri \u2212 rj + sij )2b\u22121\n\n1 \u2212 (1 \u2212 ri \u2212 rj + sij )2b\n\ni, j \u2208 {1, 2, 3}, i (cid:54)= j.\n\n,\n\nTheorem 1 naturally suggests an iterative estimation procedure, by writing Eq. (6) as s = Pbu\u2212 Z.\n\nFigure 2: Pb, for verifying the probability formula in Theorem 1. The empirical estimates and the\ntheoretical predictions essentially overlap regardless of the sparsity measure ri = fi/D.\n\nA Simulation Study\nFor the purpose of verifying Theorem 1, we use three sets corresponding\nto the occurrences of three common words (\u201cOF\u201d, \u201cAND\u201d, and \u201cOR\u201d) in a chunk of real world Web\ncrawl data. Each (word) set is a set of document (Web page) IDs which contained that word at least\nonce. The three sets are not too sparse and D = 216 suf\ufb01ces to represent their elements. The ri = fi\nD\nvalues are 0.5697, 0.5537, and 0.3564, respectively. The true 3-way resemblance is R = 0.47.\n\n01002003004005000.460.480.50.520.540.560.58Sample size kPb D = 216b = 2b = 3b = 42 bits3 bits4 bitsTheoretical01002003004005000.460.480.50.520.540.560.58Sample size kPb D = 220b = 2b = 3b = 42 bits3 bits4 bitsTheoretical\fWe can also increase D by mapping these sets into a larger space using a random mapping, with\nD = 216, 218, 220, or 222. When D = 222, the ri values are 0.0089, 0.0087, 0.0056.\nFig. 2 presents the empirical estimates of the probability Pb, together with the theoretical predictions\nby Theorem 1. The empirical estimates essentially overlap the theoretical predictions. Even though\nthe proof assumes D \u2192 \u221e, D does not have to be too large for Theorem 1 to be accurate.\n3 The Much Simpli\ufb01ed Estimator for Sparse Data\nThe basic probability formula (Theorem 1) we derive could be too complicated for practical use. To\nD \u2248 0,\nobtain a simpler formula, we leverage the observation that in practice we often have ri = fi\neven though both fi and D can be very large. For example, consider web duplicate detection [17].\nHere, D = 264, which means that even for a web page with fi = 254 shingles (corresponding to the\nD \u2248 0.001. Note that, even when ri \u2192 0, the ratios, e.g., r2\ntext of a small novel), we still have fi\n,\ncan be still large. Recall the resemblances (2) and (3) are only determined by these ratios.\n\nr1\n\nD using two real-life datasets: the UCI dataset containing 3 \u00d7 105\nWe analyzed the distribution of fi\nNYTimes articles; and a Microsoft proprietary dataset with 106 news articles [19]. For the UCI-\nNYTimes dataset, each document was already processed as a set of single words. For the anonymous\ndataset, we report results using three different representations: single words (1-shingle), 2-shingles\n(two contiguous words), and 3-shingles. Table 1 reports the summary statistics of the fi\n\nD values.\n\nTable 1: Summary statistics of the fi\n\nData\n3 \u00d7 105 UCI-NYTimes articles\n106 Microsoft articles (1-shingle)\n106 Microsoft articles (2-shingle)\n106 Microsoft articles (3-shingle)\n\nD values in two datasets\nStd.\n0.0011\n0.00023\n0.00005\n0.00002\n\nMean\n0.0022\n0.00032\n0.00004\n0.00002\n\nMedian\n0.0021\n0.00027\n0.00003\n0.00002\n\nFor truly large-scale applications, prior studies [3, 4, 17] commonly used 5-shingles. This means\nthat real world data may be signi\ufb01cantly more sparse than the values reported in Table 1.\n3.1 The Simpli\ufb01ed Probability Formula and the Practical Estimator\nTheorem 2 Assume D is large. Let T = R12 + R13 + R23. As r1, r2, r3 \u2192 0,\n\nPb = Pr\n\n1{e1,i = e2,i = e3,i} = 1\n\n(2b \u2212 1)(2b \u2212 2)R + (2b \u2212 1)T + 1\n\n.\n\n(7)\n\n(cid:111)\n\n(cid:33)\n\n(cid:110)\n\n=\n\n1\n4b\n\n(cid:195)\n\nb(cid:89)\n\ni=1\n\nInterestingly, if b = 1, then P1 = 1\ncontained. Hence, it is necessary to use b \u2265 2 to estimate 3-way similarities.\nAlg. 1 uses \u02c6Pb and \u02c6Pij,b to respectively denote the empirical estimates of the theoretical probabilities\nPb and Pij,b. Assuming r1, r2, r3 \u2192 0, the proposed estimator of R, denoted by \u02c6Rb, is\n\n4 (1 + T ), i.e., no information about the 3-way resemblance R is\n\n(cid:179)\n\n(cid:180)\n\n\u02c6P12,b + \u02c6P13,b + \u02c6P23,b\n(2b \u2212 1)(2b \u2212 2)\n\n+ 2\n\n.\n\n\u02c6Rb =\n\n4b \u02c6Pb \u2212 2b\n(cid:110)\n\n(cid:179)\n\n(cid:111)\nTheorem 3 Assume D is large and r1, r2, r3 \u2192 0. Then \u02c6Rb in (8) is unbiased with the variance\nR \u2212 (2b \u2212 1)(2b \u2212 2)R2\n\n4b \u2212 6 \u00d7 2b + 10\n\n1 + (2b \u2212 3)T +\n\n(cid:180)\n\n(cid:179)\n\n(cid:180)\n\n1\n\nV ar\n\n\u02c6Rb\n\n=\n\n1\nk\n\n(2b \u2212 1)(2b \u2212 2)\n\n(8)\n\n.\n\n(9)\n\nIt is interesting to examine several special cases:\n\n\u2022 b = 1: V ar( \u02c6R1) = \u221e, i.e., one must use b \u2265 2.\n\u2022 b = 2: V ar( \u02c6R2) = 1\n\u2022 b = \u221e: V ar( \u02c6R\u221e) = 1\n\n(cid:161)\n1 + T + 2R \u2212 6R2\nk R(1 \u2212 R) = V ar( \u02c6RM ). \u02c6RM is the original minwise hashing esti-\nmator for 3-way resemblance. In principle, the estimator \u02c6RM requires an in\ufb01nite precision\n(i.e., b = \u221e). Numerically, V ar( \u02c6RM ) and V ar( \u02c6R64) are indistinguishable.\n\n(cid:162)\n\n6k\n\n.\n\n\f3.2 Simulations for Validating Theorem 3\nWe now present a simulation study for verifying Theorem 3, using the same three sets used in Fig. 2.\nFig. 3 presents the resulting empirical biases: E( \u02c6Rb)\u2212Rb. Fig. 4 presents the empirical mean square\nerrors (MSE = bias2+variance) together with the theoretical variances V ar( \u02c6Rb) in Theorem 3.\n\nFigure 3: Bias of \u02c6Rb (8). We used 3 (word) sets: \u201cOF\u201d, \u201cAND\u201d, and \u201cOR\u201d and four D values: 216,\n218, 220, and 222. We conducted experiments using b = 2, 3, and 4 as well as the original minwise\nhashing (denoted by \u201cM\u201d). The plots verify that as ri decreases (to zero), the biases vanish. Note\nthat the set sizes fi remain the same, but the relative values ri = fi\n\nD decrease as D increases.\n\nFigure 4: MSE of \u02c6Rb (8). The solid curves are the empirical MSEs (=var+bias2) and the dashed\nlines are the theoretical variances (9), under the assumption of ri \u2192 0. Ideally, we would like to see\nthe solid and dashed lines overlap. When D = 220 and D = 222, even though the ri values are not\ntoo small, the solid and dashed lines almost overlap. Note that, at the same sample size k, we always\nhave V ar( \u02c6R2) > V ar( \u02c6R3) > V ar( \u02c6R4) > V ar( \u02c6RM ), where \u02c6RM is the original minwise hashing\nestimator. We can see that, V ar( \u02c6R3) and V ar( \u02c6R4) are very close to V ar( \u02c6RM ).\n\nWe can summarize the results in Fig. 3 and Fig. 4 as follows:\n\n\u2022 When the ri = fi\n\nD values are large (e.g., ri \u2248 0.5 when D = 216), the estimates using\n(8) can be noticeably biased. The estimation biases diminish as the ri values decrease. In\nfact, even when the ri values are not small (e.g., ri \u2248 0.05 when D = 220), the biases are\nalready very small (roughly 0.005 when D = 220).\n\u2022 The variance formula (9) becomes accurate when the ri values are not too large. For exam-\nple, when D = 218 (ri \u2248 0.1), the empirical MSEs largely overlap the theoretical variances\nwhich assumed ri \u2192 0, unless the sample size k is large. When D = 220 (and D = 222),\nthe empirical MSEs and theoretical variances overlap.\n\u2022 For real applications, as we expect D will be very large (e.g., 264) and the ri values (fi/D)\nwill be very small, our proposed simple estimator (8) will be very useful in practice, be-\ncause it becomes unbiased and the variance can be reliably predicted by (9).\n\n4 Improving Estimates for Dense Data Using Theorem 1\nWhile we believe the simple estimator in (8) and Alg. 1 should suf\ufb01ce in most applications, we\ndemonstrate here that the sparsity assumption of ri \u2192 0 is not essential if one is willing to use the\nmore sophisticated estimation procedure provided by Theorem 1.\nBy Eq. (6), s = Pbu \u2212 Z, where Z contains s, sij, ri etc. We \ufb01rst estimate sij (from the estimated\nRij) using the precise formula for the two-way case; see Appendix A. We then iteratively solve for\ns using the initial guess provided by the estimator \u02c6Rb in (8). Usually a few iterations suf\ufb01ce.\nFig. 5 reports the bias (left most panel, only for D = 216) and MSE, corresponding to Fig. 3 and\nFig. 4. In Fig. 5, the solid curves are obtained using the precise estimation procedure by Theorem 1.\nThe dashed curves are the estimates using the simpli\ufb01ed estimator \u02c6Rb which assumes ri \u2192 0.\n\n0100200300400500\u22120.1\u22120.0500.05Sample size kBias D = 216b = 2b = 3b = 4M0100200300400500\u22120.03\u22120.02\u22120.0100.01Sample size kBias b = 4Mb = 3b = 2D = 2180100200300400500\u221210\u2212505x 10\u22123Sample size kBias D = 220MM4b = 230100200300400500\u221210\u2212505x 10\u22123Sample size kBias D = 222Mb = 2b = 431010050010\u2212310\u2212210\u22121Sample size kMean square error (MSE) D = 216b = 2b = 3b = 4M2 bits3 bits4 bitsminwiseTheoretical1010050010\u2212310\u2212210\u22121Sample size kMean square error (MSE) D = 218b = 2M3342 bits3 bits4 bitsminwiseTheoretical1010050010\u2212310\u2212210\u22121Sample size kMean square error (MSE) D = 220b = 2M342 bits3 bits4 bitsminwiseTheoretical1010050010\u2212310\u2212210\u22121Sample size kMean square error (MSE) D = 222b = 23M42 bits3 bits4 bitsminwiseTheoretical\fEven when the data are not sparse, the precise estimation procedure provides unbiased estimates\nas veri\ufb01ed by the leftmost panel of Fig. 5. Using the precise procedure results in noticeably more\naccurate estimates in non-sparse data, as veri\ufb01ed by the second panel of Fig. 5. However, as long as\nthe data are reasonably sparse (the right two panels), the simple estimator \u02c6Rb in (8) is accurate.\n\nFigure 5: The bias (leftmost panel) and MSE of the precise estimation procedure, using the same\ndata used in Fig. 3 and Fig. 4. The dashed curves correspond to the estimates using the simpli\ufb01ed\nestimator \u02c6Rb in (8) which assumes ri \u2192 0.\n\n5 Quantifying the Improvements Using b-Bit Hashing\n\nThis section is devoted to analyzing the improvements of b-bit minwise hashing, compared to using\n64 bits for each hashed value. Throughout the paper, we use the terms \u201csample\u201d and \u201csample size\u201d\n(denoted by k). The original minwise hashing stores each \u201csample\u201d using 64 bits (as in [17]). For\nb-bit minwise hashing, we store each \u201csample\u201d using b bits only. Note that V ar( \u02c6R64) and V ar( \u02c6RM )\n(the variance of the original minwise hashing) are numerically indistinguishable.\nAs we decrease b, the space needed for each sample will be smaller; the estimation variance at\nthe same sample size k, however, will increase. This variance-space trade-off can be quanti\ufb01ed by\n\u00d7 k, which is called the storage factor. Lower B(b) is more desirable. The\nB(b) = b \u00d7 Var\nratio B(64)\n\nB(b) precisely characterizes the improvements of b-bit hashing compared to using 64 bits.\n\n\u02c6Rb\n\n(cid:179)\n\n(cid:180)\n\nFig. 6 con\ufb01rms the substantial improvements of b-bit hashing over the original minwise hashing\nusing 64 bits. The improvements in terms of the storage space are usually 10 (or 15) to 25-fold\nwhen the sets are reasonably similar (i.e., when the 3-way resemblance > 0.1). When the three sets\nare very similar (e.g., the top left panel), the improvement will be even 25 to 30-fold.\n\nFigure 6: B(64)\nB(b) , the relative storage improvement of using b = 2, 3, 4, 6 bits, compared to using 64\nbits. Since the variance (9) contains both R and T = R12 + R13 + R23, we compare variances using\ndifferent T /R ratios. As 3R \u2264 T always, we let T = \u03b1R, for some \u03b1 \u2265 3. Since T \u2264 3, we know\nR \u2264 3/\u03b1. Practical applications are often interested in cases with reasonably large R values.\n\n0100200300400500\u22121\u22120.500.51x 10\u22123Sample size kBias D = 216Biasb = 3b = 21010050010\u2212310\u2212210\u22121Sample size kMean square error (MSE) D = 216b = 2b = 3b = 3b = 21010050010\u2212310\u2212210\u22121Sample size kMean square error (MSE) D = 218b = 2b = 3b = 2b = 31010050010\u2212310\u2212210\u22121Sample size kMean square error (MSE) b = 3b = 2D = 22000.20.40.60.81051015202530RStorage ratio ( B(64) / B(b) ) T = 3Rb = 2b = 4b = 3b = 600.10.20.30.40.50.60.70.805101520RStorage ratio B(64) / B(b) T = 4Rb = 2b = 3b = 4b = 6b = 200.10.20.30.40.505101520RStorage ratio B(64) / B(b) T = 6Rb = 3b = 2b = 44b = 2b = 600.050.10.150.20.250.3024681012RStorage ratio B(64) / B(b) T = 10Rb = 3b = 2b = 4b = 600.050.10.150246810RStorage ratio ( B(64) / B(b) ) T = 20Rb = 4b = 3b = 2b = 600.010.020.030.040.050.0601234567RStorage ratio ( B(64) / B(b) ) T = 50Rb = 4b = 3b = 2b = 6\f6 Evaluation of Accuracy\nWe conducted a duplicate detection experiment on a public (UCI) collection of 300,000 NYTimes\nnews articles. The task is to identify 3-groups with 3-way resemblance R exceeding a threshold R0.\nWe used a subset of the data; the total number of 3-groups is about one billion. We experimented\nwith b = 2, 4 and the original minwise hashing. Fig. 7 presents the precision curves for a represen-\ntative set of thresholds R0\u2019s. Just like in [35], the recall curves are not shown because they could not\ndifferentiate estimators. These curves con\ufb01rm the signi\ufb01cant improvement of using b-bit minwise\nhashing when the threshold R0 is quite high (e.g., 0.3). In fact, when R0 = 0.3, using b = 4 re-\nsulted in similar precisions as using the original minwise hashing (i.e., a 64/4=16-fold reduction in\nstorage). Even when R0 = 0.1, using b = 4 can still achieve similar precisions as using the original\nminwise hashing by only slightly increasing the sample size k.\n\nFigure 7: Precision curves on the UCI collection of news data. The task is to retrieve news article\n3-groups with resemblance R \u2265 R0. For example, consider R0 = 0.2. To achieve a precision of\nat least 0.8, 2-bit hashing and 4-bit hashing require about k = 500 samples and k = 260 samples\nrespectively, while the original minwise hashing (denoted by M) requires about 170 samples.\n7 Conclusion\nIn machine learning, high-\nComputing set similarities is fundamental in many applications.\ndimensional binary data are common and are equivalent to sets. This study is devoted to simul-\ntaneously estimating 2-way and 3-way similarities using b-bit minwise hashing. Compared to the\nprior work on estimating 2-way resemblance [35], the extension to 3-way is important for many\napplication scenarios (as described in Sec. 1) and is technically non-trivial.\nFor estimating 3-way resemblance, our analysis shows that b-bit minwise hashing can normally\nachieve a 10 to 25-fold improvement in the storage space required for a given estimator accuracy,\nwhen the set similarities are not extremely low (e.g., 3-way resemblance > 0.02). Many applications\nsuch as data cleaning and de-duplication are mainly concerned with relatively high set similarities.\nFor many practical applications, the reductions in storage directly translate to improvements in pro-\ncessing speed as well, especially when memory latency is the main bottleneck, which, with the\nadvent of many-core processors, is more and more common.\nFuture work: We are interested in developing a b-bit version for Conditional Random Sampling\n(CRS) [31, 32, 33], which requires only one permutation (instead of k permutations) and naturally\nextends to non-binary data. CRS is also provably more accurate than minwise hashing for binary\ndata. However, the analysis for developing the b-bit version of CRS appears to be very dif\ufb01cult.\nA Review of b-Bit Minwise Hashing for 2-Way Resemblance\nTheorem 4 ([35]) Assume D is large.\n\n(cid:33)\n\n(cid:195)\n\nb(cid:89)\n\nwhere\n\nP12,b = Pr\nr2\n\nC1,b = A1,b\n\ni=1\n\n+ A2,b\n\nr1\n\nr1 + r2\n\nr1 + r2\n\n1{e1,i = e2,i} = 1\n\n= C1,b + (1 \u2212 C2,b) R12\nr2\n\nr1\n\n, C2,b = A1,b\n\nr1 + r2\n\n+ A2,b\n\n,\n\nr1 + r2\n\nA1,b =\n\nr1 [1 \u2212 r1]2b\u22121\n1 \u2212 [1 \u2212 r1]2b ,\nIf r1, r2 \u2192 0, P12,b = 1+(2b\u22121)R12\n, where \u02c6P12,b is the\nempirical observation of P12,b. If r1, r2 are not small, R12 is estimated by ( \u02c6P12,b\u2212C1,b)/(1\u2212C2,b).\n\nand one can estimate R12 by 2b \u02c6P12,b\u22121\n2b\u22121\n\nr2 [1 \u2212 r2]2b\u22121\n1 \u2212 [1 \u2212 r2]2b .\n\nA2,b =\n\n2b\n\n010020030040050000.20.40.60.81Sample size kPrecision R0 = 0.1b=2b=4M010020030040050000.20.40.60.81Sample size kPrecision R0 = 0.2b=2Mb=4010020030040050000.20.40.60.81Sample size kPrecision R0 = 0.3b=2MM4\fReferences\n[1] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005.\n[2] M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262\u2013271, Barcelona, Spain, 2009.\n[3] A. Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21\u201329,\n\nPositano, Italy, 1997.\n\n[4] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW, pages 1157 \u2013 1166, Santa Clara,\n\nCA, 1997.\n\n[5] G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In WSDM, pages\n\n95\u2013106, Stanford, CA, 2008.\n\n[6] O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for histogram-based image classi\ufb01cation. 10(5):1055\u20131064, 1999.\n[7] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380\u2013388, Montreal, Quebec, Canada, 2002.\n[8] S. Chaudhuri. An Overview of Query Optimization in Relational Systems. In PODS, pages 34\u201343, 1998.\n[9] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operatior for similarity joins in data cleaning. In ICDE, 2006.\n[10] S. Chaudhuri, V. Ganti, and R. Motwani. Robust identi\ufb01cation of fuzzy duplicates. In ICDE, pages 865\u2013876, Tokyo, Japan, 2005.\n[11] F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD,\n\npages 219\u2013228, Paris, France, 2009.\n\n[12] K. Church. Approximate lexicography and web search. International Journal of Lexicography, 21(3):325\u2013336, 2008.\n[13] K. Church and P. Hanks. Word association norms, mutual information and lexicography. Computational Linguistics, 16(1):22\u201329, 1991.\n[14] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without\n\nsupport pruning. IEEE Trans. on Knowl. and Data Eng., 13(1), 2001.\n\n[15] F. Diaz. Integration of News Content into Web Results. In WSDM, 2009.\n[16] Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classi\ufb01cation of dense implicit communities in the web graph. ACM Trans.\n\nWeb, 3(2):1\u201336, 2009.\n\n[17] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669\u2013678,\n\nBudapest, Hungary, 2003.\n\n[18] G. Forman, K. Eshghi, and J. Suermondt. Ef\ufb01cient detection of large-scale redundancy in enterprise \ufb01le systems. SIGOPS Oper. Syst.\n\nRev., 43(1):84\u201391, 2009.\n\n[19] M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, and A. C. K\u00a8onig. Blews: Using blogs to provide context for news articles. In AAAI\n\nConference on Weblogs and Social Media, 2008.\n\n[20] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: the Complete Book. Prentice Hall, New York, NY, 2002.\n[21] A. Gionis, D. Gunopulos, and N. Koudas. Ef\ufb01cient and tunable similar set retrieval. In SIGMOD, pages 247\u2013258, CA, 2001.\n[22] S. Gollapudi and A. Sharma. An axiomatic approach for result diversi\ufb01cation. In WWW, pages 381\u2013390, Madrid, Spain, 2009.\n[23] M. Hein and O. Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on probability measures.\n\nIn AISTATS, pages 136\u2013143,\n\nBarbados, 2005.\n\n[24] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604\u2013613,\n\nDallas, TX, 1998.\n\n[25] Y. E. Ioannidis. The history of histograms (abridged). In VLDB, 2003.\n[26] Y. Jiang, C. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages\n\n494\u2013501, Amsterdam, Netherlands, 2007.\n\n[27] N. Jindal and B. Liu. Opinion spam and analysis. In WSDM, pages 219\u2013230, Palo Alto, California, USA, 2008.\n[28] K. Kalpakis and S. Tang. Collaborative data gathering in wireless sensor networks using measurement co-occurrence. Computer\n\nCommunications, 31(10):1979\u20131992, 2008.\n\n[29] A. C. K\u00a8onig, M. Gamon, and Q. Wu. Click-Through Prediction for News Queries. In SIGIR, 2009.\n[30] H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. In PVLDB, 2009.\n[31] P. Li and K. W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics, 33(3):305\u2013\n\n354, 2007 (Preliminary results appeared in HLT/EMNLP 2005).\n\n[32] P. Li, K. W. Church, and T. J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages\n\n873\u2013880, Vancouver, BC, Canada, 2006.\n\n[33] P. Li, K. W. Church, and T. J. Hastie. One sketch for all: Theory and applications of conditional random sampling. In NIPS, Vancouver,\n\nBC, Canada, 2008.\n\n[34] P. Li, T. J. Hastie, and K. W. Church. Improving random projections using marginal information. In COLT, pages 635\u2013649, Pittsburgh,\n\nPA, 2006.\n\n[35] P. Li and A. C. K\u00a8onig. b-bit minwise hashing. In WWW, pages 671\u2013680, Raleigh, NC, 2010.\n[36] Ludmila, K. Eshghi, C. B. M. III, J. Tucek, and A. Veitch. Probabilistic frequent itemset mining in uncertain databases. In KDD, pages\n\n1087\u20131096, Paris, France, 2009.\n\n[37] G. S. Manku, A. Jain, and A. D. Sarma. Detecting Near-Duplicates for Web-Crawling. In WWW, Banff, Alberta, Canada, 2007.\n[38] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999.\n[39] M. Najork, S. Gollapudi, and R. Panigrahy. Less is more: sampling the neighborhood graph makes salsa better and faster. In WSDM,\n\npages 242\u2013251, Barcelona, Spain, 2009.\n\n[40] S. Sarawagi and A. Kirpal. Ef\ufb01cient set joins on similarity predicates. In SIGMOD, pages 743\u2013754, 2004.\n[41] T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with html style similarities. ACM Trans. Web, 2(1):1\u201328, 2008.\n[42] X. Wang and C. Zhai. Mining term association patterns from search logs for effective query reformulation. In CIKM, pages 479\u2013488,\n\nNapa Valley, California, USA, 2008.\n\n[43] D. Zhou, J. Huang, and B. Sch\u00a8olkopf. Beyond pairwise classi\ufb01cation and clustering using hypergraphs. 2006.\n\n\f", "award": [], "sourceid": 1143, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Arnd", "family_name": "Konig", "institution": null}, {"given_name": "Wenhao", "family_name": "Gui", "institution": null}]}