{"title": "Simple and Efficient Weighted Minwise Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 1498, "page_last": 1506, "abstract": "Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large -scale search and learning. The resource bottleneck with WMH is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data.  We propose a simple rejection type sampling scheme based on a carefully designed red-green map, where we show that the number of rejected sample has exactly the same distribution as weighted minwise sampling. The running time of our method,  for many practical datasets, is an order of magnitude smaller than existing methods. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe's method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient ``densified\" one permutation hashing schemes~\\cite{Proc:OneHashLSH_ICML14,Proc:Shrivastava_UAI14}. Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice.", "full_text": "Simple and Ef\ufb01cient Weighted Minwise Hashing\n\nAnshumali Shrivastava\n\nDepartment of Computer Science\n\nRice University\n\nHouston, TX, 77005\n\nanshumali@rice.edu\n\nAbstract\n\nWeighted minwise hashing (WMH) is one of the fundamental subrou-\ntine, required by many celebrated approximation algorithms, commonly\nadopted in industrial practice for large -scale search and learning. The\nresource bottleneck with WMH is the computation of multiple (typically a\nfew hundreds to thousands) independent hashes of the data. We propose\na simple rejection type sampling scheme based on a carefully designed\nred-green map, where we show that the number of rejected sample has\nexactly the same distribution as weighted minwise sampling. The running\ntime of our method, for many practical datasets, is an order of magnitude\nsmaller than existing methods. Experimental evaluations, on real datasets,\nshow that for computing 500 WMH, our proposal can be 60000x faster\nthan the Ioffe\u2019s method without losing any accuracy. Our method is also\naround 100x faster than approximate heuristics capitalizing on the ef\ufb01cient\n\u201cdensi\ufb01ed\" one permutation hashing schemes [26, 27]. Given the simplicity\nof our approach and its signi\ufb01cant advantages, we hope that it will replace\nexisting implementations in practice.\n\n1\n\nIntroduction\n\n(Weighted) Minwise Hashing (or Sampling), [2, 4, 17] is the most popular and successful\nrandomized hashing technique, commonly deployed in commercial big-data systems for\nreducing the computational requirements of many large-scale applications [3, 1, 25].\nMinwise sampling is a known LSH for the Jaccard similarity [22]. Given two positive vectors\nx, y \u2208 RD, x, y > 0, the (generalized) Jaccard similarity is de\ufb01ned as\n\n(cid:80)D\n(cid:80)D\ni=1 min{xi, yi}\ni=1 max{xi, yi} .\n\nJ(x, y) =\n\n(1)\n\nJ(x, y) is a frequently used measure for comparing web-documents [2], histograms (specially\nimages [13]), gene sequences [23], etc. Recently, it was shown to be a very effective kernel for\nlarge-scale non-linear learning [15]. WMH leads to the best-known LSH for L1 distance [13],\ncommonly used in computer vision, improving over [7].\nWeighted Minwise Hashing (WMH) (or Minwise Sampling) generates randomized hash (or\n\ufb01ngerprint) h(x), of the given data vector x \u2265 0, such that for any pair of vectors x and y, the\nprobability of hash collision (or agreement of hash values) is given by,\n\nPr(h(x) = h(y)) =\n\n(2)\nA notable special case is when x and y are binary (or sets), i.e. xi, yi \u2208 {0, 1}D . For this case,\nthe similarity measure boils down to J(x, y) =\n\n(cid:80) min{xi, yi}\n(cid:80) max{xi, yi} = J(x, y).\n(cid:80) min{xi,yi}\n(cid:80) max{xi,yi} = |x\u2229y|\n\n|x\u222ay|.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fBeing able to generate a randomized signature, h(x), satisfying Equation 2 is the key break-\nthrough behind some of the best-known approximations algorithms for metric labelling [14],\nmetric embedding [5], mechanism design, and differential privacy [8].\nA typical requirement for algorithms relying on minwise hashing is to generate, some large\nenough, k independent Minwise hashes (or \ufb01ngerprints) of the data vector x, i.e. compute\nhi(x) i \u2208 {1, 2, ..., k} repeatedly with independent randomization. These independent hashes\ncan then be used for a variety of data mining tasks such as cheap similarity estimation,\nindexing for sublinear-search, kernel features for large scale learning, etc. The bottleneck\nstep in all these applications is the costly computation of the multiple hashes, which requires\nmultiple passes over the data. The number of required hashes typically ranges from few\nhundreds to several thousand [26]. For example, the number of hashes required by the\nfamous LSH algorithm is O(n\u03c1) which grows with the size of the data. [15] showed the\nnecessity of around 4000 hashes per data vector in large-scale learning with J(x, y) as the\nkernel, making hash generation the most costly step.\nOwing to the signi\ufb01cance of WMH and its impact in practice, there is a series of work over\nthe last decade trying to reduce its costly computation cost [11].The \ufb01rst groundbreaking\nwork on Minwise hashing [2] computed hashes h(x) only for unweighted sets x (or binary\nvectors), i.e. when the vector components xis can only take values 0 and 1. Later it was\nrealized that vectors with positive integer weights, which are equivalent to weighted\nsets, can be reduced to unweighted set by replicating elements in proportion to their\nweights [10, 11]. This scheme was very expensive due to blowup in the number of elements\ncaused by replications. Also, it cannot handle real weights. In [11], the authors showed few\napproximate solutions to reduce these replications.\nLater [17], introduced the concept of consistent weighted sampling (CWS), which focuses\non sampling directly from some well-tailored distribution to avoid any replication. This\nmethod, unlike previous ones, could handle real weights exactly. Going a step further,\nIoffe [13] was able to compute the exact distribution of minwise sampling leading to a\nscheme with worst case O(d), where d is the number of non-zeros. This is the fastest known\nexact weighted minwise sampling scheme, which will also be our main baseline.\nO(dk) for computing k independent hashes is very expensive for modern massive datasets,\nespecially when k with ranges up to thousands. Recently, there was a big success for the\nbinary case, where using the novel idea of \u201cDensi\ufb01cation\" [26, 27, 25] the computation time\nfor unweighted minwise was brought down to O(d + k). This resulted in over 100-1000 fold\nimprovement. However, this speedup was limited only to binary vectors. Moreover, the\nsamples were not completely independent.\nCapitalizing on recent advances for fast unweighted minwise hashing, [11] exploited the old\nidea of replication to convert weighted sets into unweighted sets. To deal with non-integer\nweights, the method samples the coordinates with probabilities proportional to leftover\nweights. The overall process converts the weighted minwise sampling to an unweighted\nproblem, however, at a cost of incurring some bias (see Algorithm 2). This scheme is faster\nthan Ioffe\u2019s scheme but, unlike other prior works on CWS, it is not exact and leads to biased\nand correlated samples. Moreover, it requires strong and expensive independence [12].\nAll these lines of work lead to a natural question: does there exist an unbiased and inde-\npendent WMH scheme with same property as Ioffe\u2019s hashes but signi\ufb01cantly faster than all\nexisting methodologies? We answer this question positively.\n1.1 Our Contributions:\n1. We provide an unbiased weighted minwise hashing scheme, where each hash computa-\ntion takes time inversely proportional to effective sparsity (de\ufb01ne later) which can be an\norder of magnitude (even more) smaller than O(d). This improves upon the best-known\nscheme in the literature by Ioffe [13] for a wide range of datasets. Experimental evaluations\non real datasets show more than 60000x speedup over the best known exact scheme and\naround 100x times faster than biased approximate schemes based on the recent idea of fast\nminwise hashing.\n2. In practice, our hashing scheme requires much fewer bits usually (5-9) bits instead of\n64 bits (or higher) required by existing schemes, leading to around 8x savings in space, as\nshown on real datasets.\n\n2\n\n\f3. We derive our scheme from elementary \ufb01rst principles. Our scheme is simple and it\nonly requires access to uniform random number generator, instead of costly sampling and\ntransformations needed by other methods. The hashing procedure is different from tradi-\ntional schemes and could be of independent interest in itself. Our scheme naturally provide\nthe quanti\ufb01cation of when and how much savings we can obtain compared to existing\nmethodologies.\n4. Weighted Minwise sampling is a fundamental subroutine in many celebrated approxima-\ntion algorithms. Some of the immediate consequences of our proposal are as follows:\n\nfor L1 distance and Jaccard Similarity search.\n\n\u2022 We obtain an algorithmic improvement, over the query time of LSH based algorithm,\n\u2022 We reduce the kernel feature [21] computation time with min-max kernels [15].\n\u2022 We reduce the sketching time for fast estimation of a variety of measures, including\n\nL1 and earth mover distance [14, 5].\n\n2 Review: Ioffe\u2019s Algorithm and Fast Unweighted Minwise Hashing\n\nAlgorithm 1 Ioffe\u2019s CWS [13]\ninput Vector x, random seed[][]\n\nfor i = 1 to k do\n\n(cid:22) log x j\n\nri, j\n\nend for\nk\u2217 = arg min j a j\nHashPairs[i] = (k\u2217, tk\u2217)\n\nend for\nRETURN HashPairs[]\n\nfor Iterate over x j s.t x j > 0 do\nrandomseed = seed[i][ j];\n(cid:23)\nSample ri, j, ci, j \u223c Gamma(2, 1).\nSample \u03b2i, j \u223c Uni f orm(0, 1)\nt j =\n+ \u03b2i, j\ny j = exp(ri, j(t j \u2212 \u03b2i, j))\nz j = y j \u2217 exp(ri, j)\na j = ci, j/z j\n\nWe brie\ufb02y review the state-of-the-art method-\nologies for Weighted Minwise Hashing (WMH).\nSince WMH is only de\ufb01ned for weighted sets,\nour vectors under consideration will always be\npositive, i.e. every xi \u2265 0. D will denote the di-\nmensionality of the data, and we will use d to\ndenote the number (or the average) of non-zeros\nof the vector(s) under consideration.\nThe fastest known scheme for exact weighted\nminwise hashing is based on an elegant deriva-\ntion of the exact sampling process for \u201cConsis-\ntent Weighted Sampling\" (CWS) due to Ioffe [13],\nwhich is summarized in Algorithm 1. This\nscheme requires O(d) computations.\nO(d) for a single hash computation is quite ex-\npensive. Even the unweighted case of minwise\nhashing had complexity O(d) per hashes, until\nrecently. [26, 27] showed a new one permutation\nbased scheme for generating k near-independent unweighted minwise hashes in O(d + k)\nbreaking the old O(dk) barrier. However, this improvement does not directly extend to the\nweighted case. Nevertheless, it leads to a very powerful heuristic in practice.\nIt was known that with some bias, weighted\nminwise sampling can be reduced to an un-\nweighted minwise sampling using the idea of\nsampling weights in proportion to their prob-\nabilities [10, 14]. Algorithm 2 describes such a\nprocedure. A reasonable idea is then to use the\nfast unweighted hashing scheme, on the top of\nthis biased approximation [11, 24]. The inside\nfor-loop in Algorithm 2 blows up the number of\nnon-zeros in the returned unweighted set. This\nmakes the process slower and dependent on the\nmagnitude of weights. Moreover, unweighted\nsampling requires very costly random permuta-\ntions for good accuracy [20].\nBoth the Ioffe\u2019s scheme and the biased un-\nweighted approximation scheme generate big hash values requiring 32-bits or higher storage\nper hash value. For reducing this to a manageable size of say 4-8 bits, a commonly adopted\npractical methodology is to randomly rehash it to smaller space at the cost of loss in accu-\nracy [16]. It turns out that our hashing scheme generates 5-9 bits values, h(x), satisfying\nEquation 2, without losing any accuracy, for many real datasets.\n\nend for\nr = Uni f orm(0, 1)\nif r \u2264 x j \u2212 f loorx j then\nS = S \u222a ( f loorx j + 1, j)\nend if\nend for\nRETURN S (unweighted set)\n\nAlgorithm 2 Reduce to Unweighted [11]\ninput Vector x,\n\nf loorx j = (cid:98)x j(cid:99)\nfor i = 1 to f loorx j do\nS = S \u222a (i, j)\n\nS = \u03c6\nfor Iterate over x j s.t x j > 0 do\n\n3\n\n\f3 Our Proposal: New Hashing Scheme\nWe \ufb01rst describe our procedure in details. We will later talk about the correctness of the\nscheme. We will then discuss its runtime complexity and other practical issues.\n3.1 Procedure\nWe will denote the ith component of vector x \u2208\nRD by xi. Let mi be the upper bound on the value\nof component xi in the given dataset. We can al-\nways assume the mi to be an integer, otherwise\nwe take the ceiling (cid:100)mi(cid:101) as our upper bound. De-\n\ufb01ne\n\ni(cid:88)\n\nD(cid:88)\n\nmi = Mi. and\n\nmi = MD = M (3)\n\nk=1\n\nk=1\n\nFigure 1: Illustration of Red-Green Map\nof 4 dimensional vectors x and y.\n\ninterval being [Mi\u22121, Mi] which is of the size mi. Note that(cid:80)D\n\nIf the data is normalized, then mi = 1 and M = D.\nGiven a vector x, we \ufb01rst create a red-green map associated with it, as shown in Figure 1.\nFor this, we \ufb01rst take an interval [0, M] and divide it into D disjoint intervals, with ith\ni=1 mi = M, so we can always\ndo that. We then create two regions, red and green. For the ith interval [Mi\u22121, Mi], we mark\nthe subinterval [Mi\u22121, Mi\u22121 + xi] as green and the rest [Mi\u22121 + xi, Mi] with red, as shown in\nFigure 1. If xi = 0 for some i, then the whole ith interval [Mi\u22121, Mi] is marked as red.\nFormally, for a given vector x, de\ufb01ne the green xgreen and the red xred regions as follows\n\nxgreen = \u222aD\n\ni=1[Mi, Mi + xi];\n\nxred = \u222aD\n\ni=1[Mi + xi, Mi+1];\n\n(4)\n\nOur sampling procedure simply draws an inde-\npendent random real number between [0, M], if\nthe random number lies in the red region we\nrepeat and re-sample. We stop the process as\nsoon as the generated random number lies in\nthe green region. Our hash value for a given\ndata vector, h(x), is simply the number of steps\ntaken before we stop. We summarize the proce-\ndure in Algorithm 3. More formally,\ni = 1, 2, 3....} as a se-\nDe\ufb01nition 1 De\ufb01ne {ri\nquence of i.i.d uniformly generated random number\nbetween [0, M]. Then we de\ufb01ne the hash of x, h(x) as\n(5)\n\ns.t. ri \u2208 xgreen\n\nh(x) = arg min\n\nri,\n\n:\n\ni\n\nAlgorithm 3 Weighted MinHash\ninput Vector x, Mi\u2019s, k, random seed[].\n\nInitialise Hashes[] to all 0s.\nfor i = 1 to k do\n\nrandomseed = seed[i];\nwhile true do\n\nr = M \u00d7 Uni f orm(0, 1);\nif ISGREEN(r), (check if r \u2208 xred\nthen\n\nbreak;\n\nend if\nrandomseed = (cid:100)r \u2217 1000000(cid:101) ;\nHashes[i] + +;\n\nend while\n\nend for\nRETURN Hashes\n\nOur procedure can be viewed as a form of rejec-\ntion sampling [30]. To the best of our knowledge,\nthere has been no prior evidence in the literature, where that the number of samples rejected\nhas locality sensitive property.\nWe want our hashing scheme to be consistent [13] across different data points to guarantee\nEquation 2. This requires ensuring the consistency of the random numbers in hashes [13].\nWe can achieve the required consistency by pre-generating the sequence of random numbers\nand storing them analogous to other hashing schemes. However, there is an easy way to\ngenerate a \ufb01xed sequence of random numbers on the \ufb02y by ensuring the consistency of the\nrandom seed. This does not require any storage, except the starting seed. Our Algorithm 3\nuses this criterion, to ensure the consistency of random numbers. We start with a \ufb01xed\nrandom seed for generating random numbers. If the generated random number lies in the\nred region, then before re-sampling, we reset the seed of our random number generator as a\nfunction of discarded random number. In the algorithm, we used (cid:100)100000 \u2217 r(cid:101), where (cid:100)(cid:101) is\nthe ceiling operation, as a convenient way to ensure the consistency of sequence, without\nany memory overhead. This seems to works nicely in practice. Since we are sampling real\nnumbers, the probability of any repetition (or cycle) is zero. For generating k independent\nhashes we just use different random seeds which are kept \ufb01xed for the entire dataset.\n\n4\n\n \ud835\udc74\ud835\udfce=\ud835\udfce \ud835\udc74\ud835\udfcf \ud835\udc9a\ud835\udfd0 \ud835\udc74\ud835\udfd0 \ud835\udc74\ud835\udfd1 \ud835\udc74\ud835\udfd2=\ud835\udc74 \ud835\udc9a\ud835\udfd1=\ud835\udfce \ud835\udc9a\ud835\udfd2 \ud835\udc74\ud835\udfce=\ud835\udfce \ud835\udc99\ud835\udfcf \ud835\udc74\ud835\udfcf \ud835\udc99\ud835\udfd0 \ud835\udc74\ud835\udfd0 \ud835\udc74\ud835\udfd1 \ud835\udc74\ud835\udfd2=\ud835\udc74 \ud835\udc99\ud835\udfd1 \ud835\udc99\ud835\udfd2 \ud835\udc9a\ud835\udfcf y x \f3.2 Correctness\nWe show that the simple, but very unusual, scheme given in Algorithms 3 actually does\npossess the required property, i.e. for any pair of points x and y Equation 2 holds. Unlike\nthe previous works on this line [17, 13] which requires computing the exact distribution of\nassociated quantities, the proof of our proposed scheme is elementary and can be derived\nfrom \ufb01rst principles. This is not surprising given the simplicity of our procedure.\n\nTheorem 1 For any two vectors x and y, we have\n\n(cid:18)\n\n(cid:19)\n\nPr\n\nh(x) = h(y)\n\n= J(x, y) =\n\n(cid:80)D\n(cid:80)D\ni=1 min{xi, yi}\ni=1 max{xi, yi}\n\n(6)\n\n(9)\n\nTheorem 1 implies that the sampling process is exact and we automatically have an unbiased\nestimator of J(x, y), using k independently generated WMH, hi(x)s from Algorithm 3.\n\nk(cid:88)\n\ni=1\n\n\u02c6J =\n\n1\nk\n\n(cid:2)1{hi(x) = hi(y)}(cid:3); E( \u02c6J) = J(x, y); Var( \u02c6J) =\n\nJ(x, y)(1 \u2212 J(x, y))\n\nk\n\n,\n\n(7)\n\nwhere 1 is the indicator function.\n3.3 Running Time Analysis and Fast Implementation\nDe\ufb01ne\n\nSize of green region\n\n(cid:80)D\n\ni=1 xi\nM\n\n||x||1\nM ,\n\nsx =\n\nSize of red region + Size of green region\n\n(8)\nas the effective sparsity of the vector x. Note that this is also the probability of Pr(r \u2208 xgreen).\nAlgorithm 3 has a while loop.\nWe show that the expected times the while loops runs, which is also the expected value of\nh(x), is the inverse of effective sparsity . Formally,\n\n=\n\n=\n\nTheorem 2\n\nE(h(x)) =\n\n1\nsx\n\n; Var(h(x)) =\n\n1 \u2212 sx\ns2\nx\n\n;\n\nPr\n\n(cid:18)\nh(x) \u2265\n\n(cid:19) \u2264 \u03b4.\n\nlog \u03b4\n\nlog (1 \u2212 sx)\n\n3.4 When is this advantageous over Ioffe\u2019s scheme?\nThe time to compute each hash value, in expectation, is the inverse of effective sparsity\n1\ns . This is a very different quantity compared to existing solutions which needs O(d). For\ndatasets with 1\ns << d, we can expect our method to be much faster. For real datasets, such as\nimage histograms, where minwise sampling is popular[13], the value of this sparsity is of\n\u2248 13\u2212 50. On the other hand, the number\nthe order of 0.02-0.08 (see Section 4.2) leading to 1\nsx\nof non-zeros is around half million. Therefore, we can expect signi\ufb01cant speed-ups.\n\nCorollary 1 The expected amount of bits required to represent h(x) is small, in particular,\n\nE(bits) \u2264 \u2212 log sx; E(bits) \u2248 log\n\n1\nsx\n\n\u2212 (1 \u2212 sx)\n\n2\n\n;\n\n(10)\n\nExisting hashing scheme require 64 bits, which is quite expensive. A popular approach for\nreducing space uses least signi\ufb01cant bits of hashes [16, 13]. This tradeoff in space comes at\nthe cost of accuracy [16]. Our hashing scheme naturally requires only few bits, typically 5-9\n(see Section 4.2), eliminating the need for trading accuracy for manageable space.\nWe know from Theorem 2 that each hash function computation requires 1\ns number of\nfunction calls to ISGREEN(r). If we can implement ISGREEN(r) in constant time, i.e O(1),\nthen we can generate generate k independent hashes in total O(d + k\ns ) time instead of O(dk)\nrequired by [13]. Note that O(d) is the time to read the input vector which cannot be avoided.\nOnce the data is loaded into the memory, our procedure is actually O( k\ns ) for computing k\nhashes, for all k \u2265 1. This can be a huge improvement as in many real scenarios 1\n\ns (cid:28) d\n\n5\n\n\fAlgorithm 4 ComputeHashMaps (Once per\ndataset)\ninput Mi\u2019s,\n\nindex =0, CompToM[0] =0\nfor i = 0 to D \u2212 1 do\nif i < D \u2212 1 then\nCompToM[i + 1] = Mi + CompToM[i]\nend if\nfor j = 0 to Mi \u2212 1 do\n\nBefore we jump into a constant time implementation of ISGREEN(r), we would like readers\nto note that there is a straightforward binary search algorithm for ISGREEN(r) in log d time.\nWe consider d intervals [Mi, Mi + xi] for all i, such that xi (cid:44) 0. Because of the nature of the\nproblem, Mi\u22121 + xi\u22121 \u2264 Mi \u2200i. Therefore, these intervals are disjoint and sorted. Therefore,\ngiven a random number r, determining if r \u2208 \u222aD\ni=1[Mi, Mi + xi] only needs binary search over\nd ranges. Thus, in expectation, we already have a scheme that generates k independent\nhashes in total O(d + k\ns log d) time improving over best known O(dk) required by [13] for\nexact unbiased sampling, whenever d (cid:29) 1\ns .\nWe show that with some algorithmic tricks\nand few more data structures, we can imple-\nment ISGREEN(r) in constant time O(1). We\nneed two global pre-computed hashmaps,\nIntToComp (Integer to Vector Component)\nand CompToM (Vector Component to M\nvalue). IntToComp is a hashmap that maps\nevery integer between [0, M] to the asso-\nciated components,\ni.e., all integers be-\ntween [Mi, Mi+1] are mapped to i, because\nit is associated with ith component. Comp-\nToM maps every component of vectors i \u2208\n{1, 2, 3, ..., D} to its associated value Mi. The\nprocedure for computing these hashmaps\nis straightforward and is summarized in Al-\ngorithm 4. It should be noted that these hash-maps computation is a one time pre-processing\noperation over the entire dataset having a negligible cost. Mi\u2019s can be computed (estimated)\nwhile reading the data.\nUsing these two pre-computed hashmaps, the\nISGREEN(r) methodology works as follows: We\n(cid:100)r(cid:101), then\n\ufb01rst compute the ceiling of r, i.e.\nwe \ufb01nd the component i associated with r, i.e.,\nr \u2208 [Mi, Mi+1], and the corresponding associated\nMi using hashmaps IntToComp and CompToM. Fi-\nnally, we return true if r \u2264 xi + Mi otherwise we\nreturn false. The main observation is that since\nwe ensure that all Mi\u2019s are Integers, for any real\nnumber r, if r \u2208 [Mi, Mi+1] then the same holds\nfor (cid:100)r(cid:101), i.e., (cid:100)r(cid:101) \u2208 [Mi, Mi+1]. Hence we can work\nwith hashmaps using (cid:100)r(cid:101) as the key. The overall procedure is summarized in Algorithm 5.\nNote that our overall procedure is much simpler compared to Algorithm 1. We only need to\ngenerate random numbers followed by a simple condition check using two hash lookups.\nOur analysis shows that we have to repeat this only for small number of times. Compare\nthis with the scheme of Ioffe where for every non-zero component of a vector we need to\nsample two Gamma variables followed by computing several expensive transformations\nincluding exponentials. We next demonstrate the bene\ufb01ts of our approach in practice.\n\nCompToM[] from Algorithm 4.\nindex = (cid:100)r(cid:101)\ni = IntToComp[index]\nMi = CompToM[i]\nif r \u2264 Mi + xi then\nRETURN TRUE\nend if\nRETURN FALSE\n\nIntToComp[index] = i\nindex++\n\nend for\n\nend for\nRETURN CompToM[] and IntToComp[]\n\nAlgorithm 5 ISGREEN(r)\ninput r, x, Hashmaps IntToComp[] and\n\n4 Experiments\n\nIn this section, we demonstrate that in real high-dimensional settings, our proposal provides\nsigni\ufb01cant speedup and requires less memory over existing methods. We also need to\nvalidate our theory that our scheme is unbiased and should be indistinguishable in accuracy\nwith Ioffe\u2019s method.\nBaselines: Ioffe\u2019s method is the fastest known exact method in the literature, so it serves\nas our natural baseline. We also compare our method with biased unweighted approxi-\nmations (see Algorithm 2) which capitalizes on recent success in fast unweighted minwise\nhashing [26, 27], we call it Fast-WDOPH (for Fast Weighted Densi\ufb01ed One Permutation\nHashing). Fast-WDOPH needs very long permutation, which is expensive. For ef\ufb01ciency,\n\n6\n\n\fData\n\nFigure 2: Average Errors in Jaccard Similarity Estimation with the Number of Hash Val-\nues. Estimates are averaged over 200 repetitions.\nwe implemented the permutation using fast 2-universal hashing which is always recom-\nmended [18].\nDatasets: Weighted Minwise sam-\npling is commonly used for sketch-\ning image histograms [13]. We\nchose two popular publicly avail-\nable vision dataset Caltech101 [9]\nand Oxford [19]. We used the stan-\ndard publicly available Histogram\nof Oriented Gradient (HOG) codes [6], popular in vision task, to convert images into fea-\nture vectors. In addition, we also used random web images [29] and computed simple\nhistograms of RGB values. We call this dataset as Hist. The statistics of these datasets\nis summarized in Table 1. These datasets cover a wide range of variations in terms of\ndimensionality, non-zeros and sparsity.\n\nTable 1: Basic Statistics of the Datasets\n\nSparsity\n(s)\n0.081\n0.024\n0.086\n\nHist\nCaltech101\nOxford\n\nnon-zeros (d) Dim (D)\n\n737\n95029\n401879\n\n485640\n580644\n\n768\n\nProp\n\nIoffe\n\nOxford\n\nMethod\n\nCaltech101\n\n10ms\n57ms\n11ms\n\n986ms\n87105ms\n746120ms\n\nFast-\nWDOPH\n57ms\n268ms\n959ms\n\nTable 2: Time taken in milliseconds (ms) to com-\npute 500 hashes by different schemes. Our pro-\nposed scheme is signi\ufb01cantly faster.\n\n4.1 Comparing Estimation Accuracy\nIn this section, we perform a sanity\ncheck experiment and compare the es-\ntimation accuracy with WMH. For this\ntask we take 9 pairs of vectors from\nour datasets with varying level of sim-\nilarities. For each of the pair (x, y), we\ngenerate k weighted minwise hashes\nhi(x) and hi(y) for i \u2208 {1, 2, .., k}, us-\ning the three competing schemes. We\nthen compute the estimate of the Jac-\ncard similarity J(x, y) using the formula 1\nk\nthe errors in the estimate as a function of k. To minimize the effect of randomization, we\naverage the errors from 200 random repetitions with different seeds. We plot this average\nerror with k = {1, 2, ..., 50} in Figure 2 for different similarity levels.\nWe can clearly see from the plots that the accuracy of the proposed scheme is indistinguish-\nable from Ioffe\u2019s scheme. This is not surprising because both the schemes are unbiased and\nhave the same theoretical distribution. This validates Theorem 1\nThe accuracy of Fast-WDOPH is inferior to that of the other two unbiased schemes and\nsometimes its performance is poor. This is because the weighted to unweighted reduction\nis biased and approximate. The bias of this reduction depends on the vector pairs under\nconsideration, which can be unpredictable.\n\n(cid:2)1{hi(x) = hi(y)}(cid:3) (See Equation 7). We compute\n\n(cid:80)k\n\nHist\n\ni=1\n\n7\n\nNumber of Hashes1020304050Average Error00.050.10.150.2Sim=0.8ProposedFast-WDOPHIoffeNumber of Hashes1020304050Average Error00.050.10.150.20.25Sim=0.72ProposedFast-WDOPHIoffeNumber of Hashes1020304050Average Error00.10.20.3Sim=0.61ProposedFast-WDOPHIoffeNumber of Hashes1020304050Average Error00.10.20.3Sim=0.56ProposedFast-WDOPHIoffeNumber of Hashes1020304050Average Error00.10.20.3Sim=0.44ProposedFast-WDOPHIoffeNumber of Hashes1020304050Average Error00.10.20.30.4Sim=0.27ProposedFast-WDOPHIoffe\fMean Values\nHash Range\nBits Needed\n\nHist\n11.94\n[1,107]\n7\n\nCaltech101 Oxford\n52.88\n[1,487]\n9\n\n9.13\n[1,69]\n7\n\nTable 3: The range of the observed hash values, using\nthe proposed scheme, along with the maximum bits\nneeded per hash value. The mean hash values agrees\nwith Theorem 2\n\n4.2 Speed Comparisons\nWe compute the average time (in milliseconds) taken by the competing algorithms to\ncompute 500 hashes of a given data vector for all the three datasets. Our experiments were\ncoded in C# on Intel Xenon CPU with 256 GB RAM. Table 2 summarises the comparison.\nWe do not include the data loading cost in these numbers and assume that the data is in the\nmemory for all the three methodologies.\nWe can clearly see tremendous\nspeedup over Ioffe\u2019s scheme. For\nHist dataset with mere 768 non-\nzeros, our scheme is 100 times\nfaster than Ioffe\u2019s scheme and\naround 5 times faster than Fast-\nWDOPH approximation. While\non caltech101 and Oxford datasets,\nwhich are high dimensional and\ndense datasets, our scheme can be 1500x to 60000x faster than Ioffe\u2019s scheme, while it is\naround 5 to 100x times faster than Fast-WDOPH scheme. Dense datasets like Caltech101 and\nOxford represent more realistic scenarios. These features are taken from real applications [6]\nand such level of sparsity and dimensionality are more common in practice.\nThe results are not surprising because Ioffe\u2019s scheme is very slow O(dk). Moreover, the\nconstant are inside bigO is also large, because of complex transformations. Therefore, for\ndatasets with high values of d (non-zeros) this scheme is very slow. Similar phenomena\nwere observed in [13], that decreasing the non-zeros by ignoring non-frequent dimensions\ncan be around 150 times faster. However, ignoring dimension looses accuracy.\n4.3 Memory Comparisons\nTable 3 summarizes the range of the hash values and the maximum number of bits needed\nto encode these hash values without any bias. We can clearly see that the hash values, even\nfor such high-dimensional datasets, only require 7-9 bits. This is a huge saving compared\nto existing hashing schemes which requires (32-64) bits [16]. Thus, our method leads to\naround 5-6 times savings in space. The mean values observed (Table 3) validate the formula\nin Theorem 2.\n5 Discussions\nTheorem 2 shows that the quantity sx =\ndetermines the runtime. If sx is very very\nsmall then, although the running time is constant (independent of d or D), it can still make\nthe algorithm unnecessarily slow. Note that for the algorithm to work we choose Mi to be\nthe largest integer greater than the maximum possible value of co-ordinate i in the given\ndataset. If this integer gap is big then we unnecessarily increase the running time. Ideally,\nthe best running time is obtained when the maximum value, is itself an integer, or is very\nclose to its ceiling value. If all the values are integers, scaling up does not matter, as it does\nnot change sx, but scaling down can make sx worse. Ideally we should scale, such that,\n\u03b1 = arg max\u03b1 =\n\n(cid:80)D\n(cid:80)D\ni=1 \u03b1xi\ni=1(cid:100)\u03b1mi(cid:101) is maximized, where mi is the maximum value of co-ordinate i.\n\n(cid:80)D\n(cid:80)D\n\ni=1 xi\ni=1 mi\n\n5.1 Very Sparse Datasets\nFor very sparse datasets, the information is more or less in the sparsity pattern rather than in\nthe magnitude [28]. Binarization of very sparse dataset is a common practice and densi\ufb01ed\none permutation hashing [26, 27] provably solves the problem in O(d + k). Nevertheless, for\napplications when the data is extremely sparse, and the magnitude of component seems\ncrucial, binary approximations followed by densi\ufb01ed one permutation hashing (Fast-DOPH)\nshould be the preferred method. Ioeffe\u2019s scheme is preferable, dues to its exactness nature,\nwhen number the number of non-zeros is of the order of k.\n6 Acknowledgements\nThis work is supported by Rice Faculty Initiative Award 2016-17. We would like to thank\nanonymous reviewers, Don Macmillen, and Ryan Moulton for feedbacks on the presentation\nof the paper.\n\n8\n\n\fReferences\n[1] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131\u2013140, 2007.\n\n[2] A. Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21\u201329,\n\nPositano, Italy, 1997.\n\n[3] A. Z. Broder. Filtering near-duplicate documents. In FUN, Isola d\u2019Elba, Italy, 1998.\n\n[4] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW, pages 1157 \u2013 1166,\n\nSanta Clara, CA, 1997.\n\n[5] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380\u2013388, Montreal, Quebec,\n\nCanada, 2002.\n\n[6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition,\n\nvolume 1, pages 886\u2013893. IEEE, 2005.\n\n[7] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokn. Locality-sensitive hashing scheme based on p-stable distributions. In\n\nSCG, pages 253 \u2013 262, Brooklyn, NY, 2004.\n\n[8] C. Dwork and A. Roth. The algorithmic foundations of differential privacy.\n\n[9] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian\n\napproach tested on 101 object categories. Computer Vision and Image Understanding, 106(1):59\u201370, 2007.\n\n[10] S. Gollapudi and R. Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In Proceedings of the 15th ACM interna-\n\ntional conference on Information and knowledge management, pages 475\u2013482. ACM, 2006.\n\n[11] B. Haeupler, M. Manasse, and K. Talwar. Consistent weighted sampling made fast, small, and easy. Technical report,\n\narXiv:1410.4266, 2014.\n\n[12] P. Indyk. A small approximately min-wise independent family of hash functions. Journal of Algorithms, 38(1):84\u201390, 2001.\n\n[13] S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246\u2013255, Sydney, AU, 2010.\n\n[14]\n\nJ. Kleinberg and E. Tardos. Approximation algorithms for classi\ufb01cation problems with pairwise relationships: Metric label-\ning and Markov random \ufb01elds. In FOCS, pages 14\u201323, New York, 1999.\n\n[15] P. Li. 0-bit consistent weighted sampling. In KDD, 2015.\n\n[16] P. Li and A. C. K\u00f6nig. Theory and applications b-bit minwise hashing. Commun. ACM, 2011.\n\n[17] M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft\n\nResearch, 2010.\n\n[18] M. Mitzenmacher and S. Vadhan. Why simple hash functions work: exploiting the entropy in a data stream. In Proceed-\nings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 746\u2013755. Society for Industrial and Applied\nMathematics, 2008.\n\n[19]\n\nJ. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.\n\n[20] M. P\u02d8atra\u00b8scu and M. Thorup. On the k-independence required by linear probing and minwise independence.\n\npages 715\u2013726, 2010.\n\nIn ICALP,\n\n[21] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems,\n\npages 1177\u20131184, 2007.\n\n[22] A. Rajaraman and J. Ullman. Mining of Massive Datasets.\n\n[23] Z. Rasheed and H. Rangwala. Mc-minh: Metagenome clustering using minwise based hashing. SIAM.\n\n[24] P. Sadosky, A. Shrivastava, M. Price, and R. C. Steorts. Blocking methods applied to casualty records from the syrian con\ufb02ict.\n\narXiv preprint arXiv:1510.07714, 2015.\n\n[25] A. Shrivastava. Probabilistic Hashing Techniques For Big Data. PhD thesis, Cornell University, 2015.\n\n[26] A. Shrivastava and P. Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing,\n\nChina, 2014.\n\n[27] A. Shrivastava and P. Li. Improved densi\ufb01cation of one permutation hashing. In UAI, Quebec, CA, 2014.\n\n[28] A. Shrivastava and P. Li. In defense of minhash over simhash. In Proceedings of the Seventeenth International Conference on\n\nArti\ufb01cial Intelligence and Statistics, pages 886\u2013894, 2014.\n\n[29]\n\nJ. Wang, J. Li, D. Chan, and G. Wiederhold. Semantics-sensitive retrieval for digital picture libraries. D-Lib Magazine, 5(11),\n1999.\n\n[30] Wikipedia. https://en.wikipedia.org/wiki/Rejection_sampling.\n\n9\n\n\f", "award": [], "sourceid": 828, "authors": [{"given_name": "Anshumali", "family_name": "Shrivastava", "institution": "Rice University"}]}