{"title": "Practical Hash Functions for Similarity Estimation and Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 6615, "page_last": 6625, "abstract": "Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input.  In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.      We consider the recent mixed tabulation hash function of Dahlgaard et al. [FOCS'15] which was proved theoretically to perform like a truly random hash function in many applications, including the above OPH. Here we first show improved concentration bounds for FH with truly random hashing and then argue that mixed tabulation performs similar when the input vectors are sparse. Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH.  We find that mixed tabulation hashing is almost as fast as the classic multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work well on sufficiently random data, but we demonstrate that in the above applications, it can lead to bias and poor concentration on both real-world and synthetic data. We also compare with the very popular MurmurHash3, which has no proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to truly random hashing in our experiments. However, mixed tabulation was 40% faster than MurmurHash3, and it has the proven guarantee of good performance on all possible input making it more reliable.", "full_text": "Practical Hash Functions for Similarity Estimation\n\nand Dimensionality Reduction\n\nS\u00f8ren Dahlgaard\n\nUniversity of Copenhagen / SupWiz\n\ns.dahlgaard@supwiz.com\n\nMathias B\u00e6k Tejs Knudsen\n\nUniversity of Copenhagen / SupWiz\n\nm.knudsen@supwiz.com\n\nMikkel Thorup\n\nUniversity of Copenhagen\n\nmthorup@di.ku.dk\n\nAbstract\n\nHashing is a basic tool for dimensionality reduction employed in several aspects of\nmachine learning. However, the perfomance analysis is often carried out under the\nabstract assumption that a truly random unit cost hash function is used, without\nconcern for which concrete hash function is employed. The concrete hash function\nmay work \ufb01ne on suf\ufb01ciently random input. The question is if they can be trusted\nin the real world where they may be faced with more structured input.\nIn this paper we focus on two prominent applications of hashing, namely similarity\nestimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS\u201912]\nand feature hashing (FH) of Weinberger et al. [ICML\u201909], both of which have\nfound numerous applications, i.e. in approximate near-neighbour search with LSH\nand large-scale classi\ufb01cation with SVM.\nWe consider the recent mixed tabulation hash function of Dahlgaard et al.\n[FOCS\u201915] which was proved theoretically to perform like a truly random hash\nfunction in many applications, including the above OPH. Here we \ufb01rst show im-\nproved concentration bounds for FH with truly random hashing and then argue that\nmixed tabulation performs similar when the input vectors are not too dense. Our\nmain contribution, however, is an experimental comparison of different hashing\nschemes when used inside FH, OPH, and LSH.\nWe \ufb01nd that mixed tabulation hashing is almost as fast as the classic multiply-mod-\nprime scheme (ax + b) mod p. Mutiply-mod-prime is guaranteed to work well on\nsuf\ufb01ciently random data, but here we demonstrate that in the above applications, it\ncan lead to bias and poor concentration on both real-world and synthetic data. We\nalso compare with the very popular MurmurHash3, which has no proven guarantees.\nMixed tabulation and MurmurHash3 both perform similar to truly random hashing\nin our experiments. However, mixed tabulation was 40% faster than MurmurHash3,\nand it has the proven guarantee of good performance (like fully random) on all\npossible input making it more reliable.\n\n1\n\nIntroduction\n\nHashing is a standard technique for dimensionality reduction and is employed as an underlying tool in\nseveral aspects of machine learning including search [22, 31, 32, 3], classi\ufb01cation [24, 22], duplicate\ndetection [25], computer vision and information retrieval [30]. The need for dimensionality reduction\ntechniques such as hashing is becoming further important due to the huge growth in data sizes. As\nan example, already in 2010, Tong [36] discussed data sets with 1011 data points and 109 features.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFurthermore, when working with text, data points are often stored as w-shingles (i.e. w contiguous\nwords or bytes) with w \u2265 5. This further increases the dimension from, say, 105 common english\nwords to 105w.\nTwo particularly prominent applications are set similarity estimation as initialized by the MinHash\nalgorithm of Broder, et al. [8, 9] and feature hashing (FH) of Weinberger, et al. [37]. Both applications\nhave in common that they are used as an underlying ingredient in many other applications. While\nboth MinHash and FH can be seen as hash functions mapping an entire set or vector, they are perhaps\nbetter described as algorithms implemented using what we will call basic hash functions. A basic\nhash function h maps a given key to a hash value, and any such basic hash function, h, can be used to\nimplement Minhash, which maps a set of keys, A, to the smallest hash value mina\u2208A h(a). A similar\ncase can be made for other locality-sensitive hash functions such as SimHash [12], One Permutation\nHashing (OPH) [22, 31, 32], and cross-polytope hashing [2, 33, 20], which are all implemented using\nbasic hash functions.\n\n1.1\n\nImportance of understanding basic hash functions\n\nIn this paper we analyze the basic hash functions needed for the applications of similarity estimation\nand FH. This is important for two reasons: 1) As mentioned in [22], dimensionality reduction is\noften a time bottle-neck and using a fast basic hash function to implement it may improve running\ntimes signi\ufb01cantly, and 2) the theoretical guarantees of hashing schemes such as Minhash and FH\nrely crucially on the basic hash functions used to implement it, and this is further propagated into\napplications of these schemes such as approximate similarity search with the seminal LSH framework\nof Indyk and Motwani [19].\nTo fully appreciate this, consider LSH for approximate similarity search implemented with MinHash.\nWe know from [19] that this structure obtains provably sub-linear query time and provably sub-\nquadratic space, where the exponent depends on the probability of hash collisions for \u201csimilar\u201d and\n\u201cnot-similar\u201d sets. However, we also know that implementing MinHash with a poorly chosen hash\nfunction leads to constant bias in the estimation [28], and this constant then appears in the exponent\nof both the space and the query time of the search structure leading to worse theoretical guarantees.\nChoosing the right basic hash function is an often overlooked aspect, and many authors simply state\nthat any (universal) hash function \u201cis usually suf\ufb01cient in practice\u201d (see e.g. [22, page 3]). While\nthis is indeed the case most of the time (and provably if the input has enough entropy [26]), many\napplications rely on taking advantage of highly structured data to perform well (such as classi\ufb01cation\nor similarity search). In these cases a poorly chosen hash function may lead to very systematic\ninconsistensies. Perhaps the most famous example of this is hashing with linear probing which was\ndeemed very fast but unrealiable in practice until it was fully understood which hash functions to\nemploy (see [35] for discussion and experiments). Other papers (see e.g. [31, 32] suggest using\nvery powerful machinery such as the seminal pseudorandom generator of Nisan [27]. However,\nsuch a PRG does not represent a hash function and implementing it as such would incur a huge\ncomputational overhead.\nMeanwhile, some papers do indeed consider which concrete hash functions to use. In [15] it was\nconsidered to use 2-independent hashing for bottom-k sketches, which was proved in [34] to work for\nthis application. However, bottom-k sketches do not work for SVMs and LSH. Closer to our work,\n[23] considered the use of 2-independent (and 4-independent) hashing for large-scale classi\ufb01cation\nand online learning with b-bit minwise hashing. Their experiments indicate that 2-independent\nhashing often works, and they state that \u201cthe simple and highly ef\ufb01cient 2-independent scheme may\nbe suf\ufb01cient in practice\u201d. However, no amount of experiments can show that this is the case for all\ninput. In fact, we demonstrate in this paper \u2013 for the underlying FH and OPH \u2013 that this is not the case,\nand that we cannot trust 2-independent hashing to work in general. As noted, [23] used hashing for\nsimilarity estimation in classi\ufb01cation, but without considering the quality of the underlying similarity\nestimation. Due to space restrictions, we do not consider classi\ufb01cation in this paper, but instead focus\non the quality of the underlying similarity estimation and dimensionality reduction sketches as well\nas considering these sketches in LSH as the sole applicaton (see also the discussion below).\n\n2\n\n\f1.2 Our contribution\n\nWe analyze the very fast and powerful mixed tabulation scheme of [14] comparing it to some of the\nmost popular and widely employed hash functions. In [14] it was shown that implementing OPH\nwith mixed tabulation gives concentration bounds \u201cessentially as good as truly random\u201d. For feature\nhashing, we \ufb01rst present new concentration bounds for the truly random case improving on [37, 16].\nWe then argue that mixed tabulation gives essentially as good concentration bounds in the case where\nthe input vectors are not too dense, which is a very common case for applying feature hashing.\nExperimentally, we demonstrate that mixed tabulation is almost as fast as the classic multiply-mod-\nprime hashing scheme. This classic scheme is guaranteed to work well for the considered applications\nwhen the data is suf\ufb01ciently random, but we demonstrate that bias and poor concentration can occur\non both synthetic and real-world data. We verify on the same experiments that mixed tabulation\nhas the desired strong concentration, con\ufb01rming the theory. We also \ufb01nd that mixed tabulation is\nroughly 40% faster than the very popular MurmurHash3 and CityHash. In our experiments these hash\nfunctions perform similar to mixed tabulation in terms of concentration. They do, however, not have\nthe same theoretical guarantees making them harder to trust. We also consider different basic hash\nfunctions for implementing LSH with OPH. We demonstrate that the bias and poor concentration of\nthe simpler hash functions for OPH translates into poor concentration for e.g. the recall and number\nof retrieved data points of the corresponding LSH search structure. Again, we observe that this is not\nthe case for mixed tabulation, which systematically out-performs the faster hash functions. We note\nthat [23] suggests that 2-independent hashing only has problems with dense data sets, but both the\nreal-world and synthetic data considered in this work are sparse or, in the case of synthetic data, can\nbe generalized to arbitrarily sparse data. While we do not consider b-bit hashing as in [23], we note\nthat applying the b-bit trick to our experiments would only introduce a bias from false positives for\nall basic hash functions and leave the conclusion the same.\nIt is important to note that our results do not imply that standard hashing techniques (i.e. multiply-mod\nprime) never work. Rather, they show that there does exist practical scenarios where the theoretical\nguarantees matter, making mixed tabulation more consistent. We believe that the very fast evaluation\ntime and consistency of mixed tabulation makes it the best choice for the applications considered in\nthis paper.\n\n2 Preliminaries\n\nAs mentioned we focus on similarity estimation and feature hashing. Here we brie\ufb02y describe the\nmethods used. We let [m] = {0, . . . , m \u2212 1}, for some integer m, denote the output range of the\nhash functions considered.\n\n2.1 Similarity estimation\n\nIn similarity estimation we are given two sets, A and B belonging to some universe U and are tasked\nwith estimating the Jaccard similarity J(A, B) = |A \u2229 B|/|A \u222a B|. As mentioned earlier, this can\nbe solved using k independent repetitions of the MinHash algorithm, however this requires O(k \u00b7 |A|)\nrunning time. In this paper we instead use the faster OPH of Li et al. [22] with the densi\ufb01cation\nscheme of Shrivastava and Li [32]. This scheme works as follows: Let k be a parameter with k\nbeing a divisor of m, and pick a random hash function h : U \u2192 [m]. for each element x split\nh(x) into two parts b(x), v(x), where b(x) : U \u2192 [k] is given by h(x) mod k and v(x) is given by\n(cid:98)h(x)/k(cid:99). To create the sketch SOP H (A) of size k we simply let SOP H (A)[i] = mina\u2208A,b(a)=i v(a).\nTo estimate the similarity of two sets A and B we simply take the fraction of indices, i, where\nSOP H (A)[i] = SOP H (B)[i].\nThis is, however, not an unbiased estimator, as there may be empty bins. Thus, [31, 32] worked on\nhandling empty bins. They showed that the following addition gives an unbiased estimator with good\nvariance. For each index i \u2208 [k] let bi be a random bit. Now, for a given sketch SOP H (A), if the\nith bin is empty we copy the value of the closest non-empty bin going left (circularly) if bi = 0 and\ngoing right if bi = 1. We also add j \u00b7 C to this copied value, where j is the distance to the copied bin\nand C is some suf\ufb01ciently large offset parameter. The entire construction is illustrated in Figure 1\n\n3\n\n\fFigure 1: Left: Example of one permutation sketch creation of a set A with |U| = 20 and k = 5. For\neach of the 20 possible hash value the corresponding bin and value is displayed. The hash values of\nA, h(A), are displayed as an indicator vector with the minimal value per bin marked in red. Note that\nthe 3rd bin is empty. Right: Example of the densi\ufb01cation from [32] (right).\n\n2.2 Feature hashing\n\ni =(cid:80)\n\nFeature hashing (FH) introduced by Weinberger et al. [37] takes a vector v of dimension d and\nproduces a vector v(cid:48) of dimension d(cid:48) (cid:28) d preserving (roughly) the norm of v. More precisely,\nlet h : [d] \u2192 [d(cid:48)] and sgn : [d] \u2192 {\u22121, +1} be random hash functions, then v(cid:48) is de\ufb01ned as\nv(cid:48)\nj,h(j)=i sgn(j)vj. Weinberger et al. [37] (see also [16]) showed exponential tail bounds on\n(cid:107)v(cid:48)(cid:107)2\n\n2 when (cid:107)v(cid:107)\u221e is suf\ufb01ciently small and d(cid:48) is suf\ufb01ciently large.\n\n2.3 Locality-sensitive hashing\n\nThe LSH framework of [19] is a solution to the approximate near neighbour search problem: Given a\ngiant collection of sets C = A1, . . . , An, store a data structure such that, given a query set Aq, we\ncan, loosely speaking, ef\ufb01ciently \ufb01nd a Ai with large J(Ai, Aq). Clearly, given the potential massive\nsize of C it is infeasible to perform a linear scan.\nWith LSH parameterized by positive integers K, L we create a size K sketch Soph(Ai) (or using\nanother method) for each Ai \u2208 C. We then store the set Ai in a large table indexed by this sketch\nT [Soph(Ai)]. For a given query Aq we then go over all sets stored in T [Soph(Aq)] returning only\nthose that are \u201csuf\ufb01ciently similar\u201d. By picking K large enough we ensure that very distinct sets\n(almost) never end up in the same bucket, and by repeating the data structure L independent times\n(creating L such tables) we ensure that similar sets are likely to be retrieved in at least one of the\ntables.\nRecently, much work has gone into providing theoretically optimal [5, 4, 13] LSH. However, as noted\nin [2], these solutions require very sophisticated locality-sensitive hash functions and are mainly\nimpractical. We therefore choose to focus on more practical variants relying either on OPH [31, 32]\nor FH [12, 2].\n\n2.4 Mixed tabulation\n\nMixed tabulation was introduced by [14]. For simplicity assume that we are hashing from the universe\n[2w] and \ufb01x integers c, d such that c is a divisor of w. Tabulation-based hashing views each key x\nas a list of c characters x0, . . . , xc\u22121, where xi consists of the ith w/c bits of x. We say that the\nalphabet \u03a3 = [2w/c]. Mixed tabulation uses x to derive d additional characters from \u03a3. To do this\nwe choose c tables T1,i : \u03a3 \u2192 \u03a3d uniformly at random and let y = \u2295c\ni=0T1,i[xi] (here \u2295 denotes\nthe XOR operation). The d derived characters are then y0, . . . , yd\u22121. To create the \ufb01nal hash value\nwe additionally choose c + d random tables T2,i : \u03a3 \u2192 [m] and de\ufb01ne\n\n(cid:77)\n\ni\u2208[c]\n\nh(x) =\n\n(cid:77)\n\ni\u2208[d]\n\nT2,i[xi]\n\nT2,i+c[yi] .\n\nMixed Tabulation is extremely fast in practice due to the word-parallelism of the XOR operation and\nthe small table sizes which \ufb01t in fast cache. It was proved in [14] that implementing OPH with mixed\ntabulation gives Chernoff-style concentration bounds when estimating Jaccard similarity.\nAnother advantage of mixed tabulation is when generating many hash values for the same key. In\nthis case, we can increase the output size of the tables T2,i, and then whp. over the choice of T1,i the\nresulting output bits will be independent. As an example, assume that we want to map each key to\ntwo 32-bit hash values. We then use a mixed tabulation hash function as described above mapping\nkeys to one 64-bit hash value, and then split this hash value into two 32-bit values, which would be\n\n4\n\nHash value0 1 2 34 5 6 78 9 10 1112 13 14 1516 17 18 19Bin01234Value0 1 2 30 1 2 30 1 2 30 1 2 30 1 2 3h(A)0 0 110 10 00 0 0 010 1 00 0 10S_OPH(A)21-02Bin012345Direction011001S_OPH(A)3+C21+2C2+2C13\findependent of each other with high probability. Doing this with e.g. multiply-mod-prime hashing\nwould not work, as the output bits are not independent. Thereby we signi\ufb01cantly speed up the hashing\ntime when generating many hash values for the same keys.\nA sample implementation with c = d = 4 and 32-bit keys and values can be found below.\n\nuint64_t mt_T1[256][4];\nuint32_t mt_T2[256][4];\n\n// Filled with random bits\n// Filled with random bits\n\nuint32_t mixedtab(uint32_t x) {\n\nuint64_t h=0; // This will be the final hash value\nfor(int i = 0;i < 4;++i, x >>= 8)\n\nh ^= mt_T1[(uint8_t)x][i];\n\nuint32_t drv=h >> 32;\nfor(int i = 0;i < 4;++i, drv >>= 8)\n\nh ^= mt_T2[(uint8_t)drv][i];\n\nreturn (uint32_t)h;\n\n}\n\nThe main drawback to mixed tabulation hashing is that it needs a relatively large random seed to\n\ufb01ll out the tables T1 and T2. However, as noted in [14] for all the applications we consider here it\nsuf\ufb01ces to \ufb01ll in the tables using a \u0398(log |U|)-independent hash function.\n\n3 Feature Hashing with Mixed Tabulation\n\nAs noted, Weinberger et al. [37] showed exponential tail bounds for feature hashing. Here, we \ufb01rst\nprove improved concentration bounds, and then, using techniques from [14] we argue that these\nbounds still hold (up to a small additive factor polynomial in the universe size) when implementing\nFH with mixed tabulation.\nThe concentration bounds we show are as follows (proved in the full version).\nTheorem 1. Let v \u2208 Rd with (cid:107)v(cid:107)2 = 1 and let v(cid:48) be the d(cid:48)-dimensional vector obtained by applying\nfeature hashing implemented with truly random hash functions. Let \u03b5, \u03b4 \u2208 (0, 1). Assume that\nd(cid:48) \u2265 16\u03b5\u22122 lg(1/\u03b4) and (cid:107)v(cid:107)\u221e \u2264\n\n. Then it holds that\n\n\u221a\nlog(1/\u03b4) log(d(cid:48)/\u03b4)\n\n\u03b5 log(1+ 4\n\u03b5 )\n\n\u221a\n\n6\n\nPr(cid:2)1 \u2212 \u03b5 < (cid:107)v(cid:48)(cid:107)2\n\n2 < 1 + \u03b5(cid:3) \u2265 1 \u2212 4\u03b4 .\n\n(1)\n\n\u03b5\n\n\u03b5\n\n18\n\n\u221a\n\nlog(1/\u03b4) log(d(cid:48)/\u03b4)\n\n16 log(1/\u03b4) log2(d(cid:48)/\u03b4). We improve on these results factors of \u0398\n\n(cid:107)v(cid:107)\u221e is bounded by(cid:113)\n\nTheorem 1 is very similar to the bounds on feature hashing by Weinberger et al. [37] and Dasgupta\net al. [16], but improves on the requirement on the size of (cid:107)v(cid:107)\u221e. Weinberger et al. [37] show that\n(cid:17)\n(1) holds if (cid:107)v(cid:107)\u221e is bounded by\n, and Dasgupta et al. [16] show that (1) holds if\n(cid:16)(cid:112)log(1/\u03b5) log(d(cid:48)/\u03b4)\n(cid:17)\n\nand \u0398\nrespectively. We note that if we use feature hashing with a pre-\nconditioner (as in e.g. [16, Theorem 1]) these improvements translate into an improved running\ntime.\nUsing [14, Theorem 1] we get the following corollary.\nCorollary 1. Let v, \u03b5, \u03b4 and d(cid:48) be as in Theorem 1, and let v(cid:48) be the d(cid:48)-dimensional vector obtained\nusing feature hashing on v implemented with mixed tabulation hashing. Then, if supp(v) \u2264 |\u03a3|/(1 +\n\u2126(1)) it holds that\n\n(cid:16)(cid:113) 1\n\n\u03b5 log(1/\u03b5)\n\nPr(cid:2)1 \u2212 \u03b5 < (cid:107)v(cid:48)(cid:107)2\n\n2 < 1 + \u03b5(cid:3) \u2265 1 \u2212 4\u03b4 \u2212 O\n\n(cid:16)|\u03a3|1\u2212(cid:98)d/2(cid:99)(cid:17)\n\n.\n\nIn fact Corollary 1 holds even if both h and sgn from Section 2.2 are implemented using the same\nhash function. I.e., if h(cid:63) : [d] \u2192 {\u22121, +1} \u00d7 [d(cid:48)] is a mixed tabulation hash function as described in\nSection 2.4.\nWe note that feature hashing is often applied on very high dimensional, but sparse, data (e.g. in [2]),\nand thus the requirement supp(v) \u2264 |\u03a3|/(1 + \u2126(1)) is not very prohibitive. Furthermore, the target\n\n5\n\n\fdimension d(cid:48) is usually logarithmic in the universe, and then Corollary 1 still works for vectors with\npolynomial support giving an exponential decrease.\n\n4 Experimental evaluation\n\nWe experimentally evaluate several different basic hash functions. We \ufb01rst perform an evaluation of\nrunning time. We then evaluate the fastest hash functions on synthetic data con\ufb01rming the theoretical\nresults of Section 3 and [14]. Finally, we demonstrate that even on real-world data, the provable\nguarantees of mixed tabulation sometimes yields systematically better results.\nDue to space restrictions, we only present some of our experiments here, and refer to the full version\nfor more details.\nWe consider some of the most popular and fast hash functions employed in practice in k-wise\nPolyHash [10], Multiply-shift [17], MurmurHash3 [6], CityHash [29], and the cryptographic hash\nfunction Blake2 [7]. Of these hash functions only mixed tabulation (and very high degree PolyHash)\nprovably works well for the applications we consider. However, Blake2 is a cryptographic function\nwhich provides similar guarantees conditioned on certain cryptographic assumptions being true. The\nremaining hash functions have provable weaknesses, but often work well (and are widely employed)\nin practice. See e.g. [1] who showed how to break both MurmurHash3 and Cityhash64.\nAll experiments are implemented in C++11 using a random seed from http://www.random.org.\nThe seed for mixed tabulation was \ufb01lled out using a random 20-wise PolyHash function. All keys and\nhash outputs were 32-bit integers to ensure ef\ufb01cient implementation of multiply-shift and PolyHash\nusing Mersenne prime p = 261 \u2212 1 and GCC\u2019s 128-bit integers.\nWe perform two time experiments, the results of which are presented in Table 1. Namely, we\nevaluate each hash function on the same 107 randomly chosen integers and use each hash function to\nimplement FH on the News20 dataset (discussed later). We see that the only two functions faster\nthan mixed tabulation are the very simple multiply-shift and 2-wise PolyHash. MurmurHash3 and\nCityHash were roughly 30-70% slower than mixed tabulation. This even though we used the of\ufb01cial\nimplementations of MurmurHash3, CityHash and Blake2 which are highly optimized to the x86 and\nx64 architectures, whereas mixed tabulation is just standard, portable C++11 code. The cryptographic\nhash function, Blake2, is orders of magnitude slower as we would expect.\n\nTable 1: Time taken to evaluate different hash functions to 1) hash 107 random numbers, and 2)\nperform feature hashing with d(cid:48) = 128 on the entire News20 data set.\n\nHash function\nMultiply-shift\n2-wise PolyHash\n3-wise PolyHash\nMurmurHash3\nCityHash\nBlake2\nMixed tabulation\n\ntime (1..107)\n7.72 ms\n17.55 ms\n42.42 ms\n59.70 ms\n59.06 ms\n3476.31 ms\n42.98 ms\n\ntime (News20)\n55.78 ms\n82.47 ms\n120.19 ms\n159.44 ms\n162.04 ms\n6408.40 ms\n90.55 ms\n\nBased on Table 1 we choose to compare mixed tabulation to multiply-shift, 2-wise PolyHash and\nMurmurHash3. We also include results for 20-wise PolyHash as a (cheating) way to \u201csimulate\u201d truly\nrandom hashing.\n\n4.1 Synthetic data\nFor a parameter, n, we generate two sets A, B as follows. The intersection A \u2229 B is created by\nsampling each integer from [2n] independently at random with probability 1/2. The symmetric\ndifference is generated by sampling n numbers greater than 2n (distributed evenly to A and B).\nIntuitively, with a hash function like (ax + b) mod p, the dense subset of [2n] will be mapped very\nsystematically and is likely (i.e. depending on the choice of a) to be spread out evenly. When using\n\n6\n\n\fOPH, this means that elements from the intersection is more likely to be the smallest element in each\nbucket, leading to an over-estimation of J(A, B).\nWe use OPH with densi\ufb01cation as in [32] implemented with different basic hash functions to estimate\nJ(A, B). We generate one instance of A and B and perform 2000 independent repetitions for each\ndifferent hash function on these A and B. Figure 2 shows the histogram and mean squared error\n(MSE) of estimates obtained with n = 2000 and k = 200. The \ufb01gure con\ufb01rms the theory: Both\nmultiply-shift and 2-wise PolyHash exhibit bias and bad concentration whereas both mixed tabulation\nand MurmurHash3 behaves essentially as truly random hashing. We also performed experiments\nwith k = 100 and k = 500 and considered the case of n = k/2, where we expect many empty bins\nand the densi\ufb01cation of [32] kicks in. All experiments obtained similar results as Figure 2.\n\nFigure 2: Histograms of set similarity estimates obtained using OPH with densi\ufb01cation of [32] on\nsynthetic data implemented with different basic hash families and k = 200. The mean squared error\nfor each hash function is displayed in the top right corner.\n\nFor FH we obtained a vector v by taking the indicator vector of a set A generated as above and\nnormalizing the length. For each hash function we perform 2000 independent repetitions of the\nfollowing experiment: Generate v(cid:48) using FH and calculate (cid:107)v(cid:48)(cid:107)2\n2. Using a good hash function we\nshould get good concentration of this value around 1. Figure 3 displays the histograms and MSE\nwe obtained for d(cid:48) = 200. Again we see that multiply-shift and 2-wise PolyHash give poorly\nconcentrated results, and while the results are not biased this is only because of a very heavy tail of\nlarge values. We also ran experiments with d(cid:48) = 100 and d(cid:48) = 500 which were similar.\n\nFigure 3: Histograms of the 2-norm of the vectors output by FH on synthetic data implemented\nwith different basic hash families and d(cid:48) = 200. The mean squared error for each hash function is\ndisplayed in the top right corner.\n\nWe brie\ufb02y argue that this input is in fact quite natural: When encoding a document as shingles or\nbag-of-words, it is quite common to let frequent words/shingles have the lowest identi\ufb01er (using\nfewest bits). In this case the intersection of two sets A and B will likely be a dense subset of small\nidenti\ufb01ers. This is also the case when using Huffman Encoding [18], or if identi\ufb01ers are generated\non-the-\ufb02y as words occur. Furthermore, for images it is often true that a pixel is more likely to have a\nnon-zero value if its neighbouring pixels have non-zero values giving many consecutive non-zeros.\n\nAdditional synthetic results We also considered the following synthetic dataset, which actually\nshowed even more biased and poorly concentrated results. For similarity estimation we used elements\nfrom [4n], and let the symmetric difference be uniformly random sampled elements from {0 . . . , n \u2212\n1} \u222a {3n, . . . , 4n \u2212 1} with probability 1/2 and the intersection be the same but for {n, . . . , 3n \u2212 1}.\nThis gave an MSE that was rougly 6 times larger for multiply-shift and 4 times larger for 2-wise\n\n7\n\n0.20.30.40.50.60.70.80100200300400500MSE=0.0058Multiply-shift0.20.30.40.50.60.70.8MSE=0.00492-wise PolyHash0.20.30.40.50.60.70.8MSE=0.0012Mixed Tabulation0.20.30.40.50.60.70.8MSE=0.0012MurmurHash30.20.30.40.50.60.70.8MSE=0.0011\"Random\"0.00.51.01.52.02.53.00100200300400500MSE=0.6066Multiply-shift0.00.51.01.52.02.53.0MSE=0.3052-wise PolyHash0.00.51.01.52.02.53.0MSE=0.0099Mixed Tabulation0.00.51.01.52.02.53.0MSE=0.0097MurmurHash30.00.51.01.52.02.53.0MSE=0.01\"Random\"\fPolyHash compared to the other three. For feature hashing we sampled the numbers from 0 to\n3n \u2212 1 independently at random with probability 1/2 giving an MSE that was 20 times higher for\nmultiply-shift and 10 times higher for 2-wise PolyHash.\nWe also considered both datasets without the sampling, which showed an even wider gap between the\nhash functions.\n\n4.2 Real-world data\n\nWe consider the following real-world data sets\n\n\u2022 MNIST [21] Standard collection of handwritten digits. The average number of non-zeros is\nroughly 150 and the total number of features is 728. We use the standard partition of 60000\ndatabase points and 10000 query points.\n\u2022 News20 [11] Collection of newsgroup documents. The average number of non-zeros is\nroughly 500 and the total number of features is roughly 1.3 \u00b7 106. We randomly split the set\ninto two sets of roughly 10000 database and query points.\n\nThese two data sets cover both the sparse and dense regime, as well as the cases where each data\npoint is similar to many other points or few other points. For MNIST this number is roughly 3437 on\naverage and for News20 it is roughly 0.2 on average for similarity threshold above 1/2.\nFeature hashing We perform the same experiment as for synthetic data by calculating (cid:107)v(cid:48)(cid:107)2\n2 for\neach v in the data set with 100 independent repetitions of each hash function (i.e. getting 6, 000, 000\nestimates for MNIST). Our results are shown in Figure 4 for output dimension d(cid:48) = 128. Results with\nd(cid:48) = 64 and d(cid:48) = 256 were similar. The results con\ufb01rm the theory and show that mixed tabulation\n\nFigure 4: Histograms of the norm of vectors output by FH on the MNIST (top) and News20 (bottom)\ndata sets implemented with different basic hash families and d(cid:48) = 128. The mean squared error for\neach hash function is displayed in the top right corner.\n\nperforms essentially as well as a truly random hash function clearly outperforming the weaker hash\nfunctions, which produce poorly concentrated results. This is particularly clear for the MNIST data\nset, but also for the News20 dataset, where e.g. 2-wise Polyhash resulted in (cid:107)v(cid:48)(cid:107)2\n2 as large as 16.671\ncompared to 2.077 with mixed tabulation.\n\nSimilarity search with LSH We perform a rigorous evaluation based on the setup of [31]. We test\nall combinations of K \u2208 {8, 10, 12} and L \u2208 {8, 10, 12}. For readability we only provide results\nfor multiply-shift and mixed tabulation and note that the results obtained for 2-wise PolyHash and\nMurmurHash3 are essentially identical to those for multiply-shift and mixed tabulation respectively.\nFollowing [31] we evaluate the results based on two metrics: 1) The fraction of total data points\nretrieved per query, and 2) the recall at a given threshold T0 de\ufb01ned as the ratio of retrieved data\n\n8\n\n0.00.51.01.52.00100000200000300000400000500000600000700000800000MSE=0.144Multiply-shift0.00.51.01.52.0MSE=0.1655PolyHash 2-wise0.00.51.01.52.0MSE=0.0155Mixed Tabulation0.00.51.01.52.0MSE=0.016MurmurHash30.00.51.01.52.0MSE=0.0163\"random\"0.00.51.01.52.0050000100000150000200000250000MSE=0.1106Multiply-shift0.00.51.01.52.0MSE=0.0474PolyHash 2-wise0.00.51.01.52.0MSE=0.0176Mixed Tabulation0.00.51.01.52.0MSE=0.0176MurmurHash30.00.51.01.52.0MSE=0.0177\"random\"\fpoints having similarity at least T0 with the query to the total number of data points having similarity\nat least T0 with the query. Since the recall may be in\ufb02ated by poor hash functions that just retrieve\nmany data points, we instead report #retrieved/recall-ratio, i.e. the number of data points that were\nretrieved divided by the percentage of recalled data points. The goal is to minimize this ratio as we\nwant to simultaneously retrieve few points and obtain high recall. Due to space restrictions we only\nreport our results for K = L = 10. We note that the other results were similar.\nOur results can be seen in Figure 5. The results somewhat echo what we found on synthetic data.\nNamely, 1) Using multiply-shift overestimates the similarities of sets thus retrieving more points, and\n2) Multiply-shift gives very poorly concentrated results. As a consequence of 1) Multiply-shift does,\nhowever, achieve slightly higher recall (not visible in the \ufb01gure), but despite recalling slightly more\npoints, the #retrieved / recall-ratio of multiply-shift is systematically worse.\n\nFigure 5: Experimental evaluation of LSH with OPH and different hash functions with K = L = 10.\nThe hash functions used are multiply-shift (blue) and mixed tabulation (green). The value studied is\nthe retrieved / recall-ratio (lower is better).\n\n5 Conclusion\n\nIn this paper we consider mixed tabulation for computational primitives in computer vision, infor-\nmation retrieval, and machine learning. Namely, similarity estimation and feature hashing. It was\npreviously shown [14] that mixed tabulation provably works essentially as well as truly random for\nsimilarity estimation with one permutation hashing. We complement this with a similar result for\nFH when the input vectors are sparse, even improving on the concentration bounds for truly random\nhashing found by [37, 16].\nOur empirical results demonstrate this in practice. Mixed tabulation signi\ufb01cantly outperforms the\nsimple hashing schemes and is not much slower. Meanwhile, mixed tabulation is 40% faster than\nboth MurmurHash3 and CityHash, which showed similar performance as mixed tabulation. However,\nthese two hash functions do not have the same theoretical guarantees as mixed tabulation. We believe\nthat our \ufb01ndings make mixed tabulation the best candidate for implementing these applications in\npractice.\n\nAcknowledgements\n\nThe authors gratefully acknowledge support from Mikkel Thorup\u2019s Advanced Grant DFF-0602-\n02499B from the Danish Council for Independent Research as well as the DABAI project. Mathias\nB\u00e6k Tejs Knudsen gratefully acknowledges support from the FNU project AlgoDisc.\n\nReferences\n[1] Breaking murmur: Hash-\ufb02ooding dos reloaded, 2012. URL: https://emboss.github.io/\n\nblog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/.\n\n9\n\n2004006008001000Ratio05101520253035FrequencyMNIST, Thr=0.840005000600070008000900010000Ratio051015202530FrequencyMNIST, Thr=0.50.890.900.910.920.93Ratio05101520Frequencynews20, Thr=0.80.991.001.011.021.03Ratio051015202530Frequencynews20, Thr=0.5\f[2] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P. Razenshteyn, and Ludwig Schmidt.\nPractical and optimal LSH for angular distance. In Proc. 28th Advances in Neural Information\nProcessing Systems, pages 1225\u20131233, 2015.\n\n[3] Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive\nhashing. In Proc. 25th ACM/SIAM Symposium on Discrete Algorithms (SODA), pages 1018\u2013\n1028, 2014.\n\n[4] Alexandr Andoni, Thijs Laarhoven, Ilya P. Razenshteyn, and Erik Waingarten. Optimal\nhashing-based time-space trade-offs for approximate near neighbors. In Proc. 28th ACM/SIAM\nSymposium on Discrete Algorithms (SODA), pages 47\u201366, 2017.\n\n[5] Alexandr Andoni and Ilya P. Razenshteyn. Optimal data-dependent hashing for approximate\nnear neighbors. In Proc. 47th ACM Symposium on Theory of Computing (STOC), pages 793\u2013801,\n2015.\n\n[6] Austin Appleby. Murmurhash3, 2016. URL: https://github.com/aappleby/smhasher/\n\nwiki/MurmurHash3.\n\n[7] Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O\u2019Hearn, and Christian Winnerlein.\nBLAKE2: simpler, smaller, fast as MD5. In Proc. 11th International Conference on Applied\nCryptography and Network Security, pages 119\u2013135, 2013.\n\n[8] Andrei Z. Broder. On the resemblance and containment of documents. In Proc. Compression\n\nand Complexity of Sequences (SEQUENCES), pages 21\u201329, 1997.\n\n[9] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic\n\nclustering of the web. Computer Networks, 29:1157\u20131166, 1997.\n\n[10] Larry Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer\n\nand System Sciences, 18(2):143\u2013154, 1979. See also STOC\u201977.\n\n[11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM\n\nTIST, 2(3):27:1\u201327:27, 2011.\n\n[12] Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th\n\nACM Symposium on Theory of Computing (STOC), pages 380\u2013388, 2002.\n\n[13] Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-\nsensitive \ufb01ltering. In Proc. 28th ACM/SIAM Symposium on Discrete Algorithms (SODA), pages\n31\u201346, 2017.\n\n[14] S\u00f8ren Dahlgaard, Mathias B\u00e6k Tejs Knudsen, Eva Rotenberg, and Mikkel Thorup. Hashing for\nstatistics over k-partitions. In Proc. 56th IEEE Symposium on Foundations of Computer Science\n(FOCS), pages 1292\u20131310, 2015.\n\n[15] S\u00f8ren Dahlgaard, Christian Igel, and Mikkel Thorup. Nearest neighbor classi\ufb01cation using\n\nbottom-k sketches. In IEEE BigData Conference, pages 28\u201334, 2013.\n\n[16] Anirban Dasgupta, Ravi Kumar, and Tam\u00e1s Sarl\u00f3s. A sparse johnson: Lindenstrauss transform.\n\nIn Proc. 42nd ACM Symposium on Theory of Computing (STOC), pages 341\u2013350, 2010.\n\n[17] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable\nrandomized algorithm for the closest-pair problem. Journal of Algorithms, 25(1):19\u201351, 1997.\n[18] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings\n\nof the Institute of Radio Engineers, 40(9):1098\u20131101, September 1952.\n\n[19] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse\nof dimensionality. In Proc. 13th ACM Symposium on Theory of Computing (STOC), pages\n604\u2013613, 1998.\n\n[20] Christopher Kennedy and Rachel Ward. Fast cross-polytope locality-sensitive hashing. CoRR,\n\nabs/1602.06922, 2016.\n\n[21] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten\n\ndigits, 1998. URL: http://yann.lecun.com/exdb/mnist/.\n\n[22] Ping Li, Art B. Owen, and Cun-Hui Zhang. One permutation hashing. In Proc. 26th Advances\n\nin Neural Information Processing Systems, pages 3122\u20133130, 2012.\n\n10\n\n\f[23] Ping Li, Anshumali Shrivastava, and Arnd Christian K\u00f6nig. b-bit minwise hashing in practice:\nLarge-scale batch and online learning and using gpus for fast preprocessing with simple hash\nfunctions. CoRR, abs/1205.2958, 2012. URL: http://arxiv.org/abs/1205.2958.\n\n[24] Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd Christian K\u00f6nig. Hashing al-\ngorithms for large-scale learning. In Proc. 25th Advances in Neural Information Processing\nSystems, pages 2672\u20132680, 2011.\n\n[25] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web\n\ncrawling. In Proc. 10th WWW, pages 141\u2013150, 2007.\n\n[26] Michael Mitzenmacher and Salil P. Vadhan. Why simple hash functions work: exploiting the\nentropy in a data stream. In Proc. 19th ACM/SIAM Symposium on Discrete Algorithms (SODA),\npages 746\u2013755, 2008.\n\n[27] Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica,\n\n12(4):449\u2013461, 1992. See also STOC\u201990.\n\n[28] Mihai Patrascu and Mikkel Thorup. On the k-independence required by linear probing and\nminwise independence. ACM Transactions on Algorithms, 12(1):8:1\u20138:27, 2016. See also\nICALP\u201910.\n\n[29] Geoff Pike and Jyrki Alakuijala. Introducing cityhash, 2011. URL: https://opensource.\n\ngoogleblog.com/2011/04/introducing-cityhash.html.\n\n[30] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-neighbor methods in learning\n\nand vision. IEEE Trans. Neural Networks, 19(2):377, 2008.\n\n[31] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast\nnear neighbor search. In Proc. 31th International Conference on Machine Learning (ICML),\npages 557\u2013565, 2014.\n\n[32] Anshumali Shrivastava and Ping Li. Improved densi\ufb01cation of one permutation hashing. In\nProceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI 2014,\nQuebec City, Quebec, Canada, July 23-27, 2014, pages 732\u2013741, 2014.\n\n[33] Kengo Terasawa and Yuzuru Tanaka. Spherical LSH for approximate nearest neighbor search\non unit hypersphere. In Proc. 10th Workshop on Algorithms and Data Structures (WADS), pages\n27\u201338, 2007.\n\n[34] Mikkel Thorup. Bottom-k and priority sampling, set similarity and subset sums with minimal\n\nindependence. In Proc. 45th ACM Symposium on Theory of Computing (STOC), 2013.\n\n[35] Mikkel Thorup and Yin Zhang. Tabulation-based 5-independent hashing with applications to\nlinear probing and second moment estimation. SIAM Journal on Computing, 41(2):293\u2013331,\n2012. Announced at SODA\u201904 and ALENEX\u201910.\n\n[36] Simon Tong.\n\ning system, April 2010.\nlessons-learned-developing-practical.html.\n\nLessons learned developing a practical\n\nlarge scale machine learn-\nURL: https://research.googleblog.com/2010/04/\n\n[37] Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Atten-\nberg. Feature hashing for large scale multitask learning. In Proc. 26th International Conference\non Machine Learning (ICML), pages 1113\u20131120, 2009.\n\n11\n\n\f", "award": [], "sourceid": 3315, "authors": [{"given_name": "S\u00f8ren", "family_name": "Dahlgaard", "institution": "University of Copenhagen"}, {"given_name": "Mathias", "family_name": "Knudsen", "institution": "University of Copenhagen"}, {"given_name": "Mikkel", "family_name": "Thorup", "institution": "University of Copenhagen"}]}