{"title": "One Permutation Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 3113, "page_last": 3121, "abstract": "While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) $k=500$ permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \\textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing $k$ permutations to just one would be much more \\textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM \\& logistic regression also confirm the theoretical results.", "full_text": "One Permutation Hashing\n\nDepartment of Statistical Science\n\nDepartment of Statistics\n\nPing Li\n\nCornell University\n\nArt B Owen\n\nStanford University\nAbstract\n\nCun-Hui Zhang\n\nDepartment of Statistics\n\nRutgers University\n\nMinwise hashing is a standard procedure in the context of search, for ef\ufb01ciently\nestimating set similarities in massive binary data such as text. Recently, b-bit\nminwise hashing has been applied to large-scale learning and sublinear time near-\nneighbor search. The major drawback of minwise hashing is the expensive pre-\nprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations\non the data. This paper presents a simple solution called one permutation hashing.\nConceptually, given a binary data matrix, we permute the columns once and divide\nthe permuted columns evenly into k bins; and we store, for each data vector, the\nsmallest nonzero location in each bin. The probability analysis illustrates that this\none permutation scheme should perform similarly to the original (k-permutation)\nminwise hashing. Our experiments with training SVM and logistic regression con-\n\ufb01rm that one permutation hashing can achieve similar (or even better) accuracies\ncompared to the k-permutation scheme. See more details in arXiv:1208.1259.\n\n1 Introduction\nMinwise hashing [4, 3] is a standard technique in the context of search, for ef\ufb01ciently computing\nset similarities. Recently, b-bit minwise hashing [18, 19], which stores only the lowest b bits of\neach hashed value, has been applied to sublinear time near neighbor search [22] and learning [16],\non large-scale high-dimensional binary data (e.g., text). A drawback of minwise hashing is that it\nrequires a costly preprocessing step, for conducting (e.g.,) k = 200 \u223c 500 permutations on the data.\n1.1 Massive High-Dimensional Binary Data\nIn the context of search, text data are often processed to be binary in extremely high dimensions. A\nstandard procedure is to represent documents (e.g., Web pages) using w-shingles (i.e., w contiguous\nwords), where w \u2265 5 in several studies [4, 8]. This means the size of the dictionary needs to be\nsubstantially increased, from (e.g.,) 105 common English words to 105w \u201csuper-words\u201d. In current\npractice, it appears suf\ufb01cient to set the total dimensionality to be D = 264, for convenience. Text\ndata generated by w-shingles are often treated as binary. The concept of shingling can be naturally\nextended to Computer Vision, either at pixel level (for aligned images) or at Visual feature level [23].\nIn machine learning practice, the use of extremely high-dimensional data has become common. For\nexample, [24] discusses training datasets with (on average) n = 1011 items and D = 109 distinct\nfeatures. [25] experimented with a dataset of potentially D = 16 trillion (1.6\u00d71013) unique features.\n1.2 Minwise Hashing and b-Bit Minwise Hashing\nMinwise hashing was mainly designed for binary data. A binary (0/1) data vector can be viewed as\na set (locations of the nonzeros). Consider sets Si \u2286 \u2126 = {0, 1, 2, ..., D \u2212 1}, where D, the size of\nthe space, is often set as D = 264 in industrial applications. The similarity between two sets, S1 and\nS2, is commonly measured by the resemblance, which is a version of the normalized inner product:\n\nR =\n\n|S1 \u2229 S2|\n|S1 \u222a S2| =\n\na\n\nf1 + f2 \u2212 a\n\nwhere f1 = |S1|, f2 = |S2|, a = |S1 \u2229 S2|\n\n,\n\n(1)\n\nFor large-scale applications, the cost of computing resemblances exactly can be prohibitive in time,\nspace, and energy-consumption. The minwise hashing method was proposed for ef\ufb01cient computing\nresemblances. The method requires applying k independent random permutations on the data.\nDenote \u03c0 a random permutation: \u03c0 : \u2126 \u2192 \u2126. The hashed values are the two minimums of \u03c0(S1)\nand \u03c0(S2). The probability at which the two hashed values are equal is\n|S1 \u2229 S2|\n|S1 \u222a S2| = R\n\nPr (min(\u03c0(S1)) = min(\u03c0(S2))) =\n\n(2)\n\n1\n\n\fk(cid:88)\n\nj=1\n\n1\nk\n\nD\u22121(cid:88)\n\n(cid:179)\n\n(cid:180)\n\nOne can then estimate R from k independent permutations, \u03c01, ..., \u03c0k:\n\n\u02c6RM =\n\n1{min(\u03c0j(S1)) = min(\u03c0j(S2))},\n\n(3)\nBecause the indicator function 1{min(\u03c0j(S1)) = min(\u03c0j(S2))} can be written as an inner product\nbetween two binary vectors (each having only one 1) in D dimensions [16]:\n\nR(1 \u2212 R)\n\nVar\n\n\u02c6RM\n\n=\n\n1\nk\n\n1{min(\u03c0j(S1)) = min(\u03c0j(S2))} =\n\n1{min(\u03c0j(S1)) = i} \u00d7 1{min(\u03c0j(S2)) = i}\n\n(4)\n\ni=0\n\nwe know that minwise hashing can be potentially used for training linear SVM and logistic regres-\nsion on high-dimensional binary data by converting the permuted data into a new data matrix in\nD \u00d7 k dimensions. This of course would not be realistic if D = 264.\nThe method of b-bit minwise hashing [18, 19] provides a simple solution by storing only the lowest\nb bits of each hashed data, reducing the dimensionality of the (expanded) hashed data matrix to just\n2b \u00d7 k. [16] applied this idea to large-scale learning on the webspam dataset and demonstrated that\nusing b = 8 and k = 200 to 500 could achieve very similar accuracies as using the original data.\n\n1.3 The Cost of Preprocessing and Testing\nClearly, the preprocessing of minwise hashing can be very costly. In our experiments, loading the\nwebspam dataset (350,000 samples, about 16 million features, and about 24GB in Libsvm/svmlight\n(text) format) used in [16] took about 1000 seconds when the data were stored in text format, and\ntook about 150 seconds after we converted the data into binary. In contrast, the preprocessing cost for\nk = 500 was about 6000 seconds. Note that, compared to industrial applications [24], the webspam\ndataset is very small. For larger datasets, the preprocessing step will be much more expensive.\nIn the testing phrase (in search or learning), if a new data point (e.g., a new document or a new\nimage) has not been processed, then the total cost will be expensive if it includes the preprocessing.\nThis may raise signi\ufb01cant issues in user-facing applications where the testing ef\ufb01ciency is crucial.\nIntuitively, the standard practice of minwise hashing ought to be very \u201cwasteful\u201d in that all the\nnonzero elements in one set are scanned (permuted) but only the smallest one will be used.\n\n1.4 Our Proposal: One Permutation Hashing\n\nFigure 1: Consider S1, S2, S3 \u2286 \u2126 = {0, 1, ..., 15} (i.e., D = 16). We apply one permutation \u03c0 on the\nsets and present \u03c0(S1), \u03c0(S2), and \u03c0(S3) as binary (0/1) vectors, where \u03c0(S1) = {2, 4, 7, 13}, \u03c0(S2) =\n{0, 6, 13}, and \u03c0(S3) = {0, 1, 10, 12}. We divide the space \u2126 evenly into k = 4 bins, select the smallest\nnonzero in each bin, and re-index the selected elements as: [2, 0, \u2217, 1], [0, 2, \u2217, 1], and [0, \u2217, 2, 0]. For\nnow, we use \u2018*\u2019 for empty bins, which occur rarely unless the number of nonzeros is small compared to k.\n\nAs illustrated in Figure 1, the idea of one permutation hashing is simple. We view sets as 0/1 vectors\nin D dimensions so that we can treat a collection of sets as a binary data matrix in D dimensions.\nAfter we permute the columns (features) of the data matrix, we divide the columns evenly into k\nparts (bins) and we simply take, for each data vector, the smallest nonzero element in each bin.\nIn the example in Figure 1 (which concerns 3 sets), the sample selected from \u03c0(S1) is [2, 4,\u2217, 13],\nwhere we use \u2019*\u2019 to denote an empty bin, for the time being. Since only want to compare elements\nwith the same bin number (so that we can obtain an inner product), we can actually re-index the\nelements of each bin to use the smallest possible representations. For example, for \u03c0(S1), after\nre-indexing, the sample [2, 4,\u2217, 13] becomes [2 \u2212 4 \u00d7 0, 4 \u2212 4 \u00d7 1,\u2217, 13 \u2212 4 \u00d7 3] = [2, 0,\u2217, 1].\nWe will show that empty bins occur rarely unless the total number of nonzeros for some set is small\ncompared to k, and we will present strategies on how to deal with empty bins should they occur.\n\n2\n\n01234567891011121314150110011000101000000101000000000010000011100000001234\u03c0(S1):\u03c0(S2):\u03c0(S3):\f1.5 Advantages of One Permutation Hashing\nReducing k (e.g., 500) permutations to just one permutation (or a few) is much more computationally\nef\ufb01cient. From the perspective of energy consumption, this scheme is desirable, especially consid-\nering that minwise hashing is deployed in the search industry. Parallel solutions (e.g., GPU [17]),\nwhich require additional hardware and software implementation, will not be energy-ef\ufb01cient.\nIn the testing phase, if a new data point (e.g., a new document or a new image) has to be \ufb01rst pro-\ncessed with k permutations, then the testing performance may not meet the demand in, for example,\nuser-facing applications such as search or interactive visual analytics.\nOne permutation hashing should be easier to implement, from the perspective of random number\ngeneration. For example, if a dataset has one billion features (D = 109), we can simply generate a\n\u201cpermutation vector\u201d of length D = 109, the memory cost of which (i.e., 4GB) is not signi\ufb01cant.\nOn the other hand, it would not be realistic to store a \u201cpermutation matrix\u201d of size D\u00d7 k if D = 109\nand k = 500; instead, one usually has to resort to approximations such as universal hashing [5].\nUniversal hashing often works well in practice although theoretically there are always worst cases.\nOne permutation hashing is a better matrix sparsi\ufb01cation scheme. In terms of the original binary data\nmatrix, the one permutation scheme simply makes many nonzero entries be zero, without further\n\u201cdamaging\u201d the matrix. Using the k-permutation scheme, we store, for each permutation and each\nrow, only the \ufb01rst nonzero and make all the other nonzero entries be zero; and then we have to\nconcatenate k such data matrices. This signi\ufb01cantly changes the structure of the original data matrix.\n\n1.6 Related Work\nOne of the authors worked on another \u201cone permutation\u201d scheme named Conditional Random Sam-\npling (CRS) [13, 14] since 2005. Basically, CRS continuously takes the bottom-k nonzeros after\napplying one permutation on the data, then it uses a simple \u201ctrick\u201d to construct a random sample for\neach pair with the effective sample size determined at the estimation stage. By taking the nonzeros\ncontinuously, however, the samples are no longer \u201caligned\u201d and hence we can not write the esti-\nmator as an inner product in a uni\ufb01ed fashion. [16] commented that using CRS for linear learning\ndoes not produce as good results compared to using b-bit minwise hashing. Interestingly, in the\noriginal \u201cminwise hashing\u201d paper [4] (we use quotes because the scheme was not called \u201cminwise\nhashing\u201d at that time), only one permutation was used and a sample was the \ufb01rst k nonzeros after\nthe permutation. Then they quickly moved to the k-permutation minwise hashing scheme [3].\nWe are also inspired by the work on very sparse random projections [15] and very sparse stable\nrandom projections [12]. The regular random projection method also has the expensive prepro-\ncessing cost as it needs a large number of projections.\n[15, 12] showed that one can substan-\ntially reduce the preprocessing cost by using an extremely sparse projection matrix. The pre-\nprocessing cost of very sparse random projections can be as small as merely doing one projec-\ntion. See www.stanford.edu/group/mmds/slides2012/s-pli.pdf for the experi-\nmental results on clustering/classi\ufb01cation/regression using very sparse random projections.\nThis paper focuses on the \u201c\ufb01xed-length\u201d scheme as shown in Figure 1. The technical report\n(arXiv:1208.1259) also describes a \u201cvariable-length\u201d scheme. Two schemes are more or less equiv-\nalent, although the \ufb01xed-length scheme is more convenient to implement (and it is slightly more\naccurate). The variable-length hashing scheme is to some extent related to the Count-Min (CM)\nsketch [6] and the Vowpal Wabbit (VW) [21, 25] hashing algorithms.\n2 Applications of Minwise Hashing on Ef\ufb01cient Search and Learning\nIn this section, we will brie\ufb02y review two important applications of the k-permutation b-bit minwise\nhashing: (i) sublinear time near neighbor search [22], and (ii) large-scale linear learning [16].\n2.1 Sublinear Time Near Neighbor Search\nThe task of near neighbor search is to identify a set of data points which are \u201cmost similar\u201d to\na query data point. Developing ef\ufb01cient algorithms for near neighbor search has been an active\nresearch topic since the early days of modern computing (e.g, [9]). In current practice, methods\nfor approximate near neighbor search often fall into the general framework of Locality Sensitive\nHashing (LSH) [10, 1]. The performance of LSH largely depends on its underlying implementation.\nThe idea in [22] is to directly use the bits from b-bit minwise hashing to construct hash tables.\n\n3\n\n\fSpeci\ufb01cally, we hash the data points using k random permutations and store each hash value using\nb bits. For each data point, we concatenate the resultant B = bk bits as a signature (e.g., bk = 16).\nThis way, we create a table of 2B buckets and each bucket stores the pointers of the data points\nwhose signatures match the bucket number. In the testing phrase, we apply the same k permutations\nto a query data point to generate a bk-bit signature and only search data points in the corresponding\nbucket. Since using only one table will likely miss many true near neighbors, as a remedy, we\nindependently generate L tables. The query result is the union of data points retrieved in L tables.\n\nFigure 2: An example of hash tables, with b = 2, k = 2, and L = 2.\n\nFigure 2 provides an example with b = 2 bits, k = 2 permutations, and L = 2 tables. The size of\neach hash table is 24. Given n data points, we apply k = 2 permutations and store b = 2 bits of\neach hashed value to generate n (4-bit) signatures L times. Consider data point 6. For Table 1 (left\npanel of Figure 2), the lowest b-bits of its two hashed values are 00 and 00 and thus its signature\nis 0000 in binary; hence we place a pointer to data point 6 in bucket number 0. For Table 2 (right\npanel of Figure 2), we apply another k = 2 permutations. This time, the signature of data point 6\nbecomes 1111 in binary and hence we place it in the last bucket. Suppose in the testing phrase, the\ntwo (4-bit) signatures of a new data point are 0000 and 1111, respectively. We then only search for\nthe near neighbors in the set {6, 15, 26, 79, 110, 143}, instead of the original set of n data points.\n2.2 Large-Scale Linear Learning\nThe recent development of highly ef\ufb01cient linear learning algorithms is a major breakthrough. Pop-\nular packages include SVMperf [11], Pegasos [20], Bottou\u2019s SGD SVM [2], and LIBLINEAR [7].\ni=1, xi \u2208 RD, yi \u2208 {\u22121, 1}, the L2-regularized logistic regression solves\nGiven a dataset {(xi, yi)}n\nthe following optimization problem (where C > 0 is the regularization parameter):\n\n(cid:179)\n(cid:169)\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\n,\n\n(cid:180)\n(cid:170)\n\n(5)\n\n(6)\n\n,\n\nmin\nw\n\n1\n2\n\nwTw + C\n\nlog\n\n1 + e\u2212yiwTxi\n\nThe L2-regularized linear SVM solves a similar problem:\n\nmin\nw\n\n1\n2\n\nwTw + C\n\nmax\n\n1 \u2212 yiwTxi, 0\n\nIn [16], they apply k random permutations on each (binary) feature vector xi and store the lowest\nb bits of each hashed value, to obtain a new dataset which can be stored using merely nbk bits. At\nrun-time, each new data point has to be expanded into a 2b \u00d7 k-length vector with exactly k 1\u2019s.\nTo illustrate this simple procedure, [16] provided a toy example with k = 3 permutations. Sup-\npose for one data vector, the hashed values are {12013, 25964, 20191}, whose binary digits\nare respectively {010111011101101, 110010101101100, 100111011011111}. Using b = 2 bits,\nthe binary digits are stored as {01, 00, 11} (which corresponds to {1, 0, 3} in decimals). At\nrun-time, the (b-bit) hashed data are expanded into a new feature vector of length 2bk = 12:\n{0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0}. The same procedure is then applied to all n feature vectors.\nClearly, in both applications (near neighbor search and linear learning), the hashed data have to be\n\u201caligned\u201d in that only the hashed data generated from the same permutation are interacted. Note\nthat, with our one permutation scheme as in Figure 1, the hashed data are indeed aligned.\n3 Theoretical Analysis of the One Permutation Scheme\n\nThis section presents the probability analysis to provide a rigorous foundation for one permutation\nhashing as illustrated in Figure 1. Consider two sets S1 and S2. We \ufb01rst introduce two de\ufb01nitions,\n\n4\n\n00 1011 1011 1100 0000 01IndexData Points11 01(empty)6, 110, 143 3, 38, 217 5, 14, 20631, 74, 153 21, 142, 32900 1011 1011 1100 0000 01IndexData Points11 016,15, 26, 7933, 4897, 49, 2083, 14, 32, 9711, 25, 998, 159, 331\ffor the number of \u201cjointly empty bins\u201d and the number of \u201cmatched bins,\u201d respectively:\n\nk(cid:88)\n\nk(cid:88)\n\nNemp =\n\nIemp,j,\n\nNmat =\n\nImat,j\n\nj=1\n\nj=1\n\nwhere Iemp,j and Imat,j are de\ufb01ned for the j-th bin, as\n\nif both \u03c0(S1) and \u03c0(S2) are empty in the j-th bin\notherwise\nif both \u03c0(S1) and \u03c0(S1) are not empty and the smallest element of \u03c0(S1)\nmatches the smallest element of \u03c0(S2), in the j-th bin\notherwise\n\n(cid:189)\n(cid:40) 1\n\n1\n0\n\n0\n\nIemp,j =\n\nImat,j =\n\n, 0 \u2264 j \u2264 k \u2212 1\n\n(10)\n\n(cid:162) \u2212 t\n\n1 \u2212 j+s\nD \u2212 t\n\nk\n\nRecall the notation: f1 = |S1|, f2 = |S2|, a = |S1 \u2229 S2|. We also use f = |S1 \u222a S2| = f1 + f2 \u2212 a.\nLemma 1\n\nPr (Nemp = j) =\n\nj!s!(k \u2212 j \u2212 s)!\n\n(cid:161)\n\nAssume D\n\n1 \u2212 1\n\nk\n\ns=0\n\nk!\n\n(\u22121)s\n\nk\u2212j(cid:88)\n(cid:162) \u2265 f = f1 + f2 \u2212 a.\n(cid:162) \u2212 j\nf\u22121(cid:89)\n(cid:182)\n(cid:181)\n1 \u2212 E (Nemp)\n\n1 \u2212 1\nD \u2212 j\n\n= R\n\n(cid:161)\n\n=\n\nk\n\n(cid:161)\n\nD\n\nt=0\n\nf\u22121(cid:89)\n(cid:182)f\n\uf8eb\uf8ed1 \u2212 f\u22121(cid:89)\n\n(cid:181)\n1 \u2212 1\nk\n\nk\n\nk\n\nE (Nemp)\n\nE (Nmat)\n\n(11)\n\n(12)\n\nCov (Nmat, Nemp) \u2264 0\n\nD\n\nj=0\n\n\u2264\n\n(cid:161)\n\n\uf8f6\uf8f8\n(cid:162)f is a good approximation to the true value of E(Nemp)\n\n(cid:162) \u2212 j\n\n1 \u2212 1\nD \u2212 j\n\n= R\n\nj=0\n\nD\n\nk\n\nk\n\nk\n\n(cid:161)\n\n1 \u2212 1\n\n(cid:162)f \u2248\n(cid:161)\n(cid:162)f \u2248 0.0067. For practical applications, we would expect that f (cid:192) k (for most data pairs),\n\n(13)\nIn practical scenarios, the data are often sparse, i.e., f = f1 + f2 \u2212 a (cid:191) D. In this case, the upper\nbound (11)\ne\u2212f /k, we know that the chance of empty bins is small when f (cid:192) k. For example, if f /k = 5 then\n1 \u2212 1\notherwise hashing probably would not be too useful anyway. This is why we do not expect empty\nbins will signi\ufb01cantly impact (if at all) the performance in practical settings.\nLemma 2 shows the following estimator \u02c6Rmat of the resemblance is unbiased:\nLemma 2\n\n1 \u2212 1\n\n. Since\n\n(cid:161)\n\nk\n\nk\n\nk\n\n(cid:164)\n\n(7)\n\n(8)\n\n(9)\n\n(14)\n\n(15)\n\n(16)\n\n(cid:180)\n\n(cid:179)\n(cid:181)\n\n\u02c6Rmat\n\n(cid:181)\n\n(cid:182)(cid:181)\n\n= R\n\n1\n\nk \u2212 Nemp\n\u2265\n\n\u02c6Rmat = Nmat\n\nk \u2212 Nemp\n\n,\n\nE\n\n\u02c6Rmat\n\n= R(1 \u2212 R)\n\nE\n\nV ar\n\n(cid:181)\n\nE\n\n(cid:179)\n\n(cid:179)\n\n(cid:182)\n\n(cid:180)\n\n(cid:180)\n\nk\u22121(cid:88)\n\nj=0\n\n(cid:182)\n\n(cid:182)\n\n1 +\n\n1\n\nf \u2212 1\n\n\u2212 1\nf \u2212 1\n\n1\n\nk \u2212 Nemp\n\n=\n\nPr (Nemp = j)\n\nk \u2212 j\n\n1\n\nk \u2212 E(Nemp)\n\n(cid:164)\n\n\u02c6Rmat\n\n(cid:180)\n= R may seem surprising as in general ratio estimators are not unbiased.\nThe fact that E\nNote that k\u2212 Nemp > 0, because we assume the original data vectors are not completely empty (all-\nzero). As expected, when k (cid:191) f = f1 + f2 \u2212 a, Nemp is essentially zero and hence V ar\n\u2248\nR(1\u2212R)\n\n\u02c6Rmat\n\n(cid:179)\n\n(cid:180)\n\n(cid:179)\n\n. In fact, V ar\n\n\u02c6Rmat\n\nk\n\nis a bit smaller than R(1\u2212R)\n\nk\n\n, especially for large k.\n\nIt is probably not surprising that our one permutation scheme (slightly) outperforms the original\nk-permutation scheme (at merely 1/k of the preprocessing cost), because one permutation hashing,\nwhich is \u201csampling-without-replacement\u201d, provides a better strategy for matrix sparsi\ufb01cation.\n\n5\n\n\f(cid:113)\n\n\u02c6R(0)\n\nmat =\n\n(cid:113)\n\nNmat\n\nk \u2212 N (1)\n\nemp\n\nk \u2212 N (2)\n\nemp\n\n(cid:80)k\n\n(cid:80)k\n\nemp =\n\n4 Strategies for Dealing with Empty Bins\nIn general, we expect that empty bins should not occur often because E(Nemp)/k \u2248 e\u2212f /k, which\nis very close to zero if f /k > 5. (Recall f = |S1 \u222a S2|.) If the goal of using minwise hashing is for\ndata reduction, i.e., reducing the number of nonzeros, then we would expect that f (cid:192) k anyway.\nNevertheless, in applications where we need the estimators to be inner products, we need strategies\nto deal with empty bins in case they occur. Fortunately, we realize a (in retrospect) simple strategy\nwhich can be nicely integrated with linear learning algorithms and performs well.\nFigure 3 plots the histogram of the numbers of\nnonzeros in the webspam dataset, which has 350,000\nsamples. The average number of nonzeros is about\n4000 which should be much larger than k (e.g., 500)\nfor the hashing procedure. On the other hand, about\n10% (or 2.8%) of the samples have < 500 (or <\n200) nonzeros. Thus, we must deal with empty bins\nif we do not want to exclude those data points. For\nexample, if f = k = 500, then Nemp \u2248 e\u2212f /k =\n0.3679, which is not small.\nThe strategy we recommend for linear learning is zero coding, which is tightly coupled with the\nstrategy of hashed data expansion [16] as reviewed in Sec. 2.2. More details will be elaborated in\nSec. 4.2. Basically, we can encode \u201c*\u201d as \u201czero\u201d in the expanded space, which means Nmat will\nremain the same (after taking the inner product in the expanded space). This strategy, which is\nsparsity-preserving, essentially corresponds to the following modi\ufb01ed estimator:\n\nFigure 3: Histogram of the numbers of nonzeros\nin the webspam dataset (350,000 samples).\n\n(17)\n\n(cid:113)\n\n(cid:113)\n\nj=1 I (2)\n\nk \u2212 N (2)\n\nk \u2212 N (1)\n\nemp\n\nemp = N (2)\n\nj=1 I (1)\n\nemp,j and N (2)\n\nemp =\n\nemp and N (2)\n\nemp,j are the numbers of empty bins in \u03c0(S1)\n\nwhere N (1)\nand \u03c0(S2), respectively. This modi\ufb01ed estimator makes sense for a number of reasons.\nBasically, since each data vector is processed and coded separately, we actually do not know Nemp\n(the number of jointly empty bins) until we see both \u03c0(S1) and \u03c0(S2). In other words, we can not\nreally compute Nemp if we want to use linear estimators. On the other hand, N (1)\nemp are\nalways available. In fact, the use of\nemp in the denominator corresponds to the\nnormalizing step which is needed before feeding the data to a solver for SVM or logistic regression.\nemp = Nemp, (17) is equivalent to the original \u02c6Rmat. When two original vectors\nWhen N (1)\nare very similar (e.g., large R), N (1)\nemp will be close to Nemp. When two sets are highly\nunbalanced, using (17) will overestimate R; however, in this case, Nmat will be so small that the\nabsolute error will not be large.\n4.1 The m-Permutation Scheme with 1 < m (cid:191) k\nIf one would like to further (signi\ufb01cantly) reduce the chance of the occurrences of empty bins,\nhere we shall mention that one does not really have to strictly follow \u201cone permutation,\u201d since one\ncan always conduct m permutations with k(cid:48) = k/m and concatenate the hashed data. Once the\npreprocessing is no longer the bottleneck, it matters less whether we use 1 permutation or (e.g.,)\nm = 3 permutations. The chance of having empty bins decreases exponentially with increasing m.\n\nemp and N (2)\n\n4.2 An Example of The \u201cZero Coding\u201d Strategy for Linear Learning\nSec. 2.2 reviewed the data-expansion strategy used by [16] for integrating b-bit minwise hashing\nwith linear learning. We will adopt a similar strategy with modi\ufb01cations for considering empty bins.\nWe use a similar example as in Sec. 2.2. Suppose we apply our one permutation hashing scheme and\nuse k = 4 bins. For the \ufb01rst data vector, the hashed values are [12013, 25964, 20191, \u2217] (i.e., the\n4-th bin is empty). Suppose again we use b = 2 bits. With the \u201czero coding\u201d strategy, our procedure\n\n6\n\n020004000600080001000001234x 104# nonzerosFrequencyWebspam\fis summarized as follows:\nOriginal hashed values (k = 4) :\nOriginal binary representations :\nLowest b = 2 binary digits :\nExpanded 2b = 4 binary digits :\n\nNew feature vector fed to a solver :\n\n12013\n010111011101101\n01\n0010\n\n25964\n110010101101100\n00\n0001\n\n20191\n100111011011111\n11\n1000\n\n1\u221a\n4 \u2212 1\n\n\u00d7 [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]\n\n\u2217\n\u2217\n\u2217\n0000\n\nWe apply the same procedure to all feature vectors in the data matrix to generate a new data matrix.\nThe normalization factor\nvaries, depending on the number of empty bins in the i-th vector.\n\n1(cid:113)\n\nk\u2212N\n\n(i)\nemp\n\n5 Experimental Results on the Webspam Dataset\nThe webspam dataset has 350,000 samples and 16,609,143 features. Each feature vector has on\naverage about 4000 nonzeros; see Figure 3. Following [16], we use 80% of samples for training\nand the remaining 20% for testing. We conduct extensive experiments on linear SVM and logistic\nregression, using our proposed one permutation hashing scheme with k \u2208 {26, 27, 28, 29} and b \u2208\n{1, 2, 4, 6, 8}. For convenience, we use D = 224 = 16, 777, 216, which is divisible by k.\nThere is one regularization parameter C in linear SVM and logistic regression. Since our purpose\nis to demonstrate the effectiveness of our proposed hashing scheme, we simply provide the results\nfor a wide range of C values and assume that the best performance is achievable if we conduct\ncross-validations. This way, interested readers may be able to easily reproduce our experiments.\nFigure 4 presents the test accuracies for both linear SVM (upper panels) and logistic regression (bot-\ntom panels). Clearly, when k = 512 (or even 256) and b = 8, b-bit one permutation hashing achieves\nsimilar test accuracies as using the original data. Also, compared to the original k-permutation\nscheme as in [16], our one permutation scheme achieves similar (or even slightly better) accuracies.\n\nFigure 4: Test accuracies of SVM (upper panels) and logistic regression (bottom panels), averaged\nover 50 repetitions. The accuracies of using the original data are plotted as dashed (red, if color is\navailable) curves with \u201cdiamond\u201d markers. C is the regularization parameter. Compared with the\noriginal k-permutation minwise hashing (dashed and blue if color is available), the one permutation\nhashing scheme achieves similar accuracies, or even slightly better accuracies when k is large.\n\nThe empirical results on the webspam datasets are encouraging because they verify that our proposed\none permutation hashing scheme performs as well as (or even slightly better than) the original k-\npermutation scheme, at merely 1/k of the original preprocessing cost. On the other hand, it would\nbe more interesting, from the perspective of testing the robustness of our algorithm, to conduct\nexperiments on a dataset (e.g., news20) where the empty bins will occur much more frequently.\n6 Experimental Results on the News20 Dataset\nThe news20 dataset (with 20,000 samples and 1,355,191 features) is a very small dataset in not-too-\nhigh dimensions. The average number of nonzeros per feature vector is about 500, which is also\nsmall. Therefore, this is more like a contrived example and we use it just to verify that our one\npermutation scheme (with the zero coding strategy) still works very well even when we let k be\n\n7\n\n10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8SVM: k = 64Webspam: Accuracy10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4b = 6,8SVM: k = 128Webspam: AccuracyOriginal1 Permk Perm10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4b = 6,8SVM: k = 256Webspam: AccuracyOriginal1 Permk Perm10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4,6,8SVM: k = 512Webspam: AccuracyOriginal1 Permk Perm10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8logit: k = 64Webspam: Accuracy10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4b = 6,8logit: k = 128Webspam: Accuracy10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4b = 6,8logit: k = 256Webspam: AccuracyOriginal1 Permk Perm10\u2212310\u2212210\u2212110010110280828486889092949698100CAccuracy (%) b = 1b = 2b = 4,6,8logit: k = 512Webspam: AccuracyOriginal1 Permk Perm\fas large as 4096 (i.e., most of the bins are empty). In fact, the one permutation schemes achieves\nnoticeably better accuracies than the original k-permutation scheme. We believe this is because the\none permutation scheme is \u201csample-without-replacement\u201d and provides a better matrix sparsi\ufb01cation\nstrategy without \u201ccontaminating\u201d the original data matrix too much.\nWe experiment with k \u2208 {25, 26, 27, 28, 29, 210, 211, 212} and b \u2208 {1, 2, 4, 6, 8}, for both one per-\nmutation scheme and k-permutation scheme. We use 10,000 samples for training and the other\n10,000 samples for testing. For convenience, we let D = 221 (which is larger than 1,355,191).\nFigure 5 and Figure 6 present the test accuracies for linear SVM and logistic regression, respectively.\nWhen k is small (e.g., k \u2264 64) both the one permutation scheme and the original k-permutation\nscheme perform similarly. For larger k values (especially as k \u2265 256), however, our one permu-\ntation scheme noticeably outperforms the k-permutation scheme. Using the original data, the test\naccuracies are about 98%. Our one permutation scheme with k \u2265 512 and b = 8 essentially achieves\nthe original test accuracies, while the k-permutation scheme could only reach about 97.5%.\n\nFigure 5: Test accuracies of linear SVM averaged over 100 repetitions. The one permutation scheme\nnoticeably outperforms the original k-permutation scheme especially when k is not small.\n\nFigure 6: Test accuracies of logistic regression averaged over 100 repetitions. The one permutation\nscheme noticeably outperforms the original k-permutation scheme especially when k is not small.\n\n7 Conclusion\nA new hashing algorithm is developed for large-scale search and learning in massive binary data.\nCompared with the original k-permutation (e.g., k = 500) minwise hashing (which is a standard\nprocedure in the context of search), our method requires only one permutation and can achieve\nsimilar or even better accuracies at merely 1/k of the original preprocessing cost. We expect that one\npermutation hashing (or its variant) will be adopted in practice. See more details in arXiv:1208.1259.\nAcknowledgement: The research of Ping Li is partially supported by NSF-IIS-1249316, NSF-\nDMS-0808864, NSF-SES-1131848, and ONR-YIP-N000140910911. The research of Art B Owen\nis partially supported by NSF-0906056. The research of Cun-Hui Zhang is partially supported by\nNSF-DMS-0906420, NSF-DMS-1106753, NSF-DMS-1209014, and NSA-H98230-11-1-0205.\n\n8\n\n10\u2212110010110210350556065707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8SVM: k = 32News20: Accuracy10\u2212110010110210350556065707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8SVM: k = 64News20: Accuracy10\u2212110010110210350556065707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8SVM: k = 128News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8SVM: k = 256News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8SVM: k = 512News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6,8SVM: k = 1024News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6,8SVM: k = 2048News20: AccuracyOriginal1 Permk Perm10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4,6,8SVM: k = 4096News20: AccuracyOriginal1 Permk Perm10\u2212110010110210350556065707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8logit: k = 32News20: Accuracy10\u2212110010110210350556065707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8logit: k = 64News20: Accuracy10\u2212110010110210350556065707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8logit: k = 128News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8logit: k = 256News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6b = 8logit: k = 512News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6,8logit: k = 1024News20: Accuracy10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4b = 6,8logit: k = 2048News20: AccuracyOriginal1 Permk Perm10\u2212110010110210365707580859095100CAccuracy (%) b = 1b = 2b = 4,6,8logit: k = 4096News20: AccuracyOriginal1 Permk Perm\fReferences\n[1] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. In Commun. ACM, volume 51, pages 117\u2013122, 2008.\n\n[2] Leon Bottou. http://leon.bottou.org/projects/sgd.\n[3] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent\n\npermutations (extended abstract). In STOC, pages 327\u2013336, Dallas, TX, 1998.\n\n[4] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of\n\nthe web. In WWW, pages 1157 \u2013 1166, Santa Clara, CA, 1997.\n\n[5] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions (extended abstract). In\n\nSTOC, pages 106\u2013112, 1977.\n\n[6] Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and\n\nits applications. Journal of Algorithm, 55(1):58\u201375, 2005.\n\n[7] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library\n\nfor large linear classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[8] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution\n\nof web pages. In WWW, pages 669\u2013678, Budapest, Hungary, 2003.\n\n[9] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for \ufb01nding nearest neighbors.\n\nTransactions on Computers, 24:1000\u20131006, 1975.\n\nIEEE\n\n[10] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimen-\n\nsionality. In STOC, pages 604\u2013613, Dallas, TX, 1998.\n\n[11] Thorsten Joachims. Training linear svms in linear time. In KDD, pages 217\u2013226, Pittsburgh, PA, 2006.\n[12] Ping Li. Very sparse stable random projections for dimension reduction in l\u03b1 (0 < \u03b1 \u2264 2) norm. In\n\nKDD, San Jose, CA, 2007.\n\n[13] Ping Li and Kenneth W. Church. Using sketches to estimate associations. In HLT/EMNLP, pages 708\u2013\n\n715, Vancouver, BC, Canada, 2005 (The full paper appeared in Commputational Linguistics in 2007).\n\n[14] Ping Li, Kenneth W. Church, and Trevor J. Hastie. One sketch for all: Theory and applications of\nIn NIPS, Vancouver, BC, Canada, 2008 (Preliminary results appeared\n\nconditional random sampling.\nin NIPS 2006).\n\n[15] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random projections.\n\n287\u2013296, Philadelphia, PA, 2006.\n\nIn KDD, pages\n\n[16] Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian K\u00a8onig. Hashing algorithms for large-\n\nscale learning. In NIPS, Granada, Spain, 2011.\n\n[17] Ping Li, Anshumali Shrivastava, and Arnd Christian K\u00a8onig. b-bit minwise hashing in practice: Large-\nscale batch and online learning and using GPUs for fast preprocessing with simple hash functions. Tech-\nnical report.\n\n[18] Ping Li and Arnd Christian K\u00a8onig. b-bit minwise hashing. In WWW, pages 671\u2013680, Raleigh, NC, 2010.\n[19] Ping Li, Arnd Christian K\u00a8onig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simi-\n\nlarities. In NIPS, Vancouver, BC, 2010.\n\n[20] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In ICML, pages 807\u2013814, Corvalis, Oregon, 2007.\n\n[21] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and S.V.N. Vishwanathan. Hash\n\nkernels for structured data. Journal of Machine Learning Research, 10:2615\u20132637, 2009.\n\n[22] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML,\n\n2012.\n\n[23] Josef Sivic and Andrew Zisserman. Video google: a text retrieval approach to object matching in videos.\n\nIn ICCV, 2003.\n[24] Simon Tong.\n\nhttp://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html, 2008.\n\nLessons learned developing a practical\n\nlarge scale machine learning system.\n\n[25] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing\n\nfor large scale multitask learning. In ICML, pages 1113\u20131120, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1436, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Art", "family_name": "Owen", "institution": null}, {"given_name": "Cun-hui", "family_name": "Zhang", "institution": null}]}