{"title": "Fully Understanding The Hashing Trick", "book": "Advances in Neural Information Processing Systems", "page_first": 5389, "page_last": 5399, "abstract": "Feature hashing, also known as {\\em the hashing trick}, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix $A : \\mathbb{R}^n \\to \\mathbb{R}^m$ (where $m \\ll n$) in order to reduce the dimension of the data from $n$ to $m$ while approximately preserving the Euclidean norm. Every column of $A$ contains exactly one non-zero entry, equals to either $-1$ or $1$.\n\nWeinberger et al. showed tail bounds on $\\|Ax\\|_2^2$. Specifically they showed that for every $\\varepsilon, \\delta$, if $\\|x\\|_{\\infty} / \\|x\\|_2$ is sufficiently small, and $m$ is sufficiently large, then \n\\begin{equation*}\\Pr[ \\; | \\;\\|Ax\\|_2^2 - \\|x\\|_2^2\\; | < \\varepsilon \\|x\\|_2^2 \\;] \\ge 1 - \\delta \\;.\\end{equation*}\nThese bounds were later extended by Dasgupta et al. (2010) and most recently refined by Dahlgaard et al. (2017), however, the true nature of the performance of this key technique, and specifically the correct tradeoff between the pivotal parameters $\\|x\\|_{\\infty} / \\|x\\|_2, m, \\varepsilon, \\delta$ remained an open question.\n\nWe settle this question by giving tight asymptotic bounds on the exact tradeoff between the central parameters, thus providing a complete understanding of the performance of feature hashing. We complement the asymptotic bound with empirical data, which shows that the constants \"hiding\" in the asymptotic notation are, in fact, very close to $1$, thus further illustrating the tightness of the presented bounds in practice.", "full_text": "Fully Understanding The Hashing Trick\n\nCasper Freksen \u2217\n\nDepartment of Computer Science\n\nAarhus University, Denmark\n\nLior Kamma \u2217\n\nDepartment of Computer Science\n\nAarhus University, Denmark\n\ncfreksen@cs.au.dk\n\nlior.kamma@cs.au.dk\n\nKasper Green Larsen \u2217\n\nDepartment of Computer Science\n\nAarhus University, Denmark\n\nlarsen@cs.au.dk\n\nAbstract\n\nFeature hashing, also known as the hashing trick, introduced by Weinberger et\nal.\n(2009), is one of the key techniques used in scaling-up machine learning\nalgorithms. Loosely speaking, feature hashing uses a random sparse projection\nmatrix A : Rn \u2192 Rm (where m (cid:28) n) in order to reduce the dimension of the data\nfrom n to m while approximately preserving the Euclidean norm. Every column\nof A contains exactly one non-zero entry, equals to either \u22121 or 1.\nWeinberger et al. showed tail bounds on (cid:107)Ax(cid:107)2\nevery \u03b5, \u03b4, if (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2 is suf\ufb01ciently small, and m is suf\ufb01ciently large, then\n\n2. Speci\ufb01cally they showed that for\n\nPr[ | (cid:107)Ax(cid:107)2\n\n2 \u2212 (cid:107)x(cid:107)2\n\n2 | < \u03b5(cid:107)x(cid:107)2\n\n2 ] \u2265 1 \u2212 \u03b4 .\n\nThese bounds were later extended by Dasgupta et al. (2010) and most recently\nre\ufb01ned by Dahlgaard et al. (2017), however, the true nature of the performance\nof this key technique, and speci\ufb01cally the correct tradeoff between the pivotal\nparameters (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2, m, \u03b5, \u03b4 remained an open question.\nWe settle this question by giving tight asymptotic bounds on the exact tradeoff\nbetween the central parameters, thus providing a complete understanding of the\nperformance of feature hashing. We complement the asymptotic bound with\nempirical data, which shows that the constants \u201chiding\u201d in the asymptotic notation\nare, in fact, very close to 1, thus further illustrating the tightness of the presented\nbounds in practice.\n\n1\n\nIntroduction\n\nDimensionality reduction that approximately preserves Euclidean distances is a key tool used as a\npreprocessing step in many geometric, algebraic and classi\ufb01cation algorithms, whose performance\nheavily depends on the dimension of the input. Loosely speaking, a distance-preserving dimensional-\nity reduction is an (often random) embedding of a high-dimensional Euclidean space into a space\nof low dimension, such that pairwise distances are approximately preserved (with high probability).\nIts applications range upon nearest neighbor search [AC09, HIM12, AIL+15], classi\ufb01cation and\nregression [RR08, MM09, PBMID14], manifold learning [HWB08] sparse recovery [CT06] and\nnumerical linear algebra [CW09, MM13, S\u00e1r06]. For more applications see, e.g. [Vem05].\nOne of the most fundamental results in the \ufb01eld was presented in the seminal paper by Johnson and\nLindenstrauss [JL84].\n\n\u2217All authors contributed equally, and are presented in alphabetical order.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fLemma 1 (Distributional JL Lemma). For every n \u2208 N and \u03b5, \u03b4 \u2208 (0, 1), there exists a random\nm \u00d7 n projection matrix A, where m = \u0398(\u03b5\u22122 lg 1\n2 \u2212 (cid:107)x(cid:107)2\n\n\u03b4 ) such that for every x \u2208 Rn\n2 | < \u03b5(cid:107)x(cid:107)2\n\n(1)\n\nPr[ | (cid:107)Ax(cid:107)2\n\n2 ] \u2265 1 \u2212 \u03b4\n\nThe target dimension m in the lemma is known to be optimal [JW13, LN17].\n\nRunning Time Performances. Perhaps the most common proof of the lemma (see, e.g. [DG03,\nMat08]) samples a projection matrix by independently sampling each entry from a standard Gaussian\n(or Rademacher) distribution. Such matrices are by nature very dense, and thus a na\u00efve embedding\nruns in O(m(cid:107)x(cid:107)0) time, where (cid:107)x(cid:107)0 is the number of non-zero entries of x.\nDue to the algorithmic signi\ufb01cance of the lemma, much effort was invested in \ufb01nding techniques to\naccelerate the embedding time. One fruitful approach for accomplishing this goal is to consider a\ndistribution over sparse projection matrices. This line of work was initiated by Achlioptas [Ach03],\nwho constructed a distribution over matrices, in which the expected fraction of non-zero entries is\nat most one third, while maintaining the target dimension. The best result to date in constructing a\nsparse Johnson-Lindenstrauss matrix is due to Kane and Nelson [KN14], who presented a distribution\nover matrices satisfying (1) in which every column has at most s = O(\u03b5\u22121 lg(1/\u03b4)) non-zero entries.\nConversely Nelson and Nguy\u02dc\u00ean [NN13] showed that this is almost asymptotically optimal. That is,\nevery distribution over n \u00d7 m matrices satisfying (1) with m = \u0398(\u03b5\u22122 lg(1/\u03b4)), and such that every\ncolumn has at most s non-zero entries must satisfy s = \u2126((\u03b5 lg(1/\u03b5))\u22121 lg(1/\u03b4)).\nWhile the bound presented by Nelson and Nguy\u02dc\u00ean is theoretically tight, we can provably still do\nmuch better in practice. Speci\ufb01cally, the lower bound is attained on vectors x \u2208 Rn for which,\nloosely speaking, the \u201cmass\u201d of x is concentrated in few entries. Formally, the ratio (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2\nis large. However, in practical scenarios, such as the term frequency - inverse document frequency\nrepresentation of a document, we may often assume that the mass of x is \u201cwell-distributed\u201d over\nmany entries (That is, (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2 is small). In these common scenarios projection matrices which\nare signi\ufb01cantly sparser turn out to be very effective.\n\nIn the pursuit for sparse projection matrices, Weinberger et al.\n\nFeature Hashing.\n[WDL+09]\nintroduced dimensionality reduction via Feature Hashing, in which the projection matrix A is, in a\nsense, as sparse as possible. That is, every column of A contains exactly one non-zero entry, randomly\nchosen from {\u22121, 1}. This techniqueis one of the most in\ufb02uencial mathematical tools in the study\nof scaling-up machine learning algorithms, mainly due to its simplicity and good performance in\npractice [Dal13, Sut15]. More formally, for n, m \u2208 N+, the projection matrix A is sampled as\nfollows. Sample h \u2208R [n] \u2192 [m], and \u03c3 = (cid:104)\u03c3j(cid:105)j\u2208[n] \u2208R {\u22121, 1}n independently. For every\ni \u2208 [m], j \u2208 [n], let aij = aij(h, \u03c3) := \u03c3j \u00b7 1h(j)=i (that is, aij = \u03c3j iff h(j) = i and 0 otherwise).\nWeinberger et al. additionally showed exponential tail bounds on (cid:107)Ax(cid:107)2\n2 when the ratio (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2\nis suf\ufb01ciently small, and m is suf\ufb01ciently large. These bounds were later improved by Dasgupta\net al. [DKS10] and most recently by Dahlgaard, Knudsen and Thorup [DKT17a] improved these\nconcentration bounds. Conversely, a result by Kane and Nelson [KN14] implies that if we allow\n(cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2 to be too large, then there exist vectors for which (1) does not holds.\nFinding the correct tradeoffs between (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2, and m, \u03b5, \u03b4 in which feature hashing performs\nwell remained an open problem. Our main contribution is settling this problem, and providing a\ncomplete and comprehensive understanding of the performance of feature hashing.\n\n1.1 Main results\n\nThe main result of this paper is a tight tradeoff between the target dimension m, the approximation\nratio \u03b5, the error probability \u03b4 and (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2. More formally, let \u03b5, \u03b4 > 0 and m \u2208 N+. Let\n\u03bd(m, \u03b5, \u03b4) be the maximum \u03bd \u2208 [0, 1] such that for every x \u2208 Rn, if (cid:107)x(cid:107)\u221e \u2264 \u03bd(cid:107)x(cid:107)2 then (1) holds.\nOur main result is the following theorem, which gives tight asymptotic bounds for the performance\nof feature hashing, thus closing the long-standing gap.\n\n2\n\n\fTheorem 2. There exist constants C \u2265 D > 0 such that for every \u03b5, \u03b4 \u2208 (0, 1) and m \u2208 N+ the\nfollowing holds. If C lg 1\n\n\u03b4\n\n\u03b52 \u2264 m < 2\n\n\u03b52\u03b4 then\n\n\u03bd(m, \u03b5, \u03b4) = \u0398\uf8eb\uf8ec\uf8ed\u221a\u03b5 \u00b7 min\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nlg \u03b5m\nlg 1\n\u03b4\nlg 1\n\u03b4\n\n,(cid:118)(cid:117)(cid:117)(cid:116) lg \u03b52m\n\nlg 1\n\u03b4\nlg 1\n\u03b4\n\nD lg 1\n\u03b4\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe\n\uf8f6\uf8f7\uf8f8 .\n\n\u03b52\u03b4 then \u03bd(m, \u03b5, \u03b4) = 1. Moreover if m <\n\nOtherwise, if m \u2265 2\nWhile the bound presented in the theorem may strike as surprising, due to the intricacy of the\nexpressions involved, the tightness of the result shows that this is, in fact, the correct and \u201ctrue\u201d\nbound. Moreover, the proof of the theorem demonstrates how both branches in the min expression\nare required in order to give a tight bound.\n\nthen \u03bd(m, \u03b5, \u03b4) = 0.\n\n\u03b52\n\nExperimental Results. Our theoretical bounds are accompanied by empirical results that shed\nlight on the nature of the constants in Theorem 2. Our empirical results show that in practice the\nconstants inside the \u0398-notation are signi\ufb01cantly tighter than the theoretical proof might suggest,\nand in fact feature hashing performs well for a larger scope of vectors. Speci\ufb01cally, for a synthetic\nset of generated bit-vectors, we show that whenever 4 lg 1\n\u03b52\u03b4 the constant hidden by the\n\u0398-notation is at least 0.75 (except for very sparse vectors, i.e. (cid:107)x(cid:107)0 \u2264 7). That is\n\n\u03b52 \u2264 m < 2\n\n\u03b4\n\n\u03bd(m, \u03b5, \u03b4) \u2265 0.725\u221a\u03b5 \u00b7 min\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nlg \u03b5m\nlg 1\n\u03b4\nlg 1\n\u03b4\n\n,(cid:118)(cid:117)(cid:117)(cid:116) lg \u03b52m\n\nlg 1\n\u03b4\nlg 1\n\u03b4\n\n.\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe\n\nFor a bag-of-words representation of 1500 NIPS papers with stopwords removed [DKT17b, New08]\nour experiments show that the constant is even larger, whereas the theoretical proof provides a\n\nmuch smaller constant of 2\u22126 in front of \u221a\u03b5. Since feature hashing satis\ufb01es (1) whenever (cid:107)x(cid:107)\u221e \u2264\n\u03bd(m, \u03b5, \u03b4)(cid:107)x(cid:107)2, this implies that feature hashing works with a better constant than the theory suggests.\nProof Technique As a fundamental step in the proof of Theorem 2 we prove tight asymptotic\nbounds for high-order norms of the approximation factor.2 More formally, for every x \u2208 Rn \\ {0} let\n2|. The technical crux of our results is tight bounds on high-order moments\nX(x) = |(cid:107)Ax(cid:107)2\nof X(x). Note that by rescaling we may restrict our focus without loss of generality to unit vectors.\nNotation 1. For every m, r, k > 0 denote\n\n2 \u2212 (cid:107)x(cid:107)2\n\n\u039b(m, r, k) =\n\nm ,\n\n(cid:112) r\nmax(cid:26)(cid:112) r\nmax(cid:26)(cid:112) r\n\nm ,\n\nm ,\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nk ln2( emr\n\nk )(cid:27) ,\n\n,\n\nr2\n\nr2\n\nk ln2( emr\n\nk )\n\nk ln( emr\n\nr\n\nk2 )(cid:27) ,\n\nk \u2265 mr\nmr > k \u2265 \u221amr\n\u221amr > k\n\n.\n\nIn these notations our main technical lemmas are the following.\n\u22122\u221e )).\nLemma 3. For every even r \u2264 m/4 and everyunit vector x \u2208 Rn, (cid:107)X(x)(cid:107)r = O(\u039b(m, r,(cid:107)x(cid:107)\nLemma 4. For every k \u2264 n and even r \u2264 min{m/4, k}, (cid:107)X(x(k))(cid:107)r = \u2126 (\u039b(m, r, k)), where\nx(k) \u2208 Rn is the unit vector whose \ufb01rst k entries equal 1\u221a\nWhile it might seem at a glance that bounding the high-order moments of X(x) is merely a technical\nissue, known tools and techniques could not be used to prove Lemmas 3, 4. Particularly, earlier\nwork by Kane and Nelson [KN14, CJN18] and Freksen and Larsen [FL17] used high-order moments\nbounds as a step in proving probability tail bounds of random variables. The existing techniques,\nhowever, can not be adopted to bound high-order moments of X(x) (see also Section 1.2), and novel\napproaches were needed. Speci\ufb01cally, our proof incorporates a novel combinatorial scheme for\ncounting edge-labeled Eulerian graphs.\n\n2Given a random variable X and r > 0, the rth norm of X (if exists) is de\ufb01ned as (cid:107)X(cid:107)r := r(cid:112)E(|X|r).\n\n.\n\nk\n\n3\n\n\fPrevious Results. Weinberger et al. [WDL+09] showed that whenever m = \u2126(\u03b5\u22122 lg(1/\u03b4)), then\n\u03bd(m, \u03b5, \u03b4) = \u2126(\u03b5 \u00b7 (lg(1/\u03b4) lg(m/\u03b4))\u22121/2). Dasgupta et al.\n[DKS10] showed that under similar\nconditions \u03bd(m, \u03b5, \u03b4) = \u2126(\u221a\u03b5 \u00b7 (lg(1/\u03b4) lg2(m/\u03b4))\u22121/2). These bounds were recently improved by\nDahlgaard et al. [DKT17a] who showed that \u03bd(m, \u03b5, \u03b4) = \u2126(cid:16)\u221a\u03b5 \u00b7(cid:113) lg(1/\u03b5)\nlg(1/\u03b4) lg(m/\u03b4)(cid:17). Conversely,\nKane and Nelson [KN14] showed that for the restricted case of m = \u0398(\u03b5\u22122 lg(1/\u03b4)), \u03bd(m, \u03b5, \u03b4) =\nlg(1/\u03b4)(cid:17), which matches the bound in Theorem 2 if, in addition, lg(1/\u03b5) \u2264(cid:112)lg(1/\u03b4).\nO(cid:16)\u221a\u03b5 \u00b7 lg(1/\u03b5)\n\n1.2 Related Work\n\nThe CountSketch scheme, presented by Charikar et al. [CCF04], was shown to satisfy (1) by Thorup\nand Zhang [TZ12]. The scheme essentially samples O(lg(1/\u03b4)) independent copies of a feature\nhashing matrix with m = O(\u03b5\u22122) rows, and applies them all to x. The estimator for (cid:107)x(cid:107)2\n2 is then\ngiven by computing the median norm over all projected vectors. The CountSketch scheme thus\nconstructs a sketching matrix A such that every column has O(lg(1/\u03b4)) non-zero entries. However,\nthis construction does not provide a norm-preserving embedding into a Euclidean space (that is, the\n2 cannot be represented as a norm of Ax), which is essential for some applications\nestimator of (cid:107)x(cid:107)2\nsuch as nearest-neighbor search [HIM12].\nKane and Nelson [KN14] presented a simple construction for the so-called sparse Johnson Linden-\nstrauss transform. This is a distribution of m \u00d7 n matrices, for m = \u0398(\u03b5\u22122 lg(1/\u03b4)), where every\ncolumn has s non-zero entries, randomly chosen from {\u22121, 1}. Note that if s = 1, this distribution\nyields the feature hashing one. Kane and Nelson showed that for s = \u0398(\u03b5m) this construction\nsatis\ufb01es (1). Recently, Cohen et al. [CJN18] presented two simple proofs for this result. While their\nproof methods give (simple) bounds for high-order moments similar to those in Lemmas 3 and 4,\nthey rely heavily on the fact that s is relatively large. Speci\ufb01cally, for s = 1 the bounds their method\nor an extension thereof give are trivial.\n\n2 Bounding \u03bd(m, \u03b5, \u03b4)\n\nIn this section we prove the principal part of Theorem 2, assuming Lemmas 3 and 4, whose proof is\ndeferred to the full version of the paper. Formally we prove the following.\nTheorem 5 (Main Part of Theorem 2). There exist constants \u02c6C > 0 such that for every \u03b5, \u03b4 \u2208 (0, 1)\nand m \u2208 N+, if \u02c6C lg 1\n\n\u03b52 \u2264 m < 2\n\n\u03b52\u03b4 then\n\n\u03b4\n\n\u03bd(m, \u03b5, \u03b4) = \u0398\uf8eb\uf8ec\uf8ed\u221a\u03b5 \u00b7 min\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nlg \u03b5m\nlg 1\n\u03b4\nlg 1\n\u03b4\n\n,(cid:118)(cid:117)(cid:117)(cid:116) lg \u03b52m\n\nlg 1\n\u03b4\nlg 1\n\u03b4\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe\n\uf8f6\uf8f7\uf8f8 .\n\nFix \u03b5, \u03b4 \u2208 (0, 1) and an integer m. From Lemmas 3 and 4 there exist C1, C2 > 0 such that for every\nr, k, if r \u2264 m/4 then for every unit vector x, (cid:107)X(x)(cid:107)r \u2264 2C2\u039b(m, r, k). Moreover, if r \u2264 k then\n\n2\n\n\u2212C1 \u039b(m, r, k) \u2264 (cid:107)X(x(k))(cid:107)r \u2264 2C2\u039b(m, r, k) .\n\nNote that in addition \u039b(m, 2r, k) \u2264 4\u039b(m, r, k). Denote \u02c6C = 2C2+2, and C = 2C1 + 2C2 + 5.\nFor the rest of the proof we assume that \u02c6C log 1\non \u03bd.\n\n\u03b52\u03b4 , and we start by proving a lower bound\n\n\u03b4\n\n\u221a\n\u03b5\nlg 1\n\u03b4\n\nLemma 6. \u03bd(m, \u03b5, \u03b4) = \u2126\uf8eb\uf8edmin\uf8f1\uf8f2\uf8f3\nand let k := 1(cid:107)x(cid:107)2\u221e \u2265 max(cid:26) 2C2 er2\n\nProof. Let r = lg 1\n\nr\n\n\u03b4 , let x \u2208 Rn be a unit vector such that (cid:107)x(cid:107)\u221e \u2264 min(cid:26)\u221a\nr (cid:27). If k \u2264 mr, then since\n\n2C2 er\n\u03b5 lg e\u03b52m\n\n\u03b5 ln2 e\u03b5m\n\n,\n\n\u03b5 ln e\u03b5m\n\u221a\nr\n2C2 er\n\nr2\n\nk ln2 emr\n\nk\n\nr\n\n2C2 er (cid:27),\n,(cid:113) \u03b5 lg e\u03b52m\n\nis convex as a\n\n\u03b52 \u2264 m < 2\n\u03b4 \uf8fc\uf8fd\uf8fe\n,(cid:115) \u03b5 lg \u03b52m\n\nlg 1\n\u03b4\n\nlg 1\n\nlg \u03b5m\nlg 1\n\u03b4\n\n\uf8f6\uf8f8.\n\n4\n\n\ffunction of k \u2208(cid:104) 2C2 er2\n\n\u03b5 ln2 e\u03b5m\n\nr\n\n, mr(cid:105) then\n\n2C2\n\nk ln2 emr\nMoreover, if k \u2264 \u221amr then since\n\nr2\n\nk \u2264 max(cid:40) r\n\nr\n\nk ln emr\nk2\n\n,\u221amr(cid:21), then\n\n(2)\n\n\uf8f6\uf8f8.\n\nr\n\nr\n\nr\n\nr\n\nm\n\n\u03b5 ln e\u03b52m\n\ne ln2 e\u03b5m ln2 e\u03b5m\n\n\u03b5 ln e\u03b52m\n\u03b52m ln2 e\u03b52m\n\n,(cid:32) \u03b5 ln2 e\u03b5m\n(cid:33)(cid:41) < \u03b5/2 .\nis convex as a function of k \u2208(cid:20) 2C2 er\n\uf8f6\uf8f8\uf8fc\uf8fd\uf8fe \u2264 \u03b5/2 .\n,\uf8eb\uf8ed\n(cid:114) r\nr \u2264 (\u03b5/2)r, and thus\n\u03b4 \uf8fc\uf8fd\uf8fe\n,(cid:115) \u03b5 lg \u03b52m\n\n\u2212r = \u03b4 .\n\nlg \u03b5m\nlg 1\n\u03b4\n\n\u221a\n\u03b5\nlg 1\n\u03b4\n\ne ln\n\nlg 1\n\u03b4\n\nlg 1\n\nr\n\nr\n\nr\n\nr\n\nm\n\n2C2\n\nk ln emr\n\n\u03b5 ln e\u03b5m\n\u221a\nr\n2C2 er\n\nm \u2264 \u03b5/2, then by Lemma 3 we have (cid:107)X(x)(cid:107)r\n\nk2 \u2264 max\uf8f1\uf8f2\uf8f3\n2 \u2212 1(cid:12)(cid:12) > \u03b5(cid:3) = Pr [ (X(x))r > \u03b5r ] \u2264 2\n2C2 er (cid:27) = \u2126\uf8eb\uf8edmin\uf8f1\uf8f2\uf8f3\n,(cid:113) \u03b5 lg e\u03b52m\n\u03b4 \uf8fc\uf8fd\uf8fe\n,(cid:115) \u03b5 lg \u03b52m\n\uf8f6\uf8f8.\n= O\uf8eb\uf8ec\uf8edmin\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nSince clearly,(cid:113) 22C2 r\nPr(cid:2)(cid:12)(cid:12)(cid:107)Ax(cid:107)2\nHence \u03bd(m, \u03b5, \u03b4) \u2265 min(cid:26)\u221a\nLemma 7. \u03bd(m, \u03b5, \u03b4) = O\uf8eb\uf8edmin\uf8f1\uf8f2\uf8f3\nt = min\uf8f1\uf8f2\uf8f3\n,(cid:115) e\u03b5 ln e\u03b52m\n\nTo this end, let r = 1\n\n\u03b4 , and denote\n\n\u221ae\u03b5\nr\n\nlg \u03b5m\nlg 1\n\u03b4\n\n\u221a\n\u03b5\nlg 1\n\u03b4\n\nC lg 1\n\ne\u03b5m\n\nlg 1\n\u03b4\n\nlg 1\n\nlg\n\nr\n\nr\n\nr\n\nAssume \ufb01rst that t \u2264 1\u221a\nthen k \u2265 r. If\nTherefore\n\ne\u03b5\nr\n\n\u221a\n\nr\n\nln\n\nlg 1\n\u03b4\n\nlg 1\n\u03b4\n\n\u03b5m\nlg 1\n\u03b4\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe\n\n\u221a\u03b5\nlg 1\n\u03b4\n\nr , and let k = 1\n\n,(cid:118)(cid:117)(cid:117)(cid:116) \u03b5 lg \u03b52m\n\n\uf8f6\uf8f7\uf8f8 .\n\uf8fc\uf8fd\uf8fe\nt2 . We will show that E(cid:104)(cid:0)X(x(k))(cid:1)r(cid:105) \u2265 2\u03b5r. Since t \u2264 1\u221a\nr \u2264 (cid:113) e\u03b5 ln e\u03b52m\nr \u2265(cid:32)\nE(cid:104)(cid:16)X(x(k))(cid:17)r(cid:105) = (cid:107)X(x(k))(cid:107)r\n\n(cid:33)r\nk ln2 emr\nr > 1, then k \u2264 r/\u03b5 \u2264 \u221amr. Therefore\n\nr ,\nr > e, then k \u2264 mr.\n\n=(cid:32) e\u03b5 ln2 e\u03b5m\n\n. Moreover, since \u03b52m\n\nln2 e2\u03b5m ln2 e\u03b5m\n\n. Since e\u03b5m\n\n\u2265 2\u03b5r .\n\n, then k =\n\ne\u03b5 ln e\u03b52m\n\nln e\u03b5m\n\ne ln2 e\u03b5m\n\nr2\n\nr2\n\nr\n\nr\n\nr\n\nr\n\nr\n\nr\n\nr\n\nr\n\nOtherwise, k =\n\nr\n\nr\n\nr\n\nr\n\nr\n\nk ln emr\n\n\u2265 2\u03b5r .\n\nk2 (cid:19)r\n\ne\u03b5 ln e\u03b52m\ne3\u03b52m ln2 e\u03b52m\n\nApplying the Paley-Zygmund inequality we get that\n\nE(cid:104)(cid:16)X(x(k))(cid:17)r(cid:105) = (cid:107)X(x(k))(cid:107)r\n\n\uf8f6\uf8f8\nr \u2265(cid:18)\n4(cid:18) 2\u2212C1\u039b(m, r, k)\n2C2\u039b(m, 2r, k)(cid:19)2r\n2 \u2212 1(cid:12)(cid:12)(cid:12) > \u03b5(cid:105) \u2265 Pr(cid:20)(cid:16)X(x(k))(cid:17)r\nPr(cid:104)(cid:12)(cid:12)(cid:12)(cid:107)Ax(k)(cid:107)2\nTherefore \u03bd(m, \u03b5, \u03b4) \u2264 (cid:107)x(k)(cid:107)\u221e = t.\nr , then m > e\u221ar/(e\u03b5), and\nr < t <(cid:112) \u03b5\nAssume next that 1\u221a\nsince(cid:113) e\u03b5 ln e\u03b52m\n\u2265 1\u221a\nt2 , and consider independent h \u2208R [n] \u2192 [m],\nand \u03c3 = (\u03c31, . . . , \u03c3m) \u2208R {\u22121, 1}m. Let y \u2208 Rn be de\ufb01ned as follows. For every j \u2208 [n], yj = x(k)\nif and only if h(j) = 1, and yj = 0 otherwise. Denote z = x(k) \u2212 y. Then (cid:107)x(k)(cid:107)2\n2,\n2 + (cid:107)z(cid:107)2\n\n4, and note that since\nr then m > e1/(e\u03b5). Let k = 1\n\nr \u2265 1\u221a\n\n2 = (cid:107)y(cid:107)2\n\nln e\u03b5m\n\ne\u03b5\nr\n\n1\n2\n\n\u221a\n\n>\n\n1\n\nr\n\nj\n\nr\n\n\u2265 \u03b4 .\n\nk (cid:33)r\n=\uf8eb\uf8ed\nE[X(x(k))r](cid:21) \u2265\n\nln\n\n5\n\n\f2 \u2212 (cid:107)z(cid:107)2\n\n2 + (cid:107)Az(cid:107)2\n\n2 = (cid:107)Ay(cid:107)2\n\nand moreover, (cid:107)Ax(k)(cid:107)2\n2, where A = A(h, \u03c3). Let Ef irst denote the event that\n|h\u22121({1})| = 2\u221a\u03b5k, and that for all j \u2208 [n], if h(j) = 1 then \u03c3j = 1, and let Erest denote the event\nthat(cid:12)(cid:12)(cid:107)Az(cid:107)2\n2. By Chebyshev\u2019s inequality, Pr[Erest | Ef irst] = \u2126(1). Note that if\n2\u221a\u03b5k \u2264\n\n\u221a\n, then since m > max{e1/e\u03b5, e\nlg 1\n\u03b4\n\n2(cid:12)(cid:12) < \u03b5(cid:107)z(cid:107)2\n\nC\u221ae ln e\u03b5m\n\nr} we get\n\nlg 1\n\u03b4\n2 ln m\n\ne\u03b5 ln2 e\u03b5m\n\nlg 1\n\u03b4\n\nlg 1\n\u03b4\n\nk =\n\nr2\n\nC\u221ae(ln em \u2212 ln 1\n\n\u03b5 \u2212 ln r) \u2264\n\nC\u221ae(ln m \u2212 3 ln ln m) \u2264\n\nr \u2264\n\n,\n\nr\n\nand otherwise, k =\n\nr\n\n, and\n\ne\u03b5 ln e\u03b52m\n\nr\n\nlg 1\n\u03b4\n\neC(ln em \u2212 4 ln ln m) \u2264\n\u221a\n\nlg 1\n\u03b4\n2 ln m\n\n.\n\n2\u221a\u03b5k \u2264 \u03b5k =\n\nlg 1\n\u03b4\n\neC ln e\u03b52m\n\nr\n\n=\n\nTherefore for small enough \u03b5,\n\nPr[Ef irst] =(cid:18) k\n2\u221a\u03b5k(cid:19) \u00b7(cid:18) 1\nm(cid:19)2\n\u00b7(cid:18)1 \u2212\n\u2265(cid:18) 1\nm(cid:19)2\n\n\u221a\n\n\u03b5k\n\nlg 1\n\u03b4\n\n1\n\n\u03b5k\n\n\u03b5k\n\n\u221a\n\n\u221a\n\n\u03b5 \u2212 ln r) \u2264\neC(ln em \u2212 2 ln 1\nm(cid:19)k\u22122\n\u00b7(cid:18)1 \u2212\n\u2212r \u2265(cid:18) 1\nm(cid:19)r\nm(cid:19)2\nk \u2212 2\u221a\u03b5k\n\n\u00b7 2\n\n\u00b7 2\n\n\u22122\n\n\u221a\n\n\u03b5k\n\n1\n\n\u03b5k\n\n\u2212 2\n\nC lg 1\n\nk\n\n\u00b7 2\n\n4\u03b5k\nk\n\n2 = (cid:107)Ay(cid:107)2\n\n2+(cid:107)Az(cid:107)2 \u2265\n\n+(1\u2212\u03b5)(cid:107)z(cid:107)2\n\n2 = 4\u03b5+(1\u2212\u03b5)\u00b7\n\n4. Since(cid:113) e\u03b5 ln e\u03b52m\n\n\u03b4 \u2265 \u03b43/4 .\nThus for small enough \u03b4, Pr[Ef irst \u2227 Erest] \u2265 \u03b4. Conditioned on Ef irst \u2227 Erest we get that\n\u2265 4\u03b5+(1\u2212\u03b5)2 > 1+\u03b5 ,\n(cid:107)Ax(k)(cid:107)2\n\u03b5 . Therefore \u03bd(m, \u03b5, \u03b4) \u2264 (cid:107)x(k)(cid:107)\u221e = t.\nwhere the inequality before last is due to the fact that k \u2265 4\n\u2265 t >(cid:112) \u03b5\nFinally, assume t >(cid:112) \u03b5\ne\u03b52\u03b41/(4eC) .\n\u03b5 . Consider independent h \u2208R [n] \u2192 [m], and \u03c3 = (\u03c31, . . . , \u03c3m) \u2208R {\u22121, 1}m, and let\nLet k = 2\nA = A(h, \u03c3). Let Ecol denote the event that there are j (cid:54)= (cid:96) \u2208 [k] such that for every p (cid:54)= q \u2208 [k],\nh(p) = h(q) if and only if {p, q} = {j, (cid:96)}. Then for small enough \u03b5, \u03b4,\n2(cid:19) \u00b7\nm \u00b7 (cid:89)j\u2208[k\u22121](cid:18)1 \u2212\nPr[Ecol] =(cid:18)k\n2m \u00b7 (1 \u2212 \u03b5/2) \u00b7(cid:18)1 \u2212\nConditioned on Ecol we get that(cid:12)(cid:12)(cid:107)Ax(k)(cid:107)2\n\nm(cid:19) \u2265\nm(cid:19) \u2265 2\u03b4 \u00b7 (1 \u2212 \u03b5/2) \u00b7(cid:16)1 \u2212 4e\u03b41/(4Ce)(cid:17) \u2265 \u03b4 .\n2 \u2212 1(cid:12)(cid:12) = 2\nk = \u03b5. Therefore \u03bd(m, \u03b5, \u03b4) \u2264 (cid:112) \u03b5\n\n2m \u00b7 (1 \u2212 \u03b5/2) \u00b7(cid:18)1 \u2212\n\nThis completes the proof of Lemma 7, and thus of Theorem 5.\n\n4, we get that m \u2265 r\n\ne\u03b52 er/(4e) \u2265\n\n2 \u2264 O(t).\n\nm(cid:19)k\n\n\u2265\n\nk2\n\nk2\n\nk2\n\nk\n\n1\n\nj\n\nr\n\nr\n\nr\n\n3 Empirical Analysis\n\nWe complement our theoretical bounds with experimental results on both real and synthetic data.\nThe goal of the experiments is to give bounds on some of the constants hidden in the main theorem.\nOur synthetic-data experiments show that for 4 lg 1\n\u03b52\u03b4 the constant inside the \u0398-notation in\nTheorem 2 is at least 0.725 except for very sparse vectors ((cid:107)x(cid:107)0 \u2264 7), where the constant is at least\n\u03b52\u03b4 and that there exists data points\n0.6. Furthermore, we con\ufb01rm that \u03bd(m, \u03b5, \u03b4) = 1 when m \u2265 2\nwhere \u03bd(m, \u03b5, \u03b4) < 1 while m = 2\u2212\u03b3\n\u03b52\u03b4 , for some small \u03b3. In addition, for the real-world data we\ntested feature hashing on, the constant is at least 1.1 or 0.8, based on the data set.\n\n\u03b52 \u2264 m < 2\n\n\u03b4\n\n3.1 Experiment Setup and Analysis\n\nTo arrive at the results, we ran experiments and analysed the data in several phases. In the \ufb01rst\nphase we varied the target dimension m over exponentially spaced values in the range [24, 214], and a\nparameter k which controls the ratio between the (cid:96)\u221e and the (cid:96)2 norm. The values of k varied over\n\n6\n\n\fFigure 1: The plot shows the ratio between \u02c6\u03bd values and the theoretical bound, abbreviated here as\nmin{left, right}. This ratio corresponds to the constant in the \u0398-notation in Theorem 2. The points\nare marked with blue circles if left < right, and with green \u00d7\u2019s otherwise. The horizontal line at\n0.725 is there to ease comparisons with Figure 2, while the line at 1.1 helps in comparing against real\nworld data (Figure 4 and Figure 5).\n\nexponentially spaced values in the range [21, 213]. Then for all m and k, we generated 224 vectors\n\nx with entries in {0, 1} such that (cid:107)x(cid:107)2 = \u221ak(cid:107)x(cid:107)\u221e, and for any given m and k the supports of the\n\nvectors were pairwise disjoint. We then hashed the generated vectors using feature hashing, and\nrecorded the (cid:96)2 norm of the embedded vectors.\nThe second phase then calculated the distortion between the original and the embedded vectors, and\ncomputed the error probability \u02c6\u03b4. Loosely speaking, \u02c6\u03b4(m, k, \u03b5) is the ratio of the 224 vectors for\na given m and k that have distortion greater than \u03b5. Formally, \u02c6\u03b4 is calculated using the following\nformula\n\n\u02c6\u03b4(m, k, \u03b5) = (cid:12)(cid:12)(cid:12)(cid:8)x : (cid:107)x(cid:107)2 = \u221ak(cid:107)x(cid:107)\u221e,(cid:12)(cid:12)(cid:107)Amx(cid:107)2\n(cid:12)(cid:12){x : (cid:107)x(cid:107)2 = \u221ak(cid:107)x(cid:107)\u221e}(cid:12)(cid:12)\n\n2 \u2212 (cid:107)x(cid:107)2\n\n2(cid:9)(cid:12)(cid:12)(cid:12)\n2(cid:12)(cid:12) \u2265 \u03b5(cid:107)x(cid:107)2\n\n,\n\nwhere \u03b5 was varied over exponentially spaced values in the range [2\u221210, 2\u22121]. Note that \u02c6\u03b4 tends to\nthe true error probability as the number of vectors tends to in\ufb01nity. Computing \u02c6\u03b4 yielded a series\nof 4-tuples (m, k, \u03b5, \u02c6\u03b4) which can be interpreted as given target dimension m, (cid:96)\u221e/(cid:96)2 ratio 1/\u221ak,\ndistortion \u03b5, we have measured that the failure probability is at most \u02c6\u03b4.\nIn the third phase, we varied \u03b4 over exponentially spaced values in the range [2\u221220, 20], and calculated\na value \u02c6\u03bd. Intuitively, \u02c6\u03bd(m, \u03b5, \u03b4) is the largest (cid:96)\u221e/(cid:96)2 ratio such that for all vectors having at most\nthis (cid:96)\u221e/(cid:96)2 ratio the measured error probability \u02c6\u03b4 is at most \u03b4. Formally,\n\n\u02c6\u03bd(m, \u03b5, \u03b4) = max(cid:110) 1\n\u221ak\n\n(cid:48)\n\n: \u2200k\n\n(cid:48)\n\n\u2265 k, \u02c6\u03b4(m, k\n\n, \u03b5) \u2264 \u03b4(cid:111).\n\nNote once more that \u02c6\u03bd tends to the true \u03bd value as the number of vectors tends to in\ufb01nity. To \ufb01nd a\nbound on the constant of the \u0398-notation in Theorem 2, we truncated data points that did not satisfy\n4 lg 1\n\u03b52\u03b4 , and for the remaining points we plotted \u02c6\u03bd over the theoretical bound in Figure 1.\n\u03b52 \u2264 m < 2\n\u03b4\nFrom this plot we conclude that the constant is at least 0.6 on the large range on parameters we tested.\nHowever, the smallest values seem to be outliers and come from a combination of very sparse vectors\n(k = 7) and high target dimension (m = 214). For the rest of the data points the constant is at least\n0.725. While there are data points where the constant is larger (i.e. feature hashing performs better),\nthere are data points close to 0.725 over the entire range of \u03b5 and \u03b4.\nIn Figure 2 we show that we indeed need both terms in the minimum in Theorem 2, by plotting the\nmeasured \u02c6\u03bd values over both terms in the minimum in the theoretical bound separately. For both\nterms there are points whose value is signi\ufb01cantly below 0.725.\nTo \ufb01nd a bound on m where \u02c6\u03bd(m, \u03b5, \u03b4) = 1 we took the untruncated data and recorded the maximal\n\u02c6\u03b4 for each m and \u03b5. We then plotted m\u03b52\u02c6\u03b4 in Figure 3. From Figure 3 it is clear that \u02c6\u03bd(m, \u03b5, \u03b4) = 1\n\n7\n\n12345lg1/\u03b50.40.60.81.01.2\u02c6\u03bdmin(cid:8)\u221a\u03b5lg1/\u03b4lg\u03b5mlg1/\u03b4,s\u03b5lg\u03b52mlg1/\u03b4lg1/\u03b4(cid:9)2.55.07.510.012.515.017.520.0lg1/\u03b40.40.60.81.01.2\fFigure 2: This plot shows the measured \u02c6\u03bd values over each of the two terms in the minimum in the\ntheoretical bound (abbreviated here): min{left, right}. In the left sub\ufb01gure the y-axis of the blue\ncircles is \u02c6\u03bd\nright. Note that the x-axis\n(values of lg(1/\u03b4) ) is the same in both sub\ufb01gures, and the same as in the right sub\ufb01gure of Figure 1.\nAs in Figure 1, the horizontal line at 0.725 is there to ease comparison between the \ufb01gures.\n\nleft, while the y-axis of the green \u00d7\u2019s in the right sub\ufb01gure is\n\n\u02c6\u03bd\n\nFigure 3: This plot shows the constant where \u02c6\u03bd(m, \u03b5, \u03b4) becomes 1. The theory states that if\n2 \u2264 m\u03b52\u03b4 then \u02c6\u03bd(m, \u03b5, \u03b4) = 1. The distinct curves in the left plot correspond to distinct values of m.\n\nwhen m \u2265 2\nwhile m = 2\u2212\u03b3\n\n\u03b52\u03b4 . Furthermore, the \ufb01gure also shows that there are data points where \u02c6\u03bd(m, \u03b5, \u03b4) < 1\n\u03b52\u03b4 , for some small \u03b3. Therefore we conclude the bound m \u2265 2\n\n\u03b52\u03b4 is tight.\n\n3.2 Real-World Data\n\nWe also ran experiments on real-world data, namely bag-of-words representations of 1500 NIPS\npapers with stopwords removed [DKT17b, New08]. We ran experiments on this data set both with and\nwithout preprocessing with the common logarithmic term frequency - inverse document frequency (tf-\nidf). These experiments were executed and analysed similarly to the synthetic experiments described\nabove, except for a few changes. First, in order to explore any meaningful \u03b4 values we hashed each\nvector 220 times. In this way, iterating over the original vectors in the real world experiments plays a\nsimilar role to iterating over the k values in the synthetic experiments. Secondly, in these experiments\nm ranged over values in [24, 212].\nThe results of these experiments can be seen in Figure 4 and Figure 5, from which we conclude\nthat feature hashing performs even better on the real-world data we tested compared to the synthetic\ndata, as the Theorem 2 constant is always above 1.1 and 0.8 with and without tf-idf, respectively.\nFurthermore, for the vast majority of data points have a constant around or above 1.2.\n\n8\n\n2.55.07.510.012.515.017.520.0lg1/\u03b40.40.60.81.01.2\u02c6\u03bd\u221a\u03b5lg1/\u03b4lg\u03b5mlg1/\u03b42.55.07.510.012.515.017.520.0lg1/\u03b40.40.60.81.01.2\u02c6\u03bds\u03b5lg\u03b52mlg1/\u03b4lg1/\u03b4246810lg1/\u03b50.000.250.500.751.001.251.501.752.00m\u03b52\u03b40246810lg1/\u03b40.000.250.500.751.001.251.501.752.00\fFigure 4: This plot has a similar structure to Figure 1, but is based on the NIPS dataset preprocessed\nwith tf-idf. This plot shows the measured \u02c6\u03bd values over the theoretical bound (abbreviated here):\nmin{left, right}. This ratio corresponds to the constant in the \u0398-notation in Theorem 2. The points\nare marked with blue circles if left < right, otherwise they are marked with green \u00d7\u2019s. The horizontal\nlines at 0.725 and 1.1 are there to ease comparisons with Figure 1.\n\nFigure 5: This plot has a similar structure to Figure 1, but is based on the NIPS dataset without\ntf-idf preprocessing. This plot shows the measured \u02c6\u03bd values over the theoretical bound (abbreviated\nhere): min{left, right}. This ratio corresponds to the constant in the \u0398-notation in Theorem 2. The\npoints are marked with blue circles if left < right, otherwise they are marked with green \u00d7\u2019s. The\nhorizontal lines at 0.725 and 1.1 are there to ease comparisons with Figure 1.\n\n3.3\n\nImplementation Details\n\nAs random number generators, we used degree 20 polynomials modulo the Mersenne prime 261 \u2212 1,\nwhere the coef\ufb01cients were random data from random.org. The random data was independent\nbetween experiments with diffent values of m, between synthetic and real world experiments, and\nbetween the random number generator used for vector generation and hashing.\nFeature hashing was done using double tabulation hashing [Tho14] on 64 bit numbers. The tables in\nour implementation of double tabulation hashing were \ufb01lled with numbers from the aforementioned\nrandom number generator. Double tabulation hashing has been proven to behave fully randomly with\nhigh probability [DKRT15].\nIn order to ef\ufb01ciently do the 220 hashings per vector for the real world data, we utilised the high\nindependence of double tabulation hashing. Let d be the original dimension of the vectors. We blew\nup the source dimension to 220d, and at the ith rehash we shifted the coordinates of the original\nvector i \u00b7 d places to the right.\n\n9\n\n1.01.52.02.53.03.5lg1/\u03b51.01.52.02.53.0\u02c6\u03bdmin(cid:8)\u221a\u03b5lg1/\u03b4lg\u03b5mlg1/\u03b4,s\u03b5lg\u03b52mlg1/\u03b4lg1/\u03b4(cid:9)6810121416lg1/\u03b41.01.52.02.53.01.01.52.02.53.03.5lg1/\u03b50.81.01.21.41.61.82.0\u02c6\u03bdmin(cid:8)\u221a\u03b5lg1/\u03b4lg\u03b5mlg1/\u03b4,s\u03b5lg\u03b52mlg1/\u03b4lg1/\u03b4(cid:9)46810121416lg1/\u03b40.81.01.21.41.61.82.0\fAcknowledgments\n\nThis work was supported by a Villum Young Investigator Grant.\n\nReferences\n[AC09]\n\n[Ach03]\n\nN. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate\nnearest neighbors. SIAM J. Comput., 39(1):302\u2013322, 2009.\nD. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with\nbinary coins. J. Comput. Syst. Sci., 66(4):671\u2013687, June 2003.\n\n[CJN18]\n\n[CCF04]\n\n[AIL+15] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and\noptimal lsh for angular distance. In Proceedings of the 28th International Conference on\nNeural Information Processing Systems - Volume 1, NIPS\u201915, pages 1225\u20131233. MIT\nPress, 2015. Available from: http://dl.acm.org/citation.cfm?id=2969239.\n2969376.\nM. Charikar, K. C. Chen, and M. Farach-Colton. Finding frequent items in data streams.\nTheor. Comput. Sci., 312(1):3\u201315, 2004.\nM. B. Cohen, T. S. Jayram, and J. Nelson. Simple analyses of the sparse Johnson-\nLindenstrauss transform. In 1st Symposium on Simplicity in Algorithms, SOSA 2018,\nJanuary 7-10, 2018, New Orleans, LA, USA, pages 15:1\u201315:9, 2018.\nE. J. Cand\u00e8s and T. Tao. Near-optimal signal recovery from random projections:\nUniversal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406\u2013\n5425, 2006.\nK. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In\nProceedings of the Forty-\ufb01rst Annual ACM Symposium on Theory of Computing, pages\n205\u2013214. ACM, 2009.\nB. Dalessandro. Bring the noise: Embracing randomness is the key to scaling up\nmachine learning algorithms. Big Data, 1(2):110\u2013112, 2013.\nS. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Linden-\nstrauss. Random Struct. Algorithms, 22(1):60\u201365, 2003.\n\n[Dal13]\n\n[DG03]\n\n[CT06]\n\n[CW09]\n\n[DKS10]\n\n[DKRT15] S. Dahlgaard, M. B. T. Knudsen, E. Rotenberg, and M. Thorup. Hashing for statistics\nover k-partitions. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foun-\ndations of Computer Science (FOCS), FOCS \u201915, pages 1292\u20131310. IEEE Computer\nSociety, 2015.\nA. Dasgupta, R. Kumar, and T. Sarl\u00f3s. A sparse Johnson-Lindenstrauss transform.\nIn Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010,\nCambridge, Massachusetts, USA, 5-8 June 2010, STOC \u201910, pages 341\u2013350, 2010.\nS. Dahlgaard, M. Knudsen, and M. Thorup. Practical hash functions for similarity\nestimation and dimensionality reduction. In Advances in Neural Information Processing\nSystems 30, pages 6615\u20136625. Curran Associates, Inc., 2017.\n\n[DKT17a]\n\n[DKT17b] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017. Available\n\n[FL17]\n\n[HIM12]\n\nfrom: http://archive.ics.uci.edu/ml.\nC. B. Freksen and K. G. Larsen. On using toeplitz and circulant matrices for Johnson-\nLindenstrauss transforms. In 28th International Symposium on Algorithms and Com-\nputation, ISAAC 2017, December 9-12, 2017, Phuket, Thailand, pages 32:1\u201332:12,\n2017.\nS. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards\nremoving the curse of dimensionality. Theory of Computing, 8(1):321\u2013350, 2012.\n\n[HWB08] C. Hegde, M. Wakin, and R. Baraniuk. Random projections for manifold learning.\nIn Advances in Neural Information Processing Systems 20, pages 641\u2013648. Curran\nAssociates, Inc., 2008.\nW. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert\nspace. In Conference in modern analysis and probability (New Haven, Conn., 1982),\n\n[JL84]\n\n10\n\n\f[JW13]\n\n[KN14]\n\n[LN17]\n\n[Mat08]\n\n[MM09]\n\n[MM13]\n\n[New08]\n\n[NN13]\n\nvolume 26 of Contemporary Mathematics, pages 189\u2013206. American Mathematical\nSociety, 1984.\nT. S. Jayram and D. P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms\nand streaming problems with subconstant error. ACM Trans. Algorithms, 9(3):26:1\u2013\n26:17, 2013.\nD. M. Kane and J. Nelson. Sparser Johnson-Lindenstrauss transforms. J. ACM,\n61(1):4:1\u20134:23, January 2014.\nK. G. Larsen and J. Nelson. Optimality of the Johnson-Lindenstrauss lemma. In 58th\nIEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley,\nCA, USA, October 15-17, 2017, pages 633\u2013638, 2017.\nJ. Matou\u0161ek. On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algo-\nrithms, 33(2):142\u2013156, 2008.\nO. Maillard and R. Munos. Compressed least-squares regression. In Advances in Neural\nInformation Processing Systems 22, pages 1213\u20131221. Curran Associates, Inc., 2009.\nX. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity\ntime and applications to robust linear regression. In Symposium on Theory of Computing\nConference, STOC\u201913, Palo Alto, CA, USA, June 1-4, 2013, STOC \u201913, pages 91\u2013100,\n2013.\nD. Newman. Bag of words data set, 2008. Available from: https://archive.ics.\nuci.edu/ml/datasets/Bag+of+Words.\nJ. Nelson and H. L. Nguyen. Sparsity lower bounds for dimensionality reducing maps.\nIn Proceedings of the Forty-\ufb01fth Annual ACM Symposium on Theory of Computing,\nSTOC \u201913, pages 101\u2013110. ACM, 2013.\n\n[PBMID14] S. Paul, C. Boutsidis, M. Magdon-Ismail, and P. Drineas. Random projections for linear\n\n[RR08]\n\n[S\u00e1r06]\n\n[Sut15]\n\n[Tho14]\n\n[TZ12]\n\n[Vem05]\n\nsupport vector machines. ACM Trans. Knowl. Discov. Data, 8(4):22:1\u201322:25, 2014.\nA. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. C.\nPlatt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information\nProcessing Systems 20, pages 1177\u20131184. Curran Associates, Inc., 2008.\nT. S\u00e1rlos. Improved approximation algorithms for large matrices via random projections.\nIn Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer\nScience, pages 143\u2013152. IEEE Computer Society, 2006.\nS. Suthaharan. Machine Learning Models and Algorithms for Big Data Classi\ufb01ca-\ntion: Thinking with Examples for Effective Learning. Springer Publishing Company,\nIncorporated, 1st edition, 2015.\nM. Thorup. Simple tabulation, fast expanders, double tabulation, and high independence.\nIn 2013 IEEE 54th Annual Symposium on Foundations of Computer Science(FOCS),\npages 90\u201399. IEEE Computer Society, 2014.\nM. Thorup and Y. Zhang. Tabulation-based 5-independent hashing with applications to\nlinear probing and second moment estimation. SIAM J. Comput., 41(2):293\u2013331, April\n2012.\nS. S. Vempala. The random projection method. DIMACS : series in discrete mathematics\nand theoretical computer science. American Mathematical Society, 2005.\n\n[WDL+09] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing\nfor large scale multitask learning. In Proceedings of the 26th Annual International\nConference on Machine Learning, ICML \u201909, pages 1113\u20131120, 2009.\n\n11\n\n\f", "award": [], "sourceid": 2582, "authors": [{"given_name": "Casper", "family_name": "Freksen", "institution": "Aarhus University"}, {"given_name": "Lior", "family_name": "Kamma", "institution": "Aarhus University"}, {"given_name": "Kasper", "family_name": "Green Larsen", "institution": "Aarhus University, MADALGO"}]}