{"title": "Understanding Sparse JL for Feature Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 15203, "page_last": 15213, "abstract": "Feature hashing and other random projection schemes are commonly used to reduce the dimensionality of feature vectors. The goal is to efficiently project a high-dimensional feature vector living in R^n into a much lower-dimensional space R^m, while approximately preserving Euclidean norm. These schemes can be constructed using sparse random projections, for example using a sparse Johnson-Lindenstrauss (JL) transform. A line of work introduced by Weinberger et. al (ICML '09) analyzes the accuracy of sparse JL with sparsity 1 on feature vectors with small l_infinity-to-l_2 norm ratio. Recently, Freksen, Kamma, and Larsen (NeurIPS '18) closed this line of work by proving a tight tradeoff between l_infinity-to-l_2 norm ratio and accuracy for sparse JL with sparsity 1. In this paper, we demonstrate the benefits of using sparsity s greater than 1 in sparse JL on feature vectors. Our main result is a tight tradeoff between l_infinity-to-l_2 norm ratio and accuracy for a general sparsity s, that significantly generalizes the result of Freksen et. al. Our result theoretically demonstrates that sparse JL with s > 1 can have significantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this finding.", "full_text": "Understanding Sparse JL for Feature Hashing\n\nMeena Jagadeesan\u2217\nHarvard University\n\nCambridge, MA 02138\n\nmjagadeesan@college.harvard.edu\n\nAbstract\n\nFeature hashing and other random projection schemes are commonly used to\nreduce the dimensionality of feature vectors. The goal is to ef\ufb01ciently project\na high-dimensional feature vector living in Rn into a much lower-dimensional\nspace Rm, while approximately preserving Euclidean norm. These schemes can be\nconstructed using sparse random projections, for example using a sparse Johnson-\nLindenstrauss (JL) transform. A line of work introduced by Weinberger et. al\n(ICML \u201909) analyzes the accuracy of sparse JL with sparsity 1 on feature vectors\nwith small (cid:96)\u221e-to-(cid:96)2 norm ratio. Recently, Freksen, Kamma, and Larsen (NeurIPS\n\u201918) closed this line of work by proving a tight tradeoff between (cid:96)\u221e-to-(cid:96)2 norm\nratio and accuracy for sparse JL with sparsity 1.\nIn this paper, we demonstrate the bene\ufb01ts of using sparsity s greater than 1 in\nsparse JL on feature vectors. Our main result is a tight tradeoff between (cid:96)\u221e-to-(cid:96)2\nnorm ratio and accuracy for a general sparsity s, that signi\ufb01cantly generalizes the\nresult of Freksen et. al. Our result theoretically demonstrates that sparse JL with\ns > 1 can have signi\ufb01cantly better norm-preservation properties on feature vectors\nthan sparse JL with s = 1; we also empirically demonstrate this \ufb01nding.\n\n1\n\nIntroduction\n\nFeature hashing and other random projection schemes are in\ufb02uential in helping manage large data\n[11]. The goal is to reduce the dimensionality of feature vectors: more speci\ufb01cally, to project\nhigh-dimensional feature vectors living in Rn into a lower dimensional space Rm (where m (cid:28) n),\nwhile approximately preserving Euclidean distances (i.e. (cid:96)2 distances) with high probability. This\ndimensionality reduction enables a classi\ufb01er to process vectors in Rm, instead of vectors in Rn.\nIn this context, feature hashing was \ufb01rst introduced by Weinberger et. al [29] for document-based\nclassi\ufb01cation tasks such as email spam \ufb01ltering. For such tasks, feature hashing yields a lower\ndimensional embedding of a high-dimensional feature vector derived from a bag-of-words model.\nSince then, feature hashing has become a mainstream approach [28], applied to numerous domains\nincluding ranking text documents [4], compressing neural networks [7], and protein sequence\nclassi\ufb01cation [5].\n\nRandom Projections\n\nDimensionality reduction schemes for feature vectors \ufb01t nicely into the random projection literature.\nIn fact, the feature hashing scheme proposed by Weinberger et al. [29] boils down to uniformly\ndrawing a random m \u00d7 n matrix where each column contains one nonzero entry, equal to \u22121 or 1.\nThe (cid:96)2-norm-preserving objective can be expressed mathematically as follows: for error \u0001 > 0 and\nfailure probability \u03b4, the goal is to construct a probability distribution A over m \u00d7 n real matrices\n\n\u2217I would like to thank Prof. Jelani Nelson for advising this project.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthat satis\ufb01es the following condition for vectors x \u2208 Rn:\n\nPA\u2208A[(1 \u2212 \u0001)(cid:107)x(cid:107)2 \u2264 (cid:107)Ax(cid:107)2 \u2264 (1 + \u0001)(cid:107)x(cid:107)2] > 1 \u2212 \u03b4.\n\n(1)\nThe result underlying the random projection literature is the Johnson-Lindenstrauss lemma, which\ngives an upper bound on the dimension m achievable by a probability distribution A satisfying (1):\nLemma 1.1 (Johnson-Lindenstrauss [16]) For any n \u2208 N and \u0001, \u03b4 \u2208 (0, 1), there exists a probabil-\nity distribution A over m \u00d7 n matrices, with m = \u0398(\u0001\u22122 ln(1/\u03b4)), that satis\ufb01es (1).\nThe optimality of the dimension m achieved by Lemma 1.1 has been proven [17, 15].\nTo speed up projection time, it is useful to consider probability distributions over sparse matrices\n(i.e. matrices with a small number of nonzero entries per column). More speci\ufb01cally, for matrices\nwith s nonzero entries per column, the projection time for a vector x goes down from O(m(cid:107)x(cid:107)0)\nto O(s(cid:107)x(cid:107)0), where (cid:107)x(cid:107)0 is the number of nonzero entries of x. In this context, Kane and Nelson\n[19] constructed sparse JL distributions (which we de\ufb01ne formally in Section 1.1), improving upon\nprevious work [2, 22, 12]. Roughly speaking, a sparse JL distribution, as constructed in [19], boils\ndown to drawing a random m \u00d7 n matrix where each column contains exactly s nonzero entries,\neach equal to \u22121/\ns. Kane and Nelson show that sparse JL distributions achieve the same\n(optimal) dimension as Lemma 1.1, while also satisfying a sparsity property.\nTheorem 1.2 (Sparse JL [19]) For any n \u2208 N and \u0001, \u03b4 \u2208 (0, 1), a sparse JL distribution As,m,n\n(de\ufb01ned formally in Section 1.1) over m \u00d7 n matrices, with dimension m = \u0398(\u0001\u22122 ln(1/\u03b4)) and\nsparsity s = \u0398(\u0001\u22121 ln(1/\u03b4)), satis\ufb01es (1).\nSparse JL distributions are state-of-the-art sparse random projections, and achieve a sparsity that is\nnearly optimal when the dimension m is \u0398(\u0001\u22122 ln(1/\u03b4)).2 However, in practice, it can be necessary\nto utilize a lower sparsity s, since the projection time is linear in s. Resolving this issue, Cohen [8]\nextended the upper bound in Theorem 1.2 to show that sparse JL distributions can achieve a lower\nsparsity with an appropriate gain in dimension. He proved the following dimension-sparsity tradeoffs:\nTheorem 1.3 (Dimension-Sparsity Tradeoffs [8]) For any n \u2208 N and \u0001, \u03b4 \u2208 (0, 1), a uniform\nsparse JL distribution As,m,n (de\ufb01ned formally in Section 1.1), with s \u2264 \u0398(\u0001\u22121 ln(1/\u03b4)) and\nm \u2265 min\n\n2\u0001\u22122/\u03b4, \u0001\u22122 ln(1/\u03b4)e\u0398(\u0001\u22121 ln(1/\u03b4)/s)(cid:17)\n\n\u221a\ns or 1/\n\n, satis\ufb01es (1).\n\n(cid:16)\n\n\u221a\n\nConnection to Feature Hashing\n\nSparse JL distributions have particularly close ties to feature hashing. In particular, the feature hashing\nscheme proposed by Weinberger et al. [29] can be viewed as a special case of sparse JL, namely with\ns = 1. Interestingly, in practice, feature hashing can do much better than theoretical results, such as\nTheorem 1.2 and Theorem 1.3, would indicate [13]. An explanation for this phenomenon is that the\nhighest error terms in sparse JL stem from vectors with mass concentrated on a very small number of\nentries, while in practice, the mass on feature vectors may be spread out between many coordinates.\nThis motivates studying the tradeoff space for vectors with low (cid:96)\u221e-to-(cid:96)2 ratio.\n\n\u2264 v\n\n, so that S1 = Rn and Sv (cid:40) Sw for 0 \u2264 v <\nMore formally, take Sv to be\nw \u2264 1. Let v(m, \u0001, \u03b4, s) be the supremum over all 0 \u2264 v \u2264 1 such that a sparse JL distribution with\nsparsity s and dimension m satis\ufb01es (1) for each x \u2208 Sv. (That is, v(m, \u0001, \u03b4, s) is the maximum\nv \u2208 [0, 1] such that for every x \u2208 Rn, if (cid:107)x(cid:107)\u221e \u2264 v (cid:107)x(cid:107)2 then (1) holds.) For s = 1, a line of work\n[29, 12, 18, 10, 19] improved bounds on v(m, \u0001, \u03b4, 1), and was recently closed by Freksen et al. [13].\nTheorem 1.4 ([13]) For any m \u2208 N and \u0001, \u03b4 \u2208 (0, 1), the function v(m, \u0001, \u03b4, 1) is equal to\nf (m, \u0001, ln(1/\u03b4)) where:\n\n(cid:110)\nx \u2208 Rn | (cid:107)x(cid:107)\u221e(cid:107)x(cid:107)2\n\n(cid:111)\n\n(cid:32)\u221a\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n1\n\n\u0398\n\n0\n\n(cid:32)\n\n(cid:33)(cid:33)\n\n(cid:113)\n\nln( m\u0001\np )\np\n\nif m \u2265 2\u0001\u22122ep\nif \u0398(\u0001\u22122p) \u2264 m < 2\u0001\u22122ep\nif m \u2264 \u0398(\u0001\u22122p).\nshowed that any distribution satisfying (1)\n\nln( m\u00012\n\u221a\n\np )\n\np\n\n,\n\nf (m, \u0001, p) =\n\n\u0001 min\n\n2Nelson and Nguyen [25]\n\nsparsity\n\u2126(\u0001\u22121 ln(1/\u03b4)/ ln(1/\u0001)) when the dimension m is \u0398(\u0001\u22122 ln(1/\u03b4)). Kane and Nelson [19] also showed that\nthe analysis of sparse JL distributions in Theorem 1.2 is tight at m = \u0398(\u0001\u22122 ln(1/\u03b4)).\n\nrequires\n\n2\n\n\fGeneralizing to Sparse Random Projections with s > 1\n\nWhile Theorem 1.4 is restricted to the case of s = 1, dimensionality reduction schemes constructed\nusing sparse random projections with sparsity s > 1 have been used in practice for projecting feature\nvectors. For example, sparse JL-like methods (with s > 1) have been used to project feature vectors\nin machine learning domains including visual tracking [27], face recognition [23], and recently in\nELM [6]. Now, a variant of sparse JL is included in the Python sklearn library.3\nIn this context, it is natural to explore how constructions with s > 1 perform on feature vectors,\nby studying v(m, \u0001, \u03b4, s) for sparse JL with s > 1. In fact, a related question was considered\n\u221a\nby Weinberger et al. [29] for \u201cmultiple hashing,\u201d an alternate distribution over sparse matrices\nconstructed by adding s draws from A1,m,n and scaling by 1/\ns. More speci\ufb01cally, they show\nthat v(m, \u0001, \u03b4, s) \u2265 min(1,\ns \u00b7 v(m, \u0001, \u03b4, 1)) for multiple hashing. However, Kane and Nelson [19]\nlater showed that multiple hashing has worse geometry-preserving properties than sparse JL: that is,\nmultiple hashing requires a larger sparsity than sparse JL to satisfy (1).\nCharacterizing v(m, \u0001, \u03b4, s) for sparse JL distributions, which are state-of-the-art, remained an open\nproblem. In this work, we settle how v(m, \u0001, \u03b4, s) behaves for sparse JL with a general sparsity s > 1,\ngiving tight bounds. Our theoretical result shows that sparse JL with s > 1, even if s is a small\nconstant, can achieve signi\ufb01cantly better norm-preservation properties for feature vectors than sparse\nJL with s = 1. Moreover, we empirically demonstrate this \ufb01nding.\n\n\u221a\n\nMain Results\n\nWe show the following tight bounds on v(m, \u0001, \u03b4, s) for a general sparsity s:\nTheorem 1.5 For any s, m \u2208 N such that s \u2264 m/e, consider a uniform sparse JL distribution\n(de\ufb01ned in Section 1.1) with sparsity s and dimension m.4 If \u0001 and \u03b4 are small enough5, the function\nv(m, \u0001, \u03b4, s) is equal to f(cid:48)(m, \u0001, ln(1/\u03b4), s), where f(cid:48)(m, \u0001, p, s) is6:\n(cid:18)\n(cid:18)\n(cid:32)\n\n(cid:19)(cid:19)(cid:33)\n(cid:19)(cid:19)(cid:33)\n(cid:18)\n\n(cid:18)\n1, p\u0001\u22121\n(cid:18)\n\n\u0398(\u0001\u22122p), s \u00b7 e\n\n\uf8f6\uf8f8\uf8f6\uf8f8 else, if \u0398(\u0001\u22122p) \u2264 m \u2264 min\n\n\uf8f6\uf8f8\n\uf8eb\uf8ed ln( m\u0001\n\n\uf8eb\uf8ed\u221a\n\uf8eb\uf8ed\u221a\n\n\u0001\u22122e\u0398(p), s \u00b7 e\n\n2\u0001\u22122ep, \u0001\u22122pe\n\nif m \u2265 min\n\n(cid:32)\n(cid:32)\n\nelse, if max\n\n1, p\u0001\u22121\n\n1, p\u0001\u22121\n\nln( m\u00012\n\np )\n\u221a\np\n\np )\np\n\n,\n\nln( m\u00012\n\n\u221a\n\np )\np\n\n(cid:19)(cid:19)(cid:33)\n\n\u2264 m \u2264 \u0001\u22122e\u0398(p)\n\n\u0398\n\n\u0001s\n\n\u0398\n\nmax\n\ns\n\n\u0001s min\n\n(cid:114)\n\n(cid:114)\n\n\u0398\n\nmax\n\n\u0398\n\nmax\n\ns\n\ns\n\n(cid:18)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n1\n\n\u0398\n\n0\n\nif m \u2264 \u0398(\u0001\u22122p).\n\nOur main result, Theorem 1.5, signi\ufb01cantly generalizes Theorem 1.2, Theorem 1.3, and Theorem\n1.4. Notice our bound in Theorem 1.5 has up to four regimes. In the \ufb01rst regime, which occurs when\nm \u2265 min(2\u0001\u22122/\u03b4, \u0001\u22122 ln(1/\u03b4)e\u0398(max(1,ln(1/\u03b4)\u0001\u22121/s))), Theorem 1.5 shows v(m, \u0001, \u03b4, s) = 1, so (1)\nholds on the full space Rn. Notice this boundary on m occurs at the dimensionality-sparsity tradeoff\nin Theorem 1.3. In the last regime, which occurs when m \u2264 \u0398(\u0001\u22122 ln(1/\u03b4)), Theorem 1.5 shows that\nv(m, \u0001, \u03b4, s) = 0, so there are vectors with arbitrarily small (cid:96)\u221e-to-(cid:96)2 norm ratio where (1) does not\nhold. When s \u2264 \u0398(\u0001\u22121 ln(1/\u03b4)), Theorem 1.5 shows that up to two intermediate regimes exist. One\n\u221a\nln( m\u00012\nof the regimes, \u0398(\np)), matches the middle regime of v(m, \u0001, \u03b4, 1)\np )/p,\n\u221a\ns, much like the bound for multiple hashing in [29] that\nin Theorem 1.4 with an extra factor of\nwe mentioned previously. However, unlike the multiple hashing bound, Theorem 1.5 sometimes\n\u221a\n\u221a\nhas another regime, \u0398(\nin Theorem\np )/\n1.4).7 Intuitively, we expect this additional regime for sparse JL with s close to \u0398(\u0001\u22121 ln(1/\u03b4)):\n\np), which does not arise for s = 1 (i.e.\n\n\u221a\np )/\n\n\u0001s min(ln( m\u0001\n\n(cid:113)\n\n(cid:113)\n\nln( m\u00012\n\n\u0001s\n\n3See https://scikit-learn.org/stable/modules/random_projection.html.\n4We prove the lower bound on v(m, \u0001, \u03b4, s) in Theorem 1.5 for any sparse JL distribution.\n5By \u201csmall enough\u201d, we mean the condition that \u0001, \u03b4 \u2208 (0, C(cid:48)) for some positive constant C(cid:48).\n6Notice that the function f(cid:48)(m, \u0001, p, s) is not de\ufb01ned for certain \u201cconstant-factor\u201d intervals between the\n7This regime does not arise for s = 1, since e\u0398(p\u0001\u22121) \u2265 \u0001\u22122e\u0398(p) for suf\ufb01ciently small \u0001.\n\nboundaries of regimes (e.g. C1\u0001\u22122p \u2264 m \u2264 C2\u0001\u22122p). See Appendix A for a discussion.\n\n3\n\n\fat s = \u0398(\u0001\u22121 ln(1/\u03b4)) and m = \u0398(\u0001\u22122 ln(1/\u03b4)), Theorem 1.2 tells us v(m, \u0001, \u03b4, s) = 1, but\n\u221a\nif \u0001 is a constant, then the branch \u0398(\n\n/p) yields \u0398(1/(cid:112)ln(1/\u03b4)), while the branch\n\n\u0001s ln\n\n(cid:16) m\u0001\n\n(cid:17)\n\np\n\np) yields \u0398(1). Thus, it is natural that the \ufb01rst branch disappears for large m.\n\n(cid:113)\n\n\u221a\n\n\u0398(\n\n\u0001s\n\n\u221a\np )/\n\nln( m\u00012\n\n\u221a\n\nOur result elucidates that v(m, \u0001, \u03b4, s) increases approximately as\ns, thus providing insight into\nhow even small constant increases in sparsity can be useful in practice. Another consequence of our\nresult is a lower bound on dimension-sparsity tradeoffs (Corollary A.1 in Appendix A) that essentially\nmatches the upper bound in Theorem 1.3. Moreover, we require new techniques to prove Theorem\n1.5, for reasons that we discuss further in Section 1.2.\nWe also empirically support our theoretical \ufb01ndings in Theorem 1.5. First, we illustrate with real-\nworld datasets the potential bene\ufb01ts of using small constants s > 1 for sparse JL on feature vectors.\nWe speci\ufb01cally show that s = {4, 8, 16} consistently outperforms s = 1 in preserving the (cid:96)2 norm of\neach vector, and that there can be up to a factor of ten decrease in failure probability for s = 8, 16 in\ncomparison to s = 1. Second, we use synthetic data to illustrate phase transitions and other trends in\nTheorem 1.5. More speci\ufb01cally, we empirically show that v(m, \u0001, \u03b4, s) is not smooth, and that the\nmiddle regime(s) of v(m, \u0001, \u03b4, s) increases with s.\n\n1.1 Preliminaries\nLet As,m,n be a sparse JL distribution if the entries of a matrix A \u2208 As,m,n are generated as\n\u221a\ns where {\u03c3r,i}r\u2208[m],i\u2208[n] and {\u03b7r,i}r\u2208[m],i\u2208[n] are de\ufb01ned as follows:\nfollows. Let Ar,i = \u03b7r,i\u03c3r,i/\n\n\u2022 The families {\u03c3r,i}r\u2208[m],i\u2208[n] and {\u03b7r,i}r\u2208[m],i\u2208[n] are independent from each other.\n\u2022 The variables {\u03c3r,i}r\u2208[m],i\u2208[n] are i.i.d Rademachers (\u00b11 coin \ufb02ips).\n\u2022 The variables {\u03b7r,i}r\u2208[m],i\u2208[n] are identically distributed Bernoullis ({0, 1} random vari-\nables) with expectation s/m.\n\u2022 The {\u03b7r,i}r\u2208[m],i\u2208[n] are independent across columns but not independent within each\nr=1 \u03b7r,i = s. Moreover, the random\nvariables are negatively correlated: for every subset S \u2286 [m] and every column 1 \u2264 i \u2264 n,\n\ncolumn. For every column 1 \u2264 i \u2264 n, it holds that(cid:80)m\nit holds that E(cid:2)(cid:81)\n\n(cid:3) \u2264(cid:81)\n\nE[\u03b7r,i].\n\nr\u2208S \u03b7r,i\n\nr\u2208S\n\ntion for s > 1. In this distribution, each column 1 \u2264 i \u2264 n is partitioned into s blocks of(cid:4) m\n\nA common special case is a uniform sparse JL distribution, generated as follows: for every\n1 \u2264 i \u2264 n, we uniformly choose exactly s of these variables in {\u03b7r,i}r\u2208[m] to be 1. When s = 1,\n(cid:5)\nevery sparse JL distribution is a uniform sparse JL distribution, but for s > 1, this is not the case.\nAnother common special case is a block sparse JL distribution. This produces a different construc-\nconsecutive rows. In each block in each column, the distribution of the variables {\u03b7r,i} is de\ufb01ned by\nuniformly choosing exactly one of these variables to be 1.8\n\ns\n\n1.2 Proof Techniques\nWe use the following notation. For any random variable X and value q \u2265 1, we call E[|X|q] the qth\nmoment of X, where E denotes the expectation. We use (cid:107)X(cid:107)q to denote the q-norm (E[|X|q])1/q.\nFor every [x1, . . . , xn] \u2208 Rn such that (cid:107)x(cid:107)2 = 1, we need to analyze tail bounds of an error term,\nwhich for the sparse JL construction is the following random variable:\n\n(cid:107)Ax(cid:107)2\n\n2 \u2212 1 =\n\n1\ns\n\n(cid:88)\n\nm(cid:88)\n\ni(cid:54)=j\n\nr=1\n\n\u03b7r,i\u03b7r,j\u03c3r,i\u03c3r,jxixj =: R(x1, . . . , xn).\n\nAn upper bound on the tail probability of R(x1, . . . , xn) is needed to prove the lower bound on\nv(m, \u0001, \u03b4, s) in Theorem 1.5, and a lower bound is needed to prove the upper bound on v(m, \u0001, \u03b4, s)\n\n8Our lower bound in Theorem 1.5 applies to this distribution, though our upper bound does not. An interesting\n\ndirection for future work would be to generalize the upper bound to this distribution.\n\n4\n\n\f(cid:111)\n\n\u2264 v\n\nat each threshold v value.\n\n(cid:110)\nx \u2208 Rn | (cid:107)x(cid:107)\u221e(cid:107)x(cid:107)2\n\nIt turns out that it suf\ufb01ces to tightly analyze the random variable moments\nin Theorem 1.5.\nE[(R(x1, . . . , xn))q]. For the upper bound, we use Markov\u2019s inequality like in [13, 19, 3, 24],\nand for the lower bound, we use the Paley-Zygmund inequality like in [13]: Markov\u2019s inequality\ngives a tail upper bound from upper bounds on moments, and the Paley-Zygmund inequality gives a\ntail lower bound from upper and lower bounds on moments. Thus, the key ingredient of our analysis\nis a tight bound for (cid:107)R(x1, . . . , xn)(cid:107)q on Sv =\nWhile the moments of R(x1, . . . , xn) have been studied in previous analyses of sparse JL, we\nemphasize that it is not clear how to adapt these existing approaches to obtain a tight bound on every\nSv. The moment bound that we require and obtain is far more general: the bounds in [19, 9] are\nlimited to Rn = S1 and the bound in [13] is limited to s = 1.9 The non-combinatorial approach in\n[9] for bounding (cid:107)R(x1, . . . , xn)(cid:107)q on Rn = S1 also turns out to not be suf\ufb01ciently precise on Sv,\nfor reasons we discuss in Section 2.10\nThus, we require new tools for our moment bound. Our analysis provides a new perspective, inspired\nby the probability theory literature, that differs from the existing approaches in the JL literature. We\nbelieve our style of analysis is less brittle than combinatorial approaches [13, 19, 3, 24]: in this setting,\nonce the sparsity s = 1 case is recovered, it becomes straightforward to generalize to other s values.\nMoreover, our approach can yield greater precision than the existing non-combinatorial approaches\n[9, 8, 14], which is necessary for this setting. Thus, we believe that our structural approach to\nanalyzing JL distributions could be of use in other settings.\nIn Section 2, we present an overview of our methods and the key technical lemmas to analyze\n(cid:107)R(x1, . . . , xn)(cid:107)q. We defer the proofs to the Appendix. In Section 3, we prove the tail bounds in\nTheorem 1.5 from these moment bounds. In Section 4, we empirically evaluate sparse JL.\n\n2 Sketch of Bounding the Moments of R(x1, . . . , xn)\n\n(i.e. (cid:80)\n\nt1,t2\n\nOur approach takes advantage of the structure of R(x1, . . . , xn) as a quadratic form of Rademachers\nat1,t2\u03c3t1 \u03c3t2) with random variable coef\ufb01cients (i.e. where at1,t2 is itself a random\nvariable). For the upper bound, we need to analyze (cid:107)R(x1, . . . , xn)(cid:107)q for general vectors [x1, . . . , xn].\nFor the lower bound, we only need to show (cid:107)R(x1, . . . , xn)(cid:107)q is large for single vector in each Sv,\nand we show we can select the vector in the (cid:96)2-unit ball with 1/v2 nonzero entries, all equal to v. For\nease of notation, we denote this vector by [v, . . . , v, 0, . . . , 0] for the remainder of the paper.\nWe analyze (cid:107)R(x1, . . . , xn)(cid:107)q using general moment bounds for Rademacher linear and quadratic\nforms. Though Cohen, Jayram, and Nelson [9] also view R(x1, . . . , xn) as a quadratic form, we\nshow in the supplementary material that their approach of bounding the Rademachers by gaussians is\nnot suf\ufb01ciently precise for our setting.11\nIn our approach, we make use of stronger moment bounds for Rademacher linear and quadratic forms,\nsome of which are known to the probability theory community through Lata\u0142a\u2019s work in [21, 20] and\nsome of which are new adaptions tailored to the constraints arising in our setting. More speci\ufb01cally,\nLata\u0142a\u2019s bounds [21, 20] target the setting where the coef\ufb01cients are scalars. In our setting, however,\nthe coef\ufb01cients are themselves random variables, and we need bounds that are tractable to analyze in\nthis setting, which involves creating new bounds to handle some cases.\nOur strategy for bounding (cid:107)R(x1, . . . , xn)(cid:107)q is to break down into rows. We de\ufb01ne\n\nZr(x1, . . . , xn) :=\n\n\u03b7r,i\u03b7r,j\u03c3r,i\u03c3r,jxixj\n\n(cid:88)\n\n1\u2264i(cid:54)=j\u2264n\n\n9As described in [13], even for the case for s = 1, the approach in [19] cannot be directly generalized to\nrecover Theorem 1.4. Moreover, the approach in [13], though more precise for s = 1, is highly tailored to s = 1,\nand it is not clear how to generalize it to s > 1.\n\n10In predecessor work [14], we give a non-combinatorial approach similar to [9] for a sign-consistent variant\nof the JL distribution. Moreover, a different non-combinatorial approach for subspace embeddings is given in\n[8]. However, these approaches both suffer from issues in this setting that are similar to [9].\n\n11We actually made a similar conceptual point for a different JL distribution in our predecessor work [14], but\n\nthe alternate bound that we produce there also suffers from precision issues in this setting.\n\n5\n\n\f(cid:80)m\n\n,\n\n1\n\nmv2\n\n(cid:38)\n\n\u221a\n\nT 2v2\n\nT\n\nln(m/s)\n\nmT v2\n\n(cid:17)\n\ns\n\nmT v2\n\ns\n\nmT v2\n\n(cid:16)\n\n(cid:38) s\n\ni=1 \u03b71,i=2\n\nln2(mv2T /s)\n\nSuppose\n\nln(mT v2/s)2 ,\ns\n\nfor T = 2, 3 \u2264 T \u2264 se\nfor T \u2265 3, T \u2265 se\nfor T \u2265 3, T \u2265 se\n\nso that R(x1, . . . , xn) = 1\nr=1 Zr(x1, . . . , xn). We analyze the moments of Zr(x1, . . . , xn), and\ns\nthen combine these bounds to obtain moment bounds for R(x1, . . . , xn). In our bounds, we use the\nnotation f (cid:46) g (resp. f (cid:38) g) to denote f \u2264 Cg (resp. f \u2265 Cg) for some constant C.\n2.1 Bounding (cid:107)Zr(x1, . . . , xn)(cid:107)q\nWe show the following bounds on (cid:107)Zr(x1, . . . , xn)(cid:107)q. For the lower bound, as we discussed\nbefore, it suf\ufb01ces to bound (cid:107)Zr(v, . . . , v, 0, . . . , 0)(cid:107)q. For the upper bound, we need to bound\n(cid:107)Zr(x1, . . . , xn)(cid:107)q for general vectors as a function of the (cid:96)\u221e-to-(cid:96)2 norm ratio.\nLemma 2.1 Let As,m,n be a sparse JL distribution such that s \u2264 m/e. Suppose that x =\n[x1, . . . , xn] satis\ufb01es (cid:107)x(cid:107)\u221e \u2264 v and (cid:107)x(cid:107)2 = 1. If T is even, then:\n\nT s\nm ,\nmin\n\n(cid:107)Zr(x1, . . . , xn)(cid:107)T\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\nv2(cid:0)\n(cid:13)(cid:13)(cid:13)Zr(v, . . . , v, 0, . . . , 0)I(cid:80)1/v2\n\n(cid:46)\n\n(cid:1)2/T and\n\n(cid:107)Zr(v, . . . , v, 0, . . . , 0)(cid:107)T\n\n(cid:1)2/T\n(cid:13)(cid:13)(cid:13)T\nLemma 2.2 Let As,m,n be a sparse JL distribution.\n(cid:38) v2(cid:0)\ngers. Then, (cid:107)Zr(v, . . . , v, 0, . . . , 0)(cid:107)2\nm. Moreover, if s \u2264 m/e and T \u2265 se\n\uf8f1\uf8f2\uf8f3 T 2v2\nv2(cid:0)\n(cid:1)2/T\n\nfor 1 \u2264 ln(mv2T /s) \u2264 T, v \u2264\nfor ln(mv2T /s) > T.\n\nmv2 , ln(T mv2/s) \u2264 T\nmv2 , ln(T mv2/s) > T.\nv2 and T are even inte-\nmv2 , then\n\n(Lemma 2.2), we can view Zr(v, . . . , v, 0, . . . , 0) as a quadratic form(cid:80)\n\nWe now sketch our methods to prove Lemma 2.1 and Lemma 2.2. For the lower bound\nat1,t2\u03c3t1\u03c3t2 where\n(at1,t2 )t1,t2\u2208[mn] is an appropriately de\ufb01ned block-diagonal mn dimensional matrix. We can\nwrite E\u03c3,\u03b7[(Zr(v, . . . , v, 0, . . . , 0))q] as E\u03b7 [E\u03c3[(Zr(v, . . . , v, 0, . . . , 0))q]]: for \ufb01xed \u03b7r,i values, the\ncoef\ufb01cients are scalars. We make use of Lata\u0142a\u2019s tight bound on Rademacher quadratic forms\nwith scalar coef\ufb01cients [21] to analyze E\u03c3[(Zr(v, . . . , v, 0, . . . , 0))q] as a function of the \u03b7r,i.\nThen, we handle the randomness of the \u03b7r,i by taking an expectation of the resulting bound on\nE\u03c3[(Zr(v, . . . , v, 0, . . . , 0))q] over the \u03b7r,i values to obtain a bound on (cid:107)Zr(v, . . . , v, 0, . . . , 0)(cid:107)q.\nFor the upper bound (Lemma 2.1), since Lata\u0142a\u2019s bound [21] is tight for scalar quadratic forms,\nthe natural approach would be to use it to upper bound E\u03c3[(Zr(x1, . . . , xn))q] for general vectors.\nHowever, when the vector is not of the form [v, . . . , v, 0, . . . , 0], the asymmetry makes the resulting\nbound intractable to simplify. Speci\ufb01cally, there is a term, which can be viewed as a generalization of\nan operator norm to an (cid:96)2 ball cut out by (cid:96)\u221e hyperplanes, that becomes problematic when taking an\nexpectation over the \u03b7r,i to obtain a bound on E\u03c3,\u03b7[(Zr(x1, . . . , xn))q]. Thus, we construct simpler\nestimates that avoid these complications while remaining suf\ufb01ciently precise for our setting. These\nestimates take advantage of the structure of Zr(x1, . . . , xn) and enable us to show Lemma 2.1.\n2.2 Obtaining bounds on (cid:107)R(x1, . . . , xn)(cid:107)q\nNow, we use Lemma 2.1 and Lemma 2.2 to show the following bounds on (cid:107)R(x1, . . . , xn)(cid:107)q:\nLemma 2.3 Suppose As,m,n is a sparse JL distribution such that s \u2264 m/e, and let x = [x1, . . . , xn]\nbe such that (cid:107)x(cid:107)2 = 1. Then, (cid:107)R(x1, . . . , xn)(cid:107)2 \u2264 \u221a\nm. Now, suppose that 2 < q \u2264 m is an even\n2\u221a\n(cid:46) \u221a\ninteger and (cid:107)x(cid:107)\u221e \u2264 v. If se\nq\u221a\nmv2 < q and if there exists a\nm. If se\nconstant C2 \u2265 1 such that C2q3mv4 \u2265 s2, then (cid:107)R(x1, . . . , xn)(cid:107)q\n\nmv2 \u2265 q, then (cid:107)R(x1, . . . , xn)(cid:107)q\n(cid:19)\n\n(cid:46) g where g is:\n\n\u221a\nln(m/s)\n\nt1,t2\n\nT\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nmax\n\u221a\nq\u221a\nm\n\nmax\n\nmax\n\n(cid:18) \u221a\n(cid:18) \u221a\n(cid:16) \u221a\n\nq\u221a\nm , C\n\ns ln2(qmv2/s)\n\n1/3\n2\n\nq2v2\n\nq\u221a\nm ,\n\nqv2\n\ns ln(qmv4/s2) , min\n\nq\u221a\nm ,\n\nqv2\n\ns ln(qmv4/s2)\n\n(cid:17)\n\n(cid:18)\n\n(cid:19)(cid:19)\n\nq\n\ns ln(m/s)\n\nC\n\n1/3\n2\n\nq2v2\n\ns ln2(qmv2/s) ,\n\n6\n\nif ln( qmv4\nif ln( qmv4\nif ln( qmv4\n\ns\n\ns2 ) \u2264 2, ln( qmv2\ns2 ) \u2264 2, ln( qmv2\ns2 ) > 2, ln( qmv2\ns2 ) > 2, ln( qmv2\n\ns\n\ns\n\ns\n\nif ln( qmv4\n\n) \u2264 q\n\n) > q\n) \u2264 q\n\n) > q.\n\n\fqv2\n\n\u221a\n\ns ln(qmv4/s2). If s \u2264 m/e, v \u2264\n(cid:38)\n\nv2 is an even integer. If qv2 \u2264 s, then (cid:107)R(v, . . . , v, 0, . . . , 0)(cid:107)q\n\nLemma 2.4 Suppose As,m,n is a uniform sparse JL distribution. Let q be a power of 2, and sup-\npose that 0 < v \u2264 0.5 and 1\n(cid:38)\n\u221a\nIf m \u2265 q, 2 \u2264 ln(qmv4/s2) \u2264 q, 2qv2 \u2264 0.5s ln(qmv4/s2), and s \u2264 m/e, then\nq\u221a\nm.\n, and 1 \u2264 ln(qmv2/s) \u2264 q,\n(cid:107)R(v, . . . , v, 0, . . . , 0)(cid:107)q\n(cid:38)\nthen (cid:107)R(v, . . . , v, 0 . . . , 0)(cid:107)q\nWe now sketch how to prove bounds on (cid:107)R(x1, . . . , xn)(cid:107)q using bounds on (cid:107)Zr(x1, . . . , xn)(cid:107)T .\nTo show Lemma 2.3, we show that making the row terms Zr(x1, . . . , xn) independent does not\ndecrease (cid:107)R(x1, . . . , xn)(cid:107)q, and then we apply a general result from [20] for moments of sums\nof i.i.d symmetric random variables. For Lemma 2.4, handling the correlations between the row\nterms Zr(x1, . . . , xn) requires more care. We show that the negative correlations induced by hav-\ning exactly s nonzero entries per column do not lead to signi\ufb01cant loss, and then stitch together\n(cid:107)R(v, . . . , v, 0, . . . , 0)(cid:107)q using the moments of Zr(v, . . . , v, 0, . . . , 0) that contribute the most.\n\ns ln2(qmv2/s).\n\n\u221a\nln(m/s)\n\nq2v2\n\nq\n\n3 Proof of Main Result from Moment Bounds\n\nWe now sketch how to prove Theorem 1.5, using Lemma 2.3 and Lemma 2.4. First, we simplify\nthese bounds at the target parameters to obtain the following:\nLemma 3.1 Let As,m,n be a sparse JL distribution, and suppose \u0001 and \u03b4 are small enough, s \u2264 m/e,\n\u0398(\u0001\u22122 ln(1/\u03b4)) \u2264 m < 2\u0001\u22122/\u03b4, v \u2264 f(cid:48)(m, \u0001, ln(1/\u03b4), s), and p = \u0398(ln(1/\u03b4)) is even. If x =\n[x1, . . . , xn] satis\ufb01es (cid:107)x(cid:107)\u221e \u2264 v and (cid:107)x(cid:107)2 = 1, then (cid:107)R(x1, . . . , xn)(cid:107)p \u2264 \u0001\n2.\nLemma 3.2 There is a universal constant D satisfying the following property. Let As,m,n be a\nuniform sparse JL distribution, and suppose \u0001, \u03b4 are small enough, s \u2264 m/e, f(cid:48)(m, \u0001, ln(1/\u03b4), s) \u2264\n0.5, and q is an even integer such that q = min(m/2, \u0398(ln(1/\u03b4)). For each \u03c8 > 0, there exists\n\u2265 D.\nv \u2264 f(cid:48)(m, \u0001, ln(1/\u03b4), s) + \u03c8, such that (cid:107)R(v, . . . , v, 0, . . . , 0)(cid:107)q \u2265 2\u0001 and\nNow, we use Lemma 3.1 and Lemma 3.2 to prove Theorem 1.5.\nProof of Theorem 1.5. Since the maps in As,m,n are linear, it suf\ufb01ces to consider unit vectors x.\nFirst, we prove the lower bound on v(m, \u0001, \u03b4, s). To handle m \u2265 2\u0001\u22122/\u03b4, we take q = 2 in Lemma\n3.1 and apply Chebyshev\u2019s inequality. Otherwise, we take p = ln(1/\u03b4) (approximately) and apply\nLemma 3.1 and Markov\u2019s inequality. We see that P[|(cid:107)Ax(cid:107)2\n\n2 \u2212 1| \u2265 \u0001] can be expressed as:\nP[|R(x1, . . . , xn)| \u2265 \u0001] = P[R(x1, . . . , xn)p \u2265 \u0001p] \u2264 \u0001\u2212pE[R(x1, . . . , xn)]p \u2264 \u03b4.\n\nThus, condition (1) is satis\ufb01ed for x \u2208 Sv when v \u2264 f(cid:48)(m, \u0001, ln(1/\u03b4), s) as desired.\nNow, we prove the upper bound on v(m, \u0001, \u03b4, s). We need to lower bound the tail probability of\nR(v, . . . , v, 0, . . . , 0), and to do this, we use the Paley-Zygmund inequality applied to qth moments.\nLet D be de\ufb01ned as in Lemma 3.2, and take q = min(m/2, ln(1/\u03b4)\u22122\n\u22122 ln(D) ). By the Paley-Zygmund\ninequality and Lemma 3.2, there exists v \u2264 f(cid:48)(m, \u0001, ln(1/\u03b4), s) + \u03c8 such that:\n\n(cid:107)R(v,...,v,0,...,0)(cid:107)q\n(cid:107)R(v,...,v,0,...,0)(cid:107)2q\n\nP[|R(v, . . . , v, 0, . . . , 0)| > \u0001] \u2265 0.25\n\n\u2265 0.25D2q > \u03b4.\n\n(cid:32) (cid:107)R(v, v, . . . , v, 0, . . . , 0)(cid:107)q\n\n(cid:107)R(v, v, . . . , v, 0, . . . , 0)(cid:107)2q\n\n(cid:33)2q\n\nP[|(cid:107)Ax(cid:107)2\n\n2 \u2212 1| > \u0001] > \u03b4 as desired.\n\nThus, it follows that supx\u2208Sf(cid:48) (m,\u0001,ln(1/\u03b4),s)+\u03c8,(cid:107)x(cid:107)2=1\n\n4 Empirical Evaluation\n\nRecall that for sparse JL distributions with sparsity s, the projection time for an input vector x is\nO(s(cid:107)x(cid:107)0), where (cid:107)x(cid:107)0 is the number of nonzero entries in x. Since this grows linearly in s, in order\nto minimize the impact on projection time, we restrict to small constant s values (i.e. 1 \u2264 s \u2264 16).\nIn Section 4.1, we demonstrate on real-world data the bene\ufb01ts of using s > 1. In Section 4.2,\nwe illustrate trends in our theoretical bounds on synthetic data. Additional graphs can be found\nin Appendix I. For all experiments, we use a block sparse JL distribution to demonstrate that our\ntheoretical upper bounds also empirically generalize to non-uniform sparse JL distributions.\n\n7\n\n\f4.1 Real-World Datasets\n\nWe considered two bag-of-words datasets: the News20 dataset [1] (based on newsgroup documents),\nand the Enron email dataset [26] (based on e-mails from the senior management of Enron).12 Both\ndatasets were pre-processed with the standard tf-idf preprocessing. In this experiment, we evaluated\nhow well sparse JL preserves the (cid:96)2 norms of the vectors in the dataset. An interesting direction\nfor future work would be to empirically evaluate how well sparse JL preserves other aspects of the\ngeometry of real-world data sets, such as the (cid:96)2 distances between pairs of vectors.\nIn our experiment, we estimated the failure probability \u02c6\u03b4(s, m, \u0001) for each dataset as follows. Let D\nbe the number of vectors in the dataset, and let n be the dimension (n = 101631, D = 11314 for\nNews20; n = 28102, D = 39861 for Enron). We drew a matrix M \u223c As,m,n from a block sparse\nfor each vector x in the dataset, and used these values\nJL distribution. Then, we computed\n\n(cid:107)M x(cid:107)2\n(cid:107)x(cid:107)2\n\n(cid:54)\u22081\u00b1\u0001\n\n. We ran 100 trials to produce\n\nto compute an estimate \u02c6\u03b4(s, m, \u0001) =\n100 estimates \u02c6\u03b4(s, m, \u0001).\n\n(cid:107)M x(cid:107)2\n(cid:107)x(cid:107)2\nnumber of vectors x such that\n\nD\n\nFigure 1: News20: \u02c6\u03b4(m, s, 0.07) v. s\n\nFigure 2: Enron: \u02c6\u03b4(m, s, 0.07) vs. s\n\nFigure 1 and Figure 2 show the mean and error bars (3 standard errors of the mean) of \u02c6\u03b4(s, m, \u0001) at\n\u0001 = 0.07. We consider s \u2208 {1, 2, 4, 8, 16}, and choose m values so that 0.01 \u2264 \u02c6\u03b4(1, m, \u0001) \u2264 0.04.\nAll of the plots show that s \u2208 {2, 4, 8, 16} achieves a lower failure probability than s = 1, with\nthe differences most pronounced when m is larger. In fact, at m = 1000, there is a factor of four\ndecrease in \u03b4 between s = 1 and s = 4, and a factor of ten decrease between s = 1 and s = 8, 16.\nWe note that in plots in the Appendix, there is a slight increase between s = 8 and s = 16 at some\n\u0001, \u03b4, m values (see Appendix I for a discussion of this non-monotonicity in s); however s > 1 still\nconsistently beats s = 1. Thus, these \ufb01ndings demonstrate the potential bene\ufb01ts of using small\nconstants s > 1 in sparse JL in practice, which aligns with our theoretical results.\n\n4.2 Synthetic Datasets\n\nWe used synthetic data to illustrate the phase transitions in our bounds on v(m, \u0001, \u03b4, s) in Theorem\n1.5 for a block sparse JL distribution. For several choices of s, m, \u0001, \u03b4, we computed an estimate\n\u02c6v(m, \u0001, \u03b4, s) of v(m, \u0001, \u03b4, s) as follows. Our experiment borrowed aspects of the experimental design\nin [13]. Our synthetic data consisted of binary vectors (i.e. vectors whose entries are in {0, 1}). The\nbinary vectors were de\ufb01ned by a set W of values exponentially spread between 0.03 and 113: for each\nw \u2208 W , we constructed a binary vector xw where the \ufb01rst 1/w2 entries are nonzero, and computed\nan estimate \u02c6\u03b4(s, m, \u0001, w) of the failure probability of the block sparse JL distribution on the speci\ufb01c\nvector xw (i.e. PA\u2208As,m,1/w2 [(cid:107)Axw(cid:107)2 (cid:54)\u2208 (1 \u00b1 \u0001)(cid:107)xw(cid:107)2]). We computed each \u02c6\u03b4(s, m, \u0001, w) using\n100,000 samples from a block sparse JL distribution, as follows. In each sample, we independently\n13We took W =(cid:8)w | w\u22122 \u2208 {986, 657, 438, 292, 195, 130, 87, 58, 39, 26, 18, 12, 9, 8, 7, 6, 5, 4, 3, 2, 1}(cid:9).\n\n12Note that the News20 dataset is used in [10], and the Enron dataset is from the same collection as the dataset\n\nused in [13], but contains a larger number of documents.\n\n8\n\n\f(cid:107)M xw(cid:107)2\n(cid:107)xw(cid:107)2\n\n(cid:110)\n\n(cid:107)M xw(cid:107)2\n(cid:107)xw(cid:107)2\n\nv \u2208 W | \u02c6\u03b4(s, m, \u0001, w) < \u03b4 for all w \u2208 W where w \u2264 v\n\ndrew a matrix M \u223c As,m,1/w2 and computed the ratio\n(number of samples where\n\n. Then, we took \u02c6\u03b4(s, m, \u0001, w) :=\n(cid:54)\u2208 1 \u00b1 \u0001)/T . Finally, we used the estimates \u02c6\u03b4(s, m, \u0001, w) to\n.\nobtain the estimate \u02c6v(m, \u0001, \u03b4, s) = max\nWhy does this procedure estimate v(m, \u0001, \u03b4, s)? With enough samples, \u02c6\u03b4(s, m, \u0001, w) \u2192\nPA\u2208As,m,1/w2 [(cid:107)Axw(cid:107)2 (cid:54)\u2208 (1 \u00b1 \u0001)(cid:107)xw(cid:107)2].14 As a result, if xw is a \u201cviolating\u201d vector, i.e.\n\u02c6\u03b4(s, m, \u0001, w) \u2265 \u03b4, then likely PA\u2208As,m,n [(cid:107)Axw(cid:107)2 (cid:54)\u2208 (1 \u00b1 \u0001)(cid:107)xw(cid:107)2] \u2265 \u03b4, and so \u02c6v(m, \u0001, \u03b4, s) \u2265\nv(m, \u0001, \u03b4, s). For the other direction, we use that in the proof of Theorem 1.5, we show that asymp-\ntotically, if a \u201cviolating\u201d vector (i.e. x s.t. PA\u2208As,m,n [(cid:107)Ax(cid:107)2 (cid:54)\u2208 (1 \u00b1 \u0001)(cid:107)x(cid:107)2] \u2265 \u03b4) exists in\nSv, then there\u2019s a \u201cviolating\u201d vector of the form xw for some w \u2264 \u0398(v). Thus, the estimate\n\u02c6v(m, \u0001, \u03b4, s) = \u0398(v(m, \u0001, \u03b4, s)) as T \u2192 \u221e and as precision in W goes to \u221e.\n\n(cid:111)\n\nFigure 3: Phase transitions of \u02c6v(m, 0.1, 0.01, s) Figure 4: Phase transitions of \u02c6v(m, 0.05, 0.05, s)\n\nFigure 3 and Figure 4 show \u02c6v(m, \u0001, \u03b4, s) as a function of dimension m for s \u2208 {1, 2, 3, 4, 8} for two\nsettings of \u0001 and \u03b4. The error-bars are based on the distance to the next highest v value in W .\nOur \ufb01rst observation is that for each set of s, \u0001, \u03b4 values considered, the curve \u02c6v(m, \u0001, \u03b4, s) has \u201csharp\u201d\nchanges as a function of m. More speci\ufb01cally, \u02c6v(m, \u0001, \u03b4, s) is 0 at small m, then there is a phase\ntransition to a nonzero value, then an increase to a higher value, then an interval where the value\nappears \u201c\ufb02at\u201d, and lastly a second phase transition to 1. The \ufb01rst phase transition is shared between s\nvalues, but the second phase transition occurs at different dimensions m (but is within a factor of\n3 between s values). Here, the \ufb01rst phase transition likely corresponds to \u0398(\u0001\u22122 ln(1/\u03b4)) and the\nsecond phase transition likely corresponds to min\n\n(cid:16)\n\u0001\u22122e\u0398(ln(1/\u03b4)), \u0001\u22122 ln(1/\u03b4)e\u0398(ln(1/\u03b4)\u0001\u22121/s)(cid:17)\n\n.\n\n\u221a\n\nOur second observation is that as s increases, the \u201c\ufb02at\u201d part occurs at a higher y-coordinate. Here,\nthe increase in the \u201c\ufb02at\u201d y-coordinate as a function of s corresponds to the\ns term in v(m, \u0001, \u03b4, s).\nTechnically, according to Theorem 1.5, the \u201c\ufb02at\u201d parts should be increasing in m at a slow rate: the\nempirical \u201c\ufb02atness\u201d likely arises since W is a \ufb01nite set in the experiments.\nOur third observation is that s > 1 generally outperforms s = 1 as Theorem 1.5 suggests: that is,\ns > 1 generally attains a higher \u02c6v(m, \u0001, \u03b4, s) value than s = 1. We note at large m values (where\n\u02c6v(m, \u0001, \u03b4, s) is close to 1), lower s settings sometimes attain a higher \u02c6v(m, \u0001, \u03b4, s) than higher s\nsettings (e.g. the second phase transition doesn\u2019t quite occur in decreasing order of s in Figure\n3): see Appendix I for a discussion of this non-monotonicity in s.15 Nonetheless, in practice, it\u2019s\nunlikely to select such a large dimension m, since the (cid:96)\u221e-to-(cid:96)2 guarantees of smaller m are likely\nsuf\ufb01cient. Hence, a greater sparsity generally leads to a better \u02c6v(m, \u0001, \u03b4, s) value, thus aligning with\nour theoretical \ufb01ndings.\n\n14With 100,000 samples, running our procedure twice yielded the same \u02c6v(m, \u0001, \u03b4, s) values both times.\n15In Appendix I, we also show more examples where at large m values, lower s settings attain a higher\n\n\u02c6v(m, \u0001, \u03b4, s) than higher s settings.\n\n9\n\n\fReferences\n[1] The 20 newsgroups text dataset. https://scikit-learn.org/0.19/datasets/twenty_\n\nnewsgroups.html.\n\n[2] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins.\n\nJ. Comput. Syst. Sci., 66(4):671\u2013687, June 2003.\n\n[3] Z. Allen-Zhu, R. Gelashvili, S. Micali, and N. Shavit.\n\nSparse sign-consistent John-\nson\u2013Lindenstrauss matrices: Compression with neuroscience-based constraints. In Proceedings\nof the National Academy of Sciences (PNAS), volume 111, pages 16872\u201316876, 2014.\n\n[4] Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi,\nOlivier Chapelle, and Kilian Weinberger. Learning to rank with (a lot of) word features.\nInformation Retrieval, 13(3):291\u2013314, Jun 2010.\n\n[5] C. Caragea, A. Silvescu, and P. Mitra. Protein sequence classi\ufb01cation using feature hashing.\n\nProteome Science, 10(1), 2012.\n\n[6] C. Chen, C. Vong, C. Wong, W. Wang, and P. Wong. Ef\ufb01cient extreme learning machine via\n\nvery sparse random projection. Soft Computing, 22, 03 2018.\n\n[7] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks\nwith the hashing trick. Proceedings of the 32nd Annual International Conference on Machine\nLearning (ICML), pages 2285\u20132294, 2015.\n\n[8] M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings\nof the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages\n278\u2013287, 2016.\n\n[9] M. B. Cohen, T. S. Jayram, and J. Nelson. Simple analyses of the sparse Johnson-Lindenstrauss\ntransform. In Proceedings of the 1st Symposium on Simplicity in Algorithms (SOSA), pages 1\u20139,\n2018.\n\n[10] S. Dahlgaard, M. Knudsen, and M. Thorup. Practical hash functions for similarity estimation\nand dimensionality reduction. In Proceedings of the 31st International Conference on Neural\nInformation Processing Systems (NIPS), pages 6618\u20136628, 2017.\n\n[11] B. Dalessandro. Bring the noise: Embracing randomness is the key to scaling up machine\n\nlearning algorithms. Big Data, 1(2):110\u2013112, 2013.\n\n[12] A. Dasgupta, R. Kumar, and T. Sarlos. A sparse Johnson-Lindenstrauss transform. In Pro-\nceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341\u2013350,\n2010.\n\n[13] C. Freksen, L. Kamma, and K. G. Larsen. Fully understanding the hashing trick. In Proceedings\nof the 32nd International Conference on Neural Information Processing Systems (NeurIPS),\npages 5394\u20135404, 2018.\n\n[14] M. Jagadeesan. Simple analysis of sparse, sign-consistent JL. In Proceedings of the 23rd\nInternational Conference and 24th International Conference on Approximation, Randomization,\nand Combinatorial Optimization: Algorithms and Techniques (RANDOM), pages 61:1\u201361:20,\n2019.\n\n[15] T.S. Jayram and D. P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and\nsteaming problems with subconstant error. In ACM Transactions on Algorithms (TALG) -\nSpecial Issue on SODA\u201911, volume 9, pages 1\u201326, 2013.\n\n[16] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space.\n\nContemporary Mathematics, 26:189\u2013206, 1984.\n\n[17] D. M. Kane, R. Meka, and J. Nelson. Almost optimal explicit Johnson-Lindenstrauss families.\nIn Proceedings of the 14th International Workshop and 15th International Conference on\nApproximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques\n(RANDOM), pages 628\u2013639, 2011.\n\n10\n\n\f[18] D. M. Kane and J. Nelson. A derandomized sparse Johnson-Lindenstrauss transform. CoRR,\n\nabs/1006.3585, 2010.\n\n[19] D. M. Kane and J. Nelson. Sparser Johnson-Lindenstrauss transforms. In Proceedings of the\n23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 16872\u201316876.\nACM Press, 2012.\n\n[20] R. Lata\u0142a. Estimation of moments of sums of independent real random variables. Annals of\n\nProbability, 25(3):1502\u20131513, 1997.\n\n[21] R. Lata\u0142a. Tail and moment estimates for some types of chaos. Studia Mathematica, 135(1):39\u2013\n\n53, 1999.\n\n[22] P. Li, T. Hastie, and K. Church. Very sparse random projections. In Proceedings of the 12th\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201906,\npages 287\u2013296, 2006.\n\n[23] C. Ma, J. Jung, S. Kim, and S. Ko. Random projection-based partial feature extraction for\n\nrobust face recognition. Neurocomputing, 149:1232 \u2013 1244, 2015.\n\n[24] J. Nelson and H.L. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparser\nsubspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual\nSymposium on, pages 117\u2013126, 2013.\n\n[25] J. Nelson and H.L. Nguyen. Sparsity lower bounds for dimensionality reducing maps. In\nProceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pages\n101\u2013110, 2013.\n\n[26] D. Newman. Bag of words data set. https://archive.ics.uci.edu/ml/datasets/Bag+\n\nof+Words, 2008.\n\n[27] H. Song. Robust visual tracking via online informative feature selection. Electronics Letters,\n\n50(25):1931\u20131932, 2014.\n\n[28] S. Suthaharan. Machine Learning Models and Algorithms for Big Data Classi\ufb01cation: Thinking\nwith Examples for Effective Learning, volume 36 of Integrated Series in Information Systems.\nSpringer US, Boston, MA, 2016.\n\n[29] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for\nlarge scale multitask learning. In Proceedings of the 26th Annual International Conference on\nMachine Learning (ICML), pages 1113\u20131120, 2009.\n\n11\n\n\f", "award": [], "sourceid": 8733, "authors": [{"given_name": "Meena", "family_name": "Jagadeesan", "institution": "Harvard University"}]}