{"title": "Quantized Random Projections and Non-Linear Estimation of Cosine Similarity", "book": "Advances in Neural Information Processing Systems", "page_first": 2756, "page_last": 2764, "abstract": "Random projections constitute a simple, yet effective technique for dimensionality reduction with applications in learning and search problems. In the present paper, we consider the problem of estimating cosine similarities when the projected data undergo scalar quantization to $b$ bits. We here argue that the maximum likelihood estimator (MLE) is a principled approach to deal with the non-linearity resulting from quantization, and subsequently study its computational and statistical properties. A specific focus is on the on the trade-off between bit depth and the number of projections given a fixed budget of bits for storage or transmission. Along the way, we also touch upon the existence of a qualitative counterpart to the Johnson-Lindenstrauss lemma in the presence of quantization.", "full_text": "Quantized Random Projections and Non-Linear\n\nEstimation of Cosine Similarity\n\nPing Li\nRutgers University\npingli@stat.rutgers.edu\n\nMichael Mitzenmacher\nHarvard University\nmichaelm@eecs.harvard.edu\n\nMartin Slawski\nRutgers University\nmartin.slawski@rutgers.edu\n\nAbstract\n\nRandom projections constitute a simple, yet effective technique for dimensionality\nreduction with applications in learning and search problems. In the present paper,\nwe consider the problem of estimating cosine similarities when the projected\ndata undergo scalar quantization to b bits. We here argue that the maximum\nlikelihood estimator (MLE) is a principled approach to deal with the non-linearity\nresulting from quantization, and subsequently study its computational and statistical\nproperties. A speci\ufb01c focus is on the on the trade-off between bit depth and the\nnumber of projections given a \ufb01xed budget of bits for storage or transmission.\nAlong the way, we also touch upon the existence of a qualitative counterpart to the\nJohnson-Lindenstrauss lemma in the presence of quantization.\n\nIntroduction\n\n1\nThe method of random projections (RPs) is an important approach to linear dimensionality reduc-\ntion [23]. RPs have established themselves as an alternative to principal components analysis which\nis computationally more demanding. Instead of determining an optimal low-dimensional subspace\nvia a singular value decomposition, the data are projected on a subspace spanned by a set of directions\npicked at random (e.g. by sampling from the Gaussian distribution). Despite its simplicity, this\napproach comes with a theoretical guarantee: as asserted by the celebrated Johnson-Lindenstrauss\n(J-L) lemma [6, 12], k = O(log n/\u03b52) random directions are enough to preserve the squared distances\nbetween all pairs from a data set of size n up to a relative error of \u03b5, irrespective of the dimension d the\ndata set resides in originally. Inner products are preserved similarly. As a consequence, procedures\nonly requiring distances or inner products can be approximated in the lower-dimensional space,\nthereby achieving substantial reductions in terms of computation and storage, or mitigating the curse\nof dimensionality. The idea of RPs has thus been employed in linear learning [7, 19], fast matrix\nfactorization [24], similarity search [1, 9], clustering [2, 5], statistical testing [18, 22], etc.\nThe idea of data compression by RPs has been extended to the case where the projected data are\nadditionally quantized to b bits so as to achieve further reductions in data storage and transmission.\nThe extreme case of b = 1 is well-studied in the context of locality sensitive hashing [4]. More\nrecently, b-bit quantized random projections for b \u2265 1 have been considered from different perspec-\ntives. The paper [17] studies Hamming distance-based estimation of cosine similarity and linear\nclassi\ufb01cation when using a coding scheme that maps a real value to a binary vector of length 2b. It\nis demonstrated that for similarity estimation, taking b > 1 may yield improvements if the target\nsimilarity is high. The paper [10] is dedicated to J-L-type results for quantized RPs, considerably\nimproving over an earlier result of the same \ufb02avor in [15]. The work [15] also discusses the trade-off\nbetween the number of projections k and number of bits b per projection under a given budget of bits\nas it also appears in the literature on quantized compressed sensing [11, 14].\nIn the present paper, all of these aspects and some more are studied for an approach that can be\nsubstantially more accurate for small b (speci\ufb01cally, we focus on 1 \u2264 b \u2264 6) than those in [10, 17, 15].\nIn [10, 15] the non-linearity of quantization is ignored by treating the quantized data as if they had\nbeen observed directly. Such \u201clinear\u201d approach bene\ufb01ts from its simplicity, but it is geared towards\n\ufb01ne quantization, whereas for small b the bias resulting from quantization dominates. By contrast,\nthe approach proposed herein makes full use of the knowledge about the quantizer. As in [17] we\nsuppose that the original data set is contained in the unit sphere of Rd, or at least that the Euclidean\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fnorms of the data points are given. In this case, approximating distances boils down to estimating\ninner products (or cosine similarity) which can be done by maximum likelihood (ML) estimation\nbased on the quantized data. Several questions of interest can be addressed by considering the Fisher\ninformation of the maximum likelihood estimator (MLE). With regard to the aforementioned trade-off\nbetween k and b, it turns out that the choice b = 1 is optimal (in the sense of yielding maximum\nFisher information) as long as the underlying similarity is smaller than 0.2; as the latter increases, the\nmore effective it becomes to increase b. By considering the rate of growth of the Fisher information\nnear the maximum similarity of one, we discover a gap between the \ufb01nite bit and in\ufb01nite bit case with\nrates of \u0398((1 \u2212 \u03c1\u2217)\u22123/2) and \u0398((1 \u2212 \u03c1\u2217)\u22122), respectively, where \u03c1\u2217 denotes the target similarity.\nAs an implication, an exact equivalent of the J-L lemma does not exist in the \ufb01nite bit case.\nThe MLE under study does not have a closed form solution. We show that it is possible to approximate\nthe MLE by a non-iterative scheme only requiring pre-computed look-up tables. Derivation of this\nscheme lets us draw connections to alternatives like the Hamming distance-based estimator in [17].\nWe present experimental results concerning applications of the proposed approach in nearest neighbor\nsearch and linear classi\ufb01cation. In nearest neighbor search, we focus on the high similarity regime and\ncon\ufb01rm theoretical insights into the trade-off between k and b. For linear classi\ufb01cation, we observe\nempirically that intermediate values of b can yield better trade-offs than single-bit quantization.\nNotation. We let [d] = {1, . . . , d}. I(P ) denotes the indicator function of expression P . For a\nfunction f (\u03c1), we use \u02d9f (\u03c1) and \u00a8f (\u03c1) for its \ufb01rst resp. second derivative. P\u03c1 and E\u03c1 denote probabil-\nity/expectation w.r.t. a zero mean, unit variance bivariate normal distribution with correlation \u03c1.\n\nSupplement: Proofs and additional experimental results can be found in the supplement.\n2 Quantized random projections, properties of the MLE, and implications\nWe start by formally introducing the setup, the problem and the approach that is taken before\ndiscussing properties of the MLE in this speci\ufb01c case, along with important implications.\nSetup. Let X = {x1, . . . , xn} \u2282 Sd\u22121, where Sd\u22121 := {x \u2208 Rd : (cid:107)x(cid:107)2 = 1} denotes the unit\nsphere in Rd, be a set of data points. We think of d being large. As discussed below, the requirement\nof having all data points normalized to unit norm is not necessary, but it simpli\ufb01es our exposition\nconsiderably. Let x, x(cid:48) be a generic pair of elements from X and let \u03c1\u2217 = (cid:104)x, x(cid:48)(cid:105) denote their inner\nproduct. Alternatively, we may refer to \u03c1\u2217 as (cosine) similarity or correlation. Again for simplicity,\nwe assume that 0 \u2264 \u03c1\u2217 < 1; the case of negative \u03c1\u2217 is a trivial extension because of symmetry.\nWe aim at reducing the dimensionality of the given data set by means of a random projection, which\nis realized by sampling a random matrix A of dimension k by d whose entries are i.i.d. N (0, 1)\n(i.e., zero-mean Gaussian with unit variance). Applying A to X yields Z = {zi}n\ni=1 \u2282 Rk with\nzi = Axi, i \u2208 [n]. Subsequently, the projected data points {zi}n\ni=1 are subject to scalar quantization.\nA b-bit scalar quantizer is parameterized by 1) thresholds t = (t1, . . . , tK\u22121) with 0 = t0 < t1 <\n. . . < tK\u22121 < tK = +\u221e inducing a partitioning of the positive real line into K = 2b\u22121 intervals\n{[tr\u22121, tr), r \u2208 [K]} and 2) a codebook M = {\u00b51, . . . , \u00b5K} with code \u00b5r representing interval\n[tr\u22121, tr), r \u2208 [K]. Given t and M, the scalar quantizer (or quantization map) is de\ufb01ned by\n\nK(cid:88)\n\nr=1\n\nQ : R \u2192 M\u00b1 := \u2212M \u222a M,\n\nz (cid:55)\u2192 Q(z) = sign(z)\n\n\u00b5rI(|z| \u2208 [tr\u22121, tr))\n\ni=1 \u2282 (M\u00b1)k, qi = ( Q(zij) )k\n\n2 = 2(1 \u2212 \u03c1\u2217). If z, z(cid:48) were given, it would be standard to use 1\n\n(1)\nThe projected, b-bit quantized data result as Q = {qi}n\nj=1, i \u2208 [n].\nProblem statement. Let z, z(cid:48) and q, q(cid:48) denote the pairs corresponding to x, x(cid:48) in Z respectively\nQ. The goal is to estimate \u03c1\u2217 = (cid:104)x, x(cid:48)(cid:105) from q, q(cid:48) which automatically yields an estimate of\nk (cid:104)z, z(cid:48)(cid:105) as an unbiased\n(cid:107)x \u2212 x(cid:48)(cid:107)2\nestimator of \u03c1\u2217. This \"linear\" approach is commonly adopted when the data undergo uniform\nquantization with saturation level T (i.e., tr = T \u00b7 r/(K \u2212 1), \u00b5r = (tr \u2212 tr\u22121)/2, r \u2208 [K \u2212 1],\nk (cid:104)z, z(cid:48)(cid:105) which in turn is sharply\n\u00b5K = T ), based on the rationale that as b \u2192 \u221e, 1\nconcentrated around its expectation \u03c1\u2217.\nk (cid:104)q, q(cid:48)(cid:105) has a\nThere are two major concerns about this approach. First, for \ufb01nite b the estimator 1\nbias resulting from the non-linearity of Q that does not vanish as k \u2192 \u221e. For small b, the effect of\nthis bias is particularly pronounced. Lloyd-Max quantization (see Proposition 1 below) in place of\n\nk (cid:104)q, q(cid:48)(cid:105) \u2192 1\n\n2\n\n\fp1 = P\u03c1(Z \u2208 (0, t1], Z(cid:48) \u2208 (0, t1])\np2 = P\u03c1(Z \u2208 (0, t1], Z(cid:48) \u2208 (t1,\u221e))\np3 = P\u03c1(Z \u2208 (t1,\u221e), Z(cid:48) \u2208 (t1,\u221e))\np4 = P\u2212\u03c1(Z \u2208 (0, t1], Z(cid:48) \u2208 (0, t1])\np5 = P\u2212\u03c1(Z \u2208 (0, t1], Z(cid:48) \u2208 (t1,\u221e))\np6 = P\u2212\u03c1(Z \u2208 (t1,\u221e), Z(cid:48) \u2208 (t1,\u221e))\n\nk((cid:98)\u03c1MLE \u2212 \u03c1\u2217)2 for b = 3 (averaged over 104 i.i.d. data sets with k = 100) compared to the inverse\n\nFigure 1: (L, M): Partitioning into cells for b = 2 and cell probabilities. (R): Empirical MSE\ninformation. The disagreement for \u03c1 \u2264 0.2 results from positive truncation of the MLE at zero.\n\nuniform quantization provides some remedy, but the issue of non-vanishing bias remains. Second,\neven for in\ufb01nite b, the approach is statistically not ef\ufb01cient. In order to see this, note that\n\n{(zj, z(cid:48)\n\nj)}k\n\nj=1\n\ni.i.d.\u223c (Z, Z(cid:48)), where (Z, Z(cid:48)) \u223c N2\n\n0,\n\n.\n\n(2)\n\n(cid:18)\n\n(cid:18) 1\n\n\u03c1\u2217\n\n(cid:19)(cid:19)\n\n\u03c1\u2217\n1\n\nk(cid:89)\n\nIt is shown in [16] that the MLE of \u03c1\u2217 under the above bivariate normal model has a variance of\n(1 \u2212 \u03c12\u2217)2/{k (1 + \u03c12\u2217)}, while Var((cid:104)z, z(cid:48)(cid:105) /k) = (1 + \u03c12\u2217)/k which is a substantial difference for\nlarge \u03c1\u2217. The higher variance results from not using the information that the components of z and z(cid:48)\nhave unit variance [16]. In conclusion, the linear approach as outlined above suffers from noticeable\nbias/and or high variance if the similarity \u03c1\u2217 is high, and it thus makes sense to study alternatives.\nMaximum likelihood estimation of \u03c1\u2217. We here propose the MLE in place of the linear approach.\nThe advantage of the MLE is that it can have substantially better statistical performance as the\nquantization map is explicitly taken into account. The MLE is based on bivariate normality according\nto (2). The effect of quantization is identical to that of what is known as interval censoring in statistics,\ni.e., in place of observing a speci\ufb01c value, one only observes that the datum is contained in an interval.\nThe concept is easiest to understand in the case of one-bit quantization. For any j \u2208 [k], each of\nthe four possible outcomes of (qj, q(cid:48)\nj) corresponds to one of the four orthants of R2. By symmetry,\nthe probability of (qj, q(cid:48)\nj) falling into the positive or into the negative orthant are identical; both\ncorrespond to a \u201ccollision\u201d, i.e., to the event {qj = q(cid:48)\nj) falling\nj}.\ninto one of the remaining two orthants are identical, corresponding to a disagreement {qj (cid:54)= q(cid:48)\nAccordingly, the likelihood function in \u03c1 is given by\n\nj}. Likewise, the probability of (qj, q(cid:48)\n\n{\u03c0(\u03c1)I(qj =q(cid:48)\n\nj )(1 \u2212 \u03c0(\u03c1))I(qj(cid:54)=q(cid:48)\n\nj )},\n\n\u03c0(\u03c1) := P\u03c1(sign(Z) = sign(Z(cid:48))),\n\nj=1\n\nwhere \u03c0(\u03c1) denotes the probability of a collision after quantization for (Z, Z(cid:48)) as in (2) with \u03c1\u2217\n\nreplaced by \u03c1. It is straightforward to show that the MLE is given by(cid:98)\u03c1MLE = cos(\u03c0(1 \u2212(cid:98)\u03c0)), where\n\u03c0 is the circle constant and(cid:98)\u03c0 = k\u22121(cid:80)k\nthat the expression for(cid:98)\u03c1MLE follows the same rationale as used for the simhash in [4].\n\nj) is the empirical counterpart to \u03c0(\u03c1). We note\n\nj=1 I(qj = q(cid:48)\n\nWith these preparations, it is not hard to see how the MLE generalizes to cases with more than one\nbit. For b = 2, there is a single non-trivial threshold t1 that yields a partitioning of the real axis into\nfour bins and accordingly a component (qj, q(cid:48)\nj) of a quantized pair can fall into 16 possible cells\n(rectangles), cf. Figure 1. By orthant symmetry and symmetries within each orthant, one ends up\nwith six distinct probabilities p1, . . . , p6 for (qj, q(cid:48)\nj) falling into one of those cells depending on \u03c1.\nWeighting those probabilities according to the number of their occurrences in the left part of Figure 1,\nwe end up with probabilities \u03c01 = \u03c01(\u03c1), . . . , \u03c06 = \u03c06(\u03c1) that sum up to one. The corresponding\nj=1 form a suf\ufb01cient statistic for \u03c1. For\ngeneral b, we have 22bcells and L = K(K + 1) (recall that K = 2b\u22121) distinct probabilities, so\nthat L = 20, 72, 272, 1056 for b = 3, . . . , 6. This yields the following compact expressions for the\n\nrelative cell frequencies(cid:98)\u03c01, . . . ,(cid:98)\u03c06 resulting from (qj, q(cid:48)\n\nj)k\n\n3\n\np1p2p2p4p5p5p4p5p5p1p2p2p3p3p6p600.20.40.60.8100.20.40.60.811.21.4\u03c1 empirical MSEI\u22121(\u03c1)b = 3\f(\u03c1)/I\u22121\n\nFigure 2: b \u00b7 I\u22121\n1 (\u03c1) vs. \u03c1 for different choices of t: Lloyd-Max and uniform quantization\nwith saturation levels T0.9, T0.95, T0.99, cf. \u00a74.1 for a de\ufb01nition. The latter are better suited for high\nsimilarity. The differences become smaller as b increases. Note that for b = 6, \u03c1 > 0.7 is required for\neither quantization scheme to achieve a better trade-off than the one-bit MLE.\n\nb\n\nnegative log-likelihood l(\u03c1) and the Fisher information I(\u03c1) = E\u03c1[\u00a8l(\u03c1)] (up to a factor of k)\n\nL(cid:88)\n\n(cid:98)\u03c0(cid:96) log(\u03c0(cid:96)(\u03c1)),\n\nL(cid:88)\n\n( \u02d9\u03c0(cid:96)(\u03c1))2\n\u03c0(cid:96)(\u03c1)\n\n.\n\n(3)\n\nl(\u03c1) =\n\nI(\u03c1) =\n\n(cid:96)=1\n\n(cid:96)=1\n\nThe information I(\u03c1) is of particular interest. By classical statistical theory [21], {E[(cid:98)\u03c1MLE] \u2212 \u03c1\u2217}2 =\nO(1/k2), Var((cid:98)\u03c1MLE) = I\u22121(\u03c1)/k, E[((cid:98)\u03c1MLE \u2212 \u03c1\u2217)2] = I\u22121(\u03c1)/k + O(1/k2) as k \u2192 \u221e. While\n(cid:98)\u03c1MLE in subsequent analysis.\n\nthis is an asymptotic result, it agrees to a good extent with what one observes for \ufb01nite, but not too\nsmall samples, cf. Figure 1. We therefore treat the inverse information as a proxy for the accuracy of\n\nRemark. We here brie\ufb02y address the case of known, but possibly non-unit norms, i.e., (cid:107)x(cid:107)2 = \u03c3x,\n(cid:107)x(cid:48)(cid:107)2 = \u03c3x(cid:48). This can be handled by re-scaling the thresholds of the quantizer (1) by \u03c3x resp. \u03c3x(cid:48),\nestimating \u03c1\u2217 based on q, q(cid:48) as in the unit norm case, and subsequently re-scaling the estimate by\n\u03c3x\u03c3x(cid:48) to obtain an estimate of (cid:104)x, x(cid:48)(cid:105). The assumption that the norms are known is not hard to satisfy\nin practice as they can be computed by one linear scan during data collection. With a limited bit\nbudget, the norms additionally need to be quantized. It is unclear how to accurately estimate them\nfrom quantized data (for b = 1, it is de\ufb01nitely impossible).\nChoice of the quantizer. Equipped with the Fisher information (3), one of the questions that can\nbe addressed is quantizer design. Note that as opposed to the linear approach, the speci\ufb01c choice\nof the {\u00b5r}K\nr=1 in (1) is not important as ML estimation only depends on cell frequencies but not\non the values associated with the intervals {(tr\u22121, tr]}K\nr=1. The thresholds t, however, turn out to\nhave a considerable impact, at least for small b. An optimal set of thresholds can be determined by\nminimizing the inverse information I\u22121(\u03c1; t) w.r.t. t for \ufb01xed \u03c1. As the underlying similarity is not\nknown, this may not seem practical. On the other hand, prior knowledge about the range of \u03c1 may be\navailable, or the closed form one-bit estimator can be used as pilot estimator. For \u03c1 = 0, the optimal\nset of thresholds coincide with those of Lloyd-Max quantization [20].\nProposition 1. Let g \u223c N (0, 1) and consider Lloyd-Max quantization given by\n\n(t\u2217,{\u00b5\u2217\n\nr}K\nr=1) = argmin\nt,{\u00b5r}K\n\nr=1\n\nE[{g \u2212 Q(g; t,{\u00b5r}K\n\nr=1)}2]. We also have t\u2217 = argmin\n\nI\u22121(0; t).\n\nt\n\n(cid:96)=1 and their derivatives { \u02d9\u03c0(cid:96)(\u03c1; t)}L\n\nThe Lloyd-Max problem can be solved numerically by means of an alternating scheme which can\nbe shown to converge to a global optimum [13]. For \u03c1 > 0, an optimal set of thresholds can be\ndetermined by general procedures for nonlinear optimization. Evaluation of I\u22121(\u03c1; t) requires\ncomputation of the probabilities {\u03c0(cid:96)(\u03c1; t)}L\n(cid:96)=1. The latter are\navailable in closed form (cf. supplement), while for the former specialized numerical integration\nprocedures [8] can be used. In order to avoid multi-dimensional optimization, it makes sense to\ncon\ufb01ne oneself to thresholds of the form tr = T \u00b7 r/(K \u2212 1), r \u2208 [K \u2212 1], so that only T needs to\nbe optimized. Even though the Lloyd-Max scheme performs reasonably also for large values of \u03c1,\nthe one-parameter scheme may still yield signi\ufb01cant improvements in that case, cf. Figure 2. Once\nb \u2265 5, the differences between the two schemes become marginal.\nTrade-off between k and b. Suppose we are given a \ufb01xed budget of bits B = k \u00b7 b for transmission\nor storage, and we are in free choosing b. The optimal choice of b can be determined by comparing\n\n4\n\n00.20.40.60.810.40.60.811.21.41.61.8b = 2 \u03c1b\u2217I\u22121b(\u03c1)/I\u221211(\u03c1) Lloyd-MaxT0.9T0.95T0.9900.20.40.60.8100.20.40.60.811.21.41.61.82b = 4 \u03c1b\u2217I\u22121b(\u03c1)/I\u221211(\u03c1) Lloyd-MaxT0.9T0.95T0.9900.20.40.60.8100.511.522.5b = 6 \u03c1b\u2217I\u22121b(\u03c1)/I\u221211(\u03c1) Lloyd-MaxT0.9T0.95T0.99\fb\n\nb\n\nb\n\n(\u03c1) vs. \u03c1.\n\n(\u03c1) vs. \u03c1 for 1 \u2264 b \u2264 6 with t chosen by Lloyd-Max.\n\nschemes above. Since the mean squared error of(cid:98)\u03c1MLE decays with 1/k for any b, for b(cid:48) with b(cid:48) > b\n\nFigure 3: Trade-off between k and b. (L): b\u00b7 I\u22121\n(M): Zoom into the range 0.9 \u2264 \u03c1 \u2264 1. (R): choice of b minimizing b \u00b7 I\u22121\nthe inverse Fisher information I\u22121\n(\u03c1) for changing b with t chosen according to either of the two\nto be more ef\ufb01cient than b at the bit scale it, is required that Ib(cid:48)(\u03c1)/Ib(\u03c1) > b(cid:48)/b as with the smaller\nchoice b one would be allowed to increase k by a factor of b(cid:48)/b. Again, this comparison is dependent\non a speci\ufb01c \u03c1. From Figure 3, however, one can draw general conclusions: for \u03c1 < 0.2, it does not\npay off to increase b beyond one; as \u03c1 increases, higher values of b achieve a better trade-off with\neven b = 6 being the optimal choice for \u03c1 > 0.98. The intuition is that two points of high similarity\nagree on their \ufb01rst signi\ufb01cant bit for most coordinates, in which case increasing the number of bits\nbecomes bene\ufb01cial. This \ufb01nding is particularly relevant to (near-)duplicate detection/nearest neighbor\nsearch where high similarities prevail, an application investigated in \u00a74.\nRate of growth of the Fisher information near \u03c1 = 1. Interestingly, we do not observe a \u201csaturation\u201d\neven for b = 6 in the sense that for \u03c1 close enough to 1, one can still achieve an improvement\nat the bit scale compared to 1 \u2264 b \u2264 5. This raises the question about the rate of growth of\nthe Fisher information near one relative to the full precision case (b \u2192 \u221e). As shown in [16]\nI\u221e(\u03c1) = (1 + \u03c12)/(1 \u2212 \u03c12)2 = \u0398((1 \u2212 \u03c1)\u22122) as \u03c1 \u2192 1. As stated below, in the \ufb01nite bit case, the\nexponent is only 3/2 for all b. This is a noticeable gap.\nTheorem 1. For 1 \u2264 b < \u221e, we have I(\u03c1) = \u0398((1 \u2212 \u03c1)\u22123/2) as \u03c1 \u2192 1.\nThe theorem has an interesting implication with regard to the existence of a Johnson-Lindenstrauss\n(J-L)-type result for quantized random projections. In a nutshell, the J-L lemma states that as long as\nk = \u2126(log n/\u03b52), with high probability we have that\n\n(1 \u2212 \u03b5)(cid:107)xi \u2212 xj(cid:107)2\n\n2 \u2264 (cid:107)zi \u2212 zj(cid:107)2\n\n2/k \u2264 (1 + \u03b5)(cid:107)xi \u2212 xj(cid:107)2\n\n2 for all pairs (i, j),\n\ni.e., the distances of the data in X are preserved in Z up to a relative error of \u03b5. In our setting, one\nwould hope for an equivalent of the form\n\n(4)\nMLE denotes the MLE for \u03c1ij given quantized RPs. The\nstandard proof of the J-L lemma [6] combines norm preservation for each individual pair of the form\n\n(1 \u2212 \u03b5)2(1 \u2212 \u03c1ij) \u2264 2(1 \u2212(cid:98)\u03c1ij\nwhere \u03c1ij = (cid:104)xi, xj(cid:105), i, j \u2208 [n], and (cid:98)\u03c1ij\nwith a union bound. Such a concentration result does not appear to be attainable for(cid:98)\u03c1MLE \u2212 \u03c1\u2217, not\n2 \u2264 (cid:107)zi \u2212 zj(cid:107)2\neven asymptotically as k \u2192 \u221e in which case(cid:98)\u03c1MLE \u2212 \u03c1\u2217 is asymptotically normal with mean zero\n\nMLE) \u2264 (1 + \u03b5)2(1 \u2212 \u03c1ij) \u2200(i, j) as long as k = \u2126(log n/\u03b52),\n\nand variance I\u22121(\u03c1\u2217)/k. This yields an asymptotic tail bound of the form\n\n2/k \u2264 (1 + \u03b5)(cid:107)xi \u2212 xj(cid:107)2\n\n2) \u2264 2 exp(\u2212k\u0398(\u03b52))\n\nP((1 \u2212 \u03b5)(cid:107)xi \u2212 xj(cid:107)2\n\nP(|(cid:98)\u03c1MLE \u2212 \u03c1\u2217| > \u03b4) \u2264 2 exp(\u2212\u03b42k/{2I\u22121(\u03c1\u2217)}).\n\n(5)\nFor a result of the form (4), which is about relative distance preservation, one would need to choose \u03b4\nproportional to \u03b5(1\u2212 \u03c1\u2217). In virtue of Theorem 1, I\u22121(\u03c1\u2217) = \u0398((1\u2212 \u03c1\u2217)3/2) as \u03c1\u2217 \u2192 1 so that with\n\u03b4 chosen in that way the exponent in (5) would vanish as \u03c1\u2217 \u2192 1. By constrast, the required rate of\ndecay of I\u22121(\u03c1\u2217) is achieved in the full precision case. Given the asymptotic optimality of the MLE\naccording to the Cramer-Rao lower bound suggests that a qualitative counterpart to the J-L lemma (4)\nis out of reach. Weaker versions in which the required lower bound on k would depend inversely on\nthe minimum distance of points in X are still possible. Similarly, a weaker result of the form\nMLE) \u2264 2(1 \u2212 \u03c1ij) + \u03b5 \u2200(i, j) as long as k = \u2126(log n/\u03b52),\n\n2(1 \u2212 \u03c1ij) \u2212 \u03b5 \u2264 2(1 \u2212(cid:98)\u03c1ij\n\nis known to hold already in the one-bit case and follows immediately from the closed form expression\nof the MLE, Hoeffdings\u2019s inequality, and the union bound; cf. e.g. [10].\n\n5\n\n00.20.40.60.81\u03c101234567b\u2217I\u22121b(\u03c1)all\u03c1b=1b=2b=3b=4b=5b=60.90.920.940.960.981\u03c100.050.10.150.20.25b\u2217I\u22121b(\u03c1)\u03c1\u22650.9b=1b=2b=3b=4b=5b=600.20.40.60.81123456\u03c1optimal b\f3 A general class of estimators and approximate MLE computation\nA natural concern about the MLE relative to the linear approach is that it requires optimization via an\niterative scheme. The optimization problem is smooth, one-dimensional and over the unit interval,\nhence not challenging for modern solvers. However, in applications it is typically required to compute\nthe MLE many times, hence avoiding an iterative scheme for optimization is worthwhile. In this\nsection, we introduce an approximation to the MLE that only requires at most two table look-ups.\n\nA general class of estimators. Let \u03c0(\u03c1) = (\u03c01(\u03c1), . . . , \u03c0L(\u03c1))(cid:62),(cid:80)L\n\n(cid:96)=1 \u03c0(cid:96)(\u03c1) = 1, be the normal-\nized cell frequencies depending on \u03c1 as de\ufb01ned in \u00a72, let further w \u2208 RL be a \ufb01xed vector of weights,\nand consider the map \u03c1 (cid:55)\u2192 \u03b8(\u03c1; w) := (cid:104)\u03c0(\u03c1), w(cid:105). If (cid:104) \u02d9\u03c0(\u03c1), w(cid:105) > 0 uniformly in \u03c1 (such w always\nexist), \u03b8(\u00b7; w) is increasing and has an inverse \u03b8\u22121(\u00b7 ; w). We can then consider the estimator\n\ntwo-fold application of the continuous mapping theorem. By choosing w such that w(cid:96) = 1 for (cid:96)\n\nwhere we recall that(cid:98)\u03c0 = ((cid:98)\u03c0, . . . ,(cid:98)\u03c0L)(cid:62) are the empirical cell frequencies given quantized data q, q(cid:48).\nIt is easy to see that(cid:98)\u03c1w is a consistent estimator of \u03c1\u2217: we have(cid:98)\u03c0 \u2192 \u03c0(\u03c1\u2217) in probability by the\nlaw of large numbers, and \u03b8\u22121((cid:104)(cid:98)\u03c0, w(cid:105) ; w) \u2192 \u03b8\u22121((cid:104)\u03c0(\u03c1\u2217), w(cid:105) ; w) = \u03b8\u22121(\u03b8(\u03c1\u2217; w); w) = \u03c1\u2217 by\ncorresponding to cells contained in the positive/negative orthant and w(cid:96) = \u22121 otherwise,(cid:98)\u03c1w becomes\nthe one-bit MLE. By choosing w(cid:96) = 1 for diagonal cells (cf. Figure 1) corresponding to a collision\nAlternatively, we may choose w such that the asymptotic variance of(cid:98)\u03c1w is minimized.\nj} and w(cid:96) = 0 otherwise, we obtain the Hamming distance-based estimator in [17].\nevent {qj = q(cid:48)\nTheorem 2. For any w s.t. \u02d9\u03c0(\u03c1\u2217)(cid:62)w (cid:54)= 0, we have Var((cid:98)\u03c1w) = V (w; \u03c1\u2217)/k + O(1/k2) as k \u2192 \u221e,\n\n(cid:98)\u03c1w = \u03b8\u22121((cid:104)(cid:98)\u03c0, w(cid:105) ; w),\n\nV (w; \u03c1\u2217) = (w(cid:62)\u03a3(\u03c1\u2217)w)/{ \u02d9\u03c0(\u03c1\u2217)(cid:62)w}2, \u03a3(\u03c1\u2217) := \u03a0(\u03c1\u2217) \u2212 \u03c0(\u03c1\u2217)\u03c0(\u03c1\u2217)(cid:62),\n\n(6)\n\nand \u03a0(\u03c1\u2217) := diag( (\u03c0(cid:96)(\u03c1\u2217))L\n\n(cid:96)=1 ). Moreover, let w\u2217 = \u03a0\u22121(\u03c1\u2217) \u02d9\u03c0(\u03c1\u2217). Then:\n\nV (w\u2217; \u03c1\u2217) = I\u22121(\u03c1\u2217),\n\nargminw V (w; \u03c1\u2217) = {\u03b1(w\u2217 + c1), \u03b1 (cid:54)= 0, c \u2208 R},\n\nTheorem 2 yields an expression for the optimal weights w\u2217 = \u03a0\u22121(\u03c1\u2217) \u02d9\u03c0(\u03c1\u2217). This optimal choice is\non the choice w = w\u2217 achieves asymptotically the same statistical performance as the MLE.\n\nand E[((cid:98)\u03c1w\u2217 \u2212 \u03c1\u2217)2] = E[((cid:98)\u03c1MLE \u2212 \u03c1\u2217)2] + O(1/k2).\nunique up to translation by a multiple of the constant vector 1 and scaling. The estimator(cid:98)\u03c1w\u2217 based\nApproximate computation. The estimator (cid:98)\u03c1w\u2217 is not operational as the optimal choice of the\nweights depends on the estimand itself. This issue can be dealt with by using a pilot estimator(cid:98)\u03c10 like\nthe one-bit MLE, the Hamming distance-based estimator in [17] or(cid:98)\u03c10 =(cid:98)\u03c1w, where w =(cid:82) 1\nestimator, we may then replace w\u2217 by w((cid:98)\u03c10) and use(cid:98)\u03c1w((cid:98)\u03c10) as a proxy for(cid:98)\u03c1w\u2217 which achieves the\nA second issue is that computation of(cid:98)\u03c1w (6) entails inversion of the function \u03b8(\u00b7; w). The inverse\ncomputing(cid:98)\u03c1w((cid:98)\u03c10), the weights depends on the data via the pilot estimator. We thus need to tabulate\n\nmay not be de\ufb01ned in general, but for the choices of w that we have in mind, this is not a concern\n(cf. supplement). Inversion of \u03b8(\u00b7; w) can be carried out with tolerance \u03b5 by tabulating the function\nvalues on a uniform grid of cardinality (cid:100)1/\u03b5(cid:101) and performing a table lookup for each query. When\n\n0 w(\u03c1) d\u03c1\naverages the expression w(\u03c1) = \u03a0\u22121(\u03c1) \u02d9\u03c0(\u03c1) for the optimal weights over \u03c1. Given the pilot\n\nsame statistical performance asymptotically.\n\nw(\u03c1) on a grid, too. Accordingly, a whole set of look-up tables is required for function inversion, one\nfor each set of weights. Given parameters \u03b5, \u03b4 > 0, a formal description of our scheme is as follows.\n1. Set R = (cid:100)1/\u03b5(cid:101), \u03c1r = r/R, r \u2208 [R], and B = (cid:100)1/\u03b4(cid:101), \u03c1b = b/B, b \u2208 [B].\n2. Tabulate w(\u03c1b), b \u2208 [B], and function values \u03b8(\u03c1r; w(\u03c1b)) = (cid:104)w(\u03c1b), \u03c0(\u03c1r)(cid:105), r \u2208 [R], b \u2208 [B].\n\nSteps 1. and 2. constitute a one-time pre-processing. Given data q, q(cid:48), we proceed as follows.\n\n3. Obtain(cid:98)\u03c0 and the pilot estimator(cid:98)\u03c10 = \u03b8\u22121((cid:104)(cid:98)\u03c0, w(cid:105) ; w), with w de\ufb01ned in the previous paragraph.\n4. Return(cid:98)\u03c1 = \u03b8\u22121((cid:104)(cid:98)\u03c0, w((cid:101)\u03c10)(cid:105) ; w((cid:101)\u03c10)), where(cid:101)\u03c10 is the value closest to(cid:98)\u03c10 among the {\u03c1b}.\n\nStep 2. requires about C = (cid:100)1/\u03b5(cid:101) \u00b7 (cid:100)1/\u03b4(cid:101) \u00b7 L computations/storage. From experimental results we\n\ufb01nd that \u03b5 = 10\u22124 and \u03b4 = .02 appear suf\ufb01cient for practical purposes, which is still manageable\neven for b = 6 with L = 1056 cells in which case C \u2248 5 \u00d7 108. Again, this cost is occurred\nlookups. By organizing computations ef\ufb01ciently, the frequencies(cid:98)\u03c0 can be obtained from one pass\nonly once independent of the data. The function inversions in steps 3. and 4. are replaced by table\nover (qj \u00b7 q(cid:48)\nj), j \u2208 [k]. Equipped with the look-up tables, estimating the similarity of two points\nrequires O(k + L + log(1/\u03b5)) \ufb02ops which is only slightly more than a linear scheme with O(k).\n\n6\n\n\fFigure 4: Average fraction of K = 10 nearest neighbors retrieved vs. total # of bits (log2 scale) for\n1 \u2264 b \u2264 6. b = \u221e (dashed) represents the MLE based on unquantized data, with k as for b = 6. The\noracle curve (dotted) corresponds to b = \u221e with maximum k (i.e., as for b = 1).\n\n4 Experiments\n\nsimilarity of the quantized data is measured in terms of their Hamming distance(cid:80)k\n\nWe here illustrate the approach outlined above in nearest neighbor search and linear classi\ufb01cation.\nThe focus is on the trade-off between b and k, in particular in the presence of high similarity.\n4.1 Nearest Neighbor Search\nFinding the most similar data points for a given query is a standard task in information retrieval.\nAnother application is nearest neighbor classi\ufb01cation. We here investigate how the performance of\nour approach is affected by the choice of k, b and the quantization scheme. Moreover, we compare\nto two baseline competitors, the Hamming distance-based approach in [17] and the linear approach\nin which the quantized data are treated like the original unquantized data. For the approach in [17],\nj=1 I(qj (cid:54)= q(cid:48)\nj).\nSynthetic data. We generate k i.i.d. samples of Gaussian data, where each sample X =\n(X0, X1, . . . , X96) is generated as X0 \u223c N (0, 1), Xj = \u03c1jX0 + (1 \u2212 \u03c12\nj )1/2Zj, 1 \u2264 j \u2264 96, where\nthe {Zj}96\nj=1 are i.i.d. N (0, 1) and independent of X0. We have E[(X0 \u2212 Xj)2] = 2(1 \u2212 \u03c1j), where\n\u03c1j = min{0.8+(j\u22121)0.002, 0.99}, 1 \u2264 j \u2264 96. The thus generated data subsequently undergo b-bit\nquantization, for 1 \u2264 b \u2264 6. Regarding the number of samples, we let k \u2208 {26/b, 27/b, . . . , 213/b}\nwhich yields bit budgets between 26 and 213 for all b. The goal is to recover the K nearest neighbors\nof X0 according to the {\u03c1j}, i.e., X96 is the nearest neighbor etc. The purpose of this speci\ufb01c setting\nis to mimic the use of quantized random projections in the situation of a query x0 and data points\nX = {x1, . . . , x96} having cosine similarities {\u03c1j}96\nReal data. We consider the Farm Ads data set (n = 4, 143, d = 54, 877) from the UCI repository and\nthe RCV1 data set (n = 20, 242, d = 47, 236) from the LIBSVM webpage [3]. For both data sets,\neach instance is normalized to unit norm. As queries we select all data points whose \ufb01rst neighbor\nhas (cosine) similarity less than 0.999, whose tenth neighbor has similarity at least 0.8 and whose\nhundredth neighbor has similarity less than 0.5. These restrictions allow for a more clear presentation\nof our results. Prior to nearest neighbor search, b-bit quantized random projections are applied to the\ndata, where the ranges for b and for the number of projections k is as for the synthetic data.\nQuantization. Four different quantization schemes are considered: Lloyd-Max quantization and\nthresholds tr = T\u03c1 \u00b7 r/(K \u2212 1), r \u2208 [K \u2212 1], where T\u03c1 is chosen to minimize I\u22121(\u03c1); we consider\n\u03c1 \u2208 {0.9, 0.95, 0.99}. For the linear approach, we choose \u00b5r = E[g|g \u2208 (tr\u22121, tr)], r \u2208 [K], where\ng \u223c N (0, 1). For our approach and that in [17] the speci\ufb01c choice of the {\u00b5r} is not important.\nEvaluation. We perform 100 respectively 20 independent replications for synthetic respectively\nreal data. We then inspect the top K neighbors for K \u2208 {3, 5, 10} returned by the methods under\nconsideration, and for each K we report the average fraction of true K neighbors that have been\nretrieved over 100 respectively 20 replications, where for the real data, we also average over the\nchosen queries (366 for farm and 160 for RCV1).\nThe results of our experiments point to several conclusions that can be summarized as follows.\nOne-bit quantization is consistently outperformed by higher-bit quantization. The optimal choice of b\ndepends on the underlying similarities, and interacts with the choice of t. It is an encouraging result\nthat the performance based on full precision data (with k as for b = 6) can essentially be matched\n\nj=1 with the query.\n\n7\n\n678910111213log2(bits)0.50.60.70.80.91fraction retrievedsynthetic, K = 10 b=1b=2b=3b=4b=5b=6b=\u221eoracle678910111213log2(bits)0.60.650.70.750.80.850.90.951fraction retrievedfarm, K = 10 b=1b=2b=3b=4b=5b=6b=\u221eoracle678910111213log2(bits)0.70.750.80.850.90.951fraction retrievedrcv1, K = 10 b=1b=2b=3b=4b=5b=6b=\u221eoracle\fFigure 5: Average fraction of K = 10 nearest neighbors retrieved vs. total # of bits (log2 scale) of our\napproach (MLE) relative to that based on the Hamming distance and the linear approach for b = 2, 4.\n\nwhen quantized data is used. For b = 2, the performance of the MLE is only marginally better than\nthe approach based on the Hamming distance. The superiority of the former becomes apparent once\nb \u2265 4 which is expected since for increasing b the Hamming distance is statistically inef\ufb01cient as it\nonly uses the information whether a pair of quantized data agrees/disagrees. Some of these \ufb01ndings\nare re\ufb02ected in Figures 4 and 5. We refer to the supplement for additional \ufb01gures.\n4.2 Linear Classi\ufb01cation\ni(cid:105))1\u2264i,i(cid:48)\u2264n from (cid:98)G =\nWe here outline an application to linear classi\ufb01cation given features generated by (quantized) random\n((cid:98)gii(cid:48)), where for i (cid:54)= i(cid:48),(cid:98)gii(cid:48) =(cid:98)\u03c1MLE(qi, qi(cid:48)) equals the MLE of (cid:104)xi, x(cid:48)\nprojections. We aim at reconstructing the original Gram matrix G = ((cid:104)xi, x(cid:48)\ni, and(cid:98)gii(cid:48) = 1 else (assuming normalized data). The matrix (cid:98)G is subsequently fed into LIBSVM.\ni(cid:105) given a quantized data pair\nqi, q(cid:48)\nFor testing, the inner products between test and training pairs are approximated accordingly.\nSetup. We work with the farm data set using the \ufb01rst 3,000 samples for training, and the Arcene\ndata set from the UCI repository with 100 training and 100 test samples in dimension d = 104. The\nchoice of k and b is as in \u00a74.1; for arcene, the total bit budget is lowered by a factor of 2. We perform\n20 independent replications for each combination of k and b. For SVM classi\ufb01cation, we consider\nlogarithmically spaced grids between 10\u22123 and 103 for the parameter C (cf. LIBSVM manual).\n\nFigure 6: (L, M): accuracy vs. bits, optimized over the SVM parameter C. (R) accuracy vs. C for a\n\ufb01xed # bits. b = \u221e indicates the performance based on unquantized data with k as for b = 6. The\noracle curve (dotted) corresponds to b = \u221e with maximum k (i.e., as for b = 1).\nFigure 6 (L, M) displays the average accuracy on the test data (after optimizing over C) in dependence\nof the bit budget. For the farm Ads data set, b = 2 achieves the best trade-off, followed by b = 1 and\nb = 3. For the Arcene data set, b = 3, 4 is optimal. In both cases, it does not pay off to go for b \u2265 5.\n5 Conclusion\nIn this paper, we bridge the gap between random projections with full precision and random pro-\njections quantized to a single bit. While Theorem 1 indicates that an exact counterpart to the J-L\nlemma is not attainable, other theoretical and empirical results herein point to the usefulness of the\nintermediate cases which give rise to an interesting trade-off that deserves further study in contexts\nwhere random projections can naturally be applied e.g. linear learning, nearest neighbor classi\ufb01cation\nor clustering. The optimal choice of b eventually depends on the application: increasing b puts an\nemphasis on local rather than global similarity preservation.\n\n8\n\n678910111213log2(bits)0.60.650.70.750.80.850.90.95fraction retrievedfarm, b = 2, K = 10 MLEHammingLinear678910111213log2(bits)0.40.50.60.70.80.91fraction retrievedfarm, b = 4, K = 10 MLEHammingLinear678910111213log2(bits)0.750.80.850.90.951fraction retrievedrcv1, b = 2, K = 10 MLEHammingLinear678910111213log2(bits)0.40.50.60.70.80.91fraction retrievedrcv1, b = 4, K = 10 MLEHammingLinear8910111213log2(bits)0.70.750.80.850.9accuracy on test setfarmb=1b=2b=3b=4b=5b=6b=\u221eoracle789101112log2(bits)0.70.750.80.85accuracy on test setarceneb=1b=2b=3b=4b=5b=6b=\u221eoracle00.511.522.5log10(C parameter)0.650.70.750.80.85accuracy on test setarcene, total #bits = 210b=1b=2b=3b=4b=5b=6b=\u221eoracle\fAcknowledgement\nThe work of Ping Li and Martin Slawski is supported by NSF-Bigdata-1419210 and NSF-III-1360971.\nThe work of Michael Mitzenmacher is supported by NSF CCF-1535795 and NSF CCF-1320231.\nReferences\n[1] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and\n\ntext data. In Conference on Knowledge discovery and Data mining (KDD), pages 245\u2013250, 2001.\n\n[2] C. Boutsidis, A. Zouzias, and P. Drineas. Random Projections for k-means Clustering. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 298\u2013306. 2010.\n\n[3] C-C. Chang and C-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent\n\nSystems and Technology, 2:27:1\u201327:27, 2011. http://www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[4] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium\n\non Theory of Computing (STOC), pages 380\u2013388, 2002.\n\n[5] S. Dasgupta. Learning mixtures of Gaussians. In FOCS, pages 634\u2013644, 1999.\n[6] S. Dasgupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and\n\nAlgorithms, 22:60\u201365, 2003.\n\n[7] D. Fradkin and D. Madigan. Experiments with random projections for machine learning. In Conference on\n\nKnowledge discovery and Data mining (KDD), pages 517\u2013522, 2003.\n\n[8] A. Genz. BVN: A function for computing bivariate normal probabilities. http://www.math.wsu.edu/\n\nfaculty/genz/homepage.\n\n[9] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality.\n\nIn Proceedings of the Symposium on Theory of Computing (STOC), pages 604\u2013613, 1998.\n\n[10] L. Jacques. A Quantized Johnson-Lindenstrauss Lemma: The Finding of Buffon\u2019s needle. IEEE Transac-\n\ntions on Information Theory, 61:5012\u20135027, 2015.\n\n[11] L. Jacques, K. Degraux, and C. De Vleeschouwer. Quantized iterative hard thresholding: Bridging 1-bit\n\nand high-resolution quantized compressed sensing. arXiv:1305.1786, 2013.\n\n[12] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary\n\nMathematics, pages 189\u2013206, 1984.\n\n[13] J. Kieffer. Uniqueness of locally optimal quantizer for log-concave density and convex error weighting\n\nfunction. IEEE Transactions on Information Theory, 29:42\u201347, 1983.\n\n[14] J. Laska and R. Baraniuk. Regime change: Bit-depth versus measurement-rate in compressive sensing.\n\nIEEE Transactions on Signal Processing, 60:3496\u20133505, 2012.\n\n[15] M. Li, S. Rane, and P. Boufounos. Quantized embeddings of scale-invariant image features for mobile\naugmented reality. In International Workshop on Multimedia Signal Processing (MMSP), pages 1\u20136, 2012.\n[16] P. Li, T. Hastie, and K. Church. Improving Random Projections Using Marginal Information. In Annual\n\nConference on Learning Theory (COLT), pages 635\u2013649, 2006.\n\n[17] P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for Random Projections. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), 2014.\n\n[18] M. Lopes, L. Jacob, and M. Wainwright. A More Powerful Two-Sample Test in High Dimensions using\nRandom Projection. In Advances in Neural Information Processing Systems 24, pages 1206\u20131214. 2011.\n[19] O. Maillard and R. Munos. Compressed least-squares regression. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 1213\u20131221. 2009.\n\n[20] J. Max. Quantizing for Minimum Distortion. IRE Transactions on Information Theory, 6:7\u201312, 1960.\n[21] L. Shenton and K. Bowman. Higher Moments of a Maximum-likelihood Estimate. Journal of the Royal\n\nStatistical Society, Series B, pages 305\u2013317, 1963.\n\n[22] R. Srivastava, P. Li, and D. Ruppert. RAPTT: An exact two-sample test in high dimensions using random\n\nprojections. Journal of Computational and Graphical Statistics, 25(3):954\u2013970, 2016.\n[23] S. Vempala. The Random Projection Method. American Mathematical Society, 2005.\n[24] F. Wang and P. Li. Ef\ufb01cient nonnegative matrix factorization with random projections. In SDM, pages\n\n281\u2013292, Columbus, Ohio, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1404, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": "Rugters University"}, {"given_name": "Michael", "family_name": "Mitzenmacher", "institution": "Harvard University"}, {"given_name": "Martin", "family_name": "Slawski", "institution": "George Mason University"}]}