{"title": "Simple strategies for recovering inner products from coarsely quantized random projections", "book": "Advances in Neural Information Processing Systems", "page_first": 4567, "page_last": 4576, "abstract": "Random projections have been increasingly adopted for a diverse set of tasks in machine learning involving dimensionality reduction. One specific line of research on this topic has investigated the use of quantization subsequent to projection with the aim of additional data compression. Motivated by applications in nearest neighbor search and linear learning, we revisit the problem of recovering inner products (respectively cosine similarities) in such setting. We show that even under coarse scalar quantization with 3 to 5 bits per projection, the loss in accuracy tends to range from ``negligible'' to ``moderate''. One implication is that in most scenarios of practical interest, there is no need for a sophisticated recovery approach like maximum likelihood estimation as considered in previous work on the subject. What we propose herein also yields considerable improvements in terms of accuracy over the Hamming distance-based approach in Li et al. (ICML 2014) which is comparable in terms of simplicity", "full_text": "Simple Strategies for Recovering Inner Products from\n\nCoarsely Quantized Random Projections\n\nPing Li\n\nBaidu Research, and\nRutgers University\n\npingli98@gmail.com\n\nMartin Slawski\n\nDepartment of Statistics\nGeorge Mason University\n\nmslawsk3@gmu.edu\n\nAbstract\n\nRandom projections have been increasingly adopted for a diverse set of tasks in\nmachine learning involving dimensionality reduction. One speci\ufb01c line of research\non this topic has investigated the use of quantization subsequent to projection\nwith the aim of additional data compression. Motivated by applications in nearest\nneighbor search and linear learning, we revisit the problem of recovering inner\nproducts (respectively cosine similarities) in such setting. We show that even under\ncoarse scalar quantization with 3 to 5 bits per projection, the loss in accuracy tends\nto range from \u201cnegligible\u201d to \u201cmoderate\u201d. One implication is that in most scenarios\nof practical interest, there is no need for a sophisticated recovery approach like\nmaximum likelihood estimation as considered in previous work on the subject.\nWhat we propose herein also yields considerable improvements in terms of accuracy\nover the Hamming distance-based approach in Li et al. (ICML 2014) which is\ncomparable in terms of simplicity.\n\n1\n\nIntroduction\n\nThe method of random projections (RPs) for linear dimensionality reduction has become more\nand more popular over the years after the basic theoretical foundation, the celebrated Johnson-\nLindenstrauss (JL) Lemma [12, 20, 33], had been laid out. In a nutshell, it states that it is possible\nto considerably lower the dimension of a set of data points by means of a linear map in such a way\nthat squared Euclidean distances and inner products are roughly preserved in the low-dimensional\nrepresentation. Conveniently, a linear map of this sort can be realized by a variety of random\nmatrices [1, 2, 18]. The scope of applications of RPs has expanded dramatically in the course of\ntime, and includes dimension reduction in linear classi\ufb01cation and regression [14, 30], similarity\nsearch [5, 17], compressed sensing [8], clustering [7, 11], randomized numerical linear algebra and\nmatrix sketching [29], and differential privacy [21], among others.\nThe idea of achieving further data compression by means of subsequent scalar quantization of the\nprojected data has been considered for a while. Such setting can be motivated from constraints\nconcerning data storage and communication, locality-sensitive hashing [13, 27], or the enhancement\nof privacy [31]. The extreme case of one-bit quantization can be associated with two seminal works\nin computer science, the SDP relaxation of the MAXCUT problem [16] and the simhash [10]. One-bit\ncompressed sensing is introduced in [6], and along with its numerous extensions, has meanwhile\ndeveloped into a sub\ufb01eld within the compressed sensing literature. A series of recent papers discuss\nquantized RPs with a focus on similarity estimation and search. The papers [25, 32] discuss quantized\nRPs with a focus on image retrieval based on nearest neighbor search. Independent of the speci\ufb01c\napplication, [25, 32] provide JL-type statements for quantized RPs, and consider the trade-off between\nthe number of projections and the number of bits per projection under a given budget of bits as it also\nappears in the compressed sensing literature [24]. The paper [19] studies approximate JL-type results\nfor quantized RPs in detail. The approach to quantized RPs taken in the present paper follows [27, 28]\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fin which the problem of recovering distances and inner products is recast within the framework of\nclassical statistical point estimation theory. The paper [28] discusses maximum likelihood estimation\nin this context, with an emphasis on the aforementioned trade-off between the number of RPs and the\nbit depth per projection. In the present paper we focus on the much simpler and computationally much\nmore convenient approach in which the presence of the quantizer is ignored, i.e., quantized data are\ntreated in the same way as full-precision data. We herein quantify the loss of accuracy of this approach\nrelative to the full-precision case, which turns out to be insigni\ufb01cant in many scenarios of practical\ninterest even under coarse quantization with 3 to 5 bits per projection. Moreover, we show that\nthe approach compares favorably to the Hamming distance-based (or equivalently collision-based)\nscheme in [27] which is of similar simplicity. We argue that both approaches have their merits: the\ncollision-based scheme performs better in preserving local geometry (the distances of nearby points),\nwhereas the one studied in more detail herein yields better preservation globally.\nNotation. For a positive integer m, we let [m] = {1, . . . , m}. For l \u2208 [m], v(l) denotes the l-th\ncomponent of a vector v \u2208 Rm; if there is no danger of confusion with another index, the brackets in\nthe subscript are omitted. I(P ) denotes the indicator function of expression P .\n\nSupplement: Proofs and additional experimental results can be found in the supplement.\nBasic setup. Let X = {x1, . . . , xn} \u2282 Rd be a set of input data with squared Euclidean norms\n2, i \u2208 [n]. We think of d being large. RPs reduce the dimensionality of the input data\ni := (cid:107)xi(cid:107)2\n\u03bb2\nby means of a linear map A : Rd \u2192 Rk, k (cid:28) d. We assume throughout the paper that the map\nA is realized by a random matrix with i.i.d. entries from the standard Gaussian distribution, i.e.,\nAlj \u223c N (0, 1), l \u2208 [k], j \u2208 [d]. One standard goal of RPs is to approximately preserve distances in\nX while lowering the dimension, i.e., (cid:107)Axi \u2212 Axj(cid:107)2\n2 for all (i, j). This is implied\nby approximate inner product preservation (cid:104)xi, xj(cid:105) \u2248 (cid:104)Axi, Axj(cid:105) /k for all (i, j).\nFor the time being, we assume that it is possible to compute and store the squared norms {\u03bb2\n\nand to rescale the input data to unit norm, i.e., one \ufb01rst forms(cid:101)xi \u2190 xi/\u03bbi, i \u2208 [n], before applying\ni}n\ni=1,\n= (cid:104)(cid:101)xi,(cid:101)xj(cid:105), i, j \u2208 [n], of\nthe input data X from their compressed representation Z = {z1, . . . , zn}, zi := A(cid:101)xi, i \u2208 [n].\n\nA. In this case, it suf\ufb01ces to recover the (cosine) similarities \u03c1ij :=\n\n2/k \u2248 (cid:107)xi \u2212 xj(cid:107)2\n\n(cid:104)xi,xj(cid:105)\n\u03bbi\u03bbj\n\n2 Estimation of cosine similarity based on full-precision RPs\n\nAs preparation for later sections, we start by providing background concerning the usual setting\nwithout quantization. Let (Z, Z(cid:48))r be random variables having a bivariate Gaussian distribution with\nzero mean, unit variance, and correlation r \u2208 (\u22121, 1):\n\n(Z, Z(cid:48))r \u223c N2\n\n0\n\nLet further x, x(cid:48) be a generic pair of points from X , and let z := A(cid:101)x, z(cid:48) := A(cid:101)x(cid:48) be the counterpart in\nl=1 of (z, z(cid:48)) are distributed i.i.d. as in (1) with r = \u03c1 =: (cid:104)(cid:101)x,(cid:101)x(cid:48)(cid:105).\n(l))}k\n\nZ. Then the components {(z(l), z(cid:48)\nHence the problem of recovering the cosine similarity of x and x(cid:48) can be re-cast as estimating the\ncorrelation from an i.i.d. sample of k bivariate Gaussian random variables. To simplify our exposition,\nwe henceforth assume that 0 \u2264 \u03c1 < 1 as this can easily be achieved by \ufb02ipping the sign of one of x\nor x(cid:48). The standard estimator of \u03c1 is what is called the \u201clinear estimator\u201d herein:\n\nr\n\n(1)\n\n(cid:18)(cid:18)0\n(cid:19)\n\n(cid:18)1\n\n,\n\n(cid:19)(cid:19)\n\n.\n\nr\n1\n\n(cid:98)\u03c1lin =\n\n(cid:104)z, z(cid:48)(cid:105) =\n\n1\nk\n\n1\nk\n\nk(cid:88)\n\nl=1\n\nz(l)z(cid:48)\n\n(l).\n\n(2)\n\n(cid:26)\n\n(cid:98)\u03c1MLE = argmax\n\nAs pointed out in [26] this estimator can be considerably improved upon by the maximum likelihood\nestimator (MLE) given (1):\n\u2212 1\n2\n\nThe estimator(cid:98)\u03c1MLE is not available in closed form, which is potentially a serious concern since it\n\nneeds to be evaluated for numerous different pairs of data points. However, this can be addressed\n\nlog(1 \u2212 r2) \u2212 1\n2\n\n(cid:104)z, z(cid:48)(cid:105) 2r\n\n(cid:18) 1\n\n(cid:107)z(cid:48)(cid:107)2\n\n2 \u2212 1\nk\n\n(cid:107)z(cid:107)2\n\n2 +\n\n1\n\n1 \u2212 r2\n\n(cid:19)(cid:27)\n\n.\n\n(3)\n\n1\nk\n\nk\n\nr\n\n2\n\n\f(cid:110)(cid:16)(cid:107)z(cid:107)2\n\n(cid:17)\n\n(cid:111)\n\n(6)\n(7)\n\n(8)\n\nby tabulation of the two statistics\n\n(cid:98)\u03c1MLE over a suf\ufb01ciently \ufb01ne grid. At processing time, computation of(cid:98)\u03c1MLE can then be reduced to a\nOne obvious issue of(cid:98)\u03c1lin is that it does not respect the range of the underlying parameter. A natural\n\nlook-up in a pre-computed table.\n\nand the corresponding solutions\n\n/k, (cid:104)z, z(cid:48)(cid:105) /k\n\n2 + (cid:107)z(cid:48)(cid:107)2\n\n2\n\n\ufb01x is the use of the \u201cnormalized linear estimator\u201d\n\n(cid:98)\u03c1norm = (cid:104)z, z(cid:48)(cid:105) /((cid:107)z(cid:107)2 (cid:107)z(cid:48)(cid:107)2).\n\u03c1((cid:98)\u03c1) + Var\u03c1((cid:98)\u03c1),\n\n(4)\nWhen comparing different estimators of \u03c1 in terms of statistical accuracy, we evaluate the mean\nsquared error (MSE), possibly asymptotically as the number of RPs k \u2192 \u221e. Speci\ufb01cally, we consider\n\nMSE\u03c1((cid:98)\u03c1) = E\u03c1[(\u03c1 \u2212(cid:98)\u03c1)2] = Bias2\n\nwhere(cid:98)\u03c1 is some estimator, and the subscript \u03c1 indicates that expectations are taken with respect to a\nIt turns out that(cid:98)\u03c1norm and(cid:98)\u03c1MLE can have dramatically lower (asymptotic) MSEs than(cid:98)\u03c1lin for large\n\nsample (z, z(cid:48)) following the bivariate normal distribution in (1) with r = \u03c1.\n\nvalues of \u03c1, i.e., for points of high cosine similarity. It can be shown that (cf. [4], p.132, and [26])\n\n(5)\n\nBias\u03c1((cid:98)\u03c1) := E\u03c1[(cid:98)\u03c1] \u2212 \u03c1,\n\nBias\u03c1((cid:98)\u03c1lin) = 0,\n\nVar\u03c1((cid:98)\u03c1lin) = (1 + \u03c12)/k,\n\u03c1((cid:98)\u03c1norm) = O(1/k2), Var\u03c1((cid:98)\u03c1norm) = (1 \u2212 \u03c12)2/k + O(1/k2),\n\u03c1((cid:98)\u03c1MLE) = O(1/k2), Var\u03c1((cid:98)\u03c1MLE) = (1\u2212\u03c12)2\n\n1+\u03c12 /k + O(1/k2).\n\nBias2\n\nBias2\n\nWhile for \u03c1 = 0, the (asymptotic) MSEs are the same, we note that the leading terms of the MSEs\n\nof(cid:98)\u03c1norm and(cid:98)\u03c1MLE decay at rate \u0398((1 \u2212 \u03c1)2) as \u03c1 \u2192 1, whereas the MSE of(cid:98)\u03c1lin grows with \u03c1. The\nfollowing table provides the asymptotic MSE ratios of(cid:98)\u03c1lin and(cid:98)\u03c1norm for selected values of \u03c1.\n\n\u03c1\n\nMSE\u03c1((cid:98)\u03c1lin)\nMSE\u03c1((cid:98)\u03c1norm)\n\n0.5\n\n2.2\n\n0.6\n\n3.3\n\n0.7\n\n5.7\n\n0.8\n\n12.6\n\n0.9\n\n50\n\n0.95\n\n0.99\n\n200\n\n5000\n\nIn conclusion, if it is possible to pre-compute and store the norms of the data prior to dimensionality\nreduction, a simple form of normalization can yield important bene\ufb01ts with regard to the recovery of\ninner products and distances for pairs of points having high cosine similarity. The MLE can provide\n\na further re\ufb01nement, but the improvement over(cid:98)\u03c1norm can be at most by a factor of 2.\n\n3 Estimation of cosine similarity based on quantized RPs\n\nThe following section contains our main results. After introducing preliminaries regarding quantiza-\ntion, we review previous approaches to the problem, before analyzing estimators following a different\nparadigm. We conclude with a comparison and some recommendations about what to use in practice.\nQuantization. After obtaining the projected data Z, the next step is scalar quantization. Let\nt = (t1, . . . , tK\u22121) with 0 = t0 < t1 < . . . < tK\u22121 < tK = +\u221e be a set of thresholds\ninducing a partitioning of the positive real line into K intervals {[ts\u22121, ts), s \u2208 [K]}, and let\nM = {\u00b51, . . . , \u00b5K} be a set of codes with \u00b5s representing interval [ts\u22121, ts), s \u2208 [K]. Given t and\nM, the scalar quantizer (or quantization map) is de\ufb01ned by\n\nQ : R \u2192 M\u00b1 := \u2212M \u222a M,\n\ns=1 \u00b5sI(|z| \u2208 [ts\u22121, ts)).\n\ni=1 \u2282 (M\u00b1)k, qi =(cid:0) Q(zi(l) )(cid:1)k\n\n(9)\n\nThe projected and quantized data result as Q = {qi}n\nl=1, where zi(l)\ndenotes the l-th component of zi \u2208 Z, l \u2208 [k], i \u2208 [n]. The bit depth b of the quantizer is given by\nb := 1 + log2(K). For simplicity, we only consider the case where b is an integer. The case b = 1 is\nwell-studied [10, 27] and is hence disregarded in our analysis to keep our exposition compact.\nBin-based vs. code-based approaches. Let q = Q(z) and q(cid:48) = Q(z(cid:48)) be the points resulting from\nquantization of the generic pair z, z(cid:48) in the previous section. In this paper, we distinguish between\ntwo basic paradigms for estimating the cosine similarity of the underlying pair x, x(cid:48) from q, q(cid:48). The\n\ufb01rst paradigm, which we refer to as bin-based estimation, does not make use of the speci\ufb01c values of\n\nz (cid:55)\u2192 Q(z) = sign(z)(cid:80)K\n\n3\n\n\fthe codes M\u00b1, but only of the intervals (\u201cbins\u201d) associated with each code. This is opposite to the\nsecond paradigm, referred to as code-based estimation which only makes use of the values of the\ncodes. As we elaborate below, an advantage of the bin-based approach is that working with intervals\nre\ufb02ects the process of quantization more faithfully and hence can be statistically more accurate; on the\nother hand, a code-based approach tends to be more convenient from the point of view computation.\nIn this paper, we make a case for the code-based approach by showing that the loss in statistical\naccuracy can be fairly minor in several regimes of practical interest.\n\nLloyd-Max (LM) quantizer. With b respectively K being \ufb01xed, one needs to choose the thresholds\nt and the codes M of the quantizer (the second is crucial only for a code-based approach). In our\nsetting, with zi(l) \u223c N (0, 1), i \u2208 [n], l \u2208 [k], which is inherited from the distribution of the entries\nof A, a standard choice is LM quantization [15] which minimizes the squared distortion error:\n\n(t(cid:63), \u00b5(cid:63)) = argmin\n\nt,\u00b5\n\nEg\u223cN (0,1)[{g \u2212 Q(g; t, \u00b5)}2].\n\n(10)\n\nProblem (10) can be solved by an iterative scheme that alternates between optimization of t for \ufb01xed\n\u00b5 and vice versa. That scheme can be shown to deliver the global optimum [22]. In the absence of\nany prior information about the cosine similarities that we would like to recover, (10) appears as a\nreasonable default whose use for bin-based estimation has been justi\ufb01ed in [28]. In the limit of cosine\nsimilarity \u03c1 \u2192 1, it may seem more plausible to use (10) with g replaced by its square, and taking the\nroot of the resulting optimal thresholds and codes. However, it turns out that empirically this yields\nreduced performance more often than improvements, hence we stick to (10) in the sequel.\n\n(l)\n\nl=1 and q(cid:48) = (q(cid:48)\n\n3.1 Bin-based approaches\nMLE. Given a pair q = (q(l) )k\nl=1 of projected and quantized points, max-\n)k\nimum likelihood estimation of the underlying cosine similarity \u03c1 is studied in depth in [28].\nThe associated likelihood function L(r) is based on bivariate normal probabilities of the form\nPr(Z \u2208 [ts\u22121, ts), Z(cid:48) \u2208 [tu\u22121, tu)), P\u2212r(Z \u2208 [ts\u22121, ts), Z(cid:48) \u2208 [tu\u22121, tu)) with (Z, Z(cid:48))r as in (1).\nIt is shown in [28] that the MLE with b \u2265 2 can be more ef\ufb01cient at the bit level than common\nsingle-bit quantization [10, 16]; the optimal choice of b increases with \u03c1. While statistically optimal in\nthe given setting, the MLE remains computationally cumbersome even when using the approximation\nin [28] because it requires cross-tabulation of the empirical frequencies corresponding to the bivariate\nnormal probabilities above. This makes the use of the MLE unattractive particularly in situations in\nwhich it is not feasible to materialize all O(n2) pairwise similarities estimable from (qi, qj)i