{"title": "On the Downstream Performance of Compressed Word Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 11805, "page_last": 11816, "abstract": "Compressing word embeddings is important for deploying NLP models in memory-constrained settings. However, understanding what makes compressed embeddings perform well on downstream tasks is challenging---existing measures of compression quality often fail to distinguish between embeddings that perform well and those that do not. We thus propose the eigenspace overlap score as a new measure. We relate the eigenspace overlap score to downstream performance by developing generalization bounds for the compressed embeddings in terms of this score, in the context of linear and logistic regression. We then show that we can lower bound the eigenspace overlap score for a simple uniform quantization compression method, helping to explain the strong empirical performance of this method. Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to 2x lower selection error rates than the next best measure of compression quality, and avoid the cost of training a separate model for each task of interest.", "full_text": "On the Downstream Performance of Compressed\n\nWord Embeddings\n\nAvner May\n\nJian Zhang\n\nDepartment of Computer Science, Stanford University\n\nTri Dao\n\nChristopher R\u00e9\n\n{avnermay, zjian, trid, chrismre}@cs.stanford.edu\n\nAbstract\n\nCompressing word embeddings is important for deploying NLP models in memory-\nconstrained settings. However, understanding what makes compressed embeddings\nperform well on downstream tasks is challenging\u2014existing measures of compres-\nsion quality often fail to distinguish between embeddings that perform well and\nthose that do not. We thus propose the eigenspace overlap score as a new measure.\nWe relate the eigenspace overlap score to downstream performance by developing\ngeneralization bounds for the compressed embeddings in terms of this score, in the\ncontext of linear and logistic regression. We then show that we can lower bound the\neigenspace overlap score for a simple uniform quantization compression method,\nhelping to explain the strong empirical performance of this method. Finally, we\nshow that by using the eigenspace overlap score as a selection criterion between\nembeddings drawn from a representative set we compressed, we can ef\ufb01ciently\nidentify the better performing embedding with up to 2\u00d7 lower selection error rates\nthan the next best measure of compression quality, and avoid the cost of training a\nmodel for each task of interest.\n\n1\n\nIntroduction\n\nIn recent years, word embeddings [22, 28, 23, 29, 10] have brought large improvements to a wide range\nof applications in natural language processing (NLP) [1, 5, 37]. However, these word embeddings can\noccupy a large amount of memory, making it expensive to deploy them in data centers, and impractical\nto use them in memory-constrained environments like smartphones. To reduce and amortize these\ncosts, embeddings can be compressed [e.g., 33] and shared across many downstream tasks [7].\nRecently, there have been numerous successful methods proposed for compressing embeddings; these\nmethods take a variety of approaches, ranging from compression using k-means clustering [2] to\ndictionary learning using neural networks [33, 6].\nThe goal of this work is to gain a deeper understanding of what makes compressed embeddings\nperform well on downstream tasks. Practically, this understanding could allow for evaluating the\nquality of a compressed embedding without having to train a model for each task of interest. Our work\nis motivated by two surprising empirical observations: First, we \ufb01nd that existing ways [40, 3, 41] of\nmeasuring the quality of compressed embeddings do not effectively explain the relative downstream\nperformance of different compressed embeddings\u2014for example, failing to discriminate between\nembeddings that perform well and those that do not. Second, we observe that a simple uniform\nquantization method can match or outperform the state-of-the-art deep compositional code learning\nmethod [33] and the k-means compression method [2] in terms of downstream performance. These\nobservations suggest that there is currently an incomplete understanding of what makes a compressed\nembedding perform well on downstream tasks. One way to narrow this gap in our understanding is to\n\ufb01nd a measure of compression quality that (i) is directly related to generalization performance, and\n(ii) can be used to analyze the performance of uniformly quantized embeddings.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fHere we introduce the eigenspace overlap score as a new measure of compression quality, and\nshow that it satis\ufb01es the above two desired properties. This score measures the degree of overlap\nbetween the subspaces spanned by the eigenvectors of the Gram matrices of the compressed and\nuncompressed embedding matrices. Our theoretical contributions are two-fold, addressing the\nsurprising observations and desired properties discussed above: First, we prove generalization bounds\nfor the compressed embeddings in terms of the eigenspace overlap score in the context of linear and\nlogistic regression, revealing a direct connection between this score and downstream performance.\nSecond, we prove that in expectation uniformly quantized embeddings attain a high eigenspace\noverlap score with the uncompressed embeddings at relatively high compression rates, helping to\nexplain their strong performance. Inspired by these theoretical connections between the eigenspace\noverlap score and generalization performance, we propose using this score as a selection criterion for\nef\ufb01ciently picking among a set of compressed embeddings, without having to train a model for each\ntask of interest using each embedding.\nWe empirically validate our theoretical contributions and the ef\ufb01cacy of our proposed selection\ncriterion by showing three main experimental results: First, we show the eigenspace overlap score is\nmore predictive of downstream performance than existing measures of compression quality [40, 3, 41].\nSecond, we show uniform quantization consistently matches or outperforms all the compression\nmethods to which we compare [2, 33, 15], in terms of both the eigenspace overlap score and\ndownstream performance. Third, we show the eigenspace overlap score is a more accurate criterion\nfor choosing between compressed embeddings than existing measures; speci\ufb01cally, we show that\nwhen choosing between embeddings drawn from a representative set we compressed [2, 33, 11, 15],\nthe eigenspace overlap score is able to identify the one that attains better downstream performance\nwith up to 2\u00d7 lower selection error rates than the next best measure of compression quality. We\nconsider several baseline measures of compression quality: the Pairwise Inner Product (PIP) loss [40],\nand two spectral measures of approximation error between the embedding Gram matrices [3, 41].\nOur results are consistent across a range of NLP tasks [32, 18, 37], embedding types [28, 23, 10],\nand compression methods [2, 33, 11].\nThe rest of this paper is organized as follows. In Section 2 we review background on word embedding\ncompression methods and existing measures of compression quality, and present the two motivating\nempirical observations. In Section 3 we present the eigenspace overlap score along with our corre-\nsponding theoretical contributions, and propose to use the eigenspace overlap score as a selection\ncriterion. In Section 4, we show the results from our extensive experiments validating the practical\nsigni\ufb01cance of our theoretical contributions, and the ef\ufb01cacy of our proposed selection criterion. We\npresent related work in Section 5, and conclude in Section 6.\n\n2 Background and Motivation\nWe \ufb01rst review different compression methods in Section 2.1 and existing ways to measure the\nquality of a compressed embedding relative to the uncompressed embedding in Section 2.2. We then\nshow in Section 2.3 that existing measures of compression quality do not satisfactorily explain the\nrelative downstream performance of existing compression methods; this motivates our work to better\nunderstand the downstream performance of compressed embeddings.\n\n2.1 Embedding Compression Methods\n\nWe now discuss a number of compression methods for word embeddings. For the purposes of this\npaper, the goal of an embedding compression method C(\u00b7) is to take as input an uncompressed\nembedding X \u2208 Rn\u00d7d, and produce as output a compressed embedding \u02dcX := C(X) \u2208 Rn\u00d7k which\nuses less memory than X, but attains similar performance to X when used in downstream models.\nHere, n denotes the vocabulary size, d and k the uncompressed and compressed dimensions.\nDeep Compositional Code Learning (DCCL)\nThe DCCL method [33] uses a dictionary learning\napproach to represent a large number of word vectors using a much smaller number of basis vectors.\nThe dictionaries are trained using an autoencoder-style architecture to minimize the embedding matrix\nreconstruction error. A similar approach was independently proposed by Chen et al. [6].\nK-means Compression\nThe k-means algorithm can be used to compress word embeddings by\n\ufb01rst clustering all the scalar entries in the word embedding matrix, and then replacing each scalar\nwith the closest centroid [2]. Using 2b centroids allows for storing each matrix entry using only b bits.\n\n2\n\n\fDimensionality Reduction One can train an embedding with a lower dimension, or use a method\nlike principal component analysis (PCA) to reduce the dimensionality of an existing embedding.\nUniform Quantization\nTo compress real numbers, uniform quantization divides an interval into\nsub-intervals of equal size, and then (deterministically or stochastically) rounds the numbers in\neach sub-interval to one of the boundaries [11, 13]. To apply uniform quantization to embedding\ncompression, we propose to \ufb01rst determine the optimal threshold at which to clip the extreme values in\nthe word embedding matrix, and then uniformly quantize the clipped embeddings within the clipped\ninterval. For more details about uniform quantization and how we use it to compress embeddings, see\nAppendices A.1 and D.3 respectively.\n\n2.2 Measures of Compression Quality\n\nWe review ways of measuring the compression quality of a compressed embedding relative to\nthe uncompressed embedding. For our purposes, an ideal measure would consider a compressed\nembedding to have high quality when it is likely to perform similarly to the uncompressed embedding\non downstream tasks, and low quality otherwise. Such a measure would shed light on what determines\nthe downstream performance of a compressed embedding, and give us a way of measuring the quality\nof a compressed embedding without having to train a downstream model for each task.\nSeveral of the measures discussed below are based on comparing the pairwise inner product (Gram)\nmatrices of the compressed and uncompressed embeddings. The Gram matrices of embeddings are\nnatural to consider for two reasons: First, the loss function for training word embeddings typically\nonly considers dot products between embedding vectors [22, 28]. Second, one can view word\nembedding training as implicit matrix factorization [20], and thus comparing the Gram matrices\nof two embedding matrices is similar to comparing the matrices these embeddings are implicitly\nfactoring. We now review several existing ways of measuring compression quality.\nWord Embedding Reconstruction Error\nThe \ufb01rst and simplest way of comparing two embed-\ndings X and \u02dcX is to measure the reconstruction error (cid:107)X \u2212 \u02dcX(cid:107)F . Note that in order to be able to\nuse this measure of quality, X and \u02dcX must have the same dimension.\nPairwise Inner Product (PIP) Loss Given XX T and \u02dcX \u02dcX T , the Gram matrices of the uncom-\npressed and compressed embeddings, the Pairwise Inner Product (PIP Loss) [40] is de\ufb01ned as\n(cid:107)XX T \u2212 \u02dcX \u02dcX T(cid:107)F . This measure of quality was recently proposed to explain the existence of an\noptimal dimension for word embeddings, in terms of a bias-variance trade-off for the PIP loss.\nSpectral Approximation Error A symmetric matrix A is de\ufb01ned [41] to be a (\u22061, \u22062)-spectral\napproximation of another symmetric matrix B if it satis\ufb01es (1 \u2212 \u22061)B (cid:22) A (cid:22) (1 + \u22062)B (in the\nsemide\ufb01nite order). Zhang et al. [41] show that if \u02dcX \u02dcX T + \u03bbI is a (\u22061, \u22062)-spectral approximation\nof XX T + \u03bbI for suf\ufb01ciently small values of \u22061 and \u22062, then the linear model trained using \u02dcX and\nregularization parameter \u03bb will attain similar generalization performance to the model trained using\nX. Avron et al. [3] use a single scalar \u2206 in place of \u22061 and \u22062, and use this scalar as a measure of\napproximation error, while Zhang et al. [41] consider \u22061 and \u22062 independently, and use the quantity\n\u2206max := max(\n\n, \u22062) to measure approximation error.\n\n1\n\n1\u2212\u22061\n\n2.3 Two Motivating Empirical Observations\n\nWe now present two empirical observations which illustrate the need to better understand the down-\nstream performance of models trained using compressed embeddings. In these experiments we\ncompare the downstream performance of the methods introduced in Section 2.1, and attempt to\nuse the measures of compression quality from Section 2.2 to explain the relative performance of\nthese compression methods. Our observations reveal that explaining the downstream performance of\ncompressed embeddings is challenging. We now provide an overview of these two observations; for\na more thorough presentation of these results, see Section 4.\n\u2022 First, we observe that the downstream performance of embeddings compressed using the var-\nious methods from Section 2.1 cannot be satisfactorily explained in terms of any of the ex-\nisting measures of compression quality described in Section 2.2. For example, in Figure 1\nwe see that on GloVe embeddings [28], the uniform quantization method with compression\nrate 32\u00d7 can have over 1.3\u00d7 higher PIP loss than dimensionality reduction with compres-\n\n3\n\n\fsion rate 6\u00d7, while attaining better downstream performance by over 2.5 F1 points on the\nStanford Question Answering Dataset (SQuAD) [32]. Furthermore, the PIP loss and the two\nspectral measures of approximation error \u2206 and \u2206max\nonly achieve Spearman correlation absolute values of\n0.49, 0.46, and 0.62 with the question answering test\nF1 score, respectively (Table 1). These results show\nthat existing measures of compression quality correlate\nrelatively poorly with downstream performance.\n\n\u2022 Our second observation is that the simple uniform quan-\ntization method matches or outperforms the more com-\nplex DCCL and k-means compression methods across\na number of tasks, embedding types, and compression\nratios. For example, with a compression ratio of 32\u00d7,\nuniform quantization attains an average F1 score 0.47\npoints below the uncompressed GloVe embeddings on\nthe Stanford Question Answering Dataset [32], while\nthe DCCL method [33] is 0.43 points below.\n\nFigure 1: The PIP loss does not satisfac-\ntorily explain the relative downstream\nperformance of different compression\nmethods.\n\nThese two observations suggest the need to better understand the downstream performance of\ncompressed embeddings. Toward this end, we focus on \ufb01nding a measure of compression quality\nwith the properties that (i) we can directly relate it to generalization performance, and (ii) we can use\nit to analyze the performance of uniformly quantized embeddings.\n\n3 A New Measure of Compression Quality\nTo better understand what properties of compressed embeddings determine their downstream perfor-\nmance, and to help explain the motivating empirical observations above, we introduce the eigenspace\noverlap score, and show that it satis\ufb01es the two desired properties described above. In Section 3.1\nwe present generalization bounds for compressed embeddings in the context of linear and logistic\nregression, in terms of the eigenspace overlap score between the compressed and uncompressed\nembeddings. In Section 3.2 we show that in expectation, uniformly quantized embeddings attain high\neigenspace overlap scores, helping to explain their strong downstream performance. Based on the\nconnection between the eigenspace overlap score and downstream performance, in Section 3.3 we\npropose using this score as a way of ef\ufb01ciently selecting among different compressed embeddings.\n\n3.1 The Eigenspace Overlap Score and Generalization Performance\n\nWe begin by de\ufb01ning the eigenspace overlap score, which measures how well a compressed embed-\nding approximates an uncompressed embedding. We then present our theoretical results relating the\ngeneralization performance of compressed embeddings to their eigenspace overlap scores.\n\n3.1.1 The Eigenspace Overlap Score\n\n1\n\nmax(d,k)(cid:107)U T \u02dcU(cid:107)2\nF .\n\nWe now de\ufb01ne the eigenspace overlap score, and discuss the intuition behind this de\ufb01nition.\nDe\ufb01nition 1. Given two full-rank embedding matrices X \u2208 Rn\u00d7d, \u02dcX \u2208 Rn\u00d7k, whose Gram matrices\nhave eigendecompositions XX T = U \u039bU T , \u02dcX \u02dcX T = \u02dcU \u02dc\u039b \u02dcU T for U \u2208 Rn\u00d7d, \u02dcU \u2208 Rn\u00d7k, we de\ufb01ne\nthe eigenspace overlap score E(X, \u02dcX) :=\nThis score quanti\ufb01es the similarity between the subspaces spanned by the eigenvectors with nonzero\neigenvalues of \u02dcX \u02dcX T and XX T . In particular, assuming k \u2264 d, it measures the ratio between\nthe squared Frobenius norm of U before and after being projected onto \u02dcU. It attains a maximum\nvalue of one when span(U ) = span( \u02dcU ), and a minimum value of zero when these two spans are\northogonal. Computing this score takes time O(n max(d, k)2), as it requires computing the singular\nvalue decompositions (SVDs) of X and \u02dcX. As is clear from the de\ufb01nition, the eigenspace overlap\nscore only depends on the left singular vectors of the two embedding matrices. To better understand\nwhy this is a desirable property, consider two embedding matrices X and \u02dcX with the same left\nsingular vectors. It follows that the output of any linear model over X can be exactly matched by the\noutput of a linear model over \u02dcX; if we consider the SVDs X = U SV T , \u02dcX := U \u02dcS \u02dcV T , then for any\n\n4\n\n010000200003000040000PIPloss71727374F1scoreGloVe,SQuAD,\u21e2=0.49K-meansUniformDCCLDim.reduction6x32x\fparameter vector w \u2208 Rd over X, \u02dcw := \u02dcV \u02dcS\u22121SV T w gives Xw = \u02dcX \u02dcw. This observation shows\nhow central the left singular vectors of an embedding matrix are to the set of models which use this\nmatrix, and thus why it is reasonable for the eigenspace overlap score to only consider the left singular\nvectors. In Appendix B.3 we discuss this score\u2019s robustness to perturbations, while in Appendix B.4\nwe discuss the connection between this score and a variant of embedding reconstruction error.\n\n3.1.2 Generalization Results\n\n(cid:2) 1\n\nn\n\ni=1 (cid:96)(xT\n\n(cid:80)n\n\ni=1((cid:96)(fX,\u0001(xi), \u00afyi)(cid:3) = 1\n(cid:80)n\n\nWe now present our theoretical results relating the difference in generalization performance between\nmodels trained on compressed vs. uncompressed embeddings, in terms of the eigenspace overlap\nscore. For these results, we consider an average-case analysis in the context of \ufb01xed design linear\nregression, for both the squared loss function and for any Lipschitz continuous loss function (e.g.,\nlogistic loss). We consider the \ufb01xed design setting for ease of analysis; for example, when using the\nsquared loss there is a closed-form expression for a regressor\u2019s generalization performance. Before\npresenting our results in Theorems 1 and 2 for the two types of loss functions, we brie\ufb02y review \ufb01xed\ndesign linear regression, and discuss the average-case setting we consider.\nIn \ufb01xed design linear regression, we observe a set of labeled points {(xi, yi)}n\ni=1 where the observed\nlabels yi = \u00afyi + \u0001i \u2208 R are perturbed from the true labels \u00afyi with independent noise \u0001i with mean\nzero and variance \u03c32. If we let xi \u2208 Rd denote the ith row of the matrix X \u2208 Rn\u00d7d with SVD\nX = U SV T , let y and \u00afy in Rn denote the perturbed and true label vectors, and let (cid:96) : R \u00d7 R \u2192 R\nbe a convex loss function, we can de\ufb01ne fX,\u0001 as the linear model which minimizes the empirical\nloss: fX,\u0001(x) := xT w\u2217 where w\u2217 := arg minw\u2208Rd\ni w, yi). When the loss function is the\nsquared loss, we can use the closed-form solution w\u2217 = (X T X)\u22121X T y to show that the expected\nloss of fX,\u0001 is equal to R\u00afy(X) := E\u0001\nn ((cid:107)\u00afy(cid:107)2 \u2212 (cid:107)U T \u00afy(cid:107)2 + d\u03c32); for\nthe derivation, see Appendix A.2. If we instead consider any Lipschitz continuous convex loss\nfunction (e.g., the logistic loss1) there may not be a closed-form solution for the parameter vector w\u2217,\nbut we can still derive upper bounds on the expected loss in this setting (see Theorem 2).\nWe consider average-case analysis for two reasons: First, in the setting where one would like to\nuse the same compressed embedding across many tasks (i.e., different label vectors \u00afy), an average-\ncase result describes the average performance across these tasks. Second, for both empirical and\ntheoretical reasons we argue that worst-case bounds are too loose to explain our empirical observations.\nEmpirically, we observe that compressed embeddings with large values of \u22061 and \u22062 (de\ufb01ned in\nSection 2.2) can still attain strong generalization performance (Appendix E.6), even though these\nvalues imply large worst-case bounds on the generalization error [41]. From a theoretical perspective,\nworst-case bounds must account for all possible label vectors, including those chosen adversarially.\nFor example, if there exists a single direction in span(U ) orthogonal to span( \u02dcU ) (which always\noccurs when dim( \u02dcU ) < dim(U )) the label vector \u00afy can be in this direction, resulting in large\ngeneralization error for \u02dcX and small generalization error for X. Thus, we consider an average-case\nanalysis in which we assume \u00afy is a random label vector in span(U ). We consider this setting because\nwe are most interested in the situation where we know the uncompressed embedding matrix X\n(cid:3).\nperforms well (in this case, R\u00afy(X) = d\u03c32/n), and we would like to understand how well \u02dcX can do.2\nWe now present our result for the squared loss. To maintain a constant signal (\u00afy) to noise (\u0001) ratio for\ndifferent embedding matrix sizes, we de\ufb01ne c \u2208 R as the scalar for which \u03c32 = c2 \u00b7 E\u00afy\ni=1 \u00afy2\ni\nThus, when c = 1 the entries of the true label vector on average have the same variance as the noise.\nTheorem 1. Let X = U SV T \u2208 Rn\u00d7d be the singular value decomposition of a full-rank embedding\nmatrix X, and let \u02dcX \u2208 Rn\u00d7k be another full-rank embedding matrix. Let \u00afy = U z \u2208 Rn denote a\nrandom label vector in span(U ), where z is random with zero mean and identity covariance matrix.\n(cid:104)\nLetting \u03c32 = c2 \u00b7 E\u00afy\nE\u00afy\n\n(cid:3) = c2 d\n(cid:2) 1\n(cid:80)n\n(cid:105)\n, z) := \u2212(cid:0)\u03c3(z) log(cid:0)\u03c3(z\nR\u00afy( \u02dcX) \u2212 R\u00afy(X)\n\nn denote the variance of the label noise, it follows that\n\n)(cid:1)(cid:1), where here\n\n\u03c3 : R \u2192 R denotes the sigmoid function, and z and z\n:= wT x is bounded (which\noccurs when the weight vector and data are both bounded), this loss is Lipschitz continuous in both arguments.\n2The difference between average-case and worst-case analysis is central to understanding the difference be-\ntween (\u22061, \u22062)-spectral approximation (which yields worst-case generalization bounds) [41] and the eigenspace\noverlap score (which yields average-case generalization bounds).\n\n(cid:48) both represent logits. If z\n\n1We consider the logistic loss (cid:96)(z\n\n(cid:80)n\n\n1 \u2212 E(X, \u02dcX)\n\nd(d \u2212 k)\n\n)(cid:1) + (1 \u2212 \u03c3(z)) log(cid:0)1 \u2212 \u03c3(z\n\n\u2212 c2 \u00b7\n\nn2\n\n.\n\n(cid:48)\n\n(cid:2) 1\n\nn\n\ni=1 \u00afy2\ni\n\nn\n\n=\n\nd\nn \u00b7\n\n(cid:48)\n\n(cid:48)\n\n(cid:17)\n\n(1)\n\n(cid:48)\n\n(cid:16)\n\n5\n\n\fThis theorem reveals that a larger eigenspace overlap score E(X, \u02dcX) results in better expected loss\nfor the compressed embedding. Note that if we focus on the low-dimensional and low-noise setting,\nwhere d (cid:28) n and c2 = O(1), we can effectively ignore the term c2 d(d\u2212k)\nn2 = O(d2/n2), and the\ngeneralization performance is determined by the eigenspace overlap score.\nWe now present a result analogous to Theorem 1 for Lipschitz continuous loss functions.\nTheorem 2. Let X \u2208 Rn\u00d7d, \u02dcX \u2208 Rn\u00d7k, \u00afy \u2208 Rn, and c \u2208 R be de\ufb01ned as in Theorem 1. Let\n(cid:96) : R \u00d7 R \u2192 R be a convex non-negative loss function which is L-Lipschitz continuous in both\narguments and satis\ufb01es arg minv(cid:48) (cid:96)(v(cid:48), v) = v \u2200v \u2208 R. It follows that\n\n(cid:104)\n(cid:105)\nR\u00afy( \u02dcX) \u2212 R\u00afy(X)\n\nE\u00afy\n\n(cid:18)(cid:113)\n\n(cid:19)\n1 \u2212 E(X, \u02dcX) + 2c\n\n.\n\nL\u221ad\n\u221an\n\n\u2264\n\nSimilarly to Theorem 1, we see that a larger eigenspace overlap score results in a tighter bound on\nthe generalization performance of the compressed embeddings. See Appendix B for the proofs for\nTheorems 1 and 2, where we consider the more general setting of z having arbitrary covariance.\n\n3.2 The Eigenspace Overlap Score and Uniform Quantization\n\nTo help explain the strong downstream performance of uniformly quantized embeddings, in this\nsection we present a lower bound on the expected eigenspace overlap score for uniformly quantized\nembeddings. Combining this result with Theorem 1 directly provides a guarantee on the performance\nof the uniformly quantized embeddings.\nTo prove this bound on the eigenspace overlap score, we use the Davis-Kahan sin(\u0398) theorem [8],\nwhich upper bounds the amount the eigenvectors of a matrix can change after the matrix is perturbed,\nin terms of the perturbation magnitude. Because for uniform quantization we can exactly characterize\nthe magnitude of the perturbation, this theorem allows us to bound the eigenspace overlap score of\nuniformly quantized embeddings. Note that we assume unbiased stochastic rounding is used for the\nuniform quantization (see [13] or Appendix A.1). We now present the result (proof in Appendix C):\nTheorem 3. Let X \u2208 Rn\u00d7d be a bounded embedding matrix with Xij \u2208 [\u2212 1\u221ad\n]3 and smallest\nn/d, for a \u2208 (0, 1].4 Let \u02dcX be an unbiased stochastic uniform quantization\nsingular value \u03c3min = a\nof X, where b bits are used per entry. Then for n \u2265 max(33, d), we can lower bound the expected\neigenspace overlap score of \u02dcX, over the randomness of the stochastic quantization, as follows:\n\n(cid:112)\n\n, 1\u221ad\n\nE(cid:104)\n\n(cid:105)\n1 \u2212 E(X, \u02dcX)\n\n\u2264\n\n20\n\n(2b \u2212 1)2a4 .\n\n(cid:0) \u221a20\na2\u221a\u0001 + 1(cid:1),\n\nA consequence of this theorem is that with only a logarithmic number of bits b \u2265 log2\nuniform quantization can attain an expected eigenspace overlap score of at least 1 \u2212 \u0001. This helps\nexplain the strong downstream performance of uniform quantization at high compression rates.\nIn Appendix C.2 we empirically validate that the scaling of the eigenspace overlap score with respect\nto the quantities in Theorem 3 matches the theory; we show 1 \u2212 E(X, \u02dcX) drops as the precision b\nand the scalar a are increased, and is relatively unaffected by changes to the vocabulary size n and\ndimension d.\n\n3.3 The Eigenspace Overlap Score as a Selection Criterion\n\nDue to the theoretical connections between generalization performance and the eigenspace overlap\nscore, we propose using the eigenspace overlap score as a selection criterion between different\ncompressed embeddings. Speci\ufb01cally, the algorithm we propose takes as input an uncompressed em-\nbedding along with two or more compressed versions of this embedding, and returns the compressed\nembedding with the highest eigenspace overlap score to the uncompressed embedding. Ideally,\na selection criterion should be both accurate and robust. For each downstream task, we consider\n\n3This bound on the entries of X results in the entries of its Gram matrix being bounded by a constant\n\nindependent of d.\n\n4The maximum possible value of \u03c3min is(cid:112)\n\nn/d, which occurs when (cid:107)X(cid:107)2\n\nF = n and \u03c3min = \u03c3max.\n\n6\n\n\fFigure 2: Downstream performance vs. measures of compression quality. We plot the perfor-\nmance of compressed fastText embeddings on the SQuAD question answering task as a function\nof different measures of compression quality. The eigenspace overlap score E demonstrates better\nalignment with downstream performance across compression methods than the other measures. We\nquantify the degree of alignment using the Spearman correlation \u03c1, and include \u03c1 in the plot titles.\n\naccuracy as the fraction of cases where a criterion selects the best-performing embedding on the\ntask. We quantify the robustness as the maximum observed performance difference between the\nselected embedding and the one which performs the best on a downstream task. In Section 4.3, we\nempirically validate that the eigenspace overlap score is a more accurate and robust criterion than\nexisting measures of compression quality.\n\n4 Experiments\nWe empirically validate our theory relating the eigenspace overlap score with generalization per-\nformance, our analysis on the strong performance of uniform quantization, and the ef\ufb01cacy of the\neigenspace overlap score as an embedding selection criterion. We \ufb01rst demonstrate that this score\ncorrelates better with downstream performance than existing measures of compression quality in\nSection 4.1. We then demonstrate in Section 4.2 that uniform quantization consistently matches or\noutperforms the compression methods to which we compare, both in terms of the eigenspace overlap\nscore and downstream performance. In Section 4.3, we show that the eigenspace overlap score is a\nmore accurate and robust selection criterion than other measures of compression quality.\nExperiment setup We evaluate compressed versions of publicly available 300-dimensional fast-\nText and GloVe embeddings on question answering and sentiment analysis tasks, and compressed\n768-dimensional WordPiece embeddings from the pre-trained case-sensitive BERTBASE model [10]\non tasks from the General Language Understanding Evaluation (GLUE) benchmark [37]. We use the\nfour compression methods discussed in Section 2: DCCL, k-means, dimensionality reduction, and\nuniform quantization.5 For the tasks, we consider question answering using the DrQA model [5] on\nthe Stanford Question Answering Dataset (SQuAD) [32], sentiment analysis using a CNN model [18]\non all the datasets used by Kim [18], and language understanding using the BERTBASE model on the\ntasks in the GLUE benchmark [37]. We present results on the SQuAD dataset, the largest sentiment\nanalysis dataset (SST-1 [34]) and the two largest GLUE tasks (MNLI and QQP) in this section, and\ninclude the results on the other sentiment analysis and GLUE tasks in Appendix E. We evaluate\ndownstream performance using the F1 score for question answering, accuracy for sentiment analysis,\nand the standard evaluation metric for each GLUE task (Table 5 in Appendix D). Across embedding\ntypes and tasks, we \ufb01rst compress the pre-trained embeddings, and then train the non-embedding\nmodel parameters in the standard manner for each task, keeping the embeddings \ufb01xed throughout\ntraining. For the GLUE tasks, we add a linear layer on top of the \ufb01nal layer of the pre-trained BERT\nmodel (as in [10]), and then \ufb01ne-tune the non-embedding model parameters.6 For more details on the\nvarious embeddings, tasks, and hyperparameters we use, see Appendix D.\n\n4.1 The Eigenspace Overlap Score and Downstream Performance\nTo empirically validate the theoretical connection between the eigenspace overlap score and down-\nstream performance, we show that the eigenspace overlap score correlates better with downstream\nperformance than the existing measures of compression quality discussed in Section 2. Thus, even\nthough our analysis is for linear and logistic regression, we see the eigenspace overlap score also has\nstrong empirical correlation with downstream performance on tasks using neural network models.\n\n5For dimensionality reduction, we use PCA for fastText and BERT embeddings (compression rates: 1, 2, 4,\n\n8), and publicly available lower-dimensional embeddings for GloVe (compression rates: 1, 1.5, 3, 6).\n6Freezing the WordPiece embeddings does not observably affect performance (see Appendix E.1).\n\n7\n\n0500010000PIPloss72747678F1scorefastText,SQuAD,\u03c1=\u22120.34K-meansUniformDCCLDim.reduction0102030\u2206max72747678F1scorefastText,SQuAD,\u03c1=\u22120.72K-meansUniformDCCLDim.reduction05101520\u220672747678F1scorefastText,SQuAD,\u03c1=0.31K-meansUniformDCCLDim.reduction0.00.20.40.60.81\u2212E72747678F1scorefastText,SQuAD,\u03c1=\u22120.91K-meansUniformDCCLDim.reduction\fTable 1: Spearman correlation between measures of compression quality and downstream\nperformance. For each measure of compression quality, we show the absolute value of its Spearman\ncorrelation with downstream performance, on the SQuAD (question answering), SST-1 (sentiment\nanalysis), MNLI (natural language inference), and QQP (question pair matching) tasks. We see that\nthe eigenspace overlap score E attains stronger correlation than the other measures.\n\nDataset\n\nSQuAD\n\nSST-1\n\nMNLI\n\nQQP\n\nEmbedding GloVe\n\nfastText GloVe\n\nfastText\n\nBERT WordPiece\n\nBERT WordPiece\n\nPIP loss\n\n\u2206\n\u2206max\n1 \u2212 E\n\n0.49\n0.46\n0.62\n0.81\n\n0.34\n0.31\n0.72\n0.91\n\n0.46\n0.33\n0.51\n0.75\n\n0.25\n0.29\n0.60\n0.73\n\n0.45\n0.44\n0.86\n0.92\n\n0.45\n0.36\n0.86\n0.93\n\nIn Figure 2 we present results for question answering (SQuAD) performance for compressed fastText\nembeddings as a function of the various measures of compression quality. In each plot, for each\ncombination of compression rate and compression method, we plot the average compression quality\nmeasure (x-axis) and the average downstream performance (y-axis) across the \ufb01ve random seeds\nused (error bars indicate standard deviations). If the ranking based on the measure of compression\nquality was identical to the ranking based on downstream performance, we would see a monotonically\ndecreasing sequence of points. As we can see from the rightmost plot in Figure 2, the downstream per-\nformance decreases smoothly as the eigenspace overlap value decreases; the downstream performance\ndoes not align as well with the other measures of compression quality (left three plots).\nTo quantify how well the ranking based on the quality measures matches the ranking based on\ndownstream performance, we compute the Spearman correlation \u03c1 between these quantities. In\nTable 1 we can see that the eigenspace overlap score gets consistently higher correlation values with\ndownstream performance than the other measures of compression quality. Note that \u2206max also attains\nrelatively high correlation values, though the eigenspace overlap score still outperforms \u2206max by\n0.06 to 0.24 on the tasks in Table 1. See Appendix E.5 for similar results on other tasks.\n\n4.2 Downstream Performance of Uniform Quantization\n\nWe show that across tasks and compres-\nsion rates uniform quantization consis-\ntently matches or outperforms the other\ncompression methods, in terms of both\nthe eigenspace overlap score and down-\nstream performance. These empirical\nresults validate our analysis from Sec-\ntion 3.2 showing that uniformly quan-\ntized embeddings in expectation attain\nhigh eigenspace overlap scores, and are\nthus likely to attain strong downstream\nperformance. In Figure 3 we plot the av-\nerage eigenspace overlap (left) and aver-\nage question answering (SQuAD) perfor-\nmance (right) of compressed fastText embeddings for different compression methods and compression\nrates; we visualize the standard deviation over \ufb01ve random seeds with error bars. Our primary con-\nclusion is that the simple uniform quantization method consistently performs similarly to or better\nthan the other compression methods, both in terms of the eigenspace overlap score and downstream\nperformance.7 Given the connections between downstream performance and the eigenspace overlap\nscore, the high eigenspace overlap scores attained by uniform quantization help explain its strong\ndownstream performance. For results with the same trend on the GLUE and sentiment tasks, see\nAppendices E.1, E.4.8\n\nFigure 3: Eigenspace overlap and downstream per-\nformance of uniform quantization. Uniform quantiza-\ntion can attain high values for the eigenspace overlap E,\nand match the k-means and DCCL methods for fastText\nembeddings on the question answering (SQuAD) task.\n\n7We apply uniform quantization to compress embeddings trained end-to-end for a translation task in Ap-\n\npendix E.2; we show it outperforms a tensorized factorization [16] proposed for the task-speci\ufb01c setting.\n\n8We provide a memory-ef\ufb01cient implementation of the uniform quantization method in https://github.\n\ncom/HazyResearch/smallfry.\n\n8\n\n12481632Compressionrate0.20.40.60.81.0EfastTextK-meansUniformDCCLDim.reduction12481632Compressionrate72747678F1scorefastText,SQuADK-meansUniformDCCLDim.reduction\fTable 2: The selection error rate of each measure of compression quality as a selection criterion.\nAcross all pairs of compressed embeddings from our experiments, we measure for each task the\nfraction of cases when a quality measure selects the worse performing embedding. We observe that\nthe eigenspace overlap score E achieves lower error rates than other compression quality measures.\n\nSST-1\n\nMNLI\n\nDataset\n\nSQuAD\n\nQQP\n\nEmbedding GloVe\n\nfastText GloVe\n\nfastText\n\nBERT WordPiece\n\nBERT WordPiece\n\nPIP loss\n\n\u2206\n\u2206max\n1 \u2212 E\n\n0.32\n0.34\n0.28\n0.17\n\n0.37\n0.58\n0.22\n0.11\n\n0.32\n0.39\n0.30\n0.19\n\n0.40\n0.57\n0.27\n0.20\n\n0.31\n0.32\n0.15\n0.10\n\n0.32\n0.33\n0.16\n0.10\n\n4.3 Compressed Embedding Selection with the Eigenspace Overlap Score\n\nWe now show that the eigenspace overlap score is a more accurate and robust selection criterion for\ncompressed embeddings than the existing measures of compression quality. In our experiment, we\n\ufb01rst enumerate all the embeddings we compressed using different compression methods, compression\nrates, and \ufb01ve random seeds, and we evaluate each of these embeddings on the various downstream\ntasks; we use the same random seed for compression and for downstream training. We then consider\nfor each task all pairs of compressed embeddings, and for each measure of compression quality\nreport the selection error rate\u2014the fraction of cases where the embedding with a higher compression\nquality score attains worse downstream performance. We show in Table 2 that across different tasks\nthe eigenspace overlap score achieves lower selection error rates than the PIP loss and the spectral\ndistance measures \u2206 and \u2206max, with 1.3\u00d7 to 2\u00d7 lower selection error rates than the second best\nmeasure. To demonstrate the robustness of the eigenspace overlap score as a criterion, we measure\nthe maximum difference in downstream performance, across all pairs of compressed embeddings\ndiscussed above, between the better performing embedding and the one selected by the eigenspace\noverlap score. We observe that this maximum performance difference is 1.1\u00d7 to 5.5\u00d7 smaller for\nthe eigenspace overlap score than for the measure of compression quality with the second smallest\nmaximum performance difference. See Appendix E.8 for more detailed results on the robustness of\nthe eigenspace overlap score as a selection criterion.\n5 Related Work\nCompressing machine learning models is critical for training and inference in resource-constrained\nsettings. To enable low-memory training, recent work investigates using low numerical precision [21,\n9] and sparsity [35, 24]. To compress a model for low-memory inference, Han et al. [14] investigate\npruning and quantization for deep neural networks.\nOur work on understanding the generalization performance of compressed embeddings is also closely\nrelated to work on understanding the generalization performance of kernel approximation methods\n[38, 31]. In particular, training a linear model over compressed word embeddings can be viewed as\ntraining a model with a linear kernel using an approximation to the kernel matrix. Recently, there\nhas been work on how different measures of kernel approximation error relate to the generalization\nperformance of the model trained using the approximate kernels, with Avron et al. [3] and Zhang\net al. [41] proposing the spectral measures of approximation error which we consider in this work.\n\n6 Conclusion and Future Work\nWe proposed the eigenspace overlap score, a new way to measure the quality of a compressed\nembedding without requiring training for each downstream task of interest. We related this score\nto the generalization performance of linear and logistic regression models, used this score to better\nunderstand the strong empirical performance of uniformly quantized embeddings, and showed that\nthis score is an accurate and robust selection criterion for compressed embeddings. Although this\nwork focuses on word embeddings, for future work we hope to show that the ideas presented here\nextend to other domains\u2014for example, to other types of embeddings (e.g., graph node embeddings\n[12]), and to compressing the activations of neural networks. We also believe that our work can help\nunderstand the performance of any model trained using compressed or perturbed features, and to\nunderstand why certain proposed methods for compressing neural networks succeed while others fail.\nWe hope this work inspires improvements to compression methods in various domains.\n\n9\n\n\fAcknowledgments\n\nWe thank Tony Ginart, Max Lam, Stephanie Wang, and Christopher Aberger for all their work on the\nearly stages of this project. We further thank all the members of our research group for their helpful\ndiscussions and feedback throughout the course of this work.\nWe gratefully acknowledge the support of DARPA under Nos.\nFA87501720095 (D3M),\nFA86501827865 (SDH), and FA86501827882 (ASED); NIH under No. U54EB020405 (Mobi-\nlize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and\n1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervision); the Moore\nFoundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi,\nBASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family\nInsurance, Google Cloud, Swiss Re, and members of the Stanford DAWN project: Teradata, Face-\nbook, Google, Ant Financial, NEC, VMWare, and Infosys. The U.S. Government is authorized to\nreproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation\nthereon. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material\nare those of the authors and do not necessarily re\ufb02ect the views, policies, or endorsements, either\nexpressed or implied, of DARPA, NIH, ONR, or the U.S. Government.\n\nReferences\n[1] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman\nGanchev, Slav Petrov, and Michael Collins. Globally normalized transition-based neural\nnetworks. In ACL, 2016.\n\n[2] Martin Andrews. Compressing word embeddings. In ICONIP, 2016.\n[3] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and\nAmir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds\nand statistical guarantees. In ICML, 2017.\n\n[4] Nicola Bertoldi, Prashant Mathur, Nicholas Ruiz, and Marcello Federico. FBK\u2019s machine\ntranslation and speech translation systems for the IWSLT 2014 evaluation campaign. In IWSLT,\n2014.\n\n[5] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer\n\nopen-domain questions. In ACL, 2017.\n\n[6] Ting Chen, Martin Renqiang Min, and Yizhou Sun. Learning k-way d-dimensional discrete\n\ncodes for compact embedding representations. In ICML, 2018.\n\n[7] Dan Shiebler, Chris Green, Luca Belli, Abhishek Tayal.\n\nEmbeddings@Twitter,\n2018.\nURL https://blog.twitter.com/engineering/en_us/topics/insights/\n2018/embeddingsattwitter.html. [Online; published 13-Sept-2018; accessed 20-May-\n2019].\n\n[8] C. Davis and W. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on\n\nNumerical Analysis, 7(1):1\u201346, 1970.\n\n[9] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger,\nKunle Olukotun, and Christopher R\u00e9. High-accuracy low-precision training. arXiv preprint\narXiv:1803.03383, 2018.\n\n[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[11] A. Gersho. Quantization. IEEE Communications Society Magazine, 15(5):16\u201316, Sep. 1977.\n[12] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD,\n\n2016.\n\n[13] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\n\nwith limited numerical precision. In ICML, 2015.\n\n[14] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural\n\nnetwork with pruning, trained quantization and Huffman coding. In ICLR, 2016.\n\n10\n\n\f[15] Harold Hotelling. Analysis of a complex of statistical variables into principal components.\n\nJournal of educational psychology, 24(6):417, 1933.\n\n[16] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan V. Oseledets. Tensorized\n\nembedding layers for ef\ufb01cient model compression. arXiv preprint arXiv:1901.10787, 2019.\n\n[17] J. Kiefer. Sequential minimax search for a maximum. Proceedings of the American Mathemati-\n\ncal Society, 4:502\u2013506, 1953.\n\n[18] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. In EMNLP, 2014.\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In\n\nNeurIPS, 2014.\n\n[21] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Frederick Diamos, Erich Elsen,\nDavid Garc\u00eda, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao\nWu. Mixed precision training. In ICLR, 2018.\n\n[22] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[23] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin.\n\nAdvances in pre-training distributed word representations. In LREC, 2018.\n\n[24] Hesham Mostafa and Xin Wang. Parameter ef\ufb01cient training of deep convolutional neural\n\nnetworks by dynamic sparse reparameterization. In ICML, 2019.\n\n[25] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,\nand Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL-HLT:\nDemonstrations, 2019.\n\n[26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NeurIPS Autodiff Workshop, 2017.\n\n[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[28] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for\n\nword representation. In EMNLP, 2014.\n\n[29] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,\n\nand Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.\n\n[30] Tiberiu Popoviciu. Sur les \u00e9quations alg\u00e9briques ayant toutes leurs racines r\u00e9elles. Mathematica,\n\n9:129\u2013145, 1935.\n\n[31] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NeurIPS,\n\n2007.\n\n[32] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+\n\nquestions for machine comprehension of text. In EMNLP, 2016.\n\n[33] Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional\n\ncode learning. In ICLR, 2018.\n\n[34] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y.\nNg, and Christopher Potts. Recursive deep models for semantic compositionality over a\nsentiment treebank. In EMNLP, 2013.\n\n[35] Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, and\nChristopher R\u00e9. Low-memory neural network training: A technical report. arXiv preprint\narXiv:1904.10631, 2019.\n\n[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.\n\n11\n\n\f[37] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman.\nGLUE: A multi-task benchmark and analysis platform for natural language understanding. In\nICLR, 2019.\n\n[38] Christopher K. I. Williams and Matthias W. Seeger. Using the Nystr\u00f6m method to speed up\n\nkernel machines. In NeurIPS, 2000.\n\n[39] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[40] Zi Yin and Yuanyuan Shen. On the dimensionality of word embedding. In NeurIPS, 2018.\n[41] Jian Zhang, Avner May, Tri Dao, and Christopher R\u00e9. Low-precision random Fourier features\n\nfor memory-constrained kernel approximation. In AISTATS, 2019.\n\n12\n\n\f", "award": [], "sourceid": 6320, "authors": [{"given_name": "Avner", "family_name": "May", "institution": "Stanford University"}, {"given_name": "Jian", "family_name": "Zhang", "institution": "Stanford University"}, {"given_name": "Tri", "family_name": "Dao", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}]}