{"title": "On the Dimensionality of Word Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 887, "page_last": 898, "abstract": "In this paper, we provide a theoretical understanding of word embedding and its dimensionality. Motivated by the unitary-invariance of word embedding, we propose the Pairwise Inner Product (PIP) loss, a novel metric on the dissimilarity between word embeddings. Using techniques from matrix perturbation theory, we reveal a fundamental bias-variance trade-off in dimensionality selection for word embeddings. This bias-variance trade-off sheds light on many empirical observations which were previously unexplained, for example the existence of an optimal dimensionality. Moreover, new insights and discoveries, like when and how word embeddings are robust to over-fitting, are revealed. By optimizing over the bias-variance trade-off of the PIP loss, we can explicitly answer the open question of dimensionality selection for word embedding.", "full_text": "On the Dimensionality of Word Embedding\n\nZi Yin\n\nStanford University\n\nYuanyuan Shen\n\nMicrosoft Corp. & Stanford University\n\ns0960974@gmail.com\n\nYuanyuan.Shen@microsoft.com\n\nAbstract\n\nIn this paper, we provide a theoretical understanding of word embedding and its\ndimensionality. Motivated by the unitary-invariance of word embedding, we pro-\npose the Pairwise Inner Product (PIP) loss, a novel metric on the dissimilarity\nbetween word embeddings. Using techniques from matrix perturbation theory, we\nreveal a fundamental bias-variance trade-off in dimensionality selection for word\nembeddings. This bias-variance trade-off sheds light on many empirical observa-\ntions which were previously unexplained, for example the existence of an optimal\ndimensionality. Moreover, new insights and discoveries, like when and how word\nembeddings are robust to over-\ufb01tting, are revealed. By optimizing over the bias-\nvariance trade-off of the PIP loss, we can explicitly answer the open question of\ndimensionality selection for word embedding.\n\n1\n\nIntroduction\n\nWord embeddings are very useful and versatile tools, serving as keys to many fundamental prob-\nlems in numerous NLP research [Turney and Pantel, 2010]. To name a few, word embeddings\nare widely applied in information retrieval [Salton, 1971, Salton and Buckley, 1988, Sparck Jones,\n1972], recommendation systems [Breese et al., 1998, Yin et al., 2017], image description [Frome\net al., 2013], relation discovery [Mikolov et al., 2013c] and word level translation [Mikolov et al.,\n2013b]. Furthermore, numerous important applications are built on top of word embeddings. Some\nprominent examples are long short-term memory (LSTM) networks [Hochreiter and Schmidhuber,\n1997] that are used for language modeling [Bengio et al., 2003], machine translation [Sutskever\net al., 2014, Bahdanau et al., 2014], text summarization [Nallapati et al., 2016] and image caption\ngeneration [Xu et al., 2015, Vinyals et al., 2015]. Other important applications include named entity\nrecognition [Lample et al., 2016], sentiment analysis [Socher et al., 2013] and so on.\nHowever, the impact of dimensionality on word embedding has not yet been fully understood. As\na critical hyper-parameter, the choice of dimensionality for word vectors has huge in\ufb02uence on the\nperformance of a word embedding. First, it directly impacts the quality of word vectors - a word\nembedding with a small dimensionality is typically not expressive enough to capture all possible\nword relations, whereas one with a very large dimensionality suffers from over-\ufb01tting. Second,\nthe number of parameters for a word embedding or a model that builds on word embeddings (e.g.\nrecurrent neural networks) is usually a linear or quadratic function of dimensionality, which directly\naffects training time and computational costs. Therefore, large dimensionalities tend to increase\nmodel complexity, slow down training speed, and add inferential latency, all of which are constraints\nthat can potentially limit model applicability and deployment [Wu et al., 2016].\nDimensionality selection for embedding is a well-known open problem. In most NLP research, di-\nmensionality is either selected ad hoc or by grid search, either of which can lead to sub-optimal\nmodel performances. For example, 300 is perhaps the most commonly used dimensionality in vari-\nous studies [Mikolov et al., 2013a, Pennington et al., 2014, Bojanowski et al., 2017]. This is possibly\ndue to the in\ufb02uence of the groundbreaking paper, which introduced the skip-gram Word2Vec model\nand chose a dimensionality of 300 [Mikolov et al., 2013a]. A better empirical approach used by\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fsome researchers is to \ufb01rst train many embeddings of different dimensionalities, evaluate them on a\nfunctionality test (like word relatedness or word analogy), and then pick the one with the best em-\npirical performance. However, this method suffers from 1) greatly increased time complexity and\ncomputational burden, 2) inability to exhaust all possible dimensionalities and 3) lack of consensus\nbetween different functionality tests as their results can differ. Thus, we need a universal criterion\nthat can re\ufb02ect the relationship between the dimensionality and quality of word embeddings in order\nto establish a dimensionality selection procedure for embedding methods.\nIn this regard, we outline a few major contributions of our paper:\n\n1. We introduce the PIP loss, a novel metric on the dissimilarity between word embeddings;\n2. We develop a mathematical framework that reveals a fundamental bias-variance trade-off\nin dimensionality selection. We explain the existence of an optimal dimensionality, a phe-\nnomenon commonly observed but lacked explanations;\n\n3. We quantify the robustness of embedding algorithms using the exponent parameter \u03b1, and\nestablish that many widely used embedding algorithms, including skip-gram and GloVe,\nare robust to over-\ufb01tting;\n\n4. We propose a mathematically rigorous answer to the open problem of dimensionality selec-\ntion by minimizing the PIP loss. We perform this procedure and cross-validate the results\nwith grid search for LSA, skip-gram Word2Vec and GloVe on an English corpus.\n\nFor the rest of the paper, we consider the problem of learning an embedding for a vocabulary of\nsize n, which is canonically de\ufb01ned as V = {1, 2,\u00b7\u00b7\u00b7 , n}. Speci\ufb01cally, we want to learn a vector\nrepresentation vi \u2208 Rd for each token i. The main object is the embedding matrix E \u2208 Rn\u00d7d,\nconsisting of the stacked vectors vi, where Ei,\u00b7 = vi. All matrix norms in the paper are Frobenius\nnorms unless otherwise stated.\n\n2 Preliminaries and Background Knowledge\n\nOur framework is built on the following preliminaries:\n\n1. Word embeddings are unitary-invariant;\n2. Most existing word embedding algorithms can be formulated as low rank matrix approxi-\n\nmations, either explicitly or implicitly.\n\n2.1 Unitary Invariance of Word Embeddings\n\nThe unitary-invariance of word embeddings has been discovered in recent research [Hamilton et al.,\n2016, Artetxe et al., 2016, Smith et al., 2017, Yin, 2018]. It states that two embeddings are essentially\nidentical if one can be obtained from the other by performing a unitary operation, e.g., a rotation. A\nunitary operation on a vector corresponds to multiplying the vector by a unitary matrix, i.e. v(cid:48) = vU,\nwhere U T U = U U T = Id. Note that a unitary transformation preserves the relative geometry of\nthe vectors, and hence de\ufb01nes an equivalence class of embeddings. In Section 3, we introduce the\nPairwise Inner Product loss, a unitary-invariant metric on embedding similarity.\n\n2.2 Word Embeddings from Explicit Matrix Factorization\n\nA wide range of embedding algorithms use explicit matrix factorization, including the popular La-\ntent Semantics Analysis (LSA). In LSA, word embeddings are obtained by truncated SVD of a signal\nmatrix M which is usually based on co-occurrence statistics, for example the Pointwise Mutual In-\nformation (PMI) matrix, positive PMI (PPMI) matrix and Shifted PPMI (SPPMI) matrix [Levy and\nGoldberg, 2014]. Eigen-words [Dhillon et al., 2015] is another example of this type.\nCaron [2001], Bullinaria and Levy [2012], Turney [2012], Levy and Goldberg [2014] described a\ngeneric approach of obtaining embeddings from matrix factorization. Let M be the signal matrix\n(e.g. the PMI matrix) and M = U DV T be its SVD. A k-dimensional embedding is obtained by\ntruncating the left singular matrix U at dimension k, and multiplying it by a power of the trun-\ncated diagonal matrix D, i.e. E = U1:kD\u03b1\n1:k,1:k for some \u03b1 \u2208 [0, 1]. Caron [2001], Bullinaria\n\n2\n\n\fand Levy [2012] discovered through empirical studies that different \u03b1 works for different language\ntasks. In Levy and Goldberg [2014] where the authors explained the connection between skip-gram\nWord2Vec and matrix factorization, \u03b1 is set to 0.5 to enforce symmetry. We discover that \u03b1 controls\nthe robustness of embeddings against over-\ufb01tting, as will be discussed in Section 5.1.\n\n2.3 Word Embeddings from Implicit Matrix Factorization\n\nIn NLP, two most widely used embedding models are skip-gram Word2Vec [Mikolov et al., 2013c]\nand GloVe [Pennington et al., 2014]. Although they learn word embeddings by optimizing over some\nobjective functions using stochastic gradient methods, they have both been shown to be implicitly\nperforming matrix factorizations.\n\nSkip-gram Skip-gram Word2Vec maximizes the likelihood of co-occurrence of the center word\nand context words. The log likelihood is de\ufb01ned as\n\nn(cid:88)\n\ni+w(cid:88)\n\ni=0\n\nj=i\u2212w,j(cid:54)=i\n\nlog(\u03c3(vT\n\nj vi)), where \u03c3(x) =\n\nex\n\n1 + ex\n\nLevy and Goldberg [2014] showed that skip-gram Word2Vec\u2019s objective is an implicit symmetric\nfactorization of the Pointwise Mutual Information (PMI) matrix:\n\nPMIij = log\n\np(vi, vj)\np(vi)p(vj)\n\nSkip-gram is sometimes enhanced with techniques like negative sampling [Mikolov et al., 2013b],\nwhere the signal matrix becomes the Shifted PMI matrix [Levy and Goldberg, 2014].\n\nGloVe Levy et al. [2015] pointed out that the objective of GloVe is implicitly a symmetric factor-\nization of the log-count matrix. The factorization is sometimes augmented with bias vectors and the\nlog-count matrix is sometimes raised to an exponent \u03b3 \u2208 [0, 1] [Pennington et al., 2014].\n3 PIP Loss: a Novel Unitary-invariant Loss Function for Embeddings\n\nHow do we know whether a trained word embedding is good enough? Questions of this kind cannot\nbe answered without a properly de\ufb01ned loss function. For example, in statistical estimation (e.g.\nlinear regression), the quality of an estimator \u02c6\u03b8 can often be measured using the l2 loss E[(cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217\n(cid:107)2\n2]\nwhere \u03b8\u2217 is the unobserved ground-truth parameter. Similarly, for word embedding, a proper metric\nis needed in order to evaluate the quality of a trained embedding.\nAs discussed in Section 2.1, a reasonable loss function between embeddings should respect the\nunitary-invariance. This rules out choices like direct comparisons, for example using (cid:107)E1 \u2212 E2(cid:107) as\nthe loss function. We propose the Pairwise Inner Product (PIP) loss, which naturally arises from the\nunitary-invariance, as the dissimilarity metric between two word embeddings:\nDe\ufb01nition 1 (PIP matrix). Given an embedding matrix E \u2208 Rn\u00d7d, de\ufb01ne its associated Pairwise\nInner Product (PIP) matrix to be\n\nPIP(E) = EET\n\nIt can be seen that the (i, j)-th entry of the PIP matrix corresponds to the inner product between the\nembeddings for word i and word j, i.e. PIPi,j = (cid:104)vi, vj(cid:105). To compare E1 and E2, two embedding\nmatrices on a common vocabulary, we propose the PIP loss:\nDe\ufb01nition 2 (PIP loss). The PIP loss between E1 and E2 is de\ufb01ned as the norm of the difference\nbetween their PIP matrices\n\n(cid:107)PIP(E1) \u2212 PIP(E2)(cid:107) = (cid:107)E1ET\n\n1 \u2212 E2ET\n\n2 (cid:107) =\n\n((cid:104)v(1)\n\ni\n\n, v(1)\n\nj\n\n(cid:105) \u2212 (cid:104)v(2)\n\ni\n\n, v(2)\n\nj\n\n(cid:105))2\n\ni,j\n\nNote that the i-th row of the PIP matrix, viET = ((cid:104)vi, v1(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)vi, vn(cid:105)), can be viewed as the rela-\ntive position of vi anchored against all other vectors {v1,\u00b7\u00b7\u00b7 , vn}. In essence, the PIP loss measures\nthe vectors\u2019 relative position shifts between E1 and E2, thereby removing their dependencies on any\nspeci\ufb01c coordinate system. The PIP loss respects the unitary-invariance. Speci\ufb01cally, if E2 = E1U\n\n3\n\n(cid:115)(cid:88)\n\n\fwhere U is a unitary matrix, then the PIP loss between E1 and E2 is zero because E2ET\n1 .\n2 = E1ET\nIn addition, the PIP loss serves as a metric of functionality dissimilarity. A practitioner may only\ncare about the usability of word embeddings, for example, using them to solve analogy and related-\nness tasks [Schnabel et al., 2015, Baroni et al., 2014], which are the two most important properties\nof word embeddings. Since both properties are tightly related to vector inner products, a small PIP\nloss between E1 and E2 leads to a small difference in E1 and E2\u2019s relatedness and analogy as the\nPIP loss measures the difference in inner products1. As a result, from both theoretical and prac-\ntical standpoints, the PIP loss is a suitable loss function for embeddings. Furthermore, we show\nin Section 4 that this formulation opens up a new angle to understanding the effect of embedding\ndimensionality with matrix perturbation theory.\n\n4 How Does Dimensionality Affect the Quality of Embedding?\n\n\u2206\n= U\u00b7,1:dD\u03b1\n\nWith the PIP loss, we can now study the quality of trained word embeddings for any algorithm that\nuses matrix factorization. Suppose a d-dimensional embedding is derived from a signal matrix M\nwith the form f\u03b1,d(M )\n1:d,1:d, where M = U DV T is the SVD. In the ideal scenario, a\ngenie reveals a clean signal matrix M (e.g. PMI matrix) to the algorithm, which yields the oracle\nembedding E = f\u03b1,d(M ). However, in practice, there is no magical oil lamp, and we have to\nestimate \u02dcM (e.g. empirical PMI matrix) from the training data, where \u02dcM = M + Z is perturbed\nby the estimation noise Z. The trained embedding \u02c6E = f\u03b1,k( \u02dcM ) is computed by factorizing this\nnoisy matrix. To ensure \u02c6E is close to E, we want the PIP loss (cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107) to be small. In\nparticular, this PIP loss is affected by k, the dimensionality we select for the trained embedding.\nArora [2016] discussed in an article about a mysterious empirical observation of word embeddings:\n\u201c... A striking \ufb01nding in empirical work on word embeddings is that there is a sweet spot for the\ndimensionality of word vectors: neither too small, nor too large\u201d2. He proceeded by discussing two\npossible explanations: low dimensional projection (like the Johnson-Lindenstrauss Lemma) and the\nstandard generalization theory (like the VC dimension), and pointed out why neither is suf\ufb01cient for\nexplaining this phenomenon. While some may argue that this is caused by under\ufb01tting/over\ufb01tting,\nthe concept itself is too broad to provide any useful insight. We show that this phenomenon can be\nexplicitly explained by a bias-variance trade-off in Section 4.1, 4.2 and 4.3. Equipped with the PIP\nloss, we give a mathematical presentation of the bias-variance trade-off using matrix perturbation\ntheory. We \ufb01rst introduce a classical result in Lemma 1. The proof is deferred to the appendix,\nwhich can also be found in Stewart and Sun [1990].\nLemma 1. Let X, Y be two orthogonal matrices of Rn\u00d7n. Let X = [X0, X1] and Y = [Y0, Y1] be\nthe \ufb01rst k columns of X and Y respectively, namely X0, Y0 \u2208 Rn\u00d7k and k \u2264 n. Then\nwhere c is a constant depending on the norm only. c = 1 for 2-norm and \u221a2 for Frobenius norm.\nAs pointed out by several papers [Caron, 2001, Bullinaria and Levy, 2012, Turney, 2012, Levy and\nGoldberg, 2014], embedding algorithms can be generically characterized as E = U1:k,\u00b7D\u03b1\n1:k,1:k for\nsome \u03b1 \u2208 [0, 1]. For illustration purposes, we \ufb01rst consider a special case where \u03b1 = 0.\n4.1 The Bias Variance Trade-off for a Special Case: \u03b1 = 0\n\n0 (cid:107) = c(cid:107)X T\n\n(cid:107)X0X T\n\n0 \u2212 Y0Y T\n\n0 Y1(cid:107)\n\nThe following theorem shows how the PIP loss can be naturally decomposed into a bias term and a\nvariance term when \u03b1 = 0:\nTheorem 1. Let E \u2208 Rn\u00d7d and \u02c6E \u2208 Rn\u00d7k be the oracle and trained embeddings, where k \u2264 d.\nAssume both have orthonormal columns. Then the PIP loss has a bias-variance decomposition\n\n(cid:107)PIP(E) \u2212 PIP( \u02c6E)(cid:107)2 = d \u2212 k + 2(cid:107) \u02c6ET E\n\n\u22a5(cid:107)2\n\nProof. The proof utilizes techniques from matrix perturbation theory. To simplify notations, denote\nX0 = E, Y0 = \u02c6E, and let X = [X0, X1], Y = [Y0, Y1] be the complete n by n orthogonal matrices.\n\n1A detailed discussion on the PIP loss and analogy/relatedness is deferred to the appendix\n2http://www.offconvex.org/2016/02/14/word-embeddings-2/\n\n4\n\n\fSince k \u2264 d, we can further split X0 into X0,1 and X0,2, where the former has k columns and the\nlatter d \u2212 k. Now, the PIP loss equals\n\n(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)2 =(cid:107)X0,1X T\n=(cid:107)X0,1X T\n= 2(cid:107)Y T\n=2(cid:107)Y T\n=d \u2212 k + 2(cid:107)Y T\n\n(a)\n\n0,1 \u2212 Y0Y T\n0,1 \u2212 Y0Y T\n\n0,2(cid:107)2\n0 + X0,2X T\n0 (cid:107)2 + (cid:107)X0,2X T\n\n0 [X0,2, X1](cid:107)2 + d \u2212 k \u2212 2(cid:104)Y0Y T\n0 X0,2(cid:107)2 + 2(cid:107)Y T\n\n0 , X0,2X T\n0 X1(cid:107)2 + d \u2212 k \u2212 2(cid:104)Y0Y T\n\n0,2(cid:107)2 + 2(cid:104)X0,1X T\n0,2(cid:105)\n0 , X0,2X T\n\n0,2(cid:105)\n\n0 X1(cid:107)2 = d \u2212 k + 2(cid:107) \u02c6ET E\n\n\u22a5(cid:107)2\n\n0,1 \u2212 Y0Y T\n\n0 , X0,2X T\n\n0,2(cid:105)\n\nwhere in equality (a) we used Lemma 1.\n\nThe observation is that the right-hand side now consists of two parts, which we identify as bias and\nvariance. The \ufb01rst part d\u2212k is the amount of lost signal, which is caused by discarding the rest d\u2212k\ndimensions when selecting k \u2264 d. However, (cid:107) \u02c6ET E\u22a5\n(cid:107) increases as k increases, as the noise perturbs\nthe subspace spanned by E, and the singular vectors corresponding to smaller singular values are\nmore prone to such perturbation. As a result, the optimal dimensionality k\u2217 which minimizes the\nPIP loss lies in between 0 and d, the rank of the matrix M.\n\n4.2 The Bias Variance Trade-off for the Generic Case: \u03b1 \u2208 (0, 1]\nIn this generic case, the columns of E, \u02c6E are no longer orthonormal, which does not satisfy the\nassumptions in matrix perturbation theory. We develop a novel technique where Lemma 1 is applied\nin a telescoping fashion. The proof of the theorem is deferred to the appendix.\nTheorem 2. Let M = U DV T , \u02dcM = \u02dcU \u02dcD \u02dcV T be the SVDs of the clean and estimated signal\nmatrices. Suppose E = U\u00b7,1:dD\u03b1\n1:k,1:k is the\ntrained embedding, for some k \u2264 d. Let D = diag(\u03bbi) and \u02dcD = diag(\u02dc\u03bbi), then\ni \u2212 \u03bb2\u03b1\n\n1:d,1:d is the oracle embedding, and \u02c6E = \u02dcU\u00b7,1:k \u02dcD\u03b1\n\n(cid:107)PIP(E) \u2212 PIP( \u02c6E)(cid:107) \u2264\n\n(cid:118)(cid:117)(cid:117)(cid:116) d(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\ni+1)(cid:107) \u02dcU T\u00b7,1:iU\u00b7,i:n(cid:107)\n\nk(cid:88)\n\n\u221a\n2\n\ni \u2212 \u02dc\u03bb2\u03b1\n\ni )2 +\n\n\u03bb4\u03b1\ni +\n\n(\u03bb2\u03b1\n\n(\u03bb2\u03b1\n\ni=k+1\n\ni=1\n\ni=1\n\nAs before, the three terms in Theorem 2 can be characterized into bias and variances. The \ufb01rst term\nis the bias as we lose part of the signal by choosing k \u2264 d. Notice that the embedding matrix E\nconsists of signal directions (given by U) and their magnitudes (given by D\u03b1). The second term is\nthe variance on the magnitudes, and the third term is the variance on the directions.\n\n4.3 The Bias-Variance Trade-off Captures the Signal-to-Noise Ratio\n\nWe now present the main theorem, which shows that the bias-variance trade-off re\ufb02ects the \u201csignal-\nto-noise ratio\u201d in dimensionality selection.\nTheorem 3 (Main theorem). Suppose \u02dcM = M + Z, where M is the signal matrix, symmetric with\nspectrum {\u03bbi}d\ni=1. Z is the estimation noise, symmetric with iid, zero mean, variance \u03c32 entries.\nFor any 0 \u2264 \u03b1 \u2264 1 and k \u2264 d, let the oracle and trained embeddings be\n\nE = U\u00b7,1:dD\u03b1\n\n1:d,1:d, \u02c6E = \u02dcU\u00b7,1:k \u02dcD\u03b1\n\n1:k,1:k\n\nwhere M = U DV T , \u02dcM = \u02dcU \u02dcD \u02dcV T are the SVDs of the clean and estimated signal matrices. Then\n\ni=1\n\ni=1\n\n5\n\n1. When \u03b1 = 0,\n\nE[(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)] \u2264\n\n2. When 0 < \u03b1 \u2264 1,\n\nE[(cid:107)EET \u2212 \u02c6E \u02c6ET(cid:107)] \u2264\n\n(cid:118)(cid:117)(cid:117)(cid:116) d(cid:88)\n\ni=k+1\n\n\u221a\n\n\u03bb4\u03b1\ni + 2\n\n(cid:115)\nd \u2212 k + 2\u03c32 (cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\n\u03bb4\u03b1\u22122\n\n2n\u03b1\u03c3\n\n+\n\ni\n\nr\u2264k,s>d\n\n(\u03bbr \u2212 \u03bbs)\u22122\n\nk(cid:88)\n\ni \u2212 \u03bb2\u03b1\n\n(\u03bb2\u03b1\n\ni+1)\u03c3\n\n\u221a\n2\n\n(cid:115) (cid:88)\n\nr\u2264i~~ 0,\n\nand Z has iid, zero mean entries with variance \u03c32, then\n\nE[(cid:107) \u02dcU T\n\n1 U0(cid:107)] \u2264 \u03c3\n\n(\u03bbi \u2212 \u03bbj)\u22122\n\n(cid:115) (cid:88)\n\n1\u2264i\u2264k~~