{"title": "What the Vec? Towards Probabilistically Grounded Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 7467, "page_last": 7477, "abstract": "Word2Vec (W2V) and Glove are popular word embedding algorithms that perform well on a variety of natural language processing tasks. The algorithms are fast, efficient and their embeddings widely used. Moreover, the W2V algorithm has recently been adopted in the field of graph embedding, where it underpins several leading algorithms. However, despite their ubiquity and the relative simplicity of their common architecture, what the embedding parameters of W2V and Glove learn, and why that it useful in downstream tasks largely remains a mystery. We show that different interactions of PMI vectors encode semantic properties that can be captured in low dimensional word embeddings by suitable projection, theoretically explaining why the embeddings of W2V and Glove work, and, in turn, revealing an interesting mathematical interconnection between the semantic relationships of relatedness, similarity, paraphrase and analogy.", "full_text": "Towards Probabilistically Grounded Embeddings\n\nWhat the Vec?\n\nCarl Allen1\n\nIvana Bala\u017eevi\u00b4c1\n\nTimothy Hospedales1,2\n\n1 School of Informatics, University of Edinburgh, UK\n\n2 Samsung AI Centre, Cambridge, UK\n\n{carl.allen, ivana.balazevic, t.hospedales}@ed.ac.uk\n\nAbstract\n\nWord2Vec (W2V) and GloVe are popular, fast and ef\ufb01cient word embedding al-\ngorithms. Their embeddings are widely used and perform well on a variety of\nnatural language processing tasks. Moreover, W2V has recently been adopted in\nthe \ufb01eld of graph embedding, where it underpins several leading algorithms. How-\never, despite their ubiquity and relatively simple model architecture, a theoretical\nunderstanding of what the embedding parameters of W2V and GloVe learn and\nwhy that is useful in downstream tasks has been lacking. We show that different\ninteractions between PMI vectors re\ufb02ect semantic word relationships, such as\nsimilarity and paraphrasing, that are encoded in low dimensional word embeddings\nunder a suitable projection, theoretically explaining why embeddings of W2V\nand GloVe work. As a consequence, we also reveal an interesting mathematical\ninterconnection between the considered semantic relationships themselves.\n\n1\n\nIntroduction\n\nWord2Vec1 (W2V) [25] and GloVe [29] are fast, straightforward algorithms for generating word\nembeddings, or vector representations of words, often considered points in a semantic space. Their\nembeddings perform well on downstream tasks, such as identifying word similarity by vector\ncomparison (e.g. cosine similarity) and solving analogies, such as the well known \u201cman is to king as\nwoman is to queen\u201d, by the addition and subtraction of respective embeddings [26, 27, 19].\nIn addition, the W2V algorithm has recently been adopted within the growing \ufb01eld of graph em-\nbedding, where the typical aim is to represent graph nodes in a common latent space such that their\nrelative positioning can be used to predict edge relationships. Several state-of-the-art models for\ngraph representation incorporate the W2V algorithm to learn node embeddings based on random\nwalks over the graph [13, 30, 31]. Furthermore, word embeddings often underpin embeddings of\nword sequences, e.g. sentences. Although sentence embedding models can be complex [8, 17], as\nshown recently [38] they sometimes learn little beyond the information available in word embeddings.\nDespite their relative ubiquity, much remains unknown of the W2V and GloVe algorithms, perhaps\nmost fundamentally we lack a theoretical understanding of (i) what is learned in the embedding\nparameters; and (ii) why that is useful in downstream tasks. Answering such core questions is of\ninterest in itself, particularly since the algorithms are unsupervised, but may also lead to improved\nembedding algorithms, or enable better use to be made of the embeddings we have. For example,\nboth algorithms generate two embedding matrices, but little is known of how they relate or should\ninteract. Typically one is simply discarded, whereas empirically their mean can perform well [29] and\nelsewhere they are assumed identical [14, 4]. As for embedding interactions, a variety of heuristics\nare in common use, e.g. cosine similarity [26] and 3CosMult [19].\n\n1We refer exclusively, throughout, to the more common implementation Skipgram with negative sampling.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOf works that seek to theoretically explain these embedding models [20, 14, 4, 9, 18], Levy and\nGoldberg [20] identify the loss function minimised (implicitly) by W2V and, thereby, the relationship\nbetween W2V word embeddings and the Pointwise Mutual Information (PMI) of word co-occurrences.\nMore recently, Allen and Hospedales [2] showed that this relationship explains the linear interaction\nobserved between embeddings of analogies. Building on these results, our key contributions are:\n\n\u2022 to show how particular semantic relationships correspond to linear interactions of high dimensional\nPMI vectors and thus to equivalent interactions of low dimensional word embeddings generated\nby their linear projection, thereby explaining the semantic properties exhibited by embeddings of\nW2V and GloVe;\n\n\u2022 to derive a relationship between embedding matrices proving that they must differ, justifying the\nheuristic use of their mean and enabling word embedding interactions \u2013 including the widely used\ncosine similarity \u2013 to be semantically interpreted; and\n\n\u2022 to establish a novel hierarchical mathematical inter-relationship between relatedness, similarity,\n\nparaphrase and analogy (Fig 2).\n\n2 Background\nWord2Vec [25, 26] takes as input word pairs {(wir , cjr )}D\nr=1 extracted from a large text corpus,\nwhere target word wi\u2208E ranges over the corpus and context word cj \u2208E ranges over a window of\nsize l, symmetric about wi (E is the dictionary of distinct words, n =|E|). For each observed word\npair, k random pairs (negative samples) are generated from unigram distributions. For embedding\ndimension d, W2V\u2019s architecture comprises the product of two weight matrices W, C\u2208Rd\u00d7n subject\nto the logistic sigmoid function. Columns of W and C are the word embeddings: wi \u2208Rd, the ith\ncolumn of W, represents the ith word in E observed as the target word (wi); and cj \u2208 Rd, the jth\ncolumn of C, represents the jth word in E observed as a context word (cj).\nLevy and Goldberg [20] show that the loss function of W2V is given by:\n\n(cid:96)W2V = \u2212 n(cid:88)\n\nn(cid:88)\n\n#(wi, cj) log \u03c3(w(cid:62)\n\ni cj) + k\n\nD #(wi)#(cj) log(\u03c3(\u2212w(cid:62)\n\ni cj)),\n\n(1)\n\ni=1\n\nj=1\n\nwhich is minimised if w(cid:62)\ninformation (PMI). In matrix form, this equates to factorising a shifted PMI matrix S\u2208Rn\u00d7n:\n\ni cj = Pi,j \u2212 log k, where Pi,j = log p(wi, cj )\n\np(wi)p(cj ) is pointwise mutual\n\nW(cid:62)C = S .\n\n(2)\n\nGloVe [29] has the same architecture as W2V, but a different loss function, minimised when:\n\ni cj = log p(wi, cj) \u2212 bi \u2212 bj + log Z,\nw(cid:62)\n\n(3)\nfor biases bi, bj and normalising constant Z. In principle, the biases provide \ufb02exibility, broadening\nthe family of statistical relationships that GloVe embeddings can learn.\nAnalogies are word relationships, such as the canonical \u201cman is to king as woman is to queen\u201d,\nthat are of particular interest because their word embeddings appear to satisfy a linear relationship\n[27, 19]. Allen and Hospedales [2] recently showed that this phenomenon follows from relationships\nbetween PMI vectors, i.e. rows of the (unshifted) PMI matrix P\u2208 Rn\u00d7n. In doing so, the authors\nde\ufb01ne (i) the induced distribution of an observation \u25e6 as p(E|\u25e6), the probability distribution over all\ncontext words observed given \u25e6; and (ii) that a word w\u2217 paraphrases a set of words W \u2282E if the\ninduced distributions p(E|w\u2217) and p(E|W) are (elementwise) similar.\n\n3 Related Work\n\nWhile many works explore empirical properties of word embeddings (e.g. [19, 23, 5]), we focus here\non those that seek to theoretically explain why W2V and GloVe word embeddings capture semantic\nproperties useful in downstream tasks. The \ufb01rst of these is the previously mentioned derivation\nby Levy and Goldberg [20] of the loss function (1) and the PMI relationship that minimises it (2).\nHashimoto et al. [14] and Arora et al. [4] propose generative language models to explain the structure\n\n2\n\n\ffound in word embeddings. However, both contain strong a priori assumptions of an underlying\ngeometry that we do not require (further, we \ufb01nd that several assumptions of [4] fail in practice\n(Appendix D)). Cotterell et al. [9] and Landgraf and Bellay [18] show that W2V performs exponential\n(binomial) PCA [7], however this follows from the (binomial) negative sampling and so describes the\nalgorithm\u2019s mechanics, not why it works. Several works focus on the linearity of analogy embeddings\n[4, 12, 2, 10], but only [2] rigorously links semantics to embedding geometry (S.2).\nTo our knowledge, no previous work explains how the semantic properties of relatedness, similarity,\nparaphrase and analogy are all encoded in the relationships of PMI vectors and thereby manifest in\nthe low dimensional word embeddings of W2V and GloVe.\n\n4 PMI: linking geometry to semantics\n\nThe derivative of W2V\u2019s loss function (1) with respect to embedding wi, is given by:\n\nn(cid:88)\n\nj=1\n\n(cid:0) p(wi, cj) + kp(wi)p(cj)\n(cid:124)\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:1)(cid:0) \u03c3(Si,j)\u2212 \u03c3(w(cid:62)\n(cid:124)\n(cid:123)(cid:122)\n\n(cid:1)cj = C D(i)e(i) ,\n(cid:125)\n\ni cj)\n\n1\n\nD\u2207wi(cid:96)W2V =\n\nd(i)\n\nj\n\ne(i)\nj\n\n(4)\n\nfor diagonal matrix D(i) = diag(d(i)) \u2208 Rn\u00d7n; d(i), e(i)\u2208Rn containing the probability and error\nterms indicated; and all probabilities estimated empirically from the corpus. This con\ufb01rms that (1) is\nminimised if W(cid:62)C = S (2), since all e(i)\nj = 0, but that requires W and C to each have rank at least\nthat of S. In the general case, including the typical case d(cid:28) n, (1) is minimised when probability\nweighted error vectors D(i)e(i) are orthogonal to the rows of C. As such, embeddings wi can be\nseen as a non-linear (due to the sigmoid function \u03c3(\u00b7)) projection of rows of S, induced by the loss\nfunction. (Note that the distinction between W and C is arbitrary: embeddings cj can also be viewed\nas projections onto the rows of W.)\nRecognising that the log k shift term is an artefact of the W2V algorithm (see Appendix A), whose\neffect can be evaluated subsequently (as in [2]), we exclude it and analyse properties and interactions\nof word embeddings wi that are projections of pi, the corresponding rows of P (PMI vectors). We\naim to identify the properties of PMI vectors that capture semantics and are then preserved in word\nembeddings under the low-rank projection induced by a suitably chosen loss function.\n\n4.1 The domain of PMI vectors\nPMI vector pi\u2208Rn has a component PMI(wi, cj) for all context words cj \u2208E, given by:\n\nPMI(wi, cj) = log p(cj ,wi)\n\np(wi)p(cj ) = log p(cj|wi)\n\np(cj )\n\n.\n\n(5)\n\nAny difference in the probability of observing cj having observed wi, relative to its marginal\nprobability, can be thought of as due to wi. Thus PMI(wi, cj) captures the in\ufb02uence of one word\non another. Speci\ufb01cally, by reference to marginal probability p(cj): PMI(wi, cj) > 0 implies cj is\nmore likely to occur in the presence of wi; PMI(wi, cj) < 0 implies cj is less likely to occur given\nwi; and PMI(wi, cj) = 0 indicates that wi and cj occur independently, i.e. they are unrelated. PMI\nthus re\ufb02ects the semantic property of relatedness, as previously noted [36, 6, 15]. A PMI vector thus\nre\ufb02ects any change in the probability distribution over all words p(E), given (or due to) wi:\n\npi (cid:44)(cid:8)log p(cj|wi)\n\n(cid:9)\ncj\u2208E (cid:44) log p(E|wi)\np(E)\n\n.\n\np(cj )\n\n(6)\nWhile PMI values are unconstrained in R, PMI vectors are constrained to an n\u22121 dimensional surface\nS \u2282Rn, where each dimension corresponds to a word (Fig 1) (although technically a hypersurface,\nwe refer to S simply as a \u201csurface\u201d). The geometry of S can be constructed step-wise from (6):\n\u2022 the vector of numerator terms qi = p(E|wi) lies on the simplex Q\u2282Rn;\n\u2022 dividing all q\u2208Q (element-wise) by p = p(E)\u2208Q, gives probability ratio vectors q\n\u201cstretched simplex\u201d R\u2282Rn (containing 1\u2208Rn) that has a vertex at\n\u2022 the natural logarithm transforms R to the surface S, with pi = log p(E|wi)\n\np that lie on a\np(cj ) on axis j, \u2200cj \u2208E; and\np(E) \u2208S, \u2200wi\u2208E.\n\n1\n\n3\n\n\fNote, p = p(E) uniquely determines S. Considering each point s \u2208 S as an element-wise log\np \u2208S (q\u2208Q), shows S to have the properties (proofs in Appendix B):\nprobability ratio vector s = log q\nP1 S, and any subsurface of S, is non-linear. PMI vectors are thus not constrained to a linear\nsubspace, identi\ufb01able by low-rank factorisation of the PMI matrix, as may seem suggested by (2).\nP2 S contains the origin, which can be considered the PMI vector of the null word \u2205, i.e. p\u2205 =\n\nlog p(E|\u2205)\n\np(E) = log p(E)\n\np(E) = 0 \u2208 Rn.\n\nP3 Probability vector q\u2208Q is normal to the tangent plane of S at s = log q\nP4 S does not intersect with the fully positive or fully negative orthants (excluding 0). Thus\n\np \u2208S.\n\nPMI vectors are not isotropically (i.e. uniformly) distributed in space (as assumed in [4]).\n\nP5 The sum of 2 points s + s(cid:48) lies in S only for certain s, s(cid:48) \u2208S. That is, for any s\u2208S (s(cid:54)= 0),\n\nthere exists a (strict) subset Ss\u2282S, such that s + s(cid:48)\u2208S iff s(cid:48)\u2208Ss. Trivially 0\u2208Ss, \u2200s\u2208S.\n\nNote that while all PMI vectors lie in S, certainly not all (in\ufb01nite) points in S correspond to the\n(\ufb01nite) PMI vectors of words. Interestingly, P2 and P5 allude to properties of a vector space, often\nthe desired structure for a semantic space [14]. Whilst the domain of PMI vectors is clearly not a\nvector space, addition and subtraction of PMI vectors do have semantic meaning, as we now show.\n\nFigure 1: The PMI surface S, showing sample PMI vectors of words (red dots)\n\n4.2 Subtraction of PMI vectors \ufb01nds similarity\nTaking the de\ufb01nition from [2] (see S.2), we consider a word wi that paraphrases a word set W \u2208E,\nwhere W ={wj} contains a single word. Since paraphrasing requires distributions of local context\nwords (induced distributions) to be similar, this intuitively \ufb01nds wi that are interchangeable with, or\nsimilar to, wj: in the limit wj itself or, less trivially, a synonym. Thus, word similarity corresponds\nto a low KL divergence between p(E|wi) and p(E|wj). Interestingly, the difference between the\nassociated PMI vectors:\n\n\u03c1i,j = pi \u2212 pj = log p(E|wi)\np(E|wj ) ,\n\n(7)\n\nis a vector of un-weighted KL divergence components. Thus, if dimensions were suitably weighted,\nthe sum of difference components (comparable to Manhattan distance but directed) would equate to\na KL divergence between induced distributions. That is, if qi = p(E|wi), then a KL divergence is\ngiven by qi(cid:62)\u03c1i,j. Furthermore, qi is the normal to the surface S at pi (with unit l1 norm), by P3.\nThe projection onto the normal (to S) at pj, i.e. \u2212qj(cid:62)\u03c1i,j, gives the other KL divergence. (Intuition\nfor the semantic interpretation of each KL divergence is discussed in Appendix A of [2].)\n\n4\n\n\f4.3 Addition of PMI vectors \ufb01nds paraphrases\nFrom geometric arguments (P5), we know that only certain pairs of points in S sum to another point\nin the surface. We can also consider the probabilistic conditions for PMI vectors to sum to another:\n\nx = pi + pj = log p(E|wi)\n\n= log p(E|wi,wj)\n\np(E) + log p(E|wj )\np(E)\n(cid:123)(cid:122)\n(cid:123)(cid:122)\np(E)\n\n\u2212 log p(wi,wj|E)\np(wi|E)p(wj|E)\n\n(cid:125)\n\n(cid:124)\n\n(cid:125)\n\npi,j\n\n\u03c3ij\n\n(cid:124)\n\n+ log p(wi,wj)\np(wi)p(wj)\n\n= pi,j \u2212 \u03c3ij + \u03c4 ij1,\n\n(8)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03c4 ij\n\n(cid:125)\n\nwhere (overloading notation) pi,j \u2208S is a vector of PMI terms involving p(E|wi, wj), the induced\ndistribution of wi and wj observed together;2 and \u03c3ij \u2208Rn, \u03c4 ij \u2208R are the conditional and marginal\ndependence terms indicated (as seen in [2]). From (8), if wi, wj \u2208 E occur both independently\nand conditionally independently given each and every word in E, then x = pi,j \u2208S, and (from P5)\npj \u2208Spi and pi\u2208Spj . If not, error vector \u03b5ij = \u03c3ij\u2212\u03c4 ij1 separates x and pi,j and x /\u2208S, unless by\nmeaningless coincidence. (Note, whilst probabilistic aspects here mirror those of [2], we combine\nthese with a geometric understanding.) Although certainly pi,j \u2208S, the extent to which pi,j \u2248 pk for\nsome wk \u2208E depends on paraphrase error \u03c1k,{i,j} = pk \u2212 pi,j, that compares the induced distributions\nof wk and {wi, wj}. Thus the PMI vector difference (pi +pj)\u2212pk for any words wi, wj, wk \u2208E\ncomprises: \u03b5ij a component between pi +pj and the surface S (re\ufb02ecting word dependence); and\n\u03c1k,{i,j} a component along the surface (re\ufb02ecting paraphrase error). The latter captures a semantic\nrelationship with wk, which the former may obscure, irrespective of wk. (Further geometric and\nprobabilistic implications are considered in Appendix C.)\n\n4.4 Linear combinations of PMI vectors \ufb01nd analogies\n\nPMI vectors of analogy relationships \u201cwa is to wa\u2217 as wb is to wb\u2217\u201d have been proven [2] to satisfy:\n\n(9)\nThe proof builds on the concept of paraphrasing (with error terms similar to those in Section 4.3),\ncomparing PMI vectors of analogous word pairs to show that pa + pb\u2217 \u2248 pa\u2217\n\n+ pb and thus (9).\n\npb\u2217 \u2248 pa\u2217 \u2212 pa + pb.\n\n5 Encoding PMI: from PMI vectors to word embeddings\n\nUnderstanding how high dimensional PMI vectors encode semantic properties desirable in word\nembeddings, we consider how they can be transferred to low dimensional representations. A key\nobservation is that all PMI vector interactions, for similarity (7), paraphrases (8) and analogies (9),\nare additive, and are therefore preserved under linear projection. By comparison, the loss function\nof W2V (1) projects PMI vectors non-linearly, and that of GloVe (3) does project linearly, but not\n(necessarily) PMI vectors. Linear projection can be achieved by the least squares loss function:3\n\nn(cid:88)\n\nn(cid:88)\n\n(cid:0)wi\n(cid:62)cj \u2212 PMI(wi, cj)(cid:1)2\n\n(cid:96)LSQ = 1\n2\n\n(10)\n(cid:96)LSQ is minimised when \u2207W(cid:62)(cid:96)LSQ = (W(cid:62)C\u2212P)C(cid:62)= 0, or W(cid:62)= P C\u2020, for C\u2020 = C(cid:62)(CC(cid:62))\u22121\nthe Moore\u2013Penrose pseudoinverse of C. This explicit linear projection allows interactions performed\nbetween word embeddings, e.g. dot product, to be mapped to interactions between PMI vectors, and\nthereby semantically interpreted. However, we do better still by considering how W and C relate.\n\nj=1\n\ni=1\n\n.\n\n5.1 The relationship between W and C\n\nWhilst W2V and GloVe train two embedding matrices, typically only W is used and C discarded.\nThus, although relationships are learned between W and C, they are tested between W and W. If\n\n2Whilst wi, wj are both target words, by symmetry we can interchange roles of context and target words to\n\ncompute p(E|w, w(cid:48)) based on the distribution of target words for which wi and wj are both context words.\n\n3We note that the W2V and GloVe loss functions include probability weightings (as considered in [35]),\n\nwhich we omit for simplicity.\n\n5\n\n\fW and C are equal, the distinction falls away, but that is not found to be the case in practice. Here,\nwe consider why typically W(cid:54)= C and, as such, what relationship between W and C does exist.\nIf the symmetric PMI matrix P is positive semi-de\ufb01nite (PSD), its closest low-rank approximation\n(minimising (cid:96)LSQ) is given by the eigendecomposition P = \u03a0\u039b\u03a0(cid:62), \u03a0, \u039b \u2208 Rn\u00d7n, \u03a0(cid:62)\u03a0 = I;\nand (cid:96)LSQ is minimised by W = C = S1/2U(cid:62), where S\u2208 Rd\u00d7d, U\u2208 Rd\u00d7n are \u039b, \u03a0, respectively,\ntruncated to their d largest eigenvalue components. Any matrix pair W\u2217 = M(cid:62)W, C\u2217 = M\u22121W,\nalso minimises (cid:96)LSQ (for any invertible M\u2208 Rd\u00d7d), but of these W, C are unique (up to rotation\nand permutation) in satisfying W = C, a preferred solution for learning word embeddings since the\nnumber of free parameters is halved and consideration of whether to use W, C or both falls away.\nHowever, P is not typically PSD in practice and this preferred (real) factorisation does not exist\nsince P has negative eigenvalues, S1/2 is complex and any W, C minimising (cid:96)LSQ with W = C\nmust also be complex. (Complex word embeddings arise elsewhere, e.g. [16, 22], but since the word\nembeddings we examine are real we keep to the real domain.) By implication, any W, C\u2208 Rd\u00d7n\nthat minimise (cid:96)LSQ cannot be equal, contradicting the assumption W = C sometimes made [14, 4].\nReturning to the eigendecomposition, if S contains the d largest absolute eigenvalues and U the\nii =\u00b11) such that S =|S|I(cid:48). Thus,\ncorresponding eigenvectors of P, we de\ufb01ne I(cid:48) = sign(S) (i.e. I(cid:48)\nW =|S|1/2U(cid:62)and C = I(cid:48)W can be seen to minimise (cid:96)LSQ (i.e. W(cid:62)C\u2248 P) with W(cid:54)= C but where\ncorresponding rows of W, C (denoted by superscript) satisfy Wi =\u00b1Ci (recall word embeddings\nwi, ci are columns of W, C). Such W, C can be seen as quasi-complex conjugate. Again, W, C\ncan be used to de\ufb01ne a family of matrix pairs that minimise (cid:96)LSQ, of which W, C themselves are a\nmost parameter ef\ufb01cient choice, with (n+1)d free parameters compared to 2nd.\n\n5.2\n\nInterpreting embedding interactions\n\ni wj), rather than W and C (e.g. w(cid:62)\n\nVarious word embedding interactions are used to predict semantic relationships, e.g. cosine similarity\n[26] and 3CosMult [19], although typically with little theoretical justi\ufb01cation. With a semantic un-\nderstanding of PMI vector interactions (S.4) and the derived relationship C = I(cid:48)W, we now interpret\ncommonly used word embedding interactions and evaluate the effect of combining embeddings of W\nonly (e.g. w(cid:62)\ni cj). For use below, we note that W(cid:62)C = USU(cid:62),\nC\u2020 = U|S|\u22121/2I(cid:48) and de\ufb01ne: reconstruction error matrix E = P\u2212 W(cid:62)C, i.e. E = USU(cid:62)where\nU, S contain the n\u2212d smallest absolute eigenvalue components of \u03a0, \u03a3 (as omitted from U, S);\nF = U( S \u2212 |S|\n)U(cid:62), comprising the negative eigenvalue components of P; and mean embeddings ai\nas the columns of A = W+C\nDot Product: We compare the following interactions, associated with predicting relatedness:\n\n2 = U|S|1/2I(cid:48)(cid:48)\u2208Rd\u00d7n, where I(cid:48)(cid:48) = I+I(cid:48)\n\nii\u2208{0,1}).\n\n2\n\n(i.e. I(cid:48)(cid:48)\n\n2\n\ni cj = Ui S Uj(cid:62)\nW, C : w(cid:62)\ni wj = Ui |S| Uj(cid:62)\nW, W : w(cid:62)\ni aj = Ui |S| I(cid:48)(cid:48) Uj(cid:62) = Ui(S \u2212 ( S \u2212 |S|\na(cid:62)\nA, A :\nThis shows that w(cid:62)\ni wj overestimates the PMI approximation given by w(cid:62)\ni cj by twice any component\nrelating to negative eigenvalues \u2013 an overestimation that is halved using mean embeddings, a(cid:62)\ni aj.\np(ck|wj ) , x = U|S|\u22121/2I(cid:48)1\nDifference sum:\n\n= Ui(S \u2212 (S\u2212|S|))Uj(cid:62) = Pi,j \u2212 Ei,j \u2212 2 Fi,j\n))Uj(cid:62) = Pi,j \u2212 Ei,j \u2212 Fi,j\n\n(wi \u2212 wj)(cid:62)1 = (pi \u2212 pj)C\u20201 =(cid:80)n\n\nk=1 xk log p(ck|wi)\n\n= Pi,j \u2212 Ei,j\n\n2\n\nThus, summing over the difference of embedding components compares to a KL divergence between\ninduced distributions (and so similarity) more so than for PMI vectors (S.4.2) as dimensions are\nweighted by xk. However, unlike a KL divergence, x is not a probability distribution and does\nnot vary with wi or wj. We speculate that between x and the omitted probability weighting of the\nloss function, the dimensions of low probability words are down-weighted, mitigating the effect of\n\u201coutliers\u201d to which PMI is known to be sensitive [37], and loosely re\ufb02ecting a KL divergence.\nEuclidean distance: (cid:107)wi \u2212 wj(cid:107)2 = (cid:107)(log p(E|wi)\nCosine similarity: Surprisingly, w(cid:62)\n(cid:107)wi(cid:107)(cid:107)wj(cid:107), as often used effectively to predict word relatedness\nand/or similarity [33, 5], has no immediate semantic interpretation. However, recent work [3]\nproposes a more holistic description of relatedness than PMI(wi, wj) > 0 (S.4.1): that related words\n\np(E|wj ) )C\u2020(cid:107)2 shows no obvious meaning.\n\ni wj\n\n6\n\n\fTable 1: Accuracy in semantic tasks using different loss functions on the text8 corpus [24].\n\nLoss\n\nModel\nRelationship\nW 2V W2V W(cid:62)C \u2248 P\nLSQ W(cid:62)W \u2248 P\nW=C\nLSQ LSQ W(cid:62)C \u2248 P\n\nRelatedness [1]\n\nSimilarity [1]\n\nAnalogy [25]\n\n.628\n.721\n.727\n\n.703\n.786\n.791\n\n.283\n.411\n.425\n\n(wi, wj) have multiple positive PMI vector components in common, because all words associated\nwith any common semantic \u201ctheme\u201d are also more likely to co-occur. The strength of relatedness\n(similarity being the extreme case) is given by the number of common word associations, as re\ufb02ected\nin the dimensionality of a common aggregate PMI vector component, which projects to a common\nembedding component. The magnitude of such common component is not directly meaningful, but\nas relatedness increases and wi, wj share more common word associations, the angle between their\nPMI vectors, and so too their embeddings, narrows, justifying the widespread use of cosine similarity.\nOther statistical word embedding relationships assumed in [4] are considered in Appendix D.\n\n6 Empirical evidence\n\nWord embeddings (especially those of W2V) have been well empirically studied, with many exper-\nimental \ufb01ndings. Here we draw on previous results and run test experiments to provide empirical\nsupport for our main theoretical results:\n\n1. Analogies form as linear relationships between linear projections of PMI vectors (S.4.4)\n\nWhilst previously explained in [2], we emphasise that their rationale for this well known phe-\nnomenon \ufb01ts precisely within our broader explanation of W2V and GloVe embeddings. Further,\nre-ordering paraphrase questions is observed to materially affect prediction accuracy [23], which\ncan be justi\ufb01ed from the explanation provided in [2] (see Appendix E).\n\n2. The linear projection of additive PMI vectors captures semantic properties more accurately than\n\nthe non-linear projection of W2V (S.5).\nSeveral works consider alternatives to the W2V loss function [20, 21], but none isolates the effect\nof an equivalent linear loss function, which we therefore implement (detail below). Comparing\nmodels W 2V and LSQ (Table 1) shows a material improvement across all semantic tasks from\nlinear projection.\n\n3. Word embedding matrices W and C are dissimilar (S.5.1).\n\nW, C are typically found to differ, e.g. [26, 29, 28]. To demonstrate the difference, we include\nan experiment tying W = C. Comparing models W=C and LSQ (Table 1) shows a small but\nconsistent improvement in the former despite a lower data-to-parameter ratio.\ni aj \u2265 w(cid:62)\n\n4. Dot products recover PMI with decreasing accuracy: w(cid:62)\n\ni cj \u2265 a(cid:62)\n\ni wj (S.5.2).\n\nThe use of average embeddings a(cid:62)\nrecently, [5] show that relatedness correlates noticeably better to w(cid:62)\nmetric\u201d choices (a(cid:62)\n\ni aj over w(cid:62)\n\ni wj is a well-known heuristic [29, 21]. More\ni cj than either of the \u201csym-\n\ni aj or w(cid:62)\n\ni wj).\n\n5. Relatedness is re\ufb02ected by interactions between W and C embeddings, and similarity is re\ufb02ected\n\nby interactions between W and W. (S.5.2)\nAsr et al. [5] compare human judgements of similarity and relatedness to cosine similarity between\ncombinations of W, C and A. The authors \ufb01nd a \u201cvery consistent\u201d support for their conclusion that\n\u201cWC ... best measures ... relatedness\u201d and \u201csimilarity [is] best predicted by ... WW\u201d. An example\nis given for house: w(cid:62)\ni wj gives mansion, farmhouse and cottage, i.e. similar or synonymous\nwords; w(cid:62)\n\ni cj gives barn, residence, estate, kitchen, i.e. related words.\n\nModels: As we perform a standard comparison of loss functions, similar to [20, 21], we leave\nexperimental details to Appendix F. In summary, we learn 500 dimensional embeddings from word\nco-occurrences extracted from a standard corpus (\u201ctext8\u201d [24]). We implement loss function (1)\nexplicitly as model W 2V . Models W=C and LSQ use least squares loss (10), with constraint W =C\nin the latter (see point 3 above). Evaluation on popular data sets [1, 25] uses the Gensim toolkit [32].\n\n7\n\n\fGlobal\n\nrelatedness\n\nwi\n\nwj\n\nSimilarity\n\nwi\n\nE\n\nwi\n\nW\n\nParaphrase\n\nRelatedness\n\nwi wj\n\nPMI\n\nW1 W2\n\nAnalogy\n\nFigure 2: Interconnection between semantic relationships: relatedness is a base pairwise comparison\n(measured by PMI); global relatedness considers relatedness to all words (PMI vector); similarity,\nparaphrase and analogy depend on global relatedness between words (w\u2208E) and word sets (W \u2286E).\n\n7 Discussion\n\nHaving established mathematical formulations for relatedness, similarity, paraphrase and analogy\nthat explain how they are captured in word embeddings derived from PMI vectors (S.4), it can be\nseen that they also imply an interesting, hierarchical interplay between the semantic relationships\nthemselves (Fig 2). At the core is relatedness, which correlates with PMI, both empirically [36, 6, 15]\nand intuitively (S.4.2). As a pairwise comparison of words, relatedness acts somewhat akin to a\nkernel (an actual kernel requires P to be PSD), allowing words to be considered numerically in terms\nof their relatedness to all words, as captured in a PMI vector, and compared according to how they\neach relate to all other words, or globally relate. Given this meta-comparison, we see that one word is\nsimilar to another if they are globally related (1-1); a paraphrase requires one word to globally relate\nto the joint occurrence of a set of words (1-n); and analogies arise when joint occurrences of word\npairs are globally related (n-n). Continuing the \u201ckernel\u201d analogy, the PMI matrix mirrors a kernel\nmatrix, and word embeddings the representations derived from kernelised PCA [34].\n\n8 Conclusion\n\nIn this work, we take two previous results \u2013 the well known link between W2V embeddings and\nPMI [20], and a recent connection between PMI and analogies [2] \u2013 to show how the semantic\nproperties of relatedness, similarity, paraphrase and analogy are captured in word embeddings that\nare linear projections of PMI vectors. The loss functions of W2V (2) and GloVe (3) approximate\nsuch a projection: non-linearly in the case of W2V and linearly projecting a variant of PMI in GloVe;\nexplaining why their embeddings exhibit semantic properties useful in downstream tasks.\nWe derive a relationship between embedding matrices W and C, enabling word embedding interac-\ntions (e.g. dot product) to be semantically interpreted and justifying the familiar cosine similarity as a\nmeasure of relatedness and similarity. Our theoretical results explain several empirical observations,\ne.g. why W and C are not found to be equal despite representing the same words, their symmetric\ntreatment in the loss function and a symmetric PMI matrix; why mean embeddings (A) are often\nfound to outperform those from either W or C; and why relatedness corresponds to interactions\nbetween W and C, and similarity to interactions between W and W.\nWe discover an interesting hierarchical structure between semantic relationships: with relatedness\nas a basic pairwise comparison, similarity, paraphrase and analogy are de\ufb01ned according to how\ntarget words each relate to all words. Error terms arise in the latter higher order relationships due to\nstatistical dependence between words. Such errors can be interpreted geometrically with respect to\nthe hypersurface S on which all PMI vectors lie, and can, in principle, be evaluated from higher order\nstatistics (e.g trigrams co-occurrences).\nSeveral further details of W2V and GloVe remain to be explained that we hope to address in future\nwork, e.g. the weighting of PMI components over the context window [31], the exponent 3/4 often\napplied to unigram distributions [26], the probability weighting in the loss function (S.5), and an\ninterpretation of the weight vector x in embedding differences (S.5.2).\n\n8\n\n\fAcknowledgements\n\nWe thank Ivan Titov, Jonathan Mallinson and the anonymous reviewers for helpful comments. Carl\nAllen and Ivana Bala\u017eevi\u00b4c were supported by the Centre for Doctoral Training in Data Science,\nfunded by EPSRC (grant EP/L016427/1) and the University of Edinburgh.\n\nReferences\n[1] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pa\u00b8sca, and Aitor Soroa.\nA study on similarity and relatedness using distributional and wordnet-based approaches. In\nNorth American Chapter of the Association for Computational Linguistics, 2009.\n\n[2] Carl Allen and Timothy Hospedales. Analogies explained: Towards understanding word\n\nembeddings. In International Conference on Machine Learning, 2019.\n\n[3] Carl Allen, Ivana Balazevic, and Timothy M Hospedales. On understanding knowledge graph\n\nrepresentation. arXiv preprint arXiv:1909.11611, 2019.\n\n[4] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent vari-\nable model approach to pmi-based word embeddings. Transactions of the Association for\nComputational Linguistics, 2016.\n\n[5] Fatemeh Torabi Asr, Robert Zinkov, and Michael Jones. Querying word embeddings for\nsimilarity and relatedness. In North American Chapter of the Association for Computational\nLinguistics, 2018.\n\n[6] John A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-\noccurrence statistics: A computational study. Behavior research methods, 39(3):510\u2013526,\n2007.\n\n[7] Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. A generalization of principal\ncomponents analysis to the exponential family. In Advances in Neural Information Processing\nSystems, 2002.\n\n[8] Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo\u00efc Barrault, and Antoine Bordes. Supervised\nlearning of universal sentence representations from natural language inference data. In Empirical\nMethods in Natural Language Processing, 2017.\n\n[9] Ryan Cotterell, Adam Poliak, Benjamin Van Durme, and Jason Eisner. Explaining and gen-\neralizing skip-gram through exponential family principal component analysis. In European\nChapter of the Association for Computational Linguistics, 2017.\n\n[10] Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Towards understanding linear word\n\nanalogies. In Association for Computational Linguistics, 2019.\n\n[11] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman,\nand Eytan Ruppin. Placing search in context: The concept revisited. In International Conference\non World Wide Web, 2001.\n\n[12] Alex Gittens, Dimitris Achlioptas, and Michael W Mahoney. Skip-Gram - Zipf + Uniform =\n\nVector Additivity. In Association for Computational Linguistics, 2017.\n\n[13] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.\n\nInternational Conference on Knowledge Discovery and Data mining, 2016.\n\nIn\n\n[14] Tatsunori B Hashimoto, David Alvarez-Melis, and Tommi S Jaakkola. Word embeddings\nas metric recovery in semantic spaces. Transactions of the Association for Computational\nLinguistics, 2016.\n\n[15] Aminul Islam, Evangelos Milios, and Vlado Keselj. Comparing word relatedness measures\n\nbased on google n-grams. In International Conference on Computational Linguistics, 2012.\n\n9\n\n\f[16] Amit Kumar Jaiswal, Guilherme Holdack, Ingo Frommholz, and Haiming Liu. Quantum-like\ngeneralization of complex word embedding: a lightweight approach for textual classi\ufb01cation.\nIn Lernen, Wissen, Daten, Analysen, 2018.\n\n[17] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio\nTorralba, and Sanja Fidler. Skip-thought vectors. In Advances in Neural Information Processing\nSystems, 2015.\n\n[18] Andrew J Landgraf and Jeremy Bellay. word2vec Skip-Gram with Negative Sampling is a\n\nWeighted Logistic PCA. arXiv preprint arXiv:1705.09755, 2017.\n\n[19] Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representa-\n\ntions. In Computational Natural Language Learning, 2014.\n\n[20] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In\n\nAdvances in Neural Information Processing Systems, 2014.\n\n[21] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons\nlearned from word embeddings. Transactions of the Association for Computational Linguistics,\n2015.\n\n[22] Qiuchi Li, Sagar Uprety, Benyou Wang, and Dawei Song. Quantum-inspired complex word\n\nembedding. In Workshop on Representation Learning for NLP, 2018.\n\n[23] Tal Linzen. Issues in evaluating semantic spaces using word analogies. In 1st Workshop on\n\nEvaluating Vector-Space Representations for NLP, 2016.\n\n[24] Matt Mahoney.\n\ntext8 wikipedia dump. http://mattmahoney.net/dc/textdata.html,\n\n2011. [Online; accessed May 2019].\n\n[25] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\nrepresentations in vector space. In International Conference on Learning Representations,\nWorkshop Track Proceedings, 2013.\n\n[26] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in Neural Information\nProcessing Systems, 2013.\n\n[27] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous\nspace word representations. In North American Chapter of the Association for Computational\nLinguistics, 2013.\n\n[28] David Mimno and Laure Thompson. The strange geometry of skip-gram with negative sampling.\n\nIn Empirical Methods in Natural Language Processing, 2017.\n\n[29] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word\n\nrepresentation. In Empirical Methods in Natural Language Processing, 2014.\n\n[30] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social\nrepresentations. In International Conference on Knowledge Discovery and Data mining, 2014.\n\n[31] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embed-\nding as matrix factorization: Unifying DeepWalk, line, pte, and node2vec. In International\nConference on Web Search and Data Mining, 2018.\n\n[32] Radim \u02c7Reh\u02dau\u02c7rek and Petr Sojka. Software framework for topic modelling with large corpora. In\n\nWorkshop on New Challenges for NLP Frameworks, 2010.\n\n[33] Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. Evaluation methods for\nunsupervised word embeddings. In Empirical Methods in Natural Language Processing, 2015.\n\n[34] Bernhard Sch\u00f6lkopf, Alexander Smola, and Klaus-Robert M\u00fcller. Kernel principal component\n\nanalysis. In International Conference on Arti\ufb01cial Neural Networks, 1997.\n\n10\n\n\f[35] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In International\n\nConference on Machine Learning, 2003.\n\n[36] Peter D Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European\n\nConference on Machine Learning. Springer, 2001.\n\n[37] Peter D Turney and Patrick Pantel. From frequency to meaning: Vector space models of\n\nsemantics. Journal of Arti\ufb01cial Intelligence Research, 37:141\u2013188, 2010.\n\n[38] John Wieting and Douwe Kiela. No training required: Exploring random encoders for sentence\n\nclassi\ufb01cation. In International Conference on Learning Representations, 2019.\n\n11\n\n\f", "award": [], "sourceid": 4068, "authors": [{"given_name": "Carl", "family_name": "Allen", "institution": "University of Edinburgh"}, {"given_name": "Ivana", "family_name": "Balazevic", "institution": "University of Edinburgh"}, {"given_name": "Timothy", "family_name": "Hospedales", "institution": "University of Edinburgh"}]}