{"title": "Invariance and identifiability issues for word embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 15140, "page_last": 15149, "abstract": "Word embeddings are commonly obtained as optimisers of a criterion function f of a text corpus, but assessed on word-task performance using a different evaluation function g of the test data. We contend that a possible source of disparity in performance on tasks is the incompatibility between classes of transformations that leave f and g invariant. In particular, word embeddings defined by f are not unique; they are defined only up to a class of transformations to which f is invariant, and this class is larger than the class to which g is invariant. One implication of this is that the apparent superiority of one word embedding over another, as measured by word task performance, may largely be a consequence of the arbitrary elements selected from the respective solution sets. We provide a formal treatment of the above identifiability issue, present some numerical examples, and discuss possible resolutions.", "full_text": "Invariance and identi\ufb01ability issues for word\n\nembeddings\n\nRachel Carrington\n\nKarthik Bharath\n\nSimon Preston\n\nSchool of Mathematical Sciences, University of Nottingham\n\n{rachel.carrington, karthik.bharath, simon.preston}@nottingham.ac.uk\n\nAbstract\n\nWord embeddings are commonly obtained as optimizers of a criterion function f of\na text corpus, but assessed on word-task performance using a different evaluation\nfunction g of the test data. We contend that a possible source of disparity in\nperformance on tasks is the incompatibility between classes of transformations that\nleave f and g invariant. In particular, word embeddings de\ufb01ned by f are not unique;\nthey are de\ufb01ned only up to a class of transformations to which f is invariant, and\nthis class is larger than the class to which g is invariant. One implication of this is\nthat the apparent superiority of one word embedding over another, as measured by\nword task performance, may largely be a consequence of the arbitrary elements\nselected from the respective solution sets. We provide a formal treatment of the\nabove identi\ufb01ability issue, present some numerical examples, and discuss possible\nresolutions.\n\n1\n\nIntroduction\n\nWord embeddings map a text corpus, say X, to a collection of vectors V = (v1, ..., vp) where each\nvj \u2208 Rd, for a prescribed embedding dimension d, represents one of p words in the corpus. Different\nword embedding models can be cast as the solution of an optimisation\n\narg min\n\nF (X, U, V ) = arg min\n\nf (X, U V ),\n\nU,V\n\nU,V\n\n(1)\n\nfor particular corpus representation X and objective function f, where U = (u1, . . . , un)T are\nvectors in Rn representing contexts, typically not of main interest. The setup subsumes some popular\nembedding techniques such as Latent Semantic Analysis (LSA) [Deerwester et al., 1990], word2vec\n[Mikolov et al., 2013b,a], and GloVe [Pennington et al., 2014], wherein the matrices U and V appear\nin a suitably chosen f only through their product U V .\nOnce a word embedding V is constructed by solving (1), the embedding is evaluated on its perfor-\nmance in tasks, including identifying word similarity (given word a, identify words with similar\nmeanings), and word analogy (for the statement \"a is to b what c is to x\", given a, b and c, identify\nx). Similarities or analogies can be computed from V , then performance evaluated against a test data\nset D containing human-assigned judgements as\n\n(2)\nfor some function g. Constructing word embeddings is \"unsupervised\" with respect to the evaluation\ntask in the sense that V is determined from (1) independently of the choice of g and the data D in (2),\nalthough f typically entails free parameters that may, consciously or not, be chosen to optimize (2)\n[Levy et al., 2015].\n\ng(D, V ),\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDifferent word embedding models, identi\ufb01ed as different f in (1), are often compared based on\nperformance in word tasks in the sense of g in (2). But there are several reasons why comparing\nperformance in this way is dif\ufb01cult. First: performance may be affected less by the structure of\nmodel f, and more by the number of free parameters it entails and how well they have been tuned\n[Levy et al., 2015]. Second: for many embeddings, solving (1) entails a Monte Carlo optimisation,\nso different runs with identical f will result in different realisations of V and hence different values\nof g(D, V ). Third, more subtle and often con\ufb02ated with the \ufb01rst and second: for most embedding\nmodels f, (1) does not uniquely identify V \u2014 V is said to be non-identi\ufb01able \u2014 and different\nsolutions, V , each equally optimal with respect to (1), correspond to different values of g(D, V ).\nThis raises the disconcerting question: can apparent differences in performances in word tasks\nas evaluated with g be substantially attributed to the arbitrary selection of a solution V from the\nset of solutions of f? In this paper we explore the non-identi\ufb01ability of V , particularly with\nrespect to the class of non-singular transformations C for which f (X, U V ) = f (X, U C\u22121CV )\nbut g(D, V ) (cid:54)= g(D, CV ), and the consequences for constructing and evaluating word embeddings.\nSpeci\ufb01cally, our contributions are as follows.\n\n1. For g de\ufb01ned using inner products of embedded word vectors (e.g. Cosine similarity) in d di-\nmensions, we characterise the subset Fd contained in the set of non-singular transformations\nto which g is not invariant.\n\n2. We study a widely used strategy for constructing word embeddings that involves multiplying\na \"base\" embedding by a powered matrix of singular values, and show that this amounts to\nexploring a one-dimensional subset of the optimal solutions.\n\n3. We discuss resolutions to the non-identi\ufb01ability, including (i) constraining the set of solutions\nof f to ensure compatibility with invariances of g, and (ii) optimizing over the solutions of\nf with respect to g in a supervised learning sense.\n\n2 Non-identi\ufb01ability of word embedding V\n\nThe issue of non-identi\ufb01ability is most transparent in word embedding models explicitly involving\nmatrix factorisation. LSA assumes X is an n \u00d7 p context-word matrix and seeks V as\n\narg min\n\nf (X, U V ) := arg min\n\n(3)\nwhere (cid:107) \u00b7 (cid:107) is the Frobenius norm, and U is an n \u00d7 d matrix of contexts to be estimated. For any\nparticular solution {U\u2217, V \u2217} of (3) {U\u2217C\u22121, CV \u2217} is also a solution, where C is any d\u00d7d invertible\nmatrix. The solution of (3) for V is hence a set\n\nU,V\n\nU,V\n\n(cid:107)X \u2212 U V (cid:107),\n\n{CV \u2217 : C \u2208 GL(d)}\n\n(4)\n\nwhere GL(d) denotes the general linear group of d \u00d7 d invertible matrices.\nOne way to \ufb01nd an element of the solution set (4) is by using the singular value decomposition (SVD)\nof X. The SVD decomposes X as X = A\u03a3BT where A and B have orthogonal columns and \u03a3 is a\ndiagonal matrix with the singular values in decreasing order on the diagonal. Then a rank d matrix\nthat minimizes (cid:107)X \u2212 Xd(cid:107) is Xd = Ad\u03a3dBT\nd where Ad and Bd are matrices containing the \ufb01rst d\ncolumns of A and B respectively, and \u03a3d is the d \u00d7 d upper left part of \u03a3 [Eckart and Young, 1936].\nHence a solution to (3) is obtained by taking\nU\u2217 = Ad,\n\n(5)\ncalled by Bullinaria and Levy [2012] the \"simple SVD\" solution. Bullinaria and Levy [2012] and\nTurney [2013] have investigated the word embedding V \u2217 = \u03a31\u2212\u03b1\nd which generalises V \u2217 in (5)\nby introducing a tunable parameter \u03b1 \u2208 R, motivated by empirical evidence that \u03b1 (cid:54)= 0 often leads to\nbetter performance on word tasks. Such an embedding is perfectly justi\ufb01ed, however, as an alternative\nsolution\nto (3), for any \u03b1 \u2208 R. We can hence interpret the tuning parameter \u03b1 as indexing different elements\nof the solution set (4), each optimal with respect to the embedding model f, with \u03b1 free to be chosen\nso that the word-task performance g is maximized.\n\nV \u2217 = \u03a3dBT\nd ,\n\nU\u2217 = Ad\u03a3\u03b1\nd ,\n\nV \u2217 = \u03a31\u2212\u03b1\n\nd BT\nd ,\n\nd BT\n\n2\n\n\fIndeed, by choosing the particular solution V \u2217 in (5), and setting C = \u03a3\u2212\u03b1\nd , we see that tuning \u03b1\nd \u2208 GL(d), a one-dimensional\namounts to optimising over the one-parameter subgroup \u03b3(\u03b1) := \u03a3\u2212\u03b1\nsubset of the d2-dimensional group GL(d) to which V is non-identi\ufb01able. The motivation for\nrestricting the optimisation to this particular subset is unclear, however. In fact, it is not clear that\nchoice of the matrix of singular values \u03a3d in the subgroup \u03b3 necessarily leads to better performance\nwith g; Figure 2 in Section 4.2, demonstrates superior performance for alternate (but arbitrary)\ndiagonal matrices for certain values of \u03b1.\nYin and Shen [2018] (see also references therein) recognise \"unitary [equivalently orthogonal]\ninvariance\" of word embeddings, explaining that \"two embeddings are essentially identical if one\ncan be obtained from the other by performing a unitary [orthogonal] operation.\" Here \"essentially\nidentical\" appears to mean with respect to the performance evaluation, our g in this paper. We\nemphasise the distinction between this and the non-identi\ufb01ability of V , which refers to the invariance\nof f to a (typically larger) class of transformations. The distinction was similarly made by Mu\net al. [2019] who suggested modifying the embedding model f such that the class of invariant\ntransformations of f and g match. We brie\ufb02y discuss further their approach later.\nRemark 1. The foregoing discussion focuses on the LSA embedding model, f in (3), in which the\noptimal embedding V arises clearly from a matrix factorisation X \u2248 U V with respect to Frobenius\nnorm, and the non-identi\ufb01ability is transparent. But other embedding models, including word2vec\nand GloVe, are de\ufb01ned by different f yet share the same property that V is non-identi\ufb01able, i.e. that\nthe solution is de\ufb01ned as the set (4). Levy et al. [2015] have shown that word2vec and GloVe both\namount to solving implicit matrix factorisation problems each with respect to a particular corpus\nrepresentation X and metric. To see this, and the consequent non-identi\ufb01ability, it is suf\ufb01cient to\nobserve, as with the objective of LSA, that the objective functions of word2vec and GloVe involve\nmatrices U and V appearing only as the product U V .\n\n3 Effect of non-identi\ufb01ability of embeddings on g\n\nThe word embeddings are evaluated on tasks on the test data D using the function g, which typically\nis based on cosine similarity between elements of Rd. Our focus will hence be on functions g that\ndepend on V only through the cosine similarity between its columns.\nThe set of invariances associated with such g consists of the group cO(d) := {cQ \u2208 GL(d) : c \u2208\nR, Q \u2208 O(d)}, where O(d) is the subset of orthogonal matrices {Q \u2208 GL(d) : QT Q = QQT = Id}.\nThis set also contains the set of scale transformations cI := {cId : c \u2208 R \u2212 {0}}. O(d) relates to\ntransformations that leave (cid:104)v1, v2(cid:105) invariant; the scale transformation preserves the angle between v1\nand v2.\nFigure 1 (left) illustrates the incompatibility between invariances of f and g. For embedding\ndimension d = 2, vi and vj are 2D embeddings of words i and j obtained from solving f with respect\nto coordinate vectors {e1, e2}. For Q \u2208 O(d), with respect to orthogonally transformed coordinates\n{Qe1, Qe2}, Qvi and Qvj are also viable solutions of f. A g that depends only on cos (vi, vj) has\nthe same value for cos (Qvi, Qvj). On the other hand, equally valid solutions Cvi and Cvj of f with\nrespect to nonsingularly transformed coordinates {Ce1, Ce2} for C \u2208 GL(d) lead to a different value\nof g since cos (vi, vi) (cid:54)= cos (Cvi, Cvj) unless C \u2208 cO(d).\nThus with respect to the evaluation function g, each solution from the set {CV \u2217 : C \u2208 cO(d)} is\nequally good (or bad). However, since cO(d) \u2282 GL(d), there still exist embeddings CV \u2217 which solve\nf with g(\u00b7, CV \u2217) (cid:54)= g(\u00b7, V \u2217). Such C are precisely those which characterise the incompatibility\nbetween invariances of f and g. One such example is the set of C given by the one-parameter\nsubgroup R (cid:51) \u03b1 (cid:55)\u2192 \u039b\u03b1, where \u039b is a d-dimensional diagonal matrix with positive elements. This\ngeneralises the subgroup \u03b3(\u03b1) discussed in \u00a72, which is the special case with \u039b = \u03a3d. Figure 1\n(right) illustrates the solution set and 1D subsets {\u039b\u03b1V \u2217} for different \u039b and particular solutions V \u2217.\nThe discussion above is summarised through the following Proposition.\nProposition 1. Let V \u2217 be a solution of (1). Then g is not invariant to non-singular transforms\nV \u2217 (cid:55)\u2192 \u039b\u03b1V \u2217 for any \u03b1 \u2208 R unless \u039b \u2208 cI for some c \u2208 R.\nThe key message from Proposition 1 is: for \u03b11, \u03b12 \u2208 R, comparison of performances of embeddings\n\u039b\u03b11 V \u2217 and \u039b\u03b12V \u2217 using g depends on the (arbitrary) choice of the orthogonal coordinates of Rd.\nNote however that the choice of the orthogonal coordinates does not have any bearing on f, and\n\n3\n\n\fFigure 1: Left: For d = 2, orthogonally transformed coordinates {Qe1, Qe2} (blue) with Q \u2208 O(d), and\nnonsingularly transformed {Ce1, Ce2} (green) with C \u2208 GL(d), where {e1, e2} (red) are standard coordinates.\nDistances between two embedding vectors vi and vj are preserved in the coordinates {Qe1, Qe2}, but altered in\nthe coordinates {Ce1, Ce2}. However, {vi, vj},{Qvi, Qvj} and {Cvi, Cvj} are valid solutions to (1). Right:\nIllustration of the solution set and one-dimensional subsets \u039b\u03b1V \u2217 parameterised by \u03b1 for two choices of \u039b and\ntwo particular solutions V \u2217.\n\nhence \u039b\u03b11V \u2217 and \u039b\u03b12V \u2217 are both solutions of f. The \ufb01rst step towards addressing identi\ufb01ability\nissues pertaining to f and g is to isolate and understand the structure of the set Fd of transformations\nin GL(d) which leave f invariant but not g.\n3.1 Structure of the set Fd\nWhat is the dimension of the set Fd \u2282 GL(d)? The dimension of GL(d) is d2 and that of O(d) is\nd(d\u22121)/2. Since cI is one-dimensional, the dimension of Fd is d2\u2212d(d\u22121)/2\u22121 = d(d+1)/2\u22121.\nFigure 1 (right) clari\ufb01es the implication of the result of Proposition 1: given a solution V \u2217, tuning \u03b1\nexplores only a one-dimensional set within {CV \u2217 : C \u2208 Fd} (yellow) within the overall solution set\n{CV \u2217 : C \u2208 GL(d)} (green).\nA group-theoretic formalism is useful in precisely identifying Fd. Since O(d) is a subgroup of GL(d),\nwe are interested in those elements of GL(d) that cannot be related by an orthogonal transformation.\nSuch elements can be identi\ufb01ed as the (right) coset GL(d) \\ O(d) of O(d) in GL(d): equivalence\nclasses [C] := {QC : Q \u2208 O(d)} for C \u2208 GL(d), known as orbits, under the equivalence relation\nM \u223c N if there exists Q \u2208 O(d) such that M = QN. The set of orbits {[C] : C \u2208 GL(d)} forms a\npartition of GL(d): each nonsingular transformation C \u2208 GL(d) is associated with its [C], elements\nof which are orthogonally equivalent.\nFrom the de\ufb01nition of GL(d) \\ O(d), we can represent Fd as Fd = \u02dcFd \u2212 cI, where \u02dcFd represents\nwhat is left behind in GL(d) once O(d) has been \u2018removed\u2019, and \u2212 denotes the set difference.\nProposition 2. The set \u02dcFd can be identi\ufb01ed with the subgroup UT(d) of upper triangular matrices\nwithin GL(d) with positive diagonal entries.\nProof. The proof is based on identifying a set S \u2282 GL(d) that is in bijection with the orbits in\nGL(d) \\ O(d). Such a subset S is known as a cross section of the coset GL(d) \\ O(d), and intersects\neach orbit [C] at a single point. Since O(d) is a subgroup of GL(d), no two members of Fd belong\nto the same orbit [C] of any C \u2208 GL(d). Thus Fd can be identi\ufb01ed with any cross section of\nGL(d) \\ O(d).\nThe map GL(d) (cid:51) C (cid:55)\u2192 h(C) := C T C is invariant to the action of O(d) since h(QC) =\n(QC)T QC = C T C. This implies that h is constant within each orbit [C]. To show that h is maximal\ninvariant, we need to show that h(C1) = h(C2) if and only if there is a Q \u2208 O(d) with C1 = QC2.\n2 C2, and let v1, ..., vd be a basis for Rd. Let xi = C1vi and\nTo see this, suppose that C T\nyi = C2vi. Then (cid:104)xi, xj(cid:105) = (cid:104)C1vi, C1vj(cid:105) = (cid:104)vi, C T\n2 C2vj(cid:105) = (cid:104)C2vi, C2vj(cid:105) =\n(cid:104)yi, yj(cid:105). There thus exists a linear isometry, say Q, such that Qyi = xi for i = 1, ..., d. This implies\nthat QC2vi = C1vi for i = 1, ..., d, and since v1, ..., vd is a basis for Rd, QC2 = C1 with Q \u2208 O(d).\nThus the range of h is in bijection with the orbits in GL(d) \\ O(d), and constitutes a cross section.\n\n1 C1vj(cid:105) = (cid:104)vi, C T\n\n1 C1 = C T\n\n4\n\n\fFor any C \u2208 GL(d) consider its unique QR decomposition C = QR, where Q \u2208 O(d) and\nR \u2208 UT(d), made possible since R is assumed to have positive diagonal elements. Clearly then\nh(C) = h(QR) = RT R, and its range h(GL(d)) can be identi\ufb01ed with the set UT(d).\n\nRemark 2. The result of Proposition 2 can be distilled to the existence of a unique QR decomposition\nof C \u2208 GL(d): C = QR, where Q \u2208 O(d) and R \u2208 UT(d). There is no loss of generality in\nassuming that R has positive entries along the diagonal, since this amounts to multiplying by another\northogonal matrix which changes signs accordingly. Thus the map GL(d) (cid:51) C (cid:55)\u2192 {UT(d) \u2212 cI}\nuniquely identi\ufb01es an element of Fd.\nThe map GL(d) (cid:51) C (cid:55)\u2192 h(C) = C T C is referred to as a maximal invariant function, and indexes the\nelements of GL(d) \\ O(d), and hence UT(d). This offers veri\ufb01cation of the fact that the dimension of\nFd is d(d + 1)/2 \u2212 1 since it is one fewer than the dimension of the subgroup UT(d). Another way\nto arrive at the conclusion is to notice that any d \u00d7 d upper triangular matrix R can be represented\nas R = D(Id + L), where Id is the identity, L is an upper triangular matrix with zeroes along the\ndiagonal, and D is a diagonal matrix. The dimension of the set of L is d(d \u2212 1)/2 and that of the set\nof D is d, resulting in d + d(d \u2212 1)/2 = d(d + 1)/2 as the dimension of the set of R.\n\n4 Resolving the problem of non-identi\ufb01ability\nFrom the preceding discussion we gather that {CV \u2217 : C \u2208 Fd} comprises the set of solutions of\nf which do not leave g invariant. We explore two resolutions: (i) imposing additional constraints\non V in (1) to identify solutions up to C \u2208 O(d) (Theorem 1), and uniquely (Corollary 1); and\n(ii) considering C as a free parameter. In (i) the identi\ufb01ed solution is chosen in a way that is\nmathematically natural, but need not be necessarily optimal with respect to g. In (ii), where C is\nconsidered as a free parameter, it may be chosen to optimize performance in tasks, i.e., by optimising\ng(D, CV \u2217) over C \u2208 UT(d).\n\n4.1 Constraining the solution set\n\nRede\ufb01ne (1) as a constrained optimisation\n\narg min\nU,V :V \u2208Cv\n\nf (X, U V ),\n\n(6)\n\nover a subset Cv of possible values of V which ensures that the only possible solutions are of the form\n{CV \u2217 : C \u2208 O(d)} for any solution V \u2217. The set of possible U is unconstrained. From Proposition\n2 and the QR decomposition of an element of GL(d), this is tantamount to ensuring that CV \u2217 for\nC \u2208 UT(d) is a solution of (6) if and only if C = Id, the identity matrix. Theorem below identi\ufb01es\nthe set Cv for any solution of U.\nTheorem 1. Let Cv = {V \u2208 Rd\u00d7p : V V T = Id}. Then for any solution V \u2217 to the constrained\nproblem (6), any other solution of the form CV \u2217 for C \u2208 GL(d) satis\ufb01es g(D, CV \u2217) = g(D, V \u2217)\nfor a given test data D.\nProof. Let { \u00afU , \u00afV } be a solution to the unconstrained problem. The proof rests on the simultaneous\ndiagonalisation of \u00afV \u00afV T and \u00afU T \u00afU. Since \u00afV \u00afV T is positive de\ufb01nite there exists M \u2208 GL(d) such\nthat \u00afV \u00afV T = M T M. Then M\u2212T ( \u00afU T \u00afU )M\u22121 is symmetric, and there exists Q \u2208 O(d) such that\nQT M\u2212T ( \u00afU T \u00afU )M\u22121Q = \u039b, where \u039b is diagonal. Setting C = M\u22121Q results in C T \u00afV \u00afV T C =\nQT M\u2212T ( \u00afV \u00afV T )M\u22121Q = Id.\nWe thus arrive at the conclusion that there exists a C \u2208 GL(d) such that C T \u00afV \u00afV T C =\nand C T \u00afU T \u00afU C = \u039b. The elements of \u039b solve the generalised eigenvalue problem\nId,\ndet( \u00afU T \u00afU \u2212 \u03bb \u00afV \u00afV T ). Evidently then C \u2208 GL(d) is orthogonal if \u00afV \u00afV T = Id.\n\nAn obvious but important corollary to the above Theorem is that any two solutions from Cv are\nrelated through an orthogonal transformation (not necessarily unique).\nCorollary 1. For any solutions V1 and V2 of (6) in C there exists an Q \u2208 O(d) such that QV1 = V2.\nIn other words, O(d) acts transitively on C.\n\n5\n\n\fRemark 3. Optimisation over the constrained set Cv results in a reduction of the invariance transfor-\nmations of f from GL(d) to O(d). This can be understood as choosing CV \u2217 for a \ufb01xed solution V \u2217\nand arbitrary C \u2208 GL(d), performing a Gram\u2013Schmidt procedure to obtain QRV \u2217 for an Q \u2208 O(d)\nand R \u2208 UT(d), and discarding R. Topologically then, the set of solutions {QV \u2217 : Q \u2208 O(d)} is ho-\nmotopically equivalent to the set {CV \u2217 : C \u2208 GL(d)}. This is because the inclusion O(d) (cid:44)\u2192 GL(d)\nis a homotopy equivalence, as it is well-known that the Gram Schmidt process GL(d) \u2192 O(d) is a\n(strong) deformation retraction.\n\nA unique solution for V can be identi\ufb01ed by imposing additional constraints on U as follows.\nCorollary 2. Denote by Cu the set of all U \u2208 Rn\u00d7d which satisfy the following conditions: (i) The\ncolumns of U are orthogonal; (ii) the diagonal elements of U T U are arranged in descending order;\n(iii) \ufb01rst non-zero element of each column of U is positive. Then, any solution to the optimisation\nproblem in (1) over the constrained set (U, V ) \u2208 Cu \u00d7 Cv is unique.\nProof. We need to show that on the constrained space Cu \u00d7 Cv, the orthogonal C obtained by\noptimising (6) reduces to the identity.\nOn the set Cv, from the proof of Theorem 1, we note that there exists a C \u2208 O(d) such that\nC T \u00afU T \u00afU C = \u039b for a diagonal \u039b containing the eigenvalues of U T U with respect to V V T obtained\na solution of det( \u00afU T \u00afU \u2212 \u03bb \u00afV \u00afV T ).\nIn addition to being orthogonal, condition (i) forces C to be a matrix with each column and row\ncontaining one non-zero element assuming values \u00b11. In other words, C is forced to be a monomial\nmatrix with entries equal to \u00b11. This implies that the diagonal C T U T U C contains the same elements\nas U T U, but possibly in a different order. Condition (ii) then \ufb01xes a particular order, and condition\n(iii) ensures that each diagonal element is +1. We thus end up with C = Id.\n\nThe idea to modify the optimisation so that the solution is unique up to transformations in O(d), but\nnot necessarily GL(d), is also used by Mu et al. [2019]. Rather than place constraints on V , as above,\nthey modi\ufb01ed the objective f to include Frobenius norm penalties on U and V , which achieves the\nsame outcome, although the relationship between the solutions of the penalised and unpenalised\nproblems is not transparent.\n\n4.1.1 Exploiting symmetry of X\n\nIf the corpus representation X is a symmetric matrix, for example involving counts of word-word\nco-occurrences, then the rows of U and the columns of V both have the same interpretation as word\nembeddings. In such cases the symmetry motivates the imposition U T = V . For example, in LSA\n(3) and its solution (5), this is achieved by taking \u03b1 = 1/2, since Ad = Bd owing to the symmetry.\nThis identi\ufb01es a solution up to sign changes and permutations of the word vectors, transformations\nwhich are contained within O(d) and hence are of no consequence to g.\nIn GloVe, Pennington et al. [2014] observe that when X is symmetric the U T and V are equivalent\nbut differ in practise \"as a result of their random initializations\". It seems likely that different runs\ninvolve the optimisation routine converging to different elements of the solution set, and not in\ngeneral to solutions with U T = V . For a given run Pennington et al seek to treat solutions U\u2217T\nand V \u2217 symmetrically by taking the word embedding to be V = U\u2217T + V \u2217, which is not itself in\ngeneral optimal with respect to the GloVe objective function, f (although they report that using it\nover V = V \u2217 typically confers a small performance advantage). A different approach is to take the\nembedding to be V = CV \u2217 where C \u2208 GL(d) is the solution to the equation C\u2212T U\u2217T\n= CV \u2217\nwhich more directly identi\ufb01es an element of the solution set for which U T = V , and hence avoids\ntaking the \ufb01nal embedding to be one that is non-optimal with respect to criterion f. The same strategy\nis also appropriate to other word embedding models, e.g. word2vec.\n4.2 Optimizing over Fd\nTo what extent can we optimize word-task performance g(D, V ) by choosing an appropriate element\nV of the solution set (4)? The set of transformations Fd has dimension d(d + 1)/2 \u2212 1, typically\nmuch larger than the number of cases in d, so care is needed to avoid over\ufb01tting. In particular,\nif the embeddings generated are to be regarded as a predictive model, then it is necessary to use\ncross-validation rather than just optimising the embeddings with respect to a particular test set. One\n\n6\n\n\fV \u2217 = \u03a3dBT\nPearson\n\nd\n\nV \u2217 = \u03a3dBT\nSpearman\n\nd\n\nV \u2217 = BT\nPearson\n\nd\n\nV \u2217 = BT\nd\nSpearman\n\nFigure 2: Plots showing word task evaluation scores g(D, V ) corresponding to the WordSim-353 task [Finkel-\nstein et al., 2002] (located at http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/)\nwhich provides a set of word pairs with human-assigned similarity scores. The embeddings are evaluated by\ncalculating the cosine similarities between the word pairs and using either Pearson or Spearman correlation (each\ninvariant to O(d) \u222a cI) to score correspondence between embedding and human-assigned similarity values.\nThe embedding is from model (3), with X taken to be a document\u2013term matrix computed from the Corpus of\nHistorical American English [Davies, 2012], and the plotted lines show how performance varies with different\nelements of the solution set, namely V = \u039b\u03b1V \u2217 for V \u2217 as indicated and different \u039b = diag(\u03bb1, ..., \u03bbd) as\nfollows: \u039b = \u03a3d (red lines); \u03bbi = i (green); \u03bbi \u223c U (0, 1) (blue); and \u03bbi \u223c |N (0, 1)| (purple). Performance\nfor \u039b = \u03a3d, which is widely used, is not obviously superior to performance of the other completely arbitrary\nchoices for \u039b.\n\napproach is to restrict the dimension of the optimisation, for example as earlier by considering\nsolutions V = \u039b\u03b1V \u2217 for a particular solution V \u2217 and diagonal matrix \u039b. A widely used approach\ncorresponds to choosing \u039b = \u03a3d, a matrix containing the dominant singular values of X; Figure 2\nshows how g varies with \u03b1 for this \u039b and some other choices of \u039b chosen quite arbitrarily (details\nin the caption). There is clearly substantial variability in g with \u03b1, but performance with \u039b = \u03a3d is\nonly on a par with the other arbitrary choices.\nFigure 3 shows the distribution of g for V = RV \u2217 where V \u2217 is a GloVe embedding, and R is a\nrandom element of Fd, which is either upper triangular or diagonal, with its non-zero elements taken\nfrom the distribution |N (0, 1)|, and g measures the performance of the embeddings on two similarity\ntest sets. (More details in caption.) The histograms shows substantial variance in the scores for\ndifferent R. The score for the base embedding V \u2217 is at the higher end of the distribution, though for\nsome instances of random R the performance of V is superior. It is also noticeable that there is a\nmuch greater range of scores when R is sampled from the set of diagonal matrices than when it is\nsampled from the set of upper triangular matrices. We hypothesize that this is because when R is\ndiagonal, there is a possibility of very small elements on the diagonal which will essentially wipe out\nwhole rows of V , which could have a signi\ufb01cant impact on the results.\nTable 1 shows scores that result from optimising g(D, V ) for V = \u039bV \u2217 with respect to the elements\nof \u039b = diag(\u03bb1, ..., \u03bbd), using R\u2019s optim implementation of the Nelder\u2013Mead method, where V \u2217\nare 300-dimensional embeddings generated using GloVe and word2vec. The results show that there\nexists a transformed embedding \u039bV \u2217 that performs substantially better than the base embedding. In\npractice, in order to use this optimization method to generate embeddings, it would be necessary to\nuse cross-validation, as embeddings which achieve optimal performance with respect to one test set\nmay do less well on others. Our aim here is merely to point out that it is possible to improve the test\nscores by optimizing over elements of \u039b.\n\n5 Conclusions\n\nWe summarise our conclusions as follows.\n\n1. Examining word embeddings \u2014 including LSA, word2vec, GloVe \u2014 through the relation-\nship with low-rank matrix factorisations with respect to a criterion f makes it clear that\n\n7\n\n\u22122\u221210120.00.20.40.6a\u22122\u221210120.00.20.40.6a\u22122\u221210120.00.20.40.6a\u22122\u221210120.00.20.40.6aCorrelation coefficient\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: For the same type of task as in Fig. 2, histograms of Spearman correlation scores for embeddings\nV = RV \u2217 where V \u2217 is a GloVe embedding1 with d = 300 trained on Wikipedia 2014 + Gigaword 5 corpus,\nevaluated on the WordSim-353 test set in (a) and (b), and on the SimLex-999 test set [Hill et al., 2015] in\n(c) and (d). R \u2208 Fd is a random matrix, taken to be diagonal in (a) and (c) and upper-triangular in (b) and\n(d), in each case with the non-zero elements each distributed as |N (0, 1)|. The number of runs in each case\nwas 1000. The red line on each graph shows the score for the original embedding in each case. 1Source:\nhttps://nlp.stanford.edu/projects/glove/\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Histograms showing the performance of word2vec embeddings trained on the 100-billion word\nGoogle News corpus, where d = 300 (downloaded from https://code.google.com/archive/p/word2vec). As for\nFigure 3, the test set used is the WordSim-353 test set in (a) and (b), and SimLex-999 in (c) and (d), with the test\nscore being calculated using the Spearman correlation coef\ufb01cient. In graphs (a) and (c) R is sampled from the\nset of diagonal matrices, and in (b) and (d) it is taken to be upper triangular.\n\nthe solution V is non-identi\ufb01able: for a particular solution V \u2217, CV \u2217 for any C \u2208 GL(d) is\nalso a solution. Different elements of the d2-dimensional solution set perform differently in\nevaluations, g, of word task performance.\n\n2. An important implication is that the disparity in performance between word embeddings\non tasks g maybe due to the particular elements selected from the solution sets. In word\nembeddings for which the f is optimized numerically with some randomness, for example\nin the initializations, the optimisation may converge to different elements of the solution\nset. An embedding chosen based on the best performance in g over repeated runs of the\noptimisation can essentially be viewed as a Monte Carlo optimisation over the solution set.\n3. The evaluation function g is usually only invariant to orthogonal (O(d)) and scale-type (cI)\ntransformations. Thus for an embedding dimension d, the effective dimension of the solution\nset after accounting for the orthogonal transformations, and scaled versions of the identity, is\nd(d + 1)/2 \u2212 1. Conclusions from evaluations with large d must hence be interpreted with\nsome care, especially if the V is optimized with respect to the incompatible transformations\nFd directly or indirectly, for example as in point 2 above.\n\n4. These considerations have a bearing on the interpretation of the performance of the popular\nembedding approach of taking V = \u039b\u03b1V \u2217 where \u03b1 is a tuning parameter and \u039b is a diagonal\n\n8\n\nrDensity0.20.30.40.50.602468rDensity0.500.540.580.6205101520rDensity0.150.250.350246810rDensity0.300.340.3805102030rDensity0.550.65051015rDensity0.580.620.660.7005101520rDensity0.360.400.440510152025rDensity0.380.400.420.4405152535\fTest set\n\nWordSim-353\n\nSimLex-999\n\nSpearman\n0.658\n0.601\n0.641\n0.679\n0.700\n0.645\n0.797\n0.371\n0.402\n0.560\n0.441\n0.475\n0.583\n\nPearson\n\n0.603\n0.637\n0.760\n0.652\n0.588\n0.838\n0.389\n0.421\n0.582\n0.453\n0.480\n0.617\n\nGloVe\nGloVe with Equation 6 imposed\nGloVe optimized over \u039b = diag(\u03bb1, ..., \u03bbd)\nw2v\nw2v with Equation 6 imposed\n\nEmbeddings\nGloVe vectors reported in [Pennington et al., 2014]\nGloVe embedding V \u2217\nGloVe embedding V \u2217\nV = \u039bV \u2217\nword2vec embedding V \u2217\nword2vec embedding V \u2217\nV = \u039bV \u2217\nGloVe embedding V \u2217\nGloVe embedding V \u2217\nV = \u039bV \u2217\nword2vec embedding V \u2217\nword2vec embedding V \u2217\nV = \u039bV \u2217\n\nw2v optimized over \u039b = diag(\u03bb1, ..., \u03bbd)\n\nGloVe\nGloVe with Equation 6 imposed\nGloVe optimized over \u039b = diag(\u03bb1, ..., \u03bbd)\nw2v\nw2v with Equation 6 imposed\n\nw2v optimized over \u039b = diag(\u03bb1, ..., \u03bbd)\n\nTable 1: Evaluation task scores g(D, V ) corresponding to WordSim-353 [Finkelstein et al., 2002] and SimLex-\n999 [Hill et al., 2015] test sets. The base GloVe embedding V \u2217 is as described in the caption of Figure 3; the\nword2vec embedding is as described in the caption of Figure 4.\n\nIn the \ufb01rst row we note for reference the performance reported in [Pennington et al., 2014]. The\nresults indicate substantial scope for improving performance scores via an appropriate choice of \u039b.\n\nmatrix taken, for example, to contain the singular values of X. This amounts to providing a\nway to perform a search over a one-dimensional subset of the (d(d + 1)/2\u2212 1)-dimensional\nsolution set. Our numerical results suggest there is nothing special about this particular\nchoice of \u039b (or the corresponding one-dimensional subset being searched over), nor is there\na clear rationale for restricting to a one-dimensional subset.\n\nAcknowledgments\n\nThe authors gratefully acknowledge support for this work from grants NSF DMS 1613054 and NIH\nRO1 CA214955 (KB), a Bloomberg Data Science Research Grant (KB & SP), and an EPSRC PhD\nstudentship (RC).\n\nReferences\nJohn A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-occurrence\n\nstatistics: stop-lists, stemming, and svd. Behavior research methods, 44(3):890\u2013907, 2012.\n\nMark Davies. Expanding horizons in historical linguistics with the 400-million word corpus of\n\nhistorical american english. Corpora, 7(2):121\u2013157, 2012.\n\nScott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman.\nIndexing by latent semantic analysis. Journal of the American society for information science, 41\n(6):391\u2013407, 1990.\n\nCarl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychome-\n\ntrika, 1(3):211\u2013218, 1936.\n\nLev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and\nEytan Ruppin. Placing search in context: The concept revisited. ACM Transactions on information\nsystems, 20(1):116\u2013131, 2002.\n\nFelix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with\n\n(genuine) similarity estimation. Computational Linguistics, 41(4):665\u2013695, 2015.\n\nOmer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned\nfrom word embeddings. Transactions of the Association for Computational Linguistics, 3:211\u2013225,\n2015.\n\nTomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representa-\n\ntions in vector space. arXiv preprint arXiv:1301.3781, 2013a.\n\n9\n\n\fTomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013b.\n\nCun Mu, Guang Yang, and Zheng Yan. Revisiting skip-gram negative sampling model with recti\ufb01ca-\n\ntion. arXiv preprint arXiv:1804.00306v2, 2019.\n\nJeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word\nrepresentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language\nProcessing, pages 1532\u20131543, 2014.\n\nPeter D Turney. Distributional semantics beyond words: Supervised learning of analogy and\n\nparaphrase. Transactions of the Association for Computational Linguistics, 1:353\u2013366, 2013.\n\nZi Yin and Yuanyuan Shen. On the dimensionality of word embedding. In Advances in Neural\n\nInformation Processing Systems, pages 887\u2013898, 2018.\n\n10\n\n\f", "award": [], "sourceid": 8672, "authors": [{"given_name": "Rachel", "family_name": "Carrington", "institution": "University of Nottingham"}, {"given_name": "Karthik", "family_name": "Bharath", "institution": "University of Nottingham"}, {"given_name": "Simon", "family_name": "Preston", "institution": "University of Nottingham"}]}