{"title": "Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 10237, "page_last": 10248, "abstract": "Embedding complex objects as vectors in low dimensional spaces is a longstanding problem in machine learning. We propose in this work an extension of that approach, which consists in embedding objects as elliptical probability distributions, namely distributions whose densities have elliptical level sets. We endow these measures with the 2-Wasserstein metric, with two important benefits: (i) For such measures, the squared 2-Wasserstein metric has a closed form, equal to a weighted sum of the squared Euclidean distance between means and the squared Bures metric between covariance matrices. The latter is a Riemannian metric between positive semi-definite matrices, which turns out to be Euclidean on a suitable factor representation of such matrices, which is valid on the entire geodesic between these matrices. (ii) The 2-Wasserstein distance boils down to the usual Euclidean metric when comparing Diracs, and therefore provides a natural framework to extend point embeddings. We show that for these reasons Wasserstein elliptical embeddings are more intuitive and yield tools that are better behaved numerically than the alternative choice of Gaussian embeddings with the Kullback-Leibler divergence. In particular, and unlike previous work based on the KL geometry, we learn elliptical distributions that are not necessarily diagonal. We demonstrate the advantages of elliptical embeddings by using them for visualization, to compute embeddings of words, and to reflect entailment or hypernymy.", "full_text": "Generalizing Point Embeddings using the\nWasserstein Space of Elliptical Distributions\n\nBoris Muzellec\nCREST, ENSAE\n\nboris.muzellec@ensae.fr\n\nMarco Cuturi\n\nGoogle Brain and CREST, ENSAE\n\ncuturi@google.com\n\nAbstract\n\nEmbedding complex objects as vectors in low dimensional spaces is a longstanding\nproblem in machine learning. We propose in this work an extension of that\napproach, which consists in embedding objects as elliptical probability distributions,\nnamely distributions whose densities have elliptical level sets. We endow these\nmeasures with the 2-Wasserstein metric, with two important bene\ufb01ts: (i) For such\nmeasures, the squared 2-Wasserstein metric has a closed form, equal to a weighted\nsum of the squared Euclidean distance between means and the squared Bures\nmetric between covariance matrices. The latter is a Riemannian metric between\npositive semi-de\ufb01nite matrices, which turns out to be Euclidean on a suitable factor\nrepresentation of such matrices, which is valid on the entire geodesic between\nthese matrices. (ii) The 2-Wasserstein distance boils down to the usual Euclidean\nmetric when comparing Diracs, and therefore provides a natural framework to\nextend point embeddings. We show that for these reasons Wasserstein elliptical\nembeddings are more intuitive and yield tools that are better behaved numerically\nthan the alternative choice of Gaussian embeddings with the Kullback-Leibler\ndivergence. In particular, and unlike previous work based on the KL geometry, we\nlearn elliptical distributions that are not necessarily diagonal. We demonstrate the\nadvantages of elliptical embeddings by using them for visualization, to compute\nembeddings of words, and to re\ufb02ect entailment or hypernymy.\n\n1\n\nIntroduction\n\nOne of the holy grails of machine learning is to compute meaningful low-dimensional embeddings\nfor high-dimensional complex data. That ability has recently proved crucial to tackle more advanced\ntasks, such as for instance: inference on texts using word embeddings [Mikolov et al., 2013b,\nPennington et al., 2014, Bojanowski et al., 2017], improved image understanding [Norouzi et al.,\n2014], representations for nodes in large graphs [Grover and Leskovec, 2016].\nSuch embeddings have been traditionally recovered by seeking isometric embeddings in lower\ndimensional Euclidean spaces, as studied in [Johnson and Lindenstrauss, 1984, Bourgain, 1985].\nGiven n input points x1, . . . , xn, one seeks as many embeddings y1, . . . , yn in a target space Y = Rd\nwhose pairwise distances kyi yjk2 do not depart too much from the original distances dX (xi, xj)\nin the input space. Note that when d is restricted to be 2 or 3, these embeddings (yi)i provide a useful\nway to visualize the entire dataset. Starting with metric multidimensional scaling (mMDS) [De Leeuw,\n1977, Borg and Groenen, 2005], several approaches have re\ufb01ned this intuition [Tenenbaum et al.,\n2000, Roweis and Saul, 2000, Hinton and Roweis, 2003, Maaten and Hinton, 2008]. More general\ncriteria, such as reconstruction error [Hinton and Salakhutdinov, 2006, Kingma and Welling, 2014];\nco-occurence [Globerson et al., 2007]; or relational knowledge, be it in metric learning [Weinberger\nand Saul, 2009] or between words [Mikolov et al., 2013b] can be used to obtain vector embeddings.\nIn such cases, distances kyi yjk2 between embeddings, or alternatively their dot-products hyi, yji\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmust comply with sophisticated desiderata. Naturally, more general and \ufb02exible approaches in which\nthe embedding space Y needs not be Euclidean can be considered, for instance in generalized MDS\non the sphere [Maron et al., 2010], on surfaces [Bronstein et al., 2006], in spaces of trees [B\u02d8adoiu\net al., 2007, Fakcharoenphol et al., 2003] or, more recently, computed in the Poincar\u00e9 hyperbolic\nspace [Nickel and Kiela, 2017].\nProbabilistic Embeddings. Our work belongs to a recent trend, pioneered by Vilnis and McCallum,\nwho proposed to embed data points as probability measures in Rd [2015], and therefore generalize\npoint embeddings. Indeed, point embeddings can be regarded as a very particular\u2014and degenerate\u2014\ncase of probabilistic embedding, in which the uncertainty is in\ufb01nitely concentrated on a single point (a\nDirac). Probability measures can be more spread-out, or event multimodal, and provide therefore an\nopportunity for additional \ufb02exibility. Naturally, such an opportunity can only be exploited by de\ufb01ning\na metric, divergence or dot-product on the space (or a subspace thereof) of probability measures.\nVilnis and McCallum proposed to embed words as Gaussians endowed either with the Kullback-\nLeibler (KL) divergence or the expected likelihood kernel [Jebara et al., 2004]. The Kullback-Leibler\nand expected likelihood kernel on measures have, however, an important drawback: these geometries\ndo not coincide with the usual Euclidean metric between point embeddings when the variances of\nthese Gaussians collapse. Indeed, the KL divergence and the `2 distance between two Gaussians\ndiverges to 1 or saturates when the variances of these Gaussians become small. To avoid numerical\ninstabilities arising from this degeneracy, Vilnis and McCallum must restrict their work to diagonal\ncovariance matrices. In a concurrent approach, Singh et al. represent words as distributions over their\ncontexts in the optimal transport geometry [Singh et al., 2018].\nContributions. We propose in this work a new framework for probabilistic embeddings, in which\npoint embeddings are seamlessly handled as a particular case. We consider arbitrary families of\nelliptical distributions, which subsume Gaussians, and also include uniform elliptical distributions,\nwhich are arguably easier to visualize because of their compact support. Our approach uses the\n2-Wasserstein distance to compare elliptical distributions. The latter can handle degenerate measures,\nand both its value and its gradients admit closed forms [Gelbrich, 1990], either in their natural\nRiemannian formulation, as well as in a more amenable local Euclidean parameterization. We\nprovide numerical tools to carry out the computation of elliptical embeddings in different scenarios,\nboth to optimize them with respect to metric requirements (as is done in multidimensional scaling)\nor with respect to dot-products (as shown in our applications to word embeddings for entailment,\nsimilarity and hypernymy tasks) for which we introduce a proxy using a polarization identity.\nNotations S d\nvectors x, y 2 Rd and a matrix M 2S d\nkx ck2\nRd, V is the Lebesgue measure on that subspace. M\u2020 is the pseudo inverse of M.\n\n+) is the set of positive (resp. semi-)de\ufb01nite d \u21e5 d matrices. For two\n+, we write the Mahalanobis norm induced by M as\nM = (x c)T M(x c) and |M| for det(M). For V an af\ufb01ne subspace of dimension m of\n\n++ (resp. S d\n\n2 The Geometry of Elliptical Distributions in the Wasserstein Space\n\nWe recall in this section basic facts about elliptical distributions in Rd. We adopt a general formulation\nthat can handle measures supported on subspaces of Rd as well as Dirac (point) measures. That level\nof generality is needed to provide a seamless connection with usual vector embeddings, seen in the\ncontext of this paper as Dirac masses. We recall results from the literature showing that the squared\n2-Wasserstein distance between two distributions from the same family of elliptical distributions is\nequal to the squared Euclidean distance between their means plus the squared Bures metric between\ntheir scale parameter scaled by a suitable constant.\nElliptically Contoured Densities.\nIn their simplest form, elliptical distributions can be seen\nas generalizations of Gaussian multivariate densities in Rd: their level sets describe concentric\nellipsoids, shaped following a scale parameter C 2S d\n++, and centered around a mean parameter\nc 2 Rd [Cambanis et al., 1981]. The density at a point x of such distributions is f (kxckC1)/p|C|\nwhere the generator function f is such thatRRd f (kxk2)dx = 1. Gaussians are recovered with f =\ng, g(\u00b7) / e\u00b7/2 while uniform distributions on full rank ellipsoids result from f = u, u(\u00b7) / 1\u00b7\uf8ff 1.\nBecause the norm induced by C1 appears in formulas above, the scale parameter C must have full\nrank for these de\ufb01nitions to be meaningful. Cases where C does not have full rank can however\n\n2\n\n\fappear when a probability measure is supported on an af\ufb01ne subspace1 of Rd, such as lines in R2, or\neven possibly a space of null dimension when the measure is supported on a single point (a Dirac\nmeasure), in which case its scale parameter C is 0. We provide in what follows a more general\napproach to handle these degenerate cases.\nElliptical Distributions. To lift this limitation, several reformulations of elliptical distributions have\nbeen proposed to handle degenerate scale matrices C of rank rk C < d. Gelbrich [1990, Theorem 2.4]\nde\ufb01nes elliptical distributions as measures with a density w.r.t the Lebesgue measure of dimension\nrk C, in the af\ufb01ne space c + Im C, where the image of C is Im C def={Cx, x 2 Rd}. This approach\nis intuitive, in that it reduces to describing densities in their relevant subspace. A more elegant\napproach uses the parameterization provided by characteristic functions [Cambanis et al., 1981, Fang\net al., 1990]. In a nutshell, recall that the characteristic function of a multivariate Gaussian is equal\nto (t) = eitT cg(tT Ct) where, as in the paragraph above, g(\u00b7) = e\u00b7/2. A natural generalization\nto consider other elliptical distributions is therefore to consider for g other functions h of positive\ntype [Ushakov, 1999, Theo.1.8.9], such as the indicator function u above, and still apply them to the\nsame argument tT Ct. Such functions are called characteristic generators and fully determine, along\nwith a mean c and a scale parameter C, an elliptical measure. This parameterization does not require\nthe scale parameter C to be invertible, and therefore allows to de\ufb01ne probability distributions that do\nnot have necessarily a density w.r.t to the Lebesgue measure in Rd. Both constructions are relatively\ncomplex, and we refer the interested reader to these references for a rigorous treatment.\nRank De\ufb01cient Elliptical Distributions and their Variances. For the purpose of this work, we\nwill only require the following result: the variance of an elliptical measure is equal to its scale\nparameter C multiplied by a scalar that only depends on its characteristic generator. Indeed, given a\nmean vector c 2 Rd, a scale semi-de\ufb01nite matrix C 2S d\n+ and a characteristic generator function h,\nwe de\ufb01ne \u00b5h,c,C to be the measure with char-\nacteristic function t 7! eitT ch(tT Ct). In that\ncase, one can show that the covariance matrix of\n\u00b5h,c,C is equal to its scale parameter C times a\nconstant \u2327h that only depends on h, namely\n\nB3= 3 1\n\n1\n1 4 1\n\n1 1 6\n\nvar(\u00b5h,c,C) = \u2327hC .\n\n(1)\n\nFor Gaussians, the scale parameter C and its\ncovariance matrice coincide, that is \u2327g = 1. For\nuniform elliptical distributions, one has \u2327u =\n1/(d + 2): the covariance of a uniform distribu-\ntion on the volume {c+Cx, x 2 Rd,kxk = 1},\nsuch as those represented in Figure 1, is equal\nto C/(d + 2).\nThe 2-Wasserstein Bures Metric A natural\nmetric for elliptical distributions arises from op-\ntimal transport (OT) theory. We refer interested\nreaders to [Santambrogio, 2015, Peyr\u00e9 and Cu-\nturi, 2018] for exhaustive surveys on OT. Re-\ncall that for two arbitrary probability measures\n\u00b5, \u232b 2P (Rd), their squared 2-Wasserstein dis-\ntance is equal to\n\nW 2\n\n2 (\u00b5, \u232b) def=\n\ninf\n\nX\u21e0\u00b5,Y \u21e0\u232b\n\nEkXY k2\n\n2\n\n.\n\nA = 8 2 0\n0 4\n\n2 8 0\n0\n\nB1= vvT\n\nB0 = 03\u21e53\n\nB2 = 8 5 0\n0 0\n\n5 8 0\n0\n\nFigure 1: Five measures from the family of uni-\nform elliptical distributions in R3. Each mea-\nsure has a mean (location) and scale parameter.\nIn this carefully selected example, the reference\nmeasure (with scale parameter A) is equidistant\n(according to the 2-Wasserstein metric) to the\nfour remaining measures, whose scale parameters\nB0, B1, B2, B3 have ranks equal to their indices\n(here, v = [3, 7,2]T ).\n\nThis formula rarely has a closed form. However,\nin the footsteps of Dowson and Landau [1982] who proved it for Gaussians, Gelbrich [1990] showed\nthat for \u21b5 def= \u00b5h,a,A and def= \u00b5h,b,B in the same family Ph = {\u00b5h,c,C, c 2 Rd, C 2S d\n+}, one has\n(2)\n2 + \u2327hB2(A, B) ,\n\n2 + B2(var \u21b5, var ) = ka bk2\n\n2 (\u21b5, ) = ka bk2\nW 2\n\n1For instance, the random variable Y in R2 obtained by duplicating the same normal random variable X in\n\nR, Y = [X, X], is supported on a line in R2 and has no density w.r.t the Lebesgue measure in R2.\n\n3\n\n\fwhere B2 is the (squared) Bures metric on S d\nand studied recently in [Bhatia et al., 2018, Malag\u00f2 et al., 2018],\n\n+, proposed in quantum information geometry [1969]\n\n1\n2 YX\n\n1\n2 )\n\n1\n2 ) .\n\nB2(X, Y) def= Tr(X + Y 2(X\n\n(3)\nThe factor \u2327h next to the rightmost term B2 in (2) arises from homogeneity of B2 in its arguments (3),\nwhich is leveraged using the identity in (1).\nA few remarks (i) When both scale matrices A = diag dA and B = diag dB are diagonal,\n2 (\u21b5, ) is the sum of two terms: the usual squared Euclidean distance between their means, plus \u2327h\nW 2\ntimes the squared Hellinger metric between the diagonals dA, dB: H2(dA, dB) def= kpdApdBk2\n2.\n(ii) The distance W2 between two Diracs a, b is equal to the usual distance between vectors\nka bk2. (iii) The squared distance W 2\n2 between a Dirac a and a measure \u00b5h,b,B in Ph reduces\nto ka bk2 + \u2327hTrB. The distance between a point and an ellipsoid distribution therefore always\nincreases as the scale parameter of the latter increases. Although this point makes sense from the\nquadratic viewpoint of W 2\n2 of points x in the ellipsoid\nthat stand further away from a than b will dominate that brought by points x that are closer, see\nFigure 3) this may be counterintuitive for applications to visualization, an issue that will be addressed\nin Section 4. (iv) The W2 distance between two elliptical distributions in the same family Ph is always\n\ufb01nite, no matter how degenerate they are. This is illustrated in Figure 1 in which a uniform measure\n\u00b5a,A is shown to be exactly equidistant to four other uniform elliptical measures, some of which are\ndegenerate. However, as can be hinted by the simple example of the Hellinger metric, that distance\nmay not be differentiable for degenerate measures (in the same sense that (px py)2 is de\ufb01ned\nat x = 0 but not differentiable w.r.t x). (v) Although we focus in this paper on uniform elliptical\ndistributions, notably because they are easier to plot and visualize, considering any other elliptical\nfamily simply amounts to changing the constant \u2327h next to the Bures metric in (2). Alternatively,\nincreasing (or tuning) that parameter \u2327h simply amounts to considering elliptical distributions with\nincreasingly heavier tails.\n\n2 (in which the quadratic contribution ka xk2\n\n3 Optimizing over the Space of Elliptical Embeddings\n\nOur goal in this paper is to use the set of elliptical distributions endowed with the W2 distance as\nan embedding space. To optimize objective functions involving W2 terms, we study in this section\nseveral parameterizations of the parameters of elliptical distributions. Location parameters only\nappear in the computation of W2 through their Euclidean metric, and offer therefore no particular\nchallenge. Scale parameters are more tricky to handle since they are constrained to lie in S d\n+. Rather\nthan keeping track of scale parameters, we advocate optimizing directly on factors (square roots) of\nsuch parameters, which results in simple Euclidean (unconstrained) updates reviewed below.\nGeodesics for Elliptical Distributions When A and B have full rank, the geodesic from \u21b5 to is\na curve of measures in the same family of elliptic distributions, characterized by location and scale\nparameters c(t), C(t), where\n\nc(t) = (1 t)a + tb; C(t) =(1 t)I + tTAB A(1 t)I + tTAB ,\n\n(4)\nand where the matrix TAB is such that x ! TAB(x a) + b is the so-called Brenier optimal\ntransportation map [1987] from \u21b5 to , given in closed form as,\n2 A 1\n\n(5)\nand is the unique matrix such that B = TABATAB [Peyr\u00e9 and Cuturi, 2018, Remark 2.30]. When\nA is degenerate, such a curve still exists as long as Im B \u21e2 Im A, in which case the expression\nabove is still valid using pseudo-inverse square roots A\u2020/2 in place of the usual inverse square-root.\nDifferentiability in Riemannian Parameterization Scale parameters are restricted to lie on the\n+. For such problems, it is well known that a direct gradient-and-project based optimization\ncone S d\non scale parameters would prove too expensive. A natural remedy to this issue is to perform manifold\noptimization [Absil et al., 2009]. Indeed, as in any Riemannian manifold, the Riemannian gradient\n2 d2(x, y) is given by logx y [Lee, 1997]. Using the expressions of the exp and log given in\ngradx\n[Malag\u00f2 et al., 2018], we can show that minimizing 1\n2 B2(A, B) using Riemannian gradient descent\ncorresponds to making updates of the form, with step length \u2318\n\nTAB def= A 1\n\n1\n2 BA\n\n1\n2 )\n\n2 (A\n\n1\n\n1\n\n2 ,\n\nA0 =(1 \u2318)I + \u2318TAB A(1 \u2318)I + \u2318TAB .\n\n4\n\n(6)\n\n\fWhen 0 \uf8ff \u2318 \uf8ff 1, this corresponds to considering a new point A0 closer to B along the Bures\ngeodesic between A and B. When \u2318 is negative or larger than 1, A0 no longer lies on this geodesic\nbut is guaranteed to remain PSD, as can be seen from (6). Figure 2 shows a W2 geodesic between\ntwo measures \u00b50 and \u00b51, as well as its extrapolation following exactly the formula given in (4).\nThat \ufb01gure illustrates that \u00b5t is not necessarily geodesic outside of the boundaries [0, 1] w.r.t. three\nrelevant measures, because its metric derivative is smaller than 1 [Ambrosio et al., 2006, Theorem\n1.1.2]. When negative steps are taken (for instance when the W 2\n2 distance needs to be increased), this\nlack of geodisicity has proved dif\ufb01cult to handle numerically for a simple reason: such updates may\nlead to degenerate scale parameters A0, as illustrated around time t = 1.5 of the curve in Figure 2.\nAnother obvious drawback of Riemannian approaches is that they are not as well studied as simpler\nnon-constrained Euclidean problems, for which a plethora of optimization techniques are available.\nThis observations motivates an alternative Euclidean parameterization, detailed in the next paragraph.\n\nMetric derivative on curve\n\n\u00b52\n\n\u00b53\n\n\u00b50\n\n\u00b51\n\n)\n1\n\u00b5\n\n,\n0\n\u00b5\n(\n2\n\nW\n/\n|\nt\nd\n/\n2\nW\nd\n|\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n\u00b5t \u2192 \u00b5\u22122\n\u00b5t \u2192 \u00b50\n\u00b5t \u2192 \u00b51\n\u00b5t \u2192 \u00b53\n\n2\n\n3\n\n0.8\n\n-2\n\n-1\n\n0\n1\ncurve time\n\nFigure 2: (left) Interpolation (\u00b5t)t between two measures \u00b50 and \u00b51 following the geodesic equa-\ntion (4). The same formula can be used to interpolate on the left and right of times 0, 1. Displayed\ntimes are [2,1,.5, 0, .25, .5, .75, 1, 1.5, 2, 3]. Note that geodesicity is not ensured outside of the\nboundaries [0, 1]. This is illustrated in the right plot displaying normalized metric derivatives of the\ncurve \u00b5t to four relevant points: \u00b50, \u00b51, \u00b52, \u00b53. The curve \u00b5t is not always locally geodesic, as can\nbe seen by the fact that the metric derivative is strictly smaller than 1 in several cases.\nDifferentiability in Euclidean Parameterization A canonical way to handle a PSD constraint for\nA is to rewrite it in factor form A = LLT . In the particular case of the Bures metric, we show that\nthis simple parametrization comes without losing the geometric interest of manifold optimization,\nwhile bene\ufb01ting from simpler additive updates. Indeed, one can (see supplementary material) that the\ngradient of the squared Bures metric has the following gradient:\n\n(7)\n\n1\n2\n\nrL\n\nB2(A, B) =I TAB L, with updates L0 =(1 \u2318)I + \u2318TAB L .\n\nLinks between Euclidean and Riemannian Parameterization The factor updates in (7) are exactly\nequivalent to the Riemannian ones (6) in the sense that A0 = L0L0T . Therefore, by using a factor\nparameterization we carry out updates that stay on the Riemannian geodesic yet only require linear\nupdates on L, independently of the factor L chosen to represent A (given a factor L of A, any\nright-side multiplication of that matrix by a unitary matrix remains a factor of A).\nWhen considering a general loss function L that take as arguments squared Bures distances, one can\nalso show that L is geodesically convex w.r.t. to scale matrices A if and only if it is convex in the\nusual sense with respect to L, where A = LLT . Write now LB = TABL. One can recover that\nLBLT\n\nB = B. Therefore, expanding the expression B2 for the right term below we obtain\n\nB2(A, B) = B2LLT , LBLT\n\nB = B2\u21e3LLT , TABLTABLT\u2318 = kL TABLk2\n\nF\n\nIndeed, the Bures distance simply reduces to the Frobenius distance between two factors of A and B.\nHowever these factors need to be carefully chosen: given L for A, the factor for B must be computed\naccording to an optimal transport map TAB.\nPolarization between Elliptical Distributions Some of the applications we consider, such as\nthe estimation of word embeddings, are inherently based on dot-products. By analogy with the\npolarization identity, hx, yi = (kx 0k2 +ky 0k2 kx yk2)/2, we de\ufb01ne a Wasserstein-Bures\npseudo-dot-product, where 0 = \u00b50d,0d\u21e5d is the Dirac mass at 0,\n[\u00b5a,A : \u00b5b,B]def= 1\n\n2 (\u00b5a,A, 0) + W 2\n\n1\n2 BA\n\n1\n2 )\n\n1\n2\n\n2W 2\n\n2 (\u00b5b,B, 0) W 2\n\n2 (\u00b5a,A, \u00b5b,B) =ha, bi+Tr (A\n\n5\n\n\fNote that [\u00b7 : \u00b7] is not an actual inner product since the Bures metric is not Hilbertian, unless we\nrestrict ourselves to diagonal covariance matrices, in which case it is the the inner product between\n(a,pdA) and (b,pdB). We use [\u00b5a,A : \u00b5b,B] as a similarity measure which has, however, some\nregularity: one can show that when a, b are constrained to have equal norms and A and B equal\ntraces, then [\u00b5a,A : \u00b5b,B] is maximal when a = b and A = B. Differentiating all three terms in that\nsum, the gradient of this pseudo dot-product w.r.t. A reduces to rA[\u00b5a,A : \u00b5b,B] = TAB.\nComputational Aspects The computational bottleneck of gradient-based Bures optimization lies in\nthe matrix square roots and inverse square roots operations that arise when instantiating transport\nmaps T as in (5). A naive method using eigenvector decomposition is far too time-consuming, and\nthere is not yet, to the best of our knowledge, a straightforward way to perform it in batches on a\nGPU. We propose to use Newton-Schulz iterations (Algorithm 1, see [Higham, 2008, Ch. 6]) to\napproximate these root computations. These iterations producing both a root and an inverse root\napproximation, and, relying exclusively on matrix-matrix multiplications, stream ef\ufb01ciently on GPUs.\nAnother problem lies in the fact that numerous roots and inverse-roots are required to form map T.\nTo solve this, we exploit an alternative formula for TAB (proof in the supplementary material):\n\nTAB = A 1\n\n2 (A\n\n1\n2 BA\n\n1\n2 )\n\n1\n\n2 A 1\n\n2 = B\n\n1\n2 (B\n\n1\n2 AB\n\n1\n\n2 ) 1\n\n2 B\n\n1\n2 .\n\n(8)\n\nIn a gradient update, both the loss and the gradient of the metric are needed. In our case, we can use\nthe matrix roots computed during loss evaluation and leverage the identity above to compute on a\nbudget the gradients with respect to either scale matrices A and B. Indeed, a naive computation of\nrAB2(A, B) and rBB2(A, B) would require the knowledge of 6 roots:\n\n1\n2 , B\n\n1\n2 , (A\n\n1\n2 BA\n\n1\n2 )\n\n1\n2 , (B\n\n1\n2 AB\n\n1\n2 )\n\nA\n\n1\n\n2 , A 1\n\n2 , and B 1\n\n2\n\nto compute the following transport maps\n\nTAB = A 1\n\n2 (A\n\n1\n2 BA\n\n1\n2 )\n\n1\n\n2 A 1\n\n2 , TBA = B 1\n\n2 (B\n\n1\n2 AB\n\n1\n2 )\n\n1\n\n2 B 1\n\n2 ,\n\nnamely four matrix roots and two matrix inverse roots. We can avoid computing those six matrices\nusing identity (8) and limit ourselves to two runs of Algorithm 1, to obtain the same quantities as\n\n1\n2 , Z1\n\ndef= A\n\ndef= (A\n{Y1\nTAB = Z1Y2Z1, TBA = Y1Z2Y1 .\n\n2},{Y2\n\ndef= A 1\n\n1\n2 BA\n\n1\n2 )\n\n1\n2 , Z2\n\ndef= (A\n\n1\n2 BA\n\n1\n\n2 ) 1\n2}\n\n1, Zi\n\n(1+\u270f)kAk\n\nAlgorithm 1 Newton-Schulz\nInput: PSD matrix A, \u270f> 0\n\nY A\n, Z I\nwhile not converged do\nT (3I ZY)/2\nY YT\nZ TZ\nend while\n\nWhen computing the gradients of n \u21e5 m squared Wasserstein dis-\n2 (\u21b5i, j) in parallel, one only needs to run n Newton-\ntances W 2\nSchulz algorithms (in parallel) to compute matrices (Yi\n1)i\uf8ffn,\nand then n\u21e5 m Newton-Schulz algorithms to recover cross matrices\n2 , Zi,j\nYi,j\n2 . On the other hand, using an automatic differentiation\nframework would require an additional backward computation of\nthe same complexity as the forward pass evaluating computation of\nthe roots and inverse roots, hence requiring roughly twice as many\noperations per batch.\nAvoiding Rank De\ufb01ciency at Optimization Time Although\nB2(A, B) is de\ufb01ned for rank de\ufb01cient matrices A and B, it is\nnot differentiable with respect to these matrices if they are rank\nde\ufb01cient. Indeed, as mentioned earlier, this can be compared to the\nnon-differentiability of the Hellinger metric, (px py)2 when x\nor y becomes 0, at which point if becomes not differentiable. If\nIm B 6\u21e2 Im A, which is notably the case if rk B > rk A, then rAB2(A, B) no longer exists.\nHowever, even in that case, rBB2(A, B) exists iff Im A \u21e2 Im B. Since it would be cumbersome to\naccount for these subtleties in a large scale optimization setting, we propose to add a small common\nregularization term to all the factor products considered for our embeddings, and set A\" = LLT + \"I\nwere \"> 0 is a hyperparameter. This ensures that all matrices are full rank, and thus that all gradients\nexist. Most importantly, all our derivations still hold with this regularization, and can be shown to\n\nY p(1 + \u270f)kAkY\n\nZp(1+\u270f)kAk\nZ \nverse square root Z\n\nleave the method to compute the gradients w.r.t L unchanged, namely remain equal toI TA\"B L.\n\nOutput: square root Y, in-\n\n6\n\n\f4 Experiments\n\nWe discuss in this section several applications of elliptical embeddings. We \ufb01rst consider a simple\nmMDS type visualization task, in which elliptical distributions in d = 2 are used to embed isomet-\nrically points in high dimension. We argue that for such purposes, a more natural way to visualize\nellipses is to use their precision matrices. This is due to the fact that the human eye somewhat acts in\nthe opposite direction to the Bures metric, as discussed in Figure 3. We follow with more advanced\nexperiments in which we consider the task of computing word embeddings on large corpora as a\ntesting ground, and equal or improve on the state-of-the-art.\n\nFigure 3: (left) three points on the plane. (middle) isometric elliptic embedding with the Bures\nmetric: ellipses of a given color have the same respective distances as points on the left. Although the\nmechanics of optimal transport indicate that the blue ellipsoid is far from the two others, in agreement\nwith the left plot, the human eye tends to focus on those areas that overlap (below the ellipsoid center)\nrather than those far away areas (north-east area) that contribute more signi\ufb01cantly to the W2 distance.\n(right) the precision matrix visualization, obtained by considering ellipses with the same axes but\ninverted eigenvalues, agree better with intuition, since they emphasize that overlap and extension of\nthe ellipse means on the contrary that those axis contribute less to the increase of the metric.\n\nFigure 4: Toy experiment: visualization of a dataset of 10 PISA scores for 35 countries in the OECD.\n(left) MDS embeddings of these countries on the plane (right) elliptical embeddings on the plane\nusing the precision visualization discussed in Figure 3. The normalized stress with standard MDS\nis 0.62. The stress with elliptical embeddings is close to 5e 3 after 1000 gradient iterations, with\nrandom initializations for scale matrices (following a Standard Wishart with 4 degrees of freedom)\nand initial means located on the MDS solution.\n\nVisualizing Datasets Using Ellipsoids Multidimensional scaling [De Leeuw, 1977] aims at em-\nbedding points x1, . . . , xn in a \ufb01nite metric space in a lower dimensional one by minimizing\nIn our case, this translates to the minimization of\n\nthe stress Pij(kxi xjk kyi yjk)2.\nLMDS(a1, . . . an, A1, . . . , An) = Pij(kxi xjk W2(\u00b5ai,Ai, \u00b5aj ,Aj ))2. This objective can\n\nbe crudely minimized with a simple gradient descent approach operating on factors as advocated in\nSection 3, as illustrated in a toy example carried out using data from OECD\u2019s PISA study2.\nWord Embeddings The skipgram model [Mikolov et al., 2013a] computes word embeddings in\na vector space by maximizing the log-probability of observing surrounding context words given\nan input central word. Vilnis and McCallum [2015] extended this approach to diagonal Gaussian\nembeddings using an energy whose overall principles we adopt here, adapted to elliptical distributions\nwith full covariance matrices in the 2-Wasserstein space. For every word w, we consider an input\n(as a word) and an ouput (as a context) representation as an elliptical measure, denoted respectively\n\u00b5w and \u232bw, both parameterized by a location vector and a scale parameter (stored in factor form).\n\n2http://pisadataexplorer.oecd.org/ide/idepisa/\n\n7\n\n\fFigure 5: Precision matrix visualization of trained embeddings of a set of words on the plane spanned\nby the two principal eigenvectors of the covariance matrix of \u201cBach\u201d.\n\nTable 1: Results for elliptical embed-\ndings (evaluated using our cosine mix-\nture) compared to diagonal Gaussian em-\nbeddings trained with the seomoz pack-\nage (evaluated using expected likelihood\ncosine similarity as recommended by\nVilnis and McCallum).\n\nW2G/45/C Ell/12/CM\n\nGiven a set R of positive word/context pairs of words\n(w, c), and for each input word a set N (w) of n negative\ncontexts words sampled randomly, we adapt Vilnis and\nMcCallum\u2019s loss function to the W 2\n2 distance to minimize\nthe following hinge loss:\n\nX(w,c)2R\n\n24M [\u00b5w : \u232bc] + 1\n\nn Xc02N (w)\n\n[\u00b5w : \u232bc0]35+\n\nDataset\nSimLex\nWordSim\nWordSim-R\nWordSim-S\n\nMEN\nMC\nRG\nYP\n\nMT-287\nMT-771\n\nRW\n\n25.09\n53.45\n61.70\n48.99\n65.16\n59.48\n69.77\n37.18\n61.72\n57.63\n40.14\n\n24.09\n66.02\n71.07\n60.58\n65.58\n65.95\n65.58\n25.14\n59.53\n56.78\n29.04\n\nwhere M > 0 is a margin parameter. We train our em-\nbeddings on the concatenated ukWaC and WaCkypedia\ncorpora [Baroni et al., 2009], consisting of about 3 bil-\nlion tokens, on which we keep only the tokens appearing\nmore than 100 times in the text (for a total number of\n261583 different words). We train our embeddings using\nadagrad [Duchi et al., 2011], sampling one negative con-\ntext per positive context and, in order to prevent the norms\nof the embeddings to be too highly correlated with the cor-\nresponding word frequencies (see Figure in supplementary\nmaterial), we use two distinct sets of embeddings for the\ninput and context words.\nWe compare our full elliptical to diagonal Gaussian embeddings trained using the methods described\nin [Vilnis and McCallum, 2015] on a collection of similarity datasets by computing the Spearman\nrank correlation between the similarity scores provided in the data and the scores we compute based\non our embeddings. Note that these results are obtained using context (\u232bw) rather than input (\u00b5w)\nembeddings. For a fair comparison across methods, we set dimensions by ensuring that the number\nof free parameters remains the same: because of the symmetry in the covariance matrix, elliptical\nembeddings in dimension d have d + d(d + 1)/2 free parameters (d for the means, d(d + 1)/2 for the\ncovariance matrices), as compared with 2d for diagonal Gaussians. For elliptical embeddings, we use\nthe common practice of using some form of normalized quantity (a cosine) rather than the direct dot\nproduct. We implement this here by computing the mean of two cosine terms, each corresponding\nseparately to mean and covariance contributions:\n\nSB[\u00b5a,A, \u00b5b,B] := ha, bi\nkakkbk\n\nTr (A\n\n1\n1\n2 BA\n2 )\npTrATrB\n\n1\n2\n\n+\n\nUsing this similarity measure rather than the Wasserstein-Bures dot product is motivated by\nthe fact that the norms of the embeddings show some dependency with word frequencies (see\n\ufb01gures in supplementary) and become dominant when comparing words with different fre-\nquencies scales. An alternative could have been obtained by normalizing the Wasserstein-\nBures dot product in a more standard way that pools together means and covariances. How-\nthis choice makes it harder to deal with\never, as discussed in the supplementary material,\nthe variations in scale of the means and covariances,\ntherefore decreasing performance.\n\n8\n\n\fModel\n\nW2G/45/Cosine\n\nW2G/45/KL\nEll/12/CM\n\nF1\n0.74\n0.74\n0.73\n\nAP\n0.70\n0.72\n0.70\n\ne[\u00b5u,\u00b5v ]\n\ne[\u00b5u,\u00b5v ]+Pv02N (u) e[\u00b5u,\u00b5v0 ] .\n\nTable 2: Entailment benchmark: we evaluate our\nembeddings on the Entailment dataset using aver-\nage precision (AP) and F1 scores. The threshold\nfor F1 is chosen to be the best at test time.\n\nWe also evaluate our embeddings on the Entail-\nment dataset ([Baroni et al., 2012]), on which\nwe obtain results roughly comparable to those\nof [Vilnis and McCallum, 2015]. Note that con-\ntrary to the similarity experiments, in this frame-\nwork using the (unsymmetrical) KL divergence\nmakes sense and possibly gives an advantage,\nas it is possible to choose the order of the argu-\nments in the KL divergence between the entail-\ning and entailed words.\nHypernymy In this experiment, we use the framework of [Nickel and Kiela, 2017] on hypernymy\nrelationships to test our embeddings. A word A is said to be a hypernym of a word B if any B is a type\nof A, e.g. any dog is a type of mammal, thus constituting a tree-like structure on nouns. The WORDNET\ndataset [Miller, 1995] features a transitive closure of 743,241 hypernymy relations on 82,115 distinct\nnouns, which we consider as an undirected graph of relations R. Similarly to the skipgram model,\nfor each noun u we sample a \ufb01xed number n of negative examples and store them in set N (u) to\noptimize the following loss:P(u,v)2R log\n\nFigure 6: Reconstruction performance of our embeddings against\nPoincare embeddings (reported from [Nickel and Kiela, 2017],\nas we were not able to reproduce scores comparable to these\nvalues) evaluated by mean retrieved rank (lower=better) and MAP\n(higher=better).\n\nWe train the model using SGD\nwith only one set of embeddings.\nThe embeddings are then eval-\nuated on a link reconstruction\ntask: we embed the full tree\nand rank the similarity of each\npositive hypernym pair (u, v)\namong all negative pairs (u, v0)\nand compute the mean rank thus\nachieved as well as the mean av-\nerage precision (MAP), using the\nWasserstein-Bures dot product as\nthe similarity measure. Elliptical\nembeddings consistently outper-\nform Poincare embeddings for di-\nmensions above a small thresh-\nold, as shown in Figure 6, which\ncon\ufb01rms our intuition that the ad-\ndition of a notion of variance or\nuncertainty to point embeddings allows for a richer and more signi\ufb01cant representation of words.\nConclusion We have proposed to use the space of elliptical distributions endowed with the W2 metric\nto embed complex objects. This latest iteration of probabilistic embeddings, in which a point an\nobject is represented as a probability measure, can consider elliptical measures (including Gaussians)\nwith arbitrary covariance matrices. Using the W2 metric we can provides a natural and seamless\ngeneralization of point embeddings in Rd. Each embedding is described with a location c and a\nscale C parameter, the latter being represented in practice using a factor matrix L, where C is\nrecovered as LLT . The visualization part of work is still subject to open questions. One may seek a\ndifferent method than that proposed here using precision matrices, and ask whether one can include\nmore advanced constraints on these embeddings, such as inclusions or the presence (or absence) of\nintersections across ellipses. Handling multimodality using mixtures of Gaussians could be pursued.\nIn that case a natural upper bound on the W2 distance can be computed by solving the OT problem\nbetween these mixtures of Gaussians using a simpler proxy: consider them as discrete measures\nputting Dirac masses in the space of Gaussians endowed with the W2 metric as a ground cost, and\nuse the optimal cost of that proxy as an upper bound of their Wasserstein distance. Finally, note that\nthe set of elliptical measures \u00b5c,C endowed with the Bures metric can also be interpreted, given that\nC = LLT , L 2 Rd\u21e5k, and writing \u02dcli = li \u00afl for the centered column vectors of L, as a discrete\n\u02dcli)i endowed with a W2 metric only looking at their \ufb01rst and second order\npoint cloud (c + 1pk\nmoments. These k points, whose mean and covariance matrix match c and C, can therefore fully\ncharacterize the geometric properties of the distribution \u00b5c,C, and may provide a simple form of\nmultimodal embedding.\n\n9\n\n\fReferences\nP-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton\n\nUniversity Press, 2009.\n\nL. Ambrosio, N. Gigli, and G. Savar\u00e9. Gradient \ufb02ows in metric spaces and in the space of probability measures.\n\nSpringer, 2006.\n\nMarco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. The wacky wide web: a collection\nof very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):\n209\u2013226, September 2009.\n\nMarco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. Entailment above the word level in\ndistributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association\nfor Computational Linguistics, pages 23\u201332. ACL, 2012.\n\nRajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures-Wasserstein distance between positive de\ufb01nite\n\nmatrices. Expositiones Mathematicae, 2018.\n\nPiotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword\n\ninformation. Transactions of the Association for Computational Linguistics, 5:135\u2013146, 2017.\n\nIngwer Borg and Patrick JF Groenen. Modern multidimensional scaling: Theory and applications. Springer\n\nScience & Business Media, 2005.\n\nJean Bourgain. On Lipschitz embedding of \ufb01nite metric spaces in Hilbert space. Israel Journal of Mathematics,\n\n52(1):46\u201352, 1985.\n\nYann Brenier. D\u00e9composition polaire et r\u00e9arrangement monotone des champs de vecteurs. CR Acad. Sci. Paris\n\nS\u00e9r. I Math, 305(19):805\u2013808, 1987.\n\nAlexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional scaling: a\nframework for isometry-invariant partial surface matching. Proceedings of the National Academy of Sciences,\n103(5):1168\u20131172, 2006.\n\nElia Bruni, Nam Khanh Tran, and Marco Baroni. Multimodal distributional semantics. J. Artif. Int. Res., 49(1):\n\n1\u201347, January 2014.\n\nMihai B\u02d8adoiu, Piotr Indyk, and Anastasios Sidiropoulos. Approximation algorithms for embedding general\nmetrics into trees. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms,\npages 512\u2013521. Society for Industrial and Applied Mathematics, 2007.\n\nDonald Bures. An extension of Kakutani\u2019s theorem on in\ufb01nite product measures to the tensor product of\n\nsemi\ufb01nite w*-algebras. Transactions of the American Mathematical Society, 135:199\u2013212, 1969.\n\nStamatis Cambanis, Steel Huang, and Gordon Simons. On the theory of elliptically contoured distributions.\n\nJournal of Multivariate Analysis, 11(3):368 \u2013 385, 1981.\n\nJan De Leeuw. Applications of convex analysis to multidimensional scaling. In Recent Developments in\n\nStatistics, 1977.\n\nDC Dowson and BV Landau. The Fr\u00e9chet distance between multivariate normal distributions. Journal of\n\nmultivariate analysis, 12(3):450\u2013455, 1982.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nJittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree\nmetrics. In Proceedings of the thirty-\ufb01fth annual ACM symposium on Theory of computing, pages 448\u2013455.\nACM, 2003.\n\nKT Fang, S Kotz, and KW Ng. Symmetric Multivariate and Related Distributions. Chapman and Hall/CRC,\n\n1990.\n\nLev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin.\n\nPlacing search in context: the concept revisited. ACM Trans. Inf. Syst., 20(1):116\u2013131, 2002.\n\nMatthias Gelbrich. On a formula for the l2 Wasserstein metric between measures on Euclidean and Hilbert\n\nspaces. Mathematische Nachrichten, 147(1):185\u2013203, 1990.\n\n10\n\n\fAmir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean embedding of co-occurrence\n\ndata. Journal of Machine Learning Research, 8(Oct):2265\u20132295, 2007.\n\nAditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd\nACM SIGKDD international conference on Knowledge discovery and data mining, pages 855\u2013864. ACM,\n2016.\n\nGuy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. Large-scale learning of word relatedness\nwith constraints. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, KDD \u201912, pages 1406\u20131414, New York, NY, USA, 2012. ACM.\n\nNicholas J. Higham. Functions of Matrices: Theory and Computation (Other Titles in Applied Mathematics).\n\nSociety for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008.\n\nFelix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with genuine similarity\n\nestimation. Comput. Linguist., 41(4):665\u2013695, December 2015.\n\nGeoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In Advances in Neural Information\n\nProcessing Systems, pages 857\u2013864, 2003.\n\nGeoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\nTony Jebara, Risi Kondor, and Andrew Howard. Probability product kernels. Journal of Machine Learning\n\nResearch, 5:819\u2013844, 2004.\n\nWilliam B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In\nConference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemp. Math.,\npages 189\u2013206. Amer. Math. Soc., Providence, RI, 1984.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational Bayes. In Proceedings of the International\n\nConference on Learning Representations, 2014.\n\nJ.M. Lee. Riemannian Manifolds: An Introduction to Curvature. Graduate Texts in Mathematics. Springer New\n\nYork, 1997.\n\nLaurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 9(Nov):2579\u20132605, 2008.\n\nLuigi Malag\u00f2, Luigi Montrucchio, and Giovanni Pistone. Wasserstein-Riemannian geometry of positive-de\ufb01nite\n\nmatrices. arXiv preprint arXiv:1801.09269, 2018.\n\nYariv Maron, Michael Lamar, and Elie Bienenstock. Sphere embedding: An application to part-of-speech\n\ninduction. In Advances in Neural Information Processing Systems, pages 1567\u20131575, 2010.\n\nTomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations in\n\nvector space. ICLR Workshop, 2013a.\n\nTomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages\n3111\u20133119, 2013b.\n\nGeorge A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39\u201341, November 1995.\n\nGeorge A. Miller and Walter G. Charles. Contextual correlates of semantic similarity. Language and Cognitive\n\nProcesses, 6(1):1\u201328, 1991.\n\nMaximillian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical representations. In\nI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 6341\u20136350. Curran Associates, Inc., 2017.\n\nMohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg\nCorrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In Proceedings\nof the International Conference on Learning Representations, 2014.\n\nJeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.\nIn Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages\n1532\u20131543, 2014.\n\n11\n\n\fGabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport. arXiv preprint arXiv:1803.00567, 2018.\n\nKira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: Computing\nword relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on\nWorld Wide Web, WWW \u201911, pages 337\u2013346, New York, NY, USA, 2011. ACM.\n\nSam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500):2323\u20132326, 2000.\n\nHerbert Rubenstein and John B. Goodenough. Contextual correlates of synonymy. Commun. ACM, 8(10):\n\n627\u2013633, October 1965.\n\nFilippo Santambrogio. Optimal Transport for Applied Mathematicians. Birkhauser, 2015.\n\nSidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, and Martin Jaggi. Context mover\u2019s distance & barycenters:\n\nOptimal transport of contexts for building representations. arXiv preprint arXiv:1808.09663, 2018.\n\nJoshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\nMinh thang Luong, Richard Socher, and Christopher D. Manning. Better word representations with recursive\nneural networks for morphology. In In Proceedings of the Thirteenth Annual Conference on Natural Language\nLearning. Tomas Mikolov, Wen-tau, 2013.\n\nNikolai G Ushakov. Selected topics in characteristic functions. Walter de Gruyter, 1999.\n\nLuke Vilnis and Andrew McCallum. Word representations via Gaussian embedding. Proceedings of the\n\nInternational Conference on Learning Representations, 2015. arXiv preprint arXiv:1412.6623.\n\nK.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation. The\n\nJournal of Machine Learning Research, 10:207\u2013244, 2009.\n\nDongqiang Yang and David M. W. Powers. Measuring semantic similarity in the taxonomy of wordnet. In\nProceedings of the Twenty-eighth Australasian Conference on Computer Science - Volume 38, ACSC \u201905,\npages 315\u2013322, Darlinghurst, Australia, Australia, 2005. Australian Computer Society, Inc.\n\n12\n\n\f", "award": [], "sourceid": 6577, "authors": [{"given_name": "Boris", "family_name": "Muzellec", "institution": "ENSAE ParisTech"}, {"given_name": "Marco", "family_name": "Cuturi", "institution": "Universit\u00e9 Paris-Saclay, CREST - ENSAE"}]}