{"title": "Comparing Unsupervised Word Translation Methods Step by Step", "book": "Advances in Neural Information Processing Systems", "page_first": 6033, "page_last": 6043, "abstract": "Cross-lingual word vector space alignment is the task of mapping the vocabularies of two languages into a shared semantic space, which can be used for dictionary induction, unsupervised machine translation, and transfer learning. In the unsupervised regime, an initial seed dictionary is learned in the absence of any known correspondences between words, through {\\bf distribution matching}, and the seed dictionary is then used to supervise the induction of the final alignment in what is typically referred to as a (possibly iterative) {\\bf refinement} step. We focus on the first step and compare distribution matching techniques in the context of language pairs for which mixed training stability and evaluation scores have been reported. We show that, surprisingly, when looking at this initial step in isolation, vanilla GANs are superior to more recent methods, both in terms of precision and robustness. The improvements reported by more recent methods thus stem from the refinement techniques, and we show that we can obtain state-of-the-art performance combining vanilla GANs with such refinement techniques.", "full_text": "Comparing Unsupervised Word Translation Methods\n\nStep by Step\n\nMareike Hartmann\n\nDepartment of Computer Science\n\nUniversity of Copenhagen\n\nCopenhagen, Denmark\nhartmann@di.ku.dk\n\nYova Kementchedjhieva\n\nDepartment of Computer Science\n\nUniversity of Copenhagen\n\nCopenhagen, Denmark\n\nyova@di.ku.dk\n\nAnders S\u00f8gaard\n\nDepartment of Computer Science\n\nUniversity of Copenhagen\n\nCopenhagen, Denmark\nsoegaard@di.ku.dk\n\nAbstract\n\nCross-lingual word vector space alignment is the task of mapping the vocabularies\nof two languages into a shared semantic space, which can be used for dictionary\ninduction, unsupervised machine translation, and transfer learning. In the unsu-\npervised regime, an initial seed dictionary is learned in the absence of any known\ncorrespondences between words, through distribution matching, and the seed\ndictionary is then used to supervise the induction of the \ufb01nal alignment in what is\ntypically referred to as a (possibly iterative) re\ufb01nement step. We focus on the \ufb01rst\nstep and compare distribution matching techniques in the context of language pairs\nfor which mixed training stability and evaluation scores have been reported. We\nshow that, surprisingly, when looking at this initial step in isolation, vanilla GANs\nare superior to more recent methods, both in terms of precision and robustness.\nThe improvements reported by more recent methods thus stem from the re\ufb01nement\ntechniques, and we show that we can obtain state-of-the-art performance combining\nvanilla GANs with such re\ufb01nement techniques.\n\n1\n\nIntroduction\n\nA word vector space \u2013 sometimes referred to as a word embedding \u2013 associates similar words in a\nvocabulary with similar vectors. Learning a projection of one word vector space into another, such\nthat similar words \u2013 across the two word embeddings \u2013 are associated with similar vectors, is useful\nin many contexts, with the most prominent example being the alignment of vocabularies of different\nlanguages, i.e., word translation. This is a key step in machine translation of low-resource languages\n(Lample et al., 2018).\nProjections between word vector spaces have typically been learned from seed dictionaries. In\nseminal papers (Mikolov et al., 2013; Faruqui and Dyer, 2014; Gouws and S\u00f8gaard, 2015), these\nseeds would comprise thousands of words, but Vuli\u00b4c and Korhonen (2016) showed that we can learn\nreliable projections from as little as 50 words. Smith et al. (2017) and Hauer et al. (2017) subsequently\nshowed that the seed can be replaced with just words that are identical across languages; and Artetxe\net al. (2017) showed that numerals can also do the job, in some cases; both proposals removing the\nneed for an actual dictionary. Even more recently, entirely unsupervised approaches to projecting\nword vector spaces onto each other have been proposed, which induce seed dictionaries in the absence\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fINITIALIZATION AND OPTIMIZATION STEPS\n\nUnsupervised step\nGAN\nWasserstein GAN\nGAN\nICP\n\nAuthors\nBarone (2016)\nZhang et al. (2017)\nConneau et al. (2018)\nHoshen and Wolf (2018)\nAlvarez-Melis and Jaakkola (2018) Gromov-Wasserstein\nGromov-Wasserstein\nArtetxe et al. (2018)\nYang et al. (2018)\nGromov-Wasserstein MMD\nGAN\nXu et al. (2018)\nGrave et al. (2018)\nGold-Rangarajan\n\nSinkhorn\nSinkhorn\n\nSupervised step Extras\nNone\nProcrustes\nProcrustes\nProcrustes\nProcrustes\nStochastic\n\nRestarts\n\nBack-translation\n\nTable 1: Approaches to unsupervised alignment of word vector spaces. We break down these\napproaches in two steps (and extras): (1) Unsupervised distribution matching for seed dictionary\nlearning): (W)GANs, ICP, Gromov-Wasserstein initialization, and the convex relaxation proposed in\nGold and Rangarajan (1996). (2) Supervised re\ufb01nement: Procrustes, stochastic dictionary induction,\nmaximum mean discrepancy (MMD), and the Sinkhorn algorithm.\n\nof any known correspondences between words, using distribution matching techniques. These seed\ndictionaries are then used as supervision for alignment algorithms based on, e.g., Procrustes Analysis\nSch\u00f6nemann (1966). These unsupervised systems, in other words, typically combine two steps: an\nunsupervised step of distribution matching and a (possibly iterative) (pseudo-)supervised step of\nre\ufb01nement, based on a seed dictionary learned in the \ufb01rst step. See Table 1 for an overview.\nThe \ufb01rst unsupervised dictionary induction (UBDI) systems (Barone, 2016; Zhang et al., 2017;\nConneau et al., 2018) were based on Generative Adversarial Networks (GANs) (Goodfellow et al.,\n2014). These approaches learn a linear transformation to minimize the divergence between a target\ndistribution (say French word embeddings) and a source distribution (the English word embeddings\nprojected into the French space). GAN-based approaches achieve impressive results for some\nlanguage pairs (Conneau et al., 2018), but show instabilities for others. In particular, S\u00f8gaard et al.\n(2018) presented results suggesting that GAN-based UBDI is dif\ufb01cult for some language pairs\nexhibiting very different morphosyntactic properties, as well as when the monolingual corpora are\nvery different. Recently, a range of unsupervised approaches that do not rely on GANs have been\nproposed (Artetxe et al., 2018; Hoshen and Wolf, 2018; Grave et al., 2018) in the hope they would\nprovide a more robust alternative. In this paper, we show none of these are more robust on the\nlanguage pairs we consider. Instead we propose a simple technique for making (vanilla) GAN-based\nUBDI more robust and show that combining this with a recently proposed re\ufb01nement technique \u2013\nstochastic dictionary induction (Artetxe et al., 2018) \u2013 leads to state-of-the-art performance in UBDI.\n\nContributions We present the \ufb01rst systematic comparison of (a subset of) recently proposed\nmethods for UBDI. These methods are two-step pipelines of unsupervised distribution matching for\nseed induction and supervised re\ufb01nement. While the authors typically introduce new approaches to\nboth steps (see Table 1), distribution matching and re\ufb01nement are independent, and in this paper,\nwe focus on the distribution matching step - by either omitting re\ufb01nement or using the same\nre\ufb01nement method across different distribution matching, or seed dictionary induction methods. On\nthe language pairs considered here, vanilla GANs are superior to more recently improved distribution\nmatching techniques. Moreover, we show that using an unsupervised model selection method, we can\noften pick out the best vanilla GAN runs in the absence of cross-lingual supervision. Since vanilla\nGANs thus seem to remain an interesting technique for inducing seed dictionaries, we explore what\ncauses the instability of vanilla GAN seed induction, by looking at how they perform on simple\ntransformations of the embedding spaces, and by using a combination of supervised training and\nmodel interpolation to analyze the loss landscapes. The results lead us to conclude that the instability\nis caused by a mild form of mode collapse, that cannot easily be overcome by changes in the number\nof parameters, batch size, and learning rate. Nevertheless, vanilla GANs with unsupervised model\nselection seem superior to more recently proposed methods, and we show that when combined with a\nstate-of-the-art re\ufb01nement technique, vanilla GANs with unsupervised model selection is superior to\nthese methods across the board.\n\n2\n\n\f2 GAN-initialized UBDI\n\nIn this section, we discuss the dynamics of GAN-based UBDI. While the idea of using GANs for\nUBDI originates with Barone (2016), we refer to Conneau et al. (2018) as the canonical imple-\nmentation of GAN-based UBDI. Note that GANs are not a necessary component to unsupervised\ndistribution matchning for alignment of vector spaces, albeit a popular approach (Barone, 2016;\nConneau et al., 2018; Zhang et al., 2017). In \u00a73, we brie\ufb02y discuss how GAN-based initialization\ncompares to the alternative of using point set registration techniques (Hoshen and Wolf, 2018) and\nrelated strategies.\nA GAN consists of a generator and a discriminator (Goodfellow et al., 2014). The generator G is\ntrained to fool the discriminator D. The generator can be any differentiable function; in Conneau et al.\n(2018), it is a linear transform \u2126. Let e \u2208 E be an English word vector, and f \u2208 F a French word\nvector, both of dimensionality d. The goal of the generator is then to choose \u2126 \u2208 Rd\u00d7d such that \u2126E\nhas a distribution close to F . The discriminator is a map Dw : X \u2192 {0, 1}, implemented in Conneau\net al. (2018) as a multi-layered perceptron. The objective of the discriminator is to discriminate\nbetween vector spaces F and \u2126E. During training, the model parameters \u2126 and w are optimized\nusing stochastic gradient descent by alternately updating the parameters of the discriminator based\non the gradient of the discriminator loss and the parameters of the generator based on the gradient of\nthe generator loss, which, by de\ufb01nition, is the inverse of the discriminator loss. The loss function\nused in Conneau et al. (2018) and in our experiments below is cross-entropy. In each iteration, we\nsample N vectors e \u2208 E and N vectors f \u2208 F and update the discriminator parameters w according\n\nto w \u2192 w + \u03b1(cid:80)N\n\ni=1 \u2207[log Dw(fi) + log(1 \u2212 Dw(G\u2126(ei)].\nthe\n\nparameters\n\noptimal\n\nare\n\na\n\nto\n\nsolution\n\nthe min-max\n\nTheoretically,\nproblem:\nmin\u2126 maxw E[log(Dw(F )) + log(1 \u2212 Dw(G\u2126(E)))], which reduces to min\u2126 JS (PF | P\u2126). If a\ngenerator wins the game against an ideal discriminator on a very large number of samples, then F\nand \u2126E can be shown to be close in Jensen-Shannon divergence, and thus the model has learned the\ntrue data distribution. This result, referring to the distributions of the data, pdata, and the distribution,\npg, G is sampling from, is from Goodfellow et al. (2014): If G and D have enough capacity, and at\neach step of training, the discriminator is allowed to reach its optimum given G, and pg is updated so\nas to improve the criterion Ex\u223cpdata [log D\u2217\nG(x)] then pg converges to pdata. This result relies on a\nnumber of assumptions that do not hold in practice. The generator in Conneau et al. (2018), which\nlearns a linear transform \u2126, has very limited capacity, for example, and we are updating \u2126 rather\nthan pg. In practice, therefore, during training, Conneau et al. (2018) alternate between k steps of\noptimizing the discriminator and one step of optimizing the generator. Another common problem\nwith training GANs is that the discriminator loss quickly drops to zero, when there is no overlap\nbetween pg and pdata (Arjovsky et al., 2017); but note that in our case, the discriminator is initially\npresented with IE and F , for which there is typically no trivial solution, since the embedding\nspaces are likely to overlap. We show in \u00a74 that the discriminator and generator losses are poor\nmodel selection criteria, however; instead we propose a simple criterion based on cosine similarities\nbetween nearest neighbors in the learned alignment.\nFrom \u2126E and F , a seed (bilingual) dictionary can be extracted using nearest neighbor queries, i.e.,\nby asking for the nearest neighbor of \u2126E in F , or vice versa. Conneau et al. (2018) use a normalized\nnearest neighbor retrieval method to reduce the in\ufb02uence of hubs (Radovanovi\u00b4c et al., 2010; Dinu et al.,\n2015). The method is called cross-domain similarity local scaling (CSLS) and used to expand high-\ndensity areas and condense low-density ones. The mean similarity of a source language embedding\n\u2126e to its k nearest neighbors in the target language is de\ufb01ned as \u00b5k\ni=1 cos(e, fi),\nwhere cos is the cosine similarity. \u00b5F (fi) is de\ufb01ned in an analogous manner for every i. CSLS(e, fi)\nis then calculated as 2 cos(e, fi) \u2212 \u00b5E(\u2126(e)) \u2212 \u00b5F (fi). Conneau et al. (2018) use an unsupervised\nvalidation criterion based on CSLS. The translations of the top k (10,000) most frequent words in the\nsource language are obtained with CSLS and average pairwise cosine similarity is computed over\nthem. This metric is considered indicative of the closeness between the projected source space and\nthe target space, and is found to correlate well with supervised evaluation metrics. After inducing a\nbilingual dictionary, Ed and Fd, by querying \u2126E and F with CSLS, Conneau et al. (2018) perform a\nre\ufb01nement step based on Procrustes Analysis (Sch\u00f6nemann, 1966). Here, the optimal mapping \u2126\nthat maps the words in the seed dictionary onto each other, is computed analytically as \u2126 = U V T ,\nwhere U and V are obtained via the singular value decomposition U \u03a3V T of F T\n\nE(\u2126(e)) = 1\nk\n\n(cid:80)k\n\nd Ed.\n\n3\n\n\f3 Alternatives to GAN-initialized UBDI\n\nThis section introduces some recent alternatives to (vanilla) GAN-initialized UBDI. In Table 1, we\nlist more approaches and classify them by how they perform unsupervised distribution matching and\nsupervised re\ufb01nement.\n\nIterative closest point\nThe idea of minimizing nearest neighbor distances for unsupervised model\nselection is also found in point set registration and lies at the core of iterative closest point (ICP)\noptimization (Besl and McKay, 1992). ICP typically minimizes the \u03bb2 distance (mean squared error)\nbetween nearest neighbor pairs. The ICP optimization algorithm works by assigning each transformed\nvector to its nearest neighbor and then computing the new relative transformation that minimizes\nthe cost function with respect to this assignment. ICP can be shown to converge to local optima\n(Besl and McKay, 1992), in polynomial time (Ezra et al., 2006). ICP easily gets trapped in local\noptima, however, exact algorithms only exist for two- and three-dimensional point set registration,\nand these algorithms are slow (Yang et al., 2016). Generally, it holds that the optimal solution to\nthe GAN min-max problem is also optimal for ICP. To see this, note that a GAN minimizes the\nJensen-Shannon divergence between F and \u2126E. The optimal solution to this is F = \u2126E. As sample\nsize goes to in\ufb01nity, this means the L2 loss in ICP goes to 0. In other words, the ICP loss is minimal\nif an optimal solution to the UBDI min-max problem is found. ICP was independently proposed for\nUBDI in Hoshen and Wolf (2018). They report their method only works using PCA initialization,\ni.e. they project a subset of both sets of word embeddings onto the 50 \ufb01rst principal components,\nand learn an initial seed dictionary using ICP on the lower-dimensional embeddings. This seed\nmapping is then used as starting point for ICP on the full word embeddings. We explored PCA\ninitialization for GAN-based distribution matching, but observed the opposite effect, namely that\nPCA initialization leads to a degradation in performance. The most important thing to note from\nHoshen and Wolf (2018), however, is that they do 500 random restarts of the PCA initialization to\nobtain robust performance; ICP, in other words, is extremely sensitive to initialization. This explains\ntheir poor performance under our experimental protocol below (Table 2).\n\nWasserstein GAN Zhang et al. (2017) were the \ufb01rst to introduce Wasserstein GANs as a way to\nlearn seed dictionaries in the context of UBDI. In their best system, they train simple Wasserstein\nGANs and use the resulting seed dictionaries to supervise Procrustes Analysis. We modi\ufb01ed the\nMUSE code to experiment with Wasserstein GANs in a controlled way. Simple Wasserstein GANs\nwere unsuccessful, but with gradient penalty (Gulrajani et al., 2017), we obtained almost competitive\nresults, after tuning the learning rate and the gradient penalty \u03bb using nearest neighbor cosine distance\nas validation criterion. On the other hand, the results were not signi\ufb01cantly better, and instability did\nnot improve. Finally, we experimented with CT-GANs (Wei et al., 2018), an extension of Wasserstein\nGANs with gradient penalty, but this only lowered performance and increased instability. Since\nWasserstein GANs and CT-GANs were consistently worse and less stable than vanilla GANs, we do\nnot include them in the experiments below.\n\nGromov-Wasserstein Alvarez-Melis and Jaakkola (2018) present a very different initialization\nstrategy. In brief, Alvarez-Melis and Jaakkola (2018) learn a linear transformation to minimize\nGromov-Wasserstein distances of distances between nearest neighbors, in the absence of cross-\nlingual supervision. We report the performance of their system in the experiments below, but results\n(Table 2) were all negative. We think the reason is that Alvarez-Melis and Jaakkola (2018) only\nconsider small subsamples of the vector spaces, and that in hard cases, alignments induced on\nsubspaces are unlikely to scale. It achieved an impressive P@1 of 85.6 on the Greek MUSE dataset\n(Conneau et al. (2018) obtain 59.5); but on the datasets, where Conneau et al. (2018) are instable,\nconsidered here, it consistently fails to align the vector spaces.\nArtetxe et al. (2018) introduce a very simple, related initialization method that is also based on\nGromov-Wasserstein distances of distances between nearest neighbors: They use these second-order\ndistances to build a seed dictionary directly by aligning nearest neighbors across languages. By itself,\nthis is a poor initialization method (see Table 2). Artetxe et al. (2018), however, combine this with a\nnew re\ufb01nement method called stochastic dictionary induction, i.e., randomly dropping out dimensions\nof the similarity matrix when extracting a seed dictionary for the next iteration of Procrustes Analysis.\nArtetxe et al. (2018) show in an ablation study for one language pair (English-Finnish) that the\ninitialization method only works in combination with the stochastic dictionary induction step, i.e.,\n\n4\n\n\fConneau et al. (2018)\nGAN\nHoshen and Wolf (2018)\nICP\nArtetxe et al. (2018)\nGW\nAlvarez-Melis and Jaakkola (2018) GW\n\n6.4\n0.1\n0\n0\n\n9\n10\n10\n10\n\net\n\nmax\n\nfail max\n\nTO ENGLISH\n\nfa\n\n\ufb01\n\nlv\n\ntr\n\nvi\n\nav\n\nfail max\nNO REFINEMENT\n22.5\n28.5\n0\n0\n0.1\n0.1\n0\n0\n\n3\n10\n10\n10\n\nfail max\n\nfail max\n\nfail max\n\nfail max\n\nfail\n\n1\n10\n10\n10\n\n14.3\n0\n0.1\n0\n\n9\n10\n10\n10\n\n32.1\n0\n0.1\n0\n\n2\n10\n10\n10\n\n2.4\n0\n0.1\n0\n\n9\n10\n10\n10\n\n17.7\n0\n0.1\n0\n\n5.5\n10\n10\n10\n\nWITH PROCRUSTES REFINEMENT\n\n58.9\n\n3\n10\n0\n10\n\n0\n\n9\n10\n10\n10\n\n40.9\n0\n\n40.2\n\nGAN 27.5\nConneau et al. (2018)\n0.1\nICP\nHoshen and Wolf (2018)\nGW\n1.1\nArtetxe et al. (2018)\nAlvarez-Melis and Jaakkola (2018) GW\n0\nTable 2: Comparisons of unsupervised seed dictionary learning strategies in the absence of re\ufb01nement\n(upper half) or using the same re\ufb01nement technique (orthogonal Procrustes) (lower half). For results\nwith re\ufb01nement, we use GANs, ICPs, and Gromov-Wasserstein (GW) distribution matching and\nfeed seed dictionaries to Procrustes re\ufb01nement. We then report maximum performance (P@1) and\nstability (fails) across 10 runs. We consider a P@1 score below 2% a failure. The results suggest that\nGANs, in spite of their instability, have the highest potential for inducing useful seed dictionaries.\n\n33.2\n0\n0.1\n0\n\n51.3\n0\n0.3\n0\n\n0\n60.5\n0\n\n60.6\n0\n\n59.6\n\n5.5\n10\n5\n10\n\n9\n10\n10\n10\n\n9\n10\n10\n10\n\n45.4\n0\n\n27.0\n\n0\n\n2\n10\n0\n10\n\n1\n10\n0\n10\n\n0\n\nwithout the application of stochasticity, the induced mapping is degenerate. In our experiments below,\nwe show that this \ufb01nding generalizes to other language pairs, suggesting that the stochastic dictionary\ninduction is the main contribution in their work. We show that when combined with vanilla GANs\nfor the initial step of learning a seed dictionary through distribution matching, stochastic dictionary\ninduction performs even better.\n\nConvex Relaxation The Gold-Rangarajan relaxation is a convex relaxation of the (NP-hard) graph\nmatching problem and can be solved using the Frank-Wolfe algorithm. Once the minimal optimizer\nis computed, an initial transformation is obtained using singular-value decomposition. The Gold-\nRangarajan relaxation can thus be used for stable learning of seed dictionaries Grave et al. (2018).\nIt remains an open question how this strategy fairs on challenging language pairs such as the ones\nincluded here. We would have liked to include this approach in our experiments, but the code was not\npublicly available at the time of writing.\n\nProperties of Unsupervised Alignment Algorithms The above approaches provably work if the\ntwo vector spaces to be aligned, are isomorphic, except for the pathological case where the vectors are\nplaced on an equidistant grid forming a sphere.1 A function \u2126 from E to F is a linear transformation\nif \u2126(f + g) = \u2126(f ) + \u2126(g) and \u2126(kf ) = k\u2126(f ) for all elements f, g of E, and for all scalars k.\nAn invertible linear transformation is called an isomorphism. The two vector spaces E and F are\ncalled isomorphic, if there is an isomorphism from E to F . Equivalently, if the kernel of a linear\ntransformation between two vector spaces of the same dimensionality contains only the zero vector, it\nis invertible and hence an isomorphism. Most work on supervised or unsupervised alignment of word\nvector spaces relies on the assumption that they are approximately isomorphic, i.e., isomorphic after\nremoving a small set of vertices (Mikolov et al., 2013; Barone, 2016; Zhang et al., 2017; Conneau\net al., 2018). It is not dif\ufb01cult to show that many pairs of vector spaces are not approximately\nisomorphic, however. See S\u00f8gaard et al. (2018) for examples.\n\n1In this case, there is an in\ufb01nite set of equally good linear transformations (rotations) that achieve the same\ntraining loss. Similarly, for two binary-valued, n-dimensional vector spaces with one vector in each possible\nposition. Here the number of local optima would be 2n, but since the loss is the same in each of them the loss\nlandscape is highly non-convex, and the basin of convergence is therefore very small (Yang et al., 2016). The\nchance of aligning the two spaces using gradient descent optimization would be 1\n2n . In other words, minimizing\nthe Jensen-Shannon divergence between the word vector distributions, even in the easy case, is not always\nguaranteed to uncover an alignment between translation equivalents. From the above, it follows that alignments\nbetween linearly alignable vector spaces cannot always be learned using UBDI methods. In \u00a73.1 , we test for\napproximate isomorphism to decide whether two vector spaces are linearly alignable.\u00a73.2\u20133.3 are devoted to\nanalyzing when alignments between linearly alignable vector spaces can be learned.\n\n5\n\n\f4 Experiments\n\nIn our experiments, we focus on aligning word vector spaces between two languages, by projecting\nfrom the foreign language into English. Our languages are: Estonian (et), Farsi (fa), Finnish (\ufb01),\nLatvian (lv), Turkish (tr), and Vietnamese (vi). This selection of languages is motivated by observed\ninstability when training vanilla GANs, e.g., S\u00f8gaard et al. (2018). In addition, the languages span\nfour language families: Finno-Ugric (et, \ufb01), Indo-European (fa, lv), Turkic (tr), and Austroasiatic (vi).\n\nData\nIn all our experiments, we use pretrained FastText embeddings (Bojanowski et al., 2017) and\nthe bilingual test dictionaries released along with the MUSE system.2 The FastText embeddings are\ntrained on Wikipedia dumps3; the bilingual dictionaries were created using an in-house Facebook\ntranslation tool and contain translations for 1500 test words for each language pair. Since we cannot\ndo reliable hyper-parameter optimization in the absence of cross-lingual supervision, we use MUSE\nwith the default parameters (Conneau et al., 2018). For the experiments with stochastic dictionary\ninduction (Table 3), we use the implementation in the VecMap framework (Artetxe et al., 2018).4\n\n4.1 Comparison of distribution matching strategies\n\nOur main experiments, reported in Table 2, compare the initialization strategies listed in Table 2:\nvanilla GANs, the two varieties of Gromov-Wasserstein (see \u00a73), and ICP.5 Table 2 is split in two:\nFirst we report the performance, measured as precision at one, in the absence of re\ufb01nement; and\nthen we report the performance with re\ufb01nement, using the same re\ufb01nement technique (Procrustes\nAnalysis) across the board. For all the randomly initialized algorithms (the \ufb01rst three), we report the\nbest of 10 runs and the number of fails, where fails are runs with scores lower than 2%.6 The reported\nscores are P@1, i.e., the fraction of words whose neighbors are translation equivalents.\nWe believe it is crucial to evaluate the different techniques this way, instead of simply comparing the\nnumbers reported in the relevant papers: First of all, no three of these authors report performance\non the same datasets. Secondly, if the authors use different re\ufb01nement techniques, it is impossible\nto see the impact of the initialization strategies in the reported numbers. Instead we control for the\nre\ufb01nement techniques and study the distribution matching techniques in Table 1 in isolation. This\nmeans, for example, that we evaluate the Artetxe et al. (2018) in the absence of stochastic dictionary\ninduction, and Hoshen and Wolf (2018) in the absence of 500 random restarts. In \u00a74.2 (Table 3), we\ncompare vanilla GANs and Gromov-Wasserstein in the context of stocastic dictionary induction.\nThe patterns in Table 2 are very consistent. Vanilla GAN distribution matching is very instable,\nwith 1/10 fails for Finnish and Turkish, but 6, 7 and 9 fails for Estonian, Latvian, and Vietnamese,\nrespectively. All other methods are more instable, however, with the distribution matching techniques\nin Hoshen and Wolf (2018) and Alvarez-Melis and Jaakkola (2018) failing across the board, with or\nwithout supervised Procrustes re\ufb01nement. Vanilla GAN distribution matching also leads to higher\nprecision for 5/6 language pairs.\nVanilla GAN distribution matching thus seems to have the highest potential for inducing useful seed\ndictionaries among all these methods. If we could only manage their instability, GANs seem to\nprovide us with a better point of departure. This naturally leads us to ask: Is it feasible to select good\nvanilla GAN UBDI runs from a batch of random restarts, in the absence of cross-lingual supervision?\nThis question is explored in \u00a74.2, in which we also explore whether state-of-the-art performance\ncan be achieved with vanilla GANs and a more advanced re\ufb01nement technique, namely stochastic\ndictionary induction.\n\n2https://github.com/facebookresearch/MUSE\n3https://fasttext.cc/docs/en/pretrained-vectors.html\n4https://github.com/artetxem/vecmap\n5We ignore Wasserstein GANs, which proved more instable than vanilla GANs in our preliminary experi-\n\nments, as well as Gold-Rangarajan, which performs considerably below current state of the art.\n\n6In practice, performance tends to be much higher than 2% for successful runs, hence slight changes in the\n\nthreshold value would not affect results.\n\n6\n\n\fPROCRUSTES\n\nC-MUSE\n\nSTOCHASTIC DICTIONARY INDUCTION\nC-MUSE\n\nArtetxe et al. (2018)\n\net-en\nfa-en\n\ufb01-en\nlv-en\ntr-en\nvi-en\naverage\n\n27.5\n40.9\n58.9\n33.2\n60.6\n51.3\n45.4\n\n47.6\n41.5\n62.5\n44.1\n62.8\n54.3\n52.1\n\n47.6\n40.2\n63.6\n41.6\n60.6\n0.3\n42.3\n\nTable 3: Comparison of MUSE with cosine-based model selection over 10 random restarts (C-MUSE)\nwith and without stochastic dictionary induction (with suggested hyper-parameters from Artetxe\net al. (2018)), against state of the art. Using vanilla GANs is better than Gromov-Wasserstein on\naverage and better on 4/6 language pairs.\n\n4.2 GAN distribution matching with random restarts\n\nExploring this question we found that the discriminator loss during training, which is used as a\nmodel selection criterion in Daskalakis et al. (2018), is a poor selection criterion. However, we did\n\ufb01nd another unsupervised model selection criterion that correlates well with UBDI performance:\ncosine similarity of (induced) nearest neighbors. This criterion is also used as a stopping criterion in\nConneau et al. (2018), and can be used with or without CSLS scaling. This stopping criterion in fact\nturns out to be a quite robust model selection criterion for picking the best out of n random restarts.\nIn Table 3, we compare MUSE with 10 random restarts and using CSLS cosine similarity of nearest\nneighbors as an unsupervised model selection criterion, to the full state-of-the-art model in Artetxe\net al. (2018) with stochastic dictionary induction. What we see in these results, is that Artetxe et al.\n(2018) is still superior to MUSE with random restarts, but even with 10 restarts, the gap narrows\nconsiderably, and MUSE is better on 2/6 languages. Note, however, that this is a comparison of\ntwo systems using two different re\ufb01nement techniques. If we combine vanilla GAN distribution\nmatching from MUSE with the stochastic dictionary induction technique from Artetxe et al. (2018),\nwe obtain slightly better performance than Artetxe et al. (2018) (Table 3, mid-column): While overall\nimprovements are small, compared to the differences in seed dictionary quality, the combination of\nvanilla GANs for distribution matching and stochastic dictionary induction provides a promising and\nfully competitive alternative to the state of the art for unsupervised word translation.\n\n4.3 Discussion and Further Experiments\n\nWe have shown that while vanilla GANs are instable, they carry a seemingly unique potential for\nUBDI. We have shown that a simple unsupervised cosine-based model selection criterion can achieve\nrobust state-of-the-art performance. We have performed several other experiments to probe this\ninstability in search of ways to stabilize vanilla GANs without signi\ufb01cant performance drops. This\nsubsection summarizes these experiments.\n\nNormalization We observed that GAN-based UBDI becomes more instable and performance\ndeteriorates with unit length normalization. We performed unit length normalization (ULN) of all\nvectors x, i.e., x(cid:48) = x||x||2 , which is often used in supervised bilingual dictionary induction (Xing\net al., 2015; Artetxe et al., 2017). We used this transform to project word vectors onto a sphere \u2013 to\ncontrol for shape information. If vectors are distributed smoothly over two spheres, there is no way\nto learn an alignment in the absence of dictionary seed; in other words, if vanilla GAN distribution\nmatching is unaffected by this transform, vanilla GANs learn from density information alone. While\nsupervised methods are insensitive to or bene\ufb01t from ULN, we \ufb01nd that vanilla GANs are very\nsensitive to such normalization; in fact, the number of failed runs over six languages increases from\nbelow 50% to 90%. For example, while for Finnish, MUSE only fails in 1/10 runs, MUSE with\nULN failed across the board; for Farsi, MUSE with ULN failed in 6/10 runs, compared to 3/10. We\nverify that supervised alignment is not affected by ULN by running Procrustes re\ufb01nement with a seed\ndictionary as supervision; here, performance remains unchanged under this transformation.\n\n7\n\n\fFigure 1: Discriminator loss averaged over all training data points (green), P@1 on the test data\npoints (blue) and mean cosine similarity (red) on the training data \u2013 for generator parameters on the\nline segment that connects the unsupervised GAN solution with the supervised Procrustes Analysis\nsolution. \u03b1 is the interpolation parameter moving the generator parameters from the unsupervised\nGAN solution (\u03b1 = 0) to the supervised solution (\u03b1 = 1).\n\nNoise injection On the contrary, GAN-based UBDI is largely unaffected by noise injection. We\nsaw this from running experiments on a few languages, but do not report performance across the\nboard. Speci\ufb01cally, we add 25% random vectors, randomly sampled from a hypercube bounding the\nvector set. GAN-based UBDI results are not affected by noise injection. This, we found, is because\nthe injected vectors rarely end up in the seed dictionaries used for subsequent re\ufb01nement.\n\nOver-parameterization GAN training is instable because discriminators end up in poor local\noptima or saddle points (see below). A known technique for escaping local optima is over-\nparameterization (Brutzkus et al., 2018). We experimented with widening our discriminators to\nsmoothen the loss landscape. Results were mixed, with more stability and better performance on\nsome languages, and less stability and worse performance on others. We provide the full list of results\nin the Appendix.\n\nLarge batches and small learning rates Previous work has shown that large learning rate and\nsmall batch size contribute towards SGD \ufb01nding \ufb02atter minima (Jastrzebski et al., 2018), but in\nour experiments, we are interested in the discriminator not ending up in \ufb02at regions, where there\nis no signal to update the generator. We therefore experiment with (higher and) smaller learning\nrate and (smaller and) larger batch sizes. The motivation behind both is decreasing the scale of\nrandom \ufb02uctuations in the SGD dynamics (Smith and Le, 2017; Balles et al., 2017), enabling the\ndiscriminator to explore narrower regions in the loss landscape. Increasing the batch size or varying\nthe learning rate (up or down), however, leads to worse performance, and it seems the MUSE default\nhyperparameters are close to optimal. We provide the full list of results in the Appendix.\n\nExploring the loss landscapes GAN training instability arises from discriminators getting stuck in\nsaddle points, where neither the discriminator nor the generator has a learning signals. To show this,\nwe analyze the discriminator loss in areas of convergence by plotting it as a function of the generator\nparameters. Speci\ufb01cally, we plot the loss surface along its intersection with a line segment connecting\ntwo sets of parameters (Goodfellow et al., 2015; Li et al., 2018). In our case, we interpolate between\nthe model induced by GAN-based UBDI and the (oracle) model obtained using supervised Procrustes\nAnalysis. Results are shown in Figure 1. The green loss curves represent the current discriminator\u2019s\nloss along all the generators between the current generator and the generator found by Procrustes\nre\ufb01nement. We see that while performance (P@1 and mean cosine similarity) goes up as soon as we\nmove closer toward the supervised solution, the discriminator loss does not change until we get very\nclose to this solution, suggesting there is no learning signal in this direction for GAN-based UBDI.\n\n8\n\nDiscriminator lossP@1Mean cosine (csls_knn_10)0.00.51.00.00.20.40.60.8lv_en0.00.51.00.00.20.40.60.8tr_en0.00.51.00.00.20.40.60.8vi_en0.00.51.00.00.20.40.60.8fa_en0.00.51.00.00.20.40.60.8fi_en0.00.51.00.00.20.40.60.8et_en\fThis is along a line segment representing the shortest path from the failed generator to the oracle\ngenerator, of course; linear interpolation provides no guarantee there are no almost-as-short paths\nwith plenty of signal. A more sophisticated sampling method is to sample along two random direction\nvectors (Goodfellow et al., 2015; Li et al., 2018). We used an alternative strategy of sampling from\nnormal distributions with \ufb01xed variance that were orthogonal to the line segment. We observed the\nsame pattern, leading us to the conclusion that instability is caused by discriminator saddle points.\n\n5 Conclusions\n\nThis paper explores the dynamics of (vanilla) GAN training in the context of unsupervised word\ntranslation and a systematic comparison of GANs with different distribution matching (seed induction)\nmethods across six challenging language pairs. Our main \ufb01nding is that vanilla GANs, in spite of\ntheir instability, have the highest potential for inducing useful seed dictionaries. We explore an\nunsupervised model selection criterion for selecting the best models from multiple random restarts,\nnarrowing the gap between MUSE and Artetxe et al. (2018), and further show that combining GANs\nwith stochastic dictionary induction provides a new state of the art for unsupervised word translation.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for their comments and suggestions. Mareike Hartmann was\nsupported by the Carlsberg Foundation. Anders S\u00f8gaard was supported by a Google Focused\nResearch Award.\n\nReferences\nDavid Alvarez-Melis and Tommi Jaakkola. 2018. Gromov-wasserstein alignment of word embedding\n\nspaces. In EMNLP.\n\nMartin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. 2017. Wasserstein GAN. In CoRR. page\n\nabs/1701.07875.\n\nMikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with\n(almost) no bilingual data. In Proceedings of ACL. pages 451\u2013462. https://doi.org/10.18653/v1/P17-\n1042.\n\nMikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully\n\nunsupervised cross-lingual mappings of word embeddings. In ACL.\n\nLukas Balles, Javier Romero, and Philipp Hennig. 2017. Coupling adaptive batch sizes with learning\n\nrates. In Proceedings of UAI.\n\nAntonio Valerio Miceli Barone. 2016. Towards cross-lingual distributed representations without paral-\nlel text trained with adversarial autoencoders. Proceedings of the 1st Workshop on Representation\nLearning for NLP pages 121\u2013126. http://arxiv.org/pdf/1608.02996.pdf.\n\nPaul Besl and Neil McKay. 1992. A method for registration of 3-d shapes. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence 14(2).\n\nPiotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors\n\nwith subword information. Transactions of the ACL 5:135\u2013146.\n\nAlon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD learns over-\nparameterized networks that provably generalize on linearly separable data. In Proceedings of\nICLR.\n\nAlexis Conneau, Guillaume Lample, Marc Ranzato, Ludovic Denoyer, and Herv\u00e9 J\u00e9gou. 2018. Word\n\ntranslation without parallel data. In ICLR.\n\nConstantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. 2018. Training GANs\n\nwith optimism. In ICLR.\n\n9\n\n\fGeorgiana Dinu, Angeliki Lazaridou, and Marco Baroni. 2015.\n\nImproving zero-shot learn-\nIn Proceedings of ICLR (Workshop Papers).\n\ning by mitigating the hubness problem.\nhttp://arxiv.org/abs/1412.6568.\n\nEster Ezra, Micha Sharir, and Alon Efrat. 2006. On the ICP algorithm. In SGC.\n\nManaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilin-\n\ngual correlation. In Proceedings of EACL. pages 462\u2013471. http://repository.cmu.edu/lti/31.\n\nS Gold and A Rangarajan. 1996. A graduated assignment algorithm for graph matching. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence 18:377\u2013388.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. In Proceedings of\nNIPS.\n\nIan J Goodfellow, Oriol Vinyals, and Andrew Saxe. 2015. Qualitatively characterizing neural network\n\noptimization problems. In Proceedings of ICLR.\n\nStephan Gouws and Anders S\u00f8gaard. 2015. Simple task-speci\ufb01c bilingual word embeddings. In\n\nProceedings of NAACL-HLT. pages 1302\u20131306.\n\nEdouard Grave, Armand Joulin, and Quentin Berthet. 2018. Unsupervised alignment of embeddings\n\nwith wasserstein procrustes. CoRR abs/1805.11222.\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017.\n\nImproved training of Wasserstein GANSs. In Proceedings of NIPS.\n\nBradley Hauer, Garrett Nicolai, and Grzegorz Kondrak. 2017. Bootstrapping unsupervised bilingual\n\nlexicon induction. In Proceedings of EACL. pages 619\u2013624.\n\nYedid Hoshen and Lior Wolf. 2018. An iterative closest point method for unsupervised word\n\ntranslation. In CoRR. page 1801.06126.\n\nStanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio,\n\nand Amos Storkey. 2018. Finding \ufb02atter minima with SGD. In Proceedings of ICLR.\n\nGuillaume Lample, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato. 2018. Unsupervised machine\nIn Proceedings of ICLR (Conference Papers).\n\ntranslation using monolingual corpora only.\nhttp://arxiv.org/abs/1711.00043.\n\nHao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. 2018. Visualizing the loss landscape of neural\n\nnets. In ICLR.\n\nTomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for\n\nmachine translation. CoRR abs/1309.4168. http://arxiv.org/abs/1309.4168.\n\nMilos Radovanovi\u00b4c, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular\nnearest neighbors in high-dimensional data. Journal of Machine Learning Research 11:2487\u20132531.\nhttp://portal.acm.org/citation.cfm?id=1953015.\n\nPeter Sch\u00f6nemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika\n\n31:1\u201310.\n\nSamuel Smith and Quoc Le. 2017. A Bayesian perspective on generalization and stochastic gradient\n\ndescent. arXiv preprint arXiv:1710.06451 .\n\nSamuel L. Smith, David H. P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Bilingual word\nvectors, orthogonal transformations and the inverted softmax. In Proceedings of ICLR (Conference\nTrack).\n\nAnder S\u00f8gaard, Sebastian Ruder, and Ivan Vulic. 2018. On the limitations of unsupervised bilingual\n\ndictionary induction. In Proceedings of ACL.\n\n10\n\n\fIvan Vuli\u00b4c and Anna Korhonen. 2016. On the role of seed lexicons in learning bilingual word\n\nembeddings. In Proceedings of ACL. pages 247\u2013257.\n\nXiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and Liqiang Wang. 2018. Improving the improved\n\ntraining of wasserstein gans: A consistency term and its dual effect. In Proceedings of ICLR.\n\nChao Xing, Chao Liu, Dong Wang, and Yiye Lin. 2015. Normalized word embedding and orthogonal\n\ntransform for bilingual word translation. In Proceedings of NAACL-HLT.\n\nRuochen Xu, Yiming Yang, Naoki Otani, and Yuexin Wu. 2018. Unsupervised cross-lingual transfer\nof word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in\nNatural Language Processing. Association for Computational Linguistics, pages 2465\u20132474.\nhttp://aclweb.org/anthology/D18-1268.\n\nJiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. 2016. Go-ICP: A globally optimal\nsolution to 3D ICP point-set registration. IEEE Transactions on Pattern Analysis and Machine\nIntelligence 38(11).\n\nPengcheng Yang, Fuli Luo, Shuangzhi Wu, Jingjing Xu, Dongdong Zhang, and Xu Sun. 2018.\nLearning unsupervised word mapping by maximizing mean discrepancy. CoRR abs/1811.00275.\n\nMeng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial training for unsupervised\n\nbilingual lexicon induction. In Proceedings of ACL.\n\n11\n\n\f", "award": [], "sourceid": 3240, "authors": [{"given_name": "Mareike", "family_name": "Hartmann", "institution": "University of Copenhagen"}, {"given_name": "Yova", "family_name": "Kementchedjhieva", "institution": "University of Copenhagen"}, {"given_name": "Anders", "family_name": "S\u00f8gaard", "institution": "University of Copenhagen"}]}