{"title": "Cross-lingual Language Model Pretraining", "book": "Advances in Neural Information Processing Systems", "page_first": 7059, "page_last": 7069, "abstract": "Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.", "full_text": "Cross-lingual Language Model Pretraining\n\nAlexis Conneau\u2217\nFacebook AI Research\nUniversit\u00e9 Le Mans\naconneau@fb.com\n\nGuillaume Lample\u2217\nFacebook AI Research\nSorbonne Universit\u00e9s\nglample@fb.com\n\nAbstract\n\nRecent studies have demonstrated the ef\ufb01ciency of generative pretraining for En-\nglish natural language understanding. In this work, we extend this approach to\nmultiple languages and show the effectiveness of cross-lingual pretraining. We\npropose two methods to learn cross-lingual language models (XLMs): one unsu-\npervised that only relies on monolingual data, and one supervised that leverages\nparallel data with a new cross-lingual language model objective. We obtain state-of-\nthe-art results on cross-lingual classi\ufb01cation, unsupervised and supervised machine\ntranslation. On XNLI, our approach pushes the state of the art by an absolute gain\nof 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on\nWMT\u201916 German-English, improving the previous state of the art by more than 9\nBLEU. On supervised machine translation, we obtain a new state of the art of 38.5\nBLEU on WMT\u201916 Romanian-English, outperforming the previous best approach\nby more than 4 BLEU. Our code and pretrained models are publicly available1.\n\n1\n\nIntroduction\n\nGenerative pretraining of sentence encoders [30, 20, 14] has led to strong improvements on numerous\nnatural language understanding benchmarks [40]. In this context, a Transformer [38] language model\nis learned on a large unsupervised text corpus, and then \ufb01ne-tuned on natural language understanding\n(NLU) tasks such as classi\ufb01cation [35] or natural language inference [7, 42]. Although there has\nbeen a surge of interest in learning general-purpose sentence representations, research in that area\nhas been essentially monolingual, and largely focused around English benchmarks [10, 40]. Recent\ndevelopments in learning and evaluating cross-lingual sentence representations in many languages\n[12] aim at mitigating the English-centric bias and suggest that it is possible to build universal\ncross-lingual encoders that can encode any sentence into a shared embedding space.\nIn this work, we demonstrate the effectiveness of cross-lingual language model pretraining on multiple\ncross-lingual understanding (XLU) benchmarks. Precisely, we make the following contributions:\n\n1. We introduce a new unsupervised method for learning cross-lingual representations using\n\ncross-lingual language modeling and investigate two monolingual pretraining objectives.\n\n2. We introduce a new supervised learning objective that improves cross-lingual pretraining\n\nwhen parallel data is available.\n\n3. We signi\ufb01cantly outperform the previous state of the art on cross-lingual classi\ufb01cation,\n\nunsupervised machine translation and supervised machine translation.\n\n4. We show that cross-lingual language models can provide signi\ufb01cant improvements on the\n\nperplexity of low-resource languages.\n\n5. We make our code and pretrained models publicly available1.\n\n\u2217 Equal contribution.\n1https://github.com/facebookresearch/XLM\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Related Work\n\nOur work builds on top of Radford et al. [30], Howard and Ruder [20], Devlin et al. [14] who\ninvestigate language modeling for pretraining Transformer encoders. Their approaches lead to drastic\nimprovements on several classi\ufb01cation tasks from the GLUE benchmark [40]. Ramachandran et al.\n[31] show that language modeling pretraining can also provide signi\ufb01cant improvements on machine\ntranslation tasks, even for high-resource language pairs such as English-German where there exists a\nsigni\ufb01cant amount of parallel data. Concurrent to our work, results on cross-lingual classi\ufb01cation\nusing a cross-lingual language modeling approach were showcased on the BERT repository . We\ncompare those results to our approach in Section 5.\nAligning distributions of text representations has a long tradition, starting from word embeddings\nalignment and the work of Mikolov et al. [27] that leverages small dictionaries to align word\nrepresentations from different languages. A series of follow-up studies show that cross-lingual\nrepresentations can be used to improve the quality of monolingual representations [16], that orthogonal\ntransformations are suf\ufb01cient to align these word distributions [43], and that all these techniques\ncan be applied to an arbitrary number of languages [2]. Following this line of work, the need for\ncross-lingual supervision was further reduced [34] until it was completely removed [11]. We take\nthese ideas one step further by aligning distributions of sentences and also reducing the need for\nparallel data.\nThere is a large body of work on aligning sentence representations from multiple languages. By\nusing parallel data, Hermann and Blunsom [18], Conneau et al. [12], Eriguchi et al. [15] investigated\nzero-shot cross-lingual sentence classi\ufb01cation. But the most successful recent approach of cross-\nlingual encoders is probably the one of Johnson et al. [21] for multilingual machine translation.\nThey show that a single sequence-to-sequence model can be used to perform machine translation\nfor many language pairs, by using a single shared LSTM encoder and decoder. Their multilingual\nmodel outperformed the state of the art on low-resource language pairs, and enabled zero-shot\ntranslation. Following this approach, Artetxe and Schwenk [4] show that the resulting encoder can be\nused to produce cross-lingual sentence embeddings. By leveraging more than 200 million parallel\nsentences, they obtain a new state of the art on the XNLI cross-lingual classi\ufb01cation benchmark\n[12]. While these methods require a signi\ufb01cant amount of parallel data, recent work in unsupervised\nmachine translation show that sentence representations can be aligned in a completely unsupervised\nway [25, 5]. For instance, Lample et al. [26] obtained 25.2 BLEU on WMT\u201916 German-English\nwithout using parallel sentences. Similarly, we show that we can align distributions of sentences in a\ncompletely unsupervised way, and that our cross-lingual models can be used for a broad set of natural\nlanguage understanding tasks, including machine translation.\nThe most similar work to ours is probably the one of Wada and Iwata [39], where the authors train a\nLSTM [19] language model with sentences from different languages to align word embeddings in an\nunsupervised way.\n\n3 Cross-lingual language models\n\nIn this section, we present the three language modeling objectives we consider throughout this work.\nTwo of them only require monolingual data (unsupervised), while the third one requires parallel\nsentences (supervised). We consider N languages. Unless stated otherwise, we suppose that we have\nN monolingual corpora {Ci}i=1...N , and we denote by ni the number of sentences in Ci.\n\n3.1 Shared sub-word vocabulary\n\nIn all our experiments we process all languages with the same shared vocabulary created through Byte\nPair Encoding (BPE) [32]. As shown in Lample et al. [25], this greatly improves the alignment of\nembedding spaces across languages that share either the same alphabet or anchor tokens such as digits\n[34] or proper nouns. We learn the BPE splits on the concatenation of sentences sampled randomly\nfrom the monolingual corpora. Sentences are sampled according to a multinomial distribution with\nprobabilities {qi}i=1...N , where: qi =\n. We consider \u03b1 = 0.5.\nSampling with this distribution increases the number of tokens associated to low-resource languages\n\nni(cid:80)N\n\npi =\n\nk=1 nk\n\ni(cid:80)N\n\np\u03b1\nj=1 p\u03b1\nj\n\nwith\n\n2\n\n\fand alleviates the bias towards high-resource languages. In particular, this prevents words of low-\nresource languages from being split at the character level.\n\n3.2 Causal Language Modeling (CLM)\n\nOur causal language modeling (CLM) task consists of a Transformer language model trained to model\nthe probability of a word given the previous words in a sentence P (wt|w1, . . . , wt\u22121, \u03b8). While\nrecurrent neural networks obtain state-of-the-art performance on language modeling benchmarks\n[22], Transformer models are also very competitive [13]. In the case of LSTM language models,\nback-propagation through time [41] (BPTT) is performed by providing the LSTM with the last hidden\nstate of the previous iteration. In the case of Transformers, previous hidden states can be passed to\nthe current batch [1] to provide context to the \ufb01rst words in the batch. However, this technique does\nnot scale to the cross-lingual setting, so we just leave the \ufb01rst words in each batch without context for\nsimplicity.\n\n3.3 Masked Language Modeling (MLM)\n\nWe also consider the masked language modeling (MLM) objective of Devlin et al. [14], also known as\nthe Cloze task [36]. Following Devlin et al. [14], we sample randomly 15% of the BPE tokens from\nthe text streams, replace them by a [MASK] token 80% of the time, by a random token 10% of the\ntime, and we keep them unchanged 10% of the time. Differences between our approach and the MLM\nof Devlin et al. [14] include the use of text streams of an arbitrary number of sentences (truncated at\n256 tokens) instead of pairs of sentences. To counter the imbalance between rare and frequent tokens\n(e.g. punctuations or stop words), we also subsample the frequent outputs using an approach similar\nto Mikolov et al. [28]: tokens in a text stream are sampled according to a multinomial distribution,\nwhose weights are proportional to the square root of their invert frequencies. Our MLM objective is\nillustrated in Figure 1.\n\n3.4 Translation Language Modeling (TLM)\n\nBoth the CLM and MLM objectives are unsupervised and only require monolingual data. However,\nthese objectives cannot be used to leverage parallel data when it is available. We introduce a new\ntranslation language modeling (TLM) objective for improving cross-lingual pretraining. Our TLM\nobjective is an extension of MLM, where instead of considering monolingual text streams, we\nconcatenate parallel sentences as illustrated in Figure 1. We randomly mask words in both the source\nand target sentences. To predict a word masked in an English sentence, the model can either attend to\nsurrounding English words or to the French translation, encouraging the model to align the English\nand French representations. In particular, the model can leverage the French context if the English\none is not suf\ufb01cient to infer the masked English words. To facilitate the alignment, we also reset the\npositions of target sentences.\n\n3.5 Cross-lingual Language Models\n\nIn this work, we consider cross-lingual language model pretraining with either CLM, MLM, or MLM\nused in combination with TLM. For the CLM and MLM objectives, we train the model with batches\nof 64 streams of continuous sentences composed of 256 tokens. At each iteration, a batch is composed\nof sentences coming from the same language, which is sampled from the distribution {qi}i=1...N\nabove, with \u03b1 = 0.7. When TLM is used in combination with MLM, we alternate between these two\nobjectives, and sample the language pairs with a similar approach.\n\n4 Cross-lingual language model pretraining\n\nIn this section, we explain how cross-lingual language models can be used to obtain:\n\n\u2022 a better initialization of sentence encoders for zero-shot cross-lingual classi\ufb01cation\n\u2022 a better initialization of supervised and unsupervised neural machine translation systems\n\u2022 language models for low-resource languages\n\u2022 unsupervised cross-lingual word embeddings\n\n3\n\n\fFigure 1: Cross-lingual language model pretraining. The MLM objective is similar to the one of Devlin\net al. [14], but with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to\npairs of parallel sentences. To predict a masked English word, the model can attend to both the English sentence\nand its French translation, and is encouraged to align English and French representations. Position embeddings\nof the target sentence are reset to facilitate the alignment.\n\n4.1 Cross-lingual classi\ufb01cation\n\nOur pretrained XLM models provide general-purpose cross-lingual text representations. Similar to\nmonolingual language model \ufb01ne-tuning [30, 14] on English classi\ufb01cation tasks, we \ufb01ne-tune XLMs\non a cross-lingual classi\ufb01cation benchmark. We use the cross-lingual natural language inference\n(XNLI) dataset to evaluate our approach. Precisely, we add a linear classi\ufb01er on top of the \ufb01rst\nhidden state of the pretrained Transformer, and \ufb01ne-tune all parameters on the English NLI training\ndataset. We then evaluate the capacity of our model to make correct NLI predictions in the 15 XNLI\nlanguages. Following Conneau et al. [12], we also include machine translation baselines of train and\ntest sets. We report our results in Table 1.\n\n4.2 Unsupervised Machine Translation\n\nPretraining is a key ingredient of unsupervised neural machine translation (UNMT) [25, 5]. Lample\net al. [26] show that the quality of pretrained cross-lingual word embeddings used to initialize the\nlookup table has a signi\ufb01cant impact on the performance of an unsupervised machine translation\nmodel. We propose to take this idea one step further by pretraining the entire encoder and decoder\nwith a cross-lingual language model to bootstrap the iterative process of UNMT. We explore various\ninitialization schemes and evaluate their impact on several standard machine translation benchmarks,\nincluding WMT\u201914 English-French, WMT\u201916 English-German and WMT\u201916 English-Romanian.\nResults are presented in Table 2.\n\n4.3 Supervised Machine Translation\n\nWe also investigate the impact of cross-lingual language modeling pretraining for supervised machine\ntranslation, and extend the approach of Ramachandran et al. [31] to multilingual NMT [21]. We\nevaluate the impact of both CLM and MLM pretraining on WMT\u201916 Romanian-English, and present\nresults in Table 3.\n\n4\n\n[/s] the [MASK][MASK] blue[/s] [MASK] rideaux \u00e9taient [MASK] 012345enenenenenencurtainsles[/s] 12345[/s] 0fr fr fr fr fr fr bleusTransformerTransformerTokenembeddingstakedrinknow[/s][/s] [MASK] aseat havea [MASK] [/s] [MASK] [MASK] relax and012345enenenenenen78910116en en en en en en +PositionembeddingsLanguageembeddingsMasked LanguageModeling (MLM)+++++++++++++++++++++++++++++++++++++++++++++++TokenembeddingsPositionembeddingsLanguageembeddingsTranslation LanguageModeling (TLM)were\f4.4 Low-resource language modeling\n\nFor low-resource languages, it is often bene\ufb01cial to leverage data in similar but higher-resource\nlanguages, especially when they share a signi\ufb01cant fraction of their vocabularies. For instance, there\nare about 100k sentences written in Nepali on Wikipedia, and about 6 times more in Hindi. These\ntwo languages also have more than 80% of their tokens in common in a shared BPE vocabulary of\n100k subword units. We provide in Table 4 a comparison in perplexity between a Nepali language\nmodel and a cross-lingual language model trained in Nepali but enriched with different combinations\nof Hindi and English data.\n\n4.5 Unsupervised cross-lingual word embeddings\n\nConneau et al. [11] showed how to perform unsupervised word translation by aligning monolingual\nword embedding spaces with adversarial training (MUSE). Lample et al. [25] showed that using a\nshared vocabulary between two languages and then applying fastText [6] on the concatenation of their\nmonolingual corpora also directly provides high-quality cross-lingual word embeddings (Concat)\nfor languages that share a common alphabet. In this work, we also use a shared vocabulary but our\nword embeddings are obtained via the lookup table of our cross-lingual language model (XLM).\nIn Section 5, we compare these three approaches on three different metrics: cosine similarity, L2\ndistance and cross-lingual word similarity.\n\n5 Experiments and results\n\nIn this section, we empirically demonstrate the strong impact of cross-lingual language model\npretraining on several benchmarks, and compare our approach to the current state of the art.\n\n5.1 Training details\n\nIn all experiments, we use a Transformer architecture with 1024 hidden units, 8 heads, GELU\nactivations [17], a dropout rate of 0.1 and learned positional embeddings. We train our models with\nthe Adam optimizer [23], a linear warm-up [38] and learning rates varying from 10\u22124 to 5.10\u22124.\nFor the CLM and MLM objectives, we use streams of 256 tokens and a mini-batches of size 64.\nUnlike Devlin et al. [14], a sequence in a mini-batch can contain more than two consecutive sentences,\nas explained in Section 3.2. For the TLM objective, we sample mini-batches of 4000 tokens composed\nof sentences with similar lengths. We use the averaged perplexity over languages as a stopping\ncriterion for training. For machine translation, we only use 6 layers, and we create mini-batches of\n2000 tokens.\nWhen \ufb01ne-tuning on XNLI, we use mini-batches of size 8 or 16, and we clip the sentence length\nto 256 words. We use 80k BPE splits and a vocabulary of 95k and train a 12-layer model on the\nWikipedias of the XNLI languages. We sample the learning rate of the Adam optimizer with values\nfrom 5.10\u22124 to 2.10\u22124, and use small evaluation epochs of 20000 random samples. We use the\n\ufb01rst hidden state of the last layer of the transformer as input to the randomly initialized \ufb01nal linear\nclassi\ufb01er, and \ufb01ne-tune all parameters. In our experiments, using either max-pooling or mean-pooling\nover the last layer did not work better than using the \ufb01rst hidden state.\nWe implement all our models in PyTorch [29], and train them on 64 Volta GPUs for the language\nmodeling tasks, and 8 GPUs for the MT tasks. We use \ufb02oat16 operations to speed up training and to\nreduce the memory usage of our models.\n\n5.2 Data preprocessing\n\nWe use WikiExtractor to extract raw sentences from Wikipedia dumps and use them as monolingual\ndata for the CLM and MLM objectives. For the TLM objective, we only use parallel data that\ninvolves English, similar to Conneau et al. [12]. Precisely, we use MultiUN [44] for French, Spanish,\nRussian, Arabic and Chinese, and the IIT Bombay corpus [3] for Hindi. We extract the following\ncorpora from the OPUS website Tiedemann [37]: the EUbookshop corpus for German, Greek and\nBulgarian, OpenSubtitles 2018 for Turkish, Vietnamese and Thai, Tanzil for both Urdu and Swahili\nand GlobalVoices for Swahili. For Chinese and Thai we respectively use the tokenizer of Chang\n\n5\n\n\ffr\n\nes\n\nen\n\n80.2\n\n81.9\n85.0\n\n77.8\n80.8\n\nde\nMachine translation baselines (TRANSLATE-TRAIN)\n75.9\nDevlin et al. [14]\nXLM (MLM+TLM)\n80.3\nMachine translation baselines (TRANSLATE-TEST)\nDevlin et al. [14]\n74.9\nXLM (MLM+TLM)\n79.5\nEvaluation of cross-lingual sentence encoders\n68.7\nConneau et al. [12]\n74.3\nDevlin et al. [14]\n72.9\nArtetxe and Schwenk [4]\n76.3\nXLM (MLM)\n78.9\nXLM (MLM+TLM)\n\n67.7\n70.5\n72.6\n74.2\n77.8\n\n73.7\n81.4\n73.9\n83.2\n85.0\n\n71.9\n76.5\n78.7\n\n74.4\n78.1\n\n81.4\n85.0\n\n-\n\n-\n\n-\n\n79.0\n\n67.7\n\nel\n\nbg\n\nru\n\ntr\n\nar\n\nvi\n\nth\n\nzh\n\nhi\n\nsw\n\nur\n\n\u2206\n\n-\n\n78.1\n\n-\n\n79.3\n\n-\n\n78.1\n\n-\n\n74.7\n\n70.7\n76.5\n\n-\n\n76.6\n\n-\n\n75.5\n\n76.6\n78.6\n\n-\n\n72.3\n\n-\n\n70.9\n\n61.6\n63.2\n\n-\n\n76.7\n\n-\n\n77.8\n\n-\n\n77.6\n\n-\n\n75.5\n\n-\n\n73.7\n\n70.4\n73.7\n\n-\n\n70.8\n\n-\n\n70.4\n\n70.1\n73.6\n\n-\n\n69.0\n\n-\n\n64.7\n\n62.1\n65.1\n\n-\n\n74.2\n\n68.9\n\n67.9\n\n65.4\n\n64.2\n\n66.4\n\n64.1\n\n64.1\n\n55.7\n\n-\n\n73.1\n73.1\n76.6\n\n-\n\n74.2\n74.0\n77.4\n\n-\n\n71.5\n73.1\n75.3\n\n-\n\n69.7\n67.8\n72.5\n\n64.8\n62.1\n71.4\n68.5\n73.1\n\n-\n\n72.0\n71.2\n76.1\n\n-\n\n69.2\n69.2\n73.2\n\n65.8\n63.8\n71.4\n71.9\n76.5\n\n-\n\n65.5\n65.7\n69.6\n\n-\n\n62.2\n64.6\n68.4\n\n58.4\n58.3\n61.0\n63.4\n67.3\n\n65.6\n\n-\n\n70.2\n71.5\n75.1\n\nTable 1: Results on cross-lingual classi\ufb01cation accuracy. Test accuracy on the 15 XNLI languages.\nWe report results for machine translation baselines and zero-shot classi\ufb01cation approaches based on\ncross-lingual sentence encoders. XLM (MLM) corresponds to our unsupervised approach trained\nonly on monolingual corpora, and XLM (MLM+TLM) corresponds to our supervised method that\nleverages both monolingual and parallel data through the TLM objective. \u2206 corresponds to the\naverage accuracy.\n\net al. [9], and the PyThaiNLP tokenizer. For all other languages, we use the tokenizer provided by\nMoses [24], falling back on the default English tokenizer when necessary. We use fastBPE to learn\nBPE codes and split words into subword units. The BPE codes are learned on the concatenation of\nsentences sampled from all languages, following the method presented in Section 3.1.\n\n5.3 Results and analysis\n\nIn this section, we demonstrate the effectiveness of cross-lingual language model pretraining. Our\napproach signi\ufb01cantly outperforms the previous state of the art on cross-lingual classi\ufb01cation, unsu-\npervised and supervised machine translation.\n\nCross-lingual classi\ufb01cation In Table 1, we evaluate two types of pretrained cross-lingual encoders:\nan unsupervised cross-lingual language model that uses the MLM objective on monolingual corpora\nonly; and a supervised cross-lingual language model that combines both the MLM and the TLM loss\nusing additional parallel data. Following Conneau et al. [12], we include two machine translation\nbaselines: TRANSLATE-TRAIN, where the English MultiNLI training set is machine translated into\neach XNLI language, and TRANSLATE-TEST where every dev and test set of XNLI is translated to\nEnglish. We report the XNLI baselines of Conneau et al. [12], the multilingual BERT approach of\nDevlin et al. [14] and the recent work of Artetxe and Schwenk [4].\nOur fully unsupervised MLM method sets a new state of the art on zero-shot cross-lingual classi\ufb01ca-\ntion and signi\ufb01cantly outperforms the supervised approach of Artetxe and Schwenk [4] which uses\n223 million of parallel sentences. Precisely, MLM obtains 71.5% accuracy on average (\u2206), while\nthey obtained 70.2% accuracy. By leveraging parallel data through the TLM objective (MLM+TLM),\nwe get a signi\ufb01cant boost in performance of 3.6% accuracy, improving even further the state of the\nart to 75.1%. On the Swahili and Urdu low-resource languages, we outperform the previous state\nof the art by 6.2% and 6.3% respectively. Using TLM in addition to MLM also improves English\naccuracy from 83.2% to 85% accuracy, outperforming Artetxe and Schwenk [4] and Devlin et al.\n[14] by 11.1% and 3.6% accuracy respectively.\nWhen \ufb01ne-tuned on the training set of each XNLI language (TRANSLATE-TRAIN), our supervised\nmodel outperforms our zero-shot approach by 1.6%, reaching an absolute state of the art of 76.7%\naverage accuracy. This result demonstrates the consistency of our approach and shows that XLMs\ncan be \ufb01ne-tuned on any language with strong performance. Similar to multilingual BERT [14], we\nobserve that TRANSLATE-TRAIN outperforms TRANSLATE-TEST by 2.5% average accuracy, and\nadditionally that our zero-shot approach outperforms TRANSLATE-TEST by 0.9%.\n\nUnsupervised machine translation For the unsupervised machine translation task we consider 3\nlanguage pairs: English-French, English-German, and English-Romanian. Our setting is identical to\nthe one of Lample et al. [26], except for the initialization step where we use cross-lingual language\nmodeling to pretrain the full model as opposed to only the lookup table.\n\n6\n\n\fen-fr\n\nfr-en\n\nen-de\n\nde-en\n\nen-ro\n\nro-en\n\nPrevious state-of-the-art - Lample et al. [26]\nNMT\nPBSMT\nPBSMT + NMT\nOur results for different encoder and decoder initializations\n\n21.0\n22.7\n25.2\n\n21.2\n21.3\n25.1\n\n25.1\n28.1\n27.6\n\n17.2\n17.8\n20.2\n\n24.2\n27.2\n27.7\n\n-\n\n-\n\nEMB\n\n13.0\nEMB\n29.4\nCLM CLM 30.4\nMLM MLM 33.4\nCLM\n28.7\nMLM\n31.6\nCLM\n25.3\nMLM 29.2\nCLM MLM 32.3\nMLM CLM 33.4\n\n-\n-\n\n-\n-\n\n15.8\n29.4\n30.0\n33.3\n\n28.2\n32.1\n26.4\n29.1\n\n31.6\n32.3\n\n6.7\n21.3\n22.7\n26.4\n\n24.4\n27.0\n19.2\n21.6\n\n24.3\n24.9\n\n15.3\n27.3\n30.5\n34.3\n\n30.3\n33.2\n26.0\n28.6\n\n32.5\n32.9\n\n18.9\n27.5\n29.0\n33.3\n\n29.2\n31.8\n25.7\n28.2\n\n31.6\n31.7\n\n19.4\n23.0\n23.9\n\n18.3\n26.6\n27.8\n31.8\n\n28.0\n30.5\n24.6\n27.3\n\n29.8\n30.4\n\nTable 2: Results on unsupervised MT. BLEU scores on WMT\u201914 English-French, WMT\u201916\nGerman-English and WMT\u201916 Romanian-English. For our results, the \ufb01rst two columns indicate the\nmodel used to pretrain the encoder and the decoder. \u201c - \u201d means the model was randomly initialized.\nEMB corresponds to pretraining the lookup table with cross-lingual embeddings, CLM and MLM\ncorrespond to pretraining with models trained on the CLM or MLM objectives.\n\nFor both the encoder and the decoder, we consider different possible initializations: CLM pretraining,\nMLM pretraining, or random initialization. We then follow Lample et al. [26] and train the model\nwith a denoising auto-encoding loss along with an online back-translation loss. Results are reported\nin Table 2. We compare our approach with the ones of Lample et al. [26]. For each language pair, we\nobserve signi\ufb01cant improvements over the previous state of the art. We re-implemented the NMT\napproach of Lample et al. [26] (EMB), and obtained better results than reported in their paper. We\nexpect that this is due to our multi-GPU implementation which uses signi\ufb01cantly larger batches. In\nGerman-English, our best model outperforms the previous unsupervised approach by more than 9.1\nBLEU, and 13.3 BLEU if we only consider neural unsupervised approaches. Compared to pretraining\nonly the lookup table (EMB), pretraining both the encoder and decoder with MLM leads to consistent\nsigni\ufb01cant improvements of up to 7 BLEU on German-English. We also observe that the MLM\nobjective pretraining consistently outperforms the CLM one, going from 30.4 to 33.4 BLEU on\nEnglish-French, and from 28.0 to 31.8 on Romanian-English. These results are consistent with the\nones of Devlin et al. [14] who observed a better generalization on NLU tasks when training on the\nMLM objective compared to CLM. We also observe that the encoder is the most important element\nto pretrain: when compared to pretraining both the encoder and the decoder, pretraining only the\ndecoder leads to a signi\ufb01cant drop in performance, while pretraining only the encoder only has a\nsmall impact on the \ufb01nal BLEU score.\n\nSupervised machine translation In Table 3 we report the performance on Romanian-English\nWMT\u201916 for different supervised training con\ufb01gurations: mono-directional (ro\u2192en), bidirectional\n(ro\u2194en, a multi-NMT model trained on both en\u2192ro and ro\u2192en) and bidirectional with back-\ntranslation (ro\u2194en + BT). Models with back-translation are trained with the same monolingual\ndata as language models used for pretraining. As in the unsupervised setting, we observe that\npretraining provides a signi\ufb01cant boost in BLEU score for each con\ufb01guration, and that pretraining\nwith the MLM objective leads to the best performance. Also, while models with back-translation\nhave access to the same amount of monolingual data as the pretrained models, they are not able\nto generalize as well on the evaluation sets. Our bidirectional model trained with back-translation\nobtains the best performance and reaches 38.5 BLEU, outperforming the previous SOTA of Sennrich\net al. [33] (based on back-translation and ensemble models) by more than 4 BLEU. Similar to\nEnglish-Romanian, we obtained a 1.5 BLEU improvement for English-German WMT\u201916 using MLM\n\n7\n\n\fpretraining. For English-French WMT\u201914 which contains signi\ufb01cantly more supervised training data,\nwe only obtained a minor improvement of 0.1 BLEU, which tends to indicate that the gains coming\nfrom pretraining are not as important for very high-resource settings than they are for lower-resource\nlanguages. However, in all cases we observed that convergence with pretraining is extremely fast.\nTypically, even for English-French, we observed that the model only needs a few epochs to converge.\n\nPretraining\nSennrich et al.\nro \u2192 en\nro \u2194 en\nro \u2194 en + BT\n\n-\n\nCLM MLM\n\n33.9\n28.4\n28.5\n34.4\n\n-\n\n31.5\n31.5\n37.0\n\n-\n\n35.3\n35.6\n38.5\n\nTable 3: Results on supervised MT. BLEU scores on WMT\u201916 Romanian-English. The previous\nstate-of-the-art of Sennrich et al. [33] uses both back-translation and an ensemble model. ro \u2194 en\ncorresponds to models trained on both directions.\n\nLow-resource language model\nIn Table 4, we investigate the impact of cross-lingual language\nmodeling for improving the perplexity of a Nepali language model. To do so, we train a Nepali\nlanguage model on Wikipedia, together with additional data from either English or Hindi. While\nNepali and English are distant languages, Nepali and Hindi are similar as they share the same\nDevanagari script and have a common Sanskrit ancestor. When using English data, we reduce the\nperplexity on the Nepali language model by 17.1 points, going from 157.2 for Nepali-only language\nmodeling to 140.1 when using English. Using additional data from Hindi, we get a much larger\nperplexity reduction of 41.6. Finally, by leveraging data from both English and Hindi, we reduce\nthe perplexity even more to 109.3 on Nepali. The gains in perplexity from cross-lingual language\nmodeling can be partly explained by the n-grams anchor points that are shared across languages, for\ninstance in Wikipedia articles. The cross-lingual language model can thus transfer the additional\ncontext provided by the Hindi or English monolingual corpora through these anchor points to improve\nthe Nepali language model.\n\nTraining languages\nNepali\nNepali + English\nNepali + Hindi\nNepali + English + Hindi\n\nNepali perplexity\n\n157.2\n140.1\n115.6\n109.3\n\nTable 4: Results on language modeling.\nNepali perplexity when using additional data\nfrom a similar language (Hindi) or a distant\nlanguage (English).\n\nCosine sim.\n\nL2 dist.\n\nSemEval\u201917\n\n0.38\n0.36\n0.55\n\n5.13\n4.89\n2.64\n\n0.65\n0.52\n0.69\n\nMUSE\nConcat\nXLM\nTable 5: Unsupervised cross-lingual word\nembeddings Cosine similarity and L2 dis-\ntance between source words and their trans-\nlations. Pearson correlation on SemEval\u201917\ncross-lingual word similarity task of Camacho-\nCollados et al. [8].\n\nUnsupervised cross-lingual word embeddings The MUSE, Concat and XLM (MLM) methods\nprovide unsupervised cross-lingual word embedding spaces that have different properties. In Table 5,\nwe study those three methods using the same word vocabulary and compute the cosine similarity\nand L2 distance between word translation pairs from the MUSE dictionaries. We also evaluate\nthe quality of the cosine similarity measure via the SemEval\u201917 cross-lingual word similarity task\nof Camacho-Collados et al. [8]. We observe that XLM outperforms both MUSE and Concat on\ncross-lingual word similarity, reaching a Pearson correlation of 0.69. Interestingly, word translation\npairs are also far closer in the XLM cross-lingual word embedding space than for MUSE or Concat.\nSpeci\ufb01cally, MUSE obtains 0.38 and 5.13 for cosine similarity and L2 distance while XLM gives\n0.55 and 2.64 for the same metrics. Note that XLM embeddings have the particularity of being\ntrained together with a sentence encoder which may enforce this closeness, while MUSE and Concat\nare based on fastText word embeddings.\n\n8\n\n\f6 Conclusion\n\nIn this work, we show for the \ufb01rst time the strong impact of cross-lingual language model (XLM)\npretraining. We investigate two unsupervised training objectives that require only monolingual\ncorpora: Causal Language Modeling (CLM) and Masked Language Modeling (MLM). We show\nthat both the CLM and MLM approaches provide strong cross-lingual features that can be used for\npretraining models. On unsupervised machine translation, we show that MLM pretraining is extremely\neffective. We reach a new state of the art of 34.3 BLEU on WMT\u201916 German-English, outperforming\nthe previous best approach by more than 9 BLEU. Similarly, we obtain strong improvements on\nsupervised machine translation. We reach a new state of the art on WMT\u201916 Romanian-English of\n38.5 BLEU, which corresponds to an improvement of more than 4 BLEU points. We also demonstrate\nthat XLMs can be used to improve the perplexity of a Nepali language model, and that it provides\nunsupervised cross-lingual word embeddings. Without using a single parallel sentence, our MLM\nmodel \ufb01ne-tuned on XNLI already outperforms the previous supervised state of the art by 1.3%\naccuracy on average. Our translation language model objective (TLM) leverages parallel data to\nimprove further the alignment of sentence representations. When used together with MLM, we show\nthat this supervised approach beats the previous state of the art on XNLI by 4.9% accuracy on average.\nOur code and pretrained models are publicly available.\n\nReferences\n[1] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level\n\nlanguage modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018.\n\n[2] Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A\n\nSmith. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925, 2016.\n\n[3] Kunchukuttan Anoop, Mehta Pratik, and Bhattacharyya Pushpak. The iit bombay english-hindi\n\nparallel corpus. In LREC, 2018.\n\n[4] Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot\n\ncross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464, 2018.\n\n[5] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural\nmachine translation. In International Conference on Learning Representations (ICLR), 2018.\n[6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors\nwith subword information. Transactions of the Association for Computational Linguistics, 5:\n135\u2013146, 2017.\n\n[7] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large\n\nannotated corpus for learning natural language inference. In EMNLP, 2015.\n\n[8] Jose Camacho-Collados, Mohammad Taher Pilehvar, Nigel Collier, and Roberto Navigli.\nSemeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings\nof the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 15\u201326,\n2017.\n\n[9] Pi-Chuan Chang, Michel Galley, and Christopher D Manning. Optimizing chinese word\nsegmentation for machine translation performance. In Proceedings of the third workshop on\nstatistical machine translation, pages 224\u2013232, 2008.\n\n[10] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence\n\nrepresentations. LREC, 2018.\n\n[11] Alexis Conneau, Guillaume Lample, Marc\u2019Aurelio Ranzato, Ludovic Denoyer, and Herv\u00e9\n\nJegou. Word translation without parallel data. In ICLR, 2018.\n\n[12] Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger\nSchwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In\nProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.\nAssociation for Computational Linguistics, 2018.\n\n[13] Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, and\nRuslan Salakhutdinov. Transformer-XL: Language modeling with longer-term dependency,\n2019. URL https://openreview.net/forum?id=HJePno0cYm.\n\n9\n\n\f[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[15] Akiko Eriguchi, Melvin Johnson, Orhan Firat, Hideto Kazawa, and Wolfgang Macherey. Zero-\nshot cross-lingual classi\ufb01cation using multilingual neural machine translation. arXiv preprint\narXiv:1809.04686, 2018.\n\n[16] Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilingual\n\ncorrelation. Proceedings of EACL, 2014.\n\n[17] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with\n\ngaussian error linear units. arXiv preprint arXiv:1606.08415, 2016.\n\n[18] Karl Moritz Hermann and Phil Blunsom. Multilingual models for compositional distributed\n\nsemantics. arXiv preprint arXiv:1404.4641, 2014.\n\n[19] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[20] Jeremy Howard and Sebastian Ruder. Universal language model \ufb01ne-tuning for text classi-\n\ufb01cation. In Proceedings of the 56th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), volume 1, pages 328\u2013339, 2018.\n\n[21] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil\nThorat, Fernanda Vi\u00e9gas, Martin Wattenberg, Greg Corrado, et al. Google\u2019s multilingual neural\nmachine translation system: Enabling zero-shot translation. Transactions of the Association for\nComputational Linguistics, 5:339\u2013351, 2017.\n\n[22] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\n\nthe limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[24] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola\nBertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open\nsource toolkit for statistical machine translation. In Proceedings of the 45th annual meeting\nof the ACL on interactive poster and demonstration sessions, pages 177\u2013180. Association for\nComputational Linguistics, 2007.\n\n[25] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato. Unsuper-\n\nvised machine translation using monolingual corpora only. In ICLR, 2018.\n\n[26] Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato.\n\nPhrase-based & neural unsupervised machine translation. In EMNLP, 2018.\n\n[27] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for\n\nmachine translation. arXiv preprint arXiv:1309.4168, 2013.\n\n[28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. NIPS 2017 Autodiff Workshop, 2017.\n\n[30] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language\nunderstanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-\nassets/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018. URL\nhttps://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/\nlanguage_understanding_paper.pdf.\n\n[31] Prajit Ramachandran, Peter J Liu, and Quoc V Le. Unsupervised pretraining for sequence to\n\nsequence learning. arXiv preprint arXiv:1611.02683, 2016.\n\n[32] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare\nwords with subword units. In Proceedings of the 54th Annual Meeting of the Association for\nComputational Linguistics, pages 1715\u20131725, 2015.\n\n10\n\n\f[33] Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation\n\nsystems for wmt 16. arXiv preprint arXiv:1606.02891, 2016.\n\n[34] Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. Of\ufb02ine bilingual\nword vectors, orthogonal transformations and the inverted softmax. International Conference\non Learning Representations, 2017.\n\n[35] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Proceedings of the 2013 conference on empirical methods in natural language\nprocessing, pages 1631\u20131642, 2013.\n\n[36] Wilson L Taylor. \u201ccloze procedure\u201d: A new tool for measuring readability. Journalism Bulletin,\n\n30(4):415\u2013433, 1953.\n\n[37] J\u00f6rg Tiedemann. Parallel data, tools and interfaces in opus. In Nicoletta Calzolari (Conference\nChair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph\nMariani, Jan Odijk, and Stelios Piperidis, editors, LREC, Istanbul, Turkey, may 2012. European\nLanguage Resources Association (ELRA). ISBN 978-2-9517408-7-7.\n\n[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 6000\u20136010, 2017.\n\n[39] Takashi Wada and Tomoharu Iwata. Unsupervised cross-lingual word embedding by multilin-\n\ngual neural language models. arXiv preprint arXiv:1809.02306, 2018.\n\n[40] Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman.\nGlue: A multi-task benchmark and analysis platform for natural language understanding. arXiv\npreprint arXiv:1804.07461, 2018.\n\n[41] Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of\n\nthe IEEE, 78(10):1550\u20131560, 1990.\n\n[42] Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus\n\nfor sentence understanding through inference. In NAACL, 2017.\n\n[43] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding and orthogonal\n\ntransform for bilingual word translation. Proceedings of NAACL, 2015.\n\n[44] Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. The united nations parallel\n\ncorpus v1. 0. In LREC, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3821, "authors": [{"given_name": "Alexis", "family_name": "CONNEAU", "institution": "Facebook"}, {"given_name": "Guillaume", "family_name": "Lample", "institution": "Facebook AI Research"}]}