{"title": "An Autoencoder Approach to Learning Bilingual Word Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 1853, "page_last": 1861, "abstract": "Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.", "full_text": "An Autoencoder Approach to Learning\n\nBilingual Word Representations\n\nSarath Chandar A P1 \u2217, Stanislas Lauly2 \u2217, Hugo Larochelle2, Mitesh M Khapra3,\n\nBalaraman Ravindran1, Vikas Raykar3, Amrita Saha3\n\n\u2217 Both authors contributed equally\n\n1Indian Institute of Technology Madras, 2Universit\u00b4e de Sherbrooke, 3IBM Research India\n\napsarathchandar@gmail.com, {stanislas.lauly,hugo.larochelle}@usherbrooke.ca,\n\n{mikhapra,viraykar,amrsaha4}@in.ibm.com, ravi@cse.iitm.ac.in\n\nAbstract\n\nCross-language learning allows one to use training data from one language to\nbuild models for a different language. Many approaches to bilingual learning re-\nquire that we have word-level alignment of sentences from parallel corpora. In this\nwork we explore the use of autoencoder-based methods for cross-language learn-\ning of vectorial word representations that are coherent between two languages,\nwhile not relying on word-level alignments. We show that by simply learning to\nreconstruct the bag-of-words representations of aligned sentences, within and be-\ntween languages, we can in fact learn high-quality representations and do without\nword alignments. We empirically investigate the success of our approach on the\nproblem of cross-language text classi\ufb01cation, where a classi\ufb01er trained on a given\nlanguage (e.g., English) must learn to generalize to a different language (e.g., Ger-\nman). In experiments on 3 language pairs, we show that our approach achieves\nstate-of-the-art performance, outperforming a method exploiting word alignments\nand a strong machine translation baseline.\n\n1\n\nIntroduction\n\nThe accuracy of Natural Language Processing (NLP) tools for a given language depend heavily on\nthe availability of annotated resources in that language. For example, high quality POS taggers\n[1], parsers [2], sentiment analyzers [3] are readily available for English. However, this is not the\ncase for many other languages such as Hindi, Marathi, Bodo, Farsi, and Urdu, for which annotated\ndata is scarce. This situation was acceptable in the past when only a few languages dominated the\ndigital content available online and elsewhere. However, the ever increasing number of languages\non the web today has made it important to accurately process natural language data in such resource-\ndeprived languages also. An obvious solution to this problem is to improve the annotated inventory\nof these languages, but the cost, time and effort required act as a natural deterrent to this.\nAnother option is to exploit the unlabeled data available in a language. In this context, vectorial text\nrepresentations have proven useful for multiple NLP tasks [4, 5]. It has been shown that meaning-\nful representations, capturing syntactic and semantic similarity, can be learned from unlabeled data.\nWhile the majority of previous work on vectorial text representations has concentrated on the mono-\nlingual case, there has also been considerable interest in learning word and document representations\nthat are aligned across languages [6, 7, 8, 9, 10, 11, 12]. Such aligned representations allow the use\nof resources from a resource-fortunate language to develop NLP capabilities in a resource-deprived\nlanguage.\nOne approach to cross-lingual exploitation of resources is to project parameters learned from the\nannotated data of one language to another language [13, 14, 15, 16, 17]. These approaches rely on a\n\n1\n\n\fbilingual resource such as a Machine Translation (MT) system. Recent attempts at learning common\nbilingual representations [9, 10, 11] aim to eliminate the need of such an MT system. A common\nproperty of these approaches is that a word-level alignment of translated sentences is leveraged to\nderive a regularization term relating word embeddings across languages. Such methods not only\neliminate the need for an MT system but also outperform MT based projection approaches.\nIn this paper, we experiment with methods that learn bilingual word representations without word-to-\nword alignments of bilingual corpora during training. Unlike previous approaches, we only require\naligned sentences and do not rely on word-level alignments (e.g., extracted using GIZA++, as is\nusual), simplifying the learning procedure. To do so, we propose and investigate bilingual autoen-\ncoder models, that learn hidden encoder representations of paired bag-of-words sentences that are\nnot only informative of the original bag-of-words but also predictive of the other language. Word\nrepresentations can then easily be extracted from the encoder and used in the context of a super-\nvised NLP task. Speci\ufb01cally, we demonstrate the quality of these representations for the task of\ncross-language document classi\ufb01cation, where a labeled data set can be available in one language,\nbut not in another one. As we\u2019ll see, our approach is able to reach state-of-the-art performance,\noutperforming a method exploiting word alignments and a strong machine translation baseline.\n\n2 Autoencoder for Bags-of-Words\n\nLet x be the bag-of-words representation of a sentence. Speci\ufb01cally, each xi is a word index from\na \ufb01xed vocabulary of V words. As this is a bag-of-words, the order of the words within x does not\ncorrespond to the word order in the original sentence. We wish to learn a D-dimensional vectorial\nrepresentation of our words from a training set of sentence bags-of-words {x(t)}T\nWe propose to achieve this by using an autoencoder model that encodes an input bag-of-words x with\na sum of the representations (embeddings) of the words present in x, followed by a non-linearity.\nSpeci\ufb01cally, let matrix W be the D \u00d7 V matrix whose columns are the vector representations for\neach word. The encoder\u2019s computation will involve summing over the columns of W for each\nword in the bag-of-word. We will denote this encoder function \u03c6(x). Then, using a decoder, the\nautoencoder will be trained to optimize a loss function that measures how predictive of the original\nbag-of-words is the encoder representation \u03c6(x) .\nThere are different variations we can consider in the design of the encoder/decoder and the choice of\nloss function. One must be careful however, as certain choices can be inappropriate for training on\nword observations, which are intrinsically sparse and high-dimensional. In this paper, we explore\nand compare two different approaches, described in the next two sub-sections.\n\nt=1.\n\n2.1 Binary bag-of-words reconstruction training with merged bags-of-words\n\nIn the \ufb01rst approach, we start from the conventional autoencoder architecture, which minimizes a\ncross-entropy loss that compares a binary vector observation with a decoder reconstruction. We thus\nconvert the bag-of-words x into a \ufb01xed-size but sparse binary vector v(x), which is such that v(x)xi\nis 1 if word xi is present in x and otherwise 0.\nFrom this representation, we obtain an encoder representation by multiplying v(x) with the word\nrepresentation matrix W\n\na(x) = c + Wv(x), \u03c6(x) = h(a(x))\n\n(1)\nwhere h(\u00b7) is an element-wise non-linearity such as the sigmoid or hyperbolic tangent, and c is a\nD-dimensional bias vector. Encoding thus involves summing the word representations of the words\npresent at least once in the bag-of-words.\nTo produce a reconstruction, we parametrize the decoder using the following non-linear form:\n\n(cid:98)v(x) = sigm(V\u03c6(x) + b)\n\n(2)\nwhere V = WT , b is the bias vector of the reconstruction layer and sigm(a) = 1/(1 + exp(\u2212a)) is\nthe sigmoid non-linearity.\n\n2\n\n\fThen, the reconstruction is compared to the original binary bag-of-words as follows:\n\n(cid:96)(v(x)) = \u2212 V(cid:88)\n\nv(x)i log((cid:98)v(x)i) + (1 \u2212 v(x)i) log(1 \u2212(cid:98)v(x)i) .\n\n(3)\n\ni=1\n\nTraining proceeds by optimizing the sum of reconstruction cross-entropies across the training set,\ne.g., using stochastic or mini-batch gradient descent.\nNote that, since the binary bags-of-words are very high-dimensional (the dimensionality corresponds\nto the size of the vocabulary, which is typically large), the above training procedure which aims at\nreconstructing the complete binary bag-of-word, will be slow. Since we will later be training on\nmillions of sentences, training on each individual sentence bag-of-words will be expensive.\nThus, we propose a simple trick, which exploits the bag-of-words structure of the input. Assuming\nwe are performing mini-batch training (where a mini-batch contains a list of the bags-of-words of\nadjacent sentences), we simply propose to merge the bags-of-words of the mini-batch into a single\nbag-of-words and perform an update based on that merged bag-of-words. The resulting effect is that\neach update is as ef\ufb01cient as in stochastic gradient descent, but the number of updates per training\nepoch is divided by the mini-batch size . As we\u2019ll see in the experimental section, this trick produces\ngood word representations, while suf\ufb01ciently reducing training time. We note that, additionally, we\ncould have used the stochastic approach proposed by Dauphin et al. [18] for reconstructing binary\nbag-of-words representations of documents, to further improve the ef\ufb01ciency of training. They use\nimportance sampling to avoid reconstructing the whole V -dimensional input vector.\n\n2.2 Tree-based decoder training\n\nThe previous autoencoder architecture worked with a binary vectorial representation of the input\nbag-of-words. In the second autoencoder architecture we investigate, we consider an architecture\nthat instead works with the bag (unordered list) representation more directly.\nFirst, the encoder representation will now involve a sum of the representation of all words, re\ufb02ecting\nthe relative frequency of each word:\n\na(x) = c +\n\nW\u00b7,xi, \u03c6(x) = h (a(x)) .\n\n(4)\n\ndistribution p((cid:98)x|\u03c6(x)) over any word(cid:98)x observed at the reconstruction output layer. Then, we can\n\nMoreover, decoder training will assume that, from the decoder\u2019s output, we can obtain a probability\ntreat the input bag-of-words as a |x|-trials multinomial sample from that distribution and use as the\nreconstruction loss its negative log-likelihood:\n\n\u2212 log p((cid:98)x = xi|\u03c6(x)) .\n\n(5)\n\n(cid:96)(x) =\n\ni=1\n\ncally, we\u2019d like to avoid a procedure scaling linearly with the vocabulary size V , since V will be very\n\nWe now must ensure that the decoder can compute p((cid:98)x = xi|\u03c6(x)) ef\ufb01ciently from \u03c6(x). Speci\ufb01-\nlarge in practice. This precludes any procedure that would compute the numerator of p((cid:98)x = w|\u03c6(x))\nSpeci\ufb01cally, we use a probabilistic tree decomposition of p((cid:98)x = xi|\u03c6(x)). Let\u2019s assume each word\n\nfor each possible word w separately and normalize it so it sums to one.\nWe instead opt for an approach borrowed from the work on neural network language models [19, 20].\n\nhas been placed at the leaf of a binary tree. We can then treat the sampling of a word as a stochastic\npath from the root of the tree to one of the leaves.\nWe denote as l(x) the sequence of internal nodes in the path from the root to a given word x, with\nl(x)1 always corresponding to the root. We will denote as \u03c0(x) the vector of associated left/right\nbranching choices on that path, where \u03c0(x)k = 0 means the path branches left at internal node l(x)k\n\nand otherwise branches right if \u03c0(x)k = 1. Then, the probability p((cid:98)x = x|\u03c6(x)) of reconstructing\n\na certain word x observed in the bag-of-words is computed as\n\n|x|(cid:88)\n\ni=1\n\nV(cid:88)\n\np((cid:98)x|\u03c6(x)) =\n\np(\u03c0((cid:98)x)k|\u03c6(x))\n\n|\u03c0(\u02c6x)|(cid:89)\n\nk=1\n\n3\n\n(6)\n\n\fwhere p(\u03c0((cid:98)x)k|\u03c6(x)) is output by the decoder. By using a full binary tree of words, the number of\ndifferent decoder outputs required to compute p((cid:98)x|\u03c6(x)) will be logarithmic in the vocabulary size\n\nV . Since there are |x| words in the bag-of-words, at most O(|x| log V ) outputs are required from\nthe decoder. This is of course a worst case scenario, since words will share internal nodes between\ntheir paths, for which the decoder output can be computed just once. As for organizing words into a\ntree, as in Larochelle and Lauly [21] we used a random assignment of words to the leaves of the full\nbinary tree, which we have found to work well in practice.\nFinally, we need to choose a parametrized form for the decoder. We choose the following form:\n\np(\u03c0((cid:98)x)k = 1|\u03c6(x)) = sigm(bl(\u02c6xi)k + Vl(\u02c6xi)k,\u00b7\u03c6(x))\n\n(7)\nwhere b is a (V -1)-dimensional bias vector and V is a (V \u22121)\u00d7D matrix. Each left/right branching\nprobability is thus modeled with a logistic regression model applied on the encoder representation\nof the input bag-of-words \u03c6(x).\n\n3 Bilingual autoencoders\nLet\u2019s now assume that for each sentence bag-of-words x in some source language X , we have an\nassociated bag-of-words y for this sentence translated in some target language Y by a human expert.\nAssuming we have a training set of such (x, y) pairs, we\u2019d like to use it to learn representations in\nboth languages that are aligned, such that pairs of translated words have similar representations.\nTo achieve this, we propose to augment the regular autoencoder proposed in Section 2 so that, from\nthe sentence representation in a given language, a reconstruction can be attempted of the original\nsentence in the other language. Speci\ufb01cally, we now de\ufb01ne language speci\ufb01c word representation\nmatrices Wx and Wy, corresponding to the languages of the words in x and y respectively. Let\nV X and V Y also be the number of words in the vocabulary of both languages, which can be dif-\nferent. The word representations however are of the same size D in both languages. For the binary\nreconstruction autoencoder, the bag-of-words representations extracted by the encoder become\n\n\u03c6(x) = h(cid:0)c + WX v(x)(cid:1) , \u03c6(y) = h(cid:0)c + WY v(y)(cid:1)\n\nand are similarly extended for the tree-based autoencoder. Notice that we share the bias c before the\nnon-linearity across encoders, to encourage the encoders in both languages to produce representa-\ntions on the same scale.\nFrom the sentence in either languages, we want to be able to perform a reconstruction of the original\nsentence in both the languages. In particular, given a representation in any language, we\u2019d like a\ndecoder that can perform a reconstruction in language X and another decoder that can reconstruct in\nlanguage Y. Again, we use decoders of the form proposed in either Section 2.1 or 2.2 (see Figure 1),\nbut let the decoders of each language have their own parameters (bX , VX ) and (bY , VY ).\nThis encoder/decoder decomposition structure allows us to learn a mapping within each language\nand across the languages. Speci\ufb01cally, for a given pair (x, y), we can train the model to (1) construct\ny from x (loss (cid:96)(x, y)), (2) construct x from y (loss (cid:96)(y, x)), (3) reconstruct x from itself (loss\n(cid:96)(x)) and (4) reconstruct y from itself (loss (cid:96)(y)). We follow this approach in our experiments and\noptimize the sum of the corresponding 4 losses during training.\n\n3.1\n\nJoint reconstruction and cross-lingual correlation\n\nWe also considered incorporating two additional terms to the loss function, in an attempt to favour\neven more meaningful bilingual representations:\n\n(cid:96)(x, y) + (cid:96)(y, x) + (cid:96)(x) + (cid:96)(y) + \u03b2(cid:96)([x, y], [x, y]) \u2212 \u03bb \u00b7 cor(a(x), a(y))\n\n(8)\nThe term (cid:96)([x, y], [x, y]) is simply a joint reconstruction term, where both languages are simul-\ntanouesly presented as input and reconstructed. The second term cor(a(x), a(y)) encourages corre-\nlation between the representation of each language. It is the sum of the scalar correlations between\neach pair a(x)k, a(y)k, across all dimensions k of the vectors a(x), a(y)1. To obtain a stochastic\nestimate of the correlation, during training, small mini-batches are used.\n\n1While we could have applied the correlation term on \u03c6(x), \u03c6(y) directly, applying it to the pre-activation\n\nfunction vectors was found to be more numerically stable.\n\n4\n\n\fFigure 1: Left: Bilingual autoencoder based on the binary reconstruction error. Right: Tree-based\nbilingual autoencoder. In this example, they both reconstruct the bag-of-words for the English sen-\ntence \u201cthe dog barked\u201d from its French translation \u201cle chien a japp\u00b4e\u201d.\n3.2 Document representations\n\nOnce we learn the language speci\ufb01c word representation matrices Wx and Wy as described above,\nwe can use them to construct document representations, by using their columns as word vector\nrepresentations. Given a document d written in language Z \u2208 {X ,Y} and containing m words,\nz1, z2, . . . , zm, we represent it as the tf-idf weighted sum of its words\u2019 representations \u03c8(d) =\n.,zi. We use the document representations thus obtained to train our document\n\n(cid:80)m\ni=1 tf-idf(zi) \u00b7 WZ\n\nclassi\ufb01ers, in the cross-lingual document classi\ufb01cation task described in Section 5.\n\n4 Related Work\n\nRecent work that has considered the problem of learning bilingual representations of words usually\nhas relied on word-level alignments. Klementiev et al. [9] propose to train simultaneously two neural\nnetwork languages models, along with a regularization term that encourages pairs of frequently\naligned words to have similar word embeddings. Thus, the use of this regularization term requires\nto \ufb01rst obtain word-level alignments from parallel corpora. Zou et al. [10] use a similar approach,\nwith a different form for the regularizer and neural network language models as in [5]. In our work,\nwe speci\ufb01cally investigate whether a method that does not rely on word-level alignments can learn\ncomparably useful multilingual embeddings in the context of document classi\ufb01cation.\nLooking more generally at neural networks that learn multilingual representations of words or\nphrases, we mention the work of Gao et al. [22] which showed that a useful linear mapping between\nseparately trained monolingual skip-gram language models could be learned. They too however\nrely on the speci\ufb01cation of pairs of words in the two languages to align. Mikolov et al. [11] also pro-\npose a method for training a neural network to learn useful representations of phrases, in the context\nof a phrase-based translation model. In this case, phrase-level alignments (usually extracted from\nword-level alignments) are required. Recently, Hermann and Blunsom [23], [24] proposed neural\nnetwork architectures and a margin-based training objective that, as in this work, does not rely on\nword alignments. We will brie\ufb02y discuss this work in the experiments section.\n\n5 Experiments\n\nThe techniques proposed in this paper enable us to learn bilingual embeddings which capture cross-\nlanguage similarity between words. We propose to evaluate the quality of these embeddings by using\nthem for the task of cross-language document classi\ufb01cation. We followed closely the setup used by\nKlementiev et al. [9] and compare with their method, for which word representations are publicly\navailable2. The set up is as follows. A labeled data set of documents in some language X is available\nto train a classi\ufb01er, however we are interested in classifying documents in a different language Y\nat test time. To achieve this, we leverage some bilingual corpora, which is not labeled with any\n\n2http://people.mmci.uni-saarland.de/\u02dcaklement/data/distrib/\n\n5\n\n\fdocument-level categories. This bilingual corpora is used to learn document representations that are\ncoherent between languages X and Y. The hope is thus that we can successfully apply the classi\ufb01er\ntrained on document representations for language X directly to the document representations for\nlanguage Y. Following this setup, we performed experiments on 3 data sets of language pairs:\nEnglish/German (EN/DE), English/French (EN/FR) and English/Spanish (EN/ES).\n\n5.1 Data\n\nFor learning the bilingual embeddings, we used sections of the Europarl corpus [25] which contains\nroughly 2 million parallel sentences. We considered 3 language pairs. We used the same pre-\nprocessing as used by Klementiev et al. [9]. We tokenized the sentences using NLTK [26], removed\npunctuations and lowercased all words. We did not remove stopwords.\nAs for the labeled document classi\ufb01cation data sets, they were extracted from sections of the Reuters\nRCV1/RCV2 corpora, again for the 3 pairs considered in our experiments. Following Klementiev\net al. [9], we consider only documents which were assigned exactly one of the 4 top level categories\nin the topic hierarchy (CCAT, ECAT, GCAT and MCAT). These documents are also pre-processed\nusing a similar procedure as that used for the Europarl corpus. We used the same vocabularies as\nthose used by Klementiev et al. [9] (varying in size between 35, 000 and 50, 000).\nFor each pair of languages, our overall procedure for cross-language classi\ufb01cation can be summa-\nrized as follows:\nTrain representation: Train bilingual word representations Wx and Wy on sentence pairs ex-\ntracted from Europarl for languages X and Y. Optionally, we also use the monolingual documents\nfrom RCV1/RCV2 to reinforce the monolingual embeddings (this choice is cross-validated). These\nnon-parallel documents can be used through the losses (cid:96)(x) and (cid:96)(y) (i.e. by reconstructing x from x\nor y from y). Note that Klementiev et al. [9] also used this data when training word representations.\nTrain classi\ufb01er: Train document classi\ufb01er on the Reuters training set for language X , where docu-\nments are represented using the word representations Wx (see Section 3.2). As in Klementiev et al.\n[9] we used an averaged perceptron trained for 10 epochs, for all the experiments.\nTest-time classi\ufb01cation: Use the classi\ufb01er trained in the previous step on the Reuters test set for\nlanguage Y, using the word representations Wy to represent the documents.\nWe trained the following autoencoders3: BAE-cr which uses reconstruction error based decoder\ntraining (see Section 2.1) and BAE-tr which uses tree-based decoder training (see Section 2.2).\nModels were trained for up to 20 epochs using the same data as described earlier. BAE-cr used\nmini-batch (of size 20) stochastic gradient descent, while BAE-tr used regular stochastic gradient.\nAll results are for word embeddings of size D = 40, as in Klementiev et al. [9]. Further, to speed\nup the training for BAE-cr we merged each 5 adjacent sentence pairs into a single training instance,\nas described in Section 2.1. For all language pairs, the joint reconstruction \u03b2 was \ufb01xed to 1 and\nthe cross-lingual correlation factor \u03bb to 4 for BAE-cr. For BAE-tr, none of these additional terms\nwere found to be particularly bene\ufb01cial, so we set their weights to 0 for all tasks. The other hyper-\nparameters were tuned to each task using a training/validation set split of 80% and 20% and using\nthe performance on the validation set of an averaged perceptron trained on the smaller training set\nportion (notice that this corresponds to a monolingual classi\ufb01cation experiment, since the general\nassumption is that no labeled data is available in the test set language).\n\n5.2 Comparison of the performance of different models\n\nWe now present the cross language classi\ufb01cation results obtained by using the embeddings produced\nby our two autoencoders. We also compare our models with the following approaches:\nKlementiev et al.: This model uses word embeddings learned by a multitask neural network lan-\nguage model with a regularization term that encourages pairs of frequently aligned words to have\nsimilar word embeddings. From these embeddings, document representations are computed as de-\nscribed in Section 3.2.\n\n3Our word representations and code are available at http://www.sarathchandar.in/crl.html\n\n6\n\n\fTable 1: Cross-lingual classi\ufb01cation accuracy for 3 language pairs, with 1000 labeled examples.\nEN \u2192 DE DE \u2192 EN EN \u2192 FR FR \u2192 EN EN \u2192 ES ES \u2192 EN\n81.8\n91.8\n77.6\n68.1\n46.8\n\nBAE-tr\nBAE-cr\nKlementiev et al.\nMT\nMajority Class\n\n59.4\n49.0\n31.3\n52.0\n15.3\n\n60.4\n64.4\n63.0\n58.4\n22.2\n\n60.1\n74.2\n71.1\n67.4\n46.8\n\n70.4\n84.6\n74.5\n76.3\n22.5\n\n61.8\n74.2\n61.9\n71.1\n25.0\n\nMT: Here, test documents are translated to the language of the training documents using a standard\nphrase-based MT system, MOSES4 which was trained using default parameters and a 5-gram lan-\nguage model on the Europarl corpus (same as the one used for inducing our bilingual embeddings).\nMajority Class: Test documents are simply assigned the most frequent class in the training set.\nFor the EN/DE language pairs, we directly report the results from Klementiev et al. [9]. For the other\npairs (not reported in Klementiev et al. [9]), we used the embeddings available online and performed\nthe classi\ufb01cation experiment ourselves. Similarly, we generated the MT baseline ourselves.\nTable 1 summarizes the results. They were obtained using 1000 RCV training examples. We report\nresults in both directions, i.e. language X to Y and vice versa. The best performing method is always\neither BAE-cr or BAE-tr, with BAE-cr having the best performance overall. In particular, BAE-cr\noften outperforms the approach of Klementiev et al. [9] by a large margin.\nWe also mention the recent work of Hermann and Blunsom [23], who proposed two neural network\narchitectures for learning word and document representations using sentence-aligned data only. In-\nstead of an autoencoder paradigm, they propose a margin-based objective that aims to make the\nrepresentation of aligned sentences closer than non-aligned sentences. While their trained embed-\ndings are not publicly available, they report results for the EN/DE classi\ufb01cation experiments, with\nrepresentations of the same size as here (D = 40) and trained on 500K EN/DE sentence pairs. Their\nbest model in that setting reaches accuracies of 83.7% and 71.4% respectively for the EN \u2192 DE and\nDE \u2192 EN tasks. One clear advantage of our model is that unlike their model, it can use additional\nmonolingual data. Indeed, when we train BAE-cr with 500k EN/DE sentence pairs, plus monolin-\ngual RCV documents (which come at no additional cost), we get accuracies of 87.9% (EN \u2192 DE)\nand 76.7% (DE \u2192 EN), still improving on their best model.\nIf we do not use the monolingual\ndata, BAE-cr\u2019s performance is worse but still competitive at 86.1% for EN \u2192 DE and 68.8% for\nDE \u2192 EN. Finally, without constraining D to 40 (they use 128) and by using additional French data,\nthe best results of Hermann and Blunsom [23] are 88.1% (EN \u2192 DE) and 79.1% (DE \u2192 EN), the\nlater being, to our knowledge, the current state-of-the-art.\nWe also evaluate the effect of varying the amount of supervised training data for training the classi-\n\ufb01er. For brevity, we report only the results for the EN/DE pair, which are summarized in Figure 2.\nWe observe that BAE-cr clearly outperforms the other models at almost all data sizes. More impor-\ntantly, it performs remarkably well at very low data sizes (100), suggesting it learns very meaningful\nembeddings, though the method can still bene\ufb01t from more labeled data (as in the DE \u2192 EN case).\nTable 2 also illustrates the properties captured within and across languages, for the EN/DE pair5.\nFor a few English words, the words with closest word representations (in Euclidean distance) are\nshown, for both English and German. We observe that words that form a translation pair are close,\nbut also that close words within a language are syntactically/semantically similar as well.\nThe excellent performance of BAE-cr suggests that merging several sentences into single bags-of-\nwords can still yield good word embeddings.\nIn other words, not only we do not need to rely\non word-level alignments, but exact sentence-level alignment is also not essential to reach good\nperformances. We experimented with the merging of 5, 25 and 50 adjacent sentences (see the\nsupplementary material). Generally speaking, these experiments also con\ufb01rm that even coarser\nmerges can sometimes not be detrimental. However, for certain language pairs, there can be an\nimportant decrease in performance. On the other hand, when comparing the performance of BAE-tr\nwith the use of 5-sentences merges, no substantial impact is observed.\n\n4http://www.statmt.org/moses/\n5See also the supplementary material for a t-SNE visualization of the word representations.\n\n7\n\n\fTable 2: Example English words along with the closest words both in English (EN) and German\n(DE), using the Euclidean distance between the embeddings learned by BAE-cr.\n\nWord\n\noil\n\nLang Nearest neighbors\nEN\nDE\nEN\nDE\nEN\nDE\n\njanuary, march, october\njanuar, m\u00a8arz, oktober\npresident, i, mr, presidents microsoft\npr\u00a8asident, pr\u00a8asidentin\nsaid, told, say, believe\ngesagt, sagte, sehr, heute\n\nmarket\n\nWord\n\njanuary\n\npresident\n\nsaid\n\nLang Nearest neighbors\nEN\nDE\nEN\nDE\nEN\nDE\n\noil, supply, supplies, gas\n\u00a8ol, boden, be\ufb01ndet, ger\u00a8at\nmicrosoft, cds, insider\nmicrosoft, cds, warner\nmarket, markets, single\nmarkt, marktes, m\u00a8arkte\n\nFigure 2: Cross-lingual classi\ufb01cation accuracy results, from EN \u2192 DE (left), and DE \u2192 EN (right).\n\n6 Conclusion and Future Work\n\nWe presented evidence that meaningful bilingual word representations could be learned without\nrelying on word-level alignments or using fairly coarse sentence-level alignments. In particular, we\nshowed that even though our model does not use word level alignments, it is able to reach state-of-\nthe-art performance, even compared to a method that exploits word-level alignments. In addition, it\nalso outperforms a strong machine translation baseline. For future work, we would like to investigate\nextensions of our bag-of-words bilingual autoencoder to bags-of-n-grams, where the model would\nalso have to learn representations for short phrases. Such a model should be particularly useful in the\ncontext of a machine translation system. We would also like to explore the possibility of converting\nour bilingual model to a multilingual model which can learn common representations for multiple\nlanguages given different amounts of parallel data between these languages.\n\nAcknowledgement\n\nWe would like to thank Alexander Klementiev and Ivan Titov for providing the code for the classi\ufb01er\nand data indices. This work was supported in part by Google.\n\nReferences\n[1] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-rich part-of-speech\ntagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American\nChapter of the Association for Computational Linguistics, NAACL \u201903, pages 173\u2013180, 2003.\n\n[2] Richard Socher, John Bauer, Christopher D. Manning, and Ng Andrew Y. Parsing with compositional\nvector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Lin-\nguistics (Volume 1: Long Papers), pages 455\u2013465, So\ufb01a, Bulgaria, August 2013.\n\n[3] Bing Liu. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies.\n\nMorgan & Claypool Publishers, 2012.\n\n8\n\n\f[4] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: A simple and general method for\nsemi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computa-\ntional Linguistics (ACL2010), pages 384\u2013394, 2010.\n\n[5] Ronan Collobert, Jason Weston, L\u00b4eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.\nNatural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12, 2011.\n[6] Susan T Dumais, Todd A Letsche, Michael L Littman, and Thomas K Landauer. Automatic cross-\nlanguage retrieval using latent semantic indexing. AAAI spring symposium on cross-language text and\nspeech retrieval, 15:21, 1997.\n\n[7] John C. Platt, Kristina Toutanova, and Wen-tau Yih. Translingual document representations from discrim-\ninative projections. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP \u201910, pages 251\u2013261, Stroudsburg, PA, USA, 2010.\n\n[8] Wen-tau Yih, Kristina Toutanova, John C. Platt, and Christopher Meek. Learning discriminative projec-\ntions for text similarity measures. In Proceedings of the Fifteenth Conference on Computational Natural\nLanguage Learning, CoNLL \u201911, pages 247\u2013256, Stroudsburg, PA, USA, 2011.\n\n[9] Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. Inducing Crosslingual Distributed Representa-\n\ntions of Words. In Proceedings of the International Conference on Computational Linguistics, 2012.\n\n[10] Will Y. Zou, Richard Socher, Daniel Cer, and Christopher D. Manning. Bilingual Word Embeddings for\n\nPhrase-Based Machine Translation. In Empirical Methods in Natural Language Processing, 2013.\n\n[11] Tomas Mikolov, Quoc Le, and Ilya Sutskever. Exploiting Similarities among Languages for Machine\n\nTranslation. Technical report, arXiv, 2013.\n\n[12] Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilingual correla-\ntion. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational\nLinguistics, pages 462\u2013471, Gothenburg, Sweden, April 2014.\n\n[13] David Yarowsky and Grace Ngai. Inducing multilingual pos taggers and np bracketers via robust projec-\ntion across aligned corpora. In Proceedings of the second meeting of the North American Chapter of the\nAssociation for Computational Linguistics on Language technologies, pages 1\u20138, Pennsylvania, 2001.\n\n[14] Dipanjan Das and Slav Petrov. Unsupervised part-of-speech tagging with bilingual graph-based pro-\njections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:\nHuman Language Technologies, pages 600\u2013609, Portland, Oregon, USA, June 2011.\n\n[15] Rada Mihalcea, Carmen Banea, and Janyce Wiebe. Learning multilingual subjective language via cross-\nIn Proceedings of the 45th Annual Meeting of the Association of Computational\n\nlingual projections.\nLinguistics, pages 976\u2013983, Prague, Czech Republic, June 2007.\n\n[16] Xiaojun Wan. Co-training for cross-lingual sentiment classi\ufb01cation. In Proceedings of the Joint Con-\nference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural\nLanguage Processing of the AFNLP, pages 235\u2013243, Suntec, Singapore, August 2009.\n\n[17] Sebastian Pad\u00b4o and Mirella Lapata. Cross-lingual annotation projection for semantic roles. Journal of\n\nArti\ufb01cial Intelligence Research (JAIR), 36:307\u2013340, 2009.\n\n[18] Yann Dauphin, Xavier Glorot, and Yoshua Bengio. Large-Scale Learning of Embeddings with Recon-\nstruction Sampling. In Proceedings of the 28th International Conference on Machine Learning (ICML\n2011), pages 945\u2013952. Omnipress, 2011.\n\n[19] Frederic Morin and Yoshua Bengio. Hierarchical Probabilistic Neural Network Language Model.\n\nIn\nProceedings of the 10th International Workshop on Arti\ufb01cial Intelligence and Statistics (AISTATS 2005),\npages 246\u2013252. Society for Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[20] Andriy Mnih and Geoffrey E Hinton. A Scalable Hierarchical Distributed Language Model. In Advances\n\nin Neural Information Processing Systems 21 (NIPS 2008), pages 1081\u20131088, 2009.\n\n[21] Hugo Larochelle and Stanislas Lauly. A Neural Autoregressive Topic Model.\n\nInformation Processing Systems 25 (NIPS 25), 2012.\n\nIn Advances in Neural\n\n[22] Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng. Learning continuous phrase representations for\ntranslation modeling. In Proceedings of the 52nd Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), pages 699\u2013709, Baltimore, Maryland, June 2014.\n\n[23] Karl Moritz Hermann and Phil Blunsom. Multilingual models for compositional distributed semantics.\nIn Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014,\nJune 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 58\u201368, 2014.\n\n[24] Karl Moritz Hermann and Phil Blunsom. Multilingual Distributed Representations without Word Align-\n\nment. In Proceedings of International Conference on Learning Representations (ICLR), 2014.\n\n[25] Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In MT Summit, 2005.\n[26] Edward Loper Bird Steven and Ewan Klein. Natural Language Processing with Python. OReilly Media\n\nInc., 2009.\n\n9\n\n\f", "award": [], "sourceid": 995, "authors": [{"given_name": "Sarath", "family_name": "Chandar A P", "institution": "IBM Research India"}, {"given_name": "Stanislas", "family_name": "Lauly", "institution": "Universit\\'e de Sherbrooke"}, {"given_name": "Hugo", "family_name": "Larochelle", "institution": "Universit\u00e9 de Sherbrooke (Quebec)"}, {"given_name": "Mitesh", "family_name": "Khapra", "institution": "IBM India Research Lab"}, {"given_name": "Balaraman", "family_name": "Ravindran", "institution": "Indian Institute of Technology Madras"}, {"given_name": "Vikas", "family_name": "Raykar", "institution": "IBM Research"}, {"given_name": "Amrita", "family_name": "Saha", "institution": "IBM India Research Lab"}]}