{"title": "Learned in Translation: Contextualized Word Vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 6294, "page_last": 6305, "abstract": "Computer vision has benefited from initializing multiple deep layers with weights pretrained on large supervised training sets like ImageNet. Natural language processing (NLP) typically sees initialization of only the lowest layer of deep models with pretrained word vectors. In this paper, we use a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation (MT) to contextualize word vectors. We show that adding these context vectors (CoVe) improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQuAD). For fine-grained sentiment analysis and entailment, CoVe improves performance of our baseline models to the state of the art.", "full_text": "Learned in Translation: Contextualized Word Vectors\n\nBryan McCann\n\nbmccann@salesforce.com\n\nJames Bradbury\n\njames.bradbury@salesforce.com\n\nCaiming Xiong\n\ncxiong@salesforce.com\n\nRichard Socher\n\nrsocher@salesforce.com\n\nAbstract\n\nComputer vision has bene\ufb01ted from initializing multiple deep layers with weights\npretrained on large supervised training sets like ImageNet. Natural language pro-\ncessing (NLP) typically sees initialization of only the lowest layer of deep models\nwith pretrained word vectors. In this paper, we use a deep LSTM encoder from\nan attentional sequence-to-sequence model trained for machine translation (MT)\nto contextualize word vectors. We show that adding these context vectors (CoVe)\nimproves performance over using only unsupervised word and character vectors on\na wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question\nclassi\ufb01cation (TREC), entailment (SNLI), and question answering (SQuAD). For\n\ufb01ne-grained sentiment analysis and entailment, CoVe improves performance of our\nbaseline models to the state of the art.\n\n1\n\nIntroduction\n\nSigni\ufb01cant gains have been made through transfer and multi-task learning between synergistic tasks.\nIn many cases, these synergies can be exploited by architectures that rely on similar components. In\ncomputer vision, convolutional neural networks (CNNs) pretrained on ImageNet [Krizhevsky et al.,\n2012, Deng et al., 2009] have become the de facto initialization for more complex and deeper models.\nThis initialization improves accuracy on other related tasks such as visual question answering [Xiong\net al., 2016] or image captioning [Lu et al., 2016, Socher et al., 2014].\nIn NLP, distributed representations pretrained with models like Word2Vec [Mikolov et al., 2013]\nand GloVe [Pennington et al., 2014] have become common initializations for the word vectors of\ndeep learning models. Transferring information from large amounts of unlabeled training data in\nthe form of word vectors has shown to improve performance over random word vector initialization\non a variety of downstream tasks, e.g. part-of-speech tagging [Collobert et al., 2011], named entity\nrecognition [Pennington et al., 2014], and question answering [Xiong et al., 2017]; however, words\nrarely appear in isolation. The ability to share a common representation of words in the context of\nsentences that include them could further improve transfer learning in NLP.\nInspired by the successful transfer of CNNs trained on ImageNet to other tasks in computer vision,\nwe focus on training an encoder for a large NLP task and transferring that encoder to other tasks\nin NLP. Machine translation (MT) requires a model to encode words in context so as to decode\nthem into another language, and attentional sequence-to-sequence models for MT often contain an\nLSTM-based encoder, which is a common component in other NLP models. We hypothesize that MT\ndata in general holds potential comparable to that of ImageNet as a cornerstone for reusable models.\nThis makes an MT-LSTM pairing in NLP a natural candidate for mirroring the ImageNet-CNN\npairing of computer vision.\nAs depicted in Figure 1, we begin by training LSTM encoders on several machine translation datasets,\nand we show that these encoders can be used to improve performance of models trained for other\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: We a) train a two-layer, bidirectional LSTM as the encoder of an attentional sequence-to-\nsequence model for machine translation and b) use it to provide context for other NLP models.\n\ntasks in NLP. In order to test the transferability of these encoders, we develop a common architecture\nfor a variety of classi\ufb01cation tasks, and we modify the Dynamic Coattention Network for question\nanswering [Xiong et al., 2017]. We append the outputs of the MT-LSTMs, which we call context\nvectors (CoVe), to the word vectors typically used as inputs to these models. This approach improved\nthe performance of models for downstream tasks over that of baseline models using pretrained\nword vectors alone. For the Stanford Sentiment Treebank (SST) and the Stanford Natural Language\nInference Corpus (SNLI), CoVe pushes performance of our baseline model to the state of the art.\nExperiments reveal that the quantity of training data used to train the MT-LSTM is positively\ncorrelated with performance on downstream tasks. This is yet another advantage of relying on MT,\nas data for MT is more abundant than for most other supervised NLP tasks, and it suggests that\nhigher quality MT-LSTMs carry over more useful information. This reinforces the idea that machine\ntranslation is a good candidate task for further research into models that possess a stronger sense of\nnatural language understanding.\n\n2 Related Work\n\nTransfer Learning. Transfer learning, or domain adaptation, has been applied in a variety\nof areas where researchers identi\ufb01ed synergistic relationships between independently collected\ndatasets. Saenko et al. [2010] adapt object recognition models developed for one visual domain to\nnew imaging conditions by learning a transformation that minimizes domain-induced changes in\nthe feature distribution. Zhu et al. [2011] use matrix factorization to incorporate textual information\ninto tagged images to enhance image classi\ufb01cation. In natural language processing (NLP), Collobert\net al. [2011] leverage representations learned from unsupervised learning to improve performance on\nsupervised tasks like named entity recognition, part-of-speech tagging, and chunking. Recent work in\nNLP has continued in this direction by using pretrained word representations to improve models for\nentailment [Bowman et al., 2014], sentiment analysis [Socher et al., 2013], summarization [Nallapati\net al., 2016], and question answering [Seo et al., 2017, Xiong et al., 2017]. Ramachandran et al. [2016]\npropose initializing sequence-to-sequence models with pretrained language models and \ufb01ne-tuning\nfor a speci\ufb01c task. Kiros et al. [2015] propose an unsupervised method for training an encoder that\noutputs sentence vectors that are predictive of surrounding sentences. We also propose a method of\ntransferring higher-level representations than word vectors, but we use a supervised method to train\nour sentence encoder and show that it improves models for text classi\ufb01cation and question answering\nwithout \ufb01ne-tuning.\nNeural Machine Translation. Our source domain of transfer learning is machine translation,\na task that has seen marked improvements in recent years with the advance of neural machine\ntranslation (NMT) models. Sutskever et al. [2014] investigate sequence-to-sequence models that\nconsist of a neural network encoder and decoder for machine translation. Bahdanau et al. [2015]\npropose the augmenting sequence to sequence models with an attention mechanism that gives\nthe decoder access to the encoder representations of the input sequence at each step of sequence\ngeneration. Luong et al. [2015] further study the effectiveness of various attention mechanisms\nwith respect to machine translation. Attention mechanisms have also been successfully applied to\nNLP tasks like entailment [Conneau et al., 2017], summarization [Nallapati et al., 2016], question\nanswering [Seo et al., 2017, Xiong et al., 2017, Min et al., 2017], and semantic parsing [Dong and\nLapata, 2016]. We show that attentional encoders trained for NMT transfer well to other NLP tasks.\n\n2\n\nEncodera)b)EncoderEncoderWord VectorsWord VectorsTask-specific ModelDecoderTranslationWord Vectors\fTransfer Learning and Machine Translation. Machine translation is a suitable source domain for\ntransfer learning because the task, by nature, requires the model to faithfully reproduce a sentence in\nthe target language without losing information in the source language sentence. Moreover, there is an\nabundance of machine translation data that can be used for transfer learning. Hill et al. [2016] study\nthe effect of transferring from a variety of source domains to the semantic similarity tasks in Agirre\net al. [2014]. Hill et al. [2017] further demonstrate that \ufb01xed-length representations obtained from\nNMT encoders outperform those obtained from monolingual (e.g. language modeling) encoders on\nsemantic similarity tasks. Unlike previous work, we do not transfer from \ufb01xed length representations\nproduced by NMT encoders. Instead, we transfer representations for each token in the input sequence.\nOur approach makes the transfer of the trained encoder more directly compatible with subsequent\nLSTMs, attention mechanisms, and, in general, layers that expect input sequences. This additionally\nfacilitates the transfer of sequential dependencies between encoder states.\nTransfer Learning in Computer Vision. Since the success of CNNs on the ImageNet challenge, a\nnumber of approaches to computer vision tasks have relied on pretrained CNNs as off-the-shelf feature\nextractors. Girshick et al. [2014] show that using a pretrained CNN to extract features from region\nproposals improves object detection and semantic segmentation models. Qi et al. [2016] propose\na CNN-based object tracking framework, which uses hierarchical features from a pretrained CNN\n(VGG-19 by Simonyan and Zisserman [2014]). For image captioning, Lu et al. [2016] train a visual\nsentinel with a pretrained CNN and \ufb01ne-tune the model with a smaller learning rate. For VQA, Fukui\net al. [2016] propose to combine text representations with visual representations extracted by a\npretrained residual network [He et al., 2016]. Although model transfer has seen widespread success\nin computer vision, transfer learning beyond pretrained word vectors is far less pervasive in NLP.\n\n3 Machine Translation Model\n\nWe begin by training an attentional sequence-to-sequence model for English-to-German translation\nbased on Klein et al. [2017] with the goal of transferring the encoder to other tasks.\nn] and a\nFor training, we are given a sequence of words in the source language wx = [wx\nsequence of words in the target language wz = [wz\nm]. Let GloVe(wx) be a sequence of\nGloVe vectors corresponding to the words in wx, and let z be a sequence of randomly initialized\nword vectors corresponding to the words in wz.\nWe feed GloVe(wx) to a standard, two-layer, bidirectional, long short-term memory network 1 [Graves\nand Schmidhuber, 2005] that we refer to as an MT-LSTM to indicate that it is this same two-layer\nBiLSTM that we later transfer as a pretrained encoder. The MT-LSTM is used to compute a sequence\nof hidden states\n\n1 , . . . , wx\n\n1, . . . , wz\n\nh = MT-LSTM(GloVe(wx)).\n\n(1)\n\nFor machine translation, the MT-LSTM supplies the context for an attentional decoder that produces\na distribution over output words p( \u02c6wz\n\nt\u22121) at each time-step.\n\nt |H, wz\n\n1, . . . , wz\n\nAt time-step t, the decoder \ufb01rst uses a two-layer, unidirectional LSTM to produce a hidden state hdec\nbased on the previous target embedding zt\u22121 and a context-adjusted hidden state \u02dcht\u22121:\n\nt\n\n(cid:16)\n\n(cid:17)\n\nhdec\nt = LSTM\n\n[zt\u22121; \u02dcht\u22121], hdec\nt\u22121\n\n.\n\n(2)\n\nThe decoder then computes a vector of attention weights \u03b1 representing the relevance of each\nencoding time-step to the current decoder state.\n\n\u03b1t = softmax(cid:0)H(W1hdec\n\nt + b1)(cid:1)\n\n(3)\n\nwhere H refers to the elements of h stacked along the time dimension.\n\n1 Since there are several biLSTM variants, we de\ufb01ne ours as follows. Let h = [h1, . . . , hn] = biLSTM (x)\nrepresent the output sequence of our biLSTM operating on an input sequence x. Then a forward LSTM computes\nh\u2192\nt = LSTM (xt, h\u2192\nt+1). The\n\ufb01nal outputs of the biLSTM for each time step are ht = [h\u2192\n\nt\u22121) for each time step, and a backward LSTM computes h\u2190\n\nt = LSTM (xt, h\u2190\n\nt ; h\u2190\nt ].\n\n3\n\n\fThe decoder then uses these weights as coef\ufb01cients in an attentional sum that is concatenated with\nthe decoder state and passed through a tanh layer to form the context-adjusted hidden state \u02dch:\n\n\u02dcht =(cid:2)tanh(cid:0)W2H(cid:62)\u03b1t + b2; hdec\n\nt\n\n(4)\n\n(cid:1)(cid:3)\n(cid:17)\n\n.\n\n(cid:16)\n\nWout\n\nThe distribution over output words is generated by a \ufb01nal transformation of the context-adjusted\nhidden state: p( \u02c6wz\n\n\u02dcht + bout\n\nt\u22121) = softmax\n\nt |X, wz\n\n1, . . . , wz\n\n4 Context Vectors (CoVe)\n\nWe transfer what is learned by the MT-LSTM to downstream tasks by treating the outputs of the\nMT-LSTM as context vectors. If w is a sequence of words and GloVe(w) the corresponding sequence\nof word vectors produced by the GloVe model, then\n\nCoVe(w) = MT-LSTM(GloVe(w))\n\n(5)\n\nis the sequence of context vectors produced by the MT-LSTM. For classi\ufb01cation and question\nanswering, for an input sequence w, we concatenate each vector in GloVe(w) with its corresponding\nvector in CoVe(w)\n\n\u02dcw = [GloVe(w); CoVe(w)]\n\n(6)\n\nas depicted in Figure 1b.\n\n5 Classi\ufb01cation with CoVe\n\nWe now describe a general biattentive classi\ufb01cation network (BCN) we use to test how well CoVe\ntransfer to other tasks. This model, shown in Figure 2, is designed to handle both single-sentence\nand two-sentence classi\ufb01cation tasks. In the case of single-sentence tasks, the input sequence is\nduplicated to form two sequences, so we will assume two input sequences for the rest of this section.\nInput sequences wx and wy are converted to se-\nquences of vectors, \u02dcwx and \u02dcwy, as described in Eq. 6\nbefore being fed to the task-speci\ufb01c portion of the\nmodel (Figure 1b).\nA function f applies a feedforward network with\nReLU activation [Nair and Hinton, 2010] to each\nelement of \u02dcwx and \u02dcwy, and a bidirectional LSTM\nprocesses the resulting sequences to obtain task spe-\nci\ufb01c representations,\n\nx = biLSTM (f ( \u02dcwx))\n\n(7)\n\ny = biLSTM (f ( \u02dcwy))\n\n(8)\nThese sequences are each stacked along the time axis\nto get matrices X and Y .\nIn order to compute representations that are interde-\npendent, we use a biattention mechanism [Seo et al.,\n2017, Xiong et al., 2017]. The biattention \ufb01rst com-\nputes an af\ufb01nity matrix A = XY (cid:62). It then extracts\nattention weights with column-wise normalization:\n\nAy = softmax(cid:0)A(cid:62)(cid:1) (9)\n\nAx = softmax (A)\n\nwhich amounts to a novel form of self-attention when\nx = y. Next, it uses context summaries\nCy = A(cid:62)\ny Y\nto condition each sequence on the other.\n\nCx = A(cid:62)\nx X\n\n(10)\n\n4\n\nFigure 2: Our BCN uses a feedforward net-\nwork with ReLU activation and biLSTM en-\ncoder to create task-speci\ufb01c representations of\neach input sequence. Biattention conditions\neach representation on the other, a biLSTM\nintegrates the conditional information, and a\nmaxout network uses pooled features to com-\npute a distribution over possible classes.\n\nEncoderEncoderBiattentionMaxout NetworkIntegratePoolIntegratePoolReLU NetworkReLU Network\fWe integrate the conditioning information into our representations for each sequence with two separate\none-layer, bidirectional LSTMs that operate on the concatenation of the original representations\n(to ensure no information is lost in conditioning), their differences from the context summaries (to\nexplicitly capture the difference from the original signals), and the element-wise products between\noriginals and context summaries (to amplify or dampen the original signals).\n\nX|y = biLSTM ([X; X \u2212 Cy; X (cid:12) Cy])\nY|x = biLSTM ([Y ; Y \u2212 Cx; Y (cid:12) Cx])\n\n(11)\n(12)\n\nThe outputs of the bidirectional LSTMs are aggregated by pooling along the time dimension. Max and\nmean pooling have been used in other models to extract features, but we have found that adding both\nmin pooling and self-attentive pooling can aid in some tasks. Each captures a different perspective on\nthe conditioned sequences.\nThe self-attentive pooling computes weights for each time step of the sequence\n\n\u03b2x = softmax(cid:0)X|yv1 + d1\n\n(cid:1)\n\n\u03b2y = softmax(cid:0)Y|xv2 + d2\n\n(cid:1)\n\n(13)\n\n(14)\n\n(15)\n(16)\n\nand uses these weights to get weighted summations of each sequence:\n\nThe pooled representations are combined to get one joined representation for all inputs.\n\nxself = X(cid:62)\n\n|y \u03b2x\n\nyself = Y (cid:62)\n\n|x \u03b2y\n\nxpool =(cid:2)max(X|y); mean(X|y); min(X|y); xself\n(cid:3)\nypool =(cid:2)max(Y|x); mean(Y|x); min(Y|x); yself\n\n(cid:3)\n\nWe feed this joined representation through a three-layer, batch-normalized [Ioffe and Szegedy, 2015]\nmaxout network [Goodfellow et al., 2013] to produce a probability distribution over possible classes.\n\n6 Question Answering with CoVe\n\nFor question answering, we obtain sequences x and y just as we do in Eq. 7 and Eq. 8 for classi\ufb01cation,\nexcept that the function f is replaced with a function g that uses a tanh activation instead of a ReLU\nactivation. In this case, one of the sequences is the document and the other the question in the\nquestion-document pair. These sequences are then fed through the coattention and dynamic decoder\nimplemented as in the original Dynamic Coattention Network (DCN) [Xiong et al., 2016].\n\n7 Datasets\n\nMachine Translation. We use three different English-German machine translation datasets to train\nthree separate MT-LSTMs. Each is tokenized using the Moses Toolkit [Koehn et al., 2007].\nOur smallest MT dataset comes from the WMT 2016 multi-modal translation shared task [Specia\net al., 2016]. The training set consists of 30,000 sentence pairs that brie\ufb02y describe Flickr captions\nand is often referred to as Multi30k. Due to the nature of image captions, this dataset contains\nsentences that are, on average, shorter and simpler than those from larger counterparts.\nOur medium-sized MT dataset is the 2016 version of the machine translation task prepared for the\nInternational Workshop on Spoken Language Translation [Cettolo et al., 2015]. The training set\nconsists of 209,772 sentence pairs from transcribed TED presentations that cover a wide variety of\ntopics with more conversational language than in the other two machine translation datasets.\nOur largest MT dataset comes from the news translation shared task from WMT 2017. The training set\nconsists of roughly 7 million sentence pairs that comes from web crawl data, a news and commentary\ncorpus, European Parliament proceedings, and European Union press releases.\nWe refer to the three MT datasets as MT-Small, MT-Medium, and MT-Large, respectively, and we\nrefer to context vectors from encoders trained on each in turn as CoVe-S, CoVe-M, and CoVe-L.\n\n5\n\n\fTask\nDataset\nSentiment Classi\ufb01cation\nSST-2\nSentiment Classi\ufb01cation\nSST-5\nSentiment Classi\ufb01cation\nIMDb\nQuestion Classi\ufb01cation\nTREC-6\nTREC-50 Question Classi\ufb01cation\nSNLI\nSQuAD\n\nEntailment Classi\ufb01cation\nQuestion Answering\n\nDetails\n2 classes, single sentences\n5 classes, single sentences\n2 classes, multiple sentences\n6 classes\n50 classes\n2 classes\nopen-ended (answer-spans)\n\nExamples\n56.4k\n94.2k\n22.5k\n5k\n5k\n550k\n87.6k\n\nTable 1: Datasets, tasks, details, and number of training examples.\n\nSentiment Analysis. We train our model separately on two sentiment analysis datasets: the Stanford\nSentiment Treebank (SST) [Socher et al., 2013] and the IMDb dataset [Maas et al., 2011]. Both of\nthese datasets comprise movie reviews and their sentiment. We use the binary version of each dataset\nas well as the \ufb01ve-class version of SST. For training on SST, we use all sub-trees with length greater\nthan 3. SST-2 contains roughly 56, 400 reviews after removing \u201cneutral\u201d examples. SST-5 contains\nroughly 94, 200 reviews and does include \u201cneutral\u201d examples. IMDb contains 25, 000 multi-sentence\nreviews, which we truncate to the \ufb01rst 200 words. 2, 500 reviews are held out for validation.\nQuestion Classi\ufb01cation. For question classi\ufb01cation, we use the small TREC dataset [Voorhees and\nTice, 1999] dataset of open-domain, fact-based questions divided into broad semantic categories. We\nexperiment with both the six-class and \ufb01fty-class versions of TREC, which which refer to as TREC-6\nand TREC-50, respectively. We hold out 452 examples for validation and leave 5, 000 for training.\nEntailment. For entailment, we use the Stanford Natural Language Inference Corpus (SNLI) [Bow-\nman et al., 2015], which has 550,152 training, 10,000 validation, and 10,000 testing examples. Each\nexample consists of a premise, a hypothesis, and a label specifying whether the premise entails,\ncontradicts, or is neutral with respect to the hypothesis.\nQuestion Answering. The Stanford Question Answering Dataset (SQuAD) [Rajpurkar et al., 2016]\nis a large-scale question answering dataset with 87,599 training examples, 10,570 development\nexamples, and a test set that is not released to the public. Examples consist of question-answer pairs\nassociated with a paragraph from the English Wikipedia. SQuAD examples assume that the question\nis answerable and that the answer is contained verbatim somewhere in the paragraph.\n\n8 Experiments\n\n8.1 Machine Translation\n\nThe MT-LSTM trained on MT-Small obtains an uncased, tokenized BLEU score of 38.5 on the\nMulti30k test set from 2016. The model trained on MT-Medium obtains an uncased, tokenized BLEU\nscore of 25.54 on the IWSLT test set from 2014. The MT-LSTM trained on MT-Large obtains an\nuncased, tokenized BLEU score of 28.96 on the WMT 2016 test set. These results represent strong\nbaseline machine translation models for their respective datasets. Note that, while the smallest dataset\nhas the highest BLEU score, it is also a much simpler dataset with a restricted domain.\nTraining Details. When training an MT-LSTM, we used \ufb01xed 300-dimensional word vectors. We\nused the CommonCrawl-840B GloVe model for English word vectors, which were completely \ufb01xed\nduring training, so that the MT-LSTM had to learn how to use the pretrained vectors for translation.\nThe hidden size of the LSTMs in all MT-LSTMs is 300. Because all MT-LSTMs are bidirectional,\nthey output 600-dimensional vectors. The model was trained with stochastic gradient descent with a\nlearning rate that began at 1 and decayed by half each epoch after the validation perplexity increased\nfor the \ufb01rst time. Dropout with ratio 0.2 was applied to the inputs and outputs of all layers of the\nencoder and decoder.\n\n8.2 Classi\ufb01cation and Question Answering\n\nFor classi\ufb01cation and question answering, we explore how varying the input representations affects\n\ufb01nal performance. Table 2 contains validation performances for experiments comparing the use of\nGloVe, character n-grams, CoVe, and combinations of the three.\n\n6\n\n\f(a) CoVe and GloVe\n\n(b) CoVe and Characters\n\nFigure 3: The Bene\ufb01ts of CoVe\n\nGloVe+\n\nDataset\nSST-2\nSST-5\nIMDb\nTREC-6\nTREC-50\nSNLI\nSQuAD\n\nRandom GloVe Char CoVe-S CoVe-M CoVe-L Char+CoVe-L\n\n84.2\n48.6\n88.4\n88.9\n81.9\n82.3\n65.4\n\n88.4\n53.5\n91.1\n94.9\n89.2\n87.7\n76.0\n\n90.1\n52.2\n91.3\n94.7\n89.8\n87.7\n78.1\n\n89.0\n54.0\n90.6\n94.7\n89.6\n87.3\n76.5\n\n90.9\n54.7\n91.6\n95.1\n89.6\n87.5\n77.1\n\n91.1\n54.5\n91.7\n95.8\n90.5\n87.9\n79.5\n\n91.2\n55.2\n92.1\n95.8\n91.2\n88.1\n79.9\n\nTable 2: CoVe improves validation performance. CoVe has an advantage over character\nn-gram embeddings, but using both improves performance further. Models bene\ufb01t most by\nusing an MT-LSTM trained with MT-Large (CoVe-L). Accuracy is reported for classi\ufb01cation\ntasks, and F1 is reported for SQuAD.\n\nTraining Details. Unsupervised vectors and MT-LSTMs remain \ufb01xed in this set of experiments.\nLSTMs have hidden size 300. Models were trained using Adam with \u03b1 = 0.001. Dropout was\napplied before all feedforward layers with dropout ratio 0.1, 0.2, or 0.3. Maxout networks pool over 4\nchannels, reduce dimensionality by 2, 4, or 8, reduce again by 2, and project to the output dimension.\nThe Bene\ufb01ts of CoVe. Figure 3a shows that models that use CoVe alongside GloVe achieve higher\nvalidation performance than models that use only GloVe. Figure 3b shows that using CoVe in Eq. 6\nbrings larger improvements than using character n-gram embeddings [Hashimoto et al., 2016]. It\nalso shows that altering Eq. 6 by additionally appending character n-gram embeddings can boost\nperformance even further for some tasks. This suggests that the information provided by CoVe is\ncomplementary to both the word-level information provided by GloVe as well as the character-level\ninformation provided by character n-gram embeddings.\nThe Effects of MT Training Data.\nWe experimented with different\ntraining datasets for the MT-LSTMs\nto see how varying the MT train-\ning data affects the bene\ufb01ts of using\nCoVe in downstream tasks. Figure 4\nshows an important trend we can ex-\ntract from Table 2. There appears to\nbe a positive correlation between the\nlarger MT datasets, which contain\nmore complex, varied language, and\nthe improvement that using CoVe\nbrings to downstream tasks. This is\nevidence for our hypothesis that MT\ndata has potential as a large resource\nfor transfer learning in NLP.\n\nFigure 4: The Effects of MT Training Data\n\n7\n\nSST-2SST-5IMDbTREC-6TREC-50SNLISQuAD246810121416% improvement over randomly initialized word vectorsGloVeGloVe+CoVeSST-2SST-5IMDbTREC-6TREC-50SNLISQuAD246810121416% improvement over randomly initialized word vectorsGloVe+CharGloVe+CoVeGloVe+CoVe+CharGloVeGloVe+CoVe-SGloVe+CoVe-MGloVe+CoVe-L246810121416% improvement over randomly initialized word vectorsSST-2SST-5IMDbTREC-6TREC-50SNLISQuAD\fTest\nModel\n89.2\nP-LSTM [Wieting et al., 2016]\n89.4\nCT-LSTM [Looks et al., 2017]\n89.6\nTE-LSTM [Huang et al., 2017]\n89.7\nNSE [Munkhdalai and Yu, 2016a]\n90.3\nBCN+Char+CoVe [Ours]\n91.8\nbmLSTM [Radford et al., 2017]\n51.5\nMVN [Guo et al., 2017]\n52.1\nDMN [Kumar et al., 2016]\n52.4\nLSTM-CNN [Zhou et al., 2016]\n52.6\nTE-LSTM [Huang et al., 2017]\n53.1\nNTI [Munkhdalai and Yu, 2016b]\n53.7\nBCN+Char+CoVe [Ours]\n91.8\nBCN+Char+CoVe [Ours]\n92.8\nSA-LSTM [Dai and Le, 2015]\n92.9\nbmLSTM [Radford et al., 2017]\nTRNN [Dieng et al., 2016]\n93.8\noh-LSTM [Johnson and Zhang, 2016] 94.1\n94.1\nVirtual [Miyato et al., 2017]\n\nTest\nModel\nSVM [da Silva et al., 2011]\n95.0\nSVM [Van-Tu and Anh-Cuong, 2016] 95.2\n95.6\nDSCNN-P [Zhang et al., 2016]\nBCN+Char+CoVe [Ours]\n95.8\n96.0\nTBCNN [Mou et al., 2015]\n96.1\nLSTM-CNN [Zhou et al., 2016]\n89.0\nSVM [Loni et al., 2011]\nSNoW [Li and Roth, 2006]\n89.3\n90.2\nBCN+Char+CoVe [Ours]\nRulesUHC [da Silva et al., 2011]\n90.8\nSVM [Van-Tu and Anh-Cuong, 2016] 91.6\nRules [Madabushi and Lee, 2016]\n97.2\n86.8\nDecAtt+Intra [Parikh et al., 2016]\n87.3\nNTI [Munkhdalai and Yu, 2016b]\n87.5\nre-read LSTM [Sha et al., 2016]\n87.6\nbtree-LSTM [Paria et al., 2016]\n600D ESIM [Chen et al., 2016]\n88.0\n88.1\nBCN+Char+CoVe [Ours]\n\n6\n-\nC\nE\nR\nT\n\n0\n5\n-\nC\nE\nR\nT\n\nI\nL\nN\nS\n\n2\n-\nT\nS\nS\n\n5\n-\nT\nS\nS\n\nb\nD\nM\n\nI\n\nTable 4: Single model test accuracies for classi\ufb01cation tasks.\n\nEM F1\n51.0\n40.0\n72.1\n62.5\n64.1\n73.9\n75.6\n65.4\n77.3\n68.0\n79.5\n71.1\n71.3\n79.9\n\nModel\nLR [Rajpurkar et al., 2016]\nDCR [Yu et al., 2017]\nhM-LSTM+AP [Wang and Jiang, 2017]\nDCN+Char [Xiong et al., 2017]\nBiDAF [Seo et al., 2017]\nR-NET [Wang et al., 2017]\nDCN+Char+CoVe [Ours]\nTable 3: Exact match and F1 validation scores for single-\nmodel question answering.\n\nTest Performance. Table 4 shows the \ufb01nal test accuracies of our best classi\ufb01cation models, each of\nwhich achieved the highest validation accuracy on its task using GloVe, CoVe, and character n-gram\nembeddings. Final test performances on SST-5 and SNLI reached a new state of the art.\nTable 3 shows how the validation exact\nmatch and F1 scores of our best SQuAD\nmodel compare to the scores of the most\nrecent top models in the literature. We\ndid not submit the SQuAD model for test-\ning, but the addition of CoVe was enough\nto push the validation performance of\nthe original DCN, which already used\ncharacter n-gram embeddings, above the\nvalidation performance of the published\nversion of the R-NET. Test performances\nare tracked by the SQuAD leaderboard 2.\nComparison to Skip-Thought Vectors. Kiros et al.\n[2015] show how to encode a sentence into a single\nskip-thought vector that transfers well to a variety of\ntasks. Both skip-thought and CoVe pretrain encoders\nto capture information at a higher level than words.\nHowever, skip-thought encoders are trained with an\nunsupervised method that relies on the \ufb01nal output of\nthe encoder. MT-LSTMs are trained with a supervised\nmethod that instead relies on intermediate outputs as-\nsociated with each input word. Additionally, the 4800\ndimensional skip-thought vectors make training more\nunstable than using the 600 dimensional CoVe. Table 5\nshows that these differences make CoVe more suitable\nfor transfer learning in our classi\ufb01cation experiments.\n\nTable 5: Classi\ufb01cation validation accura-\ncies with skip-thought and CoVe.\n\nDataset\nSST-2\nSST-5\nTREC-6\nTREC-50\nSNLI\n\nGloVe+Char+\n\nSkip-Thought CoVe-L\n\n88.7\n52.1\n94.2\n89.6\n86.0\n\n91.2\n55.2\n95.8\n91.2\n88.1\n\n2https://rajpurkar.github.io/SQuAD-explorer/\n\n8\n\n\f9 Conclusion\n\nWe introduce an approach for transferring knowledge from an encoder pretrained on machine\ntranslation to a variety of downstream NLP tasks. In all cases, models that used CoVe from our best,\npretrained MT-LSTM performed better than baselines that used random word vector initialization,\nbaselines that used pretrained word vectors from a GloVe model, and baselines that used word vectors\nfrom a GloVe model together with character n-gram embeddings. We hope this is a step towards the\ngoal of building uni\ufb01ed NLP models that rely on increasingly more general reusable weights.\nThe PyTorch code at https://github.com/salesforce/cove includes an example of how to\ngenerate CoVe from the MT-LSTM we used in all of our best models. We hope that making our best\nMT-LSTM available will encourage further research into shared representations for NLP models.\n\nReferences\nE. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and\n\nJ. Wiebe. SemEval-2014 Task 10: Multilingual semantic textual similarity. In SemEval@COLING, 2014.\n\nD. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In\n\nICLR, 2015.\n\nS. R. Bowman, C. Potts, and C. D. Manning. Recursive neural networks for learning logical semantics. CoRR,\n\nabs/1406.1827, 2014.\n\nS. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning natural language\ninference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing\n(EMNLP). Association for Computational Linguistics, 2015.\n\nM. Cettolo, J. Niehues, S. St\u00fcker, L. Bentivogli, R. Cattoni, and M. Federico. The IWSLT 2015 evaluation\n\ncampaign. In IWSLT, 2015.\n\nQ. Chen, X.-D. Zhu, Z.-H. Ling, S. Wei, and H. Jiang. Enhancing and combining sequential and tree LSTM for\n\nnatural language inference. CoRR, abs/1609.06038, 2016.\n\nR. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing\n\n(almost) from scratch. JMLR, 12:2493\u20132537, 2011.\n\nA. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning of universal sentence\n\nrepresentations from natural language inference data. arXiv preprint arXiv:1705.02364, 2017.\n\nJ. P. C. G. da Silva, L. Coheur, A. C. Mendes, and A. Wichert. From symbolic to sub-symbolic information in\n\nquestion classi\ufb01cation. Artif. Intell. Rev., 35:137\u2013154, 2011.\n\nA. M. Dai and Q. V. Le. Semi-supervised sequence learning. In NIPS, 2015.\n\nJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\ndatabase. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248\u2013255, 2009.\n\nA. B. Dieng, C. Wang, J. Gao, and J. W. Paisley. TopicRNN: A recurrent neural network with long-range\n\nsemantic dependency. CoRR, abs/1611.01702, 2016.\n\nL. Dong and M. Lapata. Language to logical form with neural attention. CoRR, abs/1601.01280, 2016.\n\nA. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling\n\nfor visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.\n\nR. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and\nsemantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 580\u2013587, 2014.\n\nI. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.\n\nA. Graves and J. Schmidhuber. Framewise phoneme classi\ufb01cation with bidirectional LSTM and other neural\n\nnetwork architectures. Neural Networks, 18(5):602\u2013610, 2005.\n\nH. Guo, C. Cherry, and J. Su. End-to-end multi-view networks for text classi\ufb01cation. CoRR, abs/1704.05907,\n\n2017.\n\n9\n\n\fK. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher. A joint many-task model: Growing a neural network for\n\nmultiple NLP tasks. CoRR, abs/1611.01587, 2016.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\nF. Hill, K. Cho, and A. Korhonen. Learning distributed representations of sentences from unlabelled data. In\n\nHLT-NAACL, 2016.\n\nF. Hill, K. Cho, S. Jean, and Y. Bengio. The representational geometry of word meanings acquired by\nISSN 1573-0573. doi:\n\nneural machine translation models. Machine Translation, pages 1\u201316, 2017.\n10.1007/s10590-017-9194-2. URL http://dx.doi.org/10.1007/s10590-017-9194-2.\n\nM. Huang, Q. Qian, and X. Zhu. Encoding syntactic knowledge in neural networks for sentiment classi\ufb01cation.\n\nACM Trans. Inf. Syst., 35:26:1\u201326:27, 2017.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate\n\nshift. In ICML, 2015.\n\nR. Johnson and T. Zhang. Supervised and semi-supervised text categorization using LSTM for region embeddings.\n\nIn ICML, 2016.\n\nR. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In\n\nNIPS, 2015.\n\nG. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. OpenNMT: Open-source toolkit for neural machine\n\ntranslation. ArXiv e-prints, 2017.\n\nP. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran,\nR. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit for statistical machine\ntranslation. In ACL, 2007.\n\nA. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural networks.\n\nIn Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\nA. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask\n\nme anything: Dynamic memory networks for natural language processing. In ICML, 2016.\n\nX. Li and D. Roth. Learning question classi\ufb01ers: The role of semantic information. Natural Language\n\nEngineering, 12:229\u2013249, 2006.\n\nB. Loni, G. van Tulder, P. Wiggers, D. M. J. Tax, and M. Loog. Question classi\ufb01cation by weighted combination\n\nof lexical, syntactic and semantic features. In TSD, 2011.\n\nM. Looks, M. Herreshoff, D. Hutchins, and P. Norvig. Deep learning with dynamic computation graphs. CoRR,\n\nabs/1702.02181, 2017.\n\nJ. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for\n\nimage captioning. arXiv preprint arXiv:1612.01887, 2016.\n\nT. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. In\n\nEMNLP, 2015.\n\nA. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment\nanalysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human\nLanguage Technologies, pages 142\u2013150, Portland, Oregon, USA, June 2011. Association for Computational\nLinguistics. URL http://www.aclweb.org/anthology/P11-1015.\n\nH. T. Madabushi and M. Lee. High accuracy rule-based question classi\ufb01cation using question syntax and\n\nsemantics. In COLING, 2016.\n\nT. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in vector space. In\n\nICLR (workshop), 2013.\n\nS. Min, M. Seo, and H. Hajishirzi. Question answering through transfer learning from large \ufb01ne-grained\n\nsupervision data. 2017.\n\nT. Miyato, A. M. Dai, and I. Goodfellow. Adversarial training methods for semi-supervised text classi\ufb01cation.\n\n2017.\n\n10\n\n\fL. Mou, H. Peng, G. Li, Y. Xu, L. Zhang, and Z. Jin. Discriminative neural sentence modeling by tree-based\n\nconvolution. In EMNLP, 2015.\n\nT. Munkhdalai and H. Yu. Neural semantic encoders. CoRR, abs/1607.04315, 2016a.\n\nT. Munkhdalai and H. Yu. Neural tree indexers for text understanding. CoRR, abs/1607.04492, 2016b.\n\nV. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In ICML, 2010.\n\nR. Nallapati, B. Zhou, C. N. dos Santos, \u00c7aglar G\u00fcl\u00e7ehre, and B. Xiang. Abstractive text summarization using\n\nsequence-to-sequence RNNs and beyond. In CoNLL, 2016.\n\nB. Paria, K. M. Annervaz, A. Dukkipati, A. Chatterjee, and S. Podder. A neural architecture mimicking humans\n\nend-to-end for natural language inference. CoRR, abs/1611.04741, 2016.\n\nA. P. Parikh, O. Tackstrom, D. Das, and J. Uszkoreit. A decomposable attention model for natural language\n\ninference. In EMNLP, 2016.\n\nJ. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.\n\nY. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang. Hedged deep tracking. In Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 4303\u20134311, 2016.\n\nA. Radford, R. J\u00f3zefowicz, and I. Sutskever. Learning to generate reviews and discovering sentiment. CoRR,\n\nabs/1704.01444, 2017.\n\nP. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of\n\ntext. arXiv preprint arXiv:1606.05250, 2016.\n\nP. Ramachandran, P. J. Liu, and Q. V. Le. Unsupervised pretraining for sequence to sequence learning. CoRR,\n\nabs/1611.02683, 2016.\n\nK. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, 2010.\n\nM. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention \ufb02ow for machine comprehension.\n\nICLR, 2017.\n\nL. Sha, B. Chang, Z. Sui, and S. Li. Reading and thinking: Re-read LSTM unit for textual entailment recognition.\n\nIn COLING, 2016.\n\nK. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv\n\npreprint arXiv:1409.1556, 2014.\n\nR. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and C. Potts. Recursive deep models for semantic\n\ncompositionality over a sentiment treebank. In EMNLP, 2013.\n\nR. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for \ufb01nding\n\nand describing images with sentences. In ACL, 2014.\n\nL. Specia, S. Frank, K. Sima\u2019an, and D. Elliott. A shared task on multimodal machine translation and crosslingual\n\nimage description. In WMT, 2016.\n\nI. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.\n\nN. Van-Tu and L. Anh-Cuong. Improving question classi\ufb01cation by feature extraction and selection. Indian\n\nJournal of Science and Technology, 9(17), 2016.\n\nE. M. Voorhees and D. M. Tice. The TREC-8 question answering track evaluation. In TREC, volume 1999,\n\npage 82, 1999.\n\nS. Wang and J. Jiang. Machine comprehension using Match-LSTM and answer pointer. 2017.\n\nW. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou. Gated self-matching networks for reading comprehension\n\nand question answering. 2017.\n\nJ. Wieting, M. Bansal, K. Gimpel, and K. Livescu. Towards universal paraphrastic sentence embeddings. In\n\nICLR, 2016.\n\nC. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In\n\nProceedings of The 33rd International Conference on Machine Learning, pages 2397\u20132406, 2016.\n\n11\n\n\fC. Xiong, V. Zhong, and R. Socher. Dynamic coattention networks for question answering. ICRL, 2017.\n\nY. Yu, W. Zhang, K. Hasan, M. Yu, B. Xiang, and B. Zhou. End-to-end reading comprehension with dynamic\n\nanswer chunk ranking. ICLR, 2017.\n\nR. Zhang, H. Lee, and D. R. Radev. Dependency sensitive convolutional neural networks for modeling sentences\n\nand documents. In HLT-NAACL, 2016.\n\nP. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu. Text classi\ufb01cation improved by integrating bidirectional\n\nLSTM with two-dimensional max pooling. In COLING, 2016.\n\nY. Zhu, Y. Chen, Z. Lu, S. J. Pan, G.-R. Xue, Y. Yu, and Q. Yang. Heterogeneous transfer learning for image\n\nclassi\ufb01cation. In AAAI, 2011.\n\n12\n\n\f", "award": [], "sourceid": 3167, "authors": [{"given_name": "Bryan", "family_name": "McCann", "institution": "Salesforce Research"}, {"given_name": "James", "family_name": "Bradbury", "institution": "Salesforce Research"}, {"given_name": "Caiming", "family_name": "Xiong", "institution": "Salesforce Research"}, {"given_name": "Richard", "family_name": "Socher", "institution": "MetaMind"}]}