{"title": "Learning word embeddings efficiently with noise-contrastive estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2265, "page_last": 2273, "abstract": "Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-the art method of Mikolov et al. (2013a). We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones.", "full_text": "Learning word embeddings ef\ufb01ciently with\n\nnoise-contrastive estimation\n\nAndriy Mnih\n\nDeepMind Technologies\n\nandriy@deepmind.com\n\nKoray Kavukcuoglu\nDeepMind Technologies\nkoray@deepmind.com\n\nAbstract\n\nContinuous-valued word embeddings learned by neural language models have re-\ncently been shown to capture semantic and syntactic information about words very\nwell, setting performance records on several word similarity tasks. The best results\nare obtained by learning high-dimensional embeddings from very large quantities\nof data, which makes scalability of the training method a critical factor.\nWe propose a simple and scalable new approach to learning word embeddings\nbased on training log-bilinear models with noise-contrastive estimation. Our ap-\nproach is simpler, faster, and produces better results than the current state-of-the-\nart method. We achieve results comparable to the best ones reported, which were\nobtained on a cluster, using four times less data and more than an order of mag-\nnitude less computing time. We also investigate several model types and \ufb01nd that\nthe embeddings learned by the simpler models perform at least as well as those\nlearned by the more complex ones.\n\n1\n\nIntroduction\n\nNatural language processing and information retrieval systems can often bene\ufb01t from incorporating\naccurate word similarity information. Learning word representations from large collections of un-\nstructured text is an effective way of capturing such information. The classic approach to this task\nis to use the word space model, representing each word with a vector of co-occurrence counts with\nother words [16]. Representations of this type suffer from data sparsity problems due to the ex-\ntreme dimensionality of the word count vectors. To address this, Latent Semantic Analysis performs\ndimensionality reduction on such vectors, producing lower-dimensional real-valued word embed-\ndings.\nBetter real-valued representations, however, are learned by neural language models which are trained\nto predict the next word in the sentence given the preceding words. Such representations have been\nused to achieve excellent performance on classic NLP tasks [4, 18, 17]. Unfortunately, few neural\nlanguage models scale well to large datasets and vocabularies due to use of hidden layers and the\ncost of computing normalized probabilities.\nRecently, a scalable method for learning word embeddings using light-weight tree-structured neural\nlanguage models was proposed in [10]. Although tree-structured models can be trained quickly, they\nare considerably more complex than the traditional (\ufb02at) models and their performance is sensitive\nto the choice of the tree over words [13]. Inspired by the excellent results of [10], we investigate\na simpler approach based on noise-contrastive estimation (NCE) [6], which enables fast training\nwithout the complexity of working with tree-structured models. We compound the speedup obtained\nby using NCE to eliminate the normalization costs during training, by using very simple variants of\nthe log-bilinear model [14], resulting in parameter update complexity linear in the word embedding\ndimensionality.\n\n1\n\n\fWe evaluate our approach on two analogy-based word similarity tasks [11, 10] and show that de-\nspite the considerably shorter training times our models outperform the Skip-gram model from [10]\ntrained on the same 1.5B-word Wikipedia dataset. Furthermore, we can obtain performance com-\nparable to that of the huge Skip-gram and CBOW models trained on a 125-CPU-core cluster after\ntraining for only four days on a single core using four times less training data. Finally, we explore\nseveral model architectures and discover that the simplest architectures learn embeddings that are at\nleast as good as those learned by the more complex ones.\n\n2 Neural probabilistic language models\n\nNeural probabilistic language models (NPLMs) specify the distribution for the target word w, given\na sequence of words h, called the context. In statistical language modelling, w is typically the next\nword in the sentence, while the context h is the sequence of words that precede w. Though some\nmodels such as recurrent neural language models [9] can handle arbitrarily long contexts, in this\npaper, we will restrict our attention to \ufb01xed-length contexts. Since we are interested in learning\nword representations as opposed to assigning probabilities to sentences, we do not need to restrict\nour models to predicting the next word, and can, for example, predict w from the words surrounding\nit as was done in [4].\nGiven a context h, an NPLM de\ufb01nes the distribution for the word to be predicted using the scoring\nfunction s\u03b8(w, h) that quanti\ufb01es the compatibility between the context and the candidate target\nword. Here \u03b8 are model parameters, which include the word embeddings. The scores are converted\nto probabilities by exponentiating and normalizing:\n\n(cid:80)\n\nP h\n\n\u03b8 (w) =\n\nexp(s\u03b8(w, h))\nw(cid:48) exp(s\u03b8(w(cid:48), h))\n\n.\n\n(1)\n\nUnfortunately both evaluating P h\n\u03b8 (w) and computing the corresponding likelihood gradient requires\nnormalizing over the entire vocabulary, which means that maximum likelihood training of such\nmodels takes time linear in the vocabulary size, and thus is prohibitively expensive for all but the\nsmallest vocabularies.\nThere are two main approaches to scaling up NPLMs to large vocabularies. The \ufb01rst one involves\nusing a tree-structured vocabulary with words at the leaves, resulting in training time logarithmic\nin the vocabulary size [15]. Unfortunately, this approach is considerably more involved than ML\ntraining and \ufb01nding well-performing trees is non-trivial [13]. The alternative is to keep the model but\nuse a different training strategy. Using importance sampling to approximate the likelihood gradient\nwas the \ufb01rst such method to be proposed [2, 3], and though it could produce substantial speedups, it\nsuffered from stability problems. Recently, a method for training unnormalized probabilistic models,\ncalled noise-contrastive estimation (NCE) [6], has been shown to be a stable and ef\ufb01cient way of\ntraining NPLMs [14]. As it is also considerably simpler than the tree-based prediction approach, we\nuse NCE for training models in this paper. We will describe NCE in detail in Section 3.1.\n\n3 Scalable log-bilinear models\n\nWe are interested in highly scalable models that can be trained on billion-word datasets with vocab-\nularies of hundreds of thousands of words within a few days on a single core, which rules out most\ntraditional neural language models such as those from [1] and [4]. We will use the log-bilinear lan-\nguage model (LBL) [12] as our starting point, which unlike traditional NPLMs, does not have a hid-\nden layer and works by performing linear prediction in the word feature vector space. In particular,\nwe will use a more scalable version of LBL [14] that uses vectors instead of matrices for its context\nweights to avoid the high cost of matrix-vector multiplication. This model, like all other models\nwe will describe, has two sets of word representations: one for the target words (i.e.\nthe words\nbeing predicted) and one for the context words. We denote the target and the context representations\nfor word w with qw and rw respectively. Given a sequence of context words h = w1, .., wn, the\nmodel computes the predicted representation for the target word by taking a linear combination of\nthe context word feature vectors:\n\n\u02c6q(h) =\n\nci (cid:12) rwi,\n\n(2)\n\nn(cid:88)\n\ni=1\n\n2\n\n\fwhere ci is the weight vector for the context word in position i and (cid:12) denotes element-wise mul-\ntiplication. The context can consist of words preceding, following, or surrounding the word being\npredicted. The scoring function then computes the similarity between the predicted feature vector\nand one for word w:\n\ns\u03b8(w, h) = \u02c6q(h)(cid:62)qw + bw,\n\n(cid:80)n\n\n(3)\nwhere bw is a bias that captures the context-independent frequency of word w. We will refer to this\nmodel as vLBL, for vector LBL.\nvLBL can be made even simpler by eliminating the position-dependent weights and computing the\npredicted feature vector simply by averaging the context word feature vectors: \u02c6q(h) = 1\ni=1 rwi.\nn\nThe result is something like a local topic model, which ignores the order of context words, potentially\nforcing it to capture more semantic information, perhaps at the expense of syntax. The idea of simply\naveraging context word feature vectors was introduced in [8], where it was used to condition on large\ncontexts such as entire documents. The resulting model can be seen as a non-hierarchical version of\nthe CBOW model of [10].\nAs our primary concern is learning word representations as opposed to creating useful language\nmodels, we are free to move away from the paradigm of predicting the target word from its context\nand, for example, do the reverse. This approach is motivated by the distributional hypothesis, which\nstates that words with similar meanings often occur in the same contexts [7] and thus suggests look-\ning for word representations that capture their context distributions. The inverse language modelling\napproach of learning to predict the context from the word is a natural way to do that. Some classic\nword-space models such as HAL and COALS [16] follow this approach by representing the context\ndistribution using a bag-of-words but they do not learn embeddings from this information.\nUnfortunately, predicting an n-word context requires modelling the joint distribution of n words,\nwhich is considerably harder than modelling the distribution of a single word. We make the task\ntractable by assuming that the words in different context positions are conditionally independent\ngiven the current word w:\n\nn(cid:89)\n\nP w\n\n\u03b8 (h) =\n\nP w\n\ni,\u03b8(wi).\n\n(4)\n\ni=1\n\nThough this assumption can be easily relaxed without giving up tractability by introducing some\nMarkov structure into the context distribution, we leave investigating this direction as future work.\nThe context word distributions P w\ni,\u03b8(wi) are simply vLBL models that condition on the current word\nand are de\ufb01ned by the scoring function\n\nsi,\u03b8(wi, w) = (ci (cid:12) rw)(cid:62)qwi + bwi.\n\n(5)\nThe resulting model can be seen as a Naive Bayes classi\ufb01er parameterized in terms of word embed-\ndings. As this model performs inverse language modelling, we will refer to it as ivLBL.\nAs with our traditional language model, we also consider the simpler version of this model without\nposition-dependent weights, de\ufb01ned by the scoring function\n\nsi,\u03b8(wi, w) = r(cid:62)\n\n(6)\nThe resulting model is the non-hierarchical counterpart of the Skip-gram model [10]. Note that\nunlike the tree-based models, such as those in the above paper, which only learn conditional embed-\ndings for words, in our models each word has both a conditional and a target embedding which can\npotentially capture complementary information. Tree-based models replace target embeddings with\nparameters vectors associated with the tree nodes, as opposed to individual words.\n\nw qwi + bwi.\n\n3.1 Noise-contrastive estimation\n\nWe train our models using noise-contrastive estimation, a method for \ufb01tting unnormalized models\n[6], adapted to neural language modelling in [14]. NCE is based on the reduction of density estima-\ntion to probabilistic binary classi\ufb01cation. The basic idea is to train a logistic regression classi\ufb01er to\ndiscriminate between samples from the data distribution and samples from some \u201cnoise\u201d distribu-\ntion, based on the ratio of probabilities of the sample under the model and the noise distribution. The\n\n3\n\n\fmain advantage of NCE is that it allows us to \ufb01t models that are not explicitly normalized making\nthe training time effectively independent of the vocabulary size. Thus, we will be able to drop the\nnormalizing factor from Eq. 1, and simply use exp(s\u03b8(w, h)) in place of P h\n\u03b8 (w) during training. The\nperplexity of NPLMs trained using this approach has been shown to be on par with those trained\nwith maximum likelihood learning, but at a fraction of the computational cost.\nSuppose we would like to learn the distribution of words for some speci\ufb01c context h, denoted by\nP h(w). To do that, we create an auxiliary binary classi\ufb01cation problem, treating the training data as\npositive examples and samples from a noise distribution Pn(w) as negative examples. We are free\nto choose any noise distribution that is easy to sample from and compute probabilities under, and\nthat does not assign zero probability to any word. We will use the (global) unigram distribution of\nthe training data as the noise distribution, a choice that is known to work well for training language\nmodels. If we assume that noise samples are k times more frequent than data samples, the probability\nthat the given sample came from the data is P h(D = 1|w) =\nd (w)+kPn(w). Our estimate of this\nP h\nprobability is obtained by using our model distribution in place P h\nd :\n\nd (w)\n\nP h\n\nP h(D = 1|w, \u03b8) =\n\nP h\n\n\u03b8 (w)\n\nP h\n\n= \u03c3 (\u2206s\u03b8(w, h)) ,\n\n\u03b8 (w) + kPn(w)\n\n(7)\nwhere \u03c3(x) is the logistic function and \u2206s\u03b8(w, h) = s\u03b8(w, h) \u2212 log(kPn(w)) is the difference in\nthe scores of word w under the model and the (scaled) noise distribution. The scaling factor k in\nfront of Pn(w) accounts for the fact that noise samples are k times more frequent than data samples.\nNote that in the above equation we used s\u03b8(w, h) in place of log P h\n\u03b8 (w), ignoring the normalization\nterm, because we are working with an unnormalized model. We can do this because the NCE\nobjective encourages the model to be approximately normalized and recovers a perfectly normalized\nmodel if the model class contains the data distribution [6].\nWe \ufb01t the model by maximizing the log-posterior probability of the correct labels D averaged over\nthe data and noise samples:\nJ h(\u03b8) =EP h\n=EP h\n\n(cid:2)log P h(D = 1|w, \u03b8)(cid:3) + kEPn\n\n[log \u03c3 (\u2206s\u03b8(w, h))] + kEPn [log (1 \u2212 \u03c3 (\u2206s\u03b8(w, h)))] ,\n\n(cid:2)log P h(D = 0|w, \u03b8)(cid:3)\n\n(8)\n\nd\n\nd\n\nIn practice, the expectation over the noise distribution is approximated by sampling. Thus, we\nestimate the contribution of a word / context pair w, h to the gradient of Eq. 8 by generating k noise\nsamples {xi} and computing\n\n(cid:21)\n\n(cid:20)\n\n\u03b8 (w) \u2212 k(cid:88)\n\ni=1\n\nJ h,w(\u03b8) = (1 \u2212 \u03c3 (\u2206s\u03b8(w, h)))\n\n\u2202\n\u2202\u03b8\n\n\u2202\n\u2202\u03b8\n\nlog P h\n\n\u03c3 (\u2206s\u03b8(xi, h))\n\nlog P h\n\n\u03b8 (xi)\n\n.\n\n(9)\n\n\u2202\n\u2202\u03b8\n\nNote that the gradient in Eq. 9 involves a sum over k noise samples instead of a sum over the entire\nvocabulary, making the NCE training time linear in the number of noise samples and independent\nof the vocabulary size. As we increase the number of noise samples k, this estimate approaches\nthe likelihood gradient of the normalized model, allowing us to trade off computation cost against\nestimation accuracy [6].\nNCE shares some similarities with a training method for non-probabilistic neural language models\nthat involves optimizing a margin-based ranking objective [4]. As that approach is non-probabilistic,\nit is outside the scope of this paper, though it would be interesting to see whether it can be used to\nlearn competitive word embeddings.\n\n4 Evaluating word embeddings\n\nUsing word embeddings learned by neural language models outside of the language modelling con-\ntext is a relatively recent development. An early example of this is the multi-layer neural network\nof [4] trained to perform several NLP tasks which represented words exclusively in terms of learned\nword embeddings. [18] provided the \ufb01rst comparison of several word embeddings learned with dif-\nferent methods and showed that incorporating them into established NLP pipelines can boost their\nperformance.\n\n4\n\n\fRecently the focus has shifted towards evaluating such representations more directly, instead of mea-\nsuring their effect on the performance of larger systems. Microsoft Research (MSR) has released\ntwo challenge sets: a set of sentences each with a missing word to be \ufb01lled in [20] and a set of\nanalogy questions [11], designed to evaluate semantic and syntactic content of word representa-\ntions respectively. Another dataset, consisting of semantic and syntactic analogy questions has been\nreleased by Google [10].\nIn this paper we will concentrate on the two analogy-based challenge sets, which consist of questions\n\u201d, denoted as a : b \u2192 c : ? . The task is to identify the held-out\nof the form \u201ca is to b is as c is to\nfourth word, with only exact word matches deemed correct. Word embeddings learned by neural\nlanguage models have been shown to perform very well on these datasets when using the following\nvector-similarity-based protocol for answering the questions. Suppose (cid:126)w is the representation vector\nfor word w normalized to unit norm. Then, following [11], we answer a : b \u2192 c : ? , by \ufb01nding the\nword d\u2217 with the representation closest to (cid:126)b \u2212 (cid:126)a + (cid:126)c according to cosine similarity:\n\nd\u2217 = arg max\n\nx\n\n((cid:126)b \u2212 (cid:126)a + (cid:126)c)(cid:62)(cid:126)x\n(cid:107)(cid:126)b \u2212 (cid:126)a + (cid:126)c(cid:107) .\n\n(10)\n\nWe discovered that reproducing the results reported in [10] and [11] for publicly available word\nembeddings required excluding b and c from the vocabulary when looking for d\u2217 using Eq. 10,\nthough that was not clear from the papers. To see why this is necessary, we can rewrite Eq. 10 as\n\nd\u2217 = arg max\n\n(cid:126)b(cid:62)(cid:126)x \u2212 (cid:126)a(cid:62)(cid:126)x + (cid:126)c(cid:62)(cid:126)x\n\nx\n\n(11)\n\nand notice that setting x to b or c maximizes the \ufb01rst or third term respectively (since the vectors are\nnormalized), resulting in a high similarity score. This equation suggests the following interpretation\nof d\u2217: it is simply the word with the representation most similar to (cid:126)b and (cid:126)c and dissimilar to (cid:126)a, which\nmakes it quite natural to exclude b and c themselves from consideration.\n\n5 Experimental evaluation\n\n5.1 Datasets\n\nWe evaluated our word embeddings on two analogy-based word similarity tasks released recently\nby Google and Microsoft Research that we described in Section 4. We could not train on the data\nused for learning the embeddings in the original papers as it was not readily available. [10] used the\nproprietary Google News corpus consisting of 6 billion words, while the 320-million-word training\nset used in [11] is a compilation of several Linguistic Data Consortium corpora, some of which\navailable only to their subscribers.\nInstead, we decided to use two freely-available datasets: the April 2013 dump of English Wikipedia\nand the collection of about 500 Project Gutenberg texts that form the canonical training data for\nthe MSR Sentence Completion Challenge [19]. We preprocessed Wikipedia by stripping out the\nXML formatting, mapping all words to lowercase, and replacing all digits with 7, leaving us with\n1.5 billion words. Keeping all words that occurred at least 10 times resulted in a vocabulary of\nabout 872 thousand words. Such a large vocabulary was used to demonstrate the scalability of our\nmethod as well as to ensure that the models will have seen almost all the words they will be tested\non. When preprocessing the 47M-word Gutenberg dataset, we kept all words that occurred 5 or\nmore times, resulting in an 80-thousand-word vocabulary. Note that many words used for testing\nthe representations are missing from this dataset, which greatly limits the accuracy achievable when\nusing it. To make our results directly comparable to those in other papers, we report accuracy scores\ncomputed using Eq. 10, excluding the second and the third word in the question from consideration,\nas explained in Section 4.\n\n5.2 Details of training\n\nAll models were trained on a single core, using minibatches of size 100 and the initial learning\nrate of 3 \u00d7 10\u22122. No regularization was used. Initially we used a validation-set based learning\nrate adaptation scheme described in [14], which halves the learning rate whenever the validation set\n\n5\n\n\fTable 1: Accuracy in percent on word similarity tasks. The models had 100D word embeddings\nand were trained to predict 5 words on both sides of the current word on the 1.5B-word Wikipedia\ndataset. Skip-gram(*) is our implementation of the model from [10]. ivLBL is the inverse language\nmodel without position-dependent weights. NCEk denotes NCE training using k noise samples.\n\nGOOGLE\nSYNTACTIC OVERALL\n\nMODEL\nSKIP-GRAM(*)\nIVLBL+NCE1\nIVLBL+NCE2\nIVLBL+NCE3\nIVLBL+NCE5\nIVLBL+NCE10\nIVLBL+NCE25\n\nSEMANTIC\n\n28.0\n28.4\n30.8\n34.2\n37.2\n38.9\n40.0\n\n36.4\n42.1\n44.1\n43.6\n44.7\n45.0\n46.1\n\nMSR\n\n31.7\n34.9\n36.2\n36.3\n36.7\n36.0\n36.7\n\nTIME\n(HOURS)\n12.3\n3.1\n4.0\n5.1\n7.3\n12.2\n26.8\n\n32.6\n35.9\n38.0\n39.4\n41.3\n42.2\n43.3\n\nTable 2: Accuracy in percent on word similarity tasks for large models. The Skip-gram\u2020 and\nCBOW\u2020 results are from [10].\nivLBL models predict 5 words before and after the current word.\nvLBL models predict the current word from the 5 preceding and 5 following words.\n\nMODEL\nSKIP-GRAM\u2020\nSKIP-GRAM\u2020\nSKIP-GRAM\u2020\nIVLBL+NCE25\nIVLBL+NCE25\nIVLBL+NCE25\nIVLBL+NCE25\nIVLBL+NCE25\nIVLBL+NCE25\nCBOW\u2020\nCBOW\u2020\nVLBL+NCE5\nVLBL+NCE5\nVLBL+NCE5\nVLBL+NCE5\nVLBL+NCE5\n\nEMBED.\n\nDIM.\n300\n300\n1000\n300\n300\n300\u00d72\n100\n100\n100\u00d72\n300\n1000\n300\n100\n300\n600\n600\u00d72\n\nTRAINING\nSET SIZE\n\n1.6B\n785M\n\n6B\n1.5B\n1.5B\n1.5B\n1.5B\n1.5B\n1.5B\n1.6B\n6B\n1.5B\n1.5B\n1.5B\n1.5B\n1.5B\n\nSEM.\n52.2\n56.7\n66.1\n61.2\n63.6\n65.2\n52.6\n55.9\n59.3\n16.1\n57.3\n40.3\n45.0\n54.2\n57.3\n60.5\n\nGOOGLE\nSYN. OVERALL\n55.1\n52.2\n65.1\n58.4\n61.8\n63.0\n48.5\n50.1\n54.2\n52.6\n68.9\n55.4\n56.8\n64.8\n66.0\n67.1\n\n53.8\n55.5\n65.6\n59.7\n62.6\n64.0\n50.3\n53.2\n56.5\n36.1\n63.7\n48.5\n51.5\n60.0\n62.1\n64.1\n\nMSR\n\n48.8\n52.4\n54.2\n39.2\n42.3\n44.6\n\n48.7\n52.3\n58.1\n59.1\n60.8\n\nTIME\n(DAYS)\n2.0\n2.5\n2.5\u00d7125\n1.2\n4.1\n4.1\n1.2\n2.9\n2.9\n0.6\n2\u00d7140\n0.3\n2.0\n2.0\n2.0\n3.0\n\nperplexity failed to improve after some time, but found that it led to poor representations despite\nachieving low perplexity scores, which was likely due to undertraining. The linear learning rate\nschedule described in [10] produced better results. Unfortunately, using it requires knowing in\nadvance how many passes through the data will be performed, which is not always possible or\nconvenient. Perhaps more seriously, this approach might result in undertraining of representations\nfor rare words because all representation share the same learning rate.\nAdaGrad [5] provides an automatic way of dealing with this issue. Though AdaGrad has already\nbeen used to train neural language models in a distributed setting [10], we found that it helped\nto learn better word representations even using a single CPU core. We reduced the potentially\nprohibitive memory requirements of AdaGrad, which requires storing a running sum of squared\ngradient values for each parameter, by using the same learning rate for all dimensions of a word\nembedding. Thus we store only one extra number per embedding vector, which is helpful when\ntraining models with hundreds of millions of parameters.\n\n5.3 Results\n\nInspired by the excellent performance of tree-based models of [10], we started by comparing the\nbest-performing model from that paper, the Skip-gram, to its non-hierarchical counterpart, ivLBL\nwithout position-dependent weights, proposed in Section 3, trained using NCE. As there is no pub-\nlicly available Skip-gram implementation, we wrote our own. Our implementation is faithful to the\ndescription in the paper, with one exception. To speed up training, instead of predicting all context\nwords around the current word, we predict only one context word, sampled at random using the\n\n6\n\n\fTable 3: Results for various models trained for 20 epochs on the 47M-word Gutenberg dataset\nusing NCE5 with AdaGrad. (D) and (I) denote models with and without position-dependent weights\nrespectively. For each task, the left (right) column give the accuracy obtained using the conditional\n(target) word embeddings. nL (nR) denotes n words on the left (right) of the current word.\n\nMODEL\nVLBL(D)\nVLBL(D)\nVLBL(D)\nVLBL(I)\nVLBL(I)\nVLBL(I)\nIVLBL(D)\nIVLBL(I)\n\nCONTEXT\n\nSIZE\n\n5L + 5R\n\n5L + 5R\n5L + 5R\n\n5L + 5R\n\n10L\n10R\n\n10L\n10R\n\nSEMANTIC\n2.6\n2.4\n1.9\n2.8\n2.4\n2.7\n2.9\n3.0\n2.8\n2.5\n2.6\n2.3\n2.8\n2.3\n2.6\n2.8\n\nGOOGLE\nSYNTACTIC\n23.8\n24.7\n22.1\n14.8\n24.1\n13.1\n29.6\n27.5\n16.1\n23.5\n24.6\n16.2\n15.1\n13.0\n26.8\n26.8\n\nOVERALL\n14.2\n14.6\n12.9\n9.3\n14.2\n8.4\n17.5\n16.4\n10.1\n14.0\n14.6\n9.9\n9.5\n8.1\n15.8\n15.9\n\nMSR\n\n23.4\n20.9\n8.8\n22.9\n19.8\n10.0\n14.5\n21.4\n\n23.1\n9.0\n23.0\n24.2\n10.1\n20.3\n14.0\n21.0\n\nTIME\n\n(HOURS)\n\n2.6\n2.6\n2.6\n2.3\n2.3\n2.1\n1.2\n1.2\n\nnon-uniform weighting scheme from the paper. Note that our models are also trained using the same\ncontext-word sampling approach. To make the comparison fair, we did not use AdaGrad for our\nmodels in these experiments, using the linear learning rate schedule as in [10] instead.\nTable 1 shows the results on the word similarity tasks for the two models trained on the Wikipedia\ndataset. We ran NCE training several times with different numbers of noise samples to investigate the\neffect of this parameter on the representation quality and training time. The models were trained for\nthree epochs, which in our experience provided a reasonable compromise between training time and\nrepresentation quality.1 All NCE-trained models outperformed the Skip-gram. Accuracy steadily\nincreased with the number of noise samples used, as did the training time. The best compromise\nbetween running time and performance seems to be achieved with 5 or 10 noise samples.\nWe then experimented with training models using AdaGrad and found that it signi\ufb01cantly improved\nthe quality of embeddings obtained when training with 10 or 25 noise samples, increasing the se-\nmantic score for the NCE25 model by over 10 percentage points. Encouraged by this, we trained\ntwo ivLBL models with position-independent weights and different embedding dimensionalities\nfor several days using this approach. As some of the best results in [10] were obtained with the\nCBOW model, we also trained its non-hierarchical counterpart from Section 3, vLBL with position-\nindependent weights, using 100/300/600-dimensional embeddings and NCE with 5 noise samples,\nfor shorter training times. Note that due to the unavailability of the Google News dataset used in that\npaper, we trained on Wikipedia. The scores for ivLBL and vLBL models were obtained using the\nconditional word and target word representations respectively, while the scores marked with d \u00d7 2\nwere obtained by concatenating the two word representations, after normalizing them.\nThe results, reported in Table 2, show that our models substantially outperform their hierarchical\ncounterparts when trained using comparable amounts of time and data. For example, the 300D\nivLBL model trained for just over a day, achieves accuracy scores 3-9 percentage points better than\nthe 300D Skip-gram trained on the same amount of data for almost twice as long. The same model\ntrained for four days achieves accuracy scores that are only 2-4 percentage points lower than those\nof the 1000D Skip-gram trained on four times as much data using 75 times as many CPU cycles.\nBy computing word similarity scores using the conditional and the target word representations con-\ncatenated together, we can bring the accuracy gap down to 2 percentage points at no additional\ncomputational cost. The accuracy achieved by vLBL models as compared to that of CBOW models\nfollows a similar pattern. Once again our models achieve better accuracy scores faster and we can\nget within 3 percentage points of the result obtained on a cluster using much less data and far less\ncomputation.\nTo determine whether we were crippling our models by using position-independent weight, we\nevaluated all model architectures described in Section 3 on the Gutenberg corpus. The models were\ntrained for 20 epochs using NCE5 and AdaGrad. We report the accuracy obtained with both condi-\ntional and target representation (left and right columns respectively) for each of the models in Ta-\n\n1We checked this by training the Skip-gram model for 10 epochs, which did not result in a substantial\n\nincrease in accuracy.\n\n7\n\n\fTable 4: Accuracy on the MSR Sentence Completion Challenge dataset.\n\nMODEL\n\nCONTEXT\n\nLATENT\n\nLSA [19]\n\nSKIP-GRAM [10]\n\nLBL [14]\nIVLBL\nIVLBL\nIVLBL\n\nSIZE\n\nSENTENCE\n10L+10R\n\n10L\n\n5L+5R\n5L+5R\n5L+5R\n\nDIM\n300\n640\n300\n100\n300\n600\n\nPERCENT\nCORRECT\n\n49\n48.0\n54.7\n51.0\n55.2\n55.5\n\nble 3. Perhaps surprisingly, the results show that representations learned with position-independent\nweights, designated with (I), tend to perform better than the ones learned with position-dependent\nweights. The difference is small for traditional language models (vLBL), but is quite pronounced\nfor the inverse language model (ivLBL). The best-performing representations were learned by the\ntraditional language model with the context surrounding the word and position-independent weights.\nSentence completion: We also applied our approach to the MSR Sentence Completion Challenge\n[19], where the task is to complete each of the 1,040 test sentences by picking the missing word\nfrom the list of \ufb01ve candidate words. Using the 47M-word Gutenberg dataset, preprocessed as in\n[14], as the training set, we trained several ivLBL models with NCE5 to predict 5 words preceding\nand 5 following the current word. To complete a sentence, we compute the probability of the 10\nwords around the missing word (using Eq. 4) for each of the candidate words and pick the one\nproducing the highest value. The resulting accuracy scores, given in Table 4 along with those of\nseveral baselines, show that ivLBL models perform very well. Even the model with the lowest\nembedding dimensionality of 100, achieves 51.0% correct, compared to 48.0% correct reported in\n[10] for the Skip-gram model with 640D embeddings. The 55.5% correct achieved by the model\nwith 600D embeddings is also better than the best single-model score on this dataset in the literature\n(54.7% in [14]).\n\n6 Discussion\n\nWe have proposed a new highly scalable approach to learning word embeddings which involves\ntraining lightweight log-bilinear language models with noise-contrastive estimation. It is simpler\nthan the tree-based language modelling approach of [10] and produces better-performing embed-\ndings faster. Embeddings learned using a simple single-core implementation of our method achieve\naccuracy scores comparable to the best reported ones, which were obtained on a large cluster using\nfour times as much data and almost two orders of magnitude as many CPU cycles. The scores we\nreport in this paper are also easy to compare to, because we trained our models only on publicly\navailable data.\nSeveral promising directions remain to be explored. [8] have recently proposed a way of learning\nmultiple representations for each word by clustering the contexts the word occurs in and allocating\na different representation for each cluster, prior to training the model. As ivLBL predicts the context\nfrom the word, it naturally allows using multiple context representations per current word, resulting\nin a more principled approach to the problem based on mixture modeling. Sharing representations\nbetween the context and the target words is also worth investigating as it might result in better-\nestimated rare word representations.\n\nAcknowledgments\n\nWe thank Volodymyr Mnih for his helpful comments.\n\nReferences\n[1] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language\n\nmodel. Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[2] Yoshua Bengio and Jean-S\u00b4ebastien Sen\u00b4ecal. Quick training of probabilistic neural nets by importance\n\nsampling. In AISTATS\u201903, 2003.\n\n8\n\n\f[3] Yoshua Bengio and Jean-S\u00b4ebastien Sen\u00b4ecal. Adaptive importance sampling to accelerate training of a\n\nneural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713\u2013722, 2008.\n\n[4] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks\nwith multitask learning. In Proceedings of the 25th International Conference on Machine Learning, 2008.\n[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2010.\n\n[6] M.U. Gutmann and A. Hyv\u00a8arinen. Noise-contrastive estimation of unnormalized statistical models, with\n\napplications to natural image statistics. Journal of Machine Learning Research, 13:307\u2013361, 2012.\n\n[7] Zellig S Harris. Distributional structure. Word, 1954.\n[8] Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Improving word representa-\ntions via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the\nAssociation for Computational Linguistics, pages 873\u2013882, 2012.\n\n[9] T. Mikolov, M. Kara\ufb01\u00b4at, L. Burget, J. \u02c7Cernock`y, and S. Khudanpur. Recurrent neural network based\nlanguage model. In Eleventh Annual Conference of the International Speech Communication Association,\n2010.\n\n[10] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. International Conference on Learning Representations 2013, 2013.\n\n[11] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word\n\nrepresentations. Proceedings of NAACL-HLT, 2013.\n\n[12] A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. Proceedings of\n\nthe 24th International Conference on Machine Learning, pages 641\u2013648, 2007.\n\n[13] Andriy Mnih and Geoffrey Hinton. A scalable hierarchical distributed language model. In Advances in\n\nNeural Information Processing Systems, volume 21, 2009.\n\n[14] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language\nmodels. In Proceedings of the 29th International Conference on Machine Learning, pages 1751\u20131758,\n2012.\n\n[15] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In AIS-\n\nTATS\u201905, pages 246\u2013252, 2005.\n\n[16] Magnus Sahlgren. The Word-Space Model: Using distributional analysis to represent syntagmatic and\nparadigmatic relations between words in high-dimensional vector spaces. PhD thesis, Stockholm, 2006.\n[17] R. Socher, C.C. Lin, A.Y. Ng, and C.D. Manning. Parsing natural scenes and natural language with\n\nrecursive neural networks. In International Conference on Machine Learning (ICML), 2011.\n\n[18] J. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general method for semi-\nsupervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational\nLinguistics, pages 384\u2013394, 2010.\n\n[19] G. Zweig and C.J.C. Burges. The Microsoft Research Sentence Completion Challenge. Technical Report\n\nMSR-TR-2011-129, Microsoft Research, 2011.\n\n[20] Geoffrey Zweig and Chris J.C. Burges. A challenge set for advancing language modeling. In Proceedings\nof the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of\nLanguage Modeling for HLT, pages 29\u201336, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1097, "authors": [{"given_name": "Andriy", "family_name": "Mnih", "institution": "Gatsby Unit, UCL"}, {"given_name": "Koray", "family_name": "Kavukcuoglu", "institution": "NEC Labs"}]}