{"title": "FRAGE: Frequency-Agnostic Word Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 1334, "page_last": 1345, "abstract": "Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In order to mitigate the issue, in this paper, we propose a neat, simple yet effective adversarial training method to blur the boundary between the embeddings of high-frequency words and low-frequency words. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that we achieve higher performance than the baselines in all tasks.", "full_text": "FRAGE: Frequency-Agnostic Word Representation\n\nChengyue Gong1\n\nDi He2\n\nXu Tan3\n\ncygong@pku.edu.cn\n\ndi_he@pku.edu.cn\n\nxu.tan@microsoft.com\n\nTao Qin3\n\nLiwei Wang2,4\n\nTie-Yan Liu3\n\ntaoqin@microsoft.com\n\nwanglw@cis.pku.edu.cn\n\ntie-yan.liu@microsoft.com\n\n1Peking University\n\n2Key Laboratory of Machine Perception, MOE, School of EECS, Peking University\n\n4Center for Data Science, Peking University, Beijing Institute of Big Data Research\n\n3Microsoft Research Asia\n\nAbstract\n\nContinuous word representation (aka word embedding) is a basic building block\nin many neural network-based models used in natural language processing tasks.\nAlthough it is widely accepted that words with similar semantics should be close\nto each other in the embedding space, we \ufb01nd that word embeddings learned in\nseveral tasks are biased towards word frequency: the embeddings of high-frequency\nand low-frequency words lie in different subregions of the embedding space, and\nthe embedding of a rare word and a popular word can be far from each other even\nif they are semantically similar. This makes learned word embeddings ineffective,\nespecially for rare words, and consequently limits the performance of these neural\nnetwork models. In this paper, we develop FRequency-AGnostic word Embedding\n(FRAGE) which is a neat, simple yet effective way to learn word representation\nusing adversarial training. We conducted comprehensive studies on ten datasets\nacross four natural language processing tasks, including word similarity, language\nmodeling, machine translation, and text classi\ufb01cation. Results show that with\nFRAGE, we achieve higher performance than the baselines in all tasks.\n\n1\n\nIntroduction\n\nWord embeddings, which are distributed and continuous vector representations for word tokens,\nhave been one of the basic building blocks for many neural network-based models used in natural\nlanguage processing (NLP) tasks, such as language modeling [18, 16], text classi\ufb01cation [24, 7] and\nmachine translation [4, 5, 40, 38, 11]. Different from classic one-hot representation, the learned word\nembeddings contain semantic information which can measure the semantic similarity between words\n[28], and can also be transferred into other learning tasks [29, 3].\nIn deep learning approaches for NLP tasks, word embeddings act as the inputs of the neural network\nand are usually trained together with neural network parameters. As the inputs of the neural network,\nword embeddings carry all the information of words that will be further processed by the network,\nand the quality of embeddings is critical and highly impacts the \ufb01nal performance of the learning task\n[15]. Unfortunately, we \ufb01nd the word embeddings learned by many deep learning approaches are far\nfrom perfect. As shown in Figure 1(a) and 1(b), in the embedding space learned by word2vec model,\nthe nearest neighbors of word \u201cPeking\u201d includes \u201cquickest\u201d, \u201cmulticellular\u201d, and \u201cepigenetic\u201d, which\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fare not semantically similar, while semantically related words such as \u201cBeijing\u201d and \u201cChina\u201d are far\nfrom it. Similar phenomena are observed from the word embeddings learned from translation tasks.\nWith a careful study, we \ufb01nd a more general problem which is rooted in low-frequency words in\nthe text corpus. Without any confusion, we also call high-frequency words as popular words and\ncall low-frequency words as rare words. As is well known [23], the frequency distribution of words\nroughly follows a simple mathematical form known as Zipf\u2019s law. When the size of a text corpus\ngrows, the frequency of rare words is much smaller than popular words while the number of unique\nrare words is much larger than popular words. Interestingly, the learned embeddings of rare words\nand popular words behave differently. (1) In the embedding space, a popular word usually has\nsemantically related neighbors, while a rare word usually does not. Moreover, the nearest neighbors\nof more than 85% rare words are rare words. (2) Word embeddings encode frequency information.\nAs shown in Figure 1(a) and 1(b), the embeddings of rare words and popular words actually lie in\ndifferent subregions of the space. Such a phenomenon is also observed in [29].\nWe argue that the different behaviors of the embeddings of popular words and rare words are\nproblematic. First, such embeddings will affect the semantic understanding of words. We observe\nmore than half of the rare words are nouns or variants of popular words. Those rare words should\nhave similar meanings or share the same topics with popular words. Second, the neighbors of a large\nnumber of rare words are semantically unrelated rare words. To some extent, those word embeddings\nencode more frequency information than semantic information which is not good from the view\nof semantic understanding. It will consequently limit the performance of down-stream tasks using\nthe embeddings. For example, in text classi\ufb01cation, it cannot be well guaranteed that the label of a\nsentence does not change when you replace one popular/rare word in the sentence by its rare/popular\nalternatives.\nTo address this problem, in this paper, we propose an adversarial training method to learn FRequency-\nAGnostic word Embedding (FRAGE). For a given NLP task, in addition to minimizing the task-speci\ufb01c\nloss by optimizing the task-speci\ufb01c parameters together with word embeddings, we introduce another\ndiscriminator, which takes a word embedding as input and classi\ufb01es whether it is a popular/rare\nword. The discriminator optimizes its parameters to maximize its classi\ufb01cation accuracy, while word\nembeddings are optimized towards a low task-dependent loss as well as fooling the discriminator\nto misclassify the popular and rare words. When the whole training process converges and the\nsystem achieves an equilibrium, the discriminator cannot well differentiate popular words from rare\nwords. Consequently, rare words lie in the same region as and are mixed with popular words in the\nembedding space. Then FRAGE will catch better semantic information and help the task-speci\ufb01c\nmodel to perform better.\nWe conduct experiments on four types of NLP tasks, including three word similarity tasks, two\nlanguage modeling tasks, three sentiment classi\ufb01cation tasks, and two machine translation tasks to\ntest our method. In all tasks, FRAGE outperforms the baselines. Speci\ufb01cally, in language modeling\nand machine translation, we achieve better performance than the state-of-the-art results on PTB, WT2\nand WMT14 English-German datasets.\n\n2 Background\n\n2.1 Word Representation\n\nWords are the basic units of natural languages, and distributed word representations (i.e., word\nembeddings) are the basic units of many models in NLP tasks including language modeling [18, 16]\nand machine translation [4, 5, 40, 38, 11]. It has been demonstrated that word representations learned\nfrom one task can be transferred to other tasks and achieve competitive performance [3].\nWhile word embeddings play an important role in neural network-based models in NLP and achieve\ngreat success, one technical challenge is that the embeddings of rare words are dif\ufb01cult to train due\nto their low frequency of occurrences. [35] develops a novel way to split each word into sub-word\nunits which is widely used in neural machine translation. However, the low-frequency sub-word units\nare still dif\ufb01cult to train: [31] provides a comprehensive study which shows that the rare (sub)words\nare usually under-estimated in neural machine translation: during inference step, the model tends to\nchoose popular words over their rare alternatives.\n\n2\n\n\f2.2 Adversarial Training\n\nThe basic idea of our work to address the above problem is adversarial training, in which two or\nmore models learn together by pursuing competing goals. A representative example of adversarial\ntraining is Generative Adversarial Networks (GANs) [13, 34] for image generation [33, 42, 2], in\nwhich a discriminator and a generator compete with each other: the generator aims to generate images\nsimilar to the natural ones, and the discriminator aims to detect the generated ones from the natural\nones. Recently, adversarial training has been successfully applied to NLP tasks [6, 22, 21]. [6, 22]\nintroduce an additional discriminator to differentiate the semantics learned from different languages\nin non-parallel bilingual data. [21] develops a discriminator to classify whether a sentence is created\nby human or generated by a model.\nOur proposed method is under the adversarial training framework but not exactly the conventional\ngenerator-discriminator approach since there is no generator in our scenario. For an NLP task and its\nneural network model (including word embeddings), we introduce a discriminator to differentiate\nembeddings of popular words and rare words; while the NN model aims to fool the discriminator and\nminimize the task-speci\ufb01c loss simultaneously.\nOur work is also weakly related to adversarial domain adaptation which attempts to mitigate the\nnegative effects of domain shift between training and testing [9, 36]. The difference between this\nwork and adversarial domain adaptation is that we do not target at the mismatch between training and\ntesting; instead, we aim to improve the effectiveness of word embeddings and consequently improve\nthe performance of end-to-end NLP tasks.\n\n3 Empirical Study\n\nIn this section, we study the embeddings of popular words and rare words based on the models trained\nfrom Google News corpora using word2vec 1 and trained from WMT14 English-German translation\ntask using Transformer [38]. The implementation details can be found in [12].\n\nExperimental Design In both tasks, we simply set the top 20% frequent words in vocabulary as\npopular words and denote the rest as rare words (roughly speaking, we set a word as a rare word if\nits relative frequency is lower than 10\u22126 in WMT14 dataset and 10\u22127 in Google News dataset). We\nhave tried other thresholds such as 10% or 25% and found the observations are similar.\nWe study whether the semantic relationship between two words is reasonable. To achieve this, we\nrandomly sampled some rare/popular words and checked the embeddings trained from different\ntasks. For each sampled word, we determined its nearest neighbors based on the cosine similarity\nbetween its embeddings and others\u2019.2 We also manually chose words which are semantically similar\nto it. For simplicity, for each word, we call the nearest words predicted from the embeddings as\nmodel-predicted neighbors, and call our chosen words as semantic neighbors.\n\nObservation To visualize word embeddings, we reduce their dimensionalities by SVD and plot\ntwo cases in Figure 1. More cases and other studies without dimensionality reduction can be found in\nSection 5.\nWe \ufb01nd that the embeddings trained from different tasks share some common patterns. For both tasks,\nmore than 90% of model-predicted neighbors of rare words are rare words. For each rare word, the\nmodel-predicted neighbor is usually not semantically related to this word, and semantic neighbors we\nchose are far away from it in the embedding space. In contrast, the model-predicted neighbors of\npopular words are very reasonable.\nAs the patterns in rare words are different from that of popular words, we further check the whole\nembedding matrix to make a general understanding. We also visualize the word embeddings using\nSVD by keeping the two directions with top-2 largest eigenvalues as in [28, 30] and plot them in\nFigure 1(c) and 1(d). From the \ufb01gure, we can see that the embeddings actually encode frequencies to\na certain degree: the rare words and popular words lie in different regions after this linear projection,\n\n1https://code.google.com/archive/p/word2vec/\n2Cosine distance is the most popularly used metric in literature to measure semantic similarity [28, 32, 29].\n\nWe also have tried other metrics, e.g., Euclid distance, and the phenomena still exist.\n\n3\n\n\f(a) WMT En\u2192De Case\n\n(b) Word2vec Case\n\n(c) WMT En\u2192De\n\n(d) Word2vec\n\nFigure 1: Case study of the embeddings trained from WMT14 translation task using Transformer\nand trained from Google News dataset using word2vec is shown in (a) and (b). (c) and (d) show the\nvisualization of embeddings trained from WMT14 translation task using Transformer and trained\nfrom Google News dataset using word2vec. Red points represent rare words and blue points represent\npopular words. In (a) and (b), we highlight the semantic neighbors in bold.\n\nand thus they occupy different regions in the original embedding space. This strange phenomenon is\nalso observed in other learned embeddings (e.g.CBOW and GLOVE) and mentioned in [30].\n\nExplanation From the empirical study above, we can see that the occupied spaces of popular\nwords and rare words are different and here we intuitively explain a possible reason. We simply\ntake word2vec as an example which is trained by stochastic gradient descent. During training, the\nsample rate of a popular word is high and the embedding of a popular word updates frequently. For\na rare word, the sample rate is low and its embedding rarely updates. According to our study, on\naverage, the moving distance of the embedding for a popular word is twice longer than that of a rare\nword during training. As all word embeddings are usually initialized around the origin with a small\nvariance, we observe in the \ufb01nal model, the embeddings of rare words are still around the origin and\nthe popular words have moved far away.\n\nDiscussion We have strong evidence that the current phenomena are problematic. First, according\nto our study,3 in both tasks, more than half of the rare words are nouns, e.g., company names, city\nnames. They may share some similar topics to popular entities, e.g., big companies and cities; around\n10% percent of rare words include a hyphen (which is usually used to join popular words), and over\n30% rare words are different PoSs of popular words. These words should have mixed or similar\nsemantics to some popular words. These facts show that rare words and popular words should lie\nin the same region of the embedding space, which is different from what we observed. Second, as\nwe can see from the cases, for rare words, model-predicted neighbors are usually not semantically\nrelated words but frequency-related words (rare words). This shows, for rare words, the embeddings\nencode more frequency information than semantic information. It is not good to use such word\nembeddings into semantic understanding tasks, e.g., text classi\ufb01cation, language modeling, language\nunderstanding, and translation.\n\n4 Our Method\n\nIn this section, we present our method to improve word representations. As we have a strong prior that\nmany rare words should share the same region in the embedding space as popular words, the basic\nidea of our algorithm is to train the word embeddings in an adversarial framework: We introduce\na discriminator to categorize word embeddings into two classes: popular ones or rare ones. We\nhope the learned word embeddings not only minimize the task-speci\ufb01c training loss but also fool the\ndiscriminator. By doing so, the frequency information is removed from the embedding and we call\nour method frequency-agnostic word embedding (FRAGE).\nWe \ufb01rst de\ufb01ne some notations and then introduce our algorithm. We develop three types of notations:\nembeddings, task-speci\ufb01c parameters/loss, and discriminator parameters/loss.\nDenote \u03b8emb \u2208 Rd\u00d7|V | as the word embedding matrix to be learned, where d is the dimension of\nthe embedding vectors and |V | is the vocabulary size. Let Vpop denote the set of popular words and\nVrare = V \\ Vpop denote the set of rare words. Then the embedding matrix \u03b8emb can be divided\n\n3We use the POS tagger from Natural Language Toolkit, https://github.com/nltk.\n\n4\n\ndairyunattachedwartimeappendixcyberwarcowmilkPekingBeijingChinadiktatorenquickestepigeneticmulticellular\fFigure 2: The proposed learning framework includes a task-speci\ufb01c predictor and a discriminator,\nwhose function is to classify rare and popular words. Both modules use word embeddings as the\ninput.\n\nrare for rare words. Let \u03b8emb\n\nw\n\npop for popular words and \u03b8emb\n\ninto two parts: \u03b8emb\ndenote the embedding\nof word w. Let \u03b8model denote all the other task-speci\ufb01c parameters except word embeddings. For\ninstance, for language modeling, \u03b8model is the parameters of the RNN or LSTM; for neural machine\ntranslation, \u03b8model is the parameters of the encoder, attention module, and decoder.\nLet LT (S; \u03b8model, \u03b8emb) denote the task-speci\ufb01c loss over a dataset S. Taking language modeling as\nan example, the loss LT (S; \u03b8model, \u03b8emb) is de\ufb01ned as the negative log likelihood of the data:\n\nLT (S; \u03b8model, \u03b8emb) = \u2212 1\n|S|\n\nlog P (y; \u03b8model, \u03b8emb),\n\n(1)\n\n(cid:88)\n\ny\u2208S\n\nwhere y is a sentence.\nLet f\u03b8D denote a discriminator with parameters \u03b8D , which takes a word embedding as input and\noutputs a con\ufb01dence score between 0 and 1 indicating how likely the word is a rare word. Let\nLD(V ; \u03b8D, \u03b8emb) denote the loss of the discriminator:\n\nLD(V ; \u03b8D, \u03b8emb) =\n\n1\n\n|Vpop|\n\nlog f\u03b8D (\u03b8emb\n\nw ) +\n\n1\n\n|Vrare|\n\nlog(1 \u2212 f\u03b8D (\u03b8emb\n\nw )).\n\n(2)\n\n(cid:88)\n\nw\u2208Vpop\n\n(cid:88)\n\nw\u2208Vrare\n\nFollowing the principle of adversarial training, we develop a minimax objective to train the task-\nspeci\ufb01c model (\u03b8model and \u03b8emb) and the discriminator (\u03b8D) as below:\n\nmin\n\n\u03b8model,\u03b8emb\n\nmax\n\u03b8D\n\nLT (S; \u03b8model, \u03b8emb) \u2212 \u03bbLD(V ; \u03b8D, \u03b8emb),\n\n(3)\n\nwhere \u03bb is a coef\ufb01cient to trade off the two loss terms. We can see that when the model parameter\n\u03b8model and the embedding \u03b8emb are \ufb01xed, the optimization of the discriminator \u03b8D becomes\n\n\u2212\u03bbLD(V ; \u03b8D, \u03b8emb),\n\nmax\n\u03b8D\n\n(4)\n\nwhich is to minimize the classi\ufb01cation error of popular and rare words. When the discriminator \u03b8D is\n\ufb01xed, the optimization of \u03b8model and \u03b8emb becomes\n\nmin\n\n\u03b8model,\u03b8emb\n\nLT (S; \u03b8model, \u03b8emb) \u2212 \u03bbLD(V ; \u03b8D, \u03b8emb),\n\n(5)\n\ni.e., to optimize the task performance as well as fooling the discriminator. We train \u03b8model, \u03b8emb and\n\u03b8D iteratively by stochastic gradient descent or its variants. The general training process is shown in\nAlgorithm 1.\n\n5 Experiment\n\nWe test our method on a wide range of tasks, including word similarity, language modeling, machine\ntranslation, and text classi\ufb01cation. For each task, we choose the state-of-the-art architecture together\nwith the state-of-the-art training method as our baseline 4.\n\n4Code for our implementation is available at https://github.com/ChengyueGongR/FrequencyAgnostic\n\n5\n\nInput TokensWord EmbeddingsTask-specific OutputsTask-specific ModelLoss \ud835\udc3f\ud835\udc47Rare/Popular LabelsDiscriminatorLoss \ud835\udc3f\ud835\udc37predictpredict\fAlgorithm 1 Proposed Algorithm\n1: Input: Dataset S, vocabulary V = Vpop \u222a Vrare, \u03b8model, \u03b8emb, \u03b8D.\n2: repeat\n3:\n4:\n5:\n6:\n7: until Converge\n8: Output: \u03b8model, \u03b8emb, \u03b8D.\n\nSample a minibatch \u02c6S from S.\nSample a minibatch \u02c6V = \u02c6Vpop \u222a \u02c6Vrare from V .\nUpdate \u03b8model, \u03b8emb by gradient descent according to Eqn. (5) with data \u02c6S.\nUpdate \u03b8D by gradient ascent according to Eqn. (4) with vocabulary \u02c6V .\n\nFor fair comparisons, for each task, our method shares the same model architecture as the baseline.\nThe only difference is that we use the original task-speci\ufb01c loss function with an additional adversarial\nloss as in Eqn. (3). Dataset description and hyper-parameter con\ufb01gurations can be found in [12].\n\n5.1 Settings\n\nWe conduct experiments on the following tasks.\nWord Similarity evaluates the performance of the learned word embeddings by calculating the word\nsimilarity: it evaluates whether the most similar words of a given word in the embedding space are\nconsistent with the ground-truth, in terms of Spearman\u2019s rank correlation. We use the skip-gram\nmodel as our baseline model [28]5, and train the embeddings using Enwik96. We test the baseline\nand our method on three datasets: RG65, WS, and RW. The RW dataset is a dataset for the evaluation\nof rare words. Following common practice [28, 1, 32, 29], we use cosine distance while computing\nthe similarity between two word embeddings.\nLanguage Modeling is a basic task in natural language processing. The goal is to predict the next\nword conditioned on previous words and the task is evaluated by perplexity. We do experiments on\ntwo widely used datasets [25, 26, 41], Penn Treebank (PTB) [27] and WikiText-2 (WT2) [26]. We\nchoose two recent works as our baselines: the AWD-LSTM model7 [25] and the AWD-LSTM-MoS\nmodel,8 [41]. AWD-LSTM [25] is a weight-dropped LSTM which uses Drop Connect on hidden-to-\nhidden weights as a means of recurrent regularization. The model is trained by NT-ASGD, which is a\nvariant of the averaged stochastic gradient method. The training process has two steps, in the second\nstep, the model is \ufb01netuned using another con\ufb01guration of NT-ASGD. AWD-LSTM-MoS [41] uses\nthe Mixture of Softmaxes structure to the vanilla AWD-LSTM and achieves the state-of-the-art result\non PTB and WT2.\nMachine Translation is a popular task in both deep learning and natural language processing. We\nchoose two datasets: WMT14 English-German and IWSLT14 German-English datasets, which are\nevaluated in terms of BLEU score9. We use Transformer [38] as the baseline model. Transformer [38]\nis a recently developed architecture in which the self-attention network is used during encoding and\ndecoding step. It achieves the best performances on several machine translation tasks, e.g. WMT14\nEnglish-German, WMT14 English-French datasets. We use transformer_base and transformer_big\ncon\ufb01gurations following tensor2tensor [37]10.\nText Classi\ufb01cation is a conventional machine learning task and is evaluated by accuracy. Following\nthe setting in [20], we implement a Recurrent CNN-based model11 and test it on AG\u2019s news corpus\n(AGs), IMDB movie review dataset (IMDB) and 20 Newsgroups (20NG). RCNN [20] contains both\n\n5https://github.com/tensor\ufb02ow/models/blob/master/tutorials/embedding\n6http://mattmahoney.net/dc/textdata.html\n7https://github.com/salesforce/awd-lstm-lm\n8https://github.com/zihangdai/mos\n9https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl\n10To improve the training for imbalanced labeled data, a common method is to adjust loss function by\nreweighting the training samples; To regularize the parameter space, a common method is to use l2 regularization.\nWe tested these methods in machine translation and found the performance is not good. Detailed analysis is\nprovided in [12]\n\n11https://github.com/brightmart/text_classi\ufb01cation\n\n6\n\n\frecurrent and convolutional layers to catch the key components in texts and is widely used in text\nclassi\ufb01cation tasks.\nIn all tasks, we simply set the top 20% frequent words in vocabulary as popular words and denote the\nrest as rare words, which is the same as our empirical study. For all the tasks except training skip-gram\nmodel, we use full-batch gradient descent to update the discriminator. For training skip-gram model,\nmini-batch stochastic gradient descent is used to update the discriminator with a batch size 3000,\nsince the vocabulary size is large. For language modeling and machine translation tasks, we use\nlogistic regression as the discriminator. For other tasks, we \ufb01nd using a shallow neural network with\none hidden layer is more ef\ufb01cient and we set the number of nodes in the hidden layer as 1.5 times\nembedding size. In all tasks, we set the hyper-parameter \u03bb to 0.1.\n\nRG65\n\nWS\n\nRW\n\nOrig. with FRAGE Orig. with FRAGE Orig. with FRAGE\n75.63\n\n78.78\n\n58.12\n\n66.74\n\n69.35\n\n52.67\n\nTable 1: Results on three word similarity datasets.\n\nParas\n\nOrig.\n\nwith FRAGE\nValidation Test Validation Test\n\nPTB\n\nWT2\n\nAWD-LSTM w/o \ufb01netune[25]\nAWD-LSTM[25]\nAWD-LSTM + continuous cache pointer[25]\nAWD-LSTM-MoS w/o \ufb01netune[41]\nAWD-LSTM-MoS[41]\nAWD-LSTM-MoS + dynamic evaluation[41]\n\nAWD-LSTM w/o \ufb01netune[25]\nAWD-LSTM[25]\nAWD-LSTM + continuous cache pointer[25]\nAWD-LSTM-MoS w/o \ufb01netune[41]\nAWD-LSTM-MoS[41]\nAWD-LSTM-MoS + dynamic evaluation[41]\n\n24M\n24M\n24M\n24M\n24M\n24M\n\n33M\n33M\n33M\n35M\n35M\n35M\n\n60.7\n60.0\n53.9\n58.08\n56.54\n48.33\n\n69.1\n68.6\n53.8\n66.01\n63.88\n42.41\n\n58.8\n57.3\n52.8\n55.97\n54.44\n47.69\n\n67.1\n65.8\n52.0\n63.33\n61.45\n40.68\n\n60.2\n58.1\n52.3\n57.55\n55.52\n47.38\n\n67.9\n66.5\n51.0\n64.86\n62.68\n40.85\n\n58.0\n56.1\n51.8\n55.23\n53.31\n46.54\n\n64.8\n63.4\n49.3\n62.12\n59.73\n39.14\n\nTable 2: Perplexity on validation and test sets on Penn Treebank and WikiText2. Smaller the\nperplexity, better the result. Baseline results are obtained from [25, 41]. \u201cParas\u201d denotes the number\nof model parameters.\n\n5.2 Results\n\nIn this subsection, we provide the experimental results of all tasks. For simplicity, we use \u201cwith\nFRAGE\u201d as our proposed method in the tables.\nWord Similarity The results on three word similarity tasks are listed in Table 1. From the table,\nwe can see that our method consistently outperforms the baseline on all datasets. In particular, we\noutperform the baseline for about 5.4 points on the rare word dataset RW. This result shows that our\nmethod improves the representation of words, especially the rare words.\nLanguage Modeling The results of language modeling on PTB and WT2 datasets are presented in Ta-\nble 2. We test our model and the baselines at several checkpoints used in the baseline papers: without\n\ufb01netune, with \ufb01netune, with post-process (continuous cache pointer [14] or dynamic evaluation [19]).\nIn all these settings, our method outperforms the two baselines. On PTB dataset, our method improves\n\n7\n\n\fthe AWD-LSTM and AWD-LSTM-MoS baseline by 0.8/1.2/1.0 and 0.76/1.13/1.15 points in test set\nat different checkpoints. On WT2 dataset, which contains more rare words, our method achieves\nlarger improvements. We improve the results of AWD-LSTM and AWD-LSTM-MoS by 2.3/2.4/2.7\nand 1.15/1.72/1.54 in terms of test perplexity, respectively.\n\nWMT En\u2192De\n\nIWSLT De\u2192En\n\nMethod\nByteNet[17]\nConvS2S[11]\nTransformer Base[38]\nTransformer Base with FRAGE 28.36 ConvS2S+Risk [8]\nTransformer Big[38]\nTransformer Big with FRAGE\n\nBLEU\nBLEU Method\n30.04\n23.75 DeepConv[10]\n25.16 Dual transfer learning [39] 32.35\n32.68\n27.30 ConvS2S+SeqNLL [8]\n32.93\n33.12\n28.40 Transformer\n29.11 Transformer with FRAGE 33.97\n\nTable 3: BLEU scores on test set on WMT2014 English-German and IWSLT German-English tasks.\n\nMachine Translation The results of neural machine translation on WMT14 English-German and\nIWSLT14 German-English tasks are shown in Table 3. We outperform the baselines for 1.06/0.71 in\nthe term of BLEU in transformer_base and transformer_big settings in WMT14 English-German\ntask, respectively. The model learned from adversarial training also outperforms the original one\nin IWSLT14 German-English task by 0.85. These results show improving word embeddings can\nachieve better results in more complicated tasks and larger datasets.\n\nAG\u2019s\n\nIMDB\n\n20NG\n\nOrig. with FRAGE Orig. with FRAGE\n90.47%\n\n91.73% 92.41%\n\nOrig.\n\nwith FRAGE\n\n93.07% 96.49%[20]\n\n96.93%\n\nTable 4: Accuracy on test sets of AG\u2019s news corpus (AG\u2019s), IMDB movie review dataset (IMDB)\nand 20 Newsgroups (20NG) for text classi\ufb01cation.\n\nText Classi\ufb01cation The results are listed in Table 4. Our method outperforms the baseline method\nfor 1.26%/0.66%/0.44% on three different datasets.\nAs a summary, our experiments on four different tasks with 10 datasets verify the effectiveness of our\nmethod. We provide case study and qualitative analysis of the model with and without our method in\nTable 5 and Figure 3. By comparing the cases, we \ufb01nd that, with our method, the word similarities\nare improved and popular/rare words are better mixed together. More cases are shown in [12].\n\n(a)\n\n(b)\n\nFigure 3: These \ufb01gures show that, in different tasks, the embeddings of rare and popular words are\nbetter mixed together after applying our method.\n\n8\n\n\fOrig.\n\nOrig.\n\nWord: citizens Word: citizenship* Word: accepts* Word: bacterial*\n\nclinicians*\nastronomers*\nwestliche\nadults\n\nModel-predicted neighbor\nannounces*\ndigs*\nexternally*\nempowers*\n\nbliss*\npakistanis*\ndismiss*\nreinforces*\n\nmulticellular*\nepigenetic*\nisotopic*\nconformational*\n\nSemantic neighbor + Model-predicted Ranking\n\ncitizen*:771\ncitizenship*:832\n\ncitizen*:10745\ncitizens:11706\n\naccepted*:21109 bacteria*:116\naccept:30612\nchemical:233\n\nOrig. with FRAGE\n\nOrig. with FRAGE\n\nWord: citizens Word: citizenship* Word: accepts* Word: bacterial*\n\nModel-predicted neighbor\n\nhomes\ncitizen*\nb\u00fcrger\npopulation\n\npopulation\nst\u00e4dtischen*\ndignity\nb\u00fcrger\n\nregistered\ntolerate*\nrecognizing*\naccepting*\n\nmyeloproliferative*\nmetabolic*\nbacteria*\napoptotic*\n\nSemantic neighbor + Model-predicted Ranking\n\ncitizen*:2\ncitizenship*:40\n\ncitizen*:79\ncitizens:7\n\naccepted*:26\naccept:29\n\nbacteria* : 3\nchemical: 8\n\nTable 5: Case study for the original model and our method. Rare words are marked by \u201c*\u201d. For each\nword, we list its model-predicted neighbors. Moreover, we also show the ranking positions of the\nsemantic neighbors based on cosine similarity. As we can see, the ranking positions of the semantic\nneighbors are very low for the original model.\n\n6 Conclusion\n\nIn this paper, we \ufb01nd that word embeddings learned in several tasks are biased towards word\nfrequency: the embeddings of high-frequency and low-frequency words lie in different subregions of\nthe embedding space. This makes learned word embeddings ineffective, especially for rare words,\nand consequently limits the performance of these neural network models. We propose a neat, simple\nyet effective adversarial training method to improve the model performance which is veri\ufb01ed in a\nwide range of tasks.\nWe will explore several directions in the future. First, we will investigate the theoretical aspects\nof word embedding learning and our adversarial training method. Second, we will study more\napplications which have the similar problem even beyond NLP.\n\nAcknowledgement\n\nThis work is supported by National Basic Research Program of China (973 Program) (grant no.\n2015CB352502), NSFC (61573026) and BJNSF (L172037) and a grant from Microsoft Research\nAsia. We would like to thank the anonymous reviewers for their valuable comments on our paper.\n\n9\n\n\fReferences\n[1] R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilin-\ngual nlp. In Proceedings of the Seventeenth Conference on Computational Natural Language\nLearning, pages 183\u2013192, So\ufb01a, Bulgaria, August 2013. Association for Computational Lin-\nguistics.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[3] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings.\n\n2016.\n\n[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[5] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine\ntranslation. arXiv preprint arXiv:1406.1078, 2014.\n\n[6] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. J\u00e9gou. Word translation without\n\nparallel data. arXiv preprint arXiv:1710.04087, 2017.\n\n[7] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information\n\nProcessing Systems, pages 3079\u20133087, 2015.\n\n[8] S. Edunov, M. Ott, M. Auli, D. Grangier, and M. Ranzato. Classical structured prediction losses\n\nfor sequence to sequence learning. arXiv preprint arXiv:1711.04956, 2017.\n\n[9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Interna-\n\ntional Conference on Machine Learning, pages 1180\u20131189, 2015.\n\n[10] J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin. A convolutional encoder model for neural\n\nmachine translation. arXiv preprint arXiv:1611.02344, 2016.\n\n[11] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to\n\nsequence learning. arXiv preprint arXiv:1705.03122, 2017.\n\n[12] C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T. Liu. FRAGE: frequency-agnostic word\n\nrepresentation. CoRR, abs/1809.06858, 2018.\n\n[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[14] E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous\n\ncache. CoRR, abs/1612.04426, 2016.\n\n[15] E. Hoffer, I. Hubara, and D. Soudry. Fix your classi\ufb01er: the marginal value of training the last\n\nweight layer. ICLR, 2018.\n\n[16] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language\n\nmodeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[17] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu.\n\nNeural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.\n\n[18] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character-aware neural language models. In\n\nAAAI, pages 2741\u20132749, 2016.\n\n[19] B. Krause, E. Kahembwe, I. Murray, and S. Renals. Dynamic evaluation of neural sequence\n\nmodels. CoRR, abs/1709.07432, 2017.\n\n[20] S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text classi\ufb01cation.\n\nIn AAAI, volume 333, pages 2267\u20132273, 2015.\n\n10\n\n\f[21] A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio. Professor\nforcing: A new algorithm for training recurrent networks. In Advances In Neural Information\nProcessing Systems, pages 4601\u20134609, 2016.\n\n[22] G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual\n\ncorpora only. arXiv preprint arXiv:1711.00043, 2017.\n\n[23] R. R. Larson.\n\nIntroduction to information retrieval. Journal of the American Society for\n\nInformation Science and Technology, 61(4):852\u2013853, 2010.\n\n[24] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word\nvectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association\nfor Computational Linguistics: Human Language Technologies-Volume 1, pages 142\u2013150.\nAssociation for Computational Linguistics, 2011.\n\n[25] S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing LSTM language models.\n\nCoRR, abs/1708.02182, 2017.\n\n[26] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. CoRR,\n\nabs/1609.07843, 2016.\n\n[27] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. Cernock\u00fd, and S. Khudanpur. Recurrent neural network\nbased language model. In INTERSPEECH 2010, 11th Annual Conference of the International\nSpeech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages\n1045\u20131048, 2010.\n\n[28] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in neural information processing\nsystems, pages 3111\u20133119, 2013.\n\n[29] J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: simple and effective postprocessing for word\n\nrepresentations. arXiv preprint arXiv:1702.01417, 2017.\n\n[30] J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: Simple and effective postprocessing for word\n\nrepresentations. CoRR, abs/1702.01417, 2017.\n\n[31] M. Ott, M. Auli, D. Granger, and M. Ranzato. Analyzing uncertainty in neural machine\n\ntranslation. arXiv preprint arXiv:1803.00047, 2018.\n\n[32] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.\nIn Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,\nEMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest\nGroup of the ACL, pages 1532\u20131543, 2014.\n\n[33] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[34] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages\n2234\u20132242, 2016.\n\n[35] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword\n\nunits. arXiv preprint arXiv:1508.07909, 2015.\n\n[36] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation.\nIn 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu,\nHI, USA, July 21-26, 2017, pages 2962\u20132971, 2017.\n\n[37] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser,\nN. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit. Tensor2tensor for neural\nmachine translation. CoRR, abs/1803.07416, 2018.\n\n[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems,\npages 6000\u20136010, 2017.\n\n11\n\n\f[39] Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and L. Tie-Yan. Dual transfer learning for\n\nneural machine translation with marginal distribution regularization.\n\n[40] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al. Google\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[41] Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen. Breaking the softmax bottleneck: A\n\nhigh-rank RNN language model. CoRR, abs/1711.03953, 2017.\n\n[42] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n12\n\n\f", "award": [], "sourceid": 692, "authors": [{"given_name": "Chengyue", "family_name": "Gong", "institution": "Peking University"}, {"given_name": "Di", "family_name": "He", "institution": "Peking University"}, {"given_name": "Xu", "family_name": "Tan", "institution": null}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Liwei", "family_name": "Wang", "institution": "Peking University"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research Asia"}]}