{"title": "Kernelized Bayesian Softmax for Text Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 12508, "page_last": 12518, "abstract": "Neural models for text generation require a softmax layer with proper token embeddings during the decoding phase.\nMost existing approaches adopt single point embedding for each token. \nHowever, a word may have multiple senses according to different context, some of which might be distinct. \nIn this paper, we propose KerBS, a novel approach for learning better embeddings for text generation. \nKerBS embodies two advantages: \n(a) it employs a Bayesian composition of embeddings for words with multiple senses;\n(b) it is adaptive to semantic variances of words and robust to rare sentence context by imposing learned kernels to capture the closeness of words (senses) in the embedding space.\nEmpirical studies show that KerBS significantly boosts the performance of several text generation tasks.", "full_text": "Kernelized Bayesian Softmax for Text Generation\n\nNing Miao Hao Zhou Chengqi Zhao Wenxian Shi Lei Li\n\nByteDance AI lab\n\n{miaoning,zhouhao.nlp,zhaochengqi.d,shiwenxian,lileilab}@bytedance.com\n\nAbstract\n\nNeural models for text generation require a softmax layer with proper word em-\nbeddings during the decoding phase. Most existing approaches adopt single point\nembedding for each word. However, a word may have multiple senses according\nto different context, some of which might be distinct. In this paper, we propose\nKerBS, a novel approach for learning better embeddings for text generation. KerBS\nembodies two advantages: a) it employs a Bayesian composition of embeddings\nfor words with multiple senses; b) it is adaptive to semantic variances of words and\nrobust to rare sentence context by imposing learned kernels to capture the closeness\nof words (senses) in the embedding space. Empirical studies show that KerBS\nsigni\ufb01cantly boosts the performance of several text generation tasks.\n\n1\n\nIntroduction\n\nText generation has been signi\ufb01cantly improved with deep learning approaches in tasks such as\nlanguage modeling [Bengio et al., 2003, Mikolov et al., 2010], machine translation [Sutskever et al.,\n2014, Bahdanau et al., 2015, Vaswani et al., 2017], and dialog generation [Sordoni et al., 2015]. All\nthese models include a softmax \ufb01nal layer to yield words. The softmax layer takes a context state (h)\nfrom an upstream network such as RNN cells as the input, and transforms h into the word probability\nwith a linear projection (W \u00b7 h) and an exponential activation. Each row of W can be viewed as the\nembedding of a word. Essentially, softmax conducts embedding matching with inner-product scoring\nbetween a calculated context vector h and word embeddings W in the vocabulary.\nThe above commonly adopted setting for softmax imposes a strong hypothesis on the embedding\nspace \u2014 it assumes that each word corresponds to a single vector and the context vector h from the\ndecoding network must be indiscriminately close to the desired word embedding vector in certain\ndistance metric. We discover that such an assumption does not coincide with practical cases. Fig. 1\nvisualizes examples of the context vectors for utterances containing the examined words, calculated\nfrom the BERT model Devlin et al. [2019]. We make three interesting observations. a) Multi-sense:\nNot every word\u2019s context vectors form a single cluster. There are words with multiple clusters\n(Fig. 1b). b) Varying-variance: The variances of context vectors vary signi\ufb01cantly across clusters.\nSome words correspond to smaller variances while others to larger variances (Fig. 1c). c) Robustness:\nThere are outliers in the context space (Fig. 1b). These observations explain the ineffectiveness during\ntraining with the traditional softmax. The traditional way brings word embedding W ill-centered\nwith all context vectors of the same word \u2013 even though they might belong to multiple clusters. At\nthe same time, the variances of different words are completely ignored in the plain softmax with\ninner-product as the similarity score. It is also vulnerable to outliers since a single anomally would\nlead the word embedding to be far from the main cluster. In short, the softmax layer doesn\u2019t have\nsuf\ufb01cient expressiveness capacity.\nYang et al. [2018] propose Mixture-of-Softmax (MoS) to enhance the expressiveness of softmax. It\nreplaces a single softmax layer with a weighted average of M softmax layers. However, all words\nshare the same \ufb01xed number of components M and averaging weights, which heavily restrict MoS\u2019s\ncapacity. Furthermore, the variances of context vectors are not taken into the consideration.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) \u201ccomputer\u201d\n\n(b) \u201cmonitor\u201d\n\n(c) car and vehicle\n\nFigure 1: Context vectors h calculated from BERT and projected using PCA. Each point corresponds\nto one utterance containing the word. (a) \u201ccomputer\u201d has only one cluster; (b) \u201cmonitor\u201d has two\nclusters, representing the verb (left) and the noun (right). The outlier at the lower right only appears\nin phrase \u201cchristian science monitor\u201d; (c) \u201ccar\u201d has smaller variance than \u201cvehicle\u201d.\n\nIn this paper, we propose KerBS, a novel approach to learn text embedding for generation. KerBS\navoids the above softmax issues by introducing a Bayesian composition of multiple embeddings and\na learnable kernel to measure the similarities among embeddings. Instead of a single embedding,\nKerBS explicitly represents a word with a weighted combination of multiple embeddings \u2013 each is\nregarded as a \u201csense\u201d1. The number of embeddings is automatically learned from the corpus as well.\nWe design a family of kernel functions to replace the embedding matching (i.e. the matrix-vector\ndot-product) in softmax layer. With parameters learned from text, each word (or \u201csense\u201d) can enjoy\nindividual variance in its embedding space. In addition, the kernel family is more robust to outliers\nthan Gaussian kernels.\nWe conduct experiments on a variety of text generation tasks including machine translation, language\nmodeling, and dialog generation. The empirical results verify the effectiveness of KerBS. Ablation\nstudy indicates that each part of KerBS, including the Bayesian composition and the kernel function,\nis necessary for the performance improvement. We also \ufb01nd that words with more semantic meanings\nare allocated with more sense embeddings, which adheres to our intuition.\n\n2 Related work\n\nWord Embeddings. Word2Vec [Mikolov et al., 2013] and GloVe [Pennington et al., 2014] learn\ndistributed word representations from corpus in an unsupervised way. However, only one embedding\nis assigned to each word, which not only suffers from ignoring polysemy but also could not provide\ncontext related word embeddings. Recent works [Alec Radford and Sutskever, 2018, Peters et al.,\n2018, Devlin et al., 2019] indicates that pre-trained contextualized word representations are bene\ufb01cial\nfor downstream natural language processing tasks. BERT [Devlin et al., 2019] pre-train a masked\nlanguage model with a deep bidirectional Transformer and it achieves state-of-the-art performance in\nvarious NLP tasks.\nMulti-Sense Word Embeddings. Early works obtain multi-sense word embeddings by \ufb01rst training\nsingle point word embeddings and then clustering the context embeddings (for example, the average\nembedding of neighbor words). But these methods are not scalable and take lots of efforts in parameter\ntuning [Reisinger and Mooney, 2010, Huang et al., 2012]. Tian et al. [2014] introduce a probabilistic\nmodel, which uses a variable to control sense selection of each word. Liu et al. [2015] add a topic\nvariable for each word, and condition word embeddings on the topic variable. Both of Tian et al.\n[2014] and Liu et al. [2015] can be easily integrated into Skip-Gram model [Mikolov et al., 2013],\nwhich is highly ef\ufb01cient. Other works [Chen et al., 2014, Jauhar et al., 2015, Chen et al., 2015, Wu\nand Giles, 2015] further improve the performance of multi-sense embeddings by making use of huge\ncorpora such as WordNet Miller [1995] and Wikipedia. However, these works are mainly focused on\ntext understanding rather than text generation.\nWord Embedding as a Distribution. In order to represent the semantic breadth of each word, Vilnis\nand McCallum [2015] propose to map each word into a Gaussian distribution in the embedding\n\n1Since there is no direct supervision, an embedding vector does not necessarily correspond to a semantic\n\nsense.\n\n2\n\n\fspace. Instead of using cosine similarity in Mikolov et al. [2013], Vilnis and McCallum [2015]\nuse KL-divergence of the embedding distributions to measure the similarities between words. To\nimprove the numerical stability of Gaussian word embeddings, especially when comparing very close\nor very distant distributions, Sun et al. [2018] propose to replace KL-divergence with Wasserstein\ndistance. Though Gaussian word embeddings perform well in word-level tasks such as similarity and\nentailment detection, they cannot be directly applied to the scenario of text generation, because it is\ndif\ufb01cult to perform embedding matching between Gaussian word embeddings and output embeddings,\nwhich are usually single points in the embedding space.\n\n3 Background\n\nMost text generation models generate words through an embedding matching procedure. Intuitively,\nat each step, upstream networks such as RNN decoders compute a context vector h according to\nthe encoded information from input and previously generated words. The context vector h serves\nas a query to search for the most similar match from a pre-calculated vocabulary embeddings W .\nIn practice, this is implemented with an inner-product between W and h. Normalized probabilities\nover all words are computed with the softmax function. Words with the highest probabilities will be\nchosen during the inference process.\nSpeci\ufb01cally, given an utterance \u02c6y, a GRU decoder calculates as follows:\n\net = LOOKUP(Win, \u02c6yt\u22121),\nht = GRU(ht\u22121, et),\nP (yt = i) = SOFTMAX(htW )i.\n\n(1)\n(2)\n(3)\n\nAt time step t, its word embedding et is obtained by looking up the previous output word in the\nword embedding matrix Win = [ \u00afw1, \u00afw2, ..., \u00afwV ] (Eq. (1)). Here \u00afwi is the embedding of the i-th\nword in the vocabulary. V is the vocabulary size. The context embedding ht of the t-th step will be\nobtained from GRU by combining information of ht\u22121 and et ( Eq. (2)). Other decoders such as\nTransformer Vaswani et al. [2017] work similary.\nEq. (3) performs embedding matching between ht and W , and probabilities of words will be obtained\nby a softmax activation. Intuitively, to generate the correct word \u02c6yt, the context embedding ht should\nlie in a small neighborhood around \u02c6yt\u2019s word embedding w\u02c6yt.\n\n4 Proposed KerBS\n\nIn this section, we \ufb01rst introduce KerBS for text generation. It is designed according to the three\nobservations mentioned in the introduction: multi-sense, varying-variance, and robustness. Then\nwe provide a training scheme to dynamically allocate senses since it is dif\ufb01cult to directly learn the\nnumber of senses of each word.\n\n4.1 Model Structure\n\nKerBS assumes that the space of context vectors for the same word consists of several geometrically\nseparate components. Each component represents a \u201csense\u201d, with its own variance. To better model\ntheir distribution, we replace Eq. (3) with the following equations:\n\nP (st = (cid:104)i, j(cid:105)) = Softmax(cid:0)[K\u03b8(ht, W )](cid:1)j\n\n(cid:80)\n\n(cid:80)\n\nk\n\nexp(K\u03b8j\nr\u22080,1,...,Mk\n\ni\n\n(ht, wj\ni ))\nexp(K\u03b8r\n\nk\n\n(ht, wr\n\nk))\n\n.\n\n(5)\n\nP (yt = i) =\n\nj\u22080,1,...,Mi\n\nP (st = (cid:104)i, j(cid:105)).\n\n(4)\n\nHere, st is the sense index of the step t. Its value takes (cid:104)i, j(cid:105) corresponding to the j-th sense of\nthe i-th word in vocabulary. Mi is the number of senses for word i, which may be different for\ndifferent words. Instead of directly calculating the probabilities of words, KerBS \ufb01rst calculates the\nprobabilities of all senses belonging to a word and then sums them up to get the word probability.\n\n(cid:88)\n\ni =\n\n3\n\n\fThe probability of output sense st in Eq. (5) is not a strict Gaussian posterior, as the training of\nGaussian models in high dimensional space is numerical instable. Instead, we propose to use a\ncarefully designed kernel function, to model the distribution variance of each sense. Concretely, we\nreplace the inner product in Eq. (3) with kernel function K, which depends on a variance-related\nparameter \u03b8. [K\u03b8(ht, W )] is a simpli\ufb01ed notation containing all pairs of kernal values K\u03b8j\n(ht, wj\ni ).\nWith different \u03b8j\ni for each sense, we can model the variances of their distributions separately.\n\ni\n\n4.1.1 Bayesian Composition of Embeddings\n\nIn this part, we introduce in detail how KerBS models the multi-sense property of words. Intuitively,\nwe use Bayesian composition of embeddings in text generation, because the same word can have\ntotally different meanings. For words with more than one sense, their corresponding context vectors\ncan be usually divided into separate clusters (see Figure 1). If we use single-embedding models such\nas traditional softmax to \ufb01t these clusters, the word embedding will converge to the mean of these\nclusters and could be distant from all of them. This may lead to poor performance in text generation.\nAs shown in Eq. (4), we can allocate different embeddings for each sense. We \ufb01rst obtain the sense\nprobabilities by performing weight matching between context vector h and sense embedding matrix\nW . Then we add up the sense probabilities belonging to each word to get word probabilities.\nWe adopt weight tying scheme [Inan et al., 2017], where the decoding embedding and the input\nembedding are shared. Since W is a matrix of sense embeddings, it cannot be directly used in the\ndecoding network for next step as in Eq. (1). Instead, we obtain embedding et by calculating the\nweighted sum of sense embeddings according to their conditional probabilities. Assume that \u02c6yt = i\nis the input word at step t,\n\nP (st\u22121 = (cid:104)i, j(cid:105)|\u02c6y[0:t\u22121]) wj\ni ,\n\net =\n\nj\u2208[1,2,...,Mi]\n\nP (st\u22121 = (cid:104)i, j(cid:105)|\u02c6y[0:t\u22121]) =\n\n(cid:80)\nk\u2208[1,2,...,Mi] P (st\u22121 = (cid:104)i, k(cid:105)|\u02c6y[0:t\u22122])\n\nP (st\u22121 = (cid:104)i, j(cid:105)|\u02c6y[0:t\u22122])\n\n(6)\n\n(7)\n\n.\n\n(cid:88)\n\n4.1.2 Embedding Matching with Kernels\n\n(cid:80)\n\n(cid:88)\n\nL =\n\nTo calculate the probability of each sense, it is very straightforward to introduce Gaussian distributions\nin the embedding space. However, it is dif\ufb01cult to learn a Gaussian distribution for embeddings\nin high dimensional space for the following reasons. Context vectors are usually distributed in\nlow dimensional manifolds embedded in a high dimensional space. Using an iostropic Gaussian\ndistribution to model embedding vectors in low dimensional manifolds may lead to serious instability.\nAssume in a d-dimensional space, the distribution of Hi follows N (0, \u03c31) in a d1-dimensional\nsubspace. We build a model N (0, \u03c3) to \ufb01t the embedding points. But there are often some noisy\noutliers, which are assumed to distribute uniformly in a cube with edge length 1 and centered at the\norigin. Then the average square distance between an outlier and the origin is d\n12, which increases\nlinearly with d. The log-likelihood to maximize can be written as:\n\n\u221a\nlog((\n\n2\u03c0\u03c3)\u2212d exp(\u2212\n\ni\u22081,2,...,d x2\ni\n\n2\u03c32\n\n)) =\n\n\u221a\n(\u2212d log(\n\n2\u03c0\u03c3) \u2212\n\nx\u2208X\n\nE((cid:80)\noptimal \u03c3 \u2248 (cid:112) \u03b1\n\ni\u22081,2,...,d x2\n\ni ) equals d1 for points generated by the oracle and d\nby outliers when d is large. The optimal \u03c3 approximately equals to\n\n(cid:113) \u03b1d\nwhere X is the set of data points including outliers. Denote the proportion of outliers in X as \u03b1. Since\n12 for outliers, L is dominated\n. With large d,\n12, which is independent of real variance \u03c31. As expected, we \ufb01nd that directly\n\nmodeling the Gaussian distributions does not work well in our preliminary experiments.\nTherefore we design a kernel function to model embedding variances, which can be more easily\nlearned compared with Gaussian mixture model. Speci\ufb01cally, we replace the inner product I(h, e) =\ncos(h, e)|h||e|, which can be regarded as a \ufb01xed kernel around whole space, with a kernel function\n(9)\n\nK\u03b8(h, e) = |h||e| (a exp(\u2212\u03b8 cos(h, e)) \u2212 a).\n\n12 +(1\u2212\u03b1)d\u03c31\n\nd\n\n(cid:80)\n\ni\u22081,2,...,d x2\ni\n\n2\u03c32\n\n),\n\n(8)\n\n(cid:88)\n\nx\u2208X\n\n4\n\n\f(a) \u03b8 = -2\n(b) \u03b8 = 2\nFigure 2: Kernel shapes with different \u03b8.\n\nHere \u03b8 is a parameter controlling the embedding variances of each sense and a =\n2(exp(\u2212\u03b8)+\u03b8\u22121))\nis a normalization factor. When \u03b8 \u2192 0, K\u03b8(h, e) \u2192 I(h, e), which degenerates to com-\nmon inner product. As shown in Figure 2, with a small \u03b8, embeddings are concentrated on a\nsmall region, while a large \u03b8 leads to a \ufb02at kernel. Finally, parameters for the i-th word could\nbe:{[w1\ni are the embedding and kernel parameter of\nsense (cid:104)i, j(cid:105). Intuitively, in the original space with inner product similarity, the density of probability\nmass is uniformly distributed. But K distorts the probabilistic space, making the variances of context\nvectors differ over different senses.\nSince\n\n]}, where wj\n\ni ]\u00b7\u00b7\u00b7 [wMi\n\ni and \u03b8j\n\n, \u03b8Mi\n\ni , \u03b81\n\ni ], [w2\n\ni , \u03b82\n\ni\n\ni\n\n\u2212\u03b8\n\n\u2202 log K\u03b8(h, e)\n\n\u2202\u03b8\n\n=\n\n1\na\n\n\u2202a\n\u2202\u03b8\n\n\u2212 cos(h, e)exp(\u2212\u03b8cos(h, e))\nexp(\u2212\u03b8cos(h, e)) \u2212 1\n\n,\n\n(10)\n\nthe gradient of each h is bounded for \ufb01xed \u03b8. It results from the continuity of cos(h,e)exp(\u2212\u03b8cos(h,e))\nwhen cos(h, e) (cid:54)= 0 and the fact that cos(h,e)exp(\u2212\u03b8cos(h,e))\n\u03b8 , when cos(h, e) \u2192 0. As a result,\na small proportion of outliers or noise points will not have a major impact on training stability.\n\nexp(\u03b8cos(h,e))\u22121 \u2192 1\n\nexp(\u03b8cos(h,e))\u22121\n\n4.2 Training Scheme\n\nIt is dif\ufb01cult to empirically determine the sense numbers of each word, which is a very large set\nof hyper-parameters. Also, properties of the same word may vary among different corpora and\ntasks. So we design a training scheme for KerBS, which includes dynamic sense allocation. Instead\nof providing the sense number for each word, we only need to input the total sense number. The\nalgorithm will automatically allocate senses.\nDetails of the training scheme are shown in Algorithm 1. Speci\ufb01cally, to obtain parameters for both\nKerBS and upstream network f\u03c6, which outputs the context vectors, the whole process consists of\nallocation and adaptation phases. Before training, W and \u03b8 are initialized by a random matrix and a\nrandom vector respectively. We randomly allocate Msum senses to words. After initialization, we \ufb01rst\nturn to the adaptation phase. Given a sequence \u02c6y in training set, at step t, we get the context vector\nht from f\u03c6. Then sense and word probabilities are calculated by Eq. (4) and Eq. (5), respectively.\nAfterwards, we calculate the log-probability L of generating \u02c6yt. And we maximize L by tuning W , \u03b8\nand \u03c6:\n\n(cid:88)\n\nL =\n\nlog(P (yt = \u02c6yt|\u02c6y[0:t\u22121]; W, \u03b8, \u03c6)).\n\n(11)\n\nt\n\ni , the sense embedding vector, and \u03b8j\n\nDuring the adaption phase, KerBS learns wj\ndistribution variance.\nDuring the allocation phase, we remove redundant senses and reallocate them to poorly predicted\nwords. To determine senses to remove and words which need more senses, we record the moving\naverage of each word\u2019s log prediction accuracy log P and sense usage U:\nlog Pi \u2190 (1 \u2212 \u03b2) log Pi + log(P (yt = i))1i=\u02c6yt\ni + \u03b2P (st = (cid:104)i, j(cid:105))1i=\u02c6yt\n\ni \u2190 (1 \u2212 \u03b2)U j\nU j\n\ni , the indicator of\n\n(12)\n(13)\n\n5\n\nEmbedding space-1.00-0.500.000.501.00-1.00-0.500.000.501.00Kernel value0.000.440.881.311.75Embedding space-1.00-0.500.000.501.00-1.00-0.500.000.501.00Kernel value0.000.440.881.311.75\fwhere \u03b2 is the updating rate. For a word i, if after several epochs logP is consistently lower than a\nthreshold \u0001, we think that the senses currently allocated to i is not enough. Then we delete the least\nused sense and reallocate it to i. We alternatively perform adaption and reallocation until convergence.\n\nAlgorithm 1: Training scheme for KerBS\nInput\n\n:Training corpus \u02c6Y , total sense num Msum, word num V , embedding dimension d,\nadaption-allocation ratio Q, threshold \u0001;\n\nOutput :W , \u03b8, sense allocation list L;\nInitialize W , H, \u03b8, U, L, step = 0;\nwhile not converge do\n\nRandom select \u02c6y \u2208 \u02c6Y ;\nfor it in T do\nht \u2190 f\u03c6(\u02c6y[0:t\u22121]) ;\nCalculate sense probability P (yt = (cid:104)i, j(cid:105)) and word probability P (yt = i) by Eq. (4), (5);\nMAXIMIZE log(P (yt = \u02c6yt)) by ADAM;\nUpdate logP and U by Eq. (12), (13);\n\nend\nif step mod Q = 0 then\nfor i in {1, 2, ..., V } do\nif logPi < \u0001 then\n\ni(cid:48)\n0, j(cid:48)\n\u03b8j(cid:48)\ni(cid:48)\nL[(cid:104)i(cid:48)\n\n0\n\n0\n\n0 \u2190 arg mini(cid:48),j(cid:48) (U j(cid:48)\ni(cid:48) );\n\u2190 1e \u2212 8; U j(cid:48)\ni(cid:48)\n0, j(cid:48)\n\n0(cid:105)] \u2190 i;\n\n0\n\n0\n\n\u2190 MEAN(U );\n\nend\n\nend\n\nend\nstep = step + 1;\n\nend\n\n4.3 Theoretical Analysis\n\nIn this part, we explain why KerBS has the ability to learn the complex distributions of context\nvectors. We only give a brief introduction to the following lemmas and leave more detailed proofs in\nthe appendix.\nLemma 4.1. KerBS has the ability to learn the multi-sense property. If the real distribution of context\nvectors consists of several disconnected clusters, KerBS will learn to represent as many clusters as\npossible.\nProof. Each cluster of word i\u2019s context vectors attracts i\u2019s KerBS sense embeddings, in order to\ndraw these embeddings nearer to increase L. However, if a cluster has already been represented by a\nKerBS sense, its attractions to embeddings of other senses get weaker. So they will converge to other\nclusters. Instead of gathering together in a few clusters, senses will try to represent as many clusters\nof context vectors\u2019 distribution as possible.\n\nLemma 4.2. KerBS has the ability to learn variances of embedding distribution. For distributions\nwith larger variances, KerBS learns larger \u03b8.\nProof. The optimized \u03b8 is a solution of equation \u2202L\nvariance of h grows, the solution of the equation gets larger.\n\n\u2202\u03b8 = 0. We only need to explain that, when the\n\n5 Experiment\n\nIn this section, we empirically validate the effectiveness of KerBS. We will \ufb01rst set up the experiments,\nand then give the experimental results in Section 5.2.\nWe test KerBS on several text generation tasks, including:\n\n\u2022 Machine Translation (MT) is conducted on IWSLT\u201916 De\u2192En, which contains 196k pairs\n\nof sentences for training.\n\n6\n\n\f\u2022 Language modeling (LM) is included to test the unconditional text generation performance.\nFollowing previous work, we use a 300k, 10k and 30k subset of One-Billion-Word Corpus\nfor training, validating and testing, respectively.\n\u2022 Dialog generation (Dialog) is also included. We employ the DailyDialog dataset from Li\net al. [2017] for experiment, by deleting the overlapping of train and test sets in advance.\n\nNote that these text generation tasks emphasize on different sides. MT is employed to test the ability\nof semantic transforming across bilingual corpus. LM is included to test whether KerBS can generally\nhelp generate more \ufb02uent sentences. Dialog generation even needs some prior knowledge to generate\ngood responses, which is the most challenging task.\nFor LM, we use Perplexity (PPL) to test the performance. For MT and Dialog, we measure the\ngeneration quality with BLEU-4 and BLEU-1 scores [Papineni et al., 2002]. Human evaluation is\nalso included for Dialog. During human evaluation, 3 volunteers are requested to label Dialog data\ncontaining 50 sets of sentences. Each set contains the input sentences as well as output responses\ngenerated by KerBS and baseline models. Volunteers are asked to score the responses according to\ntheir \ufb02uency and relevance to the corresponding questions. (See detailed scoring in the appendix.)\nAfter responses are labeled, we calculate the average score of each method. Then a t-test is performed\nto reject the hypothesis that KerBS is not better than the baseline methods.\n\n5.1\n\nImplementation Details\n\nFor LM, we use GRU language model [Chung et al., 2014] as our testbed. We try different sets of\nparameters, including RNN layers, hidden sizes and embedding dimensions. The model that performs\nbest with traditional softmax is chosen as the baseline.\nFor MT and Dialog, we implement the attention-based sequence to sequence model (Seq2Seq,\n[Bahdanau et al., 2015]) as well as Transformer [Vaswani et al., 2017] as our baselines. For Seq2Seq,\n(hidden size, embedding dimension) are set to (512, 256) and (1024, 512), respectively. And For\nTransformer, (hidden size, embedding dim, dropout, layer num, head num) is set to (288, 507, 0.1, 5,\n2) for both MT and Dialog, following Lee et al. [2018]. All models are trained on sentences with\nup to 80 words. We set the batch size to 128 and the beam size to 5 for decoding. For both German\nand English, we \ufb01rst tokenize sentences into tokens by Moses tokenizer [Koehn et al., 2007]. Then\nBPE [Sennrich et al., 2016] is applied to segment each word into subwords.\nAdam [Kingma and Ba, 2014] is adopted as our optimization algorithm. We start to decay the learning\nrate when the loss on validation set stops to decrease. For LM, we set the initial learning rate to 1.0,\nand the decay rate to 0.8. For MT and Dialog, the initial learning rate is 5e-4 and the decay rate is 0.5.\n\n5.2 Results of Text Generation\n\nWe list the results of using KerBS in Table 1 and 2. Then we give some analysis.\n\nTasks\nMT\nLM\nDialog\n\nMetrics\nBLEU-4\n\nPPL\n\nBLEU-1\n\nHuman Eval.\n\nSeq2Seq+ MoS [Yang et al., 2018]\n\nTable 1: Performance of KerBS on Seq2Seq.\nSeq2Seq\n25.91\n103.12\n16.56\n1.24\n\n26.45\n102.72\n13.73\n1.04\n\nTasks\nMT\nDialog\n\nMetrics\nBLEU-4\nBLEU-1\n\n29.61\n10.61\n\nTable 2: Performance of KerBS on Transformer.\nTransformer + MoS [Yang et al., 2018]\nTransformer\n\n28.54\n9.81\n\nSeqSeq + KerBS\n\n27.28\n102.17\n17.85\n1.40\n\nTransformer + KerBS\n\n30.90\n10.90\n\nMachine Translation For machine translation, KerBS achieves higher BLEU-4 scores on\nSeq2Seq(+1.37) and Transformer(+1.29). However, the performance gain of MoS is not signif-\nicant, and it is not even as good as vanilla Transformer model. Cases of MT on Transformer are\nshown in Table 3.\n\n7\n\n\fTable 3: Examples of MT on IWSLT\u201916 De\u2192En\n\nSource\nTransformer my foster mother was a teacher.\n\nmeine gebildete Mutter aber wurde Lehrerin.\n\n+ MoS\n+ KerBS\n\nSource\nTransformer\n\n+ MoS\n+ KerBS\n\nand my educated mother was a teacher.\nbut my educated mother became a teacher.\nman erreicht niemals eine Gemeinde mit Ideen, man setzt sich mit den Einheimischen zusammen.\nyou never achieve a community with ideas; you put together with local people.\nyou never get a community with ideas, you\u2019re putting together with indigenous people.\nyou never get to a community with ideas, and you sit with the local people.\n\nTable 4: Examples of dialog generation on DailyDialog\n\nSource\nSeq2Seq\n+ MoS\n+ KerBS well, i mean. we always do the same thing. there s no variety in our lives.\n\nwhat do you mean ?\ni mean, what s up with the and iron bars on your windows.\nwell, how can i put this? france is a terrible team.\n\nsource\nSeq2Seq\n+ MoS\n+ KerBS\n\nnow , what seems to be the trouble ?\ntrouble is trouble.\nyeah. and he was.\nnot bad. but i have a bad cold today.\nit can t be more than fourteen days late for us to accept it .\nsource\nit will just \ufb01ne.\nSeq2Seq\nwell, i see. have you been back to work then?\n+ MoS\n+ KerBS maybe you re right. i think we should take it.\n\nLanguage Model As expected, KerBS achieves lower PPL (102.17) on LM compared with both\nMoS (102.72) and traditional softmax (103.12). Although it introduces more parameters, KerBS does\nnot lead to over\ufb01tting. On the contrary, the increased complexity in KerBS helps the model to better\ncapture the information of the embedding space.\nDialogue Generation We also include results of dialog generation. Unlike tasks where source and\ntarget sentences are highly aligned, dialog generation may need some prior knowledge for obtaining\ngood responses. Moreover, the multi-modality of the generated sentences is a serious problem in\nDialog. We expect that much expressive structure of KerBS could help. Since the performance of\nTransformer is not comparable to Seq2Seq on Dialog generation, we will focus on Seq2Seq in this\npart. KerBS achieves a BLEU-1 score of 17.85 on test set, which is remarkable compared with the\nbaselines. Human evaluations also con\ufb01rm the effectiveness of using KerBS in dialog generation.\nAfter performing a one-tailed hypothesis test, we \ufb01nd that the p-value is lower than 0.05, which means\nthat the obtained improvements on Dialog systems are nontrivial. We list some of the generated\nresponses of different models in Table 4.\n\n5.3 Ablation Study\n\nWe perform ablation study of three variants of KerBS on the MT task. KerBS w/o kernel removes\nthe kernel function from KerBS, so that distribution variances are no longer explicitly controlled.\nWe \ufb01nd that it loses 0.49 BLEU scores compared with original KerBS, which indicates that to\nexplicitly express distribution variances of hidden states is important and KerBS works well in doing\nso (Table 5). KerBS with single sense replaces the multi-sense model with single-sense one, which\nalso leads to performance decline. This further con\ufb01rms our assumption that the distribution of\n\nTable 5: Results of ablation study on MT (Seq2Seq).\nBLEU-4\nModels\nSeq2Seq + KerBS\n27.28\n26.79\n26.80\n27.00\n\nw/o kernel\nw/ only single sense\nw/o dynamic allocation\n\n8\n\n\fcontext vectors is multi-modal. In such cases, the output layer should also be multi-modal. In KerBS\nw/o dynamic allocation, each word is allocated with a \ufb01xed number of senses. Though it still performs\nbetter than single sense models, it is slightly worse than full KerBS model, which shows the necessity\nof dynamic allocation.\n\n5.4 Detailed Analysis\n\nIn this part, we verify that KerBS learns reasonable sense number M and variance parameter \u03b8 by\nexamples. And we have the following conclusions.\n\nTable 6: Randomly selected words with different numbers of senses M after training.\n\nSense\n\n1\n\nRedwood\n\nparticular\n\n2\n\n\ufb01gure\nduring\nknown\nsize\n\n3\n\nopen\norder\n\namazing\nsound\nbase\n\n4\n\nthey\nwork\nbody\npower\nchange\n\nword\n\nheal\n\nstructural\ntheoretical\n\nrotate\n\nFirstly, KerBS can learn the multisense property. From Table 6, we \ufb01nd that words with a single\nmeaning, including some proper nouns, are allocated with only one sense. But for words with\nmore complex meanings, such as pronouns, more senses are necessary to represent them. (In our\nexperiment, we restrict each word\u2019s sense number between 1 and 4, in order to keep the training\nstable.) In addition, we \ufb01nd that words with 4 senses have several distinct meanings. For instance,\n\u201dchange\u201d means transformation as well as small currency.\n\nFigure 3: Words with different \u03b8.\n\nSecondly, \u03b8 in KerBS is an indicator for words\u2019 semantic scopes. In \ufb01gure 3 we compare the \u03b8 of 3\nsets of nouns. For each set of them, we \ufb01nd words denoting bigger concepts (such as car, animal and\nearth) have larger \u03b8.\n\n5.5 Time Complexity\n\nCompared with baselines, the computation cost of incorporating KerBS into text generation mainly\nlies with the larger vocabulary for embedding matching, which is only a portion of the whole\ncomputation of text generation. Empirically, when we set the total sense number to about 3 times the\nvocabulary size, KerBS takes twice as long as vanilla softmax for one epoch.\n\n6 Conclusion\n\nText generation requires a proper embedding space for words. In this paper, we proposed KerBS to\nlearn better embeddings for text generation. Unlike traditional Softmax, KerBS includes a Bayesian\ncomposition of multi-sense embedding for words and a learnable kernel to capture the similarities\nbetween words. Incorporating KerBS into text generation could boost the performance of several text\ngeneration tasks, especially the dialog generation task. Future work includes proposing better kernels\nfor generation and designing a meta learner to dynamically reallocate senses.\n\n9\n\n00.1-0.1carJeepFordanimalcatmonkeyearthChinaBeijing-0.5\ud835\udf03\fAcknowledgments\n\nWe would like to thank Xunpeng Huang and Yitan Li for helpful discussion and review of the \ufb01rst\nversion. We also wish to thank the anonymous reviewers for their insightful comments.\n\nReferences\nTim Salimans Alec Radford, Karthik Narasimhan and Ilya Sutskever. Improving language under-\n\nstanding with unsupervised learning. In Technical report, OpenAI, 2018.\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. ICLR 2015, 2015.\n\nYoshua Bengio, R\u00e9jean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic\n\nlanguage model. Journal of machine learning research, 3(Feb):1137\u20131155, 2003.\n\nTao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. Improving distributed representation of word sense\nvia wordnet gloss composition and context clustering. In Proceedings of the 53rd Annual Meeting\nof the Association for Computational Linguistics and the 7th International Joint Conference on\nNatural Language Processing (Volume 2: Short Papers), volume 2, pages 15\u201320, 2015.\n\nXinxiong Chen, Zhiyuan Liu, and Maosong Sun. A uni\ufb01ed model for word sense representation\nand disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural\nLanguage Processing (EMNLP), pages 1025\u20131035, 2014.\n\nJunyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of\ngated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning,\nDecember 2014, 2014.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep\nbidirectional transformers for language understanding. In Proceedings of the 2019 Conference of\nthe North American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies, Volume 1 (Long and Short Papers), pages 4171\u20134186, 2019.\n\nEric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng.\n\nImproving word\nrepresentations via global context and multiple word prototypes. In Proceedings of the 50th Annual\nMeeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 873\u2013882.\nAssociation for Computational Linguistics, 2012.\n\nHakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classi\ufb01ers: A\n\nloss framework for language modeling. ICLR 2017, 2017.\n\nSujay Kumar Jauhar, Chris Dyer, and Eduard Hovy. Ontologically grounded multi-sense represen-\ntation learning for semantic vector space models. In Proceedings of the 2015 Conference of the\nNorth American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies, pages 683\u2013693, 2015.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR 2014, 2014.\n\nPhilipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola\nBertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar,\nAlexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine\ntranslation. In ACL, 2007.\n\nJason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence\nmodeling by iterative re\ufb01nement. In Proceedings of the 2018 Conference on Empirical Methods in\nNatural Language Processing, pages 1173\u20131182, 2018.\n\nYanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. Dailydialog: A manually\nlabelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference\non Natural Language Processing (Volume 1: Long Papers), volume 1, pages 986\u2013995, 2017.\n\n10\n\n\fYang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Topical word embeddings. In Proceedings\nof the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, pages 2418\u20132424. AAAI Press,\n2015.\n\nTom\u00e1\u0161 Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1\u0161 Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. Recurrent\nneural network based language model. In Eleventh Annual Conference of the International Speech\nCommunication Association, 2010.\n\nTomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations\nof words and phrases and their compositionality. In Advances in neural information processing\nsystems, pages 3111\u20133119, 2013.\n\nGeorge A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):\n\n39\u201341, 1995.\n\nKishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\n\nevaluation of machine translation. In ACL 2002, 2002.\n\nJeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word\nrepresentation. In Proceedings of the 2014 conference on empirical methods in natural language\nprocessing (EMNLP), pages 1532\u20131543, 2014.\n\nMatthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\nLuke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Confer-\nence of the North American Chapter of the Association for Computational Linguistics: Human\nLanguage Technologies, Volume 1 (Long Papers), volume 1, pages 2227\u20132237, 2018.\n\nJoseph Reisinger and Raymond J Mooney. Multi-prototype vector-space models of word meaning.\nIn Human Language Technologies: The 2010 Annual Conference of the North American Chapter\nof the Association for Computational Linguistics, pages 109\u2013117. Association for Computational\nLinguistics, 2010.\n\nRico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with\n\nsubword units. In ACL 2016, 2016.\n\nAlessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell,\nJian-Yun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive gener-\nation of conversational responses. In Proceedings of the 2015 Conference of the North American\nChapter of the Association for Computational Linguistics: Human Language Technologies, pages\n196\u2013205, 2015.\n\nChi Sun, Hang Yan, Xipeng Qiu, and Xuanjing Huang. Gaussian word embedding with a wasserstein\n\ndistance loss. arXiv preprint arXiv:1808.07016, 2018.\n\nIlya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.\n\nIn Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\nFei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. A probabilistic\nmodel for learning multi-prototype word embeddings. In Proceedings of COLING 2014, the 25th\nInternational Conference on Computational Linguistics: Technical Papers, pages 151\u2013160, 2014.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nIn Advances in Neural Information\n\nKaiser, and Illia Polosukhin. Attention is all you need.\nProcessing Systems, pages 5998\u20136008, 2017.\n\nLuke Vilnis and Andrew McCallum. Word representation via gaussian embedding. In Proceedings of\n\nICLR 2015, 2015.\n\nZhaohui Wu and C Lee Giles. Sense-aaware semantic analysis: A multi-prototype word representation\n\nmodel using wikipedia. In AAAI, pages 2188\u20132194, 2015.\n\nZhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. Breaking the softmax\n\nbottleneck: A high-rank RNN language model. In ICLR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6783, "authors": [{"given_name": "Ning", "family_name": "Miao", "institution": "ByteDance AI Lab"}, {"given_name": "Hao", "family_name": "Zhou", "institution": "Bytedance"}, {"given_name": "Chengqi", "family_name": "Zhao", "institution": "Bytedance"}, {"given_name": "Wenxian", "family_name": "Shi", "institution": "Bytedance"}, {"given_name": "Lei", "family_name": "Li", "institution": "ByteDance"}]}