{"title": "Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models", "book": "Advances in Neural Information Processing Systems", "page_first": 6308, "page_last": 6319, "abstract": "Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks. However, NLMs are very computationally demanding largely due to the computational cost of the decoding process, which consists of a softmax layer over a large vocabulary.We observe that in the decoding of many NLP tasks, only the probabilities of the top-K hypotheses need to be calculated preciously and K is often much smaller than the vocabulary size.\nThis paper proposes a novel softmax layer approximation algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a given context, a set of K words that are most likely to occur according to a NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude while attaining close to the full softmax baseline accuracy on neural machine translation and language modeling tasks. We also prove the theoretical guarantee on the softmax approximation quality.", "full_text": "Navigating with Graph Representations for Fast and\n\nScalable Decoding of Neural Language Models\n\nMinjia Zhang\n\nXiaodong Liu\n\nJianfeng Gao\n\nYuxiong He\n\n{minjiaz,xiaodl,wenhanw,jfgao,yuxhe}@microsoft.com\n\nWenhan Wang\nMicrosoft\n\nAbstract\n\nNeural language models (NLMs) have recently gained a renewed interest by achiev-\ning state-of-the-art performance across many natural language processing (NLP)\ntasks. However, NLMs are very computationally demanding largely due to the\ncomputational cost of the decoding process, which consists of a softmax layer\nover a large vocabulary. We observe that in the decoding of many NLP tasks, only\nthe probabilities of the top-K hypotheses need to be calculated preciously and\nK is often much smaller than the vocabulary size. This paper proposes a novel\nsoftmax layer approximation algorithm, called Fast Graph Decoder (FGD), which\nquickly identi\ufb01es, for a given context, a set of K words that are most likely to\noccur according to a NLM. We demonstrate that FGD reduces the decoding time by\nan order of magnitude while attaining close to the full softmax baseline accuracy\non neural machine translation and language modeling tasks. We also prove the\ntheoretical guarantee on the softmax approximation quality.\n\n1\n\nIntroduction\n\nDrawing inspiration from biology and neurophysiology, recent progress on many natural language\nprocessing (NLP) tasks has been remarkable with deep neural network based approaches, including\nmachine translation [1\u20133], sentence summarization [4], speech recognition [5\u20137], and conversational\nagents [8\u201311]. Such approaches often employ a neural language model (NLM) in a decoder at\ninference time to generate a sequence of tokens (e.g., words) given an input [1\u20137, 9, 10, 12\u201314].\nOne long-recognized issue of decoding using NLMs is the computational complexity, which easily\nbecomes a bottleneck when the vocabulary size is large. Consider a beam search decoder using a\nNLM. At each decoding step, a recurrent neural network [15, 16] \ufb01rst generates a context vector based\non each partial hypothesis in the beam. It then uses a softmax layer to compute a word probability\ndistribution over the vocabulary [17\u201319]. The softmax layer consists of an inner product operator that\nprojects the context vector into a vocabulary-sized vector of logits, followed by a softmax function that\ntransforms these logits into a vector of probabilities. Finally, the decoder selects top-K words with\nthe highest probabilities given the context (i.e., top-K maximum subset of inner product), and stores\nthe expended hypotheses and their probabilities in the beam. The most computationally expensive\npart in this process is the softmax layer, where the complexity of performing inner product is linear\nwith respect to the vocabulary size. In this paper we strive to develop new softmax approximation\nmethods for fast decoding.\nMany techniques have been proposed to speed up the softmax layer in training, such as hierarchical\nsoftmax [20] and sampling-based approaches [21\u201324]. However, most of them cannot be directly\napplied to decoding at inference because they rely on knowing the words to be predicted and they\nstill need to calculate the probability of all words to \ufb01nd the most likely prediction. Other works\nspeed up softmax inference (in training and decoding) by reducing the cost of computing each word\u2019s\nprobability using some approximation [23, 25\u201327]. Though the cost of computing each word\u2019s\nprobability is reduced, the complexity of softmax layer as a whole is still linear with respect to the\nsize of the vocabulary. We notice that in many NLP tasks we only need to identify the top-K most\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flikely next words given a context. Do we have to go over the entire large vocabulary to search for the\ntop-K words? Our answer is no. Before we present our approach, we review brie\ufb02y the \ufb01nding in\nbiological science that motivates our research.\nIn spite of the large number of words in a vocabulary, the human brain is capable of managing them\neffectively and navigating the massive mental lexicon very ef\ufb01ciently. How is it possible? How is\nthe vocabulary implemented in human brain? One of the theories from biological science indicates\nthat human language has a character of complex network [28\u201330], where the intrinsic relational\nstructure, which refers to the fact that words are related to each other and thus form a small world\ngraph, provides some hints how the lexicon is mentally organized. To predict the next word given a\ncontext, humans never need to examine every word in a vocabulary stored in their brains. Instead, a\nperson can immediately identify a small set of K candidate words that are most semantically related\nto the context, and then picks the most proper word among the top-K candidates. We believe that if\nwe can represent the vocabulary using a similar small world graph, we can signi\ufb01cantly improve the\ndecoding ef\ufb01ciency of NLMs because softmax only needs to explicitly compute the probabilities of\nK words, where K is much smaller than the vocabulary size.\nWe propose a Fast Graph Decoder (FGD) to approximate the softmax layer of a NLM in the decoding\nprocess. First, we construct a small world graph representation [31, 32] of a vocabulary. The\nnodes in the graph are words, each being represented using a continuous vector transformed from\nits word embedding vector in the NLM. The edges in the graph encode the word-word distances\nin a well-de\ufb01ned metric space. Then, at each decoding step, we identify for a given context (e.g.,\na partial hypothesis in the beam search) the top-K hypotheses and compute their probabilities in\nthe softmax layer of the NLM. We prove that \ufb01nding the top-K hypotheses in the softmax layer is\nequivalent to \ufb01nding the K nearest neighbors using FGD in the small world graph, and the latter\ncan be performed approximately using ef\ufb01cient graph navigating methods. We also prove that the\ndecoding error due to the use of the approximated K nearest neighbor search with graph navigation\nis theoretically bounded.\nWe validate the effectiveness of our approach on two NLP tasks, neural machine translation and\nlanguage modeling. Empirical results show that FGD achieves an order of magnitude speedup while\nattaining the accuracy in comparison with existing state-of-the-art approaches.\nIn the rest of the paper, Section 2 details the softmax layer implementation and the challenge.\nSection 3 describes FGD and gives theoretical justi\ufb01cations. Section 4 presents experimental results.\nConclusions are drawn in Section 5.\n\n2 Motivation and Challenge\n\nThe softmax layer of a NLM results in a major computational bottleneck at the decoding process in\nmany NLP tasks. Consider a NLM that uses a two\u2013layer LSTM and a vocabulary size of |V | [17\u201319].\nThe total number of \ufb02oating point operations (FLOPS) per LSTM step is 2(layer) \u00d7 (I + D) \u00d7 D \u00d7\n4\u00d7 2 1, where I and D represent the input and hidden dimension, respectively. The number of FLOPS\nof the softmax layer is D \u00d7 |V | \u00d7 2, which is proportional to the vocabulary size |V |. Assuming\nthat the input/hidden dimension of the LSTM is 500 and the vocabulary size is 50K, the LSTM\npart has 8M FLOPS, whereas the softmax layer has 50M FLOPS. The softmax layer dominates the\ncomputational cost of the NLM, and even more so with a larger vocabulary.\nThis decoding bottleneck limits NLMs\u2019 application in many interactive services such as Web search,\nonline recommendation systems and conversational bots, where low latency, often at the scale of\nmilliseconds, is demanded. In addition, unlike model training where we can leverage the massive\nparallelism power of GPUs, inference needs to run in various clients ranging from PC, mobile, to IoT\n(Internet of Things) device, most of which have limited hardware resources and where GPUs are not\nalways available [33]. Therefore, fast decoding is crucial to broaden the applicability of NLMs.\n\n3 Approach\n\nThe goal of this work is to reduce the computational complexity of decoding, the major bottleneck\nfor many NLP tasks. Our approach is called FGD (Fast Graph Decoder), which consists of an of\ufb02ine\n\n1It times 4 because an LSTM has 3 gates and 1 memory cell, and it times 2 because each weight value causes\n\na multiply\u2013and\u2013add operation.\n\n2\n\n\fstep that transforms the pre-trained word embeddings to a small world graph representation and an\nonline inference step that \ufb01nds the top-K hypotheses, as outlined in Figure 1. In what follows, we\nwill present in turn\n\n\u2022 Why do we use the small world graph representation to \ufb01nd top-K hypotheses? (Section 3.1)\n\u2022 How to construct a small world graph? (Section 3.2)\n\u2022 How to identify top-K hypotheses for a given context on small world graph? (Section 3.3)\n\nFigure 1: Overview of FGD: (a) illustrates the original operations of selecting top-K hypotheses\nin the decoding process, where the top-K hypotheses are selected from the output of sof tmax.\nThe inner product between the context vector h and word embedding vectors x1, x2, .., x|V | in the\nsoftmax layer is the most expensive part, which has a complexity of O(D \u00d7 |V |). (b) shows the\ntransformation from x1, x2, .., x|V | to a small world graph representation that encodes pair-wise\nsimilarity information among words. This transformation, incurring once of\ufb02ine, is essential for FGD\nto perform fast decoding. (c) shows the decoding process of FGD at online inference time. For a\ngiven context vector h, FGD identi\ufb01es top-K hypotheses in the small world graph and produces their\nprobabilities with a search complexity of O(D \u00d7 log |V |).\n\n3.1 Why Small World Graphs?\n\nNearest neighbor search is a commonly used method to identify top-K points in a set that is most\nsimilar to a given point. The small world graph has been recently introduced to address the problem\nof nearest neighbor search [34, 35]. Research shows that navigation in small world graph exhibits\nO(log N ) search complexity (N represents the number of nodes in the graph), and performs well\nin high dimensionality [34\u201336]. These results motivate us to investigate the use of the small world\ngraph to develop fast decoding methods for NLM with large vocabularies.\nTo get logarithmic search complexity, the small world graph needs to perform neighborhood selection\nbased on well-de\ufb01ned pair-wise metric distance, as well as small world properties, such as great local\nconnectivity (as in a lattice graph) combined with a small graph diameter (as in a random graph) [37],\nwhich we will describe in details in the following sections.\n\n3.2 Small World Graph Construction\n\nFigure 1 illustrates the transformation from the set of word embedding vectors in Figure 1(a) to\na small world graph representation G in Figure 1 (b). Given a set of word embedding vectors\nX = [x1, x2, ..., x|V |], xi \u2208 RD, where |V | represents the vocabulary size and D is the word\nembedding dimension, our transformation process takes two steps.\n\n1. Inner product preserving transformation (representation transformation): For each\nvector\u2013based embedding xi in X, we apply a transformation called Inner Product Pre-\nserving Transformation (IPPT) to obtain X = [x1, x2, ..., x|V |], xi \u2208 RD+2 that establishes\nthe equivalence of \ufb01nding top-K maximum subset of inner product in X and searching for\ntop-K nearest neighbors in X with a given distance metric \u03c1. (Section 3.2.1)\n\n3\n\nhx1x2xV\u2026x3softmaxp1p2pV\u2026p3Top K hypotheses IPPTFGD search pathLong-range edgeShort-range edgeClosest neighborOfflineInferenceO(D x log(|V|))(a)(b)(c)Top K hypothesesO(D x |V|) \f2. Small world graph construction (data structure transformation): Denote G = (X, E) as\nthe small world graph with the set X as graph nodes and E as the graph edges, where we\nimpose the set E based on the distance metric \u03c1 to form a small world graph. (Section 3.2.2)\n\n3.2.1 Inner Product Preserving Transformation\n\nWhy is inner product insuf\ufb01cient? Using inner product to represent the mutual similarity between\nnodes is de\ufb01cient because it lacks very basic properties that are required to hold for distance (i.e., the\ninverse of similarity) functions in metric spaces (e.g., Euclidean spaces) \u2013 identity of indiscernibles\nand triangle inequality [38]. For example, under the Euclidean space, two points are the same iff their\ndistance is 0. The inner product of a point x to itself is (cid:107)x(cid:107)2, but there can be other points whose\ninner product to x is smaller than (cid:107)x(cid:107)2. The search process on small world graphs relies on these\nproperties to converge and achieve their ef\ufb01ciency [34].\n\ndistance in Euclidean space. Thus \u03c1(\u03be, \u03b7) = (cid:107)\u03be \u2212 \u03b7(cid:107)2 =(cid:112)(cid:104)\u03be \u2212 \u03b7, \u03be \u2212 \u03b7(cid:105). In the following lemma,\n\nRepresentation transformation. To create similarity relationship between words represented with\na metric, we present a new method called Inner Product Preserving Transformation (IPPT) to convert\nthe word embedding vectors to higher dimensional vectors. We establish the equivalence of \ufb01nding\ntop-K maximum subset of inner product and searching for top-K nearest neighbor with a distance\nmetric in the higher dimension space. We use the notation (cid:104)\u00b7,\u00b7(cid:105) for the inner product and \u03c1(\u00b7,\u00b7) for the\nwe de\ufb01ne a transform \u03c6 for the word embeddings and another transform h (cid:55)\u2192 h for the context vector\ninto a higher dimensional space, such that the inner product between the embedding and context\nvectors remains the same before and after the transform. We call it inner product preserving. We\ninclude the proof in Appendix A.\nLemma 3.1. Suppose D \u2265 1. Let xi \u2208 RD and bi \u2208 R be the vector-based word embedding\nand bias at position i in the softmax layer respectively, for 1 \u2264 i \u2264 |V |. Choose U such that\nU \u2265 maxi\u2208V\ni . Let [; ] represents vector concatenation. De\ufb01ne \u03c6 : {(\u03be, \u03b7) \u2208 RD \u00d7\nR : (cid:107)\u03be(cid:107)2\n. For a context vector\nh \u2208 RD, let h = [h; 1; 0] \u2208 RD+2. Then for any i \u2208 V , we have (cid:104)h, xi(cid:105) + bi = (cid:104)h, \u03c6(xi, bi)(cid:105) =\n\n2 + \u03b72 \u2264 U 2} \u2212\u2192 RD+2 as \u03c6(x, b) =\n\n(cid:112)(cid:107)xi(cid:107)2\n2 \u2212 \u03c1(h, \u03c6(xi, bi))2(cid:1).\n(cid:0)U 2 + 1 + (cid:107)h(cid:107)2\n\n2 + b2\n\n(cid:105)\n\n(cid:104)\n\nx; b;(cid:112)U 2 \u2212 (cid:107)x(cid:107)2\n\n2 \u2212 b2\n\n1\n2\n\nThe above lemma states that for a \ufb01xed context vector h, (cid:104)h, xi(cid:105) + bi depends linearly and\nIn particular, (cid:104)h, xi(cid:105) + bi \u2264 (cid:104)h, xj(cid:105) + bj iff\nmonotonically decreasingly on \u03c1(h, \u03c6(xi, bi))2.\n\u03c1(h, \u03c6(xi, bi)) \u2265 \u03c1(h, \u03c6(xj, bj)). This gives rise to the equivalence between the top-K maximum\ninner product search and top-K nearest neighbor search in graph.\nDe\ufb01nition 1 (Top-K maximum (minimum) subset). Let V be the set of vocabulary, and 1 \u2264 K \u2264\n|V |. We call K a top-K maximum (minimum) subset for a function f : V \u2192 R, if |K| = K and\nf (vi) \u2265 f (vj) (f (vi) \u2264 f (vj)) for all vi \u2208 K and vj (cid:54)\u2208 K.\nTheorem 3.2. Suppose 1 \u2264 K \u2264 |V | and consider a \ufb01xed context vector h. Let K \u2286 V be a top-K\nmaximum subset for vi (cid:55)\u2192 (cid:104)h, xi(cid:105) + bi. Then K is also a top-K minimum subset for the Euclidean\ndistance vi (cid:55)\u2192 \u03c1(h, \u03c6(xi, bi)).\nBased on the theorem, we can build a small world graph in RD+2 to equivalently solve the top-K\nmaximum subset of inner product search problem.\n\n3.2.2 Small World Graph Construction\n\nWe present the algorithm FGD\u2013P (P for Preprocessing) for constructing the small world graph\nin Algorithm 1. FGD\u2013P is performed only once on a trained model. It transforms the trained\nvector-based word representation X, which is a D-by-|V | matrix, into X with IPPT (line 4\u20139).\nFGD\u2013P builds an in-memory proximity graph possessing small world properties using G =\nCreateSwg(X, M ) (line 10). Several existing work have devoted to constructing the small world\ngraph. Among the most accomplished algorithms, HNSW (Hierarchical Navigable Small Worlds) has\nrecently attained outstanding speed-accuracy trade-offs [35]. We employ HNSW\u2019s graph construction\nalgorithm to create a small world graph. We brie\ufb02y describe the main ideas below and please \ufb01nd\nmore details in Malkov and Yashunin [35].\n\n4\n\n\fOf\ufb02ine preprocessing algorithm FGD\u2013P\n\n\u02dcX:i \u2190 [X:i; bi]\n\nAlgorithm 1\n1: Input: Trained weights of the softmax layer X, and bias vector b.\n2: Output: Small world graph G, and Umax.\n3: Hyperparameter: Small world graph neighbor degree M.\n4: for all i in (0..|X| \u2212 1) do\n5:\n6: Umax \u2190 maxi (cid:107) \u02dcX:i(cid:107)2\n7: for all i in 0..(| \u02dcW| \u2212 1) do\n8:\n9:\n10: G \u2190 CreateSwg(X, M )\n\nUmax\nX :i \u2190 [ \u02dcX:i; \u2206i]\n\n\u2206i \u2190(cid:113)\n\n2 \u2212 (cid:107) \u02dcX:i(cid:107)2\n\n2\n\n(cid:46) Word embedding and bias fusion\n\n(cid:46) Calculate the normalizer\n\n(cid:46) Build small world graph\n\nThe CreateSwg creates a multi-layer small world graph G, which consists of a chain of subsets\nV = L0 \u2287 L1 \u2287 . . . \u2287 Ll of nodes as \"layers\" and the ground layer L0 contains the entire xi\nas nodes. L0 is built incrementally by iteratively inserting each word vector xi in X as a node.\nEach node will generate M (i.e., the neighbor degree) out-going edges. Among those, M \u2212 1 are\nshort-range edges, which connect xi to M \u2212 1 suf\ufb01ciently close nodes, i.e., neighbors, according\nto their pair-wise Euclidean distance to xi (e.g., the edge between x1 and x2 in Figure 1 (b)). The\nrest is a long-range edge that connects xi to a randomly picked node, which does not necessarily\nconnect two closest nodes but connects isolated clusters (e.g., the edge between x3 and x|V | in\nFigure 1 (b)). It is theoretically justi\ufb01ed that these two types of edges give the graph small world\nproperties [34, 35, 37]. Given L0, each layer of Lk(k > 0) is formed recursively by picking each\nnode in Lk\u22121 with a \ufb01xed probability 1/M, and the top layer contains only a single node. The\nnumber of layers is bounded by O (log |V |/ log M ) [34].\n\n3.3 Decoding as Searching Small World Graphs\n\nFGD\u2013I (Algorithm 2) shows how FGD is used to enable fast decoding (as in Figure 1 (c)). It \ufb01rst\ntransforms the context vector h to [h; 1; 0] \u2208 Rd+2(line 4). SearchSwg(G, h, K) searches the small\nworld graph to \ufb01nd the top-K hypotheses using the search method from HNSW [35]. Here we\nprovide a brief description of the methodology.\nThe search of the graph starts from its top layer Ll and uses greedy search to \ufb01nd the node with the\nclosest distance to h as an entry point to descend to the lower layer. The upper layers route h to an\nentry point in the ground layer L0 that is already close to the nearest neighbors to h. Once reaching\nthe ground layer, SearchSwg employs a prioritized breath-\ufb01rst search: It examines its neighbors\nand stores all the visited nodes in a priority queue based on their distances to the context vector. The\nlength of the queue is bounded by ef Search, a hyperparameter that controls the trade-off of search\ntime and accuracy. When the search reaches a stop condition (e.g., above a given number of distance\ncalculation), SearchSwg returns the results of top-K word hypotheses and their distances to h. We\ntransform the distance value back to their inner product (line 5\u2013 7) with the inner product preserving\nproperty of IPPT. FGD\u2013I generates the output by computing a softmax distribution over the inner\nproduct of the top-K returned results (line 8).\n\nOnline inference algorithm FGD\u2013I\n\nAlgorithm 2\n1: Input: Context vector h, small world graph G, and Umax.\n2: Output: Probability distribution P over top-K word hypotheses.\n3: Hyperparameter: Candidate queue length ef Search.\n4: h \u2190 [h; 1; 0]\n5: I K, DK \u2190 SearchSwg(G, h, K)\n6: for all i in 0..(K \u2212 1) do\n7:\n\n(cid:16)(cid:107)h(cid:107)2\n8: P \u2190 exp(S)/(cid:80) exp(S)\n\n:i ] \u2190 1\n\n2 \u2212 DK\n\n2 + Umax\n\n2(cid:17)\n\nS[I K\n\n2\n\n(cid:46) Map context vector from RD to RD+2\n(cid:46) Return top-K hypotheses with minimal distance to h\n\n:i\n\n(cid:46) Map Euclidean distance back to inner product\n(cid:46) Compute top-K softmax probability distribution\n\n5\n\n\fIn practice, letting K = |V | is both slow and unnec-\nBounding softmax probability distribution.\nessary, an approximated approach is often much more ef\ufb01cient without sacri\ufb01cing much accuracy.\nWe demonstrate empirically the effectiveness of our approach in Section 4 and provide a theoretically\nderived error bound of softmax approximation with top-K word hypotheses toward a probability\ndistribution in Appendix C.\n\n4 Evaluation\n\nSummary of main results.\nneural machine translation (NMT) and language modeling (LM).\n\nIn this section, we present the results of FGD on two different tasks:\n\n1. On NMT, FGD obtains more than 14X speedup on softmax layer execution time over\nfull-softmax with a similar BLEU score to the baseline, and obtains 30X speedup at the cost\nof decreasing 0.67 BLEU score.\n\n2. On LM, FGD scales with a logarithmic increase of execution time and outperforms full-\n\nsoftmax in speed by an order of magnitude for large vocabularies.\n\nSetup. We implement FGD in Python using numpy2. To construct the small world graph, we\nemploy a state-of-the-art framework NMSLIB [35, 39]. The execution time is given as the averaged\nper-step decoding time in milliseconds, measured on a 64-bit Linux Ubuntu 16.04 server with two\nIntel Xeon CPU E5-2650 v4 @ 2.20GHz processor with single thread regime so that all algorithms\nare compared under the same amount of hardware resource.\n\n4.1 Neural Machine Translation\n\nNMT is implemented using a sequence\u2013to\u2013sequence model which contains an RNN encoder and\nan RNN decoder. The decoder contains an output projection at every step to predict the next word.\nDecoding time and BLEU score [40] are the two major metrics for this evaluation. The lower\nthe decoding time without sacri\ufb01cing much BLEU, the better the result is. We trained a global\nattention-based [41] encoder\u2013decoder model with two stacked unidirectional LSTM [1, 2] using the\nOpenNMT-py toolkit [42] on the IWSLT\u201914 German-English corpus [43]. We set the LSTM hidden\ndimension size to 200. The model is optimized with SGD using an initial learning rate of 1.0 and\na dropout [44] ratio of 0.3. The dataset is tokenized and preprocessed using the OpenNMT data\npreprocessor with |V | = 50, 000 frequent words [24, 41]. BLEU score is computed using the Moses\ntoolkit [45].\nOnce the model is trained, we processed the trained weights in the softmax layer using FGD\u2013P of\ufb02ine.\nIt takes three minutes on our server to construct the small world graph. During online processing,\nthe hyperparameter, ef Search, decides the length of the candidate queue to track nearest neighbors,\nwhich offers the trade-off between the online decoding speed and the BLEU score quality. We tested\ndifferent ef Search values and identi\ufb01ed [20, 200] as a good range.\n\nDecoding time and BLEU score comparison with existing methods. Two approaches are used\nfor comparison: 1) a full-softmax approach; 2) a state-of-the-art approach, called SVD-softmax [25].\nSVD-softmax improves the inference speed by approximating softmax layer using singular vector\ndecomposition (SVD). It includes two steps: it \ufb01rst estimates the probability of each word using\na small part of the softmax layer weight matrix, and then performs a re\ufb01nement on top-V most\nlikely words based on the previous estimated results. It reduces the complexity from O(|V | \u00d7 D) to\nO(|V | \u00d7 D + |V | \u00d7 D), where 1 \u2264 D < D. As suggested by [25], we use two con\ufb01gurations of\nSVD-softmax: SVD-a3 and SVD-b4.\n\n2http://www.numpy.org/\n3The preview window width D is set to 16, and the re\ufb01nement window width V is set to 2500.\n4The preview window width D is set to 16, and the re\ufb01nement window width V is set to 5000.\n\n6\n\n\fFigure 2: Execution time of the softmax layer and BLEU score of NMT model with FGD, SVD-\nsoftmax (SVD), and Full-softmax (Full). [20, 50, 100, 200] are the hyperparameter of ef Search\nin FGD. Execution time is displayed as the height of the bar chart, in millisecond (lower is better).\nBLEU scores are labeled with colored numbers on the top (higher is better).\n\nFigure 2 shows the main results \u2014 FGD achieves signi\ufb01cantly lower execution time than the existing\nmethods with comparable BLEU scores.\nComparing with full softmax, when ef Search is 20, FGD reduces the execution time from 6.3ms\nto 0.21ms, achieving 30X speedup at the cost of losing 0.67 BLEU score. By increasing ef Search\nto 50, FGD obtains nearly the same BLEU score as the full-softmax baseline, while reducing the\nexecution time from 6.3ms to 0.43ms, achieving more than 14X speedup.\nFor SVD-softmax, we also observed that SVD-b approaches a BLEU score close to the full-softmax\nbaseline, but it is much slower than FGD in terms of the execution time (5.53ms vs 0.43ms). SVD-a\nshows slightly better performance than SVD-b but with a lower BLEU score. Although the theoretical\nspeedup of SVD-a is 5.5X, it gets only 1.3X speedup in practice because top-V most likely words\nselected in the \ufb01rst step appear at discontinuous location on memory, which causes non-negligible\nmemory copy cost to bring them to a continuous space for the second step calculation.\n\nSensitivity of sequence lengths. Figure 3 reports the results with ef Search = 100. FGD is on\na par with the full softmax baseline uniformly on different lengths (without statistically signi\ufb01cant\ndifference). It demonstrates the robustness of the proposed approach.\n\nSensitivity of beam sizes. We vary the beam size among 1, 2, 5, 10, which are typical settings used\nby prior work [1\u20133, 46]. Table 1 shows that, when ef Search is equal or larger than 50, FGD obtains\nthe BLEU scores close to the full softmax baseline under all beam sizes without and statistically\nsigni\ufb01cant difference.\n\nefSearch Beam = 1 Beam = 2 Beam = 5 Beam = 10\n\n20\n50\n100\n200\nFull\n\n26.69\n27.55\n27.63\n27.53\n27.36\n\n27.65\n28.76\n28.94\n28.99\n28.91\n\n27.81\n29.06\n29.28\n29.28\n29.45\n\n27.62\n28.9\n29.1\n29.22\n29.34\n\nFigure 3: BLEU score breakdown by sen-\ntence length (setting ef Search=100).\n\nTable 1: BLEU score on NMT task, with various beam\nsizes.\n\nInternals of FGD. To reveal the internals of FGD, we analyzed two metrics, precision@K (or\nequivalently P@K) and dist_cnt. Precision@K measures the proportion of overlap between retrieved\ntop-K hypotheses and expected top-K hypotheses, based on what top-K on a full-softmax would\nreturn. dist_cnt measures the number of distance computation in FGD under a given ef Search.\nTable 2 reports precision@K when K is 1, 2, 5, and 10, which correspond to beam size 1, 2, 5, and\n10 respectively, and dist_cnt with versus without FGD. Overall, FGD achieves fairly high precision.\nIn particular, gradually increasing ef Search leads to higher precision at the expense of increased\nnumber of distance computation. This matches the observation that higher ef Search leads to higher\n\n7\n\n\fBLEU score (Figure 2) and also longer execution time (Table 1). Further increasing ef Search\nleads to little extra precision improvement but signi\ufb01cantly more distance computation because\nthe precision is getting close to 1, which explains why FGD can get close to baseline BLEU score\n(Table 1). We also observe that under the same ef Search, further increasing K sometimes leads to\nslightly worse precision if ef Search is not big enough (e.g., ef Search is 20), as the highest ranked\nwords not visited during the graph search are de\ufb01nitely lost. On the other hand, the computation\nof distance grows proportional to the increase of ef Search. Comparing with the full-softmax, the\namount of distance computation is reduced by 10\u201350 times, which explains the speedup of decoding\ntime (Figure 2).\n\nefSearch\n\n20\n50\n100\n200\n\nP@1\n0.939\n0.974\n0.986\n0.992\n\nP@2\n0.934\n0.974\n0.986\n0.993\n\nP@5\n0.929\n0.973\n0.987\n0.994\n\nP@10\n0.918\n0.971\n0.987\n0.994\n\ndist_cnt (FGD/ Full)\n\n981 / 50K\n1922 / 50K\n3310 / 50K\n5785 / 50K\n\nTable 2: Precision and distance computation results on the NMT model.\n\n4.2 Language Modeling\n\nThis section evaluates the impact of vocabulary sizes and word embedding dimensions on FGD using\nlanguage models 5 trained on WikiText-2 [47]. The model uses a two\u2013layer LSTM6.\n\nImpact of vocabulary size. We explore multiple models with different vocabulary size of 10,000\n(10K), 20,000 (20K), 40,000 (40K), and 80,000 (80K). The vocabulary is created by tokenizing raw\ntexts via Moses toolkit [45] and choosing the correspondingly topmost frequent words on the raw\nWikiText-2 dataset [47]. Both input and hidden dimension are set to 256.\nTable 3 shows the impact on search quality by varying the vocabulary size from 10K to 80K. With\nthe same ef Search, FGD generally obtains better precision results for smaller vocabularies; With\nthe same vocabulary size, bigger ef Search is better for high precision. With ef Search being 200,\nthe precision of FGD is getting very close to 100%.\n\n20\n\n50\n\n100\n\nFGD (efSearch)\n\nP@K\n\n|V|\n10K P@1\nP@10\n20K P@1\nP@10\n40K P@1\nP@10\n80K P@1\nP@10\n\n200\n0.870 0.938 0.989 1.000\n0.909 0.972 0.992 0.998\n0.845 0.932 0.975 0.995\n0.871 0.955 0.987 0.997\n0.808 0.912 0.936 0.980\n0.845 0.931 0.961 0.991\n0.832 0.933 0.966 0.982\n0.858 0.945 0.978 0.994\nTable 3: Precision of FGD on WikiText-\n2 dataset varying vocabulary size.\n\nFigure 4: Scalability of WikiText-2 language model\nvarying vocabulary size.\n\nFigure 4 shows the decoding time of varying vocabulary sizes on the full softmax baseline and\nFGD (settings ef Search={50, 200} for the sake of readability). As expected, the execution time all\nincreases with the increase of the vocabulary size. However, compared to the baseline, FGD provides\na shorter execution time consistently. As the vocabulary size increases, the execution time of the\nbaseline increases almost linearly, whereas FGD\u2019s execution time increases much more slowly. This\nis because the complexity of softmax is O(D \u00d7 |V |), which is linear to the size of the vocabulary,\nwhereas the complexity of FGD is O(D\u00d7 log |V |) is logarithmic to |V |. Therefore, FGD scales much\n\n5https://github.com/pytorch/examples/tree/master/word_language_model\n6 The models are trained with stochastic gradient descent (SGD) with an initial learning rate of 20 [48]. The\nbatch size is set to 20, and the network is unrolled for 35 timesteps. Dropout is applied to LSTM layers with a\ndropout ratio of 0.3 [44]. Gradient clipping is set to 0.25 [49].\n\n8\n\n\fbetter and the improvement becomes more signi\ufb01cant with larger vocabulary sizes. In particular, FGD\nis more than an order of magnitude faster than the baseline when the vocabulary size is medium or\nlarge. For example, FGD achieves more than 30X speedup with |V | = 80K when ef Search = 50\n(Appendix 6 includes a speedup graph).\nIt is worth mentioning that perplexity is often used to evaluate language models but it is not applicable\nhere. This work focuses on generating the probabilities of the top-K word hypotheses for fast\ndecoding, and FGD optimizes that by saving tons of probability computation for the words that\nare not in the top-K list. As a result, the evaluation in perplexity is not applicable because the\nprobabilities of the words in the test set are unde\ufb01ned if they are not in the top-K list. We observe\nthat many end applications such as machine translation do not require the probabilities of the words\nthat are not in the top-K list. That is why we evaluate the LM using precision@K and the accuracy\non the end applications.\n\nSensitive of word embedding dimension. We also tested various word embedding dimensions,\nwhere FGD gets higher precision consistently with an order of magnitude execution time reduction in\ncomparison with baselines (see Appendix D).\n\n5 Discussion\n\nCompatibility with parallelizable recurrence. There has been work on optimizing language\nmodels and its end applications by leveraging parallelizable structures to speed up recurrent layer,\nbecause the sequential dependency in the recurrent layer is hard to be computed in parallel [50\u201356].\nIn comparison with these work, FGD speeds up the vocabulary searching process at the softmax layer.\nBy incorporating FGD with these approaches, together they will reduce the end-to-end decoding time\neven further.\n\nGeneralization to training. There are several challenges that need to be addressed to make training\nef\ufb01cient using FGD. During inference, weights of the softmax layer are static. In contrast, those\nweights are constantly changing as new examples are seen during training. It would be prohibitively\nslow to update parameters for all word vector embeddings and update the underline small world graph\nafter every training step. One possibility is, during backpropagation, to propagate gradients based on\ntop-K hypotheses that are retrieved during the forward pass of the model and update the parameter\nvectors of only these retrieved hypotheses. Then the challenge is to \ufb01gure out how gradients based on\nonly a small set of examples affect word embedding vectors and how these sparse gradients can be\npropagated in a computationally ef\ufb01cient way.\n\n6 Conclusion\n\nWe propose a novel decoding algorithm, called Fast Graph Decoder (FGD), which quickly navigates,\nfor a given context, on a small world graph representation of word embeddings to search for a set of\nK words that are most likely to be the next words to predict according to NLMs. On neural machine\ntranslation and neural language modeling tasks, we demonstrate that FGD reduces the decoding\ntime by an order of magnitude (e.g., 14X speedup comparing with the full softmax baseline) while\nattaining similar accuracy. As the further work, we also like to explore how to speed up NLMs\ntraining with large vocabularies.\n\nAcknowledgments\n\nWe thank Kevin Duh for reading a previous version of this paper and providing feedback. We thank\nthe anonymous reviewers for their helpful suggestions for improving this paper.\n\n9\n\n\fReferences\n\n[1] Kyunghyun Cho, Bart van Merri\u00ebnboer, \u00c7aglar G\u00fcl\u00e7ehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder\u2013Decoder for\nStatistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural\nLanguage Processing, EMNLP\u2019 14, pages 1724\u20131734, 2014.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning\n\nto Align and Translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[3] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks.\nIn Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information\nProcessing Systems, NIPS \u201914, pages 3104\u20133112, 2014.\n\n[4] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence\nSummarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP \u201915, pages 379\u2013389, 2015.\n\n[5] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech Recognition with Deep Recurrent\nNeural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP\n\u201913, pages 6645\u20136649, 2013.\n\n[6] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger,\nSanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep Speech: Scaling up\nend-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.\n\n[7] Geoffrey Zweig, Chengzhu Yu, Jasha Droppo, and Andreas Stolcke. Advances in all-neural speech\nrecognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP\n\u201917, pages 4805\u20134809, 2017.\n\n[8] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational ai. arXiv preprint\n\narXiv:1809.08267, 2018.\n\n[9] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep Reinforcement\nLearning for Dialogue Generation. In Proceedings of the 2016 Conference on Empirical Methods in\nNatural Language Processing, EMNLP \u201916, pages 1192\u20131202, 2016.\n\n[10] Oriol Vinyals and Quoc V. Le. A Neural Conversational Model. arXiv preprint arXiv:1506.05869, 2015.\n[11] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-\nYun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive generation of\nconversational responses. In NAACL-HLT, May 2015.\n\n[12] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence\nSummarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP \u201915, pages 379\u2013389, 2015.\n\n[13] Katja Filippova, Enrique Alfonseca, Carlos A. Colmenares, Lukasz Kaiser, and Oriol Vinyals. Sentence\nCompression by Deletion with LSTMs. In Proceedings of the 2015 Conference on Empirical Methods in\nNatural Language Processing, EMNLP \u201915, pages 360\u2013368, 2015.\n\n[14] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. Building\nEnd-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. volume 16 of\nAAAI \u201916, pages 3776\u20133784, 2016.\n\n[15] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[16] Junyoung Chung, \u00c7aglar G\u00fcl\u00e7ehre, KyungHyun Cho, and Yoshua Bengio. Empirical Evaluation of Gated\n\nRecurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[17] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1s Burget, Jan Cernock\u00fd, and Sanjeev Khudanpur. Recurrent Neural\nNetwork Based Language Model. In 11th Annual Conference of the International Speech Communication\nAssociation, INTERSPEECH \u201910, pages 1045\u20131048, 2010.\n\n[18] Yoshua Bengio, R\u00e9jean Ducharme, Pascal Vincent, and Christian Janvin. A Neural Probabilistic Language\n\nModel. The Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[19] Rafal J\u00f3zefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the Limits of\n\nLanguage Modeling. CoRR, arXiv preprint abs/1602.02410, 2016.\n\n[20] Frederic Morin and Yoshua Bengio. Hierarchical Probabilistic Neural Network Language Model. In\nProceedings of the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics, AISTATS \u201905.\n\n[21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed Representa-\ntions of Words and Phrases and their Compositionality. In Advances in Neural Information Processing\nSystems 26: 27th Annual Conference on Neural Information Processing Systems, NIPS \u201913.\n\n1\n\n\f[22] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language\n\nmodels. In Proceedings of the 29th International Conference on Machine Learning, ICML \u201912.\n\n[23] Wenlin Chen, David Grangier, and Michael Auli. Strategies for Training Large Vocabulary Neural\nLanguage Models. In Proceedings of the 54th Annual Meeting of the Association for Computational\nLinguistics, ACL \u201916, 2016.\n\n[24] S\u00e9bastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. On Using Very Large Target\nVocabulary for Neural Machine Translation. In Proceedings of the 53rd Annual Meeting of the Association\nfor Computational Linguistics and the 7th International Joint Conference on Natural Language Processing\nof the Asian Federation of Natural Language Processing, ACL, pages 1\u201310, 2015.\n\n[25] Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, and Wonyong Sung. SVD-Softmax: Fast Softmax\nApproximation on Large Vocabulary Neural Networks. In Advances in Neural Information Processing\nSystems 30: Annual Conference on Neural Information Processing Systems 2017, NIPS \u201917, pages\n5469\u20135479, 2017.\n\n[26] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. Breaking the Softmax Bottleneck:\n\nA High-Rank RNN Language Model. CoRR, arXiv preprint abs/1711.03953, 2017.\n\n[27] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M. Schwartz, and John Makhoul.\nFast and Robust Neural Network Joint Models for Statistical Machine Translation. In Proceedings of the\n52nd Annual Meeting of the Association for Computational Linguistics, ACL \u201914.\n\n[28] Ramon Ferrer i Cancho and Richard V Sol\u00e9. The small world of human language. Proceedings of the\n\nRoyal Society of London B: Biological Sciences, 268(1482):2261\u20132265, 2001.\n\n[29] Adilson E Motter, Alessandro PS De Moura, Ying-Cheng Lai, and Partha Dasgupta. Topology of the\n\nConceptual Network of Language. Physical Review E, 65(6):065102, 2002.\n\n[30] Sergey N Dorogovtsev and Jos\u00e9 Fernando F Mendes. Language as an Evolving Word Web. Proceedings of\n\nthe Royal Society of London B: Biological Sciences, 268(1485):2603\u20132606, 2001.\n\n[31] Jeffrey Travers and Stanley Milgram. The Small World Problem. Phychology Today, 1(1):61\u201367, 1967.\n[32] Mark EJ Newman. Models of the Small World. Journal of Statistical Physics, 101(3-4):819\u2013841, 2000.\n[33] Mung Chiang and Tao Zhang. Fog and IoT: An Overview of Research Opportunities. IEEE Internet of\n\nThings Journal, 3(6):854\u2013864, 2016.\n\n[34] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. Approximate nearest\n\nneighbor algorithm based on navigable small world graphs. Information Systems, pages 61\u201368, 2014.\n\n[35] Yury A. Malkov and D. A. Yashunin. Ef\ufb01cient and robust approximate nearest neighbor search using\n\nHierarchical Navigable Small World graphs. CoRR, arXiv preprint abs/1603.09320, 2016.\n\n[36] Jon Kleinberg. The Small-world Phenomenon: An Algorithmic Perspective. In Proceedings of the 32\n\nAnnual ACM Symposium on Theory of Computing, STOC \u201900, pages 163\u2013170, 2000.\n\n[37] Duncan J. Watts. Small Worlds: The Dynamics of Networks Between Order and Randomness. 1999.\n[38] James R Munkres. Elements of Algebraic Topology. CRC Press, 2018.\n[39] Leonid Boytsov and Bilegsaikhan Naidan. Engineering Ef\ufb01cient and Effective Non-metric Space Library.\nIn Similarity Search and Applications - 6th International Conference, SISAP \u201913, pages 280\u2013293, 2013.\n[40] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for Automatic\nIn Proceedings of the 40th Annual Meeting on Association for\n\nEvaluation of Machine Translation.\nComputational Linguistics, ACL \u201902, pages 311\u2013318, 2002.\n\n[41] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attention-based Neural\nMachine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP, pages 1412\u20131421, 2015.\n\n[42] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. OpenNMT: Open-\n\nSource Toolkit for Neural Machine Translation. arXiv preprint arXiv:1701.02810, 2017.\n\n[43] Mauro Cettolo, Jan Niehues, Sebastian St\u00fcker, Luisa Bentivogli, and Marcello Federico. Report on the\n11th IWSLT evaluation campaign, IWSLT 2014. In Proceedings of the International Workshop on Spoken\nLanguage Translation, 2014.\n\n[44] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA Simple Way to Prevent Neural Networks from Over\ufb01tting. Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[45] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi,\nBrooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ond\u02c7rej Bojar, Alexandra Con-\nstantin, and Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings\nof the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL \u201907, pages\n177\u2013180, 2007.\n\n2\n\n\f[46] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine translation system: Bridging\nthe gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[47] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models.\n\nCoRR, arXiv preprint abs/1609.07843, 2016.\n\n[48] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-\n\nSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[49] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent neural\nnetworks. In Proceedings of the 30th International Conference on Machine Learning, ICML, pages\n1310\u20131318, 2013.\n\n[50] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-Recurrent Neural Networks.\n\narXiv preprint arXiv: abs/1611.01576, 2016.\n\n[51] Tao Lei, Yu Zhang, and Yoav Artzi. Training RNNs as Fast as CNNs. arXiv preprint arXiv: abs/1709.02755,\n\n2017.\n\n[52] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language\n\nModels. arXiv preprint arXiv: abs/1708.02182, 2017.\n\n[53] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An Analysis of Neural Language Modeling at\n\nMultiple Scales. abs/1803.08240, 2018.\n\n[54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz\nKaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing\nSystems 30: Annual Conference on Neural Information Processing Systems, pages 6000\u20136010, 2017.\n\n[55] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. Non-Autoregressive\n\nNeural Machine Translation. arXiv preprint arXiv: abs/1711.02281, 2017.\n\n[56] Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and Experiments on Vector\n\nQuantized Autoencoders. arXiv preprint arXiv: abs/1805.11063, 2018.\n\n[57] Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n3\n\n\f", "award": [], "sourceid": 3115, "authors": [{"given_name": "Minjia", "family_name": "Zhang", "institution": "Microsoft"}, {"given_name": "Wenhan", "family_name": "Wang", "institution": "Microsoft"}, {"given_name": "Xiaodong", "family_name": "Liu", "institution": "Microsoft"}, {"given_name": "Jianfeng", "family_name": "Gao", "institution": "Microsoft Research, Redmond, WA"}, {"given_name": "Yuxiong", "family_name": "He", "institution": "Microsoft"}]}