{"title": "Mixtape: Breaking the Softmax Bottleneck Efficiently", "book": "Advances in Neural Information Processing Systems", "page_first": 5775, "page_last": 5783, "abstract": "The softmax bottleneck has been shown to limit the expressiveness of neural lan-\r\nguage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques\u2014logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a\r\nsoftmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.", "full_text": "Mixtape: Breaking the Softmax Bottleneck Ef\ufb01ciently\n\nZhilin Yang1, Thang Luong2, Ruslan Salakhutdinov1, Quoc Le2\n\n1Carnegie Mellon University, 2Google Brain\n\n{zhiliny,rsalakhu}@cs.cmu.edu, {thangluong,qvl}@google.com\n\nAbstract\n\nThe softmax bottleneck has been shown to limit the expressiveness of neural lan-\nguage models. Mixture of Softmaxes (MoS) is an effective approach to address\nsuch a theoretical limitation, but are expensive compared to softmax in terms of\nboth memory and time. We propose Mixtape, an output layer that breaks the soft-\nmax bottleneck more ef\ufb01ciently with three novel techniques\u2014logit space vector\ngating, sigmoid tree decomposition, and gate sharing. On four benchmarks includ-\ning language modeling and machine translation, the Mixtape layer substantially\nimproves the ef\ufb01ciency over the MoS layer by 3.5x to 10.5x while obtaining similar\nperformance. A network equipped with Mixtape is only 20% to 34% slower than a\nsoftmax-based network with 10-30K vocabulary sizes, and outperforms softmax in\nperplexity and translation quality.\n\n1\n\nIntroduction\n\nSoftmax has been a standard output layer for a wide variety of neural networks, including the\nmajority of neural language models [5, 2, 3, 8, 11]. However, as pointed out by [19], softmax is a\nfundamental limitation of the expressiveness of neural language models, because it constrains the\noutput representations to be low-rank, which might not be suf\ufb01cient for modeling the complexity\nof natural language. Such a limitation is called the softmax bottleneck. To break the softmax\nbottleneck, [19] proposed Mixture of Softmaxes (MoS) that introduces discrete latent variables into\nthe output layer so that the log probability matrix is high-rank because of the log-sum-exp nonlinear\ntransformation. However, MoS is expensive compared to softmax in terms of both memory and time,\nwhich makes it less practically useful when computational budgets are limited.\nTo reduce the computational cost of MoS, we propose a novel output layer Mixtape to break the\nsoftmax bottleneck ef\ufb01ciently. Mixtape can be plugged into any existing networks as an additional\nlayer before the cross entropy loss. Instead of employing a scalar mixture in the probability space as in\nMoS, Mixtape applies a vector gating mechanism in the logit space to avoid using multiple expensive\nsoftmaxes. In addition, Mixtape uses two more novel techniques to further reduce the computational\ncost. First, the vector gating mechanism is expensive because we need to compute a softmax gate for\neach word in the vocabulary. We propose sigmoid tree decomposition that decomposes a softmax\nprobability gating distribution into a depth-2 binary tree structure, where each branch carries a portion\nof the probability mass determined by a sigmoid function. Sigmoid tree decomposition is much more\nef\ufb01cient because it avoids the reduction and division operations in softmax. The other technique\ngate sharing is to share the gate values for all infrequent words, resulting in partially high-rank\nrepresentations. This technique saves a considerable amount of memory and computation without\naffecting the performance because the gate values of infrequent words are usually hard to accurately\nestimate even without sharing the gates.\nWith all the above techniques combined, Mixtape substantially improves the ef\ufb01ciency of MoS while\nobtaining comparable or even better performances on four benchmarks, including language modeling\nand machine translation. With normal vocabulary sizes (e.g., 10K-30K), the Mixtape layer is 1.6x\nto 11.5x fater than the MoS layer given the same batch size, and is 3.5x to 10.5x faster given the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsame memory budget. With normal vocabulary sizes, a Mixtape-based network is only 5% to 18%\nslower than a softmax-based network given the same batch size, and is only 20% to 34% slower\ngiven the same memory budget. With a large vocabulary of 100K tokens, a Mixtape-based network is\nstill only 60% slower than a softmax-based network. Both Mixtape and MoS outperform softmax\nin perplexity and translation quality. Interestingly, these benchmarks have varied vocabulary sizes\nranging from 10K to 100K and different input representations including words and BPE subwords,\nwhich demonstrates that Mixtape is effective and robust with a variety of inputs.\n\n2 Softmax Bottleneck\n\nIn the following, we will introduce the notations and review the softmax bottleneck problem pointed\nout by [19].\nConsider a general setting of language modeling and text generation, where given the context C\nwe want to estimate the conditional distribution of the next token P \u2217(X|C). Here we use P \u2217 to\ndenote the true data distribution. The context C denotes the tokens that have occurred so far. For\nexample, given a corpus (X1, X2,\u00b7\u00b7\u00b7 , XT ), for each time step t, we aim to estimate the probability\nP \u2217(Xt|C = X