{"title": "Large Memory Layers with Product Keys", "book": "Advances in Neural Information Processing Systems", "page_first": 8548, "page_last": 8559, "abstract": "This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead.\nIts design and access pattern is based on product keys, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time. This memory layer allows us to tackle very large scale language modeling tasks. In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture. In particular, we found that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being twice faster at inference time. We release our code for reproducibility purposes.", "full_text": "Large Memory Layers with Product Keys\n\nGuillaume Lample\u2217\u2020, Alexandre Sablayrolles\u2217, Marc\u2019Aurelio Ranzato\u2217,\n\nLudovic Denoyer\u2217\u2020, Herv\u00b4e J\u00b4egou\u2217\n\n{glample,asablayrolles,ranzato,denoyer,rvj}@fb.com\n\nAbstract\n\nThis paper introduces a structured memory which can be easily integrated into a\nneural network. The memory is very large by design and signi\ufb01cantly increases\nthe capacity of the architecture, by up to a billion parameters with a negligi-\nble computational overhead. Its design and access pattern is based on product\nkeys, which enable fast and exact nearest neighbor search. The ability to increase\nthe number of parameters while keeping the same computational budget lets the\noverall system strike a better trade-off between prediction accuracy and compu-\ntation ef\ufb01ciency both at training and test time. This memory layer allows us to\ntackle very large scale language modeling tasks. In our experiments we consider\na dataset with up to 30 billion words, and we plug our memory layer in a state-\nof-the-art transformer-based architecture. In particular, we found that a memory\naugmented model with only 12 layers outperforms a baseline transformer model\nwith 24 layers, while being twice faster at inference time. We release our code for\nreproducibility purposes.3\n\n1\n\nIntroduction\n\nNeural networks are commonly employed to address many complex tasks such as machine trans-\nlation [43], image classi\ufb01cation [27] or speech recognition [16]. As more and more data becomes\navailable for training, these networks are increasingly larger [19]. For instance, recent models both\nin vision [29] and in natural language processing [20, 36, 28] have more than a billion parame-\nters. The higher-capacity enables better modeling of data like natural text or images, and it also\nimproves generalization [41, 33]. Unfortunately, increasing capacity has led to a dramatic increase\nof computational complexity, both at training and inference time [20].\nThere is a growing interest in developing architectures with reasonable computational complexity.\nRecently, there has been some efforts to develop high capacity architectures that operate on a limited\ncomputational budget [40, 18]. This is well illustrated by the \u201cOn-device Visual Intelligence Chal-\nlenge\u201d [5], which speci\ufb01cally focuses on the complexity/accuracy trade-off for image classi\ufb01cation.\nSome researchers have attempted to increase the capacity of a network without increasing its com-\nputational complexity. Most notably, Rae et al. [37] incorporate fast nearest neighbor search within\na neural network architecture to leverage large key-value layers with sparse reads and writes. Their\napproach relies on an external indexing structure [32], which is approximate and needs to be re-\nlearned regularly while training the neural network to avoid a catastrophic drift.\nIn this work, we propose a key-value memory layer that can scale to very large sizes while keeping\nexact search on the key space. This layer dramatically increases the capacity of the overall system\nfor a negligible computational overhead. Unlike existing models based on key-value memories (see\n\n\u2217Facebook AI Research\n\u2020Sorbonne Universit\u00b4es, UPMC Univ Paris 06, UMR 7606, LIP6\n3https://github.com/facebookresearch/XLM\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Overview of a key-value memory layer: The input x is processed through a query\nnetwork that produces a query vector q, which is compared to all the keys. The output is the sparse\nweighted sum over the memories associated with the selected keys. For a large number of keys |K|,\nthe key selection procedure becomes too expensive in practice. Our product key method is exact and\nmakes this search process very fast.\n\nFigure 1), we de\ufb01ne keys as the concatenation of two sub-keys, in the spirit of product quantiza-\ntion [21]. As shown in more details in Figure 2, this structure implicitly de\ufb01nes a very large set of\nkeys, each being associated with a value memory slot. The set of value vectors introduces the bulk\nof the parameters, as it scales quadratically with the number of sub-keys. Despite the large num-\nber of memory slots, \ufb01nding the exact closest keys to the input is very ef\ufb01cient, typically requiring\n\nO((cid:112)|K|) vector comparisons, where |K| is the total number of memory slots. All the memory pa-\n\nrameters are trainable, yet only a handful of memory slots are updated for each input at training time.\nSparsity of key selection and parameter updates make both training and inference very ef\ufb01cient.\nOur layer allows us to tackle problems where current architectures under\ufb01t given the vast amount of\navailable data, or when they are too slow to work in practice. We thus focus on the language mod-\neling task, integrating our memory within the popular transformer architecture [44]. This choice\nis motivated by the success of BERT [11] and GPT-2 [36], which demonstrated that increasing the\ncapacity of large models directly translates to large improvements in language modeling, which in\nturn translates to better performance in both language understanding tasks [11, 46] and text genera-\ntion [36]. Overall, our paper makes the following contributions:\n\n\u25e6 We introduce a new layer that provides a large capacity to a neural network for only a slight\n\ncomputational overhead both at train and test time.\n\n\u25e6 Our fast indexing strategy offers exact nearest neighbor search by construction, and avoids\n\nthe pitfall of relying on an indexing structure that needs to be re-learned during training.\n\n\u25e6 We demonstrate our method within a large state-of-the-art transformer, composed of 24\nlayers of dimension 1600. Our method with 1 memory and 12 layers outperforms a 24-\nlayer transformer while being twice faster at inference time. We show that adding more\nmemory layers to transformers of various complexities provides systematic and signi\ufb01cant\nimprovements on our target task.\n\n2 Related work\n\nDifferent approaches have been proposed to increase the capacity of neural networks without in-\ncreasing too much the computational complexity. For instance, conditional computation models\naim at routing inputs into very large neural networks such that only a subset of connections and/or\nlayers are used to process each input. Different methods have been developed like large mixture of\nexperts [40], gating techniques [3, 12, 6] or even reinforcement learning-based approaches [10].\nAnother line of research is the development of memory augmented neural networks. For instance,\nmemory-based neural layers [47, 42] are an ef\ufb01cient way to represent variable length inputs for\ncomplex problems such as question answering [48]. Such memories can also operate in feature\nspace and have various reading and writing mechanisms [23, 17]. Unfortunately, these approaches\nscale linearly with the size of the memory which is prohibitive for very large memories. Neural cache\nmodels [15] suffer from the same scaling issues, which are circumvented by adopting approximate\nlookup techniques at test time [14].\n\n2\n\nquerynetwork21|\ue237|21|\ue237|keysvalueskey selection2\fDiscretization techniques have been intensively studied for compressing network weights [8, 38]\nand/or activations [7, 38] or to accelerate inference. For instance, Gerald et al. [13] propose to\nmap an input to a low-dimensional binary code, each code being associated with one category, thus\nreducing the complexity of inference by avoiding the use of a \ufb01nal large linear layer. Another\nmodel is proposed in [45], where the authors develop a fast locality-sensitive hashing technique\nto approximate the dot product between large matrices and vectors in neural networks. However,\nexploiting binary codes or approximate techniques at training time raises several challenges in terms\nof optimization, because approximate indexes are not accurate in high-dimensional spaces. In our\npaper, we borrow some ideas from product quantization (PQ) [21]. This is an approximate search\ntechnique that maps database vectors into compact codes. However, our goal is different: we do not\nbuild an approximate index, but rather we exploit the idea to represent a large set of key vectors by\na drastically smaller number of vectors, that we update by regular back-propagation. As discussed\nlater, the selection of the closest keys is exact and inherits from the fast neighbor search of PQ.\nOur model is also related to sparsity models which have been mainly studied in the unsupervised\nlearning setting [34, 24]. For instance, the k-sparse autoencoder [30] only keeps the k largest values\nin the latent representation of an auto-encoder, similar to our memory layer but without the product\nIn winner take all autoencoders [31], sparsity is induced by using mini-batch\nkeys component.\nstatistics, while in the sparse access memory [37] reports some speed-up by both thresholding the\nmemory to a sparse subset, and by using ef\ufb01cient data structures for content-based read operations.\nUnfortunately, the fast access to memories rely on an approximate external indexing structure [32]\nthat has to be re-learned periodically. Our work solves this issue by fully incorporating the key\nselection mechanism as a network component.\nThe transformer network [44] is the current workhorse of Natural Language Processing (NLP): it\nis employed ubiquitously across a large variety of tasks. Transformers are built by stacking blocks\ncomposed of self-attention layers followed by fully connected layers (dubbed FFN), as shown in\nFigure 3. The components of the memory layer bear similarities to the query, key and value networks\nused in self-attention layers with two notable differences: the keys and values do not correspond to\ninput tokens but are free embedding vectors, and the number of values (memory size) is very large.\n\n3 Learnable product key memories\nWe consider the design of a function m : Rd \u2192 Rn, that will act as a layer in a neural network. The\npurpose of m is to offer a large capacity within a neural network.\n\n3.1 Memory design\n\nHigh-level structure. The overall structure of our memory is illustrated by Figures 1 and 2. The\nmemory is composed of three components: a query network, a key selection module containing two\nsets of sub-keys, and a value lookup table. It \ufb01rst computes a query that is compared to the set of\nproduct keys. For each product key, it computes a score and selects the k product keys with the\nhighest scores. The scores are then used to produce an output m(x) via a weighted sum over the\nvalues associated with the selected keys. All the parameters of the memory are trainable, yet only\nk memory slots are updated for each input. The sparse selection and parameter update make both\ntraining and inference very ef\ufb01cient.\nQuery generation: pre-processing network. The function q : x (cid:55)\u2192 q(x) \u2208 Rdq, referred to as\nthe query network, maps the d-dimensional input to a latent space of dimensionality dq. Typically,\nq is a linear mapping or a multi-layer perceptron that reduces the dimensionality from d to dq =\n512. As keys are randomly initialized, they occupy the space relatively uniformly. Adding a batch\nnormalization layer on the top of the query network helps increasing key coverage during training.\nThis insight is con\ufb01rmed by our ablation experiments in Section 4.5.\nStandard key assignment and weighting. Let q(x) be a query and Tk denote the top-k operator4.\nGiven a set of keys K = {k1, . . . , k|K|} composed of |K| dq-dimensional vectors, and an input x,\n4If the permutation (i1, . . . , in) sorts numbers (t1, . . . , tn) as ti1 \u2265 ti2 \u2265 \u00b7\u00b7\u00b7 \u2265 tin, the top-k indices are\n\nTk(t1, . . . , tn) = {i1, . . . , ik}\n\n3\n\n\fFigure 2: Illustration of the product keys. We de\ufb01ne two discrete subsets of keys (sub-key set 1\nand sub-key set 2). They induce a much larger set of keys, which are never made explicit (product\nkeys). Given a query, we split it into two sub-queries (q1 and q2). Selecting the k closest keys (k = 2\nin the \ufb01gure) in each subset implicitly selects k \u00d7 k keys. The k keys maximizing the inner product\nwith the query are guaranteed to belong to this subset, on which the search can be done ef\ufb01ciently.\n\nwe select the top k keys maximizing the inner product with the query q(x):\n\n(cid:1)\n\nI = Tk\n\n(cid:0)q(x)T ki\nw = Softmax(cid:0)(q(x)T ki)i\u2208I(cid:1)\n(cid:88)\n\nm(x) =\n\ni\u2208I wivi\n\n# Get k nearest neighbors\n# Normalize top-k scores\n# Aggregate selected values\n\n(1)\n(2)\n(3)\nHere I denotes the indices of the k most similar keys (where the similarity measure is the inner\nproduct), and w is the vector that represents the normalized scores associated with the selected keys.\nAll these operations can be implemented using auto-differentiation mechanisms, making our layer\npluggable at any location in a neural network.\nOperations (2), (3) only depend on the top-k indices and are therefore computationally ef\ufb01cient.\nIn contrast, the exhaustive comparison of Equation (1) is not ef\ufb01cient for large memories since it\ninvolves computing |K| inner products. To circumvent this issue, we resort to a structured set of\nkeys, that we refer to as product keys.\n\nThe product key set\noperator, of two vector codebooks C and C(cid:48):\n\nis de\ufb01ned as the outer product, with respect to the vector concatenation\n\nK = {(c, c(cid:48)) | c \u2208 C, c(cid:48) \u2208 C(cid:48)}\n\nThe total number of keys induced by this Cartesian product construction is |K| = |C|\u00d7|C(cid:48)|. The sets\nC and C(cid:48) both comprise a set of sub-keys of dimension dq/2. We exploit this structure to compute\nthe closest keys I \u2208 (1, ..., K) ef\ufb01ciently. First, we split the query q(x) into two sub-queries q1 and\nq2. We then compute the k sub-keys in C (resp. C(cid:48)) closest to the sub-query q1 (resp. q2):\n\n(cid:0)(q1(x)T ci)i\u2208{1...|C|}(cid:1) ,\n\nIC = Tk\n\n(cid:0)(q2(x)T c(cid:48)\n\nj)j\u2208{1...|C(cid:48)|}(cid:1)\n\nIC(cid:48) = Tk\n\nWe are guaranteed that the k most similar keys in K are of the form {(ci, c(cid:48)\nexample of product keys with the key selection process is shown in Figure 2.\n\n(4)\nj) | i \u2208 IC, j \u2208 IC(cid:48)}. An\n\n3.2 Complexity\nSearching for the top-k most similar keys when the keys have a \ufb02at representation requires |K|\ncomparisons of vectors of size dq, i.e. O(|K| \u00d7 dq) operations.\nFor product keys, we consider the setup where |C| = |C(cid:48)|, i.e.\nthe con\ufb01guration that maximizes\nonly need to compare the two sub-queries with |C| and |C(cid:48)| sub-keys of size dq/2, which amounts to\n\n|C|\u00d7|C(cid:48)| for a \ufb01xed number of sub-keys |C| +|C(cid:48)|. Since |K| = |C|\u00d7|C(cid:48)|, we have |C| =(cid:112)|K|. We\nO(|C| \u00d7 dq/2 + |C(cid:48)| \u00d7 dq/2) = O(|C| \u00d7 dq) = O((cid:112)|K| \u00d7 dq) operations.\n\nThen, we need to search for the top-k keys in {(ci, c(cid:48)\nC}, which is a set composed\nof k2 keys of dimension dq. This can be done in O(k2 \u00d7 dq) operations (in practice, this could be\n\nj) | i \u2208 IC, j \u2208 I(cid:48)\n\n4\n\nc1c3c2c'1c'3c'2sub-key set 1sub-key set 2product keysc3c3c3c2c2c2c1c1c1c'1c'3c'2c'1c'3c'2c'1c'3c'2c2c1c3c'3c'1c'2q1q2q1q2querysub-key retrievalc3c3c2c2c'1c'3c'1c'3 candidate keys2key selectionc3c2c'1c'1 selected keys\fFigure 3: Left: A typical transformer block is composed by a self-attention layer followed by an\nFFN layer (a two layer network). Right:\nIn our system, we replace the FFN layer with a product\nkey memory layer, which is analogous to a sparse FFN layer with a very large hidden state. In\npractice, we only replace the FFN layer in N layers, where typically N \u2208 {0, 1, 2}.\ndone in O(k log k) scalar operations with a priority list [1], but this choice is less compliant with\nGPU architectures). As a result, the overall complexity is:\n\nO(cid:16)\n\n((cid:112)|K| + k2) \u00d7 dq\n\n(cid:17)\n\nFor small values of k, and a memory of size |K| = 10242, retrieving the nearest product keys\nrequires about 103 less operations than an exhaustive search. As shown later in our ablation study,\nproduct keys also lead to a better performance compared to a set composed of \ufb02at keys.\n\n3.3 Multi-head memory attention\n\nWe make the model more expressive with a multi-head mechanism, where each head independently\ncomputes a query used to select k keys from the memory. The memory simply sums the output\n\nmi(x) of each head i: m(x) =(cid:80)H\n\ni=1 mi(x) where H is the number of heads.\n\nEach head has its own query network and its own set of sub-keys, but all heads share the same\nvalues. This is similar to the multi-head attention used in transformers, except that we do not split\nthe query into H heads, but instead create H queries. As the query networks are independent from\neach other and randomly initialized, they often map the same input to very different values of the\nmemory. In practice, for the same input we observe very little overlap between the keys selected\nby two different heads. This method let us increase key usage and generally improves performance.\nThe impact of the multi-head attention mechanism is discussed in Section 4.5.\n\n4 Experiments\n\nWe report results on large-scale experiments for transformer models equipped with a memory, fol-\nlowed by an ablation study that shows the impact of different memory components on the model\nperformance and memory usage. We propose to replace the FFN block of some transformer layers\nby a memory, as presented in Figure 3. In that setting, the memory is integrated with a residual con-\nnection in the network, and the input x to the memory layer becomes x \u2190 x + PKM(x) instead of\nx \u2190 x + FFN(x). In practice, we could also keep the FFN layer and simply interleave the memory\nbetween some transformer layers.\n\n4.1 Dataset\n\nWe evaluate the impact of our memory in a large scale language modeling task, where traditional\nmodels are known to under\ufb01t. The largest publicly available language modeling dataset is the One\nBillion Word corpus [4]. As noted in prior work [2, 9, 36], obtaining a good performance on this\ndataset requires tedious regularization as it is now too small for standard architectures. In our experi-\nments, we encountered the same issues, and observed that even a small model was enough to over\ufb01t:\non this dataset, for a 16 layers model with a dimensionality of 1024, we obtain a test perplexity of\n25.3 when the validation perplexity starts to increase. The train perplexity is then equal to 14.8 and\nkeeps improving while the validation perplexity deteriorates.\nWe therefore evaluate the bene\ufb01t of our approach on a corpus that is 30 times larger and extracted\nfrom the public Common Crawl. The training set is composed of 28 billion words (140 GB of data)\nextracted from about 40 million English news articles indexed by Common Crawl corpora. The\nvalidation and test sets are both composed of 5000 news articles removed from the training set.\nUnlike in the One Billion Word corpus, we did not shuf\ufb02e sentences, allowing the model to learn\nlong range dependencies. On this dataset, we did not observe any over\ufb01tting, and increasing the\n\n5\n\nSelf-attentionFeed-forwardlayer (FFN)++Self-attentionMemory layer(PKM)++\fmodel capacity systematically led to a better performance on the validation set. We tokenized the\ndata using the tokenizer provided by the Moses toolkit [26]. To reduce the vocabulary size, we use\nfastBPE5 to apply Byte Pair Encoding (BPE) [39], with 60k BPE splits.\n\n4.2 Evaluation metrics\n\nWe measure the performance of our models by reporting the perplexity on the test set. For models\nwith memories, we report two different metrics to evaluate the usage:\n\ni =(cid:80)\n\n\u2022 The memory usage that represents the fraction of accessed values: #{zi (cid:54)= 0}\n\n\u2022 The KL divergence between z and the uniform distribution: log(|K|) +(cid:80) zi log(zi)\n\nwhere z = z(cid:48)/(cid:107)z(cid:48)(cid:107)1, and z(cid:48) \u2208 R|K| is de\ufb01ned as z(cid:48)\nx w(x)i where w(x) represents the weights\nof the keys accessed in the memory when the network is fed with an input x from the test set (i.e.,\nthe w(x) are sparse with at most H \u00d7 k non-zero elements).\nAt test time, we expect the model to access as many keys as possible, i.e.\nto have a usage near\n100%; a lower usage means that part of the capacity is not exploited at all. The KL divergence\nre\ufb02ects imbalance in the access patterns to the memory: if the model attends the same key for every\nquery (while giving a tiny weight to the remaining keys), it would give a perfect usage but a very\nhigh KL, showing that the same performance could be achieved with just one value.\n\n4.3 Training details\n\nWe use a transformer architecture with 16 attention heads and learned positional embeddings. We\nconsider models with 12, 16 or 24 layers, with either 1024 or 1600 dimensions. We train our models\nwith the Adam optimizer [25], with a learning rate of 2.5 \u00d7 10\u22124, with \u03b21 = 0.9, \u03b22 = 0.98,\nfollowing the learning rate schedule of Vaswani et al. [44]. In the memory, the keys and the query\nnetwork are learned with the same optimizer and learning rate as the rest of the network. Since the\nmemory values are learned with sparse updates, we found it bene\ufb01cial to learn them with a higher\nAdam learning rate of 10\u22123. We implement our models with PyTorch [35], and train them on 32\nVolta GPUs. We use \ufb02oat16 operations to speed up training and to reduce the GPU memory usage\nof our models. To retrieve key indices ef\ufb01ciently, we perform the search over sub-keys with a fast\nnearest neighbors implementation by Johnson et al. [22].\nFor a transformer model with L layers and N memories, we interspersed the memories at regular\nintervals. For instance, for L = 16 and N = 2, we replace the FFN of layers 6 and 12. This way, the\nnetwork can leverage information at different levels of the architecture. The impact of the memory\nposition within the network is studied in Section 4.5. In our main experiments, we use H = 4\nmemory heads, we select k = 32 keys per head, and use |K| = 5122 memory slots.\n\n4.4 Results\n\nDimension\nN memories\n\n12 layers\n16 layers\n24 layers\n\n1024\n\n1\n\n2\n\n3\n\n0\n\n1600\n\n1\n\n15.6\n14.9\n14.6\n\n14.8\n14.1\n\n-\n\n14.5\n\n-\n-\n\n15.0\n14.4\n14.0\n\n13.7\n13.2\n\n-\n\n0\n\n17.7\n16.7\n16.0\n\nTable 1: Test perplexity for mod-\nels with and without memory. PKM\nmodels with 12 layers outperforms 24-\nlayer models of same dimensionality.\nBold refers to models optimizing per-\nformance for a given dimension.\n\nTable 1 and Figure 4 show the perplexity of different models on the test set of the CC-News corpus.\nWe observe that increasing either the dimensionality or the number of layers leads to signi\ufb01cant per-\nplexity improvements in all the models. However, adding a memory to the model is more bene\ufb01cial\nthan increasing the number of layers; for instance, a model with a single memory and 12 layers out-\nperforms a memoryless model with the same hidden dimension and 24 layers, both when the number\nof hidden units is 1024 and 1600. Adding 2 or 3 memory layers further improves performance.\nFigure 4 also shows speed as measured in words per second, for different model con\ufb01gurations.\nIn particular, when the internal hidden states have 1024 dimensions, a model with 12 layers and a\n\n5https://github.com/glample/fastBPE\n\n6\n\n\fFigure 4: Trade-off between speed and perplexity on the test set. Labels on the graph represent\nthe number of layers. Adding memory layers signi\ufb01cantly improves the performance and has a\nnegligible impact on the inference speed. Models with 12 layers and a Product Key Memory (PKM)\noutperform 24-layer models of the same dimension, while being almost twice faster at inference. In\nparticular, a 12-layer model of dimension 1024 with a memory outperforms a model of 24 layers of\nthe same dimension (same con\ufb01guration as BERT large).\n\nmemory obtains a better perplexity than a model with 24 layers (same con\ufb01guration as BERT large),\nand it is almost twice faster. When adding memory to large models that have internal dimensionality\nequal to 1600, inference time barely increases.\n\n4.5 Ablation Study\n\nIn this section we study the impact of the different components on the memory layer, and measure\nhow they affect the model performance and the memory usage. For all experiments, we consider a\ntransformer network with 6 layers and 8 heads. Unless speci\ufb01ed otherwise, we consider a memory\nof 5122 = 262k slots, with 4 memory heads, k = 32 selected keys, and we insert it at layer 5.\nMemory size. We train transformer models with memories of size |K| = |C| \u00d7 |C(cid:48)|, with |C(cid:48)| =\n|C| and |C| \u2208 {128, 256, 384, 512, 768, 1024}. Table 2 shows that test perplexity decreases as the\nmemory becomes larger. A model with a memory size of 16k obtains a perplexity of 22.8. Increasing\nthe size to 1M decreases the perplexity down to 18.0 while leaving the inference time unchanged.\nThe dominant factor for inference time is the number of accessed memory values, which is governed\nby the number of memory heads and the parameter k, but not the memory size.\n\nQuery Batch Normalization. Table 2 and Figure 5 present results with and without batch nor-\nmalization in the query network. We observe that for small memories the usage is always close to\n100%, but for a memory of size 1M, the batch normalization layer improves usage from 25.8% to\n80.3%, with a consequent perplexity decrease from 19.8 down to 18.0. For comparison, a model\nwithout memory obtains a perplexity of 23.0, which is on par with a memory of size 16k.\nFinally, we observe a correlation between the number of used keys and the model performance. In\nparticular, a model with a memory of size 1M that does not use batch normalization uses about\n25.8% of the memory values (i.e. roughly 250k values), and obtains a perplexity of 19.8, which is\non par with the model using a memory of size 262k that uses batch normalization, and that has a\nnearly optimal memory usage of 100%.\n\nMemory position.\nIn this experiment we insert the memory at different levels in the transformer,\nto see where it is the most bene\ufb01cial. In Table 3 we observe that the model bene\ufb01ts the most from\nthe memory when it replaces the FFN of the layers 4 or 5 in the transformer. Putting memory at layer\n1 (after the input token embeddings) gives the worst performance. When the memory is inserted in\nlayer 6, it is located right before the softmax output, the model has only one linear layer to process\n\n7\n\n4k6k8k10k12k14kInference speed (words per second)1314151617Perplexity121624121620121624 (BERT large)1216241216dim 1600, 0 PKMdim 1600, 1 PKMdim 1024, 0 PKMdim 1024, 1 PKMdim 1024, 2 PKMs\fTable 2: Perplexity and memory usage for different memory sizes, with and without Batch-\nNorm. Adding a batch normalization layer in the query network encourages the model to use more\nkeys. This is not necessary for small memories of size 16k and 65k where the usage is already close\nto 100% without batch normalization, but for memories of size 147k of more, batch normalization\nimproves the memory usage signi\ufb01cantly, along with the perplexity.\nMemory size\nBatchNorm\nPerplexity\nUsage (%)\nKL\n\nYes\n21.9\n100.0\n0.58\n\nYes\n18.0\n80.3\n0.95\n\nNo\n20.5\n64.4\n1.20\n\nYes\n20.7\n99.6\n0.65\n\nYes\n19.8\n97.9\n0.68\n\n65k\n\n147k\n\n262k\n\n590k\n\n1M\n\nNo\n20.0\n38.0\n1.70\n\n16k\n\nNo\n22.8\n100\n0.56\n\nYes\n23.0\n100\n0.56\n\nYes\n18.7\n90.3\n0.83\n\nNo\n19.8\n25.8\n2.06\n\nNo\n21.7\n99.0\n0.69\n\nNo\n20.9\n83.8\n0.94\n\nFigure 5: Memory usage and perplexity with and without query batch normalization. Adding batch normal-\nization increases both performance and the fraction of used memory slots.\n\nFigure 6: Memory usage and perplexity for different number of heads, and number of k-NN. Increasing the\nnumber of heads or k-NN increases both performance and the fraction of used memory slots.\n\nthe information read from the memory. The best position to insert the memory is at an intermediate\nlayer. We surmise that effective use of the memory requires operating in a more abstract feature\nspace than the input and that it is important to have some layers on the top of the memory to further\nprocess and aggregate information from every location.\n\nNumber of heads / k-NN. Figure 6 shows that increasing the number of heads or the number\nof k-NN improves both the perplexity of the model, and the memory usage. We also note that\nmodels with identical h \u00d7 k (h being the number of heads and k the number of nearest neighbors)\nhave a similar memory usage, i.e. models with (h, k) \u2208 {(1, 64), (2, 32), (4, 16), (8, 8)} all have a\nmemory usage around 70%, and a perplexity around 20.5. Adding more heads overall improves the\nperformance, but also increases the computation time. Overall, we found that using 4 heads and 32\nk-NN strikes a good trade-off between speed and performance.\n\n8\n\n16k65k147k262k590k1MMemory size30405060708090100Memory usage (%)Without Query BatchNormWith Query BatchNorm16k65k147k262k590k1MMemory size181920212223PerplexityWithout Query BatchNormWith Query BatchNorm8163264Number of k-NN020406080100Memory usage (%)1 head2 heads4 heads8 heads8163264Number of k-NN19.520.020.521.021.522.022.523.0Perplexity1 head2 heads4 heads8 heads\fTable 3: Perplexity and memory usage for different memory positions in a transformer with 6\nlayers. Adding a memory in positions 4 or 5 maximizes the performance (layer 1 is the worst).\n\nPosition\nPerplexity\nUsage (%)\nKL\n\n1\n\n21.5\n100.0\n2.23\n\n2\n\n20.7\n100.0\n0.95\n\n3\n\n20.4\n98.3\n0.74\n\n4\n\n20.1\n97.1\n0.71\n\n5\n19.8\n97.9\n0.68\n\n6\n\n20.3\n96.9\n1.08\n\nTable 4: Perplexity, memory usage and inference speed with product keys and regular keys.\nModels with product keys have a much better usage than models that represent keys by a \ufb02at matrix,\nand obtain a better perplexity. They also have signi\ufb01cantly less parameters and are dramatically\nfaster to run. The speed is measured at inference, in thousands of words per second (w/s). For\nmodels with more than 262k memory slots, we only report the inference time. We observe that with\nproduct keys, the memory size do not impact the inference time.\n\nMemory size\nProduct Keys\nPerplexity\nUsage (%)\nKL\nSpeed (w/s)\n\n16k\n\nYes\nNo\n23.0\n23.2\n100\n19.6\n2.04\n0.56\n35.0k 35.8k\n\n65k\n\nYes\nNo\n21.9\n22.6\n13.6 100.0\n2.48\n0.58\n28.5k 36.7k\n\n147k\n\nYes\nNo\n20.7\n22.1\n99.6\n10.1\n2.77\n0.65\n13.9k 36.4k\n\n262k\n\nNo Yes\n-\n19.8\n97.9\n-\n-\n0.68\n7.7k 36.3k\n\n590k\n\nNo Yes\n-\n18.7\n90.3\n-\n0.83\n-\n4.7k 36.2k\n\n1M\n\nNo Yes\n-\n18.0\n80.3\n-\n0.95\n-\n1.2k 35.7k\n\nProduct keys vs. \ufb02at keys. Product keys presented in Figure 2 enable \ufb01nding the nearest neigh-\nbors in a matrix of size (|C|2, dk) with the same time/compute complexity of a search over two\n2 ). As a result, product keys contain |C| times less parameters than keys rep-\nmatrices of size (|C|, dk\nresented by a full matrix. Table 4 and Figure 7 compare product keys to the default regular \ufb02at keys.\nIn the second case, searching the nearest keys boils down to a liner index search at each iteration,\nwhich is computationally very expensive. As a result, we only report results for memories of size\n16k, 65k, 147k, as experiments with a \ufb02at index on larger memories takes an unreasonable amount\nof time to converge. We can see that models with product keys are not only faster but they have also\na much better memory usage, and consequently obtain a better perplexity.\n\nFigure 7: Speed over memory size. Speed\n(in thousands of words per second) for different\nmemory sizes. For regular \ufb02at keys, increasing\nthe number of keys signi\ufb01cantly slows down the\nmodel, while with product keys, increasing the\nmemory size barely impacts the inference speed.\n\n5 Conclusion\n\nThis paper introduces a memory layer that allows to drastically improve the capacity of a neural\nnetwork with a negligible computational overhead. The ef\ufb01ciency of our layer relies on two key\ningredients:\nthe factorization of keys as a product set, and the sparse read/write accesses to the\nmemory values. Our layer is integrated into an existing neural network architecture. We show\nexperimentally that it provides important gains on large-scale language modeling, reaching with 12\nlayers the performance of a 24-layer BERT-large model with half the running time.\n\n9\n\n16k65k147k262k590k1MMemory size05101520253035Words per second (x1000)Regular keysProduct keys\fReferences\n[1] Artem Babenko and Victor Lempitsky. The inverted multi-index.\n\nPattern Analysis and Machine Intelligence, 2014.\n\nIEEE Transactions on\n\n[2] Alexei Baevski and Michael Auli. Adaptive input representations for neural language model-\n\ning. In International Conference on Representation Learning, 2019.\n\n[3] Yoshua Bengio, Nicholas L\u00b4eonard, and Aaron C. Courville. Estimating or propagating gradi-\n\nents through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.\n\n[4] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and\nTony Robinson. One billion word benchmark for measuring progress in statistical language\nmodeling. Conference of the International Speech Communication Association, 2014.\n\n[5] Bo Chen and Jeffrey M. Gilbert. The on-device visual intelligence challenge. https://\n\nai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html,\n2019. Accessed: 2019-05-20.\n\n[6] Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation\n\nratio for conditional computation in deep learning. CoRR, abs/1406.7362, 2014.\n\n[7] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with\n\nweights and activations constrained to +1 or -1. CoRR, 2016.\n\n[8] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. Advances in Neural Information\nProcessing Systems, 2015.\n\n[9] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le,\nand Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a \ufb01xed-length\ncontext. In Conference of the Association for Computational Linguistics, 2019.\n\n[10] Ludovic Denoyer and Patrick Gallinari.\n\nabs/1410.0510, 2014.\n\nDeep sequential neural network.\n\nCoRR,\n\n[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training\nIn Conference of the North\n\nof deep bidirectional transformers for language understanding.\nAmerican Chapter of the Association for Computational Linguistic, 2018.\n\n[12] D. Eigen, I. Sutskever, and M. Ranzato. Learning factored representations in a deep mixture\n\nof experts. In Workshop at the International Conference on Learning Representations, 2014.\n\n[13] Thomas Gerald, Nicolas Baskiotis, and Ludovic Denoyer. Binary stochastic representations for\nlarge multi-class classi\ufb01cation. In International Conference on Neural Information Processing,\n2017.\n\n[14] Edouard Grave, Moustapha M Cisse, and Armand Joulin. Unbounded cache model for on-\nline language modeling with open vocabulary. In Advances in Neural Information Processing\nSystems, 2017.\n\n[15] Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with\n\na continuous cache. In International Conference on Representation Learning, 2017.\n\n[16] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nIn International Conference on Acoustics, Speech, and Signal\n\nrecurrent neural networks.\nProcessing, 2013.\n\n[17] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,\n\n2014.\n\n[18] Sam Gross, Marc\u2019Aurelio Ranzato, and Arthur Szlam. Hard mixtures of experts for large scale\nweakly supervised vision. In Conference on Computer Vision and Pattern Recognition, 2017.\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In Conference on Computer Vision and Pattern Recognition, 2016.\n\n[20] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le,\nand Zhifeng Chen. Gpipe: Ef\ufb01cient training of giant neural networks using pipeline paral-\nlelism. CoRR, abs/1811.06965, 2018.\n\n[21] Herv\u00b4e J\u00b4egou, Matthijs Douze, and Cordelia Schmid. Product Quantization for Nearest Neigh-\n\nbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.\n\n10\n\n\f[22] Jeff Johnson, Matthijs Douze, and Herv\u00b4e J\u00b4egou. Billion-scale similarity search with gpus.\n\nIEEE Transactions on Big Data, 2017.\n\n[23] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented re-\n\ncurrent nets. In Advances in Neural Information Processing Systems, 2015.\n\n[24] Koray Kavukcuoglu, Marc\u2019Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding\nalgorithms with applications to object recognition. CoRR, abs/1010.3467, 2010. URL http:\n//arxiv.org/abs/1010.3467.\n\n[25] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Representation Learning, 2015.\n\n[26] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,\nNicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses:\nOpen source toolkit for statistical machine translation. In Conference of the Association for\nComputational Linguistics, 2007.\n\n[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, 2012.\nIn\n\n[28] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.\n\nAdvances in Neural Information Processing Systems, 2019.\n\n[29] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan\nLi, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised\npretraining. In European Conference on Computer Vision, 2018.\n\n[30] Alireza Makhzani and Brendan Frey. K-sparse autoencoders. In International Conference on\n\nRepresentation Learning, 2014.\n\n[31] Alireza Makhzani and Brendan J Frey. Winner-take-all autoencoders. In Advances in Neural\n\nInformation Processing Systems, 2015.\n\n[32] Marius Muja and David G. Lowe. Scalable nearest neighbor algorithms for high dimensional\n\ndata. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.\n\n[33] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. To-\nwards understanding the role of over-parametrization in generalization of neural networks. In\nInternational Conference on Representation Learning, 2019.\n\n[34] Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set, a\n\nstrategy employed by v1? Vision Research, 1997.\n\n[35] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In Neurips Autodiff Workshop, 2017.\n\n[36] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Lan-\n\nguage models are unsupervised multitask learners, 2019.\n\n[37] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne,\nAlex Graves, and Timothy Lillicrap. Scaling memory-augmented neural networks with sparse\nreads and writes. In Advances in Neural Information Processing Systems, 2016.\n\n[38] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Ima-\ngenet classi\ufb01cation using binary convolutional neural networks. In European Conference on\nComputer Vision, 2016.\n\n[39] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words\n\nwith subword units. In Conference of the Association for Computational Linguistics, 2015.\n\n[40] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E.\nHinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-\nexperts layer. In International Conference on Representation Learning, 2017.\n\n[41] Stefano Spigler, Mario Geiger, St\u00b4ephane d\u2019Ascoli, Levent Sagun, Giulio Biroli, and Matthieu\nWyart. A jamming transition from under- to over-parametrization affects loss landscape and\ngeneralization. CoRR, abs/1810.09665, 2018.\n\n[42] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory net-\n\nworks. In Advances in Neural Information Processing Systems, 2015.\n\n11\n\n\f[43] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, 2014.\n\n[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\nIn Advances in Neural\n\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need.\nInformation Processing Systems, 2017.\n\n[45] Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, and Jay Yagnik. Deep net-\nIn Workshop at the International Conference on Learning\n\nworks with large output spaces.\nRepresentations, 2015.\n\n[46] Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman.\nGlue: A multi-task benchmark and analysis platform for natural language understanding. In\nInternational Conference on Representation Learning, 2018.\n\n[47] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.\n\nConference on Representation Learning, 2015.\n\nIn International\n\n[48] Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. Towards ai-complete ques-\ntion answering: A set of prerequisite toy tasks. In International Conference on Representation\nLearning, 2016.\n\n12\n\n\f", "award": [], "sourceid": 4624, "authors": [{"given_name": "Guillaume", "family_name": "Lample", "institution": "Facebook AI Research"}, {"given_name": "Alexandre", "family_name": "Sablayrolles", "institution": "Facebook AI Research"}, {"given_name": "Marc'Aurelio", "family_name": "Ranzato", "institution": "Facebook AI Research"}, {"given_name": "Ludovic", "family_name": "Denoyer", "institution": "Facebook - FAIR"}, {"given_name": "Herve", "family_name": "Jegou", "institution": "Facebook AI Research"}]}