{"title": "Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time", "book": "Advances in Neural Information Processing Systems", "page_first": 6630, "page_last": 6641, "abstract": "A key requirement in sequence to sequence processing is the modeling of long range dependencies. To this end, a vast majority of the state-of-the-art models use attention mechanism which is of O(n^2) complexity that leads to slow execution for long sequences. \n\nWe introduce a new Shuffle-Exchange neural network model for sequence to sequence tasks which have O(log n) depth and O(n log n) total complexity. We show that this model is powerful enough to infer efficient algorithms for common algorithmic benchmarks including sorting, addition and multiplication. We evaluate our architecture on the challenging LAMBADA question answering dataset and compare it with the state-of-the-art models which use attention. Our model achieves competitive accuracy and scales to sequences with more than a hundred thousand of elements.\n\nWe are confident that the proposed model has the potential for building more efficient architectures for processing large interrelated data in language modeling, music generation and other application domains.", "full_text": "Neural Shuf\ufb02e-Exchange Networks \u2212 Sequence\n\nProcessing in O(n log n) Time\n\nK\u00afarlis Freivalds, Em\u00af\u0131ls Ozolin, \u0161, Agris \u0160ostaks\nInstitute of Mathematics and Computer Science\n\n{Karlis.Freivalds, Emils.Ozolins, Agris.Sostaks}@lumii.lv\n\nUniversity of Latvia\n\nRaina bulvaris 29, Riga, LV-1459, Latvia\n\nAbstract\n\nA key requirement in sequence to sequence processing is the modeling of long\nrange dependencies. To this end, a vast majority of the state-of-the-art models use\nattention mechanism which is of O(n2) complexity that leads to slow execution for\nlong sequences.\nWe introduce a new Shuf\ufb02e-Exchange neural network model for sequence to\nsequence tasks which have O(log n) depth and O(n log n) total complexity. We\nshow that this model is powerful enough to infer ef\ufb01cient algorithms for common\nalgorithmic benchmarks including sorting, addition and multiplication. We evaluate\nour architecture on the challenging LAMBADA question answering dataset and\ncompare it with the state-of-the-art models which use attention. Our model achieves\ncompetitive accuracy and scales to sequences with more than a hundred thousand\nof elements.\nWe are con\ufb01dent that the proposed model has the potential for building more\nef\ufb01cient architectures for processing large interrelated data in language modeling,\nmusic generation and other application domains.\n\n1\n\nIntroduction\n\nA key requirement in sequence to sequence processing is the modeling of long range dependencies.\nSuch dependencies occur in natural language when the meaning of some word depends on other\nwords in the same or some previous sentence. There are important cases, e.g., to resolve coreferences,\nwhen such distant information may not be disregarded. A similar phenomenon occurs in music,\nwhere a common motif may reappear throughout the entire piece and it should be kept coherent\nby any applied transformation (Huang et al., 2019). Dealing with long range dependencies require\nprocessing very long sequences (several pages of text or the entire musical composition) in a manner\nthat aggregates information from their distant parts.\nAggregation of distant information is even more important for algorithmic tasks where each output\nsymbol typically depends on every input symbol. The goal for algorithm synthesis is to derive\nan algorithm from given input-output examples which are often given as sequences. Algorithmic\ntasks are especially challenging due to the need for processing sequences of unlimited length. Also,\ngeneralization plays an important role since training is often performed on short sequences but testing\non long ones.\nCurrently the best neural network architectures do not scale well with the sequence length. A large\nfraction of them uses the attention mechanism which has quadratic complexity depending on the\nsequence length. These models can be easily trained on length 512 or so but become very slow and\nmemory hungry on longer sequences.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOn sequential algorithmic tasks, the best architecture in the respect of a variety of learnable tasks and\ngeneralization to longer sequences is the improved (Freivalds and Liepins, 2018) Neural GPU (Kaiser\nand Sutskever, 2015). It has O(n) convolutional layers where each layer performs O(n) operations\nwhere n is the input length. This architecture can represent algorithms of running time \u0398(n2) but\nto learn faster algorithms, for example of complexity O(n log n), a fundamentally new approach is\nneeded.\nIn this paper, we propose a new differentiable architecture for sequence processing tasks that has\ndepth O(log n) and allows modeling of any dependencies in the sequence. This architecture is derived\nfrom the Shuf\ufb02e-Exchange network used for packet routing in the \ufb01eld of computer networks (Dally\nand Towles, 2004).\nWe empirically validate our model on algorithmic tasks and the LAMBADA question answering\ntask(Paperno et al., 2016). Our model is able to synthesize O(n log n) time algorithms for common\nbenchmark tasks such as copy, reversal, sorting and long binary addition which generalize to longer\nsequences. On the LAMBADA task, our model scores second best in terms of accuracy, losing only\nto the Universal Transformer(Dehghani et al., 2018), but is signi\ufb01cantly faster and able to process\n32x longer sequences than the Universal Transformer.\n\n2 Related Work\n\nCommon tools for sequence processing tasks are recurrent networks, in particular, LSTM (Hochreiter\nand Schmidhuber, 1997) and GRU networks (Cho et al., 2014). They can ef\ufb01ciently process sequences\nof any length and have the ability to remember arbitrary long dependencies. But they process symbols\none by one and have limited state memory, hence can remember only a limited number of such\ndependencies. They are successful at natural language processing but too weak for nontrivial\nalgorithmic tasks. Grid LSTM (Kalchbrenner et al., 2015) allows creating a multilayered recurrent\nstructure that is more powerful, able to learn more complex tasks such as addition and memorization\nat the expense of increased running time.\nConvolutional architectures can be used for sequence processing tasks (Gehring et al., 2017). But\nconvolutions are inherently local \u2212 the value of a particular neuron depends on a small neighborhood\nof the previous layer, so it is common to augment the network with the attention mechanism. Attention\nallows combining information from any two locations of the sequence in a single step. Attention\nmechanism has become a standard choice in numerous neural models including Transformer (Vaswani\net al., 2017) and BERT (Devlin et al., 2018) which achieve state-of-the-art accuracy in NLP and\nrelated tasks. Although ef\ufb01cient for moderately long sequences, the complexity of the attention\nmechanism is quadratic depending on the input length and does not scale to longer sequences.\nResearchers have recognized the need for processing long sequences and are searching for ways to\novercome the complexity of attention. An obvious way is to cut the sequence into short segments\nand use attention only within the segment boundaries (Al-Rfou et al., 2018). To, at least partially,\nrecover the lost information, recurrent connections can be added between the segments (Dai et al.,\n2019). Child et al. (2019) reduce the complexity of attention to O(n\nn) by attending only to a\nsmall predetermined subset of locations. Star-Transformer (Guo et al., 2019) sparsi\ufb01es attention even\nmore by pushing all the long range dependency information through one central node and reaches\nlinear time performance. Clark and Gardner (2017) enable document level question answering by\npreselecting the paragraph most likely to contain the answer.\nA different way to capture long range structure is to increase the receptive \ufb01eld of convolution\nby using dilated (atrous) convolution, where the convolution mask is spread out at regular spatial\nintervals. Dilated architectures have achieved great success in image segmentation (Yu and Koltun,\n2015) and audio generation (van den Oord et al., 2016) but are hard to apply to algorithmic tasks\nthat require generalization to longer sequences. That would require layers with shared weights but\ndifferent dilation patterns, a setting that is not yet explored. Also, it is important to avoid congestion\n(for example, that may arise in tree-like structures) when a lot of long range information is forced to\ntravel through a few nodes. Both these problems are elegantly solved with the proposed architecture.\nThe design of memory access is crucial for learning algorithmic tasks, see (Kant, 2018) for a good\noverview. To access memory, attention mechanism over the input sequence is used in Pointer\nNetworks (Vinyals et al., 2015). Specialized memory modules, which are controlled by a mechanism\n\n\u221a\n\n2\n\n\fFigure 1: Shuf\ufb02e-Exchange network.\n\nFigure 2: Bene\u0161 network.\n\nsimilar to attention, are used by Neural Turing Machine (Graves et al., 2014) and Differentiable\nNeural Computer (Graves et al., 2016).\nNeural GPU (Kaiser and Sutskever, 2015) utilizes active memory (Kaiser and Bengio, 2016) where\ncomputation is coupled with memory access. This architecture is simple and fast and can learn\nfairly complicated algorithms such as long number addition and multiplication. The computation and\nmemory coupling introduces a limitation: for the information to travel from one end of the sequence\nto the other, o(n) layers are required which result in \u2126(n2) total complexity. The \ufb02ow of information\nis facilitated by introducing diagonal gates in (Freivalds and Liepins, 2018) that improves training\nand generalization, but does not address the performance problem caused by many layers.\nThe goal of inferring algorithms with running time O(n log n) is pursued in (Nowak et al., 2018)\nwhich use the divide and conquer paradigm with learnable split and merge steps. A hierarchical\nmemory layout with logarithmic access time is introduced in Andrychowicz and Kurach (2016),\nhowever, the model is discrete and has to be trained with reinforcement learning to achieve the claimed\nperformance. Neural programmer-interpreters (Reed and de Freitas, 2016) can learn very ef\ufb01cient\nalgorithms but often require program traces during training. Neural Random-Access Machines\n(Kurach et al., 2016) introduce a memory addressing scheme potentially allowing constant time\naccess facilitated by discretization. Discrete models are hard to train, but our model, on the contrary,\nis continuous and differentiable and yet allows synthesizing O(n log n) time algorithms.\n\n3 Shuf\ufb02e-Exchange Networks\n\nRouting messages from many sources to many destinations is a well-explored topic in the area of\ncomputer networks where several kinds of sparse architectures have been developed to connect two\nsets of devices. The Shuf\ufb02e-Exchange network1 has a regular layered structure and serves best as a\nprototype for a neural network. Shuf\ufb02e-Exchange network consists of repeated application of two\nstages - shuf\ufb02e and exchange. Fig. 1 shows a Shuf\ufb02e-Exchange network for routing 8 messages.\nMessages arrive on the left side, \ufb02ow through the network layers and arrive at the rightmost nodes.\nWe consider Shuf\ufb02e-Exchange networks with 2k inputs and pad the input data to the nearest power\nof two. First comes the exchange stage where elements are divided into adjacent pairs and each\npair is passed through a switch. The switch contains logic to select which input is routed to which\noutput. The shuf\ufb02e stage follows(depicted as arrows in the \ufb01gures), where messages are permuted\naccording to the perfect-shuf\ufb02e permutation. The perfect-shuf\ufb02e is often employed to shuf\ufb02e a deck\nof cards by splitting the deck into two halves and then interleaving the halves. In this permutation, the\ndestination address is a cyclic bit shift(left or right) of the source address. The network for routing 2k\nmessages contain k exchange stages and k \u2212 1 shuf\ufb02e stages. It is proven that switches can always\nbe programmed in a way to connect any source to any destination through the network (Dally and\nTowles, 2004).\nBut the throughput of the Shuf\ufb02e-Exchange network is limited \u2212 it may not be possible to route\nseveral messages simultaneously. A better design for multiple message routing is the Bene\u0161 network\n(see Fig. 2). The Bene\u0161 network is formed by connecting a Shuf\ufb02e-Exchange network with its mirror\ncopy2. The mirror copy is obtained by reversing the direction of bit shift in the destination address\n\n1Also called Omega network.\n2In the literature, the Butter\ufb02y network is typically used instead which is isomorphic to the Shuf\ufb02e-Exchange\n\nnetwork.\n\n3\n\n\fcalculation. Bene\u0161 network has 2k \u2212 1 exchange stages and 2k \u2212 2 shuf\ufb02e stages. Such a network\ncan route 2k messages in any input-to-output permutation (Dally and Towles, 2004).\n\n4 The Model\n\nWe propose a neural network analogue of the Bene\u0161 network where we replace each switch with a\nlearnable 2-to-2 function. The input to the neural network is a sequence of cells of length 2k where\neach cell is a vector of size m. The network consists of alternating Switch and Shuf\ufb02e layers. The\nSwitch Layer corresponds to a column of boxes in Fig. 1 and Fig. 2 and the Shuf\ufb02e Layer corresponds\nto the links between the box columns.\nIn the Switch Layer, we divide the cells into adjacent non-overlapping pairs and apply Switch Unit to\neach pair3. The Switch Unit is similar to Gated Recurrent Unit(GRU) but it has two inputs [s1, s2]\nand two outputs [s1\no]. It contains two reset gates, one for each output. The reset gate performs the\ncomputing logic of the unit(\u03c3 and tanh nonlinearities just keep the values in range) and it is important\nthat each output uses a separate reset gate for the unit to produce unrelated outputs. Technically,\ncreating the pairs is implemented as reshaping the sequence s into a twice shorter sequence where\neach new cell concatenates two adjacent cells [s1, s2] along the feature dimension. The Switch Unit\nis de\ufb01ned as follows:\n\no, s2\n\nr s + B1\nr )\nr s + B2\nr )\n\ns = [s1, s2]\nr1 = \u03c3(W 1\nr2 = \u03c3(W 2\nc1 = tanh(W 1\nc2 = tanh(W 2\nu = \u03c3(Wus + Bu)\n\u02dcs = swapHalf(s1, s2)\no] = u (cid:12) \u02dcs + (1 \u2212 u) (cid:12) [c1, c2]\n\nc (r1 (cid:12) s) + B1\nc )\nc (r2 (cid:12) s) + B2\nc )\n\n[s1\n\no, s2\n\nIn the above equations, Wr and Wu are weight matrices of size 2m \u00d7 2m, Wc are weight matrices of\nsize 2m\u00d7m, B are bias vectors; these are the parameters that will be learned; (cid:12) denotes element-wise\nvector multiplication and \u03c3 is the sigmoid function.\nThe function swapHalf splits the values of s1 and s2 into two halves along the feature dimension and\nswaps their second halves:\n\n(cid:20)(cid:20) a\n\nd\n\n(cid:21)\n\n(cid:20) c\n\nb\n\n,\n\n(cid:21)(cid:21)\n\n=\n\n(cid:18)(cid:20) a\n\nb\n\n(cid:21)\n\n(cid:20) c\n\nd\n\n,\n\n(cid:21)(cid:19)\n\nswapHalf\n\nThe motivation for swapHalf is to encourage the unit to perform one of its two default actions \u2212\nreturn the two inputs unchanged or to swap them. The update gate u, which is borrowed from GRU,\nis responsible for this. For GRU, there is only one default action and its update gate performs a\nstraight-through copy. To facilitate both actions of the Switch Unit, we perform the straight-through\ncopy in the \ufb01rst half of the feature maps and swapped copy in the second half. Such a \ufb01xed assignment\nto maps works better than introducing another gate for action selection, see ablation study below.\nA similar \ufb01xed assignment was found bene\ufb01cial in (Freivalds and Liepins, 2018) for introducing\ndiagonal update gates.\nThe Shuf\ufb02e Layer permutes cells according to bit rotation permutation, namely s[x] = s[rotate(x, k)],\nwhere rotate(x, k) performs cyclic bit shift of x by one position, where x is treated as a binary\nnumber of length k. Left rotation is used in the \ufb01rst part of the Bene\u0161 network, right in the second\n(the difference is insigni\ufb01cant if we apply the rotations the other way). Shuf\ufb02e Layer has no learnable\nparameters. The whole network is organized by connecting these two kinds of layers in the pattern of\nthe Bene\u0161 network. A deeper architecture can be obtained by stacking several blocks where each is\n\n3Identical transformation is applied to all pairs of the sequence.\n\n4\n\n\fin the form of the Bene\u0161 network. In such a case, the last switch layer of every but the \ufb01nal Bene\u0161\nblock is omitted. We typically use two stacked Bene\u0161 blocks in our models, except for the simple\nalgorithmic tasks which use one block. Please see Fig. 3 of the whole model containing two Bene\u0161\nblocks on input length 16, k = 4. We employ residual skip connections between blocks where a\nscaled input of each Switch Layer is added to the input of a corresponding layer in the next Bene\u0161\nblock. The scaling factor is a learnable parameter resembling the forget gate of LSTM.\n\nFigure 3: Neural Shuf\ufb02e-Exchange network with 2 Bene\u0161 blocks.\n\nWe use shared weights for every consecutive k \u2212 1 layers(shown with colors in Fig. 3). The last layer\nof the last Bene\u0161 block has non-shared weights. Weight sharing is required to obtain generalization\nto longer sequences but we use it always, also for tasks not requiring generalization. This way we\nreduce the number of learnable parameters without an observable decline of accuracy. See more\nanalysis in the Appendix.\n\n5 Evaluation\n\nIn this section, we evaluate the proposed architecture on algorithmic tasks and the LAMBADA\nquestion answering task (Paperno et al., 2016). We have implemented the proposed architecture in\nTensorFlow. The code is available at https://github.com/LUMII-Syslab/shuffle-exchange.\nThe neural network model for evaluation consists of an embedding layer where each symbol of\nthe input is mapped to a vector of length m, one or more Bene\u0161 blocks and the output layer which\nperforms a linear transformation to the required number of classes with a softmax cross-entropy loss\nfor each symbol independently(except for the LAMBADA task which will be described later). All\nmodels are trained on a single Nvidia RTX 2080 Ti (11GB) GPU with Adam optimizer (Kingma and\nBa, 2014).\n\n5.1 Algorithmic tasks\n\nLet us evaluate the Shuf\ufb02e-Exchange network to see if it can infer O(n log n) time algorithms purely\nfrom input-output examples. We consider sequence duplication, reversal, long binary addition, long\nbinary multiplication, sorting. These tasks are common benchmarks in several papers including\n(Kalchbrenner et al., 2015; Zaremba and Sutskever, 2015; Zaremba et al., 2016; Joulin and Mikolov,\n2015; Grefenstette et al., 2015; Kaiser and Sutskever, 2015; Freivalds and Liepins, 2018; Dehghani\net al., 2018). There are known O(n log n) time algorithms for these tasks (Brent and Kung, 1982;\nAjtai et al., 1983; Seiferas, 2009; Harvey and Van Der Hoeven, 2019), although for sorting and\nmultiplication all known O(n log n) time algorithms have huge constants hidden by the O notation\nand are not realistically usable in practice.\nThe proposed architecture performs very well on simple tasks(all except multiplication) \u2212 it achieves\n100% test accuracy on examples of length it was trained on, so the main focus is to explore how\n\n5\n\nSharedweightsSharedweightsSharedweightsSharedweightsShuffleLayerSwitchUnit+xx+ScalingbylearnableparameterPointwiseaddition+x+x+x+x+Residual\tskip\tconnections+1st\tBene\u0161\tblock2nd\tBene\u0161\tblockInverseShuffleLayerxxInputOutputSharedweightsSharedweightsSharedweightsSharedweights\fwell these tasks generalize to longer inputs. Multiplication is a considerably harder task than the\nrest; from the existing architectures only the Neural GPU (Kaiser and Sutskever, 2015; Freivalds\nand Liepins, 2018) is able to learn it purely from input-output examples. The Shuf\ufb02e-Exchange\narchitecture can learn multiplication up to certain input length (depending on the model size) but the\nsolution does not generalize to longer examples. Since it is the hardest task, we analyze it separately\nand use it for tuning our model and ablation study.\nWe use dataset generators and curriculum learning from (Freivalds and Liepins, 2018). For training\nwe instantiate several models for different sequence lengths (powers of 2) sharing the same weights\nand train each example on the smallest instance it \ufb01ts.\nLet us consider the simple tasks \ufb01rst. We train them on inputs of length 64 and test on 8x longer\ninstances. A small model comprising one Bene\u0161 block and 192 feature maps suf\ufb01ce for these tasks.\nFor the duplication, reversal, sorting we use a \ufb01xed alphabet of symbols in range 1-12. Fig. 4 shows\nthe generalization of simple tasks to sequences of length 512 vs. training step. The graphs show\nthe accuracy on the test set averaged over 5 training runs. We see that the duplication and reversal\ntasks converge in a few hundred steps and generalize perfectly, the addition task quickly reaches 90%\naccuracy(we measure the fraction of correctly predicted output symbols) and then converges to 98%\nin about 10k steps. The sorting task reaches about 95% generalization accuracy.\nIt is interesting that this architecture (of depth 2 log n) is able to learn a sorting algorithm that\ngeneralizes. Although the best known sorting circuits have depth O(log n) (Ajtai et al., 1983;\nSeiferas, 2009), they are huge, with estimated depth over 100 log n. Simple sorting circuits have\ndepth \u0398(log2 n) (Knuth, 1973). Apparently, the inferred algorithm takes advantage of the discrete\nnature and limited range of the alphabet whereas the mentioned sorting circuits work in a comparison\nbased model that is applicable to any numeric values. Indeed, when we try increasing the alphabet\nsize, training becomes slower and a larger model is required at some point.\nThe multiplication task can be trained up to quite a large length (longer examples require a larger\nmodel and longer training), the solution generalizes to other examples of the same length but not to\nlonger examples. We chose to demonstrate the multiplication task on sequences of length 128 for\nwhich a model can be trained in a reasonably short time. A model with 384 feature maps reaches\n100% accuracy on the test in about 50K steps, see Fig. 10.\nThe generalization to longer sequences of the Shuf\ufb02e-Exchange network is shown in Fig. 6. We\ncompare it to the optimized implementation of the Neural GPU with diagonal gates (DNGPU) by\nFreivalds and Liepins (2018) which is currently the best in this respect. Shuf\ufb02e-Exchange network\nprovides a similar generalization on duplication and reversal tasks but performs better on the addition\nand sorting tasks. Both models were trained for 40k steps on length up to 64 symbols. A possible\nexplanation of why the multiplication task does not generalize is that there may not exist a simple\nO(n log n) time algorithm for it. The only currently known asymptotically optimal algorithm (Harvey\nand Van Der Hoeven, 2019) is galactic but practical algorithms, for example based on the Fast Fourier\nTransform, have complexity O(n log n log log n) that is slightly more than available within our model.\nWe have tried training a model of depth O(log2 n) but were not able to obtain better generalization.\nSuch model is relatively deep for the sequence lengths on which it can be realistically trained(for\nexample, we get 113 switch layers for sequence length 256) on and it seems unable to recognize the\nproper scaling of O(log2 n) instead of cn for some constant c.\nA distinct advantage of the Shuf\ufb02e-Exchange model is its speed. Fig. 5 shows the comparison\nof the running time of the learned binary addition algorithm (or any other algorithm having the\nsame sized model) with the proposed model and DNGPU. We use 96 feature maps for both models\nwhich is enough for both of them to learn binary addition. The time complexity of our model is\nO(n log n) vs. O(n2) for DNGPU which is re\ufb02ected in the measured running time. Also, DNGPU\ndemands a lot more GPU memory \u2212 we were able to evaluate it only up to length 16K, whereas the\nShuf\ufb02e-Exchange model is able to process sequences of two million symbols in about 5 seconds.\n\n5.2 LAMBADA question answering\n\nThe goal of the LAMBADA task is to predict a given target word from its broad context (on average\n4.6 sentences collected from novels). The sentences in the LAMBADA dataset (Paperno et al., 2016)\nare specially selected such that giving the right answer requires examining the whole passage. In\n81% cases of the test set the target word can be found in the text and we follow a common strategy\n\n6\n\n\fFigure 4: Accuracy on length 512 vs training step. Figure 5: Evaluation and training time compari-\n\nson with DNGPU (log scale).\n\n(a) Shuf\ufb02e-Exchange\n\n(b) DNGPU\n\nFigure 6: Generalization to longer sequences.\n\n(Chu et al., 2017; Dehghani et al., 2018) to choose the target word as one from the text. Obviously,\nthe answer will be wrong in the remaining cases so the obtained accuracy will not exceed 81%.\nWe instantiate the model for input length 128 (almost all test examples \ufb01t into this length) and pad\nthe input sequence to that length by placing the sequence at a random position and adding zeros on\nboth ends. Without randomization, the model over\ufb01ts. We use a pretrained fastText 1M English word\nembedding (Mikolov et al., 2018) for the input words. The embedding layer is followed by 2 Bene\u0161\nblocks with 384 feature maps. To perform the answer selection as a word from the text, the \ufb01nal layer\nof the network is constructed differently. Each symbol of the output is linearly mapped to a single\nscalar and we use softmax loss over the obtained sequence to select the position of the answer word.\nIn Table 1 we give our results in the context of previous works. Our architecture is able to score better\nthan Gated-Attention Reader (Chu et al., 2017) and loses only to Universal Transformer (Dehghani\net al., 2018). Both these networks use attention mechanism enclosed in a highly elaborate neural\nmodel, so it is remarkable that our model of much simpler architecture can provide a competitive\naccuracy.\nOn the other hand, our model is signi\ufb01cantly faster and has better scaling to long sequences. In\nFig. 7 we compare the training and execution time of our model to the Universal Transformer. We\nuse the of\ufb01cial Universal Transformer implementation from the Tensor2Tensor library (Vaswani\net al., 2018) and measure the time for one training or evaluation step on a single sequence. For both\nmodels, we use con\ufb01gurations that reach the best test accuracy. We use the base con\ufb01guration (as\nmentioned in Dehghani et al. (2018)) for the Universal Transformer which has 152M parameters\nand the Shuf\ufb02e-Exchange network with 384 feature maps and 2 Bene\u0161 blocks and total parameter\ncount 33M. It\u2019s worth mentioning that our model is about 5x smaller than Universal Transformer.\nWe perform the evaluation on sequence lengths that are able to \ufb01t in the 11GB of GPU memory. We\ncan see that the Shuf\ufb02e-Exchange network is signi\ufb01cantly faster and can be applied up to 32x longer\nsequences using the same amount of GPU memory.\nIt is interesting to note that, in contrast to Transformers, our architecture does not need positional\nembeddings. It can learn positional embedding itself if required. Assume that the input has a marked\nposition(end-of-line marker, for example) and consider the binary tree of paths from it to the nodes at\n\n7\n\n01k2k3k4k5k6k7k8k9k10k5060708090100duplicationreversaladditionsortingstepaccuracy %1282565121K2K4K8K16K32K64K128K256K512K1M2M00.0030.0070.0150.0310.0620.1250.250.5124816Shuffle-Exchange evalShuffle-Exchange trainDNGPU evalDNGPU traininput lengthtime (seconds)81632641282565121K2K4K0.50.60.70.80.91Shu\ufb04e-Exchange sortingShu\ufb04e-Exchange multiplicationShu\ufb04e-Exchange additionShu\ufb04e-Exchange reversalShu\ufb04e-Exchange duplicationtest length81632641282565121K2K4K405060708090100DNGPU sortingDNGPU multiplicationDNGPU additionDNGPU reversalDNGPU duplicationtest lengthaccuracy %\fdepth log(n). Each leaf of this tree can be uniquely labeled according to left-or-right choices on the\npath connecting it to the root. Such labeling can be learned by the \ufb01rst log(n) layers of the network\nand the rest of the network can use this information as positional embedding.\n\nTable 1: Accuracy on LAMBADA word prediction task\n\nModel\nRandom word from passage (Paperno et al., 2016)\nGated-Attention Reader (Chu et al., 2017)\nShuf\ufb02e-Exchange network (this work)\nUniversal Transformer (Dehghani et al., 2018)\nHuman performance (Chu et al., 2017)\n\nTest accuracy (%)\n\n1.6\n49.0\n52.28\n56.0\n86.0\n\nFigure 7: Evaluation and training time of Shuf\ufb02e-Exchange and Universal Transformer (log scale).\n\n5.3 Ablation study\n\nWe have studied several modi\ufb01cations of our architecture and con\ufb01rmed that the presented form is\nthe best. We veri\ufb01ed the need for the key features by removing them one by one and comparing\nwith a baseline version on the multiplication, LAMBADA, addition and sorting tasks (Fig. 8). We\nconsidered the following ablations: using the identity function instead of swapHalf (without swap),\nremoving residual connections(without residual), disabling the idea by Bene\u0161 to use two shuf\ufb02e\ndirections(without Bene\u0161), using an additional swap gate in place of swapHalf function which chooses\na straight-through or a swapped input for the update gate (swap gate). Additionally, we explore two\nversions where the Switch Unit is replaced with two fully connected layers with ReLU and twice\nthe number of feature maps in the middle(ReLU on both FC layers does not work well). The \ufb01rst\nablation replaces the entire unit (Two FC layers), the second one replaces only the part involving\nreset gates but keeps the update gate(Two FC layers+gate).\nWe can see that the baseline version gives the best accuracy overall. The differences are most\npronounced on the multiplication task, on other tasks all versions, except the one without the gate,\nperform reasonably well.\nWe have also evaluated the effect of the network depth(measured in the Bene\u0161 block count) and\nfeature map count on the test accuracy, see Figures 9 and 10. There are no surprises \u2212 a larger and\ndeeper network is generally better.\n\n6 Conclusions\n\nWe have introduced a Shuf\ufb02e-Exchange neural network model that has O(log n) depth and O(n log n)\ntotal complexity and allows modeling any dependencies in the sequence. We have shown that this\nmodel can successfully synthesize nontrivial O(n log n) time algorithms with good generalization.\nThe Shuf\ufb02e-Exchange model can serve as an alternative to the attention mechanism with better\nscaling to long sequences. Although we obtained slightly lower accuracy on the LAMBADA question\nanswering task, our model is much simpler than the winning Universal Transformer and has fewer\nparameters. We are looking forward to implementing more elaborate architectures combining Shuf\ufb02e-\nExchange blocks with other kinds of layers to achieve unmatched accuracy for long sequence tasks.\n\n8\n\n641282565121K2K4K8K16K32K64K128K0.0030.0070.0150.0310.0620.1250.250.512Shuffle-Exchange evalShuffle-Exchange trainUniversal Transformer evalUniversal Transformer traininput lengthtime (seconds)\f(a) Multiplication task\n\n(b) LAMBADA task\n\n(c) Addition task\n\n(d) Sorting task\n\nFigure 8: Ablation study.\n\n(a) Multiplication task\n\n(b) LAMBADA task\n\nFigure 9: Test accuracy depending on the block count.\n\n(a) Multiplication task\n\n(b) LAMBADA task\n\nFigure 10: Test accuracy depending on the feature map count.\n\n9\n\n050k100k150k200k60708090100BaselineWithout Bene\u0161Swap gateTwo FC layersTwo FC layers + gateWithout swapWithout residualstepaccuracy %00.2M0.4M0.6M0.8M1M152025303540455055Baselinewithout Bene\u0161swap gateTwo FC layersTwo FC layers + gatewithout swapstepaccuracy %05k10k15k20k5060708090100Baselinewithout Bene\u0161swap gateTwo FC layersTwo FC layers + gatewithout swapstepaccuracy %05k10k15k20k30405060708090100Baselinewithout Bene\u0161swap gateTwo FC layersTwo FC layers + gatewithout swapstepaccuracy %050k100k150k200k607080901001 block2 blocks3 blocksstepaccuracy %00.2M0.4M0.6M0.8M1M15202530354045501 block2 blocks3 blocksstepaccuracy %050k100k150k200k60708090100384 feature maps192 feature maps96 feature mapsstepaccuracy %00.2M0.4M0.6M0.8M1M152025303540455055384 feature maps192 feature maps96 feature mapsstepaccuracy %\fAcknowledgements\n\nWe would like to thank Ren\u00afars Liepin, \u0161 for the valuable discussion regarding the subject matter of the\npaper, the IMCS UL Scienti\ufb01c Cloud for the computing power and Leo Truk\u0161\u00afans for the technical\nsupport. We sincerely thank all the reviewers for their comments and suggestions. This research is\nfunded by the Latvian Council of Science, project No. lzp-2018/1-0327.\n\nReferences\nAjtai, M., J. Koml\u00f3s, and E. Szemer\u00e9di\n\n1983. Sorting in c log n parallel steps. Combinatorica, 3(1):1\u201319.\n\nAl-Rfou, R., D. Choe, N. Constant, M. Guo, and L. Jones\n\nlanguage modeling with deeper self-attention.\n\n2018.\narXiv:1808.04444.\n\nCharacter-level\n\nAndrychowicz, M. and K. Kurach\n\n2016. Learning ef\ufb01cient algorithms with hierarchical attentive memory.\narXiv:1602.03218.\n\narXiv preprint\n\narXiv preprint\n\nBrent, R. P. and H. T. Kung\n\n1982. A regular layout for parallel adders. IEEE transactions on Computers, (3):260\u2013264.\n\nChild, R., S. Gray, A. Radford, and I. Sutskever\n\n2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.\n\nCho, K., B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio\n2014. Learning phrase representations using RNN encoder-decoder for statistical machine transla-\ntion. arXiv preprint arXiv:1406.1078.\n\nChu, Z., H. Wang, K. Gimpel, and D. McAllester\n\n2017. Broad context language modeling as reading comprehension. In Proceedings of the 15th\nConference of the European Chapter of the Association for Computational Linguistics: Volume 2,\nShort Papers, Pp. 52\u201357. ACL (Association for Computational Linguistics).\n\nClark, C. and M. Gardner\n\n2017.\narXiv:1710.10723.\n\nSimple and effective multi-paragraph reading comprehension.\n\narXiv preprint\n\nDai, Z., Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov\n\n2019. Transformer-xl: Attentive language models beyond a \ufb01xed-length context. arXiv preprint\narXiv:1901.02860.\n\nDally, W. J. and B. P. Towles\n\n2004. Principles and practices of interconnection networks. Elsevier.\n\nDehghani, M., S. Gouws, O. Vinyals, J. Uszkoreit, and \u0141. Kaiser\n\n2018. Universal transformers. arXiv preprint arXiv:1807.03819.\n\nDevlin, J., M.-W. Chang, K. Lee, and K. Toutanova\n\n2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv\npreprint arXiv:1810.04805.\n\nFreivalds, K. and R. Liepins\n\n2018. Improving the neural GPU architecture for algorithm learning. The ICML workshop Neural\nAbstract Machines & Program Induction v2 (NAMPI 2018).\n\nGehring, J., M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin\n\n2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International\nConference on Machine Learning, D. Precup and Y. W. Teh, eds., volume 70 of Proceedings of\nMachine Learning Research, Pp. 1243\u20131252. PMLR.\n\nGraves, A., G. Wayne, and I. Danihelka\n\n2014. Neural Turing machines. arXiv preprint arXiv:1410.5401.\n\n10\n\n\fGraves, A., G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi\u00b4nska, S. G. Col-\n\nmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al.\n2016. Hybrid computing using a neural network with dynamic external memory. Nature,\n538(7626):471.\n\nGrefenstette, E., K. M. Hermann, M. Suleyman, and P. Blunsom\n\nIn Advances in Neural Information\n2015. Learning to transduce with unbounded memory.\nProcessing Systems 28, C. Cortes and Lee D.D. et al., eds., Pp. 1828\u20131836. Curran Associates,\nInc.\n\nGuo, Q., X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang\n\n2019. Star-transformer. arXiv preprint arXiv:1902.09113.\n\nHarvey, D. and J. Van Der Hoeven\n\n2019. Integer multiplication in time O(n log n).\n\nHochreiter, S. and J. Schmidhuber\n\n1997. Long short-term memory. Neural computation, 9(8):1735\u20131780.\n\nHuang, C.-Z. A., A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D.\n\nHoffman, M. Dinculescu, and D. Eck\n2019. Music transformer. In International Conference on Learning Representations.\n\nJoulin, A. and T. Mikolov\n\n2015. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in Neural\nInformation Processing Systems 28, C. Cortes and Lee D.D. et al., eds., Pp. 190\u2013198. Curran\nAssociates, Inc.\n\nKaiser, \u0141. and S. Bengio\n\n2016. Can active memory replace attention? In Advances in Neural Information Processing\nSystems 29, D. Lee and Luxburg U.V. et al., eds., Pp. 3781\u20133789. Curran Associates, Inc.\n\nKaiser, \u0141. and I. Sutskever\n\n2015. Neural GPUs learn algorithms. arXiv preprint arXiv:1511.08228.\n\nKalchbrenner, N., I. Danihelka, and A. Graves\n\n2015. Grid long short-term memory. arXiv preprint arXiv:1507.01526.\n\nKant, N.\n\n2018. Recent advances in neural program synthesis. arXiv preprint arXiv:1802.02353.\n\nKingma, D. and J. Ba\n\n2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.\n\nKnuth, D. E.\n\n1973. The art of computer programming, vol. 3: Searching and sorting. Reading MA: Addison-\nWisley.\n\nKurach, K., M. Andrychowicz, and I. Sutskever\n2016. Neural random access machines. ICLR.\n\nMikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin\n\n2018. Advances in pre-training distributed word representations. In Proceedings of the Eleventh\nInternational Conference on Language Resources and Evaluation (LREC 2018), N. Calzolari and\nChoukri, Khalid et al., eds. European Language Resources Association (ELRA).\n\nNowak, A., D. Folqu\u00e9, and J. Bruna\n\n2018. Divide and conquer networks. In International Conference on Learning Representations.\n\nPaperno, D., G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni,\n\nG. Boleda, and R. Fern\u00e1ndez\n2016. The lambada dataset: word prediction requiring a broad discourse context. In Proceedings\nof the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long\nPapers), Pp. 1525\u20131534. ACL (Association for Computational Linguistics).\n\n11\n\n\fReed, S. E. and N. de Freitas\n\n2016. Neural programmer-interpreters. In 4th International Conference on Learning Representa-\ntions, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.\n\nSeiferas, J.\n\n2009. Sorting networks of logarithmic depth, further simpli\ufb01ed. Algorithmica, 53(3):374\u2013384.\n\nvan den Oord, A., S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.\n\nSenior, and K. Kavukcuoglu\n2016. Wavenet: A generative model for raw audio. In SSW - the 9th ISCA Speech Synthesis\nWorkshop, Sunnyvale, CA, USA, 2016, P. 125.\n\nVaswani, A., S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, \u0141. Kaiser,\n\nN. Kalchbrenner, N. Parmar, et al.\n2018. Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416.\n\nVaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin\n2017. Attention is all you need. In Advances in Neural Information Processing Systems 30,\nI. Guyon and Luxburg U.V. et al., eds., Pp. 5998\u20136008. Curran Associates, Inc.\n\nVinyals, O., M. Fortunato, and N. Jaitly\n\n2015. Pointer networks. In Advances in Neural Information Processing Systems 28, C. Cortes and\nLee D.D. et al., eds., Pp. 2692\u20132700. Curran Associates, Inc.\n\nYu, F. and V. Koltun\n\n2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.\n\nZaremba, W., T. Mikolov, A. Joulin, and R. Fergus\n\n2016. Learning simple algorithms from examples. In International Conference on Machine\nLearning, Pp. 421\u2013429.\n\nZaremba, W. and I. Sutskever\n\n2015. Reinforcement learning neural Turing machines-revised. arXiv preprint arXiv:1505.00521.\n\n12\n\n\f", "award": [], "sourceid": 3591, "authors": [{"given_name": "Karlis", "family_name": "Freivalds", "institution": "Institute of Mathematics and Computer Science, University of Latvia"}, {"given_name": "Em\u012bls", "family_name": "Ozoli\u0146\u0161", "institution": "Institute of Mathematics and Computer Science"}, {"given_name": "Agris", "family_name": "\u0160ostaks", "institution": "Institute of Mathematics and Computer Science"}]}