{"title": "Compositional De-Attention Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6135, "page_last": 6145, "abstract": "Attentional models are distinctly characterized by their ability to learn relative importance, i.e., assigning a different weight to input values. This paper proposes a new quasi-attention that is compositional in nature, i.e., learning whether to \\textit{add}, \\textit{subtract} or \\textit{nullify} a certain vector when learning representations. This is strongly contrasted with vanilla attention, which simply re-weights input tokens. Our proposed \\textit{Compositional De-Attention} (CoDA) is fundamentally built upon the intuition of both similarity and dissimilarity (negative affinity) when computing affinity scores, benefiting from a greater extent of expressiveness. We evaluate CoDA on six NLP tasks, i.e. open domain question answering, retrieval/ranking, natural language inference, machine translation, sentiment analysis and text2code generation. We obtain promising experimental results, achieving state-of-the-art performance on several tasks/datasets.", "full_text": "Compositional De-Attention Networks\n\n\u2020Yi Tay\u2217, (cid:93)Luu Anh Tuan\u2217, (cid:92)Aston Zhang, \u2663Shuohang Wang, (cid:91)Siu Cheung Hui\n\n\u2020,(cid:91)Nanyang Technological University, Singapore\n\n(cid:93)MIT CSAIL, (cid:92)Amazon AI\n\n\u2663Microsoft Dynamics 365 AI Research\n\nytay017@gmail.com\n\nAbstract\n\nAttentional models are distinctly characterized by their ability to learn relative\nimportance, i.e., assigning a different weight to input values. This paper proposes\na new quasi-attention that is compositional in nature, i.e., learning whether to\nadd, subtract or nullify a certain vector when learning representations. This is\nstrongly contrasted with vanilla attention, which simply re-weights input tokens.\nOur proposed Compositional De-Attention (CoDA) is fundamentally built upon the\nintuition of both similarity and dissimilarity (negative af\ufb01nity) when computing\naf\ufb01nity scores, bene\ufb01ting from a greater extent of expressiveness. We evaluate\nCoDA on six NLP tasks, i.e. open domain question answering, retrieval/ranking,\nnatural language inference, machine translation, sentiment analysis and text2code\ngeneration. We obtain promising experimental results, achieving state-of-the-art\nperformance on several tasks/datasets.\n\n1\n\nIntroduction\n\nNot all inputs are created equal. This highly intuitive motivator, commonly referred to as \u2018attention\u2019,\nforms the bedrock of many recent and successful advances in deep learning research [Bahdanau et al.,\n2014, Parikh et al., 2016, Seo et al., 2016, Vaswani et al., 2017]. To this end, the Softmax operator\nlives at it\u2019s heart, signifying the importance of learning relative importance as a highly effective\ninductive bias for many problem domains and model architectures.\nThis paper proposes a new general purpose quasi-attention method. Our method is \u2018quasi\u2019 in the\nsense that it behaves like an attention mechanism, albeit with several key fundamental differences.\nFirstly, instead of learning relative importance (weighted sum), we learn a compositional pooling\nof tokens, deciding whether to add, subtract or delete an input token. Since our method learns to\n\ufb02ip/subtract tokens, deviating from the original motivation of attention, we refer to our method as a\nquasi-attention method. Secondly, we introduce a secondary de-attention (deleted attention) matrix,\n\ufb01nally learning a multiplicative composition of similarity and dissimilarity. We hypothesize that\nmore \ufb02exible design can lead to more expressive and powerful models which will arrive at better\nperformance.\nIn order to achieve this, we introduce two technical contributions. The \ufb01rst, is a dual af\ufb01nity scheme,\nwhich introduces a secondary af\ufb01nity matrix N, in addition to the original af\ufb01nity matrix E. The\naf\ufb01nity matrix E, commonly found in pairwise [Parikh et al., 2016] or self-attentional [Vaswani et al.,\n2017] models, learns pairwise similarity computation between all elements in a sequence (or two\nsequences), i.e., eij = a(cid:62)\ni bj. Contrary to E, our new N matrix is learned a dissimilarity metric such\nas negative L1 distance, providing dual \ufb02avours of pairwise composition.\n\n\u2217Denotes equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSecondly, we introduce a compositional mechanism which composes tanh(E) with sigmoid(N )\nto form the quasi-attention matrix M. In this case, the \ufb01rst term tanh(E) controls the adding and\nsubtracting of vectors while the secondary af\ufb01nity N can be interpreted as a type of gating mechanism,\nerasing unnecessary pairwise scores to zero when desired. The motivation for using dissimilarity as a\ngate is natural, serving as a protection against over-relying on raw similarity, i.e, if dissimilarity is\nhigh then the negative component learns to erase the positive af\ufb01nity score.\nFinally, the quasi-attention matrix is then utilized as per standard vanilla attention models, pooling\nacross a sequence of vectors for learning attentional representations. More concretely, this new\nquasi-attention mechanism is given the ability to express arithmetic operations [Trask et al., 2018]\nwhen composing vectors, i.e., compositional pooling. As a general-purpose and universal neural\ncomponent, our CoDA mechanism can be readily applied to many state-of-the-art neural models such\nas models that use pairwise attention or self-attention-based transformers.\nAll in all, the prime contributions of this work are as follows:\n\n\u2022 We introduce Compositional De-Attention (CoDA), a form of quasi-attention method. Our\nCoDA mechanism is largely based on two new concepts, (1) dual af\ufb01nity matrices and\n(2) compositional pooling, distinguishing itself from all other attention mechanisms in the\nliterature.\n\n\u2022 Our CoDA method decouples the Softmax operator with standard attention mechanisms\nand puts forward a new paradigm for attentional pooling in neural architectures. To the best\nof our knowledge, this is the \ufb01rst work that explores the usage of Softmax-less attention\nmechanisms. As a by-product, we also show that going Softmax-less could also be a viable\nchoice even in attentional models.\n\n\u2022 CoDA enables a greater extent of \ufb02exibility in composing vectors during the attentional\npooling process. We imbue our model with the ability to subtract vectors (not only relatively\nweight them).\n\n\u2022 We conduct extensive experiments on a myriad of NLP tasks such as open domain question\nanswering, ranking, natural language inference, machine translation, sentiment analysis and\ntext2code generation. We obtain reasonably promising results, demonstrating the utility\nof the proposed CoDA mechanism, outperforming vanilla attention more often than not.\nMoreover, CoDA achieves state-of-the-art on several tasks/datasets.\n\n2 Compositional De-Attention Networks (CoDA)\n\nThis section introduces the proposed CoDA method.\n\n2.1 Input Format and Pairwise Formulation\nOur proposed CoDA method accepts two input sequences A \u2208 R(cid:96)a\u00d7d and B \u2208 R(cid:96)b\u00d7d, where (cid:96)a, (cid:96)b\nare lengths of sequences A, B repectively and d is the dimensionaity of the input vectors, and returns\npooled representations which are of equal dimensions. Note that CoDA is universal in the sense that\nit can be applied to both pairwise (cross) attention [Parikh et al., 2016, Seo et al., 2016] as well as\nsingle sequence attention. In the case of single sequence attention, A and B are often referred to the\nsame sequence (i.e., self-attention [Vaswani et al., 2017]).\n\n2.2 Dual Af\ufb01nity Computation\n\nWe compute the pairwise af\ufb01nity between each element in A and B via:\n\nEij = \u03b1 FE(ai)FE(bj)(cid:62)\n\n(1)\n\nwhich captures the pairwise similarity between each element in A with each element of B. In this\ncase, FE(.) is a parameterized function such as a linear/nonlinear projection. Moreover, \u03b1 is a scaling\nconstant and a non-negative hyperparameter which can be interpreted as a temperature setting that\ncontrols saturation. Next, as a measure of negative (dissimilarity), we compute:\n\nNij = \u2212\u03b2 ||FN (ai) \u2212 FN (bj)|||(cid:96)1\n\n(2)\n\n2\n\n\fwhere FN (.) is a parameterized function, \u03b2 is a scaling constant, and (cid:96)1 is the L1 Norm. In practice,\nwe may share parameters of FE(.) and FN (.). Note that Nij \u2208 R is a scalar value and the af\ufb01nity\nmatrix N has equal dimension with the af\ufb01nity matrix E. We hypothesize that capturing a \ufb02avour of\ndissimilarity (subtractive compositionality) is crucial in attentional models. The rationale for using\nthe negative distance is for this negative af\ufb01nity values to act as a form of gating (as elaborated\nsubsequently).\n\n2.3 Compositional De-Attention Matrix\nIn the typical case of vanilla attention, Softmax is applied onto the matrix E \u2208 R(cid:96)A\u00d7(cid:96)B row-wise\nand column-wise, normalizing the matrix. Hence, multiplying the normalized matrix of E with the\noriginal input sequence can be interpreted as a form of attentional pooling (learning to align), in\nwhich each element of A pools all relevant information across all elements of B. For our case, we\nuse the following equation:\n(3)\nwhere M is the \ufb01nal (quasi)-attention matrix in our CoDA mechanism and (cid:12) is the element-wise\nmultiplication between the two matrices.\n\nM = tanh(E) (cid:12) sigmoid(N )\n\nCentering of N. Since N is constructed by the negative L1 distance, it is clear that the range of\nsigmoid(N ) \u2208 [0, 0.5]. Hence, in order to ensure that sigmoid(N ) lies in [0, 1], we center the\nmatrix N to have a zero mean:\n\n(4)\nIntuitively, by centering N to zero mean, we are also able to ensure and maintain it\u2019s ability to both\nerase and retain values in tanh(E) since sigmoid(N ) now saturates at 0 and 1, behaving more like\na gate.\n\nN \u2192 N \u2212 M ean(N )\n\nScaling of sigmoid(N). A second variation, as an alternative to centering is to scale sigmoid(N )\nby 2, ensuring that its range fall within [0, 1] instead of [0, 0.5].\n\nM = tanh(E) (cid:12) (2 \u2217 sigmoid(N ))\n\n(5)\n\nEmpirically, we found that this approach works considerably well as well.\n\nCentering of E. Additionally, there is no guarantee that E contains values both positive and\nnegative. Hence, in order to ensure that tanh(E) is able to effectively express subtraction (negative\nvalues), we may normalize E to have zero mean:\n\nE \u2192 E \u2212 M ean(E)\n\n(6)\nThis normalization/centering can be interpreted as a form of inductive bias. Without it, we have no\nguarantee if the model converges to a solution where all values are only E > 0 or E < 0. Naturally,\nwe also observe that the scalar value \u03b1 in equation 1 acts like a temperature hyperparameter and\nwhen \u03b1 is large, the values of tanh(E) will saturate towards {\u22121, 1}.\n\nIntuition Note that since the distance value takes a summation over vector elements, the value of\nsigmoid(N ) \u2208 [0, 1] (centered) will saturate towards 0 or 1. Hence, this encodes a strong prior for\neither erasing (a.k.a \u2018de-attention\u2019) or keeping the entire scores from E. Contrary to typical attention\nmechanism, M is biased towards values {\u22121, 0, 1} since sigmoid(N ) is biased towards {0, 1} whilst\ntanh(E) is biased towards {\u22121, 1}. Intuitively, N (the negative af\ufb01nity matrix) controls the deletion\noperation while E controls whether we want to add or subtract a vector.\nAdditionally, we can consider that sigmoid(N ) acts as af\ufb01nity gates. A higher dissimilarity (denoted\nby a large negative distance) will \u2018erase\u2019 the values on the main af\ufb01nity matrix tanh(E). The choice\nof other activation functions and their compositions will be discussed in Section ??.\n\nTemperature We introduced hyperparameters \u03b1, \u03b2 in earlier sections that control the magnitude\nof E and N. Intuitively, these hyperparameters control and in\ufb02uence the temperature of tanh and\nsigmoid functions. In other words, a high value of \u03b1, \u03b2 will enforce a hard form of compositional\npooling. For most cases, setting \u03b1 = 1, \u03b2 = 1 may suf\ufb01ce. Note that the dimensionality of the vector\ncontributes to the hard-ness of our method since this results in large values of E, N matrices, this\ncase, \u03b1, \u03b2 may be set to \u2264 1 to prevent M from being too hard.\n\n3\n\n\fFigure 1: Illustration of our proposed Compositional De-Attention (CoDA) af\ufb01nity matrix composi-\ntion. Red represents positive values and blue represents negative values. White represents close to\nzero values.\n\n2.4 Compositional Pooling\n\nAfter which, we apply M (the quasi-attention matrix) to input sequences A and B.\n\nA(cid:48) = M B and B(cid:48) = M(cid:62)A\n\n(7)\nwhere A(cid:48) \u2208 R(cid:96)A\u00d7d and B(cid:48) \u2208 R(cid:96)B\u00d7d are the compositionally manipulated representations of A and\nB respectively. In the case of A(cid:48) , each element Ai in A scans across the sequence B and decides2\nwhether to include/add (+1), subtract (\u22121) or delete (\u00d70) the tokens in B. Similar intuition is applied\nto B(cid:48) where each element in B scans across the sequence A and decides to add, subtract or delete the\ntokens in A. Intuitively, this allows for rich and expressive representations, unlike typical attentional\npooling methods that softly perform weighted averages over a sequence.\n\n2.5\n\nIncorporating CoDA to Existing Models\n\nIn this section, we discuss various ways how CoDA may be used and incorporated into existing neural\nmodels.\n\nCoDA Cross-Attention Many models for pairwise sequence problems require a form of cross\nattention. In this case, CoDA is applied:\n\n(8)\nwhere A \u2208 R(cid:96)A\u00d7d, B \u2208 R(cid:96)B\u00d7d are two input sequences (e.g., document-query or premise-hypothesis\npairs). A(cid:48) \u2208 R(cid:96)A\u00d7d, B(cid:48) \u2208 R(cid:96)B\u00d7d are compositionally aligned representations of A and B respectively.\nNext, we use an alignment function,\n\nA(cid:48), B(cid:48) = CoDA(A, B)\n\nF = [F (A(cid:48), A); F (B(cid:48), B)]\n\n(9)\n\nto learn cross-sentence feature representations. Note that F (.) may be any parameterized function\nsuch as RNNs, MLPs or even simple pooling function. [; ] is the concatenation operator.\n\nCoDA Transformer Transformers [Vaswani et al., 2017] adopt self-attention mechanisms, which\ncan be interpreted as cross-attention with respect to the same sequence. The original transformer3\n\n2We emphasize that this is still done in a soft manner.\n3We \ufb01nd that, for some tasks, removing the scaling factor of\n\n1\u221a\n\ndk\n\nworks better for our CoDA mechanism.\n\n4\n\n\fequation of A = sof tmax( QK(cid:62)\u221a\n\ndk\n\n)V now becomes:\n\nA = (tanh(\n\nQK(cid:62)\u221a\n\ndk\n\n) (cid:12) sigmoid(\n\nG(Q, K)\u221a\n\ndk\n\n))V\n\n(10)\n\nwhere G(.) is the negation of outer L1 distance between all rows of Q against all rows of K. We either\napply centering to ( G(Q,K)\u221a\n))V to ensure the value is in [0, 1]. Finally,\nnote that both af\ufb01nity matrices are learned by transforming Q, K, V only once.\n\n))V or 2 \u2217 sigmoid( G(Q,K)\u221a\n\ndk\n\ndk\n\n3 Experiments\n\nWe perform experiments on a variety of NLP tasks including open domain question answering,\nretrieval/ranking, natural language inference, neural machine translation, sentiment analysis and\ntext2code generation. This section provides experimental details such as experimental setups, results\nand detailed discussions.\n\n3.1 Open Domain Question Answering\n\nWe evaluate CoDA on Open Domain QA. The task at hand is to predict an appropriate answer span\nin a collection of paragraphs. We use well-established benchmarks, SearchQA [Dunn et al., 2017]\nand Quasar-T [Dhingra et al., 2017]. Both dataset comprises of QA pairs with accompanying set\nof documents retrieved by search engines. For this experiment, we use the recently proposed and\nopen source4 DecaProp [Tay et al., 2018] as a base model and replace the context-query attention\nwith our CoDA variation. We set the hyperparameters as closely to the original implementation as\npossible since the key here is to observe if CoDA enhanced adaptation can improve upon the original\nDecaProp. As competitors, we compare with the latest [Das et al., 2019], a sophisticated multi-step\nreasoner specially targeted at open domain QA, as well as the canonical R3 model [Wang et al.,\n2017], AQA [Buck et al., 2017] and BiDAF [Seo et al., 2016].\n\nModel\nGA\n\nQuasar-T\nEM F1\n26.4\n26.4\n28.5\n25.9\n-\n-\n40.9\n34.2\n40.6\n47.0\n46.9\n38.6\n41.3\n49.7\n\nSearchQA\nEM F1\n-\n-\n\n28.6\n38.7\n49.0\n56.3\n56.8\n57.2\n\n34.6\n45.6\n55.3\n61.4\n63.6\n63.9\n\nBiDAF [Seo et al., 2016]\nAQA [Buck et al., 2017]\n\nR3 Reader-Ranker [Wang et al., 2017]\nMulti-step-reasoner [Das et al., 2019]\n\nDecaProp [Tay et al., 2018]\nDecaProp + CoDA (Ours)\n\nTable 1: Experimental results on Open Domain Question Answering. DecaProp + CoDA achieves\nstate-of-the-art performance on both datasets.\n\nResults Table 1 reports the results on Open Domain QA. Most importantly, we \ufb01nd that CoDA\nis able to reasonably improve upon the base DecaProp model on the Quasar-T dataset (+2.7%)\nwhile marginally improving performance on the SearchQA dataset. Notably, DecaProp + CoDA also\nexceeds specialized open domain QA models such as the recent Multi-step reasoner [Das et al., 2019]\nand achieves state-of-the-art performance on both datasets.\n\n3.2 Retrieval and Ranking\n\nWe evaluate CoDA on a series of retrieval and ranking tasks. More concretely, we use well-established\nanswer retrieval datasets (TrecQA [Wang et al., 2007] and WikiQA [Yang et al., 2015]) along with\nresponse selection dataset (Ubuntu dialogue corpus [Lowe et al., 2015]). These datasets are given\na question-answer or message-response pair and are tasked to ranked the answer/responses to how\nlikely they match the question.\n\n4https://github.com/vanzytay/NIPS2018_DECAPROP.\n\n5\n\n\fFor this experiment, we use a competitive baseline, DecompAtt [Parikh et al., 2016], as the base\nbuilding block for our experiments. We train DecompAtt in pointwise model for ranking tasks with\nbinary Softmax loss. We report MAP/MRR for TrecQA/WikiQA and top-1 accuracy for Ubuntu\ndialogue corpus (UDC). We train all models for 20 epochs, optimizing with Adam with learning rate\n0.0003. Hidden dimensions are set to 200 following the original DecompAtt model. Batch size is set\nto 64.\n\nTrecQA WikiQA UDC\n80.6/83.9\nD-ATT\n51.8\n52.5\nD-ATT+CoDA 80.0/84.5\nTable 2: Experimental results on Retrieval and\nRanking.\n\n66.4/68.0\n70.5/72.4\n\nResults Table 2 reports our results on\nthe retrieval and ranking task. D-ATT\n+ CoDA outperforms vanilla D-ATT for\nmost of the cases. On WikiQA, we ob-\nserve a +4% gain on both MRR and\nMAP metrics. Performance gain on\nUDC and TrecQA (MRR) are marginal.\nOverall, the results on this task are quite\npromising.\n\n3.3 Natural Language Inference\n\nThe task of Natural Language Inference (NLI) is concerned with determining whether two sentences\nentail or contradict each other. This task has commonly been associated with language understanding\nin general. We use four datasets, SNLI Bowman et al. [2015], MNLI [Williams et al., 2017], SciTail\n[Khot et al., 2018] and the newly released Dialogue NLI (DNLI) [Welleck et al., 2018]. Similar to\nthe retrieval and ranking tasks, we use the DecompAtt model as the base model. We use an identical\nhyperparameter setting as the retrieval and ranking model but train all models for 50 epochs. We set\nthe batch size to 32 for Scitail in lieu of a smaller dataset size.\n\nSNLI\nModel\n84.61\nD-ATT\nD-ATT + CoDA 85.71\n\nScitail DNLI\n82.0\n88.2\n88.8\n83.6\n\nTable 3: Experimental results of accuracy on Natural Language Inference.\n\nMNLI\n\n71.34/71.97\n72.19/72.45\n\nResults Table 3 reports the results of our experiments on all four NLI tasks. Concretely, our\nresults show that CoDA helps the base DecompAtt on both datasets. Notably, DecompAtt + CoDA\noutperforms the state-of-the-art result of 88.2% on the original DNLI dataset leaderboard in [Welleck\net al., 2018].\n\n3.4 Machine Translation\n\nWe evaluate CoDA-Transformer against vanilla Transformer on Machine Translation (MT) task.\nIn our experiments, we use the IWSLT\u201915 English-Vietnamese dataset. We implement CoDA-\nTransformer in Tensor2Tensor5. We use the transformer_base_single_gpu setting and run the\nmodel on a single TitanX GPU for 50K steps and using the default checkpoint averaging script.\nCompetitors include Stanford Statistical MT [Luong and Manning], traditional Seq2Seq + Attention\n[Bahdanau et al., 2014], and Neural Phrase-based MT [Huang et al., 2017].\n\nModel\nBLEU\nLuong & Manning (2015)\n23.30\nSeq2Seq Attention\n26.10\nNeural Phrase-based MT\n27.69\nNeural Phrase-based MT + LM 28.07\n28.43\nTransformer\n29.84\nCoDA Transformer\n\nTable 4: Experimental results on Machine Translation task using IWSLT\u201915 English-Vietnamese\ndataset.\n\n5https://github.com/tensorflow/tensor2tensor.\n\n6\n\n\fResults Table 4 reports the result on our MT experiments. We observe that CoDA improves the\nbase Transformer model by about +1.4% BLEU points on this dataset. Notably, CoDA Transformer\nalso outperforms all other prior work on this dataset by a reasonable margin.\n\n3.5 Sentiment Analysis\n\nWe compare CoDA-Transformer and Vanilla Transformer on word-level sentiment analysis. We use\nthe IMDb and sentiment tree-bank (SST) sentiment dataset. We implement CoDA-Transformer in\nTensor2Tensor and compare using the tiny default hyperparameter setting for both models. We train\nboth models with 2000 steps.\n\nModel\nTransformer\nCoDA Transformer\n\nIMDb\n82.6\n83.3\n\nSST\n78.9\n80.6\n\nTable 5: Experimental results on IMDb and SST\nSentiment Analysis.\n\nResults We observe that CoDA-Transformer\noutperforms vanilla Transformer on both\ndatasets. Note that this implementation uses\nByte-pair Encoding/No pretrained vectors and\ntherefore is not comparable with all other works\nin literature that use this IMDb and SST datasets.\n\n3.6 Mathematical Language Understanding (MLU)\n\nWe evaluate CoDA on the arithmetic MLU dataset [Wangperawong, 2018], a character-level transduc-\ntion task6. The key idea is to test the compositional reasoning capabilities of the proposed techniques.\nExample input to the model is x = 85, y = \u2212523, x \u2217 y and the corresponding output is \u221244455. A\nseries of variations are introduced, such as permutation of variables or introduction of other operators.\nThe dataset comprises of 12 million of these input-output pairs.\n\nImplementation Following [Wangperawong, 2018], we trained a CoDA Transformer model on\nthis dataset for 100K steps. Evaluation is performed during accuracy per sequence which assigns a\npositive class when there is an exact match.\n\nModel\nTransformer\u2020\nUniversal Transformer\u2020\nCoDA Transformer\n\nAcc\n76.1\n78.8\n84.3\n\nTable 6: Experimental results on Mathematical\nLanguage Understanding (MLU). \u2020 denotes results\nreported from [Wangperawong, 2018].\n\nResults Table 6 reports the results of our exper-\niments. We observe that our CoDA Transformer\nachieves a sharp performance gain (+8.2%) over\nthe base Transformer model. Moreover, the CoDA\nTransformer achieves almost full accuracy (solving\nthis task) on this dataset, showing the advantages\nthat CoDA has on compositional reasoning.\n\n3.7 Program Search / Text2Code Generation\n\nWe report additional experiments on language to code generation. We use the AlgoLisp dataset\nfrom [Polosukhin and Skidanov, 2018], which is implementation of the problems in a Lisp-inspired\nprogramming language. Each problem has 10 tests, where each test is input to be fed into the\nsynthesized program and the program should produce the expected output. We frame the problem as\na sequence transduction task.\nSimilar to other experiments, our implementation is based on the Tensor2Tensor framework. We\ntrain Transformer and CoDA Transformer for 100K steps using the tiny setting. The evalua-\ntion metric is accuracy per sequence which means the model only gets it correct if it generates\n\n6This task is readily available on the Tensor2Tensor framework.\n\n7\n\n\fthe entire sequence correctly. Baselines are reported from [Polosukhin and Skidanov, 2018].\n\nModel\nAttention Seq2Seq\nSeq2Tree + Search\nTransformer\nCoDA Transformer\n\nAcc\n54.4\n86.1\n96.8\n97.7\n\nResults Table 7 reports results on the Text2Code\ntask. Our CoDA Transformer outperforms the base\nTransformer by about +0.9% and overall achiev-\ning state-of-the-art results on this task. Similar to\nthe MLU task, CoDA Transformer comes close to\nsolving this problem.\n\nTable 7: Experimental results on Text2Code Gen-\neration task (AlgoLisp).\n\n4 Analysis\n\nCentering of E and N. We further examine the need for centering of E and N and show the\nexperimental results on NMT, WikiQA, SciTail and DNLI datasets in table 8. We observe that\ncentering E and N does not help in performance in most of the cases, except for the DNLI dataset\n(but the gap is mimimal though). We conclude that instead of forcing the balance between the positive\nand negative values in E and N, it is better for the model to learn when to add (positive value), negate\n(negative value) or ignore (zero value) the tokens.\n\nFunction\n\nFT anh(E) (cid:12) FSigmoid(N )\nFT anh(E(cid:48)) (cid:12) FSigmoid(N )\nFT anh(E(cid:48)) (cid:12) FSigmoid(N(cid:48))\n\nNMT WikiQA\n70.4 / 71.2\n27.8\n26.9\n68.1 / 68.3\n69.3 / 70.2\n27.3\n\nSciTail DNLI\n85.3\n86.5\n86.8\n84.5\n86.7\n85.0\n\nTable 8: Ablation study (development scores) of centering E and N. E(cid:48) and N(cid:48) refer to centered/mean\nzeroed values of E and N. Model architectures remain identical to experiment section.\n\nAblation study Table 9 shows the development\nscores of different compositions on Machine Trans-\nlation task using IWSLT\u201915 dataset. We observe\nthat the composition of applying T anh on E and\nSigmoid on N helps our proposed attention mech-\nanism achieve the best performance compared to\nother compositions.\n\nFunction\n\nFT anh(E) (cid:12) FSigmoid(N )\nFT anh(E) (cid:12) FT anh(N )\nFT anh(E) (cid:12) FArctan(N )\nFT anh(E) (cid:12) FAlgebraic(N )\nFSigmoid(E) (cid:12) FT anh(N )\nFSigmoid(E) (cid:12) FSigmoid(N )\nFSigmoid(E) (cid:12) FArctan(N )\nFSigmoid(E) (cid:12) FAlgebraic(N )\n\nBLEU\n27.8\n23.1\n21.9\n26.1\n21.9\n27.5\n25.1\n26.3\n\nTable 9: Ablation study (development scores) of\nvarious composition functions on MT task. Model\narchitectures remain identical to experiment sec-\ntion.\n\n5 Visualization\n\nIn order to provide an in-depth study of the behaviour our proposed mechanism, this section presents\na visual study of the CoDA mechanism. More concretely, we trained a model and extracted the\nmatrices tanh(E), sigmoid(N ) and M. Figure 2 illustrates some of these visualizations.\nWe make several key observations. First, the behaviour of sigmoid(N ) is aligned with our intuition,\nsaturating at {0, 1} and acting as gates. Second, the behaviour of tanh(E) concentrates around 0\nbut spreads across both negative and positive values. Lastly, the shape matrix M follows tanh(E)\nquite closely although there are a larger percentage of values close to 0 due to composing with\nsigmoid(N).\nAt convergence, the model learns values of tanh(E) which are close to 0 and more biased towards\nnegative values. This is surprising since we also found that tanh(E) are saturated, i.e., {\u22121,\u22121}\nat the early epochs. We found that the model learns to shrink representations, allowing tanh(E) to\n\n8\n\n\fhave values closer to 0. Finally we note that the shape of M remains similar to tanh(E). However,\nthe distribution near 0 values change slightly, likely to be in\ufb02uenced by the sigmoid(N ) values.\n\n(a) sigmoid(N)\n\n(b) tanh(E)\n\n(c) Matrix M\n\nFigure 2: Visualization at d = 200.\n\n6 Related Work\n\nAttention [Bahdanau et al., 2014] is a well-established building block in deep learning research\ntoday. A wide spectrum of variations have been proposed across the recent years, including Content-\nbased [Graves et al., 2014], Additive [Bahdanau et al., 2014], Location-based [Luong et al., 2015],\nDot-Product [Luong et al., 2015] and Scaled Dot-Product [Vaswani et al., 2017]. Many of these\nadaptations vary the scoring function which computes alignment scores. Ultimately, Softmax operator\nnormalizes the sequence and computes relative importance. In essence, the motivation of attention is\nliterally derived from its naming, i.e., to pay attention to certain parts of the input representation.\nIn very recent years, more sub-branches of attention mechanisms have also started to show great\npromise across many application domains. Self-attention [Xu et al., 2015, Vaswani et al., 2017]\nhas been shown to be an effective replacement for recurrence/convolution. On the other hand,\nBidirectional attention \ufb02ow [Seo et al., 2016] is known to be effective at learning query-document\nrepresentations. Decomposable attention [Parikh et al., 2016] provides a strong inductive prior for\nlearning alignment in natural language inference. A common denominator of these recent, advanced\nattention mechanisms is the computation of an af\ufb01nity matrix which can be interpreted as a fully\nconnected graph that connects all nodes/tokens in each sequence.\nThe extent of paying attention is also an interesting area of research. An extreme focus, commonly\nreferred to as hard attention [Xu et al., 2015] tries to learn discriminate representations that focus\nsolely on certain targets. Conversely, soft attention [Bahdanau et al., 2014] access and pools across\nthe entire input. There are also active research pertaining to the activation functions of attention\nmechanisms, e.g., Sparsemax [Martins and Astudillo, 2016], Sparsegen [Laha et al., 2018] or EntMax\n[Peters et al., 2019]. However, in\ufb02uencing the sparsity of softmax can be considered an orthogonal\ndirection from this work.\nAll in all, the idea of attention is to learn relative representations. To the best of our knowledge, there\nhave been no work to consider learning attentive representations that enable negative representations\n(subtracting) during pooling. Moreover, there is also no work that considers a dual af\ufb01nity scheme,\ni.e., considering both positive and negative af\ufb01nity when learning to attend.\n\n7 Conclusion\n\nWe proposed a new quasi-attention method, the compositional de-attention (CoDA) mechanism. We\napply CoDA across an extensive number of NLP tasks. Results demonstrate promising results and the\nCoDA-variations of several existing state-of-the-art models achieve new state-of-the-art performances\nin several datasets.\n\nReferences\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n9\n\n0.00.20.40.60.81.0Histogram sigmoid(N)010000200003000040000500000.20.00.20.40.60.81.0Histogram tanh(E)01000020000300004000050000600000.20.00.20.40.6Histogram M0100002000030000400005000060000\fSamuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated\n\ncorpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.\n\nChristian Buck, Jannis Bulian, Massimiliano Ciaramita, Wojciech Gajewski, Andrea Gesmundo, Neil\nHoulsby, and Wei Wang. Ask the right questions: Active question reformulation with reinforcement\nlearning. arXiv preprint arXiv:1705.07830, 2017.\n\nRajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. Multi-step retriever-\nreader interaction for scalable open-domain question answering. In International Conference on\nLearning Representations, 2019. URL https://openreview.net/forum?id=HkfPSh05K7.\n\nBhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. Quasar: Datasets for question answering\n\nby search and reading. arXiv preprint arXiv:1707.03904, 2017.\n\nMatthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho.\nSearchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint\narXiv:1704.05179, 2017.\n\nAlex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.\n\narXiv:1410.5401, 2014.\n\narXiv preprint\n\nPo-Sen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. Towards neural phrase-\n\nbased machine translation. arXiv preprint arXiv:1706.05565, 2017.\n\nTushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science\n\nquestion answering. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nAnirban Laha, Saneem Ahmed Chemmengath, Priyanka Agrawal, Mitesh Khapra, Karthik Sankara-\nnarayanan, and Harish G Ramaswamy. On controllable sparse alternatives to softmax. In Advances\nin Neural Information Processing Systems, pages 6422\u20136432, 2018.\n\nRyan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. The ubuntu dialogue corpus: A large\ndataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909,\n2015.\n\nMinh-Thang Luong and Christopher D Manning. Stanford neural machine translation systems for\n\nspoken language domains.\n\nMinh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based\n\nneural machine translation. arXiv preprint arXiv:1508.04025, 2015.\n\nAndre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and\nmulti-label classi\ufb01cation. In International Conference on Machine Learning, pages 1614\u20131623,\n2016.\n\nAnkur P Parikh, Oscar T\u00e4ckstr\u00f6m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention\n\nmodel for natural language inference. arXiv preprint arXiv:1606.01933, 2016.\n\nBen Peters, Vlad Niculae, and Andr\u00e9 FT Martins. Sparse sequence-to-sequence models. arXiv\n\npreprint arXiv:1905.05702, 2019.\n\nIllia Polosukhin and Alexander Skidanov. Neural program search: Solving programming tasks from\n\ndescription and examples. arXiv preprint arXiv:1802.04335, 2018.\n\nMinjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention\n\n\ufb02ow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016.\n\nYi Tay, Anh Tuan Luu, Siu Cheung Hui, and Jian Su. Densely connected attention propagation for\nreading comprehension. In Advances in Neural Information Processing Systems, pages 4911\u20134922,\n2018.\n\nAndrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic\n\nlogic units. In Advances in Neural Information Processing Systems, pages 8046\u20138055, 2018.\n\n10\n\n\fAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nIn Advances in Neural Information\n\nKaiser, and Illia Polosukhin. Attention is all you need.\nProcessing Systems, pages 5998\u20136008, 2017.\n\nMengqiu Wang, Noah A Smith, and Teruko Mitamura. What is the jeopardy model? a quasi-\nsynchronous grammar for qa. In Proceedings of the 2007 Joint Conference on Empirical Methods in\nNatural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),\n2007.\n\nShuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang,\nGerald Tesauro, Bowen Zhou, and Jing Jiang. R\u03983: Reinforced reader-ranker for open-domain\nquestion answering. arXiv preprint arXiv:1709.00023, 2017.\n\nArtit Wangperawong. Attending to mathematical language with transformers. CoRR, abs/1812.02825,\n\n2018. URL http://arxiv.org/abs/1812.02825.\n\nSean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. Dialogue natural language\n\ninference. arXiv preprint arXiv:1811.00671, 2018.\n\nAdina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for\n\nsentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.\n\nKelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich\nZemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual\nattention. In International conference on machine learning, pages 2048\u20132057, 2015.\n\nYi Yang, Wen-tau Yih, and Christopher Meek. Wikiqa: A challenge dataset for open-domain question\nanswering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, pages 2013\u20132018, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3306, "authors": [{"given_name": "Yi", "family_name": "Tay", "institution": "Nanyang Technological University"}, {"given_name": "Anh Tuan", "family_name": "Luu", "institution": "MIT CSAIL"}, {"given_name": "Aston", "family_name": "Zhang", "institution": "Amazon AI"}, {"given_name": "Shuohang", "family_name": "Wang", "institution": "Singapore Management University"}, {"given_name": "Siu Cheung", "family_name": "Hui", "institution": "Nanyang Technological University"}]}