{"title": "High-Order Attention Models for Visual Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 3664, "page_last": 3674, "abstract": "The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they  take into account different data modalities,  such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.", "full_text": "High-Order Attention Models for Visual Question\n\nAnswering\n\nIdan Schwartz\n\nDepartment of Computer Science\n\nTechnion\n\nAlexander G. Schwing\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Illinois at Urbana-Champaign\n\nidansc@cs.technion.ac.il\n\naschwing@illinois.edu\n\nDepartment of Industrial Engineering & Management\n\nTamir Hazan\n\nTechnion\n\ntamir.hazan@gmail.com\n\nAbstract\n\nThe quest for algorithms that enable cognitive abilities is an important part of\nmachine learning. A common trait in many recently investigated cognitive-like\ntasks is that they take into account different data modalities, such as visual and\ntextual input. In this paper we propose a novel and generally applicable form\nof attention mechanism that learns high-order correlations between various data\nmodalities. We show that high-order correlations effectively direct the appropriate\nattention to the relevant elements in the different data modalities that are required\nto solve the joint task. We demonstrate the effectiveness of our high-order attention\nmechanism on the task of visual question answering (VQA), where we achieve\nstate-of-the-art performance on the standard VQA dataset.\n\n1\n\nIntroduction\n\nThe quest for algorithms which enable cognitive abilities is an important part of machine learning\nand appears in many facets, e.g., in visual question answering tasks [6], image captioning [26],\nvisual question generation [18, 10] and machine comprehension [8]. A common trait in these recent\ncognitive-like tasks is that they take into account different data modalities, for example, visual and\ntextual data.\nTo address these tasks, recently, attention mechanisms have emerged as a powerful common theme,\nwhich provides not only some form of interpretability if applied to deep net models, but also often\nimproves performance [8]. The latter effect is attributed to more expressive yet concise forms of the\nvarious data modalities. Present day attention mechanisms, like for example [15, 26], are however\noften lacking in two main aspects. First, the systems generally extract abstract representations of\ndata in an ad-hoc and entangled manner. Second, present day attention mechanisms are often geared\ntowards a speci\ufb01c form of input and therefore hand-crafted for a particular task.\nTo address both issues, we propose a novel and generally applicable form of attention mechanism\nthat learns high-order correlations between various data modalities. For example, second order\ncorrelations can model interactions between two data modalities, e.g., an image and a question, and\nmore generally, k\u2212th order correlations can model interactions between k modalities. Learning these\ncorrelations effectively directs the appropriate attention to the relevant elements in the different data\nmodalities that are required to solve the joint task.\nWe demonstrate the effectiveness of our novel attention mechanism on the task of visual question\nanswering (VQA), where we achieve state-of-the-art performance on the VQA dataset [2]. Some\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Results of our multi-modal attention for one image and two different questions (1st\ncolumn). The unary image attention is identical by construction. The pairwise potentials differ\nfor both questions and images since both modalities are taken into account (3rd column). The \ufb01nal\nattention is illustrated in the 4th column.\n\nof our results are visualized in Fig. 1, where we show how the visual attention correlates with the\ntextual attention.\nWe begin by reviewing the related work. We subsequently provide details of our proposed technique,\nfocusing on the high-order nature of our attention models. We then conclude by presenting the\napplication of our high-order attention mechanism to VQA and compare it to the state-of-the-art.\n\n2 Related work\n\nAttention mechanisms have been investigated for both image and textual data. In the following we\nreview mechanisms for both.\nImage attention mechanisms: Over the past few years, single image embeddings extracted from a\ndeep net (e.g., [17, 16]) have been extended to a variety of image attention modules, when considering\nVQA. For example, a textual long short term memory net (LSTM) may be augmented with a spatial\nattention [29]. Similarly, Andreas et al. [1] employ a language parser together with a series of neural\nnet modules, one of which attends to regions in an image. The language parser suggests which neural\nnet module to use. Stacking of attention units was also investigated by Yang et al. [27]. Their stacked\nattention network predicts the answer successively. Dynamic memory network modules which capture\ncontextual information from neighboring image regions has been considered by Xiong et al. [24].\nShih et al. [23] use object proposals and and rank regions according to relevance. The multi-hop\nattention scheme of Xu et al. [25] was proposed to extract \ufb01ne-grained details. A joint attention\nmechanism was discussed by Lu et al. [15] and Fukui et al. [7] suggest an ef\ufb01cient outer product\nmechanism to combine visual representation and text representation before applying attention over\nthe combined representation. Additionally, they suggested the use of glimpses. Very recently, Kazemi\net al. [11] showed a similar approach using concatenation instead of outer product. Importantly, all of\nthese approaches model attention as a single network. The fact that multiple modalities are involved\nis often not considered explicitly which contrasts the aforementioned approaches from the technique\nwe present.\nVery recently Kim et al. [14] presented a technique that also interprets attention as a multi-variate\nprobabilistic model, to incorporate structural dependencies into the deep net. Other recent techniques\nare work by Nam et al. [19] on dual attention mechanisms and work by Kim et al. [13] on bilinear\n\n2\n\nWhat does the man have on his head?How many cars are in the picture?Original ImageUnary PotentialsPairwise PotentialsFinal AttentionWhat does the man have on his head?What does the man have on his head?What does the man have on his head?How many cars are in the picture?How many cars are in the picture?How many cars are in the picture?\fmodels. In contrast to the latter two models our approach is easy to extend to any number of data\nmodalities.\nTextual attention mechanisms: We also want to provide a brief review of textual attention. To\naddress some of the challenges, e.g., long sentences, faced by translation models, Hermann et al. [8]\nproposed RNNSearch. To address the challenges which arise by \ufb01xing the latent dimension of neural\nnets processing text data, Bahdanau et al. [3] \ufb01rst encode a document and a query via a bidirectional\nLSTM which are then used to compute attentions. This mechanism was later re\ufb01ned in [22] where\na word based technique reasons about sentence representations. Joint attention between two CNN\nhierarchies is discussed by Yin et al. [28].\nAmong all those attention mechanisms, relevant to our approach is work by Lu et al. [15] and the\napproach presented by Xu et al. [25]. Both discuss attention mechanisms which operate jointly over\ntwo modalities. Xu et al. [25] use pairwise interactions in the form of a similarity matrix, but ignore\nthe attentions on individual data modalities. Lu et al. [15] suggest an alternating model, that directly\ncombines the features of the modalities before attending. Additionally, they suggested a parallel\nmodel which uses a similarity matrix to map features for one modality to the other. It is hard to extend\nthis approach to more than two modalities. In contrast, our model develops a probabilistic model,\nbased on high order potentials and performs mean-\ufb01eld inference to obtain marginal probabilities.\nThis permits trivial extension of the model to any number of modalities.\nAdditionally, Jabri et al. [9] propose a model where answers are also used as inputs. Their approach\nquestions the need of attention mechanisms and develops an alternative solution based on binary\nclassi\ufb01cation. In contrast, our approach captures high-order attention correlations, which we found to\nimprove performance signi\ufb01cantly.\nOverall, while there is early work that propose a combination of language and image attention for\nVQA, e.g., [15, 25, 12], attention mechanism with several potentials haven\u2019t been discussed in detail\nyet. In the following we present our approach for joint attention over any number of modalities.\n\n3 Higher order attention models\n\nAttention modules are a crucial component for present day decision making systems. Particularly\nwhen taking into account more and more data of different modalities, attention mechanisms are able\nto provide insights into the inner workings of the oftentimes abstract and automatically extracted\nrepresentations of our systems.\nAn example of such a system that captured a lot of research efforts in recent years is Visual Question\nAnswering (VQA). Considering VQA as an example, we immediately note its dependence on two or\neven three different data modalities, the visual input V , the question Q and the answer A, which get\nprocessed simultaneously. More formally, we let\n\nV \u2208 Rnv\u00d7d, Q \u2208 Rnq\u00d7d, A \u2208 Rna\u00d7d\n\ndenote a representation for the visual input, the question and the answer respectively. Hereby, nv, nq\nand na are the number of pixels, the number of words in the question, and the number of possible\nanswers. We use d to denote the dimensionality of the data. For simplicity of the exposition we\nassume d to be identical across all data modalities.\nDue to this dependence on multiple data modalities, present day decision making systems can be\ndecomposed into three major parts: (i) the data embedding; (ii) attention mechanisms; and (iii) the\ndecision making. For a state-of-the-art VQA system such as the one we developed here, those three\nparts are immediately apparent when considering the high-level system architecture outlined in Fig. 2.\n\n3.1 Data embedding\n\nAttention modules deliver to the decision making component a succinct representation of the relevant\ndata modalities. As such, their performance depends on how we represent the data modalities\nthemselves. Oftentimes, an attention module tends to use expressive yet concise data embedding\nalgorithms to better capture their correlations and consequently to improve the decision making\nperformance. For example, data embeddings based on convolutional deep nets which constitute the\nstate-of-the-art in many visual recognition and scene understanding tasks. Language embeddings\nheavily rely on LSTM which are able to capture context in sequential data, such as words, phrases\nand sentences. We give a detailed account to our data embedding architectures for VQA in Sec. 4.1.\n\n3\n\n\f(cid:27)\n\nDecision (Sec. 3.3)\n\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8fe Attention (Sec. 3.2)\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fe Data embedding (Sec. 3.1)\n\nFigure 2: Our state-of-the-art VQA system\n\n3.2 Attention\n\nAs apparent from the aforementioned description, attention is the crucial component connecting data\nembeddings with decision making modules.\nSubsequently we denote attention over the nq words in the question via PQ(iq), where iq \u2208\n{1, . . . , nq} is the word index. Similarly, attention over the image is referred to via PV (iv),\nwhere iv \u2208 {1, . . . , nv}, and attention over the possible answers are denoted PA(ia), where\nia \u2208 {1, . . . , na}.\nWe consider the attention mechanism as a probability model, with each attention mechanism com-\nputing \u201cpotentials.\u201d First, unary potentials \u03b8V , \u03b8Q, \u03b8A denote the importance of each feature (e.g.,\nquestion word representations, multiple choice answers representations, and image patch features)\nfor the VQA task. Second, pairwise potentials, \u03b8V,Q, \u03b8V,A, \u03b8Q,A express correlations between two\nmodalities. Last, third-order potential, \u03b8V,Q,A captures dependencies between the three modalities.\nTo obtain marginal probabilities PQ, PV and PA from potentials, our model performs mean-\ufb01eld\ninference. We combine the unary potential, the marginalized pairwise potential and the marginalized\nthird order potential linearly including a bias term:\n\nPV (iv) = smax(\u03b11\u03b8V (iv)+\u03b12\u03b8V,Q(iv)+\u03b13\u03b8A,V (iv)+\u03b14\u03b8V,Q,A(iv) + \u03b15),\nPQ(iq) = smax(\u03b21\u03b8Q(iq)+\u03b22\u03b8V,Q(iq)+\u03b23\u03b8A,Q(iq)+\u03b24\u03b8V,Q,A(iq) + \u03b25),\nPA(ia) = smax(\u03b31\u03b8A(ia)+\u03b32\u03b8A,V (ia)+\u03b33\u03b8A,Q(ia)+\u03b34\u03b8V,Q.A(ia) + \u03b35).\n\n(1)\n\nHereby \u03b1i, \u03b2i, and \u03b3i are learnable parameters and smax(\u00b7) refers to the soft-max operation over\niv \u2208 {1, . . . , nv}, iq \u2208 {1, . . . , nq} and ia \u2208 {1, . . . , na} respectively. The soft-max converts the\ncombined potentials to probability distributions, which corresponds to a single mean-\ufb01eld iteration.\nSuch a linear combination of potentials provides extra \ufb02exibility for the model, since it can learn\nthe reliability of the potential from the data. For instance, we observe that question attention relies\nmore on the unary question potential and on pairwise question and answer potentials. In contrast, the\nimage attention relies more on the pairwise question and image potential.\nGiven the aforementioned probabilities PV , PQ, and PA, the attended image, question and answer\nvectors are denoted by aV \u2208 Rd, aQ \u2208 Rd and aA \u2208 Rd. The attended modalities are calculated\nas the weighted sum of the image features V = [v1, . . . , vnv ]T \u2208 Rnv\u00d7d, the question features\nQ = [q1, . . . , qnq ]T \u2208 Rnq\u00d7d, and the answer features A = [a1, . . . , ana ]T \u2208 Rna\u00d7d, i.e.,\n\naV =\n\nPV (iv)viv ,\n\naQ =\n\nPQ(iq)qiq ,\n\nand aV =\n\nPA(ia)aia .\n\niv=1\n\niq=1\n\nia=1\n\n4\n\nnv(cid:88)\n\nnq(cid:88)\n\nna(cid:88)\n\nResNetConcatenateWord EmbeddingIs the dog trying to catch a frisbee?1.Yes2.Yellow\u202617.No18.FoodMCBLSTMLSTMWordEmbedding1D-ConvUnary PotentialPairwisePotentialUnary PotentialSoftmaxSoftmaxUnary PotentialPairwisePotentialPairwisePotentialSoftmaxTernary PotentialMCBMCBYes\fV\n\nQ\n\nV\n\n\u03b8V\n\nQ\n\nV\n\nA\n\n\u03b8Q,V (iq)\n\n\u03b8Q,V (iv)\n\n\u03b8Q,V,A(iq)\n\n\u03b8Q,V,A(iv)\n\n\u03b8Q,V,A(ia)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Illustration of our k\u2212order attention. (a) unary attention module (e.g., visual). (b) pairwise attention\nmodule (e.g., visual and question) marginalized over its two data modalities. (c) ternary attention module (e.g.,\nvisual, question and answer) marginalized over its three data modalities..\n\nThe attended modalities, which effectively focus on the data relevant for the task, are passed to a\nclassi\ufb01er for decision making, e.g., the ones discussed in Sec. 3.3. In the following we now describe\nthe attention mechanisms for unary, pairwise and ternary potentials in more detail.\n\n3.2.1 Unary potentials\nWe illustrate the unary attention schematically in Fig. 3 (a). The input to the unary attention module\nis a data representation, i.e., either the visual representation V , the question representation Q, or the\nanswer representation A. Using those representations, we obtain the \u2018unary potentials\u2019 \u03b8V , \u03b8Q and\n\u03b8A using a convolution operation with kernel size 1 \u00d7 1 over the data representation as an additional\nembedding step, followed by a non-linearity (tanh in our case), followed by another convolution\noperation with kernel size 1 \u00d7 1 to reduce embedding dimensionality. Since convolutions with kernel\nsize 1 \u00d7 1 are identical to matrix multiplies we formally obtain the unary potentials via\n\n\u03b8V (iv) = tanh(V Wv2 )Wv1 ,\n\n\u03b8Q(iq) = tanh(QWq2)Wq1 ,\n\n\u03b8A(ia) = tanh(AWa2)Wa1 .\n\nwhere Wv1, Wq1, Wa1 \u2208 Rd\u00d71, and Wv2, Wq2, Wa2 \u2208 Rd\u00d7d are trainable parameters.\n3.2.2 Pairwise potentials\nBesides the mentioned mechanisms to generate unary potentials, we speci\ufb01cally aim at taking\nadvantage of pairwise attention modules, which are able to capture the correlation between the\nrepresentation of different modalities. Our approach is illustrated in Fig. 3 (b). We use a similarity\nmatrix between image and question modalities C2 = QWq(V Wv)(cid:62). Alternatively, the (i, j)-th entry\nis the correlation (inner-product) of the i-th column of QWq and the j-th column of V Wv:\n\n(C2)i,j = corr2((QWq):,i, (V Wv):,j),\n\ncorr2(q, v) =\n\nqlvl.\n\nwhere Wq, Wv \u2208 Rd\u00d7d are trainable parameters. We consider (C2)i,j as a pairwise potential that\nrepresents the correlation of the i-th word in a question and the j-th patch in an image. Therefore, to\nretrieve the attention for a speci\ufb01c word, we convolve the matrix along the visual dimension using a\n1 \u00d7 1 dimensional kernel. Speci\ufb01cally,\n\n(cid:32) nv(cid:88)\n\n(cid:33)\n\nd(cid:88)\n\nl=1\n\n\uf8eb\uf8ed nq(cid:88)\n\n\uf8f6\uf8f8 .\n\n\u03b8V,Q(iq) = tanh\n\nwiv (C2)iv,iq\n\n,\n\nand \u03b8V,Q(iv) = tanh\n\nwiq (C2)iv,iq\n\niv=1\n\niq=1\n\nSimilarly, we obtain \u03b8A,V and \u03b8A,Q, which we omit due to space limitations. These potentials are\nused to compute the attention probabilities as de\ufb01ned in Eq. (1).\n\n3.2.3 Ternary Potentials\n\nTo capture the dependencies between all three modalities, we consider their high-order correlations.\n\n(C3)i,j,k = corr3((QWq):,i, (V Wv):,j, (AWa):,k),\n\ncorr3(q, v, a) =\n\nqlvlal.\n\nd(cid:88)\n\nl=1\n\n5\n\nTernary PotentialconvconvCorr3convconvtanhtanhconvconvtanhPairwise PotentialconvconvCorr2convconvtanhtanhUnary PotentialconvtanhconvUnary PotentialPairwisePotentialThreeway PotentialTernary PotentialconvconvCorr3convconvtanhtanhconvconvtanhPairwise PotentialconvconvCorr2convconvtanhtanhUnary PotentialconvtanhconvUnary PotentialPairwisePotentialThreeway PotentialTernary PotentialconvconvCorr3convconvtanhtanhconvconvtanhPairwise PotentialconvconvCorr2convconvtanhtanhUnary PotentialconvtanhconvUnary PotentialPairwisePotentialThreeway Potential\faQ\n\naV\n\naQ\n\naV\n\naA\n\n(a)\n\n(b)\n\nFigure 4: Illustration of correlation units used for decision making. (a) MCB unit approximately sample from\nouter product space of two attention vectors, (b) MCT unit approximately sample from outer product space of\nthree attention vectors.\nWhere Wq, Wv, Wa \u2208 Rd\u00d7d are trainable parameters. Similarly to the pairwise potentials, we use\nthe C3 tensor to obtain correlated attention for each modality:\n\n(cid:32) nv(cid:88)\n\nna(cid:88)\n\n(cid:33)\n\uf8eb\uf8ed nv(cid:88)\n\n\uf8eb\uf8ed nq(cid:88)\n\nna(cid:88)\n\uf8f6\uf8f8 .\n\nia=1\n\n\uf8f6\uf8f8,\n\n\u03b8V,Q,A(iq) = tanh\n\nwiv,ia (C3)iq,iv,ia\n\n, \u03b8V,Q,A(iv) = tanh\n\nwiq,ia (C3)iq,iv,ia\n\niv=1\n\nia=1\n\nand\n\n\u03b8V,Q,A(ia) = tanh\n\nnq(cid:88)\n\niq=1\n\nwiq,ia (C3)iq,iv,ia\n\nThese potentials are used to compute the attention probabilities as de\ufb01ned in Eq. (1).\n\niv=1\n\niq=1\n\n3.3 Decision making\n\nThe decision making component receives as input the attended modalities and predicts the desired\noutput. Each attended modality is a vector that consists of the relevant data for making the decision.\nWhile the decision making component can consider the modalities independently, the nature of\nthe task usually requires to take into account correlations between the attended modalities. The\ncorrelation of a set of attended modalities are represented by the outer product of their respective\nvectors, e.g., the correlation of two attended modalities is represented by a matrix and the correlation\nof k-attended modalities is represented by a k-dimensional tensor.\nIdeally, the attended modalities and their high-order correlation tensors are fed into a deep net which\nproduces the \ufb01nal decision. The number of parameters in such a network grows exponentially in\nthe number of modalities, as seen in Fig. 4. To overcome this computational bottleneck, we follow\nthe tensor sketch algorithm of Pham and Pagh [21], which was recently applied to attention models\nby Fukui et al. [7] via Multimodal Compact Bilinear Pooling (MCB) in the pairwise setting or\nMultimodal Compact Trilinear Pooling (MCT), an extension of MCB that pools data from three\nmodalities. The tensor sketch algorithm enables us to reduce the dimension of any rank-one tensor\nwhile referring to it implicitly. It relies on the count sketch technique [4] that randomly embeds an\nattended vector a \u2208 Rd1 into another Euclidean space \u03a8(a) \u2208 Rd2. The tensor sketch algorithm\nthen projects the rank-one tensor \u2297k\ni=1ai which consists of attention correlations of order k using\nthe convolution \u03a8(\u2297k\ni=1\u03a8(ai). For example, for two attention modalities, the correlation\n2 = a1\u2297a2 is randomly projected to Rd2 by the convolution \u03a8(a1\u2297a2) = \u03a8(a1)\u2217\u03a8(a2).\nmatrix a1a(cid:62)\nThe attended modalities \u03a8(ai) and their high-order correlations \u03a8(\u2297k\ni=1ai) are fed into a fully\nconnected neural net to complete decision making.\n\ni=1ai) = \u2217k\n\n4 Visual question answering\n\nIn the following we evaluate our approach qualitatively and quantitatively. Before doing so we\ndescribe the data embeddings.\n\n4.1 Data embedding\nThe attention module requires the question representation Q \u2208 Rnq\u00d7d, the image representation\nV \u2208 Rnv\u00d7d, and the answer representation A \u2208 Rna\u00d7d, which are computed as follows.\nImage embedding: To embed the image, we use pre-trained convolutional deep nets (i.e., VGG-19,\nResNet). We extract the last layer before the fully connected units. Its dimension in the VGG net\ncase is 512 \u00d7 14 \u00d7 14 and the dimension in the ResNet case is 2048 \u00d7 14 \u00d7 14. Hence we obtain\n\n6\n\nPairwise PotentialconvconvCorr2convconvtanhtanhUnary PotentialconvtanhconvUnary PotentialPairwisePotentialThreeway PotentialMCTMCBOuter ProductSpacePairwise PotentialconvconvCorr2convconvtanhtanhUnary PotentialconvtanhconvUnary PotentialPairwisePotentialThreeway PotentialMCTOuter ProductSpaceMCBOuter ProductSpace\fTable 1: Comparison of results on the Multiple-Choice VQA dataset for a variety of methods. We\nobserve the combination of all three unary, pairwise and ternary potentials to yield the best result.\n\ntest-dev\n\ntest-std\n\nMethod\nNaive Bayes [15]\nHieCoAtt (ResNet) [15]\nRAU (ResNet) [20]\nMCB (ResNet) [7]\nDAN (VGG) [19]\nDAN (ResNet) [19]\nMLB (ResNet) [13]\n2-Modalities: Unary+Pairwis (ResNet)\n3-Modalities: Unary+Pairwise (ResNet)\n3-Modalities: Unary + Pairwise + Ternary (VGG)\n3-Modalities: Unary + Pairwise + Ternary (ResNet)\n\n40.1\n40.0\n41.1\n\nY/N Num Other All\n64.9\n79.7\n79.7\n65.8\n67.7\n81.9\n68.6\n67.0\n69.1\n\n57.9\n59.8\n61.5\n\n-\n-\n-\n-\n\n-\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n80.9\n82.0\n81.2\n81.6\n\n36.0\n42.7\n42.7\n43.3\n\n61.6\n63.3\n62.3\n64.8\n\n66.7\n68.7\n67.9\n69.4\n\nAll\n-\n\n66.1\n67.3\n\n69.0\n68.9\n\n-\n-\n\n-\n\n-\n\n68.7\n\n69.3\n\nnv = 196 and we embed both the 196 VGG-19 or ResNet features into a d = 512 dimensional space\nto obtain the image representation V .\nQuestion embedding: To obtain a question representation, Q \u2208 Rnq\u00d7d, we \ufb01rst map a 1-hot encod-\ning of each word in the question into a d-dimensional embedding space using a linear transformation\nplus corresponding bias terms. To obtain a richer representation that accounts for neighboring words,\nwe use a 1-dimensional temporal convolution with \ufb01lter of size 3. While a combination of multiple\nsized \ufb01lters is suggested in the literature [15], we didn\u2019t \ufb01nd any bene\ufb01t from using such an approach.\nSubsequently, to capture long-term dependencies, we used a Long Short Term Memory (LSTM)\nlayer. To reduce over\ufb01tting caused by the LSTM units, we used two LSTM layers with d/2 hidden\ndimension, one uses as input the word embedding representation, and the other one operates on\nthe 1D conv layer output. Their output is then concatenated to obtain Q. We also note that nq is a\nconstant hyperparameter, i.e., questions with more than nq words are cut, while questions with less\nwords are zero-padded.\nAnswer embedding: To embed the possible answers we use a regular word embedding. The\nvocabulary is speci\ufb01ed by taking only the most frequent answers in the training set. Answers that\nare not included in the top answers are embedded to the same vector. Answers containing multiple\nwords are embedded as n-grams to a single vector. We assume there is no real dependency between\nthe answers, therefore there is no need of using additional 1D conv, or LSTM layers.\n\n4.2 Decision making\n\nFor our VQA example we investigate two techniques to combine vectors from three modalities. First,\nthe attended feature representation for each modality, i.e., aV , aA and aQ, are combined using an\nMCT unit. Each feature element is of the form ((aV )i \u00b7 (aQ)j \u00b7 (aA)k). While this \ufb01rst solution\nis most general, in some cases like VQA, our experiments show that it is better to use our second\napproach, a 2-layer MCB unit combination. This permits greater expressiveness as we employ\nfeatures of the form ((aV )i \u00b7 (aQ)j \u00b7 (aQ)k \u00b7 (aA)t) therefore also allowing image features to interact\nwith themselves. Note that in terms of parameters both approaches are identical as neither MCB nor\nMCT are parametric modules.\nBeyond MCB, we tested several other techniques that were suggested in the literature, including\nelement-wise multiplication, element-wise addition and concatenation [13, 15, 11], optionally fol-\nlowed by another hidden fully connected layer. The tensor sketching units consistently performed\nbest.\n\n4.3 Results\nExperimental setup: We use the RMSProp optimizer with a base learning rate of 4e\u22124 and \u03b1 = 0.99\nas well as \u0001 = 1e\u22128. The batch size is set to 300. The dimension d of all hidden layers is set to 512.\nThe MCB unit feature dimension was set to d = 8192. We apply dropout with a rate of 0.5 after the\nword embeddings, the LSTM layer, and the \ufb01rst conv layer in the unary potential units. Additionally,\nfor the last fully connected layer we use a dropout rate of 0.3. We use the top 3000 most frequent\n\n7\n\n\fFigure 5: For each image (1st column) we show the attention generated for two different questions in columns\n2-4 and columns 5-7 respectively. The attentions are ordered as unary attention, pairwise attention and combined\nattention for both the image and the question. We observe the combined attention to signi\ufb01cantly depend on the\nquestion.\n\nFigure 6: The attention generated for two different questions over three modalities. We \ufb01nd the attention over\nmultiple choice answers to emphasis the unusual answers.\n\nanswers as possible outputs, which covers 91% of all answers in the train set. We implemented our\nmodels using the Torch framework1 [5].\nAs a comparison for our attention mechanism we use the approach of Lu et al. [15] and the technique\nof Fukui et al. [7]. Their methods are based on a hierarchical attention mechanism and multi-modal\ncompact bilinear (MCB) pooling. In contrast to their approach we demonstrate a relatively simple\ntechnique based on a probabilistic intuition grounded on potentials. For comparative reasons only,\nthe visualized attention is based on two modalities: image and question.\nWe evaluate our attention modules on the VQA real-image test-dev and test-std datasets [2]. The\ndataset consists of 123, 287 training images and 81, 434 test set images. Each image comes with 3\nquestions along with 18 multiple choice answers.\nQuantitative evaluation: We \ufb01rst evaluate the overall performance of our model and compare it to a\nvariety of baselines. Tab. 1 shows the performance of our model and the baselines on the test-dev and\nthe test-standard datasets for multiple choice (MC) questions. To obtain multiple choice results we\nfollow common practice and use the highest scoring answer among the provided ones. Our approach\n(Fig. 2) for the multiple choice answering task achieved the reported result after 180,000 iterations,\nwhich requires about 40 hours of training on the \u2018train+val\u2019 dataset using a TitanX GPU. Despite\nthe fact that our model has only 40 million parameters, while techniques like [7] use over 70 million\nparameters, we observe state-of-the-art behavior. Additionally, we employ a 2-modality model having\na similar experimental setup. We observe a signi\ufb01cant improvement for our 3-modality model, which\nshows the importance of high-order attention models. Due to the fact that we use a lower embedding\ndimension of 512 (similar to [15]) compared to 2048 of existing 2-modality models [13, 7], the\n2-modality model achieves inferior performance. We believe that higher embedding dimension and\nproper tuning can improve our 2-modality starting point.\nAdditionally, we compared our proposed decision units. MCT, which is a generic extension of MCB\nfor 3-modalities, and 2-layers MCB which has greater expressiveness (Sec. 4.2). Evaluating on the\n\u2019val\u2019 dataset while training on the \u2019train\u2019 part using the VGG features, the MCT setup yields 63.82%\n\n1https://github.com/idansc/HighOrderAtten\n\n8\n\nHowmanyglassesareonthetable?Howmanyglassesareonthetable?Howmanyglassesareonthetable?Isanyoneinthescenewearingblue?Isanyoneinthescenewearingblue?Isanyoneinthescenewearingblue?Whatkindofflooringisinthebathroom?Whatkindofflooringisinthebathroom?Whatkindofflooringisinthebathroom?Whatroomisthis?Whatroomisthis?Whatroomisthis?Isthisanimaldrinkingwater?0.000.020.040.060.080.100.120.14Attentionno ... thisrednoyeswhiteforks41tomatoespresidentialblue313green2filai ... don'tIs this animal drinking water?Whatkindofanimalisthis?0.000.020.040.060.080.100.120.140.16Attentionblueredcutting ... cakegreenbear1whiteobjazd3elephant4giraffeyesrejectcow2spainnoWhat kind of animal is this?Whatisonthewall?0.000.020.040.060.080.100.120.140.16Attentionyesnext ... tobluegreenparka31piratesgadzoompicture ... of2pictureclockno4whitephotoredWhat is on the wall?Isalighton?0.000.020.040.060.080.100.12Attention3notwhite1on ... boy'sredyes4no2aspropimpplayerbluepainif ... you'vegreenno ... imageIs a light on?\fIs she using a\nbattery device?\nOurs: yes\n[15]: no\n[7]: no\nGT: yes\n\nIs this boy\nor a girl?\nOurs: girl\n[15]: boy\n[7]: girl\nGT: girl\n\nFigure 7: Comparison of our attention results (2nd column) with attention provided by [15] (3rd column)\nand [7] (4th column). The fourth column provides the question and the answer of the different techniques.\n\nWhat color is\nthe table?\nGT: brown\nOurs: blue\n\nWhat color is\nthe umbrella?\nGT: blue\nOurs: blue\n\nFigure 8: Failure cases: Unary, pairwise and combined attention of our approach. Our system\nfocuses on the colorful umbrella as opposed to the table in the \ufb01rst row.\n\nwhere 2-layer MCB yields 64.57%. We also tested a different ordering of the input to the 2-modality\nMCB and found them to yield inferior results.\nQualitative evaluation: Next, we evaluate our technique qualitatively. In Fig. 5 we illustrate the\nunary, pairwise and combined attention of our approach based on the two modality architecture,\nwithout the multiple choice as input. For each image we show multiple questions. We observe the\nunary attention usually attends to strong features of the image, while pairwise potentials emphasize\nareas that correlate with question words. Importantly, the combined result is dependent on the\nprovided question. For instance, in the \ufb01rst row we observe for the question \u201cHow many glasses are\non the table?,\u201d that the pairwise potential reacts to the image area depicting the glass. In contrast, for\nthe question \u201cIs anyone in the scene wearing blue?\u201d the pairwise potentials reacts to the guy with the\nblue shirt. In Fig. 6, we illustrate the attention for our 3-modality model. We \ufb01nd the attention over\nmultiple choice answers to favor the more unusual results.\nIn Fig. 7, we compare the \ufb01nal attention obtained from our approach to the results obtained with\ntechniques discussed in [15] and [7]. We observe that our approach attends to reasonable pixel and\nquestion locations. For example, considering the \ufb01rst row in Fig. 7, the question refers to the battery\noperated device. Compared to existing approaches, our technique attends to the laptop, which seems\nto help in choosing the correct answer. In the second row, the question wonders \u201cIs this a boy or a\ngirl?\u201d. Both of the correct answers were produced when the attention focuses on the hair.\nIn Fig. 8, we illustrate a failure case, where the attention of our approach is identical, despite two\ndifferent input questions. Our system focuses on the colorful umbrella as opposed to the object\nqueried for in the question.\n\n5 Conclusion\nIn this paper we investigated a series of techniques to design attention for multimodal input data.\nBeyond demonstrating state-of-the-art performance using relatively simple models, we hope that this\nwork inspires researchers to work in this direction.\n\n9\n\nIssheusingabattery-operateddevice?Issheusingabattery-operateddevice?Isthisaboyoragirl?Isthisaboyoragirl?Whatcoloristhetable?Whatcoloristhetable?Whatcoloristhetable?Whatcoloristheumbrella?Whatcoloristheumbrella?Whatcoloristheumbrella?\fAcknowledgments: This research was supported in part by The Israel Science Foundation (grant\nNo. 948/15). This material is based upon work supported in part by the National Science Foundation\nunder Grant No. 1718221. We thank Nvidia for providing GPUs used in this research.\n\nReferences\n[1] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks\n\nfor question answering. arXiv preprint arXiv:1601.01705, 2016.\n\n[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick,\n\nand Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.\n\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[4] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In ICALP.\n\nSpringer, 2002.\n\n[5] Ronan Collobert, Koray Kavukcuoglu, and Cl\u00e9ment Farabet. Torch7: A matlab-like environment for\n\nmachine learning. In BigLearn, NIPS Workshop, 2011.\n\n[6] Abhishek Das, Harsh Agrawal, C Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Human attention\nin visual question answering: Do humans and deep networks look at the same regions? arXiv preprint\narXiv:1606.03556, 2016.\n\n[7] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach.\nMultimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint\narXiv:1606.01847, 2016.\n\n[8] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,\n\nand Phil Blunsom. Teaching machines to read and comprehend. In NIPS, pages 1693\u20131701, 2015.\n\n[9] Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines.\n\nIn ECCV. Springer, 2016.\n\n[10] U. Jain\u2217, Z. Zhang\u2217, and A. G. Schwing. Creativity: Generating Diverse Questions using Variational\n\nAutoencoders. In CVPR, 2017. \u2217 equal contribution.\n\n[11] Vahid Kazemi and Ali Elqursh. Show, ask, attend, and answer: A strong baseline for visual question\n\nanswering. arXiv preprint arXiv:1704.03162, 2017.\n\n[12] Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-\n\nTak Zhang. Multimodal residual learning for visual qa. In NIPS, 2016.\n\n[13] Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang.\n\nHadamard Product for Low-rank Bilinear Pooling. In ICLR, 2017.\n\n[14] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. Structured attention networks. arXiv\n\npreprint arXiv:1702.00887, 2017.\n\n[15] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for\n\nvisual question answering. In NIPS, 2016.\n\n[16] Lin Ma, Zhengdong Lu, and Hang Li. Learning to answer questions from image using convolutional neural\n\nnetwork. arXiv preprint arXiv:1506.00333, 2015.\n\n[17] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to\n\nanswering questions about images. In ICCV, 2015.\n\n[18] Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende.\n\nGenerating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.\n\n[19] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning\n\nand matching. arXiv preprint arXiv:1611.00471, 2016.\n\n[20] Hyeonwoo Noh and Bohyung Han. Training recurrent answering units with joint loss minimization for\n\nvqa. arXiv preprint arXiv:1606.03647, 2016.\n\n[21] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In SIGKDD.\n\nACM, 2013.\n\n10\n\n\f[22] Tim Rockt\u00e4schel, Edward Grefenstette, Moritz Hermann, Karl, Tom\u00e1\u0161 Ko\u02c7cisk`y, and Phil Blunsom.\n\nReasoning about entailment with neural attention. In ICLR, 2016.\n\n[23] Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question\n\nanswering. In CVPR, 2016.\n\n[24] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual\n\nquestion answering. arXiv preprint arXiv:1603.01417, 2016.\n\n[25] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for\n\nvisual question answering. In ECCV, pages 451\u2013466. Springer, 2016.\n\n[26] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel,\nand Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML,\n2015.\n\n[27] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image\n\nquestion answering. In CVPR, 2016.\n\n[28] Wenpeng Yin, Hinrich Sch\u00fctze, Bing Xiang, and Bowen Zhou. Abcnn: Attention-based convolutional\n\nneural network for modeling sentence pairs. arXiv preprint arXiv:1512.05193, 2015.\n\n[29] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in\n\nimages. In CVPR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2048, "authors": [{"given_name": "Idan", "family_name": "Schwartz", "institution": "Technion"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Tamir", "family_name": "Hazan", "institution": "Technion"}]}