{"title": "Hierarchical Question-Image Co-Attention for Visual Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 289, "page_last": 297, "abstract": "A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling \"where to look\" or visual attention, it is equally important to model \"what words to listen to\" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.", "full_text": "Hierarchical Question-Image Co-Attention\n\nfor Visual Question Answering\n\nJiasen Lu\u2217, Jianwei Yang\u2217, Dhruv Batra\u2217\u2020 , Devi Parikh\u2217\u2020\n\n\u2217 Virginia Tech, \u2020 Georgia Institute of Technology\n\n{jiasenlu, jw2yang, dbatra, parikh}@vt.edu\n\nAbstract\n\nA number of recent works have proposed attention models for Visual Question\nAnswering (VQA) that generate spatial maps highlighting image regions relevant to\nanswering the question. In this paper, we argue that in addition to modeling \u201cwhere\nto look\u201d or visual attention, it is equally important to model \u201cwhat words to listen\nto\u201d or question attention. We present a novel co-attention model for VQA that\njointly reasons about image and question attention. In addition, our model reasons\nabout the question (and consequently the image via the co-attention mechanism)\nin a hierarchical fashion via a novel 1-dimensional convolution neural networks\n(CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to\n60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the\nperformance is further improved to 62.1% for VQA and 65.4% for COCO-QA.1.\n\nIntroduction\n\n1\nVisual Question Answering (VQA) [2, 6, 14, 15, 27] has emerged as a prominent multi-discipline\nresearch problem in both academia and industry. To correctly answer visual questions about an\nimage, the machine needs to understand both the image and question. Recently, visual attention based\nmodels [18, 21\u201323] have been explored for VQA, where the attention mechanism typically produces\na spatial map highlighting image regions relevant to answering the question.\nSo far, all attention models for VQA in literature have focused on the problem of identifying \u201cwhere\nto look\u201d or visual attention. In this paper, we argue that the problem of identifying \u201cwhich words to\nlisten to\u201d or question attention is equally important. Consider the questions \u201chow many horses are\nin this image?\u201d and \u201chow many horses can you see in this image?\". They have the same meaning,\nessentially captured by the \ufb01rst three words. A machine that attends to the \ufb01rst three words would\narguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.\nMotivated by this observation, in addition to reasoning about visual attention, we also address the\nproblem of question attention. Speci\ufb01cally, we present a novel multi-modal attention model for VQA\nwith the following two unique features:\nCo-Attention: We propose a novel mechanism that jointly reasons about visual attention and question\nattention, which we refer to as co-attention. Unlike previous works, which only focus on visual\nattention, our model has a natural symmetry between the image and question, in the sense that the\nimage representation is used to guide the question attention and the question representation(s) are\nused to guide image attention.\nQuestion Hierarchy: We build a hierarchical architecture that co-attends to the image and question\nat three levels: (a) word level, (b) phrase level and (c) question level. At the word level, we embed the\nwords to a vector space through an embedding matrix. At the phrase level, 1-dimensional convolution\nneural networks are used to capture the information contained in unigrams, bigrams and trigrams.\n\n1The source code can be downloaded from https://github.com/jiasenlu/HieCoAttenVQA\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Flowchart of our proposed hierarchical co-attention model. Given a question, we extract its word\nlevel, phrase level and question level embeddings. At each level, we apply co-attention on both the image and\nquestion. The \ufb01nal answer prediction is based on all the co-attended image and question features.\n\nSpeci\ufb01cally, we convolve word representations with temporal \ufb01lters of varying support, and then\ncombine the various n-gram responses by pooling them into a single phrase level representation. At\nthe question level, we use recurrent neural networks to encode the entire question. For each level\nof the question representation in this hierarchy, we construct joint question and image co-attention\nmaps, which are then combined recursively to ultimately predict a distribution over the answers.\nOverall, the main contributions of our work are:\n\n\u2022 We propose a novel co-attention mechanism for VQA that jointly performs question-guided\nvisual attention and image-guided question attention. We explore this mechanism with two\nstrategies, parallel and alternating co-attention, which are described in Sec. 3.3;\n\u2022 We propose a hierarchical architecture to represent the question, and consequently construct\nimage-question co-attention maps at 3 different levels: word level, phrase level and question\nlevel. These co-attended features are then recursively combined from word level to question\nlevel for the \ufb01nal answer prediction;\n\u2022 At the phrase level, we propose a novel convolution-pooling strategy to adaptively select the\n\u2022 Finally, we evaluate our proposed model on two large datasets, VQA [2] and COCO-QA [15].\n\nphrase sizes whose representations are passed to the question level representation;\n\nWe also perform ablation studies to quantify the roles of different components in our model.\n\n2 Related Work\n\nMany recent works [2, 6, 11, 14, 15, 25] have proposed models for VQA. We compare and relate our\nproposed co-attention mechanism to other vision and language attention mechanisms in literature.\nImage attention.\nInstead of directly using the holistic entire-image embedding from the fully\nconnected layer of a deep CNN (as in [2, 13\u201315]), a number of recent works have explored image\nattention models for VQA. Zhu et al. [26] add spatial attention to the standard LSTM model for\npointing and grounded QA. Andreas et al. [1] propose a compositional scheme that consists of a\nlanguage parser and a number of neural modules networks. The language parser predicts which neural\nmodule network should be instantiated to answer the question. Some other works perform image\nattention multiple times in a stacked manner. In [23], the authors propose a stacked attention network,\nwhich runs multiple hops to infer the answer progressively. To capture \ufb01ne-grained information from\nthe question, Xu et al. [22] propose a multi-hop image attention scheme. It aligns words to image\npatches in the \ufb01rst hop, and then refers to the entire question for obtaining image attention maps in\nthe second hop. In [18], the authors generate image regions with object proposals and then select the\nregions relevant to the question and answer choice. Xiong et al. [21] augments dynamic memory\nnetwork with a new input fusion module and retrieves an answer from an attention based GRU. In\n\n2\n\nQues%on:\t\r \u00a0What\t\r \u00a0color\t\r \u00a0on\t\r \u00a0the\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0stop\t\r \u00a0light\t\r \u00a0is\t\r \u00a0lit\t\r \u00a0up\t\r \u00a0\t\r \u00a0?\t\r \u00a0\t\r \u00a0\u2026\t\r \u00a0\u2026\t\r \u00a0color\t\r \u00a0stop\t\r \u00a0light\t\r \u00a0lit\t\r \u00a0co-\u00ad\u2010a7en%on\t\r \u00a0color\t\r \u00a0\u2026\t\r \u00a0stop\t\r \u00a0\t\r \u00a0light\t\r \u00a0\t\r \u00a0\u2026\t\r \u00a0\t\r \u00a0What\t\r \u00a0color\t\r \u00a0\u2026\t\r \u00a0the\t\r \u00a0stop\t\r \u00a0light\t\r \u00a0\t\r \u00a0light\t\r \u00a0\t\r \u00a0\u2026\t\r \u00a0\t\r \u00a0What\t\r \u00a0color\t\r \u00a0What\t\r \u00a0color\t\r \u00a0on\t\r \u00a0the\t\r \u00a0stop\t\r \u00a0light\t\r \u00a0is\t\r \u00a0lit\t\r \u00a0up\t\r \u00a0\u2026\t\r \u00a0\u2026\t\r \u00a0the\t\r \u00a0stop\t\r \u00a0light\t\r \u00a0\u2026\t\r \u00a0\u2026\t\r \u00a0stop\t\r \u00a0\t\r \u00a0Image\t\r \u00a0Answer:\t\r \u00a0green\t\r \u00a0\fconcurrent work, [5] collected \u2018human attention maps\u2019 that are used to evaluate the attention maps\ngenerated by attention models for VQA. Note that all of these approaches model visual attention\nalone, and do not model question attention. Moreover, [22, 23] model attention sequentially, i.e., later\nattention is based on earlier attention, which is prone to error propagation. In contrast, we conduct\nco-attention at three levels independently.\nLanguage Attention. Though no prior work has explored question attention in VQA, there are\nsome related works in natural language processing (NLP) in general that have modeled language\nattention. In order to overcome dif\ufb01culty in translation of long sentences, Bahdanau et al. [3]\npropose RNNSearch to learn an alignment over the input sentences. In [8], the authors propose an\nattention model to circumvent the bottleneck caused by \ufb01xed width hidden vector in text reading and\ncomprehension. A more \ufb01ne-grained attention mechanism is proposed in [16]. The authors employ\na word-by-word neural attention mechanism to reason about the entailment in two sentences. Also\nfocused on modeling sentence pairs, the authors in [24] propose an attention-based bigram CNN for\njointly performing attention between two CNN hierarchies. In their work, three attention schemes are\nproposed and evaluated. In [17], the authors propose a two-way attention mechanism to project the\npaired inputs into a common representation space.\n\n3 Method\nWe begin by introducing the notation used in this paper. To ease understanding, our full model\nis described in parts. First, our hierarchical question representation is described in Sec. 3.2 and\nthe proposed co-attention mechanism is then described in Sec. 3.3. Finally, Sec. 3.4 shows how to\nrecursively combine the attended question and image features to output answers.\n\n3.1 Notation\nGiven a question with T words, its representation is denoted by Q = {q1, . . . qT}, where qt is the\nfeature vector for the t-th word. We denote qw\nt as word embedding, phrase embedding and\nquestion embedding at position t, respectively. The image feature is denoted by V = {v1, ..., vN},\nwhere vn is the feature vector at the spatial location n. The co-attention features of image and\nquestion at each level in the hierarchy are denoted as \u02c6vr and \u02c6qr where r \u2208 {w, p, s}. The weights in\ndifferent modules/layers are denoted with W , with appropriate sub/super-scripts as necessary. In the\nexposition that follows, we omit the bias term b to avoid notational clutter.\n\nt and qs\n\nt , qp\n\n3.2 Question Hierarchy\nGiven the 1-hot encoding of the question words Q = {q1, . . . , qT}, we \ufb01rst embed the words to\na vector space (learnt end-to-end) to get Qw = {qw\nT }. To compute the phrase features,\nwe apply 1-D convolution on the word embedding vectors. Concretely, at each word location, we\ncompute the inner product of the word vectors with \ufb01lters of three window sizes: unigram, bigram\nand trigram. For the t-th word, the convolution output with window size s is given by\n\n1 , . . . , qw\n\ns \u2208 {1, 2, 3}\n\nc qw\n\n\u02c6qp\ns,t = tanh(W s\n\n(1)\nwhere W s\nc is the weight parameters. The word-level features Qw are appropriately 0-padded before\nfeeding into bigram and trigram convolutions to maintain the length of the sequence after convolution.\nGiven the convolution result, we then apply max-pooling across different n-grams at each word\nlocation to obtain phrase-level features\nqp\nt = max( \u02c6qp\n\nt \u2208 {1, 2, . . . , T}\n\nt:t+s\u22121),\n\n(2)\n\n1,t, \u02c6qp\n\n2,t, \u02c6qp\n\n3,t),\n\nOur pooling method differs from those used in previous works [9] in that it adaptively selects different\ngram features at each time step, while preserving the original sequence length and order. We use a\nLSTM to encode the sequence qp\nt is\nthe LSTM hidden vector at time t.\nOur hierarchical representation of the question is depicted in Fig. 3(a).\n\nt after max-pooling. The corresponding question-level feature qs\n\n3.3 Co-Attention\n\nWe propose two co-attention mechanisms that differ in the order in which image and question\nattention maps are generated. The \ufb01rst mechanism, which we call parallel co-attention, generates\n\n3\n\n\fFigure 2: (a) Parallel co-attention mechanism; (b) Alternating co-attention mechanism.\n\nimage and question attention simultaneously. The second mechanism, which we call alternating\nco-attention, sequentially alternates between generating image and question attentions. See Fig. 2.\nThese co-attention mechanisms are executed at all three levels of the question hierarchy.\nParallel Co-Attention. Parallel co-attention attends to the image and question simultaneously.\nSimilar to [22], we connect the image and question by calculating the similarity between image and\nquestion features at all pairs of image-locations and question-locations. Speci\ufb01cally, given an image\nfeature map V \u2208 Rd\u00d7N , and the question representation Q \u2208 Rd\u00d7T , the af\ufb01nity matrix C \u2208 RT\u00d7N\nis calculated by\n\nC = tanh(QT WbV )\n\n(3)\nwhere Wb \u2208 Rd\u00d7d contains the weights. After computing this af\ufb01nity matrix, one possible way of\ncomputing the image (or question) attention is to simply maximize out the af\ufb01nity over the locations\nof other modality, i.e. av[n] = maxi(Ci,n) and aq[t] = maxj(Ct,j). Instead of choosing the max\nactivation, we \ufb01nd that performance is improved if we consider this af\ufb01nity matrix as a feature and\nlearn to predict image and question attention maps via the following\n\nH v = tanh(WvV + (WqQ)C), H q = tanh(WqQ + (WvV )CT )\n\nav = softmax(wT\n\n(4)\nwhere Wv, Wq \u2208 Rk\u00d7d, whv, whq \u2208 Rk are the weight parameters. av \u2208 RN and aq \u2208 RT are\nthe attention probabilities of each image region vn and word qt respectively. The af\ufb01nity matrix C\ntransforms question attention space to image attention space (vice versa for CT ). Based on the above\nattention weights, the image and question attention vectors are calculated as the weighted sum of the\nimage features and question features, i.e.,\n\nhvH v), aq = softmax(wT\n\nhqH q)\n\nN(cid:88)\n\nn=1\n\nT(cid:88)\n\nt=1\n\n\u02c6v =\n\nav\nnvn,\n\n\u02c6q =\n\naq\nt qt\n\n(5)\n\nThe parallel co-attention is done at each level in the hierarchy, leading to \u02c6vr and \u02c6qr where r \u2208\n{w, p, s}.\nAlternating Co-Attention. In this attention mechanism, we sequentially alternate between gen-\nerating image and question attention. Brie\ufb02y, this consists of three steps (marked in Fig. 2b): 1)\nsummarize the question into a single vector q; 2) attend to the image based on the question summary\nq; 3) attend to the question based on the attended image feature.\nConcretely, we de\ufb01ne an attention operation \u02c6x = A(X; g), which takes the image (or question)\nfeatures X and attention guidance g derived from question (or image) as inputs, and outputs the\nattended image (or question) vector. The operation can be expressed in the following steps\n\nH = tanh(WxX + (Wgg)1T )\nax = softmax(wT\n\nhxH)\n\n(cid:88)\n\n\u02c6x =\n\nax\ni xi\n\n4\n\n(6)\n\n(b)\t\r \u00a0\t\r \u00a0Image\rA\rA\rA\rQues+on\r0\t\r \u00a0QV(a)\t\r \u00a0Image\rQues+on\rx\rx\rQVCx\rx\rWvVWqQaqav1.\t\r \u00a02.\t\r \u00a03.\t\r \u00a0\u02c6q\u02c6q\u02c6s\u02c6v\u02c6v\fFigure 3: (a) Hierarchical question encoding (Sec. 3.2); (b) Encoding for predicting answers (Sec. 3.4).\n\nwhere 1 is a vector with all elements to be 1. Wx, Wg \u2208 Rk\u00d7d and whx \u2208 Rk are parameters. ax\nis the attention weight of feature X.\nThe alternating co-attention process is illustrated in Fig. 2 (b). At the \ufb01rst step of alternating co-\nattention, X = Q, and g is 0; At the second step, X = V where V is the image features, and the\nguidance g is intermediate attended question feature \u02c6s from the \ufb01rst step; Finally, we use the attended\nimage feature \u02c6v as the guidance to attend the question again, i.e., X = Q and g = \u02c6v. Similar to the\nparallel co-attention, the alternating co-attention is also done at each level of the hierarchy.\n\n3.4 Encoding for Predicting Answers\n\nFollowing [2], we treat VQA as a classi\ufb01cation task. We predict the answer based on the co-\nattended image and question features from all three levels. We use a multi-layer perceptron (MLP) to\nrecursively encode the attention features as shown in Fig. 3(b).\nhw = tanh(Ww(\u02c6qw + \u02c6vw))\nhp = tanh(Wp[(\u02c6qp + \u02c6vp), hw])\nhs = tanh(Ws[(\u02c6qs + \u02c6vs), hp])\np = softmax(Whhs)\n\n(7)\n\nwhere Ww, Wp, Ws and Wh are the weight parameters. [\u00b7] is the concatenation operation on two\nvectors. p is the probability of the \ufb01nal answer.\n\n4 Experiment\n4.1 Datasets\n\nWe evaluate the proposed model on two datasets, the VQA dataset [2] and the COCO-QA dataset\n[15].\nVQA dataset [2] is the largest dataset for this problem, containing human annotated questions and\nanswers on Microsoft COCO dataset [12]. The dataset contains 248,349 training questions, 121,512\nvalidation questions, 244,302 testing questions, and a total of 6,141,630 question-answers pairs.\nThere are three sub-categories according to answer-types including yes/no, number, and other. Each\nquestion has 10 free-response answers. We use the top 1000 most frequent answers as the possible\noutputs similar to [2]. This set of answers covers 86.54% of the train+val answers. For testing, we\ntrain our model on VQA train+val and report the test-dev and test-standard results from the VQA\nevaluation server. We use the evaluation protocol of [2] in the experiment.\nCOCO-QA dataset [15] is automatically generated from captions in the Microsoft COCO dataset\n[12]. There are 78,736 train questions and 38,948 test questions in the dataset. These questions\nare based on 8,000 and 4,000 images respectively. There are four types of questions including\nobject, number, color, and location. Each type takes 70%, 7%, 17%, and 6% of the whole dataset,\nrespectively. All answers in this data set are single word. As in [15], we report classi\ufb01cation accuracy\nas well as Wu-Palmer similarity (WUPS) in Table 2.\n\n5\n\n\u201cWhat\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0color\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0on\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0the\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\u2026\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0up\t\r \u00a0\t\r \u00a0?\u201d\t\r \u00a0Word \u00a0embedding\rConvolu1on \u00a0layer \u00a0\rwith \u00a0mul1ple \u00a0\ufb01lter \u00a0\rof \u00a0di\ufb00erent \u00a0widths\rMax-\u00ad\u2010over \u00a0di\ufb00erent \u00a0\r\ufb01lter \u00a0pooling \u00a0layer\r LSTM \u00a0ques1on \u00a0\rencoding\rLSTM\t\r \u00a0LSTM\t\r \u00a0LSTM\t\r \u00a0LSTM\t\r \u00a0\u2026\t\r \u00a0\u2026\t\r \u00a0\u2026\t\r \u00a0\u2026\t\r \u00a0(a)\t\r \u00a0(b)\t\r \u00a0Answer\t\r \u00a0\u02c6qp\u02c6vp\u02c6qs\u02c6vshphw\u02c6qw\u02c6vw+\t\r \u00a0+\t\r \u00a0+\t\r \u00a0hsso;max\t\r \u00a0\fTable 1: Results on the VQA dataset. \u201c-\u201d indicates the results is not available.\nMultiple-Choice\ntest-dev\n\nOpen-Ended\n\ntest-dev\n\ntest-std\n\ntest-std\n\nAll\n58.2\n\n-\n\n58.2\n58.9\n59.5\n60.4\n\nY/N Num Other\n43.0\n80.5\n\n36.8\n\n-\n\n80.9\n79.3\n81.1\n80.5\n79.5\n79.6\n79.7\n\n-\n\n37.3\n36.6\n36.2\n36.8\n38.7\n38.4\n38.7\n\n-\n\n43.1\n46.1\n45.8\n48.3\n48.3\n49.1\n51.7\n\nAll\n57.8\n\n-\n\n58.0\n58.7\n59.2\n60.3\n60.1\n60.5\n61.8\n\nMethod\nLSTM Q+I [2]\nRegion Sel. [18]\nSMem [22]\nSAN [23]\nFDA [10]\nDMN+ [21]\nOursp+VGG\nOursa+VGG\nOursa+ResNet\n\n4.2 Setup\n\nY/N Num Other\n53.0\n80.5\n77.6\n55.8\n\n38.2\n34.3\n\nAll\n62.7\n62.4\n\nAll\n63.1\n\n81.5\n\n39.0\n\n54.7\n\n64.0\n\n64.2\n\n-\n-\n\n-\n\n-\n-\n\n-\n\n-\n-\n\n-\n\n-\n-\n\n-\n\n-\n-\n-\n\n-\n-\n-\n\n-\n-\n\n62.1\n\n79.5\n79.7\n79.7\n\n39.8\n40.1\n40.0\n\n57.4\n57.9\n59.8\n\n64.6\n64.9\n65.8\n\n66.1\n\nWe use Torch [4] to develop our model. We use the Rmsprop optimizer with a base learning rate\nof 4e-4, momentum 0.99 and weight-decay 1e-8. We set batch size to be 300 and train for up to\n256 epochs with early stopping if the validation accuracy has not improved in the last 5 epochs. For\nCOCO-QA, the size of hidden layer Ws is set to 512 and 1024 for VQA since it is a much larger\ndataset. All the other word embedding and hidden layers were vectors of size 512. We apply dropout\nwith probability 0.5 on each layer. Following [23], we rescale the image to 448 \u00d7 448, and then take\nthe activation from the last pooling layer of VGGNet [19] or ResNet [7] as its feature.\n\n4.3 Results and Analysis\n\nThere are two test scenarios on VQA: open-ended and multiple-choice. The best performing method\ndeeper LSTM Q + norm I from [2] is used as our baseline. For open-ended test scenario, we\ncompare our method with the recent proposed SMem [22], SAN [23], FDA [10] and DMN+ [21].\nFor multiple choice, we compare with Region Sel.\n[18] and FDA [10]. We compare with 2-\nVIS+BLSTM [15], IMG-CNN [13] and SAN [23] on COCO-QA. We use Oursp to refer to our\nparallel co-attention, Oursa for alternating co-attention.\nTable 1 shows results on the VQA test sets for both open-ended and multiple-choice settings. We can\nsee that our approach improves the state of art from 60.4% (DMN+ [21]) to 62.1% (Oursa+ResNet) on\nopen-ended and from 64.2% (FDA [10]) to 66.1% (Oursa+ResNet) on multiple-choice. Notably, for\nthe question type Other and Num, we achieve 3.4% and 1.4% improvement on open-ended questions,\nand 4.0% and 1.1% on multiple-choice questions. As we can see, ResNet features outperform or\nmatch VGG features in all cases. Our improvements are not solely due to the use of a better CNN.\nSpeci\ufb01cally, FDA [10] also uses ResNet [7], but Oursa+ResNet outperforms it by 1.8% on test-dev.\nSMem [22] uses GoogLeNet [20] and the rest all use VGGNet [19], and Ours+VGG outperforms\nthem by 0.2% on test-dev (DMN+ [21]).\nTable 2 shows results on the COCO-QA test set. Similar to the result on VQA, our model improves the\nstate-of-the-art from 61.6% (SAN(2,CNN) [23]) to 65.4% (Oursa+ResNet). We observe that parallel\nco-attention performs better than alternating co-attention in this setup. Both attention mechanisms\nhave their advantages and disadvantages: parallel co-attention is harder to train because of the dot\nproduct between image and text which compresses two vectors into a single value. On the other hand,\nalternating co-attention may suffer from errors being accumulated at each round.\n\n4.4 Ablation Study\n\nIn this section, we perform ablation studies to quantify the role of each component in our model.\nSpeci\ufb01cally, we re-train our approach by ablating certain components:\n\n\u2022 Image Attention alone, where in a manner similar to previous works [23], we do not use any\nquestion attention. The goal of this comparison is to verify that our improvements are not the\nresult of orthogonal contributions. (say better optimization or better CNN features).\n\n6\n\n\fTable 2: Results on the COCO-QA dataset. \u201c-\u201d indicates the results is not available.\n\nMethod\n2-VIS+BLSTM [15]\nIMG-CNN [13]\nSAN(2, CNN) [23]\nOursp+VGG\nOursa+VGG\nOursa+ResNet\n\nObject Number Color\n58.2\n49.5\n\n44.8\n\n-\n\n64.5\n65.6\n65.6\n68.0\n\n-\n\n48.6\n49.6\n48.9\n51.0\n\n-\n\n57.9\n61.5\n59.8\n62.9\n\nLocation Accuracy WUPS0.9 WUPS0.0\n\n47.3\n\n-\n\n54.0\n56.8\n56.7\n58.8\n\n55.1\n58.4\n61.6\n63.3\n62.9\n65.4\n\n65.3\n68.5\n71.6\n73.0\n72.8\n75.1\n\n88.6\n89.7\n90.9\n91.3\n91.3\n92.0\n\n\u2022 Question Attention alone, where no image attention is performed.\n\u2022 W/O Conv, where no convolution and pooling is performed to represent phrases. Instead, we\n\nstack another word embedding layer on the top of word level outputs.\n\n\u2022 W/O W-Atten, where no word level co-attention is performed. We replace the word level attention\n\nwith a uniform distribution. Phrase and question level co-attentions are still modeled.\n\n\u2022 W/O P-Atten, where no phrase level co-attention is performed, and the phrase level attention is\n\nset to be uniform. Word and question level co-attentions are still modeled.\n\n\u2022 W/O Q-Atten, where no question level co-attention is performed. We replace the question level\n\nattention with a uniform distribution. Word and phrase level co-attentions are still modeled.\n\nTable 3: Ablation study on the VQA dataset using\nOursa+VGG.\n\nTable 3 shows the comparison of our full approach w.r.t these ablations on the VQA validation set\n(test sets are not recommended to be used for such experiments). The deeper LSTM Q + norm I\nbaseline in [2] is also reported for comparison. We can see that image-attention-alone does improve\nperformance over the holistic image feature (deeper LSTM Q + norm I), which is consistent with\n\ufb01ndings of previous attention models for VQA [21, 23].\nComparing the full model w.r.t. ablated versions\nwithout word, phrase, question level attentions re-\nveals a clear interesting trend \u2013 the attention mech-\nanisms closest to the \u2018top\u2019 of the hierarchy (i.e. ques-\ntion) matter most, with a drop of 1.7% in accuracy\nif not modeled; followed by the intermediate level\n(i.e. phrase), with a drop of 0.3%; \ufb01nally followed\nby the \u2018bottom\u2019 of the hierarchy (i.e. word), with\na drop of 0.2% in accuracy. We hypothesize that\nthis is because the question level is the \u2018closest\u2019 to\nthe answer prediction layers in our model. Note\nthat all levels are important, and our \ufb01nal model\nsigni\ufb01cantly outperforms not using any linguistic\nattention (1.1% difference between Full Model and\nImage Atten). The question attention alone model\nis better than LSTM Q+I, with an improvement of 0.5% and worse than image attention alone, with a\ndrop of 1.1%. Oursa further improves if we performed alternating co-attention for one more round,\nwith an improvement of 0.3%.\n\nY/N Num Other\n79.8\n40.7\n79.8\n43.6\n41.7\n79.4\n42.9\n79.6\n45.4\n79.5\n45.6\n79.6\n45.7\n79.6\n\nMethod\nLSTM Q+I\nImage Atten\nQuestion Atten\nW/O Q-Atten\nW/O P-Atten\nW/O W-Atten\nFull Model\n\nAll\n54.3\n55.9\n54.8\n55.3\n56.7\n56.8\n57.0\n\n32.9\n33.9\n33.3\n32.1\n34.1\n34.4\n35.0\n\nvalidation\n\n4.5 Qualitative Results\n\nWe now visualize some co-attention maps generated by our method in Fig. 4. At the word level, our\nmodel attends mostly to the object regions in an image, e.g., heads, bird. At the phrase level, the\nimage attention has different patterns across images. For the \ufb01rst two images, the attention transfers\nfrom objects to background regions. For the third image, the attention becomes more focused on\nthe objects. We suspect that this is caused by the different question types. On the question side,\nour model is capable of localizing the key phrases in the question, thus essentially discovering the\nquestion types in the dataset. For example, our model pays attention to the phrases \u201cwhat color\u201d and\n\u201chow many snowboarders\u201d. Our model successfully attends to the regions in images and phrases in the\nquestions appropriate for answering the question, e.g., \u201ccolor of the bird\u201d and bird region. Because\n\n7\n\n\fQ: what is the man holding a\nsnowboard on top of a snow\n\ncovered? A: mountain\n\nwhat is the man holding a\n\nsnowboard on top of a snow covered\n\nwhat is the man holding a\nsnowboard on top of a snow\n\ncovered ?\n\nwhat is the man holding a\nsnowboard on top of a snow\n\ncovered ?\n\nQ: what is the color of the bird? A:\n\nwhite\n\nwhat is the color of the bird ?\n\nwhat is the color of the bird ?\n\nwhat is the color of the bird ?\n\nQ: how many snowboarders in\nformation in the snow, four is\n\nsitting? A: 5\n\nhow many snowboarders in\nformation in the snow , four is\n\nsitting ?\n\nhow many snowboarders in\nformation in the snow , four is\n\nsitting ?\n\nhow many snowboarders in\nformation in the snow , four is\n\nsitting ?\n\nFigure 4: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:\noriginal image and question pairs, word level co-attention maps, phrase level co-attention maps and question\nlevel co-attention maps. For visualization, both image and question attentions are scaled (from red:high to\nblue:low). Best viewed in color.\n\nour model performs co-attention at three levels, it often captures complementary information from\neach level, and then combines them to predict the answer.\n\n5 Conclusion\nIn this paper, we proposed a hierarchical co-attention model for visual question answering. Co-\nattention allows our model to attend to different regions of the image as well as different fragments\nof the question. We model the question hierarchically at three levels to capture information from\ndifferent granularities. The ablation studies further demonstrate the roles of co-attention and question\nhierarchy in our \ufb01nal performance. Through visualizations, we can see that our model co-attends\nto interpretable regions of images and questions for predicting the answer. Though our model was\nevaluated on visual question answering, it can be potentially applied to other tasks involving vision\nand language.\nAcknowledgements\nThis work was funded in part by NSF CAREER awards to DP and DB, an ONR YIP award to DP, ONR Grant\nN00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, a Allen Distinguished\nInvestigator award to DP from the Paul G. Allen Family Foundation, ICTAS Junior Faculty awards to DB and\nDP, Google Faculty Research Awards to DP and DB, AWS in Education Research grant to DB, and NVIDIA\nGPU donations to DB. The views and conclusions contained herein are those of the authors and should not be\ninterpreted as necessarily representing the of\ufb01cial policies or endorsements, either expressed or implied, of the\nU.S. Government or any sponsor.\n\nReferences\n[1] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Deep compositional question answering\n\nwith neural module networks. In CVPR, 2016.\n\n[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick,\n\nand Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.\n\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. In ICLR, 2015.\n\n8\n\n\f[4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning.\n\nIn BigLearn, NIPS Workshop, 2011.\n\n[5] Abhishek Das, Harsh Agrawal, C Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Human attention\nin visual question answering: Do humans and deep networks look at the same regions? arXiv preprint\narXiv:1606.03556, 2016.\n\n[6] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a\n\nmachine? dataset and methods for multilingual image question answering. In NIPS, 2015.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[8] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,\n\nand Phil Blunsom. Teaching machines to read and comprehend. In NIPS, 2015.\n\n[9] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architectures for\n\nmatching natural language sentences. In NIPS, 2014.\n\n[10] Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. A focused dynamic attention model for visual question\n\nanswering. arXiv:1604.01485, 2016.\n\n[11] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,\nYannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision\nusing crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.\n\n[12] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.\n\n[13] Lin Ma, Zhengdong Lu, and Hang Li. Learning to answer questions from image using convolutional neural\n\nnetwork. In AAAI, 2016.\n\n[14] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to\n\nanswering questions about images. In ICCV, 2015.\n\n[15] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering.\n\nIn NIPS, 2015.\n\n[16] Tim Rockt\u00e4schel, Edward Grefenstette, Karl Moritz Hermann, Tom\u00e1\u0161 Ko\u02c7cisk`y, and Phil Blunsom. Reason-\n\ning about entailment with neural attention. In ICLR, 2016.\n\n[17] Cicero dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. Attentive pooling networks. arXiv preprint\n\narXiv:1602.03609, 2016.\n\n[18] Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question\n\nanswering. In CVPR, 2016.\n\n[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. CoRR, abs/1409.1556, 2014.\n\n[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\n\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[21] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual\n\nquestion answering. In ICML, 2016.\n\n[22] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for\n\nvisual question answering. arXiv preprint arXiv:1511.05234, 2015.\n\n[23] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image\n\nquestion answering. In CVPR, 2016.\n\n[24] Wenpeng Yin, Hinrich Sch\u00fctze, Bing Xiang, and Bowen Zhou. Abcnn: Attention-based convolutional\n\nneural network for modeling sentence pairs. In ACL, 2016.\n\n[25] Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Yin and yang: Balancing\n\nand answering binary visual questions. arXiv preprint arXiv:1511.05099, 2015.\n\n[26] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in\n\nimages. In CVPR, 2016.\n\n[27] C Lawrence Zitnick, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell, Dhruv Batra, and Devi\n\nParikh. Measuring machine intelligence through visual question answering. AI Magazine, 37(1), 2016.\n\n9\n\n\f", "award": [], "sourceid": 188, "authors": [{"given_name": "Jiasen", "family_name": "Lu", "institution": "Virginia Tech"}, {"given_name": "Jianwei", "family_name": "Yang", "institution": "Virginia Tech"}, {"given_name": "Dhruv", "family_name": "Batra", "institution": "Virginia Tech"}, {"given_name": "Devi", "family_name": "Parikh", "institution": "Virginia Tech"}]}