{"title": "Bilinear Attention Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1564, "page_last": 1574, "abstract": "Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.", "full_text": "Bilinear Attention Networks\n\nJin-Hwa Kim1\u21e4, Jaehyun Jun2, Byoung-Tak Zhang2,3\n\n1SK T-Brain, 2Seoul National University, 3Surromind Robotics\njnhwkim@sktbrain.com, {jhjun,btzhang}@bi.snu.ac.kr\n\nAbstract\n\nAttention networks in multimodal learning provide an ef\ufb01cient way to utilize given\nvisual information selectively. However, the computational cost to learn attention\ndistributions for every pair of multimodal input channels is prohibitively expensive.\nTo solve this problem, co-attention builds two separate attention distributions for\neach modality neglecting the interaction between multimodal inputs. In this paper,\nwe propose bilinear attention networks (BAN) that \ufb01nd bilinear attention distri-\nbutions to utilize given vision-language information seamlessly. BAN considers\nbilinear interactions among two groups of input channels, while low-rank bilinear\npooling extracts the joint representations for each pair of channels. Furthermore,\nwe propose a variant of multimodal residual networks to exploit eight-attention\nmaps of the BAN ef\ufb01ciently. We quantitatively and qualitatively evaluate our\nmodel on visual question answering (VQA 2.0) and Flickr30k Entities datasets,\nshowing that BAN signi\ufb01cantly outperforms previous methods and achieves new\nstate-of-the-arts on both datasets.\n\n1\n\nIntroduction\n\nMachine learning for computer vision and natural language processing accelerates the advancement of\narti\ufb01cial intelligence. Since vision and natural language are the major modalities of human interaction,\nunderstanding and reasoning of vision and natural language information become a key challenge. For\ninstance, visual question answering involves a vision-language cross-grounding problem. A machine\nis expected to answer given questions like \"who is wearing glasses?\", \"is the umbrella upside down?\",\nor \"how many children are in the bed?\" exploiting visually-grounded information.\nFor this reason, visual attention based models have succeeded in multimodal learning tasks, identifying\nselective regions in a spatial map of an image de\ufb01ned by the model. Also, textual attention can be\nconsidered along with visual attention. The attention mechanism of co-attention networks [36, 18, 20,\n39] concurrently infers visual and textual attention distributions for each modality. The co-attention\nnetworks selectively attend to question words in addition to a part of image regions. However, the co-\nattention neglects the interaction between words and visual regions to avoid increasing computational\ncomplexity.\nIn this paper, we extend the idea of co-attention into bilinear attention which considers every pair\nof multimodal channels, e.g., the pairs of question words and image regions. If the given question\ninvolves multiple visual concepts represented by multiple words, the inference using visual attention\ndistributions for each word can exploit relevant information better than that using single compressed\nattention distribution.\nFrom this background, we propose bilinear attention networks (BAN) to use a bilinear attention\ndistribution, on top of low-rank bilinear pooling [15]. Notice that the BAN exploits bilinear inter-\nactions between two groups of input channels, while low-rank bilinear pooling extracts the joint\nrepresentations for each pair of channels. Furthermore, we propose a variant of multimodal residual\n\n\u21e4This work was done while at Seoul National University.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOverview\n\n\u2022 After getting bilinear attention maps, we can stack multiple BANs.\n\nWhat is the mustache \nmade of ?\n\nXTU\n\n\u03c1\n\nGRU\n\nAll hidden \nX\n\nstates\n\nK\n\np\n\nY\n\n\u03c6\n\nObject Detection\n\n\u2219\n\nVTY\n\nK\n\n= \u03c1\n\n\u03c6\n\natt_1\n\natt_2\n\nx\n\no ft m a\n\nS\n\nK\n\nStep 1. Bilinear Attention Maps\n\nResidual Learning\n\nK\n\nU\u2019TX\n\u03c1\n\n\u03c6\n\n1\n\n\u03c1\n\nAtt_1\n\nK\nYTV\u2019\n1\n\n=\n\n\u03c6\n\nX\u2019\n\nX\n+\nK=N\nrepeat 1\u2192\u03c1\n\nResidual Learning\n\u03c6\n\n\u03c1\n\n1\n\nK\nYTV\u2019\u2019\n\nU\u2019\u2019TX\u2019\n\u03c1\n\nX\u2019\n+\nK\nrepeat 1\u2192\u03c1\nStep 2. Bilinear Attention Networks\n\n=\u03c6\n\nAtt_2\n\n1\n\nSum \nPooling\n\nMLP \u2028\nclassi\ufb01er\n\nFigure 1: Overview of the two-glimpse BAN. Two multi-channel inputs, -object detection features\nand \u21e2-length GRU hidden vectors, are used to get bilinear attention maps and joint representations to\nbe used by a classi\ufb01er. For the de\ufb01nition of the BAN, see the text in Section 3.\n\nnetworks (MRN) to ef\ufb01ciently utilize the multiple bilinear attention maps of the BAN, unlike the\nprevious works [6, 15] where multiple attention maps are used by concatenating the attended features.\nSince the proposed residual learning method for BAN exploits residual summations instead of con-\ncatenation, which leads to parameter-ef\ufb01ciently and performance-effectively learn up to eight-glimpse\nBAN. For the overview of the two-glimpse BAN, please refer to Figure 1.\nOur main contributions are:\n\non top of the low-rank bilinear pooling technique.\n\n\u2022 We propose the bilinear attention networks (BAN) to learn and use bilinear attention distributions,\n\u2022 We propose a variant of multimodal residual networks (MRN) to ef\ufb01ciently utilize the multiple\nbilinear attention maps generated by our model. Unlike previous works, our method successfully\nutilizes up to 8 attention maps.\n\n\u2022 Finally, we validate our proposed method on a large and highly-competitive dataset, VQA\n2.0 [8]. Our model achieves a new state-of-the-art maintaining simplicity of model structure.\nMoreover, we evaluate the visual grounding of bilinear attention map on Flickr30k Entities [23]\noutperforming previous methods, along with 25.37% improvement of inference speed taking\nadvantage of the processing of multi-channel inputs.\n\n2 Low-rank bilinear pooling\n\nWe \ufb01rst review the low-rank bilinear pooling and its application to attention networks [15], which\nuses single-channel input (question vector) to combine the other multi-channel input (image features)\nas single-channel intermediate representation (attended feature).\nLow-rank bilinear model. The previous works [35, 22] proposed a low-rank bilinear model to\nreduce the rank of bilinear weight matrix Wi to give regularity. For this, Wi is replaced with the\ni , where Ui 2 RN\u21e5d and Vi 2 RM\u21e5d. As a result,\nmultiplication of two smaller matrices UiVT\nthis replacement makes the rank of Wi to be at most d \uf8ff min(N, M ). For the scalar output fi (bias\nterms are omitted without loss of generality):\n(1)\nwhere 1 2 Rd is a vector of ones and denotes Hadamard product (element-wise multiplication).\nLow-rank bilinear pooling. For a vector output f, a pooling matrix P is introduced:\n\nfi = xT Wiy \u21e1 xT UiVT\n\ni y = 1T (UT\n\ni x VT\n\ni y)\n\nf = PT (UT x VT y)\n\n(2)\nwhere P 2 Rd\u21e5c, U 2 RN\u21e5d, and V 2 RM\u21e5d. It allows U and V to be two-dimensional tensors\nby introducing P for a vector output f 2 Rc, signi\ufb01cantly reducing the number of parameters.\nUnitary attention networks. Attention provides an ef\ufb01cient mechanism to reduce input channel\nby selectively utilizing given information. Assuming that a multi-channel input Y consisting of\n = |{yi}| column vectors, we want to get single channel \u02c6y from Y using the weights {\u21b5i}:\n\n(3)\n\n\u02c6y =Xi\n\n\u21b5iyi\n\n2\n\n\fwhere \u21b5 represents an attention distribution to selectively combine input channels. Using the\nlow-rank bilinear pooling, the \u21b5 is de\ufb01ned by the output of softmax function as:\n\n\u21b5 := softmax\u21e3PT(UT x \u00b7 1T ) (VT Y)\u2318\n\n(4)\nwhere \u21b5 2 RG\u21e5, P 2 Rd\u21e5G, U 2 RN\u21e5d, x 2 RN, 1 2 R, V 2 RM\u21e5d, and Y 2 RM\u21e5. If\nG > 1, multiple glimpses (a.k.a. attention heads) are used [13, 6, 15], then \u02c6y =fG\ng=1Pi \u21b5g,iyi, the\nconcatenation of attended outputs. Finally, two single channel inputs x and \u02c6y can be used to get the\njoint representation using the other low-rank bilinear pooling for a classi\ufb01er.\n\n3 Bilinear attention networks\nWe generalize a bilinear model for two multi-channel inputs, X 2 RN\u21e5\u21e2 and Y 2 RM\u21e5, where\n\u21e2 = |{xi}| and = |{yj}|, the numbers of two input channels, respectively. To reduce both input\nchannel simultaneously, we introduce bilinear attention map A2 R\u21e2\u21e5 as follows:\n(5)\nwhere U0 2 RN\u21e5K, V0 2 RM\u21e5K, (XT U0)k 2 R\u21e2, (YT V0)k 2 R, and f0k denotes the k-th\nelement of intermediate representation. The subscript k for the matrices indicates the index of column.\nNotice that Equation 5 is a bilinear model for the two groups of input channels where A in the middle\nis a bilinear weight matrix. Interestingly, Equation 5 can be rewritten as:\n\nk A(YT V0)k\n\nf0k = (XT U0)T\n\nf0k =\n\n\u21e2Xi=1\n\nXj=1\n\n\u21e2Xi=1\n\nXj=1\n\nAi,j(XT\n\ni U0k)(V0T\n\nk Yj) =\n\nAi,jXT\n\ni (U0kV0T\n\nk )Yj\n\n(6)\n\nwhere Xi and Yj denotes the i-th channel (column) of input X and the j-th channel (channel) of\ninput Y, respectively, U0k and V0k denotes the k-th column of U0 and V0 matrices, respectively,\nand Ai,j denotes an element in the i-th row and the j-th column of A. Notice that, for each pair of\nchannels, the 1-rank bilinear representation of two feature vectors is modeled in XT\nk )Yj\nof Equation 6 (eventually at most K-rank bilinear pooling for f0 2 RK). Then, the bilinear joint\nrepresentation is f = PT f0 where f 2 RC and P 2 RK\u21e5C. For the convenience, we de\ufb01ne the\nbilinear attention networks as a function of two multi-channel inputs parameterized by a bilinear\nattention map as follows:\n\ni (U0kV0T\n\n(7)\nBilinear attention map. Now, we want to get the attention map similarly to Equation 4. Using\nHadamard product and matrix-matrix multiplication, the attention map A is de\ufb01ned as:\n\nf = BAN(X, Y;A).\n\n(8)\nwhere 1 2 R\u21e2, p 2 RK0, and remind that A2 R\u21e2\u21e5. The softmax function is applied element-\nwisely. Notice that each logit Ai,j of the softmax is the output of low-rank bilinear pooling as:\n(9)\n\nA := softmax\u21e3(1 \u00b7 pT ) XT UVT Y\u2318\nAi,j = pT(UT Xi) (VT Yj).\nAg := softmax\u21e3(1 \u00b7 pT\ng ) XT UVT Y\u2318\n\n(10)\nwhere the parameters of U and V are shared, but not for pg where g denotes the index of glimpses.\nResidual learning of attention. Inspired by multimodal residual networks (MRN) from Kim et al.\n[14], we propose a variant of MRN to integrate the joint representations from the multiple bilinear\nattention maps. The i + 1-th output is de\ufb01ned as:\n\nThe multiple bilinear attention maps can be extended as follows:\n\nfi+1 = BANi(fi, Y;Ai) \u00b7 1T + fi\n\n(11)\nwhere f0 = X (if N = K) and 1 2 R\u21e2. Here, the size of fi is the same with the size of X as\nsuccessive attention maps are processed. To get the logits for a classi\ufb01er, e.g., two-layer MLP, we\nsum over the channel dimension of the last output fG, where G is the number of glimpses.\nTime complexity. When we assume that the number of input channels is smaller than feature\nsizes, M N K \u21e2, the time complexity of the BAN is the same with the case of one\nmulti-channel input as O(KM) for single glimpse model. Since the BAN consists of matrix chain\nmultiplication and exploits the property of low-rank factorization in the low-rank bilinear pooling.\n\n3\n\n\f4 Related works\n\nMultimodal factorized bilinear pooling. Yu et al. [39] extends low-rank bilinear pooling [15]\nusing the rank > 1. They remove a projection matrix P, instead, d in Equation 2 is replaced with\nmuch smaller k while U and V are three-dimensional tensors. However, this generalization was\nnot effective for BAN, at least in our experimental setting. Please see BAN-1+MFB in Figure 2b\nwhere the performance is not signi\ufb01cantly improved from that of BAN-1. Furthermore, the peak GPU\nmemory consumption is larger due to its model structure which hinders to use multiple-glimpse BAN.\nCo-attention networks. Xu and Saenko [36] proposed the spatial memory network model estimating\nthe correlation among every image patches and tokens in a sentence. The estimated correlation C\nis de\ufb01ned as (UX)T Y in our notation. Unlike our method, they get an attention distribution\n\n\u21b5 = softmax maxi=1,...,\u21e2(Ci) 2 R\u21e2 where the logits to softmax are the maximum values in each\n\nrow vector of C. The attention distribution for the other input can be calculated similarly. There are\nvariants of co-attention networks [18, 20], especially, Lu et al. [18] sequentially get two attention\ndistributions conditioning on the other modality. Recently, Yu et al. [39] reduce the co-attention\nmethod into two steps, self-attention for a question embedding and the question-conditioned attention\nfor a visual embedding. However, these co-attention approaches use separate attention distributions\nfor each modality, neglecting the interaction between the modalities what we consider and model.\n\n5 Experiments\n\n5.1 Datasets\nVisual Question Answering (VQA). We evaluate on the VQA 2.0 dataset [1, 8], which is improved\nfrom the previous version to emphasize visual understanding by reducing the answer bias in the\ndataset. This improvement pushes the model to have the more effective joint representation of question\nand image, which \ufb01ts the motivation of our bilinear attention approach. The VQA evaluation metric\nconsiders inter-human variability de\ufb01ned as Accuracy(ans) = min(#humans that said ans/3, 1).\nNote that reporting accuracies are averaged over all ten choose nine sets of ground-truths. The test\nset is split into test-dev, test-standard, test-challenge, and test-reserve. The annotations for the test set\nare unavailable except the remote evaluation servers.\nFlickr30k Entities. For the evaluation of visual grounding by the bilinear attention maps, we use\nFlickr30k Entities [23] consisting of 31,783 images [38] and 244,035 annotations that multiple\nentities (phrases) in a sentence for an image are mapped to the boxes on the image to indicate the\ncorrespondences between them. The task is to localize a corresponding box for each entity. In this\nway, visual grounding of textual information is quantitatively measured. Following the evaluation\nmetric [23], if a predicted box has the intersection over union (IoU) of overlapping area with one\nof the ground-truth boxes which are greater than or equal to 0.5, the prediction for a given entity is\ncorrect. This metric is called Recall@1. If K predictions are permitted to \ufb01nd at least one correction,\nit is called Recall@K. We report Recall@1, 5, and 10 to compare state-of-the-arts (R@K in Table 4).\nThe upper bound of performance depends on the performance of object detection if the detector\nproposes candidate boxes for the prediction.\n\n5.2 Preprocessing\nQuestion embedding. For VQA, we get a question embedding XT 2 R14\u21e5N using GloVe word\nembeddings [21] and the outputs of Gated Recurrent Unit (GRU) [5] for every time-steps up to the\n\ufb01rst 14 tokens following the previous work [29]. The questions shorter than 14 words are end-padded\nwith zero vectors. For Flickr30k Entities, we use a full length of sentences (82 is maximum) to get all\nentities. We mark the token positions which are at the end of each annotated phrase. Then, we select\na subset of the output channels of GRU using these positions, which makes the number of channels is\nthe number of entities in a sentence. The word embeddings and GRU are \ufb01ne-tuned in training.\nImage features. We use the image features extracted from bottom-up attention [2]. These features are\nthe output of Faster R-CNN [25], pre-trained using Visual Genome [17]. We set a threshold for object\ndetection to get = 10 to 100 objects per image. The features are represented as YT 2 R\u21e52,048,\nwhich is \ufb01xed while training. To deal with variable-channel inputs, we mask the padding logits with\nminus in\ufb01nite to get zero probability from softmax avoiding under\ufb02ow.\n\n4\n\n\f5.3 Nonlinearity and classi\ufb01er\nNonlinearity. We use ReLU [19] to give nonlinearity to BAN:\n\nf0k = (XT U0)T\n\nk \u00b7 A \u00b7 (YT V0)k\nA :=(1 \u00b7 pT ) (XT U) \u00b7 (VT Y).\n\n(12)\n\n(13)\n\nwhere denotes ReLU(x) := max(x, 0). For the attention maps, the logits are de\ufb01ned as:\n\nClassi\ufb01er. For VQA, we use a two-layer multi-layer perceptron as a classi\ufb01er for the \ufb01nal joint\nrepresentation fG. The activation function is ReLU. The number of outputs is determined by the\nminimum occurrence of the answer in unique questions as nine times in the dataset, which is 3,129.\nBinary cross entropy is used for the loss function following the previous work [29]. For Flickr30k\nEntities, we take the output of bilinear attention map, and binary cross entropy is used for this output.\n\n5.4 Hyperparameters and regularization\nHyperparameters. The size of image features and question embeddings are M = 2, 048 and\nN = 1, 024, respectively. The size of joint representation C is the same with the rank K in low-\nrank bilinear pooling, C = K = 1, 024, but K0 = K \u21e5 3 is used in the bilinear attention maps\nto increase a representational capacity for residual learning of attention. Every linear mapping is\nregularized by Weight Normalization [27] and Dropout [28] (p = .2, except for the classi\ufb01er with\n.5). Adamax optimizer [16], a variant of Adam based on in\ufb01nite norm, is used. The learning rate is\nmin(ie3, 4e3) where i is the number of epochs starting from 1, then after 10 epochs, the learning\nrate is decayed by 1/4 for every 2 epochs up to 13 epochs (i.e. 1e3 for 11-th and 2.5e4 for 13-th\nepoch). We clip the 2-norm of vectorized gradients to .25. The batch size is 512.\nRegularization. For the test split of VQA, both train and validation splits are used for training. We\naugment a subset of Visual Genome [17] dataset following the procedure of the previous works [29].\nAccordingly, we adjust the model capacity by increasing all of N, C, and K to 1,280. And, G = 8\nglimpses are used. For Flickr30k Entities, we use the same test split of the previous methods [23],\nwithout additional hyperparameter tuning from VQA experiments.\n\n6 VQA results and discussions\n\n6.1 Quantitative results\nComparison with state-of-the-arts. The \ufb01rst row in Table 1 shows 2017 VQA Challenge winner\narchitecture [2, 29]. BAN signi\ufb01cantly outperforms this baseline and successfully utilize up to eight\nbilinear attention maps to improve its performance taking advantage of residual learning of attention.\nAs shown in Table 3, BAN outperforms the latest model [39] which uses the same bottom-up attention\nfeature [2] by a substantial margin. BAN-Glove uses the concatenation of 300-dimensional Glove\nword embeddings and the semantically-closed mixture of these embeddings (see Appendix A.1).\nNotice that similar approaches can be found in the competitive models [6, 39] in Table 3 with a\ndifferent initialization strategy for the same 600-dimensional word embedding. BAN-Glove-Counter\nuses both the previous 600-dimensional word embeddings and counting module [41], which exploits\nspatial information of detected object boxes from the feature extractor [2]. The learned representation\nc 2 R+1 for the counting mechanism is linearly projected and added to the joint representation after\napplying ReLU (see Equation 15 in Appendix A.2). In Table 5 (Appendix), we compare with the\nentries in the leaderboard of both VQA Challenge 2017 and 2018 achieving the 1st place at the time\nof submission (our entry is not shown in the leaderboard since challenge entries are not visible).\nComparison with other attention methods. Unitary attention has a similar architecture with Kim\net al. [15] where a question embedding vector is used to calculate the attentional weights for multiple\nimage features of an image. Co-attention has the same mechanism of Yu et al. [39], similar to Lu et al.\n[18], Xu and Saenko [36], where multiple question embeddings are combined as single embedding\nvector using a self-attention mechanism, then unitary visual attention is applied. Table 2 con\ufb01rms that\nbilinear attention is signi\ufb01cantly better than any other attention methods. The co-attention is slightly\nbetter than simple unitary attention. In Figure 2a, co-attention suffers over\ufb01tting more severely\n(green) than any other methods, while bilinear attention (blue) is more regularized compared with the\n\n5\n\n\fy\np\no\nr\nt\nn\nE\n\n4.2\n\n3.9\n\n3.6\n\n3.3\n\n3.0\n\n \n\no\nc\nS\nn\no\ni\nt\na\nd\n\ni\nl\n\na\nV\n\n55\n\n50\n\n45\n\n40\n\n35\n\nAtt_1\nAtt_2\nAtt_3\nAtt_4\n\n2 Glimpses\n4 Glimpses\n8 Glimpses\n\n6\n\n4\n\n2\n\n0\n\n14\n\n12\n\n8\n10\nEpoch\n\nTable 1: Validation scores on VQA 2.0\ndataset for the number of glimpses of the\nBAN. The standard deviations are reported\nafter \u00b1 using three random initialization.\n(a)\n(b)\n\n18\n\n85\n\n16\n\ne\nr\no\nc\ns\n \nn\no\ni\nt\na\nd\n\ni\nl\n\na\nV\n\n75\nModel\n65\nBottom-Up [29]\nBAN-1\n55\nBAN-2\n45\nBAN-4\nBAN-8\n35\nBAN-12\n3\n\n7\n\n1\n\n5\n\nVQA Score\n63.37 \u00b10.21\nbi-att train\nco-att train\n65.36 \u00b10.14\nuni-att train\n65.61 \u00b10.10\nbi-att val\n65.81 \u00b10.09\nco-att val\nuni-att val\n66.00 \u00b10.11\n66.04 \u00b10.08\n11 13 15 17 19\n\n9\nEpoch\n\n8\n\n4\n\n5\n\n6\n\n7\n\n3\n\n2\n\n1\n\n0\n\n(c)\n\nThe number of used glimpses\n\nTable 2: Validation scores on VQA 2.0 dataset for\nattention and integration mechanisms. The nParams\nindicates the number of parameters. Note that the\nhidden sizes of unitary attention and co-attention are\n70\n6.0\n1,280, while 1,024 for the BAN.\n65\n5.0\n60\nnParams\nVQA Score\n4.0\n55\n31.9M 64.59 \u00b10.04\n50\n32.5M 64.79 \u00b10.06\n32.2M 65.36 \u00b10.14\n45\n40\n44.8M 65.81 \u00b10.09\n35\n44.8M 64.78 \u00b10.08\n51.1M 64.71 \u00b10.21\nEpoch\n\nModel\nUnitary attention\nBAN-1\nCo-attention\nBAN-2\nBilinear attention\nBAN-4\nBAN-8\nBAN-4 (residual)\nBAN-12\nBAN-4 (sum)\n8 10 12\n2\nBAN-4 (concat)\nUsed glimpses\n\n1 3 5 7 9 11 1315 1719\n\ny\np\no\nr\nt\nn\nE\n\nAtt_2\nAtt_4\n\nAtt_1\nAtt_3\n\n1.0\n\n0.0\n\n2.0\n\n3.0\n\n4\n\n0\n\n6\n\ne\nr\no\nc\ns\n \nn\no\ni\nt\na\nd\n\ni\nl\n\na\nV\n\n(a)\n\ne\nr\no\nc\ns\n \n\nn\no\ni\nt\na\nd\n\ni\nl\n\na\nV\n\n85\n\n74\n\n63\n\n52\n\n41\n\n30\n\nBi-Att train\nCo-Att train\nUni-Att train\nBi-Att val\nCo-Att val\nUni-Att val\n\n1\n\n4\n\n7 10 13 16 19\nEpoch\n\n(b)\n\ne\nr\no\nc\ns\n \n\nn\no\ni\nt\na\nd\n\ni\nl\n\na\nV\n\n67.0\n\n65.5\n\n64.0\n\n62.5\n\n61.0\n59.5\n\n58.0\n\n(c)\n\ne\nr\no\nc\ns\n \n\nn\no\ni\nt\na\nd\n\ni\nl\n\na\nV\n\nUni-Att\nCo-Att\nBAN-1\nBAN-4\nBAN-1+MFB\n\n70\n65\n60\n55\n50\n45\n40\n35\n\n0\n\n(d)\n\ny\np\no\nr\nt\nn\nE\n\n8.0\n6.7\n\n5.3\n\n4.0\n\n2.7\n\n1.3\n\n0.0\n\nAtt_1\nAtt_2\nAtt_3\nAtt_4\n\n1 4 7 10 13 16 19\n\nEpoch\n\nBAN-1\nBAN-2\nBAN-4\nBAN-8\nBAN-12\n\n8 10 12\n\n6\n\n4\n\n2\nUsed glimpses\n\n0M 15M 30M 45M 60M\nThe number of parameters\n\nFigure 2: (a) learning curves. Bilinear attention (bi-att) is more robust to over\ufb01tting than unitary\nattention (uni-att) and co-attention (co-att). (b) validation scores for the number of parameters. The\nerror bar indicates the standard deviation among three random initialized models, although it is too\nsmall to be noticed for over-15M parameters. (c) ablation study for the \ufb01rst-N-glimpses (x-axis) used\nin the evaluation. (d) the information entropy (y-axis) for each attention map in the four-glimpse\nBAN. The entropy of multiple attention maps is converged to certain levels.\n\nothers. In Figure 2b, BAN is the most parameter-ef\ufb01cient among various attention methods. Notice\nthat four-glimpse BAN more parsimoniously utilizes its parameters than one-glimpse BAN does.\n\n6.2 Residual learning of attention\n\nComparison with other approaches. In the second section of Table 2, the residual learning of\nattention signi\ufb01cantly outperforms the other methods, sum, i.e., fG =Pi BANi(X, Y;Ai), and\nconcatenation (concat), i.e., fG = kiBANi(X, Y;Ai). Whereas, the difference between sum and\nconcat is not signi\ufb01cantly different. Notice that the number of parameters of concat is larger than the\nothers since the input size of the classi\ufb01er is increased.\nAblation study. An interesting property of residual learning is robustness toward arbitrary abla-\ntions [31]. To see the relative contributions, we observe the learning curve of validation scores\nwhen incremental ablation is performed. First, we train {1,2,4,8,12}-glimpse models using training\nsplit. Then, we evaluate the model on validation split using the \ufb01rst N attention maps. Hence, the\nintermediate representation fN is directly fed into the classi\ufb01er instead of fG. As shown in Figure 2c,\nthe accuracy gain of the \ufb01rst glimpse is the highest, then the gain is smoothly decreased as the number\nof used glimpses is increased.\nEntropy of attention. We analyze the information entropy of attention distributions in a four-glimpse\nBAN. As shown in Figure 2d, the mean entropy of each attention for validation split is converged\nto a different level of values. This result is repeatably observed in the other number of glimpse\nmodels. Our speculation is the multi-attention maps do not equally contribute similarly to voting\nby committees, but the residual learning by the multi-step attention. We argue that this is a novel\nobservation where the residual learning [9] is used for stacked attention networks.\n\n1\n\n6\n\n\fFigure 3: Visualization of the bilinear attention maps for two-glimpse BAN. The left and right\ngroups indicate the \ufb01rst and second bilinear attention maps (right in each group, log-scaled) and the\nvisualized image (left in each group). The most salient six boxes (1-6 numbered in the images and\nx-axis of the grids) in the \ufb01rst attention map determined by marginalization are visualized on both\nimages to compare. The model gives the correct answer, brown.\n\n(a) A girl in a yellow tennis suit, green \nvisor and white tennis shoes holding a \ntennis racket in a position where she is \ngoing to hit the tennis ball.\n\n(b) A man in a denim shirt and \npants is smoking a cigarette while \nplaying a cello for money.\n\n(c) A male conductor wearing all black \nleading an orchestra and choir on a \nbrown stage playing and singing a \nmusical number.\n\nFigure 4: Visualization examples from the test split of Flickr30k Entities are shown. Solid-lined\nboxes indicate predicted phrase localizations and dashed-line boxes indicate the ground-truth. If there\nare multiple ground-truth boxes, the closest box is shown to investigate. Each color of a phrase is\nmatched with the corresponding color of predicted and ground-truth boxes. Best view in color.\n\n6.3 Qualitative analysis\n\nThe visualization for a two-glimpse BAN is shown in Figure 3. The question is \u201cwhat color are the\npants of the guy skateboarding\u201d. The question and content words, what, pants, guy, and skateboarding\nand skateboarder\u2019s pants in the image are attended. Notice that the box 2 (orange) captured the sitting\nman\u2019s pants in the bottom.\n\n7 Flickr30k entities results and discussions\n\nTo examine the capability of bilinear attention map to capture vision-language interactions, we\nconduct experiments on Flickr30k Entities [23]. Our experiments show that BAN outperforms the\nprevious state-of-the-art on the phrase localization task with a large margin of 4.48% at a high speed\nof inference.\nPerformance. In Table 4, we compare with other previous approaches. Our bilinear attention map\nto predict the boxes for the phrase entities in a sentence achieves new state-of-the-art with 69.69%\nfor Recall@1. This result is remarkable considering that BAN does not use any additional features\nlike box size, color, segmentation, or pose-estimation [23, 37]. Note that both Query-Adaptive\nRCNN [10] and our off-the-shelf object detector [2] are based on Faster RCNN [25] and pre-trained\non Visual Genome [17]. Compared to Query-Adaptive RCNN, the parameters of our object detector\nare \ufb01xed and only used to extract 10-100 visual features and the corresponding box proposals.\nType. In Table 6 (included in Appendix), we report the results for each type of Flickr30k Entities.\nNotice that clothing and body parts are signi\ufb01cantly improved to 74.95% and 47.23%, respectively.\nSpeed. The faster inference is achieved taking advantage of multi-channel inputs in our BAN. Unlike\nprevious methods, BAN ables to infer multiple entities in a sentence which can be prepared as a\n\n7\n\n\fTable 3: Test-dev and test-standard scores of single-model on VQA 2.0 dataset to compare state-of-\nthe-arts, trained on training and validation splits, and Visual Genome for feature extraction or data\naugmentation. \u2020 This model can be found in https://github.com/yuzcccc/vqa-mfb, which is\nnot published in the paper.\n\nModel\nBottom-Up [2, 29]\nMFH [39]\nCounter [41]\nMFH+Bottom-Up [39]\u2020\nBAN (ours)\nBAN+Glove (ours)\nBAN+Glove+Counter (ours)\n\n-\n\n-\n\n-\n\n56.05\n\n81.82\n\n44.21\n\nOverall Yes/no Number Other Test-std\n65.32\n65.67\n66.12\n68.09\n68.76\n69.52\n69.66\n70.04\n\n51.62\n49.56\n50.93\n50.66\n54.04\n\n58.97\n59.89\n60.26\n60.50\n60.52\n\n83.14\n84.27\n85.31\n85.46\n85.42\n\n70.35\n\n68.41\n\n-\n\n-\n-\n-\n\nTable 4: Test split results for Flickr30k Entities. We report the average performance of our three\nrandomly-initialized models (the standard deviation of R@1 is 0.17). Upper Bound of performance\nasserted by object detector is shown. \u2020 box size and color information are used as additional features.\n\u2021 semantic segmentation, object detection, and pose-estimation is used as additional features. Notice\nthat the detectors of Hinami and Satoh [10] and ours [2] are based on Faster RCNN [25], pre-trained\nusing Visual Genome dataset [17].\n\nDetector\nMCG [3]\n\nModel\nZhang et al. [40]\nHu et al. [11]\nRohrbach et al. [26]\nWang et al. [33]\nWang et al. [32]\nRohrbach et al. [26]\nFukui et al. [6]\nPlummer et al. [23]\nYeh et al. [37]\nHinami and Satoh [10] Query-Adaptive RCNN [10]\nBAN (ours)\n\nEdge Boxes [42]\nFast RCNN [7]\nFast RCNN [7]\nFast RCNN [7]\nFast RCNN [7]\nFast RCNN [7]\nFast RCNN [7]\u2020\nYOLOv2 [24]\u2021\nBottom-Up [2]\n\nR@1\n28.5\n27.8\n42.43\n42.08\n43.89\n48.38\n48.69\n50.89\n53.97\n65.21\n69.69\n\nR@5 R@10 Upper Bound\n52.7\n-\n-\n-\n\n61.3\n62.9\n-\n-\n\n64.46\n\n68.66\n\n-\n76.9\n77.90\n76.91\n76.91\n77.90\n\n85.12\n\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n71.09\n\n75.73\n\n84.22\n\n86.35\n\n87.45\n\nmulti-channel input. Therefore, the number of forwardings to infer is signi\ufb01cantly decreased. In our\nexperiment, BAN takes 0.67 ms/entity whereas the setting that single entity as an example takes 0.84\nms/entity, achieving 25.37% improvement. We emphasize that this property is a novel in our model\nthat considers every interaction among vision-language multi-channel inputs.\nVisualization. Figure 4 shows the examples from the test split of Flickr30k Entities. The entities\nwhich have visual properties, i.e., a yellow tennis suit and white tennis shoes in Figure 4a, and a\ndenim shirt in Figure 4b, are correct. However, a relatively small object (e.g., a cigarette in Figure 4b)\nand the entity that requires semantic inference (e.g., a male conductor in Figure 4c) are incorrect.\n\n8 Conclusions\n\nBAN gracefully extends unitary attention networks exploiting bilinear attention maps, where the joint\nrepresentations of multimodal multi-channel inputs are extracted using low-rank bilinear pooling.\nAlthough BAN considers every pair of multimodal input channels, the computational cost remains in\nthe same magnitude, since BAN consists of matrix chain multiplication for ef\ufb01cient computation. The\nproposed residual learning of attention ef\ufb01ciently uses up to eight bilinear attention maps, keeping\nthe size of intermediate features constant. We believe our BAN gives a new opportunity to learn the\nricher joint representation for multimodal multi-channel inputs, which appear in many real-world\nproblems.\n\n8\n\n\fAcknowledgments\nWe would like to thank Kyoung-Woon On, Bohyung Han, Hyeonwoo Noh, Sungeun Hong, Jaesun\nPark, and Yongseok Choi for helpful comments and discussion. Jin-Hwa Kim was supported by 2017\nGoogle Ph.D. Fellowship in Machine Learning and Ph.D. Completion Scholarship from College of\nHumanities, Seoul National University. This work was funded by the Korea government (IITP-2017-\n0-01772-VTT, IITP-R0126-16-1072-SW.StarLab, 2018-0-00622-RMI, KEIT-10060086-RISF). The\npart of computing resources used in this study was generously shared by Standigm Inc.\n\nReferences\n[1] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick, Devi\nParikh, and Dhruv Batra. Vqa: Visual question answering. International Journal of Computer\nVision, 123(1):4\u201331, 2017.\n\n[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,\nand Lei Zhang. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question\nAnswering. arXiv preprint arXiv:1707.07998, 2017.\n\n[3] Pablo Arbel\u00e1ez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik.\nIn IEEE conference on computer vision and pattern\n\nMultiscale combinatorial grouping.\nrecognition, pages 328\u2013335, 2014.\n\n[4] Hedi Ben-younes, R\u00e9mi Cadene, Matthieu Cord, and Nicolas Thome. MUTAN: Multimodal\nTucker Fusion for Visual Question Answering. In IEEE International Conference on Computer\nVision, pages 2612\u20132620, 2017.\n\n[5] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-\nDecoder for Statistical Machine Translation. In 2014 Conference on Empirical Methods in\nNatural Language Processing, pages 1724\u20131734, 2014.\n\n[6] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus\nRohrbach. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual\nGrounding. arXiv preprint arXiv:1606.01847, 2016.\n\n[7] Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, pages\n\n1440\u20131448, 2015.\n\n[8] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V\nin VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In\nIEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image\n\nRecognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[10] Ryota Hinami and Shin\u2019ichi Satoh. Query-Adaptive R-CNN for Open-Vocabulary Object\n\nDetection and Retrieval. arXiv preprint arXiv:1711.09509, 2017.\n\n[11] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell.\nNatural language object retrieval. In IEEE Computer Vision and Pattern Recognition, pages\n4555\u20134564, 2016.\n\n[12] Ilija Ilievski and Jiashi Feng. A Simple Loss Function for Improving the Convergence and\n\nAccuracy of Visual Question Answering Models. 2017.\n\n[13] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Trans-\nformer Networks. In Advances in Neural Information Processing Systems 28, pages 2008\u20132016,\n2015.\n\n[14] Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha,\nand Byoung-Tak Zhang. Multimodal Residual Learning for Visual QA. In Advances in Neural\nInformation Processing Systems 29, pages 361\u2013369, 2016.\n\n9\n\n\f[15] Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak\nZhang. Hadamard Product for Low-rank Bilinear Pooling. In The 5th International Conference\non Learning Representations, 2017.\n\n[16] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.\n\nInternational Conference on Learning Representations, 2015.\n\nIn\n\n[17] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie\nChen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual\ngenome: Connecting language and vision using crowdsourced dense image annotations. arXiv\npreprint arXiv:1602.07332, 2016.\n\n[18] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical Question-Image Co-\n\nAttention for Visual Question Answering. arXiv preprint arXiv:1606.00061, 2016.\n\n[19] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed Linear Units Improve Restricted Boltzmann\n\nMachines. 27th International Conference on Machine Learning, pages 807\u2013814, 2010.\n\n[20] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual Attention Networks for Multimodal\nReasoning and Matching. In IEEE Conference on Computer Vision and Pattern Recognition,\n2016.\n\n[21] Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe: Global Vectors for\nWord Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural\nLanguage Processing, 2014.\n\n[22] Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Bilinear classi\ufb01ers for visual\nrecognition. In Advances in Neural Information Processing Systems 22, pages 1482\u20131490,\n2009.\n\n[23] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier,\nand Svetlana Lazebnik. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for\nRicher Image-to-Sentence Models. International Journal of Computer Vision, 123:74\u201393, 2017.\n\n[24] Joseph Redmon and Ali Farhadi. YOLO9000: Better, Faster, Stronger. In IEEE Computer\n\nVision and Pattern Recognition, 2017.\n\n[25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time\nObject Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 39(6), 2017.\n\n[26] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. Grounding\nof textual phrases in images by reconstruction. In European Conference on Computer Vision,\npages 817\u2013834, 2016.\n\n[27] Tim Salimans and Diederik P. Kingma. Weight Normalization: A Simple Reparameterization to\n\nAccelerate Training of Deep Neural Networks. arXiv preprint arXiv:1602.07868, 2016.\n\n[28] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-\ndinov. Dropout : A Simple Way to Prevent Neural Networks from Over\ufb01tting. Journal of\nMachine Learning Research, 15(1):1929\u20131958, 2014.\n\n[29] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and Tricks for Vi-\nsual Question Answering: Learnings from the 2017 Challenge. arXiv preprint arXiv:1708.02711,\n2017.\n\n[30] Alexander Trott, Caiming Xiong, and Richard Socher.\n\nInterpretable Counting for Visual\n\nQuestion Answering. In International Conference on Learning Representations, 2018.\n\n[31] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual Networks are Exponential\nEnsembles of Relatively Shallow Networks. In Advances in Neural Information Processing\nSystems 29, pages 550\u2013558, 2016.\n\n10\n\n\f[32] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning Deep Structure-Preserving Image-\nText Embeddings. In IEEE Conference on Computer Vision and Pattern Recognition, pages\n5005\u20135013, 2016.\n\n[33] Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, and Jia Deng. Structured\nMatching for Phrase Localization. In European Conference on Computer Vision, volume 9908,\npages 696\u2013711, 2016.\n\n[34] Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. The VQA-Machine: Learning\nHow to Use Existing Vision Algorithms to Answer New Questions. In Computer Vision and\nPattern Recognition (CVPR), pages 1173\u20131182, 2017.\n\n[35] Lior Wolf, Hueihan Jhuang, and Tamir Hazan. Modeling appearances with low-rank SVM.\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2007.\n\n[36] Huijuan Xu and Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial\nAttention for Visual Question Answering. In European Conference on Computer Vision, 2016.\n[37] Raymond A Yeh, Jinjun Xiong, Wen-Mei W Hwu, Minh N Do, and Alexander G Schwing.\nInterpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts. In\nAdvances in Neural Information Processing Systems 30, 2017.\n\n[38] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions\nto visual denotations: New similarity metrics for semantic inference over event descriptions.\nTransactions of the Association for Computational Linguistics, 2:67\u201378, 2014.\n\n[39] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond Bilinear: Gen-\neralized Multi-modal Factorized High-order Pooling for Visual Question Answering. IEEE\nTransactions on Neural Networks and Learning Systems, 2018.\n\n[40] Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-Down Neural\nAttention by Excitation Backprop. In European Conference on Computer Vision, volume 9908,\npages 543\u2013559, 2016.\n\n[41] Yan Zhang, Jonathon Hare, and Adam Pr\u00fcgel-Bennett. Learning to Count Objects in Natural Im-\nages for Visual Question Answering. In International Conference on Learning Representations,\n2018.\n\n[42] C Lawrence Zitnick and Piotr Doll\u00e1r. Edge boxes: Locating object proposals from edges. In\n\nEuropean Conference on Computer Vision, pages 391\u2013405, 2014.\n\n11\n\n\f", "award": [], "sourceid": 792, "authors": [{"given_name": "Jin-Hwa", "family_name": "Kim", "institution": "SK T-Brain"}, {"given_name": "Jaehyun", "family_name": "Jun", "institution": "Seoul National University"}, {"given_name": "Byoung-Tak", "family_name": "Zhang", "institution": "Seoul National University & Surromind Robotics"}]}