{"title": "Self-Critical Reasoning for Robust Visual Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 8604, "page_last": 8614, "abstract": "Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e. 49.5\\% using textual explanations and 48.5\\% using automatically", "full_text": "Self-Critical Reasoning\n\nfor Robust Visual Question Answering\n\nJialin Wu\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\njialinwu@utexas.edu\n\nRaymond J. Mooney\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\nmooney@cs.utexas.edu\n\nAbstract\n\nVisual Question Answering (VQA) deep-learning systems tend to capture super\ufb01-\ncial statistical correlations in the training data because of strong language priors\nand fail to generalize to test data with a signi\ufb01cantly different question-answer\n(QA) distribution [1]. To address this issue, we introduce a self-critical training\nobjective that ensures that visual explanations of correct answers match the most\nin\ufb02uential image regions more than other competitive answer candidates. The\nin\ufb02uential regions are either determined from human visual/textual explanations or\nautomatically from just signi\ufb01cant words in the question and answer. We evaluate\nour approach on the VQA generalization task using the VQA-CP dataset, achiev-\ning a new state-of-the-art i.e., 49.5% using textual explanations and 48.5% using\nautomatically annotated regions.\n\n1\n\nIntroduction\n\nRecently, Visual Question Answering (VQA) [4] has emerged as a challenging task that requires\narti\ufb01cial intelligence (AI) systems to compute answers by jointly analyzing both natural language\nquestions and visual content. The state-of-the-art VQA systems [8, 2, 1, 3, 12, 33, 25, 29, 14, 15, 21]\nachieve high performance when the training and test question-answer (QA) pairs are sampled from the\nsame distribution. However, most of these systems fail to generalize to test data with a substantially\ndifferent QA distribution. In particular, their performance drops catastrophically on the recently\nintroduced Visual Question Answering under Changing Priors (VQA-CP) [1] dataset. The strong\nlanguage priors encourage systems to blindly capture super\ufb01cial statistical correlations in the training\nQA pairs and simply output the most common answers, instead of reasoning about the relevant image\nregions on which a human would focus. For example, since about 40% of questions that begin with\n\u201cwhat sport\u201d have the answer \u201ctennis\u201d, systems tend to learn to output \u201ctennis\u201d for these questions\nregardless of image content.\nA number of recent VQA systems [28, 35, 25, 20] learn to not only predict correct answers but also\nbe \u201cright for the right reasons\u201d [23, 25]. These systems are trained to encourage the network to focus\non regions in the image that humans have somehow annotated as important (which we will refer to as\n\u201cimportant regions.\u201d). However, many times, the network also focuses on these important regions\neven when it produces a wrong answer. Previous approaches do nothing to actively discourage this\nphenomenon, which we have found occurs quite frequently.1 For example, as shown in Figure 1, we\nask the VQA system, \u201cWhat is the man eating?\u201d. The baseline system predicts \u201chot dog\u201d but focuses\non the banana because hot dog appears much more frequently in the training data. What\u2019s worse, this\nerror is hard to detect when only analyzing the correct answer \u201cbanana\u201d that has been successfully\ngrounded in the image.\n\n1We exam these situations by designing a metric called false sensitivity rate (FSR) in Sec. 5.2.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Example of a common answer misleading the prediction even though the VQA system\nhas the right reasons for the correct answer. Figure (a) shows the important regions extracted from\nhuman visual attention. Figure (b), (e) show the answers\u2019 distribution for the question \u201cWhat is the\nman eating?\u201d in the training and test dataset. Figure (c), (d) show the most in\ufb02uential region for the\nprediction \u201chot dog\u201d and \u201cbanana\u201d using the baseline UpDn VQA system and Figure (f), (g) show the\nin\ufb02uential region for the prediction \u201chot dog\u201d and \u201cbanana\u201d using the VQA system after being trained\nwith our self-critical objective. The number on the bounding box shows the answer\u2019s sensitivity to\nthe object.\n\nTo address this issue, we present a \u201cself-critical\u201d approach that directly criticizes incorrect answers\u2019\nsensitivity to the important regions. First, for each QA, we determine the important region that most\nin\ufb02uences the network\u2019s prediction of the correct answer. We then penalize the network for focusing\non this region when its predicted answer for this question is wrong.\nOur self-critical approach is end-to-end trainable and only requires that the base VQA system be\ndifferentiable to the visual content, and thus can be applied to most current state-of-the-art systems.\nWe investigated three approaches to determining important regions. First, like the previous work\n[28, 35, 25, 20], we used regions that humans have explicitly marked as important. However, this\nrequires a fair bit of extra human effort to provide such detailed annotations. So we also explored\nusing human textual VQA explanations from the VQA-X [18] dataset to determine important objects\nwhich are then grounded to important regions in the image. Finally, we tried determining important\nregions by only using objects mentioned in the question or answer and grounding them in the image,\nwhich requires no additional human annotation of the VQA training data.\nWe evaluate our approach using the UpDn VQA system [2] on the VQA-CP dataset [1] and achieve\na new state-of-the-art performance (currently 47.7%): i.e. 49.5% overall score with VQA-X [18]\ntextual explanations, 49.1 % with VQA-HAT [7] visual explanations and 48.5% using just mentioned\nobjects in the questions and answers. Our code is available at https://github.com/jialinwu17/\nSelf_Critical_VQA.\n\n2 Related Work\n\n2.1 Human Explanations for VQA\n\nThere are two main kinds of human explanations available for the most popular VQA dataset [4],\ni.e., visual and textual explanations. The VQA-HAT dataset [7] is a visual explanation dataset that\ncollects human attention maps by giving human experts blurred images and asking them to determine\nwhere to deblur in order to answer a given visual question. Alternatively, [18] presents the VQA-X\ndataset that associates a textual explanation with each QA pair, which a human has provided to justify\nan answer to a given question. In this work, we utilize both of these kinds of explanations to provide\nthe important regions.\n\n2.2 Language Priors in VQA\n\nLanguage priors [1, 9] in VQA refer to the fact that question types and their answers are highly\ncorrelated. For instance, questions that begin with \u201cHow many\u201d are usually answered by either two\n\n2\n\n Hot dog (baseline)>Banana (self-critical) Hot dog (self-critical)Human Visual AttentionBanana (baseline)<41393020Pizza Hot Donut Sand Dog wich131242Cake Banana No burritoQuestion: What is the man eating? Baseline Prediction: Hot Dog (wrong) Our Prediction: Banana (correct) Test Answer DistributionTraining Answer Distribution(a)(g)(f)(e)(d)(c)(b)\for three. These language priors allow VQA systems to take a shortcut when answering questions\nby only focusing on the questions without reasoning about the visual content. In order to prevent\nthis shortcut, VQA v2 [4] balances the answer distribution so that there exist at least two similar\nimages with different answers for each question. Recently, [1] introduce a diagnostic recon\ufb01guration\nof the VQA v2 dataset called VQA-CP where the distribution of the QA pairs in the training set\nis signi\ufb01cantly different from those in the test set. Most state-of-the-art VQA systems are found\nto highly rely on language priors and experience a catastrophic performance drop on VQA-CP. We\nevaluate our approach on VQA-CP in order to demonstrate that it generalizes better and is less\nsensitive to distribution changes.\n\n2.3\n\nImproving VQA using Human Explanations\n\nThe desired property for VQA systems is to not only infer the correct answers to visual questions\nbut also base the answer on image regions that a human believes are important, i.e., right for the\nright reasons. The VQA systems that address this issue can be classi\ufb01ed into two categories. The\n\ufb01rst trend is to build a system whose model is inherently interpretable. For example, GVQA [1]\nexplicitly disentangles the vision and language components by introducing a separate visual concept\nveri\ufb01er and answer cluster classi\ufb01ers. The other trend is to align a systems\u2019 explanation to human\nexperts\u2019 explanations for the correct answers. [35, 20] align the internal attention weights over the\nimage to the human attention maps. The work most related to ours is HINT [25], which enforces the\nsystem\u2019s gradient-based importance scores for each detected object to have the same rankings as its\nhuman importance scores. In contrast to prior work, our approach not only encourages the systems to\nbe sensitive to the important regions identi\ufb01ed by humans, but also decrease the incorrect answers\u2019\nsensitivity to these regions.\n\n3 Preliminaries\n\nIn this section, we \ufb01rst introduce our base Bottom-up Top-down (UpDn) VQA system2[2]. Then, we\ndescribe our method for constructing a proposed object set that covers the most in\ufb02uential objects on\nwhich a human would focus when answering the question.\n\n3.1 Bottom-Up Top-Down VQA\n\nA large number of previous VQA systems [8, 5, 21] utilize a trainable Top-Down attention mechanism\nover convolutional features to recognize relevant image regions. [2] introduced complementary\nbottom-up attention that \ufb01rst detects common objects and attributes so that the top-down attention\ncan directly model the contribution of higher-level concepts. This UpDn approach is heavily used in\nrecent work [25, 31, 14, 32, 26] and signi\ufb01cantly improves VQA performance.\nTechnically, on the vision side, for each image, UpDn systems \ufb01rst extract a visual feature set V =\n{vi, ..., v|V|} for each image whose element vi is a feature vector for the i-th detected object. On\nthe language side, UpDn systems sequentially encode each question Q to produce a question vector\nq using a standard single-layer GRU [6] denoted by h, i.e. q = h(Q). Let f denote the answer\nprediction operator that takes both visual features and question features as input and predicts the\ncon\ufb01dence for each answer a in the answer candidate set A, i.e. P (a|V, Q) = f (V, q). The VQA\ntask is framed as a multi-label regression problem with the gold-standard soft scores as targets in\norder to be consistent with the evaluation metric. In particular, the standard binary cross entropy loss\nLvqa is used to supervise the sigmoid-normalized outputs.\n\n3.2 Proposed In\ufb02uential Object Set Construction\n\nOur approach ideally requires identifying important regions that a human considers most critical\nin answering the question. However, directly obtaining such a clear set of in\ufb02uential objects from\neither visual or textual explanations is hard, as the visual explanations also highlight the neighbor\nobjects around the most in\ufb02uential one, and grounding textual explanations in images is still an active\nresearch \ufb01eld. We relax this requirement by identifying a proposed set of in\ufb02uential objects I for\neach QA pair. This set may we noisy and contain some irrelevant objects, but we assume that it at\n\n2The key approach used by the VQA-challenge winning entries in the last two years.\n\n3\n\n\fFigure 2: Model overview. In the left top block, the base UpDn VQA system \ufb01rst detects a set\nof objects and predicts an answer. We then analyze the correct answer\u2019s sensitivity (Fork) to the\ndetected objects via visual explanation and extract the most in\ufb02uential one in the proposal object set\nas the most in\ufb02uential object, which is also further strengthened via the in\ufb02uence strengthen loss\n(left bottom block). Finally, we analyze the competitive incorrect answers\u2019 sensitivities (Knife) to\nthe most in\ufb02uential object and criticize the sensitivity until the VQA system answers the question\ncorrectly (right block). The number on a bounding box is the answer\u2019s sensitivity to the given object.\n\nleast includes the most relevant object. As previously mentioned, we explore three separate methods\nfor constructing this proposal set, as described below:\nConstruction from Visual Explanations. Following HINT [25], we use the VQA-HAT dataset\n[7] as the visual explanation source. HAT maps contain a total of 59, 457 image-question pairs,\ncorresponding to approximately 9% of the VQA-CP training and test set. We also inherit HINT\u2019s\nobject scoring system that is based on the normalized human attention map energy inside the proposal\nbox relative to the normalized energy outside the box. We score each detected object from the\nbottom-up attention and build the potential object set by selecting the top |I| objects.\nConstruction from Textual Explanations. Recently, [18] introduced a textual explanation dataset\nthat annotates 32, 886 image-question pairs, corresponding to 5% of the entire VQA-CP dataset.\nTo extract the potential object set, we \ufb01rst assign part-of-speech (POS) tags to each word in the\nexplanation using the spaCy POS tagger [11] and extract the nouns in the sentence. Then, we select\nthe detected objects whose cosine similarity between the Glove embeddings [19] of their category\nnames and any of the extracted nouns\u2019 is greater than 0.6. Finally, we select the |I| objects with the\nhighest similarity.\nConstruction from Questions and Answers. Since the above explanations may not be available in\nother datasets, we also consider a simple way to extract the proposal object set from just the training\nQA pairs alone. The method is quite similar to the way we construct the potential set from textual\nexplanations. The only difference is that instead of parsing the explanations, we parse the QA pairs\nand extract nouns from them.\n\n4 Approach\n\nIn this section, we present our self-critical approach to prevent the most common answer from\ndominating the correct answer given the proposal sets of in\ufb02uential objects. Figure 2 shows an\noverview of our approach. Besides the UpDn VQA system (left top block), our approach contains\ntwo other components, we \ufb01rst recognize and strengthen the most in\ufb02uential objects (left bottom\nblock), and then we criticize incorrect answers that are more highly ranked than the correct answer\nand try to make them less sensitive to these key objects (right block). As recent research suggests that\n\n4\n\nWhat utensil is pictured? \u2207\"\ud835\udc5d(\ud835\udc53\ud835\udc5c\ud835\udc5f\ud835\udc58|\ud835\udc44,\ud835\udcb1)Influence StrengthenLossORThere is a fork near the cake.Self Critical LossAnswer Prediction\u2207\"\ud835\udc5d(\ud835\udc58\ud835\udc5b\ud835\udc56\ud835\udc53\ud835\udc52|\ud835\udc44,\ud835\udcb1)Knife(0.72)Fork(0.66)Proposal object setExplaining prediction \u201cfork\u201dExplaining prediction \u201cknife\u201dExtracting the most influential objectVisual feature set \ud835\udcb1Original imageHuman visual explanationHuman textual explanationThe most influential object\fS(a, vi) :=(cid:0)\u2207viP (a|V, q)(cid:1)T\n\ngradient-based methods more faithfully represent a model\u2019s decision making process [25, 34, 30, 13],\nwe use a modi\ufb01ed GradCAM [24] to compute the answer a\u2019s sensitivity to the i-th object features vi\nas shown in Eq. 1.3\n\n(1)\nThere are two modi\ufb01cations to GradCAM: (1) ReLU units are removed, (2) gradients are no longer\nweighted by their feature vectors. This is because negative gradients on the inputs to a ReLU are\nvaluable evidence against the current prediction. Therefore, there is no need to zero them out with\na ReLU. Also, before they are weighted by the feature vectors, the gradients indicate how small\nchanges in any direction in\ufb02uence the \ufb01nal prediction. If weighted by the feature vectors, the output\ntends to re\ufb02ect the in\ufb02uence caused only by existing attributes of the objects, thereby ignoring other\npotential attributes that may appear in the test data.\n\n1\n\n4.1 Recognizing and Strengthening In\ufb02uential Objects\nGiven a proposal object set I and the entire detected object set V, we identify the object that the\ncorrect answer is most sensitive to and further strengthen its sensitivity. We \ufb01rst introduce a sensitivity\nviolation term SV(a, vi, vj) for answer a and the i-th and j-th object features vi and vj as the amount\nof sensitivity that vj surpasses vi, as shown in Eq. 2.\n\n(2)\nBased on the assumption that the proposal set contains at least one in\ufb02uential object that a human\nwould use to infer the answer, we impose the constraint that the most sensitive object in the proposal\nset should not be less sensitive than any object outside the proposal set. Therefore, we introduce the\nin\ufb02uence strengthen loss Linf l in Eq. 3:\n\nSV(a, vi, vj) = max(cid:0)S(a, vj) \u2212 S(a, vi), 0(cid:1)\n(cid:17)\n\n(cid:16) (cid:88)\n\n(3)\n\nLinf l = min\nvi\u2208I\n\nvj\u2208V\\I\n\nSV(agt, vi, vj)\n\nwhere the agt denotes the ground truth answer. The key differences between our in\ufb02uence strengthen\nloss and the ranking-based HINT loss are that (1) we relax the unnecessary constraint that the objects\nshould follow the exact human ranking, and (2) it is easier to adapt to different types of explanation\n(e.g. textual explanations) where such detailed rankings are not available.\n\n4.2 Criticizing Incorrect Dominant Answers\n\nNext, for the incorrect answers ranked higher than the correct answer, we attempt to decrease the\nsensitivity of the in\ufb02uential objects. For example, in VQA-CP, bedrooms are the most common room\ntype. Therefore, during testing, systems frequently incorrectly classify bathrooms (which are rare in\nthe training data) as bedrooms. Since humans identify a sink as an in\ufb02uential object when identifying\nbathrooms, we want to decrease the in\ufb02uence of sinks on concluding bedroom.\nIn order to address this issue, we design a self-critical objective to criticize the VQA systems\u2019\nincorrect but competitive decisions based on the most in\ufb02uential object v\u2217 to which the correct answer\nis most sensitive as de\ufb01ned in Eq. 4.\n\n(cid:16) (cid:88)\n\nvj\u2208V\\I\n\nv\u2217 = arg min\nvi\u2208I\n\nSV(agt, vi, vj)\n\n(cid:17)\n\n(4)\n\n(5)\n\nSpeci\ufb01cally, we extract a bucket of at most B predictions with higher con\ufb01dence than the correct\nanswer B = {a1, a2, ..., a|B|} and utilize the proposed self-critical loss Lcrit to directly minimize the\nweighted sensitivities of the answers in the bucket B to the selected most in\ufb02uential object, as shown\nin Eq. 5.\n\nLcrit =\n\nw(a)(S(a, v\u2217) \u2212 S(agt, v\u2217))\n\n(cid:88)\n\na\u2208B\n\nwhere agt denotes the ground truth answer. Because several answer candidates could be similar (e.g.\ncow and cattle), we weight the sensitivity gaps in Eq. 5 by the cosine distance between the answers\u2019\n300-d Glove embeddings [19], i.e. w(a) = cosine_dist(Glove(agt), Glove(a)). In the multi-word\nanswer case, the Glove embeddings of these answers are computed as the sum of the individual\nword\u2019s Glove embeddings.\n\n31 denotes a vector with all 1\u2019s.\n\n5\n\n\f4.3\n\nImplementation and Training Details\n\nIn this section, we describe the detailed implementation and training procedure of our self-critical\napproach to VQA using VQA-X explanations.\nTraining Details. We \ufb01rst pre-train our base UpDn VQA system on the VQA-CP training set using\nstandard VQA loss Lvqa (binary cross-entropy loss with soft scores as supervision) with the Adam\noptimizer [16] for at most 20 epochs. As suggested in [27], the learning rate is \ufb01xed to 10e-3 with a\nbatch size of 384 during the pre-training process, and we use 1, 280 hidden units in the base UpDn\nVQA system. Then, we \ufb01ne-tune our system to recognize important objects using Lvqa + \u03bbinf lLinf l\nwith a learning rate of 10e-5 for at most 15 epochs on the intersection of VQA-X and VQA-CP\ntraining set. We initialize the model with the best model from the pre-train stage. In this stage, we\nalso \ufb01nd the best in\ufb02uence strengthening loss weight \u03bb(cid:63)\ninf l. Finally, we \ufb01ne-tune the system with the\ninf lLinf l + \u03bbcritLcrit for at most 15 epochs with a learning rate of 10e-5 on\njoint loss L = Lvqa + \u03bb(cid:63)\nthe intersection of VQA-X and VQA-CP training set. The bucket size |B| of the competitive answers\nis set to 5 because we observed that the top-5 overall score of the pre-trained system on the VQA-CP\ndataset achieves 80.4%, and increasing the bucket size only marginally improves the score.\nImplementation. We implemented our approach on top of the original UpDn system. The base\nsystem utilizes a Faster R-CNN head [22] in conjunction with a ResNet-101 base network [10]\nas the object detection module. The detection head is pre-trained on the Visual Genome dataset\n[17] and is capable of detecting 1, 600 objects categories and 400 attributes. UpDn takes the \ufb01nal\ndetection outputs and performs non-maximum suppression (NMS) for each object category using an\nIoU threshold of 0.7. Then, the convolutional features for the top 36 objects are extracted for each\nimage as the visual features, i.e. a 2, 048 dimensional vector for each object. For question embedding,\nfollowing [2], we perform standard text pre-processing and tokenization. In particular, questions are\n\ufb01rst converted to lower case and then trimmed to a maximum of 14 words, and the words that appear\nless than 5 times are replaced with an \u201c<unk>\u201d token. A single layer GRU [6] is used to sequentially\nprocess the word vectors and produce a sentential representation for the pre-processed question. We\nalso use Glove vectors [19] to initialize the word embedding matrix when embedding the questions.\nThe size of proposal object set is set to 6.\n\n5 Experimental Results\n\nFirst, we present experiments on a simple synthetic dataset to illustrate basic aspects of our approach.\nWe then present experimental results on the VQA-CP (Visual Question Answering with Changing\nPriors) [1] dataset where the QA pairs in the training data and test data have signi\ufb01cantly different\ndistributions. We also present experimental results on the VQA v2 validation set for completeness.\nWe compare our self-critical system\u2019s VQA performance with the start-of-the-art systems via the\nstandard evaluation metric. After that, we perform ablation studies to verify the contribution of\nstrengthening the in\ufb02uential objects and criticizing competitive answers. Finally, we show some\nqualitative examples to illustrate the effectiveness of criticizing the incorrect answers\u2019 sensitivity.\n\n5.1 Results on Synthetic Data\n\nWe manually created a dataset where the inputs are drawn from a mixture of two Gaussians, i.e.\nN1 = N ([\u22123, 3]T , 2I2) and N2 = N ([3, 3]T , 2I2), where each distribution de\ufb01nes a category. In\norder to ensure the training and test data have different category distributions, we intentionally assign\ndifferent weights to the two components. In particular, during training, the examples are drawn from\nN1 with probability p, and during test, the examples are drawn from N1 with probability 1 \u2212 p.\nWe examine the effectiveness of our self-critical approach varying p from 0.05 to 0.5 (i.e. 0.05,\n0.1, 0.2, 0.5) (0.5 means no train/test difference). In these experiments, we use the obvious human\nexplanation that the \ufb01rst channel (x-axis) is important for all training examples. We use a 15-layer\nfeed-forward neural network with 256 hidden units and 1000 examples for both training and test\nin all of our experiments. We use Adam to optimize our model with a learning rate of 1e-3 during\npre-training (100 epochs) with binary cross-entropy loss, and 1e-5 during \ufb01ne-tuning (50 epochs)\nwith our self-critical approach. The in\ufb02uence strengthening loss weight and self-critical loss weight\nare set to 20 and 1000, respectively. The results in Fig. 3 shows that the self-critical approach helps\n\n6\n\n\fFigure 3: Decision boundaries and test set accuracies on synthetic data with various class ratios p,\nwhich is varied from 0.05, 0.1, 0.2, to 0.5 from left to right. The training data is shown in the top row,\ntesting in the bottom. Red and blue colors denote different categories. Dashed lines and solid lines\ndenote the boundaries of the pretrained and \ufb01ne-tuned models, respectively.\n\nExpl.\n\nAll\n31.3\n39.7\n38.5\n41.2\n48.47\nQA\n47.7\nHAT\n49.17\nHAT\nVQA-X 49.45\n\nVQA-CP v2 test\nNum\nYes/No\n13.7\n58.0\n11.9\n42.7\n42.5\n11.4\n15.5\n65.5\n70.41\n10.42\n10.7\n70.0\n71.55\n10.72\n72.36\n10.93\n\nGVQA[1]\nUpDn [2]\nUpDn+AttAlign [25]\nUpDn+AdvReg. [21]\nUpDn+SCR (ours)\nUpDn+HINT [25]\nUpDn+SCR (ours)\nUpDn+SCR (ours)\n\nVQA v2 val\nYes/No Num Other\n34.7\n72.0\n55.7\n81.2\n78.9\n53.3\n55.2\n79.8\n56.5\n77.4\n54.0\n80.5\n78.9\n54.3\n54.5\n78.8\nTable 1: Comparison of the results on VQA-CP test and VQA v2 validation dataset with the state-of-\nthe-art systems. The upper part includes VQA systems without human explanations during training,\nand the VQA systems in the bottom part use either visual or textual human explanations. The \u201cExpl.\u201d\ncolumn shows the source of explanations for training the VQA systems. SCR is the short hand for\nour self-critical reasoning approach. The results with a precision of 2 decimal points denote the mean\nof three runs with different random initial seeds.\n\nOther\n22.1\n46.1\n43.8\n35.5\n47.29\n46.3\n47.49\n48.02\n\nAll\n48.2\n63.5\n61.0\n62.8\n62.3\n62.5\n62.2\n62.2\n\n31.2\n42.1\n38.4\n42.4\n40.9\n41.8\n41.4\n41.6\n\nshift the decision boundary towards the correct, unbiased position, increasing robustness and accuracy\non the test data.\n\n5.2 Results on VQA Data\n\nVQA Performance on VQA-CP and VQA v2 datasets\nTable 1 shows results on the VQA-CP generalization task, comparing our results with the state-of-\nthe-art methods. We also report our system\u2019s performance on the balanced VQA v2 validation set for\ncompleteness.\nOur system signi\ufb01cantly outperforms other state-of-the-art system (e.g., HINT [25]) by 1.5% on\nthe overall score for VQA-CP when using the same human visual explanations (VQA-HAT), which\nindicates the effectiveness of directly criticizing the competitive answers\u2019 sensitivity to the most\nin\ufb02uential objects. Using human textual explanations as supervision is even a bit more effective.\nWith only about half the number of explanations compared to VQA-HAT, these textual explanations\nimprove VQA performance by an additional 0.3% on the overall score, achieving a new state-of-the-\nart of 49.5%.\nWithout human explanations, our approach that only uses the QA proposal object set as supervision\nclearly outperforms all of the previous approaches, even those that use human explanations. We\nfurther analyzed the quality of the in\ufb02uential object proposal sets extracted from the QA pairs by\ncomparing them to those from the corresponding human explanations. On average, the QA proposal\nsets contain 57.1% and 54.3% of the objects in the VQA-X and VQA-HAT proposal object sets,\nrespectively, indicating a signi\ufb01cant but not perfect overlap.\nNote that our self-critical objective particularly improves VQA performance in the \u2019Yes/No\u2019 and\n\u2019Other\u2019 question categories; however, it does not do as well in the \u2019Num\u2019 category. This is understand-\n\n7\n\n Baseline: 77.4% Self-critical: 79.8% Baseline: 80.3% Self-critical: 86.2% Baseline: 88.5% Self-critical: 89.6% Baseline: 95.1% Self-critical: 95.0%\fExpl.\n\n\u03bbinf l\n\n\u03bbcrit\n\nUpDn [2]\nUpDn+SCR (ours) VQA-X\nUpDn+SCR (ours) VQA-X\nUpDn+SCR (ours) VQA-X\nUpDn+SCR (ours) VQA-X\n\n5\n20\n60\n80\n\n0\n0\n0\n0\n\nVQA-CP v2 test\n\nAll Yes/No Num Other\n46.1\n39.7\n47.9\n46.6\n47.8\n47.0\n47.6\n47.2\n47.0\n47.5\n\n42.7\n62.2\n67.9\n65.0\n64.3\n\n11.9\n12.1\n12.4\n11.9\n11.8\n\nTable 2: Ablation study on various in\ufb02uence-strengthening loss weights on VQA-CP test data (. The\n\u201cExpl.\u201d column shows the source of explanations for training the VQA systems. The \u201c\u03bbinf l\u201d column\nshows the in\ufb02uence-strengthening loss weight. The \u201c\u03bbcrit\u201d column shows the self-critical loss weight.\nSCR is the short hand for our self-critical reasoning approach.\n\nExpl.\n\n\u03bbinf l\n\n\u03bbcrit\n\nUpDn [2]\nUpDn+SCR (ours) VQA-X\nUpDn+SCR (ours) VQA-X\nUpDn+SCR (ours) VQA-X\nUpDn+SCR (ours) VQA-X\n\n20\n20\n20\n20\n\n500\n2000\n4000\n6000\n\nVQA-CP v2 test\n\nAll Yes/No Num Other\n46.1\n39.7\n47.8\n48.7\n49.5\n48.1\n48.1\n49.4\n49.0\n47.8\n\n11.9\n10.4\n11.0\n11.0\n10.9\n\n42.7\n70.0\n72.2\n71.9\n71.6\n\nTable 3: Ablation study on various self-critical loss weights on VQA-CP test data. The \u201cExpl.\u201d\ncolumn shows the source of explanations for training the VQA systems. The \u201c\u03bbcrit\u201d column shows\nthe self-critical loss weight. SCR is the short hand for our self-critical reasoning approach.\n\nable because counting problems are generally harder than the other two types, and requires the VQA\nsystem to consider all of the objects jointly. Therefore, criticizing only the most sensitive ones does\nnot improve the performance.\nFor the VQA v2 test dataset, our self-critical methods are competitive with previous approaches. This\nindicates that criticizing the wrong answers\u2019 sensitivities at least does not hurt performance when the\ntraining and test data have the same distribution.\nAblation Study on the Loss Weights\nTables 2 and 3 evaluate the impact of varying the weight of the in\ufb02uence strengthening loss and\nself-critical loss on the VQA-CP test data using VQA-X textual explanations. Table 2 shows that\nwithout Lcrit to criticize the false sensitivity, our in\ufb02uence-strengthening still improves the UpDn\nVQA system 8.1% on the overall score. As shown Table 3, combining with the Lcrit loss, our\napproach sets a new state-of-the-art score (49.5%) on the VQA-CP test set using textual explanations.\nWe also notice that our approach is fairly robust to changes in the weight of both losses Linf l, Lcrit\nand consistently improves VQA performance for a wide range of loss weights.\nStudy on Proposal In\ufb02uential Object Set Size\nTable 4 reports results with various set sizes indicating the two objectives are fairly robust. We use\nVQA-HAT visual explanations to construct the in\ufb02uential object sets and both losses to \ufb01ne-tune our\nmodel.\n\n|I|\n\n4\n\n5\n\nVQA-CP v2 test\n48.8% 49.1% 49.2% 49.1% 48.7% 48.3%\nTable 4: Ablation study on the size of the proposal in\ufb02uential object set.\n\n6\n\n7\n\n8\n\n10\n\nEffectiveness of Criticizing False Sensitivity\nIn this section, we quantitatively evaluate the effectiveness of the proposed self-critical objective.\nIn particular, we evaluate the fraction of false sensitivity where the predicted incorrect answer\u2019s\nsensitivity to the in\ufb02uential object (to which the correct answer is most sensitive) is greater than the\n\n8\n\n\fFigure 4: Positive examples are showing that our self-critical reasoning approach prevents the\nincorrectly predicted answer in the UpDn baseline system from being sensitive to the most in\ufb02uential\nobject. For each example, the top two \ufb01gures show the object to which the ground truth (left) and\nincorrectly predicted (right) answers are sensitive. The bottom two \ufb01gures show the corresponding\nmost in\ufb02uential object after our self-critical training. Note that the attention for the incorrect answer\nshifts to a more relevant part of the image for that answer. The number around the bounding box is\nthe answer\u2019s sensitivity to the object.\n\ncorrect answer\u2019s sensitivity. We formally de\ufb01ne the false sensitivity rate in Eq. 6:\n\n(cid:80)\nQ,V 1[S(apred, v\u2217) \u2212 S(agt, v\u2217) > 0, score(apred) = 0]\n\nFSR =\n\n(cid:80)\n\nQ,V 1\n\nwhere 1[\u00b7] denote the function that returns 1 if the condition is satis\ufb01ed and returns 0 otherwise.\nFor the original UpDn VQA system, we observe a false sensitivity rate of 35.5% among all the test\nQA pairs in the VQA-CP. After the self-critical training, the false sensitivity rate reduces to 20.4%\nusing the VQA-HAT explanations, and to 19.6% using VQA-X explanations. This indicates that false\nsensitivity is a common problem in VQA systems and shows the utility of addressing it.\nSome examples of how our self-critical approach mitigates false sensitivity are shown in Figure 4.\nNote that for the correct answer, our approach increases the in\ufb02uence of the most in\ufb02uential object,\nwhich we attribute to the in\ufb02uence strengthening part. More importantly, we observe that this object\u2019s\nin\ufb02uence on the incorrect answer decreases and sometimes falls below other objects.\n\n(6)\n\nFSR 35.5%\n\nUpDn UpDn + QA UpDn + HAT UpDn + VQA-X\n\n22.6%\n\n20.4%\n\n19.6%\n\nTable 5: False sensitivity rate (FSR) comparison of using different types of human explanations.\n\n6 Conclusion and Future Work\n\nIn this work, we have explored how to improve VQA performance by criticizing the sensitivity of\nincorrect answers to the most in\ufb02uential object for the correct answer. Our \u201cself-critical\u201d approach\nhelps VQA systems generalize to test data where the distribution of question-answer pairs is sig-\nni\ufb01cantly different from the training data. The in\ufb02uential objects are selected from a proposal set\nextracted from human visual or textual explanations, or simply from the mentioned objects in the\nquestions and answers. Our approach outperforms the state-of-the-art VQA systems on the VQA-CP\ndataset by a clear margin even without human explanations as additional supervision. In the future,\nwe would like to combine the visual and the textual explanations together to better train VQA systems.\nThis is dif\ufb01cult because the proposal object sets for these two types of explanations contain different\ntypes of noise (i.e., question-irrelevant objects), and therefore different biases.\n\nAcknowledgement\n\nThis research was supported by the DARPA XAI program under a grant from AFRL.\n\n9\n\n What is the sitting on?Key object for \u201cbench\u201d(Self-Critical, Predicted)Key object for \u201cground\u201d(Self-Critical)Key object for \u201cbench\u201d(Baseline, GT)key object for \u201cground\u201d(Baseline, Predicted)What color is the net?Key object for \u201cblue\u201d(Self-Critical, Predicted)Key object for \u201cwhite\u201d(Self-Critical)Key object for \u201cblue\u201d(Baseline, GT)Key object for \u201cwhite\u201d(Baseline, Predicted)What is the man holding?Key object for \u201ckite\u201d(Self-Critical, Predicted)Key object for \u201ctennis racket\u201d(Self-Critical)Key object for \u201ckite\u201d(Baseline, GT)Key object for \u201ctennis racket\u201d(Baseline, Predicted)\fReferences\n\n[1] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don\u2019t Just Assume; Look and Answer:\n\nOvercoming Priors for Visual Question Answering. In CVPR, 2018.\n\n[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-Up\n\nand Top-Down Attention for Image Captioning and VQA. In CVPR, 2018.\n\n[3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural Module Networks. In CVPR, 2016.\n[4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA:\n\nVisual Question Answering. In ICCV, 2015.\n\n[5] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. MUTAN: Multimodal Tucker Fusion for\n\nVisual Question Answering. In ICCV, 2017.\n\n[6] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical\nMachine Translation. In EMNLP, 2014.\n\n[7] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human Attention in Visual Question\nAnswering: Do Humans and Deep Networks Look at the Same Regions? In Computer Vision\nand Image Understanding, 2017.\n\n[8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal Compact\n\nBilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP, 2016.\n\n[9] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter:\n\nElevating the role of image understanding in Visual Question Answering. In CVPR, 2017.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR,\n\n2016.\n\n[11] M. Honnibal and I. Montani. spacy 2: Natural language understanding with bloom embeddings,\n\nconvolutional neural networks and incremental parsing. 2017.\n\n[12] R. Hu, J. Andreas, T. Darrell, and K. Saenko. Explainable Neural Computation via Stack Neural\n\nModule Networks. In ECCV, 2018.\n\n[13] S. Jain and B. C. Wallace. Attention is not Explanation. In NAACL, 2019.\n[14] Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh. Pythia v0. 1: the\n\nWinning Entry to the VQA Challenge 2018. arXiv preprint arXiv:1807.09956, 2018.\n[15] J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear Attention Networks. In NeurIPS, 2018.\n[16] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.\n[17] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li,\nD. A. Shamma, et al. Visual Genome: Connecting Language and Vision Using Crowdsourced\nDense Image Annotations. IJCV.\n\n[18] D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach.\nMultimodal Explanations: Justifying Decisions and Pointing to the Evidence. In CVPR, 2018.\n[19] J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. In\n\nEMNLP, 2014.\n\n[20] T. Qiao, J. Dong, and D. Xu. Exploring Human-Like Attention Supervision in Visual Question\n\nAnswering. In AAAI, 2018.\n\n[21] S. Ramakrishnan, A. Agrawal, and S. Lee. Overcoming Language Priors in Visual Question\n\nAnswering with Adversarial Regularization. In NeurIPS, 2018.\n\n[22] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-time Object Detection\n\nwith Region Proposal Networks. In NIPS, 2015.\n\n[23] A. S. Ross, M. C. Hughes, and F. Doshi-Velez. Right for the Right Reasons: Training Differen-\n\ntiable Models by Constraining Their Explanations. In IJCAI, 2017.\n\n[24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-CAM:\n\nVisual Explanations from Deep Networks via Gradient-Based Localization. In ICCV, 2017.\n\n[25] R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, D. Batra, and D. Parikh. Taking a HINT: Leveraging\n\nExplanations to Make Vision and Language Models More Grounded. In ICCV, 2019.\n\n[26] M. Shah, X. Chen, M. Rohrbach, and D. Parikh. Cycle-Consistency for Robust Visual Question\n\nAnswering. In CVPR, 2019.\n\n[27] D. Teney, P. Anderson, X. He, and A. v. d. Hengel. Tips and Tricks for Visual Question\n\nAnswering: Learnings from the 2017 Challenge. arXiv preprint arXiv:1708.02711, 2017.\n\n[28] A. Trott, C. Xiong, and R. Socher. Interpretable Counting for Visual Question Answering. 2018.\n\n10\n\n\f[29] J. Wu, Z. Hu, and R. J. Mooney. Generating Question Relevant Captions to Aid Visual Question\n\n[30] J. Wu, D. Li, Y. Yang, C. Bajaj, and X. Ji. Dynamic Filtering with Large Sampling Field for\n\nAnswering. In ACL, 2019.\n\nConvnets. ECCV, 2018.\n\n[31] J. Wu and R. J. Mooney. Faithful Multimodal Explanation for Visual Question Answering. In\n\nACL BlackboxNLP Workshop, 2019.\n\n[32] N. Xie, F. Lai, D. Doran, and A. Kadav. Visual Entailment: A Novel Task for Fine-Grained\n\nImage Understanding. arXiv preprint arXiv:1901.06706, 2019.\n\n[33] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked Attention Networks for Image Question\n\nAnswering. In CVPR, 2016.\n\nby Excitation Backprop. IJCV.\n\n[34] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-Down Neural Attention\n\n[35] Y. Zhang, J. C. Niebles, and A. Soto. Interpretable Visual Question Answering by Visual\n\nGrounding from Attention Supervision Mining. In WACV, 2019.\n\n11\n\n\f", "award": [], "sourceid": 4639, "authors": [{"given_name": "Jialin", "family_name": "Wu", "institution": "UT Austin"}, {"given_name": "Raymond", "family_name": "Mooney", "institution": "University of Texas at Austin"}]}