{"title": "Connective Cognition Network for Directional Visual Commonsense Reasoning", "book": "Advances in Neural Information Processing Systems", "page_first": 5669, "page_last": 5679, "abstract": "Visual commonsense reasoning (VCR) has been introduced to boost research of cognition-level visual understanding, i.e., a thorough understanding of correlated details of the scene plus an inference with related commonsense knowledge. Recent studies on neuroscience have suggested that brain function or cognition can be described as a global and dynamic integration of local neuronal connectivity, which is context-sensitive to specific cognition tasks. Inspired by this idea, towards VCR, we propose a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers. Concretely, we first develop visual neuron connectivity to fully model correlations of visual content. Then, a contextualization process is introduced to fuse the sentence representation with that of visual neurons. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers or rationales. Experimental results on the VCR dataset demonstrate the effectiveness of our method. Particularly, in $Q \\to AR$ mode, our method is around 4\\% higher than the state-of-the-art method.", "full_text": "Connective Cognition Network for Directional Visual\n\nCommonsense Reasoning\n\nAming Wu1\u2217 Linchao Zhu2 Yahong Han1\u2020 Yi Yang2\n\n1College of Intelligence and Computing, Tianjin University, Tianjin, China\n\n2ReLER, University of Technology Sydney, Australia\n\n{tjwam,yahong}@tju.edu.cn, {Linchao.Zhu, yi.yang}@uts.edu.au\n\nAbstract\n\nVisual commonsense reasoning (VCR) has been introduced to boost research of\ncognition-level visual understanding, i.e., a thorough understanding of correlated\ndetails of the scene plus an inference with related commonsense knowledge. Recent\nstudies on neuroscience have suggested that brain function or cognition can be\ndescribed as a global and dynamic integration of local neuronal connectivity, which\nis context-sensitive to speci\ufb01c cognition tasks. Inspired by this idea, towards VCR,\nwe propose a connective cognition network (CCN) to dynamically reorganize the\nvisual neuron connectivity that is contextualized by the meaning of questions and\nanswers. Concretely, we \ufb01rst develop visual neuron connectivity to fully model\ncorrelations of visual content. Then, a contextualization process is introduced to\nfuse the sentence representation with that of visual neurons. Finally, based on the\noutput of contextualized connectivity, we propose directional connectivity to infer\nanswers or rationales. Experimental results on the VCR dataset demonstrate the\neffectiveness of our method. Particularly, in Q \u2192 AR mode, our method is around\n4% higher than the state-of-the-art method.\n\n1\n\nIntroduction\n\nRecent advances in visual understanding mainly make progress on the recognition-level perception\nof visual content, e.g., object detection [13, 23] and segmentation [9, 5], or even on the recognition-\nlevel grounding of visual concepts with image regions, e.g., image captioning [40, 24] and visual\nquestion answering [1, 6]. Towards complete visual understanding, a model must move forward\nfrom perception to reasoning, which includes cognitive inferences with correlated details of the scene\nand related commonsense knowledge. As a key step towards complete visual understanding, the\ntask of Visual Commonsense Reasoning (VCR) [42] is proposed along with a well-devised new\ndataset. In VCR, given an image, a machine is required to not only answer a question about the\nthorough understanding of the correlated details of the visual content, but also provide a rationale, e.g.,\ncontextualized with related visual details and background knowledge, to justify why the answer is true.\nAs a \ufb01rst attempt to narrow the gap between recognition- and cognition-level visual understanding,\nRecognition-to-Cognition Networks (R2C) [42] conducts visual commonsense reasoning step by step,\ni.e., grounding the meaning of natural language with respect to the referred objects, contextualizing\nthe meaning of an answer with respect to the question and related global objects, and \ufb01nally reasoning\nover the shared representation to obtain a decision of an answer. Due to the large discrepancy between\nthe reasoning scheme of VCR and cognition function of human brain, R2C\u2019s performance is not in\ncompetition with humans score, e.g., 65% vs. 91% in Q \u2192 A mode.\n\n\u2217This work was done when Aming Wu visited ReLER Lab, UTS.\n\u2020Corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Overview of our CCN method. The yellow, blue, and green circles indicate visual\nelements, question and answer representation, respectively. Our method mainly includes visual\nneuron connectivity, contextualized connectivity, and directional connectivity. For semantic context,\ntwo LSTM units are used to extract sentence representations.\n\nRecent studies [31, 8] on brain networks have suggested that brain function or cognition can be\ndescribed as the global and dynamic integration of local (segregated) neuronal connectivity. And\nsuch a global and dynamic integration is context-sensitive with respect to a speci\ufb01c cognition task.\nInspired by this idea, in this paper, we propose a Connective Cognition Network (CCN) for visual\ncommonsense reasoning. As is shown in Fig. 1, the main process of CCN is to dynamically\nreorganize (integrate) the visual neuron connectivity that is contextualized by the meaning of answers\nand questions in the current reasoning task.\nConcretely, taking visual words as visual neurons and object features as segregated visual modules, we\n\ufb01rst devise an approach of Conditional GraphVLAD to represent image\u2019s visual neuron connectivity,\nwhich includes connections among visual neurons and visual modules. The visual neuron connectivity\nserves as the base function for the dynamic integration in the process of reasoning. Meanwhile, as\na context-sensitive integration, the meaning is speci\ufb01ed by the semantic context of questions and\nanswers. After obtaining the sequential information of sentences via an LSTM network [16], we fuse\nthe sentence representation with that of the visual neurons, which stands for a contextualization.\nThen we employ graph convolution neural network (GCN) to fully integrate both the local and global\nconnectivity. For example, in Fig. 1, connections between \u201cHe\u201d and \u201cPerson4\u201d, \u201cPerson4\u201d and\n\u201cPerson3\u201d, as well as \u201cPerson3\u201d and \u201ctable\u201d could all be incorporated in the contextualized connectiv-\nity, where the last connection between \u201cPerson3\u201d and \u201ctable\u201d belongs to the global integration not\nmentioned here. Though the contextualized connectivity is ready for reasoning, it lacks direction\ninformation, which is an important clue for cognitive reasoning [32]. Taking the answer sentence\nin Fig. 1 as an example, there exists directional connection from \u201cPerson4\u201d to \u201cPerson3\u201d via the\npredicate \u201ctell\u201d, as well as from \u201cPerson1\u201d to \u201csandwich\u201d via the predicate \u201corder\u201d. Though easy\nto be de\ufb01ned in \ufb01rst-order logic (FOL) [36], it is nontrivial to be incorporated into a data-driven\nlearning process. In this paper, we make an attempt to devise a direction learner on the GCN, so as to\nfurther improve the reasoning performance. Particularly, a network is \ufb01rst used to learn the semantic\ndirection of input features. Then, we add the direction to the computation of the adjacency matrix of\nGCN to obtain a directional adjacency matrix, which serves as directional connectivity for reasoning.\nThus, we develop a novel connective cognition network for directional visual commonsense reasoning.\nThe main contributions lie in that, this is the \ufb01rst attempt to use an end-to-end training neural network\nfor the cognitive reasoning process, i.e., global and dynamic integration of local (segregated) visual\nneuron connectivity, which is context-sensitive with respect to a speci\ufb01c VQA task. Moreover,\nwe also try to incorporate directional reasoning into a data-driven learning process. Experimental\nresults on the VCR dataset [42] demonstrate the effectiveness of the proposed method. On the three\nreasoning modes of VCR task, i.e., Q \u2192 A, QA \u2192 R, and Q \u2192 AR, the CCN with directional\nreasoning signi\ufb01cantly outperforms R2C by 3.4%, 3.2%, and 4.4%, respectively.\n\n2 Related Work\n\nVisual Question Answering: Recently, many effective methods are proposed in the VQA task,\nwhich includes those based on attention [21, 26], multi-modal fusion [33, 12], and visual reasoning\n\n2\n\npointingHe is telling [Person3] that [Person1] ordered the sandwich.[Bottle] [Cup] [Person3] [Person1] [Person4] [Sandwich] Answer:Why is [Person4] pointing at [Person1]?Question:Why[Person4]ispointingatHe[Person3]tellingthatisHe[Person3]istellingthat[Person3]He[Person1] sandwichtellstoContextualized ConnectivityWhyisPerson4atHeistellingPerson3thatSemantic ContextVisual Neuron ConnectivityDirectional Connectivity for ReasoningPerson1Person1order\fFigure 2: The framework of the CCN method.\nIt mainly includes visual neuron connectivity,\ncontextualized connectivity, and directional connectivity for reasoning. Here, \u2018{U, O}\u2019 indicates\nthe set including the output U of GraphVLAD and object features O. \u2018f\u03b8\u2019 indicates the prediction\nfunction for responses (answers or rationales). \u2018F\u2019 indicates fusion operation.\n\n[35, 29]. Most methods focus on the recognition of visual content and spatial positions, but they lack\nthe ability of commonsense reasoning. To advance the research of reasoning, a new task of VCR [42]\nis proposed. Given a query-image pair, this task needs models to choose correct answer and rationale\njustifying why the answer is true. The challenges mainly include a thorough understanding of vision\nand language as well as a method to infer responses (answers or rationales). In this paper, we propose\na CCN model for VCR, which has been proved to be effective in the experiment.\nNetVLAD: The work [4] proposes NetVLAD which is used to extract local features. Particularly, it\nincludes an aggregation layer for clustering the local features into a VLAD [19] global descriptor.\nRecently, NetVLAD has been demonstrated to be effective in many tasks [2, 37]. Particularly, the\nwork [2] proposes a PointNetVLAD to extract the global descriptor from a given 3D point cloud.\nBesides, the state-of-the-art models [7, 27] of video classi\ufb01cation most use NetVLAD pooling to\naggregate information from all the frames of a video. However, the original NetVLAD learns multiple\ncenters from the overall dataset to represent each input data, which ignores the characteristic of the\ninput data and reduces the accuracy of the representation. To alleviate this problem, in this paper, we\npropose a conditional GraphVLAD to integrate the characteristic of the input data.\nGraph Convolutional Network: GCN [22, 39, 28, 43] aims to generalize the Convolutional Neural\nNetwork (CNN) to graph-structured data. By encoding both the structure of the graph surrounding a\nnode and the feature of the node, GCN could learn representation for every node effectively. As GCN\nhas the bene\ufb01t of capturing relations between nodes, many works have employed GCN for reasoning\n[17, 29]. Particularly, the work [29] uses GCN to infer answers. However, it only constructs an\nundirected graph for reasoning [29], which ignores the directional information between nodes. The\ndirectional information is often considered an important factor for inference [32]. Here, we propose a\ndirectional connectivity to infer answers, which has been proved to be effective.\n\n3 Connective Cognition Network\n\nFig. 2 shows the framework of CCN model. It mainly includes visual neuron connectivity, contextu-\nalized connectivity, and directional connectivity for reasoning.\n\n3.1 Visual Neuron Connectivity\n\nThe goal of visual neuron connectivity (Fig. 3(a)) is to obtain a global representation of an image,\nwhich is helpful for a thorough understanding of visual content. It mainly includes visual element\nconnectivity and the computation of both conditional centers and GraphVLAD.\nVisual Element Connectivity. We \ufb01rst use a pre-trained network, e.g., ResNet [15], to obtain the\nfeature map X \u2208 Rw\u00d7h\u00d7m of an image, where w, h, and m separately indicate the width, height,\nand number of channels. Here, we take each element of the feature map as a visual element. We take\nthe output Y \u2208 Rn of LSTM [16] at the last time step as the representation of query (question or\nquestion with a correct answer).\n\n3\n\nWhy is [Person4] pointing at [Person1]?QueryqHe is telling [Person3] that [Person1] ordered the sandwich . Response(cid:1853)(cid:4666)(cid:3036)(cid:4667)FLSTMFLSTMGCNGCNFDirection learnerGCN(cid:1858)(cid:3087)(cid:4666)(cid:1870)(cid:4666)(cid:3036)(cid:4667)(cid:481)(cid:1853)(cid:4666)(cid:3036)(cid:4667)(cid:4667)(cid:1870)(cid:4666)(cid:3036)(cid:4667)ImageI + objectsoGraphVLADObject FeaturesVisual Neuron ConnectivityContextualized Connectivity{U, O}UODirectional Connectivity for Reasoning\fFigure 3: (a) shows the process of visual neuron connectivity. \u2018AT\u2019 indicates af\ufb01ne transformation. (b)\nshows the initial state of NetVLAD. (c) shows the conditional centers after an af\ufb01ne transformation.\nHere, we use the fusion of image and question to compute the parameter \u03b3 and \u03b2. (d) and (e) show\nthe result of NetVLAD and GraphVLAD, respectively.\n\nIn general, there exists certain relation between objects of an image [10]. As is shown in the left part\nof Fig. 1, relations (solid and dotted lines) exist not only between elements (yellow circles) in the\nsame object region, but also between various objects (Person1, Person3, Person4, and background).\nObviously, capturing these relations is helpful for a thorough understanding of the entire scene. In this\npaper, we employ GCN to capture these relations. Speci\ufb01cally, we seek to construct an undirected\ngraph Gg = {V, \u03be, A}, where \u03be is the set of graph edges to learn and A \u2208 RN\u00d7N (N = wh) is the\ncorresponding adjacency matrix. Each node \u03bd \u2208 V corresponds to one element of the feature map.\n\nAnd the size of V is set to N. We \ufb01rst reshape X to (cid:101)X \u2208 RN\u00d7m. Then, we de\ufb01ne an adjacency\nmatrix for an undirected graph as A = sof tmaxr((cid:101)X(cid:101)X T ) + Id, where Id indicates the identity matrix\n\n(cid:102)M = tanh(wc\n\nM = A(cid:101)X,\nf \u2208 R1\u00d7m\u00d7n, wc\n\nand sof tmaxr indicates we make sof tmax operation across the row direction.\ng \u2217 M + bc\ng),\n\n(1)\ng \u2208 Rn indicate the trainable parameters. \u2018*\u2019\nf \u2208 Rn, and bc\nwhere wc\nindicates the convolutional operation. \u2018(cid:12)\u2019 indicates element-wise product. Each row of the matrix M\nthe current node. (cid:102)M \u2208 RN\u00d7n indicates the output of GCN.\nrepresents a feature vector of a node, which is a weighted sum of the neighboring node features of\nThe Computation of Conditional Centers. Since (cid:102)M only captures relations between visual ele-\n\ng \u2208 R1\u00d7m\u00d7n, bc\n\nf \u2217 M + bc\n\nf ) (cid:12) \u03c3(wc\n\nments and does not have the capability to fully understand the image, we consider using NetVLAD\n[19, 4] to further enhance the representation of an image. By learning multiple centers, i.e., visual\nwords, NetVLAD could use these centers to describe a scene [4]. However, these centers are learned\nbased on the overall dataset and re\ufb02ect the attributes of the dataset. In other words, these centers are\nindependent of the current input data, which ignore the characteristic of the input data and reduce\nthe accuracy of the representation. Here, we consider making an af\ufb01ne transformation for the initial\ncenters and using these transformed centers to represent an image.\nConcretely, we \ufb01rst de\ufb01ne the initial centers C = {ci \u2208 Rn, i = 1, ..., K}. Next, based on the current\ninput query-image pairs, we make the af\ufb01ne transformation [34] for the initial centers.\n\n\u03b3 = f ((cid:104)(cid:102)M ,(cid:101)Y (cid:105)),\n\n\u03b2 = h((cid:104)(cid:102)M ,(cid:101)Y (cid:105)),\n\nwhere (cid:104)a, b(cid:105) represents the concatenation of a and b. By stacking Y , we obtain (cid:101)Y \u2208 RN\u00d7n. We\n\nseparately use a two-layer convolutional network to de\ufb01ne f and h. zi \u2208 Rn indicates the i-th\n\nzi = \u03b3ci + \u03b2,\n\n(2)\n\n4\n\n(e)(a)(d)(c)(cid:28685)(cid:28595)(cid:28624)(cid:28595)(cid:2011)(cid:1855)(cid:3397)(cid:2010)ConditionalNetVLADGraphWhyis[Person4]pointing at [Person1]?+(b)CNNGCNFeature MapWhy is [Person4] pointing at [Person1 ]?ImageIQueryqBERTLSTMFATInitial CentersVLADcoreConvSoftmaxIntra-normalizationL2 normalizationFGCNConditional GraphVLADConditionalVisualElementConnectivityOutput Map\fgenerated conditional center. Here, we take the concatenated result of the representations of both\ninput images and their corresponding queries as the input of f and h to compute parameter \u03b3 and\n\u03b2. Since parameter \u03b3 and \u03b2 are learned based on the input query-image pairs, these two parameters\nre\ufb02ect the character of the current input data. Equipped with the af\ufb01ne transformation, the initial\ncenters are made to move towards the input features, which improves the accuracy of the residual\noperation (Fig. 3(d)) of NetVLAD. As is shown in Fig. 3(b) and (c), after the af\ufb01ne transformation,\nthe centers move towards the features (color circles). Finally, we use Z = {z1,\u00b7\u00b7\u00b7 , zK} to indicate\nthe new conditional centers.\n\nThe Computation of GraphVLAD. Next, we use Z and(cid:102)M to perform NetVLAD operation,\n\nN(cid:88)\n\ni=1\n\n(cid:80)\n\nDj =\n\nj (cid:102)Mi+bj\n(cid:48)(cid:102)Mi+b\n\nwT\nj\n\newT\n\n(cid:48) e\n\nj\n\n((cid:102)Mi \u2212 zj),\n\n(cid:48)\n\nj\n\n(3)\n\nwhere {wj} and {bj} are sets of trainable parameters for each center zj and j = 1, ..., K. Finally,\nwe use D \u2208 RK\u00d7n to indicate the output of NetVLAD.\nBesides, as is shown in Fig. 3(d), NetVLAD only captures relations between elements and centers.\nAs NetVLAD is computed based on visual elements where relations are existed, we consider there\nshould exist certain relations between outputs. Here, we employ GCN to capture these relations.\n\nConcretely, we \ufb01rst concatenate the NetVLAD output and conditional centers, i.e., (cid:101)Z = (cid:104)z1,\u00b7\u00b7\u00b7 , zK(cid:105),\n(cid:101)Z \u2208 RK\u00d7n, and H = (cid:104)D,(cid:101)Z(cid:105). Then, we de\ufb01ne an adjacency matrix for an undirected graph as\n\nB = sof tmaxr(HH T ) + Id. The following processes are the same as Eq. (1). Finally, we use\nU \u2208 RK\u00d7n to indicate the output of GraphVLAD. By this operation, we obtain the global information\nof an image, which is as complementation of local object features O \u2208 RL\u00d7n (L indicates the number\nof objects) extracted by a pre-trained network and GCN network. Finally, the set S = {U, O} is\ntaken as the global representation of an image.\n\n3.2 Contextualized Connectivity\n\nThe goal of contextualized connectivity is to not only capture the relevance between linguistic features\nand the global representation S, but also extract deep semantic existing in sentences according\n\nto visual information. Concretely, LSTM is employed to obtain representation (cid:101)Q \u2208 RP\u00d7n and\n(cid:101)A \u2208 RJ\u00d7n of query and response, respectively, where P and J separately indicate the length of\n\nquery and response. Next, we introduce the processing of the query. An attention operation is \ufb01rst\nused to obtain the relevance between the query and global representation.\n\nFqu = sof tmaxr((cid:101)QU T ), Fqo = sof tmaxr((cid:101)QOT ), QU = FquU, QO = FqoO.\n\nThen, we take the concatenation of QU , QO, and (cid:101)Q as QF \u2208 RP\u00d73n. U and O are the output of\n\nthe GraphVLAD. Here, we only obtain sequential features, rather than the structural information\n[41] which is helpful for a better understanding of the sentence semantic. Meanwhile, LSTM has\nthe limitation of long-term information dilution [38], which weakens the capacity of the sentence\nrepresentation. In this paper, we consider using GCN to extract structural information. Concretely, we\nde\ufb01ne an adjacency matrix for an undirected graph as Q = sof tmaxr(QF QT\nF ) + Id. The following\nprocesses are the same as Eq. (1). Finally, we use Qg \u2208 RP\u00d7n to indicate the output of this network.\nThe processing of responses is the same as that of queries. And the representation of response\ngenerated by a GCN network is de\ufb01ned as Ag \u2208 RJ\u00d7n.\n\n(4)\n\n3.3 Directional Connectivity for Reasoning\n\nGCN for reasoning. Concretely, we \ufb01rst use (cid:101)A to obtain the attention representation Qa \u2208 RJ\u00d7n of\n\nDirectional information is an important clue for cognitive reasoning. And using directional infor-\nmation could improve the accuracy of reasoning [32]. Here, we propose a semantic direction based\nQg. The processes are the same as Eq. (4). Then, Qa and Ag are concatenated as Eqa \u2208 RJ\u00d72n.\nNext, based on Eqa, we \ufb01rst try to learn the direction information.\n\n5\n\n\fDqa = \u03c6(Eqa),\n\nGt = DqaDT\n\nqa,\n\nDt = sign(Gt),\n\nVe = sof tmaxr(abs(Gt)),\n\n(5)\n\nwhere abs indicates the operation of absolute value. Here, \u03c6 is de\ufb01ned as a directional function,\nwhich is a one-layer convolutional network without activation. Besides, to learn the direction, we do\nnot use ReLU activation at the last layer of the network \u03c6. By using the sign function, we obtain the\ndirection Dt, where -1 and 1 separately indicate the negative and positive correlation. Next, based on\nthe output Dt of the sign function, we compute the adjacency matrix.\n\nH = Dt (cid:12) Ve + Id,\n\nMt = HEqa,\n\ng \u2217 Mt + br\n(6)\ng),\nf \u2208 Rn, and br\ng \u2208 Rn\nwhere H indicates the adjacency matrix. wr\nindicate the trainable parameters. Finally, we take Rt \u2208 RJ\u00d7n as the GCN output. By this operation,\nwe could make our model not only learn the direction information between nodes, but also leverage\nthe information in the computation of GCN, which results in accurate inference. In the experiment,\ncompared with undirected GCN, our method could improve performance signi\ufb01cantly.\n\nf \u2217 Mt + br\ng \u2208 R1\u00d72n\u00d7n, br\n\nRt = tanh(wr\nf \u2208 R1\u00d72n\u00d7n, wr\n\nf ) (cid:12) \u03c3(wr\n\n3.4 Prediction Layer and Loss Function\n\nAfter obtaining the output of the reasoning module, we concatenate Rt and (cid:101)A across the channel\ndimension, i.e., Fc = (cid:104)Rt, (cid:101)A(cid:105) and Fc \u2208 RJ\u00d72n. Next, we compute a global vector representation\n(cid:101)F \u2208 R2n via a max-pooling operation across the node dimension of Fc. This operation is helpful for\nl(y, \u02c6y) = \u2212(cid:80)4\n\ngetting a permutation invariant output and focusing on the impact of the graph structure [30]. Finally,\nwe compute classi\ufb01cation logits through a two-layer MLP with ReLU activation.\nFor VCR task, given a query-image pair, this task gives four response choices. In this paper, we train\nour model using a multi-class cross-entropy loss between the set of responses and the labels, i.e.,\n\ni=1 yilog(\u02c6yi), where y denotes the ground truth and \u02c6y is the predicted result.\n\n4 Experiments\n\nIn this section, we evaluate our method on the VCR dataset. And this dataset contains 290k pairs of\nquestions, answers, and rationales, over 110k unique movie scenes. Moreover, this task considers\nthree modes, i.e., Q \u2192 A (given a question, select the correct answer), QA \u2192 R (given a question\nand the correct answer, select the correct rationale), and Q \u2192 AR (given a question, select the correct\nanswer, then the correct rationale). For Q \u2192 AR mode, if it gets either the wrong answer or the wrong\nrationale, no points will be received. The code is available at https://github.com/AmingWu/CCN.\nImplementation details. We use ResNet50 [15] to extract image and object features. BERT [11]\nis used as the word embedding. The feature map is X \u2208 R12\u00d724\u00d7512. The size of the hidden state\nof LSTM is set to 512. For Eq. (1), we use a one-layer GCN. And 32 centers are used to compute\nGraphVLAD. For Eq. (2), we separately use a two-layer network to de\ufb01ne f and h. Their parameters\nare all set to 1 \u00d7 1024 \u00d7 512 and 1 \u00d7 512 \u00d7 512. Next, we use a one-layer GCN to capture relations\nbetween centers. And the parameter settings of the GCN are the same as those of Eq. (1). For\ncontextualized connectivity, we separately use a one-layer GCN to process query and response. Their\nparameter settings are the same as those of Eq. (1). For Eq. (5), a one-layer GCN is used for\nreasoning. Besides, the parameters of the network \u03c6 are set to 1 \u00d7 1024 \u00d7 512. During training, we\nuse Adam optimizer with a learning rate of 2 \u00d7 10\u22123.\n\n4.1 The Performance of Our Method\n\nWe evaluate our method on the three modes of VCR task. The results are shown in Table 1. We\ncan see that some of state-of-the-art VQA methods, e.g., MUTAN [6] and BottomUpTopDown [1],\ndo not perform well on this task. This shows that these VQA methods lack the inference ability,\nwhich results in unsatis\ufb01ed performance on the task requiring high-level commonsense reasoning.\nMeanwhile, compared with the baseline method, on the three modes of VCR task, our method is\n3.4%, 3.2%, and 4.4% higher than R2C, respectively. This shows that our method is effective.\n\n6\n\n\fFigure 4: Qualitative examples from CCN. Correct choices are highlighted in blue.\nIncorrect\ninferences are in red. The number after each option indicates the score given by our model. The \ufb01rst\nrow is a successful case. The second and last row correspond to two fail cases.\n\nIn Fig. 4, we show some qualitative examples. As is shown in these examples, compared with\nclassical VQA dataset [3, 14], both questions and answers of VCR dataset are much more complex.\nDirectly leveraging the recognition of visual content is dif\ufb01cult to choose the right answers and\nrationales. Besides, the \ufb01rst row of Fig. 4 is a successful case. Our model deduces the correct answer\nand its corresponding correct rationale with a high score. The second row shows a fail case, where the\nmodel chooses the right answer and the wrong rationale. However, the rationale chosen by our model\nis an explanation for the answer based on the understanding of the entire scene. Though from this\nview, the rationale is reasonable, compared with the ground truth, our rationale is slightly indirect and\nunclear. This shows when the visual reasoning involves more commonsense, the task of interpreting\nthe answer is more dif\ufb01cult. Besides, though the model fails, the wrong rationale indeed matches\nthe visual content, which shows the GraphVLAD module is helpful for obtaining an effective visual\nrepresentation. The third row is also a fail case, where our model chooses the wrong answer and\nrationale. From these two fail cases, we can see that when the question, answer, and rationale involve\nmuch commonsense, the model is easy to make an error selection and indeed requires a strong ability\nof inference to choose the right answer and rationale. More examples can be found in Appendix.\n\nTable 1: The performance of our CCN model on the VCR dataset.\n\nModel\n\nRevisited VQA [18]\n\nBottomUpTopDown [1]\n\nMLB [20]\nMUTAN [6]\n\nR2C (baseline) [42]\n\nCCN\n\nQ \u2192 A\nTest\nVal\n39.4\n40.5\n44.1\n42.8\n46.2\n45.5\n45.5\n44.4\n65.1\n63.8\n67.4\n68.5\n\nQA \u2192 R\nTest\nVal\n34.0\n33.7\n25.1\n25.1\n36.8\n36.1\n32.2\n32.0\n67.3\n67.2\n70.6\n70.5\n\nQ \u2192 AR\nTest\nVal\n13.5\n13.8\n11.0\n10.7\n17.2\n17.0\n14.6\n14.6\n44.0\n43.1\n47.7\n48.4\n\n4.2 Ablation Analysis\n\nIn this section, based on the validation set, we make ablation analysis for our proposed conditional\nGraphVLAD, contextualized connectivity for extracting of the sentence semantic, and directional\nconnectivity for reasoning, respectively.\n\n7\n\nPerson1Person2How do [Person1, Person2] feel about each other?a) They love each other romantically. 35.2%b) [Person1, Person2] are starting to fall in love. 22.9%The rationale is \u2026c) [Person1, Person2]are friends and agree with each other. 36.8%d) They feel sad. 5.1%a) They are sitting very close and smiling at each other lovingly. 34.2%b) [Person1] is looking at [Person2] lovingly. 26.4%c) They are passionately kissing each other. 7%d) They are at a dinner together and are holding on to each other\u2019s arms closely. 32.4%Wrong answer and rationaleBackpackWhy did [Person1] come here instead of a healthy restaurant?Person2a) [Person2] went to his mother\u2019s house for dinner. 0.0%b) [Person1] cannot spend a lot of money to satisfy his hunger. 55.9%c) He could have been watching his barbecue. 39.8%d) Because he ate before he comes. 4.3%a) [Person1] looks like a student based on [Backpack]. Students usually have limited budget. 27.3%b) This restaurant has no tables and chairs visible. It appears to be takeout only. 67.8%c) Champagne is expensive and the restaurant is high end. 2.1%d) [Person1]\u2019s clothing is of working class. He likes a worker. 2.8%The rationale is \u2026Right answer, wrong rationalePerson1Person1Person2Person4Person3Why has [Person1, Person2, Person3, Person4] turned around at the table?a) They are judging a competition. 9.9%b) A noise has attracted their attention. 56.1%c) [Person1, Person2, Person3, Person4] want to pick up fork. 2.7%d) They are eating. 31.3%a) [Person1, Person2, Person3, Person4] are looking at something. 3.2%b) They appear to be worried and paying attention to their surroundings. 0.5%d) The only thing making them turn around would be a noise. 95.5%Person5Person6c) [Person5, Person6] have a startled facial expression. 0.8%The rationale is \u2026Right answer and rationale\fFigure 5: t-SNE plot of conditional centers. Here, the red pentagrams, blue circles, and green\nrhombuses indicate the initial centers and two different conditional centers, respectively. (a) and (b)\nare used to compute blue and green centers, respectively.\n\nGraphVLAD. The number of centers is an important hyper-parameter for GraphVLAD. If few\ncenters are used, it will weaken the representation ability of GraphVLAD. Conversely, if many\ncenters are used, it will increase the number of parameters and computational costs. In Q \u2192 A,\nQA \u2192 R, and Q \u2192 AR modes, the performance of 16 centers and 48 centers separately is 66.4%,\n69.2%, 46.4% and 67.1%, 69.8%, 46.9%. For our method, the performance of 32 centers is the best.\nIn Table 2, we analyze the effect of conditional centers and GCN for GraphVLAD. Here, \u2018No-C +\nNo-G\u2019 indicates we use neither conditional centers nor GCN in the computation of GraphVLAD.\nAnd other components of our model are kept unchanged. We can see that employing conditional\ncenters and GCN could improve performance signi\ufb01cantly. Particularly, compared with NetVLAD\ncorresponding to the case of \u2018No-C + No-G\u2019, our Conditional GraphVLAD outperforms NetVLAD\nsigni\ufb01cantly. This shows our method is effective. Besides, in Fig. 5, we show two t-SNE [25]\nexamples of conditional centers. And the queries of Fig. 5(a) and (b) are \u201cWho does the dog belong\nto?\" and \u201cWhat will happen after the person pushes the lifeboat over the edge of the ship?\". We can\nsee that the positions of centers vary depending on the visual content and its corresponding queries.\nWhen an image contains rich content and its corresponding query is complex, e.g., Fig. 5(b), in\norder to capture rich visual information to answer the query, these centers will learn to spread further\napart from each other. Meanwhile, when the image content and its corresponding query contain\nrelatively less information, e.g., Fig. 5(a), in order to focus on visual information which is related to\nthe query, these centers will adaptively adjust to being more concentrated. In this way, we can obtain\nan effective visual representation, which is helpful for the following contextualization and reasoning.\nContextualized Connectivity. In this paper, we separately employ a GCN to capture the semantic of\nqueries and responses. To prove this operation is effective, we compare it with a common operation,\ni.e., using a GCN to process the concatenation of vision, query, and response. In Q \u2192 A, QA \u2192 R,\nand Q \u2192 AR mode, the performance of the common operation is 66.5%, 68.1%, and 45.7%, which\nis obviously weaker than our method.\n\nNo-C + No-G\n\nTable 2: Ablation analysis of GraphVLAD.\nQ \u2192 A QA \u2192 R Q \u2192 AR\nMethod\n65.8\n66.5\n66.9\n67.4\n\nNo-C\nNo-G\nC + G\n\n45.6\n46.6\n46.5\n47.7\n\n68.3\n69.6\n69.4\n70.6\n\nTable 3: Ablation of Directional Reasoning.\nQ \u2192 A QA \u2192 R Q \u2192 AR\nMethod\n65.9\nNo-R\n64.8\n66.5\n67.4\n\n45.3\n43.9\n46.4\n47.7\n\n67.9\n67.1\n69.4\n70.6\n\nGCN\nD-GCN\n\nLSTM-R\n\nDirectional Connectivity for Reasoning. In this paper, we propose a directional reasoning method.\nWe compare our method with other reasoning methods. The results are shown in Table 3. Here,\n\u2018No-R\u2019 indicates we do not use reasoning. \u2018LSTM-R\u2019, \u2018GCN\u2019, and \u2018D-GCN\u2019 indicate reasoning based\non LSTM, undirected GCN, and directed GCN, respectively. And other components of our model are\nkept the same. We can see that for the method without reasoning, the performance is obviously weak.\nThis shows reasoning is a necessary step for our method. Besides, the performance of the reasoning\nbased on LSTM is also weak. This shows that LSTM could not capture complex relations effectively.\n\n8\n\n(a)(b)\fFinally, compared with undirected GCN reasoning, our directional GCN outperforms it signi\ufb01cantly.\nThis shows using directional information in reasoning could improve the accuracy of inference.\n\n5 Conclution\n\nWe propose a cognition connectivity network for directional visual commonsense reasoning. This\nmodel mainly includes visual neuron connectivity, contextualized connectivity, and directional\nconnectivity for reasoning. Particularly, for visual neuron connectivity, we propose a conditional\nGraphVLAD module to represent an image. Meanwhile, we propose a directional GCN for reasoning.\nThe experimental results demonstrate the effectiveness of our method. Particularly, in the Q \u2192 AR\nmode, our method is 4.4% higher than R2C.\n\nAcknowledgement\n\nThis work is supported by the NSFC (under Grant 61876130, 61932009, U1509206).\n\nReferences\n[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.\n\nBottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.\n\n[2] Mikaela Angelina Uy and Gim Hee Lee. Pointnetvlad: Deep point cloud based retrieval for large-scale\n\nplace recognition. In CVPR, 2018.\n\n[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and\n\nDevi Parikh. Vqa: Visual question answering. In ICCV, 2015.\n\n[4] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture\n\nfor weakly supervised place recognition. In CVPR, 2016.\n\n[5] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder\nIEEE transactions on pattern analysis and machine intelligence,\n\narchitecture for image segmentation.\n39(12):2481\u20132495, 2017.\n\n[6] Hedi Ben-Younes, R\u00e9mi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for\n\nvisual question answering. In ICCV, 2017.\n\n[7] Shweta Bhardwaj, Mukundhan Srinivasan, and Mitesh M Khapra. Ef\ufb01cient video classi\ufb01cation using fewer\n\nframes. In CVPR, 2019.\n\n[8] Micha\u0142 Bola and Bernhard A Sabel. Dynamic reorganization of brain functional networks during cognition.\n\nNeuroimage, 114:398\u2013413, 2015.\n\n[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nIEEE transactions on pattern analysis and machine intelligence, 40(4):834\u2013848, 2018.\n\n[10] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, and Yannis Kalantidis.\n\nGraph-based global reasoning networks. arXiv preprint arXiv:1811.12814, 2018.\n\n[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-\n\ntional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[12] Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li, Steven CH Hoi, and Xiaogang Wang. Question-\n\nguided hybrid convolution for visual question answering. In ECCV, 2018.\n\n[13] Ross Girshick. Fast r-cnn. In ICCV, 2015.\n\n[14] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa\n\nmatter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n9\n\n\f[16] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[17] Tamir Hazan Alexander Schwing Idan Schwartz, Seunghak Yu. Factor graph attention. In CVPR, 2019.\n\n[18] Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Revisiting visual question answering baselines.\n\nIn ECCV. Springer, 2016.\n\n[19] Herv\u00e9 J\u00e9gou, Matthijs Douze, Cordelia Schmid, and Patrick P\u00e9rez. Aggregating local descriptors into a\n\ncompact image representation. In CVPR. IEEE Computer Society, 2010.\n\n[20] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang.\n\nHadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.\n\n[21] Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. Multimodal dual attention\n\nmemory for video story question answering. In ECCV, 2018.\n\n[22] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[23] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and\n\nAlexander C Berg. Ssd: Single shot multibox detector. In ECCV. Springer, 2016.\n\n[24] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention\n\nvia a visual sentinel for image captioning. In CVPR, 2017.\n\n[25] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[26] Mateusz Malinowski, Carl Doersch, Adam Santoro, and Peter Battaglia. Learning visual question answering\n\nby bootstrapping hard attention. In ECCV, 2018.\n\n[27] Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classi\ufb01cation.\n\narXiv preprint arXiv:1706.06905, 2017.\n\n[28] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M\n\nBronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, 2017.\n\n[29] Medhini Narasimhan, Svetlana Lazebnik, and Alexander Schwing. Out of the box: Reasoning with\ngraph convolution nets for factual visual question answering. In Advances in Neural Information Processing\nSystems, 2018.\n\n[30] Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for\n\ninterpretable visual question answering. In Advances in Neural Information Processing Systems, 2018.\n\n[31] Hae-Jeong Park and Karl Friston. Structural and functional brain networks: from connections to cognition.\n\nScience, 342(6158):1238411, 2013.\n\n[32] Vimla L. Patel and Marco F. Ramoni. Expertise in context. chapter Cognitive Models of Directional\n\nInference in Expert Medical Reasoning, pages 67\u201399. MIT Press, Cambridge, MA, USA, 1997.\n\n[33] Gao Peng, Hongsheng Li, Haoxuan You, Zhengkai Jiang, Pan Lu, Steven Hoi, and Xiaogang Wang.\nDynamic fusion with intra-and inter-modality attention \ufb02ow for visual question answering. arXiv preprint\narXiv:1812.05252, 2018.\n\n[34] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual\n\nreasoning with a general conditioning layer. In AAAI, 2018.\n\n[35] Matthieu Cord Nicolas Thome Remi Cadene, Hedi Ben-younes. Murel: Multimodal relational reasoning\n\nfor visual question answering. In CVPR, 2019.\n\n[36] Stuart Russell and Peter Norvig. Arti\ufb01cial Intelligence: A Modern Approach. Prentice Hall Press, Upper\n\nSaddle River, NJ, USA, 3rd edition, 2009.\n\n[37] Yongyi Tang, Xing Zhang, Lin Ma, Jingwen Wang, Shaoxiang Chen, and Yu-Gang Jiang. Non-local\n\nnetvlad encoding for video classi\ufb01cation. In ECCV, 2018.\n\n[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems,\n2017.\n\n10\n\n\f[39] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.\n\nGraph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[40] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel,\nand Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML,\n2015.\n\n[41] Kun Xu, Lingfei Wu, Zhiguo Wang, Mo Yu, Liwei Chen, and Vadim Sheinin. Exploiting rich syntactic\n\ninformation for semantic parsing with graph-to-sequence model. arXiv preprint arXiv:1808.07624, 2018.\n\n[42] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual\n\ncommonsense reasoning. In CVPR, 2019.\n\n[43] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural\n\nnetworks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3032, "authors": [{"given_name": "Aming", "family_name": "Wu", "institution": "Tianjin University"}, {"given_name": "Linchao", "family_name": "Zhu", "institution": "University of Sydney Technology"}, {"given_name": "Yahong", "family_name": "Han", "institution": "Tianjin University, China"}, {"given_name": "Yi", "family_name": "Yang", "institution": "UTS"}]}