{"title": "Learning Conditioned Graph Structures for Interpretable Visual Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 8334, "page_last": 8343, "abstract": "Visual Question answering is a challenging problem requiring a combination of concepts from Computer Vision and Natural Language Processing. Most existing approaches use a two streams strategy, computing image and question features that are consequently merged using a variety of techniques. Nonetheless, very few rely on higher level image representations, which can capture semantic and spatial relationships. In this paper, we propose a novel graph-based approach for Visual Question Answering. Our method combines a graph learner module, which learns a question specific graph representation of the input image, with the recent concept of graph convolutions, aiming to learn image representations that capture question specific interactions. We test our approach on the VQA v2 dataset using a simple baseline architecture enhanced by the proposed graph learner module. We obtain promising results with 66.18% accuracy and demonstrate the interpretability of the proposed method. Code can be found at github.com/aimbrain/vqa-project.", "full_text": "Learning Conditioned Graph Structures for\nInterpretable Visual Question Answering\n\nWill Norcliffe-Brown\n\nAimBrain Ltd.\n\nwill.norcliffe@aimbrain.com\n\nEfstathios Vafeias\n\nAimBrain Ltd.\n\nstathis@aimbrain.com\n\nSarah Parisot\nAimBrain Ltd.\n\nsarah@aimbrain.com\n\nAbstract\n\nVisual Question answering is a challenging problem requiring a combination of\nconcepts from Computer Vision and Natural Language Processing. Most existing\napproaches use a two streams strategy, computing image and question features\nthat are consequently merged using a variety of techniques. Nonetheless, very few\nrely on higher level image representations, which can capture semantic and spatial\nrelationships. In this paper, we propose a novel graph-based approach for Visual\nQuestion Answering. Our method combines a graph learner module, which learns\na question speci\ufb01c graph representation of the input image, with the recent concept\nof graph convolutions, aiming to learn image representations that capture question\nspeci\ufb01c interactions. We test our approach on the VQA v2 dataset using a simple\nbaseline architecture enhanced by the proposed graph learner module. We obtain\npromising results with 66.18% accuracy and demonstrate the interpretability of the\nproposed method. Code can be found at github.com/aimbrain/vqa-project.\n\n1\n\nIntroduction\n\nVisual Question Answering (VQA) is an emerging topic that has received an increasing amount of\nattention in recent years [1]. Its attractiveness lies in the fact that it combines two \ufb01elds that are\ntypically approached individually (Computer Vision and Natural Language Processing (NLP)). This\nallows researchers to look at both problems from a new perspective. Given an image and a question,\nthe objective of VQA is to answer the question based on the information provided by the image.\nUnderstanding both the question and image, as well as modelling their interactions requires us\nto combine Computer Vision and NLP techniques. The problem is generally framed in terms of\nclassi\ufb01cation, such that the network learns to produce answers from a \ufb01nite set of classes which\nfacilitates training and evaluation. Most VQA methods follow a two-stream strategy, learning\nseparate image and question embeddings from deep Convolutional Neural Networks (CNNs) and\nwell known word embedding strategies respectively. Techniques to combine the two streams range\nfrom element-wise product to bilinear pooling [2, 3] as well as attention based approaches [4].\nRecent computer vision works have been exploring higher level representation of images, notably\nusing object detectors and graph-based structures for better semantic and spatial image understanding\n[5]. Representing images as graphs allows one to explicitly model interactions, so as to seamlessly\ntransfer information between graph items (e.g. objects in the image) through advanced graph\nprocessing techniques such as the emerging paradigm of graph CNNs [6, 7, 8, 9]. Such graph based\ntechniques have been the focus of recent VQA works, for abstract image understanding [10] or\nobject counting [11], reaching state of the art performance. Nonetheless, an important drawback\nof the proposed techniques is the fact that the input graph structures are heavily engineered, image\nspeci\ufb01c rather than question speci\ufb01c, and not easily transferable from abstract scenes to real images.\nFurthermore, very few approaches provide means to interpret the model\u2019s behaviour, an essential\naspect that is often lacking in deep learning models.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Epitome of our graph learning module\u2019s ability to condition the bounding box connections\nbased on the question. Two graph structures are learned from the same image but tailored to answer\ndifferent questions. The thickness and opacity of graph nodes (bounding boxes) and edges (white\nlines) are determined by node degree and edge weights, showing the main objects and relationships\nof interest.\n\nContributions. In this paper, we propose a novel, interpretable, graph-based approach for visual\nquestion answering. Most recent VQA approaches focus on creating new attention architectures\nof increasing complexity, but fail to model the semantic connections between objects in the scene.\nHere, we propose to address this issue by introducing a prior in the form of a scene structure, de\ufb01ned\nas a graph which is learned from observations according to the context of question. Bounding box\nobject detections are de\ufb01ned as graph nodes, while graph edges conditioned on the question are\nlearned via an attention based module. This not only identi\ufb01es the most relevant objects in the image\nassociated to the question, but the most important interactions (e.g. relative position, similarities)\nwithout any handcrafted description of the structure of the graph. Learning a graph structure allows to\nlearn question speci\ufb01c object representations that are in\ufb02uenced by relevant neighbours using graph\nconvolutions. Our intuition is that learning a graph structure not only provides strong predictive power\nfor the VQA task, but also interpretability of the model\u2019s behaviour by inspecting the most important\ngraph nodes and edges. Experiments on the VQA v2 dataset con\ufb01rm our hypothesis. Combined with\na relatively simple baseline, our graph learner module achieves 66.18% accuracy on the test set and\nprovides interpretable results via visualisation of the learned graph representations.\n\n2 Related work\n\n2.1 Graph Convolutional Neural Networks\n\nGraph CNNs (GCNs) are a relatively new concept, aiming to generalise Convolutional Neural\nNetworks (CNNs) to graph structured data. CNNs intrinsically exploit the regular grid-like structure\nof data de\ufb01ned on the euclidean domain (e.g. images). Extending this concept to non-regularly\nstructured data (e.g. meshes, or social/brain networks) is non trivial. We distinguish graph CNNs\nde\ufb01ned in the spectral [12, 6] and spatial [8, 9, 13] domains.\nSpectral GCNs exploit concepts from graph signal processing [14], using analogies with the Euclidean\ndomain to de\ufb01ne a graph Fourier transform, allowing to perform convolutions in the spectral domain\nas multiplications. Spectral GCNs are the most principled and out of the box approach, however are\n\n2\n\n\fFigure 2: Overview of the proposed model architecture. We model the VQA problem as a clas-\nsi\ufb01cation problem, where each answer from the training set is a class. The core of our method is\nthe graph learner, which takes as input a question encoding, and a set of object bounding boxes\nwith corresponding image features. The graph learner module learns a graph representation of the\nimage that is conditioned on the question, and models the relevant interactions between objects in the\nscene. We use this graph representation to learn image features that are in\ufb02uenced by their relevant\nneighbours using graph convolutions, followed by max-pooling, element-wise product and fully\nconnected layers.\n\nlimited by the requirement that the graph structure is the same for all training samples, as the trained\n\ufb01lters are de\ufb01ned on the graph Laplacian\u2019s basis.\nSpatial GCNs tend to be more engineered as they require the de\ufb01nition of a node ordering and a\npatch operator. Several approaches have been de\ufb01ned speci\ufb01cally for regular meshes [13]. Monti et\nal. [8] recently provided a general spatial GCN formulation, learning a patch operator as a mixture of\nGaussians. Recently, Graph Attention Networks were proposed in [9], modelling the convolution\noperator as an attention operation on node neighbours, where the attention weights can be interpreted\nas graph edges. Similar to our work, both of these methods compute a type of attention to learn a\ngraph structure. However, while these methods which learn a \ufb01xed graph structure, we aim to learn a\ndynamic graph that is conditioned on the context of a query. Accordingly, our approach extends the\nnotion of an edge to be adaptive to the context.\n\n2.2 Visual Question Answering\n\nExplicit modelling of object interactions through graph representations has recently received growing\ninterest. A graph based approach was notably proposed in [10], combining graph representations of\nquestions and abstract images with graph neural networks. This approach beat the state of the art\nby a large margin, demonstrating the potential of graph based methods for VQA. This approach is\nhowever not easily applicable to natural images where the scene graph representation is not known a\npriori.\nThe decision to use object proposals as image features resulted in major improvements in VQA\nperformance. This idea was \ufb01rst introduced in [15], outperforming the state of the art with a relatively\nsimple model. Such image representations have since been exploited to model interactions between\nobjects through implicit and explicit graph structures [11, 16], with a focus on counting. [11] compute\na graph based on the outer product of the attention weights of proposed features. The computed graph\nis altered with explicitly engineered features solely to improve their baseline model\u2019s ability to count.\n[16] propose an iterative approach relying on objects similarities to improve the models\u2019 counting\nabilities and interpretability, which was only evaluated on counting questions. The main aim of both\napproaches is to eliminate duplicate object detections.\n\n3 Methods\n\nAn overview of the method is shown in Fig. 2. We develop a deep neural network that combines\nspatial, image and textual features in a novel manner in order to answer a question about an image.\nOur model \ufb01rst computes a question representation using word embeddings and a Recurrent Neural\nNetwork (RNN), and a set of object descriptors comprising bounding box coordinates and image\nfeatures vectors. Our graph learning module then learns an adjacency matrix of the image objects\nthat is conditioned on a given question. This adjacency matrix enables the next layers - the spatial\n\n3\n\n\fgraph convolutions - to focus not only on the objects but also on the object relationships that are the\nmost relevant to the question. Our convolved graph features are max-pooled, and combined with the\nquestion embedding using a simple element-wise product to predict a class pertaining to the answer\nof the given question.\n\n3.1 Computing model inputs\n\nThe \ufb01rst stage of our model is to compute embeddings for both the input image and question. We\nconvert a given image into a set of K visual features using an object detector. Object detections are\nessential for the subsequent step of our model, as each bounding box will constitute a node in the\nquestion speci\ufb01c graph representations we are learning. An embedding is produced for each proposed\nbounding box, which is the mean of the corresponding area of the convolutional feature map. Using\nsuch object features has been observed to yield a better performance in VQA tasks [3, 11], as this\nallows the model to focus on object-level features rather than a pure CNN which produces a grid of\nfeatures. For each question, we use pre-trained word embeddings as suggested in [3] to convert the\nquestion into a variable length sequence of embeddings. Then we use a dynamic RNN with a GRU\ncell [17], to encode the sequence of word embeddings as a single question embedding q.\n\n3.2 Graph learner\n\nIn this section, we introduce the key element of our model and our main contribution: the graph\nlearner module. This novel module produces a graphical representation of an image conditioned\non a question. It is general, easy to implement and, as we highlight in Section 4.3, learns complex\nrelationships between features that are interpretable and query dependent. The learned graph structure\ndrives the spatial graph convolutions by de\ufb01ning node neighbourhoods, which in contrast to previous\nmodels such as [18, 19], allows unary and pairwise attention to be learned naturally as the adjacency\nmatrix contains self loops.\nWe seek to construct an undirected graph G = {V,E, A}, where E is the set of graph edges to learn\nand A \u2208 RN\u00d7N the corresponding adjacency matrix. Each vertex v \u2208 V with |V| = N corresponds\nto a detected image object (bounding box coordinates and associated feature vector vn \u2208 Rd). We\naim to learn the adjacency matrix A so that each edge (i, j, Ai,j) \u2208 E is conditioned on the question\nencoding q. Intuitively, we need to model the similarities between feature vectors as well as their\nrelevance to the given question. This is done by \ufb01rst concatenating the question embedding q onto\neach of the N visual features vn, which we write as [vn(cid:107)q]. We then compute a joint embedding as:\n\nen = F ([vn(cid:107)q]),\n\nn = 1, 2, ..., N\n\n(1)\nwhere F : Rdv+dq \u2192 Rde is a non-linear function and dv, dq, de are the dimensions of the image\nfeature vectors, question encoding and joint embedding respectively. By concatenating the joint\nembeddings en together into a matrix E \u2208 RN\u00d7de, it is then possible to de\ufb01ne an adjacency matrix\nfor an undirected graph with self loops as A = EET so that Ai,j = eT\nSuch de\ufb01nition does not impose any constraints on the graph sparsity, and could therefore yield a\nfully connected adjacency matrix. Not only is this a problem computationally, but the vast majority\nof VQA questions requires attending to only a small subset of the graph nodes. The learned graph\nstructure will be the backbone of the subsequent graph convolution layers, where the objective is to\nlearn a representation of object features that is conditioned on the most relevant, question-speci\ufb01c\nneighbours. This requires a sparse graph structure focusing on the most relevant aspects of the image.\nIn order to learn a sparse neighbourhood system for each node, we adopt a ranking strategy as:\n\ni ej.\n\n(2)\nwhere topm returns the indices of the m largest values of an input vector, and ai denotes the ith row\nof the adjacency matrix. In other words, the neighbourhood system of a given node will correspond\nto the nodes with which it has the strongest connections.\n\nN (i) = topm(ai)\n\n3.3 Spatial graph convolutions\n\nGiven a question speci\ufb01c graph structure, we then exploit a method of graph convolutions to learn\nnew object representations that are informed by a neighbourhood system tailored to answer the given\n\n4\n\n\fquestion. The graph vertices V (i.e. the bounding box and their corresponding features) are notably\ncharacterised by their location in the image, making the problem of modelling their interactions\ninherently spatial. Additionally, a lot of VQA questions require that the model has an awareness of\nthe orientation and relative position of features in an image, an issue that many previous approaches\nhave neglected.\nAs a result, we opt to use a graph CNN approach inspired by [8], operating directly in the graph domain\nand heavily relying on spatial relationships. Crucially, their method captures spatial information\nthrough the use of a pairwise pseudo-coordinate function u(i, j) which de\ufb01nes, for each vertex i,\na coordinate system centred at i, with u(i, j) being the coordinates of vertex j in that system. Our\npseudo-coordinate function u(i, j) returns a polar coordinate vector (\u03c1, \u03b8), describing the relative\nspatial positions of the centres of the bounding boxes associated with vertices i and j. We considered\nboth Cartesian and polar coordinates as input to the Gaussian kernels and observed that polar\ncoordinates worked signi\ufb01cantly better. We posit that this is because polar coordinates separate\norientation (\u03b8) and distance (\u03c1), providing two disentangled factors to represent spatial relationships.\nAn essential step and challenge of graph CNNs is the de\ufb01nition of a patch operator describing the\nin\ufb02uence of each neighbouring node that is robust to irregular neighbourhood structures. Monti et al.\n[8] propose to do so using a set of K Gaussian kernels of learnable means and covariances, where the\nmean is interpretable as a direction and distance in pseudo coordinates. We obtain a kernel weight\nwk(u) for each k, such that the patch operator is de\ufb01ned at kernel k for node i as:\n\nfk(i) =\n\nwk(u(i, j))vj,\n\nk = 1, 2, ..., K\n\n(3)\n\n(cid:88)\n\nj\u2208N (i)\n\nwhere fn(i) \u2208 Rdv and N (i) denotes the neighbourhood of vertex i as described in Eq. 2. Considering\na given vertex i, we can think of the output of the patch operator as a weighted sum of the neighbouring\nfeatures, where the set of Gaussian kernels describe the in\ufb02uence of each neighbour on the output of\nthe convolution operation.\nWe adjust the patch operator so that it includes an additional weighting factor conditioned on the\nproduced graph edges:\n\nfk(i) =\n\nwk(u(i, j))vj\u03b1ij\n\n(4)\n\n(cid:88)\n\nj\u2208N (i)\n\nwith \u03b1ij = s(ai)j where s(.)j is the jth element of a scaling function (de\ufb01ned here as a softmax of\nthe selected adjacency matrix elements). This more general form means that the strength of messages\npassed between vertices can be weighted by information in addition to spatial orientation. For our\nuse case, this can be interpreted as how much attention the network should pay to the relationship\nbetween two nodes in terms of answering a question. Thus the network learns to attend on the visual\nfeatures in a pairwise manner conditioned on the question.\nFinally, we de\ufb01ne the output of the convolution operation at vertex i as a concatenation over the K\nkernels:\n\nhi =\n\nGkfk(i)\n\n(5)\n\nwhere each Gk \u2208 R dh\nK \u00d7dv is a matrix of learnable weights (the convolution \ufb01lters), with dh as\nthe chosen dimensionality of the outputted convolved features. This results in a convolved graph\nrepresentation H \u2208 RN\u00d7dh.\n\nk=1\n\n3.4 Prediction layers\n\nOur convolved graph representation H is computed through L spatial graph convolution layers. We\nthen compute a global vector representation of the graph hmax via a max-pooling layer across the\nnode dimension. This operation was chosen so as to get a permutation invariant output, and for its\nsimplicity, so the focus is on the impact of the graph structure. This vector representation of the graph\ncan be considered a highly non-linear compression of the graph, where the representation has been\noptimised for answering the question at hand. We then merge question q and image hmax encodings\n\n5\n\nK(cid:110)\n\n\fthrough a simple element-wise product. Finally, we compute classi\ufb01cation logits through a 2-layer\nMLP with ReLU activations.\n\n3.5 Loss function\n\nThe VQA task is cast as a multi-class classi\ufb01cation problem, where each class corresponds to one of\nthe most common answers in the training set. Following [3], we use a sigmoid activation function\nwith soft target scores, which has been shown to yield better predictions. The intuition behind this\nis that it allows to consider multiple correct answers per question and provides more information\nregarding the reliability of each answer. Assuming each question is associated with n valid provided\nanswers, we compute the soft target score of each class as t = number of votes\n. If an answer is not in\nthe top answers (i.e. the considered classes) then it has no corresponding element in the target vector.\nWe then compute the multi-label soft loss which is simply the sum of the binary cross entropy losses\nfor each element in the target vector.:\n\nn\n\n(cid:88)\n\nL(t, y) =\n\nti log(1/(1 + exp(\u2212yi)) + (1 \u2212 ti) log(exp(\u2212yi)/(1 + exp(\u2212yi))\n\n(6)\n\nwhere y is the logit vector (i.e. the output of our model).\n\ni\n\n4 Evaluation\n\n4.1 Dataset and preprocessing\n\nWe evaluate our model using the VQA 2.0 dataset [20] which contains a total of 1,105,904 questions\nand about 204,721 images from the COCO dataset. The dataset is split up roughly into proportions\nof 40%, 20%, 40% for train, validation and test sets respectively. Each question in the dataset\nis associated with 10 different answers obtained by crowdsourcing. Accuracy on this dataset is\ncomputed so as to be robust to inter-human variability as:\n\nacc(a) = min{number of times a is chosen\n\n3\n\n, 1}\n\n(7)\n\nWe consider the 3000 most common answers in the train set as possible answers for our network to\npredict. Each question is tokenized and mapped into a sequence of 300-dimensional pre-trained GloVe\nword embeddings [21]. The COCO images are encoded as set of 36 object bounding boxes with\ncorresponding 2048-dimensional feature vectors as described in [15]. We normalise the bounding\nbox corners by the image height and width so they lie in the interval [0, 1]. The bounding box\ncorners, which provide absolute spatial information, are then concatenated onto the image feature\nvectors so they become 2052 dimensions (dv = 2052). Pseudo-coordinates are computed as the polar\ncoordinates of the bounding box centres and give the model relative spatial information.\n\n4.2\n\nImplementation\n\nOur question encoder is a dynamic Gated Recurrent Unit (GRU) [17] with a hidden state size of 1024\n(dq = 1024). Our function F (see Eq. 1), which learns the adjacency matrix, comprises two dense\nlinear layers of size 512 (dg = 512). We use L=2 spatial graph convolution layers of dimensions\n2048 and 1024 so that (dh1 = 2048, dh2 = 1024). All dense layers and convolutional layers are\nactivated using Recti\ufb01ed Linear Unit (ReLU) activation functions. During training we use dropout on\nthe image features and all but the \ufb01nal dense layers\u2019 nodes with a 0.5 probability. We train for 35\nepochs using batch size of 64 and the Adam optimizer [22] with a learning rate of 0.0001 which we\nhalve after the 30th epoch. Parameters are chosen based on the performance on the validation set\nusing the training set. The model is then trained on the training and validation sets using the chosen\nparameters for evaluation on the test set.\n\n4.3 Results\n\nWe investigated the in\ufb02uence of our model\u2019s main parameters on the classi\ufb01cation accuracy, namely\nneighbourhood size (m), and the number of Gaussian kernels (K). We trained models for m \u2208 8\u2212 32,\nwith a step of 4; and K \u2208 {2, 4, 8, 16, 32}. Results are reported in Fig. 3. Our results suggest that\n\n6\n\n\fFigure 3: Results of parameter exploration. The VQA score for each question type is reported for a\nvariety of settings of the number of kernels (K) and the neighbourhood size (m). The left hand plot\nshows varying m while keeping K = 8, the right hand plot shows varying K while keeping m = 16.\n\nTable 1: VQA 2.0 standard test set results - comparison with baselines and current state of the art\nmethods\n\nAnswer type\nReasonNet [19]\nBottom-Up [3]\nCounting module [11]\nkNN graph\nAttention\nOurs\n\nAll\n64.61\n65.67\n68.41\n61.00\n61.90\n66.18\n\nY/N\n78.86\n82.20\n83.56\n79.35\n79.87\n82.91\n\nNum. Other\n57.39\n41.98\n56.26\n43.90\n51.39\n59.11\n49.70\n41.63\n50.95\n42.48\n47.13\n56.22\n\nm = 16 and K = 8 are optimal parameters. Performance drops for m < 16 and is stable for m > 16,\nwhile an optimal performance is seen for K = 8, in particular for the number questions. In the\nremaining experiments, we therefore set K = 8 and m = 16.\nTable 4.2 shows our results on the VQA 2.0 test set. We report results on a single version of our model\nand compare to recent state of the art VQA methods. We compare our model to [19] (ReasonNet),\nthe approach by [3] (Bottom-Up) who won the 2017 VQA challenge 1, and the recent approach\nproposed in [11] (Counting module) which focuses on optimising counting questions. To highlight the\nimportance of learning the graph structure, we also report results using a k-nearest neighbour graph\n(based on distances between bounding box centres) (kNN graph) as input to the graph convolution\nlayers, and train a baseline model which replaces the graph learning and convolutions with a simple\nquestion to image attention (Attention). Even though it is simpler than our approach, the Attention\nmodel provides a good intuition of the advantage of using a graph structure.\nDespite not being heavily engineered to optimise performance, our model\u2019s performance is close to\nstate of the art. It notably improves substantially on numeric questions (with the exception of [11],\nwhich is speci\ufb01cally designed for this purpose). It should also be noted that both Bottom-Up and\nCounting module use the object detector with a variable number of objects per image, while we use\na \ufb01xed number. Our method compares favourably with our baselines (Attention, kNN graph). This\nhighlights not only the advantage of learning a graph structure, but also of using an attention based\napproach, as kNN graph is the only method without attention.\nFigures 4 and 5 show examples of learned graph structures for multiple questions and images. Figures\n4 and 5 show the most important nodes and edges of each learned graph (largest node degree/number\nof connections and edge weights respectively). GraphCNNs learn feature representations of the graph\n\n1http://www.visualqa.org/\n\n7\n\n\fFigure 4: Visual examples of the learned graph structures for multiple images and questions per\nimage. Boxes and edges thickness/opacity correspond to the strengths of the node degree and edge\nweight respectively, showing the most graph nodes and edges that were considered to be the most\nrelevant to answer the question.\n\nnodes that are in\ufb02uenced by their closest neighbours in the graph. As a result, the most important\nnodes can be seen as the locations where the network is \"looking\", as they will strongly in\ufb02uence\nmost feature representations. Edges represent the most important relationships between objects. As\na result, one can identify whether the network focussed on the right objects by looking at the most\nrelevant nodes, while edge weights inform of the relationships that were considered as the most\nrelevant to answer the question.\nFigure 4 reports examples of graph structures learned leading to successful classi\ufb01cation. We report\nresults for multiple questions per image, highlighting how the learned graph is tailored to the question\nat hand. Figure 5 shows failure cases and allows to study the interpretability of the proposed model.\nFigures 5-a,d show cases where the model looked at the wrong object, mistakes which can be\nattributed to missing detected objects (correct purse for Fig. 5-a, man\u2019s face for Fig. 5-d). Figure 5-b\nshows that while the focus is on all animals on the bed, the cat is considered to be the most relevant\nobject, hence the answer. Finally, Fig. 5-c shows a case that may not be adapted to bounding box\nbased models.\n\n5 Discussion\n\nIn this paper, we propose a novel graph-based approach for Visual Question Answering. Our model\nlearns a graph representation of the input image that is conditioned on the question at hand. It then\nexploits the learned graph structure to learn better image features that are conditioned on the most\nrelevant neighbours, using the novel and powerful concept of graph convolutions. Experiments on\nthe VQA v2 dataset yield promising results and demonstrate the relevance and interpretability of the\nlearned graph structure.\nSeveral extensions and improvements could be considered. Our main objective was to show the\npotential and interpretability of learning a graph structure. We found with a fairly simple architecture\nthe learned graph structure was very effective; further work might want to consider more complex\narchitectures to re\ufb01ne the learned graph further. For example, scalar edge weights may not be able to\ncapture the full complexity of the relationships between graph items and so producing vector edges\n\n8\n\n\fFigure 5: Visual examples of interpretable failures cases. Boxes and edges thickness/opacity cor-\nrespond to the strengths of the node degree and edge weight respectively, showing the most graph\nnodes and edges that were considered to be the most relevant to answer the question.\n\ncould yield improvements. This could be implemented as an adjacency matrix per convolutional\nkernel. An important limitation of our approach is the use of an object detector as a preprocessing\nstep. The performance of the model is highly dependent on the quality of the detector, which can yield\nduplicates or miss objects (as highlighted in the results section). Furthermore, our image features\ncomprise a \ufb01xed number of detected objects per image, which can further enhance this problem.\nFinally, while the focus of this paper is the VQA problem, our approach could be adapted to more\ngeneral problems, such as few-shots learning tasks where one could learn a graph structure from\ntraining samples.\nPerformance on the VQA v2 dataset is still rather limited, which could be linked to issues within\nthe dataset itself. Indeed, several questions are subjective and cannot be associated with a correct\nanswer (e.g. \"Would you want to \ufb02y in that plane?\") [23]. In addition, modelling the problem as\nmulti-class classi\ufb01cation is the most common approach in recent VQA methods, and can strongly\nlimit performance. Questions often require answers that cannot be found in the prede\ufb01ned answers\n(e.g. \"what time is it?\"). This explains the low performance of \"Number\" questions, as it comprises\nseveral questions requiring answers absent from the training set.\n\nReferences\n[1] Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel.\nVisual question answering: A survey of methods and datasets. Comput Vis Image Underst,\n163:21\u201340, 2017.\n\n[2] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus\nRohrbach. Multimodal compact bilinear pooling for visual question answering and visual\ngrounding. CoRR, abs/1606.01847, 2016.\n\n[3] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for\nvisual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711,\n2017.\n\n[4] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-\n\nattention for visual question answering. In NIPS, pages 289\u2013297. 2016.\n\n[5] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative\n\nmessage passing. In CVPR, volume 2, 2017.\n\n[6] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. CoRR, abs/1609.02907, 2016.\n\n9\n\n\f[7] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence\n\nneural networks. CoRR, abs/1511.05493, 2015.\n\n[8] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, Jan Svoboda, and\nMichael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture\nmodel cnns. In CVPR, pages 5425\u20135434, 2017.\n\n[9] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua\n\nBengio. Graph attention networks. ICLR, 2018.\n\n[10] Damien Teney, Lingqiao Liu, and Anton van den Hengel. Graph-structured representations for\n\nvisual question answering. CoRR, abs/1609.05600, 3, 2016.\n\n[11] Yan Zhang, Jonathon S. Hare, and Adam Pr\u00fcgel-Bennett. Learning to count objects in natural\n\nimages for visual question answering. ICLR, 2018.\n\n[12] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\n\non graphs with fast localized spectral \ufb01ltering. In NIPS, pages 3844\u20133852, 2016.\n\n[13] Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, and Michael Bronstein. Learning shape\ncorrespondence with anisotropic convolutional neural networks. In NIPS, pages 3189\u20133197,\n2016.\n\n[14] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst.\nThe emerging \ufb01eld of signal processing on graphs: Extending high-dimensional data analysis to\nnetworks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83\u201398, 2013.\n[15] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,\nand Lei Zhang. Bottom-up and top-down attention for image captioning and visual question\nanswering. arXiv preprint arXiv:1707.07998, 2017.\n\n[16] Alexander Trott, Caiming Xiong, and Richard Socher. Interpretable counting for visual question\n\nanswering. ICLR, 2018.\n\n[17] Kyunghyun Cho, Bart van Merri\u00ebnboer, \u00c7aglar G\u00fcl\u00e7ehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder\u2013\ndecoder for statistical machine translation. In EMNLP, pages 1724\u20131734, 2014.\n\n[18] Idan Schwartz, Alexander G. Schwing, and Tamir Hazan. High-order attention models for\n\nvisual question answering. In NIPS, pages 3667\u20133677, 2017.\n\n[19] Ilija Ilievski and Jiashi Feng. Multimodal learning and reasoning for visual question answering.\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,\neditors, NIPS, pages 551\u2013562. 2017.\n\n[20] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V\nin VQA matter: Elevating the role of image understanding in Visual Question Answering. In\nCVPR, 2017.\n\n[21] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for\n\nword representation. In EMNLP, pages 1532\u20131543, 2014.\n\n[22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[23] Kushal Ka\ufb02e and Christopher Kanan. Visual question answering: Datasets, algorithms, and\n\nfuture challenges. Computer Vision and Image Understanding, 163:3\u201320, 2017.\n\n10\n\n\f", "award": [], "sourceid": 5053, "authors": [{"given_name": "Will", "family_name": "Norcliffe-Brown", "institution": "AimBrain"}, {"given_name": "Stathis", "family_name": "Vafeias", "institution": "AimBrain"}, {"given_name": "Sarah", "family_name": "Parisot", "institution": "Huawei"}]}