Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality: The paper proposes a novel model for the recently introduced VCR task. The main novelty of the proposed model lies in the component GraphVLAD and directional GCN modules. The paper describes that one of the closest works to this work is that of Narsimhan et al., NeurIPS 2018 that used GCN to infer answers in VQA, however that work constructs an undirected graph, ignoring the directional information between the graph nodes. This paper uses directed graph instead and shows the usefulness of incorporating directional information. It would be good for this paper to include more related work on GraphVLAD front. Quality: The paper evaluates the proposed approach on the VCR dataset and compares with the baselines and previous state-of-the-art, demonstrating how the proposed work improves the previous best performance significantly. The paper also reports ablation studies ablation each individual contributed module of the proposed CCN model, demonstrating the usefulness of each component. It would be useful if the paper could throw some light on the failure modes of the proposed model. Clarity: I would like the authors to clarify the following -- 1. Based on the description in Sec 3.1, Graph VLAD needs the query LSTM representation as an input. However, this is not consistent with Fig. 2. Can authors please rectify the Fig. 2? 2. In Eq. 3, it seems like it should be b_j’, instead of b_k’ in the denominator. Can authors please comment on this and rectify accordingly? 3. Sec 3.1 – how are conditional centers (z1, …., zK) initialized? 4. L149 – can authors provide further clarification on how object features are extracted. What kind of pre-trained network is used and how is that different from the network used for extracting the feature map X for the image which is used in GraphVLAD? 5. How is the query representation Y define in L100 different from that Q~ defined in L154? Significance: The proposed model is novel and interesting for a novel and useful task. The idea of GraphVLAD module and directional reasoning seem to be impactful and could be used for other vision and language tasks as well. The experiments demonstrate that the proposed model improves the state-of-the-art on VCR significantly. --- Post-rebuttal comments ---- The authors have responded to and addressed most of my clarification questions. It would have been nice to see some related work on GraphVLAD front in the rebuttal too (very briefly) but I am trusting the authors to do justice to this in the final version of the paper. Regarding the concerns from my fellow reviewers -- 1. Connections to brain -- I don't feel too strongly about this. 2. Results on VQA-CP / VQA -- I treat VCR as a different task from VQA / VQA-CP as it focusses more on commonsense reasoning which is not there in VQA / VQA-CP enough. It would be nice to test the proposed model on VQA / VQA-CP as well (which authors have done (only on VQA-CP), however their VQA-CP results are not beating the state-of-the-art (47.70 on VQA-CP v2 by Selvaraju et al., ICCV 2019 and 41.17 on VQA-CP v2 by Ramakrishnan et al., NeurIPS 2018)); however I do not consider lack of beating state-of-art on VQA / VQA-CP to be a reason to reject this paper. 3. Glove in language model -- the proposed model uses BERT and beats the previous state-of-art using BERT (R2C). Zellers et al. already show that VQA models using Glove perform much worse than VQA models using BERT on VCR. So given that it has been established that BERT is much better than Glove for VCR, I am not sure why it is not enough to just keep using and comparing with BERT and not show results on Glove (assuming fair comparison with previous state-of-art).
The paper proposes a new model for VQA, and explores it in the context of the VCR dataset for visual commonsense reasoning. The model uses two main components: first NetVLAD  and then Graph Convolution Network  in order to propagate contextual information across the visual objects, which is interesting and different from prior work in VQA (especially the use of NetVLAD). Experimental results are good and The paper is also generally well-written and structured clearly (except some part of the model description as discussed below). Model - The paper motivates the new model mainly by comparing it to the neuronal connectivity within the brain. I feel that in this case the comparison is not very justified/convincing. While certainly relational reasoning is important for tasks of visual question answering, I would be happy if more evidence could be presented and discussed in the paper to establish the proposed connection between the relational model in the paper and the operation of the brain. - NetVLAD: Throughout the model description section, the discussion assumes familiarity of the readers with VLAD and NetVLAD - I think it could be helpful to discuss them at least briefly either in the Related Work section or in the beginning of the subsection about “The Computation of Conditional Centers” (Page 4): e.g. to explain in high-level what the method does and provide a bit more detail on how the new model uses it. Experiments - The paper provides experiments only over a new dataset for commonsense reasoning called VCR . It could thus be really useful if experiments would be provided also for more standard datasets, such as VQA1/2 or the more recent VQA-CP, to allow better comparison to many existing approaches for visual question answering. - Most VQA models so far tend to use word embedding + LSTM to represent language - Could you please perform an ablation experiment for your model using such approach? While it’s very likely that BERT helps improve performance, and indeed baselines for VCR use BERT, I believe it is important to also have results based on the more currently common approach to again allow better comparison to most existing VQA approaches. - (Page 8) The t-SNE visualization is explained kind of briefly and it is not totally clear to me what is the main insight/conclusion that could be derived from that: we see that for one image (blue) the centers are more concentrated and for the other (green) they are spread further apart from each other - it would be good if the authors could discuss in more detail why such a thing might happen or what aspects of the image impact the density of the centers, or how the density affects the downstream reasoning performance. Clarity - (page 2) The second paragraph that gives an overview of the model is quite long and hard to follow. I think it would be good to make this part more structured: maybe splitting the paragraph into multiple ones that present each of the stages the model goes through.  Arandjelovic, Relja, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. "NetVLAD: CNN architecture for weakly supervised place recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297-5307. 2016.  Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).
This paper proposes neural network models to represent stepwise cognitive processes similar to humans' for visual understanding to deal with VCR dataset. The idea of their methods is very interesting and the results are good enough to outperform other state-of-the-art methods. However, this paper would be better if the following issues are improved. Firstly, the use of notation is rather confusing and not consistent. It would be nice to write it more concise and consistent. Secondly, as reported in [38, Zellers et al., CVPR'19], GloVe is used as language models for the comparative methods in Table 1. Since there is no comment about them, it may mislead that all of the other models adopt BERT same to the proposed models. It would be better to explain explicitly and report the performance of both cases. Thirdly, there is a lack of explanation in some experiment settings. How is 'Rationale' configured in each mode? For example, in case of QA->R, will A be concatenated with Q in textual form after finishing Q->A? If yes, the latter characters can be discarded depending on P and J values, what values are used for them? Optionally, it will be helpful to report the generalized performance by showing the results on well-known Visual QA datasets such as VQA 1.0 and 2.0. Here is minor comment: In reference, the order of authors in  is not correct with the original one. Also, some commas are missing between the author names. In eq.3, k' in the denominator is j'.