NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:2328
Title:Visual Question Answering with Question Representation Update (QRU)

Reviewer 1

Summary

The paper proposes multiple reasoning layers through which question representations are iteratively updated based on the image. A reasoning layer consists of two parts. The first part is called Question-Image interaction which is a multilayer perceptron that takes in the previous layer’s question representation and the image representation. The second part is weighted pooling where the updated representations from the Question-Image interaction are summed based on the attention weights learned through backpropagation. The model utilizes object proposals to generate candidate image regions and these image regions are encoded using a convolutional neural network. The encoded image features interact with the question representations and a soft attention mechanism is applied to generate attention distribution over image regions.

Qualitative Assessment

Strength: The technical contributions are a clever and simple extension/combination of existing ideas such as “Neural Reasoner” [B. Peng, Z. Lu, H. Li, and K.-F. Wong. Towards neural network-based reasoning. arXiv preprint 287 arXiv:1508.05508, 2015], spatial coordinates [R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. arXiv 265 preprint arXiv:1511.04164, 2015], and soft attention mechanism [K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: 307 Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015]. The paper is well-written and easy to follow, especially the architecture of the model and the explanations for it are modular and simple (image understanding layer, question encoding layer, reasoning layer, and answering layer). Haven’t yet encountered a VQA system that changes the question representation based on image. This novelty adds strength to this paper. Weaknesses: In section 3.1, the paper states that adding an 8D representation remedies the problem of lacking spatial information for object location. Is there an ablation done to compare the performances of the models with these spatial coordinates and without? Without any experiment results, it is hard to tell if these extra spatial coordinates are actually helping the model understand the spatial relationship between object proposals. If these spatial coordinates turn out not to be much helpful, then the model might still suffer from the same issue, which seems like a major drawback. In section 4.3, it is analyzed that (starting line 209), compared to SAN, the proposed model deals only with selected object proposal regions that have high probability to be an object, and thus gives better results when answering questions involving objects. Also in line 216, it is stated that a similar pattern can be observed in the VQA dataset, having a prominent improvement in the Other type. However, the results in Table 3 do not seem to be supporting this claim. Compared with SAN, the proposed model only performs better in Yes/No question and actually performs worse in Other type than SAN. The results and analysis seem to be inconsistent and it is difficult to see a concrete / quantitative evidence that supports the strength of the model, namely the capability of answering Other types. Overall, the proposed model seems to be an extension / application of “Neural Reasoner” [B. Peng, Z. Lu, H. Li, and K.-F. Wong. Towards neural network-based reasoning. arXiv preprint 287 arXiv:1508.05508, 2015] using images and attention mechanism, thereby weakening the novelty of the approach. === post rebuttal === The improved performance mentioned in the rebuttal is good, but it does not show any additional strength of the model, as other models should improve similarly with fine-tuning the visual representation (I assume that is what "better results by fine-tuning the initial learning." means). I think the two contributions stated in lines 43-46 are limited: 1. Iteratively updating the question representation is similar to [21], but applied to images. Any ablations showing benefit of this idea e.g. against a SAN [33] architecture / a single or double attention on the image features is not shown, so it remains unclear how benefitial that is. A comparison to related work is not sufficient as it is unclear if the difference is due to the experimental setup or actually the different architecture, especially as the difference in performance is small. 2. The use of object proposals is not novel as done already: [23]. Thus the limited novelty combined with limited improvements over related work, and no ablation experiments lead me to the recommendation to reject the paper.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

This paper adapts the "neural reasoning" framework for the problem of visual question answering. In particular, it addresses the issue that some questions themselves are inherently ambiguous/difficult, without seeing the image to know the context. By breaking the picture into candidate image regions, the whole image becomes a sequence of "facts" (similar to the setting in the bAbI dataset), the "neural reasoning" model can then be applied. Evaluation on COCO-QA shows decent results that are slightly higher than the state-of-the-art models. On the VQA dataset, the results are comparable to other systems and slightly worse.

Qualitative Assessment

I like the basic idea of this work and it seems that the authors have successfully modified the model designed for a somewhat artificial bAbI task and managed to show strong performance on the more real VQA task. The general story makes sense to me, but I'm not fully convinced that the strong empirical results are indeed due to the claimed advantages of the model. For instance, some of the example questions seem very ambiguous. How many such questions are answered incorrectly by other models, but instead correctly answered by the proposed method? In addition, Figure 3 and the associated questions don't seem intuitive to me, as the question is ambiguous anyway. The results on the newer VQA dataset was not as impressive as those on the COCO-QA dataset, but there is little discussion on the pros and cons when comparing the proposed models with other state-of-the-art methods, especially on the VQA experiments.

Confidence in this Review

1-Less confident (might not have understood significant parts)


Reviewer 3

Summary

This paper describes a neural model derived from [21] for visual question answering. The main idea is to iteratively update the question representation by selecting relevant image regions. Each input image is encoded by a VGG model and features are extracted from each candidate region (20 candidate regions per image). The image representation is transformed into a latent space shared with the question features. Each word in question sequence is embedded to a word vector with an embedding matrix, and the embedded vector is fed into GRU at each time step. The final hidden state is considered as the representation of the question and is also transformed into the common latent space. The question is updated through each reasoning layer with multilayer perceptron and weighted pooling (attention mechanism is used). The final answer is generated through a softmax layer. Evaluation results on COCO-QA show that their model outperforms other state-of-the-art models. Results on VQA dataset are also comparable with state-of-the-art.

Qualitative Assessment

The paper is generally clear and well-written. It could be improved by including basic descriptions of the baseline models and other state-of-the-art approaches. The attention mechanism is employed in weight pooling. It may contribute to the performance improvement, which seems reasonable. Comparing different question representation pooling mechanisms in this paper could be a significant contribution since the original paper [21] did not show any results for this part (neither did they indicate what specific mechanism they employed). It would be great to provide explanations on using one single layer in MLP. Also the third example question in Figure 3 could be clarified. It seems the question representation stays the same after updating with one reasoning layer. It would be great if the author could provide thorough analysis to the results. Although the model is derived from [21], it appears to be the first study that modify the model to address the VQA problem. The authors have done a nice job of applying an effective text question answering model to VQA task. This paper, as well as [21] (the paper of the original QA model), seems to be interesting and opens a new area of research for QA/VQA.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

This paper studies the task of visual question answering. In order to answer questions about the images, this work derives their model from neural reasoners. Neural reasoners can take a question, and supporting facts necessary to answer the question and produce updated query based on the refinement provided by the supporting facts. This task proposes to use object detection (with their spatial position) as facts necessary to answer the question. The results on standard test datasets for VQA show some promise but do not match current state of the art methods performance.

Qualitative Assessment

The paper is well written and the experiments appear sound. The overall motivation of having more directed and detailed question rather than a vague one is also important for VQA. The method is derived from neural reasoning system previously developed for text based QA. The authors do not offer any improvements or modifications to the reasoning system but rather use objects in the image as facts instead of textual facts in traditional neural reasoning system. Hence, the overall novelty of the proposed method is minor. Additionally, I am not fully convinced that object detection boxes alone provide enough facts necessary for answering the questions in the current datasets. Firstly, such method invariably perform poorly for counting questions which the authors also make note of (line 220). Second, the object and their locations should only be helpful for questions directly related to objects. The authors contend that 'most of the questions are related to objects' (line 97-99), but that is not the case for the VQA dataset. Many questions refer to object but are actually seeking information near the object (What is next to the couch? What is to the left of the dog? etc..). Moreover, the object proposals are not drawn based on the entities referred in the question but rather by using automated 'top-ranked' regions. So, the use of objects in the scene (even when used in conjunction with their location) serves only as limited source of knowledge. Since the entire set of 'facts' also include the whole image, it is not possible to study the effects of including the edge boxes. The authors do not present ablations studies to prove that the inclusion of facts do indeed help the reasoning system. It could be done by training a controlled system devoid of 'overall' image feature and only consisting of the object proposals. On the same token, the attention map is also produced on the top-ranked image regions that are generated without knowledge of the question. A grid-layout (often obtained by CNN networks before pooling layers [1] [2] etc..) is also not used which means that certain image regions (that are not bounded by the top-ranked regions) are unavailable for attending. The resolution provided by including the whole image does not provide granularity required for attending to specific parts. Similarly, the authors show a number of examples of question representation update where the updated question, at times, show greater degree of detail in the question. However, I am not convinced that this can be taken as indication of successful representation update for several reasons. 1) The authors use nearest neighbor approach to find the closest question in the dataset that match the updated question representation. It is unsurprising that the updated representations contain references to objects present in the image since the input to the reasoning system consist of explicit object locations and features from them. It is, however, uncertain whether the updated representation actually asks about the same entity as the original question. Certainly, the updated question contain more detail but is the meaning of the question unchanged to the original? 2) What do the failure cases look like? 3) The examples only include questions from COCO-QA dataset, which are derived from captions via NLP algorithms and are notoriously bad in terms of grammar, composition and phrasing. The above comment does not, however, mean that the updated representation wont produce better query. This only means that the proposed nearest neighbor method to assess or demonstrate method's efficacy is dubious. In a nutshell, the paper is well-written and presents a interesting method of using reasoning engine for VQA. However, the authors do not offer any new innovation on the neural reasoning and the image based fact generation proposed by the authors may be inadequate to completely reason about the images. The general applicability of the work in current form is low, especially considering similar methods offer better performance in VQA. For a stronger case, the visual facts need to be question specific and more detailed so that it can offer better basis for reasoning. The explicit efficacy of the proposed work also should be demonstrated by the help of ablation studies. ----------- References ----------- 1.Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question answering. In CVPR, 2016. 2.Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, 2016.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 5

Summary

This paper proposes a method for visual question answering, by updating the representation of the question iteratively such that it better discriminates the actual content in the image. This is a very interesting idea, reminiscent of the 20 questions game, where uncertainty is reduced iteratively. The idea is inspired by the neural reasoner [21] paper, but now applied to visual images. The neural network model relies on scertain pre-trained networks (CNN for images, GRU for text), which are augmented by a "reasoning" layer. A softmax layer gives the final answer. The model is evaluated on challenging real world datasets with good results.

Qualitative Assessment

The paper is explained well, with several qualitative results in addition to quantitative evaluation. The implementation details give additional information on how to adjust learning rates etc. In order to improve the expository quality of the paper, some limitations and failure cases can be illustrated through figures. In this sense, the Figure 5 can be complemented by another figure showing limitations. This will provide guidance to future work. Specific ambiguities in the parsing of text or confusion between categories can be illustrated. Although the method is evaluated experimentally, it is still unclear why the "question update" iterations should converge. In fact, I feel they can very well oscillate between two conflicting interpretations of the image content. It will be nice if the authors comment on this case.

Confidence in this Review

1-Less confident (might not have understood significant parts)


Reviewer 6

Summary

This paper proposes a framework based on a deep convolutional neural network and a gated recurrent unit for visual question answering. The proposed model learns to answer questions by iteratively updating question representation. Instead of using the entire input image, they use object proposal to obtain multiple candidate regions and focus on image regions which are highly related to the questions.

Qualitative Assessment

This paper is well written and easy to follow. The idea to use region proposal is borrowed from object detection, but sounds reasonable to improve the performance of visual question answering. My questions here: 1. This paper claims that the proposed model can iteratively update the questions. But it seems that the questions are only updated once (Figure 1 and 3). Is the number of iterations the same as the number of reasoning layers? Can you use a recurrent network in the reasoning layer to recursively update the question representation and achieve better accuracy? 2. In table 3, the proposed method outperforms many previous methods but is slightly worse than FDA[11] and DMN+[30]. Any explanation to this result on the VQA dataset?

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)