Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The authors have good results on the VCR dataset. However, the provided dataset is fairly new and it is hard to judge how meaningful these results are. The overall method is not particularly novel, and we already knew from visual question answering works that a richer image representation, e.g., in terms of attributes should help. E.g., "Solving Visual Madlibs with Multiple Cues". A slightly more novel part seems to be tagging.
Post-rebuttal: The authors addressed my concern (minor) and I'd recommend the acceptance of the paper as it would provide a strong baseline to the community. General Comments: (G1) L25-L29 need citations [1, 29, 34] (G2) L38: At this point, the readers are left wondering why the proposed model is “Intricate”. Request the authors to add a clear explanation here. (G3) L126: What do the authors mean by “joint encoder is identical”? Is it the same architecture with different parameters? Typos: L42: “to this end we” -> “to this end, we“
Many of the gains come from a more thorough approach to analyzing the language (e.g. synsets etc) and new finer labels. A somewhat unfair characterization of this work might be that its gains come primarily from “cleaning up” the data. I’m surprised that there is no benefit from additional fine-tuning of BERT/ResNet and would appreciate a bit more insight into the design choices that were made regarding the modeling (and/or ablations to this end).