Summary and Contributions: The authors developed a novel architecture for Visual Language tasks such as VQA, GQA and caption generation. For the first stage, They use Faster-RCNN to encode images and WordPiece tokenizer for the words. Both modalities uses BERT for learning good representations and a contrastive loss is used to match the two modalities. 3 different encoders are compared: i) a shared BERT encoder across both text and images ii) a BERT encoder that can attend the visual part when predicting text. iii) a cross modal BERT encoder that can attend both side. The second stage is the main contribution. they learn a relationship probe by computing pairwise distance between each embeddings for both modalities and making sure the resulting matrix are consistent across different data augmentation. Conventional data augmentation is used for images, while for text they used a pre-trained translator for going from En-De-En and En-Ru-En. For text, a supplementary supervised loss is used to aligned the relations with a pre-parsed dependency tree. Ablation study shows that the cross modality encoder provides much better accuracy on NLVR2. Also the relation probes provide a mild improvement over the cross-modal encoder. Other benchmarks shows consistent improvement over baselines.
Strengths: This work address an interesting challenge, which is the one of learning relationship between different parts of images and objects in an unsupervised fashion to target the long tail of rare relationships. The self supervised relationship probing is novel to my knowledge and is clever.
Weaknesses: A big weakness of the paper is the readability. The resulting algorithm has many parts, 2 training stage and total of 7 different component to the losses. that makes it really hard to follow and evaluate. It is also hard to relate to existing architecture e.g. I believe that the current architecture (beside relationship probing) is very similar to LXMERT  but it's not exactly clear what are the differences. Another weakness is experimental evaluation. While the ablation study is useful. It still doesn't highlight if it actually learns useful relationships. The gain could simply come from data augmentation provided by the pre-trained translator or the used parser. One of their contribution is to have competitive performance without the need of ancillary corpus by using the parser and translator. While it is interesting it makes comparison to baselines very difficult. Also, the qualitative evaluation of Figure 4 is highly unconvincing. ========= POST REBUTTAL ========== Rebuttal is well made and addresses my concerns. Table A Shows a good ablation study and Table B makes a clever comparison of blue score in their query matching. I've increased my score to 6.
Correctness: In both abstract and introduction, they justify this work by the long tail of relationships but none of this is really evaluated in the experiment setup. Ablation study highlight mild improvement when adding the relationship probe. This allows it to be better than the reported baseline but they omit to report UNITER, which has significantly higher accuracy. In table 5, they compare to BUTD which dates from 2017 on the leaderboard. I would assume that other algorithms came out since then?
Clarity: For a reader that is not savvy in the latest Visual-Language developments it is really hard to follow. I would recommend a major rewrite of the algorithm description with a more organised and concise approach. Also figures could be drawn differently to provide a more intelligible explanation of the algorithms e.g. the stack of layers is repeated 8 times in figure 2 and the horizontal lines showing the links between the positive and negatives are really confusing. Why not using 2 sub-figures, one that describes the stack of layers with more details and another that highlights how the different losses are composed. Also, there should be a figure for stage 1 and a figure for stage 2. Finally, the fact that v and w are used for both pre-bert embeddings and post-bert embeddings is extremely confusing. While we appreciate the many details for reproducibility purposes. I doubt I would be able to reproduce this work reliably mostly because of the confusion. There is no code included in the submission and no mention of open-sourcing the code. Also, I can't read Figure 3 with the maximum zoom of Mendely on my computer. I had to bring in another reader.
Relation to Prior Work: no. (see above)
Summary and Contributions: The paper suggests a self-supervised approach to modeling relationships among entities in the same modality by leveraging cross-modal and intra-model relationships and evaluates the approach on a suite of vision-and-language tasks.
Strengths: Learning relationships without supervision, only by leveraging the intrinsic alignment between modalities, is a good idea. Scene graphs can be a powerful representation for many downstream tasks since they provide many useful abstractions, and doing away with the need for labels can bring obvious benefits in training them. Contrastive methods have seen much success in the self-supervised learning community and their application in learning relationship graphs is novel to my knowledge. The technique appears to lead to modest performance gains on NLVR2. The approach does not require massive amounts of data.
Weaknesses: While the method is referred to as self-supervised, it appears the authors use supervised methods to extract ground truth for object labels and sentence dependency trees. In particular, the dependency trees are used to guide learning of the relationship-graph-inducing distance metric in both vision and language. I am concerned as to whether this method relies on the strong structural information provided by GT parses to guide learning. The authors clarify they call their approach self-supervised w.r.t the lack of predicate labels, but there is still a good amount of supervision seemingly used. While there are some performance gains on NLVR2 as shown in Table 3, more ablations of the benefit of adding "Stage 2" would help further show the effectiveness of the method. In particular, since the approach is framed as one that is more so useful for downstream tasks than on its own, this evaluation seems a little weak. On the other hand, there is also not much analysis presented of the representation graphs that the model does learn. Both these factors combine to make it hard to judge the method's effectiveness as a whole. As an side, it is not clear to me how much variation is acquired by back-translation, or whether that variation is sensible. It may have been interesting to explore other methods of text augmentation.
Correctness: The methodology and approach seem sensible and correct, and the authors appear to follow common fine-tuning practices for evaluation. I am not sure whether the claim that the method is self-supervised can fully stand.
Clarity: Overall, I did find the paper well written and clear, but I had a few smaller remarks. 3.3.1, Eq 3 - How do you obtain p(v_i), p(w_j)? Presumably, p(w_j) is dot-product softmax over the whole text vocabulary? Is p(v_i) predicting the object class label as output by an off-the-shelf detector? If so, I believe it should be made clearer you do use supervision in the form of bounding box labels. Is the function g simply notation for accessing the hidden vector output by the intra-modality vision encoder at masked locations, or is there a learnable layer there? The phrase "outputs the unmasked visual feature" was not clear, since it seems like it outputs the ground truth RoI feature, but that doesn't seem to be the case. In the cases where image and text inputs are not aligned (for image-text matching), you still seem to use the representations output by the modality-combining encoder. Do you train on the MLM loss in these cases too (forcing the model to overcome misleading context)? Are the intra-modality encoders trained from scratch on the augmented MSCOCO or is the text encoder initialized with BERT pretraining? Typo on Line 191: Textural -> textual
Relation to Prior Work: Yes, the discussion compared to prior work seems complete. Whereas there are many difficulties in extracting relationships post-hoc from attention-based models, this paper suggests explicitly learning them.
Summary and Contributions: The authors propose a self-supervised learning method that implicitly learns the visual relationships without relying on visual relationship annotations. The proposed method integrates several methods for self-supervision, and benefit various vision-language tasks.
Strengths: - The proposed self-supervised framework can learn visual relationships without using any relationship annotations, which avoids the limitations caused by manual labeling. - The experimental results show that the self-supervised learning method can benefit both vision and VL understanding tasks.
Weaknesses: - The proposed method is complicated, and it actually is the combination of a modified version of the masked language model and contrastive learning. So the contribution should be the application of these methods to implicit relationship learning but not a totally new framework. - Line 175, the authors say that both positive and negative image-sentence pairs are sampled. Since the image-text matching loss is applied at the same time with reconstruction loss in Stage 1, I think the authors should give a clearer explanation of how to sample single images and image pairs at the same time. - The SSRP method is complicated and contains various of losses or components for self-supervised learning. The author should provide more ablation results such as removing image-text matching loss. - In Table 4 and Table 5, the authors use different datasets or different settings compared to other methods. I am curious about what the performance is if training SSRP with the larger corpora like VL-BERT* in Table 4.
Clarity: The paper is well-written.
Relation to Prior Work: Yes
Summary and Contributions: The paper introduces a new and fresh idea compared to a lot of minor ablations that we are seeing in V&L pretraining domain. A self-supervised method is introduced which can implicitly learn the visual relationships in an image without relying on any ground truth visual relationship annotations thus breaking through the curse limited annotated visual relationship data which also has a long tail distribution problem. The method builds the intra- and inter-modality encodings separately and then uses relationship probing (contrastive learning loss and dependency tree) to discover the visual relationships between each modality. The method shows impressive results on multiple datasets and can be used by approaches which require vision-only embedding such as image captioning and improves these as well.
Strengths: -The method is novel and fresh and improves the existing object feature embeddings by building implicit visual relationship knowledge using self-supervised learning -Relationship probing further helps in improving the MLM pretraining representations. -The paper also introduces data augmentation techniques (though not novel) to gather more data for pretraining from the only source of COCO. -Results suggest that probing and the data augmentation both are useful. -Models intra- and inter-modality encodings separately which allows using the encodings in tasks which require inputs from only one modality. -Enhanced features help with image captioning task as well which improves the metrics compared to the original BUTD model when using them instead of original Faster RCNN ones. -The results on image retrieval and the annotated visual relationships are amazing given that they are trained in self-supervised way,
Weaknesses: I don’t have many concerns with this paper but I have some high-level issues that I believe should be addressed. - The effect of relationship probing hasn’t been studied independent of MLM training. Do we even need MLM or can be just get away with relationship probing. - The results on GQA are somewhat surprising compared to LXMERT. GQA is a task which should have better numbers with a better visual relationship understanding as the task depends on the scene graph itself. I understand that corpus for other datasets is larger, but can be know the number compared to VisualBERT or LXMERT only trained on COCO to clearly understand the actual impact. -To actually understand and for fair comparison, the number of parameters should also be compared between the different baselines and SSRP. - It would be good to have metrics on actual retrieval tasks or zero-shot caption retrieval to see how good the model is quantitatively along with qualitative results. -To understand the actual impact on the downstream tasks and the quality of the learned representations, it would make sense to test on low-resource tasks such as Hateful Memes dataset, OKVQA, TextVQA, TextCaps and nocaps for captioning. The current downstream task settings in the paper are data intensive and might not be capturing the full power of the model
Correctness: The methodology and claims seem correct. The numbers have been reported on online set.
Clarity: The paper was easy to read. I would suggest authors to be more descriptive with their captions. For example, the model figure is very hard to understand with current caption.
Relation to Prior Work: Yes, it mostly covers all of the literature. It might make sense to update the reference to include latest content. Some of the papers I mentioned and some of the latest content: 1. Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., ... & Anderson, P. (2019). nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision (pp. 8948-8957). 2. Singh, A., Goswami, V., & Parikh, D. (2020). Are we pretraining it right? Digging deeper into visio-linguistic pretraining. arXiv preprint arXiv:2004.08744. 3. Huang, Z., Zeng, Z., Liu, B., Fu, D., & Fu, J. (2020). Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv preprint arXiv:2004.00849. 4. Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., & Testuggine, D. (2020). The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. arXiv preprint arXiv:2005.04790. 5. Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., ... & Rohrbach, M. (2019). Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8317-8326). 6. Li, X., Yin, X., Li, C., Hu, X., Zhang, P., Zhang, L., ... & Choi, Y. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. arXiv preprint arXiv:2004.06165. 7. Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3195-3204).
Additional Feedback: Update after rebuttal: I thank the authors for a very good rebuttal and their effort in addressing most of the reviewers concerns. I believe most of my concerns have been addressed and I would like to keep my score as it. The concern around GQA is still standing I would love to see if authors have any comments on it.