Summary and Contributions: This paper introduces VILLA, a task-agnostic approach for training Transformer-based vision-language models with adversarial perturbations. The key idea is to add adversarial perturbations to the embedding space of visual and textual representations, and train the model to be invariant to these. An additional KL-term explicitly encourages the model's predictions with and without these adversarial perturbations to be similar. Experiments are conducted with UNITER (Chen et al.) as the base model, and adding VILLA on UNITER improves its performance on a host of tasks -- VQA, VCR, NLVR, SNLI-VE, RefCOCO, image-text retrieval -- achieving state-of-the-art results. The authors further conduct ablations to shed more light on how much VILLA helps 1) during pretraining vs. finetuning, 2) on the image vs. text domain. Finally, the authors also include some preliminary analysis to show that UNITER trained with VILLA learns better image-text correspondences (visual grounding for words) than the base UNITER model.
Strengths: This paper thoroughly explores a simple idea of adversarial training and demonstrates convincing results across a wide range of vision-language tasks. The proposed approach improves on the previous state-of-the-art by significant margins and thus makes an important empirical contribution.
Weaknesses: While this is a good paper overall demonstrating compelling results, it would be stronger and more complete if the following design choices are discussed / supported by empirical evidence: - How do adversarial perturbations in embedding space compare to those in pixel space? The latter follows more naturally from prior work in adversarial training. - What happens if adversarial perturbations are simultaneously added to both image and text domains, instead of one at a time? In addition to improving generalization performance, do these adversarial perturbations make the model more robust to adversarial attacks (where inputs, not embeddings, are adversarially perturbed)?
Correctness: To the best of my knowledge, the empirical evaluations sufficiently back claims made.
Clarity: The paper is clearly written and well-organized; it was a joy to read. Great work!This work appropriately situates itself in the context of prior work, and explores a complementary direction -- adversarial training for V&L models -- that hasn't been explored before in Transformer-based vision-language models.
Relation to Prior Work: This work appropriately situates itself in the context of prior work, and explores a complementary direction -- adversarial training for V&L models -- that hasn't been explored before in Transformer-based vision-language models.
Summary and Contributions: Update: The rebuttal has not changed my (positive) opinion of the paper. The rebuttal, like the paper, seems strong. ------------- The paper performs large-scale adversarial training for vision + language representation learning. It demonstrates SOTA performance on 6 standard vision and language tasks. (My review is shorter than my average review because I think this work is solid, the paper is very well-written, and I don't have much else to say.)
Strengths: Clear, very well-written paper SOTA performance on 6 standard tasks Other systematic evaluation (ablation study to examine the effect of adversarial training only in the pre-training or only in the fine-tuning stage, evaluating adversarial pre-training across checkpoints, etc.) General idea, applied to a couple of different model architectures Submitted code, will release models > can we apply similar adversarial training techniques to V+L problems to improve model performance? Is a reasonable question that is worth asking and answering. Adding adversarial noise to the embeddings makes sense, and does make the approach more general (which is useful when dealing with multiple modalities).
Weaknesses: None that I can think of. More tasks, more base architectures, more ablations can always make the work stronger but I think this paper has more than sufficient empirical evaluation as is. One could say that this paper has limited novelty because it does not introduce a new model architecture or learning paradigm. But I think the question this paper asks and answers (listed above) is a useful one. The findings will be useful for the community. Authors have committed to releasing their models, which will also be useful for the community. So overall, I don't see any significant weaknesses.
Correctness: Everything seems correct
Clarity: Paper is very well-written and clear
Relation to Prior Work: Connections to prior work have been described well
Additional Feedback: Out of curiosity: Did you experiment with y^ = f_theta(ximg + \deltaimg; xtxt + \deltatxt)? In general, if there were things the authors tried that didn't work out as well as expected, it would be worth describing those briefly in the paper. Readers might find that interesting + useful.
Summary and Contributions: This paper presents a large-scale adversarial training for vision-and-language representation learning. Instead of adding adversarial perturbation into the input domain (image and text), this paper propose to add the noise into the embedding domain and extend FreeLB training scheme with two additional training objective: 1: label preserving attack and confidence preserving attack loss. The authors apply VILLA to current V+L models and achieve new state of the art on a wide range of tasks.
Strengths: It's very interesting to see the large-scale adversarial training for vision and language representation learning. The paper is well written and the result is very good. The presentation of the paper is also very nice.
Weaknesses: Besides the strength of the paper, I have some concerns about the paper. 1: The original goal of adversarial training is to avoid adversarial attacks. In this paper, the authors show that by adding adversarial perturbations into the embedding, the model can improve the performance on final downstream tasks. This is great, however, the paper didn't answer whether the proposed method can perform better in the adversarial attack? What is the connection between adding noise in embedding space and pixel/token space? There are multiple ways to test how the proposed method is more robust, for example: - Some downstream tasks focus on paraphrasing, there is a vqa-rephrasing dataset, and I am curious whether injecting the adversarial noise into the embedding space will lead to better performance on this dataset? (Cycle-Consistency for Robust Visual Question Answering). - What is the performance change when the model faces traditional adversarial attacks? (by adding perturbations into the pixels space and change tokens etc? ) 2: Since this model is trained based on the UNITER's saved checkpoints, it will have more optimization steps compared to UNITER. For a better comparison, the baseline UNITER model should have similar optimization steps. 3: The training seems a little bit tricky to me. Following the last question, in the supplementary materials, the author claims that most of the hyper-parameters are followed by UNITER, However, in table 5, the batch-size of the proposed model is very different from the one mentioned in UNITER paper. (VQA VILLA_BASE batch size 5120, while UNITER 10240, etc) I wonder how these hyperparameters selected and whether this is the same as in UNITER as the paper claimed?
Relation to Prior Work: yes
Summary and Contributions: This paper presents a new large-scale adversarial training for vision-and-language (V+L) representation learning. The proposed framework consists of two training stages: task-agnostic adversarial pre-training and task-specific adversarial finetuning. Besides, it performs adversarial training in the embedding space of each modality and adopts the “free” adversarial training strategy, as well as KL-divergence-based regularization to guarantee large-scale cross-modal training. It conducts a wide range of V-L tasks and illustrates obvious improvements over existing methods.
Strengths: 1. This paper proposes a novel and brand new method of adversarial pre-training and adversarial finetuning for vision-and-language (V+L) representation learning. 2. This paper is well written and organized and it is fluent and clear for readers to understand. 3. This paper conducted comprehensive experiments on six different popular V-L tasks and show consistent improvements over existing methods.
Weaknesses: One slight concern is that visualizing only one example is not solid enough to embody the advantage of VILLA over UNITER.
Correctness: The paper is technically solid as it provides comprehensive theoretical and experimental supports for the proposed methods.
Clarity: The paper is perfectly written and organized.
Relation to Prior Work: This paper makes a comprehensive review of prior work and highlights the contributions it gains compared to the previous approaches.
Additional Feedback: It’s a good job in every aspect.