Review for NeurIPS paper: Self-Supervised MultiModal Versatile Networks

NeurIPS 2020

Self-Supervised MultiModal Versatile Networks

Review 1

Summary and Contributions: This paper proposes MultiModal Versatile (MMV) networks that learns representations from vision, audio and text in a self-supervised manner. The main contributions include: + A hierarchical fine and coarse spaces approach (FAC) to configure the relationships between different modalities, which specifically tackles the issue that audio and vision modalities are more fine grained than the textual modality. + A network deflation mechanism to transfer the learned video representations to images. + Solid experiments and state-of-the-art performance evaluated on action recognition (UCF101/HMDB datasets), audio classification (ESC-50), zero-shot text-to-video retrieval (MSRVTT and YouCook2) and image classification (PASCAL VOC and ImageNet).

Strengths: + I'm convinced about the direction of multimodal self-supervised learning being a promising route to good representations. + The main technical contributions (FAC and network deflation) are sound. + Though simple, the techniques proposed in this paper (in particular FAC) makes sense and look novel to me. + The experiment protocols look sensible to me and the results obtained with the proposed method are convincing.

Weaknesses: - In Table 1a, I'm a bit surprised by the significant gains when comparing VAT to VA for UCF and HMDB, as I'm not sure why texts obtained from ASR play such an important role for datasets like UCF and HMDB? It would be great if the authors can provide more insights here. - The main difference between the proposed deflation mechanism and a "naive deflation" solution lies in the fact that the former re-tunes the parameters of batch norms in the network using a "L1 loss between the output of the original video network when presented with single-image static-videos, and the output of the deflated network for the same images". If re-tuning is indeed the important factor here, I think a more straightforward baseline is missing: the authors could fine-tune on PASCAL VOC with everything in the backbone frozen except gamma and beta in batchnorms. This way, one doesn't need to have the extra L1 loss to adjust the distribution for batchnorms. - If possible, I am curious how much better could the proposed method do when trained on even larger datasets (e.g., AS+HT+IG65M). This would help us understand the limits of the proposed multimodal self-supervised learning approach. - L173, what exactly constitutes "N(x)"? Are all N-1 negative pairs utilized? Could the authors elaborate?

Correctness: Yes, both the approach and experiment protocols look reasonable to me.

Clarity: It's very well-written and easy to follow.

Relation to Prior Work: Yes, the authors did a good job discussing related work. To provide a more comprehensive view on current state-of-the-art, I encourage the authors to add the following papers to Table 2, "Audiovisual SlowFast Networks for Video Recognition" and "Audio-visual instance discrimination with cross-modal agreement".

Reproducibility: Yes

Additional Feedback: POST REBUTTAL Although in certain cases it's hard to have exact apples-to-apples comparison (as the authors highlighted some practical difficulties in their rebuttal), I feel the results they provide does give me confidence in the story they're trying to sell ("find and coarse" multimodal learning). Thus, I will keep my rating as in the initial review. ------------------------------------------------------------------------------ Please see my comments in above sections.

Review 2

Summary and Contributions: In this work, the authors learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. A multimodal versatile network is proposed. Such a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. The authors demonstrate the proposed approach by training the networks on large collections of unlabelled video data and applying to video, video-text, image and audio tasks.

Strengths: This paper is well written. The introduction very clearly presents the motivation of building such a multimodal versatile network. The representation of the idea, technical detail are also easy to follow/. I also like the idea that learning the multimodal embeddings in the fine-and-course manner. It is different with most of previous works that focus on learning a common space.

Weaknesses: 1. I am concerning about the novelty of the propose approach. Though the motivation of the idea and is to design a fine-and-course learning manner. However, the loss function is actually a combination of two contrastive loss, i.e. an audio-video contrast and a text-video contrast. Such a training objective is kind of regular in itself, unless the authors can clarify something novel despite that. The only difference is the model are trained on 3 different modalities, but I did not see any particular designing in dealing with the moltimoal data and the combination of the multimodal contrast loss. 2. I am concerning about the mismatch with the authors claim and their approach. The authors claim that they aims to learn the multimodal embeddings in the fine-and-course manner, which is different with previous disjoint and shared embedding space. However, the objective design is actually two contrastive loss, i.e. L_{a,v} and L{t,v}. In other words, it uses videos as an anchor to contrast with audio and text, respectively. However, in this way, it seems like no different with learning a the disjoint audio-video space and text-video space. With no clarification on this, the claim can not be validated. 3. Only a few SOTA are compared . There are more approaches and benchmarks are needed to be compared with. For example, on activity classification and audio event classification: [1] Audio-Visual Instance Discrimination with Cross-Modal Agreement [2] LEARNING VIDEO REPRESENTATIONS USING CONTRASTIVE BIDIRECTIONAL TRANSFORMER 4. the current performance are not outperform all SOTA, and the comparison is not fair. When all trained on AudioSet, XDC achieves 91.2, AVID [1] achieves 91.5 on UCF101, which are all outperform more than 1% on this work (90.1). On ESC, AVID achieves 89.2 while this work is 86.1, there is more than 3% gap. The best performance of this paper comes from training on a combination of AudioSet and HowTo100M that are too much larger than the other compared approaches. Considering the model is trained on much large dataset and also incorporating more modalities, the current performance is NOT surprisingly good. 5. The authors claim that, one of the benefits of using fine-and-course design is that, the information/knowledge embedded in specific modality can be easily transfer/translated to another, e.g. gv->gva->gvat. Thus, I am curious to see some experiments and analysis in the transferability of the learned network. Howeve, only video-text retrieval can valid the network learns some alignments between video and text. However, one may suspect that, it is learned from the video-text contrastive loss. Considering that, during training the network see the connection of audio-video and video-text, so it is natural the model can have the transfer ability between audio-video and video-text. But in order to prove that the whole network is transferable, the author also need to prove that the information/knowledge can be transfer/translated through audio-text. However, I did not see clear clarification in the approach part, also there is no experiments to empirically valid this. 6. There lack of related works on learning multimodal data (audio, video, text) [3] Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck [4] One Model To Learn Them All 7. lack of enough implementation details, and I did not see the code.

Correctness: I am concerning about the mismatch with the authors claim and their approach. The authors claim that they aims to learn the multimodal embeddings in the fine-and-course manner, which is different with previous disjoint and shared embedding space. However, the objective design is actually two contrastive loss, i.e. L_{a,v} and L{t,v}. In other words, it uses videos as an anchor to contrast with audio and text, respectively. However, in this way, it seems like no different with learning a the disjoint audio-video space and text-video space. With no clarification on this, the claim can not be validated. The authors claim that, one of the benefits of using fine-and-course design is that, the information/knowledge embedded in specific modality can be easily transfer/translated to another, e.g. g_v->g_va->g_vat. Thus, I am curious to see some experiments and analysis in the transferability of the learned network. Howeve, only video-text retrieval can valid the network learns some alignments between video and text. However, one may suspect that, it is learned from the video-text contrastive loss. Considering that, during training the network see the connection of audio-video and video-text, so it is natural the model can have the transfer ability between audio-video and video-text. But in order to prove that the whole network is transferable, the author also need to prove that the information/knowledge can be transfer/translated through audio-text. However, I did not see clear clarification in the approach part, also there is no experiments to empirically valid this.

Clarity: Yes. This paper is well written. The introduction very clearly presents the motivation of building such a multimodal versatile network. The representation of the idea, technical detail are also easy to follow/.

Relation to Prior Work: No. This paper lacks of some related works, especially for the multi-modality works ( 'vision, audio and language' ). Considering the one of the main technical contributions is learning through all of these three modalities, the current related work and discussion is apparently not satisfied. Besides, the authors claim that the main difference is to learn a fine-and-course space instead of previous 'shared space' and 'disjoint space'. However, from the approach part, I can not be convinced that the current objective design can achieve what the authors claimed. There is no clear discussion in differentiating the proposed approach with previous ones. [1] LEARNING VIDEO REPRESENTATIONS USING CONTRASTIVE BIDIRECTIONAL TRANSFORMER [2] Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck [3] One Model To Learn Them All

Reproducibility: No

Additional Feedback: -------------------------POST REBUTTAL------------------------------------------------------------ Thanks for all your discussion. Yes, I agree that the incomplete comparison should NOT be the key factor that affects contribution of this paper. The authors' feedback also addressed my concern in it, so I am fine for Evaluation part. For my concerns about the technical novelty, the authors did address part of my concerns by clarifying the main contribution is the architecture design rather than the loss function. However, the designing does not fully convinced me. Though I like the idea of 'fine-to-course', but the architecture design and training loss do not achieve this motivation in a novel way. Considering these, I would raise my score to 5, but I will be fine if this paper gets in. Because it is well written and the 'fine-to-course' concepts is new and interesting.

Review 3

Summary and Contributions: The author presents a self-supervised representation learning framework for videos that contain multimodal streams of information. In particular, two components are emphasized in the paper: Fine and Coarse Spaces (FAC) and the deflation method. Empirical experiments manifest the effectiveness of the approach. ---------------------- i will remain my rating and still think the paper is worthy of acceptance. I will maintain my score 6.

Strengths: The experimental results are promising. The paper is easy to follow and understand. Nonetheless, many of the details are hidden in the main text.

Weaknesses: (Minor) The paper fails to discuss a series of related work on human multimodal language. Details will be provided later. (Minor-Major) In lines 174-176, the author claims that the audio is perfectly aligned with the visual source, which may not be valid. Different sampling rates or various angles of the visual scenes may cause misalignment between visual and audio signals. (Minor-Major) The author should discuss more on how the positives are chosen for the text. Details will be provided later. (Minor-Major) The author should elaborate more on the deflation process in Section 3.2. (Minor) The experimental section is a bit hard to follow. Nonetheless, I believe these positive results.

Correctness: The author states the visual and audio is always aligned in a given video, which may fail for most of the cases.

Clarity: 1. It is unclear how the positive text signals are sampled. For instance, lines 180-181 is particularly vague. Also, what is P(x) here? 2. I can't grasp how important is the deflation introduced in the paper. Perhaps more discussions or better explanations can be made in section 3.2.

Relation to Prior Work: There are a series of important work [1,2,3] on multimodal video should be discussed. These works also consider vision, text, and audio signals in the video. [1] Tensor Fusion Network for Multimodal Sentiment Analysis, Zadeh et al., EMNLP'17. [2] Multi-attention Recurrent Network for Human Communication Comprehension, Zadeh et al., AAAI'18. [3] Multimodal Transformer for Unaligned Multimodal Language Sequences, Tsai et al., ACL'19.

Reproducibility: No

Additional Feedback: n/a

Review 4

Summary and Contributions: This paper proposes a new approach to multi-modal self-supervised representation learning. This new approach (called MMV) is a contrastive learning-based approach that learns from three modalities: visual, audio, and text. MMV maps the three modalities into a fine-and-coarse (FAC) feature space. The paper evaluates the performance on multiple downstream tasks and achieves good results.

Strengths: - The paper is evaluated on many downstream tasks - To the best of the reviewer's knowledge, this is the first self-supervised approach to work on the three modalities: vision, audio, and text - The ablation studies are well-conducted

Weaknesses: The main weakness of this work is the lack of direct comparisons with previous works. The authors claim that they outperform the state-of-the-art on UCF, HMDB, and ESC, but fail to compare with other approaches under the same backbone architecture and the same pertaining dataset. For example, the following concerns arise from table 1: - Is MMV with TSM-50 better than ELo because MMV has a better architecture and was pretrained on a larger dataset? What is the performance of ELo pretrained on AudioSet+HowTo100M using TSM-50? Or What is the performance of MMV using R(2+1)D-50 and pretrained on YouTube8M? - XDC outperformed its R(2+1)D-18 fully-supervised pertaining baseline, but MMV did not exceed its fully-supervised pertaining baseline. On the other hand, MMV shows better performance than XDC. Can the authors explain why this is the case? Is MMV better than XDC because it uses a better backbone architecture? - MIL-NCE uses the same S3D backbone as MMV, but MIL-NCE is only pretrained on HowTo100M. Could the additional AudioSet pertaining be the reason why MMV outperforms MIL-NCE by 1.3% on UCF? What is the performance of MMV trained only on HowTo100M? - AVTS uses MC3 architecture, which has an inferior performance to TSM-50 on action recognition tasks. Could this be a factor why AVTS shows worse performance on UCF compared to MMV? What is the performance of AVTS using TSM-50? Or what is the performance of MMV using MC3? Overall, the lack of a fair direct comparison with any of the previous approaches undermines the claim that MMV is a better representation-learning model. The authors should compare with at least one method under the same settings (the same architecture and the same pretraining dataset).

Correctness: The claim that MMV is better than previous works is not backed by fair direct comparisons using the same backbone architecture and pretraining dataset. Refer to the "Weekeness" section for more details.

Clarity: The paper is well written and organized

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: **Post-rebuttal review updates:** After reading the authors' rebuttal, I have the following comments about the response to my concerns about the lack of direct comparisons with other works (i.e. on the same architecture and same pretraining dataset): Authors: "Comparison on equal grounds is a problem for all papers in this area" - Yes, that's sadly true - but that's why we need to find a way to fix it. It's unreasonable (and it is not what I asked for) to ask the authors to compare directly against all previous methods, but they should at least have one direct comparison using the same backbone and pretraining dataset. Otherwise, we cannot know if a new paper is actually better than others because of its specific approach or due to the use of advanced backbones and large datasets (two factors that are not novel by themselves). Authors: "We did another comparison to XDC (R2, R4) by running our VA model on the same data (AudioSet) and backbone (R(2+1)D-18) ... Note that R(2+1)D-18 actually outperforms S3D " - The authors providing this experiment is exactly what I was asking for to convince a reader about the merits of this paper. Now we can directly compare to another method (XDC), and now we can fairly say MMV is a better method. That being said, I'm kinda surprised by the fact that using R(2+1)D-18 gave MMV better results than using S3D. In the supervised action recognition on Kinetics, S3D outperforms R(2+1)D-34 (note this is the 34-layer variant), let alone the smaller and weaker R(2+1)D-18 model. I'm not sure what might cause this mismatch, but I will take the authors' findings as correct. However, the authors should discuss this surprising mismatch. Authors: "XDC beats their own fully supervised baseline but we report a stronger and more meaningful quantity – the best externally published performance for supervised transfer" - I'm afraid this is a false statement. The reported fully-supervised results in Table 2 are actually the performance of S3D itself [55] (which is the fully-supervised baseline for MMV). Specifically, the 96.8 on UCF and 75.9 on HMDB are taking directly from Table 5 of the S3D paper [55]. So, I still stand by my concern here. MMV appears to underperform its fully-supervised baseline and at the same time outperforms XDC, which outperforms its fully-supervised baseline. The authors' response is misleading and does not address my concern. That being said, this does not mean that MMV is worse because it failed to outperform its fully-supervised baseline. But the author should have a convincing argument as to why it does not. I like this paper and I think it's the first SSL approach to work on the three modalities: vision, audio, and text. The authors' response addressed part of my concerns by directly comparing to one method. I'm leaning towards changing my rating from 4 to 5 as I still think the authors failed to address my remaining concerns.