Reviews: Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

* Update after rebuttal: After reading the other reviews and the authors' response, I have decided to maintain my score. I appreciate the clarifications about the MNIST-SVHN protocol and the authors' willingness to add pixel-level generation and MVAE-on-CUB results as requested. On the other hand, my questions about the likelihood results remained unaddressed, and I still feel it is misleading that the qualitative visual results in Fig. 7 (via feature look-up) are presented as 'reconstruction' and 'generation'. (There may have been a misunderstanding about what I meant by 'missing modalities'—when individual samples may not have all modalities available, rather than a modality being unobserved for the entire dataset. I realise my wording was not very clear, so I am disregarding this point.) Summary ------------ The authors posit that a good generative model for multi-modal data should be able to achieve latent factorisation, coherent joint and cross generation, and synergy between modalities. They propose a deep generative model based on variational autoencoders (VAE) and mixture-of-experts (MoE), which satisfies these criteria and sidesteps some limitations of existing models. One branch for each modality and a shared latent space. Originality ------------- The formulation of the desiderata for multi-modal generative modelling data feels quite original. The related work section is detailed and fairly comprehensive, clearly delineating the authors' contribution. In particular, it differs from the closest related method (MVAE; Wu & Goodman, 2018) in significant and non-trivial aspects. Quality --------- - It is unclear why the authors limit their scope to cases wherein all modalities are fully available during training (lines 65-66), as this could be a deal-breaker for some applications. Could the training objective not be adapted somehow to allow missing modalities? There should be some discussion of this limitation and/or of potential solutions. - There is no discussion of the likelihood results at the end of Sec. 4.3. What is the message here? What should we have expected to see? Do these numbers confirm the authors' hypotheses? If the expected difference was between the third and fourth columns ('p(xm|xm,xn)' vs. 'p(xm|xm)'), then the effect was minuscule... Also, the sentence describing the calculation of these results is quite confusing (lines 276-279), and this is not helped by the table's lack of a title and caption---which makes it even harder to interpret the results. - The experiment with the Caltech-UCSD Birds (CUB) dataset are quite preliminary. The authors apply their model on the outputs of a pre-trained deep feature extractor, on the grounds of avoiding blurriness of generated images. They then report 'generated samples' as the nearest-neighbour training images in this feature space. This is a really understated limitation, and for example the visual results shown in Fig. 7 can be very misleading to less attentive readers. - Although I am not deeply familiar with recent NLP literature, the caption generation results feel fairly unconvincing in terms of text quality. There are also several inconsistencies in language->language and image->language generation, so the authors' claim of 'quality and coherence' (Fig. 7 caption) of the sampled captions appears quite subjective. It would be useful to show many more examples of these five tasks (e.g. in the supplement) to give readers a better picture of what the model is able to learn from this data. - Why is there no comparison with the MVAE baseline (Wu & Goodman, 2018) on CUB? Clarity -------- - Generally the exposition is very clear and pleasurable to read. - I believe it could be helpful for readers if the authors added to lines 112-123 a brief intuitive description of their MoE formulation. For example, my understanding of the mixture-of-experts is that evidence can come from *either* modality, rather than the product-of-experts' assumption that *all* modalities should agree. - The initial description of the MMVAE objective in lines 124-136 felt a bit convoluted to me, and I am not sure if it is currently adding much value in the main text. In fact, the jump from IWAE straight to the explanation starting from line 137 seems a lot more intuitive to begin with. Since the authors seemed pressed for space, I would actually suggest moving the initial formulation and surrounding discussion to the supplement. - It would be interesting to briefly justify the choice of factorised Laplace distributions for priors, posteriors and likelihoods, and to clarify how the constraint on the scales is enforced during training. - In the interest of squeezing in more content, the authors seem to have tampered with some spacing elements: the abstract's lateral margins and the gaps above section headings are visibly reduced. Some additional space could be spared e.g. by deleting the superfluous Sec. 4.1 heading and rephrasing to get rid of dangling short lines of text (e.g. 257, 318, 334). - Consider increasing the spacing around inline figures (Figs. 1, 3, 5, and 6), as it currently feels very tight and makes the surrounding text a bit harder to read. - The authors could clarify that MMVAE refers to multi-modal MoE VAE (or MoE multi-modal VAE?), to better distinguish it from Wu & Goodman (2018)'s multi-modal VAE (MVAE). - The multiple likelihood samples (N=9) in Fig. 4 do not seem to provide any additional information or insight, especially given the heavy speckle noise in the generated images. Here it might be clearer to display just the predicted mean for each sampled latent code. - The description of the quantitative analysis protocol for MNIST-SVHN (lines 258-267) needs some further clarifications. Is it classifying digits? Are the SVHN and MNIST accuracies computed by feeding through only data from the corresponding modality? To which model does the 94.5% on line 267 refer, and what are the implications of this result? - Minor fixes: * Line 18: 'lack *of* explicit labels' * Line 19: 'provided' => 'provide' * Line 25: '[...] them (Yildirim, 2014).' * Lines 41, 43, 63, 72, 145: 'c.f.' => 'cf.' * Line 64: 'end-to-end' * Line 166: 'targetted' => 'targeted' * Line 205: '..' => '.' * Line 214: 'recap' sounds a bit informal * Line 238: 'generative' => 'generate' * Line 293: 'euclidean' => 'Euclidean' * Ref. Kalchbrenner et al. (2014): author list is messed up - The authors cite no fewer than 17 arXiv preprints, most of which were subsequently published and have been available for a while. Consider searching for the updated references and citing the published works instead. Significance ---------------- The authors formalisation of criteria for multi-modal generative modelling is inspiring. The model and inference algorithm could also raise interest and invite extensions. While there are several issues with the evaluation on the more complex CUB dataset, the MNIST-SVHN experiments are interesting and informative, and suggest improvements over the MVAE baseline.

****************************Originality**************************** Strengths: - The paper does a nice job of discussing the various desiderata of successful multimodal models in the introduction, nicely decomposing these into generative and discriminative objectives. Weaknesses: - The paper is not that original given the amount of work in learning multimodal generative models: — For example, from the perspective of the model, the paper builds on top of the work by Wu and Goodman (2018) except that they learn a mixture of experts rather than a product of experts variational posterior. — In addition, from the perspective of the 4 desirable attributes for multimodal learning that the authors mention in the introduction, it seems very similar to the motivation in the paper by Tsai et al. Learning Factorized Multimodal Representations, ICLR 2019, which also proposed a multimodal factorized deep generative model that performs well for discriminative and generative tasks as well as in the presence of missing modalities. The authors should have cited and compared with this paper. ****************************Quality**************************** Strengths: - The experimental results are nice. The paper claims that their MMVAE modal fulfills all four criteria including (1) latent variables that decompose into shared and private subspaces, (2) be able to generate data across all modalities, (3) be able to generate data across individual modalities, and (4) improve discriminative performance in each modality by leveraging related data from other modalities. Let's look at each of these 4 in detail: — (1) Yes, their model does indeed learn factorized variables which can be shown by good conditional generation on MNIST+SVHN dataset. — (2) Yes, joint generation (which I assume to mean generation from a single modality) is performed on vision -> vision and language -> language for CUB, — (3) Yes, conditional generation can be performed on CUB via language -> vision and vice versa. Weaknesses: - (continuing on whether the model does indeed achieve the 4 properties that the authors describe) — (3 continued) However, it is unclear how significant the performance is for both 2) and 3) since the authors report no comparisons with existing generative models, even simple ones such as a conditional VAE from language to vision. In other words, what if I forgo with the complicated MoE VAE, and all the components of the proposed model, and simply use a conditional VAE from language to vision. There are many ablation studies that are missing from the paper especially since the model is so complicated. — (4) The authors have not seemed to perform extensive experiments for this criteria since they only report the performance of a simple linear classifier on top of the latent variables. There has been much work in learning discriminative models for multimodal data involving aligning or fusing language and vision spaces. Just to name a few involving language and vision: - Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, EMNLP 2016 - DeViSE: A Deep Visual-Semantic Embedding Model, NeurIPS 2013 Therefore, it is important to justify why I should use this MMVAE model when there is a lot of existing work on fusing multimodal data for prediction. ****************************Clarity**************************** Strengths: - The paper is generally clear. I particularly liked the introduction of the paper especially motivation Figures 1 and 2. Figure 2 is particularly informative given what we know about multimodal data and multimodal information. - The table in Figure 2 nicely summarizes some of the existing works in multimodal learning and whether they fulfill the 4 criteria that the authors have pointed out to be important. Weaknesses: - Given the authors' great job in setting up the paper via Figure 1, Figure 2, and the introduction, I was rather disappointed that section 2 did not continue on this clear flow. To begin, a model diagram/schematic at the beginning of section 2 would have helped a lot. Ideally, such a model diagram could closely resemble Figure 2 where you have already set up a nice 'Venn Diagram' of multimodal information. Given this, your model basically assigns latent variables to each of the information overlapping spaces as well as arrows (neural network layers) as the inference and generation path from the variables to observed data. Showing such a detailed model diagram in an 'expanded' or 'more detailed' version of Figure 2 would be extremely helpful in understanding the notation (which there are a lot), how MMVAE accomplishes all 4 properties, as well as the inference and generation paths in MMVAE. - Unfortunately, the table in Figure 2 it is not super complete given the amount of work that has been done in latent factorization (e.g. Learning Factorized Multimodal Representations, ICLR 2019) and purely discriminative multimodal fusion (i.e. point d on synergy) - There are a few typos and stylistic issues: 1. line 18: "Given the lack explicit labels available” -> “Given the lack of explicit labels available” 2. line 19: “can provided important” -> “can provide important” 3. line 25: “between (Yildirim, 2014) them” -> “between them (Yildirim, 2014)” 4. and so on… ****************************Significance**************************** Strengths: - This paper will likely be a nice addition to the current models we have for processing multimodal data, especially since the results are quite interesting. - The paper did a commendable job in attempting to perform experiments to justify the 4 properties they outlined in the introduction. - I can see future practitioners using the variational MoE layers for encoding multimodal data, especially when there is missing multimodal data. Weaknesses: - That being said, there are some important concerns especially regarding the utility of the model as compared to existing work. In particular, there are some statements in the model description where it would be nice to have some experimental results in order to convince the reader that this model compares favorably with existing work: 1. line 113: You set \alpha_m uniformly to be 1/M which implies that the contributions from all modalities are the same. However, works in multimodal fusion have shown that dynamically weighting the modalities is quite important because 1) modalities might contain noise or uncertain information, 2) different modalities contribute differently to the prediction (e.g. in a video when a speaker is not saying anything then their visual behaviors are more indicative than their speech or language behaviors). Recent works therefore study, for example, gated attentions (e.g. Gated-Attention Architectures for Task-Oriented Language Grounding, AAAI 2018 or Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement Learning, ICMI 2017) to learn these weights. How does your model compare to this line of related work, and can your model be modified to take advantage of these fusion methods? 2. line 145-146: "We prefer the IWAE objective over the standard ELBO objective not just for the fact that it estimates a tighter bound, but also for the properties of the posterior when computing the multi-sample estimate." -> Do you have experimental results that back this up? How significant is the difference? 3. line 157-158: "needing M^2 passes over the respective decoders in total" -> Do you have experimental runtimes to show that this is not a significant overhead? The number of modalities is quite small (2 or 3), but when the decoders are large recurrent of deconvolutional layers then this could be costly. ****************************Post Rebuttal**************************** The author response addressed some of my concerns regarding novelty but I am still inclined to keep my score since I do not believe that the paper is substantially improving over (Wu and Goodmann, 2018) and (Tsai et al, 2019). The clarity of writing can be improved in some parts and I hope that the authors would make these changes. Regarding the quality of generation, it is definitely not close to SOTA language models such as GPT-2 but I would still give the authors credit since generation is not their main goal, but rather one of their 4 defined goals to measure the quality of multimodal representation learning.

Paper ID:	9192
Title:	Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

Reviewer 1

Reviewer 2

Reviewer 3