Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Variational Autoencoders (VAEs) are an effective approach to unsupervised learning but they suffer from a problem known as "posterior collapse". There has been a lot of focus on solving this problem from the ML research community. This paper belongs to that line of research. Clarity: The paper is well written however there are few things that hinder clarity. For example having section 5 as is is misleading...it makes it seem as though the same theoretical conclusions made in previous sections also hold for deep VAEs. Furthermore, the discussion makes other misleading statements that were not verified in the draft; for example the statement "We demonstrate empirically that the same optimization issues play a role in deep non-linear VAEs" is misleading because the experiments did not directly show this. Quality: The experiments conducted do not align well with the theoretical analysis in the earlier sections. The experiment section heavily focused on showcasing the performance of KL annealing. The related work section is also limited as there are tons of related work on posterior collapse in VAEs that were not referenced. See for example  and  below. Significance: I find the paper of limited significance. The analysis for the linear case does not apply to the non-linear case and empirical evidence does not show that the conclusions from the linear case still hold in the nonlinear case. Because of this I think significance is lacking. Minor Comments: 1--Throughout the paper: it's "log marginal likelihood" not "marginal log-likelihood" 2--What happens when the decoder is linear but the encoder is not? Does the analysis on linear decoders still hold?  Tackling Over-Pruning in Variational Autoencoders. Yeung et al., 2017.  Avoiding Latent Variable Collapse with Generative Skip Models. Dieng et al., 2018.
Originality The connection between pPCA and linear VAEs is well-known and already discussed in the literature. However, the paper proposes to analyze the problem of the posterior collapse by inspecting pPCA and linear VAEs. I find this analysis interesting and important. Quite surprisingly, it seems that the posterior collapse follows from the marginal likelihood maximization, and neither the non-linear character of the model nor the ELBO cause this issue. Quality The theoretical results are properly derived and are important for understanding the considered phenomenon. Moreover, the empirical results supports the theoretical analysis. The authors carried out experiments on real-valued MNIST and CelebA datasets. Clarity The paper is clearly written and well organized. The theoretical part is well explained and all concepts are outlined. The only problem with this paper is that after identifying a problem, I would expect a proposition of fixing it. The authors propose to play with this the variance, however, it is quite chaotically described. I would expect a better explanation of their proposition. Significance The problem of the posterior collapse is very important for learning VAEs. The findings of this paper are, thus, of high significance. Remarks: - The "jump" from the theoretical part to experiments is too large. What I mean by that is that a more natural would be to first identify the problem, then propose a solution, and then verify empirically whether the proposed solution is correct. I miss the "solution part" in this paper. - Can we extend the result of this paper to other distributions than Gaussians? As presented in the Appendix, it is not trivial. - The paper considers the continuous latent variables. The conclusion of the paper is that the problem lies in the marginal likelihood optimization. However, taking a discrete-valued latents, is the posterior collapse still the case? ==========AFTER REBUTTAL========== I would like to thank the authors for their rebuttal. I want to inform that I read it as well as I read the other reviews. I am satisfied by the rebuttal and stand by my score. In my opinion the authors can easily address the questions raised by the reviewers in the final version of the paper. I understand some concerns of other reviewers about invalidity of the presented analysis in the case of the non-linear VAEs, however, I still believe that the main result of the paper is interesting. Showing that the problem lies in the marginal log-likelihood rather than the VAE itself is interesting.
*** Originality *** This work belongs to the line of theoretical developments of linear autoencoder and proposes to study linear VAEs in the context of posterior collapse. I do not know the area enough to be sure all the work is referenced. However, the related work section is clear and well-written. As a side note, the title can be misleading as the theoretical developments are not enough to "understand" posterior collapse extensively. There might be some other aspects when working with more complex models, mismatch between the true posterior and the approximate posterior etc.. *** Quality *** To my understanding, the analysis of pPCA (which probably has been observed before) and Theorem 1 are rather interesting findings. I appreciate the idea that if a behavior is symptomatic in a simple model (pPCA with Gaussian noise), then it shall not work on more complex scenarios as well. Regarding the experiments, I have a couple remarks: + There is probably no need to point "bad" implementations in this manuscript. However, I never actually saw an open source implementation for a VAE with sigma^2 = 1 and this looks like a bad idea to start with. Can the authors provide more insight on why people would do this in the first place ? + The main experimental results are in Table 1. In particular, they simply suggest that there might be some challenges on initializing the VAEs ? Unfortunately, the theory does not provide much more insight on this ? + Figure 4 suggests that KL annealing might still be useful to "help" the VAE initializing its sigma towards low values. To my understanding, KL annealing was not studied in this manuscript and this is something that was not predicted by the theory ? + "This is strong evidence that the observation noise controls the stability of the stationary points in the non-linear model as in the linear case" -> are two datasets enough to claim strong evidence ? All in all, my concern is that there are some additional claims in the experimental section that might not have a corresponding theory. I think the paper could make it clearer what is expected, or analyzed (in this manuscript or in any other of the related work) and what is not. *** Clarity *** The manuscript is well-written and clear, I appreciated reviewing it. *** Significance *** My main criticism is about the link between the experiments and the theory, which could be made more rigorous or explicit. *** Minor remarks *** 1) Figure 4 and 5 should be permuted 2) Notation for the Encoder E_\psi is not used in the paper 3) Table 1 could have a clearer caption "trained on XX on the training set" ? 4) Figure 4 and 5 are not the easiest to read. Why not indicate to the reader exactly where it is best to see the mass ? ===== After rebuttal ===== The reviewers answere my comments. I still believe there is a mismatch between experiments and theory but this is a good contribution.