Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Summary: The authors describe an improved objective function for variational inference. In the spirit of the Importance Weighted Autoencoder they use multiple samples from the approximating distribution to obtain a tighter bound on the log marginal probability. The key insight of the paper is that they can use all combinations (across subsets of parameters) of samples drawn from the approximating distribution to compute marginal probability estimator. Naively this would require computation that scales exponentially in the number of subsets. The authors reduce this complexity by exploiting dependency structures in the generative model. They show the method is applicable when using fully factorized approximating distributions, or more complex approximating distributions which preserve some dependency structure. The authors benchmark their approach on synthetic data sets and compare the accuracy of estimates for the log marginal to IWAE and variational SMC. They finish by exploring the use of two variance reduction schemes with their method and IWAE. Comments: - The paper is reasonably well written. The introduction and background are relevant. The description of the approach is reasonably easy to follow, though some important points are in the supplemental. In particular, the complexity non-factorized method is not really presented in the main text. The non-factorized version in fact uses a different generative model in the objective, which is not obvious from the main text. - The idea is quite interesting and is a nice synthesis of ideas from the VAE, graphical models and MCMC literature. I feel like there may also be some interesting connections to the discrete particle filter literature. - The paper focuses on the methods ability to provide a tighter bound on the marginal likelihood. It is not clear that this leads to any performance improvement in the training of the neural nets. - The discussion of computational complexity is hard to follow. If understand correctly the method will scale exponentially in the size of the largest factor. The authors argue this is unimportant in the context of neural nets. They only show a simple linear chain model, so it has no impact on performance. But in other models it seems likely that it will make their approach slower than the IWAE, possibly significantly. - The authors seem to make a major point of how the method meshes well with tensor implementation on GPUs. To me that is a minor point, as the major insight is the reduction in computational complexity that is achieved by using the sum-product algorithm. - Figure 3 is hard/impossible to read. Better color coding or line style choice is needed. - (Post author feedback) The authors have proposed to address most of of my concerns in the final manuscript. Thus I support publication.
First of all, this paper is hard to understand. Many important discussion and results are shown only in the supplemental material and the author does not specify in the main paper that which section in the Appendix we should refer. I cannot follow how Eq (11) and Eq (12) are derived under what kind of models. The author just mentioned that Eq (11) holds for "In this (and many other cases)", but I cannot understand what this sentence means. Also, there is no formal definition of kapper_j in the main paper (I found it in Eq (36) in the appendix). The experiments are not enough. Although in the second paragraph of the Introduction, the author indicated that IWAE is not enough for large models, the experiments are only conducted on very simple models which IWAE works well. I think additional experiments in real-world data in large models are needed to verify the usefulness of the proposed method.
EDIT POST-REBUTTAL I maintain my original score for the reasons already indicated. I think there is value in publishing this work. *** This work is in a long line of papers seeking to improve stochastic optimization for variational inference with a reparameterized lower bound. While the particular algorithmic contribution is not a major departure from the state of the art, the careful empirical study, theoretical ground, and discussion in relation to recent works has made me reconsider what I considered obvious before reading--that DReGS was the most efficient method of gradient estimation due to unbiasedness and lower variance. I appreciate this insight. The writing is very clear, and addresses current open issues in approximate inference. The empirical results are appropriate, using both synthetic and MNIST data.