NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:162
Title:Divide and Couple: Using Monte Carlo Variational Objectives for Posterior Approximation

Reviewer 1

This paper contributes methodology for the improvement of methods that are commonly used for statistical, likelihood-based, approximate inference. More precisely, the paper propose improvements of the popular Variational Inference method. The authors investigate the optimisation of variational objectives constructed by using Monte Carlo estimators of the likelihood and particularly the effect of such optimisations on the approximation of the posterior distribution of interest. They develop a procedure that they call “divide and couple” in order to find an approximation to the posterior of interest that diverges no more than the gap between the log-likelihood and its lower bound derived from the Monte Carlo estimator. The authors give both the theoretical foundations for their new method and guidelines for its practical implementation. I find the research question very interesting and the paper clearly written with careful description of their new methodology. I have below a few comments on the paper. 1) The authors have to organise better Sections 4 and 5 in order the implementation and the results of their empirical study to be more clearly presented. Moreover, a small section with conclusions would very useful if added after Section 5. 2) There are small mistakes regarding the readability of the paper; for example in line 118 the word “be” should be added after “are going to”. The reference list needs to be checked again since I think that the paper in [1] refer to the paper published in JRSSB.

Reviewer 2

EDIT: I have read the response and the other reviews. I appreciate the clarification by the authors, and am content that they will include a more careful discussion of related work in the final version. Clarity: I found this paper generally clear and easy to read. I would suggest the following improvements for readability. - Include x in R(omega) and R to highlight the fact that it is a (random) function of x. - Using sans-serif to indicate when a quantity is random is nice, but its usefulness was largely obviated by the fact that the symbol for omega is serif vs. sans-serif is nearly identical. I would recommend either going with capital or using a symbol other than omega for the auxiliary variables. - I believe there is a type in Lemma 1 (should be Q(omega)R(omega)) Quality: The paper is generally of high quality and correct. Still, the current draft lacks a comprehensive related work. Especially in the context of questions about originality (below), this is a major omission. Originality: There is significant overlap between this work and known techniques in the Monte Carlo literature. The idea is very closely related to the idea of matching the proposal process to a distribution over an extended state space. This idea was used in Neal 2001, Andrieu et al. 2010, and reviewed in Axel Finke's thesis. Even more, I believe this idea is essentially equivalent to the idea proposed in Lawson et al. 2019 presented at an ICLR 2019 workshop. The focus of the Lawson et al. 2019 paper is different, in that they focus on the choice of the coupling. The Lawson et al. 2019 paper derives from work on auxiliary variable variational inference, see e.g. Agakov & Barber 2004. In any case, I urge the authors to spend more time explaining the connections between their work and the existing literature. Significance: This is an active research area, and this paper has the potential to generate interest. Even so, the experimental results are currently quite weak and some interesting avenues remain unexplored. - The improvements appear in Figure 6 to be on the order of a hundredth of a nat, which is a bit too small to justify changing a codebase to incorporate these bounds. - I would urge the authors to investigate how these distinct bounds interact with learning the parameters of q (the variational approximation) and p (the model). Citations: - Radford Neal. Annealed Importance Sampling. Statistics and Computing, 2001. - Axel Finke. On Extended State-Space Constructions for Monte Carlo Methods. - Christophe Andrieu, Arnaud Doucet, Roman Holenstein. Particle Markov chain Monte Carlo methods. JRSSB 2010. - Dieterich Lawson, George Tucker, Bo Dai, Rajesh Ranganath. Revisiting Auxiliary Latent Variables in Generative Models. ICLR Workshop 2019. - Agakov, Felix V., and David Barber. "An auxiliary variational method." International Conference on Neural Information Processing. Springer, Berlin, Heidelberg, 2004.

Reviewer 3

% Post rebuttal comments Thank you for your rebuttal. I am satisfied that the authors have taken on the feedback from the reviewers and are taking steps to improve the paper. I do still feel like there is a lot more that could be done with the experiments section in the paper beyond just the changes in the presentation proposed in the rebuttal and so I strongly encourage the authors to look into additional experiments for the camera-ready as well. I have decided to stick to my original score. %%%%% Overall I think this is a strong submission that would be of notable interest to the NeurIPS community. As explained above, I think the work has clear significance, while the approach also scores well on originality: although specific instances of the framework have already appeared in the literature (e.g. the appropriate Q(z) is known for IWAE and VSMC) and the objectives themselves are not really novel (i.e. the divide step is just prior work), there is clear novelty through the general nature of the coupling framework and the associated analysis which justifies it. The paper is clearly written and most of it is easy to follow (with the exception of the experiments section), despite the somewhat subtle nature of some of the ideas. The quality of the paper is also very good: I carefully checked the proofs for the theorems in the paper and found them to be correct, while a very detailed and precise measure-theoretic version of the results is also provided in the supplement. My main criticism with the paper is the experiments section, which feels like a bit of an after-thought. For example, the quality of the writing is far below that of the rest of the paper, there is a large unnecessary tangent about mappings of variance reduced sampling approaches in Gaussians (i.e. Figure 7 and accompanying text), and I do not feel it really provides an insightful investigation of the ideas being introduced. Though the results from Figure 8 are useful, they are both somewhat difficult to interpret and do not give the most emphatic of support for the suggested approaches. I feel there are a number of interesting empirical investigations that could have been made but were not, such as using the methods in a VAE context or doing more detailed investigations of the qualitative behavior of different Q(z). Other specific points 1. There appears to be a major typo in Lemma 1 and elsewhere: P(omega) is never defined, but is interpreted as Q(omega) it the vast majority of places (in the original version of this result from Le et al. 2018 [9] it is just Q(omega)). I presume this is just meant to be Q(omega) everywhere, but the fact that line 417 has P(omega)=Q(omega) makes me think the authors might be trying to convey something extra here. Please comment. 2. The KL is only one of many subjective metrics for the disparity between two distributions. This raises a number of interesting subtleties about whether Q(z) is an objectively better approximation, or simply one that is more suited to KL(Q||P) (e.g. its tendency to mode seeking). I would expect it to be the former, but I think the paper would be better for some discussion and maybe even experimentation on this. For example, if Q_0(omega) makes a mean field assumption, does Q(z) allow one to overcome this? 3. The definition of the KL for conditional distributions as being the expectation over the conditioning variable is liable to cause confusion as this is not standard in the VAE literature: it really threw me off the first time I read Theorem 2. I do not think anything is lost by writing these out explicitly (e.g. as an expectation of Q(z) in Theorem 2) and it would make it much things easier to interpret and avoid potential confusions like mine. 4. In lines 56-57 the paper talks about "novel objectives enabled by this framework". This is a bit of misnomer as the paper is not really contributing much in the way of new objectives: it is already known that the expectation of the log of any unnormalized marginal likelihood estimator is valid variational bound, which already implies all of these objectives which are standard marginal likelihood estimators. 5. I personally find it quite difficult to distinguish the sans-serif font for random variables. It might be good to think about a different way of distinguishing them. 6. The spacing in Table 1 should be fixed. 7. I would use something other than u in Figure 5 to avoid confusion with the "uniform" random variables \omega. 8. I really do not feel that Fig 7 and the accompanying text adds much to the paper. I would move it to the supplement and use the space to better explain the numerical experiments: the last paragraph of section 5 is extremely terse and it is really difficult to figure out exactly what experiments you are actually running (you almost have to read the Figure 8 caption before any of the main text to give things context). There is also no real discussion of the results which makes them difficult to interpret. In general, I think section 5 needs some pretty heavy editing. 9. I would explicitly define << on line 314 to make this more accessible as people may not be familiar with its typical measure theory meaning. 10. Line 334 has a tex error in the citation. 11. The proof block for Theorem 9 ends on line 380 instead of 387 for some reason. 12. In the proof of Theorem 2, I think you should spell out (i.e. add some words for) the change from P^{MC}(z|x) to p(z|x): this is somewhat the crux of the proof and it is left somewhat implicit as to why it is the case, some hand holding would help. The proof also seems to be a bit sloppy on jumping between have omega etc be scripted as random variables or not. 13. In the proof of Claim 10 the second delta should be \delta(z-T(\omega)) instead of \delta(z-\omega). The result is still the same.