Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper connects variational inference with thermodynamic integration, so that the data log-likelihood can be formulated as a 1D integration of the instantaneous ELBO in a unit interval. By applying a left Riemann sum, TVO, a novel lower bound for the marginal log likelihood, is derived in which the traditional variational ELBO is recovered when only one partition is used. The authors then design an importance-sampling-based gradient estimator to optimize the objective, and compare with other methods on both discrete and continuous deep generative models. The paper also unifies other methods like wake sleep into the TVO framework. Originality and Significance: the formulation of TVO is an interesting idea. Better optimization methods than the importance-sampling-based approach are worth further exploring. The connections to previous methods provides a new insights about unifying different learning methods. Quality: In section 2, the development of TVO is a solid derivation. But in section 3, the authors claim the proposed method uses neither the high-variance REINFORCE nor re-parameterization. This is confusing because during the derivation, both the score-function trick and the re-parameterization trick were applied in the proposed method. During the rebuttal period, the authors provide more explanations on this both theoretically and experimentally, which I think is very beneficial. Some minor concerns: 1) f was not defined before it's used in section 3. Is it the terms multiplied by 1/K in the lhs of Eq. 2? 2) the authors provides detailed comparison with REINFORCE during the rebuttal period, and shows in math why the proposed method has lower variance, which is nice. However, the gradient of the covariance estimator in the rebuttal feedback is not very obvious to see, so it'll be good if more derivation details could be provided in the final version. 3) In the section of "The effect of S, K, and β locations", why does it "increase bias" and "are likely to be more biased" to use the importance sampler? Should it be higher variance instead? 4) As a followup of 3), is it possible that the fact limiting K=2 is enough is because the proposed method has too large variance? 5) It's great to see the authors agree to add more experiment comparisons with the results of regular ELBO optimization using REINFORCE or reparameterization. 6) It's great that the authors explain during rebuttal how the variance reduction technique in Owen (2013) can be applied in the proposed method, and provide additional experiment results to show its effect. Clarity: The organization of the submission is fine, but in terms of the overall clarity, there's still plenty of room for improvement. For example, more math background needs to be provided on WS and RWS before discussing their connections with TVO in section 4. And the readability of the second half of the paper is clearly worse than the first half. A couple descriptions about the subplot places in Figure 2 are wrong (top left should be top right, etc.). Many sentences in section 6 and 7 could be better phrased to read more like scientific writings.
UPDATE AFTER READING AUTHOR FEEDBACK ================================================================================ I would like to thank the authors for taking the concerns expressed in my review very seriously. The author feedback addresses my concerns very well. I think with the promised fixes this will be a strong paper with an original idea that could be simple enough to be used in practice. From a theory perspective, the paper might spark new ideas in readers since TI is explained very well and there are probably more connections between TI and VI. ORIGINAL REVIEW ================================================================================================== The paper proposes a series of new lower bounds to the model evidence in variational inference that generalize the standard ELBO. The proposal is based on a discretized version of thermodynamic integration (TI). Intuitively, instead of evaluating log p(x) directly by solving the intractable integral $p(x) = \int p(x,z) dz$, one first evaluates log p_0(x) for a reference model p_0(x, z) for which this is easy to do. One then changes the model p_0(x, z) on a continuous path until it is deformed to the target model p(x,z), and one integrates up the changes that this procedure incurs on the evidence log p(x). By choosing how the integral is discretized, one obtains either a lower or an upper bound on the evidence. I find the proposal convincing and well written. The underlying idea of TI is simple and well explained. A non-obvious way to obtain stochastic gradient estimates is also well explained. Experiments focus on models with discrete latent variables, as it is advertised that the proposed gradient estimator is applicable even to this situation. I am curious if the authors expect similar performance gains for models with continuous latent variables. My only main comment is that I cannot find a discussion of the variance of the gradient estimator. The proposed bound is tighter than the ELBO, i.e., it is less biased. Usually, bias reduction comes at the cost of an increase in gradient variance (see [Bamler et al., NIPS 2017] and [Rainforth et al, ICML 2018]). Larger gradient variance slows down convergence of gradient descent. The variance of the proposed gradient estimator would be interesting for another reason. The paper proposes to use a new gradient estimator although the standard REINFORCE method would in principle also work for the proposed bound. As far as I can tell, the only argument against REINFORCE gradients would be to reduce gradient variance (the use of discrete latent variables is only an argument against reparameterization gradients, not against REINFORCE). However, the proposed gradient estimator in Eq. 11 looks like it could suffer from high variance too because one has to estimate a covariance between two quantities. To estimate a covariance, one has to estimate the difference between two quantities that are typically of similar magnitude (see Eq. 12). This is hard to do numerically as absolute values cancel approximately while variances add up. The "Related Work" section mentions only work related to TI. There has been more work on tighter bounds for black box variational inference, e.g., the above mentioned papers or [Li and Turner, NIPS 2016].
*originality* The paper is very original, and the provided framework extending the standard ELBO to TVO is very elegant. I expect that this paper stimulates a lot of new research in this direction. A natural and good idea. *quality* The mathematical derivations are very clear and easy to follow. The experimental evaluation is well conducted, but restricted to MNIST only. The used implementation of TVO (based importance weighted sampling) seems to be of limited advantage (number of partitions and particles seem to have a limited impact on the learned model), which is somewhat underwhelming. Also, the effect that more particles in rws, vimco, and TVO leads to worse approximation of the posterior is surprising (Fig. 4), but not further explored. It would be interesting to see the TVI integrand (Fig. 1) for a real example/model/dataset, e.g. estimated with a massive number of importance samples. In Section 4, it is unclear to me, why both wake and sleep phases are over \phi. *clarity* The paper is largely clear. The connection to wake-sleep, however, remains somewhat unclear. *significance* While the experimental evaluation could be improved, the main contribution -- the TVO -- is very refreshing and of high significance. ************************************************************************************************************************ Update: As said in my original review, I find the proposed approach refreshing, original and creative. However, a richer set of experiments concerning datasets and models could make the paper quite stronger. Thus, I stick with my original rating (7).
The paper introduces the use of termodynamic integration for the training of variational autoencoders. The new connection between TI and the ELBO is insightful and the derivation of new lower bounds as Riemann discretization of the TI formulation of the log model evidence is clever. However, the suggested gradient estimator is quite unoriginal as it is simply a reinforce estimator with importance sampling weights. The paper is clearly written, albeit the notation is sometimes slightly confusing and some symbols are introduced before being properly defined. The experiments are good enough. Comments: 1) In line 84, the authors claim that their gradient estimator does not involve the REINFORCE estimator. However, their gradient estimation method IS the reinforce estimator with the addition of importance sampling and an extra term coming from the fact that their distribution is not normalized. 2) Eq. 2 contains several undefined symbols. It is useless and confusing to give an equation if its meaning cannot be understood in that section in which it is presented. 3) The authors claim that the performance of the method does not monotonically increases as the number of partition increases because the importance sampling scheme introduces a bias. This claim makes intuitive sense but it should be backed up by either theoretical or experimental analysis. I suggest to study a case where p(x|z) p(z) is tractable and the importance samples are unbiased.