NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1914
Title:Universal Boosting Variational Inference

Reviewer 1

The author(s) investigate how to improve fitting a mixture distribution to an unknown and possibly unnormalized probability density of interest (e.g. a posterior distribution). Typical approximation families (e.g. Gaussians) generally lead to a non-zero approximation error. In contrast, mixture-approximations become asymptotically correct in the limit of infinitely many mixtures. Their approach improves upon standard Boosting Variational Inference by replacing the standard KL divergence that is usually used to measure approximation quality with the Hellinger distance. They claim that, due to the simple geometry that the Hellinger distance induces on the space of probability distributions, this results in an algorithm with much simpler guarantees. The authors furthermore provide a long list of theoretical results (section 4) which detail how the Hellinger distance might be an interesting alternative to the usual KL divergence for variationa inference. Finally, the authors test how their new method compares to alternatives on canonical challenging distributions: the Cauchy distribution and the “banana” distribution of Haario et al 2001. In a nutshell, their algorithm consists in two alternating steps that are iterated to improve a mixture-approximation of the target distribution: 1. Finding the next best component to add to the mixture. 2. Recomputing the best weights for the current mixture. They provide a clear theoretical derivation of the algorithm and give a theoretical result (Th.2; line 130) asserting its asymptotic correctness as the number of components of the mixture goes to infinity. To me, the article seems to be a fairly strong accept, though I am not super familiar with methods for fitting mixture approximations of a target distribution. The overall content of the article is very solid. I have one major issue with Section 4 which I worry might contain results which are mathematically false but I might be making a mistake. While the system flagged this submission as potentially similar to submission 6102, I am confident that there is zero overlap between the two submissions. Major remarks: In section 4, I do not understand how Prop 5,6,7 can hold in full generality. Indeed, consider the case of two distributions with disjoint support (which I do not believe that you have eliminated with your assumptions). Then, the expected values of R^2 are infinite, importance sampling has infinite error and is non-sensical, etc. However, the Hellinger distance is bounded by 1 and all Propositions thus bound infinite quantities by finites quantities. Can you please tell me if I’m wrong? Minor remarks: - I struggle to understand Th2. Indeed, Q is a family of mixtures. How can hat(p) be unique? Shouldn’t we have instead a sequence of approximations with error tending to 0? - Still in Th2. I do not understand what it means to solve eq. 4 with a (1-delta) error. Are both energy functions in eq.4 equal? If not, which half of eq.4 are we talking about? How can we actually assert that we have solved eq.4 with such level of precision? - On line 37, a reference to the Bernstein von Mises theorem seems necessary to me. I like Kleijn, Van de Vaart 2012 or Van de Vaart’s book for this purpose - Dehaene and Barthelme JRSS 2018 assert asymptotic consistency of expectation propagation and might be a relevant ref on line 37

Reviewer 2

# Strengths - A clear demonstration of a failure case of current BVI methods. BVI variants potentially offer a powerful extension to classical mean-field based VI approaches. However the authors can show both empirically as well as theoretically that they can fail even when the true posterior is part of the family of approximating distributions, demonstrating the necessity for further improvements on prior work. - Switch to Hellinger is well motivated in itself and especially in the BVI context. The switch to the Hellinger distance from the more common KL divergence allows the authors to overcome these problems and is well introduced and motivated. - Strong theoretical evaluation Throughout the paper the authors present a strong theoretical evaluation of their proposed method. Through a series of propositions which are proofed in great detail in the appendix a variety of bounds and convergences are given supporting the claims. # Weaknesses - Weak empirical evaluation The empirical evaluation of the proposed method is limited to mostly synthetic datasets that serve well to demonstrate the qualitatively different behaviour to earlier BVI approaches. However apart from these synthetic experiments the experimental section is rather limited in scope. A wider range on real world datasets would be desirable and further improve the quality of the paper if the authors can show that the theory performs also well in the practical application. # Recommendation Apart from the experimental section which could be improved, the paper is overall clearly written, motivated and on the threoretical side well evaluated. I therefore recommend acceptance.

Reviewer 3

Summary: -------- The paper introduces a new approach to boosting variational inference. The standard approach is based on optimizing the KL-divergence which can lead to degenerate optimization problems. The paper proposes a boosting approach that is based on the Hellinger distance. The new approach improves the properties of the optimization problem and the experiments suggest that it can lead to better posterior approximation. Comments: ------- - I appreciate that you discuss the advantages of the Hellinger distance. However, a complete discussion should also point out its disadvantages. Why isn't the Hellinger distance used very often in variational inference? Is it because its less know? Or does the Hellinger distance have computational disadvantages or are there settings where it leads to worse approximations? - Can an efficient implementation of your approach also be obtained for mixture families other than exponential family mixtures? - What is the complexity of the algorithm? Especially, how does it scale in high dimensions? Experiments - More details on the methods should be provided. What is BBVI (I haven't seen a definition). Do you mean boosting variational inference (BVI)? If yes, is BVI using the same family for constructing the mixture? - The target models/distributions seem to be quite simple (logistic regression + simple target distributions) and are sufficient as toy experiments. But 1-2 more challenging real-world target models should be included. - Also, the datasets seem to be too simplistic. If I see it correctly, the highest dimension considered is 10. Does your method also work in much higher dimensions? - Also the number of data points seem to be quite small. Does your method scale to big datasets? - Since the methods optimize different distances/divergences, a direct comparison of other performance measures (e.g. predictive performance) would be interesting. Conclusion: ----------- The paper is well written and the math seems to be sound. The new approach to boosting variational inference based on the Hellinger-distance seems to have some advantages over standard KL-divergence based BVI. My main concern is that the empirical evaluation focuses on too simple settings and it is unclear how significant this work is.