{"title": "Universal Boosting Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 3484, "page_last": 3495, "abstract": "Boosting variational inference (BVI) approximates an intractable probability density by iteratively building up a mixture of simple component distributions one at a time, using techniques from sparse convex optimization to provide both computational scalability and approximation error guarantees. But the guarantees have strong conditions that do not often hold in practice, resulting in degenerate component optimization problems; and we show that the ad-hoc regularization used to prevent degeneracy in practice can cause BVI to fail in unintuitive ways. We thus develop universal boosting variational inference (UBVI), a BVI scheme that exploits the simple geometry of probability densities under the Hellinger metric to prevent the degeneracy of other gradient-based BVI methods, avoid difficult joint optimizations of both component and weight, and simplify fully-corrective weight optimizations. We show that for any target density and any mixture component family, the output of UBVI converges to the best possible approximation in the mixture family, even when the mixture family is misspecified. We develop a scalable implementation based on exponential family mixture components and standard stochastic optimization techniques. Finally, we discuss statistical benefits of the Hellinger distance as a variational objective through bounds on posterior probability, moment, and importance sampling errors. Experiments on multiple datasets and models show that UBVI provides reliable, accurate posterior approximations.", "full_text": "Universal Boosting Variational Inference\n\nTrevor Campbell\n\nDepartment of Statistics\n\nUniversity of British Columbia\n\nVancouver, BC V6T 1Z4\ntrevor@stat.ubc.ca\n\nXinglong Li\n\nDepartment of Statistics\n\nUniversity of British Columbia\n\nVancouver, BC V6T 1Z4\n\nxinglong.li@stat.ubc.ca\n\nAbstract\n\nBoosting variational inference (BVI) approximates an intractable probability den-\nsity by iteratively building up a mixture of simple component distributions one at a\ntime, using techniques from sparse convex optimization to provide both compu-\ntational scalability and approximation error guarantees. But the guarantees have\nstrong conditions that do not often hold in practice, resulting in degenerate com-\nponent optimization problems; and we show that the ad-hoc regularization used\nto prevent degeneracy in practice can cause BVI to fail in unintuitive ways. We\nthus develop universal boosting variational inference (UBVI), a BVI scheme that\nexploits the simple geometry of probability densities under the Hellinger metric to\nprevent the degeneracy of other gradient-based BVI methods, avoid dif\ufb01cult joint\noptimizations of both component and weight, and simplify fully-corrective weight\noptimizations. We show that for any target density and any mixture component\nfamily, the output of UBVI converges to the best possible approximation in the mix-\nture family, even when the mixture family is misspeci\ufb01ed. We develop a scalable\nimplementation based on exponential family mixture components and standard\nstochastic optimization techniques. Finally, we discuss statistical bene\ufb01ts of the\nHellinger distance as a variational objective through bounds on posterior probabil-\nity, moment, and importance sampling errors. Experiments on multiple datasets\nand models show that UBVI provides reliable, accurate posterior approximations.\n\n1\n\nIntroduction\n\nBayesian statistical models provide a powerful framework for learning from data, with the ability to\nencode complex hierarchical dependence structures and prior domain expertise, as well as coherently\ncapture uncertainty in latent parameters. The two predominant methods for Bayesian inference\nare Markov chain Monte Carlo (MCMC) [1, 2]\u2014which obtains approximate posterior samples by\nsimulating a Markov chain\u2014and variational inference (VI) [3, 4]\u2014which obtains an approximate\ndistribution by minimizing some divergence to the posterior within a tractable family. The key\nstrengths of MCMC are its generality and the ability to perform a computation-quality tradeoff: one\ncan obtain a higher quality approximation by simulating the chain for a longer period [5, Theorem\n4 & Fact 5]. However, the resulting Monte Carlo estimators have an unknown bias or random\ncomputation time [6], and statistical distances between the discrete sample posterior approximation\nand a diffuse true posterior are vacuous, ill-de\ufb01ned, or hard to bound without restrictive assumptions\nor a choice of kernel [7\u20139]. Designing correct MCMC schemes in the large-scale data setting\nis also a challenging task [10\u201312]. VI, on the other hand, is both computationally scalable and\nwidely applicable due to advances from stochastic optimization and automatic differentiation [13\u2013\n17]. However, the major disadvantage of the approach\u2014and the fundamental reason that MCMC\nremains the preferred method in statistics\u2014is that the variational family typically does not contain the\nposterior, fundamentally limiting the achievable approximation quality. And despite recent results in\nthe asymptotic theory of variational methods [18\u201322], it is dif\ufb01cult to assess the effect of the chosen\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffamily on the approximation for \ufb01nite data; a poor choice can result in severe underestimation of\nposterior uncertainty [23, Ch. 21].\nBoosting variational inference (BVI) [24\u201326] is an exciting new approach that addresses this funda-\nmental limitation by using a nonparametric mixture variational family. By adding and reweighting\nonly a single mixture component at a time, the approximation may be iteratively re\ufb01ned, achieving\nthe computation/quality tradeoff of MCMC and the scalability of VI. Theoretical guarantees on the\nconvergence rate of Kullback-Leibler (KL) divergence [24, 27, 28] are much stronger than those\navailable for standard Monte Carlo, which degrade as the number of estimands increases, enabling\nthe practitioner to con\ufb01dently reuse the same approximation for multiple tasks. However, the bounds\nrequire the KL divergence to be suf\ufb01ciently smooth over the class of mixtures\u2014an assumption that\ndoes not hold for many standard mixture families, e.g. Gaussians, resulting in a degenerate procedure\nin practice. To overcome this, an ad-hoc entropy regularization is typically added to each component\noptimization; but this regularization invalidates convergence guarantees, and\u2014depending on the\nregularization weight\u2014sometimes does not actually prevent degeneracy.\nIn this paper, we develop universal boosting variational inference (UBVI), a variational scheme\nbased on the Hellinger distance rather than the KL divergence. The primary advantage of using\nthe Hellinger distance is that it endows the space of probability densities with a particularly simple\nunit-spherical geometry in a Hilbert space. We exploit this geometry to prevent the degeneracy of\nother gradient-based BVI methods, avoid dif\ufb01cult joint optimizations of both component and weight,\nsimplify fully-corrective weight optimizations, and provide a procedure in which the normalization\nconstant of f does not need to be known, a crucial property in most VI settings. It also leads to the\nuniversality of UBVI: we show that for any target density and any mixture component family, the\noutput of UBVI converges to the best possible approximation in the mixture family, even when the\nmixture family is misspeci\ufb01ed. We develop a scalable implementation based on exponential family\nmixture components and standard stochastic optimization techniques. Finally, we discuss other\nstatistical bene\ufb01ts of the Hellinger distance as a variational objective through bounds on posterior\nprobability, moment, and importance sampling errors. Experiments on multiple datasets and models\nshow that UBVI provides reliable, accurate posterior approximations.\n\n2 Background: variational inference and boosting\n\nVariational inference, in its most general form, involves approximating a probability density p by\nminimizing some divergence D (\u00b7||\u00b7) from \u03be to p over densities \u03be in a family Q,\n\nq = arg min\n\n\u03be\u2208Q\n\nD (\u03be||p) .\n\nPast work has almost exclusively involved parametric families Q, such as mean-\ufb01eld exponential\nfamilies [4], \ufb01nite mixtures [29\u201331], normalizing \ufb02ows [32], and neural nets [16]. The issue with\nthese families is that typically min\u03be\u2208Q D (\u03be||p) > 0\u2014meaning the practitioner cannot achieve\narbitrary approximation quality with more computational effort\u2014and a priori, there is no way to tell\nhow poor the best approximation is. To address this, boosting variational inference (BVI) [24\u201326]\nproposes the use of the nonparametric family of all \ufb01nite mixtures of a component density family C,\n\n(cid:41)\n\nQ = conv C :=\n\nwk\u03bek : K \u2208 N, w \u2208 \u2206K\u22121, \u2200k \u2208 N \u03bek \u2208 C\n\n.\n\nGiven a judicious choice of C, we have that inf \u03be\u2208Q D (\u03be||p) = 0; in other words, we can approximate\nany continuous density p with arbitrarily low divergence [33]. As optimizing directly over the\nnonparametric Q is intractable, BVI instead adds one component at a time to iteratively re\ufb01ne the\napproximation. There are two general formulations of BVI; Miller et al. [26] propose minimizing KL\ndivergence over both the weight and component simultaneously,\n\n(cid:40) K(cid:88)\n\nk=1\n\nqn =\n\nwnk\u03bek\n\n\u03ben+1, \u03c9 = arg min\n\u03be\u2208C,\u03c1\u2208[0,1]\n\nwhile Guo et al. and Wang [24, 25] argue that optimizing both simultaneously is too dif\ufb01cult, and use\na gradient boosting [34] formulation instead,\n\nDKL (\u03c1\u03be + (1 \u2212 \u03c1)qn||p) wn+1 = [(1 \u2212 \u03c9)wn \u03c9]T,\n(cid:33)\n\n(cid:69)\n\n(cid:68)\n\n\u03ben+1 = arg min\n\n\u03be\u2208C\n\n\u03be, \u2207DKL (\u00b7||p)|qn\n\n(cid:32)n+1(cid:88)\n\nk=1\n\n\u03c9k\u03bek||p\n\n.\n\nwn+1 =\n\narg min\n\n\u03c9=[(1\u2212\u03c1)wn \u03c1]T, \u03c1\u2208[0,1]\n\nDKL\n\n2\n\nn(cid:88)\n\nk=1\n\n\f(a)\n\n(b)\n\nFigure 1: (1a): Greedy component selection, with target f, current iterate \u00afgn, candidate components h,\noptimal component gn+1, the closest point g(cid:63) to f on the \u00afgn \u2192 gn+1 geodesic, and arrows for initial\ngeodesic directions. The quality of gn+1 is determined by the distance from f to g(cid:63), or equivalently,\nby the alignment of the initial directions \u00afgn \u2192 gn+1 and \u00afgn \u2192 f. (1b): BVI can fail even when p is\n2N (25, 5), and UBVI \ufb01nds the correct mixture in 2\nin the mixture family. Here p = 1\niterations. BVI (with regularization weight in {1, 10, 30}) does not converge. For example, when\nthe regularization weight is 1 the \ufb01rst component will have variance < 5, and the second component\noptimization diverges since the target N (25, 5) component has a heavier tail. Upon reweighting the\nsecond component is removed, and the approximation will never improve.\n\n2N (0, 1) + 1\n\nBoth algorithms attain DKL (qN||p) = O(1/N )1\u2014 the former by appealing to results from convex\nfunctional analysis [35, Theorem II.1], and the latter by viewing BVI as functional Frank-Wolfe\noptimization [27, 36, 37]. This requires that DKL (q||p) is strongly smooth or has bounded curvature\nover q \u2208 Q, for which it is suf\ufb01cient that densities in Q are bounded away from 0, bounded above,\nand have compact support [27], or have a bounded parameter space [28]. However, these assumptions\ndo not hold in practice for many simple (and common) cases, e.g., where C is the class of multivariate\nnormal distributions. Indeed, gradient boosting-based BVI methods all require some ad-hoc entropy\nregularization in the component optimizations to avoid degeneracy [24, 25, 28]. In particular, given a\nsequence of regularization weights rn > 0, BVI solves the following component optimization [28]:\n\n(cid:28)\n\n(cid:29)\n\n\u03ben+1 = arg min\n\n\u03be\u2208C\n\n\u03be, log\n\n\u03bern+1qn\n\np\n\n.\n\n(1)\n\nThis addition of regularization has an adverse effect on performance in practice as demonstrated in\nFig. 1b, and can lead to unintuitive behaviour and nonconvergence\u2014even when p \u2208 Q (Proposition 1)\nor when the distributions in C have lighter tails than p (Proposition 2).\nProposition 1. Suppose C is the set of univariate Gaussians with mean 0 parametrized by variance,\nlet p = N (0, 1), and let the initial approximation be q1 = N (0, \u03c4 2). Then BVI in Eq. (1) with\nregularization r2 > 0 returns a degenerate next component \u03be2 if \u03c4 2 \u2264 1, and iterates in\ufb01nitely\nwithout improving the approximation if \u03c4 2 > 1 and r2 > \u03c4 2 \u2212 1.\nProposition 2. Suppose C is the set of univariate Gaussians with mean 0 parametrized by variance,\nand let p = Cauchy(0, 1). Then BVI in Eq. (1) with regularization r1 > 0 returns a degenerate \ufb01rst\ncomponent \u03be1 if r1 \u2265 2.\n\n3 Universal boosting variational inference (UBVI)\n\n3.1 Algorithm and convergence guarantee\n\nTo design a BVI procedure without the need for ad-hoc regularization, we use a variational objective\nbased on the Hellinger distance, which for any probability space (X , \u03a3, \u00b5) and densities p, q is\n\n(cid:90) (cid:16)(cid:112)p(x) \u2212(cid:112)q(x)\n\n(cid:17)2\n\n\u00b5(dx).\n\nD2\n\nH (p, q) :=\n\n1\n2\n\n1We assume throughout that nonconvex optimization problems can be solved reliably.\n\n3\n\n\fAlgorithm 1 The universal boosting variational inference (UBVI) algorithm.\n1: procedure UBVI(p, H, N)\n2:\n3:\n4:\n\n\u221d\u2190 \u221a\nf\np\n\u00afg0 \u2190 0\nfor n = 1, . . . , N do\n\n(cid:46) Find the next component to add to the approximation using Eq. (5)\n\n1 \u2212 (cid:104)h, \u00afgn\u22121(cid:105)2\n\n5:\n\n6:\n7:\n8:\n\n9:\n10:\n11:\n\ngn \u2190 arg maxh\u2208H (cid:104)f \u2212 (cid:104)f, \u00afgn\u22121(cid:105) \u00afgn\u22121, h(cid:105)(cid:30)(cid:113)\n\n(cid:46) Compute pairwise normalizations using Eq. (2)\nfor i = 1, . . . , n do\n\nZn,i = Zi,n \u2190 (cid:104)gn, gi(cid:105)\n\nend for\n(cid:46) Update weights using Eq. (7)\nd = ((cid:104)f, g1(cid:105) , . . . ,(cid:104)f, gn(cid:105))T\n\u03b2 = arg minb\u2208Rn,b\u22650 bT Z\u22121b + 2bT Z\u22121d\n(\u03bbn,1, . . . , \u03bbn,n) =\n(cid:46) Update boosting approximation\n\n(\u03b2+d)T Z\u22121(\u03b2+d)\n\nZ\u22121(\u03b2+d)\n\n\u221a\n\n\u00afgn \u2190(cid:80)n\n\ni=1 \u03bbnigi\n\n12:\n13:\n14:\n15: end procedure\n\nend for\nreturn q = \u00afg2\nN\n\nOur general approach relies on two facts about the Hellinger distance. First, the metric DH (\u00b7,\u00b7)\nendows the set of \u00b5-densities with a simple geometry corresponding to the nonnegative functions\non the unit sphere in L2(\u00b5). In particular, if f, g \u2208 L2(\u00b5) satisfy (cid:107)f(cid:107)2 = (cid:107)g(cid:107)2 = 1, f, g \u2265 0, then\np = f 2 and q = g2 are probability densities and\n\nD2\n\nH (p, q) =\n\n(cid:107)f \u2212 g(cid:107)2\n2 .\n\n1\n2\n\nN(cid:88)\n\n(cid:18) gigj\n\n(cid:19)\n\nZij\n\nOne can thus perform Hellinger distance boosting by iteratively \ufb01nding components that minimize\ngeodesic distance to f on the unit sphere in L2(\u00b5). Like the Miller et al. approach [26], the boosting\nstep directly minimizes a statistical distance, leading to a nondegenerate method; but like the Guo et\nal. and Wang approach [24, 25], this avoids the joint optimization of both component and weight; see\ni=0 \u03bbigi, \u03bbi \u2265 0, (cid:107)gi(cid:107)2 = 1, gi \u2265 0 in\nL2(\u00b5) satisfying (cid:107)g(cid:107)2 = 1 corresponds to the mixture model density\n\nSection 3.2 for details. Second, a conic combination g =(cid:80)N\n\nq = g2 =\n\nZij\u03bbi\u03bbj\n\nZij := (cid:104)gi, gj(cid:105) \u2265 0.\n\n(2)\n\ni,j=1\n\nTherefore, if we can \ufb01nd a conic combination satisfying (cid:107)f \u2212 g(cid:107)2 \u2264 \u221a\n2 \u0001 for p = f 2, we can\nguarantee that the corresponding mixture density q satis\ufb01es DH (p, q) \u2264 \u0001. The mixture will be built\nfrom a family H \u2282 L2(\u00b5) of component functions for which \u2200h \u2208 H, (cid:107)h(cid:107)2 = 1 and h \u2265 0. We\nassume that the target function f \u2208 L2(\u00b5), (cid:107)f(cid:107)2 = 1, f \u2265 0 is known up to proportionality. We\nalso assume that f is not orthogonal to spanH for expositional brevity, although the algorithms and\ntheoretical results presented here apply equally well in this case. We make no other assumptions; in\nparticular, we do not assume f is in cl spanH.\nThe universal boosting variational inference (UBVI) procedure is shown in Algorithm 1. In each\niteration, the algorithm \ufb01nds a new mixture component from H (line 5; see Section 3.2 and Fig. 1a).\nOnce the new component is found, the algorithm solves a convex quadratic optimization problem\nto update the weights (lines 9\u201311). The primary requirement to run Algorithm 1 is the ability to\ncompute or estimate (cid:104)h, f(cid:105) and (cid:104)h, h(cid:48)(cid:105) for h, h(cid:48) \u2208 H. For this purpose we employ an exponential\ncomponent family H such that Zij is available in closed-form, and use samples from h2 to obtain\nestimates of (cid:104)h, f(cid:105); see Appendix A for further implementation details.\nThe major bene\ufb01t of UBVI is that it comes with a computation/quality tradeoff akin to MCMC: for any\ntarget p and component family H, (1) there is a unique mixture \u02c6p = \u02c6f 2 minimizing DH (\u02c6p, p) over the\nclosure of \ufb01nite mixtures clQ; and (2) the output q of UBVI(p,H, N ) satis\ufb01es DH (q, \u02c6p) = O(1/N )\nwith a dimension-independent constant. No matter how coarse the family H is, the output of UBVI\nwill converge to the best possible mixture approximation. Theorem 3 provides the precise result.\n\n4\n\n\fTheorem 3. For any density p there is a unique density \u02c6p = \u02c6f 2 satisfying \u02c6p =\narg min\u03be\u2208cl Q DH (\u03be, p).\nIf each component optimization Eq. (5) is solved with a relative sub-\noptimality of at most (1 \u2212 \u03b4), then the variational mixture approximation q returned by UBVI(p, H,\nN) satis\ufb01es\n\nJ1 := 1 \u2212(cid:68) \u02c6f , g1\n\n(cid:69)2 \u2208 [0, 1)\n\n\u03c4 := Eq. (3) < \u221e.\n\nDH (\u02c6p, q)2 \u2264\n\n1 +(cid:0) 1\u2212\u03b4\n\n\u03c4\n\n(cid:1)2\n\nJ1\nJ1(N \u2212 1)\n\ni=1\n\ns.t.\n\n(cid:107) \u02c6f \u2212\n\n(cid:107)hi(cid:107)2\n\n\u03c4 := inf\n\n\u221e(cid:88)\n\n(1 \u2212 x)\u22121\n\nhi\u2208cone H\nx\u2208[0,1)\n\nhi(cid:107)2 \u2264 x,\n\nThe proof of Theorem 3 may be found in Appendix B.3, and consists of three primary steps. First,\nLemma 9 guarantees the existence and uniqueness of the convergence target \u02c6f under possible\nmisspeci\ufb01cation of the component family H. Then the dif\ufb01culty of approximating \u02c6f with conic\ncombinations of functions in H is captured by the basis pursuit denoising problem [38]\n\n\u221e(cid:88)\nLemma 10 guarantees that \u03c4 is \ufb01nite, and in particular \u03c4 \u2264 \u221a\n1\u2212J1\n1\u2212\u221a\n, which can be estimated in\nJ1\npractice using Eq. (9). Finally, Lemma 11 develops an objective function recursion, which is then\nsolved to yield Theorem 3. Although UBVI and Theorem 3 is reminiscent of past work on greedy\napproximation in a Hilbert space [34, 39\u201346], it provides the crucial advantage that the greedy steps\ndo not require knowledge of the normalization of p. UBVI is inspired by a previous greedy method\n[46], but provides guarantees with an arbitrary, potentially misspeci\ufb01ed in\ufb01nite dictionary in a Hilbert\nspace, and uses quadratic optimization to perform weight updates. Note that both the theoretical and\npractical cost of UBVI is dominated by \ufb01nding the next component (line 5), which is a nonconvex\noptimization problem. The other expensive step is inverting Z; however, incremental methods using\nblock matrix inversion [47, p. 46] reduce the cost at iteration n to O(n2) and overall cost to O(N 3),\nwhich is not a concern for practical mixtures with (cid:28) 103 components. The weight optimization (line\n10) is a nonnegative least squares problem, which can be solved ef\ufb01ciently [48, Ch. 23].\n\n\u2200i, hi \u2265 0.\n\n(3)\n\ni=1\n\n3.2 Greedy boosting along density manifold geodesics\n\nThis section provides the technical derivation of UBVI (Algorithm 1) by expoiting the geometry\nof square-root densities under the Hellinger metric. Let the conic combination in L2(\u00b5) after\ninitialization followed by N \u2212 1 steps of greedy construction be denoted\n\nn(cid:88)\n\ni=1\n\n\u00afgn :=\n\n\u03bbnigi,\n\n(cid:107)\u00afgn(cid:107)2 = 1,\n\nwhere \u03bbni \u2265 0 is the weight for component i at step n, and gi is the component added at step i. To\n\ufb01nd the next component, we minimize the distance between \u00afgn+1 and f over choices of h \u2208 H and\nposition x \u2208 [0, 1] along the \u00afgn \u2192 h geodesic,2\n\n(cid:18)\n\n(cid:13)(cid:13)(cid:13)(cid:13)f \u2212\n(cid:28)\n\nh \u2212 (cid:104)h, \u00afgn(cid:105) \u00afgn\n(cid:107)h \u2212 (cid:104)h, \u00afgn(cid:105) \u00afgn(cid:107)2\nx\nh \u2212 (cid:104)h, \u00afgn(cid:105) \u00afgn\n(cid:107)h \u2212 (cid:104)h, \u00afgn(cid:105) \u00afgn(cid:107)2\n\n(cid:29)\n\n(cid:112)\n(cid:112)\n\n+\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n1 \u2212 x2 \u00afgn\n\n(4)\n\n+\n\n1 \u2212 x2 (cid:104)f, \u00afgn(cid:105) .\n\n= arg max\nh\u2208H,x\u2208[0,1]\n\nx\n\nf,\n\n\u00afg0 = 0\n\ngn+1, x(cid:63) = arg min\nh\u2208H,x\u2208[0,1]\n\nNoting that h \u2212 (cid:104)h, \u00afgn(cid:105) \u00afgn is orthogonal to \u00afgn, the second term does not depend on h, and x \u2265 0, we\navoid optimizing the weight and component simultaneously and \ufb01nd that\n\n(cid:29)\n\nh \u2212 (cid:104)h, \u00afgn(cid:105) \u00afgn\n(cid:107)h \u2212 (cid:104)h, \u00afgn(cid:105) \u00afgn(cid:107)2\n\n= arg max\n\nh\u2208H\n\n(cid:113)\n\n(cid:104)f \u2212 (cid:104)f, \u00afgn(cid:105) \u00afgn, h(cid:105)\n\n1 \u2212 (cid:104)h, \u00afgn(cid:105)2\n\n.\n\n(5)\n\n(cid:28) f \u2212 (cid:104)f, \u00afgn(cid:105) \u00afgn\n\ngn+1 = arg max\n\nh\u2208H\n\n(cid:107)f \u2212 (cid:104)f, \u00afgn(cid:105) \u00afgn(cid:107)2\n\n,\n\nIntuitively, Eq. (5) attempts to maximize alignment of gn+1 with the residual f \u2212 (cid:104)f, \u00afgn(cid:105) \u00afgn (the\nnumerator) resulting in a ring of possible solutions, and among these, Eq. (5) minimizes alignment\n2Note that the arg max may not be unique, and when H is in\ufb01nite it may not exist; Theorem 3 still holds\n\nand UBVI works as intended in this case. For simplicity, we use (. . . ) = arg max(. . . ) throughout.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: Forward KL divergence\u2014which controls worst-case downstream importance sampling\nerror\u2014and importance-sampling-based covariance estimation error on a task of approximating\ni.i.d.\u223c N (0, 1) with N (0, \u03c32I) by minimizing Hellinger, forward KL, and reverse\nN (0, AT A), Aij\nKL, plotted as a function of condition number \u03ba(AT A). Minimizing Hellinger distance provides\nsigni\ufb01cantly lower forward KL divergence and estimation error than minimizing reverse KL.\n\nwith the current iterate \u00afgn (the denominator). The \ufb01rst form in Eq. (5) provides an alternative intuition:\ngn+1 achieves the maximal alignment of the initial geodesic directions \u00afgn \u2192 f and \u00afgn \u2192 h on the\nsphere. See Fig. 1a for a depiction. After selecting the next component gn+1, one option to obtain\n\u00afgn+1 is to use the optimal weighting x(cid:63) from Eq. (4); in practice, however, it is typically the case that\nsolving Eq. (5) is expensive enough that \ufb01nding the optimal set of coef\ufb01cients for {g1, . . . , gn+1} is\nworthwhile. This is accomplished by maximizing alignment with f subject to a nonnegativity and\nunit-norm constraint:\n\n(\u03bb(n+1)1, . . . , \u03bb(n+1)(n+1)) = arg max\nx\u2208Rn+1\n\n(6)\nwhere Z \u2208 RN +1\u00d7N +1 is the matrix with entries Zij from Eq. (2). Since projection onto the feasible\nset of Eq. (6) may be dif\ufb01cult, the problem may instead be solved using the dual via\n\nxigi\n\nf,\n\ns.t. x \u2265 0,\n\nxT Zx \u2264 1,\n\n(cid:43)\n\n(cid:42)\n\nn+1(cid:88)\n\ni=1\n\n(cid:112)(\u03b2 + d)T Z\u22121(\u03b2 + d)\n\nZ\u22121(\u03b2 + d)\n\n(\u03bb(n+1)1, . . . , \u03bb(n+1)(n+1)) =\nd = ((cid:104)f, g1(cid:105) , . . . ,(cid:104)f, gn+1(cid:105))T\n\n\u03b2 = arg min\nb\u2208Rn+1,b\u22650\n\nbT Z\u22121b + 2bT Z\u22121d.\n\n(7)\n\nEq. (7) is a nonnegative linear least-squares problem\u2014for which very ef\ufb01cient algorithms are available\n[48, Ch. 23]\u2014in contrast to prior variational boosting methods, where a fully-corrective weight update\nis a general constrained convex optimization problem. Note that, crucially, none of the above steps\nrely on knowledge of the normalization constant of f.\n\n4 Hellinger distance as a variational objective\n\nWhile the Hellinger distance has most frequently been applied in asymptotic analyses (e.g., [49]),\nit has seen recent use as a variational objective [50] and possesses a number of particularly useful\nproperties that make it a natural \ufb01t for this purpose. First, DH (\u00b7,\u00b7) applies to any arbitrary pair\nof densities, unlike DKL (p||q), which requires that p (cid:28) q. Minimizing DH (\u00b7,\u00b7) also implicitly\nminimizes error in posterior probabilities and moments\u2014two quantities of primary importance to\npractitioners\u2014via its control on total variation and (cid:96)-Wasserstein by Propositions 4 and 5. Note\nthat the upper bound in Proposition 4 is typically tighter than that provided by the usual DKL (q||p)\nvariational objective via Pinsker\u2019s inequality (and at the very least is always in [0, 1]), and the bound\nin Proposition 5 shows that convergence in DH (\u00b7,\u00b7) implies convergence in up to (cid:96)th moments [51,\nTheorem 6.9] under relatively weak conditions.\nProposition 4 (e.g. [52, p. 61]). The Hellinger distance bounds total variation via\n\n(cid:113)\n\n2 \u2212 D2\n\nH (p, q) .\n\nH (p, q) \u2264 DTV (p, q) :=\nD2\n\n(cid:107)p \u2212 q(cid:107)1 \u2264 DH (p, q)\n\n1\n2\n\n6\n\n\fProposition 5. Suppose X is a Polish space with metric d(\u00b7,\u00b7), (cid:96) \u2265 1, and p, q are densities with\nrespect to a common measure \u00b5. Then for any x0,\n\nW(cid:96)(p, q) \u2264 2DH (p, q)1/(cid:96)(cid:0)E(cid:2)d(x0, X)2(cid:96)(cid:3) + E(cid:2)d(x0, Y )2(cid:96)(cid:3)(cid:1)1/2(cid:96) ,\n\nwhere Y \u223c p(y)\u00b5(dy) and X \u223c q(x)\u00b5(dx).\nuniformly bounded 2(cid:96)th moments, DH (p, qN ) \u2192 0 =\u21d2 W(cid:96)(p, qN ) \u2192 0 as N \u2192 \u221e.\nOnce a variational approximation q is obtained, it will typically be used to estimate expectations of\nsome function of interest \u03c6(x) \u2208 L2(\u00b5) via Monte Carlo. Unless q is trusted entirely, this involves\nimportance sampling\u2014using In(\u03c6) or Jn(\u03c6) in Eq. (8) depending on whether the normalization of p\nis known\u2014to account for the error in q compared with the target distribution p [53],\n\nIn particular, if densities (qN )N\u2208N and p have\n\nIn(\u03c6) :=\n\n1\nN\n\np(Xi)\nq(Xi)\n\n\u03c6(Xi)\n\nJn(\u03c6) :=\n\nIn(\u03c6)\nIn(1)\n\ni.i.d.\u223c q(x)\u00b5(dx).\n\nXi\n\n(8)\n\nN(cid:88)\n\nn=1\n\nRecent work has shown that the error of importance sampling is controlled by the intractable forward\nKL-divergence DKL (p||q) [54]. This is where the Hellinger distance shines; Proposition 6 shows\nthat it penalizes both positive and negative values of log p(x)/q(x) and thus provides moderate\ncontrol on DKL (p||q)\u2014unlike DKL (q||p), which only penalizes negative values. See Fig. 2 for a\ndemonstration of this effect on the classical correlated Gaussian example [23, Ch. 21]. While the\ntakeaway from this setup is typically that minimizing DKL (q||p) may cause severe underestimation\nof variance, a reasonable practitioner should attempt to use importance sampling to correct for this\nanyway. But Fig. 2 shows that minimizing DKL (q||p) doesn\u2019t minimize DKL (p||q) well, leading to\npoor estimates from importance sampling. Even though minimizing DH (p, q) also underestimates\nvariance, it provides enough control on DKL (p||q) so that importance sampling can correct the errors.\nDirect bounds on the error of importance sampling estimates are also provided in Proposition 7.\nProposition 6. De\ufb01ne R := log p(X)\n\nq(X) where X \u223c p(x)\u00b5(dx). Then\n\nDH (p, q) \u2265 1\n2\n\nProposition 7. De\ufb01ne \u03b1 :=(cid:0)N\u22121/4 + 2(cid:112)DH (p, q)(cid:1)2. Then the importance sampling error with\n\n1 + R\n\nR2\n\n.\n\n2(cid:112)1 + E [1 [R > 0] (1 + R)2]\n\nDKL (p||q)\n\n\u2265\n\n(cid:34)\n\n(cid:118)(cid:117)(cid:117)(cid:116)E\n\n(cid:18) 1 + 1 [R \u2264 0] R\n\n(cid:19)2(cid:35)\n\nknown normalization is bounded by\n\nE [|In(\u03c6) \u2212 I(\u03c6)|] \u2264 (cid:107)\u221a\n\nand with unknown normalization by\n\n\u2200t > 0\n\nP(|Jn(\u03c6) \u2212 I(\u03c6)| > (cid:107)\u221a\n\np \u03c6(cid:107)2\u03b1,\n\np \u03c6(cid:107)2t) \u2264(cid:0)1 + 4t\u22121\n\n1 + t(cid:1) \u03b1.\n\n\u221a\n\nD2\n\n(cid:115)\n\nNext, the Hellinger distance between densities q, p can be estimated with high relative accuracy given\nsamples from q, enabling the use of the above bounds in practice. This involves computing either\nH (p, q) or (cid:94)\n(cid:92)\nH (p, q) below, depending on whether the normalization of p is known. The expected\nD2\n(cid:80)N\nN(cid:88)\nerror of both of these estimates relative to DH (p, q) is bounded via Proposition 8.\n(cid:113) 1\n(cid:80)N\n(cid:92)\nH (p, q) := 1\u2212 1\nD2\nN\n(cid:113)\nH (p, q) \u2212 DH (p, q)2(cid:12)(cid:12)(cid:12)(cid:105) \u2264 DH (p, q)\nE(cid:104)(cid:12)(cid:12)(cid:12) (cid:92)\nE(cid:104)(cid:12)(cid:12)(cid:12) (cid:94)\nH (p, q) \u2212 DH (p, q)2(cid:12)(cid:12)(cid:12)(cid:105) \u2264\n(cid:16)\nN \u22121(cid:17)\n\nProposition 8. The mean absolute difference between the Hellinger squared estimates is\n\n2 \u2212 DH (p, q)2\n\u221a\nN\n\u221a\n\n(cid:94)\nH (p, q) := 1\u2212\nD2\n\n(cid:113) p(Xn)\n\ni.i.d.\u223c q(x)\u00b5(dx).\n\np(Xn)\nq(Xn)\n\nDH (p, q) .\n\np(Xn)\nq(Xn)\n\n2\n\n1 +\n\nN\n\nn=1\n\nn=1\n\nq(Xn)\n\n, Xn\n\nD2\n\nD2\n\n\u221a\n\n(9)\n\nn=1\n\n1\nN\n\n,\n\nIt is worth pointing out that although the above statistical properties of the Hellinger distance make\nit well-suited as a variational objective, it does pose computational issues during optimization.\nIn particular, to avoid numerically unstable gradient estimation, one must transform Hellinger-\nbased objectives such as Eq. (5). This typically produces biased and occasionally higher-variance\nMonte Carlo gradient estimates than the corresponding KL gradient estimates. We detail these\ntransformations and other computational considerations in Appendix A.\n\n7\n\n\f(a) Cauchy density\n\n(b) Cauchy log density\n\n(c) Cauchy KL divergence\n\n(d) Banana density\n\n(e) Banana log marginal densities\n\n(f) Banana KL divergence\n\nFigure 3: Results on the Cauchy and banana distributions; all sub\ufb01gures use the legend from Fig. 3a.\n(Figs. 3a and 3d): Density approximation with 30 components for Cauchy (3a) and banana (3d). BVI\nhas degenerate component optimizations after the \ufb01rst, while UBVI and BVI+ are able to re\ufb01ne the\napproximation. (Figs. 3b and 3e): Log density approximations for Cauchy (3b) and banana marginals\n(3e). UBVI provides more accurate approximation of distribution tails than the KL-based BVI(+)\nalgorithms. (Figs. 3c and 3f): The forward KL divergence vs. the number of boosting components\nand computation time. UBVI consistently improves its approximation as more components are added,\nwhile the KL-based BVI(+) methods improve either slowly or not at all due to degeneracy. Solid\nlines / dots indicate median, and dashed lines / whiskers indicate 25th and 75th percentile.\n\n5 Experiments\n\nIn this section, we compare UBVI, KL divergence boosting variational inference (BVI) [28], BVI\nwith an ad-hoc stabilization in which qn in Eq. (1) is replaced by qn +10\u22123 to help prevent degeneracy\n\u221a\n(BVI+), and standard VI. For all experiments, we used a regularization schedule of rn = 1/\nn\nfor BVI(+) in Eq. (1). We used the multivariate Gaussian family for H parametrized by mean\n\u221a\nand log-transformed diagonal covariance matrix. We used 10,000 iterations of ADAM [55] for\noptimization, with decaying step size 1/\n1 + i and Monte Carlo gradients based on 1,000 samples.\nFully-corrective weight optimization was conducted via simplex-projected SGD for BVI(+) and\nnonnegative least squares for UBVI. Monte Carlo estimates of (cid:104)f, gn(cid:105) in UBVI were based on 10,000\nsamples. Each component optimization was initialized from the best of 10,000 trials of sampling\na component (with mean m and covariance \u03a3) from the current mixture, sampling the initialized\ncomponent mean from N (m, 16\u03a3), and setting the initialized component covariance to exp(Z)\u03a3,\nZ \u223c N (0, 1). Each experiment was run 20 times with an Intel i7 8700K processor and 32GB of\nmemory. Code is available at www.github.com/trevorcampbell/ubvi.\n\n5.1 Cauchy and banana distributions\n\nFig. 3 shows the results of running UBVI, BVI, and BVI+ for 30 boosting iterations on the standard\nunivariate Cauchy distribution and the banana distribution [56] with curvature b = 0.1. These\ndistributions were selected for their heavy tails and complex structure (shown in Figs. 3b and 3e),\ntwo features that standard variational inference does not often address but boosting methods should\n\n8\n\n\f(a) Synthetic\n\n(b) Chemical Reactivity\n\n(c) Phishing\n\nFigure 4: Results from Bayesian logistic regression posterior inference on the synthetic (4a), chemical\n(4b), and phishing (4c) datasets, showing the energy distance [57] to the posterior (via NUTS [58])\nvs. the number of components and CPU time. Energy distance and time are normalized by the VI\nmedian. Solid lines / dots indicate median, and dashed lines / whiskers indicate 25th / 75th percentile.\n\nhandle. However, BVI particularly struggles with heavy-tailed distributions, where its component\noptimization objective after the \ufb01rst is degenerate. BVI+ is able to re\ufb01ne its approximation, but still\ncannot capture heavy tails well, leading to large forward KL divergence (which controls downstream\nimportance sampling error). We also found that the behaviour of BVI(+) is very sensitive to the choice\nof regularization tuning schedule rn, and is dif\ufb01cult to tune well. UBVI, in contrast, approximates\nboth heavy-tailed and complex distributions well with few components, and involves no tuning effort\nbeyond the component optimization step size.\n\n5.2 Logistic regression with a heavy-tailed prior\n\nFig. 4 shows the results of running 10 boosting iterations of UBVI, BVI+, and standard VI for\nposterior inference in Bayesian logistic regression. We used a multivariate T2(\u00b5, \u03a3) prior, where\ni.i.d.\u223c N (0, 1). We ran\nin each trial, the prior parameters were set via \u00b5 = 0 and \u03a3 = AT A for Aij\nthis experiment on a 2-dimensional synthetic dataset generated from the model, a 10-dimensional\nchemical reactivity dataset, and a 10-dimensional phishing websites dataset, each with 20 subsampled\ndatapoints.3 The small dataset size and heavy-tailed prior were chosen to create a complex posterior\nstructure better-suited to evaluating boosting variational methods. The results in Fig. 4 are similar to\nthose in the synthetic test from Section 5.1; UBVI is able to re\ufb01ne its posterior approximation as it\nadds components without tuning effort, while the KL-based BVI+ method is dif\ufb01cult to tune well and\ndoes not reliably provide better posterior approximations than standard VI. BVI (no stabilization) is\nnot shown, as its component optimizations after the \ufb01rst are degenerate and it reduces to standard VI.\n\n6 Conclusion\n\nThis paper developed universal boosting variational inference (UBVI). UBVI optimizes the Hellinger\nmetric, avoiding the degeneracy, tuning, and dif\ufb01cult joint component/weight optimizations of other\ngradient-based BVI methods, while simplifying fully-corrective weight optimizations. Theoretical\nguarantees on the convergence of Hellinger distance provide an MCMC-like computation/quality\ntradeoff, and experimental results demonstrate the advantages over previous variational methods.\n\n7 Acknowledgments\n\nT. Campbell and X. Li are supported by a National Sciences and Engineering Research Council of\nCanada (NSERC) Discovery Grant and an NSERC Discovery Launch Supplement.\n\n3Real datasets available online at http://komarix.org/ac/ds/ and https://www.csie.ntu.edu.tw/\n\n~cjlin/libsvmtools/datasets/binary.html.\n\n9\n\n\fReferences\n[1] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov chain Monte Carlo.\n\nCRC Press, 2011.\n\n[2] Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. Bayesian data\n\nanalysis. CRC Press, 3rd edition, 2013.\n\n[3] Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul. An introduction to variational\n\nmethods for graphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[4] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[5] Gareth Roberts and Jeffrey Rosenthal. General state space Markov chains and MCMC algorithms.\n\nProbability Surveys, 1:20\u201371, 2004.\n\n[6] Pierre Jacob, John O\u2019Leary, and Yves Atchad\u00e9. Unbiased Markov chain Monte Carlo with couplings.\n\narXiv:1708.03625, 2017.\n\n[7] Jackson Gorham and Lester Mackey. Measuring sample quality with Stein\u2019s method. In Advances in\n\nNeural Information Processing Systems, 2015.\n\n[8] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t tests and\n\nmodel evaluation. In International Conference on Machine Learning, 2016.\n\n[9] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of \ufb01t.\n\nInternational Conference on Machine Learning, 2016.\n\nIn\n\n[10] R\u00e9mi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data.\n\nJournal of Machine Learning Research, 18:1\u201343, 2017.\n\n[11] Steven Scott, Alexander Blocker, Fernando Bonassi, Hugh Chipman, Edward George, and Robert McCul-\nloch. Bayes and big data: the consensus Monte Carlo algorithm. International Journal of Management\nScience and Engineering Management, 11:78\u201388, 2016.\n\n[12] Michael Betancourt. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data\n\nsubsampling. In International Conference on Machine Learning, 2015.\n\n[13] Matthew Hoffman, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14:1303\u20131347, 2013.\n\n[14] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic differentia-\n\ntion variational inference. Journal of Machine Learning Research, 18:1\u201345, 2017.\n\n[15] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 2014.\n\n[16] Diedrik Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on\n\nLearning Representations, 2014.\n\n[17] A\u0131l\u0131m G\u00fcne\u00b8s Baydin, Barak Pearlmutter, Alexey Radul, and Jeffrey Siskind. Automatic differentiation in\n\nmachine learning: a survey. Journal of Machine Learning Research, 18:1\u201343, 2018.\n\n[18] Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approxi-\n\nmations. The Annals of Statistics, 2018 (to appear).\n\n[19] Yixin Wang and David Blei. Frequentist consistency of variational Bayes. Journal of the American\n\nStatistical Association, 0(0):1\u201315, 2018.\n\n[20] Yun Yang, Debdeep Pati, and Anirban Bhattacharya. \u03b1-variational inference with statistical guarantees.\n\nThe Annals of Statistics, 2018 (to appear).\n\n[21] Badr-Eddine Ch\u00e9rief-Abdellatif and Pierre Alquier. Consistency of variational Bayes inference for\n\nestimation and model selection in mixtures. Electronic Journal of Statistics, 12:2995\u20133035, 2018.\n\n[22] Guillaume Dehaene and Simon Barthelm\u00e9. Expectation propagation in the large data limit. Journal of the\n\nRoyal Statistical Society, Series B, 80(1):199\u2013217, 2018.\n\n[23] Kevin Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012.\n[24] Fangjian Guo, Xiangyu Wang, Kai Fan, Tamara Broderick, and David Dunson. Boosting variational\n\ninference. In Advances in Neural Information Processing Systems, 2016.\n\n[25] Xiangyu Wang. Boosting variational inference: theory and examples. Master\u2019s thesis, Duke University,\n\n2016.\n\n[26] Andrew Miller, Nicholas Foti, and Ryan Adams. Variational boosting: iteratively re\ufb01ning posterior\n\napproximations. In International Conference on Machine Learning, 2017.\n\n10\n\n\f[27] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar R\u00e4tsch. Boosting variational inference:\n\nan optimization perspective. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n\n[28] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar R\u00e4tsch. Boosting black\n\nbox variational inference. In Advances in Neural Information Processing Systems, 2018.\n\n[29] Tommi Jaakkola and Michael Jordan. Improving the mean \ufb01eld approximation via the use of mixture\n\ndistributions. In Learning in graphical models, pages 163\u2013173. Springer, 1998.\n\n[30] O. Zobay. Variational Bayesian inference with Gaussian-mixture approximations. Electronic Journal of\n\nStatistics, 8:355\u2013389, 2014.\n\n[31] Samuel Gershman, Matthew Hoffman, and David Blei. Nonparametric variational inference. In Interna-\n\ntional Conference on Machine Learning, 2012.\n\n[32] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In International\n\nConference on Machine Learning, 2015.\n\n[33] Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathematical\n\nStatistics, 33(3):1065\u20131076, 1962.\n\n[34] Robert Schapire. The strength of weak learnability. Machine Learning, 5(2):197\u2013227, 1990.\n[35] Tong Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transac-\n\ntions on Information Theory, 49(3):682\u2013691, 2003.\n\n[36] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics\n\nQuarterly, 3:95\u2013110, 1956.\n\n[37] Martin Jaggi. Revisiting Frank-Wolfe: projection-free sparse convex optimization. In International\n\nConference on Machine Learning, 2013.\n\n[38] Scott Chen, David Donoho, and Michael Saunders. Atomic decomposition by basis pursuit. SIAM Review,\n\n43(1):129\u2013159, 2001.\n\n[39] Yoav Freund and Robert Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n\n[40] Andrew Barron, Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Approximation and learning by\n\ngreedy algorithms. The Annals of Statistics, 36(1):64\u201394, 2008.\n\n[41] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Uncertainty in\n\nArti\ufb01cial Intelligence, 2010.\n\n[42] St\u00e9phane Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-\n\ntions on Signal Processing, 41(12):3397\u20133415, 1993.\n\n[43] Sheng Chen, Stephen Billings, and Wan Luo. Orthogonal least squares methods and their application to\n\nnon-linear system identi\ufb01cation. International Journal of Control, 50(5):1873\u20131896, 1989.\n\n[44] Scott Chen, David Donoho, and Michael Saunders. Atomic decomposition by basis pursuit. SIAM Review,\n\n43(1):129\u2013159, 1999.\n\n[45] Joel Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information\n\nTheory, 50(10):2231\u20132242, 2004.\n\n[46] Trevor Campbell and Tamara Broderick. Bayesian coreset construction via greedy iterative geodesic ascent.\n\nIn International Conference on Machine Learning, 2018.\n\n[47] Kaare\n\nPetersen\n\nand Michael\n\nPedersen.\n\nThe Matrix Cookbook.\n\nOnline:\n\nhttps://www.math.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf, 2012.\n\n[48] Charles Lawson and Richard Hanson. Solving least squares problems. Society for Industrial and Applied\n\nMathematics, 1995.\n\n[49] Subhashis Ghosal, Jayanta Ghosh, and Aad van der Vaart. Convergence rates of posterior distributions.\n\nAnnals of Statistics, 28(2):500\u2013531, 2000.\n\n[50] Yingzhen Li and Richard Turner. R\u00e9nyi divergence variational inference. In Advances in Neural Information\n\nProcessing Systems, 2016.\n\n[51] C\u00e9dric Villani. Optimal transport: old and new. Springer, 2009.\n[52] David Pollard. A user\u2019s guide to probability theory. Cambridge series in statistical and probabilistic\n\nmathematics. Cambridge University Press, 7th edition, 2002.\n\n[53] Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman. Yes, but did it work? Evaluating variational\n\ninference. In International Conference on Machine Learning, 2018.\n\n[54] Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. The Annals of\n\nApplied Probability, 28(2):1099\u20131135, 2018.\n\n11\n\n\f[55] Diederik Kingma and Jimmy Ba. ADAM: a method for stochastic optimization. In International Conference\n\non Learning Representations, 2015.\n\n[56] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive Metropolis algorithm. Bernoulli,\n\npages 223\u2013242, 2001.\n\n[57] G\u00e1bor Szekely and Maria Rizzo. Energy statistics: statistics based on distances. Journal of Statistical\n\nPlanning and Inference, 143(8):1249\u20131272, 2013.\n\n[58] Matthew Hoffman and Andrew Gelman. The No-U-Turn Sampler: adaptively setting path lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1351\u20131381, 2014.\n\n[59] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via Hilbert coresets.\n\nJournal of Machine Learning Research, 20(15):1\u201338, 2019.\n\n12\n\n\f", "award": [], "sourceid": 1914, "authors": [{"given_name": "Trevor", "family_name": "Campbell", "institution": "UBC"}, {"given_name": "Xinglong", "family_name": "Li", "institution": "The University of British Columbia"}]}