{"title": "Boosting Black Box Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 3401, "page_last": 3411, "abstract": "Approximating a probability density in a tractable manner is a central task in Bayesian statistics. Variational Inference (VI) is a popular technique that achieves tractability by choosing a relatively simple variational approximation. Borrowing ideas from the classic boosting framework, recent approaches attempt to \\emph{boost} VI by replacing the selection of a single density with an iteratively constructed mixture of densities. In order to guarantee convergence, previous works impose stringent assumptions that require significant effort for practitioners. Specifically, they require a custom implementation of the greedy step (called the LMO) for every probabilistic model with respect to an unnatural variational family of truncated distributions. Our work fixes these issues with novel theoretical and algorithmic insights. On the theoretical side, we show that boosting VI satisfies a relaxed smoothness assumption which is sufficient for the convergence of the functional Frank-Wolfe (FW) algorithm. Furthermore, we rephrase the LMO problem and propose to maximize the Residual ELBO (RELBO) which replaces the standard ELBO optimization in VI. These theoretical enhancements allow for black box implementation of the boosting subroutine. Finally, we present a stopping criterion drawn from the duality gap in the classic FW analyses and exhaustive experiments to illustrate the usefulness of our theoretical and algorithmic contributions.", "full_text": "Boosting Black Box Variational Inference\n\nFrancesco Locatello\u21e41,2, Gideon Dresdner\u21e42, Rajiv Khanna3, Isabel Valera1, and Gunnar R\u00e4tsch2\n\n2Dept. for Computer Science, ETH Zurich, Universit\u00e4tsstrasse 6, 8092 Zurich, Switzerland.\n\n1Max-Planck Institute for Intelligent Systems, Germany\n\n3The University of Texas at Austin, USA\n\nAbstract\n\nApproximating a probability density in a tractable manner is a central task in\nBayesian statistics. Variational Inference (VI) is a popular technique that achieves\ntractability by choosing a relatively simple variational approximation. Borrowing\nideas from the classic boosting framework, recent approaches attempt to boost\nVI by replacing the selection of a single density with an iteratively constructed\nmixture of densities. In order to guarantee convergence, previous works impose\nstringent assumptions that require signi\ufb01cant effort for practitioners. Speci\ufb01cally,\nthey require a custom implementation of the greedy step (called the LMO) for every\nprobabilistic model with respect to an unnatural variational family of truncated\ndistributions. Our work \ufb01xes these issues with novel theoretical and algorithmic\ninsights. On the theoretical side, we show that boosting VI satis\ufb01es a relaxed\nsmoothness assumption which is suf\ufb01cient for the convergence of the functional\nFrank-Wolfe (FW) algorithm. Furthermore, we rephrase the LMO problem and\npropose to maximize the Residual ELBO (RELBO) which replaces the standard\nELBO optimization in VI. These theoretical enhancements allow for black box\nimplementation of the boosting subroutine. Finally, we present a stopping criterion\ndrawn from the duality gap in the classic FW analyses and exhaustive experiments\nto illustrate the usefulness of our theoretical and algorithmic contributions.\n\n1\n\nIntroduction\n\nApproximating probability densities is a core problem in Bayesian statistics, where inference translates\nto the computation of a posterior distribution. Posterior distributions depend on the modeling\nassumptions and can rarely be computed exactly. Variational Inference (VI) is a technique to\napproximate posterior distributions through optimization. It involves choosing a set of tractable\ndensities, a.k.a. variational family, out of which the \ufb01nal approximation is to be chosen. The\napproximation is done by selecting a density in the candidate set that is close to the true posterior in\nterms of Kullback-Leibler (KL) divergence [1]. There is an inherent trade-off involved in \ufb01xing the\nset of tractable densities. Increasing the capacity of the variational family to approximate the posterior\nalso increases the complexity of the optimization problem. Consider a degenerate case where the\nvariational family contains just a single density. The optimization problem is trivial and runs in\nconstant time, but the quality of the solution is poor and stands in no relation to the true posterior.\nThis contrived example is clearly too restrictive, and in practice, the mean \ufb01eld approximation offers\na good trade-off between expressivity and tractability [2]. However, in many real-world applications,\nmean \ufb01eld approximations are lacking in their ability to accurately approximate the posterior.\nImagine a practitioner that, after designing a Bayesian model and using a VI algorithm to approximate\nthe posterior, \ufb01nds that the approximation is too poor to be useful. Standard VI does not give the\n\n*Authors contributed equally\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fpractitioner the option to trade additional computational cost for a better approximation. As a result,\nthere have been several efforts to ramp up the representative capacity of the variational family while\nmaintaining tractability.\nOne line of work in this direction involves replacing the simple mean \ufb01eld family by a mixture\nof Gaussians. It is known that mixtures of Gaussians can approximate any distribution arbitrarily\nclosely [18]. Boosting is a practical approach to \ufb01nding the optimal approximating mixture and\ninvolves adding components to the mixture greedily one at a time [5, 13, 16]. Not only is boosting a\npractical solution, it also has well-studied trade-off bounds for number of iterations vs. approximation\nquality [13] by virtue of being essentially a variant of the classical Frank-Wolfe (FW) algorithm [8, 13].\nUnfortunately, these greedy algorithms require a specialized, restricted variational family to ensure\nconvergence and therefore a white box implementation of the boosting subroutine. These restrictions\ninclude that (a) each potential component of the mixture has a bounded support i.e., truncated\ndensities, and (b) the subroutine should not return degenerate distributions. These assumptions\nrequire specialized care during implementation, and therefore, one cannot simply take existing VI\nsolvers and boost them. This makes boosting VI unattractive for practitioners. In this work, we \ufb01x\nthis issue by proposing a boosting black box VI algorithm that has many practical bene\ufb01ts.\nOur work presents several key algorithmic and theoretical contributions, which we summarize below:\n\u2022 We relax the previously known conditions for guaranteed convergence from smoothness to\nbounded curvature. As a consequence, the set of candidate densities no longer needs to be\ntruncated, thereby easing its implementation and improving on the convergence guarantees.\n\u2022 We propose a modi\ufb01ed form of the ELBO optimization, the Residual ELBO (RELBO),\nwhich guarantees that the selected density is non-degenerate and is suitable for black box\nsolvers (e.g. black box variational inference [19]).\n\nany boosting VI algorithm.\n\n\u2022 We propose a novel stopping criterion using the duality gap from FW, which is applicable to\nIn addition to these theoretical contributions, we present extensive simulated and real-world empirical\nexperiments to show the applicability of our proposed algorithmic extensions.\nWhile our work is motivated by applications to VI, our theoretical results have a more general\nimpact. We essentially analyze the application of the functional FW algorithm to the general task of\nminimizing the KL-divergence over a space of probability densities.\n\n1.1 Related work\n\nThere is a vast literature on VI. We refer to [1] for a thorough review. Our focus is to use boosting\nto increase the complexity of a density, similar to the goal of Normalizing Flows [20], MCMC-VI\nhybrid methods [22, 21], or distribution transformations [23]. Our approach is in line with several\nprevious approaches using mixtures of distributions to improve the expressiveness of the variational\napproximation [9, 6] but goes further to draw connections with classic optimization to obtain several\nnovel theoretical and algorithmic insights.\nWhile boosting has been well studied in other settings [15], it has only recently been applied to the\nproblem of VI. Related works of [5] and [16] developed the algorithmic framework and conjectured\na possible convergence rate of O(1/T ) but without theoretical analyses. The authors in [5] identify\nthe need of truncated densities to ensure smoothness of the KL cost function. A more recent\nwork [13] provides a theoretical base for analyzing the algorithm. They identify the suf\ufb01cient\nconditions for guaranteeing convergence and provide explicit constants to the conjectured O(1/T )\nrate. Unfortunately, these suf\ufb01cient conditions amount to restrictive assumptions about the variational\nfamily and therefore require the practitioner to have white box access to the variational family and\nunderlying VI algorithm. In this work, we remove these assumptions to allow use of black box VI\nmethods.\nOur analysis is mainly based on the FW algorithm [8], which is a commonly used algorithm for\nprojection free constrained optimization. The convergence rates and requisite assumptions are\nwell studied in various settings [12, 11, 14]. Its applications include non-Euclidean spaces, e.g., a\nvariational objective for approximate marginal inference over the marginal polytope [10].\nThe rest of the paper is organized as follows. We begin by introducing and formalizing the boosting VI\nframework in Section 2. In Section 3, we review and analyze the Functional FW algorithm to greedily\n\n2\n\n\fsolve the boosting VI. In Section 4, we \ufb01rst propose RELBO, an alternative of the contemporary\nELBO optimization to implement a black box LMO (linear minimization oracle). Then, we propose\na duality gap based stopping criterion for boosting VI algorithms. Finally, experimental evaluation is\npresented in Section 5. We refer the reader to the appendix for all proofs.\nNotation. We represent vectors by small letters bold, (e.g. x) and matrices by capital bold, (e.g. X).\nFor a non-empty subset A of some Hilbert space H, let conv(A) denote its convex hull. A is often\ncalled atom set in the literature, and its elements are called atoms. The support of a density function q\nis a measurable set denoted by capital letters sans serif i.e., Z. The inner product between two density\nfunctions p, q : Z ! R in L2 is de\ufb01ned as hp, qi :=RZ p(z)q(z)dz.\n\n2 Variational inference and boosting\n\nBayesian inference involves computing the posterior distribution given a model and the data. More\nformally, we choose a distribution for our observations x given unobserved latent variables z, called\nthe likelihood p(x|z), and a prior distribution over the latent variables, p(z). Our goal is to infer\nthe posterior, p(z|x) [1]. Bayes theorem relates these three distributions by expressing the posterior\nas equal to the product of prior and likelihood divided by the normalization constant, p(x). The\nposterior is often intractable because the normalization constant p(x) =RZ p(x|z)p(z)dz requires\nintegrating over the full latent variable space.\nThe goal of VI is to \ufb01nd a tractable approximation q(z) of p(z|x). From an optimization viewpoint,\none can think of the posterior as an unknown function p(z|x) : Z ! R+\n>0 where Z is a measurable\nset. The task of VI is to \ufb01nd the best approximation, in terms of KL divergence, to this unknown\nfunction within a family of tractable distributions Q. Therefore, VI can be written as the following\noptimization problem:\n\nmin\nq(z)2Q\n\nDKL(q(z)kp(z|x)).\n\n(1)\n\nIt should be obvious that the quality of the approximation directly depends on the expressivity of the\nfamily Q. However, as we increase the complexity of Q, the optimization problem (1) also becomes\nmore complex.\nRather than optimizing the objective (1) which requires access to an unknown function p(z|x) and is\ntherefore not computable, VI equivalently maximizes instead the so-called Evidence Lower BOund\n(ELBO) [1]:\n\n(2)\nA recent line of work [5, 16, 13] aims at replacing Q with conv(Q) thereby expanding the capacity\nof the variational approximation to the class of mixtures of the base family Q:\n\nEq [log q(z)] + Eq [log p(x, z)] .\n\nmin\n\nq(z)2conv(Q)\n\nDKL(q(z)||p(x, z)).\n\n(3)\n\nThe boosting approach to this problem consists of specifying an iterative procedure, in which the\nproblem (3) is solved via the greedy combination of solutions from simpler surrogate problems.\nThis approach was \ufb01rst proposed in [5], and its connection to the FW algorithm was studied in [13].\nContrary to the boosting approaches in the deep generative modeling literature initiated by [24],\nboosting VI does not enjoy a simple and elegant subproblem as we discuss in Section 3.1. Next, we\nshow how to tackle (3) from a formal and yet very practical optimization perspective.\n\n3 Functional Frank-Wolfe for boosting variational inference\n\nTaking a step back from the problem (3), we \ufb01rst de\ufb01ne the general optimization problem and the\nrelevant quantities needed for proving the convergence of FW. Then, we present the extension to\nboosting black box VI.\nWe start with the general optimization problem:\n\nmin\n\nq2conv(A)\n\nf (q).\n\n3\n\n(4)\n\n\fwhere A \u21e2 H is closed and bounded and f : conv(A)! R is a convex functional with bounded\ncurvature over its domain. Here the curvature is de\ufb01ned as in [8]:\n\nCf,A :=\n\nsup\n\ns2A, q2conv(A)\ny=q+(sq)\n\n2[0,1]\n\n2\n2 D(y, q),\n\n(5)\n\nwhere\n\nD(y, q) := f (y)  f (q)  hy  q,rf (q)i.\n\nt+2\n\nIt is known that if rf is Lipschitz continuous with constant L (often referred to as L-smoothness)\nover conv(A), then Cf,A \uf8ff L diam(A)2 where diam(A) := maxp,q2A ||p  q||2 [8].\nThe FW algorithm is depicted in Algo-\nAlgorithm 1 Af\ufb01ne Invariant Frank-Wolfe\nrithm 1. Note that Algorithm 2 in [5]\n1: init q0 2 conv(A), S := {q0}, and accuracy  > 0\nis a variant of Algorithm 1. In each\n2: for t = 0 . . . T\niteration, the FW algorithm queries a\nso-called Linear Minimization Oracle\n3:\n(LMO) which solves the following in-\n4:\nner optimization problem:\n5:\n6:\nLMOA(y) := arg min\n7:\n8:\n9: end for\n\nFind st := (Approx-)LMOA(rf (qt))\nVariant 0:  = 2\nVariant 1:  = arg min2[0,1] f ((1  )qt + st)\nqt+1 := (1  )qt + st\nVariant 2: S = S [ st\n\nfor a given y 2 H and A \u21e2 H. To\ntackle the constrained convex mini-\nmization problem in Eq. (4), Frank-\nWolfe iteratively solves a linear constrained problem where, at iteration t, the function f is replaced\nby its \ufb01rst-order Taylor approximation around the current iterate qt. It is easy to see that the so-\nlution of this problem can be obtained by querying LMO(rf (qt)). Indeed, the following holds:\narg mins2conv(A) f (qt) + hrf (qt), s  qti = arg mins2conv(A)hrf (qt), si =: LMO(rf (qt)).\nDepending on A, computing an exact solution of (6) can be hard in practice. This motivates the\napproximate LMO which returns an approximate minimizer \u02dcs of (6) for some accuracy parameter \nand the current iterate qt such that:\n\nqt+1 = arg minq2conv(S) f (q)\n\ns2A hy, si,\n\n(6)\n\nhy, \u02dcs  qti \uf8ff  min\n\ns2Ahy, s  qti\n\n(7)\n\nWe discuss a simple algorithm to implement the LMO in Section 4. Finally, we \ufb01nd that Algorithm 1\nis known to converge sublinearly to the minimizer q? of (4) with the following rate:\nTheorem 1 ([8]). Let A \u21e2 H be a closed and bounded set and let f : H! R be a convex function\nwith bounded curvature Cf,A over A. Then, the Af\ufb01ne Invariant FW algorithm (Algorithm 1)\nconverges for t  0 as\n\nf (qt)  f (q?) \uf8ff\n\n2 1\n Cf,A + \"0\n\nt + 2\n\nwhere \"0 := f (q0)  f (q?) is the initial error in objective, and  2 (0, 1] is the accuracy parameter\nof the approximate LMO.\n\nDiscussion. Theorem 1 has several implications for boosting VI. First, the LMO problem does not\nneed to be solved exactly in order to guarantee convergence. Second, Theorem 1 guarantees that\nAlgorithm 1 converges to the best approximation in conv(A) which, depending on the expressivity of\nthe base family, could even contain the full posterior. Furthermore, the theorem gives a convergence\nrate which states that, in order to achieve an error of \u270f, we need to perform O( 1\nSimilar discussions are also presented by [13]. The crucial question, which they leave unaddressed,\nis whether under their assumptions there exists a variational family of densities which (a) is expres-\nsive enough to well-approximate the posterior; (b) satis\ufb01es the conditions required to guarantee\nconvergence; and (c) allows for ef\ufb01cient implementation of the LMO.\n\n\u270f ) iterations.\n\n3.1 Curvature of boosting variational inference\nIn order to boost VI using FW in practice, we need to ensure that the assumptions are not violated.\nAssume that A \u21e2 Q is the set of probability density functions with compact parameter space as well\n\n4\n\n\fas bounded in\ufb01nity norm and L2 norm. These assumptions on the search space are easily justi\ufb01ed\nsince it is reasonable to assume that the posterior is not degenerate (bounded in\ufb01nity norm) and has\nmodes that are not arbitrarily far away from each other (compactness). Under these assumptions, the\noptimization domain is closed and bounded. It is simple to show that the solution of the LMO problem\nover conv(A) is an element of A. Therefore, A is closed. The troublesome condition that needs to\nbe satis\ufb01ed for the convergence of FW is smoothness. Indeed, the work of [5] already recognized\nthat the boosting literature typically require a smooth objective and showed that densities bounded\naway from zero are suf\ufb01cient. [13] showed that the necessary condition to achieve smoothness is that\nthe approximation be not arbitrarily far from the optimum. They argue that while this is a practical\nassumption, the general theory would require truncated densities. We relax this assumption. As per\nTheorem 1, a bounded curvature is actually suf\ufb01cient to guarantee convergence. This condition is\nweaker than smoothness, which was assumed by [13, 5]. For the KL divergence, the following holds.\nTheorem 2. Cf,A is bounded for the KL divergence if the parameter space of the densities in A is\nbounded.\n\nThe proof is provided in Appendix A.\nDiscussion. Surprisingly, a bounded curvature for the DKL can be obtained as long as:\n\n2\n2 DKL(ykq)\n\nsup\n\ns2A, q2conv(A)\ny=q+(sq)\n\n2[0,1]\n\nis bounded. The proof sketch proceeds as follows. For any pair s and q, we need to check that\n2 DKL(ykq) is bounded as a function of  2 [0, 1]. The two limit points, DKL(skq) for  = 1\n2\n2 for  = 0, are both bounded for any choice of s and q. Hence, the Cf,A is bounded\nand ks  qk2\nas it is a continuous function of  in [0, 1] with bounded function values at the extreme points.\nDKL(skq) is bounded because the parameter space is bounded. ks  qk2\n2 is bounded by the triangle\ninequality and bounded L2 norm of the elements of A. This result is particularly relevant, as it\nmakes the complicated truncation described in [13] unnecessary without any additional assumption.\nIndeed, while a bounded parameter space was already assumed in [13] and is a practical assumption,\ntruncation is tedious to implement. Note that [5] considers full densities as an approximation of the\ntruncated one. They also argue that the theoretically grounded family of distributions for boosting\nshould contain truncated densities. Avoiding truncation has another very important consequence for\nthe optimization. Indeed, [13] proves convergence of boosting VI only to a truncated version of the\nposterior. Therefore, Theorem 8 in [13] contains a term that does not decrease when the number of\niteration increases. While this term could be small, as it contains the error of the approximation on\nthe tails, it introduces a crucial hyperparameter in the algorithm i.e., where to truncate. Instead, we\nhere show that under much weaker assumptions on the set A, it is possible to converge to the true\nposterior.\n\n4 The residual ELBO: implementing a black box LMO\nNote that the LMO is a constrained linear problem in a function space. A complicated heuristic\nis developed in [5] to deal with the fact that the unconstrained linear problem they consider has a\ndegenerate solution. The authors of [13] propose to use projected stochastic gradient descent on the\nparameters of s. The problem with this is that, to the best of our knowledge, the convergence of\nprojected stochastic gradient descent is not yet understood in this setting. To guarantee convergence\nof the FW procedure, it is crucial to make sure that the solution of the LMO is not a degenerate\ndistribution. This translates to a constraint on the in\ufb01nity norm of s. Such a constraint is hardly\npractical. Indeed, one must be able to compute the maximum value of s as a function of its parameters\nwhich depends on the particular choice of A. In contrast, the entropy is a general term that can be\napproximated via sampling and therefore allows for black box computation. We relate in\ufb01nity norm\nand entropy in the following lemma.\nLemma 3. A density with bounded in\ufb01nity norm has entropy bounded from below. The converse is\ntrue for many of the distributions which are commonly used in VI (for example Gaussian, Cauchy\nand Laplace).\n\nThe proof is provided in Appendix A.\n\n5\n\n\fIn general, bounded entropy does not always imply bounded in\ufb01nity norm. While this is precisely the\nstatement we need, a simple veri\ufb01cation is suf\ufb01cient to show that it holds in most cases of interest.\nWe assume that A is a family for which bounded entropy implies bounded in\ufb01nity norm. Therefore,\nwe can constrain the optimization problem with the entropy instead of the in\ufb01nity norm. We call \u00afA\nthe family A without the in\ufb01nity norm constraint. At every iteration, we need to solve:\n\narg min\n\nH(s)M\u2327s, log\u2713 qt\np\u25c6\n\ns2 \u00afA\n\nNote that this is simply the LMO from Equation (6) with y = rqDKL(qtkp) = log qt\np but with an\nadditional constraint on the entropy. This constraint on the entropy is crucial since otherwise the\nsolution of the LMO would be a degenerate distribution as the authors of [5] have also argued.\nWe now replace this problem with its regularized form using Lagrange multipliers and solve for s\ngiven a \ufb01xed value of :\n\narg min\n\ns2 \u00afA \u2327s, log\u2713 qt\n\np\u25c6 +  (H(s)  M ) = arg min\n\n(8)\n\n= arg min\n\n= arg min\n\np\n\ns2 \u00afA \u2327s, log\u2713 qt\np\u25c6 + hs, log si\ns2 \u00afA *s, log s\nqt!+\n1A+ .\ns2 \u00afA *s, log0@ s\nq p\nr p\n\nqt ),\n\nqt\n\nTherefore, the regularized LMO problem is equivalent to the following minimization problem:\n\nwhere Z is the normalization constant of q p\n\nwe call the Residual Evidence Lower Bound (RELBO) as:\n\n1\nZ\n\narg min\n\ns2 \u00afA\n\nDKL(sk\nqt . From this optimization problem, we can write what\n\nRELBO(s, ) := Es[log p]  Es[log s]  Es[log qt].\n\n(9)\n\nDiscussion. Let us now analyze the RELBO and compare it with the ELBO in standard VI [1]. First,\nnote that we introduce the hyperparameter  which controls the weight of the entropy. In order to\nobtain the true LMO solution, one would need to maximize the LHS of Equation (8) for  and solve\nthe saddle point problem. In light of the fact that an approximate solution is suf\ufb01cient for convergence,\nwe consider the regularized problem as a simple heuristic. One can then \ufb01x an arbitrary value for\n or decrease it when t increases. The latter amounts to allowing increasingly sharp densities as\noptimization proceeds. The other important difference between ELBO and RELBO is the residual\nterm which is expressed through Es[log p]  Es[log qt]. Maximizing this term amounts to looking\nfor a density with low cross entropy with the joint p and high cross entropy with the current iterate\nqt. In other words, the next component st needs to be as close as possible to the target p but also\nsuf\ufb01ciently different from the current approximation qt. Indeed, st should capture the aspects of the\nposterior that the current mixture could not approximate yet.\nFailure Modes. Using a black box VI as an implementation for the LMO represents an attractive\npractical solution. Indeed, one could just run VI once and, if the result is not good enough, run it\nagain on the residual without changing the structure of the implementation. Unfortunately, there are\ntwo failure modes that should be discussed. First, if the target posterior is a perfectly symmetric\nmultimodal distribution, then the residual is also symmetric and the algorithm may get stuck. A\nsimple solution to this problem is to run the black box VI for fewer iterations, breaking the symmetry\nof the residual. The second problem arises in scenarios where the posterior distribution can be\napproximated well by a single element of Q. In such cases, most of the residual will be on the tails.\nThe algorithm will then \ufb01t the tails and in the following iterations re-learn a distribution close to q0.\nAs a consequence, it is important to identify good solutions before investing additional computational\neffort by adding more components to the mixture. Note that the ELBO cannot be used for this\npurpose, as its value at the maximum is unknown.\n\n6\n\n\fStopping criterion. We propose a stopping criterion for boosting VI which allows us to identify\nwhen a reasonably good approximation is reached and save computational effort. To this end, we\nrephrase the notion of duality gap [8, 7] in the context of boosting VI, which gives a surprisingly\nsimple stopping criterion for the algorithm.\nLemma 4. The duality gap g(q) := maxs2conv(A)hq  s, log q\nconv(A) is an upper bound on the primal error DKL(qkp)  DKL(q?kp).\nThe proof is provided in Appendix A.\nNote that the arg maxs2conv(A)hq  s, log q\npi is precisely the LMO solution to the problem (6).\nTherefore, with an exact LMO, one obtains a certi\ufb01cate on the primal error for free, without knowing\nthe value of DKL(q?kp). It is possible to show that a convergence rate similar to Theorem 1 also\nholds for the duality gap [8]. If the oracle is inexact, the estimate of the duality gap \u02dcg(q) satis\ufb01es that\n \u02dcg(q)  g(q), as a consequence of (7).\n1\n5 Experimental evaluation\n\npi computed at some iterate q 2\n\nNotably, our VI algorithm is black box in the sense that it leaves the de\ufb01nition of the model and the\nchoice of variational family up to the user. Therefore, we are able to reuse the same boosting black\nbox VI solver to run all our experiments, and more generally, any probabilistic model and choice of\nvariational family. We chose to implement our algorithm as an extension to the Edward probabilistic\nprogramming framework [25] thereby enabling users to apply boosting VI to any probabilistic model\nand variational family which are de\ufb01nable in Edward. In Appendix B, we show a code sample of our\nimplementation of Bayesian logistic regression.\nFor comparisons to baseline VI, we use Edward\u2019s built-in black box VI (BBVI) algorithm without\nmodi\ufb01cation. We run these baseline VI experiments for 10,000 iterations which is orders of magnitude\nmore than what is required for convergence. Unless otherwise noted, we use Gaussians as our base\nfamily. Note that FW with \ufb01xed step size is not monotonic and so in the experiments in which\nwe use a \ufb01xed step size, it is expected that the last iteration is not optimal. We use the training\nlog-likelihood to select the best iteration and we used the duality gap as a diagnostic tool in the\nimplementation to understand the impact of . We found that  = 1pt+1 worked well in all\nthe experiments. Code to reproduce the experiments is available at: https://github.com/\nratschlab/boosting-bbvi.\n\n5.1 Synthetic data\n\nFirst, we use synthetic data to visualize the ap-\nproximation of our algorithm of a bimodal poste-\nrior. In particular, we consider a mixture of two\nGaussians with parameters \u00b5 = (1, +1),  =\n(0.5, 0.5), and mixing weights \u21e1 = (0.4, 0.6).\nWe performed experiments with all three vari-\nants of FW described in Algorithm 1. For the\nfully corrective variant, we used FW to solve\nthe subproblem of \ufb01nding the optimal weights\nfor the current atom set. We observe that un-\nlike BBVI, all three variants are able to \ufb01t both\nmodes of the bimodal target distribution. The\nfully corrective version gives the best \ufb01t. Un-\nfortunately, this improved solution comes at a\ncomputational cost \u2014 solving the line search\nand fully corrective subproblems is dramatically\nslower than the \ufb01xed step size variant. In the experiments that follow we were able to improve upon\nthe initial VI solution using the simple \ufb01xed step size. We believe this is the most interesting variant\nfor practitioners as it does not require any additional implementation other than the VI subroutine.\nOur synthetic data results are summarized in Figure 1.\n\nFigure 1: Comparison between BBVI and three\nvariants of our boosting BBVI method on a mixture\nof Gaussians example.\n\n7\n\n\fTable 1: Comparison of boosting BBVI on the CHEMREACT dataset. We observe that using the\nLaplace distribution as the base family, our method outperforms BBVI using either Laplace or\nGaussian distributions as the variational family. In addition, boosting BBVI has lower variance across\nrepetitions.\n\nBoosting BBVI (Laplace)\nBBVI Edward (Laplace)\nBBVI Edward (Gaussian)\nLine Search Boosting VI ([5])\nFixed Step Boosting VI ([13])\nNorm Corrective Boosting VI ([13])\n\nTrain LL\n\n-.677 \u00b1 0.002\n-0.681 \u00b1 0.003\n-0.671 \u00b1 0.002\n\n-2.808\n-3.045\n-2.725\n\nTest AUROC\n0.794 \u00b1 0.005\n0.781 \u00b1 0.012\n0.790 \u00b1 0.009\n\n0.6377\n0.6193\n0.6440\n\nTable 2: Comparison of boosting BBVI on EICU COLLABORATIVE RESEARCH dataset. We observe\nthat our method outperforms BBVI with the Laplace distribution. In addition, boosting BBVI has\nlower variance across repetitions.\n\nTrain LL\n\n-0.195 \u00b1 0.007\n-0.200 \u00b1 0.032\n\nTest AUROC\n0.844 \u00b1 0.006\n0.838 \u00b1 0.016\n\nBoosting BBVI (Laplace)\nBBVI Edward (Laplace)\n\nTable 3: Matrix factorization results for latent dimension D = 3, 5, 10 on the CBCL FACE dataset.\nWe observe that our method outperforms the baseline BBVI method on mean-squared error (MSE).\n\nD=3\nD=5\nD=10\n\nBBVI MSE\n0.0184 \u00b1 0.001\n0.0187 \u00b1 0.001\n0.0188 \u00b1 0.001\n\nBoosting BBVI MSE\n0.0139 \u00b1 0.44e-04\n0.0137 \u00b1 0.53e-04\n0.0135 \u00b1 0.52e-04\n\nBBVI Test LL\n-0.9363 \u00b1 0.6e-3\n-0.9391 \u00b1 0.6e-3\n-0.9468 \u00b1 0.3e-3\n\n5.2 Bayesian logistic regression on two real-world datasets\n\nBoosting BBVI Test LL\n\n-0.9354 \u00b1 0.3e-3\n-0.9393 \u00b1 .4e-3\n-0.9492 \u00b1 .001\n\nIn this experiment, we consider two real-world binary-classi\ufb01cation tasks: predicting the reactivity\nof a chemical and predicting mortality in the intensive case unit (ICU). For both tasks we use the\nBayesian logistic regression model. This allows us to compare to previous work in [13]. Bayesian\nlogistic regression is a conditional prediction model with prior p(w) = N (0, 1) on the weights and\nconditional likelihood p(y|X) = Bernoulli(p = sigmoid(X>w)). This model is commonly used\nas an example of a simple model which does not have a closed form posterior. [1]\nWe use the CHEMREACT dataset which contains 26,733 chemicals, each with 100 features. For this\nexperiment, we ran our algorithm for 35 iterations and found that iteration 17 had the best performance.\nWe observe that running merely one single well-tuned iteration of BBVI as implemented in the Edward\nframework using Gaussian as the variational class outperforms 10 iterations of boosting VI in [13].\nWhile BBVI already has good performance in terms of AUROC, we are able to improve it further by\nusing the \ufb01xed step size variant of FW and the Laplace distributions as the base family. In addition,\nour solution is more stable, namely it has lower standard deviation across replications. Results are\nsummarized in Table 1.\nFor the mortality prediction task, we used a preprocessed dataset created by the authors of [3] from the\nEICU COLLABORATIVE RESEARCH database [4]. The preprocessing included selecting patient stays\nbetween 1 and 30 days, removing patients with missing values, and selecting a subset of clinically\nrelevant features. The \ufb01nal dataset contains 71,366 patient stays and 70 relevant features ranging\nfrom age and gender to lab test results. We performed a 70-30% train-test split. We ran our algorithm\nfor 29 iterations and again found that iteration 17 had the best performance. We observed that our\nmethod improves upon the AUROC of Edward\u2019s baseline VI solution and is also more stable. Results\nare summarized in Table 2.\n\n5.3 Bayesian matrix factorization\nBayesian Matrix Factorization [17] is a more complex model de\ufb01ned in terms of two latent variables,\nU and V for some choice of the latent dimension D. In the base distribution, each entry of the\nmatrices U and V are independent Gaussians. To sample from Rt, we sample U, V from the\n\n8\n\n\fi \u21b5isi(U, V) and si(U, V) is the t-th iterate returned by the LMO.\n\nboosted posterior (U, V)t and then sample from N (U>V, I). Thus, Rt \u21e0 N (Ut>Vt, I) where\n(U, V)t \u21e0Pt\n\nWe use the CBCL FACE1 dataset which is composed of 2,492 images of 361 pixels each, arranged\ninto a matrix. Using a 50% mask for the train-test split, we performed matrix completion on this\ndata using the above model. We compared our boosting BBVI to BBVI for three choices of the\nlatent dimension D = 3, 5, 10 and observe improvements across all three in mean-squared error. For\nheld-out test log-likelihood, we observe an improved performance for D = 3. Similar to the results\nfor Bayesian linear regression, we observe that the variance across replications is also smaller for our\nmethod. Surprisingly, increasing D does not have a signi\ufb01cant effect on either of the metrics which\nmay indicate that a relatively inexpressive model (D = 3) already contains a good approximation.\nResults are summarized in Table 3.\n6 Conclusion\n\nWe have presented a re\ufb01ned yet practical theoretical analysis for the boosting variational inference\nparadigm. Our approach incorporates black box VI solvers into a general gradient boosting framework\nbased on the Frank-Wolfe algorithm. Furthermore, we introduced a subroutine which is \ufb01nally\nattractive to practitioners as it does not require any additional overhead beyond running a general\nblack box VI solver multiple times. This is an important step forward in adding boosting VI to the\nstandard toolbox of Bayesian inference.\n\nAcknowledgements. FL is partially supported by the Max-Planck ETH Center for Learning Systems.\nFL, GD are partially supported by an ETH core grant (to GR). RK is supported by NSF Grant IIS\n1421729. We thank Matthias H\u00fcser for providing the preprocessed eICU dataset. We also thank\nDavid Blei and Anant Raj for helpful discussions.\n\n1http://cbcl.mit.edu/software-datasets/FaceData2.html\n\n9\n\n\fReferences\n[1] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for\n\nstatisticians. arXiv preprint arXiv:1601.00670, 2016.\n\n[2] M Bishop Christopher. Pattern Recognition and Machine Learning. Springer-Verlag New York,\n\n2016.\n\n[3] Vincent Fortuin, Matthias H\u00fcser, Francesco Locatello, Heiko Strathmann, and Gunnar R\u00e4tsch.\nDeep Self-Organization: Interpretable Discrete Representation Learning on Time Series. arXiv\npreprint arXiv:1806.02199, 2018.\n\n[4] Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov,\nRoger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley.\nPhysiobank, physiotoolkit, and physionet. Circulation, 101(23):e215\u2013e220, 2000.\n\n[5] Fangjian Guo, Xiangyu Wang, Kai Fan, Tamara Broderick, and David B Dunson. Boosting\n\nvariational inference. arXiv preprint arXiv:1611.05559, 2016.\n\n[6] Tommi S. Jaakkola and Michael I. Jordan. Improving the mean \ufb01eld approximation via the use\n\nof mixture distributions, 1998.\n\n[7] Martin Jaggi. Convex Optimization without Projection Steps. arXiv math.OC, August 2011.\n[8] Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML\n\n2013 - Proceedings of the 30th International Conference on Machine Learning, 2013.\n\n[9] Ghassen Jerfel. Boosted stochastic backpropagation for variational inference, 2017.\n[10] Rahul G. Krishnan, Simon Lacoste-Julien, and David Sontag. Barrier frank-wolfe for marginal\n\ninference. pages 532\u2013540, 2015.\n\n[11] Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv\n\npreprint arXiv:1607.00345, 2016.\n\n[12] Simon Lacoste-Julien and Martin Jaggi. On the Global Linear Convergence of Frank-Wolfe\n\nOptimization Variants. In NIPS 2015, pages 496\u2013504, 2015.\n\n[13] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar R\u00e4tsch. Boosting variational\ninference: an optimization perspective. In Proceedings of the 21st International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS 2018), volume 84 of Proceedings of Machine\nLearning Research, pages 464\u2013472. PMLR, 2018.\n\n[14] Francesco Locatello, Rajiv Khanna, Michael Tschannen, and Martin Jaggi. A uni\ufb01ed optimiza-\ntion view on generalized matching pursuit and frank-wolfe. In Proc. International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS), 2017.\n\n[15] Ron Meir and Gunnar R\u00e4tsch. An introduction to boosting and leveraging. In Advanced lectures\n\non machine learning, pages 118\u2013183. Springer, 2003.\n\n[16] Andrew C Miller, Nicholas Foti, and Ryan P Adams. Variational boosting: Iteratively re\ufb01ning\n\nposterior approximations. arXiv preprint arXiv:1611.06585, 2016.\n\n[17] Andriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In Advances in\n\nneural information processing systems, pages 1257\u20131264, 2008.\n\n[18] Emanuel Parzen. On estimation of a probability density function and mode. The Annals of\n\nMathematical Statistics, 33:pp. 1065\u20131076, 1962.\n\n[19] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference.\n\nAISTATS, pages 814\u2013822, 2014.\n\nIn\n\n[20] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\nIn Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and\nConference Proceedings, pages 1530\u20131538. JMLR.org, 2015.\n\n10\n\n\f[21] Ardavan Saeedi, Tejas D. Kulkarni, Vikash K. Mansinghka, and Samuel J. Gershman. Variational\n\nparticle approximations. Journal of Machine Learning Research, 18(69):1\u201329, 2017.\n\n[22] Tim Salimans, Diederik P. Kingma, and Max Welling. Markov chain monte carlo and variational\n\ninference: Bridging the gap. In ICML, 2015.\n\n[23] Siddhartha Saxena, Shibhansh Dohare, and Jaivardhan Kapoor. Variational inference via\n\ntransformations on distributions. CoRR, abs/1707.02510, 2017.\n\n[24] Ilya Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard\n\nSch\u00f6lkopf. Adagan: Boosting generative models. arXiv preprint arXiv:1701.02386, 2017.\n\n[25] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M.\nBlei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint\narXiv:1610.09787, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1726, "authors": [{"given_name": "Francesco", "family_name": "Locatello", "institution": "MPI T\u00fcbingen - ETH Z\u00fcrich"}, {"given_name": "Gideon", "family_name": "Dresdner", "institution": "ETH Z\u00fcrich"}, {"given_name": "Rajiv", "family_name": "Khanna", "institution": "University of Texas at Austin"}, {"given_name": "Isabel", "family_name": "Valera", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Gunnar", "family_name": "Raetsch", "institution": "ETHZ"}]}