{"title": "Variational Inference via $\\chi$ Upper Bound Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2732, "page_last": 2741, "abstract": "Variational inference (VI) is widely used as an efficient alternative to Markov chain Monte Carlo. It posits a family of approximating distributions $q$ and finds the closest member to the exact posterior $p$. Closeness is usually measured via a divergence $D(q || p)$ from $q$ to $p$. While successful, this approach also has problems. Notably, it typically leads to underestimation of the posterior variance. In this paper we propose CHIVI, a black-box variational inference algorithm that minimizes $D_{\\chi}(p || q)$, the $\\chi$-divergence from $p$ to $q$. CHIVI minimizes an upper bound of the model evidence, which we term the $\\chi$ upper bound (CUBO). Minimizing the CUBO leads to improved posterior uncertainty, and it can also be used with the classical VI lower bound (ELBO) to provide a sandwich estimate of the model evidence. We study CHIVI on three models: probit regression, Gaussian process classification, and a Cox process model of basketball plays. When compared to expectation propagation and classical VI, CHIVI produces better error rates and more accurate estimates of posterior variance.", "full_text": "Variational Inference via\n\n\u03c7 Upper Bound Minimization\n\nAdji B. Dieng\n\nColumbia University\n\nDustin Tran\n\nColumbia University\n\nRajesh Ranganath\nPrinceton University\n\nJohn Paisley\n\nColumbia University\n\nDavid M. Blei\n\nColumbia University\n\nAbstract\n\nVariational inference (VI) is widely used as an ef\ufb01cient alternative to Markov\nchain Monte Carlo. It posits a family of approximating distributions q and \ufb01nds\nthe closest member to the exact posterior p. Closeness is usually measured via a\ndivergence D(q||p) from q to p. While successful, this approach also has problems.\nNotably, it typically leads to underestimation of the posterior variance. In this paper\nwe propose CHIVI, a black-box variational inference algorithm that minimizes\nD\u03c7(p||q), the \u03c7-divergence from p to q. CHIVI minimizes an upper bound of the\nmodel evidence, which we term the \u03c7 upper bound (CUBO). Minimizing the\nCUBO leads to improved posterior uncertainty, and it can also be used with the\nclassical VI lower bound (ELBO) to provide a sandwich estimate of the model\nevidence. We study CHIVI on three models: probit regression, Gaussian process\nclassi\ufb01cation, and a Cox process model of basketball plays. When compared to\nexpectation propagation and classical VI, CHIVI produces better error rates and\nmore accurate estimates of posterior variance.\n\n1\n\nIntroduction\n\nBayesian analysis provides a foundation for reasoning with probabilistic models. We \ufb01rst set a joint\ndistribution p(x, z) of latent variables z and observed variables x. We then analyze data through the\nposterior, p(z| x). In most applications, the posterior is dif\ufb01cult to compute because the marginal\nlikelihood p(x) is intractable. We must use approximate posterior inference methods such as Monte\nCarlo [1] and variational inference [2]. This paper focuses on variational inference.\nVariational inference approximates the posterior using optimization. The idea is to posit a family of\napproximating distributions and then to \ufb01nd the member of the family that is closest to the posterior.\nTypically, closeness is de\ufb01ned by the Kullback-Leibler (KL) divergence KL(q (cid:107) p), where q(z; \u03bb) is\na variational family indexed by parameters \u03bb. This approach, which we call KLVI, also provides the\nevidence lower bound (ELBO), a convenient lower bound of the model evidence log p(x).\nKLVI scales well and is suited to applications that use complex models to analyze large data sets [3].\nBut it has drawbacks. For one, it tends to favor underdispersed approximations relative to the exact\nposterior [4, 5]. This produces dif\ufb01culties with light-tailed posteriors when the variational distribution\nhas heavier tails. For example, KLVI for Gaussian process classi\ufb01cation typically uses a Gaussian\napproximation; this leads to unstable optimization and a poor approximation [6].\nOne alternative to KLVI is expectation propagation (EP), which enjoys good empirical performance\non models with light-tailed posteriors [7, 8]. Procedurally, EP reverses the arguments in the KL diver-\ngence and performs local minimizations of KL(p(cid:107) q); this corresponds to iterative moment matching\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fon partitions of the data. Relative to KLVI, EP produces overdispersed approximations. But EP also\nhas drawbacks. It is not guaranteed to converge [7, Figure 3.6]; it does not provide an easy estimate\nof the marginal likelihood; and it does not optimize a well-de\ufb01ned global objective [9].\nIn this paper we develop a new algorithm for approximate posterior inference, \u03c7-divergence variational\ninference (CHIVI). CHIVI minimizes the \u03c7-divergence from the posterior to the variational family,\n\n(cid:104)(cid:16) p(z| x)\n\nq(z; \u03bb)\n\n(cid:17)2 \u2212 1\n(cid:105)\n\nD\u03c72(p(cid:107) q) = Eq(z;\u03bb)\n\n.\n\n(1)\n\nCHIVI enjoys advantages of both EP and KLVI. Like EP, it produces overdispersed approximations;\nlike KLVI, it optimizes a well-de\ufb01ned objective and estimates the model evidence.\nAs we mentioned, KLVI optimizes a lower bound on the model evidence. The idea behind CHIVI\nis to optimize an upper bound, which we call the \u03c7 upper bound (CUBO). Minimizing the CUBO\nis equivalent to minimizing the \u03c7-divergence. In providing an upper bound, CHIVI can be used (in\nconcert with KLVI) to sandwich estimate the model evidence. Sandwich estimates are useful for\ntasks like model selection [10]. Existing work on sandwich estimation relies on MCMC and only\nevaluates simulated data [11]. We derive a sandwich theorem (Section 2) that relates CUBO and\nELBO. Section 3 demonstrates sandwich estimation on real data.\nAside from providing an upper bound, there are two additional bene\ufb01ts to CHIVI. First, it is a\nblack-box inference algorithm [12] in that it does not need model-speci\ufb01c derivations and it is easy to\napply to a wide class of models. It minimizes an upper bound in a principled way using unbiased\nreparameterization gradients [13, 14] of the exponentiated CUBO.\nSecond, it is a viable alternative to EP. The \u03c7-divergence enjoys the same \u201czero-avoiding\u201d behavior of\nEP, which seeks to place positive mass everywhere, and so CHIVI is useful when the KL divergence is\nnot a good objective (such as for light-tailed posteriors). Unlike EP, CHIVI is guaranteed to converge;\nprovides an easy estimate of the marginal likelihood; and optimizes a well-de\ufb01ned global objective.\nSection 3 shows that CHIVI outperforms KLVI and EP for Gaussian process classi\ufb01cation.\nThe rest of this paper is organized as follows. Section 2 derives the CUBO, develops CHIVI, and\nexpands on its zero-avoiding property that \ufb01nds overdispersed posterior approximations. Section 3\napplies CHIVI to Bayesian probit regression, Gaussian process classi\ufb01cation, and a Cox process\nmodel of basketball plays. On Bayesian probit regression and Gaussian process classi\ufb01cation, it\nyielded lower classi\ufb01cation error than KLVI and EP. When modeling basketball data with a Cox\nprocess, it gave more accurate estimates of posterior variance than KLVI.\nRelated work. The most widely studied variational objective is KL(q (cid:107) p). The main alternative\nis EP [15, 7], which locally minimizes KL(p(cid:107) q). Recent work revisits EP from the perspective of\ndistributed computing [16, 17, 18] and also revisits [19], which studies local minimizations with the\ngeneral family of \u03b1-divergences [20, 21]. CHIVI relates to EP and its extensions in that it leads to\noverdispersed approximations relative to KLVI. However, unlike [19, 20], CHIVI does not rely on\ntying local factors; it optimizes a well-de\ufb01ned global objective. In this sense, CHIVI relates to the\nrecent work on alternative divergence measures for variational inference [21, 22].\nA closely related work is [21]. They perform black-box variational inference using the reverse \u03b1-\ndivergence D\u03b1(q (cid:107) p), which is a valid divergence when \u03b1 > 01. Their work shows that minimizing\nD\u03b1(q (cid:107) p) is equivalent to maximizing a lower bound of the model evidence. No positive value of\n\u03b1 in D\u03b1(q (cid:107) p) leads to the \u03c7-divergence. Even though taking \u03b1 \u2264 0 leads to CUBO, it does not\ncorrespond to a valid divergence in D\u03b1(q (cid:107) p). The algorithm in [21] also cannot minimize the upper\nbound we study in this paper. In this sense, our work complements [21].\nAn exciting concurrent work by [23] also studies the \u03c7-divergence. Their work focuses on upper\nbounding the partition function in undirected graphical models. This is a complementary application:\nBayesian inference and undirected models both involve an intractable normalizing constant.\n\n2 \u03c7-Divergence Variational Inference\n\nWe present the \u03c7-divergence for variational inference. We describe some of its properties and develop\nCHIVI, a black box algorithm that minimizes the \u03c7-divergence for a large class of models.\n\n1It satis\ufb01es D(p(cid:107) q) \u2265 0 and D(p(cid:107) q) = 0 \u21d0\u21d2 p = q almost everywhere\n\n2\n\n\fVariational inference (VI) casts Bayesian inference as optimization [24]. VI posits a family of\napproximating distributions and \ufb01nds the closest member to the posterior. In its typical formulation, VI\nminimizes the Kullback-Leibler divergence from q(z; \u03bb) to p(z| x). Minimizing the KL divergence\nis equivalent to maximizing the ELBO, a lower bound to the model evidence log p(x).\n2.1 The \u03c7-divergence\n\nMaximizing the ELBO imposes properties on the resulting approximation such as underestimation of\nthe posterior\u2019s support [4, 5]. These properties may be undesirable, especially when dealing with\nlight-tailed posteriors such as in Gaussian process classi\ufb01cation [6].\nWe consider the \u03c7-divergence (Equation 1). Minimizing the \u03c7-divergence induces alternative proper-\nties on the resulting approximation. (See Appendix 5 for more details on all these properties.) Below\nwe describe a key property which leads to overestimation of the posterior\u2019s support.\nZero-avoiding behavior: Optimizing the \u03c7-divergence leads to a variational distribution with a\nzero-avoiding behavior, which is similar to EP [25]. Namely, the \u03c7-divergence is in\ufb01nite whenever\nq(z; \u03bb) = 0 and p(z| x) > 0. Thus when minimizing it, setting p(z| x) > 0 forces q(z; \u03bb) > 0.\nThis means q avoids having zero mass at locations where p has nonzero mass.\nThe classical objective KL(q (cid:107) p) leads to approximate posteriors with the opposite behavior, called\nzero-forcing. Namely, KL(q (cid:107) p) is in\ufb01nite when p(z| x) = 0 and q(z; \u03bb) > 0. Therefore the optimal\nvariational distribution q will be 0 when p(z| x) = 0. This zero-forcing behavior leads to degenerate\nsolutions during optimization, and is the source of \u201cpruning\u201d often reported in the literature (e.g.,\n[26, 27]). For example, if the approximating family q has heavier tails than the target posterior p, the\nvariational distributions must be overcon\ufb01dent enough that the heavier tail does not allocate mass\noutside the lighter tail\u2019s support.2\n\n2.2 CUBO: the \u03c7 Upper Bound\n\nWe derive a tractable objective for variational inference with the \u03c72-divergence and also generalize it\nto the \u03c7n-divergence for n > 1. Consider the optimization problem of minimizing Equation 1. We\nseek to \ufb01nd a relationship between the \u03c72-divergence and log p(x). Consider\n\n= 1 + D\u03c72(p(z|x)(cid:107) q(z; \u03bb)) = p(x)2[1 + D\u03c72 (p(z|x)(cid:107) q(z; \u03bb))].\n\nTaking logarithms on both sides, we \ufb01nd a relationship analogous to how KL(q (cid:107) p) relates to the\nELBO. Namely, the \u03c72-divergence satis\ufb01es\n\n(cid:104)(cid:16) p(x, z)\n\n(cid:17)2(cid:105)\n\nq(z; \u03bb)\n\nEq(z;\u03bb)\n\n(cid:104)(cid:16) p(x, z)\n\n(cid:17)2(cid:105)\n\n.\n\nq(z; \u03bb)\n\nlog(1 + D\u03c72 (p(z|x)(cid:107) q(z; \u03bb))) = \u2212 log p(x) +\n\n1\n2\n\nlog Eq(z;\u03bb)\n\n1\n2\n\nBy monotonicity of log, and because log p(x) is constant, minimizing the \u03c72-divergence is equivalent\nto minimizing\n\nL\u03c72 (\u03bb) =\n\n1\n2\n\nlog Eq(z;\u03bb)\n\n(cid:104)(cid:16) p(x, z)\n\n(cid:17)2(cid:105)\n\n.\n\nq(z; \u03bb)\n\nFurthermore, by nonnegativity of the \u03c72-divergence, this quantity is an upper bound to the model\nevidence. We call this objective the \u03c7 upper bound (CUBO).\nA general upper bound. The derivation extends to upper bound the general \u03c7n-divergence,\n\nL\u03c7n (\u03bb) =\n\n1\nn\n\nlog Eq(z;\u03bb)\n\n(cid:104)(cid:16) p(x, z)\n\n(cid:17)n(cid:105)\n\nq(z; \u03bb)\n\n= CUBOn.\n\n(2)\n\nThis produces a family of bounds. When n < 1, CUBOn is a lower bound, and minimizing it for\nthese values of n does not minimize the \u03c7-divergence (rather, when n < 1, we recover the reverse\n\u03b1-divergence and the VR-bound [21]). When n = 1, the bound is tight where CUBO1 = log p(x).\nFor n \u2265 1, CUBOn is an upper bound to the model evidence. In this paper we focus on n = 2. Other\n2Zero-forcing may be preferable in settings such as multimodal posteriors with unimodal approximations:\nfor predictive tasks, it helps to concentrate on one mode rather than spread mass over all of them [5]. In this\npaper, we focus on applications with light-tailed posteriors and one to relatively few modes.\n\n3\n\n\fvalues of n are possible depending on the application and dataset. We chose n = 2 because it is\nthe most standard, and is equivalent to \ufb01nding the optimal proposal in importance sampling. See\nAppendix 4 for more details.\nSandwiching the model evidence. Equation 2 has practical value. We can minimize the CUBOn\nand maximize the ELBO. This produces a sandwich on the model evidence. (See Appendix 8 for a\nsimulated illustration.) The following sandwich theorem states that the gap induced by CUBOn and\nELBO increases with n. This suggests that letting n as close to 1 as possible enables approximating\nlog p(x) with higher precision. When we further decrease n to 0, CUBOn becomes a lower bound\nand tends to the ELBO.\n\nTheorem 1 (Sandwich Theorem): De\ufb01ne CUBOn as in Equation 2. Then the following holds:\n\n\u2022 \u2200n \u2265 1 ELBO \u2264 log p(x) \u2264 CUBOn.\n\u2022 \u2200n \u2265 1 CUBOn is a non-decreasing function of the order n of the \u03c7-divergence.\n\u2022 limn\u21920 CUBOn = ELBO.\n\nSee proof in Appendix 1. Theorem 1 can be utilized for estimating log p(x), which is important for\nmany applications such as the evidence framework [28], where the marginal likelihood is argued to\nembody an Occam\u2019s razor. Model selection based solely on the ELBO is inappropriate because of\nthe possible variation in the tightness of this bound. With an accompanying upper bound, one can\nperform what we call maximum entropy model selection in which each model evidence values are\nchosen to be that which maximizes the entropy of the resulting distribution on models. We leave this\nas future work. Theorem 1 can also help estimate Bayes factors [29]. In general, this technique is\nimportant as there is little existing work: for example, Ref. [11] proposes an MCMC approach and\nevaluates simulated data. We illustrate sandwich estimation in Section 3 on UCI datasets.\n\n2.3 Optimizing the CUBO\n\nWe derived the CUBOn, a general upper bound to the model evidence that can be used to minimize\nthe \u03c7-divergence. We now develop CHIVI, a black box algorithm that minimizes CUBOn.\nThe goal in CHIVI is to minimize the CUBOn with respect to variational parameters,\n\nThe expectation in the CUBOn is usually intractable. Thus we use Monte Carlo to construct an\nestimate. One approach is to naively perform Monte Carlo on this objective,\n\nCUBOn(\u03bb) =\n\nlog Eq(z;\u03bb)\n\n1\nn\n\nCUBOn(\u03bb) \u2248 1\nn\n\nlog\n\n1\nS\n\n(cid:104)(cid:16) p(x, z)\n\n(cid:17)n(cid:105)\n\nq(z; \u03bb)\n\n.\n\n,\n\nS(cid:88)\n\n(cid:104)(cid:16) p(x, z(s))\n\n(cid:17)n(cid:105)\n\nq(z(s); \u03bb)\n\ns=1\n\nfor S samples z(1), ..., z(S) \u223c q(z; \u03bb). However, by Jensen\u2019s inequality, the log transform of the\nexpectation implies that this is a biased estimate of CUBOn(\u03bb):\n\n(cid:34)\n\nEq\n\n1\nn\n\nlog\n\n1\nS\n\n(cid:17)n(cid:105)(cid:35)\n\nS(cid:88)\n\n(cid:104)(cid:16) p(x, z(s))\n\nq(z(s); \u03bb)\n\ns=1\n\n(cid:54)= CUBOn.\n\nIn fact this expectation changes during optimization and depends on the sample size S. The objective\nis not guaranteed to be an upper bound if S is not chosen appropriately from the beginning. This\nproblem does not exist for lower bounds because the Monte Carlo approximation is still a lower bound;\nthis is why the approach in [21] works for lower bounds but not for upper bounds. Furthermore,\ngradients of this biased Monte Carlo objective are also biased.\nWe propose a way to minimize upper bounds which also can be used for lower bounds. The approach\nkeeps the upper bounding property intact. It does so by minimizing a Monte Carlo approximation of\nthe exponentiated upper bound,\n\nL = exp{n \u00b7 CUBOn(\u03bb)}.\n\n4\n\n\fAlgorithm 1: \u03c7-divergence variational inference (CHIVI)\n\nInput: Data x, Model p(x, z), Variational family q(z; \u03bb).\nOutput: Variational parameters \u03bb.\nInitialize \u03bb randomly.\nwhile not converged do\n\nDraw S samples z(1), ..., z(S) from q(z; \u03bb) and a data subsample {xi1, ..., xiM}.\nSet \u03c1t according to a learning rate schedule.\nSet log w(s) = log p(z(s)) + N\nM\nSet w(s) = exp(log w(s) \u2212 max\nUpdate \u03bbt+1 = \u03bbt \u2212 (1\u2212n)\u00b7\u03c1t\n\n(cid:80)M\nj=1 p(xij | z) \u2212 log q(z(s); \u03bbt), s \u2208 {1, ..., S}.\nw(s)(cid:17)n\u2207\u03bb log q(z(s); \u03bbt)\n(cid:104)(cid:16)\n(cid:105)\n(cid:80)S\nlog w(s)), s \u2208 {1, ..., S}.\n\n.\n\ns\n\nS\n\ns=1\n\nend\n\nBy monotonicity of exp, this objective admits the same optima as CUBOn(\u03bb). Monte Carlo produces\nan unbiased estimate, and the number of samples only affects the variance of the gradients. We\nminimize it using reparameterization gradients [13, 14]. These gradients apply to models with\ndifferentiable latent variables. Formally, assume we can rewrite the generative process as z = g(\u03bb, \u0001)\nwhere \u0001 \u223c p(\u0001) and for some deterministic function g. Then\n\nB(cid:88)\n\nb=1\n\n\u02c6L =\n\n1\nB\n\n(cid:17)n\n\nq(g(\u03bb, \u0001(b)); \u03bb)\n\n(cid:16) p(x, g(\u03bb, \u0001(b)))\n(cid:17)n\u2207\u03bb log\n\nis an unbiased estimator of L and its gradient is\n\n(cid:16) p(x, g(\u03bb, \u0001(b)))\n\nq(g(\u03bb, \u0001(b)); \u03bb)\n\nB(cid:88)\n\nb=1\n\n\u2207\u03bb \u02c6L =\n\nn\nB\n\n(cid:16) p(x, g(\u03bb, \u0001(b)))\n\nq(g(\u03bb, \u0001(b)); \u03bb)\n\n(cid:17)\n\n.\n\n(3)\n\n(See Appendix 7 for a more detailed derivation and also a more general alternative with score function\ngradients [30].)\nComputing Equation 3 requires the full dataset x. We can apply the \u201caverage likelihood\u201d technique\nfrom EP [18, 31]. Consider data {x1, . . . , xN} and a subsample {xi1, ..., xiM}.. We approximate\nthe full log-likelihood by\n\nlog p(x| z) \u2248 N\nM\n\nlog p(xij | z).\n\nM(cid:88)\n\nj=1\n\nUsing this proxy to the full dataset we derive CHIVI, an algorithm in which each iteration depends on\nonly a mini-batch of data. CHIVI is a black box algorithm for performing approximate inference with\nthe \u03c7n-divergence. Algorithm 1 summarizes the procedure. In practice, we subtract the maximum of\nthe logarithm of the importance weights, de\ufb01ned as\n\nlog w = log p(x, z) \u2212 log q(z; \u03bb).\n\nto avoid under\ufb02ow. Stochastic optimization theory still gives us convergence with this approach [32].\n\n3 Empirical Study\n\nWe developed CHIVI, a black box variational inference algorithm for minimizing the \u03c7-divergence.\nWe now study CHIVI with several models: probit regression, Gaussian process (GP) classi\ufb01cation,\nand Cox processes. With probit regression, we demonstrate the sandwich estimator on real and\nsynthetic data. CHIVI provides a useful tool to estimate the marginal likelihood. We also show that\nfor this model where ELBO is applicable CHIVI works well and yields good test error rates.\n\n5\n\n\fFigure 1: Sandwich gap via CHIVI and BBVI on different datasets. The \ufb01rst two plots correspond to\nsandwich plots for the two UCI datasets Ionosphere and Heart respectively. The last plot corresponds\nto a sandwich for generated data where we know the log marginal likelihood of the data. There the\ngap is tight after only few iterations. More sandwich plots can be found in the appendix.\n\nTable 1: Test error for Bayesian probit regression. The lower the better. CHIVI (this paper) yields\nlower test error rates when compared to BBVI [12], and EP on most datasets.\n\nDataset\nPima\nIonos\n\nMadelon\nCovertype\n\nBBVI\n\n0.235 \u00b1 0.006\n0.123 \u00b1 0.008\n0.457 \u00b1 0.005\n0.157 \u00b1 0.01\n\nEP\n\n0.234 \u00b1 0.006\n0.124 \u00b1 0.008\n0.445 \u00b1 0.005\n0.155 \u00b1 0.018\n\nCHIVI\n\n0.222 \u00b1 0.048\n0.116 \u00b1 0.05\n0.453 \u00b1 0.029\n0.154 \u00b1 0.014\n\nSecond, we compare CHIVI to Laplace and EP on GP classi\ufb01cation, a model class for which KLVI\nfails (because the typical chosen variational distribution has heavier tails than the posterior).3 In these\nsettings, EP has been the method of choice. CHIVI outperforms both of these methods.\nThird, we show that CHIVI does not suffer from the posterior support underestimation problem\nresulting from maximizing the ELBO. For that we analyze Cox processes, a type of spatial point\nprocess, to compare pro\ufb01les of different NBA basketball players. We \ufb01nd CHIVI yields better\nposterior uncertainty estimates (using HMC as the ground truth).\n\n3.1 Bayesian Probit Regression\n\nWe analyze inference for Bayesian probit regression. First, we illustrate sandwich estimation on UCI\ndatasets. Figure 1 illustrates the bounds of the log marginal likelihood given by the ELBO and the\nCUBO. Using both quantities provides a reliable approximation of the model evidence. In addition,\nthese \ufb01gures show convergence for CHIVI, which EP does not always satisfy.\nWe also compared the predictive performance of CHIVI, EP, and KLVI. We used a minibatch size\nof 64 and 2000 iterations for each batch. We computed the average classi\ufb01cation error rate and the\nstandard deviation using 50 random splits of the data. We split all the datasets with 90% of the\ndata for training and 10% for testing. For the Covertype dataset, we implemented Bayesian probit\nregression to discriminate the class 1 against all other classes. Table 1 shows the average error rate\nfor KLVI, EP, and CHIVI. CHIVI performs better for all but one dataset.\n\n3.2 Gaussian Process Classi\ufb01cation\n\nGP classi\ufb01cation is an alternative to probit regression. The posterior is analytically intractable because\nthe likelihood is not conjugate to the prior. Moreover, the posterior tends to be skewed. EP has been\nthe method of choice for approximating the posterior [8]. We choose a factorized Gaussian for the\nvariational distribution q and \ufb01t its mean and log variance parameters.\nWith UCI benchmark datasets, we compared the predictive performance of CHIVI to EP and Laplace.\nTable 2 summarizes the results. The error rates for CHIVI correspond to the average of 10 error\nrates obtained by dividing the data into 10 folds, applying CHIVI to 9 folds to learn the variational\nparameters and performing prediction on the remainder. The kernel hyperparameters were chosen\n\n3For KLVI, we use the black box variational inference (BBVI) version [12] speci\ufb01cally via Edward [33].\n\n6\n\n020406080100epoch4.54.03.53.02.52.01.51.0objectiveSandwich Plot Using CHIVI and BBVI On Ionosphere Datasetupper boundlower bound050100150200epoch4.54.03.53.02.52.01.51.0objectiveSandwich Plot Using CHIVI and BBVI On Heart Datasetupper boundlower bound\fTable 2: Test error for Gaussian process classi\ufb01cation. The lower the better. CHIVI (this paper)\nyields lower test error rates when compared to Laplace and EP on most datasets.\n\nDataset Laplace\nCrabs\nSonar\nIonos\n\n0.02\n0.154\n0.084\n\nEP\n0.02\n0.139\n\n0.08 \u00b1 0.04\n\nCHIVI\n\n0.03 \u00b1 0.03\n0.055 \u00b1 0.035\n0.069 \u00b1 0.034\n\nTable 3: Average L1 error for posterior uncertainty estimates (ground truth from HMC). We \ufb01nd that\nCHIVI is similar to or better than BBVI at capturing posterior uncertainties. Demarcus Cousins, who\nplays center, stands out in particular. His shots are concentrated near the basket, so the posterior is\nuncertain over a large part of the court Figure 2.\n\nCurry Demarcus Lebron Duncan\n0.0849\n0.060\n0.066\n0.0871\n\n0.0825\n0.0812\n\n0.073\n0.082\n\nCHIVI\nBBVI\n\nusing grid search. The error rates for the other methods correspond to the best results reported in [8]\nand [34]. On all the datasets CHIVI performs as well or better than EP and Laplace.\n\n3.3 Cox Processes\n\nFinally we study Cox processes. They are Poisson processes with stochastic rate functions. They\ncapture dependence between the frequency of points in different regions of a space. We apply\nCox processes to model the spatial locations of shots (made and missed) from the 2015-2016 NBA\nseason [35]. The data are from 308 NBA players who took more than 150, 000 shots in total. The\nnth player\u2019s set of Mn shot attempts are xn = {xn,1, ..., xn,Mn}, and the location of the mth shot\nby the nth player in the basketball court is xn,m \u2208 [\u221225, 25] \u00d7 [0, 40]. Let PP(\u03bb) denote a Poisson\nprocess with intensity function \u03bb, and K be a covariance matrix resulting from a kernel applied to\nevery location of the court. The generative process for the nth player\u2019s shot is\n\nKi,j = k(xi, xj) = \u03c32 exp(\u2212 1\n\n2\u03c62||xi \u2212 xj||2)\n\nf \u223c GP(0, k(\u00b7,\u00b7)) ; \u03bb = exp(f) ; xn,k \u223c PP(\u03bb) for k \u2208 {1, ..., Mn}.\n\nThe kernel of the Gaussian process encodes the spatial correlation between different areas of the\nbasketball court. The model treats the N players as independent. But the kernel K introduces\ncorrelation between the shots attempted by a given player.\nOur goal is to infer the intensity functions \u03bb(.) for each player. We compare the shooting pro\ufb01les\nof different players using these inferred intensity surfaces. The results are shown in Figure 2. The\nshooting pro\ufb01les of Demarcus Cousins and Stephen Curry are captured by both BBVI and CHIVI.\nBBVI has lower posterior uncertainty while CHIVI provides more overdispersed solutions. We plot\nthe pro\ufb01les for two more players, LeBron James and Tim Duncan, in the appendix.\nIn Table 3, we compare the posterior uncertainty estimates of CHIVI and BBVI to that of HMC, a\ncomputationally expensive Markov chain Monte Carlo procedure that we treat as exact. We use the\naverage L1 distance from HMC as error measure. We do this on four different players: Stephen Curry,\nDemarcus Cousins, LeBron James, and Tim Duncan. We \ufb01nd that CHIVI is similar or better than\nBBVI, especially on players like Demarcus Cousins who shoot in a limited part of the court.\n\n4 Discussion\n\nWe described CHIVI, a black box algorithm that minimizes the \u03c7-divergence by minimizing the\nCUBO. We motivated CHIVI as a useful alternative to EP. We justi\ufb01ed how the approach used in\nCHIVI enables upper bound minimization contrary to existing \u03b1-divergence minimization techniques.\nThis enables sandwich estimation using variational inference instead of Markov chain Monte Carlo.\n\n7\n\n\fFigure 2: Basketball players shooting pro\ufb01les as inferred by BBVI [12], CHIVI (this paper), and\nHamiltonian Monte Carlo (HMC). The top row displays the raw data, consisting of made shots\n(green) and missed shots (red). The second and third rows display the posterior intensities inferred\nby BBVI, CHIVI, and HMC for Stephen Curry and Demarcus Cousins respectively. Both BBVI and\nCHIVI capture the shooting behavior of both players in terms of the posterior mean. The last two\nrows display the posterior uncertainty inferred by BBVI, CHIVI, and HMC for Stephen Curry and\nDemarcus Cousins respectively. CHIVI tends to get higher posterior uncertainty for both players in\nareas where data is scarce compared to BBVI. This illustrates the variance underestimation problem\nof KLVI, which is not the case for CHIVI. More player pro\ufb01les with posterior mean and uncertainty\nestimates can be found in the appendix.\n\nWe illustrated this by showing how to use CHIVI in concert with KLVI to sandwich-estimate the\nmodel evidence. Finally, we showed that CHIVI is an effective algorithm for Bayesian probit\nregression, Gaussian process classi\ufb01cation, and Cox processes.\nPerforming VI via upper bound minimization, and hence enabling overdispersed posterior approxi-\nmations, sandwich estimation, and model selection, comes with a cost. Exponentiating the original\nCUBO bound leads to high variance during optimization even with reparameterization gradients.\nDeveloping variance reduction schemes for these types of objectives (expectations of likelihood\nratios) is an open research problem; solutions will bene\ufb01t this paper and related approaches.\n\n8\n\nCurry Shot ChartDemarcus Shot ChartCurry Posterior Intensity (KLQP)0255075100125150175200225Curry Posterior Intensity (Chi)0255075100125150175200225Curry Posterior Intensity (HMC)0255075100125150175200225Demarcus Posterior Intensity (KLQP)04080120160200240280320360Demarcus Posterior Intensity (Chi)04080120160200240280320Demarcus Posterior Intensity (HMC)04080120160200240280320Curry Posterior Uncertainty (KLQP)0.00.10.20.30.40.50.60.70.80.91.0Curry Posterior Uncertainty (Chi)0.00.10.20.30.40.50.60.70.80.91.0Curry Posterior Uncertainty (HMC)0.00.10.20.30.40.50.60.70.80.91.0Demarcus Posterior Uncertainty (KLQP)0.00.10.20.30.40.50.60.70.80.91.0Demarcus Posterior Uncertainty (Chi)0.00.10.20.30.40.50.60.70.80.91.0Demarcus Posterior Uncertainty (HMC)0.00.10.20.30.40.50.60.70.80.91.0\fAcknowledgments\n\nWe thank Alp Kucukelbir, Francisco J. R. Ruiz, Christian A. Naesseth, Scott W. Linderman,\nMaja Rudolph, and Jaan Altosaar for their insightful comments. This work is supported by NSF\nIIS-1247664, ONR N00014-11-1-0651, DARPA PPAML FA8750-14-2-0009, DARPA SIMPLEX\nN66001-15-C-4032, the Alfred P. Sloan Foundation, and the John Simon Guggenheim Founda-\ntion.\n\nReferences\n[1] C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, 2004.\n[2] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for\n\ngraphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[3] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR,\n\n2013.\n\n[4] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.\n[5] C. M. Bishop. Pattern recognition. Machine Learning, 128, 2006.\n[6] J. Hensman, M. Zwie\u00dfele, and N. D. Lawrence. Tilted variational Bayes. JMLR, 2014.\n[7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.\n[8] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process\n\nclassi\ufb01cation. JMLR, 6:1679\u20131704, 2005.\n\n[9] M. J. Beal. Variational algorithms for approximate Bayesian inference. University of London,\n\n2003.\n\n[10] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415\u2013447, 1992.\n[11] R. B. Grosse, Z. Ghahramani, and R. P. Adams. Sandwiching the marginal likelihood using\n\nbidirectional monte carlo. arXiv preprint arXiv:1511.02543, 2015.\n\n[12] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, 2014.\n[13] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[14] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate\n\nInference in Deep Generative Models. In ICML, 2014.\n\n[15] M. Opper and O. Winther. Gaussian processes for classi\ufb01cation: Mean-\ufb01eld algorithms. Neural\n\nComputation, 12(11):2655\u20132684, 2000.\n\n[16] Andrew Gelman, Aki Vehtari, Pasi Jyl\u00e4nki, Tuomas Sivula, Dustin Tran, Swupnil Sahai, Paul\nBlomstedt, John P Cunningham, David Schiminovich, and Christian Robert. Expectation\npropagation as a way of life: A framework for Bayesian inference on partitioned data. arXiv\npreprint arXiv:1412.4869, 2017.\n\n[17] Y. W. Teh, L. Hasenclever, T. Lienart, S. Vollmer, S. Webb, B. Lakshminarayanan, and C. Blun-\ndell. Distributed Bayesian learning with stochastic natural-gradient expectation propagation\nand the posterior server. arXiv preprint arXiv:1512.09327, 2015.\n\n[18] Y. Li, J. M. Hern\u00e1ndez-Lobato, and R. E. Turner. Stochastic Expectation Propagation. In NIPS,\n\n2015.\n\n[19] T. Minka. Power EP. Technical report, Microsoft Research, 2004.\n[20] J. M. Hern\u00e1ndez-Lobato, Y. Li, D. Hern\u00e1ndez-Lobato, T. Bui, and R. E. Turner. Black-box\n\n\u03b1-divergence minimization. ICML, 2016.\n\n[21] Y. Li and R. E. Turner. Variational inference with R\u00e9nyi divergence. In NIPS, 2016.\n[22] Rajesh Ranganath, Jaan Altosaar, Dustin Tran, and David M. Blei. Operator variational\n\ninference. In NIPS, 2016.\n\n9\n\n\f[23] Volodymyr Kuleshov and Stefano Ermon. Neural variational inference and learning in undirected\n\ngraphical models. In NIPS, 2017.\n\n[24] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine Learning, 37(2):183\u2013233, 1999.\n\n[25] T. Minka. Divergence measures and message passing. Technical report, Microsoft Research,\n\n2005.\n\n[26] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In\n\nInternational Conference on Learning Representations, 2016.\n\n[27] Matthew D Hoffman. Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo.\n\nIn International Conference on Machine Learning, 2017.\n\n[28] D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge Univ.\n\nPress, 2003.\n\n[29] A. E. Raftery. Bayesian model selection in social research. Sociological methodology, 25:111\u2013\n\n164, 1995.\n\n[30] J. Paisley, D. Blei, and M. Jordan. Variational Bayesian inference with stochastic search. In\n\nICML, 2012.\n\n[31] G. Dehaene and S. Barthelm\u00e9. Expectation propagation in the large-data limit. In NIPS, 2015.\n[32] Peter Sunehag, Jochen Trumpf, SVN Vishwanathan, Nicol N Schraudolph, et al. Variable metric\n\nstochastic approximation theory. In AISTATS, pages 560\u2013566, 2009.\n\n[33] Dustin Tran, Alp Kucukelbir, Adji B Dieng, Maja Rudolph, Dawen Liang, and David M\nBlei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint\narXiv:1610.09787, 2016.\n\n[34] H. Kim and Z. Ghahramani. The em-ep algorithm for gaussian process classi\ufb01cation. In\nProceedings of the Workshop on Probabilistic Graphical Models for Classi\ufb01cation (ECML),\npages 37\u201348, 2003.\n\n[35] A. Miller, L. Bornn, R. Adams, and K. Goldsberry. Factorized point process intensities: A\n\nspatial analysis of professional basketball. In ICML, 2014.\n\n10\n\n\f", "award": [], "sourceid": 1548, "authors": [{"given_name": "Adji Bousso", "family_name": "Dieng", "institution": "Columbia University"}, {"given_name": "Dustin", "family_name": "Tran", "institution": "Columbia University & OpenAI"}, {"given_name": "Rajesh", "family_name": "Ranganath", "institution": "Princeton University"}, {"given_name": "John", "family_name": "Paisley", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}