{"title": "Divide and Couple: Using Monte Carlo Variational Objectives for Posterior Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 339, "page_last": 349, "abstract": "Recent work in variational inference (VI) has used ideas from Monte Carlo estimation to obtain tighter lower bounds on the log-likelihood to be used as objectives for VI. However, there is not a systematic understanding of how optimizing different objectives relates to approximating the posterior distribution. Developing such a connection is important if the ideas are to be applied to inference\u2014i.e., applications that require an approximate posterior and not just an approximation of the log-likelihood. Given a VI objective defined by a Monte Carlo estimator of the likelihood, we use a \"divide and couple\" procedure to identify augmented proposal and target distributions so that the gap between the VI objective and the log-likelihood is equal to the divergence between these distributions. Thus, after maximizing the VI objective, the augmented variational distribution may be used to approximate the posterior distribution.", "full_text": "Divide and Couple: Using Monte Carlo Variational\n\nObjectives for Posterior Approximation\n\nJustin Domke1 and Daniel Sheldon1,2\n\n1 College of Information and Computer Sciences, University of Massachusetts Amherst\n\n2 Department of Computer Science, Mount Holyoke College\n\nAbstract\n\nRecent work in variational inference (VI) uses ideas from Monte Carlo estimation to\ntighten the lower bounds on the log-likelihood that are used as objectives. However,\nthere is no systematic understanding of how optimizing different objectives relates\nto approximating the posterior distribution. Developing such a connection is\nimportant if the ideas are to be applied to inference\u2014i.e., applications that require\nan approximate posterior and not just an approximation of the log-likelihood. Given\na VI objective de\ufb01ned by a Monte Carlo estimator of the likelihood, we use a \"divide\nand couple\" procedure to identify augmented proposal and target distributions. The\ndivergence between these is equal to the gap between the VI objective and the\nlog-likelihood. Thus, after maximizing the VI objective, the augmented variational\ndistribution may be used to approximate the posterior distribution.\n\n1\n\nIntroduction\n\nVariational inference (VI) is a leading approximate inference method in which a posterior distribution\np(z|x) is approximated by a simpler distribution q(z) from some approximating family. The procedure\nto select q is based on the decomposition that [29]\n\nThe \ufb01rst term is the evidence lower bound (ELBO) [4]. Selecting q to maximize the ELBO tightens\nthe lower bound on log p(x) and simultaneously minimizes the KL-divergence in the second term.\nThis dual view is important because minimizing the KL-divergence justi\ufb01es using q to approximate\nthe posterior for making predictions.\nRecent work has investigated tighter objectives [6, 20, 18, 17, 23, 25], based on the following\nprinciple: Let R be an estimator of the likelihood\u2014i.e., a nonnegative random variable with E R =\np(x). By Jensen\u2019s inequality, log p(x) E log R, so log R is a stochastic lower bound on the\nlog-likelihood. Parameters of the estimator can be optimized to tighten the bound. Standard VI is the\ncase when R = p(z, x)/q(z) and z \u21e0 q, which is parameterized in terms of q. Importance-weighted\nautoencoders [IWAEs; 6] essentially use R = 1\nm=1 p(zm, x)/q(zm) where z1 . . . zM \u21e0 q are\niid. Sequential Monte Carlo (SMC) also gives a variational objective [20, 18, 17]. The principle\nunderlying these works is that likelihood estimators that are more concentrated lead to tighter bounds,\nbecause the gap in Jensen\u2019s inequality is smaller. To date, the main application has been for learning\nparameters of the generative model p.\nOur key question is: what are the implications of modi\ufb01ed variational objectives for probabilistic in-\nference? Eq. (1) relates the standard ELBO to KL[qkp], which justi\ufb01es using q for posterior inference.\nIf we optimize a variational objective obtained from a different estimator, does this still correspond to\nminimizing some KL-divergence? It has been shown [10, 20, 11] that maximizing the IWAE objective\ncorresponds to minimizing (an upper bound to) KL[qIS(z)kp(z|x)], where qIS is a version of q that is\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nMPM\n\nlog p(x) = E\n\n(1)\n\nq(z)\uf8ff log\n\np(z, x)\n\nq(z) + KL[q(z)kp(z|x)].\n\n\fFigure 1: Left: Naive VI on the running example. Right: A tighter bound using antithetic sampling.\n\nFigure 2: A kernel density approximation of Q(z)\nfor the antithetic estimator in Fig. 1.\n\n\"corrected\" toward p using importance sam-\npling; this justi\ufb01es using qIS to approximate the\nposterior. Naesseth et al. [20] also show that\nperforming VI with an SMC objective can be\nseen as minimizing (an upper bound to) a di-\nvergence from the SMC sampling distribution\nqSMC(z) to p(z|x). For an arbitrary estimator,\nhowever, there is little understanding.\nWe establish a deeper connection between variational objectives and approximating families. Given a\nnon-negative Monte Carlo (MC) estimator R such that E R = p(x), we show how to \ufb01nd a distribution\nQ(z) such that the divergence between Q(z) and p(z|x) is at most the gap between E log R and\nlog p(x). Thus, better estimators mean better posterior approximations. The approximate posterior\nQ(z) can be found by a two-step \"divide and couple\" procedure. The \u201cdivide\u201d step follows Le et al.\n[17] and connects maximizing E log R to minimizing a divergence between two distributions, but not\nnecessarily involving p(z|x). The \u201ccouple\u201d step shows how to \ufb01nd Q(z) such that the divergence is\nan upper bound to KL [Q(z)kp(z|x)]. We show how a range of ideas from the statistical literature\u2014\nsuch as antithetic sampling, strati\ufb01ed sampling, and quasi Monte Carlo [24]\u2014can produce novel\nvariational objectives; then, using the divide and couple framework, we describe ef\ufb01cient-to-sample\napproximate posteriors Q(z) for each of these objectives. We contribute mathematical tools for\nderiving new estimators and approximating distributions within this framework. Experiments show\nthat the novel objectives enabled by this framework can lead to improved likelihood bounds and\nposterior approximations.\nThere is a large body of work using MC techniques to reduce the variance of gradient estimators of\nthe standard variational objective [26, 5, 19, 31, 13]. The aims of this paper are different: we use MC\ntechniques to change R to get a tighter objective.\n\n2 Setup and Motivation\n\nImagine we have some distribution p(z, x). After observing data x, we wish to approximate the\nposterior p(z|x). Traditional VI tries to both bound log p(x) and approximate p(z|x) using the\n\u201cELBO decomposition\u201d of Eq. (1). We already observed that a similar lower bound can be obtained\nfrom any non-negative random variable R with E R = p(x), since by Jensen\u2019s inequality,\n\nlog p(x) E log R.\n\nq(z)\n\n2\n\nTraditional VI can be seen as de\ufb01ning R = p(z, x)/q(z) for z \u21e0 q and then optimizing the parameters\nof q to maximize E log R. Many other estimators R of p(x) can be designed and their parameters\noptimized to make E log R as large as possible. We want to know: what relationship does this have\nto approximating the posterior p(z|x)?\n2.1 Example\nFig. 1 shows a one dimensional target distribution p(z, x) as a function of z, and the Gaussian q(z)\nobtained by standard VI, i.e. maximizing E log R for R = p(z, x)/q(z). The resulting bound is\nE log R \u21e1 0.237, while the true value is log p(x) = 0. By Eq. (1), the KL-divergence from q(z) to\np(z|x) is 0.237. Tightening the likelihood bound has made q close to p.\nA Gaussian cannot exactly represent the main mode of p(z, x), since it is asymmetric. Antithetic\nsampling [24] can exploit this. De\ufb01ne\n\u25c6 , z \u21e0 q,\n1\n\n2\u2713 p(z, x) + p(T (z), x)\n\nR0 =\n\n(2)\n\n\f2 p(z, x) for z in that region.\n\nwhere T (z) = \u00b5 (z \u00b5) is z \u201cre\ufb02ected\u201d around the mean \u00b5 of q. This is a valid estimator since\nq(z) is constant under re\ufb02ection. Tightening this bound over Gaussians q gives E log R0 \u21e1 0.060.\nThis is better, intuitively, since the right half of q is a good match to the main mode of p, i.e. since\nq(z) \u21e1 1\nWhat about p(z|x)? It is not true that antithetic sampling gives a q(z) with lower divergence (it is\naround 7.34). After all, naive VI already found the optimal Gaussian. Is there some other distribution\nthat is close to p(z|x)? How can we \ufb01nd it? These questions motivate this paper.\n2.2 Notation and Conventions\nWe use sans-serif font for random variables. All estimators R may depend on the input x, but we\nsuppress this for simplicity. Similarly, a(z|!) may depend on x. Proofs for all results are in the\nsupplement. Objects such as P MC, Q, p, q are distributions of random variables and will be written\nlike densities: Q(!), p(z, x). However, the results are more general: the supplement includes a\nmore rigorous version of our main results using probability measures. We write densities with Dirac\ndelta functions. These are not Lebesgue-integrable, but can be interpreted unambiguosuly as Dirac\nmeasures: e.g., a(z|!) = (z !) means the conditional distribution of z given ! = ! is the Dirac\nmeasure !. Throughout, x is \ufb01xed, so p(z, x) and P MC(!, z, x) are unnormalized distributions over\nthe other variables, and p(z|x) and P MC(!, z|x) are the corresponding normalized distributions.\n3 The Divide-and-Couple Framework\n\nIn this section we identify a correspondence between maximizing a likelihood bound and pos-\nterior inference for general non-negative estimators using a two step \u201cdivide\u201d and then \u201ccouple\u201d\nconstruction.\n\n3.1 Divide\nLet R(!) be a positive function of ! \u21e0 Q(!) such that EQ(!) R(!) = p(x), i.e., R is an unbiased\nlikelihood estimator with sampling distribution Q(!). The \u201cdivide\u201d step follows [17, Claim 1]: we\ncan interpret EQ(!) log R(!) as an ELBO by de\ufb01ning P MC so that R(!) = P MC(!, x)/Q(!). That\nis, P MC and Q \u201cdivide\u201d to produce R. Speci\ufb01cally:\nLemma 1. Let ! be a random variable with distribution Q(!) and let R(!) be a positive estimator\nsuch that EQ(!) R(!) = p(x). Then\n\nP MC(!, x) = Q(!)R(!)\n\nis an unnormalized distribution over ! with normalization constant p(x) and R(!) =\nP MC(!, x)/Q(!) for Q(!) > 0. Furthermore as de\ufb01ned above,\n\nWhile this shows that it is easy to connect a stochastic likelihood bound to minimizing a KL-\ndivergence, this construction alone is not useful for probabilistic inference, since neither Q(!)\n\nnor P MC(!, x) make any reference to z. Put another way: Even if KL\u21e5Q(!)P MC(!|x)\u21e4 is\n\nsmall, so what? This motivates the coupling step below. More generally, P MC is de\ufb01ned by letting\nR = dP MC/dQ be the Radon-Nikodym derivative, a change of measure from Q to P MC; see\nsupplement.\n\n3.2 Couple\nIf the distributions identi\ufb01ed in the above lemma are going to be useful for approximating p(z|x),\nthey must be connected somehow to z. In this section, we suggest coupling P MC(!, x) and p(z, x)\ninto some new distribution P MC(!, z, x) with P MC(z, x) = p(z, x). In practice, it is convenient\nto describe couplings via a conditional distribution a(z|!) that augments P MC(!, x). There is\na straightforward condition for when a is valid: for the augmented distribution P MC(z, !, x) =\nP MC(!, x)a(z|!) to be a valid coupling, we require thatR P MC(!, x)a(z|!)d! = p(z, x). An\n\nequivalent statement of this requirement is as follows.\n\n3\n\nlog p(x) = E\nQ(!)\n\nlog R(!) + KL\u21e5Q(!)P MC(!|x)\u21e4 .\n\n(3)\n\n\fDe\ufb01nition. An estimator R(!) and a distribution a(z|!) are a valid estimator-coupling pair under\ndistribution Q(!) if\n(4)\n\nE\nQ(!)\n\nR(!)a(z|!) = p(z, x).\n\nThe de\ufb01nition implies that EQ(!) R(!) = p(x) as may be seen by integrating out z from both sides.\nThat is, the de\ufb01nition implies that R is an unbiased estimator for p(x).\nNow, suppose we have a valid estimator/coupling pair and that R is a good (low variance) estimator.\nHow does this help us to approximate the posterior p(z|x)? The following theorem gives the \u201cdivide\nand couple\u201d framework.1\nTheorem 2. Suppose that R(!) and a(z|!) are a valid estimator-coupling pair under Q(!). Then,\n(5)\n(6)\n\nQ(z, !) = Q(!)a(z|!),\n\nP MC(z, !, x) = Q(!)R(!)a(z|!),\n\nare valid distributions, P MC(z, x) = p(z, x), and\n\nlog p(x) = E\nQ(!)\n\nlog R(!) + KL [Q(z)kp(z|x)] + KL\u21e5Q(!|z)P MC(!|z, x)\u21e4 .\n\n(7)\n\nThe \ufb01nal term in Eq. (7) is a conditional divergence [9, Sec. 2.5]. This is the divergence over !\nbetween Q(!|z) and P MC(!|z, x), averaged over z drawn from Q(z). At a high level, this theorem\nis proved by applying Lem. 1, using the chain rule of KL-divergence, and simplifying using the fact\nthat a is a valid coupling. The point of this theorem is: if R is a good estimator, then E log R will be\nclose to log p(x). Since KL-divergences are non-negative, this means that the marginal Q(z) must be\nclose to the target p(z|x). A coupling gives us a way to transform Q(!) so as to approximate p(z|x).\nTo be useful, we need to access Q(z), usually by sampling. We assume it is possible to sample from\nQ(!) since this is part of the estimator. The user must supply a routine to sample from a(z|!). This is\nwhy the \u201ctrivial coupling\u201d of a(z|!) = p(z|x) is not helpful \u2014 if the user could sample from p(z|x),\nthe inference problem is already solved! Some estimators may be pre-equipped with a method to\napproximate p(z|x). For example, in SMC, ! include particles zi and weights wi such that selecting\na particle in proportion to its weight approximates p(z|x). This can be seen as a coupling a(z|!),\nwhich provides an alternate interpretation of the divergence bounds of [20]. The divide and couple\nframework is also closely related to extended-space MCMC methods [2], which also use extended\ntarget distributions that admit p(z, x) as a marginal; see also Sec. 6. However, in these methods, the\nestimators also seem to come with \u201cobvious\u201d couplings. There is no systematic understanding of how\nto derive couplings for general estimators.\n\n3.3 Example\nConsider again the antithetic estimator from Eq. (2). We saw before that the antithetic estimator gives\na tighter variational bound than naive VI under the distribution in Fig. 1. However, that distribution is\nless similar to the target than the one from naive VI. To re\ufb02ect this (and match our general notation)\nwe now2 write ! \u21e0 Q (instead of z \u21e0 q).\nNow, we again ask: Since Q(!) is a poor approximation of the target, can we \ufb01nd some other\ndistribution that is a good approximation? Consider the coupling distribution\n\na(z|!) = \u21e1(!) (z !) + (1 \u21e1(!)) (z T (!)),\u21e1\n\n(!) =\n\np(!, x)\n\np(!, x) + p(T (!), x)\n\n.\n\nIntuitively, a(z|!) is supported only on z = ! and on z = T (!), with probability proportional\nto the target distribution. It is simple to verify (by substitution, see Claim 5 in supplement) that\nR and a form a valid estimator-coupling pair. Thus, the augmented variational distribution is\nQ(!, z) = Q(!)a(z|!). To sample from this, draw ! \u21e0 Q and select z = ! with probability \u21e1(!),\nor z = T (!) otherwise. The marginal Q(z) is shown in Fig. 2. This is a much better approximation\nof the target than naive VI.\n1With measures instead of densities would write Q(z 2 B, ! 2 A) =RA a(z 2 B|!)dQ(!) where a is a\nwhere ! \u21e0 Q. Here and in the rest of this\npaper, when p(\u00b7,\u00b7) has two arguments, the \ufb01rst always plays the role of z.\n\n2With this notation, Eq. (2) becomes R = 1\n2\n\nMarkov kernel; see supplement.\n\np(!,x)+p(T (!),x)\n\nQ(!)\n\n4\n\n\fTable 1: Variance reduction methods jointly transform estimators and couplings. Take an estimator\nR0(!) with coupling a0(z|!), valid under Q0(!). Each line shows a new estimator R(\u00b7) and coupling\na(z|\u00b7). The method to simulate Q(\u00b7) is described in the left column. Here, F 1 is a mapping so that\nif ! is uniform on [0, 1]d, then F 1(!) has density Q0(!).\n\n1 \u00b7\u00b7\u00b7 !1\n\nMn \u21e0 Q0 restricted to \u2326m,\n\nDescription\nIID Mean\n!1 \u00b7\u00b7\u00b7 !M \u21e0 Q0 i.i.d.\nStrati\ufb01ed Sampling\n\u23261 \u00b7\u00b7\u00b7 \u2326M partition \u2326,\n!m\n\u00b5m = Q0(! 2 \u2326m).\nAntithetic Sampling\n! \u21e0 Q0. For all m, Tm(!) d= !.\nRandomized Quasi Monte Carlo\n! \u21e0 Unif([0, 1]d), \u00af!1,\u00b7\u00b7\u00b7 \u00af!M \ufb01xed,\nTm(!) = F 1 (\u00af!m + ! (mod 1))\nLatin Hypercube Sampling\n!1,\u00b7\u00b7\u00b7 , !M jointly sampled from Latin\nhypercube [24, Ch. 10.3], T = F 1.\n\nm=1 R0(!m)a0(z|!m)\n\nm=1 R0(!m)\n\nm=1\n\na(z|\u00b7)\n\n\u00b5m\nNm\n\nR(\u00b7)\n1\nM\n\nMXm=1\nMXm=1\nMXm=1\nMXm=1\nMXm=1\n\n1\nM\n\n1\nM\n\n1\nM\n\nR0(!m)\n\nR0 (!m\n\nPM\nn ) PM\nNmXn=1\nR0 (Tm(!)) PM\nR0 (Tm(!)) PM\nR0 (T (!m)) PM\n\n\u00b5m\n\nm=1\n\nPM\nNmPNm\nPM\nPM\nPM\nPM\n\nn ) a0(z|!m\nn )\n\nn=1 R0 (!m\n\u00b5m\n\nNmPNm\n\nn=1 R0 (!m\nn )\n\nm=1 R0(Tm(!)) a0 (z|Tm(!))\n\nm=1 R(Tm(!))\n\nm=1 R0(Tm(!)) a0 (z|Tm(!))\n\nm=1 R0(Tm(!))\n\nm=1 R0(T (!m)) a0 (z|T (!m))\n\nm=1 R0(T (!m))\n\n4 Deriving Couplings\nThm. 2 says that if EQ(!) log R(!) is close to log p(x) and you have a tractable coupling a(z|!),\nthen drawing ! \u21e0 Q(!) and then z \u21e0 a(z|!) yields samples from a distribution Q(z) close to\np(z|x). But how can we \ufb01nd a tractable coupling?\nMonte Carlo estimators are often created recursively using techniques that take some valid estimator\nR and transform it into a new valid estimator R0. These techniques (e.g. change of measure, Rao-\nBlackwellization, strati\ufb01ed sampling) are intended to reduce variance. Part of the power of Monte\nCarlo methods is that these techniques can be easily combined. In this section, we extend some of\nthese techniques to transform valid estimator-coupling pairs into new valid estimator-coupling pairs.\nThe hope is that the standard toolbox of variance reduction techniques can be applied as usual, and\nthe coupling is derived \u201cautomatically\u201d.\nTable 1 shows corresponding transformations of estimators and couplings for several standard variance\nreduction techniques. In the rest of this section, we will give two abstract tools that can be used to\ncreate all the entries in this table. For concreteness, we begin with a trivial \u201cbase\u201d estimator-coupling\npair. Take a distribution Q0(!) and let R0(!) = p(!, x)/Q0(!) and a0(z|!) = (z !) (the\ndeterministic coupling). It is easy to check that these satisfy Eq. (4).\n\n4.1 Abstract Transformations of Estimators and Couplings\nOur \ufb01rst abstract tool transforms an estimator-coupling pair on some space \u2326 into another estimator-\ncoupling pair on a space \u2326M \u21e5{ 1,\u00b7\u00b7\u00b7 , M}. This can be thought of as having M \u201creplicates\u201d of\nthe ! in the original estimator, along with an extra integer-valued variable that selects one of them.\nWe emphasize that this result does not (by itself) reduce variance \u2014 in fact, R has exactly the same\ndistribution as R0.\nTheorem 3. Suppose that R0(!) and a0(z|!) are a valid estimator-coupling pair under Q0(!). Let\nQ(!1,\u00b7\u00b7\u00b7 , wM , m) be any distribution such that if (!1,\u00b7\u00b7\u00b7 , !M , m) \u21e0 Q, then !m \u21e0 Q0. Then,\n(8)\n(9)\n\nR(!1,\u00b7\u00b7\u00b7 ,! M , m) = R0(!m)\na(z|!1,\u00b7\u00b7\u00b7 ,! M , m) = a0(z|!m)\n\nare a valid estimator-coupling pair under Q(!1,\u00b7\u00b7\u00b7 , wM , m).\nRao-Blackwellization is a well-known way to transform an estimator to reduce variance; we want to\nknow how it affects couplings. Take an estimator R0(!, \u232b ) with state space \u2326 \u21e5 N and distribution\n\n5\n\n\fFigure 3: Strati\ufb01ed sampling and antithetic within strati\ufb01ed sampling on the running example.\n\nFigure 4: A kernel density approximation of Q(z|x) for the estimators in Fig. 3.\n\nQ0(!, \u232b ). A new estimator R(!) = EQ0(\u232b|!) R(!, \u232b) that analytically marginalizes out \u232b has the\nsame expectation and equal or lesser variance, by the Rao-Blackwell theorem. The following result\nshows that if R0 had a coupling, then it is easy to de\ufb01ne a new coupling for R.\nTheorem 4. Suppose that R0(!, \u232b ) and a0(z|!, \u232b ) are a valid estimator-coupling pair under\nQ0(!, \u232b ). Then\n\nR(!) = E\n\nR0 (!, \u232b) ,\n\nQ0(\u232b|!)\n\n1\n\nE\n\nR(!)\n\nQ0(\u232b|!)\n\na(z|!) =\n\n[R0 (!, \u232b) a0(z|!, \u232b)] ,\n\nare a valid estimator-coupling pair under Q(!) =R Q0(!, \u232b )d\u232b.\n\n4.2 Speci\ufb01c Variance Reduction Techniques\n\nEach of the techniques in Table 1 can be derived by \ufb01rst applying Thm. 3 and then Thm. 4. As a\nsimple example, consider the IID mean. Suppose R0(!) and a0(z|!) are valid under Q0. If we let\n!1,\u00b7\u00b7\u00b7 , !M \u21e0 Q0 i.i.d. and m uniform on {1,\u00b7\u00b7\u00b7 , M} then this satis\ufb01es the condition of Thm. 3\nthat !m \u21e0 Q0. Thus we can de\ufb01ne R and a as in Eq. (8) and Eq. (9). Applying Rao-Blackwellization\nto marginalize out m gives exactly the form for R and a shown in the table. Details are in Sec. 9 of\nthe supplement\nAs another example, take strati\ufb01ed sampling. For simplicity, we assume one sample in each strata\n(Nm = 1). Suppose \u23261 \u00b7\u00b7\u00b7 \u2326M partition the state-space and let !m \u21e0 Q0(!|! 2 \u2326m) and m be\nequal to m with probability \u00b5m = Q0(! 2 \u2326m). It is again the case that !m \u21e0 Q0, and applying\nThm. 3 and then Thm. 4 produces the estimator-coupling pair shown in the table. Again, details are\nin Sec. 9 of the supplement.\n\n4.3 Example\n\nWe return to the example from Sec. 2.1 and Sec. 3.3. Fig. 3 shows the result of applying strati\ufb01ed\nsampling to the standard VI estimator R = p(z, x)/q(z) and then adjusting the parameters of q to\ntighten the bound. The bound is tighter than standard VI and slightly worse than antithetic sampling.\nWhy not combine antithetic and strati\ufb01ed sampling? Fig. 3 shows the result of applying antithetic\nsampling inside of strati\ufb01ed sampling. Speci\ufb01cally, the estimator R(!m) for each stratum m is\nreplaced by 1\n2 (R(!m) + R(Tm(!m))) where Tm is a re\ufb02ection inside the stratum that leaves the\ndensity invariant. A fairly tight bound results. For all of antithetic sampling (Fig. 2), strati\ufb01ed\nsampling (Fig. 3) and antithetic within strati\ufb01ed sampling (Fig. 3) tightening E log R \ufb01nds Q(!)\nsuch that all batches place some density on z in the main mode of p. Thus, the better sampling\nmethods permit a q with some coverage of the left mode of p while precluding the possibility that all\nsamples in a batch are simultaneously in a low-density region (which would result in R near zero,\nand thus a very low value for E log R). What do these estimators say about p(z|x)? Fig. 4 compares\nthe resulting Q(z) for each estimator \u2014 the similarity to p(z|x) correlates with the likelihood bound.\n\n6\n\n\fFigure 5: Different sampling methods applied to Gaussian VI. Top row: Different methods to sample\nfrom the unit cube. Middle row: these samples transformed using the \u201cCartesian\u201d mapping. Bottom\nrow: Same samples transformed using the \u201cElliptical\u201d mapping.\n\n5\n\nImplementation and Empirical Study\n\nOur results are easy to put into practice, e.g. for variational inference with Gaussian approximating\ndistributions and the reparameterization trick to estimate gradients.. To illustrate this, we show a\nsimple but general approach. As shown in Fig. 5 the idea is to start with a batch of samples !1 \u00b7\u00b7\u00b7 !M\ngenerated from the unit hypercube. Different sampling strategies can give more uniform coverage of\nthe cube than i.i.d. sampling. After transformation, one obtains samples z1 \u00b7\u00b7\u00b7 zM that have more\nuniform coverage of the Gaussian. This better coverage often manifests as a lower-variance estimator\nR. Our coupling framework gives a corresponding approximate posterior Q(z).\nFormally, take any distribution Q(!1,\u00b7\u00b7\u00b7 ,! M ) such that each marginal Q(!m) is uniform over the\nunit cube (but the different !m may be dependent). As shown in Fig. 5, there are various ways to\ngenerate !1 \u00b7\u00b7\u00b7 !M and to map them to samples z1 \u00b7\u00b7\u00b7 zM from a Gaussian q(zm). Then, Fig. 6 gives\nalgorithms to generate an estimator R and to generate z from a distribution Q(z) corresponding to a\nvalid coupling. We use mappings ! F 1\n! u T\u2713! z where t\u2713 = T\u2713 F 1 maps ! \u21e0 Unif([0, 1]d) to\nt\u2713(!) \u21e0 q\u2713 for some density q\u2713. The idea is to implement variance reduction to sample (batches of)\n!, use F 1 to map ! to a \u201cstandard\u201d distribution (typically in the same family as q\u2713), and then use\nT\u2713 to map samples from the standard distribution to samples from q\u2713.\nThe algorithms are again derived from Thm. 3 and Thm. 4. De\ufb01ne Q0(!) uniform on [0, 1]d, R0(!) =\np(t\u2713(!), x)/q\u2713(t\u2713(!)) and a0(z|!) = (z t\u2713(!)). These de\ufb01ne a valid estimator-coupling pair.\nLet Q(!1,\u00b7\u00b7\u00b7 ,! M ) be as described (uniform marginals) and m uniform on {1,\u00b7\u00b7\u00b7 , M}. Then\nQ(!1,\u00b7\u00b7\u00b7 ,! M , m) satis\ufb01es the assumptions of Thm. 3, so we can use that theorem then Thm. 4 to\nRao-Blackwellize out m. This produces the estimator-coupling pair in Fig. 6.\n\nAlgorithm (Generate R)\n\u2022 Generate !1,\u00b7\u00b7\u00b7 ,! M from any distribution\nwhere !m is marginally uniform over [0, 1]d.\n\u2022 Map to a standard dist. as um = F 1(!m).\n\u2022 Map to q\u2713 as zm = T\u2713(um).\n\u2022 Return R = 1\nFigure 6: Generic methods to sample R (left) and Q(z) (right). Here, Q(!1,\u00b7\u00b7\u00b7 ,! M ) is any\ndistribution where the marginals Q(!m) are uniform over the unit hypercube.\n\nAlgorithm (Sample from Q(z))\n\u2022 Generate z1,\u00b7\u00b7\u00b7 zM as on the left.\n\u2022 For all m compute weight wm = p(zm,x)\nq\u2713(zm) .\n\u2022 Select m with probability\n\u2022 Return zm\n\nMPM\n\np(zm,x)\nq\u2713(zm)\n\n.\n\nwmPM\n\nm0=1 wm0\n\nm=1\n\n7\n\n\fThe value of this approach is the many off-the-shelf methods to generate \u201cbatches\u201d of samples\n(!1,\u00b7\u00b7\u00b7 ,! M ) that have good \u201ccoverage\u201d of the unit cube. This manifests as coverage of q\u2713 after\nbeing mapped. Fig. 5 shows examples of this with multivariate Gaussians. As shown, there may\nbe multiple mappings F 1. These manifest as different coverage of q\u2713, so the choice of mapping\nin\ufb02uences the quality of the estimator. We consider two examples: The \u201cCartesian\u201d mapping F 1\nN (!)\nsimply applies the inverse CDF of the standard Gaussian. An \u201celliptical\u201d mapping, meanwhile,\nuses the \u201celliptical\u201d reparameterization of the Gaussian [11]: If r \u21e0 d and v is uniform over the\nunit sphere, then r v \u21e0N (0, I). In Fig. 5 we generate r and v from the uniform distribution as\nd (!1) and v = (cos(2\u21e1!2), sin(2\u21e1!2)), and then set F 1(!) = r v. In higher dimensions,\nr = F 1\nit is easier to generate samples from the unit sphere using redundant dimensions. Thus, we use\nd . The\n! 2 Rd+1 and map the \ufb01rst component to r again using the inverse distribution CDF F 1\nother components are mapped to the unit sphere by \ufb01rst applying the Gaussian inverse CDF in each\ncomponent, then normalizing.\nIn the experiments, we use a multivariate Gaussian q\u2713 with parameters \u2713 = (C, \u00b5). The mapping is\nT\u2713(u) = Cu + \u00b5. To ensure a diverse test, we downloaded the corpus of models from the Stan [7]\nmodel library [30] (see also Regier et al. [27]) and created an interface for automatic differentiation\nin Stan to interoperate with automatic differentiation code written in Python. We compare VI in\nterms of the likelihood bound and in terms of the (squared Frobenius norm) error in the estimated\nposterior variance. As a surrogate for the true variance, we computed the empirical variance of\n100,000 samples generated via Stan\u2019s Hamiltonian Markov chain Monte Carlo (MCMC) method.\nFor tractability, we restrict to the 88 models where pro\ufb01ling indicates MCMC would take at most 10\nhours, and evaluating the posterior for 10,000 settings of the latent variables would take at most 2\nseconds. It was infeasible to tune stochastic gradient methods for all models. Instead we used a \ufb01xed\nbatch of 50,000 batches !1,\u00b7\u00b7\u00b7 ,! M and optimized the empirical ELBO using BFGS, initialized\nusing Laplace\u2019s method. A fresh batch of 500,000 samples was used to compute the \ufb01nal likelihood\nbound and covariance estimator. Fig. 7 shows example errors for a few models. The supplement\ncontains similar plots for all models, as well as plots aggregating statistics, and a visualization of how\nthe posterior density approximation changes.\n\n6 Conclusions\n\nRecent work has studied the use of improved Monte Carlo estimators for better variational likelihood\nbounds. The central insight of this paper is that an approximate posterior can be constructed from an\nestimator using a coupling. This posterior\u2019s divergence is bounded by the looseness of the likelihood\nbound. We suggest a framework of \u201cestimator-coupling\u201d pairs to make this coupling easy to construct\nfor many estimators.\nSeveral recent works have viewed Monte Carlo VI bounds through the lens of augmented VI [3, 10,\n20, 11, 16]. These establish connections between particular likelihood estimators and approximate\nposteriors through extended distributions. They differ from our work primarily in the \u201cdirection\u201d of\nthe construction, the generality, or both. Most of the work uses the following reasoning, which starts\nwith an approximate posterior and arrives at a tractable likelihood estimator. Take a Monte Carlo\nmethod (e.g. self-normalized importance sampling) to approximately sample from p(z|x). Call the\napproximation q(z), but suppose it is not tractable to evaluate q(z). A tractable likelihood estimator\ncan be obtained as R(!, z) = p(z, x)p(!|z, x)/q(!, z), where q(!, z) is the (tractable) joint density\nover the \u201cinternal randomness\u201d ! of the Monte Carlo procedure and the \ufb01nal sample z, and p(!|z, x)\nis a conditional distribution used to extend the target to also contain these variables. Different choices\nfor the Monte Carlo procedure q(!, z) and the target extension p(!|z, x) lead to different estimators.\nTo arrive at a particular existing likelihood estimator R requires careful estimator-speci\ufb01c choices\nand derivations. In contrast, our work proceeds in the opposite direction: we start with an arbitrary\nestimator R and show (via coupling) how to \ufb01nd a corresponding Monte Carlo procedure q(!, z).\nWe also provide a set of tools to \u201cautomatically\u201d \ufb01nd couplings for many types of estimators.\nThe idea of using extended state-spaces is common in (Markov chain) Monte Carlo inference methods\n[12, 2, 21, 22]. These works also identify extended target distributions that admit p(z, x) as a marginal,\ni.e., a coupling in our terminology. By running an Markov chain Monte Carlo (MCMC) sampler\non the extended target and dropping the auxiliary variables, they obtain an MCMC sampler for\np(z|x). Our work can be seen as the VI analogue of these MCMC methods. Other recent work\n[6, 18, 10, 20, 17, 11, 8, 28] that has explored the connection between using estimators in variational\n\n8\n\n\fFigure 7: Across all models, improvements in likelihood bounds correlate strongly with im-\nprovements in posterior accuracy. Better sampling methods can improve both. First row: the\ncommon case where a simple Gaussian posterior is already very accurate. Here, only a tiny improve-\nment in the ELBO is possible, and improvement in the posterior is below the level detectable when\ncomparing to MCMC. The other rows show cases where larger improvements are possible.\n\nbounds and auxiliary variational inference [1]. To the best of our knowledge, all of these works\nconsider situations in which the relevant extended state space (z, !) is known. Thus, in these works,\nthe estimator essentially comes with an \u201cobvious\u201d coupling distribution a(z|!). In contrast, the\ngoal of this paper is to consider an arbitrary estimator R(!), where it is not obvious that a tractable\ncoupling distribution a(z|!) even exists. This is the situation in which our framework of estimator-\ncoupling pairs is likely to be useful. The alternative would be manual construction of extended\nstate-spaces for each individual estimator.\n\nReferences\n[1] Felix V. Agakov and David Barber. An Auxiliary Variational Method. In Neural Information\nProcessing, Lecture Notes in Computer Science, pages 561\u2013566. Springer, Berlin, Heidelberg,\n2004.\n\n[2] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte\n\nCarlo methods. Journal of the Royal Statistical Society: Series B, 72:269\u2013342, 2010.\n\n[3] Philip Bachman and Doina Precup. Training Deep Generative Models: Variations on a Theme.\n\nIn NIPS Workshop: Advances in Approximate Bayesian Inference, 2015.\n\n[4] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational Inference: A Review for\n\nStatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[5] Alexander Buchholz, Florian Wenzel, and Stephan Mandt. Quasi-Monte Carlo Variational\n\nInference. In ICML, 2018.\n\n[6] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In\n\nICLR, 2015.\n\n9\n\n\f[7] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael\nBetancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A Probabilistic\nProgramming Language. Journal of Statistical Software, 76(1), 2017.\n\n[8] Christian Andersson Naesseth. Machine Learning Using Approximate Inference: Variational\n\nand Sequential Monte Carlo Methods. PhD thesis, Link\u00f6ping University, 2018.\n\n[9] T. M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, Hoboken,\n\nN.J, 2nd ed edition, 2006.\n\n[10] Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting Importance-Weighted\n\nAutoencoders. arXiv:1704.02916 [stat], 2017.\n\n[11] Justin Domke and Daniel Sheldon. Importance Weighting and Variational Inference. In NeurIPS,\n\n2018.\n\n[12] Axel Finke. On Extended State-Space Constructions for Monte Carlo Methods. PhD thesis,\n\nUniversity of Warwick, 2015.\n\n[13] Tomas Geffner and Justin Domke. Using Large Ensembles of Control Variates for Variational\n\nInference. In NeurIPS, 2018.\n\n[14] Robert M. Gray. Entropy and Information Theory. Springer, New York, 2nd ed edition, 2011.\n\nOCLC: ocn669910367.\n\n[15] Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer-Verlag,\n\nLondon, 2 edition, 2014.\n\n[16] John Lawson, George Tucker, Bo Dai, and Rajesh Ranganath. Energy-Inspired Models: Learn-\n\ning with Sampler-Induced Distributions. In NeurIPS, page 13, 2019.\n\n[17] Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-Encoding\n\nSequential Monte Carlo. In ICLR, 2018.\n\n[18] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy\n\nMnih, Arnaud Doucet, and Yee Teh. Filtering Variational Objectives. In NeurIPS, 2017.\n\n[19] Andrew Miller, Nick Foti, Alexander D\u2019 Amour, and Ryan P Adams. Reducing Reparameteri-\n\nzation Gradient Variance. In NeurIPS, 2017.\n\n[20] Christian A. Naesseth, Scott W. Linderman, Rajesh Ranganath, and David M. Blei. Variational\nSequential Monte Carlo. In AISTATS, volume 84 of Proceedings of Machine Learning Research,\npages 968\u2013977. PMLR, 2018.\n\n[21] Radford M. Neal. Annealed Importance Sampling. arXiv:physics/9803008, 1998.\n[22] Radford\n\nHamiltonian\n\nNeal.\n\nImportance\n\nSampling.\n\nM.\n\nhttp://www.cs.toronto.edu/pub/radford/his-talk.pdf, 2005.\n\n[23] Sebastian Nowozin. Debiasing Evidence Approximations: On Importance-Weighted Autoen-\n\ncoders and Jackknife Variational Inference. 2018.\n\n[24] Art Owen. Monte Carlo Theory, Methods and Examples. 2013.\n[25] Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank\nWood, and Yee Whye Teh. Tighter Variational Bounds are Not Necessarily Better. In ICML,\n2018.\n\n[26] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black Box Variational Inference. In\n\nAISTATS, 2014.\n\n[27] Jeffrey Regier, Michael I Jordan, and Jon McAuliffe. Fast Black-box Variational Inference\n\nthrough Stochastic Trust-Region Optimization. In NeurIPS, page 10, 2017.\n\n[28] Hongyu Ren, Shengjia Zhao, and Stefano Ermon. Adaptive Antithetic Sampling for Variance\n\nReduction. In ICML, pages 5420\u20135428, 2019.\n\n10\n\n\f[29] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean Field Theory for Sigmoid Belief Networks.\n\nJournal of Arti\ufb01cial Intelligence Research, 4:61\u201376, 1996.\n\n[30] Stan developers. Example Models. https://github.com/stan-dev/example-models, 2018.\n[31] Michalis K. Titsias and Miguel L\u00e1zaro-Gredilla. Local Expectation Gradients for Black Box\n\nVariational Inference. In NeurIPS, 2015.\n\n11\n\n\f", "award": [], "sourceid": 162, "authors": [{"given_name": "Justin", "family_name": "Domke", "institution": "University of Massachusetts, Amherst"}, {"given_name": "Daniel", "family_name": "Sheldon", "institution": "University of Massachusetts Amherst"}]}