{"title": "Measuring the reliability of MCMC inference with bidirectional Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 2451, "page_last": 2459, "abstract": "Markov chain Monte Carlo (MCMC) is one of the main workhorses of probabilistic inference, but it is notoriously hard to measure the quality of approximate posterior samples. This challenge is particularly salient in black box inference methods, which can hide details and obscure inference failures. In this work, we extend the recently introduced bidirectional Monte Carlo technique to evaluate MCMC-based posterior inference algorithms. By running annealed importance sampling (AIS) chains both from prior to posterior and vice versa on simulated data, we upper bound in expectation the symmetrized KL divergence between the true posterior distribution and the distribution of approximate samples. We integrate our method into two probabilistic programming languages, WebPPL and Stan, and validate it on several models and datasets. As an example of how our method be used to guide the design of inference algorithms, we apply it to study the effectiveness of different model representations in WebPPL and Stan.", "full_text": "Measuring the reliability of MCMC inference with\n\nbidirectional Monte Carlo\n\nRoger B. Grosse\n\nSiddharth Ancha\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nUniversity of Toronto\n\nDaniel M. Roy\n\nDepartment of Statistics\nUniversity of Toronto\n\nAbstract\n\nMarkov chain Monte Carlo (MCMC) is one of the main workhorses of probabilistic\ninference, but it is notoriously hard to measure the quality of approximate posterior\nsamples. This challenge is particularly salient in black box inference methods,\nwhich can hide details and obscure inference failures. In this work, we extend\nthe recently introduced bidirectional Monte Carlo [GGA15] technique to evaluate\nMCMC-based posterior inference algorithms. By running annealed importance\nsampling (AIS) chains both from prior to posterior and vice versa on simulated data,\nwe upper bound in expectation the symmetrized KL divergence between the true\nposterior distribution and the distribution of approximate samples. We integrate\nour method into two probabilistic programming languages, WebPPL [GS] and Stan\n[CGHL+ p], and validate it on several models and datasets. As an example of how\nour method be used to guide the design of inference algorithms, we apply it to\nstudy the effectiveness of different model representations in WebPPL and Stan.\n\n1\n\nIntroduction\n\nMarkov chain Monte Carlo (MCMC) is one of the most important classes of probabilistic inference\nmethods and underlies a variety of approaches to automatic inference [e.g. LTBS00; GMRB+08;\nGS; CGHL+ p]. Despite its widespread use, it is still dif\ufb01cult to rigorously validate the effectiveness\nof an MCMC inference algorithm. There are various heuristics for diagnosing convergence, but\nreliable quantitative measures are hard to \ufb01nd. This creates dif\ufb01culties both for end users of automatic\ninference systems and for experienced researchers who develop models and algorithms.\nIn this paper, we extend the recently proposed bidirectional Monte Carlo (BDMC) [GGA15] method\nto evaluate certain kinds of MCMC-based inference algorithms by bounding the symmetrized KL\ndivergence (Jeffreys divergence) between the distribution of approximate samples and the true\nposterior distribution. Speci\ufb01cally, our method is applicable to algorithms which can be viewed as\nimportance sampling over an extended state space, such as annealed importance sampling (AIS;\n[Nea01]) or sequential Monte Carlo (SMC; [MDJ06]). BDMC was proposed as a method for\naccurately estimating the log marginal likelihood (log-ML) on simulated data by sandwiching the true\nvalue between stochastic upper and lower bounds which converge in the limit of in\ufb01nite computation.\nThese log-likelihood values were used to benchmark marginal likelihood estimators. We show that it\ncan also be used to measure the accuracy of approximate posterior samples obtained from algorithms\nlike AIS or SMC. More precisely, we re\ufb01ne the analysis of [GGA15] to derive an estimator which\nupper bounds in expectation the Jeffreys divergence between the distribution of approximate samples\nand the true posterior distribution. We show that this upper bound is quite accurate on some toy\ndistributions for which both the true Jeffreys divergence and the upper bound can be computed exactly.\nWe refer to our method of bounding the Jeffreys divergence by sandwiching the log-ML as Bounding\nDivergences with REverse Annealing (BREAD).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWhile our method is only directly applicable to certain algorithms such as AIS or SMC, these\nalgorithms involve many of the same design choices as traditional MCMC methods, such as the\nchoice of model representation (e.g. whether to collapse out certain variables), or the choice of\nMCMC transition operators. Therefore, the ability to evaluate AIS-based inference should also yield\ninsights which inform the design of MCMC inference algorithms more broadly.\nOne additional hurdle must be overcome to use BREAD to evaluate posterior inference: the method\nyields rigorous bounds only for simulated data because it requires an exact posterior sample. One\nwould like to be sure that the results on simulated data accurately re\ufb02ect the accuracy of posterior\ninference on the real-world data of interest. We present a protocol for using BREAD to diagnose\ninference quality on real-world data. Speci\ufb01cally, we infer hyperparameters on the real data, simulate\ndata from those hyperparameters, measure inference quality on the simulated data, and validate the\nconsistency of the inference algorithm\u2019s behavior between the real and simulated data. (This protocol\nis somewhat similar in spirit to the parametric bootstrap [ET98].)\nWe integrate BREAD into the tool chains of two probabilistic programming languages: WebPPL\n[GS] and Stan [CGHL+ p]. Both probabilistic programming systems can be used as automatic\ninference software packages, where the user provides a program specifying a joint probabilistic model\nover observed and unobserved quantities. In principle, probabilistic programming has the potential to\nput the power of sophisticated probabilistic modeling and ef\ufb01cient statistical inference into the hands\nof non-experts, but realizing this vision is challenging because it is dif\ufb01cult for a non-expert user\nto judge the reliability of results produced by black-box inference. We believe BREAD provides a\nrigorous, general, and automatic procedure for monitoring the quality of posterior inference, so that\nthe user of a probabilistic programming language can have con\ufb01dence in the accuracy of the results.\nOur approach to evaluating probabilistic programming inference is closely related to independent\nwork [CTM16] that is also based on the ideas of BDMC. We discuss the relationships between both\nmethods in Section 4.\nIn summary, this work includes four main technical contributions. First, we show that BDMC yields\nan estimator which upper bounds in expectation the Jeffreys divergence of approximate samples\nfrom the true posterior. Second, we present a technique for exactly computing both the true Jeffreys\ndivergence and the upper bound on small examples, and show that the upper bound is often a\ngood match in practice. Third, we propose a protocol for using BDMC to evaluate the accuracy\nof approximate inference on real-world datasets. Finally, we extend both WebPPL and Stan to\nimplement BREAD, and validate BREAD on a variety of probabilistic models in both frameworks.\nAs an example of how BREAD can be used to guide modeling and algorithmic decisions, we use\nit to analyze the effectiveness of different representations of a matrix factorization model in both\nWebPPL and Stan.\n\n2 Background\n\n2.1 Annealed Importance Sampling\n\nAnnealed importance sampling (AIS; [Nea01]) is a Monte Carlo algorithm commonly used to estimate\n(ratios of) normalizing constants. More carefully, \ufb01x a sequence of T distributions p1, . . . , pT , with\npt(x) = ft(x)/Zt. The \ufb01nal distribution in the sequence, pT , is called the target distribution; the\n\ufb01rst distribution, p1, is called the initial distribution. It is required that one can obtain one or more\nexact samples from p1.1 Given a sequence of reversible MCMC transition operators T1, . . . ,TT ,\nwhere Tt leaves pt invariant, AIS produces a (nonnegative) unbiased estimate of ZT /Z1 as follows:\n\ufb01rst, we sample a random initial state x1 from p1 and set the initial weight w1 = 1. For every stage\nt  2 we update the weight w and sample the state xt according to\n\nwt wt1\n\nft(xt1)\nft1(xt1)\n\nxt sample from Tt (x| xt1) .\n\n(1)\n\nNeal [Nea01] justi\ufb01ed AIS by showing that it is a simple importance sampler over an extended state\nspace (see Appendix A for a derivation in our notation). From this analysis, it follows that the weight\nwT is an unbiased estimate of the ratio ZT /Z1. Two trivial facts are worth highlighting: when Z1\n1Traditionally, this has meant having access to an exact sampler. However, in this work, we sometimes have\n\naccess to a sample from p1, but not a sampler.\n\n2\n\n\fis known, Z1wT is an unbiased estimate of ZT , and when ZT is known, wT /ZT is an unbiased\nestimate of 1/Z1. In practice, it is common to repeat the AIS procedure to produce K independent\nestimates and combine these by simple averaging to reduce the variance of the overall estimate.\nIn most applications of AIS, the normalization constant ZT for the target distribution pT is the\nfocus of attention, and the initial distribution p1 is chosen to have a known normalization constant\nZ1. Any sequence of intermediate distributions satisfying a mild domination criterion suf\ufb01ces to\nproduce a valid estimate, but in typical applications, the intermediate distributions are simply de\ufb01ned\nto be geometric averages ft(x) = f1(x)1tfT (x)t, where the t are monotonically increasing\nparameters with 1 = 0 and T = 1. (An alternative approach is to average moments [GMS13].)\nIn the setting of Bayesian posterior inference over parameters \u2713 and latent variables z given some\n\ufb01xed observation y, we take f1(\u2713, z) = p(\u2713, z) to be the prior distribution (hence Z1 = 1), and we\ntake fT (\u2713, z) = p(\u2713, z, y) = p(\u2713, z) p(y|\u2713, z). This can be viewed as the unnormalized posterior\ndistribution, whose normalizing constant ZT = p(y) is the marginal likelihood. Using geometric\naveraging, the intermediate distributions are then\n(2)\nIn addition to moment averaging, reasonable intermediate distributions can be produced in the\nBayesian inference setting by conditioning on a sequence of increasing subsets of data; this insight\nrelates AIS to the seemingly different class of sequential Monte Carlo (SMC) methods [MDJ06].\n\nft(\u2713, z) = p(\u2713, z) p(y|\u2713, z)t.\n\n2.2 Stochastic lower bounds on the log partition function ratio\nAIS produces a nonnegative unbiased estimate \u02c6R of the ratio R = ZT /Z1 of partition functions.\nUnfortunately, because such ratios often vary across many orders of magnitude, it frequently happens\nthat \u02c6R underestimates R with overwhelming probability, while occasionally taking extremely large\nvalues. Correspondingly, the variance may be extremely large, or even in\ufb01nite.\nFor these reasons, it is more meaningful to estimate log R. Unfortunately, the logarithm of a\nnonnegative unbiased estimate (such as the AIS estimate) is, in general, a biased estimator of the log\nestimand. More carefully, let \u02c6A be a nonnegative unbiased estimator for A = E[ \u02c6A]. Then, by Jensen\u2019s\ninequality, E[log \u02c6A] \uf8ff log E[ \u02c6A] = log A, and so log \u02c6A is a lower bound on log A in expectation. The\nestimator log \u02c6A satis\ufb01es another important property: by Markov\u2019s inequality for nonnegative random\nvariables, Pr(log \u02c6A > log A + b) < eb, and so log \u02c6A is extremely unlikely to overestimate log A\nby any appreciable number of nats. These observations motivate the following de\ufb01nition [BGS15]: a\nstochastic lower bound on X is an estimator \u02c6X satisfying E[ \u02c6X] \uf8ff X and Pr( \u02c6X > X + b) < eb.\nStochastic upper bounds are de\ufb01ned analogously. The above analysis shows that log \u02c6A is a stochastic\nlower bound on log A when \u02c6A is a nonnegative unbiased estimate of A, and, in particular, log \u02c6R is a\nstochastic lower bound on log R. (It is possible to strengthen the tail bound by combining multiple\nsamples [GBD07].)\n\n2.3 Reverse AIS and Bidirectional Monte Carlo\nUpper and lower bounds are most useful in combination, as one can then sandwich the true value. As\ndescribed above, AIS produces a stochastic lower bound on the ratio R; many other algorithms do as\nwell. Upper bounds are more challenging to obtain. The key insight behind bidirectional Monte Carlo\n(BDMC; [GGA15]) is that, provided one has an exact sample from the target distribution pT , one can\nrun AIS in reverse to produce a stochastic lower bound on log Rrev = log Z1/ZT , and therefore a\nstochastic upper bound on log R =  log Rrev. (In fact, BDMC is a more general framework which\nallows a variety of partition function estimators, but we focus on AIS for pedagogical purposes.)\nMore carefully, for t = 1, . . . , T , de\ufb01ne \u02dcpt = pTt+1 and \u02dcTt = TTt+1. Then \u02dcp1 corresponds\nto our original target distribution pT and \u02dcpT corresponds to our original initial distribution p1. As\nbefore, \u02dcTt leaves \u02dcpt invariant. Consider the estimate produced by AIS on the sequence of distributions\n\u02dcp1, . . . , \u02dcpT and corresponding MCMC transition operators \u02dcT1, . . . , \u02dcTT . (In this case, the forward\nchain of AIS corresponds to the reverse chain described in Section 2.1.) The resulting estimate \u02c6Rrev\nis a nonnegative unbiased estimator of Rrev. It follows that log \u02c6Rrev is a stochastic lower bound\non log Rrev, and therefore log \u02c6R1\nrev. BDMC is\n\nrev is a stochastic upper bound on log R = log R1\n\n3\n\n\fsimply the combination of this stochastic upper bound with the stochastic lower bound of Section 2.2.\nBecause AIS is a consistent estimator of the partition function ratio under the assumption of ergodicity\n[Nea01], the two bounds converge as T ! 1; therefore, given enough computation, BDMC can\nsandwich log R to arbitrary precision.\nReturning to the setting of Bayesian inference, given some \ufb01xed observation y, we can apply BDMC\nprovided we have exact samples from both the prior distribution p(\u2713, z) and the posterior distribution\np(\u2713, z|y). In practice, the prior is typically easy to sample from, but it is typically infeasible to\ngenerate exact posterior samples. However, in models where we can tractably sample from the joint\ndistribution p(\u2713, z, y), we can generate exact posterior samples for simulated observations using the\nelementary fact that\n\np(y) p(\u2713, z|y) = p(\u2713, z, y) = p(\u2713, z) p(y|\u2713, z).\n\n(3)\nIn other words, if one ancestrally samples \u2713, z, and y, this is equivalent to \ufb01rst generating a dataset y\nand then sampling (\u2713, z) exactly from the posterior. Therefore, for simulated data, one has access to a\nsingle exact posterior sample; this is enough to obtain stochastic upper bounds on log R = log p(y).\n2.4 WebPPL and Stan\nWe focus on two particular probabilistic programming packages. First, we consider WebPPL [GS], a\nlightweight probabilistic programming language built on Javascript, and intended largely to illustrate\nsome of the important ideas in probabilistic programming. Inference is based on Metropolis\u2013Hastings\n(M\u2013H) updates to a program\u2019s execution trace, i.e. a record of all stochastic decisions made by the\nprogram. WebPPL has a small and clean implementation, and the entire implementation is described\nin an online tutorial on probabilistic programming [GS].\nSecond, we consider Stan [CGHL+ p], a highly engineered automatic inference system which is\nwidely used by statisticians and is intended to scale to large problems. Stan is based on the No U-Turn\nSampler (NUTS; [HG14]), a variant of Hamiltonian Monte Carlo (HMC; [Nea+11]) which chooses\ntrajectory lengths adaptively. HMC can be signi\ufb01cantly more ef\ufb01cient than M\u2013H over execution\ntraces because it uses gradient information to simultaneously update multiple parameters of a model,\nbut is less general because it requires a differentiable likelihood. (In particular, this disallows discrete\nlatent variables unless they are marginalized out analytically.)\n\n3 Methods\n\nThere are at least two criteria we would desire from a sampling-based approximate inference algorithm\nin order that its samples be representative of the true posterior distribution: we would like the\napproximate distribution q(\u2713, z; y) to cover all the high-probability regions of the posterior p(\u2713, z|y),\nand we would like it to avoid placing probability mass in low-probability regions of the posterior. The\nformer criterion motivates measuring the KL divergence DKL(p(\u2713, z|y)k q(\u2713, z; y)), and the latter\ncriterion motivates measuring DKL(q(\u2713, z; y)k p(\u2713, z|y)). If we desire both simultaneously, this\nmotivates paying attention to the Jeffreys divergence, de\ufb01ned as DJ(qkp) = DKL(qkp) + DKL(pkq).\nIn this section, we present Bounding Divergences with Reverse Annealing (BREAD), a technique for\nusing BDMC to bound the Jeffreys divergence from the true posterior on simulated data, combined\nwith a protocol for using this technique to analyze sampler accuracy on real-world data.\n\n3.1 Upper bounding the Jeffreys divergence in expectation\nWe now present our technique for bounding the Jeffreys divergence between the target distribution\nand the distribution of approximate samples produced by AIS. In describing the algorithm, we revert\nto the abstract state space formalism of Section 2.1, since the algorithm itself does not depend\non any structure speci\ufb01c to posterior inference (except for the ability to obtain an exact sample).\nWe \ufb01rst repeat the derivation from [GGA15] of the bias of the stochastic lower bound log \u02c6R. Let\nv = (x1, . . . , xT1) denote all of the variables sampled in AIS before the \ufb01nal stage; the \ufb01nal state\nxT corresponds to the approximate sample produced by AIS. We can write the distributions over the\nforward and reverse AIS chains as:\n\nqf wd(v, xT ) = qf wd(v) qf wd(xT |v)\nqrev(v, xT ) = pT (xT ) qrev(v|xT ).\n\n4\n\n(4)\n(5)\n\n\fThe distribution of approximate samples qf wd(xT ) is obtained by marginalizing out v. Note that\nsampling from qrev requires sampling exactly from pT , so strictly speaking, BREAD is limited\nto those cases where one has at least one exact sample from pT \u2014 such as simulated data from a\nprobabilistic model (see Section 2.3).\nThe expectation of the estimate log \u02c6R of the log partition function ratio is given by:\n\nE[log \u02c6R] = Eqf wd(v,xT )\uf8fflog\n\nfT (xT ) qrev(v|xT )\n\nZ1 qf wd(v, xT ) \n\n(6)\n\n= log ZT  log Z1  DKL(qf wd(xT ) qf wd(v|xT )k pT (xT ) qrev(v|xT ))\n\uf8ff log ZT  log Z1  DKL(qf wd(xT )k pT (xT )).\n\n(7)\n(8)\n(Note that qf wd(v|xT ) is the conditional distribution of the forward chain, given that the \ufb01nal state is\nxT .) The inequality follows because marginalizing out variables cannot increase the KL divergence.\nWe now go beyond the analysis in [GGA15], to bound the bias in the other direction. The expectation\nof the reverse estimate \u02c6Rrev is\nE[log \u02c6Rrev] = Eqrev(xT ,v)\uf8fflog Z1 qf wd(v, xT )\nfT (xT ) qrev(v|xT )\n\n(9)\n\n= log Z1  log ZT  DKL(pT (xT ) qrev(v|xT )k qf wd(xT ) qf wd(v|xT ))\n\uf8ff log Z1  log ZT  DKL(pT (xT )k qf wd(xT )).\n\n(10)\n(11)\n\nrev can both be seen as estimators of log ZT\nZ1\n\nAs discussed above, log \u02c6R and log \u02c6R1\n, the former of\nwhich is a stochastic lower bound, and the latter of which is a stochastic upper bound. Consider the\ngap between these two bounds, \u02c6B , log \u02c6R1\nrev  log \u02c6R. It follows from Eqs. (8) and (11) that, in\nexpectation, \u02c6B upper bounds the Jeffreys divergence\n(12)\n\nJ , DJ(pT (xT ), qf wd(xT )) , DKL(pT (xT )k qf wd(xT )) + DKL(qf wd(xT )k pT (xT ))\n\nbetween the target distribution pT and the distribution qf wd(pT ) of approximate samples.\nAlternatively, if one happens to have some other lower bound L or upper bound U on log R, then one\ncan bound either of the one-sided KL divergences by running only one direction of AIS. Speci\ufb01cally,\nfrom Eq. (8), E[U  log \u02c6R]  DKL(qf wd(xT )k pT (xT )), and from Eq. (11), E[log \u02c6R1\nrev  L] \nDKL(pT (xT )k qf wd(xT )).\nHow tight is the expectation B , E[ \u02c6B] as an upper bound on J ? We evaluated both B and J exactly\non some toy distributions and found them to be a fairly good match. Details are given in Appendix B.\n\n3.2 Application to real-world data\nSo far, we have focused on the setting of simulated data, where it is possible to obtain an exact\nposterior sample, and then to rigorously bound the Jeffreys divergence using BDMC. However, we\nare more likely to be interested in evaluating the performance of inference on real-world data, so\nwe would like to simulate data which resembles a real-world dataset of interest. One particular\ndif\ufb01culty is that, in Bayesian analysis, hyperparameters are often assigned non-informative or weakly\ninformative priors, in order to avoid biasing the inference. This poses a challenge for BREAD, as\ndatasets generated from hyperparameters sampled from such priors (which are often very broad) can\nbe very dissimilar to real datasets, and hence conclusions from the simulated data may not generalize.\nIn order to generate simulated datasets which better match a real-world dataset of interest, we adopt\nthe following heuristic scheme: we \ufb01rst perform approximate posterior inference on the real-world\ndataset. Let \u02c6\u2318real denote the estimated hyperparameters. We then simulate parameters and data from\nthe forward model p(\u2713| \u02c6\u2318real)p(D| \u02c6\u2318real, \u2713). The forward AIS chain is run on D in the usual way.\nHowever, to initialize the reverse chain, we \ufb01rst start with (\u02c6\u2318real, \u2713), and then run some number of\nMCMC transitions which preserve p(\u2318, \u2713|D), yielding an approximate posterior sample (\u2318?, \u2713?).\nIn general, (\u2318?, \u2713?) will not be an exact posterior sample, since \u02c6\u2318real was not sampled from p(\u2318|D).\nHowever, the true hyperparameters \u02c6\u2318real which generated D ought to be in a region of high posterior\nmass unless the prior p(\u2318) concentrates most of its mass away from \u02c6\u2318real. Therefore, we expect even\na small number of MCMC steps to produce a plausible posterior sample. This motivates our use of\n(\u2318?, \u2713?) in place of an exact posterior sample. We validate this procedure in Section 5.1.2.\n\n5\n\n\f4 Related work\n\nMuch work has been devoted to the diagnosis of Markov chain convergence (e.g. [CC96; GR92;\nBG98]). Diagnostics have been developed both for estimating the autocorrelation function of\nstatistics of interest (which determines the number of effective samples from an MCMC chain) and\nfor diagnosing whether Markov chains have reached equilibrium. In general, convergence diagnostics\ncannot con\ufb01rm convergence; they can only identify particular forms of non-convergence. By contrast,\nBREAD can rigorously demonstrate convergence in the simulated data setting.\nThere has also been much interest in automatically con\ufb01guring parameters of MCMC algorithms.\nSince it is hard to reliably summarize the performance of an MCMC algorithm, such automatic\ncon\ufb01guration methods typically rely on method-speci\ufb01c analyses. For instance, Roberts and Rosenthal\n[RR01] showed that the optimal acceptance rate of Metropolis\u2013Hastings with an isotropic proposal\ndistribution is 0.234 under fairly general conditions. M\u2013H algorithms are sometimes tuned to\nachieve this acceptance rate, even in situations where the theoretical analysis doesn\u2019t hold. Rigorous\nconvergence measures might enable more direct optimization of algorithmic hyperparameters.\nGorham and Mackey [GM15] presented a method for directly estimating the quality of a set of\napproximate samples, independently of how those samples were obtained. This method has strong\nguarantees under a strong convexity assumption. By contrast, BREAD makes no assumptions about\nthe distribution itself, so its mathematical guarantees (for simulated data) are applicable even to\nmultimodal or badly conditioned posteriors.\nIt has been observed that heating and cooling processes yield bounds on log-ratios of partition\nfunctions by way of \ufb01nite difference approximations to thermodynamic integration. Neal [Nea96]\nused such an analysis to motivate tempered transitions, an MCMC algorithm based on heating and\ncooling a distribution. His analysis cannot be directly applied to measuring convergence, as it assumed\nequilibrium at each temperature. Jarzynski [Jar97] later gave a non-equilibrium analysis which is\nequivalent to that underlying AIS [Nea01].\nWe have recently learned of independent work [CTM16] which also builds on BDMC to evaluate the\naccuracy of posterior inference in a probabilistic programming language. In particular, Cusumano-\nTowner and Mansinghka [CTM16] de\ufb01ne an unbiased estimator for a quantity called the subjective\ndivergence. The estimator is equivalent to BDMC except that the reverse chain is initialized from an\narbitrary reference distribution, rather than the true posterior. In [CTM16], the subjective divergence\nis shown to upper bound the Jeffreys divergence when the true posterior is used; this is equivalent to\nour analysis in Section 3.1. Much less is known about subjective divergence when the reference distri-\nbution is not taken to be the true posterior. (Our approximate sampling scheme for hyperparameters\ncan be viewed as a kind of reference distribution.)\n\n5 Experiments\n\nIn order to experiment with BREAD, we extended both WebPPL and Stan to run forward and reverse\nAIS using the sequence of distributions de\ufb01ned in Eq. (2). The MCMC transition kernels were the\nstandard ones provided by both platforms. Our \ufb01rst set of experiments was intended to validate that\nBREAD can be used to evaluate the accuracy of posterior inference in realistic settings. Next, we\nused BREAD to explore the tradeoffs between two different representations of a matrix factorization\nmodel in both WebPPL and Stan.\n\n5.1 Validation\n\nAs described above, BREAD returns rigorous bounds on the Jeffreys divergence only when the data\nare sampled from the model distribution. Here, we address three ways in which it could potentially\ngive misleading results. First, the upper bound B may overestimate the true Jeffreys divergence J .\nSecond, results on simulated data may not correspond to results on real-world data if the simulated\ndata are not representative of the real-world data. Finally, the \ufb01tted hyperparameter procedure of\nSection 3.2 may not yield a sample suf\ufb01ciently representative of the true posterior p(\u2713, \u2318|D). The\n\ufb01rst issue, about the accuracy of the bound, is addressed in Appendix B.1.1; the bound appears to be\nfairly close to the true Jeffreys divergence on some toy distributions. We address the other two issues\nin this section. In particular, we attempted to validate that the behavior of the method on simulated\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Validation of the consistency of the behavior of forward AIS on real and simulated data for\nthe logistic regression model. Since the log-ML values need not match between the real and simulated data,\nthe y-axes for each curve are shifted based on the maximum log-ML lower bound obtained by forward AIS.\n(b) Same as (a), but for matrix factorization. The complete set of results on all datasets is given in Appendix D.\n(c) Validation of the \ufb01tted hyperparameter scheme on the logistic regression model (see Section 5.1.2 for details).\nReverse AIS curves are shown as the number of Gibbs steps used to initialize the hyperparameters is varied.\n\ndata is consistent with that on real data, and that the \ufb01tted-hyperparameter samples can be used as a\nproxy for samples from the posterior. All experiments in this section were performed using Stan.\n\n5.1.1 Validating consistency of inference behavior between real and simulated data\nTo validate BREAD in a realistic setting, we considered \ufb01ve models based on examples from the Stan\nmanual [Sta], and chose a publicly available real-world dataset for each model. These models include:\nlinear regression, logistic regression, matrix factorization, autoregressive time series modeling, and\nmixture-of-Gaussians clustering. See Appendix C for model details and Stan source code.\nIn order to validate the use of simulated data as a proxy for real data in the context of BREAD,\nwe \ufb01t hyperparameters to the real-world datasets and simulated data from those hyperparameters,\nas described in Section 3.2. In Fig. 1 and Appendix D, we show the distributions of forward and\nreverse AIS estimates on simulated data and forward AIS estimates on real-world data, based on 100\nAIS chains for each condition.2 Because the distributions of AIS estimates included many outliers,\nwe visualize quartiles of the estimates rather than means.3 The real and simulated data need not\nhave the same marginal likelihood, so the AIS estimates for real and simulated data are shifted\nvertically based on the largest forward AIS estimate obtained for each model. For all \ufb01ve models\nunder consideration, the forward AIS curves were nearly identical (up to a vertical shift), and the\ndistributions of AIS estimates were very similar at each number of AIS steps. (An example where the\nforward AIS curves failed to match up due to model misspeci\ufb01cation is given in Appendix D.) Since\nthe inference behavior appears to match closely between the real and simulated data, we conclude\nthat data simulated using \ufb01tted hyperparameters can be a useful proxy for real data when evaluating\ninference algorithms.\n\n5.1.2 Validating the approximate posterior over hyperparameters\nAs described in Section 3.2, when we simulate data from \ufb01tted hyperparameters, we use an ap-\nproximate (rather than exact) posterior sample (\u2318?, \u2713?) to initialize the reverse chain. Because of\nthis, BREAD is not mathematically guaranteed to upper bound the Jeffreys divergence even on the\nsimulated data. In order to determine the effect of this approximation in practice, we repeated the\nprocedure of Section 5.1.1 for all \ufb01ve models, but varying S, the number of MCMC steps used to\nobtain (\u2318?, \u2713?), with S 2 {10, 100, 1000, 10000}. The reverse AIS estimates are shown in Fig. 1\nand Appendix D. (We do not show the forward AIS estimates because these are unaffected by S.) In\nall \ufb01ve cases, the reverse AIS curves were statistically indistinguishable. This validates our use of\n\ufb01tted hyperparameters, as it suggests that the use of approximate samples of hyperparameters has\nlittle impact on the reverse AIS upper bounds.\n\n2The forward AIS chains are independent, while the reverse chains share an initial state.\n3Normally, such outliers are not a problem for AIS, because one averages the weights wT , and this average is\ninsensitive to extremely small values. Unfortunately, the analysis of Section 3.1 does not justify such averaging,\nso we report estimates corresponding to individual AIS chains.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 2: Comparison of Jeffreys divergence bounds for matrix factorization in Stan and WebPPL, using the\ncollapsed and uncollapsed formulations. Given as a function of (a) number of MCMC steps, (b) running time.\n\n5.2 Scienti\ufb01c \ufb01ndings produced by BREAD\nHaving validated various aspects of BREAD, we applied it to investigate the choice of model represen-\ntation in Stan and WebPPL. During our investigation, we also uncovered a bug in WebPPL, indicating\nthe potential usefulness of BREAD as a means of testing the correctness of an implementation.\n\n5.2.1 Comparing model representations\nMany models can be written in more than one way, for example by introducing or collapsing latent\nvariables. Performance of probabilistic programming languages can be sensitive to such choices of\nrepresentation, and the representation which gives the best performance may vary from one language\nto another. We consider the matrix factorization model described above, which we now specify in\nmore detail. We approximate an N \u21e5 D matrix Y as a low rank matrix, the product of matrices\nU and V with dimensions N \u21e5 K and K \u21e5 D respectively (where K < min(N, D)). We use a\nspherical Gaussian observation model, and spherical Gaussian priors on U and V:\n\nuik \u21e0 N (0, 2\nu)\n\nvkj \u21e0 N (0, 2\nv)\n\nyij | ui, vj \u21e0 N (u>i vj, 2)\n\nv) and yi | V \u21e0 N (0, uV>V + I).\nWe can also collapse U to obtain the model vkj \u21e0 N (0, 2\nIn general, collapsing variables can help MCMC samplers mix faster at the expense of greater\ncomputational cost per update. The precise tradeoff can depend on the size of the model and dataset,\nthe choice of MCMC algorithm, and the underlying implementation, so it would be useful to have a\nquantitative criterion to choose between them.\nWe \ufb01xed the values of all hyperparameters to 1, and set N = 50, K = 5 and D = 25. We ran\nBREAD on both platforms (Stan and WebPPL) and for both formulations (collapsed and uncollapsed)\n(see Fig. 2). The simulated data and exact posterior sample were shared between all conditions in\norder to make the results directly comparable.\nAs predicted, the collapsed sampler resulted in slower updates but faster convergence (in terms of\nthe number of steps). However, the per-iteration convergence bene\ufb01t of collapsing was much larger\nin WebPPL than in Stan (perhaps because of the different underlying inference algorithm). Overall,\nthe tradeoff between ef\ufb01ciency and convergence speed appears to favour the uncollapsed version in\nStan, and the collapsed version in WebPPL (see Fig. 2(b)). (Note that this result holds only for our\nparticular choice of problem size; the tradeoff may change given different model or dataset sizes.)\nHence BREAD can provide valuable insights into the tricky question of which representations of\nmodels to choose to achieve faster convergence.\n\n5.2.2 Debugging\nMathematically, the forward and reverse AIS chains yield lower and upper bounds on log p(y) with\nhigh probability; if this behavior is not observed, that indicates a bug. In our experimentation with\nWebPPL, we observed a case where the reverse AIS chain yielded estimates signi\ufb01cantly lower than\nthose produced by the forward chain, inconsistent with the theoretical guarantee. This led us to\n\ufb01nd a subtle bug in how WebPPL sampled from a multivariate Gaussian distribution (which had the\neffect that the exact posterior samples used to initialize the reverse chain were incorrect).4 These\ndays, while many new probabilistic programming languages are emerging and many are in active\ndevelopment, such debugging capabilities provided by BREAD can potentially be very useful.\n\n4Issue: https://github.com/probmods/webppl/issues/473\n\n8\n\n\fReferences\n\n[BG98]\n\n[BGS15]\n\n[CC96]\n\n[CGHL+ p]\n\n[CTM16]\n\n[ET98]\n\n[GBD07]\n\n[GGA15]\n\n[GM15]\n\nS. P. Brooks and A. Gelman. \u201cGeneral methods for monitoring convergence of\niterative simulations\u201d. Journal of Computational and Graphical Statistics 7.4 (1998),\npp. 434\u2013455.\nY. Burda, R. B. Grosse, and R. Salakhutdinov. \u201cAccurate and conservative estimates of\nMRF log-likelihood using reverse annealing\u201d. In: Arti\ufb01cial Intelligence and Statistics.\n2015.\nM. K. Cowles and B. P. Carlin. \u201cMarkov chain Monte Carlo convergence diagnostics:\na comparative review\u201d. Journal of the American Statistical Association 91.434 (1996),\npp. 883\u2013904.\nB. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A.\nBrubaker, J. Guo, P. Li, and A. Riddell. \u201cStan: a probabilistic programming language\u201d.\nJournal of Statistical Software (in press).\nM. F. Cusumano-Towner and V. K. Mansinghka. Quantifying the probable approxi-\nmation error of probabilistic inference programs. arXiv:1606.00068. 2016.\nB. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC,\n1998.\nV. Gogate, B. Bidyuk, and R. Dechter. \u201cStudies in lower bounding probability of\nevidence using the Markov inequality\u201d. In: Conference on Uncertainty in AI. 2007.\nR. B. Grosse, Z. Ghahramani, and R. P. Adams. Sandwiching the marginal likelihood\nwith bidirectional Monte Carlo. arXiv:1511.02543. 2015.\nJ. Gorham and L. Mackey. \u201cMeasuring sample quality with Stein\u2019s method\u201d. In:\nNeural Information Processing Systems. 2015.\n\n[GR92]\n\n[GS]\n\n[HG14]\n\n[Jar97]\n\n[GMS13]\n\n[LTBS00]\n\n[GMRB+08] N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum.\n\u201cChurch: a language for generative models\u201d. In: Conference on Uncertainty in AI.\n2008.\nR. Grosse, C. J. Maddison, and R. Salakhutdinov. \u201cAnnealing between distributions\nby averaging moments\u201d. In: Neural Information Processing Systems. 2013.\nA. Gelman and D. B. Rubin. \u201cInference from iterative simulation using multiple\nsequences\u201d. Statistical Science 7.4 (1992), pp. 457\u2013472.\nN. D. Goodman and A. Stuhlm\u00fcller. The Design and Implementation of Probabilistic\nProgramming Languages. http://dippl.org.\nM. D. Homan and A. Gelman. \u201cThe No-U-turn Sampler: Adaptively Setting Path\nLengths in Hamiltonian Monte Carlo\u201d. J. Mach. Learn. Res. 15.1 (Jan. 2014),\npp. 1593\u20131623. ISSN: 1532-4435.\nC. Jarzynski. \u201cEquilibrium free-energy differences from non-equilibrium measure-\nments: a master-equation approach\u201d. Physical Review E 56 (1997), pp. 5018\u20135035.\nD. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. \u201cWinBUGS \u2013 a Bayesian mod-\nelling framework: concepts, structure, and extensibility\u201d. Statistics and Computing\n10.4 (2000), pp. 325\u2013337.\nP. del Moral, A. Doucet, and A. Jasra. \u201cSequential Monte Carlo samplers\u201d. Journal of\nthe Royal Statistical Society: Series B (Statistical Methodology) 68.3 (2006), pp. 411\u2013\n436.\nR. M. Neal et al. \u201cMCMC using Hamiltonian dynamics\u201d. Handbook of Markov Chain\nMonte Carlo 2 (2011), pp. 113\u2013162.\nR. M. Neal. \u201cAnnealed importance sampling\u201d. Statistics and Computing 11 (2001),\npp. 125\u2013139.\nR. M. Neal. \u201cSampling from multimodal distributions using tempered transitions\u201d.\nStatistics and Computing 6.4 (1996), pp. 353\u2013366.\nG. O. Roberts and J. S. Rosenthal. \u201cOptimal scaling for various Metropolis\u2013Hastings\nalgorithms\u201d. Statistical Science 16.4 (2001), pp. 351\u2013367.\nStan Modeling Language Users Guide and Reference Manual. Stan Development\nTeam.\n\n[Nea+11]\n\n[Nea01]\n\n[Nea96]\n\n[RR01]\n\n[MDJ06]\n\n[Sta]\n\n9\n\n\f", "award": [], "sourceid": 1288, "authors": [{"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}, {"given_name": "Siddharth", "family_name": "Ancha", "institution": "University of Toronto"}, {"given_name": "Daniel", "family_name": "Roy", "institution": "University of Toronto"}]}