{"title": "Pseudo-Extended Markov chain Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 4312, "page_last": 4322, "abstract": "Sampling from posterior distributions using Markov chain Monte Carlo (MCMC) methods can require an exhaustive number of iterations, particularly when the posterior is multi-modal as the MCMC sampler can become trapped in a local mode for a large number of iterations. In this paper, we introduce the pseudo-extended MCMC method as a simple approach for improving the mixing of the MCMC sampler for multi-modal posterior distributions. The pseudo-extended method augments the state-space of the posterior using pseudo-samples as auxiliary variables. On the extended space, the modes of the posterior are connected, which allows the MCMC sampler to easily move between well-separated posterior modes. We demonstrate that the pseudo-extended approach delivers improved MCMC sampling over the Hamiltonian Monte Carlo algorithm on multi-modal posteriors, including Boltzmann machines and models with sparsity-inducing priors.", "full_text": "Pseudo-Extended Markov Chain Monte Carlo\n\nChristopher Nemeth\n\nFredrik Lindsten\n\nDepartment of Mathematics and Statistics\n\nDepartment of Computer and Information Science\n\nLancaster University\n\nUnited Kingdom\n\nc.nemeth@lancaster.ac.uk\n\nLink\u00f6ping University\n\nSweden\n\nfredrik.lindsten@liu.se\n\nMaurizio Filippone\n\nDepartment of Data Science\n\nEURECOM\n\nFrance\n\nmaurizio.filippone@eurecom.fr\n\nJames Hensman\nPROWLER.io\n\nCambridge\n\nUnited Kingdom\n\njames@prowler.io\n\nAbstract\n\nSampling from posterior distributions using Markov chain Monte Carlo (MCMC)\nmethods can require an exhaustive number of iterations, particularly when the\nposterior is multi-modal as the MCMC sampler can become trapped in a local\nmode for a large number of iterations. In this paper, we introduce the pseudo-\nextended MCMC method as a simple approach for improving the mixing of the\nMCMC sampler for multi-modal posterior distributions. The pseudo-extended\nmethod augments the state-space of the posterior using pseudo-samples as auxiliary\nvariables. On the extended space, the modes of the posterior are connected, which\nallows the MCMC sampler to easily move between well-separated posterior modes.\nWe demonstrate that the pseudo-extended approach delivers improved MCMC\nsampling over the Hamiltonian Monte Carlo algorithm on multi-modal posteriors,\nincluding Boltzmann machines and models with sparsity-inducing priors.\n\n1\n\nIntroduction\n\nMarkov chain Monte Carlo (MCMC) methods (see, e.g., Brooks et al. (2011)) are generally regarded\nas the gold standard approach for sampling from high-dimensional distributions. In particular,\nMCMC algorithms have been extensively applied within the \ufb01eld of Bayesian statistics to sample\nfrom posterior distributions when the posterior density can only be evaluated up to a constant of\nproportionality. Under mild conditions, it can be shown that asymptotically, the limiting distribution\nof the samples generated from the MCMC algorithm will converge to the posterior distribution of\ninterest. While theoretically elegant, one of the main drawbacks of MCMC methods is that running\nthe algorithm to stationarity can be prohibitively expensive if the posterior distribution is of a complex\nform, for example, contains multiple unknown modes. Notable examples of multi-modality include\nthe posterior over model parameters in mixture models (McLachlan and Peel, 2000), deep neural\nnetworks (Neal, 2012), and differential equation models (Calderhead and Girolami, 2009).\nIn this paper, we present the pseudo-extended Markov chain Monte Carlo method as an approach\nfor augmenting the state-space of the original posterior distribution to allow the MCMC sampler\nto easily move between areas of high posterior density. The pseudo-extended method introduces\npseudo-samples on the extended space to improve the mixing of the Markov chain. To illustrate\nhow this method works, in Figure 1 we plot a mixture of two univariate Gaussian distributions (left).\nThe area of low probability density between the two Gaussians will make it dif\ufb01cult for an MCMC\nsampler to traverse between them. Using the pseudo-extended approach (as detailed in Section 2), we\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Original target density \u03c0(x) (left) and extended target (right) with N = 2.\n\ncan extend the state-space to two dimensions (right), where on the extended space, the modes are\nnow connected allowing the MCMC sampler to easily mix between them.\nThe pseudo-extended framework can be applied for general MCMC sampling, however, in this paper,\nwe focus on using ideas from tempered MCMC (Jasra et al., 2007) to improve multi-modal posterior\nsampling. Unlike previous approaches which use MCMC to sample from multi-modal posteriors, i)\nwe do not require a priori information regarding the number, or location, of modes, ii) nor do we\nneed to specify a sequence of intermediary tempered distributions (Geyer, 1991).\nWe show that samples generated using the pseudo-extended method admit the correct posterior of\ninterest as the limiting distribution. Furthermore, once weighted using a post-hoc correction step, it is\npossible to use all pseudo-samples for approximating the posterior distribution. The pseudo-extended\nmethod can be applied as an extension to many popular MCMC algorithms, including the random-\nwalk Metropolis (Roberts et al., 1997) and Metropolis-adjusted Langevin algorithm (Roberts and\nTweedie, 1996). However, in this paper, we focus on applying the popular Hamiltonian Monte Carlo\n(HMC) algorithm (Neal, 2010) within the pseudo-extended framework and show that this leads to\nimproved posterior exploration compared to standard HMC.\n\n2 The Pseudo-Extended Method\nLet \u03c0 be a target probability density on Rd de\ufb01ned for all x \u2208 X := Rd by\n\n\u03b3(x)\n\nexp{\u2212\u03c6(x)}\n\n,\n\nZ\n\nZ\n\n=\n\n\u03c0(x) :=\n\n(1)\nwhere \u03c6 : X \u2192 R is a continuously differentiable function and Z is the normalizing constant.\nThroughout, we will refer to \u03c0(x) as the target density. In the Bayesian setting, this would be the\nposterior, where for data y \u2208 Y, the likelihood is denoted as p(y|x) with parameters x assigned a\nprior density \u03c00(x). The posterior density of the parameters given the data is derived from Bayes\ntheorem \u03c0(x) = p(y|x)\u03c00(x)/p(y), where the marginal likelihood p(y) is the normalizing constant\nZ, which is typically not available analytically.\nWe extend the state-space of the original target distribution eq. (1) by introducing N pseudo-samples,\nx1:N = {xi}N\ni=1, where the extended-target distribution \u03c0N (x1:N ) is de\ufb01ned on X N . The pseudo-\nsamples act as auxiliary variables, where for each xi, we introduce an instrumental distribution\nq(xi) \u221d exp{\u2212\u03b4(xi)} with support covering that of \u03c0(x). In a similar vein to the pseudo-marginal\nMCMC algorithm (Beaumont, 2003; Andrieu and Roberts, 2009) our extended-target, including the\nauxiliary variables, is now of the form,\n\n\u03c0N (x1:N ) :=\n\n1\nN\n\n\u03c0(xi)\n\nq(xj) =\n\n1\nZ\n\n1\nN\n\n\u03b3(xi)\nq(xi)\n\nq(xi),\n\n(2)\n\nwhere \u03b3(\u00b7) and Z are de\ufb01ned in eq. (1). In pseudo-marginal MCMC, q(\u00b7) is an instrumental distri-\nbution used for importance sampling to compute unbiased estimates of the intractable normalizing\nconstant (see Section 2.2 for details). However, with the pseudo-extended method we use q(\u00b7) to\nimprove the mixing of the MCMC algorithm. Additionally, unlike pseudo-marginal MCMC, we do\nnot require that q(\u00b7) can be sampled from; a fact that we will exploit in Section 3.\n\nN(cid:88)\n\ni=1\n\n(cid:89)\n\nj(cid:54)=i\n\n(cid:40)\n\nN(cid:88)\n\ni=1\n\n(cid:41)\n\n\u00d7(cid:89)\n\ni\n\n2\n\n202x0.00.51.01.52.02.5(x)202x13210123x2(x1,2)\fIn the case where N = 1, our extended-target eq. (2) simpli\ufb01es back to the original target \u03c0(x) =\n\u03c0N (x1:N ) eq. (1). For N > 1, the resulting marginal distribution of the ith pseudo-sample is a\nmixture between the target and the instrumental distribution\nN \u2212 1\nN\n\n\u03c0N (xi) =\n\n\u03c0(xi) +\n\nq(xi).\n\n1\nN\n\nWe then use a post-hoc weighting step to convert the samples from the extended-target to samples\nfrom the original target of interest \u03c0(x). In Theorem 2.1, we show that samples from the extended\ntarget give unbiased expectations of arbitrary functions f, under the target of interest \u03c0.\nTheorem 2.1. Let x1:N be distributed according to the extended-target \u03c0N . Weighting each sample\nwith self-normalized weights proportional to \u03b3(xi)/q(xi), for i = 1, . . . , N gives samples from the\ntarget distribution, \u03c0(x), in the sense that, for an arbitrary integrable f,\n\n(cid:34)(cid:80)N\n(cid:80)N\n\n(cid:35)\n\nE\u03c0N\n\ni=1 f (xi)\u03b3(xi)/q(xi)\n\ni=1 \u03b3(xi)/q(xi)\n\n= E\u03c0[f (x)] .\n\n(3)\n\nThe proof follows from the invariance of particle Gibbs (Andrieu et al., 2010) and is given in Section\nA of the Supplementary Material.\n\n2.1 Pseudo-extended Hamiltonian Monte Carlo\n\nWe use an MCMC algorithm to sample from the pseudo-extended target eq. (2). In this paper, we use\nthe HMC algorithm because of its impressive mixing times, however, a disadvantage of HMC, and\nother gradient-based MCMC algorithms is that they tend to be mode-seeking and are more prone to\ngetting trapped in local modes of the target. The pseudo-extended framework creates a target where\nthe modes are connected on the extended space, which reduces the mode-seeking behavior of HMC\nand allows the sampler to move easily between regions of high density.\nRecalling that our parameters are x \u2208 X := Rd, we introduce arti\ufb01cial momentum variables \u03c1 \u2208 Rd\nthat are independent of x. The Hamiltonian H(x, \u03c1), represents the total energy of the system as the\ncombination of the potential function \u03c6(x), as de\ufb01ned in eq. (1), and kinetic energy 1\n\n2 \u03c1(cid:62)M\u22121\u03c1,\n\nH(x, \u03c1) := \u03c6(x) +\n\n\u03c1(cid:62)M\u22121\u03c1,\n\n1\n2\n\nwhere M is a mass matrix and is often set to the identity matrix. The Hamiltonian now aug-\nments our target distribution so that we are sampling (x, \u03c1) from the joint distribution \u03c0(x, \u03c1) \u221d\nexp{H(x, \u03c1)} = \u03c0(x)N (\u03c1|0, M), which admits the target as the marginal. In the case of the\npseudo-extended target eq. (2), the Hamiltonian is,\n\n(cid:34) N(cid:88)\n\n(cid:35)\n\nN(cid:88)\n\nH N (x1:N , \u03c1) = \u2212 log\n\nexp{\u2212\u03c6(xi) + \u03b4(xi)}\n\n+\n\ni=1\n\ni=1\n\n\u03b4(xi) +\n\n\u03c1(cid:62)M\u22121\u03c1,\n\n1\n2\n\n(4)\n\nwhere now \u03c1 \u2208 Rd\u00d7N , and \u03b4(x) is a potential function of the instrumental distribution that is arbitrary\nbut differentiable, eq. (2).\nAside from a few special cases, we generally cannot simulate from the Hamiltonian system eq. (4)\nexactly (Neal, 2010). Instead, we discretize time using small step-sizes \u0001 and calculate the state at \u0001,\n2\u0001, 3\u0001, etc. using a numerical integrator. Several numerical integrators are available which preserve\nthe volume and reversibility of the Hamiltonian system (Girolami and Calderhead, 2011), the most\npopular being the leapfrog integrator which takes L steps, each of size \u0001, though the Hamiltonian\ndynamics (pseudo-code is given in the Supplementary Material). After a \ufb01xed number of iterations\nT , the algorithm generates samples (x(t)\n1:N , \u03c1(t)), t = 1, . . . , T approximately distributed according\nto the joint distribution \u03c0(x1:N , \u03c1), where after discarding the momentum variables \u03c1, our MCMC\nsamples will be approximately distributed according to the target \u03c0N (x1:N ). In this paper, we use\nthe No-U-turn sampler (NUTS) introduced by Hoffman and Gelman (2014) as implemented in the\nSTAN (Carpenter et al., 2017) software package to automatically tune L and \u0001.\n\n3\n\n\f2.2 Connections to pseudo-marginal MCMC\n\ndensity is of the form \u03c0(\u03b8) =(cid:82)\n\nThe pseudo-extended target eq. (2) can be viewed as a special case of the pseudo-marginal target of\nAndrieu and Roberts (2009). In the pseudo-marginal setting, it is (typically) assumed that the target\nX \u03c0(\u03b8, x)dx, where \u03b8 is some \u201ctop-level\u201d parameter, and where x are\nlatent variables that cannot be integrated out analytically. Using importance sampling, an unbiased\nMonte Carlo estimate of the target \u02dc\u03c0(\u03b8) is computed using latent variable samples x1, x2, . . . , xN\nfrom an instrumental distribution with density q(x) and then approximating the integral as\n\nN(cid:88)\n\ni=1\n\n\u02dc\u03c0(\u03b8) :=\n\n1\nN\n\n\u03c0(\u03b8, xi)\n\nq(xi)\n\n, where xi \u223c q(\u00b7).\n\n\u02dc\u03c0N (\u03b8, x) :=\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:89)\n\nj(cid:54)=i\n\nThe pseudo-marginal target is then de\ufb01ned, analogously to the pseudo-extended target eq. (2), as\n\n\u03c0(\u03b8, xi)\n\nq(xj),\n\n(5)\n\nwhich admits \u03c0(\u03b8) as a marginal. In the original pseudo-marginal method, the extended-target is\nsampled from using MCMC, with an independent proposal for x (corresponding to importance\nsampling for these variables) and a standard MCMC proposal (e.g., random-walk) used for \u03b8.\nThere are two key differences between pseudo-marginal MCMC and pseudo-extended MCMC. Firstly,\nwe do not distinguish between latent variables and parameters, and simply view all unknown variables,\nor parameters, of interest as being part of x. Secondly, we do not use an importance-sampling-based\nproposal to sample x, but instead, we propose to simulate directly from the pseudo-extended target\neq. (2) using HMC as explained in Section 2.1. An important consequence of this is that we can use\ninstrumental distributions q(\u00b7) without needing to sample from them. In Section 3 we exploit this fact\nto construct the instrumental distribution by tempering.\nIn summary, the pseudo-marginal framework is a powerful technique for sampling from models with\nintractable likelihoods. The pseudo-extended method, on the other hand, is designed for sampling\nfrom complex target distributions, where the landscape of the target is dif\ufb01cult for standard MCMC\nsamplers to traverse without an exhaustive number of MCMC iterations. In particular, where the\ntarget distribution is multi-modal, we show that extending the state-space allows our MCMC sampler\nto more easily explore the modes of the target.\n\n3 Tempering targets with instrumental distributions\nIn the case of importance sampling, we would choose an instrumental distribution q(\u00b7) which closely\napproximates the target \u03c0(\u00b7). However, this would assume that we could \ufb01nd a tractable instrumental\ndistribution for q(\u00b7) which i) suf\ufb01ciently covers the support of the target and ii) captures its multi-\nmodality. Approximations, such as the Laplace approximation (Rue et al., 2009) and variational\nmethods (e.g., Bishop (2006), Chapter 10) could be used to choose q(\u00b7), however, such approximations\ntend to be unimodal and not appropriate for approximating a multi-modal target.\nA signi\ufb01cant advantage of the pseudo-extended framework eq. (2) is that it permits a wide range of\npotential instrumental distributions. Unlike standard importance sampling, we also do not require\nq(\u00b7) to be a distribution that we can sample from, only that it can be evaluated point-wise up to\nproportionality. This is a simpler condition to satisfy and allows us to \ufb01nd better instrumental\ndistributions for connecting the modes of the target. In this paper, we utilize a simple approach for\nchoosing the instrumental distribution which does not require a closed-form approximation of the\ntarget. Speci\ufb01cally, we create an instrumental distribution by tempering the target.\nTempering has previously been utilized in the MCMC literature to improve the sampling of multi-\nmodal targets. Here we use a technique inspired by Graham and Storkey (2017) (see Section 3),\nwhere we consider the family of approximating distributions,\n\n(cid:26)\n\n(cid:27)\n\n\u03a0 :=\n\n\u03c0\u03b2(x) =\n\n\u03b3\u03b2(x)\nZ(\u03b2)\n\n: \u03b2 \u2208 (0, 1]\n\n,\n\n(6)\n\nwhere \u03b3\u03b2(x) = exp{\u2212\u03b2\u03c6(x)} can be evaluated point-wise and Z(\u03b2) is typically intractable.\n\n4\n\n\fWe will construct an extended target distribution \u03c0N (x1:N , \u03b21:N ) on X N \u00d7 (0, 1]N with N pairs\n(xi, \u03b2i), for i = 1, . . . , N. This target distribution will be constructed in such a way that the marginal\ndistribution of each xi is a mixture, with components selected from \u03a0. This will typically make the\nmarginal distribution more diffuse than the target \u03c0 itself, encouraging better mixing.\nIf we let q(x, \u03b2) = \u03c0\u03b2(x)q(\u03b2) and choose q(\u03b2) = Z(\u03b2)g(\u03b2)\nand C is a normalizing constant, then we can cancel the intractable normalizing constants Z(\u03b2),\n\n, where g(\u03b2) can be evaluated point-wise\n\nC\n\nq(x, \u03b2) =\n\n\u03b3\u03b2(x)g(\u03b2)\n\nC\n\n.\n\n(7)\n\nThe joint instrumental q(x, \u03b2) does not admit a closed-form expression and in general we cannot\nsample from it. However, we do not need to sample from it, as we instead use an MCMC algorithm\non the extended-target which only requires that q(x, \u03b2) can be evaluated point-wise, up to a constant\nof proportionality. Under the instrumental proposal eq. (7), the pseudo-extended target eq. (2) is now\n\n\u03c0N (x1:N , \u03b21:N ) :=\n\n1\nN\n\n\u03c0(xi)\u03c0(\u03b2i)\n\nq(xj, \u03b2j)\n\n=\n\n1\n\nZC N\u22121\n\n1\nN\n\n\u03b3(xi)\u03c0(\u03b2i)\n\u03b3\u03b2i(xi)g(\u03b2i)\n\nN(cid:88)\n\ni=1\n\n(cid:40)\n\n(cid:89)\n\nj(cid:54)=i\n\nN(cid:88)\n\ni=1\n\n(cid:41) N(cid:89)\n\nj=1\n\n(8)\n\n\u03b3\u03b2j (xj)g(\u03b2j),\n\nwhere \u03c0(\u03b2) is some arbitrary user-chosen target distribution for \u03b2. Through our choice of q(x, \u03b2),\nthe normalizing constants for the target and instrumental distributions, Z and C respectively are not\ndependent on x or \u03b2 and so cancel in the Metropolis-Hastings ratio.\n\nRelated work on tempered MCMC\n\nTempered MCMC is the most popular approach to sampling from multi-modal target distributions\n(see Jasra et al. (2007) for a full review). The main idea behind tempered MCMC is to sample from a\nsequence of tempered targets,\n\n\u03c0k(x) \u221d exp{\u2212\u03b2k\u03c6(x)} ,\n\nk = 1, . . . , K,\n\nwhere \u03b2k is a tuning parameter referred to as the temperature that is associated with \u03c0k(x). A\nsequence of temperatures, commonly known as the ladder, is chosen a priori, where 0 = \u03b21 < \u03b22 <\n. . . < \u03b2K = 1. The intuition behind tempered MCMC is that when \u03b2k is small, the modes of the\ntarget are \ufb02attened out making it easier for the MCMC sampler to traverse through the regions of low\ndensity separating the modes. One of the most popular tempering algorithms is parallel tempering\n(PT) (Geyer, 1991), where in parallel, K separate MCMC algorithms are run with each sampling\nfrom one of the tempered targets \u03c0k(x). Samples from neighboring Markov chains are exchanged\n(i.e. sample from chain k exchanged with chain k \u2212 1 or k + 1) using a Metropolis-Hastings step.\nThese exchanges improve the convergence of the Markov chain to the target of interest \u03c0(x), however,\ninformation from low \u03b2k targets is often slow to traverse up the temperature ladder. There is also\na serial version of this algorithm, known as simulated tempering (ST) (Marinari and Parisi, 1992).\nAn alternative approach is annealed importance sampling (AIS) (Neal, 2001), which draws samples\nfrom a simple base distribution and then, via a sequence of intermediate transition densities, moves\nthe samples along the temperature ladder giving a weighted sample from the target distribution.\nGenerally speaking, these tempered approaches can be very dif\ufb01cult to apply in practice often\nrequiring extensive tuning. In the case of PT, the user needs to choose the number of parallel chains\nK, temperature schedule, step-size for each chain and the number of exchanges at each iteration.\nOur proposed tempering scheme is closely related to the continuously-tempered HMC algorithm\nof Graham and Storkey (2017). They propose to run HMC on a distribution similar to eq. (7) and\nthen apply an importance weighting as a post-correction to account for the different temperatures.\nIt thus has some resemblance with ST, in the sense that a single chain is used to explore the state\nspace for different temperature levels. On the contrary, for our proposed pseudo-extended method,\nthe distribution eq. (7) is not used as a target, but merely as an instrumental distribution to construct\nthe pseudo-extended target eq. (8). The resulting method, therefore, has some resemblance with PT,\nsince we propagate N pseudo-samples in parallel, all possibly exploring different temperature levels.\nFurthermore, by mixing in part of the actual target \u03c0 we ensure that the samples do not simultaneously\n\u201cdrift away\u201d from regions with high probability under \u03c0.\n\n5\n\n\fGraham and Storkey (2017) propose to use a variational approximation to the target, both when\nde\ufb01ning the family of distributions eq. (6) and for choosing the function g(\u03b2). This is also possible\nwith the pseudo-extended method, but we do not consider this possibility here for brevity. Finally, we\nnote that in the pseudo-extended method the temperature parameter \u03b2 can be estimated as part of the\nMCMC scheme, rather than pre-tuning it as a sequence of \ufb01xed temperatures. This is advantageous\nbecause using a coarse grid of temperatures can cause the sampler to miss modes of the target,\nwhereas a \ufb01ne grid of temperatures leads to a signi\ufb01cantly increased computational cost of running\nthe sampler.\n\n4 Experiments\n\nWe compare the pseudo-extended method on three test models. The \ufb01rst two (Sections 4.1 and 4.2)\nare chosen to show how the pseudo-extended method performs on simulated data when the target is\nmulti-modal. The third example (Section 4.3) is a sparsity-inducing logistic regression model, where\nmulti-modality occurs in the posterior from three real-world datasets. We compare against popular\ncompeting algorithms from the literature, including methods discussed in Section 3.\nAll simulations for the pseudo-extended method use the tempered instrumental distribution and thus\nthe pseudo-extended target is given by eq. (8). For each simulation study, we set \u03c0(\u03b2) \u221d 1, g(\u03b2) \u221d 1\nand use a logit transformation for \u03b2 to map the parameters onto the unconstrained space. Additionally,\nwe consider the special case of pseudo-extended HMC where \u03b2 is \ufb01xed along a temperature ladder\n(akin to parallel tempering). The pseudo-extended HMC method is implemented within STAN 1\n\n4.1 Mixture of Gaussians\n\nBackground: We consider a popular example from the literature (Kou et al., 2006; Tak et al., 2016),\nwhere the target is a mixture of 20 bivariate Gaussians,\n\nand where {\u00b51, \u00b52, . . . , \u00b520} are speci\ufb01ed in Kou et al. (2006). We compare the pseudo-extended\nsampler against parallel tempering (PT) (Geyer, 1991), repelling-attracting Metropolis (RAM) (Tak\net al., 2016) and the equi-energy (EE) MCMC sampler (Kou et al., 2006), all of which are designed\nfor sampling from multi-modal distributions.\nSetup: We consider two simulation settings. In Scenario (a) each mixture component has weight\nwj = 1/20 and variance \u03c32\nj = 1/100 resulting in well-separated modes with most modes more than\n15 standard deviations apart. In Scenario (b) the weights wj = 1/||\u00b5j \u2212 (5, 5)(cid:62)|| and variances\nj = ||\u00b5j \u2212 (5, 5)(cid:62)||/20 are unequal where the modes far from (5,5) have a lower weight with\n\u03c32\nlarger variance, creating regions of higher density between distant modes (see Figure 2 with further\ndiscussion in the Supplementary Material).\n\n20(cid:88)\n\nj=1\n\n\u03c0(x) =\n\nwj\n2\u03c0\u03c32\nj\n\nexp\n\n(cid:40) \u22121\n\n2\u03c32\nj\n\n(cid:41)\n\n(x \u2212 \u00b5j)(cid:62)(x \u2212 \u00b5j)\n\n,\n\nFigure 2: 10,000 samples drawn from the the target under scenario (a) (left) and scenario (b) (right)\nusing the HMC and pseudo-extended HMC samplers.\n\nResults: Table 1 gives the root mean squared error (RMSE) of the Monte Carlo estimates, over\n20 independent simulations, for the \ufb01rst and second moments. Each sampler was run for 50,000\n\n1https://github.com/chris-nemeth/pseudo-extended-mcmc-code\n\n6\n\n0246810x120246810x2HMC0246810x1Pseudo-extended HMC0246810x1HMC0246810x1Pseudo-extended HMCScenario (a)Scenario (b)\fiterations (after burn-in) and the speci\ufb01c tuning details for the temperature ladder of PT and the\nenergy rings for EE are given in Kou et al. (2006). All the samplers perform worse under Scenario\n(a) where the modes are well-separated, the HMC sampler is only able to explore the modes locally\nclustered together, whereas the pseudo-extended HMC sampler is able to explore all of the modes\nwith the same number of iterations (see Section C of the Supplementary Material for posterior\nplots). Under Scenario (b), there is a higher density region separating the modes making it easier\nfor the HMC sampler to move between the mixture components. While not reported here, the HMC\nsamplers produce Markov chains with signi\ufb01cantly reduced auto-correlation compared to the EE and\nRAM samplers, which both rely on random-walk updates. We note from Table 1 that increasing the\nnumber of pseudo-samples leads to improved estimates, but at an increased computational cost. In\nthe Supplementary Material we show that when taking account for computational cost, the optimal\nnumber of pseudo-samples is 2 \u2264 N \u2264 5. Additionally, we can \ufb01x rather than estimate \u03b2 and Table\n2 in the Supplementary Material shows that this can lead to a small improvement in RMSE if \u03b2\nis correctly tuned, but can also (and often does) lead to poorer RMSE if \u03b2 is not well tuned. The\nconclusion therefore is that it is better to jointly estimate \u03c0N (x1:N , \u03b21:N ) in the absence of a priori\nknowledge of an optimal \u03b2.\n\nTable 1: Root mean-squared error of moment estimates for two mixture scenarios. Results are\ncalculated over 20 independent simulations and reported to two decimal places.\n\nScenario (a)\n\nScenario (b)\n\nRAM\nEE\nPT\nHMC\n\nPE (N=2)\nPE (N=5)\nPE (N=10)\nPE (N=20)\n\nE[X1]\n0.09\n0.11\n0.18\n2.69\n0.11\n0.04\n0.03\n0.02\n\nE[X2]\n0.10\n0.14\n0.28\n3.96\n0.10\n0.05\n0.03\n0.02\n\nE(cid:2)X2\n\n(cid:3) E(cid:2)X2\n\n(cid:3) E[X1]\n\n1\n0.90\n1.14\n1.82\n24.69\n1.11\n0.37\n0.28\n0.15\n\n2\n1.30\n1.48\n2.89\n33.65\n1.01\n0.45\n0.23\n0.21\n\n0.04\n0.07\n0.12\n0.27\n0.05\n0.04\n0.02\n0.03\n\nE(cid:2)X2\n\n(cid:3) E(cid:2)X2\n\n(cid:3)\n\n1\n0.26\n0.75\n1.15\n3.12\n0.46\n0.18\n0.10\n0.15\n\n2\n0.34\n0.84\n1.22\n4.80\n0.86\n0.36\n0.32\n0.23\n\nE[X2]\n0.04\n0.09\n0.13\n0.51\n0.08\n0.02\n0.02\n0.01\n\n4.2 Boltzmann machine relaxations\n\nBackground: Sampling from a Boltzmann machine distribution (Jordan et al., 1999) is a challenging\ninference problem from the statistical physics literature. The probability mass function,\ns(cid:62)Ws + s(cid:62)b\n\ns(cid:62)Ws + s(cid:62)b\n\n(cid:26) 1\n\n(cid:26) 1\n\n(cid:88)\n\n, with Zb =\n\n(cid:27)\n\nP (s) =\n\nexp\n\n(cid:27)\n\n,\n\n(9)\n\n1\nZb\n\n2\n\nexp\n\n2\n\ns\u2208S\n\nis de\ufb01ned on the binary space s \u2208 {\u22121, 1}db := S, where W is a db \u00d7 db real symmetric matrix and\nb \u2208 Rdb are the model parameters. Sampling from this distribution typically requires Gibbs steps\n(Geman and Geman, 1984) which tend to mix very poorly as the states can be strongly correlated\nwhen the Boltzmann machine has high levels of connectivity (Salakhutdinov, 2010). HMC methods\nhave been shown to perform signi\ufb01cantly better than Gibbs sampling when the states of the target\ndistribution are highly correlated (Girolami and Calderhead, 2011). Unfortunately, HMC is generally\nrestricted to sampling on continuous spaces. Using the Gaussian integral trick (Hertz et al., 1991),\nwe introduce auxiliary variables x \u2208 Rd and transform the problem to sampling from \u03c0(x) rather\nthan eq. (9) (see Section D in the Supplementary Material for full details).\nSetup: We let b \u223c N (0, 0.12) and set W = Rdiag(e)R(cid:62), with diagonal elements set to zero, and\nsimulate a db \u00d7 db random orthogonal matrix for R (Stewart, 1980). e is a vector of eigenvalues,\nwith ei = \u03bb1 tanh(\u03bb2\u03b7i) and \u03b7i \u223c N (0, 1), for i = 1, 2, . . . , db. We set db = 28 (d = 27) and let\n(\u03bb1, \u03bb2) = (6, 2), as these settings have been shown to produce highly multi-modal distributions (see\nFigure 3 for an example). We compare the HMC and pseudo-extended (PE) HMC algorithms against\nannealed importance sampling (AIS), simulated tempering (ST), and the continuously-tempered HMC\nalgorithm of Graham and Storkey (2017) (GS). Full set-up details are given in the Supplementary\nMaterial.\nResults: We can analytically derive the \ufb01rst two moments of the Boltzmann distribution (see Section\nD of the Supplementary Material for details), and in Figure 4 we give the RMSE of the moment\n\n7\n\n\fFigure 3: Two-dimensional projection of 10, 000 samples drawn from the target using each of the\nproposed methods, where the \ufb01rst plot gives the ground-truth sampled directly from the Boltzmann\nmachine relaxation distribution. A temperature ladder of length 1,000 was used for both simulated\ntempering and annealed importance sampling.\n\nFigure 4: Root mean squared error (log scale) of the \ufb01rst and second moment of the target taken\nover 10 independent simulations and calculated for each of the proposed methods. Results labeled\n[0.1-0.9] correspond to pseudo-extended MCMC with \ufb01xed \u03b2 = [0.1 \u2212 0.9].\n\napproximations taken over 10 independent runs. These results support the conclusion that better\nexploration of the target space leads to improved estimation of integrals of interest. Additionally, we\nnote that \ufb01xing \u03b2 can produce lower RMSE for PE as we reduce the number of parameters that need\nto be estimated. However, \ufb01xing \u03b2 poorly (e.g. \u03b2 = 0.1 in this case) can lead to an increase in RMSE,\nwhereas estimating \u03b2 as part of the inference procedure gives a balanced RMSE result. Further\nsimulations are given in the Supplementary Material which includes plots of posterior samples and\nthe effect of varying the number of pseudo-samples. When taking into account the computational\ncost, the RMSE is minimized when 2 \u2264 N \u2264 5, which corroborates with the conclusion from the\nmixture of Gaussians example (Section 4.1).\n\n4.3 Sparse logistic regression with horseshoe priors\n\nBackground: We apply the pseudo-extended approach to the problem of sparse Bayesian inference.\nThis is a common problem in statistics and machine learning, where the number of parameters to be\nestimated is much larger than the data used to \ufb01t the model. Taking a Bayesian approach, we can use\nshrinkage priors to shrink model parameters to zero and prevent the model from over-\ufb01tting to the\ndata. There are a range of shrinkage priors presented in the literature (Grif\ufb01n and Brown, 2013) and\nhere we use the horseshoe prior (Carvalho et al., 2010), in particular, the regularized horseshoe as\nproposed by Piironen and Vehtari (2017). From a sampling perspective, sparse Bayesian inference\ncan be challenging as the posterior distributions are naturally multi-modal, where there is a spike at\nzero (indicating that variable is inactive) and some posterior mass centered away from zero.\n\n8\n\n105051015IndependentGSST2002040105051015AIS2002040HMC2002040PEGSSTAISHMCPE0.10.20.30.40.50.60.70.80.9101100[X]GSSTAISHMCPE0.10.20.30.40.50.60.70.80.9[XXT]RMSE (log scale)\fFigure 5: Log-predictive densities on held-out test data (random 20% of full data) for two cancer\ndatasets comparing the HMC and pseudo-extended HMC samplers, with N = 2 and N = 5. For the\ncase of \ufb01xed \u03b2 = [0.25, 0.5, 0.75], the number of pseudo-samples N = 2.\n\nSetup and results: Following Piironen and Vehtari (2017), we apply the regularized horseshoe prior\non a logistic regression model (see Section E of the Supplementary Material for full details). We\napply this model to three real-world data sets using micro-array data for cancer classi\ufb01cation (prostate\ndata results are given in Section E of the Supplementary Material, see Piironen and Vehtari (2017)\nfor further details regarding the data). We compare the pseudo-extended HMC algorithm against\nstandard HMC and give the log-predictive density on a held-out test dataset in Figure 5. In order\nto ensure a fair comparison between HMC and pseudo-extended HMC, we run HMC for 10,000\niterations and reduce the number of iterations of the pseudo-extended algorithms (with N = 2 and\nN = 5) to give equal total computational cost. The results show that there is an improvement in\nusing the pseudo-extended method, but with a strong performance from standard HMC, which is\nnot surprising in this setting as the posterior density plots (given in the Supplementary Material)\nshow that the posterior modes are close together. As seen in Scenario (b) of Section 4.1, the HMC\nsampler can usually locate and traverse between modes that are close together. The RMSE for the\npseudo-extended method can be improved using a \ufb01xed \u03b2, but as noted in the previous examples, \u03b2\nis not known a priori and \ufb01xing it incorrectly can lead to poorer results.\n\n5 Discussion\n\nWe have introduced the pseudo-extended method as a simple approach for augmenting the target\ndistribution for MCMC sampling. We have shown that the pseudo-extended method can be applied\nwithin any general MCMC framework to sample from multi-modal distributions, a challenging\nscenario for standard MCMC algorithms, and does not require prior knowledge of where, or how\nmany, modes there are in the target. We have shown that a natural instrumental distribution for\nq(\u00b7) is a tempered version of the target, which has the added bene\ufb01t of automating the choice of\ninstrumental distribution. Alternative instrumental distributions, and methods for estimating the\ntemperature parameter \u03b2, are worthy of further investigation. For example, mixture proposals qi\nwhere each pseudo-variable is associated with a different proposal. Alternatively, the proposal could\nbe strati\ufb01ed to encourage each of the pseudo-samples for the temperature parameters \u03b21:N to explore\ndifferent regions of the parameter space. This could be achieved through the choice of the function\ng(\u00b7) (7). If we let g(\u03b21:N ) be a Gaussian distribution, then a valid N \u00d7 N covariance matrix \u03a3 could\nbe chosen by letting \u03a3ii = 1 and \u03a3ij = \u2212(N \u2212 1)\u22121, which would induce negative correlation\nbetween the pseudo-samples and force the temperatures to be roughly evenly spaced.\n\nAcknowledgements\n\nCN gratefully acknowledges the support of the UK Engineering and Physical Sciences Research\nCouncil grants EP/S00159X/1 and EP/R01860X/1. FL is \ufb01nancially supported by the Swedish\nResearch Council (project 2016-04278), by the Swedish Foundation for Strategic Research (project\nICA16-0015) and by the Wallenberg AI, Autonomous Systems and Software Program (WASP)\nfunded by the Knut and Alice Wallenberg Foundation. MF gratefully acknowledges support from the\nAXA Research Fund and the Agence Nationale de la Recherche (grant ANR-18-CE46-0002).\n\n9\n\nHMCPE-N=2PE-N=5=0.25=0.5=0.75151050Colon dataHMCPE-N=2PE-N=5=0.25=0.5=0.75Leukemia dataLog-predictive density\fReferences\nAndrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269\u2013342.\n\nAndrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for ef\ufb01cient Monte Carlo\n\ncomputations. The Annals of Statistics, 37:697\u2013725.\n\nBeaumont, M. (2003). Estimation of population growth or decline in genetically monitored popula-\n\ntions. Genetics, 164(3):1139\u201360.\n\nBishop, C. M. (2006). Pattern recognition and machine learning. springer.\n\nBrooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of markov chain monte carlo.\n\nCRC press.\n\nCalderhead, B. and Girolami, M. (2009). Estimating Bayes factors via thermodynamic integration\n\nand population MCMC. Computational Statistics and Data Analysis, 53(12):4028\u20134045.\n\nCarpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.,\nGuo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of\nStatistical Software, 76(1):1\u201332.\n\nCarvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals.\n\nBiometrika, 97(2):465\u2013480.\n\nGeman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian\n\nRestoration of Images. IEEE Trans Pattern Analysis and Machine Intelligence, 6(6):721\u2013741.\n\nGeyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In Computing Science and\n\nStatistics: Proc. 23rd Symp. on the Interface, pages 156\u2013163. Fairfax.\n\nGirolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte\nCarlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n73(2):123\u2013214.\n\nGraham, M. M. and Storkey, A. J. (2017). Continuously tempered Hamiltonian Monte Carlo. In\n\nProceedings of the 33rd Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 1\u201312.\n\nGrif\ufb01n, J. E. and Brown, P. J. (2013). Some priors for sparse regression modelling. Bayesian Analysis,\n\n8(3):691\u2013702.\n\nHertz, J. A., Krogh, A. S., and Palmer, R. G. (1991). Introduction to the theory of neural computation,\n\nvolume 1. Basic Books.\n\nHoffman, M. and Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1593\u20131623.\n\nJasra, A., Stephens, D. A., and Holmes, C. C. (2007). On population-based simulation for static\n\ninference. Statistics and Computing, 17(3):263\u2013279.\n\nJordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational\n\nmethods for graphical models. Machine Learning, 37(2):183\u2013233.\n\nKou, S., Zhou, Q., and Wong, W. H. (2006). Equi-energy sampler with applications in statistical\n\ninference and statistical mechanics. Annals of Statistics, 34(4):1581\u20131619.\n\nMarinari, E. and Parisi, G. (1992). Simulated Tempering : a New Monte Carlo Scheme. Europhysics\n\nLetters, 19(6):451\u2013458.\n\nMcLachlan, G. J. and Peel, D. (2000). Finite mixture models. Wiley Series in Probability and\n\nStatistics, New York.\n\nNeal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11:125\u2013139.\n\n10\n\n\fNeal, R. M. (2010). MCMC Using Hamiltonian Dynamics. In Handbook of Markov Chain Monte\n\nCarlo (Chapman & Hall/CRC Handbooks of Modern Statistical Methods), pages 113\u2013162.\n\nNeal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business\n\nMedia.\n\nPiironen, J. and Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and\n\nother shrinkage priors. Electronic Journal of Statistics, 11(2):5018\u20135051.\n\nRoberts, G. and Tweedie, R. (1996). Exponential convergence of Langevin distributions and their\n\ndiscrete approximations. Bernoulli, 2(4):341\u2013363.\n\nRoberts, G. O., Gelman, A., and Gilks, W. (1997). Weak Convergence and Optimal Scaling of the\n\nRandom Walk Metropolis Algorithms. The Annals of Applied Probability, 7(1):110\u2013120.\n\nRue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian\nmodels by using integrated nested Laplace approximations. Journal of the Royal Statistical Society:\nSeries B (Statistical Methodology), 71(2):319\u2013392.\n\nSalakhutdinov, R. (2010). Learning Deep Boltzmann Machines using Adaptive MCMC. In Proceed-\nings of the 27th International Conference on Machine Learning (ICML-10), volume 10, pages\n943\u2014-950.\n\nStewart, G. W. (1980). The Ef\ufb01cient Generation of Random Orthogonal Matrices with an Application\n\nto Condition Estimators. SIAM Journal of Numerical Analysis, 17(3):403\u2013409.\n\nTak, H., Meng, X.-L., and van Dyk, D. A. (2016). A repulsive-attractive metropolis algorithm for\n\nmultimodality. arXiv preprint arXiv:1601.05633.\n\nZhang, Y., Sutton, C., Storkey, A., and Ghahramani, Z. (2012). Continuous Relaxations for Discrete\nHamiltonian Monte Carlo. In Advances in Neural Information Processing Systems 25, pages\n3194\u20133202.\n\n11\n\n\f", "award": [], "sourceid": 2415, "authors": [{"given_name": "Christopher", "family_name": "Nemeth", "institution": "Lancaster University"}, {"given_name": "Fredrik", "family_name": "Lindsten", "institution": "Link\u00f6ping University"}, {"given_name": "Maurizio", "family_name": "Filippone", "institution": "EURECOM"}, {"given_name": "James", "family_name": "Hensman", "institution": "PROWLER.io"}]}