{"title": "Hamiltonian Variational Auto-Encoder", "book": "Advances in Neural Information Processing Systems", "page_first": 8167, "page_last": 8177, "abstract": "Variational Auto-Encoders (VAE) have become very popular techniques to perform\ninference and learning in latent variable models as they allow us to leverage the rich\nrepresentational power of neural networks to obtain flexible approximations of the\nposterior of latent variables as well as tight evidence lower bounds (ELBO). Com-\nbined with stochastic variational inference, this provides a methodology scaling to\nlarge datasets. However, for this methodology to be practically efficient, it is neces-\nsary to obtain low-variance unbiased estimators of the ELBO and its gradients with\nrespect to the parameters of interest. While the use of Markov chain Monte Carlo\n(MCMC) techniques such as Hamiltonian Monte Carlo (HMC) has been previously\nsuggested to achieve this [23, 26], the proposed methods require specifying reverse\nkernels which have a large impact on performance. Additionally, the resulting\nunbiased estimator of the ELBO for most MCMC kernels is typically not amenable\nto the reparameterization trick. We show here how to optimally select reverse\nkernels in this setting and, by building upon Hamiltonian Importance Sampling\n(HIS) [17], we obtain a scheme that provides low-variance unbiased estimators of\nthe ELBO and its gradients using the reparameterization trick. This allows us to\ndevelop a Hamiltonian Variational Auto-Encoder (HVAE). This method can be\nre-interpreted as a target-informed normalizing flow [20] which, within our context,\nonly requires a few evaluations of the gradient of the sampled likelihood and trivial\nJacobian calculations at each iteration.", "full_text": "Hamiltonian Variational Auto-Encoder\n\nAnthony L. Caterini1, Arnaud Doucet1,2, Dino Sejdinovic1,2\n\n{anthony.caterini, doucet, dino.sejdinovic}@stats.ox.ac.uk\n\n1Department of Statistics, University of Oxford\n\n2Alan Turing Institute for Data Science\n\nAbstract\n\nVariational Auto-Encoders (VAEs) have become very popular techniques to per-\nform inference and learning in latent variable models: they allow us to leverage the\nrich representational power of neural networks to obtain \ufb02exible approximations of\nthe posterior of latent variables as well as tight evidence lower bounds (ELBOs).\nCombined with stochastic variational inference, this provides a methodology scal-\ning to large datasets. However, for this methodology to be practically ef\ufb01cient,\nit is necessary to obtain low-variance unbiased estimators of the ELBO and its\ngradients with respect to the parameters of interest. While the use of Markov chain\nMonte Carlo (MCMC) techniques such as Hamiltonian Monte Carlo (HMC) has\nbeen previously suggested to achieve this [25, 28], the proposed methods require\nspecifying reverse kernels which have a large impact on performance. Additionally,\nthe resulting unbiased estimator of the ELBO for most MCMC kernels is typically\nnot amenable to the reparameterization trick. We show here how to optimally\nselect reverse kernels in this setting and, by building upon Hamiltonian Importance\nSampling (HIS) [19], we obtain a scheme that provides low-variance unbiased\nestimators of the ELBO and its gradients using the reparameterization trick. This al-\nlows us to develop a Hamiltonian Variational Auto-Encoder (HVAE). This method\ncan be re-interpreted as a target-informed normalizing \ufb02ow [22] which, within our\ncontext, only requires a few evaluations of the gradient of the sampled likelihood\nand trivial Jacobian calculations at each iteration.\n\n1\n\nIntroduction\n\nVariational Auto-Encoders (VAEs), introduced by Kingma and Welling [15] and Rezende et al.\n[23], are popular techniques to carry out inference and learning in complex latent variable models.\nHowever, the standard mean-\ufb01eld parametrization of the approximate posterior distribution can limit\nits \ufb02exibility. Recent work has sought to augment the VAE approach by sampling from the VAE\nposterior approximation and transforming these samples through mappings with additional trainable\nparameters to achieve richer posterior approximations. The most popular application of this idea\nis the Normalizing Flows (NFs) approach [22] in which the samples are deterministically evolved\nthrough a series of parameterized invertible transformations called a \ufb02ow. NFs have demonstrated\nsuccess in various domains [2, 16], but the \ufb02ows do not explicitly use information about the target\nposterior. Therefore, it is unclear whether the improved performance is caused by an accurate posterior\napproximation or simply a result of overparametrization. The related Hamiltonian Variational\nInference (HVI) [25] instead stochastically evolves the base samples according to Hamiltonian Monte\nCarlo (HMC) [20] and thus uses target information, but relies on de\ufb01ning reverse dynamics in the\n\ufb02ow, which, as we will see, turns out to be unnecessary and suboptimal.\nOne of the key components in the formulation of VAEs is the maximization of the evidence lower\nbound (ELBO). The main idea put forward in Salimans et al. [25] is that it is possible to use K\nMCMC iterations to obtain an unbiased estimator of the ELBO and its gradients. This estimator\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fis obtained using an importance sampling argument on an augmented space, with the importance\ndistribution being the joint distribution of the K + 1 states of the \u2018forward\u2019 Markov chain, while\nthe augmented target distribution is constructed using a sequence of \u2018reverse\u2019 Markov kernels such\nthat it admits the original posterior distribution as a marginal. The performance of this estimator is\nstrongly dependent on the selection of these forward and reverse kernels, but no clear guideline for\nselection has been provided therein. By linking this approach to earlier work by Del Moral et al. [6],\nwe show how to select these components. We focus, in particular, on the use of time-inhomogeneous\nHamiltonian dynamics, proposed originally in Neal [19]. This method uses reverse kernels which are\noptimal for reducing variance of the likelihood estimators and allows for simple calculation of the\napproximate posteriors of the latent variables. Additionally, we can easily use the reparameterization\ntrick to calculate unbiased gradients of the ELBO with respect to the parameters of interest. The\nresulting method, which we refer to as the Hamiltonian Variational Auto-Encoder (HVAE), can\nbe thought of as a Normalizing Flow scheme in which the \ufb02ow depends explicitly on the target\ndistribution. This combines the best properties of HVI and NFs, resulting in target-informed and\ninhomogeneous deterministic Hamiltonian dynamics, while being scalable to large datasets and high\ndimensions.\n\n2 Evidence Lower Bounds, MCMC and Hamiltonian Importance Sampling\n\n2.1 Unbiased likelihood and evidence lower bound estimators\nFor data x \u2208 X \u2286 Rd and parameter \u03b8 \u2208 \u0398, consider the likelihood function\np\u03b8(x) =\ufffd p\u03b8(x, z)dz =\ufffd p\u03b8(x|z)p\u03b8(z)dz,\n\nwhere z \u2208 Z are some latent variables. If we assume we have access to a strictly positive unbiased\nestimate of p\u03b8(x), denoted \u02c6p\u03b8(x), then\n\n\ufffd \u02c6p\u03b8(x)q\u03b8,\u03c6(u|x)du = p\u03b8(x),\n\nwith u \u223c q\u03b8,\u03c6(\u00b7), u \u2208 U denoting all the random variables used to compute \u02c6p\u03b8(x). Here, \u03c6 denotes\nadditional parameters of the sampling distribution. We emphasize that \u02c6p\u03b8(x) depends on both u and\npotentially \u03c6 as this is not done notationally. By applying Jensen\u2019s inequality to (1), we thus obtain,\nfor all \u03b8 and \u03c6,\n\nLELBO(\u03b8, \u03c6; x) :=\ufffd log \u02c6p\u03b8(x) q\u03b8,\u03c6(u|x)du \u2264 log p\u03b8(x) =: L(\u03b8; x).\n\ni=1 q\u03b8,\u03c6(zi|x) and \u02c6p\u03b8(x) = 1\n\nIt can be shown that |LELBO(\u03b8, \u03c6; x)\u2212L(\u03b8; x)| decreases as the variance of \u02c6p\u03b8(x) decreases; see, e.g.,\n[3, 17]. The standard variational framework corresponds to U = Z and \u02c6p\u03b8(x) = p\u03b8(x, z)/q\u03b8,\u03c6(z|x),\nwhile the Importance Weighted Auto-Encoder (IWAE) [3] with L importance samples corresponds to\nU = Z L, q\u03b8,\u03c6(u|x) =\ufffdL\nIn the general case, we do not have an analytical expression for LELBO(\u03b8, \u03c6; x). When performing\nstochastic gradient ascent for variational inference, however, we only require an unbiased estimator of\n\u2207\u03b8LELBO(\u03b8, \u03c6; x). This is given by \u2207\u03b8 log \u02c6p\u03b8(x) if the reparameterization trick [8, 15] is used, i.e.\nq\u03b8,\u03c6(u|x) = q(u), and \u02c6p\u03b8(x) is a \u2018smooth enough\u2019 function of u. As a guiding principle, one should\nattempt to obtain a low-variance estimator of p\u03b8(x), which typically translates into a low-variance\nestimator of \u2207\u03b8LELBO(\u03b8, \u03c6; x). We can analogously optimize LELBO(\u03b8, \u03c6; x) with respect to \u03c6\nthrough stochastic gradient ascent to obtain tighter bounds.\n\ni=1 p\u03b8(x, zi)/q\u03b8,\u03c6(zi|x).\n\nL\ufffdL\n\n2.2 Unbiased likelihood estimator using time-inhomogeneous MCMC\n\nSalimans et al. [25] propose to build an unbiased estimator of p\u03b8(x) by sampling a (potentially\ntime-inhomogeneous) \u2018forward\u2019 Markov chain of length K + 1 using z0 \u223c q0\n\u03b8,\u03c6(\u00b7) and zk \u223c\n\u03b8,\u03c6(\u00b7|zk\u22121) for k = 1, ..., K. Using arti\ufb01cial \u2018reverse\u2019 Markov transition kernels rk\n\u03b8,\u03c6(zk|zk+1) for\nqk\nk = 0, ..., K \u2212 1, it follows easily from an importance sampling argument that\n\n(1)\n\n(2)\n\n(3)\n\n\u02c6p\u03b8(x) =\n\np\u03b8(x, zK)\ufffdK\u22121\n\u03b8,\u03c6(z0)\ufffdK\n\nk=0 rk\nk=1 qk\n\nq0\n\n\u03b8,\u03c6(zk|zk+1)\n\u03b8,\u03c6(zk|zk\u22121)\n\n2\n\n\fis an unbiased estimator of p\u03b8(x) as long as the ratio in (3) is well-de\ufb01ned. In the framework\nof the previous section, we have U = Z K+1 and q\u03b8,\u03c6(u|x) is given by the denominator of (3).\nAlthough we did not use measure-theoretic notation, the kernels qk\n\u03b8,\u03c6 are typically MCMC kernels\nwhich do not admit a density with respect to the Lebesgue measure (e.g. the Metropolis\u2013Hastings\nkernel). This makes it dif\ufb01cult to de\ufb01ne reverse kernels for which (3) is well-de\ufb01ned as evidenced\nin Salimans et al. [25, Section 4] or Wolf et al. [28]. The estimator (3) was originally introduced\nin Del Moral et al. [6] where generic recommendations are provided for this estimator to admit a\n\u03b8,\u03c6 as MCMC kernels which are invariant, or approximately invariant\nlow relative variance: select qk\n[p\u03b8(x, z)]\u03b2k is a sequence\n\u03b8,\u03c6(z) to p\u03b8(z|x) smoothly using \u03b20 = 0 < \u03b21 < \u00b7\u00b7\u00b7 < \u03b2K\u22121 <\n\u03b8,\u03c6}k,\n\u03b8,\u03c6 (zk|zk+1) =\n\u03b8,\u03c6 (zk) denotes the marginal density of zk under the\n\nas in [9], with respect to pk\nof arti\ufb01cial densities bridging q0\n\u03b2K = 1. It is also established in Del Moral et al. [6] that, given any sequence of kernels {qk\nthe sequence of reverse kernels minimizing the variance of \u02c6p\u03b8(x) is given by rk,opt\n\u03b8,\u03c6(zk)qk+1\n\u03b8,\u03c6 (zk+1|zk)/qk+1\nqk\nforward dynamics, yielding\n\n\u03b8,\u03c6(z|x) \u221d \ufffdq0\n\n\u03b8,\u03c6(z)\ufffd1\u2212\u03b2k\n\n\u03b8,\u03c6 (zk+1), where qk\n\n\u03b8 (x, zk), where pk\n\n\u02c6p\u03b8(x) =\n\np\u03b8(x, zK)\nqK\n\u03b8,\u03c6(zK)\n\n.\n\n(4)\n\n\u03b8,\u03c6\n\n\u03b8,\u03c6\n\n\u03b8,\u03c6 should be approximating rk,opt\n\nFor stochastic forward transitions, it is typically not possible to compute rk,opt\nand the corresponding\n\u03b8,\u03c6(zk) do not admit closed-form expressions. However\nestimator (4) as the marginal densities qk\nthis suggests that rk\nand various schemes are presented in [6].\nAs noticed by Del Moral et al. [6] and Salimans et al. [25], Annealed Importance Sampling (AIS)\n[18] \u2013 also known as the Jarzynski-Crooks identity ([4, 12]) in physics \u2013 is a special case of (3)\nusing, for qk\n\u03b8 (z|x)-invariant MCMC kernel and the reversal of this kernel as the reverse\n\u03b8,\u03c6, a pk\n1. This choice of reverse kernels is suboptimal but leads to a simple expression\ntransition kernel rk\u22121\n\u03b8,\u03c6\nfor estimator (3). AIS provides state-of-the-art estimators of the marginal likelihood and has been\nwidely used in machine learning. Unfortunately, it typically cannot be used in conjunction with the\nreparameterization trick. Indeed, although it is very often possible to reparameterize the forward\nsimulation of (z1, ..., zT ) in terms of the deterministic transformation of some random variables\nu \u223c q independent of \u03b8 and \u03c6, this mapping is not continuous because the MCMC kernels it uses\ntypically include singular components. In this context, although (1) holds, \u2207\u03b8 log \u02c6p\u03b8(x) is not an\nunbiased estimator of \u2207\u03b8LELBO(\u03b8, \u03c6; x); see, e.g., Glasserman [8] for a careful discussion of these\nissues.\n\n2.3 Using Hamiltonian dynamics\n\nGiven the empirical success of Hamiltonian Monte Carlo (HMC) [11, 20], various contributions have\nproposed to develop algorithms exploiting Hamiltonian dynamics to obtain unbiased estimates of\nthe ELBO and its gradients when Z = R\ufffd. This was proposed in Salimans et al. [25]. However, the\nalgorithm suggested therein relies on a time-homogeneous leapfrog where momentum resampling is\nperformed at each step and no Metropolis correction is used. It also relies on learned reverse kernels.\nTo address the limitations of Salimans et al. [25], Wolf et al. [28] have proposed to include some\nMetropolis acceptance steps, but they still limit themselves to homogeneous dynamics and their\nestimator is not amenable to the reparameterization trick. Finally, in Hoffman [10], an alternative\napproach is used where the gradient of the true likelihood, \u2207\u03b8L(\u03b8; x), is directly approximated by\nusing Fisher\u2019s identity and HMC to obtain approximate samples from p\u03b8(z|x). However, the MCMC\nbias can be very signi\ufb01cant when one has multimodal latent posteriors and is strongly dependent on\nboth the initial distribution and \u03b8.\nHere, we follow an alternative approach where we use Hamiltonian dynamics that are time-\ninhomogeneous as in [6] and [18], and use optimal reverse Markov kernels to compute \u02c6p\u03b8(x).\nThis estimator can be used in conjunction with the reparameterization trick to obtain an unbiased\nestimator of \u2207LELBO(\u03b8, \u03c6; x). This method is based on the Hamiltonian Importance Sampling (HIS)\nscheme proposed in Neal [19]; one can also \ufb01nd several instances of related ideas in physics [13, 26].\n\n1The reversal of a \u00b5-invariant kernel K(z\ufffd|z) is given by Krev(z\ufffd|z) = \u00b5(z\ufffd)K(z|z\ufffd)\n\n\u00b5(z)\n\nthen Krev = K.\n\n. If K is \u00b5-reversible\n\n3\n\n\fqK\n\u03b8,\u03c6(zK, \u03c1K) = q0\n\n\u03b8,\u03c6(z0, \u03c10)\n\n\u03b8,\u03c6 \u25e6 \u00b7\u00b7\u00b7 \u25e6 \u03a61\n\n\u03b8,\u03c6((zk, \u03c1k)|(zk\u22121, \u03c1k\u22121)) = \u03b4\u03a6k\n\nWe work in an extended space (z, \u03c1) \u2208 U := R\ufffd \u00d7 R\ufffd, introducing momentum variables \u03c1 to pair\nwith the position variables z, with new target \u00afp\u03b8(x, z, \u03c1) := p\u03b8(x, z)N (\u03c1|0, I\ufffd). Essentially, the idea\nis to sample using deterministic transitions qk\n\u03b8,\u03c6(zk\u22121,\u03c1k\u22121)(zk, \u03c1k)\nso that (zK, \u03c1K) = H\u03b8,\u03c6(z0, \u03c10) := \ufffd\u03a6K\n\u03b8,\u03c6(\u00b7,\u00b7) and\n\u03b8,\u03c6)k\u22651, de\ufb01ne diffeomorphisms corresponding to a time-discretized and inhomogeneous Hamilto-\nK\ufffdk=1\ufffd\ufffddet\u2207\u03a6k\n\n\u03b8,\u03c6\ufffd (z0, \u03c10), where (z0, \u03c10) \u223c q0\n\n(\u03a6k\nnian dynamics. In this case, it is easy to show that\n\nIt can also be shown that this is nothing but a special case of (3) (on the extended position-momentum\nspace) using the optimal reverse kernels2 rk,opt\n\u03b8,\u03c6 . This setup is similar to the one of Normalizing\nFlows [22], except here we use a \ufb02ow informed by the target distribution. Salimans et al. [25] is in\nfact mentioned in Rezende and Mohamed [22], but the \ufb02ow therein is homogeneous and yields a\nhigh-variance estimator of the normalizing constants even if rk,opt\nis used, as demonstrated in our\nsimulations in section 4.\nUnder these dynamics, the estimator \u02c6p\u03b8(x) de\ufb01ned in (5) can be rewritten as\n\n\u03b8,\u03c6(zk, \u03c1k)\ufffd\ufffd\u22121\n\n\u00afp\u03b8(x, zK, \u03c1K)\nqK\n\u03b8,\u03c6(zK, \u03c1K)\n\n.\n\n(5)\n\nand\n\n\u02c6p\u03b8(x) =\n\n\u03b8\n\n\u02c6p\u03b8(x) =\n\n\u00afp\u03b8 (x,H\u03b8,\u03c6 (z0, \u03c10))\n\nq0\n\u03b8,\u03c6 (z0, \u03c10)\n\nK\ufffdk=1\ufffd\ufffddet\u2207\u03a6k\n\n\u03b8,\u03c6(zk, \u03c1k)\ufffd\ufffd .\n\nHence, if we can simulate (z0, \u03c10) \u223c q0\nsmooth mapping, then we can use the reparameterization trick since \u03a6k\n\n\u03b8,\u03c6(\u00b7,\u00b7) using (z0, \u03c10) = \u03a8\u03b8,\u03c6(u), where u \u223c q and \u03a8\u03b8,\u03c6 is a\n\u03b8,\u03c6 are also smooth mappings.\nIn our case, the deterministic transformation \u03a6k\n\u03b8,\u03c6 has two components: a leapfrog step, which\ndiscretizes the Hamiltonian dynamics, and a tempering step, which adds inhomogeneity to the\ndynamics and allows us to explore isolated modes of the target [19]. To describe the leapfrog step,\nwe \ufb01rst de\ufb01ne the potential energy of the system as U\u03b8(z|x) \u2261 \u2212 log p\u03b8(x, z) for a single datapoint\nx \u2208 X . Leapfrog then takes the system from (z, \u03c1) into (z\ufffd, \u03c1\ufffd) via the following transformations:\n(7)\n\n(6)\n\n(8)\n\n(9)\n\n\u03b5\n2 \ufffd \u2207U\u03b8(z|x),\n\n\u03b5\n2 \ufffd \u2207U\u03b8(z\ufffd|x),\n\n\ufffd\u03c1 = \u03c1 \u2212\nz\ufffd = z + \u03b5 \ufffd\ufffd\u03c1,\n\u03c1\ufffd =\ufffd\u03c1 \u2212\n\nwhere \u03b5 \u2208 (R+)\ufffd are the individual leapfrog step sizes per dimension, \ufffd denotes elementwise\nmultiplication, and the gradient of U\u03b8(z|x) is taken with respect to z. The composition of equations\n(7) - (9) has unit Jacobian since each equation describes a shear transformation. For the tempering\nportion, we multiply the momentum output of each leapfrog step by \u03b1k \u2208 (0, 1) for k \u2208 [K] where\n[K] \u2261 {1, . . . , K}. We consider two methods for setting the values \u03b1k. First, \ufb01xed tempering\ninvolves allowing an inverse temperature \u03b20 \u2208 (0, 1) to vary, and then setting \u03b1k =\ufffd\u03b2k\u22121/\u03b2k,\nwhere each \u03b2k is a deterministic function of \u03b20 and 0 < \u03b20 < \u03b21 < . . . < \u03b2K = 1. In the second\nmethod, known as free tempering, we allow each of the \u03b1k values to be learned, and then set the initial\ninverse temperature to \u03b20 =\ufffdK\nk. For both methods, the tempering operation has Jacobian \u03b1\ufffd\nk.\nWe obtain \u03a6k\n\u03b8,\u03c6 by composing the leapfrog integrator with the cooling operation, which implies that\nthe Jacobian is given by | det\u2207\u03a6k\n\nk = (\u03b2k\u22121/\u03b2k)\ufffd/2, which in turns implies\n\n\u03b8,\u03c6(zk, \u03c1k)| = \u03b1\ufffd\n\nk=1 \u03b12\n\nK\ufffdk=1\n\n| det\u2207\u03a6k\n\n\u03b8,\u03c6(zk, \u03c1k)| =\n\n\u03b2k \ufffd\ufffd/2\nK\ufffdk=1\ufffd \u03b2k\u22121\n\n= \u03b2\ufffd/2\n\n0\n\n.\n\nThe only remaining component to specify is the initial distribution. We will set q0\n\u03b8,\u03c6(z0, \u03c10) =\n\u03b8,\u03c6(z0) will be referred to as the variational prior over the latent\nq0\n\u03b8,\u03c6(z0) \u00b7 N (\u03c10|0, \u03b2\u22121\n2Since this is a deterministic \ufb02ow, the density can be evaluated directly. However, direct evaluation corre-\n\n0 I\ufffd), where q0\n\nsponds to optimal reverse kernels in the deterministic case.\n\n4\n\n\fvariables and N (\u03c10|0, \u03b2\u22121\n0 I\ufffd) is the canonical momentum distribution at inverse temperature \u03b20.\nThe full procedure to generate an unbiased estimate of the ELBO from (2) on the extended space\nU for a single point x \u2208 X and \ufb01xed tempering is given in Algorithm 1. The set of variational\nparameters to optimize contains the \ufb02ow parameters \u03b20 and \u03b5, along with additional parameters\nof the variational prior.3 We can see from (6) that we will obtain unbiased gradients with respect\nto \u03b8 and \u03c6 from our estimate of the ELBO if we write (z0, \u03c10) =\ufffdz0, \u03b30/\u221a\u03b20\ufffd, for z0 \u223c q0\n\u03b8,\u03c6(\u00b7)\nand \u03b30 \u223c N (\u00b7|0, I\ufffd) \u2261 N\ufffd(\u00b7), provided we are not also optimizing with respect to parameters of\nthe variational prior. We will require additional reparameterization when we elect to optimize with\nrespect to the parameters of the variational prior, but this is generally quite easy to implement on a\nproblem-speci\ufb01c basis and is well-known in the literature; see, e.g. [15, 22, 23] and section 4.\n\nAlgorithm 1 Hamiltonian ELBO, Fixed Tempering\nRequire: p\u03b8(x,\u00b7) is the unnormalized posterior for x \u2208 X and \u03b8 \u2208 \u0398\nRequire: q0\n\n\u03b8,\u03c6(\u00b7) is the variational prior on R\ufffd\n\nfunction HIS(x, \u03b8, K, \u03b20, \u03b5)\n\n\u03b8,\u03c6(\u00b7), \u03b30 \u223c N\ufffd(\u00b7)\n\nSample z0 \u223c q0\n\u03c10 \u2190 \u03b30/\u221a\u03b20\nfor k \u2190 1 to K do\n\ufffd\u03c1 \u2190 \u03c1 \u2212 \u03b5/2 \ufffd \u2207U\u03b8(zk\u22121|x)\nzk \u2190 zk\u22121 + \u03b5 \ufffd\ufffd\u03c1\n\u03c1\ufffd \u2190\ufffd\u03c1 \u2212 \u03b5/2 \ufffd \u2207U\u03b8(zk|x)\n\u221a\u03b2k \u2190\ufffd\ufffd1 \u2212 1\u221a\u03b20\ufffd \u00b7 k2/K 2 + 1\u221a\u03b20\ufffd\u22121\n\u03c1k \u2190\ufffd\u03b2k\u22121/\u03b2k \u00b7 \u03c1\ufffd\n\n\u00afp \u2190 p\u03b8(x, zK)N (\u03c1K|0, I\ufffd)\n\u00afq \u2190 q0\n\u03b8,\u03c6(z0)N (\u03c10|0, \u03b2\u22121\n\u02c6LH\nELBO(\u03b8, \u03c6; x) \u2190 log \u00afp \u2212 log \u00afq\nreturn \u02c6LH\n\nELBO(\u03b8, \u03c6; x)\n\n0 I\ufffd)\u03b2\u2212\ufffd/2\n\n0\n\n0 I\ufffd)\n\ufffd Run K steps of alternating leapfrog and tempering\n\ufffd Start of leapfrog; Equation (7)\n\ufffd Equation (8)\n\ufffd Equation (9)\n\n\ufffd \u03c10 \u223c N (\u00b7|0, \u03b2\u22121\n\n\ufffd Quadratic tempering scheme\n\n\ufffd Equation (5), left side\n\ufffd Take the log of equation (5), right side\n\ufffd Can take unbiased gradients of this estimate wrt \u03b8, \u03c6\n\n3 Stochastic Variational Inference\n\nWe will now describe how to use Algorithm 1 within a stochastic variational inference procedure,\nmoving to the setting where we have a dataset D = {x1, . . . , xN} and xi \u2208 X for all i \u2208 [N ]. In this\ncase, we are interested in \ufb01nding\n(10)\n\n\u03b8\u2217 \u2208 argmax\n\u03b8\u2208\u0398\n\nEx\u223c\u03bdD(\u00b7)[L(\u03b8; x)],\n\nN\ufffdN\n\nwhere \u03bdD(\u00b7) \u2261 1\ni=1 \u03b4xi (\u00b7) is the empirical measure of the data. We must resort to variational\nmethods since L(\u03b8; x) cannot generally be calculated exactly and instead maximize the surrogate\nELBO objective function\n(11)\nfor LELBO(\u03b8, \u03c6; x) de\ufb01ned as in (2). We can now turn to stochastic gradient ascent (or a variant\nthereof) to jointly maximize (11) with respect to \u03b8 and \u03c6 by approximating the expectation over \u03bdD(\u00b7)\nusing minibatches of observed data.\nFor our speci\ufb01c problem, we can reduce the variance of the ELBO calculation by analytically\nevaluating some terms in the expectation (i.e. Rao-Blackwellization) as follows:\n\nLELBO(\u03b8, \u03c6) \u2261 Ex\u223c\u03bdD(\u00b7) [LELBO(\u03b8, \u03c6; x)]\n\nLH\nELBO(\u03b8, \u03c6; x) = E(z0,\u03c10)\u223cq0\n\n\u03b8,\u03c6(z0, \u03c10) \ufffd\ufffd\n\u03b8,\u03c6(\u00b7,\u00b7)\ufffdlog\ufffd \u00afp\u03b8(x, zK, \u03c1K)\u03b2\ufffd/2\n\u03b8,\u03c6(\u00b7),\u03b30\u223cN\ufffd(\u00b7)\ufffdlog p\u03b8(x, zK) \u2212\n\n1\n2\n\nq0\n\n0\n\nK\u03c1K \u2212 log q0\n\u03c1T\n\n= Ez0\u223cq0\n\n\u03b8,\u03c6(z0)\ufffd +\n\n\ufffd\n2\n\n,\n\n(12)\n\n3We avoid reference to a mass matrix M throughout this formulation because we can capture the same effect\n\nby optimizing individual leapfrog step sizes per dimension as pointed out in [20, Section 4.2].\n\n5\n\n\fwhere we write (zK, \u03c1K) = H\u03b8,\u03c6\ufffdz0, \u03b30/\u221a\u03b20\ufffd under reparameterization. We can now consider the\n\noutput of Algorithm 1 as taking a sample from the inner expectation for a given sample x from the\nouter expectation. Algorithm 2 provides a full procedure to stochastically optimize (12). In practice,\nwe take the gradients of (12) using automatic differentation packages. This is achieved by using\nTensorFlow [1] in our implementation.\n\nAlgorithm 2 Hamiltonian Variational Auto-Encoder\nRequire: p\u03b8(x,\u00b7) is the unnormalized posterior for x \u2208 X and \u03b8 \u2208 \u0398\n\nfunction HVAE(D, K, nB)\n\nInitialize \u03b8, \u03c6\nwhile \u03b8, \u03c6 not converged do\n\n\ufffd nB is minibatch size\n\n\ufffd Stochastic optimization loop\n\n\ufffd Average ELBO estimators over mini-batch\n\nSample {x1, . . . , xnB} \u223c \u03bdD(\u00b7) independently\n\u02c6LH\nELBO(\u03b8, \u03c6) \u2190 0\nfor i \u2190 1 to nB do\nELBO(\u03b8, \u03c6) \u2190 HIS(xi, \u03b8, K, \u03b20, \u03b5) + \u02c6LH\n\u02c6LH\n\u02c6LH\nELBO(\u03b8, \u03c6) \u2190 \u02c6LH\nOptimize the ELBO using gradient-based techniques such as RMSProp, ADAM, etc.\n\ufffd\n\u03b8 \u2190 UPDATETHETA(\u2207\u03b8 \u02c6LH\n\u03c6 \u2190 UPDATEPHI(\u2207\u03c6 \u02c6LH\n\nELBO(\u03b8, \u03c6), \u03c6)\n\nELBO(\u03b8, \u03c6), \u03b8)\n\nELBO(\u03b8, \u03c6)/nB\n\nELBO(\u03b8, \u03c6)\n\nreturn \u03b8, \u03c6\n\n4 Experiments\n\nIn this section, we discuss the experiments used to validate our method. We \ufb01rst test HVAE on an\nexample with a tractable full log likelihood (where no neural networks are needed), and then perform\nlarger-scale tests on the MNIST dataset. Code is available online.4 All models were trained using\nTensorFlow [1].\n\n4.1 Gaussian Model\n\nThe generative model that we will consider \ufb01rst is a Gaussian likelihood with an offset and a Gaussian\nprior on the mean, given by\n\nz \u223c N (0, I\ufffd),\n\nxi|z \u223c N (z + \u0394, \u03a3)\n\nindependently,\n\ni \u2208 [N ]\n\nwhere \u03a3 is constrained to be diagonal. We will again write D \u2261 {x1, . . . , xN} to denote an observed\ndataset under this model, where each xi \u2208 X \u2286 Rd. In this example, we have \ufffd = d. The goal of the\nd) and \u0394 \u2208 Rd.\nproblem is to learn the model parameters \u03b8 \u2261 {\u03a3, \u0394}, where \u03a3 = diag(\u03c32\nHere, we have only one latent variable generating the entire set of data. Thus, our variational lower\nbound is now given by\n\n1, . . . , \u03c32\n\nLELBO(\u03b8, \u03c6;D) := Ez\u223cq\u03b8,\u03c6(\u00b7|D) [log p\u03b8(D, z) \u2212 log q\u03b8,\u03c6(z|D)] \u2264 log p\u03b8(D),\n\nfor the variational posteroir approximation q\u03b8,\u03c6(\u00b7|D). We note that this is not exactly the same as\nthe auto-encoder setting, in which an individual latent variable is associated with each observation,\nhowever it provides a tractable framework to analyze effectiveness of various variational inference\nmethods. We also note that we can calculate the log-likelihood log p\u03b8(D) exactly in this case, but we\nuse variational methods for the sake of comparison.\nFrom the model, we see that the logarithm of the unnormalized target is given by\n\nlog p\u03b8(D, z) =\n\nlog N (xi|z + \u0394, \u03a3) + log N (z|0, Id).\n\n4https://github.com/anthonycaterini/hvae-nips\n\nN\ufffdi=1\n\n6\n\n\f(a) Comparison across all methods\n\n(b) HVAE with and without tempering\n\nFigure 1: Averages of\ufffd\ufffd\ufffd\u03b8 \u2212 \u02c6\u03b8\ufffd\ufffd\ufffd\n\n2\n\n2\n\nwhere \u02c6\u03b8 is the estimated maximizer of the ELBO for each method and \u03b8 is the true parameter.\n\nfor several variational methods and choices of dimensionality d,\n\nFor this example, we will use a HVAE with variational prior equal to the true prior, i.e. q0 = N (0, I\ufffd),\nand \ufb01xed tempering. The potential, given by U\u03b8(z|D) = log p\u03b8(D, z), has gradient\n\n\u2207U\u03b8(z|D) = z + N \u03a3\u22121(z + \u0394 \u2212 x).\n\nThe set of variational parameters here is \u03c6 \u2261 {\u03b5, \u03b20}, where \u03b5 \u2208 Rd contains the per-dimension\nleapfrog stepsizes and \u03b20 \u2208 (0, 1) is the initial inverse temperature. We constrain each of the leapfrog\nstep sizes such that \u03b5j \u2208 (0, \u03be) for some \u03be > 0, for all j \u2208 [d] \u2013 this is to prevent the leapfrog\ndiscretization from entering unstable regimes. Note that \u03c6 \u2208 Rd+1 in this example; in particular, we\ndo not optimize any parameters of the variational prior and thus require no further reparameterization.\nWe will compare HVAE with a basic Variational Bayes (VB) scheme with mean-\ufb01eld approximate\nposterior q\u03c6V (z|D) = N (z|\u00b5Z, \u03a3Z), where \u03a3Z is diagonal and \u03c6V \u2261 {\u00b5Z, \u03a3Z} denotes the set\nof learned variational parameters. We will also include a planar normalizing \ufb02ow of the form of\nequation (10) in Rezende and Mohamed [22], but with the same \ufb02ow parameters across iterations to\nkeep the number of variational parameters of the same order as the other methods. The variational\nprior here is also set to the true prior as in HVAE above. The log variational posterior log q\u03c6N (z|D)\nis given by equation (13) of Rezende and Mohamed [22], where \u03c6N \u2261 {u, v, b}5 \u2208 R2d+1.\nWe set our true offset vector to be \u0394 = \ufffd\u2212 d\u22121\n2 \ufffd /5, and our scale parameters to range\n\n2 , . . . , d\u22121\n\n2\n\neach training run, we calculate\ufffd\ufffd\ufffd\u03b8 \u2212 \u02c6\u03b8\ufffd\ufffd\ufffd\n\nquadratically from \u03c31 = 1, reaching a minimum at \u03c3(d+1)/2 = 0.1, and increasing back to \u03c3d = 1.6\nAll experiments have N = 10,000 and all training was done using RMSProp [27] with a learning\nrate of 10\u22123.\nTo compare the results across methods, we train each method ten times on different datasets. For\n, where \u02c6\u03b8 is the estimated value of \u03b8 given by the variational\nmethod on a particular run, and plot the average of this across the 10 runs for various dimensions in\nFigure 1a. We note that, as the dimension increases, HVAE performs best in parameter estimation.\nThe VB method suffers most on prediction of \u0394 as the dimension increases, whereas the NF method\ndoes poorly on predicting \u03a3.\nWe also compare HVAE with tempering to HVAE without tempering, i.e. where \u03b20 is \ufb01xed to 1 in\ntraining. This has the effect of making our Hamiltonian dynamics homogeneous in time. We perform\nthe same comparison as above and present the results in Figure 1b. We can see that the tempered\nmethods perform better than their non-tempered counterparts; this shows that time-inhomogeneous\ndynamics are a key ingredient in the effectiveness of the method.\n\n2\n\n5Boldface vectors used to match notation of Rezende and Mohamed [22].\n6When d is even, \u03c3(d+1)/2 does not exist, although we could still consider (d + 1)/2 to be the location of\n\nthe minimum of the parabola de\ufb01ning the true standard deviations.\n\n7\n\n\f4.2 Generative Model for MNIST\n\nThe next experiment that we consider is using HVAE to improve upon a convolutional variational\nauto-encoder (VAE) for the binarized MNIST handwritten digit dataset. Again, our training data is\nD = {x1, . . . , xN}, where each xi \u2208 X \u2286 {0, 1}d for d = 28 \u00d7 28 = 784. The generative model is\nas follows:\n\nzi \u223c N (0, I\ufffd),\n\nd\ufffdj=1\n\nxi|zi \u223c\n\nBernoulli((xi)j|\u03c0\u03b8(zi)j),\n\nfor i \u2208 [N ], where (xi)j is the jth component of xi, zi \u2208 Z \u2261 R\ufffd is the latent variable associated\nwith xi, and \u03c0\u03b8 : Z \u2192 X is a convolutional neural network (i.e. the generative network, or encoder)\nparametrized by the model parameters \u03b8. This is the standard generative model used in VAEs in\nwhich each pixel in the image xi is conditionally independent given the latent variable. The VAE\napproximate posterior \u2013 and the HVAE variational prior across the latent variables in this case \u2013 is\ngiven by q\u03b8,\u03c6(zi|xi) = N (zi|\u00b5\u03c6(xi), \u03a3\u03c6(xi)), where \u00b5\u03c6 and \u03a3\u03c6 are separate outputs of the same\nneural network (the inference network, or encoder) parametrized by \u03c6, and \u03a3\u03c6 is constrained to be\ndiagonal.\nWe attempted to match the network structure of Salimans et al. [25]. The inference network consists\nof three convolutional layers, each with \ufb01lters of size 5 \u00d7 5 and a stride of 2. The convolutional\nlayers output 16, 32, and 32 feature maps, respectively. The output of the third layer is fed into a\nfully-connected layer with hidden dimension nh = 450, whose output is then fully connected to the\noutput means and standard deviations each of size \ufffd. Softplus activation functions are used throughout\nthe network except immediately before the outputted mean. The generative network mirrors this\nstructure in reverse, replacing the stride with upsampling as in Dosovitskiy et al. [7] and replicated in\nSalimans et al. [25].\nWe apply HVAE on top of the base convolutional VAE. We evolve samples from the variational\nprior according to Algorithm 1 and optimize the new objective given in (12). We reparameterize\nz0|x \u223c N (\u00b5\u03c6(x), \u03a3\u03c6(x)) as z0 = \u00b5\u03c6(x) + \u03a31/2\n\u03c6 (x) \u00b7 \ufffd, for \ufffd \u223c N (0, I\ufffd) and x \u2208 X , to generate\nunbiased gradients of the ELBO with respect to \u03c6. We select various values for K and set \ufffd = 64. In\ncontrast with normalizing \ufb02ows, we do not need our \ufb02ow parameters \u03b5 and \u03b20 to be outputs of the\ninference network because our \ufb02ow is guided by the target. This allows our method to have fewer\noverall parameters than normalizing \ufb02ow schemes. We use the standard stochastic binarization of\nMNIST [24] as training data, and train using Adamax [14] with learning rate 10\u22123. We also employ\nearly stopping by halting the training procedure if there is no improvement in the loss on validation\ndata over 100 epochs.\nTo evaluate HVAE after training is complete, we estimate out-of-sample negative log likelihoods\n(NLLs) using 1000 importance samples from the HVAE approximate posterior. For each trained\nmodel, we estimate NLL three times, noting that the standard deviation over these three estimates\nis no larger than 0.12 nats. We report the average NLL values over either two or three different\ninitializations (in addition to the three NLL estimates for each trained model) for several choices of\ntempering and leapfrog steps in Table 1. A full accounting of the tests is given in the supplementary\nmaterial. We also consider an HVAE scheme in which we allow \u03b5 to vary across layers of the \ufb02ow\nand report the results.\nFrom Table 1, we notice that generally increasing the inhomogeneity in the dynamics improves the\ntest NLL values. For example, free tempering is the most successful tempering scheme, and varying\nthe leapfrog step size \u03b5 across layers also improves results. We also notice that increasing the number\nof leapfrog steps does not always improve the performance, as K = 15 provides the best results in\nfree tempering schemes. We believe that the improvement in HVAE over the base VAE scheme can\nbe attributed to a more expressive approximate posterior, as we can see that samples from the HVAE\napproximate posterior exhibit non-negligible covariance across dimensions. As in Salimans et al. [25],\nwe are also able to improve upon the base model by adding our time-inhomogeneous Hamiltonian\ndynamics on top, but in a simpli\ufb01ed regime without referring to learned reverse kernels. Rezende and\nMohamed [22] report only lower bounds on the log-likelihood for NFs, which are indeed lower than\nour log-likelihood estimates, although they use a much larger number of variational parameters.\n\n8\n\n\fTable 1: Estimated NLL values for HVAE on MNIST. The base VAE achieves an NLL of 83.20. A\nmore detailed version of this table is included in the supplementary material.\n\n\u03b5 \ufb01xed across layers\n\n\u03b5 varied across layers\n\nT = Free\n\nT = Fixed\n\nT = None\n\nT = Free\n\nT = Fixed\n\nT = None\n\nK = 1\nK = 5\nK = 10\nK = 15\nK = 20\n\nN/A\n83.09\n82.97\n82.78\n82.93\n\n83.32\n83.26\n83.26\n83.56\n83.18\n\n83.17\n83.68\n83.40\n83.82\n83.33\n\nN/A\n83.01\n82.62\n82.62\n82.83\n\nN/A\n82.94\n82.87\n83.09\n82.85\n\nN/A\n83.35\n83.25\n82.94\n82.93\n\n5 Conclusion and Discussion\n\nWe have proposed a principled way to exploit Hamiltonian dynamics within stochastic variational\ninference. Contrary to previous methods [25, 28], our algorithm does not rely on learned reverse\nMarkov kernels and bene\ufb01ts from the use of tempering ideas. Additionally, we can use the reparam-\neterization trick to obtain unbiased estimators of gradients of the ELBO. The resulting HVAE can\nbe interpreted as a target-driven normalizing \ufb02ow which requires the evaluation of a few gradients\nof the log-likelihood associated to a single data point at each stochastic gradient step. However, the\nJacobian computations required for the ELBO are trivial. In our experiments, the robustness brought\nabout by the use of target-informed dynamics can reduce the number of parameters that must be\ntrained and improve generalizability.\nWe note that, although we have fewer parameters to optimize, the memory cost of using HVAE and\ntarget-informed dynamics could become prohibitively large if the memory required to store evalua-\ntions of \u2207z log p\u03b8(x, z) is already extremely large. Evaluating these gradients is not a requirement\nof VAEs or standard normalizing \ufb02ows. However, we have shown that in the case of a fairly large\ngenerative network we are still able to evaluate gradients and backpropagate through the layers of the\n\ufb02ow. Further tests explicitly comparing HVAE with VAEs and normalizing \ufb02ows in various memory\nregimes are required to determine in what cases one method should be used over the other.\nThere are numerous possible extensions of this work. Hamiltonian dynamics preserves the Hamilto-\nnian and hence also the corresponding target distribution, but there exist other deterministic dynamics\nwhich leave the target distribution invariant but not the Hamiltonian. This includes the Nos\u00e9-Hoover\nthermostat. It is possible to directly use these dynamics instead of the Hamiltonian dynamics within\nthe framework developed in subsection 2.3. In continuous-time, related ideas have appeared in\nphysics [5, 21, 26]. This comes at the cost of more complicated Jacobian calculations. The ideas\npresented here could also be coupled with the methodology proposed in [9] \u2013 we conjecture that this\ncould reduce the variance of the estimator (3) by an order of magnitude.\n\nAcknowledgments\nAnthony L. Caterini is a Commonwealth Scholar, funded by the UK government.\n\nReferences\n[1] Mart\u00edn Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.\n\nURL https://www.tensorflow.org/. Software available from tensor\ufb02ow.org.\n\n[2] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester\n\nnormalizing \ufb02ows for variational inference. arXiv preprint arXiv:1803.05649, 2018.\n\n[3] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In\n\nThe 4th International Conference on Learning Representations (ICLR), 2016.\n\n[4] Gavin E Crooks. Nonequilibrium measurements of free energy differences for microscopically\n\nreversible Markovian systems. Journal of Statistical Physics, 90(5-6):1481\u20131487, 1998.\n\n9\n\n\f[5] Michel A Cuendet. Statistical mechanical derivation of Jarzynski\u2019s identity for thermostated\n\nnon-hamiltonian dynamics. Physical Review Letters, 96(12):120602, 2006.\n\n[6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411\u2013436, 2006.\n\n[7] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs\nwith convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015\nIEEE Conference on, pages 1538\u20131546. IEEE, 2015.\n\n[8] Paul Glasserman. Gradient estimation via perturbation analysis, volume 116. Springer Science\n\n& Business Media, 1991.\n\n[9] Jeremy Heng, Adrian N Bishop, George Deligiannidis, and Arnaud Doucet. Controlled sequen-\n\ntial Monte Carlo. arXiv preprint arXiv:1708.08396, 2017.\n\n[10] Matthew D Hoffman. Learning deep latent Gaussian models with Markov chain Monte Carlo.\n\nIn International Conference on Machine Learning, pages 1510\u20131519, 2017.\n\n[11] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path\nlengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593\u20131623,\n2014.\n\n[12] Christopher Jarzynski. Nonequilibrium equality for free energy differences. Physical Review\n\nLetters, 78(14):2690, 1997.\n\n[13] Christopher Jarzynski. Hamiltonian derivation of a detailed \ufb02uctuation theorem. Journal of\n\nStatistical Physics, 98(1-2):77\u2013102, 2000.\n\n[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[15] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In The 2nd Interna-\n\ntional Conference on Learning Representations (ICLR), 2014.\n\n[16] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max\nWelling. Improved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural\nInformation Processing Systems, pages 4743\u20134751, 2016.\n\n[17] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy\nMnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural\nInformation Processing Systems, pages 6576\u20136586, 2017.\n\n[18] Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139,\n\n2001.\n\n[19] Radford M Neal. Hamiltonian importance sampling. www.cs.toronto.edu/pub/radford/\nhis-talk.ps, 2005. Talk presented at the Banff International Research Station (BIRS) work-\nshop on Mathematical Issues in Molecular Dynamics.\n\n[20] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte\n\nCarlo, 2(11), 2011.\n\n[21] Piero Procacci, Simone Marsili, Alessandro Barducci, Giorgio F Signorini, and Riccardo Chelli.\nCrooks equation for steered molecular dynamics using a Nos\u00e9-Hoover thermostat. The Journal\nof Chemical Physics, 125(16):164101, 2006.\n\n[22] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\nInternational Conference on Machine Learning, pages 1530\u20131538, 2015.\n\nIn\n\n[23] Danilo Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-\nimate inference in deep generative models. In International Conference on Machine Learning,\npages 1278\u20131286, 2014.\n\n10\n\n\f[24] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In\nProceedings of the 25th international conference on Machine learning, pages 872\u2013879. ACM,\n2008.\n\n[25] Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain Monte Carlo and variational\ninference: Bridging the gap. In International Conference on Machine Learning, pages 1218\u2013\n1226, 2015.\n\n[26] E Sch\u00f6ll-Paschinger and Christoph Dellago. A proof of Jarzynski\u2019s nonequilibrium work\ntheorem for dynamical systems that conserve the canonical distribution. The Journal of Chemical\nPhysics, 125(5):054105, 2006.\n\n[27] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):\n26\u201331, 2012.\n\n[28] Christopher Wolf, Maximilian Karl, and Patrick van der Smagt. Variational inference with\n\nHamiltonian Monte Carlo. arXiv preprint arXiv:1609.08203, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5004, "authors": [{"given_name": "Anthony", "family_name": "Caterini", "institution": "University of Oxford"}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": "Oxford"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}]}