{"title": "Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families", "book": "Advances in Neural Information Processing Systems", "page_first": 955, "page_last": 963, "abstract": "We propose Kernel Hamiltonian Monte Carlo (KMC), a gradient-free adaptive MCMC algorithm based on Hamiltonian Monte Carlo (HMC). On target densities where classical HMC is not an option due to intractable gradients, KMC adaptively learns the target's gradient structure by fitting an exponential family model in a Reproducing Kernel Hilbert Space. Computational costs are reduced by two novel efficient approximations to this gradient. While being asymptotically exact, KMC mimics HMC in terms of sampling efficiency, and offers substantial mixing improvements over state-of-the-art gradient free samplers. We support our claims with experimental studies on both toy and real-world applications, including Approximate Bayesian Computation and exact-approximate MCMC.", "full_text": "Gradient-free Hamiltonian Monte Carlo\nwith Ef\ufb01cient Kernel Exponential Families\n\nHeiko Strathmann\u2217 Dino Sejdinovic+ Samuel Livingstoneo Zoltan Szabo\u2217 Arthur Gretton\u2217\n\n\u2217Gatsby Unit\n\nUniversity College London\n\n+Department of Statistics\n\nUniversity of Oxford\n\noSchool of Mathematics\n\nUniversity of Bristol\n\nAbstract\n\nWe propose Kernel Hamiltonian Monte Carlo (KMC), a gradient-free adaptive\nMCMC algorithm based on Hamiltonian Monte Carlo (HMC). On target densities\nwhere classical HMC is not an option due to intractable gradients, KMC adap-\ntively learns the target\u2019s gradient structure by \ufb01tting an exponential family model\nin a Reproducing Kernel Hilbert Space. Computational costs are reduced by two\nnovel ef\ufb01cient approximations to this gradient. While being asymptotically exact,\nKMC mimics HMC in terms of sampling ef\ufb01ciency, and offers substantial mixing\nimprovements over state-of-the-art gradient free samplers. We support our claims\nwith experimental studies on both toy and real-world applications, including Ap-\nproximate Bayesian Computation and exact-approximate MCMC.\n\nIntroduction\n\n1\nEstimating expectations using Markov Chain Monte Carlo (MCMC) is a fundamental approximate\ninference technique in Bayesian statistics. MCMC itself can be computationally demanding, and\nthe expected estimation error depends directly on the correlation between successive points in the\nMarkov chain. Therefore, ef\ufb01ciency can be achieved by taking large steps with high probability.\nHamiltonian Monte Carlo [1] is an MCMC algorithm that improves ef\ufb01ciency by exploiting gra-\ndient information. It simulates particle movement along the contour lines of a dynamical system\nconstructed from the target density. Projections of these trajectories cover wide parts of the target\u2019s\nsupport, and the probability of accepting a move along a trajectory is often close to one. Remark-\nably, this property is mostly invariant to growing dimensionality, and HMC here often is superior to\nrandom walk methods, which need to decrease their step size at a much faster rate [1, Sec. 4.4].\nUnfortunately, for a large class of problems, gradient information is not available. For example, in\nPseudo-Marginal MCMC (PM-MCMC) [2, 3], the posterior does not have an analytic expression,\nbut can only be estimated at any given point, e.g. in Bayesian Gaussian Process classi\ufb01cation [4]. A\nrelated setting is MCMC for Approximate Bayesian Computation (ABC-MCMC), where the pos-\nterior is approximated through repeated simulation from a likelihood model [5, 6]. In both cases,\nHMC cannot be applied, leaving random walk methods as the only mature alternative. There have\nbeen efforts to mimic HMC\u2019s behaviour using stochastic gradients from mini-batches in Big Data\n[7], or stochastic \ufb01nite differences in ABC [8]. Stochastic gradient based HMC methods, however,\noften suffer from low acceptance rates or additional bias that is hard to quantify [9].\nRandom walk methods can be tuned by matching scaling of steps and target. For example, Adaptive\nMetropolis-Hastings (AMH) [10, 11] is based on learning the global scaling of the target from the\nhistory of the Markov chain. Yet, for densities with nonlinear support, this approach does not work\nvery well. Recently, [12] introduced a Kernel Adaptive Metropolis-Hastings (KAMH) algorithm\nwhose proposals are locally aligned to the target. By adaptively learning target covariance in a\nReproducing Kernel Hilbert Space (RKHS), KAMH achieves improved sampling ef\ufb01ciency.\n\n1\n\n\fIn this paper, we extend the idea of using kernel methods to learn ef\ufb01cient proposal distributions [12].\nRather than locally smoothing the target density, however, we estimate its gradients globally. More\nprecisely, we \ufb01t an in\ufb01nite dimensional exponential family model in an RKHS via score matching\n[13, 14]. This is a non-parametric method of modelling the log unnormalised target density as an\nRKHS function, and has been shown to approximate a rich class of density functions arbitrarily well.\nMore importantly, the method has been empirically observed to be relatively robust to increasing\ndimensionality \u2013 in sharp contrast to classical kernel density estimation [15, Sec. 6.5]. Gaussian\nProcesses (GP) were also used in [16] as an emulator of the target density in order to speed up\nHMC, however, this requires access to the target in closed form, to provide training points for the\nGP.\nWe require our adaptive KMC algorithm to be computationally ef\ufb01cient, as it deals with high-\ndimensional MCMC chains of growing length. We develop two novel approximations to the in\ufb01nite\ndimensional exponential family model. The \ufb01rst approximation, score matching lite, is based on\ncomputing the solution in terms of a lower dimensional, yet growing, subspace in the RKHS. KMC\nwith score matching lite (KMC lite) is geometrically ergodic on the same class of targets as stan-\ndard random walks. The second approximation uses a \ufb01nite dimensional feature space (KMC \ufb01nite),\ncombined with random Fourier features [17]. KMC \ufb01nite is an ef\ufb01cient online estimator that allows\nto use all of the Markov chain history, at the cost of decreased ef\ufb01ciency in unexplored regions. A\nchoice between KMC lite and KMC \ufb01nite ultimately depends on the ability to initialise the sampler\nwithin high-density regions of the target; alternatively, the two approaches could be combined.\nExperiments show that KMC inherits the ef\ufb01ciency of HMC, and therefore mixes signi\ufb01cantly better\nthan state-of-the-art gradient-free adaptive samplers on a number of target densities, including on\nsynthetic examples, and when used in PM-MCMC and ABC-MCMC. All code can be found at\nhttps://github.com/karlnapf/kernel_hmc\n\n2 Background and Previous Work\nLet the domain of interest X be a compact1 subset of Rd, and denote the unnormalised target den-\nsity on X by \u03c0. We are interested in constructing a Markov chain x1 \u2192 x2 \u2192 . . . such that\nlimt\u2192\u221e xt \u223c \u03c0. By running the Markov chain for a long time T , we can consistently approximate\nany expectation w.r.t. \u03c0. Markov chains are constructed using the Metropolis-Hastings algorithm,\nwhich at the current state xt draws a point from a proposal mechanism x\u2217 \u223c Q(\u00b7|xt), and sets\nxt+1 \u2190 x\u2217 with probability min(1, [\u03c0(x\u2217)Q(xt|x\u2217)]/[\u03c0(xt)Q(x\u2217|xt)]), and xt+1 \u2190 xt otherwise.\nWe assume that \u03c0 is intractable,2 i.e. that we can neither evaluate \u03c0(x) nor3 \u2207 log \u03c0(x) for any x,\nbut can only estimate it unbiasedly via \u02c6\u03c0(x). Replacing \u03c0(x) with \u02c6\u03c0(x) results in PM-MCMC [2, 3],\nwhich asymptotically remains exact (exact-approximate inference).\n\nIn the absence of \u2207 log \u03c0, the usual choice of Q is a\n(Kernel) Adaptive Metropolis-Hastings\nrandom walk, i.e. Q(\u00b7|xt) = N (\u00b7|xt, \u03a3t). A popular choice of the scaling is \u03a3t \u221d I. When the\nscale of the target density is not uniform across dimensions, or if there are strong correlations, the\nAMH algorithm [10, 11] improves mixing by adaptively learning global covariance structure of \u03c0\nfrom the history of the Markov chain. For cases where the local scaling does not match the global\ncovariance of \u03c0, i.e. the support of the target is nonlinear, KAMH [12] improves mixing by learning\nthe target covariance in a RKHS. KAMH proposals are Gaussian with a covariance that matches the\nlocal covariance of \u03c0 around the current state xt, without requiring access to \u2207 log \u03c0.\nHamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) uses deterministic, measure-\npreserving maps to generate ef\ufb01cient Markov transitions [1, 18]. Starting from the negative log\ntarget, referred to as the potential energy U (q) = \u2212 log \u03c0(q), we introduce an auxiliary momen-\ntum variable p \u223c exp(\u2212K(p)) with p \u2208 X . The joint distribution of (p, q) is then proportional\nto exp (\u2212H(p, q)), where H(p, q) := K(p) + U (q) is called the Hamiltonian. H(p, q) de\ufb01nes a\nHamiltonian \ufb02ow, parametrised by a trajectory length t \u2208 R, which is a map \u03c6H\n: (p, q) (cid:55)\u2192 (p\u2217, q\u2217)\nfor which H(p\u2217, q\u2217) = H(p, q). This allows constructing \u03c0-invariant Markov chains: for a chain at\nstate q = xt, repeatedly (i) re-sample p(cid:48) \u223c exp(\u2212K(\u00b7)), and then (ii) apply the Hamiltonian \ufb02ow\n\nt\n\n1The compactness restriction is imposed to satisfy the assumptions in [13].\n2\u03c0 is analytically intractable, as opposed to computationally expensive in the Big Data context.\n3Throughout the paper \u2207 denotes the gradient operator w.r.t. to x.\n\n2\n\n\ffor time t, giving (p\u2217, q\u2217) = \u03c6H\n\nt (p(cid:48), q). The \ufb02ow can be generated by the Hamiltonian operator\n\n\u2202K\n\u2202p\n\n\u2202\n\u2202q\n\n\u2212 \u2202U\n\u2202q\n\n\u2202\n\u2202p\n\n(1)\n\nIn practice, (1) is usually unavailable and we need to resort to approximations. Here, we limit our-\nselves to the leap-frog integrator; see [1] for details. To correct for discretisation error, a Metropolis\nacceptance procedure can be applied: starting from (p(cid:48), q), the end-point of the approximate tra-\njectory is accepted with probability min [1, exp (\u2212H(p\u2217, q\u2217) + H(p(cid:48), q))]. HMC is often able to\npropose distant, uncorrelated moves with a high acceptance probability.\n\nIn many cases the gradient of log \u03c0(q) = \u2212U (q) cannot be written in closed\nIntractable densities\nform, leaving random-walk based methods as the state-of-the-art [11, 12]. We aim to overcome\nrandom-walk behaviour, so as to obtain signi\ufb01cantly more ef\ufb01cient sampling [1].\n3 Kernel Induced Hamiltonian Dynamics\nKMC replaces the potential energy in (1) by a kernel induced surrogate computed from the history of\nthe Markov chain. This surrogate does not require gradients of the log-target density. The surrogate\ninduces a kernel Hamiltonian \ufb02ow, which can be numerically simulated using standard leap-frog\nintegration. As with the discretisation error in HMC, any deviation of the kernel induced \ufb02ow\nfrom the true \ufb02ow is corrected via a Metropolis acceptance procedure. This here also contains\nthe estimation noise from \u02c6\u03c0 and re-uses previous values of \u02c6\u03c0, c.f.\n[3, Table 1]. Consequently,\nthe stationary distribution of the chain remains correct, given that we take care when adapting the\nsurrogate.\nIn\ufb01nite Dimensional Exponential Families in a RKHS We construct a kernel induced potential\nenergy surrogate whose gradients approximate the gradients of the true potential energy U in (1),\nwithout accessing \u03c0 or \u2207\u03c0 directly, but only using the history of the Markov chain. To that end, we\nmodel the (unnormalised) target density \u03c0(x) with an in\ufb01nite dimensional exponential family model\n[13] of the form\n\nconst \u00d7 \u03c0(x) \u2248 exp ((cid:104)f, k(x,\u00b7)(cid:105)H \u2212 A(f )) ,\n\n(2)\nwhich in particular implies \u2207f \u2248 \u2212\u2207U = \u2207 log \u03c0. Here H is a RKHS of real valued functions\non X . The RKHS has a uniquely associated symmetric, positive de\ufb01nite kernel k : X \u00d7 X \u2192 R,\nwhich satis\ufb01es f (x) = (cid:104)f, k(x,\u00b7)(cid:105)H for any f \u2208 H [19]. The canonical feature map k(\u00b7, x) \u2208 H\nhere takes the role of the suf\ufb01cient statistics while f \u2208 H are the natural parameters, and\nX exp((cid:104)f, k(x,\u00b7)(cid:105)H)dx is the cumulant generating function. Eq. (2) de\ufb01nes broad\nA(f ) := log\nclass of densities: when universal kernels are used, the family is dense in the space of continuous\ndensities on compact domains, with respect to e.g. Total Variation and KL [13, Section 3]. It is pos-\nsible to consistently \ufb01t an unnormalised version of (2) by directly minimising the expected gradient\nmismatch between the model (2) and the true target density \u03c0 (observed through the Markov chain\nhistory). This is achieved by generalising the score matching approach [14] to in\ufb01nite dimensional\nparameter spaces. The technique avoids the problem of dealing with the intractable A(f ), and re-\nduces the problem to solving a linear system. More importantly, the approach is observed to be\nrelatively robust to increasing dimensions. We return to estimation in Section 4, where we develop\ntwo ef\ufb01cient approximations. For now, assume access to an \u02c6f \u2208 H such that \u2207f (x) \u2248 \u2207 log \u03c0(x).\n\n\u00b4\n\nKernel Induced Hamiltonian Flow We de\ufb01ne a kernel induced Hamiltonian operator by replac-\ning U in the potential energy part \u2202U\n\u2202q in (1) by our kernel surrogate Uk = f. It is clear that,\n\u2202p\ndepending on Uk, the resulting kernel induced Hamiltonian \ufb02ow differs from the original one. That\nsaid, any bias on the resulting Markov chain, in addition to discretisation error from the leap-frog\nintegrator, is naturally corrected for in the Pseudo-Marginal Metropolis step. We accept an end-point\n\u03c6Hk\nt\n\n(p(cid:48), q) of a trajectory starting at (p(cid:48), q) along the kernel induced \ufb02ow with probability\n\n\u2202\n\n(cid:17)\n\n(cid:17)(cid:105)\n\n(cid:104)\n\n(cid:16)\u2212H\n\n(cid:16)\n\nmin\n\n1, exp\n\n(cid:16)\n\n(cid:17)\n\n(p(cid:48), q)\n\n+ H(p(cid:48), q)\n\n\u03c6Hk\nt\n\n,\n\n(3)\n\n(p(cid:48), q). Here, in the Pseudo-\nwhere H\nMarginal context, we replace both terms in the ratio in (3) by unbiased estimates, i.e., we replace\n\ncorresponds to the true Hamiltonian at \u03c6Hk\n\n(p(cid:48), q)\n\n\u03c6Hk\nt\n\nt\n\n3\n\n\fFigure 1: Hamiltonian trajectories on a 2-dimensional standard Gaussian. End points of such trajec-\ntories (red stars to blue stars) form the proposal of HMC-like algorithms. Left: Plain Hamiltonian\ntrajectories oscillate on a stable orbit, and acceptance probability is close to one. Right: Kernel\ninduced trajectories and acceptance probabilities on an estimated energy function.\n\n\u03c0(q) within H with an unbiased estimator \u02c6\u03c0(q). Note that this also involves \u2018recycling\u2019 the es-\ntimates of H from previous iterations to ensure anyymptotic correctness, c.f. [3, Table 1]. Any\ndeviations of the kernel induced \ufb02ow from the true \ufb02ow result in a decreased acceptance probability\n(3). We therefore need to control the approximation quality of the kernel induced potential energy\nto maintain high acceptance probability in practice. See Figure 1 for an illustrative example.\n4 Two Ef\ufb01cient Estimators for Exponential Families in RKHS\nWe now address estimating the in\ufb01nite dimensional exponential family model (2) from data. The\noriginal estimator in [13] has a large computational cost. This is problematic in the adaptive MCMC\ncontext, where the model has to be updated on a regular basis. We propose two ef\ufb01cient approxima-\ntions, each with its strengths and weaknesses. Both are based on score matching.\n4.1 Score Matching\nFollowing [14], we model an unnormalised log probability density log \u03c0(x) with a parametric model\n\nlog \u02dc\u03c0Z(x; f ) := log \u02dc\u03c0(x; f ) \u2212 log Z(f ),\n\n(4)\n\nwhere f is a collection of parameters of yet unspeci\ufb01ed dimension (c.f. natural parameters of (2)),\nand Z(f ) is an unknown normalising constant. We aim to \ufb01nd \u02c6f from a set of n samples4 D :=\ni=1 \u223c \u03c0 such that \u03c0(x) \u2248 \u02dc\u03c0(x; \u02c6f ) \u00d7 const. From [14, Eq. 2], the criterion being optimised is\n{xi}n\nthe expected squared distance between gradients of the log density, so-called score functions,\n\nJ(f ) =\n\n\u03c0(x)(cid:107)\u2207 log \u02dc\u03c0(x; f ) \u2212 \u2207 log \u03c0(x)(cid:107)2\n\n2 dx,\n\nwhere we note that the normalising constants vanish from taking the gradient \u2207. As shown in [14,\nTheorem 1], it is possible to compute an empirical version without accessing \u03c0(x) or \u2207 log \u03c0(x)\nother than through observed samples,\n\n\u02c6\n\n1\n2\n\nX\n\n(cid:34)\n\n(cid:88)\n\nd(cid:88)\n\n\u02c6J(f ) =\n\n1\nn\n\n\u22022 log \u02dc\u03c0(x; f )\n\n+\n\n1\n2\n\nx\u2208D\n\n(cid:96)=1\n\n\u2202x2\n(cid:96)\n\n(cid:18) \u2202 log \u02dc\u03c0(x; f )\n\n(cid:19)2(cid:35)\n\n\u2202x(cid:96)\n\n.\n\n(5)\n\nIn\ufb01nite Dimensional Exponential Families Lite\n\nOur approximations of the original model (2) are based on minimising (5) using approximate scores.\n4.2\nThe original estimator of f in (2) takes a dual form in a RKHS sub-space spanned by nd + 1 kernel\nderivatives, [13, Thm. 4]. The update of the proposal at the iteration t of MCMC requires inversion\nof a (td + 1) \u00d7 (td + 1) matrix. This is clearly prohibitive if we are to run even a moderate number\nof iterations of a Markov chain. Following [12], we take a simple approach to avoid prohibitive\ncomputational costs in t: we form a proposal using a random sub-sample of \ufb01xed size n from the\nMarkov chain history, z := {zi}n\ni=1. In order to avoid excessive computation when d is\nlarge, we replace the full dual solution with a solution in terms of span ({k(zi,\u00b7)}n\ni=1), which covers\nthe support of the true density by construction, and grows with increasing n. That is, we assume that\nthe model (4) takes the \u2018light\u2019 form\n\ni=1 \u2286 {xi}t\n\n4We assume a \ufb01xed sample set here but will use both the full chain history {xi}t\n\ni=1 or a sub-sample later.\n\n4\n\nHMC0100200300400500Leap-frogsteps0.700.750.800.850.900.951.00Acceptanceprob.KMC0100200300400500Leap-frogsteps0.700.750.800.850.900.951.00Acceptanceprob.\f2\n\n\u03c3\n\n(cid:96)=1\n\nb =\n\n(7)\n\nd(cid:88)\n\ni=1\n\nf (x) =\n\n\u03b1ik(zi, x),\n\n(6)\nwhere \u03b1 \u2208 Rn are real valued parameters that are obtained by minimising the empirical score match-\ning objective (5). This representation is of a form similar to [20, Section 4.1], the main differences\nbeing that the basis functions are chosen randomly, the basis set grows with n, and we will require\nan additional regularising term. The estimator is summarised in the following proposition, which is\nproved in Appendix A.\nProposition 1. Given a set of samples z = {zi}n\nthe \u03bb(cid:107)f(cid:107)2H-regularised empirical score matching objective (5) is given by\n\nGaussian kernel of the form k(x, y) = exp(cid:0)\u2212\u03c3\u22121(cid:107)x \u2212 y(cid:107)2\n\n(cid:1), and \u03bb > 0, the unique minimiser of\n\ni=1 and assuming f (x) =(cid:80)n\n\ni=1 \u03b1ik(zi, x) for the\n\n(cid:18) 2\n\n\u02c6\u03b1\u03bb = \u2212 \u03c3\n2\n\n(Ks(cid:96) + Ds(cid:96) K1 \u2212 2Dx(cid:96) Kx(cid:96)) \u2212 K1\n\nwhere b \u2208 Rn and C \u2208 Rn\u00d7n are given by\n\n(C + \u03bbI)\u22121b,\n(cid:19)\nd(cid:88)\nand C =\nwith entry-wise products s(cid:96) := x(cid:96) (cid:12) x(cid:96) and Dx := diag(x).\nThe estimator costs O(n3 + dn2) computation (for computing C, b, and for inverting C) and O(n2)\nstorage, for a \ufb01xed random chain history sub-sample size n. This can be further reduced via low-rank\napproximations to the kernel matrix and conjugate gradient methods, which are derived in Appendix\nA.\n\n[Dx(cid:96) K \u2212 KDx(cid:96) ] [KDx(cid:96) \u2212 Dx(cid:96) K] ,\n\nGradients of the model are given as \u2207f (x) =(cid:80)n\n\ni=1 \u03b1i\u2207k(x, xi), i.e. they simply require to evalu-\n\nate gradients of the kernel function. Evaluation and storage of \u2207f (\u00b7) both cost O(dn).\n4.3 Exponential Families in Finite Feature Spaces\nInstead of \ufb01tting an in\ufb01nite-dimensional model on a subset of the available data, the second estimator\nis based on \ufb01tting a \ufb01nite dimensional approximation using all available data {xi}t\ni=1, in primal\nform. As we will see, updating the estimator when a new data point arrives can be done online.\nDe\ufb01ne an m-dimensional approximate feature space Hm = Rm, and denote by \u03c6x \u2208 Hm the\nembedding of a point x \u2208 X = Rd into Hm = Rm. Assume that the embedding approximates the\nkernel function as a \ufb01nite rank expansion k(x, y) \u2248 \u03c6(cid:62)\nx \u03c6y. The log unnormalised density of the\nin\ufb01nite model (2) can be approximated by assuming the model in (4) takes the form\n\n(cid:96)=1\n\nf (x) = (cid:104)\u03b8, \u03c6x(cid:105)Hm = \u03b8(cid:62)\u03c6x\n\n(8)\n\nTo \ufb01t \u03b8 \u2208 Rm, we again minimise the score matching objective (5), as proved in Appendix B.\nProposition 2. Given a set of samples x = {xi}t\ni=1 and assuming f (x) = \u03b8(cid:62)\u03c6x for a \ufb01nite\ndimensional feature embedding x (cid:55)\u2192 \u03c6x \u2208 Rm, and \u03bb > 0, the unique minimiser of the \u03bb(cid:107)\u03b8(cid:107)2\n2-\nregularised empirical score matching objective (5) is given by\n\u02c6\u03b8\u03bb := (C + \u03bbI)\u22121b,\n\n(9)\n\nt(cid:88)\n\nd(cid:88)\n\nwhere\n\nb := \u2212 1\nn\n\n\u2208 Rm,\n\n\u00a8\u03c6(cid:96)\nxi\n\nC :=\n\n1\nn\n\nwith \u02d9\u03c6(cid:96)\n\nx := \u2202\n\u2202x(cid:96)\n\n(cid:96)=1\n\ni=1\n\u03c6x and \u00a8\u03c6(cid:96)\n\nx := \u22022\n\u2202x2\n(cid:96)\n\n\u03c6x.\n\nt(cid:88)\n\nd(cid:88)\n\ni=1\n\n(cid:96)=1\n\n(cid:16) \u02d9\u03c6(cid:96)\n\nxi\n\n(cid:17)T \u2208 Rm\u00d7m,\n\n\u02d9\u03c6(cid:96)\nxi\n\nn(cid:88)\n\n(cid:113) 2\n\nm\n\n1 x + u1), . . . , cos(\u03c9T\n\n(cid:2)cos(\u03c9T\n\nmx + um)(cid:3), with \u03c9i \u223c N (\u03c9) and ui \u223c Uniform[0, 2\u03c0].\n\nAn example feature embedding based on random Fourier features [17, 21] and a standard Gaussian\nkernel is \u03c6x =\nThe estimator has a one-off cost of O(tdm2 + m3) computation and O(m2) storage. Given that we\nhave computed a solution based on the Markov chain history {xi}t\ni=1, however, it is straightforward\nto update C, b, and the solution \u02c6\u03b8\u03bb online, after a new point xt+1 arrives. This is achieved by\nstoring running averages and performing low-rank updates of matrix inversions, and costs O(dm2)\ncomputation and O(m2) storage, independent of t. Further details are given in Appendix B.\nGradients of the model are \u2207f (x) = [\u2207\u03c6x]\nthe feature space embedding, costing O(md) computation and and O(m) storage.\n\n(cid:62) \u02c6\u03b8 , i.e., they require the evaluation of the gradient of\n\n5\n\n\fAlgorithm 1 Kernel Hamiltonian Monte Carlo \u2013 Pseudo-code\nInput: Target (possibly noisy estimator) \u02c6\u03c0, adaptation schedule at, HMC parameters,\n\nSize of basis m or sub-sample size n.\n\nAt iteration t + 1, current state xt, history {xi}t\nKMC lite:\n\ni=1, perform (1-4) with probability at\n\nKMC \ufb01nite:\n\ni=1\n\n1. Update sub-sample z \u2286 {xi}t\n2. Re-compute C, b from Prop. 1\n(cid:80)n\n\u22121b\n3. Solve \u02c6\u03b1\u03bb = \u2212 \u03c3\n2 (C + \u03bbI)\n(cid:62) \u02c6\u03b8\n4. \u2207f (x) \u2190\ni=1 \u03b1i\u2207k(x, zi)\n\u2217\n5. Propose (p\n) with kernel induced Hamiltonian \ufb02ow, using \u2207xU = \u2207xf\n, x\n6. Perform Metropolis step using \u02c6\u03c0: accept xt+1 \u2190 x\n\u2217\n\u2217 was accepted, store above \u02c6\u03c0(x\n\n1. Update to C, b from Prop. 2\n2. Perform rank-d update to C\n\u22121b\n3. Update \u02c6\u03b8\u03bb = (C + \u03bbI)\n4. \u2207f (x) \u2190 [\u2207\u03c6x]\n\u2217 w.p. (3) and reject xt+1 \u2190 xt otherwise\n) for evaluating (3) in the next iteration\n\nIf \u02c6\u03c0 is noisy and x\n\n\u22121\n\n(cid:48)\n\n5 Kernel Hamiltonian Monte Carlo\nConstructing a kernel induced Hamiltonian \ufb02ow as in Section 3 from the gradients of the in\ufb01nite di-\nmensional exponential family model (2), and approximate estimators (6),(8), we arrive at a gradient\nfree, adaptive MCMC algorithm: Kernel Hamiltonian Monte Carlo (Algorithm 1).\nComputational Ef\ufb01ciency, Geometric Ergodicity, and Burn-in KMC \ufb01nite using (8) allows for\nonline updates using the full Markov chain history, and therefore is a more elegant solution than\nKMC lite, which has greater computational cost and requires sub-sampling the chain history. Due\nto the parametric nature of KMC \ufb01nite, however, the tails of the estimator are not guaranteed to\ndecay. For example, the random Fourier feature embedding described below Proposition 2 contains\nperiodic cosine functions, and therefore oscillates in the tails of (8), resulting in a reduced acceptance\nprobability. As we will demonstrate in the experiments, this problem does not appear when KMC\n\ufb01nite is initialised in high-density regions, nor after burn-in. In situations where information about\nthe target density support is unknown, and during burn-in, we suggest to use the lite estimator (7),\nwhose gradients decay outside of the training data. As a result, KMC lite is guaranteed to fall back\nto a Random Walk Metropolis in unexplored regions, inheriting its convergence properties, and\nsmoothly transitions to HMC-like proposals as the MCMC chain grows. A proof of the proposition\nbelow can be found in Appendix C.\nProposition 3. Assume d = 1, \u03c0(x) has log-concave tails, the regularity conditions of [22, Thm\n2.2] (implying \u03c0-irreducibility and smallness of compact sets), that MCMC adaptation stops after\nIf lim sup(cid:107)x(cid:107)2\u2192\u221e (cid:107)\u2207f (x)(cid:107)2 = 0, and\na \ufb01xed time, and a \ufb01xed number L of \u0001-leapfrog steps.\n\u2203M : \u2200x : (cid:107)\u2207f (x)(cid:107)2 \u2264 M, then KMC lite is geometrically ergodic from \u03c0-almost any starting\npoint.\n\nthat limt\u2192\u221e at = 0 and(cid:80)\u221e\n\nVanishing adaptation MCMC algorithms that use the history of the Markov chain for construct-\ning proposals might not be asymptotically correct. We follow [12, Sec. 4.2] and the idea of \u2018vanish-\ning adaptation\u2019 [11], to avoid such biases. Let {at}\u221e\ni=0 be a schedule of decaying probabilities such\nt=0 at = \u221e. We update the density gradient estimate according to this\nschedule in Algorithm 1. Intuitively, adaptation becomes less likely as the MCMC chain progresses,\nbut never fully stops, while sharing asymptotic convergence with adaptation that stops at a \ufb01xed\npoint [23, Theorem 1]. Note that Proposition 3 is a stronger statement about the convergence rate.\nFree Parameters KMC has two free parameters: the Gaussian kernel bandwidth \u03c3, and the regu-\nlarisation parameter \u03bb. As KMC\u2019s performance depends on the quality of the approximate in\ufb01nite\ndimensional exponential family model in (6) or (8), a principled approach is to use the score match-\ning objective function in (5) to choose \u03c3, \u03bb pairs via cross-validation (using e.g. \u2018hot-started\u2019 black-\nbox optimisation). Earlier adaptive kernel-based MCMC methods [12] did not address parameter\nchoice.\n6 Experiments\nWe start by quantifying performance of KMC \ufb01nite on synthetic targets. We emphasise that these\nresults can be reproduced with the lite version.\n\n6\n\n\fFigure 2: Hypothetical acceptance probability of KMC \ufb01nite on a challening target in growing\ndimensions. Left: As a function of n = m (x-axis) and d (y-axis). Middle/right: Slices through\nleft plot with error bars for \ufb01xed n = m and as a function of d (left), and for \ufb01xed d as a function of\nn = m (right).\n\nFigure 3: Results for the 8-dimensional synthetic Banana. As the amout of observed data increases,\nKMC performance approaches HMC \u2013 outperforming KAMH and RW. 80% error bars over 30 runs.\n\nKMC Finite: Stability of Trajectories in High Dimensions\nIn order to quantify ef\ufb01ciency in\ngrowing dimensions, we study hypothetical acceptance rates along trajectories on the kernel induced\nHamiltonian \ufb02ow (no MCMC yet) on a challenging Gaussian target: We sample the diagonal entries\nof the covariance matrix from a Gamma(1,1) distribution and rotate with a uniformly sampled\nrandom orthogonal matrix. The resulting target is challenging to estimate due to its \u2018non-singular\nsmoothness\u2019, i.e., substantially differing length-scales across its principal components. As a single\nGaussian kernel is not able to effeciently represent such scaling families, we use a rational quadratic\nkernel for the gradient estimation, whose random features are straightforward to compute. Figure\n2 shows the average acceptance over 100 independent trials as a function of the number of (ground\ntruth) samples and basis functions, which are set to be equal n = m, and of dimension d. In low to\nmoderate dimensions, gradients of the \ufb01nite estimator lead to acceptance rates comparable to plain\nHMC. On targets with more \u2018regular\u2019 smoothness, the estimator performs well in up to d \u2248 100,\nwith less variance. See Appendix D.1 for details.\nKMC Finite: HMC-like Mixing on a Synthetic Example We next show that KMC\u2019s perfor-\nmance approaches that of HMC as it sees more data. We compare KMC, HMC, an isotropic random\nwalk (RW), and KAMH on the 8-dimensional nonlinear banana-shaped target; see Appendix D.2.\nWe here only quantify mixing after a suf\ufb01cient burn-in (burn-in speed is included in next example).\nWe quantify performance on estimating the target\u2019s mean, which is exactly 0. We tuned the scaling\nof KAMH and RW to achieve 23% acceptance. We set HMC parameters to achieve 80% acceptance\nand then used the same parameters for KMC. We ran all samplers for 2000+200 iterations from a\nrandom start point, discarded the burn-in and computed acceptance rates, the norm of the empirical\nmean (cid:107)\u02c6E[x](cid:107), and the minimum effective sample size (ESS) across dimensions. For KAMH and\nKMC, we repeated the experiment for an increasing number of burn-in samples and basis functions\nm = n. Figure 3 shows the results as a function of m = n. KMC clearly outperforms RW and\nKAMH, and eventually achieves performance close to HMC as n = m grows.\nKMC Lite: Pseudo-Marginal MCMC for GP Classi\ufb01cation on Real World Data We next\napply KMC to sample from the marginal posterior over hyper-parameters of a Gaussian Process\nClassi\ufb01cation (GPC) model on the UCI Glass dataset [24]. Classical HMC cannot be used for this\nproblem, due to the intractability of the marginal data likelihood. Our experimental protocol mostly\nfollows [12, Section 5.1], see Appendix D.3, but uses only 6000 MCMC iterations without discard-\ning a burn-in, i.e., we study how fast KMC initially explores the target. We compare convergence in\nterms of all mixed moments of order up to 3 to a set of benchmark samples (MMD [25], lower is bet-\nter). KMC randomly uses between 1 and 10 leapfrog steps of a size chosen uniformly in [0.01, 0.1],\n\n7\n\n102103104n100101d0.10.20.30.40.50.60.70.80.9100101d0.00.20.40.60.81.0n=5000HMCKMCmedianKMC25%-75%KMC5%-95%102103104n0.00.20.40.60.81.0d=80500100015002000n0.00.10.20.30.40.50.60.70.80.9Acc.rate0500100015002000n0246810k\u02c6E[X]k0500100015002000n05101520253035MinimumESSHMCKMCRWKAMH\fFigure 4: Left: Results for 9-dimensional marginal posterior over length scales of a GPC model\napplied to the UCI Glass dataset. The plots shows convergence (no burn-in discarded) of all mixed\nmoments up to order 3 (lower MMD is better). Middle/right: ABC-MCMC auto-correlation and\nmarginal \u03b81 posterior for a 10-dimensional skew normal likelihood. While KMC mixes as well as\nHABC, it does not suffer from any bias (overlaps with RW, while HABC is signi\ufb01cantly different)\nand requires fewer simulations per proposal.\n\na standard Gaussian momentum, and a kernel tuned by cross-validation, see Appendix D.3. We did\nnot extensively tune the HMC parameters of KMC as the described settings were suf\ufb01cient. Both\nKMC and KAMH used 1000 samples from the chain history. Figure 4 (left) shows that KMC\u2019s burn-\nin contains a short \u2018exploration phase\u2019 where produced estimates are bad, due to it falling back to a\nrandom walk in unexplored regions, c.f. Proposition 3. From around 500 iterations, however, KMC\nclearly outperforms both RW and the earlier state-of-the-art KAMH. These results are backed by the\nminimum ESS (not plotted), which is around 415 for KMC and is around 35 and 25 for KAMH and\nRW, respectively. Note that all samplers effectively stop improving from 3000 iterations \u2013 indicating\na burn-in bias. All samplers took 1h time, with most time spent estimating the marginal likelihood.\nKMC Lite: Reduced Simulations and no Additional Bias in ABC We now apply KMC in the\ncontext of Approximate Bayesian Computation (ABC), which often is employed when the data like-\nlihood is intractable but can be obtained by simulation, see e.g. [6]. ABC-MCMC [5] targets an\napproximate posterior by constructing an unbiased Monte Carlo estimator of the approximate like-\nlihood. As each such evaluation requires expensive simulations from the likelihood, the goal of all\nABC methods is to reduce the number of such simulations. Accordingly, Hamiltonian ABC was\nrecently proposed [8], combining the synthetic likelihood approach [26] with gradients based on\nstochastic \ufb01nite differences. We remark that this requires to simulate from the likelihood in every\nleapfrog step, and that the additional bias from the Gaussian likelihood approximation can be prob-\nlematic. In contrast, KMC does not require simulations to construct a proposal, but rather \u2018invests\u2019\nsimulations into an accept/reject step (3) that ensures convergence to the original ABC target. Fig-\nure 4 (right) compares performance of RW, HABC (sticky random numbers and SPAS, [8, Sec. 4.3,\n4.4]), and KMC on a 10-dimensional skew-normal distribution p(y|\u03b8) = 2N (\u03b8, I) \u03a6 ((cid:104)\u03b1, y(cid:105)) with\n\u03b8 = \u03b1 = 1 \u00b7 10. KMC mixes as well as HABC, but HABC suffers from a severe bias. KMC also\nreduces the number of simulations per proposal by a factor 2L = 100. See Appendix D.4 for details.\n7 Discussion\nWe have introduced KMC, a kernel-based gradient free adaptive MCMC algorithm that mimics\nHMC\u2019s behaviour by estimating target gradients in an RKHS. In experiments, KMC outperforms\nrandom walk based sampling methods in up to d = 50 dimensions, including the recent kernel-\nbased KAMH [12]. KMC is particularly useful when gradients of the target density are unavailable,\nas in PM-MCMC or ABC-MCMC, where classical HMC cannot be used. We have proposed two\nef\ufb01cient empirical estimators for the target gradients, each with different strengths and weaknesses,\nand have given experimental evidence for the robustness of both.\nFuture work includes establishing theoretical consistency and uniform convergence rates for the\nempirical estimators, for example via using recent analysis of random Fourier Features with tight\nbounds [21], and a thorough experimental study in the ABC-MCMC context where we see a lot of\npotential for KMC. It might also be possible to use KMC as a precomputing strategy to speed up\nclassical HMC as in [27]. For code, see https://github.com/karlnapf/kernel_hmc\n\n8\n\n010002000300040005000Iterations102103104105106107MMDfromgroundtruthKMCKAMHRW020406080100Lag\u22120.4\u22120.20.00.20.40.60.81.01.2AutocorrelationKMCRWHABC\u22121001020304050\u03b810.00.10.20.30.40.50.60.7p(\u03b81)\fReferences\n[1] R.M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\n2, 2011.\n\n[2] M.A. Beaumont. Estimation of population growth or decline in genetically monitored popula-\n\ntions. Genetics, 164(3):1139\u20131160, 2003.\n\n[3] C. Andrieu and G.O. Roberts. The pseudo-marginal approach for ef\ufb01cient Monte Carlo com-\n\nputations. The Annals of Statistics, 37(2):697\u2013725, April 2009.\n\n[4] M. Filippone and M. Girolami. Pseudo-marginal Bayesian inference for Gaussian Processes.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.\n\n[5] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavar\u00b4e. Markov chain Monte Carlo without likeli-\n\nhoods. Proceedings of the National Academy of Sciences, 100(26):15324\u201315328, 2003.\n\n[6] S.A. Sisson and Y. Fan. Likelihood-free Markov chain Monte Carlo. Handbook of Markov\n\n[7] T. Chen, E. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In ICML,\n\nchain Monte Carlo, 2010.\n\npages 1683\u20131691, 2014.\n\n[8] E. Meeds, R. Leenders, and M. Welling. Hamiltonian ABC. In UAI, 2015.\n[9] M. Betancourt. The Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Sub-\n\nsampling. arXiv preprint arXiv:1502.01510, 2015.\n\n[10] H. Haario, E. Saksman, and J. Tamminen. Adaptive proposal distribution for random walk\n\nMetropolis algorithm. Computational Statistics, 14(3):375\u2013395, 1999.\n\n[11] C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343\u2013\n\n[12] D. Sejdinovic, H. Strathmann, M. Lomeli, C. Andrieu, and A. Gretton. Kernel Adaptive\n\n373, December 2008.\n\nMetropolis-Hastings. In ICML, 2014.\n\n[13] B. Sriperumbudur, K. Fukumizu, R. Kumar, A. Gretton, and A. Hyv\u00a8arinen. Density Estimation\n\nin In\ufb01nite Dimensional Exponential Families. arXiv preprint arXiv:1312.3516, 2014.\n\n[14] A. Hyv\u00a8arinen. Estimation of non-normalized statistical models by score matching. JMLR,\n\n[15] Larry Wasserman. All of nonparametric statistics. Springer, 2006.\n[16] C.E. Rasmussen. Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive Bayesian\n\nIntegrals. Bayesian Statistics 7, pages 651\u2013659, 2003.\n\n[17] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pages\n\n6:695\u2013709, 2005.\n\n1177\u20131184, 2007.\n\n[18] M. Betancourt, S. Byrne, and M. Girolami. Optimizing The Integrator Step Size for Hamilto-\n\nnian Monte Carlo. arXiv preprint arXiv:1503.01916, 2015.\n\n[19] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Kluwer, 2004.\n\n51:2499\u20132512, 2007.\n\n[20] A. Hyv\u00a8arinen. Some extensions of score matching. Computational Statistics & Data Analysis,\n\n[21] B.K. Sriperumbudur and Z. Szab\u00b4o. Optimal rates for random Fourier features. In NIPS, 2015.\n[22] G.O. Roberts and R.L. Tweedie. Geometric convergence and central limit theorems for multi-\n\ndimensional Hastings and Metropolis algorithms. Biometrika, 83(1):95\u2013110, 1996.\n\n[23] G.O. Roberts and J.S. Rosenthal. Coupling and ergodicity of adaptive Markov chain Monte\n\nCarlo algorithms. Journal of Applied Probability, 44(2):458\u2013475, 03 2007.\n\n[24] K. Bache and M. Lichman. UCI Machine Learning Repository, 2013.\n[25] A. Gretton, K. Borgwardt, B. Sch\u00a8olkopf, A. J. Smola, and M. Rasch. A kernel two-sample test.\n\n[26] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature,\n\nJMLR, 13:723\u2013773, 2012.\n\n466(7310):1102\u20131104, 08 2010.\n\n[27] C. Zhang, B. Shahbaba, and H. Zhao. Hamiltonian Monte Carlo Acceleration Using Neural\n\nNetwork Surrogate functions. arXiv preprint arXiv:1506.05555, 2015.\n\n[28] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university\n\n[29] Q. Le, T. Sarl\u00b4os, and A. Smola. Fastfood\u2013approximating kernel expansions in loglinear time.\n\npress, 2004.\n\nIn ICML, 2013.\n\n[30] K.L. Mengersen and R.L. Tweedie. Rates of convergence of the Hastings and Metropolis\n\nalgorithms. The Annals of Statistics, 24(1):101\u2013121, 1996.\n\n9\n\n\f", "award": [], "sourceid": 611, "authors": [{"given_name": "Heiko", "family_name": "Strathmann", "institution": "University College London"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}, {"given_name": "Samuel", "family_name": "Livingstone", "institution": "University College London"}, {"given_name": "Zoltan", "family_name": "Szabo", "institution": "Gatsby Unit, UCL"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "University Collage London"}]}