{"title": "Dimensionally Tight Bounds for Second-Order Hamiltonian Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 6027, "page_last": 6037, "abstract": "Hamiltonian Monte Carlo (HMC) is a widely deployed method to sample from high-dimensional distributions in Statistics and Machine learning. HMC is known to run very efficiently in practice and its popular second-order ``leapfrog\" implementation has long been conjectured to run in $d^{1/4}$ gradient evaluations. Here we show that this conjecture is true when sampling from strongly log-concave target distributions that satisfy a weak third-order regularity property associated with the input data. Our regularity condition is weaker than the Lipschitz Hessian property and allows us to show faster convergence bounds for a much larger class of distributions than would be possible with the usual Lipschitz Hessian constant alone. Important distributions that satisfy our regularity condition include posterior distributions used in Bayesian logistic regression for which the data satisfies an ``incoherence\" property. Our result compares favorably with the best available bounds for the class of strongly log-concave distributions, which grow like $d^{{1}/{2}}$ gradient evaluations with the dimension. Moreover, our simulations on synthetic data suggest that, when our regularity condition is satisfied, leapfrog HMC performs better than its competitors -- both in terms of accuracy and in terms of the number of gradient evaluations it requires.", "full_text": "Dimensionally Tight Bounds for Second-Order\n\nHamiltonian Monte Carlo\n\nOren Mangoubi\n\nEPFL\n\nomangoubi@gmail.com\n\nNisheeth K. Vishnoi\n\nEPFL\n\nnisheeth.vishnoi@gmail.com\n\nAbstract\n\nHamiltonian Monte Carlo (HMC) is a widely deployed method to sample from high-\ndimensional distributions in Statistics and Machine learning. HMC is known to run\nvery ef\ufb01ciently in practice and its popular second-order \u201cleapfrog\" implementation\n\nhas long been conjectured to run in d1~4 gradient evaluations. Here we show that\n\nthis conjecture is true when sampling from strongly log-concave target distributions\nthat satisfy a weak third-order regularity property associated with the input data.\nOur regularity condition is weaker than the Lipschitz Hessian property and allows\nus to show faster convergence bounds for a much larger class of distributions than\nwould be possible with the usual Lipschitz Hessian constant alone. Important\ndistributions that satisfy our regularity condition include posterior distributions\nused in Bayesian logistic regression for which the data satis\ufb01es an \u201cincoherence\"\nproperty. Our result compares favorably with the best available bounds for the class\n\nof strongly log-concave distributions, which grow like d1~2 gradient evaluations\n\nwith the dimension. Moreover, our simulations on synthetic data suggest that,\nwhen our regularity condition is satis\ufb01ed, leapfrog HMC performs better than its\ncompetitors \u2013 both in terms of accuracy and in terms of the number of gradient\nevaluations it requires.\n\nIntroduction\n\n1\nSampling problems are ubiquitous in a wide range of scienti\ufb01c and engineering disciplines and have\nreceived signi\ufb01cant attention in Machine Learning and Statistics. In a typical sampling problem, one\n\nwants to generate samples from a given target distribution \u03c0(x)\u221d e\n\u2212U(x), where one is given access\nto a function U\u2236 Rd\u2192 R and possibly its gradient\u2207U. In many situations, such as when d is large,\nshort size \u03b7, meaning that they typically only travel a distance roughly proportional to\u221a\ni\u00d7 \u03b7 in i\n\nsampling problems are computationally dif\ufb01cult, and Markov chain Monte Carlo (MCMC) algorithms\nare used. MCMC algorithms generate samples by running a Markov chain which converges to the\ntarget distribution \u03c0. Unfortunately, many MCMC algorithms work by taking independent steps of\n\nsteps, preventing the algorithm from quickly exploring the target distribution.\nOne MCMC algorithm that can take large steps is the Hamiltonian Monte Carlo (HMC) algorithm.\nEach step of the HMC Markov chain involves simulating the trajectory of a particle in the \u201cpotential\nwell\u201d U, with the trajectory determined by Hamilton\u2019s equations from classical mechanics [2, 38].\nTo ensure randomization, the momentum is refreshed after each step by independently sampling from\na multivariate Gaussian. HMC is a natural approach to the sampling problem because Hamilton\u2019s\nequations preserve the target distribution \u03c0. This convenient property reduces the need for frequent\nMetropolis corrections which slow down traditional MCMC algorithms, and allows HMC to take\nlarge steps. HMC was \ufb01rst discovered by physicists [14], was adopted soon afterwards with much\nsuccess in Bayesian Statistics and Machine learning [36, 37], and is currently the main algorithm\nused in the popular software package Stan [5]. Despite its popularity and the widespread belief\nthat HMC is faster than its competitor algorithms in a wide range of high-dimensional sampling\nproblems [11, 1, 38, 3], its theoretical properties are not as well-understood as its older competitor\nMCMC algorithms, such as the random walk Metropolis [35] or Langevin [16, 17, 13] algorithms.\nThe lack of theoretical results makes it more dif\ufb01cult to tune the parameters of HMC, and prevents us\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffrom having a good understanding of when HMC is faster than its competitor algorithms. Several\nrecent papers have begun to bridge this gap, showing that HMC is geometrically ergodic for a large\nclass of problems [30, 18] and proving quantitative bounds for the convergence rate of an idealized\nversion of HMC on Gaussian target distributions [42]. Building on probablistic coupling techniques\ndeveloped in [42], [16] and [13], [33] later proved a bound of O\n\ufb01rst-order implementation of HMC when U is m-strongly convex with M-Lipschitz gradient (here\nthe O\naccuracy parameters, and polylogarithmic factors of d). When the dimension is large, computing\nO\ncommon to use the second-order \u201cleapfrog\" implementation of HMC, which is conjectured to require\nO\nresults [27, 39]. Very recently, [33] made some progress towards this conjecture by proving that\nO\n\n2) gradient evaluations for a\n\u2217(d 1\n\u2217 notation only includes dependence on d and excludes regularity parameters such as M, m,\n2) gradient evaluations can be prohibitively slow. For this reason, in practice it is much more\n\u2217(d 1\n4) gradient evaluations based on previous simulation [14] and asymptotic \u201coptimal scaling\"\n\u2217(d 1\n\u2217(d 1\n4) gradient evaluations are required in the special case where U is separable into orthogonal\nO(1)-dimensional strongly convex components satisfying Lipschitz gradient, Lipschitz Hessian and\n\u2217(d 1\n4) gradient evaluations. Roughly, our\n\nfourth-order regularity conditions.\nOur contributions. We introduce a new, and much weaker, regularity condition that allows us\nto show that, in many cases, HMC requires at most O\nregularity condition allows the Hessian to change quickly in \u201cbad\" directions associated with the data,\nwhile at the same time guaranteeing that the Hessian changes slowly in the directions traveled by\nthe HMC chain with high probability (Assumption 1). The fact that our regularity condition need\nnot hold for the \u201cworst-case\" directions allows us to show desired bounds on the number of gradient\nevaluations for a much larger class of distributions than would be possible with more conventional\nregularity conditions such as the Lipschitz Hessian property. Under our regularity condition we show\nbounds of O\nfrom a large class of strongly log-concave target distributions (Theorem 1). Next, we show that our\nregularity condition is satis\ufb01ed by posterior distributions used in Bayesian logistic \u201cridge\" regression.\nComputing these posterior distributions is important in statistics and Machine learning applications\n[40, 22, 32, 44, 25] and quantitative convergence bounds give insight into which MCMC algorithm\nto use for a given application, and how to optimally tune the algorithm\u2019s parameters. Finally, we\nperform simulations to evaluate the performance of the HMC algorithm analyzed in this paper, and\nshow that its performance is competitive in both accuracy and speed with the Metropolis-adjusted\nversion of HMC despite the lack of a Metropolis \ufb01lter, when performing Bayesian logistic regression\non synthetic data.\nRelated work. Hamiltonian Monte Carlo. The earliest theoretical analyses of HMC were the\nasymptotic \u201coptimal scaling\" results of [27], for the special case when the target distribution is\na multivariate Gaussian. Speci\ufb01cally, they showed that the Metropolis-adjusted implementation\nof HMC with leapfrog integrator requires a numerical stepsize of O\n\n\u2217(d\n4) to maintain an \u2126(1)\n\u2212 1\nMetropolis acceptance probability in the limit as the dimension d\u2192\u221e. They then showed that for\n\u2217(d 1\n4) in the large-d limit. More recently,\n\n\u2217(d 1\n4) gradient evaluations for the leapfrog implementation of HMC when sampling\n\nthis choice of numerical stepsize the number of numerical steps HMC requires to obtain samples\nfrom Gaussian targets with a small autocorrelation is O\n[39] have extended their asymptotic analysis of the acceptance probability to more general classes of\nseparable distributions.\nThe earliest non-asymptotic analysis of an HMC Markov chain was provided in [42] for an idealized\nversion of HMC based on continuous Hamiltonian dynamics, in the special case of Gaussian target\ndistributions. [33] show that idealized HMC can sample from general m-strongly logconcave target\nm (see also [4] for more recent\nwork on idealized HMC). They also show that an unadjusted implementation of HMC with \ufb01rst-order\n\ndistributions with M-Lipschitz gradient in \u02dcO(\u03ba2) steps, where \u03ba\u2236= M\ndiscretization can sample with Wasserstein error \u03b5> 0 in \u02dcO(d 1\n\u02dc\u03b5\u2236= \u03b5~\u221a\nseparable target distributions in \u02dcO(d 1\ntotal variation (TV) error \u03b5> 0 of roughly \u02dcO( 1\n\n(non-polynomial) function of m, M, B, if the operator norms of the \ufb01rst four Fr\u00e9chet derivatives of\nthe restriction of U to the coordinate directions are bounded by B. [28] use the conductance method\nto show that an idealized version of the Riemannian variant of HMC (RHMC) has mixing time with\n4 , where R is a\n\n\u22121) gradient evaluations, where\n\u22121f(m, M, B)) gradient evaluations, where f is an unknown\n\n\u03c82T 2 R log( 1\n\n\u03b5)), for any 0\u2264 T \u2264 d\n\u2212 1\n\nregularity parameter for U and \u03c8 is an isoperimetric constant for \u03c0.\n\nM. In addition, they show that a second-order discretization of HMC can sample from\n\n2 \u03ba6.5 \u02dc\u03b5\n\n4 \u02dc\u03b5\n\n2\n\n\f2 \u03ba2 \u02dc\u03b5\n\n[21]. Therefore, since most of the probability of a standard spherical Gaussian lies in a ball of radius\n\ntarget distribution with TV error \u03b5. Interestingly, the only result [19] we are aware of specialized for\n\nsuf\ufb01ciently regular target distribution. One should then be able to apply results such as [31] to show\n\nfrom [12], [7, 15] show bounds for ULA in KL divergence. [9] show that underdamped Langevin\n\nLangevin Algorithms. [17] show that the unadjusted Langevin algorithm (ULA) can generate a sample\n\ngradient evaluations from a warm start in the TV metric.\nHit and run, ball walk, Random walk Metropolis (RWM). The Hit-and-run, ball walk, and RWM algo-\n\nfrom \u03c0 with TV error \u03b5> 0 in \u02dcO(d\u03ba2\u03b5\n\u22122) gradient evaluations. Using optimization-based techniques\nrequires \u02dcO(d 1\n\u22121) gradient evaluations for Wasserstein error \u03b5> 0 (see also [8]). [19] show\nthat the Metropolis-adjusted Langevin algorithm (MALA) requires \u02dcO(max(d\u03ba, d 1\n2 \u03ba1.5) log( 1\n\u03b5))\nrithms are all thought to have a step size of roughly \u0398(1) on suf\ufb01ciently regular target distributions\nd, one would expect all three of these algorithms to take roughly(\u221a\n\u221a\nd)2= d steps to explore a\n\u03b5)) target function evaluations to sample from the\nthat, from a warm start, RWM requires \u02dcO(d\u03ba log( 1\n\u03b5) target function evaluations for RWM.\nthe strongly log-concave case gives a bound of d2\u03ba2 log( 1\nHamiltonian Dynamics. A Hamiltonian of a simple system in Rd isH(q, p)= U(q)+ 1\n2\u0001p\u00012\nwhere q\u2208 Rd represents the \u201cposition\u201d of a particle in this system, p\u2208 Rd the \u201cmomentum,\u201d U the\n2\u0001p\u00012\n2 the \u201ckinetic energy.\u201d For \ufb01xed q, p\u2208 Rd, we denote by{qt(q, p)}t\u22650,\n{pt(q, p)}t\u22650 the solutions to Hamilton\u2019s equations:\ndqt(q, p)\n= pt(q, p)\nwith initial conditions q0(q, p)= q and p0(q, p)= p. When the initial conditions(q, p) are clear\nfrom the context, we write qt, pt in place of qt(q, p) and pt(q, p). The gradient\u2212\u2207U in the second\ntinuous Hamiltonian dynamics, with update rule Xi+1= qT(Xi, pi), where p1, p2, . . .\u223c N(0, Id) are\niid. Since solutions to Hamilton\u2019s equations have invariant distribution\u221d e\n\u2212 1\n\u0001p\u00012\nidealized HMC has stationary distribution \u03c0(q)\u221d e\n\u2212U(q) equal to the target distribution, without\n\nHamilton equation is thought of as a \u201cforce\" which acts on the particle.\nHMC. We \ufb01rst consider an idealized version of the HMC Markov chain X0, X1, . . . based on the con-\n\n2 Hamilton\u2019s equations and the Hamiltonian Monte Carlo algorithm\nIn this section we present the background and HMC algorithm; see [38, 2] for a thorough treatment.\n\n=\u2212\u2207U(qt(q, p)),\n\n\u2212H(q,p)= e\n\nneeding a correction such as Metropolis adjustment. This allows HMC to take much larger steps,\nand hence mix faster, than would otherwise be possible. It is not possible to implement an HMC\nMarkov chain with continuous trajectories, so one must discretize these trajectories using a numerical\nintegrator, such as the popular second-order leapfrog integrator (Step 5 in the algorithm below).\nIn this case, one obtains the following unadjusted HMC (UHMC) Markov chain. The number of\ngradient evaluations required by UHMC is the main object of study in this paper.\n\ndpt(q, p)\n\ndt\n\n\u201cpotential energy,\u201d and 1\n\n2,\n\n(1)\n\n\u2212U(q)\n\ne\n\n2\n\n2,\n\ndt\n\nand\n\n0\u2208 Rd, oracle for gradient\u2207U, T> 0, imax\u2208 N, discretization level \u03b7> 0\n\nfrom the (following) UHMC Markov chain\n\nimax\n\n0, . . . , X\u2020\n\nAlgorithm 1 Unadjusted HMC\ninput: Initial point X\u2020\noutput: Samples X\u2020\n\n1: for i= 0 to imax\u2212 1 do\n2: Sample pi\u223c N(0, Id)\n3: Set q0= X\u2020\ni and p0= pi\n\u03b7\u00e6\u2212 1 do\nfor j= 0 to\u00e6 T\n5: Set qj+1= qj+ \u03b7pj\u2212 1\ni+1= q\u00e6 T\n\u00e6\n\nend for\n\n4:\n\n2 \u03b72\u2207U(qj),\n\npj+1= pj\u2212 1\n\n2 \u03b7\u2207U(qj)\u2212 1\n\n2 \u03b7\u2207U(qj+1)\n\n\u03b7\n\n6:\n7: Set X\u2020\n8: end for\nInitialization: In this paper we prove gradient evaluation bounds for UHMC from both a warm start\nand cold start, which we de\ufb01ne as follows:\nDe\ufb01nition 1. (Warm start) Let X0, X1, . . . be a Markov chain, and let \u03c0 be our target distribution. We\nM\n\nsay that X has an(\u03c9, \u02c6\u03b4)-warm start if there is a random variable \u02dcY0\u223c \u03c0 such that\u0001X0\u2212 \u02dcY0\u00012< \u03c9~\u221a\nwith probability 1\u2212 \u02c6\u03b4 for some \u03c9, \u02c6\u03b4> 0.\n\n3\n\n\fd gradient evaluation bound barrier in prior approaches and present\n\n\u2200x, y, p\u2208 Rd\n\u221a\n\n\u221a\n\ntarget function U itself. However, in other applications it can take 2d times as many arithmetic\n\nnoting that if one attempts to bound the number of gradient evaluations required by HMC using a\nconventional Lipschitz bound on the Hessian\n\n\u0001(Hy\u2212 Hx)p\u00012\u2264 L2\u0001y\u2212 x\u00012\u00d7\u0001p\u00012\n\n(2)\n\n3 Regularity conditions\nIn this section we explain the\n\n\u221a\n\nDe\ufb01nition 2. (Cold start) We say that X has a cold start if X0= x\nSince UHMC requires \u0398( T\nnumber of gradient evaluations required by UHMC is \u0398(imax\u00d7 T\n\n\u0016\u2236= argminx\u2208Rd U(x).\n\u0016, with x\n\u03b7) gradient evaluations to compute each Markov chain step i, the total\n\u03b7). Note that the parameters imax,\n\nT , \u03b7 are chosen by the user, and the optimal choice of these algorithm parameters may depend on the\ndimension d and the regularity parameters of U such as M and m.\nRemark 1. The number of arithmetic operations required to compute the gradient depends on how\nthe function U is provided to us in a given application. In the Bayesian logistic regression application\nanalyzed at the end of Section 4 of this paper, the number of arithmetic operations required to\n\nthat is de\ufb01ned with respect to the Euclidean norm, then the bounds that one obtains are no faster\nthan\nd gradient evaluations. The reason is that if we use the usual \u201cEuclidean\" Lipschitz Hessian\nd, since (from a warm\ncondition to bound the numerical error, we obtain an error bound of roughly\n\nof these trajectories has Euclidean norm\nd with high probability (w.h.p.). To bound the error of a\nsecond-order method such as the leapfrog method used by HMC, we must bound the change of the\ndirectional derivative of the gradient along the path taken by the trajectories of the Markov chain.\nIn particular, when the leapfrog integrator (step 5 of Algorithm 1) takes a numerical step from qj\n\ncompute the gradient\u2207U is \u0398(d), and is the same number of operations required to evaluate the\noperations to compute the gradient\u2207U as it takes to compute the target function U.\nour regularity condition that overcomes it. Let Hx denote the Hessian of U at x\u2208 Rd. We start by\n\u221a\nstart) the trajectories of HMC travel with momentum roughly N(0, Id), implying that the momentum\nto roughly qj+1\u2248 qj+ \u03b7pj, one component of the error in computing the continuous Hamiltonian\ntrajectory can be bounded by the quantity\u0001(\u03b72Hqj+\u03b7pj\u2212 \u03b72Hqj)pj\u00012. This quantity in turn can be\nbounded using the Lipschitz Hessian constant by \u03b72L2\u0001\u03b7pj\u00012\u00d7\u0001pj\u00012. Since pj is roughly N(0, Id)\nwe have\u0001pj\u00012\u2248\u221a\n\u2217(1). To obtain an error bound of\n\u03b72L2d for the error of computing an entire HMC trajectory if T= \u0398\nd). When computing a trajectory of length \u0398\n\u03b5 we therefore need \u03b7= O\n\u2217(1) with this stepsize\n\u2217(\u221a\n\u221a\nd) numerical steps. To overcome this\ngrow as quickly with the dimension as the Euclidean norm for a random N(0, Id) momentum vector.\nWe need a better way to bound the quantity\u0001(\u03b72Hqj+\u03b7pj\u2212 \u03b72Hqj)pj\u00012. One way to do so would\n\u0001(Hy\u2212Hx)v\u00012\u2264 L\u221e\u00d7\u0001y\u2212x\u0001\u221e\u0001v\u0001\u221e for some constant L\u221e> 0. For this norm,\u0001pj\u0001\u221e= O(log(d))\nwith high probability since, roughly, pj \u223c N(0, Id), implying that\u0001(\u03b72Hqj+\u03b7pj\u2212 \u03b72Hqj)pj\u00012 is\nbounded by roughly \u03b72L\u221e log(d) rather than \u03b72L2d.\nSince for many distributions of interest this condition does not hold for a small value of L\u221e, we\ngeneralize this condition, to obtain a smaller L\u221e constant for a wider class of distributions. Towards\nthis end, we de\ufb01ne the vector (semi)-norm\u0001\u22c5\u0001\u221e,u with respect to the collection of unit vectors\nu\u2236={u1, . . . , ur} by\u0001x\u0001\u221e,u\u2236= maxi\u2208{1,...,r}u\ni x. The usual in\ufb01nity norm is just a special case\n\u0016\nof this new norm if we set ui = ei to be the coordinate vectors. Under this more general norm,\nthe magnitude of a random N(0, Id) vector still grows only logarithmically with d, since each\ni x is a univariate standard normal. The associated matrix norm\u0001A\u0001\u221e,u is de\ufb01ned to be\n\u0016\nsup\u0001x\u0001\u221e,u\u22641\u0001Ax\u00012. Using this norm, and motivated by the discussion above, we arrive at our new\nvery quickly in r> 0 \u201cbad\" directions u1, . . . , ur, as long as it does not change quickly on average in\nAssumption 1 (In\ufb01nity-norm Lipschitz condition). There exist L\u221e> 0, r\u2208 N, and a collection of\nunit vectors u={u1, . . . , ur}\u2286 Sd, such that\u0001Hy\u2212 Hx\u0001\u221e,u\u2264 L\u221e\u221ar\u0001y\u2212 x\u0001\u221e,u for all x, y\u2208 Rd.\n\nd gradient evaluation\n\u03b7, we therefore need to compute O\nbarrier, we therefore need to control the change in the Hessian with respect to a norm which does not\n\nregularity condition. Roughly speaking, our new regularity condition allows the Hessian to change\n\nbe to replace the Euclidean Lipschitz Hessian condition with an in\ufb01nity-norm Lipschitz condition\n\nd w.h.p., which gives an error bound of \u03b73L2d for one leapfrog step and roughly\n\n\u2217(1~\u221a\n\ncomponent u\n\na random direction.\n\n4\n\n\fthe target functions used in logistic regression. This condition may also be of independent interest.\nAdditional conditions for cold starts. When proving bounds from a cold start, roughly speaking\nwe would still like to guarantee that the HMC trajectories travel with speed O\n\nWe expect this assumption to hold when the target function U is of the form U(x)=\u2211r\ni=1 fi(u\ni x)\n\u0016\nfor functions fi\u2236 R\u2192 R with uniformly bounded third derivatives. In particular, this class includes\n\u2217(1) in any of the \u201cbad\"\ndirections, so that\u0001pt\u0001\u221e,u= O\n\u2217(1). However, unlike from a warm start, we have no guarantee that\nthe momentum is roughly N(0, Id). To bound\u0001pt\u0001\u221e,u we therefore need another way to control\ni pt in each \u201cbad\" direction ui. To do so, we would like to guarantee\nthe growth of the quantityu\n\u0016\n\u0016\n\u0016\nthat the bounds on the \u201cforce\" acting on our Hamiltonian trajectory in each ui direction depend only\ni pt of the position and momentum in that direction, regardless of the\ni qt and u\non the components u\nAssumption 2 (Gaussian tail bound condition (for cold start only)). There exists a constant b> 0,\ncomponent of the momentum orthogonal to ui. Towards this end, we assume the following:\nand a collection of unit vectors u={u1, . . . , ur}\u2286 Sd, such that min{mu\n\u0016)}\u2212\n\u0016(x\u2212 x\n\u0016\u2207U(x)\u2264 max{mu\nb\u2264 u\n\n\u0016(x\u2212 x\n\u0016)}+ b for all x\u2208 Rd, u\u2208 u.\n\n\u0016(x\u2212 x\n\n\u0016), M u\n\n\u0016), M u\n\n\u0016(x\u2212 x\n\nsample X\u2020\nX\u2020\n\nimax\n\nfrom \u03c0 such that\u0001X\u2020\n\nm-strongly convex, M-gradient Lipschitz, and satis\ufb01es Assumption 1. Then there exist parameters\n\nof gradient evaluations for both a warm and cold start. Here we focus on the warm start result; see the\narXiv version [34] for bounds from a cold start and a more formal statement of our warm start result.\n\nAssumption 2 gaurantees that the component of the gradient in each \u201cbad\" direction ui is bounded\nsolely in terms of the component of the position in that same \u201cbad\" direction. This allows us to apply\narguments based on Gronwall\u2019s inequality on the projection of the trajectory in each bad direction\nui in order to bound the magnitude of the position and momentum at time t in the direction ui.\nUsing Grownwall\u2019s inequality, we bound the component of the initial position and momentum in the\ndirection ui, without assuming a warm start.\n4 Theoretical results\nOur main result is a bound on the number of gradient evaluations required by HMC with second-order\nleapfrog integrator under the in\ufb01nity-norm Lipschitz condition (Assumption 1), when sampling from\n\n\u03c0(x)\u221d e\n\u2212U(x) if U is m-strongly convex with M-Lipschitz gradient. We bound the required number\n\u2212U(x) where U\u2236 Rd\u2192 R is\nTheorem 1 (Bounds for second-order HMC, informal). Let \u03c0(x)\u221d e\nT, \u03b7, imax, such that from an(\u03c9, \u03b4)-warm start, Algorithm 1 generates an approximate independent\nimax\u2212 Y\u00012< \u03b5 for some Y \u223c \u03c0 independent of the initial point\n0 with probability at least 1\u2212 \u03b4. Moreover UHMC requires at most \u02dcO(d1~4\u03b5\nL\u221e log1~2( 1\n\u03b4))\ngradient evaluations whenever m, M, \u03c9= O(1), r= \u02dcO(d) and L\u221e= \u2126(1).\nMore generally, for arbitrary m, M and r, we show (from a warm start with \u03c9 = O(1)) that\n\u0001 \u02dcL\u221e\u03ba2.25\u0002 \u02dc\u03b5\n2( 1\nthe number of gradient evaluations is \u02dcO(max\u0002d1~4\u03ba2.75, r1~4\n\u03b4)), where\n\u22121~2 log\n\u02dcL\u221e\u2236= L\u221e\u221a\nand \u03ba\u2236= M\n\u02dcO(max\u0002d1~4\u03ba3.5, r1~4\nM and \u02dcb\u2236= b~\u221a\n\u22121~2), where \u02dc\u03b5\u2236= \u03b5~\u221a\nIf \u03ba= O(1), L\u221e = O(1) and r= O(d), then our bound on the number of gradient evaluations\n2) from a warm start. To the best of our knowledge, our bounds are an improvement\nis O(d 1\n\u2212 1\n\u221a\n\u2217(d 1\n4) bounds in the special case of\na statistical model is oftentimes very large [26, 22], and in many cases \u03ba and L\u221e do not grow, or only\nmT 2), and\n1To obtain bounds from a warm start, we run UHMC with parameters T=(6\n\u03b7= \u0398(min{d\n\u03b4)). From a cold start, we run UHMC with param-\n\u22120.75}\u02dc\u03b5\n2( 1\n\u22121~2\u221e \u03ba\n\u22121~4\u03ba\n\u2212 1\n\u22121.25, r\n\u221a\nmT 2), and \u03b7= \u0398(min{d\nM \u03ba)\u22121, imax= \u0398( 1\n\u22121~2).\neters T=(6\n\u22122, r\n\u2212 1\n\u22121~4 \u02dcL\n\nover all previous gradient evaluation bounds for sampling in this regime, which all have dimension\ndependence\nproduct distributions, unlike in [33] the condition number dependence in our bounds is polynomial.\nWe are especially interested in the regime where d is large since the number of predictor variables in\n\ngrow relatively slowly, with the dimension. We state some concrete examples from Bayesian logistic\nregression of regimes where our gradient evaluation bounds are an improvement on the previous best\nbounds in the discussion after Theorem 2.\n\nm is the \u201ccondition number\". From a cold start, under the additional Gaus-\nsian tail bound condition (Assumption 2), we show that the number of gradient evaluations is\n\n\u221a\nM \u03ba)\u22121, imax= \u0398( 1\n\u22121~2\u221e (\u03ba2.75+ \u02dcb\u03ba1.75)\u22121}\u02dc\u03b51~2M\n\n\u0001 \u02dcL\u221e\u0001\u03ba4.25+ \u02dcb\u03ba3.25\u0001\u0002 \u02dc\u03b5\n\nd or greater. Also note that while [33] obtains O\n\n\u2212 1\n\n1\n2 M\n\n2 log\n\nM\n\n4 \u02dc\u03b5\n\n\u22121~2\u221a\n\n1\n\nM. 1\n\n\u22121~4 \u02dcL\n\n4 \u03ba\n\n5\n\n\f\u0016\n\nU(\u03b8)= 1~2\u03b8\n\nIn Bayesian logistic \u201cridge\" regression, one would like to\n\nApplications to logistic regression.\nsample from the target log-density\n\ni=1Yi log(F(\u03b8\n\nXi))+(1\u2212 Yi) log(F(\u2212\u03b8\n\u22121\u03b8\u2212\u2211r\n\u0016\nwhere the data vectors X1, . . . Xr \u2208 Rd are thought of as independent variables, the binary data\n\u2212s+ 1)\u22121 is the logistic function, and \u03a3 is\nY1, . . . , Yr\u2208{0, 1} are dependent variables, F(s)\u2236=(e\nj=1X\ninc(X1, . . . Xr)\u2236= maxi\u2208[r]\u2211r\ni Xj.\n\u0016\n\npositive de\ufb01nite. We de\ufb01ne the incoherence of the data as\n\nXi)),\n\n(3)\n\n\u0016\n\n\u03a3\n\n\u221a\n\n8 \u03b5\n\nd\u03b5\n\n.\n\n4 \u03b5\n\nIn all these\n\nsince C, M, m\ndistributed, we have\n\n\u0001 \u02dcL\u221e = \u02dcO(d 1\n\nand the angle between any two of the remaining vectors is greater than \u03c0\n\nspectively. To obtain bounds in the cold start setting, we show Assumption 2 is satis\ufb01ed with\n\nThe proof of Theorem 2 is given in the arXiv version [34]. In particular, when the incoherence is\n\nWe also provide bounds for m and M, showing that our Lipschitz gradient and strong con-\n\n\u0001Xi\u00012\u0002r\nC and \u201cbad\" directions u=\u0002 Xi\ni=1\n\nWe bound the value of the in\ufb01nity-Lipschitz constant in terms of the incoherence:\nTheorem 2 (Regularity bounds for logistic regression). Let U be the logistic regression target for\n\nr> 0 data vectors X1, . . . , Xr, and let inc(X1, . . . , Xr)\u2264 C for some C> 0. Then the in\ufb01nity-norm\nLipschitz assumption is satis\ufb01ed with L\u221e=\u221a\n\u02dcO(1), the constant L\u221e does not grow with dimension: This includes the separable case when the Xi\nwhere r= d and the Xi are unit vectors with the \ufb01rst\n2\u2212 1\nvectors are orthogonal and have unit magnitude. It also includes, for instance, the non-separable case\nthe number of gradient evaluations required by UHMC under a standard normal prior is \u02dcO(d 1\n2),\nd of the Xi vectors isotropically distributed,\n\u2212 1\nd. In both these examples\n\u22121, (and therefore \u03ba and \u02dcL\u221e) are all \u02dcO(1). When all r= d vectors are isotropically\n8) and require \u02dcO(d 3\n2) gradient evaluations.\n\u2212 1\nexamples we therefore obtain an improvement over the existing \u02dcO(\u221a\n\u22121) bounds of [9, 33].\nvexity assumptions are satis\ufb01ed for M = \u03bbmax\u0001\u03a3\nk\u0001 and m = \u03bbmin(\u03a3\n\u22121+\u2211r\n\u22121), re-\n\u0016\nk=1 XkX\nb= 2inc(X1~\u0001\u0001X1\u00012, . . . Xr~\u0001\u0001Xr\u00012) if \u03a3 is a multiple of the identity (see arXiv version [34] for proofs).\nalgorithm is given a warm start and where m, M = \u0398(1); the general case is proved in the arXiv\nin [33], it is enough for us to bound the approximation error\u0001X\u2020\ni \u2212 Xi\u00012< \u03b5 for all i\u2264I, where,\nroughly,I= \u0398(log 1\n\u03b5) is a bound on the mixing time of X, if each HMC trajectory is run for time\nT= \u0398(1) .\n\u2217(d 1\n4) gradient evaluation bounds, it is enough to show that an error\n4), since the HMC algorithm\ni\u2212 Xi\u00012< \u03b5 holds for a numerical timestep-size \u03b7= \u2126\n\u2217(d\nbound\u0001X\u2020\n\u2212 1\ncomputesI= O\n\u2217(d 1\n4) gradient\n\u2217(1) trajectories and for this choice of \u03b7 each trajectory takes T\n\u03b7 = O\ni\u2212 Xi\u00012 is bounded by \u03b5 for\nevaluations to compute. Our goal is therefore to show that the error\u0001X\u2020\n4).\n\u2217(d\nall i\u2264I whenever \u03b7= O\n\u2212 1\nstart) the momentum of the HMC trajectories is roughly N(0, Id) to show that the continuous HMC\nthe momentum of the HMC trajectories satisfy\u0001pt\u0001\u221e,u= O(log(d)) at every time t\u2208[0, T]. We\n\n5 Proof overview of Theorem 1\nFor simplicity of exposition, in this proof overview we consider the special case where the HMC\n\nversion [34]. Recall that Algorithm 1 generates a Markov chain X\u2020 which approximates the steps\ntaken by the idealized HMC chain X. Since the idealized HMC chain X was shown to mix quickly\n\ntrajectories of the idealized chain X are unlikely to travel quickly in any of the \u201cbad\" directions ui\nspeci\ufb01ed in Assumption 1 (Step 2). Speci\ufb01cally, we show that at every step i with high probability\n\nThe structure of our proof is as follows: We begin by bounding the local error of the leapfrog\nintegrator accumulated at each numerical step (Step 1). Then, we use the fact that (from a warm\n\nTo prove the conjectured O\n\nthen combine steps 1 and 2 to show that the numerical HMC chain also does not travel too quickly in\nany of the \u201cbad\" directions, and use this fact together with our bounds in Step 2 to bound the global\nerror of the numerical HMC trajectories (Step 3). Finally, we compute the value of \u03b7 needed to bound\nthe error \u03b5, and use this to bound the number of gradient evaluations.\nNote that when proving bounds from a cold start we use the additional Assumption 2 instead of the\ninvariant Gibbs distribution to control the behavior of the trajectories. Due to limited space, formal\nproofs are deferred to the arXiv version [34] of our paper.\n\n6\n\n\f(6)\n\n2\n\n\u2248 pj\u2212 \u03b7\u2207U(qj)\u2212 1\n\u221a\nr\u2248 tL\u221e\u0001pt\u00012\u221e,u\n\u221ar.\n\nStep 1: Error bounds for leapfrog integrator.\nIn this subsection we show how to use Assumption\n1 to bound the error of the leapfrog integrator. We are unaware of non-asymptotic second-order\nbounds for the leapfrog integrator, since the previous error bounds for leapfrog we are aware of\nonly hold in the limit as the numerical step size \u03b7 goes to zero [3, 33, 1, 29, 24]. For this reason,\nwe prove new non-asymptotic polynomial time bounds for leapfrog here. Key to our analysis is\n\n2 \u03b72\u2207U(qj) returned by the leapfrog\nthe observation that the position estimate qj+1= qj+ \u03b7pj\u2212 1\nintegrator is exactly the second-order Taylor expansion for q\u03b7(qj, pj), and the momentum estimate\n2 \u03b7\u2207U(qj+1) approximates (with third-order error) the second-order Taylor\npj+1\u2236= pj\u2212 1\nexpansion for p\u03b7(qj, pj) in the following way:\n\u03b72\u2207U(qj+1)\u2212\u2207U(qj)\ntrajectory. Roughly, we can use Assumption 1 to bound the error in the Hessian at each time 0\u2264 t\u2264 \u03b7:\n\u0001(Hqt\u2212 Hq0)pt\u00012\u2264 L\u221e\u0001pt\u0001\u221e,u\u00d7\u0001(qt\u2212 q0)\u0001\u221e,u\n\n2 \u03b7\u2207U(qj)\u2212 1\npj+1= pj\u2212 \u03b7\u2207U(qj)\u2212 1\n\nThe error in the Taylor expansion is due to the fact that the Hessian Hqj is not constant over the\n\n\u03b72Hqj pj.\n\n\u221a\n\n(4)\n\n(5)\n\nr.\n\n2\n\n\u03b7\n\nUsing Equations (4) and (5), we get an error bound of roughly\n\n\u0001pj+1\u2212 pt(qj, pj)\u00012\u2264 \u03b73L\u221esupt\u2208[0,\u03b7]\u0001pt(qj, pj)\u00012\u221e,u\n\nplished using standard techniques which do not require Assumption 1.\n\nthis, we would like to use the fact that the position and momentum of HMC trajectories from an\nidealized HMC chain started at the stationary distribution \u03c0, are jointly distributed according to the\n\ndistribution, it remains stationary distributed at every step and the position qt and momentum pt\n2 at any \ufb01xed time t. Using the fact\n\n(7)\nwith high probability for the idealized HMC chain. To do so, we use the fact that (from a warm start)\n\nFinally, we note that bounding the error for the position variable\u0001qj+1\u2212 qt(qj, pj)\u00012 can be accom-\nStep 2: Bounding\u0001pt\u0001\u221e,u for the idealized HMC chain. Since the error bound of the leapfrog\nintegrator depends crucially on\u0001pt\u0001\u221e,u, our next task is to show that\n\u0001pt\u0001\u221e,u\u2264 O(polylog(d, 1~\u03b4)+ \u03c9)\nthe distribution of the momentum at any point on an HMC trajectory is roughly N(0, Id). To show\nGibbs distribution\u221d e\n2a: Bounding\u0001pt\u0001\u221e,u from a stationary start We \ufb01rst consider a copy \u02dcYi of the idealized chain\nstarted at the stationary distribution \u02dcY0 \u223c \u03c0, and show that the momentum pt of its trajectories\nsatis\ufb01es\u0001pt\u0001\u221e,u= O(log(d)) at every time t w.h.p. Since the \u02dcY chain is started at the stationary\nof its trajectories have Gibbs distribution\u221d e\ni pt is chi-distributed with 1 degree of freedom, we apply the\nthat for every bad direction ui,u\n\u0016\nHanson-Wright inequality together with a union bound to show that\u0001u\ni pt\u0001\u221e,u= O(log(d~\u03be)) at any\n\u0016\n\ufb01xed time t with probability at least 1\u2212 \u03be for any \u03be> 0. However, our goal is to bound\u0001u\ni pt\u0001\u221e,u\n\u0016\nproblem, we considerJ = poly(\u03ba, d) equally spaced timepoints on the interval[0, T], and apply a\nunion bound to show that\u0001u\ni pt\u0001\u221e,u= polylog(d, 1~\u03b4) with probability at least 1\u2212 \u03b4. We then use the\n\u0016\ntrajectory, implying that the position and momentum do not change by more that O(1) inside each\ntime interval of length 1J . This in turn implies that\u0001pt\u0001\u221e,u= polylog(d, 1~\u03b4) at every time t\u2208[0, T].\n2b: Bounding\u0001pt\u0001\u221e,u from a warm start Unfortunately, we cannot apply our results of step 2a\nonly assume that\u0001X0\u2212 \u02dcY0\u00012< \u03c9 for some \u03c9> 0, where \u02dcY0\u223c \u03c0 is at the stationary distribution. To\nusing the update rule \u02dcYi+1= qT( \u02dcYi, pi) with the same sequence of initial momenta p1, p2, . . . that\npi at every step, we show that at every continuous time t\u2208[0, T] the Euclidean distance between the\nhave\u0001pt(Xi, pi)\u2212 pt( \u02dcYi, pi)\u0001\u221e,u\u2264\u0001pt(Xi, pi)\u2212 pt( \u02dcYi, pi)\u00012\u2264 \u03c9.\n\nshow that the trajectories of our warm-started chain also approximately satisfy this Gibbs distribution\nproperty, we couple the two copies X and \u02dcY of the idealized HMC chain by de\ufb01ning the \u02dcY chain\n\nsimultaneously at every time t, not just at a \ufb01xed time. Unfortunately, the trajectories are continuous\npaths, so we cannot directly apply a union bound to obtain a bound at every t. To get around this\n\ndirectly since we are only assuming that X0 has a warm start, not a stationary start. That is, we\n\nwere used to de\ufb01ne the X chain. Using the fact that the trajectories share the same initial momentum\n\nposition and momentum of the trajectories of the two chains remains bounded by \u03c9. We therefore\n\n\u201cconservation of energy\" property to bound the Euclidean norm of the momentum at every time on the\n\n2 at any given time t.\n\n\u2212 1\n\n2\n\n\u0001pt\u00012\n\n\u2212U(qt)\n\n\u2212 1\n\n2\n\n\u0001pt\u00012\n\ne\n\n\u2212U(qt)\n\ne\n\n7\n\n\fEq. 7= polylog(d, 1~\u03b4)+ \u03c9.\n\nIf we can extend this bound to the numerical chain, we can apply it to Inequality (6) to show that the\n\nStep 3: Bounding the global error and the number of gradient evaluations. So far, we have\n\n\u0001pj\u0001\u221e,u\u2264\u0001pj\u03b7(Xi, pi)\u0001\u221e,u+ j\u03b7\u03b5\n\n(8)\nThen one can use similar \u201cconservation of energy\" arguments as in the previous section to show that\n\nshown that the trajectories of the idealized HMC chain X satisfy a bound on\u0001pt\u0001\u221e,u (Equation (7)).\nerror at each step is O(\u03b73L\u221e\u221ar) w.h.p. To bound the global error, we use roughly the following\ninductive argument: inductively assume that the errors\u0001qj\u2212 qj\u03b7(Xi, pi)\u00012 and\u0001pj\u2212 pj\u03b7(Xi, pi)\u00012\nat numerical step j are bounded by roughly j\u03b7\u00d7 \u03b5. This implies that\n\u0001pt(qj, pj)\u0001\u221e,u= polylog(d, 1~\u03b4) over the short time interval t\u2208[0, \u03b7]. Plugging this bound into\nInequality (6) allows us to bound the error accumulated at step j by O(\u03b73L\u221e\u221ar), implying that the\ninductive assumption also holds for step j+ 1.\n\u03b7 numerical steps, the global error of each trajectory is therefore bounded by T\u00d7 \u03b72L\u221e\u221ar.\nThe error at step i is bounded by i\u00d7 T \u03b72L\u221e\u221ar. Finally, we conclude that\u0001X\u2020\ni \u2212 Xi\u00012 < \u03b5 for\n2 T). Since the algorithm uses a total of T\n\u03b5)) whenever \u03b7\ni \u2264 I = O(log( 1\n\u22121 = \u02dc\u0398(r 1\n\u2212 1\nnumerical steps to compute Xi for i\u2264I, for T= \u0398(1) the number of gradient evaluations is roughly\n\u0001 \u02dcL\u221e\u03b5\n\u0001 \u02dcL\u221e\u03b5\n2). When r= O(d), the number of gradient evaluations is roughly \u02dcO(d 1\n2).\n\u02dcO(r 1\n\u2212 1\n\u2212 1\nRemark 2. More generally, we can consider Hamiltonian trajectories with HamiltonianH(q, p)=\nU(q)+ 1\n\u0016\nby using a mass matrix \u2126= cId for some constant c> 0 and by choosing an appropriate integration\nof the form \u2126= cId for some constant c is equivalent to rescaling U by a constant factor and tuning\nthe integration time T , we analyze the case \u2126= Id and then \u201ctune\" our algorithm by setting T= \u221a\nM= 1 and \u03ba= 1\n\n6 ,\nm to determine the number of gradient evaluations. (A more general mass matrix \u2126\n\ntime T (as well as choosing other parameters such as numerical step-size). Since using a mass matrix\n\n\u2126p, where \u2126 is called the mass matrix. In practice, one tunes the algorithm parameters\n\n\u0001 \u02dcL\u221e\u03b5\n\n4\n\nAfter T\n\n4\n\n2 p\n\n4\n\n\u03b7\n\nm\n\nis equivalent to applying a pre-conditioner on U.)\n6 Simulations\n6.1 Accuracy and autocorrelation time of Unadjusted HMC\nThe purpose of our \ufb01rst set of simulations is to show that in practical situations analyzed in this paper\nthe unadjusted HMC algorithm (UHMC) is competitive with other popular sampling algorithms in\nterms of both accuracy and in terms of the number of gradient evaluations required. We compare\nUHMC to Metropolis-adjusted HMC (MHMC) [14], the Metropolis-adjusted Langevin algorithm\n(MALA) [41] and the unadjusted Langevin algorithm (ULA) [41]. All simulations were implemented\non MATLAB (see the GitHub repository https://github.com/mangoubi/HMC for our MATLAB\ncode used to implement these algorithms).\nWe consider the setting of Bayesian logistic regression with standard normal prior, with synthetic\n\nfor Z1, . . . , Zr \u223c N(0, Id) iid, for\n\u201cindependent variable\" data vectors generated as Xi = Zi\n\u0001Zi\u00012\ndimension d= 1000 and r= d. To generate the synthetic \u201cdependent variable\" binary data, a vector\n\u03b2=(\u03b21, . . . , \u03b2d) of regression coef\ufb01cients was \ufb01rst generated as \u03b2= W\u0001W\u00012\nwhere W\u223c N(0, Id). The\nrandom variables, setting Yi= 1 with probability\nand Yi= 0 otherwise. Each Markov chain\nwas initialized at a point X0 chosen randomly as X0\u223c N(0, Id).\n\nbinary dependent variable synthetic data Y1, . . . , Yd were then generated as independent Bernoulli\n\n1+e\u2212\u03b2\u0016Xi\n\n1\n\nFigure 1: Marginal accuracy (left) and autocorrelation time (middle) vs. numerical step size \u03b7 for MALA, ULA,\nMHMC, and UHMC. The log plot (right) compares an estimate for the quantity used to obtain second-order\nnumerical error bounds using the Euclidean Lipschitz and in\ufb01nity-norm Lipschitz constants, at different d.\nTo compare the accuracy, we computed the \u201cmarginal accuracy\" (MA) of the samples generated\nby each chain over a \ufb01xed number (50,000) of numerical steps for different step sizes \u03b7 in the\n\n8\n\n0.10.20.30.40.50.60.950.960.970.980.99Marginal AccuracyMALAULAMHMCUHMC0.10.20.30.40.50.6050100150Autocorellation TimeMALAULAMHMCUHMC1011021030246810u\f\u221a\n\n1\n\ninterval[0.1, 0.6] (Figure 1, left). Among all four of the algorithms, we found that UHMC had the\nhighest accuracy at the accuracy-optimizing step size (the accuracy-optimizing step size was \u03b7= 0.35\ntest function f(x)=\u0001x\u00011.2 We found that the autocorrelation time of UHMC was fastest at the\nautocorrelation time-optimizing step size (the autocorrelation time-optimizing step size was \u03b7= 0.5\n\nfor UHMC). To compare the runtime we computed the autocorrelation time of the samples for a\n\n3 , rounded down to the nearest multiple of \u03b7.\n\nfor UHMC) (Figure 1, middle). When running UHMC and MHMC, we used a trajectory time T\nequal to \u03c0\nRemark 3. The marginal accuracy is used as a heuristic to compare accuracy of samplers (see\ne.g. [18], [20] and [10]). The marginal accuracy between the measure \u00b5 of a sample and the\n\ntarget \u03c0 is M A(\u00b5, \u03c0)\u2236= 1\u2212 1\ni=1\u0001\u00b5i\u2212 \u03c0i\u0001TV, where \u00b5i and \u03c0i are the marginal distributions\n2d\u2211d\n\nL2\u0001pt\u00012 and\u221a\n\nAssumption 1. We performed this comparison for the logistic regression example of the previous\n\nAt each value of d we used MATLAB\u2019s \u201cfminunc\" function to search for the optimal values of L2 and\n\nof \u00b5 and \u03c0 for the coordinate xi. Since MALA is known to sample from the correct stationary\ndistribution and is geometrically ergodic for the class of distributions analyzed in this paper, we\nused the samples generated after running MALA for a very long time (106 steps) to obtain a more\naccurate approximation for \u03c0 as a benchmark with which to compare the sampling accuracy of the\nfour different algorithms when run for a much shorter amount of time (50, 000 numerical steps).\n6.2 Comparing Euclidean and in\ufb01nity-norm Lipschitz conditions\nThe goal of our second set of simulations was to compare the optimal values of the usual Euclidean\n\nLipschitz Hessian constant L2 to the constant L\u221e from our in\ufb01nity-norm Lipschitz condition of\nsimulation with synthetic data generated in the same way, but for different values of d, with r=\nd. The optimal values of L2 and L\u221e are L2 = supx,y,v\u2208Rd\n\u0001y\u2212x\u00012\u0001v\u00012\u0001(Hy\u2212 Hx)v\u00012 and L\u221e =\nr\u0001y\u2212x\u0001\u221e,u\u0001v\u0001\u221e,u\u0001(Hy\u2212 Hx)v\u00012, with \u201csup\" taken over points where function is de\ufb01ned.\nsupx,y,v\u2208Rd\nof the two quantities\u221a\nL\u221e. Recall that to bound the error of a numerical integrator with momentum pt, one may use one\n4\u0001pt\u0001\u221e,u. We plot the median value of these quantities\nfor random momenta pt\u223c N(0, Id) (Figure 1, right). Our results show that the median of\u221a\nL2\u0001pt\u00012\nincreases with d at a faster rate than the median of\u221a\n4\u0001pt\u0001\u221e,u over the interval d\u2208[1, 1000].\n\u2217(d1~4) gradient\n\nThis suggests that bounds based on our in\ufb01nity-norm Lipschitz condition can be much tighter for\ndistributions used in practice than bounds based on the usual Euclidean Lipschitz condition.\n7 Conclusions and future directions\nIn this paper, we show that the conjecture of [11], which says that HMC requires O\nevaluations, is true when sampling from strongly log-concave targets satisfying weak regularity\nproperties associated with the input data. In doing so, we introduce a new regularity property for the\nHessian (Assumption 1) that is much weaker than the Lipschitz Hessian property, and show that for a\nclass of functions arising in statistics and machine learning this property holds for natural conditions\non the data. One future direction is to further weaken Assumption 1, which says that the Hessian\ndoes not change too quickly in all but a few \ufb01xed bad directions, by instead allowing these directions\nto vary based on the position x. Our simulations show that UHMC is competitive with MHMC on\nsynthetic data that satis\ufb01es our regularity assumption. Further, we show that the constant in our\nregularity assumption grows much more slowly with the dimension than the Euclidean Lipschitz\nconstant of the Hessian. It would also be interesting to extend our results to non-logconcave targets,\nand to kth-order numerical implementations of generalizations of HMC, such as RHMC.\nBounds for MHMC Another open problem is to show tight gradient evaluation bounds for MHMC.\nSince the Metropolis-adjusted HMC Markov chain preserves the stationary distribution exactly,\nit should be possible to show that the number of gradient evaluations is polylogarithmic in \u03b5\nimproving on the number of gradient evaluations required by unadjusted HMC which grows like \u03b5\n2 .\nUnfortunately, the probablistic coupling approach used in our current paper is unlikely to work for\nMHMC, since Metropolis \u201caccept/reject\" steps tend to break the coupling of the two Markov chains if\nthe chains have different acceptance probabilities, causing one chain to accept its proposal while the\nother rejects the proposal. An alternative approach might be to use a proof based on the conductance\nmethod (see e.g. [43]). Unlike the coupling approach, the conductance method is compatible with\nMetropolis \u201caccept/reject\" steps.\n\nL\u221er 1\n\n2The autocorrelation time can be estimated as 1+ 2\u2211smax\n\nsome large smax[23, 6].\n\ns=1 \u03c1s, where \u03c1s is the autocorrelation at lag s for\n\n\u22121,\n\u2212 1\n\n1\n\nL\u221er 1\n\n9\n\n\fReferences\n[1] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal\n\ntuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501\u20131534, 2013.\n\n[2] Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo.\n\narXiv:1701.02434, 2017.\n\narXiv preprint\n\n[3] MJ Betancourt, Simon Byrne, and Mark Girolami. Optimizing the integrator step size for Hamiltonian\n\nMonte Carlo. arXiv preprint arXiv:1411.6669, 2014.\n\n[4] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. Coupling and convergence for Hamiltonian\n\nMonte Carlo. arXiv preprint arXiv:1805.00452, 2018.\n\n[5] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A\nBrubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal\nof Statistical Software, 20:1\u201337, 2016.\n\n[6] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Interna-\n\ntional Conference on Machine Learning, pages 1683\u20131691, 2014.\n\n[7] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Proceedings of\n\nAlgorithmic Learning Theory, pages 186\u2013211, 2018.\n\n[8] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan. Sharp\nconvergence rates for Langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648,\n2018.\n\n[9] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped Langevin\n\nMCMC: A non-asymptotic analysis. In Conference on Learning Theory, pages 300\u2013323, 2018.\n\n[10] Nicolas Chopin, James Ridgway, et al. Leave Pima indians alone: binary regression as a benchmark for\n\nBayesian computation. Statistical Science, 32(1):64\u201387, 2017.\n\n[11] Michael Creutz. Global Monte Carlo algorithms for many-Fermion systems. Physical Review D, 38(4):1228,\n\n1988.\n\n[12] Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo\n\nand gradient descent. In Conference on Learning Theory, pages 678\u2013689, 2017.\n\n[13] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave\ndensities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651\u2013676,\n2017.\n\n[14] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics\n\nletters B, 195(2):216\u2013222, 1987.\n\n[15] Alain Durmus, Szymon Majewski, and B\u0142a\u02d9zej Miasojedow. Analysis of Langevin Monte Carlo via convex\n\noptimization. arXiv preprint arXiv:1802.09188, 2018.\n\n[16] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the unadjusted Langevin\n\nalgorithm. arXiv preprint arXiv:1605.01559, 2016.\n\n[17] Alain Durmus and Eric Moulines. Non-asymptotic convergence analysis for the Unadjusted Langevin\n\nAlgorithm. The Annals of Applied Probability, in press.\n\n[18] Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian Monte Carlo. arXiv\n\npreprint arXiv:1705.00166, 2017.\n\n[19] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolis-\n\nHastings algorithms are fast! In Conference on Learning Theory, pages 793\u2013797, 2018.\n\n[20] Christel Faes, John T Ormerod, and Matt P Wand. Variational Bayesian inference for parametric and\nnonparametric regression with missing data. Journal of the American Statistical Association, 106(495):959\u2013\n971, 2011.\n\n[21] Andrew Gelman, Walter R Gilks, and Gareth O Roberts. Weak convergence and optimal scaling of random\n\nwalk Metropolis algorithms. The Annals of Applied Probability, 7(1):110\u2013120, 1997.\n\n10\n\n\f[22] Alexander Genkin, David D Lewis, and David Madigan. Large-scale Bayesian logistic regression for text\n\ncategorization. Technometrics, 49(3):291\u2013304, 2007.\n\n[23] Jonathan Goodman and Jonathan Weare. Ensemble samplers with af\ufb01ne invariance. Communications in\n\napplied mathematics and computational science, 5(1):65\u201380, 2010.\n\n[24] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration illustrated by the\n\nSt\u00f6rmer\u2013Verlet method. Acta numerica, 12:399\u2013450, 2003.\n\n[25] Matthew D Hoffman and Andrew Gelman. The no-U-turn sampler: adaptively setting path lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593\u20131623, 2014.\n\n[26] Valen E Johnson. On numerical aspects of Bayesian model selection in high and ultrahigh-dimensional\n\nsettings. Bayesian analysis (Online), 8(4):741, 2013.\n\n[27] AD Kennedy and Brian Pendleton. Acceptances and autocorrelations in hybrid Monte Carlo. Nuclear\n\nPhysics B-Proceedings Supplements, 20:118\u2013121, 1991.\n\n[28] Yin Tat Lee and Santosh S Vempala. Convergence rate of Riemannian Hamiltonian Monte Carlo and faster\npolytope volume computation. To appear in Proceedings of STOC 2018, arXiv preprint arXiv:1710.06261,\n2017.\n\n[29] Benedict Leimkuhler and Sebastian Reich. Simulating Hamiltonian dynamics, volume 14. Cambridge\n\nuniversity press, 2004.\n\n[30] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami. On the geometric ergodicity\n\nof Hamiltonian Monte Carlo. arXiv preprint arXiv:1601.08057, 2016.\n\n[31] L\u00e1szl\u00f3 Lov\u00e1sz and Santosh Vempala. Hit-and-run is fast and fun. preprint, Microsoft Research, 2003.\n\n[32] David Madigan, Alexander Genkin, David D Lewis, and Dmitriy Fradkin. Bayesian multinomial logistic\nregression for author identi\ufb01cation. In AIP Conference Proceedings, volume 803, pages 509\u2013516. AIP,\n2005.\n\n[33] Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave\n\ndistributions. arXiv preprint arXiv:1708.07114, 2017.\n\n[34] Oren Mangoubi and Nisheeth K Vishnoi. Dimensionally tight bounds for second-order Hamiltonian Monte\n\nCarlo. arXiv preprint arXiv:1802.08898, 2018.\n\n[35] Jonathan C Mattingly, Natesh S Pillai, Andrew M Stuart, et al. Diffusion limits of the random walk\n\nMetropolis algorithm in high dimensions. The Annals of Applied Probability, 22(3):881\u2013930, 2012.\n\n[36] Radford M Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method.\nTechnical report, Technical Report CRG-TR-92-1, Dept. of Computer Science, University of Toronto,\n1992.\n\n[37] Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, 1996.\n\n[38] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\n2:113\u2013162, 2011.\n\n[39] Natesh S Pillai, Andrew M Stuart, and Alexandre H Thi\u00e9ry. Optimal scaling and diffusion limits for the\n\nLangevin algorithm in high dimensions. The Annals of Applied Probability, 22(6):2320\u20132356, 2012.\n\n[40] Nicholas G Polson, James G Scott, and Jesse Windle. Bayesian inference for logistic models using\nP\u00f3lya\u2013Gamma latent variables. Journal of the American statistical Association, 108(504):1339\u20131349,\n2013.\n\n[41] Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of Langevin distributions and their\n\ndiscrete approximations. Bernoulli, 2(4):341\u2013363, 1996.\n\n[42] Christof Seiler, Simon Rubinstein-Salzedo, and Susan Holmes. Positive curvature and Hamiltonian Monte\n\nCarlo. In Advances in Neural Information Processing Systems, pages 586\u2013594, 2014.\n\n[43] Santosh Vempala. Geometric random walks: a survey. Combinatorial and computational geometry,\n\n52(573-612):2, 2005.\n\n[44] Tong Zhang and Frank J Oles. Text categorization based on regularized linear classi\ufb01cation methods.\n\nInformation retrieval, 4(1):5\u201331, 2001.\n\n11\n\n\f", "award": [], "sourceid": 2949, "authors": [{"given_name": "Oren", "family_name": "Mangoubi", "institution": "EPFL"}, {"given_name": "Nisheeth", "family_name": "Vishnoi", "institution": "EPFL"}]}