{"title": "Thermostat-assisted continuously-tempered Hamiltonian Monte Carlo for Bayesian learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10673, "page_last": 10682, "abstract": "In this paper, we propose a novel sampling method, the thermostat-assisted continuously-tempered Hamiltonian Monte Carlo, for the purpose of multimodal Bayesian learning. It simulates a noisy dynamical system by incorporating both a continuously-varying tempering variable and the Nos\\'e-Hoover thermostats. A significant benefit is that it is not only able to efficiently generate i.i.d. samples when the underlying posterior distributions are multimodal, but also capable of adaptively neutralising the noise arising from the use of mini-batches. While the properties of the approach have been studied using synthetic datasets, our experiments on three real datasets have also shown its performance gains over several strong baselines for Bayesian learning with various types of neural networks plunged in.", "full_text": "Thermostat-assisted continuously-tempered\n\nHamiltonian Monte Carlo for Bayesian learning\n\nRui Luo1, Jianhong Wang\u22171, Yaodong Yang\u22171, Zhanxing Zhu2, and Jun Wang\u20201\n\n1University College London, 2Peking University\n\nAbstract\n\nWe propose a new sampling method, the thermostat-assisted continuously-tempered\nHamiltonian Monte Carlo, for Bayesian learning on large datasets and multimodal\ndistributions. It simulates the Nos\u00e9-Hoover dynamics of a continuously-tempered\nHamiltonian system built on the distribution of interest. A signi\ufb01cant advantage of\nthis method is that it is not only able to ef\ufb01ciently draw representative i.i.d. samples\nwhen the distribution contains multiple isolated modes, but capable of adaptively\nneutralising the noise arising from mini-batches and maintaining accurate sampling.\nWhile the properties of this method have been studied using synthetic distributions,\nexperiments on three real datasets also demonstrated the gain of performance over\nseveral strong baselines with various types of neural networks plunged in.\n\nIntroduction\n\n1\nBayesian learning via Markov chain Monte Carlo (MCMC) methods is appealing for its inborn nature\nof characterising the uncertainty within the learnable parameters. However, when the distributions of\ninterest contain multiple modes, rapid exploration on the corresponding multimodal landscapes w.r.t.\nthe parameters becomes dif\ufb01cult using classic methods [7, 16]. In particular, given a large number of\nmodes, some \u201cdistant\u201d ones might be beyond the reach from others; this would potentially lead to the\nso-called pseudo-convergence [1], where the guarantee of ergodicity for MCMC methods breaks.\nTo make things worse, Bayesian learning on large datasets is typically conducted in an online setting:\nat each of the iterations, only a subset, i.e. a mini-batch, of the dataset is utilised to update the model\nparameters [24]. Although the computational complexity is substantially reduced, those mini-batches\ninevitably introduce noise into the system and therefore increase the uncertainty within the parameters,\nmaking it harder to properly sample multimodal distributions.\nIn this paper, we propose a new sampling method, referred to as the thermostat-assisted continuously-\ntempered Hamiltonian Monte Carlo, to address the aforementioned problems and to facilitate Bayesian\nlearning on large datasets and multimodal posterior distributions. We extend the classic Hamiltonian\nMonte Carlo (HMC) with the scheme of continuous tempering stemming from the recent advances in\nphysics [8] and chemistry [15]. The extended dynamics governs the variation on effective temperature\nfor the distribution of interest in a continuous and systematic fashion such that the sampling trajectory\ncan readily overcome high energy barriers and rapidly explore the entire parameter space. In addition\nto tempering, we also introduce a set of Nos\u00e9-Hoover thermostats [18, 11] to handle the noise arising\nfrom the use of mini-batches. The thermostats are integrated into the tempered dynamics so that the\nmini-batch noise can be effectively recognised and automatically neutralised. In short, the proposed\nmethod leverages continuous tempering to enhance the sampling ef\ufb01ciency, especially for multimodal\ndistributions; it makes use of Nos\u00e9-Hoover thermostats to adaptively dissipate the instabilities caused\nby mini-batches so that the desired distributions can be recovered. Various experiments are conducted\nto demonstrate the effectiveness of the new method: it consistently outperforms several samplers and\noptimisers on the accuracy of image classi\ufb01cation with different types of neural network.\n\n\u2217Equal\n\u2020Correspondence to: j.wang@cs.ucl.ac.uk\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2 Preliminaries\n\nWe review HMC [6] and continuous tempering [8, 15], the two bases of our model, where the former\nserves as a de facto standard for Bayesian sampling and the latter is a state-of-the-art solution to the\nacceleration of molecular dynamics simulations on complex physical systems.\n\n2.1 Hamiltonian Monte Carlo for posterior sampling\nBayesian posterior sampling aims at ef\ufb01ciently generating i.i.d. samples from the posterior \u03c1(\u03b8| (cid:68)) of\nthe variable of interest \u03b8 given some dataset (cid:68). Provided the prior \u03c1(\u03b8) and the likelihood (cid:76)(\u03b8; (cid:68))\nalong with the dataset (cid:68) = {xi} with | (cid:68)| independent data points xi, the target posterior to generate\nsamples from can be formulated as\n\n\u03c1(\u03b8| (cid:68)) \u221d \u03c1(\u03b8)(cid:76)(\u03b8; (cid:68)) = \u03c1(\u03b8)\n\n(cid:96)(\u03b8; xi), with the likelihood per data point (cid:96)(\u03b8; xi).\n\n(1)\n\n| (cid:68)|\u0001\n\ni\n\nIn a typical HMC setting [16], a physical system is constructed and connected with the target posterior\nin Eq. (1) via the system\u2019s potential, which is de\ufb01ned as\n\n| (cid:68)|\u0001\n\ni=1\n\nU(\u03b8) = \u2212 log \u03c1(\u03b8| (cid:68)) = \u2212 log \u03c1(\u03b8) \u2212\n\nlog (cid:96)(\u03b8; xi) \u2212 const .\n\n(2)\n\nIn this system, the variable of interest \u03b8 \u2208 R D, referred to as the system con\ufb01guration, is interpreted\nas the joint position of all physical objects within that system. An auxiliary variable p\u03b8 \u2208 R D is then\nintroduced as the conjugate momentum w.r.t. \u03b8 to describe its rate of change. The tuple \u0393 = (\u03b8,p\u03b8)\nrepresents the state of the physical system that uniquely determines the characteristics of that system.\nA prede\ufb01ned constant matrix M\u03b8 = diag[m\u03b8i] speci\ufb01es the masses of the objects associated with \u03b8\nand can be leveraged for preconditioning.\nThe energy function H(\u0393) of the physical system, referred to as the Hamiltonian, is essentially the sum\nof the potential in Eq. (2) and the conventional quadratic kinetic energy: H(\u0393) = U(\u03b8) + p(cid:62)\n\u03b8 p\u03b8/2.\nThe Hamiltonian dynamics, i.e. the Hamilton\u2019s equations of motion, can be derived by applying the\nHamiltonian formalism [(cid:219)\u03b8 = \u2202p\u03b8 H, (cid:219)p\u03b8\n= \u2212\u2202\u03b8H] to H(\u0393), where (cid:219)\u03b8 and (cid:219)p\u03b8 denote the time derivatives.\nThe Hamiltonian dynamics, on one hand, describes the time evolution of system from a microscopic\nperspective. The principle of statistical physics, on the other hand, states in a macroscopic sense that\ngiven a physical system in thermal equilibrium with a heat bath at a \ufb01xed temperature T, the states \u0393\nof that system are distributed as a particular distribution related to the system\u2019s Hamiltonian H(\u0393):\n\nZ\u0393(T) e\u2212H(\u0393)/T, with the normalising constant Z\u0393(T) =\u0001\n\n(3)\nSuch distribution is referred to as the canonical distribution. Note that by setting T = 1 and U(\u03b8) as\nin Eq. (2), the canonical distribution in Eq. (3) can be marginalised as the posterior in Eq. (1).\n\ne\u2212H(\u0393)/T .\n\n\u03c0(\u0393) =\n\n\u03b8 M\u22121\n\n1\n\n\u0393\n\n2.2 Continuous tempering\n\nIn physical chemistry, continuous tempering [8, 15] is currently a state-of-the-art method to accelerate\nmolecular dynamics simulations by means of continuously and systematically varying the temperature\nof a physical system. It extends the original system by coupling with additional degrees of freedom,\nnamely the tempering variable \u03be \u2208 R with mass m\u03be as well as its conjugate momentum p\u03be \u2208 R, which\ncontrol the effective temperature of the original system in a continuous fashion via the Hamiltonian\ndynamics of the extended system. With a suitable choice of coupling function \u03bb(\u03be) and a compatible\ncon\ufb01ning potential W(\u03be), the Hamiltonian of the extended system can be designed as\n\nH(\u0393) = \u03bb(\u03be)U(\u03b8) + W(\u03be) + p(cid:62)\n\n\u03b8 M\u22121\n\n\u03b8 p\u03b8/2 + p2\n\n\u03be/2m\u03be,\n\n(4)\nwhere \u0393 = (\u03b8, \u03be,p\u03b8, p\u03be) represents the state of the extended system with the position of the tempering\nvariable \u03be and its momentum p\u03be appended to the state of the original system (\u03b8,p\u03b8). \u03bb(\u03be) \u2208 R+ maps\nthe tempering variable to a multiplier of temperature so that the effective temperature of the original\nsystem T/\u03bb(\u03be) can vary; its domain dom\u03bb(\u03be) \u2282 R is a \ufb01nite interval regulated by W(\u03be).\n\n2\n\n\f3 Thermostat-assisted continuously-tempered Hamiltonian Monte Carlo\n\nWe propose a sampling method, called the thermostat-assisted continuously-tempered Hamiltonian\nMonte Carlo (TACT-HMC), for multimodal posterior sampling in the presence of unknown noise.\nTACT-HMC leverages the extended Hamiltonian in Eq. (4) to raise and vary the effective temperature\ncontinuously; it ef\ufb01ciently lowers the energy barriers between modes and hence accelerates sampling.\nOur method also incorporates the Nos\u00e9-Hoover thermostats to effectively recognise and automatically\nneutralise the noise arising from the use of mini-batches.\n\n3.1 System dynamics with the Nos\u00e9-Hoover augmentation\n\n|(cid:83)|\u0001\n\nlog (cid:96)(\u03b8; xik)\n\nIn solving for the system dynamics, we apply the Hamiltonian formalism to the extended Hamiltonian\nin Eq. (4), which requires the potential U(\u03b8) and gradient \u2207\u03b8U(\u03b8). We de\ufb01ne hereafter the negative\ngradient of the potential U(\u03b8) as the induced force f(\u03b8) = \u2212\u2207\u03b8U(\u03b8). Because the calculation of either\nU(\u03b8) or f(\u03b8) involves the full dataset (cid:68) = {xi}, it is computationally expensive or even unaffordable\nto calculate the actual values for large | (cid:68)|. Instead, we consider the mini-batch approximations:\n\u02dcU(\u03b8) = \u2212 log \u03c1(\u03b8) \u2212 | (cid:68)|\n\u2207\u03b8 log (cid:96)(\u03b8; xik),\n|(cid:83)|\nwhere xik denotes the data point sampled from mini-batches (cid:83) = {xik} \u2282 (cid:68) with the size |(cid:83)| (cid:28) | (cid:68)|.\nIt is clear that \u02dcU(\u03b8) and \u02dcf(\u03b8) are unbiased estimators of U(\u03b8) and f(\u03b8).\nAs we assume xik to be mutually independent, \u02dcU(\u03b8) and \u02dcf(\u03b8) are sums of |(cid:83)| i.i.d. random variables,\nwhere the Central Limit Theorem (CLT) applies; the mini-batch approximations converge to Gaussian\nvariables, i.e. \u02dcU(\u03b8) \u2192 (cid:78)(U(\u03b8), vU(\u03b8)) and \u02dcf(\u03b8) \u2192 (cid:78)(f(\u03b8),V f (\u03b8)) with variances vU(\u03b8) and V f (\u03b8).\nAs random variables, \u02dcU(\u03b8) and \u02dcf(\u03b8) inevitably inject noise into the system dynamics. We incorporate a\nset of independent Nos\u00e9-Hoover thermostats [18, 11] \u2013 apparatuses originally devised for temperature\nstabilisation in molecular dynamics simulations \u2013 to adaptively cancel the effect of noise. The system\ndynamics with the augmentation of thermostats \u2013 we call Nos\u00e9-Hoover dynamics \u2013 is formulated as\n\n\u02dcf(\u03b8) = \u2207\u03b8 log \u03c1(\u03b8) +\n\n|(cid:83)|\u0001\n\n| (cid:68)|\n|(cid:83)|\n\nand\n\nk=1\n\nk=1\n\nds\n\n(cid:104)i, j(cid:105)\n\u03b8\ndt\n\ndp\u03b8\ndt\n\nd\u03b8\ndt\n\nd\u03be\ndt\n\n,\n\n,\n\n=\n\n=\n\n=\n\n\u03ba\u03be\n\n(5)\n\n(cid:3)\n\n\u03be\nm\u03be\n\n\u2212 T\n\np\u03be\nm\u03be\n\n(cid:104)i, j(cid:105)\n\u03b8\n\n\u03b8 p\u03b8,\n\nds\u03be\ndt\n\n= \u2212\u03bb\n\ndp\u03be\ndt\n\n\u2212 T \u03b4i j\n\n= M\u22121\n\n= \u03bb(\u03be)\u02dcf(\u03b8) \u2212 \u03bb2(\u03be)S\u03b8p\u03b8,\n\n(cid:21)\nwhere S\u03b8 and s\u03be denote the Nos\u00e9-Hoover thermostats coupled with \u03b8 and \u03be. Speci\ufb01cally, S\u03b8 =(cid:2)s\n\n(cid:48)(\u03be) \u02dcU(\u03b8) \u2212 W(cid:48)(\u03be) \u2212(cid:2)\u03bb\n\n(cid:48)(\u03be)(cid:3)2s\u03be p\u03be,\n\nand \u03ba\u03be are constants that denote the \u201cthermal inertia\u201d corresponding to s\n\n(cid:104)i, j(cid:105)\n\u03b8\nis a D \u00d7 D matrix with the (i, j)-th elements s\ndependent upon the multiplicative term p\u03b8i p\u03b8 j/m\u03b8i .\n(cid:104)i, j(cid:105)\nand s\u03be, respectively.\n\u03ba\n\u03b8\nIntuitively, the thermostats S\u03b8 and s\u03be act as negative feedback controllers on the momenta p\u03b8 and p\u03be.\n\u03be/m\u03be exceeds the reference T, the thermostat s\u03be will\nConsider the dynamics of s\u03be in Eq. (5), when p2\nincrease, leading to a greater friction \u2212s\u03be p\u03be in updating p\u03be; the friction in turn reduces the magnitude\n\u03be/m\u03be. The negative feedback loop is thus established.\nof p\u03be, resulting in a decrease in the value of p2\nWith the help of thermostats, the noise injected into the system can be adaptively neutralised.\n\nWe de\ufb01ne the diffusion coef\ufb01cients bU(\u03b8) (cid:66) vU(\u03b8) dt/2 and B f (\u03b8) =(cid:2)b\n\n(\u03b8)(cid:3) (cid:66) V f (\u03b8) dt/2 such\n\nthat the variances vU(\u03b8) and V f (\u03b8) of the mini-batch approximations evaluated at each of the discrete\niterations can be embedded in the Fokker-Planck equation (FPE) [20] established in continuous time.\nFPE translates the microscopic motion of particles, formulated by SDEs, into the macroscopic time\nevolution of the state distribution in the form of PDEs. With FPE leveraged, we establish the theorem\nas follows to characterise the invariant distribution:\nTheorem 1. The system governed by the dynamics in Eq. (5) has the invariant distribution:\n\n(cid:104)i, j(cid:105)\n\u03b8\n\n(cid:104)i, j(cid:105)\nf\n\n(cid:20) p\u03b8i p\u03b8 j\n(cid:2)\u03bb(cid:48)(\u03be)(cid:3)2\n(cid:20) p2\n\n\u03bb2(\u03be)\n(cid:104)i, j(cid:105)\n\u03ba\n\u03b8\n\nm\u03b8i\n\n(cid:21)\n\n,\n\n(cid:20)\n\n(cid:16)\n\n(cid:17)2\n\n\u03ba\u03be/2 + \u0001\n\n(cid:32)\n\n(cid:104)i , j(cid:105)\n\u03b8\n\ns\n\n\u2212 b\n\n(cid:104)i , j(cid:105)\nf\nm\u03b8 j\n\n(\u03b8)\nT\n\ni , j\n\n(cid:104)i , j(cid:105)\n\u03b8\n\n\u03ba\n\n(cid:33)2\n\n(cid:21)(cid:46)\n\n(cid:14)2\n\n\u03c0(\u0393,S\u03b8, s\u03be) \u221d e\n\n\u2212\n\nH(\u0393) +\n\ns\u03be\u2212 bU (\u03b8)\n\nm\u03be T\n\nT\n\n,\n\n(6)\n\nwhere \u0393 = (\u03b8, \u03be,p\u03b8, p\u03be) denotes the extended state as presented in Eq. (4).\n\n3\n\n\f\u0001\n(cid:21)\n\u00b7(cid:104)\n\n\u03c0\n\n(cid:21)\n\n(cid:35)\n\ni, j\n\nds\n\n(cid:105) \u2212\u0001\n(cid:34)(cid:2)\u03bb(cid:48)(\u03be)(cid:3)2\n(cid:21)\n\n\u03ba\u03be\n\nProof. Recall FPE in its vector form [20]:\n\n\u00b7(cid:104)\n\n\u00b5x(x, t)\u03c0(x, t)(cid:105)\n\n(cid:20) \u2202\n\n\u2202(cid:62)\n\u2202x\n\n(cid:21)\n\n\u00b7(cid:104)Bx(x, t)\u03c0(x, t)(cid:105)\n\n\u2202\n\n\u2202t \u03c0(x, t) = \u2212 \u2202\n\u2202x\n\n,\n\n+\n\n\u2202x\n\n(7)\nwhere x = vec(\u0393,S\u03b8, s\u03be) denotes the vectorisation of the collection of all variables de\ufb01ned in Eq. (6),\n\u00b5x and Bx represent the drift and diffusion terms associated with the dynamics in Eq. (5), respectively,\nand the dot operator \u00b7 de\ufb01nes the composition of summation after element-wise multiplication.\nWe substitute the corresponding elements within Eq. (5) into the drift and diffusion of FPE in Eq. (7).\nAs we presume that the introduced thermostats are mutually independent with each other, the invariant\ndistribution can hence be factorised into marginals as \u03c0(x) = \u03c0\u0393 \u03c0s\u03be\n. It is straightforward\nto verify that those deterministic parts with the dependency only on \u0393 cancel exactly with each other.\nThe remnants are the stochastic parts as well as the deterministic ones that depend on the thermostats\nS\u03b8 and s\u03be, which can be formulated as\n\ni, j \u03c0s\n\n(cid:104)i , j(cid:105)\n\u03b8\n\n(cid:104)(cid:2)\u03bb\n\n(cid:48)(\u03be)(cid:3)2s\u03be p\u03be \u03c0\n(cid:105) \u2212 \u2202\n\n\u03bb2(\u03be)S\u03b8p\u03b8 \u03c0\n\n\u2202\n\n\u2202t \u03c0(x, t) =\n\n\u2202\n\u2202p\u03be\n\n\u00b7(cid:104)\n\n\u2202\n\u2202p\u03b8\n\n(cid:34)\n\n\u03bb2(\u03be)\n(cid:104)i, j(cid:105)\n\u03ba\n\u03b8\n\n(cid:21)\n\n\u2212 T\n\n\u03c0\n\n\u2202\n(cid:104)i, j(cid:105)\n\u03b8\n\n(cid:20) p2\n\n\u03be\nm\u03be\n\n(cid:20) p\u03b8i p\u03b8 j\n(cid:35)\n(cid:20) \u2202\n\nm\u03b8i\n\n+\n\n\u2212 T \u03b4i j\n\n\u2202(cid:62)\n\u2202p\u03b8\n\n+\n\n\u22022\n\u2202p\u03be\n\n(cid:104)(cid:2)\u03bb\n\n(cid:105)\n\n(cid:48)(\u03be)(cid:3)2bU(\u03b8)\u03c0\n(cid:105)\n\n\u03bb2(\u03be)B f (\u03b8)\u03c0\n\n.\n\n+\n\n\u2202s\u03be\n\n(cid:20)\n\n\u2202p\u03b8\n\n(8)\nWe solve for the invariant distribution \u03c0(x) by equating Eq. (8) to zero. The resulted formulae for the\nmarginals \u03c0s\u03be and \u03c0s\n\nare obtained under the assumption of factorisation in the form of\n(cid:104)i, j(cid:105)\n(\u03b8)\nf\nm\u03b8 jT\n\n\u2202\u03c0s\n1\n(cid:104)i , j(cid:105)\n\u03c0s\n\u2202s\n\u03b8\nThe solutions to Eq. (9) are clear: both \u03c0s\u03be and \u03c0s\nare Gaussian distributions determined uniquely\n(cid:104)i , j(cid:105)\n, along with the canonical distribution \u03c0\u0393 w.r.t. H(\u0393),\n\u03b8\nby the coef\ufb01cients. The marginals \u03c0s\u03be and \u03c0s\n(cid:3)\nconstitute the invariant distribution de\ufb01ned in Eq. (6).\n\ns\u03be \u2212 bU(\u03b8)\n\n(cid:104)i , j(cid:105)\n\u03b8\n= \u2212 \u03ba\u03be\nT\n\n(cid:104)i , j(cid:105)\n\u03b8\n(cid:104)i, j(cid:105)\n\u03b8\n\n(cid:104)i, j(cid:105)\n\u03b8\nT\n\n\u2202\u03c0s\u03be\n\u2202s\u03be\n\n= \u2212 \u03ba\n\n1\n\u03c0s\u03be\n\n(cid:104)i, j(cid:105)\n\u03b8\n\n\u2212 b\n\nm\u03beT\n\n(cid:104)i , j(cid:105)\n\u03b8\n\nand\n\n(cid:20)\n\n(cid:21)\n\n(9)\n\ns\n\n.\n\nTheorem 1 states that, when the system reaches equilibrium, the system state is distributed as Eq. (6),\nand the mini-batch noise is absorbed into the thermostats from the system dynamics in Eq. (5). Thus,\nwe can marginalise out both S\u03b8 and s\u03be to drop the noise, and then obtain the canonical distribution in\nEq. (3). As we are seeking for the recovery of the target posterior from the canonical distribution, we\ncan assign speci\ufb01c values to the tempering variable \u03be = \u03be\u2217 such that the effective temperature of the\noriginal system is held \ufb01xed at unity T/\u03bb(\u03be\u2217) = 1. Hence, the posterior \u03c1(\u03b8| (cid:68)) equals to the marginal\ndistribution w.r.t. \u03b8 given \u03be\u2217 satisfying \u03bb(\u03be\u2217) = T, which is obtained by the marginalisation of p\u03b8 and\np\u03be over the canonical distribution as follows:\n\n\u2217) = \u0001\n\np\u03b8 ,p\u03be\n\n\u03c0(\u03b8|\u03be\n\n\u03c0(\u0393|\u03be\n\n\u2217) =\n\n\u0001p\u03b8 ,p\u03be e\u2212H(\u0393| \u03be\u2217)/T\n\u0001\u0393\\\u03be e\u2212H(\u0393| \u03be\u2217)/T\n\n\u0001\n\n=\n\n1\nZ\u03b8(T) e\u2212U(\u03b8) = \u03c1(\u03b8| (cid:68)),\n\ne\u2212U(\u03b8)\n\u03b8 e\u2212U(\u03b8) =\n\u03be/2m\u03be represents the extended Hamiltonian\n\nwhere H(\u0393|\u03be\u2217) = \u03bb(\u03be\u2217)U(\u03b8) + W(\u03be\u2217) + p(cid:62)\nconditioning on the tempering variable \u03be = \u03be\u2217, when \u03bb(\u03be\u2217) = T holds.\n\n\u03b8 p\u03b8/2 + p2\n\n\u03b8 M\u22121\n\n3.2 Tempering enhancement via adaptive biasing force\n\nA necessary condition for the tempering scheme to be well-functioning is that the tempering variable\n\u03be can properly explore the majority of the domain of the coupling function dom\u03bb(\u03be); this ensures the\nexpected variation on the effective temperature during sampling. For complex systems, however, it is\noften the case that the tempering variable is subject to a strong instantaneous force that prevents \u03be\nfrom proper exploration of dom\u03bb(\u03be) and therefore hinders the ef\ufb01ciency of tempering. The adaptive\nbiasing force (ABF) algorithm [3] has emerged as a promising solution to such problem ever since its\ninception [4], where it was introduced to address the problem on fast calculation of the free energy of\ncomplex chemical or biochemical systems. Intuitively, ABF maintains and updates an estimate of the\naverage force, i.e. the average of the instantaneous force exerted on the target variable. It then applies\nthe estimate to the target variable in the opposite direction to counteract the instantaneous force and\nreduce it into small zero-mean \ufb02uctuations so that the variable undergoes random walks.\n\n4\n\n\fthermal inertia \u03b3\u03b8, \u03b3\u03be; # of steps for unit interval K\n\n\u03be\n\n\u2212 \u03b7\u03be\n\n(cid:3)(cid:14)\u03b3\u03be\n\n(z\u03b8, z\u03be) \u2190 (c\u03b8, c\u03be)\n\n\u03b8 r\u03b8 j/dim(r\u03b8) \u2212 \u03b7\u03b8\n\nlevel of injected noise c\u03b8, c\u03be;\n\n\u03b4A \u2190 abf[ ABFINDEXING( \u03be ) ]\n\n\u03b4\u03bb \u2190 LAMBDADERIVATIVE( \u03be )\n\n(cid:83)\u2190 NEXTBATCH( (cid:68), k );\n\u02dcU \u2190 MODELFORWARD( \u03b8, (cid:83));\n\nz\u03be \u2190 z\u03be + \u03b4\u03bb2(cid:2)r2\n(cid:3)(cid:14)\u03b3\u03b8\nz\u03b8 \u2190 z\u03b8 + \u03bb2(cid:2)r(cid:62)\nr\u03be \u2190 r\u03be \u2212 \u03b4\u03bb(cid:2)\u03b7\u03be \u02dcU + (cid:78)(0,2c\u03be \u03b7\u03be)(cid:3) \u2212 \u03b4\u03bb2z\u03ber\u03be + \u03b7\u03be \u03b4A\nr\u03b8 \u2190 r\u03b8 + \u03bb(cid:2)\u03b7\u03b8 \u02dcf + (cid:78)(0,2c\u03b8 \u03b7\u03b8I)(cid:3) \u2212 \u03bb2z\u03b8r\u03b8\n\nAlgorithm 1 Thermostat-assisted continuously-tempering Hamiltonian Monte Carlo\nInput: stepsize \u03b7\u03b8, \u03b7\u03be;\n1: r\u03b8 \u223c (cid:78)(0, \u03b7\u03b8I) and r\u03be \u223c (cid:78)(0, \u03b7\u03be);\n2: INITIALISE( \u03b8, \u03be, abf, samples )\n3: for k = 1,2,3, . . . do\n\u03bb \u2190 LAMBDA( \u03be );\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\nAPPEND( samples, \u03b8 )\n17:\nr\u03b8 \u223c (cid:78)(0, \u03b7\u03b8I) and r\u03be \u223c (cid:78)(0, \u03b7\u03be)\n18:\n19: function ABFUPDATE( abf, \u03be, \u03b4\u03bb, \u02dcU, k )\n20:\n21:\n\nABFUPDATE( abf, \u03be, \u03b4\u03bb, \u02dcU, k )\n\u03be \u2190 \u03be + r\u03be\nif ISINSIDEWELL( \u03be ) = false then\n\u03b8 \u2190 \u03b8 + r\u03b8\nif k = 0 mod K and \u03bb = 0 then\n\nj \u2190 ABFINDEXING( \u03be )\nabf[ j ] \u2190 [1 \u2212 1/k]abf[ j ] + [1/k]\u03b4\u03bb \u00b7 \u02dcU\n\n\u02dcf \u2190 MODELBACKWARD( \u03b8, (cid:83))\n\n\u03be \u2190 \u03be + r\u03be\n\nr\u03be \u2190 \u2212r\u03be;\n\n(cid:46) \u03be is restricted by the well of in\ufb01nite height.\n(cid:46) \u03be bounces back when hitting the wall.\n\n(cid:46) \u03b8 is collected as a new sample in samples.\n(cid:46) r\u03b8,r\u03be is optionally resampled.\n\n(cid:46) \u03be is mapped to the index j of the associated bin.\n\n\u0393\\\u03be\n\n(10)\n\n\u2202\u03be\n\n(cid:66)\n\n.\n\n\u2202\u03be\n\ndp\u03be\n\nd\u03c0\nd\u03be\n\n=\n\nImplementation\n\nA(cid:48)(\u03be) = \u2212 T\n\u03c0(\u03be)\n\n(cid:14) dt = \u2212\u03bb\n\nFormally, the function of free energy w.r.t. \u03be is de\ufb01ned by convention in the form of\n\nA(\u03be) = \u2212T log \u03c0(\u03be) + const, where \u03c0(\u03be) =\u0001\n(cid:48)(\u03be)(cid:3)2s\u03be p\u03be,\n(cid:48)(\u03be) \u02dcU(\u03b8) \u2212 W(cid:48)(\u03be) + A(cid:48)(\u03be) \u2212(cid:2)\u03bb\n\u03c0(\u0393) with the extended state \u0393 = (\u03b8, \u03be,p\u03b8, p\u03be).\n(cid:2) \u2202H\n(cid:3)e\u2212H(\u0393)/T\n\u0001\u0393\\\u03be\n(cid:12)(cid:12)(cid:12)(cid:12)\u03be\n(cid:28) \u2202H\n(cid:29)\n\u0001\u0393\\\u03be e\u2212H(\u0393)/T\n\nThe equation of p\u03be in Eq. (5) is then augmented with the derivative of A(\u03be) such that\nwhere A(cid:48)(\u03be) is referred to as the adaptive biasing force induced by the free energy as\n\n(11)\nThe brackets (cid:104)\u00b7|\u03be(cid:105) denote the conditional average, i.e. the average on the canonical distribution \u03c0(\u0393)\nwith \u03be held \ufb01xed. A(cid:48)(\u03be) is the average of the reversed instantaneous force on \u03be. It is proved [14] that\nABF converges to the equilibrium at which \u03be\u2019s free energy landscape is \ufb02attened, even though the\naugmentation in Eq. (10) alters the equations of motion originally de\ufb01ned in Eq. (5).\n3.3\nAs proved in Theorem 1, the dynamics in Eq. (5) is capable of preserving the correct distribution in\nthe presence of noise. In principle, it requires the thermostat S\u03b8 to be of size D2 for the D-dimensional\nparameter \u03b8; however, the storage is unaffordable for complex models in high dimensions. A plausible\noption to mitigate this issue is to assume homogeneous \u03b8 and isotropic Gaussian noise such that the\nmass M\u03b8 = m\u03b8I and the variance V f (\u03b8) = v f (\u03b8)I; this simpli\ufb01es the high-dimensional S\u03b8 to scalar s\u03b8.\nThe con\ufb01ning potential W(\u03be) that determines the range of the tempering variable \u03be is implemented as\na well of in\ufb01nite height. When colliding with the boundary of W(\u03be), \u03be bounces back elastically with\nthe velocity reversed. The Euler\u2019s method is then applied such that dt \u2192 \u2206t.\nIn Eq. (11), the calculation of A(cid:48)(\u03be) involves the ensemble average (cid:104)\u2202H/\u2202\u03be|\u03be(cid:105), hence being intractable.\n\u2202H/\u2202\u03be|\u03bek , which is equivalent to the ensemble average\nin the long-time limit under the assumption of ergodicity; it can be readily calculated in a recurrent\nform during sampling. To maintain the runtime estimates of A(cid:48)(\u03be), the range of \u03be is divided uniformly\ninto J bins of equal length with memory initialised in each of those bins. At each time step k, ABF\ndetermines the index j of the bin in which the tempering variable \u03be = \u03bek is currently located, and then\nupdates the time average using the record in memory and the current force \u2202H/\u2202\u03be|\u03bek evaluated at \u03bek.\nWith all components assembled, we establish the TACT-HMC algorithm as Algorithm. 1 with\n\u03ba\u03b8\nm\u03b8 D , \u03b3\u03be =\n\nHere we instead calculate the time average\u0001\n\n, z\u03b8 = s\u03b8 \u2206t, z\u03be = s\u03be \u2206t, \u03b7\u03b8 =\n\np\u03b8 \u2206t\nm\u03b8\n\np\u03be \u2206t\nm\u03be\n\n\u2206t2\nm\u03b8\n\n\u2206t2\nm\u03be\n\n, r\u03be =\n\n, \u03b7\u03be =\n\n, \u03b3\u03b8 =\n\napplied as the change of variables for the convenience of implementation. Furthermore, we introduce\nadditional Gaussian noises (cid:78)(0,2c\u03be \u03b7\u03be) and (cid:78)(0,2c\u03b8 \u03b7\u03b8I) in momenta updates to improve ergodicity.\n\n\u03ba\u03be\nm\u03be\n\nr\u03b8 =\n\nk\n\n5\n\n\f(a) Histograms of samples generated by TACT-HMC and the ablated alternatives, with the target shown in blue.\n\n(b) I: Sampling trajectory of TACT-HMC, demonstrating robust mixing property; II: Cumulative averages of\nthermostats, indicating fast convergence to the theoretical reference values drawn by red lines; III: Histograms of\nsampled thermostats, showing a good \ufb01t to the theoretical distributions by blue curves; IV: Autocorrelation plot\nof samples, the decreasing of autocorrelation is comparably fast; V: (A snapshot of) variation on the effective\nsystem temperature during simulation, with the standard reference of unity temperature marked by red line.\n\nFigure 1: Experiment on sampling a 1d synthetic distribution.\n\n4 Related work\nSince the inception of the stochastic gradient Langevin dynamics (SGLD) [24], algorithms originated\nfrom stochastic approximation [21] have received increasing attention on tasks of Bayesian learning.\nBy adding the right amount of noise to the updates of the stochastic gradient descent (SGD), SGLD\nmanages to properly sample the posterior in a random-walk fashion akin to the full-batch Metropolis-\nadjusted Langevin algorithm (MALA) [22]. To enable the Hamiltonian dynamics for ef\ufb01cient state\nspace exploration, Chen et al. [2] extended the mechanism designed for SGLD to HMC, and proposed\nthe stochastic gradient Hamiltonian Monte Carlo (SGHMC). As is shown that the stochastic gradient\ndrives the Hamiltonian dynamics to deviate, SGHMC estimates the unknown noise from the stochastic\ngradient with the Fisher information matrix, and then compensates the estimated noise by augmenting\nthe Hamiltonian dynamics with an additive friction derived from the estimated Fisher matrix. It turns\nout that the friction can be linked to the momentum term within a class of accelerated gradient-based\nmethods [19, 17, 23] in optimisation. Shortly after SGHMC, Ding et al. [5] came up with the idea of\nincorporating the Nos\u00e9-Hoover thermostat [18, 11] into the Hamiltonian dynamics in replacement of\nthe constant friction in SGHMC, and hence developed the stochastic gradient Nos\u00e9-Hoover thermostat\n(SGNHT). The thermostat in SGNHT serves as an adaptive friction which adaptively neutralises the\nmini-batch noise from the stochastic gradient into the system [12].\nParallel to those aforementioned studies, recent advances in the development of continuous tempering\n[8, 15] as well as its applications in machine learning [25, 9] are of particular interest. Ye et al. [25]\nproposed the continuously tempered Langevin dynamics (CTLD), which leverages the mechanism of\ncontinuous tempering and embeds the tempering variable in an extended stochastic gradient second-\norder Langevin dynamics. CTLD facilitates exploration on rugged landscapes of objective functions,\nlocating the \u201cgood\u201d wide valleys on the landscape and preventing early trapping in the \u201cbad\u201d narrow\nlocal minima. Nevertheless, CTLD is designed to be an initialiser for training deep neural networks; it\nserves as an enhancement of the subsequent gradient-based optimisers instead of a Bayesian solution.\nFrom the Bayesian perspective, Graham et al. [9] developed the continuously-tempered Hamiltonian\nMonte Carlo (CTHMC) operating in a full-batch setting. CTHMC augments the Hamiltonian system\nwith an extra continuously-varying control variate borrowed from the scheme of continuous tempering,\nwhich enables the extended Hamiltonian dynamics to bridge between sampling a complex multimodal\ntarget posterior and a simpler unimodal base distribution. Albeit bene\ufb01cial for mixing, its dynamics\nlacks the ability to handle the mini-batch noise, and thus fails to function properly with mini-batches.\n\n5 Experiment\n\nTwo sets of experiments are carried out. We \ufb01rst conduct an ablation study with synthetic distributions,\nwhere we visualise the system dynamics and validate the ef\ufb01cacy of TACT-HMC. We then evaluate\nthe performance of our method against several strong baselines on three real datasets.\n\n6\n\n-20-15-10-5051015200.000.050.100.150.20Pr()I. Samples via the proposed method-20-15-10-5051015200.000.050.100.150.20Pr()II. Samples via well-tempered HMC w/o thermostats-20-15-10-5051015200.000.050.100.150.20Pr()III. Samples via thermostat-assisted HMC w/o tempering0.000-200-400-60020Iteration151050-5-10-800-10000.05-1200-15-1400-200.10Pr()I. Sampling trajectory0.150.20012345678910Iteration1040.70.80.91.01.11.21.3CUMAVG(s)II. Cumulative averages of thermostats012345678910Iteration1040.70.80.91.01.11.21.3CUMAVG(s)-3-2-1012345s0.00.10.20.30.40.5Pr(s)III. Histograms of sampled thermostats-3-2-1012345s0.00.10.20.30.40.5Pr(s)051015202530Lag k0.000.250.500.751.00(k)IV. Autocorrelation plot050100150200250300Time 051015kBT()V. Variation of temperature\fFigure 2: Experiments on sampling two 2d synthetic distributions. Left: The distributions to sample;\nMid-left: Histograms sampled by TACT-HMC; Mid-right: Histograms by the well-tempered sampler\nwithout thermostatting; Right: Histograms by the thermostat-assisted sampler without tempering.\n\n5.1 Multimodal sampling of synthetic distributions\nWe run TACT-HMC on three 1d/2d synthetic distributions. In the meantime, two ablated alternatives\nare initiated in parallel with the same setting: one is equipped with thermostats but without tempering\nfor sampling acceleration, the other is well-tempered but without thermostatting against noise. The\ndistributions are synthesised to contain multiple distant modes; the calculation of gradient is perturbed\nby Gaussian noise that is unknown to all samplers.\nFigure 1 summarises the result of sampling a mixture of three 1d Gaussians. As the \ufb01gure indicates,\nonly TACT-HMC is capable of correctly sampling from the target. The sampler without thermostatting\nis heavily in\ufb02uenced by the noise in gradient, resulting in a spread histogram; while the one without\ntempering gets trapped by those energy barriers and hence fails to explore the entire space of system\ncon\ufb01gurations. The sampling trajectory and properties of TACT-HMC are illustrated in details in Fig.\n1b, which justi\ufb01es the correctness of TACT-HMC. The autocorrelation of samples \u03c1(k) is calculated\nand shown in Fig. 1b(IV), which decreases monotonically from \u03c1(0) = 1 down to \u03c1(\u221e) \u2192 0+. The\neffective sample size (ESS) can thus be readily evaluated through the formula\n\nESS =\n\nk=1 \u03c1(k), with \u03c1(k) as the autocorrelation at lag k.\n\n1 + 2\u0001\u221e\n\nn\n\nThe ESS for TACT-HMC in this 1d Gaussian mixture case is 2.1096 \u00d7 104 out of n = 105 samples,\nwhich is roughly 60.2% of the value for SGHMC and 50.9% of that for SGNHT. We believe that\nthe non-linear interaction between the parameter of interest \u03b8 and the tempering variable \u03be via the\nmultiplicative term \u03bb(\u03be)U(\u03b8) results in a longer autocorrelation time and hence a lower ESS value.\nWe also investigate the variation of the effective system temperature during sampling. A snapshot of\nthe trajectory regarding the effective system temperature is demonstrated in Fig. 1b(V): it constantly\noscillates between higher and lower temperatures, and returns to the unity temperature occasionally.\nWe further conduct two 2d sampling experiments as shown in Fig. 2. Comparing between columns,\nwe \ufb01nd that TACT-HMC recovers those multiple modes for both distributions while neutralising the\nin\ufb02uence of the noise in gradient; however, the samplings by the ablated alternatives are impaired\neither by the noise in gradient or by the energy barriers as discovered in the 1d scenario.\n\n5.2 Bayesian learning on real datasets\nStepping out of the study on the synthetic cases, we then move on to the tasks of image classi\ufb01cation\non three real datasets: EMNIST3, Fashion-MNIST4 and CIFAR-10. The performance is evaluated\nand compared in terms of the accuracy of classi\ufb01cation on three types of neural networks: multilayer\nperceptrons (MLPs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs).\nTwo recent samplers are chosen as part of the baselines, namely SGNHT [5] and SGHMC [2]; besides,\ntwo widely-used gradient-based optimisers, Adam [13] and momentum SGD [23], are compared.\nEach method will keep running for 1000 epochs in either sampling or training before the evaluation\nand comparison. We further apply random permutation to a certain percentage (0%, 20%, and 30%)\nof the training labels at the beginning of each epoch for demonstrating the robustness of our method.\n\n3https://www.nist.gov/itl/iad/image-group/emnist-dataset\n4https://github.com/zalandoresearch/fashion-mnist\n\n7\n\n\fTable 1: Result of Bayesian learning experiments on real datasets\n\nMLP on EMNIST\n0%\n\n20%\n\n% permuted labels\nAdam [13]\nmomentum SGD [23]\nSGHMC [2]\nSGNHT [5]\nTACT-HMC (Alg. 1)\n\n30%\n\n30%\n83.39% 80.27% 80.63% 88.84% 88.35% 88.25% 69.53% 72.39% 71.05%\n83.95% 82.64% 81.70% 88.66% 88.91% 88.34% 64.25% 65.09% 67.70%\n84.53% 82.62% 81.56% 90.25% 88.98% 88.49% 76.44% 73.87% 71.79%\n84.48% 82.63% 81.60% 90.18% 89.10% 88.58% 76.60% 73.86% 71.37%\n84.85% 82.95% 81.77% 90.84% 89.61% 89.01% 78.93% 74.88% 73.22%\n\n30%\n\nRNN on Fashion-MNIST\n\n0%\n\n20%\n\nCNN on CIFAR-10\n0%\n\n20%\n\nAll four baselines are tuned to their best on each task; the setting of TACT-HMC will be speci\ufb01ed for\neach task in the corresponding subsection. For the baseline samplers, the accuracy of classi\ufb01cation\nis calculated from Monte Carlo integration on all sampled models; for the baseline optimisers, the\nperformance is evaluated directly on test sets after training. The result is summarised in Table. 1.\nEMNIST classi\ufb01cation with MLP. The MLP herein de\ufb01nes a three-layered fully-connected neural\nnetwork with the hidden layer consisting of 100 neurons. EMNIST Balanced is selected as the dataset,\nwhere 47 categories of images are split into a training set of size 112,800 and a complementary test set\nof size 18,800. The batch size is \ufb01xed at 128 for all methods in both sampling and training tests. For\nreadability, we introduce a 7-tuple [\u03b7\u03b8, \u03b7\u03be, c\u03b8, c\u03be, \u03b3\u03b8, \u03b3\u03be, K] as the speci\ufb01cation to set up TACT-HMC\n(see Alg. 1). In this speci\ufb01cation, [\u03b7\u03b8, c\u03b8, \u03b3\u03b8] denote the step size, the level of the injected Gaussian\nnoise and the thermal inertia, all w.r.t. the parameter of interest \u03b8; similarly, [\u03b7\u03be, c\u03be, \u03b3\u03be] represent the\nquantities corresponding to the tempering variable \u03be; K de\ufb01nes the number of steps in simulating a\nunit interval. In this experiment, TACT-HMC is con\ufb01gured as [0.0015,0.0015,0.05,0.05,1.0,1.0,50].\nFashion-MNIST classi\ufb01cation with RNN. The RNN contains a LSTM layer [10] as the \ufb01rst layer,\nwith the input/output dimensions being 28/128. It takes as the input via scanning a 28 \u00d7 28 image\nvertically each line of a time. After 28 steps of scanning, the LSTM outputs a representative vector of\nlength 128 into ReLU activation, which is followed by a dense layer of size 64 with ReLU activation.\nThe prediction regarding 10 categories is generated through softmax activation in the output layer.\nThe batch size is \ufb01xed at 64 for all methods in comparison. The speci\ufb01cation of TACT-HMC in this\nexperiment is determined as [0.0012,0.0012,0.15,0.15,1.0,1.0,50].\nCIFAR-10 classi\ufb01cation with CNN. The CNN comprises of four learnable layers: from the bottom\nto the top, a 2d convolutional layer using the kernel of size 3\u00d73\u00d73\u00d716, and another 2d convolutional\nlayer with the kernel of size 3\u00d73\u00d716\u00d716, then two dense layers of size 100 and 10. ReLU activations\nare inserted between each of those learnable layers. For each convolutional layer, the stride is set to\n1 \u00d7 1, and a pooling layer with 2 \u00d7 2 stride is appended after the ReLU activation. Softmax function\nis applied for generating the \ufb01nal prediction over 10 categories. The batch size is \ufb01xed at 64 for all\nmethods. Here, TACT-HMC\u2019s speci\ufb01cation is set as [0.0010,0.0010,0.10,0.10,1.0,1.0,50].\nDiscussion. As summarised in Table. 1, TACT-HMC outperforms all four baselines on the accuracy\nof classi\ufb01cation. Speci\ufb01cally, TACT-HMC demonstrates advantages on complicated tasks, e.g. the\nCIFAR-10 classi\ufb01cation with CNN where the model has relatively higher complexity and the dataset\ncontains multiple channels. For the RNN task, our method outperforms others with roughly 0.5% on\naccuracy. The performance gain on the MLP task is rather limited; we believe the reason for this is that\nthe complexities of both model and dataset are essentially moderate. When the random permutation\nis applied to a larger portion of training labels, TACT-HMC still maintains robust performance on the\naccuracy of classi\ufb01cation, even though the landscape of the objective function becomes rougher and\nthe system dynamics gathers more noise.\n\n6 Conclusion\nWe developed a new sampling method, which is called the thermostat-assisted continuously-tempered\nHamiltonian Monte Carlo, to facilitate Bayesian learning with large datasets and multimodal posterior\ndistributions. The method builds a well-tempered Hamiltonian system by incorporating the scheme of\ncontinuous tempering in the system for classic HMC, and then simulates the dynamics augmented by\nNos\u00e9-Hoover thermostats. This sampler is designed for two substantial demands: \ufb01rst, to ef\ufb01ciently\ngenerate representative i.i.d. samples from complex multimodal distributions; second, to adaptively\nneutralise the noise arising from mini-batches. Extensive experiments have been carried out on both\nsynthetic distributions and real-world applications. The result validated the ef\ufb01cacy of tempering and\nthermostatting, demonstrating great potentials of our sampler in accelerating deep Bayesian learning.\n\n8\n\n\fReferences\n[1] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain\n\nmonte carlo. CRC press, 2011.\n\n[2] Tianqi Chen, Emily B Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo.\n\nIn ICML, pages 1683\u20131691, 2014.\n\n[3] Jeffrey Comer, James C Gumbart, J\u00e9r\u00f4me H\u00e9nin, Tony Leli\u00e8vre, Andrew Pohorille, and\nChristophe Chipot. The adaptive biasing force method: Everything you always wanted to\nknow but were afraid to ask. The Journal of Physical Chemistry B, 119(3):1129\u20131151, 2014.\n[4] Eric Darve and Andrew Pohorille. Calculating free energies using average force. The Journal\n\nof Chemical Physics, 115(20):9169\u20139183, 2001.\n\n[5] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven.\nBayesian sampling using stochastic gradient thermostats. In Advances in neural information\nprocessing systems, pages 3203\u20133211, 2014.\n\n[6] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte\n\ncarlo. Physics letters B, 195(2):216\u2013222, 1987.\n\n[7] Farhan Feroz and MP Hobson. Multimodal nested sampling: an ef\ufb01cient and robust alternative\nto markov chain monte carlo methods for astronomical data analyses. Monthly Notices of the\nRoyal Astronomical Society, 384(2):449\u2013463, 2008.\n\n[8] Gianpaolo Gobbo and Benedict J Leimkuhler. Extended hamiltonian approach to continuous\n\ntempering. Physical Review E, 91(6):061301, 2015.\n\n[9] Matthew M. Graham and Amos J. Storkey. Continuously tempered hamiltonian monte carlo. In\nGal Elidan, Kristian Kersting, and Alexander T. Ihler, editors, Proceedings of the Thirty-Third\nConference on Uncertainty in Arti\ufb01cial Intelligence, UAI 2017, Sydney, Australia, August 11-15,\n2017. AUAI Press, 2017.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[11] William G Hoover. Canonical dynamics: equilibrium phase-space distributions. Physical review\n\nA, 31(3):1695, 1985.\n\n[12] Andrew Jones and Ben Leimkuhler. Adaptive stochastic methods for sampling driven molecular\n\nsystems. The Journal of chemical physics, 135(8):084125, 2011.\n\n[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[14] Tony Leli\u00e8vre, Mathias Rousset, and Gabriel Stoltz. Long-time convergence of an adaptive\n\nbiasing force method. Nonlinearity, 21(6):1155, 2008.\n\n[15] Nicolas Lenner and Gerald Mathias. Continuous tempering molecular dynamics: A deterministic\napproach to simulated tempering. Journal of chemical theory and computation, 12(2):486\u2013498,\n2016.\n\n[16] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte\n\nCarlo, 2:113\u2013162, 2011.\n\n[17] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\no(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[18] Shuichi Nos\u00e9. A uni\ufb01ed formulation of the constant temperature molecular dynamics methods.\n\nThe Journal of chemical physics, 81(1):511\u2013519, 1984.\n\n[19] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[20] H. Risken and H. Haken. The Fokker-Planck Equation: Methods of Solution and Applications\n\nSecond Edition. Springer, 1989.\n\n[21] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[22] Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of langevin distributions\n\nand their discrete approximations. Bernoulli, 2(4):341\u2013363, 1996.\n\n9\n\n\f[23] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[24] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics.\nIn Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages\n681\u2013688, 2011.\n\n[25] Nanyang Ye, Zhanxing Zhu, and Rafal Mantiuk. Langevin dynamics with continuous tempering\nfor training deep neural networks. In Advances in Neural Information Processing Systems,\npages 618\u2013626, 2017.\n\n10\n\n\f", "award": [], "sourceid": 6787, "authors": [{"given_name": "Rui", "family_name": "Luo", "institution": "University College London"}, {"given_name": "Jianhong", "family_name": "Wang", "institution": "UCL"}, {"given_name": "Yaodong", "family_name": "Yang", "institution": "University College London"}, {"given_name": "Jun", "family_name": "WANG", "institution": "UCL"}, {"given_name": "Zhanxing", "family_name": "Zhu", "institution": "Peking University"}]}