{"title": "Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 2490, "page_last": 2498, "abstract": "We present a new approach to sample from generic binary distributions, based on an exact Hamiltonian Monte Carlo algorithm applied to a piecewise continuous augmentation of the binary distribution of interest. An extension of this idea to distributions over mixtures of binary and continuous variables allows us to sample from posteriors of linear and probit regression models with spike-and-slab priors and truncated parameters. We illustrate the advantages of these algorithms in several examples in which they outperform the Metropolis or Gibbs samplers.", "full_text": "Auxiliary-variable Exact Hamiltonian Monte\n\nCarlo Samplers for Binary Distributions\n\nAri Pakman and Liam Paninski\n\nDepartment of Statistics\n\nCenter for Theoretical Neuroscience\n\nGrossman Center for the Statistics of Mind\n\nColumbia University\nNew York, NY, 10027\n\nAbstract\n\nWe present a new approach to sample from generic binary distributions, based\non an exact Hamiltonian Monte Carlo algorithm applied to a piecewise continu-\nous augmentation of the binary distribution of interest. An extension of this idea to\ndistributions over mixtures of binary and possibly-truncated Gaussian or exponen-\ntial variables allows us to sample from posteriors of linear and probit regression\nmodels with spike-and-slab priors and truncated parameters. We illustrate the ad-\nvantages of these algorithms in several examples in which they outperform the\nMetropolis or Gibbs samplers.\n\n1\n\nIntroduction\n\nMapping a problem involving discrete variables into continuous variables often results in a more\ntractable formulation. For the case of probabilistic inference, in this paper we present a new ap-\nproach to sample from distributions over binary variables, based on mapping the original discrete\ndistribution into a continuous one with a piecewise quadratic log-likelihood, from which we can\nsample ef\ufb01ciently using exact Hamiltonian Monte Carlo (HMC).\nThe HMC method is a Markov Chain Monte Carlo algorithm that usually has better performance\nover Metropolis or Gibbs samplers, because it manages to propose transitions in the Markov chain\nwhich lie far apart in the sampling space, while maintaining a reasonable acceptance rate for these\nproposals. But the implementations of HMC algorithms generally involve the non-trivial tuning of\nnumerical integration parameters to obtain such a reasonable acceptance rate (see [1] for a review).\nThe algorithms we present in this work are special because the Hamiltonian equations of motion\ncan be integrated exactly, so there is no need for tuning a step-size parameter and the Markov chain\nalways accepts the proposed moves. Similar ideas have been used recently to sample from trun-\ncated Gaussian multivariate distributions [2], allowing much faster sampling than other methods.\nIt should be emphasized that despite the apparent complexity of deriving the new algorithms, their\nimplementation is very simple.\nSince the method we present transforms a binary sampling problem into a continuous one, it is natu-\nral to extend it to distributions de\ufb01ned over mixtures of binary and Gaussian or exponential variables,\ntransforming them into purely continuous distributions. Such a mixed binary-continuous problem\narises in Bayesian model selection with a spike-and-slab prior and we illustrate our technique by\nfocusing on this case. In particular, we show how to sample from the posterior of linear and pro-\nbit regression models with spike-and-slab priors, while also imposing truncations in the parameter\nspace (e.g., positivity).\nThe method we use to map binary to continuous variables consists in simply identifying a binary\nvariable with the sign of a continuous one. An alternative relaxation of binary to continuous vari-\n\n1\n\n\fables, known in statistical physics as the \u201cGaussian integral trick\u201d [3], has been used recently to\napply HMC methods to binary distributions [4], but the details of that method are different than\nours. In particular, the HMC in that work is not \u2018exact\u2019 in the sense used above and the algorithm\nonly works for Markov random \ufb01elds with Gaussian potentials.\n\n2 Binary distributions\n\nWe are interested in sampling from a probability distribution p(s) de\ufb01ned over d-dimensional binary\nvectors s \u2208 {\u22121, +1}d, and given in terms of a function f (s) as\n\np(s) =\n\n1\nZ\n\nf (s) .\n\n(1)\n\nHere Z is a normalization factor, whose value will not be needed. Let us augment the distribu-\ntion p(s) with continuous variables y \u2208 Rd as\n\nwhere p(y|s) is non-zero only in the orthant de\ufb01ned by\n\np(s, y) = p(s)p(y|s)\n\n(2)\n\n(3)\n\n(4)\n\nThe essence of the proposed method is that we can sample from p(s) by sampling y from\n\nsi = sign(yi)\n\ni = 1, . . . , d.\n\np(y) =\n\np(s(cid:48))p(y|s(cid:48)) ,\n\n(cid:88)\n\ns(cid:48)\n\n= p(s)p(y|s) ,\n\n(5)\nand reading out the values of s from (3). In the second line we have made explicit that for each y,\nonly one term in the sum in (4) is non-zero, so that p(y) is piecewise de\ufb01ned in each orthant.\nIn order to sample from p(y) using the exact HMC method of [2], we require log p(y|s) to be a\nquadratic function of y on its support. The idea is to de\ufb01ne a potential energy function\n\nU (y) = \u2212 log p(y|s) \u2212 log f (s) ,\n\n(6)\n\nintroduce momentum variables qi, and consider the piecewise continuous Hamiltonian\n\nH(y, q) = U (y) + q\u00b7q\n2 ,\n\n(7)\nwhose value is identi\ufb01ed with the energy of a particle moving in a d-dimensional space. Suppose the\nparticle has initial coordinates y(0). In each iteration of the sampling algorithm, we sample initial\nvalues q(0) for the momenta from a standard Gaussian distribution and let the particle move during\na time T according to the equations of motion\n\n\u02d9y(t) =\n\n\u2202H\n\u2202q(t)\n\n,\n\n\u02d9q(t) = \u2212 \u2202H\n\u2202y(t)\n\n.\n\n(8)\n\nThe \ufb01nal coordinates, y(T ), belong to a Markov chain with invariant distribution p(y), and are used\nas the initial coordinates of the next iteration. The detailed balance condition follows directly from\nthe conservation of energy and (y, q)-volume along the trajectory dictated by (8), see [1, 2] for\ndetails.\nThe restriction to quadratic functions of y in log p(y|s) allows us to solve the differential equa-\ntions (8) exactly in each orthant. As the particle moves, the potential energy U (y) and the kinetic\nenergy q\u00b7q\n2 change in tandem, keeping the value of the Hamiltonian (7) constant. But this smooth\ninterchange gets interrupted when any coordinate reaches zero. Suppose this \ufb01rst happens at time tj\nfor coordinate yj, and assume that yj < 0 for t < tj. Conservation of energy imposes now a jump\non the momentum qj as a result of the discontinuity in U (y). Let us call qj(t\u2212\nj ) the\nvalue of the momentum qj just before and after the coordinate hits yj = 0. In order to enforce\nconservation of energy, we equate the Hamiltonian at both sides of the yj = 0 wall, giving\n\nj ) and qj(t+\n\nj (t+\nq2\nj )\n2\n\n= \u2206j +\n\nj (t\u2212\nq2\nj )\n2\n\n2\n\n(9)\n\n\fwith\n\n\u2206j = U (yj = 0, sj = \u22121) \u2212 U (yj = 0, sj = +1)\n\n(10)\n\nj (t+\n\nj ), the coordinate yj crosses the boundary and continues\nIf eq. (9) gives a positive value for q2\nj (t+\nj ), the\nits trajectory in the new orthant. On the other hand, if eq.(9) gives a negative value for q2\nj ) = \u2212qj(t\u2212\nparticle is re\ufb02ected from the yj = 0 wall and continues its trajectory with qj(t+\nj ). When\n\u2206j < 0, the situation can be understood as the limit of a scenario in which the particle faces an\nupward hill in the potential energy, causing it to diminish its velocity until it either reaches the top\nof the hill with a lower velocity or stops and then reverses. In the limit in which the hill has \ufb01nite\nheight but in\ufb01nite slope, the velocity change occurs discontinuously at one instant. Note that we\nused in (9) that the momenta qi(cid:54)=j are continuous, since this sudden in\ufb01nite slope hill is only seen\nby the yj coordinate.\nRegardless of whether the particle bounces or crosses the yj = 0 wall, the other coordinates move\nunperturbed until the next boundary hit, where a similar crossing or re\ufb02ection occurs, and so on,\nuntil the \ufb01nal position y(T ).\nThe framework we presented above is very general and in order to implement a particular sampler\nwe need to select the distributions p(y|s). Below we consider in some detail two particularly simple\nchoices that illustrate the diversity of options here.\n\n2.1 Gaussian augmentation\nLet us consider \ufb01rst for p(y|s) the truncated Gaussians\n\n(cid:26)\n\np(y|s) =\n\n(2/\u03c0)d/2 e\u2212 y\u00b7y\n0\n\n2\n\nfor sign(yi) = si,\notherwise ,\n\ni = 1, . . . , d\n\n(11)\n\nThe equations of motion (8) lead to \u00a8y(t) = \u2212y(t), \u00a8q(t) = \u2212q(t), and have a solution\n\nyi(t) = yi(0) cos(t) + qi(0) sin(t) ,\n\n(12)\n(13)\n(14)\n(15)\nThis setting is similar to the case studied in [2] and from \u03c6i = tan\u22121(yi(0)/qi(0)) the boundary hit\ntimes ti are easily obtained. When a boundary is reached, say yj = 0, the coordinate yj changes its\ntrajectory for t > tj as\n\nqi(t) = \u2212yi(0) sin(t) + qi(0) cos(t) ,\n\n= ui sin(\u03c6i + t) ,\n\n= ui cos(\u03c6i + t) .\n\nyj(t) = qj(t+\n\nj ) sin(t \u2212 tj) ,\n\n(16)\n\nj ) obtained as described above.\n\nwith the value of qj(t+\nChoosing an appropriate value for the travel time T is crucial when using HMC algorithms [5]. As\nis clear from (13), if we let the particle travel during a time T > \u03c0, each coordinate reaches zero at\nleast once, and the hitting times can be ordered as\n\n(17)\nMoreover, regardless of whether a coordinate crosses zero or gets re\ufb02ected, it follows from (16) that\nthe successive hits occur at\n\n0 < tj1 \u2264 tj2 \u2264 \u00b7\u00b7\u00b7 \u2264 tjd < \u03c0 .\n\nti + n\u03c0, n = 1, 2, . . .\n\n(18)\nand therefore the hitting times only need to be computed once for each coordinate in every iteration.\nIf we let the particle move during a time T = n\u03c0, each coordinate reaches zero n times, in the\ncyclical order (17), with a computational cost of O(nd) from wall hits. But choosing precisely\nT = n\u03c0 is not recommended for the following reason. As we just showed, between yj(0) and yj(\u03c0)\nthe coordinate touches the boundary yj = 0 once, and if yj gets re\ufb02ected off the boundary, it is easy\nto check that we have yj(\u03c0) = yj(0). If we take T = n\u03c0 and the particle gets re\ufb02ected all the n\ntimes it hits the boundary, we get yj(T ) = yj(0) and the coordinate yj does not move at all. To\n2 )\u03c0, which generalizes the recommended\navoid these singular situations, a good choice is T = (n + 1\n\n3\n\n\ftravel time T = \u03c0/2 for truncated Gaussians in [2]. The value of n should be chosen for each\ndistribution, but we expect optimal values for n to grow with d.\nWith T = (n + 1\n2 )\u03c0, the total cost of each sample is O((n + 1/2)d) on average from wall hits,\nplus O(d) from the sampling of q(0) and from the d inverse trigonometric functions to obtain the\nhit times ti. But in complex distributions, the computational cost is dominated by the the evaluation\nof \u2206i in (10) at each wall hit.\nInterestingly, the rate at which wall yi = 0 is crossed coincides with the acceptance rate in a\nMetropolis algorithm that samples uniformly a value for i and makes a proposal of \ufb02ipping the\nbinary variable si. See the Appendix for details. Of course, this does not mean that the HMC algo-\nrithm is the same as Metropolis, because in HMC the order in which the walls are hit is \ufb01xed given\nthe initial velocity, and the values of q2\ni at successive hits of yi = 0 within the same iteration are not\nindependent.\n\n2.2 Exponential and other augmentations\n\nAnother distribution that allows one an exact solution of the equations of motion (8) is\n\np(y|s) =\n\ni=1 |yi|\n\nfor sign(yi) = si,\notherwise ,\n\ni = 1, . . . , d\n\nwhich leads to the equations of motion \u00a8yi(t) = \u2212si, with solutions of the form\n\n(cid:26)\n\ne\u2212(cid:80)d\n\n0\n\n(19)\n\n(20)\n\nyi(t) = yi(0) + qi(0)t \u2212 sit2\n2\n\n.\n\nIn this case, the initial hit time for every coordinate is the solution of the quadratic equation yi(t) =\n0. But, unlike the case of the Gaussian augmentation, the order of successive hits is not \ufb01xed.\nIndeed, if coordinate yj hits zero at time tj, it continues its trajectory as\n\n(t \u2212 tj)2 ,\n\n(21)\n\nso the next wall hit yj = 0 will occur at a time t(cid:48)\n\nyj(t > tj) = q(t+\n\nj )(t \u2212 tj) \u2212 sj\n2\nj given by\nj )| ,\n\nj \u2212 tj) = 2|qj(t+\n(t(cid:48)\n\n(22)\nwhere we used sj = sign(qj(t+\nj )). So we see that the time between successive hits of the same\ncoordinate depends only on its momentum after the last hit. Moreover, since the value of |qj(t+)|\nis smaller than |qj(t\u2212)| if the coordinate crosses to an orthant of lower probability, equation (22)\nimplies that the particle moves away faster from areas of lower probability. This is unlike the Gaus-\nsian augmentation, where a coordinate \u2018waits in line\u2019 until all the other coordinates touch their wall\nbefore hitting its wall again.\nThe two augmentations we considered above have only scratched the surface of interesting possibili-\nties. One could also de\ufb01ne f (y|s) as a uniform distribution in a box such that the computation of the\ntimes for wall hits would becomes purely linear and we get a classical \u2018billiards\u2019 dynamics. More\ngenerally, one could consider different augmentations in different orthants and potentially tailor the\nalgorithm to mix faster in complex and multimodal distributions.\n\n3 Spike-and-slab regression with truncated parameters\n\nThe subject of Bayesian sparse regression has seen a lot of work during the last decade. Along with\npriors such as the Bayesian Lasso [6] and the Horsehoe [7], the classic spike-and-slab prior [8, 9]\nstill remains very competitive, as shown by its superior performance in the recent works [10, 11, 12].\nBut despite its successes, posterior inference remains a computational challenge for the spike-and-\nslab prior. In this section we will show how the HMC binary sampler can be extended to sample\nfrom the posterior of these models. The latter is a distribution over a set of binary and continuous\nvariables, with the binary variables determining whether each coef\ufb01cient should be included in the\nmodel or not. The idea is to map these indicator binary variables into continuous variables as we did\nabove, obtaining a distribution from which we can sample again using exact HMC methods. Below\nwe consider a regression model with Gaussian noise but the extension to exponential noise (or other\nscale-mixtures of Gaussians) is immediate.\n\n4\n\n\f3.1 Linear regression\n\nConsider a regression problem with a log-likelihood that depends quadratically on its coef\ufb01cients,\nsuch as\n\nlog p(D|w) = \u2212 1\n2\n\n(23)\nwhere D represents the observed data. In a linear regression model z = Xw+\u03b5, with \u03b5 \u223c N (0, \u03c32),\nwe have M = X(cid:48)X/\u03c32 and r = z(cid:48)X/\u03c32. We are interested in a spike-and-slab prior for the\ncoef\ufb01cients w of the form\n\nw(cid:48)Mw + r \u00b7 w + const.\n\np(w, s|a, \u03c4 2) =\n\np(wi|si, \u03c4 2)p(si|a) .\n\n(24)\n\nEach binary variable si = \u00b11 has a Bernoulli prior p(si|a) = a\nwhether the coef\ufb01cient wi is included in the model. The prior for wi, conditioned on si, is\n\n(1 \u2212 a)\n\n(1\u2212si)\n\n(1+si)\n\n2\n\n2\n\nand determines\n\np(wi|si, \u03c4 2) =\n\n2\u03c0\u03c4 2 e\n\n\u2212 w2\ni\n2\u03c4 2\n\nfor si = +1,\nfor si = \u22121\n\ni=1\n\nd(cid:89)\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 1\u221a\n\n\u03b4(wi)\n\n(25)\n\n(26)\n\n(27)\n\n+w+\u03c4\u22122\n\n\u03b4(w\u2212)a|s+|(1 \u2212 a)|s\u2212|\n\nWe are interested in sampling from the posterior, given by\np(w, s|D, a, \u03c4 2) \u221d p(D|w)p(w, s|a, \u03c4 2)\n2 w(cid:48)Mw+r\u00b7we\u2212 1\n2 w(cid:48)\n(2\u03c0\u03c4 2)|s+|/2\n\n\u221d e\u2212 1\n\u221d e\u2212 1\n2 w(cid:48)\n\n+(M++\u03c4\u22122)w++r+\u00b7w+\n(2\u03c0\u03c4 2)|s+|/2\n\n\u03b4(w\u2212)a|s+|(1 \u2212 a)|s\u2212|\n\n(28)\nwhere s+ are the variables with si = +1 and s\u2212 those with si = \u22121. The notation r+, M+ and\nw+ indicates a restriction to the s+ subspace and w\u2212 indicates a restriction to the s\u2212 space. In the\npassage from (27) to (28) we see that the spike-and-slab prior shrinks the dimension of the Gaussian\nlikelihood from d to |s+|. In principle we could integrate out the weights w and obtain a collapsed\ndistribution for s, but we are interested in cases in which the space of w is truncated and therefore\nthe integration is not feasible. An example would be when a non-negativity constraint wi \u2265 0 is\nimposed.\nIn these cases, one possible approach is to sample from (28) with a block Gibbs sampler over the\npairs {wi, si}, as proposed in [10]. Here we will present an alternative method, extending the ideas\nof the previous section. For this, we consider a new distribution, obtained in two steps. Firstly, we\nreplace the delta functions in (28) by a factor similar to the slab (25)\nsi = \u22121\n\n\u03b4(wi) \u2192 1\u221a\n\n\u2212 w2\ni\n2\u03c4 2\n\n(29)\n\ne\n\n2\u03c0\u03c4 2\n\nThe introduction of a non-singular distribution for those wi\u2019s that are excluded from the model\nin (28) creates a Reversible Jump sampler [13]: the Markov chain can now keep track of all the\ncoef\ufb01cients, whether they belong or not to the model in a given state of the chain, thus allowing\nthem to join or leave the model along the chain in a reversible way.\nSecondly, we augment the distribution with y variables as in (2)-(5) and sum over s. Using the\nGaussian augmentation (11), this gives a distribution\n\np(w, y|D, a, \u03c4 2) \u221d e\u2212 1\n\n2 w(cid:48)\n\n+(M++\u03c4\u22122)w++r+\u00b7w+e\n\n(30)\nwhere the values of s in the rhs are obtained from the signs of y. This is a piecewise Gaussian,\ndifferent in each orthant of y, and possibly truncated in the w space. Note that the changes in\np(w, y|D, a, \u03c4 2) across orthants of y come both from the factors a|s+|(1 \u2212 a)|s\u2212| and from the\nfunctional dependence on the w variables. Sampling from (30) gives us samples from the original\ndistribution (28) using a simple rule: each pair (wi, yi) becomes (wi, si = +1) if yi \u2265 0 and\n\n\u2212 w\u2212\u00b7w\u2212\n\n2\u03c4 2 e\u2212 y\u00b7y\n\n2 a|s+|(1 \u2212 a)|s\u2212|\n\n5\n\n\f(wi = 0, si = \u22121) if yi < 0. This undoes the steps we took to transform (28) into (30): the\nidenti\ufb01cation si = sign(yi) takes us from p(w, y|D, a, \u03c4 2) to p(w, s|D, a, \u03c4 2), and setting wi = 0\nwhen si = \u22121 undoes the replacement in (29).\nSince (30) is a piecewise Gaussian distribution, we can sample from it again using the methods\nof [2]. As in that work, the possible truncations for w are given as gn(w) \u2265 0 for n = 1, . . . , N,\nwith gn(w) any product of linear and quadratic functions of w. The details are a simple extension\nof the purely binary case and are not very illuminating, so we leave them for the Appendix.\n\n3.2 Probit regression\nConsider a probit regression model in which binary variables bi = \u00b11 are observed with probability\n\np(bi|w, xi) =\n\n1\u221a\n2\u03c0\n\nzibi\u22650\n\ndzie\u2212 1\n\n2 (zi+xiw)2\n\n(31)\n\n(cid:90)\n\np(z, w, s|x, a, \u03c4 2) \u221d N(cid:89)\n\nGiven a set of N pairs (bi, xi), we are interested in the posterior distribution of the weights w using\nthe spike-and-slab prior (24). This posterior is the marginal over the zi\u2019s of the distribution\n\ne\u2212 1\n\n2 (zi+xiw)2\n\np(w, s|a, \u03c4 2)\n\nzibi \u2265 0 ,\n\n(32)\n\ni=1\n\nand we can use the same approach as above to transform this distribution into a truncated piecewise\nGaussian, de\ufb01ned over the (N + 2d)-dimensional space of the vector (z, w, y). Each zi is truncated\naccording to the sign of bi and we can also truncate the w space if we so desire. We omit the details\nof the HMC algorithm, since it is very similar to the linear regression case.\n\n4 Examples\n\nWe present here three examples that illustrate the advantages of the proposed HMC algorithms over\nMetropolis or Gibbs samplers.\n\n1D Ising model\n\n\u2212(cid:80)d\n\n4.1\nWe consider a 1D periodic Ising model, with p(s) \u221d e\u2212\u03b2E[s], where the energy is E[s] =\ni=1 sisi+1, with sd+1 = s1 and \u03b2 is the inverse temperature. Figure 1 shows the \ufb01rst 1000\niterations of both the Gaussian HMC and the Metropolis1 sampler on a model with d = 400 and\n\u03b2 = 0.42, initialized with all spins si = 1. In HMC we took a travel time T = 12.5\u03c0 and, for\nthe sake of comparable computational costs, for the Metropolis sampler we recorded the value of\ns every d \u00d7 12.5 \ufb02ip proposals. The plot shows clearly that the Markov chain mixes much faster\n(cid:80)d\nwith HMC than with Metropolis. A useful variable that summarizes the behavior of the Markov\ni=1 si , whose expected value is (cid:104)m(cid:105) = 0. The oscillations\nchain is the magnetization m = 1\nd\nof both samplers around this value illustrate the superiority of the HMC sampler. In the Appendix\nwe present a more detailed comparison of the HMC Gaussian and exponential and the Metropolis\nsamplers, showing that the Gaussian HMC sampler is the most ef\ufb01cient among the three.\n\n2D Ising model\n\n4.2\nWe consider now a 2D Ising model on a square lattice of size L \u00d7 L with periodic boundary con-\nditions below the critical temperature. Starting from a completely disordered state, we compare the\ntime it takes for the sampler to reach one of the two low energy states with magnetization m (cid:39) \u00b11.\nFigure 2 show the results of 20 simulations of such a model with L = 100 and inverse tempera-\nture \u03b2 = 0.5. We used a Gaussian HMC with T = 2.5\u03c0 and a Metropolis sampler recording values\nof s every 2.5L2 \ufb02ip proposals. In general we see that the HMC sampler reaches higher likelihood\nregions faster.\n\n1As is well known (see e.g.[14]), for binary distributions, the Metropolis sampler that chooses a random\n\nspin and makes a proposal of \ufb02ipping its value, is more ef\ufb01cient than the Gibbs sampler.\n\n6\n\n\fFigure 1: 1D Ising model. First 1000 iterations of Gaussian HMC and Metropolis samplers on a\nmodel with d = 400 and \u03b2 = 0.42, initialized with all spins si = 1 (black dots). For HMC the travel\ntime was T = 12.5\u03c0 and in the Metropolis sampler we recorded the state of the Markov chain once\nevery d \u00d7 12.5 \ufb02ip proposals. The lower two panels show the state of s at every iteration for each\nsampler. The plots show clearly that the HMC model mixes faster than Metropolis in this model.\n\nFigure 2: 2D Ising model. First samples from 20 simulations in a 2D Ising model in a square lattice\nof side length L = 100 with periodic boundary conditions and inverse temperature \u03b2 = 0.5. The\ninitial state is totally disordered. We do not show the \ufb01rst 4 samples in order to ease the visualization.\nFor the Gaussian HMC we used T = 2.5\u03c0 and for Metropolis we recorded the state of the chain\nevery 2.5L2 \ufb02ip proposals. The plots illustrate that in general HMC reaches equilibrium faster than\nMetropolis in this model.\n\nNote that these results of the 1D and 2D Ising models illustrate the advantage of the HMC method\nin relation to two different time constants relevant for Markov chains [15]. Figure 1 shows that the\nHMC sampler explores faster the sampled space once the chain has reached its equilibrium distribu-\ntion, while Figure 2 shows that the HMC sampler is faster in reaching the equilibrium distribution.\n\n7\n\n01002003004005006007008009001000\u2212101Magnetization01002003004005006007008009001000\u22121000\u2212950\u2212900Energy MetropolisHMCMetropolis1002003004005006007008009001000200400HMCIteration1002003004005006007008009001000200400545851251652052452851.71.751.81.851.91.952x 104Log likelihoodIteration 5458512516520524528500.20.40.60.81Absolute MagnetizationIterationHMCMetropolis\fFigure 3: Spike-and-slab linear regression with constraints. Comparison of the proposed HMC\nmethod with the Gibbs sampler of [10] for the posterior of a linear regression model with spike-and-\nslab prior, with a positivity constraint on the coef\ufb01cients. See the text for details of the synthetic data\nused. Above: log-likelihood as a function of the iteration. Middle: samples of the \ufb01rst coef\ufb01cient.\nBelow: ACF of the \ufb01rst coef\ufb01cient. The plots shows clearly that HMC mixes much faster than Gibbs\nand is more consistent in exploring areas of high probability.\n\n4.3 Spike-and-slab linear regression with positive coef\ufb01cients\n\nWe consider a linear regression model z = Xw + \u03b5 with the following synthetic data. X has\nN = 700 rows, each sampled from a d = 150-dimensional Gaussian whose covariance matrix has 3\nin the diagonal and 0.3 in the nondiagonal entries. The noise is \u03b5 \u223c N (0, \u03c32 = 100). The data z is\ngenerated with a coef\ufb01cients vector w, with 10 non-zero entries with values between 1 and 10. The\nspike-and-slab hyperparameters are set to a = 0.1 and \u03c4 = 10. Figure 3 compares the results of the\nproposed HMC method versus the Gibbs sampler used in [10]. In both cases we impose a positivity\nconstraint on the coef\ufb01cients. For the HMC sampler we use a travel time T = \u03c0/2. This results in a\nnumber of wall hits (both for w and y variables) of (cid:39) 150, which makes the computational cost of\nevery HMC and Gibbs sample similar. The advantage of the HMC method is clear, both in exploring\nregions of higher probability and in the mixing speed of the sampled coef\ufb01cients. This impressive\ndifference in the ef\ufb01ciency of HMC versus Gibbs is similar to the case of truncated multivariate\nGaussians studied in [2].\n\n5 Conclusions and outlook\n\nWe have presented a novel approach to use exact HMC methods to sample from generic binary\ndistributions and certain distributions over mixed binary and continuous variables,\nEven though with the HMC algorithm is better than Metropolis or Gibbs in the examples we pre-\nsented, this will clearly not be the case in many complex binary distributions for which specialized\nsampling algorithms have been developed, such as the Wolff or Swendsen-Wang algorithms for 2D\nIsing models near the critical temperature [14]. But in particularly dif\ufb01cult distributions, these HMC\nalgorithms could be embedded as inner loops inside more powerful algorithms of Wang-Landau\ntype [16]. We leave the exploration of these newly-opened realms for future projects.\n\nAcknowledgments\n\nThis work was supported by an NSF CAREER award and by the US Army Research Laboratory\nand the US Army Research Of\ufb01ce under contract number W911NF-12-1-0594.\n\n8\n\n100300500700900110013001500170019002520254025602580IterationLog likelihood HMCGibbs100300500700900110013001500170019007.47.67.888.2Samples of first coefficientIteration HMCGibbs0100200300400500600700800900100000.51LagACF of first coefficient HMCGibbs\fReferences\n[1] R Neal. MCMC Using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo,\n\npages 113\u2013162, 2011.\n\n[2] Ari Pakman and Liam Paninski. Exact Hamiltonian Monte Carlo for Truncated Multivariate\n\nGaussians. Journal of Computational and Graphical Statistics, 2013, arXiv:1208.4118.\n\n[3] John A Hertz, Anders S Krogh, and Richard G Palmer. Introduction to the theory of neural\n\ncomputation, volume 1. Westview press, 1991.\n\n[4] Yichuan Zhang, Charles Sutton, Amos Storkey, and Zoubin Ghahramani. Continuous Relax-\nations for Discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing\nSystems 25, pages 3203\u20133211, 2012.\n\n[5] M.D. Hoffman and A. Gelman. The No-U-Turn sampler: adaptively setting path lengths in\n\nHamiltonian Monte Carlo. Arxiv preprint arXiv:1111.4246, 2011.\n\n[6] T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association,\n\n103(482):681\u2013686, 2008.\n\n[7] C.M. Carvalho, N.G. Polson, and J.G. Scott. The horseshoe estimator for sparse signals.\n\nBiometrika, 97(2):465\u2013480, 2010.\n\n[8] T.J. Mitchell and J.J. Beauchamp. Bayesian variable selection in linear regression. Journal of\n\nthe American Statistical Association, 83(404):1023\u20131032, 1988.\n\n[9] E.I. George and R.E. McCulloch. Variable selection via Gibbs sampling. Journal of the Amer-\n\nican Statistical Association, 88(423):881\u2013889, 1993.\n\n[10] S. Mohamed, K. Heller, and Z. Ghahramani. Bayesian and L1 approaches to sparse unsuper-\n\nvised learning. arXiv preprint arXiv:1106.1157, 2011.\n\n[11] I.J. Goodfellow, A. Courville, and Y. Bengio. Spike-and-slab sparse coding for unsupervised\n\nfeature discovery. arXiv preprint arXiv:1201.3382, 2012.\n\n[12] Yutian Chen and Max Welling. Bayesian structure learning for Markov random \ufb01elds with a\n\nspike and slab prior. arXiv preprint arXiv:1206.1088, 2012.\n\n[13] Peter J Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model\n\ndetermination. Biometrika, 82(4):711\u2013732, 1995.\n\n[14] Mark E.J. Newman and Gerard T. Barkema. Monte Carlo methods in statistical physics. Ox-\n\nford: Clarendon Press, 1999., 1, 1999.\n\n[15] Alan D Sokal. Monte Carlo methods in statistical mechanics: foundations and new algorithms,\n\n1989.\n\n[16] Fugao Wang and David P Landau. Ef\ufb01cient, multiple-range random walk algorithm to calculate\n\nthe density of states. Physical Review Letters, 86(10):2050\u20132053, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1176, "authors": [{"given_name": "Ari", "family_name": "Pakman", "institution": "Columbia University"}, {"given_name": "Liam", "family_name": "Paninski", "institution": "Columbia University"}]}