{"title": "Continuous Relaxations for Discrete Hamiltonian Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 3194, "page_last": 3202, "abstract": "Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems.", "full_text": "Continuous Relaxations for Discrete\n\nHamiltonian Monte Carlo\n\nYichuan Zhang, Charles Sutton, Amos Storkey\n\nSchool of Informatics\nUniversity of Edinburgh\n\nUnited Kingdom\n\nY.Zhang-60@sms.ed.ac.uk,\n\ncsutton@inf.ed.ac.uk,\n\na.storkey@ed.ac.uk\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge\n\nUnited Kingdom\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nContinuous relaxations play an important role in discrete optimization, but have\nnot seen much use in approximate probabilistic inference. Here we show that a\ngeneral form of the Gaussian Integral Trick makes it possible to transform a wide\nclass of discrete variable undirected models into fully continuous systems. The\ncontinuous representation allows the use of gradient-based Hamiltonian Monte\nCarlo for inference, results in new ways of estimating normalization constants\n(partition functions), and in general opens up a number of new avenues for in-\nference in dif\ufb01cult discrete systems. We demonstrate some of these continuous\nrelaxation inference algorithms on a number of illustrative problems.\n\n1\n\nIntroduction\n\nDiscrete undirected graphical models have seen wide use in natural language processing [11, 24] and\ncomputer vision [19]. Although sophisticated inference algorithms exist for these models, including\nboth exact algorithms and variational approximations, it has proven more dif\ufb01cult to develop discrete\nMarkov chain Monte Carlo (MCMC) methods. Despite much work and many recent advances [3],\nthe most commonly used MCMC methods in practice for discrete models are based on Metropolis-\nHastings, the effectiveness of which is strongly dependent on the choice of proposal distribution.\nAn appealing idea is to relax the constraint that the random variables of interest take integral values.\nThis is inspired by optimization methods such as linear program relaxation. Continuous problems\nare appealing because the gradient is on your side: Unlike discrete probability mass functions, in\nthe continuous setting, densities have derivatives, contours, and curvature that can be used to inform\nsampling algorithms [6, 16, 18, 20, 27]. For this reason, continuous relaxations are widespread in\ncombinatorial optimization, and likewise a major appeal of variational methods is that they con-\nvert discrete inference problems into continuous optimization problems. Comparatively speaking,\nrelaxations in an MCMC setting have been generally overlooked.\nIn this paper we provide a method for relaxing a discrete model into a continuous one, using a\ntechnique from statistical physics that Hertz et al. [8] call the \u201cGaussian integral trick,\u201d and that we\npresent in a more general form than is typical. This trick is also known as the Hubbard-Stratonovich\ntransform [10]. Starting with a discrete Markov random \ufb01eld (MRF), the trick introduces an aux-\niliary Gaussian variable in such a way that the discrete dependencies cancel out. This allows the\ndiscrete variables to be summed away, leaving a continuous problem.\nThe continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for infer-\nence, highlights an equivalence between Boltzmann machines and the Gaussian-Bernoulli harmo-\nnium model [25], and in general opens up a number of new avenues for inference in dif\ufb01cult discrete\n\n1\n\n\fsystems. On synthetic problems and a real world problem in text processing, we show that HMC in\nthe continuous relaxation can be much more accurate than standard MCMC methods on the discrete\ndistribution.\nThe only previous work of which we are aware that uses the Gaussian integral trick for inference\nin graphical models is Martens and Sutskever [12]. They use the trick to transform an arbitrary\nMRF into an equivalent restricted Boltzmann machine (RBM), on which they then do block Gibbs\nsampling. They show that this transformation is useful when each block Gibbs step can be performed\nin parallel. However, unlike the current work, they do not sum out the discrete variables, so they do\nnot perform a full continuous relaxation.\n\n2 Background\n\nConsider an undirected graphical model over random vectors t = (t1, t2, . . . tM ) where each\nti \u2208 {0, 1, 2, . . . Ki \u2212 1}. We will employ a 1 of Ki representation for each non-binary ti and\nconcatenate the resulting binary variables into the vector s = (s1 . . . sN ). We will also focus on\npairwise models over a graph G = (V, E) where V = {1, 2, . . . N}. Every discrete undirected\nmodel can be converted into a pairwise model at the cost of expanding the state space. The undi-\nrected pairwise graphical model can be written in the form\n\nexp(\u2212Eij(si, sj))\n\n(1)\n\n(cid:89)\n\n(i,j)\u2208G\n\np(s) =\n\n1\nZ\n\nwhere Z is a normalisation term, and is a sum over all valid states of (s1, s2, . . . , sN ) that comply\nwith the 1 of Ki constraints. Equivalently we can set Eij(si, sj) to be very large when si and sj\nare derived from the same variable tk (for some k and i (cid:54)= j, and expanding G to include (i, j)),\nmaking the resulting product for the terms that break the 1 of Ki constraints to be exponentially\nsmall. Henceforth, without loss of generality, we can consider binary pairwise models, and assume\nE captures any additional constraints that might apply. Then this model takes the general form of a\nBoltzmann machine or binary MRF, and can be conveniently rewritten as\n\n(2)\nwhere a \u2208 RN , and W , a real symmetric matrix, are the model parameters. The normalization\nfunction is\n\naT s +\n\nsT W s\n\np(s) =\n\nexp\n\nZ =\n\nexp\n\naT s +\n\nsT W s\n\n(3)\n\n(cid:26)\n(cid:26)\n\n1\nZ\n\n(cid:88)\n\n(cid:27)\n(cid:27)\n\n.\n\n1\n2\n\n1\n2\n\n3 Gaussian Integral Trick\n\ns\n\nInference in Boltzmann machines (which is equivalent to inference in Ising models) has always been\na challenging problem. Typically Markov chain Monte Carlo procedures such as Gibbs sampling\nhave been used, but the high levels of connectivity in Boltzmann machines can cause trouble and\nresult in slow mixing in many situations. Furthermore for frustrated systems, such models are highly\nmultimodal [1], often with large potential barriers between the different modes.\nIn many situations, the Hamiltonian Monte Carlo method has provided a more ef\ufb01cient sampling\nmethod for highly coupled systems [17], but is only appropriate in real valued problems. For this\nreason, we choose to work with a real valued augmentation of the Boltzmann machine using the\nGaussian integral trick. The main idea is to introduce a real valued auxiliary vector x \u2208 RN in such\na way that the sT W s term from (2) cancels out [8]. We generalise the standard form of the Gaussian\nintegral trick by using the following form for the conditional distribution of the auxiliary variable x:\n(4)\nfor any choice of invertible matrix A and any diagonal matrix D for which W + D is positive\nde\ufb01nite. N (x; m, \u03a3) denotes the Gaussian distribution in x with mean m and covariance \u03a3. The\nresulting joint distribution over x and s is\np(x, s) \u221d exp(\u2212 1\n2\n\n(x\u2212A(W +D)s)T (A\u22121)T (W +D)\u22121A\u22121(x\u2212A(W +D)s)+\n\np(x|s) = N (x; A(W + D)s, A(W + D)AT )\n\n1\n2\n\nsT W s+aT s).\n(5)\n\n2\n\n\f(cid:18)\n\n(cid:18)\n\n(cid:19)si\n\n(cid:19)\n\n(cid:18)\n\n(cid:26)\n\nIf d denotes a vector containing the diagonal elements of D, this simpli\ufb01es to\n\np(x, s) \u221d exp\n\n\u2212 1\n2\n\nxT (A\u22121)T (W + D)\u22121A\u22121x + sT A\u22121x + (a \u2212 1\n2\n\n(cid:19)\n\nd)T s\n\n.\n\n(6)\n\n(cid:27)(cid:89)\n\n(cid:18)\n\ni\n\n(cid:26)\n\nThe key point is that the sT W s term has vanished. We can then marginalise out the s variables, as\nthey are decoupled from one another in the energy function, and can be summed over independently.\nDe\ufb01ne the vector \u03b1x = A\u22121x. Then the marginal density is\n\np(x) \u221d exp\n\n\u2212 1\n2\n\nxT A\u22121(W + D)\u22121(A\u22121)T x\n\n1 + exp\n\n\u03b1x;i + ai \u2212 di\n2\n\n.\n\n(7)\n\nThe constant of proportionality in the above equation is Z\u22121|2\u03c0A(W + D)AT|\u22121/2. The distribu-\ntion p(x) is a mixture of 2N Gaussians, i.e., the Gaussians are p(x|s) with mixing proportion p(s)\nfor each possible assignment s.\nWe have now converted the discrete distribution p(s) into a corresponding continuous distribution\np(x). To understand the sense in which the two distributions \u201ccorrespond\u201d, consider reconstructing\ns using the conditional distribution p(s|x). First, all of the si are independent given x, because s\nappears only log-linearly in (6). Using the sigmoid \u03c3(z) = (1 + exp{\u2212z})\u22121, this is\n\n(cid:27)(cid:19)\n\np(si|x) = \u03c3\n\n\u2212\u03b1x;i \u2212 ai +\n\ndi\n2\n\n\u03c3\n\n\u03b1x;i + ai \u2212 di\n2\n\n(8)\n\n(cid:19)1\u2212si\n\n(cid:18)\n\nTwo choices for A are of particular interest because they introduce additional independence rela-\ntionships into the augmented model. First, if A = \u039b\u2212 1\n2 V T for the eigendecomposition W + D =\nV \u039bV T , then the result is an undirected bipartite graphical model in the joint space of (x, s):\n\np(x, s) \u221d exp\n\n\u2212 1\n2\n\nxT x + sT V \u039b\n\n1\n\n2 x + (a \u2212 1\n2\n\nd)T s\n\n.\n\n(9)\n\nThis is a Gaussian-Bernoulli form of exponential family harmonium [25]. Hence we see that the\nGaussian-Bernoulli harmonium is equivalent to a general Boltzmann machine over the discrete vari-\nables only. Second, if A = I we get\n\n(cid:40)(cid:18)\n\n(cid:19)T\n\n(cid:41)\n\np(x, s) = Z\u22121|2\u03c0(W + D)|\u22121/2 exp\n\na + x \u2212 1\n2\n\nd\n\ns \u2212 1\n2\n\nxT (W + D)\u22121x\n\n,\n\n(10)\n\nwhich is of particular interest in that the coupling between s and x is one-to-one. A given xi\ndetermines the Bernoulli probabilities for the variable si, independent of the states of any of the\nother variables. This yields a marginal density\n\np(x) = Z\u22121|2\u03c0(W + D)|\u22121/2 exp\n\n\u2212 1\n2\n\nxT (W + D)\u22121x\n\n1 + exp\n\nai + xi \u2212 di\n2\n\n(cid:26)\n\n(cid:18)\n\n(cid:27)(cid:89)\n\n(cid:18)\n\ni\n\n(cid:26)\n\n(cid:27)(cid:19)\n\n(cid:19)si\n\n(11)\n\n(12)\n\n(cid:19)1\u2212si\n\n(cid:18)\n\np(si|x) = \u03c3\n\n\u2212ai \u2212 xi +\n\ndi\n2\n\n\u03c3\n\nai + xi \u2212 di\n2\n\nand a particularly nice set of Bernoulli conditional probabilities\n\nIn this model, the marginal of p(x) is a mixture of Gaussian distributions. Then, conditioned on x,\nthe log odds of si = 1 is a recentered version of xi, in particular, xi \u2212 ai \u2212 di/2.\nThe different versions of the Gaussian integral trick can be compactly summarized by the indepen-\ndence relations that they introduce. All versions of Gaussian integral trick give us that all si and\nsj are independent given x. If we take A = \u039b\u22121/2V T , we additionally get that all xi and xj are\nindependent given s. Finally if we instead take A = I, we get that si and sj are independent given\nonly xi and xj. These independence relations are presented graphically in Figure 1.\n\n3\n\n\fFigure 1: Graphical depiction of the different versions of the Gaussian integral trick. In all of the\nmodels here si \u2208 {0, 1} while xi \u2208 R. Notice that when A = I the x have the same dependence\nstructure as the s did in the original MRF.\n\n3.1 Convexity of Log Density\n\nBecause probabilistic inference is NP-hard, it is too much to expect that the continuous transfor-\nmation will always help. Sometimes dif\ufb01cult discrete distributions will be converted into dif\ufb01cult\ncontinuous ones. Experimentally we have noticed that highly frustrated systems typically result in\nmultimodal p(x).\nThe modes of p(x) are particularly easy to understand if A = \u039b\u22121/2V T , because p(x|s) =\nN (x; \u039b1/2V s; I), that is, the covariance does not depend on W + D. Without loss of general-\nity assume that the diagonal of W is 0. Then write (W + D) = W + cD(cid:48). Interpreting p(x) as a\nmixture of Gaussians, one for each assignment s, as c \u2192 \u221e the Gaussians become farther apart and\nwe get 2n modes, one each at \u039b1/2V s for each assignment to binary vector s. If we take a small\nc, however, we can sometimes get fewer modes, and as shown next, we can sometimes even get\nlog p(x) convex. This is a motivation to make sure that the elements of D are not too large.\nIn the following proposition we characterize the conditions on p(s) under which the resultant p(x)\nis log-concave. For any N \u00d7 N matrix M, let \u03bb1(M ) \u2265 . . . \u2265 \u03bbN (M ) denote the eigenvalues of\nM. Recall that we have already required that D be chosen so that W + D is positive de\ufb01nite, i.e.,\n\u03bbN (W + D) > 0. Then\nProposition 1. p(x) is log-concave if and only if W + D has a narrow spectrum, by which we mean\n\u03bb1(W + D) < 4.\n\nProof. The Hessian of log p(x) is easy to compute. It is\n\nHx := \u22072\n\nx log p(x) = Cx \u2212 (W + D)\u22121\n(13)\nwhere Cx is a diagonal matrix with elements cii = \u03c3(\u2212ai \u2212 xi + di\n2 )). We\nuse the simple eigenvalue inequalities that \u03bb1(A) + \u03bbN (B) \u2264 \u03bb1(A + B) \u2264 \u03bb1(A) + \u03bb1(B). If\n\u03bb1(W + D) \u2264 4, then\n\n2 )(1 \u2212 \u03c3(\u2212ai \u2212 xi + di\n\n\u03bb1(Hx) \u2264 \u03bb1(Cx) \u2212 [\u03bb1(W + D)]\u22121 \u2264 0.25 \u2212 [\u03bb1(W + D)]\u22121 \u2264 0.\n\nSo p(x) is log-concave. Conversely suppose that p(x) is log-concave. Then\n\u03bbN (Cx) \u2212 [\u03bb1(W + D)]\u22121 \u2264 sup\n\n0.25 \u2212 [\u03bb1(W + D)]\u22121 = sup\n\nx\n\n\u03bb1(Hx) \u2264 0.\n\nx\n\nSo \u03bb1(W + D) \u2264 4.\n\n3.2 MCMC in the Continuous Relaxation\n\nNow we discuss how to perform inference in the augmented distribution resulting from the trick.\nOne simple choice is to focus on the joint density p(x, s). It is straightforward to generate samples\nfrom the conditional distributions p(x|s) and p(s|x). Therefore one can sample the joint distribution\np(x, s) in a block Gibbs style that switches sampling between p(x|s) and p(s|x). In spite of the sim-\nplicity of this method, it has the potential dif\ufb01culty that it may generate highly correlated samples,\ndue to the coupling between discrete and continuous samples.\nTo overcome the drawbacks of block Gibbs sampling, we propose running MCMC directly on the\nmarginal p(x). We can ef\ufb01ciently evaluate the unnormalized density of p(x) from (11) up to a\n\n4\n\np(s)ssxxOriginal MRF[MS10; HKP91]Current ApproachsxGeneralAA=\u21e41/2VTA=I\fconstant, so we can approximately sample from p(x) using MCMC. The derivatives of log p(x)\nhave a simple form and can be computed at a linear cost of the number of dimension of x.\nThat suggests the use of Hamiltonian Monte Carlo, an advanced MCMC method that uses gradient\ninformation to traverse the continuous space ef\ufb01ciently. We refer to the use of HMC on p(x) as\ndiscrete Hamiltonian Monte Carlo (DHMC). An important bene\ufb01t of HMC is that it is more likely\nthan many other Metropolis-Hastings methods to accept a proposed sample that has a large change\nof log density compared with the current sample.\n\n3.3 Estimating Marginal Probabilities\n\nGiven a set of samples that are approximately distributed from p(x) we can estimate the marginal\ndistribution over any subset Sq \u2286 S of the discrete variables. This is possible because all of the\nvariables Sq decompose given x. There is no need to generate samples of s because p(s|x) is easy\nto compute. The marginal probability p(sq) can be estimated as\n\np(sq) \u2248 1\nM\n\np(Sq|x(m)) =\n\n1\nM\n\np(si|x(m))\n\nx(m) \u223c p(x)\n\nM(cid:88)\n\nm=1\n\nM(cid:88)\n\n(cid:89)\n\nm=1\n\nsi\u2208Sq\n\nM(cid:88)\n\nm=1\n\nThis gives us a Rao-Blackwellized estimate of p(sq) without needing to sample s directly. We will\nnot typically be able to obtain exact samples from p(x), as in general it may be multimodal, but we\ncan obtain approximate samples from an MCMC method.\n\n3.4 Normalizing Constants\n\nBecause the normalizing factor Z\u22121 of the Boltzmann machine is equal to the probability p(s = 0),\nwe can estimate the normalizing factor using the technique from the previous section\n\nZ\u22121 = p(s = 0) \u2248 1\nM\n\np(s = 0|x(m)) x(m) \u223c p(x).\n\n(14)\n\nAlthough if we can sample exactly from p(x), this estimator is unbiased, it suffers from two prob-\nlems. First, because it is an estimator of Z\u22121, as [15] explains, such an estimator may underestimate\nZ and log Z in practice. Second, it can suffer a similar problem as the harmonic mean estimator. The\nnon-Gaussian term in p(x) corresponds to p(s = 0|x)\u22121. In general it is dif\ufb01cult to approximate\nthe expectation of a function f (x) with respect to a distribution q(x) if f is large precisely where q\nis not. This is potentially the situation with this estimator of Z.\nWe use an alternative estimator using a \u201cmirrored variant\u201d of the importance trick. First, we intro-\nduce a distribution q(x). De\ufb01ne p\u2217(x) = Zp(x) to be an unnormalized version of p. Using the\n\nidentity Z\u22121 = Z\u22121(cid:82) dx q(x) we have\n(cid:90)\n\nZ\u22121 = Z\u22121\n\ndx q(x)\n\ndx\n\nq(x)\np\u2217(x)\n\np(x),\n\np\u2217(x(m)) for x(m) \u223c p(x). This estimator\nA Monte Carlo estimate of this integral is \u02c6Z\u22121 = 1\nM\nreduces to (14) if we take q(x) to be the Gaussian portion of p(x) i.e., q(x) = N (x; 0, A(W +\nD)AT ). However in practice we have found other choices of q to be much better. Intuitively, the\nvariance of \u02c6Z depends on the ratio of q(x)/p\u2217(x). If p(x) = q(x), the variance of this estimator is\nasymptotically zero. This importance trick is well known in the statistics literature, e.g., [14].\n\nq(x(m))\n\nm\n\n(cid:90)\n\np\u2217(x)\np\u2217(x)\n\n=\n\n(cid:80)\n\n4 Related Work\n\nThe use of Hamiltonian mechanics in Boltzmann machines and related models (e.g., Ising Models,\nstochastic Hop\ufb01eld models) has an interesting history. The Ising model was studied as a model\nof physical spin systems, and so the dynamics used were typically representative of the physics,\nwith Glauber dynamics [7] being a common model, e.g., [2]. In the context of stochastic neural\nmodels, though, there was the potential to examine other dynamics that did not match the standard\n\n5\n\n\fphysics of spin systems. Hamiltonian dynamics were considered as a neural interaction model [13,\n21, 26], but were not applied directly to the Ising model itself, or used for inference. Hamiltonian\ndynamics were also considered for networks combining excitatory and inhibitory neurons [22]. All\nthese approaches involved developing Hamiltonian neural models, rather than Hamiltonian auxiliary\nmethods for existing models.\ntrick is also known as the Hubbard-Stratonovich transformation in\nThe Gaussian integral\nphysics[10].\nIn the area of neural modelling, the Gaussian integral trick was also common for\ntheoretical reasons rather than as a practical augmentation strategy [8]. The Gaussian integral trick\nformed a critical part of replica-theoretical analysis [8] for phase analysis of spin systems, as it\nenabled ensemble averaging of the spin components, leading to saddle-point equations in the con-\ntinuous domain. These theoretical analysis relied on the ensemble randomness of the interaction\nmatrix of the stochastic Hop\ufb01eld model, and so were not directly relevant in the context of a learnt\nBoltzmann machine, where the weight matrix has a speci\ufb01c structure.\nThe Gaussian integral trick relates the general Boltzmann machines to exponential family harmo-\nniums [25], which generalise the restricted Boltzmann machines. The speci\ufb01c Gaussian-Bernoulli\nharmonium is in common use, but where the real valued variables are visible units and the binary\nvariables are hidden variables [9]. This is quite distinct from the use here where the visible and\nhidden units are all binary and the Gaussian variables are auxiliary variables.\nThe only work of which we are aware that uses the Gaussian integral trick for probabilistic inference\nis that of Martens and Sutskever [12]. This work also considers inference in MRFs, using the special\ncase of the Gaussian integral trick in which A = \u039b\u2212 1\n2 V T . However, they do not use the full version\nof the trick, as they do not integrate s out, so they never obtain a fully continuous problem. Instead\nthey perform inference directly in the resulting harmonium, using block Gibbs sampling alternating\nbetween s and x. On serial computers, they do not \ufb01nd that this expanded representation offers much\nbene\ufb01t over performing single-site Gibbs in the discrete space. Indeed they \ufb01nd that the sampler in\nthe augmented model is actually slightly slower than the one in the original discrete space. This is\nin sharp contrast to our work, because we use a Rao-Blackwellized sampler on the x.\n\n5 Results\n\nIn this section we evaluate the accuracy of the relaxed sampling algorithms on both synthetic grids\nand a real-world task. We evaluate both the estimation of node marginal and of the normalisation\nfactor estimation on two synthetic models.\nWe compare the accuracy of the discrete HMC sampler to Gibbs sampling in the original discrete\nmodel p(s) and to block Gibbs sampling the augmented model p(x, s). We choose the number of\nMCMC samples so that the total computational time for each method is roughly the same. The\nGibbs sampler resamples one node at a time from p(si|s\u2212i). The node marginal probability p(si)\nis estimated by the empirical probability of the samples. The normalizing constant is estimated in\nChib-style using the Gibbs transition kernel, for more details see [15].\nThe block Gibbs sampler over p(x, s) we use is based on [12]. This comparison is designed to eval-\nuate the bene\ufb01ts of summing away the discrete variables. To estimate the node marginals, we use\nthe block Gibbs sampler to generate samples of x and then apply the Rao-Blackwellized estimators\nfrom Sections 3.3 and 3.4. We empirically choose the mirror distribution q as a Gaussian distribu-\ntion, with mean and variance given by the empirical mean and covariance of the x samples from\nMCMC. The samples of s are simply discarded at the estimation stage.\nHMC can generate better samples while a large number of leapfrog steps is used, but this requires\nmuch more computation time. For a \ufb01xed computational budget, using more leapfrog steps causes\nfewer samples to be generated, which can also undermine the accuracy of the estimator. So, we\nempirically pick 5 leapfrog steps and tuning the leapfrog step size so that acceptance rate is around\n90%. We use A = I for DHMC. To estimate the marginals p(si) and the partition function, we apply\nthe Rao-Blackwellized estimators from Sections 3.3 and 3.4 in the same way as for block Gibbs.\nSynthetic Boltzmann Machines. We evaluate the performance of the samplers across different\ntypes of weight matrices by using synthetically generated models. The idea is to characterize what\ntypes of distributions are dif\ufb01cult for each sampler.\n\n6\n\n\fGibbs on p(s)\n\nHMC on p(x)\n\nBlock Gibbs on p(x, s)\n\nFigure 2: Performance of samplers on synthetic grid-structured Boltzmann machines. The axes show\nthe standard deviations of the distributions used to select the synthetic models. The top row shows\nerror in the normalization constant, while the bottom row shows average error in the single-mode\nmarginal distributions.\n\nWe randomly generate 10 \u00d7 10 grid models over binary variables. We use two different generating\nprocesses, a \u201cstandard\u201d one and a \u201cfrustrated\u201d one. In the standard case, for each node si, the biases\nare generated as ai \u223c c1N (0, 4). The weights are generated as wij \u223c c2N (0, 4). The parameters\nc1 and c2 de\ufb01ne the scales of the biases and weights and determine how hard the problem is. In\nthe frustrated case, we shift the weights to make the problem more dif\ufb01cult. We still generate the\ni wij, 4). This shift in\n\nweights as wij \u223c c2N (0, 4) but now we generate the biases as ai \u223c c1N ((cid:80)\n\nthe Gaussian distribution tends to encourage the multimodality of p(x).\nWe test all three samplers on 36 random graphs from each of the two generating processes, using\ndifferent values of c1 and c2 for each random graph. Each MCMC method is given 10 runs of 10000\nsamples with 2000 burn-in samples. We report the MSE of the node marginal estimate and the log\nnormalising constant estimate averaged over 10 runs.\nThe results are shown in Figure 2 and 3. The axes show c1 and c2, which determine the dif\ufb01culty\nof the problem. The higher value in the heat maps means a larger error. On the standard graphs\n(Figure 2), the DHMC method signi\ufb01cantly outperforms both competitors. DHMC beats Gibbs on\np(s) at the normalization constant and beats block Gibbs on p(x, s) at marginal estimation.\nThe frustrated graphs (Figure 3) are signi\ufb01cantly more dif\ufb01cult for DHMC, as expected. All three\nsamplers seem to have trouble in the same area of model space, although DHMC suffers somewhat\nworse than the other methods in marginal error, while still beating Chib\u2019s method for normalization\nconstants. It is noted that in worst scenarios, the weights of the model are very extreme. Examining a\nfew representative graphs seems to indicate that in the regime with large weights, the HMC sampler\nbecomes stuck in a bad mode. We observe that in both cases block Gibbs of p(x, s) performs roughly\nthe same at marginal estimation as Gibbs on p(s). This is consistent with the results in [12].\nText data. We also evaluate the relaxed inference methods on a small text labelling data set. The\ndata are a series of email messages that announce academic seminars [5]. We consider the binary\nproblem of determining whether or not each word in the message is part of the name of the seminar\u2019s\nspeaker, so we have one random variable for each token in the message. We use a \u201cskip-chain CRF\u201d\n[4, 23] model which contains edges between adjacent tokens and also additional edges between any\npair of identical capitalized words.\nWe trained the CRF on a set of 485 messages using belief propagation. We evaluate the performance\nof different inference methods on inferring the probability distribution over labels on a held out set\n\n7\n\nweights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale 0.511.522.530.511.522.53510152025weights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale 0.511.522.530.511.522.530.0080.010.0120.0140.0160.0180.020.022\fGibbs on p(s)\n\nHMC on p(x)\n\nBlock Gibbs on p(x, s)\n\nFigure 3: Performance of samplers on a set of highly frustrated grid-structured Boltzmann machines.\nThe axes show the standard deviations of the distributions used to select the synthetic models. The\ntop row shows error in the normalization constant, while the bottom row shows average error in the\nsingle-mode marginal distributions.\n\nof messages. Our test set uses a subset of the messages which are small enough that we can run exact\ninference. The test set contained 75 messages whose length ranged from 50 words to 628 words.\nWe evaluate whether the approximate inference methods match the solution from exact inference.\nThe accuracy of the three approximate inference methods are shown in Table 1. We see that the HMC\nsampler is much more accurate than either of the other samplers at estimating single-node marginals.\nChib\u2019s method and DHMC have roughly the same accuracy on the normalization constant. The block\nGibbs sampler yields both worse estimates of the marginals and a signi\ufb01cantly worse estimate of the\nnormalization constant.\n\nGibbs p(s) DHMC p(x) Block Gibbs p(s, x)\n\nRMSE (node marginal)\nRSME (log normalizing constant)\n\n0.2346\n3.3041\n\n0.1619\n3.3171\n\n0.2251\n12.9685\n\nTable 1: Root mean squared error of single site marginal and normalising constant against the ground\ntruth computed by the variable elimination algorithm\n\n6 Conclusion\n\nWe have provided a general strategy for approximate inference based on relaxing discrete distribu-\ntions into continuous ones using the classical Gaussian integral trick. We described a continuum of\ndifferent versions of the trick that have different properties. Although we illustrated the bene\ufb01ts of\nthe continuous setting by using Hamiltonian Monte Carlo, in future work other inference methods\nsuch as elliptical slice sampling or more advanced HMC methods may prove superior. We hope that\nthis work might open the door to a larger space of interesting relaxations for approximate inference.\n\nAcknowledgments\n\nWe thank Iain Murray for helpful comments, Max Welling for introducing ZG to the GIT and Peter\nSollich for bringing the paper by Hubbard to our attention. This work was supported by the Engi-\nneering and Physical Sciences Research Council [grant numbers EP/I036575/1 and EP/J00104X/1].\n\n8\n\nweights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale 0.511.522.530.511.522.53102030405060708090100weights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale0.511.522.530.511.522.53weights scalebiases scale 0.511.522.530.511.522.530.050.10.150.20.250.30.350.40.45\fReferences\n[1] D. J. Amit. Modeling Brain Function. Cambridge University Press, 1989.\n[2] A. Coolen, S. Laughton, and D. Sherrington. Modern analytic techniques to solve the dynamics of recur-\n\nrent neural networks. In Advances in Neural Information Processing Systems 8 (NIPS95), 1996.\n\n[3] S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. Accelerated adaptive Markov chain for partition\nfunction computation. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 2744\u20132752. 2011.\n\n[4] J. Finkel, T. Grenager, and C. D. Manning. Incorporating non-local information into information extrac-\ntion systems by Gibbs sampling. In Annual Meeting of the Association for Computational Linguistics\n(ACL), 2005.\n\n[5] D. Frietag and A. McCallum. Information extraction with HMMs and shrinkage. In AAAI Workshop on\n\nMachine Learning for Information Extraction, 1999.\n\n[6] M. Girolami, B. Calderhead, and S. A. Chin. Riemannian manifold Hamiltonian Monte Carlo. Journal\n\nof the Royal Statistical Society, B, 73(2):1\u201337, 2011.\n\n[7] R. J. Glauber. Time-dependent statistics of the Ising model. J. Math. Phys., 4:294\u2013307, 1963.\n[8] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Perseus Books,\n\n1991.\n\n[9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313:504\u2013507, 2006.\n\n[10] J. Hubbard. Calculation of partition functions. Phys. Rev. Lett., 3:77\u201378, Jul 1959. doi: 10.1103/\n\nPhysRevLett.3.77. URL http://link.aps.org/doi/10.1103/PhysRevLett.3.77.\n\n[11] M. Johnson, T. Grif\ufb01ths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo.\n\nIn HLT/NAACL, 2007.\n\n[12] J. Martens and I. Sutskever. Parallelizable sampling of Markov random \ufb01elds. In Conference on Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2010.\n\n[13] R. V. Mendes and J. T. Duarte. Vector \ufb01elds and neural networks. Complex Systems, 6:21\u201330, 1992.\n[14] J. M\u00f8ller, A. N. Pettitt, R. Reeves, and K. K. Berthelsen. An ef\ufb01cient Markov chain Monte Carlo method\n\nfor distributions with intractable normalising constants. Biometrika, 93(2):pp. 451\u2013458, 2006.\n\n[15] I. Murray and R. Salakhutdinov. Evaluating probabilities under high-dimensional latent variable mod-\nels. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems 21, pages 1137\u20131144. 2009.\n\n[16] I. Murray, R. P. Adams, and D. J. MacKay. Elliptical slice sampling. JMLR: W&CP, 9:541\u2013548, 2010.\n[17] R. Neal. Bayesian Learning for Neural Networks. PhD thesis, Computer Science, University of Toronto,\n\n1995.\n\n[18] R. M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, and X.-L. Meng,\n\neditors, Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC Press, 2010.\n\n[19] S. Nowozin and C. H. Lampert. Structured prediction and learning in computer vision. Foundations and\n\nTrends in Computer Graphics and Vision, 6(3-4), 2011.\n\n[20] Y. Qi and T. P. Minka. Hessian-based Markov chain Monte-Carlo algorithms. In First Cape Cod Workshop\n\non Monte Carlo Methods, September 2002.\n\n[21] U. Ramacher. The Hamiltonian approach to neural networks dynamics. In Proc. IEEE JCNN, volume 3,\n\n1991.\n\n[22] H. S. Seung, T. J. Richardson, J. C. Lagarias, and J. J. Hop\ufb01eld. Minimax and Hamiltonian dynamics\nof excitatory-inhibitory networks. In Advances in Neural Information Processing Systems 10 (NIPS97),\n1998.\n\n[23] C. Sutton and A. McCallum. Collective segmentation and labeling of distant entities in information\nextraction. In ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields,\n2004.\n\n[24] C. Sutton and A. McCallum. An Introduction to Conditional Random Fields. Foundations and Trends in\n\nMachine Learning, 4(4), 2012.\n\n[25] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to infor-\n\nmation retrieval. In Advances in Neural Information Processing 17, 2004.\n\n[26] P. D. Wilde. Class of Hamiltonian neural networks. Physical Review E, 2:1392\u20131396, 47.\n[27] Y. Zhang and C. Sutton. Quasi-Newton Markov chain Monte Carlo. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2011.\n\n9\n\n\f", "award": [], "sourceid": 1464, "authors": [{"given_name": "Yichuan", "family_name": "Zhang", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Amos", "family_name": "Storkey", "institution": null}, {"given_name": "Charles", "family_name": "Sutton", "institution": null}]}