{"title": "Bayesian Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 512, "abstract": null, "full_text": "Bayesian Monte Carlo\n\nCarl Edward Rasmussen and Zoubin Ghahramani\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\n17 Queen Square, London WC1N 3AR, England\n\nedward,zoubin@gatsby.ucl.ac.uk\n\nhttp://www.gatsby.ucl.ac.uk\n\nAbstract\n\nWe investigate Bayesian alternatives to classical Monte Carlo methods\nfor evaluating integrals. Bayesian Monte Carlo (BMC) allows the in-\ncorporation of prior knowledge, such as smoothness of the integrand,\ninto the estimation. In a simple problem we show that this outperforms\nany classical importance sampling method. We also attempt more chal-\nlenging multidimensional integrals involved in computing marginal like-\nlihoods of statistical models (a.k.a. partition functions and model evi-\ndences). We \ufb01nd that Bayesian Monte Carlo outperformed Annealed\nImportance Sampling, although for very high dimensional problems or\nproblems with massive multimodality BMC may be less adequate. One\nadvantage of the Bayesian approach to Monte Carlo is that samples can\nbe drawn from any distribution. This allows for the possibility of active\ndesign of sample points so as to maximise information gain.\n\n1 Introduction\n\nInference in most interesting machine learning algorithms is not computationally tractable,\nand is solved using approximations. This is particularly true for Bayesian models which\nrequire evaluation of complex multidimensional integrals. Both analytical approximations,\nsuch as the Laplace approximation and variational methods, and Monte Carlo methods\nhave recently been used widely for Bayesian machine learning problems. It is interesting\nto note that Monte Carlo itself is a purely frequentist procedure [O\u2019Hagan, 1987; MacKay,\n1999]. This leads to several inconsistencies which we review below, outlined in a paper\nby O\u2019Hagan [1987] with the title \u201cMonte Carlo is Fundamentally Unsound\u201d. We then\ninvestigate Bayesian counterparts to the classical Monte Carlo.\n\nConsider the evaluation of the integral:\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\b\u0001\n\t\f\u000b\u000e\r\u0010\u000f\u0011\t\f\u000b\u0012\r\u0014\u0013\u0015\u000b\u0017\u0016\nwhere\u000f\u0011\t\f\u000b\u000e\r\nexample,\u000f\u0011\t\f\u000b\u0012\r could be the posterior distribution and\u0001\n\t\f\u000b\u000e\r\nwith parameters\u000b\n\nis the function we wish to integrate. For\nthe predictions made by a model\nthe likelihood\nso that equation (1) evaluates the marginal likelihood (evidence) for a model. Classical\n\nis a probability (density), and \u0001\n\t\u0018\u000b\u0012\n\n, or\u000f\u0011\t\f\u000b\u000e\r could be the parameter prior and \u0001\n\t\f\u000b\u000e\r\u0019\u0004\u001a\u000f\u0017\t\u0018\u001b\u001d\u001c\n\n\u000b\u000e\n\n(1)\n\n\n\f(2)\n\n\u0007\t\b\n\n\t\f\u000b\u000e\n\n\u0001\u0003\u0002\u0001\n\nto obtain the estimate:\n\n\u0001\n\t\f\u000b\u000e\r\u0010\u000f\u0017\t\u0018\u000b\u0012\r\n\t\u0018\u000b\u0012\n\nMonte Carlo makes the approximation:\n\nAs O\u2019Hagan [1987] points out, there are two important objections to these procedures.\n\nmagnitude, it is also possible to draw samples from some importance sampling distribution\n\n\u0006\r\f\n\u0001\n\t\u0018\u000b\u000b\n\n\u0006\r\f are random (not necessarily independent) draws from\u000f\u0017\t\f\u000b\u000e\r , which converges to\nthe right answer in the limit of large numbers of samples,\u0003\nwhere\u000b\n. If sampling directly from\u000f\u0011\t\f\u000b\u000e\r\nis hard, or if high density regions in\u000f\u0011\t\f\u000b\u0012\r do not match up with areas where\u0001\n\t\f\u000b\u000e\r has large\n\u0006\r\f\n\u0006\r\f\n\t\f\u000b\u000e\r\n\u0006\r\f\n\u0013\u0015\u000b\u000f\n\u0001\n\t\f\u000b\n\r\u0010\u000f\u0011\t\f\u000b\n\u0001\u0003\u0002\n\u0006\r\f\n\u0006\r\f\n\t\u0018\u000b\n\f\u0015\u0014 , conveying exactly the same information about\nFirst, the estimator not only depends on the values of \u0001\n\t\u0018\u000b\n\r\u0010\u000f\u0011\t\f\u000b\n\r but also on the en-\ntirely arbitrary choice of the sampling distribution \u000e\n\t\f\u000b\u000e\r . Thus, if the same set of samples\n\u0002 , were obtained from\ntwo different sampling distributions, two different estimates of \n\u0002 would be obtained. This\n\u0006\r\f when forming the estimate. Consider the simple example of\n\f , conveying no extra information about the integrand. Simply aver-\nignore the values of the \u000b\nthree points that are sampled from\u000e and the third happens to fall on the same point as the\nsecond, \u000b\n\ndependence on irrelevant (ancillary) information is unreasonable and violates the Likeli-\nhood Principle. The second objection is that classical Monte Carlo procedures entirely\n\naging the integrand at these three points, which is the classical Monte Carlo estimate, is\nclearly inappropriate; it would make much more sense to average the \ufb01rst two (or the \ufb01rst\nand third). In practice points are unlikely to fall on top of each other in continuous spaces,\nhowever, a procedure that weights points equally regardless of their spatial distribution is\nignoring relevant information. To summarize the objections, classical Monte Carlo bases\nits estimate on irrelevant information and throws away relevant information.\n\n\u0010\u0012\u0010\u0013\u0010\n\n\u0017\u0016\n\n\u0017\u0018\n\n(3)\n\nWe seek to turn the problem of evaluating the integral (1) into a Bayesian inference problem\nwhich, as we will see, avoids the inconsistencies of classical Monte Carlo and can result\n\n(which is unknown until we evaluate it) we proceed\n\nis a function of \u0001\n\t\f\u000b\u000e\n\nA very convenient way of putting priors over functions is through Gaussian Processes (GP).\nUnder a GP prior the joint distribution of any (\ufb01nite) number of function values (indexed\n\nrandom. Although this interpretation is not the most usual one, it is entirely consistent with\nthe Bayesian view that all forms of uncertainty are represented using probabilities: in this\n\nin better estimates. To do this, we think of the unknown desired quantity \n\u0002 as being\ncase uncertainty arises because we cannot afford to compute \u0001\n\t\u0018\u000b\u0012\r at every location. Since\nthe desired \n\u0001\u0003\u0002\n, combining it with the observations to obtain the posterior over \u0001\nby putting a prior on \u0001\nwhich in turn implies a distribution over the desired \nby the inputs,\u000b ) is Gaussian:\n\n\u0017\u001a\n\u0010\u0012\u0010\u0013\u0010\n\u0001\n\t\f\u000b\n\u0001\n\t\u0018\u000b\n\u0004,+.-\u000b/103254\u00136\n\n(4)\nwhere here we take the mean to be zero. The covariance matrix is given by the covariance\nfunction, a convenient choice being: 1\n\nwhere the+ parameters are hyperparameters. Gaussian processes, including optimization\n\n\u0002 .\n\r\u0014\r\u001c\u001b\u001e\u001d\u001e\u001f\n798\n\t\u0018\u000b\n\u0007;\b\n\nof hyperparameters, are discussed in detail in [Williams and Rasmussen, 1996].\n\n\u0004&%('*)\n\n\"\b\u0002\u0013$\n\n\u0001\n\t\u0018\u000b\n\n\u0001\n\t\u0018\u000b\n\n\u0001\n\t\f\u000b\n\n\u0018*<\n\n:>=\n\n\u0017\u0018\n\n\u0014\n\n\t! \n\n\u0016#\"\n\n(5)\n\n1Although the function values obtained are assumed to be noise-free, we added a tiny constant to\n\nthe diagonal of the covariance matrix to improve numerical conditioning.\n\n\n\u0002\n\u0003\n\u0004\n\u0005\n\u0006\n\n\u0016\n\n\u000e\n\n\u0004\n\u0006\n\u000e\n\u000e\n\u0002\n\u0003\n\u0005\n\u0006\n\n\u000e\n\n\u0010\n\n\u0011\n\u000b\n\n\b\n\f\n\u0016\n\u0016\n\u000b\n\n\u0004\n\n\u0001\n\u0001\n\n\f\n\u0004\n\u000b\n\u0001\n\u0001\n\u0019\n\u0004\n\t\n\n\b\n\f\n\n\u0016\n\f\n\n\u0016\n\u0016\n\f\n\n\u0016\n\t\n\n\u0002\n\f\n\n\u0016\n\n$\n\f\n\u0002\n\u0005\n:\n\n\u0002\n\f\n:\n6\n\u000b\n\n$\n\f\n:\n\n+\n\u0018\n\u0016\n\f2 The Bayesian Monte Carlo Method\n\nfrom a set of samples \n\n\u0014 giving the\n\r and makes\n\r . Under a GP prior the posterior is (an in\ufb01nite dimensional\n\nThe Bayesian Monte Carlo method starts with a prior over the function, \u000f\u0017\t\n\u0010\u0013\u0010\u0012\u0010\u0006\u0005\ninferences about \u0001\nposterior distribution\u000f\u0011\t\nby\u000f\u0011\t\f\u000b\u000e\r ), the posterior\u000f\u0017\t\n\u0001\u0003\u0002\u0010\u000f\n\f\u000e\n\njoint) Gaussian; since the integral eq. (1) is just a linear projection (on the direction de\ufb01ned\nis also Gaussian, and fully characterized by its mean and\n\nvariance. The average over functions of eq. (1) is the expectation of the average function:\n\n\u0001\u0003\u0002\n\u0004\u001d\u0006\n\n\u0001\n\t\f\u000b\n\n\u0007\t\b\u000b\n\n\u000f\u0017\t\n\n\u0003\u0001\n\n\t\f\u000b\n\n\u0002\u0001\n\nis the posterior mean function. Similarly, for the variance:\n\nwhere \n\n\u0015\u0016\b\u0017\n\n\f\u0018\n\n\u0001\u0003\u0002\u0019\u000f\u000e\u0004\n\n\u0006\u001a\u0011\n\n\t\f\u000b\u0012\r\f\u000f\u0017\t\u0018\u000b\u0012\n\n\u000f\u0011\t\n\n\u000b\u0014\u001b\n\u000f\u0017\t\n\n \u001e\n\n\u0014\u0013\n\n\u000f\u0011\t\f\u000b\u000e\r\u0010\u000f\u0017\t\u0018\u000b\n\n\u0014\u0013\u0015\u000b\u0012\u0013\u0015\u000b\n\n(6)\n\n(7)\n\n(8)\n\n\u0001\n\t\f\u000b\u000e\r\u0010\u000f\u0011\t\f\u000b\u0012\r\u0014\u0013\u0015\u000b\n\u0006\u0012\u0011\n\u0001\n\t\f\u000b\u0012\r\f\u000f\u0017\t\n\u0006\b\u0001\n\t\f\u000b\u0012\r\f\u000f\u0017\t\u0018\u000b\u0012\r\n\u0006\u001d\u001c\n\u0001\n\t\u0018\u000b\u0012\r\u001f\u001e\n\u0001\n\t\f\u000b\u0012\r\n%('*)\n\u0001\n\t\f\u000b\u0012\r\n\u0001\n\t\u0018\u000b\n\u0016 and %('*)\n\n\u0014\u0013\n\u000f\u0017\t\f\u000b\u000e\n\n\u0014\u0013\n\u0001\u0014\u0013\n\r\f\u000f\u0017\t\u0018\u000b\u0014\u001b\u0010\r\n\u0001\n\t\f\u000b\u0016\u001b\n\r\t6\n\u0001\n\t\u0018\u000b\n\u0001\n\t\f\u000b\n\u000f\u0017\t\u0018\u000b\u0012\r\u0010\u000f\u0011\t\f\u000b\n\n\u0001\n\t\f\u000b\n\n6(!\n\n\"&%\n\n\t\f\u000b\u0017\u0016$#\n\n\t*#\n\n\u0016\u0014\u000b\n\n\u0004'!\n\n\t\f\u000b\u0017\u0016\u0014\u000b\n\n\t\f\u000b\u0017\u0016$#\n\n\u0001\n\t\u0018\u000b\u0012\n\n\u001c\")%\n\nposterior mean and covariance are:\n\nis the posterior covariance. The standard results for the GP model for the\n\ning eq. (8) with eq. (6-7) may lead to expressions which are dif\ufb01cult to evaluate, but there\nare several interesting special cases.\n\nwhere %('*)\nwhere# and\u0019 are the observed inputs and function values respectively. In general combin-\n\t\f\u000b\u000e\r\n\u0004\"!\nIf the density\u000f\u0017\t\u0018\u000b\u0012\r and the covariance function eq. (5) are both Gaussian, we obtain ana-\n\u0016.-\b\r and the Gaussian kernels on the data points are\nlytical results. In detail, if\u000f\u0011\t\f\u000b\u000e\r\n\u0010\u0012\u0010\u0012\u0010\n\t*/\n\u0004,+\n-5476\u000e\u001c\n\u0007\t\b\u000b\n\n\u0001\u0003\u0002\u0019\u000f\u000e\u0004\n\na result which has previously been derived under the name of Bayes-Hermite Quadrature\n[O\u2019Hagan, 1991]. For the variance, we get:\n\nwith2 as de\ufb01ned in eq. (9). Other choices that lead to analytical results include polynomial\nkernels and mixtures of Gaussians for\u000f\u0017\t\f\u000b\u000e\r .\n\ndiag\t\r+\n\u001632\n\nthen the expectation evaluates to:\n\n\u0003\u0001\n\u001610\n\u0002\u0010\u000f\u000e\u0004'2\n\n\u0010:9\n632\n\n\t*0<4<-\n\n->4&6?=\n\n+.-7=\n\n\u0015\u0016\b\u0017\n\n\f\u0018\n\n6;+\n\n \u000f\n\n\t*/\n\n6;+\n\n\t*/\n\n\b.8\n\n/1032\n\n\u0016\u001c+\n\n\t,+\n\r\u0014\n\n(9)\n\n(10)\n\n\f\u0018\n\n\b$8\n\n2.1 A Simple Example\n\nTo illustrate the method we evaluated the integral of a one-dimensional function under a\n\nto the function. Figure 1 (middle) compares the error in the Bayesian Monte Carlo (BMC)\nestimate of the integral (1) to the Simple Monte Carlo (SMC) estimate using the same sam-\nples. As we would expect the squared error in the Simple Monte Carlo estimate decreases\nis the sample size. In contrast, for more than about 10 samples, the BMC\n\nGaussian density (\ufb01gure 1, left). We generated samples independently from \u000f\u0011\t\f\u000b\u0012\r , evalu-\nated \u0001\n\t\f\u000b\u0012\r at those points, and optimised the hyperparameters of our Gaussian process \ufb01t\nas\u0002\nestimate improves at a much higher rate. This is achieved because the prior on \u0001 allows\n\n\u0003 where\u0003\n\n\u0001\n\u0004\n\u0011\n\f\n\u0016\n\f\n\n\u001c\n\u0004\n\u0004\n\u0002\n\u0001\n\u001c\n\n\n\u001c\n\n\n\n\u0006\n\u0001\n\u001c\n\n\u0001\n\u0004\n\u0006\n\u0001\n\u001c\n\n\u0013\n\u000b\n\u0004\n\u0006\n\n\u0001\n\f\n\u0013\n\u000b\n\u0016\n\u0001\n\f\n\n\u0013\n\u000b\n6\n\u0006\n\n\u0013\n\u0013\n\u0018\n\u0001\n\u001c\n\n\n\u0013\n\u0001\n\u0004\n\u0006\n\u0006\n6\n\n\u001c\n\u001b\n\n\u001b\n\u0001\n\u001c\n\n\u0001\n\u001b\n\u001b\n\u0004\n\u0006\n\u0006\n\f\n4\n\u0016\n\u001b\n\n=\n\u001b\n\n\u0013\n\u000b\n\u0013\n\u000b\n\u001b\n\u0016\n\f\n\n\u0001\n\f\n\n\b\n\u0019\n\f\n4\n\u0016\n\u001b\n\n=\n\u001b\n\n\b\n!\n\u001b\n\n\u0016\n\u0004\n\u001f\n\u001f\n\u0001\n\u0004\n\u000b\n\f\n\u0004\n\u0018\n\b\n\u0016\n\u0018\n8\n\n\u0001\n\u001b\n\"\n%\n\b\n\u0019\n-\n\u001c\n0\n%\n\b\n%\n\u0018\n\n6\n \n\n\u001b\n\n%\n\b\n\n=\n7\n0\n%\n\b\n=\n%\n\u0018\n\u001b\n\"\n%\n\b\n2\n\u0016\n<\n\fthe method to interpolate between sample points. Moreover, whereas the SMC estimate is\n\ninvariant to permutations of the values on the\u000b axis, BMC makes use of the smoothness of\n\nthe function. Therefore, a point in a sparse region is far more informative about the shape\nof the function for BMC than points in already densely sampled areas. In SMC if two sam-\nples happen to fall close to each other the function value there will be counted with double\nweight. This effect means that large numbers of samples are needed to adequately represent\n\n\u000f\u0017\t\u0018\u000b\u0012\r . BMC circumvents this problem by analytically integrating its mean function w.r.t.\n\u000f\u0017\t\u0018\u000b\u0012\r .\nvery bad performance. This is due to examples where the random draws of \u000b\ntion values\u0001\n\t\u0018\u000b\u0012\n\nIn \ufb01gure 1 left, the negative log density of the true value of the integral under the predic-\ntive distributions are compared for BMC and SMC. For not too small sample sizes, BMC\noutperforms SMC. Notice however, that for very small sample sizes BMC occasionally has\nlead to func-\nthat are consistent with much longer length scale than the true function;\nthe mean prediction becomes somewhat inaccurate, but worse still, the inferred variance\nbecomes very small (because a very slowly varying function is inferred), leading to very\npoor performance compared to SMC. This problem is to a large extent caused by the opti-\nmization of the length scale hyperparameters of the covariance function; we ought instead\nto have integrated over all possible length scales. This integration would effectively \u201cblend\nin\u201d distributions with much larger variance (since the data is also consistent with a shorter\nlength scale), thus alleviating the problem, but unfortunately this is not possible in closed\nform. The problem disappears for sample sizes of around 16 or greater.\n\nIn the previous example, we chose\u000f\u0011\t\f\u000b\u0012\r to be Gaussian. If you wish to use BMC to integrate\n\nw.r.t. non-Gaussian densities then an importance re-weighting trick becomes necessary:\n\n(11)\n\n\u0001\n\t\u0018\u000b\u0012\r\u0010\u000f\u0011\t\f\u000b\u000e\n\n\u0001\n\t\f\u000b\u000e\r\u0010\u000f\u0011\t\f\u000b\u0012\r\n\t\f\u000b\u000e\r\n\t\u0018\u000b\u0012\r\n\t\u0018\u000b\u0012\r and \u000e\n\t\f\u000b\u000e\n\nis\nan arbitrary density which can be evaluated. See Kennedy [1998] for extension to non-\n\nis a Gaussian and\u000f\u0011\t\f\u000b\u0012\n\nwhere the Gaussian process models \u0001\n\t\u0018\u000b\u0012\r\u0010\u000f\u0011\t\f\u000b\u000e\r\nGaussian\u000e\n\n2.2 Optimal Importance Sampler\n\n\t\f\u000b\u0012\r .\n\n\u0006\r\f\n\n\u0006\r\f\n\r\f\u000f\u0017\t\u0018\u000b\n\nased estimate of \nwhere\u000e\n\n\t\u0018\u000b\u0012\r\u0002\u0001\n\nFor the simple example discussed above, it is also interesting to ask whether the ef\ufb01ciency\nof SMC could be improved by generating independent samples from more-cleverly de-\nsigned distributions. As we have seen in equation (3), importance sampling gives an unbi-\n\n\u0001\u0003\u0002 by sampling\u000b\n wherever\u000f\u0011\t\f\u000b\u0012\r\u0003\u0001\n\n\t\f\u000b\u000e\r and computing:\n\u0001\n\t\u0018\u000b\n\n\u0006\r\f from\u000e\n . The variance of this estimator is given by:\n\n\u0006\r\f\n\t\f\u000b\n\u000f\u0011\t\f\u000b\u000e\r\n\t\u0018\u000b\u0012\r\n\u0001\n\t\f\u000b\u0012\r\u0003\u001c\n\u000f\u0011\t\f\u000b\u000e\r\n\u000f\u0017\t\f\u000b\n\u0001\n\t\f\u000b\n\u0013\u0015\u000b\n , which is unsurprising given that we\n\u0007 . If \u0001\n\t\f\u000b\u000e\r\nin advance to normalise\u000e . For functions that take on both positive and\n\nwhich we can substitute into equation (13) to get the minimum variance, \u0015\nalways non-negative or non-positive then \u0015\nneeded to know \n\nUsing calculus of variations it is simple to show that the optimal (minimum variance) im-\nportance sampling distribution is:\n\n\u0013\u0015\u000b56\n\n\u000e\b\u0007\n\n\t\f\u000b\u000e\n\n\u0001\n\t\f\u000b\u0012\n\n\u0002\u0006\u0005\n\n(12)\n\n(13)\n\n(14)\n\nis\n\n\u0006\n\u0013\n\u000b\n\u0004\n\u0006\n\u000e\n\u000e\n\u0013\n\u000b\n\u0016\n<\n\u000e\n\n\n\u0001\n\u0004\n\u0004\n\u0002\n\u0003\n\u0005\n\u0006\n\n\u000e\n\n\u0015\n\t\n\n\u0001\n\u0004\n\n\u0004\n\u0002\n\u0003\n\u0004\n\u0006\n\u0018\n\u0018\n\u000e\n\n\u0001\n\u0018\n\u0004\n\u001c\n\t\n\u001c\n\u001b\n\n\u001c\n\u001b\n\n\u001b\n\u0007\n\u0004\n\u0001\n\f0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n\u22120.5\n\u22124\n\nfunction f(x)\nmeasure p(x)\n\n\u22122\n\n0\n\n2\n\n4\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \n\ne\ng\na\nr\ne\nv\na\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n\n10\u22127\n\nBayesian inference\nSimple Monte Carlo\nOptimal importance\n\nl\n\ne\nu\na\nv\n \nt\nc\ne\nr\nr\no\nc\n \nf\n\no\n\n \ny\nt\ni\ns\nn\ne\nd\n\n \n\ng\no\n\nl\n \ns\nu\nn\nm\n\ni\n\nBayesian inference\nSimple Monte Carlo\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n101\n\n102\n\nsample size\n\n101\n\n102\n\nsample size\n\nFigure 1: Left: a simple one-dimensional function \u0001\nwith respect to which we wish to integrate \u0001\nple Monte Carlo sampling from\u000f\n\n(full) and Gaussian density (dashed)\n. Middle: average squared error for sim-\n(dashed), the optimal achievable bound for importance\nsampling (dot-dashed), and the Bayesian Monte Carlo estimates. The values plotted are\naverages over up to 2048 repetitions. Right: Minus the log of the Gaussian predictive den-\nsity with mean eq. (6) and variance eq. (7), evaluated at the true value of the integral (found\nby numerical integration), \u2018x\u2019. Similarly for the Simple Monte Carlo procedure, where the\nmean and variance of the predictive distribution are computed from the samples, \u2019o\u2019.\n\n which is a constant times the variance of\nnegative values \u0015\na Bernoulli random variable (sign \u0001\n\t\u0018\u000b\u0012\r ). The lower bound from this optimal importance\n\nsampler as a function of number of samples is shown in \ufb01gure 1, middle. As we can\nsee, Bayesian Monte Carlo improves on the optimal importance sampler considerably. We\nstress that the optimal importance sampler is not practically achievable since it requires\nknowledge of the quantity we are trying to estimate.\n\n\u0001\n\t\f\u000b\u0012\r\u0003\u001c\n\n3 Computing Marginal Likelihoods\n\nWe now consider the problem of estimating the marginal likelihood of a statistical model.\nThis problem is notoriously dif\ufb01cult and very important, since it allows for comparison of\ndifferent models. In the physics literature it is known as free-energy estimation. Here we\ncompare the Bayesian Monte Carlo method to two other techniques: Simple Monte Carlo\nsampling (SMC) and Annealed Importance Sampling (AIS).\n\nSimple Monte Carlo, sampling from the prior, is generally considered inadequate for this\nproblem, because the likelihood is typically sharply peaked and samples from the prior are\nunlikely to fall in these con\ufb01ned areas, leading to huge variance in the estimates (although\nthey are unbiased). A family of promising \u201cthermodynamic integration\u201d techniques for\ncomputing marginal likelihoods are discussed under the name of Bridge and Path sampling\nin [Gelman and Meng, 1998] and Annealed Importance Sampling (AIS) in [Neal, 2001].\nThe central idea is to divide one dif\ufb01cult integral into a series of easier ones, parameterised\n\n\u0007\n\u0004\n\t\n\u0002\n<\n\u0003\n\n\t\n\u0007\n\u0002\n\n\u001c\n\u000f\n\u0018\n6\n\n\u0001\n\u0018\n\fby (inverse) temperature, \n\n\u0001\u0003\u0002\n\n. In detail:\n\u0001\b\u0002\n\n\u000f\u0017\t\u0018\u001b\n\n\u000b\u0012\r\u000b\n\n\u000f\u0017\t\u0018\u000b\u0012\n\n\t,!\n\u0001\u0012\t\n\n\u0016 where\nand \u0001\b\t\n\n\b\u0005\u0004\u0006\u0004\u0007\u0004\n\u000f\u0017\t\f\u000b\u000e\r\nis the !\r\f\u000f\u000e\nTo compute each fraction we sample from equilibrium from the distribution \u000e\ninverse temperature of the annealing schedule and \nwhere \n\u000f\u0017\t\u0018\u001b\u001d\u001c\n\u000b\u000e\r\n\u000f\u0011\t\f\u000b\u0012\r and compute importance weights:\n\u000f\u0011\t\f\u001b\u001d\u001c\n\u000b\u000e\r\n\u000f\u0017\t\f\u001b\u001d\u001c\n\u000b\u0012\r\nIn practice\u0003\n\ncan be set to 1, to allow very slow reduction in temperature. Each of the\nintermediate ratios are much easier to compute than the original ratio, since the likelihood\nfunction to the power of a small number is much better behaved that the likelihood itself.\nOften elaborate non-linear cooling schedules are used, but for simplicity we will just take\na linear schedule for the inverse temperature. The samples at each temperature are drawn\nusing a single Metropolis proposal, where the proposal width is chosen to get a fairly high\nfraction of acceptances.\n\n\t!\"\n\n(15)\n\n\u0002 .\n\n\t\u0018\u000b\u0012\r\u0011\u0010\n\n(16)\n\n\u000f\u0011\t\f\u000b\u0012\r\n\u000f\u0017\t\u0018\u000b\u0012\n\n\u000f\u0011\t\f\u001b\u001d\u001c\n\n\t\f\u000b\u000e\n\n\u0007;\b\n\n\u0002\u0001\n\n\u0013\n\n%\u0014\n\nnoise variance parameter. Thus the marginal likelihood is an integral over a 7 dimensional\n\nThe model in question for which we attempt to compute the marginal likelihood was it-\nself a Gaussian process regression \ufb01t to the an arti\ufb01cial dataset suggested by [Friedman,\n\n1988].2 We had 9\nhyperparameter space. The log of the hyperparameters are given\u001f\n\nlength scale hyperparameters, a signal variance (+\n\u0016\u0016\u0015\n\n- ) and an explicit\n\r priors.\n\u0004\u0018\u0017\n\nFigure 2 shows a comparison of the three methods. Perhaps surprisingly, AIS and SMC are\nseen to be very comparable, which can be due to several reasons: 1) whereas the SMC sam-\nples are drawn independently, the AIS samples have considerable auto-correlation because\nof the Metropolis generation mechanism, which hampers performance for low sample sizes,\n2) the annealing schedule was not optimized nor the proposal width adjusted with temper-\nature, which might possibly have sped up convergence. Further, the difference between\nAIS and SMC would be more dramatic in higher dimensions and for more highly peaked\nlikelihood functions (i.e. more data).\n\nThe Bayesian Monte Carlo method was run on the same samples as were generate by the\n\nbe evaluated. Another obvious choice for generating samples for BMC would be to use\nan MCMC method to draw samples from the posterior. Because BMC needs to model the\nintegrand using a GP, we need to limit the number of samples since computation (for \ufb01tting\n\nAIS procedure. Note that BMC can use samples from any distribution, as long as\u000f\u0011\t\f\u000b\u000e\r can\n\u0016 . Thus for sample size greater than\n \u001d\u0017\u001e\u001b , chosen equally spaced from the AIS Markov\n \u001a\u0017\u001c\u001b we limit the number of samples to7\nhyperparameters and computing the \u0019 \u2019s) scales as \u0005\n\nchain. Despite this thinning of the samples we see a generally superior performance of\nBMC, especially for smaller sample sizes. In fact, BMC seems to perform equally well for\nalmost any of the investigated sample sizes. Even for this fairly large number of samples,\nthe generation of points from the AIS still dominates compute time.\n\n4 Discussion\n\nAn important aspect which we have not explored in this paper is the idea that the GP model\nused to \ufb01t the integrand gives errorbars (uncertainties) on the integrand. These error bars\n\n2The data was 100 samples generated from the 5-dimensional function \u001f\u0014 \"!$#&%('('\u0016'(%)!+*&,.-\n/&021)354\n \"67!\nnoise and the inputs are sampled independently from a uniform [0, 1] distribution.\n\nis zero mean unit variance Gaussian\n\n:=D , where D\n\n!9C\b:=A\u0006!\n\n!98&,;:=<\n\n \"!\r>@?\n\nAB,\n\n/&0\n\n\u0001\n-\n\u0004\n\u0001\n\b\n\u0001\n-\n\u0001\n\u0018\n\u0001\n\u0001\n\u0002\n%\n\b\n\u0001\n-\n\u0004\n\u0006\n\u0013\n\u000b\n\u0004\n\u0002\n\u0004\n\u0006\n\u001c\n\n\t\n\f\n\u0013\n\u000b\n\u0016\n\n\u0004\n\t\n%\n\b\n\n\t\n%\n\b\n\f\n\u0001\n\t\n%\n\b\n\u0004\n\u0006\n\n\t\n\f\n\n\t\n%\n\b\n\f\n\u000e\n\t\n%\n\b\n\u0013\n\u000b\n\n\u0002\n\u0003\n\u0004\n\u0005\n\u0001\n\u000b\n\f\n\n\t\n\f\n\n\t\n%\n\b\n\f\n\u0010\n\t\n \n\u0018\n7\n#\n0\n0\n'\n8\n:\n*\n\fd\no\no\nh\n\ni\nl\n\ne\nk\nL\n\ni\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\ng\no\nL\n\n \n\n\u221245\n\n\u221250\n\n\u221255\n\n\u221260\n\n\u221265\n\n\u221270\n\n103\n\n104\n\nNumber of Samples\n\nTrue\nSMC\nAIS\nBMC\n105\n\n\u0003\u001d\u0017\n\n \u0002\u0001 sample long run of AIS. For comparison,\n\n(which is an upper bound on the true value).\n\nFigure 2: Estimates of the marginal likelihood for different sample sizes using Simple\nMonte Carlo sampling (SMC; circles, dotted line), Annealed Importance Sampling (AIS;\n, dashed line), and Bayesian Monte Carlo (BMC; triangles, solid line). The true value\n\n(solid straight line) is estimated from a single\u0002\nthe maximum log likelihood is6\u0004\u0003\u0006\u0005\nwould be to evaluate the function at points\u000b where the GP has large uncertainty \u0015\n\u000f\u0017\t\u0018\u000b\u0012\r\nintegral scales as \u0015\npoints can often be pre-computed, see e.g. [Minka, 2000]. However, as we are adapting the\ncovariance function depending on the observed function values, active learning would have\nto be an integral part of the procedure. Classical Monte Carlo approaches cannot make use\nof active learning since the samples need to be drawn from a given distribution.\n\n\t\f\u000b\u0012\r and\n\t\u0018\u000b\u0012\r\f\u000f\u0017\t\f\u000b\u000e\r . For a \ufb01xed Gaussian Process covariance function these design\n\ncould be used to conduct an experimental design, i.e. active learning. A simple approach\n\nis not too small: the expected contribution to the uncertainty in the estimate of the\n\nWhen using BMC to compute marginal likelihoods, the Gaussian covariance function used\nhere (equation 5) is not ideally suited to modeling the likelihood. Firstly, likelihoods are\nnon-negative whereas the prior is not restricted in the values the function can take. Sec-\nondly, the likelihood tends to have some regions of high magnitude and variability and\nother regions which are low and \ufb02at; this is not well-modelled by a stationary covariance\nfunction. In practice this mis\ufb01t between the GP prior and the function modelled has even\noccasionally led to negative values for the estimate of the marginal likelihood! There could\nbe several approaches to improving the appropriateness of the prior. An importance dis-\ntribution such as one computed from a Laplace approximation or a mixture of Gaussians\ncan be used to dampen the variability in the integrand [Kennedy, 1998]. The GP could be\nused to model the log of the likelihood [Rasmussen, 2002]; however this makes integration\nmore dif\ufb01cult.\n\nThe BMC method outlined in this paper can be extended in several ways. Although the\nchoice of Gaussian process priors is computationally convenient in certain circumstances,\nin general other function approximation priors can be used to model the integrand. For\ndiscrete (or mixed) variables the GP model could still be used with appropriate choice of\ncovariance function. However, the resulting sum (analogous to equation 1) may be dif\ufb01cult\n\n\n\u0010\n\fto evaluate. For discrete\u0001\n\n, GPs are not directly applicable.\n\nAlthough BMC has proven successful on the problems presented here, there are several\nlimitations to the approach. High dimensional integrands can prove dif\ufb01cult to model. In\nsuch cases a large number of samples may be required to obtain good estimates of the\nfunction. Inference using a Gaussian Process prior is at present limited computationally\nto a few thousand samples. Further, models such as neural networks and mixture models\nexhibit an exponentially large number of symmetrical modes in the posterior. Again mod-\nelling this with a GP prior would typically be dif\ufb01cult. Finally, the BMC method requires\n\nthat the distribution\u000f\u0017\t\u0018\u000b\u0012\r can be evaluated. This contrasts with classical MC where many\nmethods only require that samples can be drawn from some distribution \u000e\n\t\f\u000b\u000e\r , for which\n\nthe normalising constant is not necessarily known (such as in equation 16). Unfortunately,\nthis limitation makes it dif\ufb01cult, for example, to design a Bayesian analogue to Annealed\nImportance Sampling.\n\nWe believe that the problem of computing an integral using a limited number of function\nevaluations should be treated as an inference problem and that all prior knowledge about\nthe function being integrated should be incorporated into the inference. Despite the lim-\nitations outlined above, Bayesian Monte Carlo makes it possible to do this inference and\ncan achieve performance equivalent to state-of-the-art classical methods despite using a\nfraction of sample evaluations, even sometimes exceeding the theoretically optimal perfor-\nmance of some classical methods.\n\nAcknowledgments\n\nWe would like to thank Radford Neal for inspiring discussions.\n\nReferences\n\nFriedman, J. (1988). Multivariate Adaptive Regression Splines. Technical Report No. 102, Novem-\nber 1988, Laboratory for Computational Statistics, Department of Statistics, Stanford University.\n\nKennedy, M. (1998). Bayesian quadrature with non-normal approximating functions, Statistics and\nComputing, 8, pp. 365\u2013375.\n\nMacKay, D. J. C. (1999). Introduction to Monte Carlo methods. In Learning in Graphical Models,\nM. I. Jordan (ed), MIT Press, 1999.\n\nGelman, A. and Meng, X.-L. (1998) Simulating normalizing constants: From importance sampling\nto bridge sampling to path sampling, Statistical Science, vol. 13, pp. 163\u2013185.\n\nMinka, T. P. (2000) Deriving quadrature rules from Gaussian processes, Technical Report, Statistics\nDepartment, Carnegie Mellon University.\n\nNeal, R. M. (2001). Annealed Importance Sampling, Statistics and Computing, 11, pp. 125\u2013139.\n\nO\u2019Hagan, A. (1987). Monte Carlo is fundamentally unsound, The Statistician, 36, pp. 247-249.\n\nO\u2019Hagan, A. (1991). Bayes-Hermite Quadrature, Journal of Statistical Planning and Inference, 29,\npp. 245\u2013260.\n\nO\u2019Hagan, A. (1992). Some Bayesian Numerical Analysis. Bayesian Statistics 4 (J. M. Bernardo,\nJ. O. Berger, A. P. Dawid and A. F. M. Smith, eds), Oxford University Press, pp. 345\u2013365 (with\ndiscussion).\n\nC. E. Rasmussen (2003). Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive\nBayesian Integrals, Bayesian Statistics 7 (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid,\nD. Heckerman, A. F. M. Smith and M. West, eds), Oxford University Press.\n\nWilliams, C. K. I. and C. E. Rasmussen (1996). Gaussian Processes for Regression, in D. S. Touret-\nzky, M. C. Mozer and M. E. Hasselmo (editors), NIPS 8, MIT Press.\n\n\f", "award": [], "sourceid": 2150, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}