{"title": "Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature", "book": "Advances in Neural Information Processing Systems", "page_first": 2789, "page_last": 2797, "abstract": "We propose a novel sampling framework for inference in probabilistic models: an active learning approach that converges more quickly (in wall-clock time) than Markov chain Monte Carlo (MCMC) benchmarks. The central challenge in probabilistic inference is numerical integration, to average over ensembles of models or unknown (hyper-)parameters (for example to compute marginal likelihood or a partition function). MCMC has provided approaches to numerical integration that deliver state-of-the-art inference, but can suffer from sample inefficiency and poor convergence diagnostics. Bayesian quadrature techniques offer a model-based solution to such problems, but their uptake has been hindered by prohibitive computation costs. We introduce a warped model for probabilistic integrands (likelihoods) that are known to be non-negative, permitting a cheap active learning scheme to optimally select sample locations. Our algorithm is demonstrated to offer faster convergence (in seconds) relative to simple Monte Carlo and annealed importance sampling on both synthetic and real-world examples.", "full_text": "Sampling for Inference in Probabilistic Models with\n\nFast Bayesian Quadrature\n\nTom Gunter, Michael A. Osborne\n\nEngineering Science\nUniversity of Oxford\n\n{tgunter,mosb}@robots.ox.ac.uk\n\nRoman Garnett\n\nKnowledge Discovery and Machine Learning\n\nUniversity of Bonn\n\nrgarnett@uni-bonn.de\n\nPhilipp Hennig\n\nMPI for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\nphennig@tuebingen.mpg.de\n\nStephen J. Roberts\nEngineering Science\nUniversity of Oxford\n\nsjrob@robots.ox.ac.uk\n\nAbstract\n\nWe propose a novel sampling framework for inference in probabilistic models: an\nactive learning approach that converges more quickly (in wall-clock time) than\nMarkov chain Monte Carlo (MCMC) benchmarks. The central challenge in proba-\nbilistic inference is numerical integration, to average over ensembles of models or\nunknown (hyper-)parameters (for example to compute the marginal likelihood or\na partition function). MCMC has provided approaches to numerical integration that\ndeliver state-of-the-art inference, but can suffer from sample inef\ufb01ciency and poor\nconvergence diagnostics. Bayesian quadrature techniques offer a model-based\nsolution to such problems, but their uptake has been hindered by prohibitive com-\nputation costs. We introduce a warped model for probabilistic integrands (like-\nlihoods) that are known to be non-negative, permitting a cheap active learning\nscheme to optimally select sample locations. Our algorithm is demonstrated to\noffer faster convergence (in seconds) relative to simple Monte Carlo and annealed\nimportance sampling on both synthetic and real-world examples.\n\n1\n\nIntroduction\n\nBayesian approaches to machine learning problems inevitably call for the frequent approximation\nof computationally intractable integrals of the form\n\n(cid:90)\n\nZ = (cid:104)(cid:96)(cid:105) =\n\n(cid:96)(x) \u03c0(x) dx,\n\n(1)\n\nwhere both the likelihood (cid:96)(x) and prior \u03c0(x) are non-negative. Such integrals arise when marginal-\nising over model parameters or variables, calculating predictive test likelihoods and computing\nmodel evidences. In all cases the function to be integrated\u2014the integrand\u2014is naturally constrained\nto be non-negative, as the functions being considered de\ufb01ne probabilities.\nIn what follows we will primarily consider the computation of model evidence, Z. In this case\n(cid:96)(x) de\ufb01nes the unnormalised likelihood over a D-dimensional parameter set, x1, ..., xD, and \u03c0(x)\nde\ufb01nes a prior density over x. Many techniques exist for estimating Z, such as annealed impor-\ntance sampling (AIS) [1], nested sampling [2], and bridge sampling [3]. These approaches are based\naround a core Monte Carlo estimator for the integral, and make minimal effort to exploit prior in-\nformation about the likelihood surface. Monte Carlo convergence diagnostics are also unreliable for\npartition function estimates [4, 5, 6]. More advanced methods\u2014e.g., AIS\u2014also require parameter\ntuning, and will yield poor estimates with misspeci\ufb01ed parameters.\n\n1\n\n\fThe Bayesian quadrature (BQ) [7, 8, 9, 10] approach to estimating model evidence is inherently\nmodel based. That is, it involves specifying a prior distribution over likelihood functions in the form\nof a Gaussian process (GP) [11]. This prior may be used to encode beliefs about the likelihood\nsurface, such as smoothness or periodicity. Given a set of samples from (cid:96)(x), posteriors over both\nthe integrand and the integral may in some cases be computed analytically (see below for discussion\non other generalisations). Active sampling [12] can then be used to select function evaluations so as\nto maximise the reduction in entropy of either the integrand or integral. Such an approach has been\ndemonstrated to improve sample ef\ufb01ciency, relative to na\u00a8\u0131ve randomised sampling [12].\nIn a big-data setting, where likelihood function evaluations are prohibitively expensive, BQ is\ndemonstrably better than Monte Carlo approaches [10, 12]. As the cost of the likelihood decreases,\nhowever, BQ no longer achieves a higher effective sample rate per second, because the computa-\ntional cost of maintaining the GP model and active sampling becomes relevant, and many Monte\nCarlo samples may be generated for each new BQ sample. Our goal was to develop a cheap and\naccurate BQ model alongside an ef\ufb01cient active sampling scheme, such that even for low cost likeli-\nhoods BQ would be the scheme of choice. Our contributions extend existing work in two ways:\nSquare-root GP: Foundational work [7, 8, 9, 10] on BQ employed a GP prior directly on the likeli-\nhood function, making no attempt to enforce non-negativity a priori. [12] introduced an approximate\nmeans of modelling the logarithm of the integrand with a GP. This involved making a \ufb01rst-order ap-\nproximation to the exponential function, so as to maintain tractability of inference in the integrand\nmodel. In this work, we choose another classical transformation to preserve non-negativity\u2014the\nsquare-root. By placing a GP prior on the square-root of the integrand, we arrive at a model which\nboth goes some way towards dealing with the high dynamic range of most likelihoods, and enforces\nnon-negativity without the approximations resorted to in [12].\nFast Active Sampling: Whereas most approaches to BQ use either a randomised or \ufb01xed sampling\nscheme, [12] targeted the reduction in the expected variance of Z. Here, we sample where the\nexpected posterior variance of the integrand after the quadratic transform is at a maximum. This is\na cheap way of balancing exploitation of known probability mass and exploration of the space in\norder to approximately minimise the entropy of the integral.\nWe compare our approach, termed warped sequential active Bayesian integration (WSABI), to non-\nnegative integration with standard Monte Carlo techniques on simulated and real examples. Cru-\ncially, we make comparisons of error against ground truth given a \ufb01xed compute budget.\n\n(cid:17)\n\n(x\u2212x(cid:48))2\n\n\u03c32\n\n(cid:16)\u2212 1\n\n2\n\n2 Bayesian Quadrature\n\nGiven a non analytic integral (cid:104)(cid:96)(cid:105) := (cid:82) (cid:96)(x)\u03c0(x) dx on a domain X = RD, Bayesian quadrature\n\nis a model based approach of inferring both the functional form of the integrand and the value of\nthe integral conditioned on a set of sample points. Typically the prior density is assumed to be a\nGaussian, \u03c0(x) := N (x; \u03bd, \u039b); however, via the use of an importance re-weighting trick, q(x) =\n(q(x)/\u03c0(x)) \u03c0(x), any prior density q(x) may be integrated against. For clarity we will henceforth\nnotationally consider only the X = R case, although all results trivially extend to X = Rd.\nTypically a GP prior is chosen for (cid:96)(x), although it may also be directly speci\ufb01ed on\n(cid:96)(x)\u03c0(x). This is parameterised by a mean \u00b5(x) and scaled Gaussian covariance K(x, x(cid:48)) :=\n. The output length-scale \u03bb and input length-scale \u03c3 control the standard devi-\n\u03bb2 exp\nation of the output and the autocorrelation range of each function evaluation respectively, and will\nbe jointly denoted as \u03b8 = {\u03bb, \u03c3}. Conditioned on samples xd = {x1, ..., xN} and associated func-\nand the posterior covariance is CD(x, x(cid:48)) := K(x, x) \u2212 K(x, xd)K(xd, xd)\u22121K(xd, x), where\n\ntion values (cid:96)(xd), the posterior mean is mD(x) := \u00b5(x) + K(x, xd)K\u22121(xd, xd)(cid:0)(cid:96)(xd) \u2212 \u00b5(xd)(cid:1),\nD :=(cid:8)xd, (cid:96)(xd), \u03b8(cid:9). For an extensive review of the GP literature and associated identities, see [11].\nvariance estimate of the integral are given as follows: E(cid:96)|D(cid:2)(cid:104)(cid:96)(cid:105)(cid:3) = (cid:82) mD(x) \u03c0(x) dx (2), and\n\nWhen a GP prior is placed directly on the integrand in this manner, the posterior mean and vari-\nance of the integral can be derived analytically through the use of Gaussian identities, as in\n[10]. This is because the integration is a linear projection of the function posterior onto \u03c0(x),\nand joint Gaussianity is preserved through any arbitrary af\ufb01ne transformation. The mean and\n\n2\n\n\fV(cid:96)|D(cid:2)(cid:104)(cid:96)(cid:105)(cid:3) =(cid:82)(cid:82) CD(x, x(cid:48)) \u03c0(x) dx \u03c0(x(cid:48)) dx(cid:48) (3). Both mean and variance are analytic when \u03c0(x)\n\nis Gaussian, a mixture of Gaussians, or a polynomial (amongst other functional forms).\nIf the GP prior is placed directly on the likelihood in the style of traditional Bayes\u2013Hermite quadra-\nture, the optimal point to add a sample (from an information gain perspective) is dependent only on\nxd\u2014the locations of the previously sampled points. This means that given a budget of N samples,\nthe most informative set of function evaluations is a design that can be pre-computed, completely un-\nin\ufb02uenced by any information gleaned from function values [13]. In [12], where the log-likelihood\nis modelled by a GP, a dependency is introduced between the uncertainty over the function at any\npoint and the function value at that point. This means that the optimal sample placement is now\ndirectly in\ufb02uenced by the obtained function values.\n\n(a) Traditional Bayes\u2013Hermite quadrature.\n\n(b) Square-root moment-matched Bayesian quadrature.\n\nFigure 1: Figure 1a depicts the integrand as modelled directly by a GP, conditioned on 15 samples\nselected on a grid over the domain. Figure 1b shows the moment matched approximation\u2014note the\nlarger relative posterior variance in areas where the function is high. The linearised square-root GP\nperformed identically on this example, and is not shown.\n\nAn illustration of Bayes\u2013Hermite quadrature is given in Figure 1a. Conditioned on a grid of 15\nsamples, it is visible that any sample located equidistant from two others is equally informative in\nreducing our uncertainty about (cid:96)(x). As the dimensionality of the space increases, exploration can\nbe increasingly dif\ufb01cult due to the curse of dimensionality. A better designed BQ strategy would\ncreate a dependency structure between function value and informativeness of sample, in such a way\nas to appropriately express prior bias towards exploitation of existing probability mass.\n\n3 Square-Root Bayesian Quadrature\n\nCrucially, likelihoods are non-negative, a fact neglected by traditional Bayes\u2013Hermite quadrature. In\n[12] the logarithm of the likelihood was modelled, and approximate the posterior of the integral, via\na linearisation trick. We choose a different member of the power transform family\u2014the square-root.\nThe square-root transform halves the dynamic range of the function we model. This helps deal with\nthe large variations in likelihood observed in a typical model, and has the added bene\ufb01t of extending\nthe autocorrelation range (or the input length-scale) of the GP, yielding improved predictive power\nwhen extrapolating away from existing sample points.\n\n2(cid:0)(cid:96)(x) \u2212 \u03b1(cid:1), such that (cid:96)(x) = \u03b1 + 1/2 \u02dc(cid:96)(x)2, where \u03b1 is a small positive scalar.1 We\n\nLet \u02dc(cid:96)(x) :=\nthen take a GP prior on \u02dc(cid:96)(x): \u02dc(cid:96) \u223c GP(0, K). We can then write the posterior for \u02dc(cid:96) as\n\n(cid:113)\n\np(\u02dc(cid:96) | D) = GP(cid:0)\u02dc(cid:96); \u02dcmD(\u00b7), \u02dcCD(\u00b7,\u00b7)(cid:1);\n\n\u02dcmD(x) := K(x, xd)K(xd, xd)\u22121 \u02dc(cid:96)(xd);\n\n(4)\n(5)\n(6)\nThe square-root transformation renders analysis intractable with this GP: we arrive at a process\nwhose marginal distribution for any (cid:96)(x) is a non-central \u03c72 (with one degree of freedom). Given\nthis process, the posterior for our integral is not closed-form. We now describe two alternative\napproximation schemes to resolve this problem.\n\n\u02dcCD(x, x(cid:48)) := K(x, x(cid:48)) \u2212 K(x, xd)K(xd, xd)\u22121K(xd, x(cid:48)).\n\n1\u03b1 was taken as 0.8 \u00d7 min (cid:96)(xd) in all experiments; our investigations found that performance was insen-\n\nsitive to the choice of this parameter.\n\n3\n\n TruefunctionGPposteriormean95%con\ufb01denceinterval\u2113(x)X TruefunctionWSABI-Mposteriormean95%con\ufb01denceinterval\u2113(x)X\f3.1 Linearisation\nWe \ufb01rstly consider a local linearisation of the transform f : \u02dc(cid:96) (cid:55)\u2192 (cid:96) = \u03b1 + 1/2 \u02dc(cid:96)2. As GPs are closed\nunder linear transformations, this linearisation will ensure that we arrive at a GP for (cid:96) given our\nexisting GP on \u02dc(cid:96). Generically, if we linearise around \u02dc(cid:96)0, we have (cid:96) (cid:39) f (\u02dc(cid:96)0) + f(cid:48)(\u02dc(cid:96)0)(\u02dc(cid:96) \u2212 \u02dc(cid:96)0). Note\nthat f(cid:48)(\u02dc(cid:96)) = \u02dc(cid:96): this simple gradient is a further motivation for our transform, as described further in\nSection 3.3. We choose \u02dc(cid:96)0 = \u02dcmD; this represents the mode of p(\u02dc(cid:96) | D). Hence we arrive at\n\n(cid:96)(x) (cid:39)(cid:0)\u03b1 + 1/2 \u02dcmD(x)2(cid:1) + \u02dcmD(x)(cid:0)\u02dc(cid:96)(x) \u2212 \u02dcmD(x)(cid:1) = \u03b1 \u2212 1/2 \u02dcmD(x)2 + \u02dcmD(x) \u02dc(cid:96)(x).\n\nUnder this approximation, in which (cid:96) is a simple af\ufb01ne transformation of \u02dc(cid:96), we have\n\np((cid:96) | D) (cid:39) GP(cid:0)(cid:96); mL\n\nD(\u00b7), CL\nD(x) := \u03b1 + 1/2 \u02dcmD(x)2;\n\nmL\nCL\nD(x, x(cid:48)) := \u02dcmD(x) \u02dcCD(x, x(cid:48)) \u02dcmD(x(cid:48)).\n\nD(\u00b7,\u00b7)(cid:1);\n\n(7)\n\n(8)\n(9)\n(10)\n\n(11)\n(12)\n\n3.2 Moment Matching\nAlternatively, we consider a moment-matching approximation: p((cid:96) | D) is approximated as a GP\nwith mean and covariance equal to those of the true \u03c72 (process) posterior. This gives p((cid:96) | D) :=\n\nGP(cid:0)(cid:96); mM\n\nD (\u00b7,\u00b7)(cid:1), where\n\nD (\u00b7), CM\n\nD (x) := \u03b1 + 1/2(cid:0) \u02dcm2D(x) + \u02dcCD(x, x)(cid:1);\n\nmM\nCM\nD (x, x(cid:48)) := 1/2 \u02dcCD(x, x(cid:48))2 + \u02dcmD(x) \u02dcCD(x, x(cid:48)) \u02dcmD(x(cid:48)).\n\nWe will call these two approximations WSABI-L (for \u201clinear\u201d) and WSABI-M (for \u201cmoment\nmatched\u201d), respectively. Figure 2 shows a comparison of the approximations on synthetic data.\nThe likelihood function, (cid:96)(x), was de\ufb01ned to be (cid:96)(x) = exp(\u2212x2), and is plotted in red. We placed\na GP prior on \u02dc(cid:96), and conditioned this on seven observations spanning the interval [\u22122, 2]. We then\ndrew 50 000 samples from the true \u03c72 posterior on \u02dc(cid:96) along a dense grid on the interval [\u22125, 5] and\nused these to estimate the true density of (cid:96)(x), shown in blue shading. Finally, we plot the means and\n95% con\ufb01dence intervals for the approximate posterior. Notice that the moment matching results in\na higher mean and variance far from observations, but otherwise the approximations largely agree\nwith each other and the true density.\n\n3.3 Quadrature\n\nD and CL\n\nD and CM\n\n\u02dcmD and \u02dcCD are both mixtures of un-normalised Gaussians K. As such, the expressions for poste-\nrior mean and covariance under either the linearisation (mL\nD, respectively) or the moment-\nmatching approximations (mM\nD , respectively) are also mixtures of un-normalised Gaus-\nsians. Substituting these expressions (under either approximation) into (2) and (3) yields closed-\nform expressions (omitted due to their length) for the mean and variance of the integral (cid:104)(cid:96)(cid:105). This\nresult motivated our initial choice of transform: for linearisation, for example, it was only the fact\nthat the gradient f(cid:48)(\u02dc(cid:96)) = \u02dc(cid:96) that rendered the covariance in (10) a mixture of un-normalised Gaus-\nsians. The discussion that follows is equally applicable to either approximation.\nIt is clear that the posterior variance of the likelihood model is now a function of both the expected\nvalue of the likelihood at that point, and the distance of that sample location from the rest of xd.\nThis is visualised in Figure 1b.\nComparing Figures 1a and 1b we see that conditioned on an identical set of samples, WSABI both\nachieves a closer \ufb01t to the true underlying function, and associates minimal probability mass with\nnegative function values. These are desirable properties when modelling likelihood functions\u2014both\narising from the use of the square-root transform.\n\n4 Active Sampling\n\nGiven a full Bayesian model of the likelihood surface, it is natural to call on the framework of\nBayesian decision theory, selecting the next function evaluation so as to optimally reduce our uncer-\n\n4\n\n\fFigure 2: The \u03c72 process, alongside moment matched (WSABI-M) and linearised approxi-\nmations (WSABI-L). Notice that the WSABI-L mean is nearly identical to the ground truth.\n\ntainty about either the total integrand surface or the integral. Let us de\ufb01ne this next sample location\nto be x\u2217, and the associated likelihood to be (cid:96)\u2217 := (cid:96)(x\u2217). Two utility functions immediately present\nthemselves as natural choices, which we consider below. Both options are appropriate for either of\nthe approximations to p((cid:96)) described above.\n\n4.1 Minimizing expected entropy\n\n4.2 Uncertainty sampling\n\n(13)\n\n(14)\n\n(15)\n\nOne possibility would be to follow [12] in minimising the expected entropy of the integral, by\nselecting x\u2217 = arg min\n\nx\n\n(cid:68)V(cid:96)|D,(cid:96)(x)\n\n(cid:2)(cid:104)(cid:96)(cid:105)(cid:3)(cid:11) , where\n(cid:90)\n\n(cid:10)V(cid:96)|D,(cid:96)(x)\n(cid:2)(cid:104)(cid:96)(cid:105)(cid:3)(cid:69)\n(cid:2)(cid:104)(cid:96)(cid:105)(cid:3)N(cid:0)(cid:96)(x); mD(x), CD(x, x)(cid:1)d(cid:96)(x).\nV(cid:96)|D(cid:2)(cid:96)(x)\u03c0(x)(cid:3) (this is known as uncertainty sampling), where\n(cid:2)(cid:96)(x)\u03c0(x)(cid:3) = \u03c0(x)CD(x, x)\u03c0(x) = \u03c0(x)2 \u02dcCD(x, x)(cid:0)1/2 \u02dcCD(x, x) + \u02dcmD(x)2(cid:1),\n\nV(cid:96)|D,(cid:96)(x)\n\n=\n\nx\n\nVM\n(cid:96)|D\n\nAlternatively, we can target the reduction in entropy of the total integrand (cid:96)(x)\u03c0(x) instead, by\ntargeting x\u2217 = arg max\n\nin the case of our moment matched approximation, and, under the linearisation approximation,\n\n(cid:2)(cid:96)(x)\u03c0(x)(cid:3) = \u03c0(x)2 \u02dcCD(x, x) \u02dcmD(x)2.\n\nVL\n(cid:96)|D\n\nThe uncertainty sampling option reduces the entropy of our GP approximation to p((cid:96)) rather than\nthe true (intractable) distribution. The computation of either (14) or (15) is considerably cheaper\nand more numerically stable than that of (13). Notice that as our model builds in greater uncertainty\nin the likelihood where it is high, it will naturally balance sampling in entirely unexplored regions\nagainst sampling in regions where the likelihood is expected to be high. Our model (the square-\nroot transform) is more suited to the use of uncertainty sampling than the model taken in [12].\nThis is because the approximation to the posterior variance is typically poorer for the extreme log-\ntransform than for the milder square-root transform. This means that, although the log-transform\nwould achieve greater reduction in dynamic range than any power transform, it would also introduce\nthe most error in approximating the posterior predictive variance of (cid:96)(x). Hence, on balance, we\nconsider the square-root transform superior for our sampling scheme.\nFigures 3\u20134 illustrate the result of square-root Bayesian quadrature, conditioned on 15 samples\nselected sequentially under utility functions (14) and (15) respectively. In both cases the posterior\nmean has not been scaled by the prior \u03c0(x) (but the variance has). This is intended to exaggerate the\ncontributions to the mean made by WSABI-M.\nA good posterior estimate of the integral has been achieved, and this set of samples is more infor-\nmative than a grid under the utility function of minimising the integral error. In all active-learning\n\n5\n\n 95%CI(WSABI-L)Mean(WSABI-L)95%CI(WSABI-M)Mean(WSABI-M)Mean(groundtruth)\u03c72process\u2113(x)X\fFigure 3: Square-root Bayesian quadrature\nwith active sampling according to utility\nfunction (14) and corresponding moment-\nmatched model. Note the non-zero expected\nmean everywhere.\n\nFigure 4: Square-root Bayesian quadrature\nwith active sampling according to utility\nfunction (15) and corresponding linearised\nmodel. Note the zero expected mean away\nfrom samples.\n\nexamples a covariance matrix adaptive evolution strategy (CMA-ES) [14] global optimiser was used\nto explore the utility function surface before selecting the next sample.\n\n5 Results\n\nGiven this new model and fast active sampling scheme for likelihood surfaces, we now test for speed\nagainst standard Monte Carlo techniques on a variety of problems.\n\n5.1 Synthetic Likelihoods\n\nWe generated 16 likelihoods in four-dimensional space by selecting K normal distributions with\nK drawn uniformly at random over the integers 5\u201314. The means were drawn uniformly at random\nover the inner quarter of the domain (by area), and the covariances for each were produced by scaling\neach axis of an isotropic Gaussian by an integer drawn uniformly at random between 21 and 29. The\noverall likelihood surface was then given as a mixture of these distributions, with weights given by\npartitioning the unit interval into K segments drawn uniformly at random\u2014\u2018stick-breaking\u2019. This\nprocedure was chosen in order to generate \u2018lumpy\u2019 surfaces. We budgeted 500 samples for our new\nmethod per likelihood, allocating the same amount of time to simple Monte Carlo (SMC).\nNaturally the computational cost per evaluation of this likelihood is effectively zero, which afforded\nSMC just under 86 000 samples per likelihood on average. WSABI was on average faster to converge\nto 10\u22123 error (Figure 5), and it is visible in Figure 6 that the likelihood of the ground truth is larger\nunder this model than with SMC. This concurs with the fact that a tighter bound was achieved.\n\n5.2 Marginal Likelihood of GP Regression\n\nAs an initial exploration into the performance of our approach on real data, we \ufb01tted a Gaussian\nprocess regression model to the yacht hydrodynamics benchmark dataset [15]. This has a six-\ndimensional input space corresponding to different properties of a boat hull, and a one-dimensional\noutput corresponding to drag coef\ufb01cient. The dataset has 308 examples, and using a squared ex-\nponential ARD covariance function a single evaluation of the likelihood takes approximately 0.003\nseconds.\nMarginalising over the hyperparameters of this model is an eight-dimensional non-analytic integral.\nSpeci\ufb01cally, the hyperparameters were: an output length-scale, six input length-scales, and an output\nnoise variance. We used a zero-mean isotropic Gaussian prior over the hyperparameters in log space\nwith variance of 4. We obtained ground truth through exhaustive SMC sampling, and budgeted 1 250\nsamples for WSABI. The same amount of compute-time was then afforded to SMC, AIS (which\nwas implemented with a Metropolis\u2013Hastings sampler), and Bayesian Monte Carlo (BMC). SMC\nachieved approximately 375 000 samples in the same amount of time. We ran AIS in 10 steps,\nspaced on a log-scale over the number of iterations, hence the AIS plot is more granular than the\nothers (and does not begin at 0). The \u2018hottest\u2019 proposal distribution for AIS was a Gaussian centered\non the prior mean, with variance tuned down from a maximum of the prior variance.\n\n6\n\n OptimalnextsampleTruefunctionWSABI-Mposteriormean95%Con\ufb01denceintervalPriormass\u2113(x)X OptimalnextsampleTruefunctionWSABI-Lposteriormean95%Con\ufb01denceintervalPriormass\u2113(x)X\fFigure 5: Time in seconds vs. average frac-\ntional error compared to the ground truth in-\ntegral, as well as empirical standard error\nbounds, derived from the variance over the\n16 runs. WSABI-M performed slightly better.\n\nFigure 6: Time in seconds versus average\nlikelihood of the ground truth integral over\n16 runs. WSABI-M has a signi\ufb01cantly larger\nvariance estimate for the integral as com-\npared to WSABI-L.\n\nFigure 7: Log-marginal likelihood of GP regression on the yacht hydrodynamics dataset.\n\nFigure 7 shows the speed with which WSABI converges to a value very near ground truth compared\nto the rest. AIS performs rather disappointingly on this problem, despite our best attempts to tune\nthe proposal distribution to achieve higher acceptance rates.\nAlthough the \ufb01rst datapoint (after 10 000 samples) is the second best performer after WSABI, further\ncompute budget did very little to improve the \ufb01nal AIS estimate. BMC is by far the worst performer.\nThis is because it has relatively few samples compared to SMC, and those samples were selected\ncompletely at random over the domain. It also uses a GP prior directly on the likelihood, which due\nto the large dynamic range will have a poor predictive performance.\n\n5.3 Marginal Likelihood of GP Classi\ufb01cation\n\nWe \ufb01tted a Gaussian process classi\ufb01cation model to both a one dimensional synthetic dataset, as\nwell as real-world binary classi\ufb01cation problem de\ufb01ned on the nodes of a citation network [16].\nThe latter had a four-dimensional input space and 500 examples. We use a probit likelihood model,\ninferring the function values using a Laplace approximation. Once again we marginalised out the\nhyperparameters.\n\n7\n\n SMC\u00b11std.errorSMCWSABI-L\u00b11std.errorWSABI-LFractionalerrorvs.groundtruthTimeinseconds02040608010012014016018020010\u2212310\u2212210\u22121100 SMCWSABI-LAveragelikelihoodofgroundtruthTimeinseconds050100150200\u00d7105012345 BMCAISSMCWSABI-MWSABI-LGroundtruthlogZTimeinseconds0200400600800100012001400\u00d7104\u22121.5\u22121\u22120.500.51\f5.4 Synthetic Binary Classi\ufb01cation Problem\n\nWe generate 500 binary class samples using a 1D input space. The GP classi\ufb01cation scheme im-\nplemented in Gaussian Processes for Machine Learning Matlab Toolbox (GPML) [17] is employed\nusing the inference and likelihood framework described above. We marginalised over the three-\ndimensional hyperparameter space of: an output length-scale, an input length-scale and a \u2018jitter\u2019\nparameter. We again tested against BMC, AIS, SMC and, additionally, Doubly-Bayesian Quadrature\n(BBQ) [12]. Ground truth was found through 100 000 SMC samples.\nThis time the acceptance rate for AIS was signi\ufb01cantly higher, and it is visibly converging to the\nground truth in Figure 8, albeit in a more noisy fashion than the rest. WSABI-L performed partic-\nularly well, almost immediately converging to the ground truth, and reaching a tighter bound than\nSMC in the long run. BMC performed well on this particular example, suggesting that the active sam-\npling approach did not buy many gains on this occasion. Despite this, the square-root approaches\nboth converged to a more accurate solution with lower variance than BMC. This suggests that the\nsquare-root transform model generates signi\ufb01cant added value, even without an active sampling\nscheme. The computational cost of selecting samples under BBQ prevents rapid convergence.\n\n5.5 Real Binary Classi\ufb01cation Problem\n\nFor our next experiment, we again used our method to calculate the model evidence of a GP model\nwith a probit likelihood, this time on a real dataset.\nThe dataset, \ufb01rst described in [16], was a graph from a subset of the CiteSeerx citation network.\nPapers in the database were grouped based on their venue of publication, and papers from the 48\nvenues with the most associated publications were retained. The graph was de\ufb01ned by having these\npapers as its nodes and undirected citation relations as its edges. We designated all papers appear-\ning in NIPS proceedings as positive observations. To generate Euclidean input vectors, the authors\nperformed \u201cgraph principal component analysis\u201d on this network [18]; here, we used the \ufb01rst four\ngraph principal components as inputs to a GP classi\ufb01er. The dataset was subsampled down to a set\nof 500 examples in order to generate a cheap likelihood, half of which were positive.\n\nFigure 8: Log-marginal likelihood for GP\nclassi\ufb01cation\u2014synthetic dataset.\n\nFigure 9: Log-marginal likelihood for GP\nclassi\ufb01cation\u2014graph dataset.\n\nAcross all our results, it is noticeable that WSABI-M typically performs worse relative to WSABI-L as\nthe dimensionality of the problem increases. This is due to an increased propensity for exploration\nas compared to WSABI-L. WSABI-L is the fastest method to converge on all test cases, apart from the\nsynthetic mixture model surfaces where WSABI-M performed slightly better (although this was not\nshown in Figure 5). These results suggest that an active-sampling policy which aggressively exploits\nareas of probability mass before exploring further a\ufb01eld may be the most appropriate approach to\nBayesian quadrature for real likelihoods.\n\n6 Conclusions\n\nWe introduced the \ufb01rst fast Bayesian quadrature scheme, using a novel warped likelihood model\nand a novel active sampling scheme. Our method, WSABI, demonstrates faster convergence (in\nwall-clock time) for regression and classi\ufb01cation benchmarks than the Monte Carlo state-of-the-art.\n\n8\n\n BBQBMCAISSMCWSABI-MWSABI-LGroundtruthlogZTimeinseconds050100150200250300350400450\u2212158\u2212156\u2212154\u2212152\u2212150\u2212148\u2212146\u2212144 BBQBMCAISSMCWSABI-MWSABI-LGroundtruthlogZTimeinseconds020040060080010001200140016001800\u2212310\u2212300\u2212290\u2212280\u2212270\u2212260\u2212250\u2212240\u2212230\u2212220\fReferences\n[1] R.M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[2] J. Skilling. Nested sampling. Bayesian inference and maximum entropy methods in science\n\nand engineering, 735:395\u2013405, 2004.\n\n[3] X. Meng and W. H. Wong. Simulating ratios of normalizing constants via a simple identity: a\n\ntheoretical exploration. Statistica Sinica, 6(4):831\u2013860, 1996.\n\n[4] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical\n\nReport CRG-TR-93-1, University of Toronto, 1993.\n\n[5] S.P. Brooks and G.O. Roberts. Convergence assessment techniques for Markov chain Monte\n\nCarlo. Statistics and Computing, 8(4):319\u2013335, 1998.\n\n[6] M.K. Cowles, G.O. Roberts, and J.S. Rosenthal. Possible biases induced by MCMC conver-\n\ngence diagnostics. Journal of Statistical Computation and Simulation, 64(1):87, 1999.\n\n[7] P. Diaconis. Bayesian numerical analysis. In S. Gupta J. Berger, editor, Statistical Decision\n\nTheory and Related Topics IV, volume 1, pages 163\u2013175. Springer-Verlag, New York, 1988.\n\n[8] A. O\u2019Hagan. Bayes-Hermite quadrature.\n\n29:245\u2013260, 1991.\n\nJournal of Statistical Planning and Inference,\n\n[9] M. Kennedy. Bayesian quadrature with non-normal approximating functions. Statistics and\n\nComputing, 8(4):365\u2013375, 1998.\n\n[10] C. E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo.\n\nIn S. Becker and K. Ober-\nmayer, editors, Advances in Neural Information Processing Systems, volume 15. MIT Press,\nCambridge, MA, 2003.\n\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[12] M. A. Osborne, D. K. Duvenaud, R. Garnett, C. E. Rasmussen, S. J. Roberts, and Z. Ghahra-\nmani. Active learning of model evidence using Bayesian quadrature. In P. Bartlett, F. C. N.\nPereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Infor-\nmation Processing Systems. MIT Press, Cambridge, MA, 2012.\n\n[13] T. P. Minka. Deriving quadrature rules from Gaussian processes. Technical report, Statistics\n\nDepartment, Carnegie Mellon University, 2000.\n\n[14] N. Hansen, S. D. M\u00a8uller, and P. Koumoutsakos. Reducing the time complexity of the de-\nrandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary\nComputation, 11(1):1\u201318, 2003.\n\n[15] J Gerritsma, R Onnink, and A Versluis. Geometry, resistance and stability of the delft system-\n\natic yacht hull series. International shipbuilding progress, 28(328), 1981.\n\n[16] R. Garnett, Y. Krishnamurthy, X. Xiong, J. Schneider, and R. P. Mann. Bayesian optimal\nactive search and surveying. In J. Langford and J. Pineau, editors, Proceedings of the 29th\nInternational Conference on Machine Learning (ICML 2012). Omnipress, Madison, WI, USA,\n2012.\n\n[17] C. E. Rasmussen and H. Nickisch. Gaussian processes for machine learning (GPML) toolbox.\n\nThe Journal of Machine Learning Research, 11(2010):3011\u201303015.\n\n[18] F. Fouss, A. Pirotte, J-M Renders, and M. Saerens. Random-walk computation of similarities\nIEEE Transac-\n\nbetween nodes of a graph with application to collaborative recommendation.\ntions on Knowledge and Data Engineering, 19(3):355\u2013369, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1448, "authors": [{"given_name": "Tom", "family_name": "Gunter", "institution": "University of Oxford"}, {"given_name": "Michael", "family_name": "Osborne", "institution": "University of Oxford"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "University of Bonn"}, {"given_name": "Philipp", "family_name": "Hennig", "institution": "MPI T\u00fcbingen"}, {"given_name": "Stephen", "family_name": "Roberts", "institution": "University of Oxford"}]}