{"title": "The Gaussian Process Density Sampler", "book": "Advances in Neural Information Processing Systems", "page_first": 9, "page_last": 16, "abstract": "We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skull-reconstruction task.", "full_text": "The Gaussian Process Density Sampler\n\nRyan Prescott Adams\u2217\nCavendish Laboratory\nUniversity of Cambridge\nCambridge CB3 0HE, UK\n\nrpa23@cam.ac.uk\n\nIain Murray\n\nDept. of Computer Science\n\nUniversity of Toronto\n\nToronto, Ontario. M5S 3G4\n\nmurray@cs.toronto.edu\n\nmackay@mrao.cam.ac.uk\n\nDavid J.C. MacKay\nCavendish Laboratory\nUniversity of Cambridge\nCambridge CB3 0HE, UK\n\nAbstract\n\nWe present the Gaussian Process Density Sampler (GPDS), an exchangeable gen-\nerative model for use in nonparametric Bayesian density estimation. Samples\ndrawn from the GPDS are consistent with exact, independent samples from a \ufb01xed\ndensity function that is a transformation of a function drawn from a Gaussian pro-\ncess prior. Our formulation allows us to infer an unknown density from data using\nMarkov chain Monte Carlo, which gives samples from the posterior distribution\nover density functions and from the predictive distribution on data space. We can\nalso infer the hyperparameters of the Gaussian process. We compare this density\nmodeling technique to several existing techniques on a toy problem and a skull-\nreconstruction task.\n\n1 Introduction\n\nWe present the Gaussian Process Density Sampler (GPDS), a generative model for probability den-\nsity functions, based on a Gaussian process. We are able to draw exact and exchangeable data from\na \ufb01xed density drawn from the prior. Given data, this generative prior allows us to perform infer-\nence of the unnormalized density. We perform this inference by expressing the generative process in\nterms of a latent history, then constructing a Markov chain Monte Carlo algorithm on that latent his-\ntory. The central idea of the GPDS is to allow nonparametric Bayesian density estimation where the\nprior is speci\ufb01ed via a Gaussian process covariance function that encodes the intuition that \u201csimilar\ndata should have similar probabilities.\u201d\n\nOne way to perform Bayesian nonparametric density estimation is to use a Dirichlet process to\nde\ufb01ne a distribution over the weights of the components in an in\ufb01nite mixture model, using a simple\nparametric form for each component. Alternatively, Neal [1] generalizes the Dirichlet process itself,\nintroducing a spatial component to achieve an exchangeable prior on discrete or continuous density\nfunctions with hierarchical characteristics. Another way to de\ufb01ne a nonparametric density is to\ntransform a simple latent distribution through a nonlinear map, as in the Density Network [2] and\nthe Gaussian Process Latent Variable Model [3]. Here we use the Gaussian process to de\ufb01ne a prior\non the density function itself.\n\n2 The prior on densities\n\nWe consider densities on an input space X that we will call the data space. In this paper, we assume\nwithout loss of generality that X is the d-dimensional real space Rd. We \ufb01rst construct a Gaussian\nprocess prior with the data space X as its input and the one-dimensional real space R as its output.\nThe Gaussian process de\ufb01nes a distribution over functions from X to R. We de\ufb01ne a mean function\nm(\u00b7) : X \u2192 R and a positive de\ufb01nite covariance function K(\u00b7, \u00b7) : X \u00d7 X \u2192 R. We\n\n\u2217http://www.inference.phy.cam.ac.uk/rpa23/\n\n\f3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22123\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22123\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n(a) \u2113x = 1, \u2113y = 1, \u03b1 = 1\n\n(b) \u2113x = 1, \u2113y = 1, \u03b1 = 10\n\n(c) \u2113x = 0.2, \u2113y = 0.2, \u03b1 = 5\n\n(d) \u2113x = 0.1, \u2113y = 2, \u03b1 = 5\n\nFigure 1: Four samples from the GPDS prior are shown, with 200 data samples. The contour lines show the ap-\nproximate unnormalized densities. In each case the base measure is the zero-mean spherical Gaussian with unit\nvariance. The covariance function was the squared exponential: K(x, x\u2032) = \u03b1 exp(\u2212 1\ni)2),\nwith parameters varied as labeled in each subplot. \u03a6(\u00b7) is the logistic function in these plots.\n\n(xi \u2212 x\u2032\n\ni\n\n2Pi \u2113\u22122\n\nassume that these functions are together parameterized by a set of hyperparameters \u03b8. Given these\ntwo functions and their hyperparameters, for any \ufb01nite subset of X with cardinality N there is a\nmultivariate Gaussian distribution on RN [4]. We will take the mean function to be zero.\n\nProbability density functions must be everywhere nonnegative and must integrate to unity. We de\ufb01ne\na map from a function g(x) : X \u2192 R, x \u2208 X , to a proper density f (x) via\n\nf (x) =\n\n1\n\nZ\u03c0[g]\n\n\u03a6(g(x)) \u03c0(x)\n\n(1)\n\nwhere \u03c0(x) is an arbitrary base probability measure on X , and \u03a6(\u00b7) : R \u2192 (0, 1) is a nonnegative\nfunction with upper bound 1. We take \u03a6(\u00b7) to be a sigmoid, e.g. the logistic function or cumulative\nnormal distribution function. We use the bold notation g to refer to the function g(x) compactly\nas a vector of (in\ufb01nite) length, versus its value at a particular x. The normalization constant is a\nfunctional of g(x):\n\nZ\u03c0[g] =Z dx\u2032 \u03a6(g(x\u2032)) \u03c0(x\u2032).\n\n(2)\n\nThrough the map de\ufb01ned by Equation 1, a Gaussian process prior becomes a prior distribution over\nnormalized probability density functions on X . Figure 2 shows several sample densities from this\nprior, along with sample data.\n\n3 Generating exact samples from the prior\n\nWe can use rejection sampling to generate samples from a common density drawn from the the\nprior described in Section 2. A rejection sampler requires a proposal density that provides an upper\nbound for the unnormalized density of interest. In this case, the proposal density is \u03c0(x) and the\nunnormalized density of interest is \u03a6(g(x))\u03c0(x).\nIf g(x) were known, rejection sampling would proceed as follows: First generate proposals {\u02dcxq}\nfrom the base measure \u03c0(x). The proposal \u02dcxq would be accepted if a variate rq drawn uniformly\nfrom (0, 1) was less than \u03a6(g(\u02dcxq)). These samples would be exact in the sense that they were not\nbiased by the starting state of a \ufb01nite Markov chain. However, in the GPDS, g(x) is not known: it is\na random function drawn from a Gaussian process prior. We can nevertheless use rejection sampling\nby \u201cdiscovering\u201d g(x) as we proceed at just the places we need to know it, by sampling from the\nprior distribution of the latent function. As it is necessary only to know g(x) at the {xq} to accept\nor reject these proposals, the samples are still exact. This retrospective sampling trick has been\nused in a variety of other MCMC algorithms for in\ufb01nite-dimensional models [5, 6]. The generative\nprocedure is shown graphically in Figure 2.\n\nIn practice, we generate the samples sequentially, as in Algorithm 1, so that we may be assured\nof having as many accepted samples as we require. In each loop, a proposal is drawn from the\nbase measure \u03c0(x) and the function g(x) is sampled from the Gaussian process at this proposed\ncoordinate, conditional on all the function values already sampled. We will call these data the\nconditioning set for the function g(x) and will denote the conditioning inputs X and the conditioning\n\n\f1\n\n(a)\n\n0\n\n0\n\n(b)\n\n{\u02dcgq}\n\n{\u02dcxq}\n\n(d)\n\n(c)\n\n{rq}\n\n(e)\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n01\n\n1 0\n\n1 0\n\n1 0\n\n1\n\nFigure 2: These \ufb01gures show the procedure for generating samples from a single density drawn from the\nGP-based prior. (a): Draw Q samples {\u02dcxq}Q from the base measure \u03c0(x), which in this case is uniform on\n[0, 1]. (b): Sample the function g(x) at the randomly chosen locations, generating the set {\u02dcgq = g(\u02dcxq)}Q. The\nsquashed function \u03a6(g(x)) is shown. (c): Draw a set of variates {rq}Q uniformly beneath the bound in the\nvertical coordinate. (d): Accept only the points whose uniform draws are beneath the squashed function value,\ni.e. rq < \u03a6(\u02dcgq). (e): The accepted points (\u02dcxq, rq) are uniformly drawn from the shaded area beneath the curve\nand the marginal distribution of the accepted \u02dcxq is proportional to \u03a6(g(x))\u03c0(x).\n\nfunction values G. After the function is sampled, a uniform variate is drawn from beneath the bound\nand compared to the \u03a6-squashed function at the proposal location.\nThe sequential procedure is exchangeable, which means that the probability of the data is identical\nunder reordering. First, the base measure draws are i.i.d.. Second, conditioned on the proposals\nfrom the base measure, the Gaussian process is a simple multivariate Gaussian distribution, which\nis exchangeable in its components. Finally, conditioned on the draw from the Gaussian process,\nthe acceptance/rejection steps are independent Bernoulli samples, and the overall procedure is ex-\nchangeable. This property is important because it ensures that the sequential procedure generates\ndata from the same distribution as the simultaneous procedure described above. More broadly, ex-\nchangeable priors are useful in Bayesian modeling because we may consider the data conditionally\nindependent, given the latent density.\n\nAlgorithm 1 Generate P exact samples from the prior\nPurpose: Draw P exact samples from a common density on X drawn from the prior in Equation 1\nInputs: GP hyperparameters \u03b8, number of samples to generate P\n1: Initialize empty conditioning sets for the Gaussian process: X = \u2205 and G = \u2205\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: until P samples have been accepted\n\nDraw a proposal from the base measure: \u02dcx \u223c \u03c0(x)\nSample the function from the Gaussian process at \u02dcx: \u02dcg \u223c GP(g | X, G, \u02dcx, \u03b8)\nDraw a uniform variate on [0, 1]: r \u223c U(0, 1)\nif r < \u03a6(\u02dcg) (Acceptance rule) then\n\nend if\nAdd \u02dcx and \u02dcg to the conditioning sets: X = X \u222a \u02dcx and G = G \u222a \u02dcg\n\nelse\n\nAccept \u02dcx\n\nReject \u02dcx\n\n4 Inference\n\nWe have N data D = {xn}N\nn=1 which we model as having been drawn independently from an un-\nknown density f (x). We use the GPDS prior from Section 2 to specify our beliefs about f (x), and\nwe wish to generate samples from the posterior distribution over the latent function g(x) correspond-\ning to the unknown density. We may also wish to generate samples from the predictive distribution\nor perform hierarchical inference of the prior hyperparameters.\n\nBy using the GPDS prior to model the data, we are asserting that the data can be explained as the\nresult of the procedure described in Section 3. We do not, however, know what rejections were made\nen route to accepting the observed data. These rejections are critical to de\ufb01ning the latent function\ng(x). One might think of de\ufb01ning a density as analogous to putting up a tent: pinning the canvas\ndown with pegs is just as important as putting up poles. In density modeling, de\ufb01ning regions with\nlittle probability mass is just as important as de\ufb01ning the areas with signi\ufb01cant mass.\n\n\fAlthough the rejections are not known, the generative procedure provides a probabilistic model that\nallows us to traverse the posterior distribution over possible latent histories that resulted in the data.\nIf we de\ufb01ne a Markov chain whose equilibrium distribution is the posterior distribution over latent\nhistories, then we may simulate plausible explanations of every step taken to arrive at the data.\nSuch samples capture all the information available about the unknown density, and with them we\nmay ask additional questions about g(x) or run the generative procedure further to draw predictive\nsamples. This approach is related to that described by Murray [7], who performed inference on an\nexactly-coalesced Markov chain [8], and by Beskos et al. [5].\n\nWe model the data as having been generated exactly as in Algorithm 1, with P = N, i.e.\nrun until exactly N proposals were accepted. The state space of the Markov chain on latent\nhistories in the GPDS consists of: 1) the values of the latent function g(x) at the data, de-\nn=1, 2) the number of rejections M, 3) the locations of the M rejected proposals,\nnoted GN = {gn}N\nm=1, and 4) the values of the latent function g(x) at the M rejected proposals,\ndenoted M = {xm}M\ndenoted GM = {gm = g(xm)}M\nWe perform Gibbs-like sampling of the latent history by alternating between modi\ufb01cation of the\nnumber of rejections M and block updating of the rejection locations M and latent function val-\nues GM and GN . We will maintain an explicit ordering of the latent rejections for reasons of clarity,\nalthough this is not necessary due to exchangeability. We will also assume that \u03a6(\u00b7) is the logistic\nfunction, i.e. \u03a6(z) = (1 + exp{\u2212z})\u22121.\n\nm=1. We will address hyperparameter inference in Section 4.3.\n\n4.1 Modifying the number of latent rejections\n\nWe propose a new number of latent rejections \u02c6M by drawing it from a proposal distribution\nq( \u02c6M \u2190 M ). If \u02c6M is greater than M, we must also propose new rejections to add to the la-\ntent state. We take advantage of the exchangeability of the process to generate the new rejections:\nwe imagine these proposals were made after the last observed datum was accepted, and our pro-\nposal is to call them rejections and move them before the last datum. If \u02c6M is less than M, we do the\nopposite by proposing to move some rejections to after the last acceptance.\n\nWhen proposing additional rejections, we must also propose times for them among the current\n\nlatent history, such that the sampler terminates after the Nth acceptance. When removing rejections,\n\nlatent history. There are(cid:0) \u02c6M +N \u22121\nwe must choose which ones to place after the data, and there are (cid:0) M\n\n\u02c6M \u2212M (cid:1) such ways to insert these additional rejections into the existing\nM \u2212 \u02c6M(cid:1) possible sets. Upon\n\nsimpli\ufb01cation, the proposal ratios for both addition and removal of rejections are identical:\n\n\u02c6M >M\n\n\u02c6M <M\n\nz\n}|\n{\nq(M \u2190 \u02c6M )(cid:0) \u02c6M +N \u22121\n\u02c6M \u2212M (cid:1)\n\u02c6M \u2212M(cid:1)\nq( \u02c6M \u2190 M )(cid:0)\n\n\u02c6M\n\n=\n\n}|\n\nz\n{\nq(M \u2190 \u02c6M )(cid:0) M\nM \u2212 \u02c6M(cid:1)\nM \u2212 \u02c6M (cid:1) =\nq( \u02c6M \u2190 M )(cid:0)M +N \u22121\n\nq(M \u2190 \u02c6M )M !( \u02c6M + N \u2212 1)!\nq( \u02c6M \u2190 M ) \u02c6M !(M + N \u2212 1)!\n\n.\n\nWhen inserting rejections, we propose the locations of the additional proposals, denoted M+, and\nM . We generate M+ by making \u02c6M \u2212 M\nthe corresponding values of the latent function, denoted G+\nindependent draws from the base measure. We draw G+\nM jointly from the Gaussian process prior,\nconditioned on all of the current latent state, i.e. (M, GM , D, GN ). The joint probability of this\nstate is\n\np(D, M, M+, GN , GM , G+\n\nM ) =\" N\nYn=1\n\n\u03c0(xn)\u03a6(gn)#\" M\nYm=1\n\n\u02c6M\n\n\u03c0(xm)(1 \u2212 \u03a6(gm))#\uf8ee\nYm=M +1\n\uf8f0\n\n\u03c0(xm)\uf8f9\n\uf8fb\n\nM | D, M, M+).\n\n\u00d7 GP(GM , GN , G+\n\n(3)\n\nThe joint in Equation 3 expresses the probability of all the base measure draws, the values of the\nfunction draws from the Gaussian process, and the acceptance or rejection probabilities of the pro-\nposals excluding the newly generated points. When we make an insertion proposal, exchangeability\nallows us to shuf\ufb02e the ordering without changing the probability; the only change is that now we\nmust account for labeling the new points as rejections. In the acceptance ratio, all terms except for\nthe \u201clabeling probability\u201d cancel. The reverse proposal is similar, however we denote the removed\n\n\fa =\uf8f1\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f3\n\nq(M\u2190 \u02c6M ) M ! ( \u02c6M +N \u22121)!\n\nq( \u02c6M\u2190M ) \u02c6M ! (M +N \u22121)!Qg\u2208G+\nq( \u02c6M\u2190M ) \u02c6M ! (M +N \u22121)!Qg\u2208G\u2212\n\nq(M\u2190 \u02c6M ) M ! ( \u02c6M +N \u22121)!\n\nM\n\nM\n\n(1 \u2212 \u03a6(g))\n\nif \u02c6M > M\n\n(1 \u2212 \u03a6(g))\u22121\n\nif \u02c6M < M .\n\n(4)\n\nproposal locations as M\u2212 and the corresponding function values as G\u2212\nratios for insertions or removals are\n\nM . The overall acceptance\n\n4.2 Modifying rejection locations and function values\n\nm=1, \u02c6GM = {\u02c6gm = \u02c6g(\u02c6xm)}M\n\nm=1 and \u02c6GN = {\u02c6gn = \u02c6g(xn)}N\n\nGiven the number of latent rejections M, we propose modifying their locations M, their latent func-\ntion values GM , and the values of the latent function at the data GN . We will denote these proposals\nas \u02c6M = {\u02c6xm}M\nn=1, respectively. We\nmake simple perturbative proposals of M via a proposal density q( \u02c6M \u2190 M). For the latent func-\ntion values, however, perturbative proposals will be poor, as the Gaussian process typically de\ufb01nes\na narrow mass. To avoid this, we propose modi\ufb01cations to the latent function that leave the prior\ninvariant.\nWe make joint proposals of \u02c6M, \u02c6GM and \u02c6GN in three steps. First, we draw new rejection locations\nfrom q( \u02c6M \u2190 M). Second, we draw a set of M intermediate function values from the Gaussian\nprocess at \u02c6M, conditioned on the current rejection locations and their function values, as well as\nthe function values at the data. Third, we propose new function values at \u02c6M and the data D via an\nunderrelaxation proposal of the form\n\n\u02c6g(x) = \u03b1 g(x) +p1 \u2212 \u03b12 h(x)\n\nwhere h(x) is a sample from the Gaussian process prior and \u03b1 is in [0, 1). This is a variant of the\noverrelaxed MCMC method discussed by Neal [9]. This procedure leaves the Gaussian process prior\ninvariant, but makes conservative proposals if \u03b1 is near one. After making a proposal, we accept or\nreject via the ratio of the joint distributions:\n\na =\n\nq(M \u2190 \u02c6M)hQM\nq( \u02c6M \u2190 M)hQM\n\nm=1 \u03c0(\u02c6xm)(1 \u2212 \u03a6(\u02c6gm))ihQN\nm=1 \u03c0(xm)(1 \u2212 \u03a6(gm))ihQN\n\nn=1 \u03a6(\u02c6gn)i\nn=1 \u03a6(gn)i .\n\n4.3 Hyperparameter inference\n\nGiven a sample from the posterior on the latent history, we can also perform a Metropolis\u2013Hasting\nstep in the space of hyperparameters. Parameters \u03b8, governing the covariance function and mean\nfunction of the Gaussian process provide common examples of hyperparameters, but we might also\nintroduce parameters \u03c6 that control the behavior of the base measure \u03c0(x). We denote the proposal\ndistributions for these parameters as q(\u02c6\u03b8 \u2190 \u03b8) and q( \u02c6\u03c6 \u2190 \u03c6), respectively. With priors p(\u03b8)\nand p(\u03c6), the acceptance ratio for a Metropolis\u2013Hastings step is\n\na =\n\nq(\u03b8 \u2190 \u02c6\u03b8) q(\u03c6 \u2190 \u02c6\u03c6) p(\u02c6\u03b8) p( \u02c6\u03c6) N ({GM , GN } | M, D, \u02c6\u03b8)\n\nq(\u02c6\u03b8 \u2190 \u03b8) q( \u02c6\u03c6 \u2190 \u03c6) p(\u03b8) p(\u03c6) N ({GM , GN } | M, D, \u03b8)\" M\nYm=1\n\n\u03c0(xm | \u02c6\u03c6)\n\n\u03c0(xm | \u03c6)#\" N\nYn=1\n\n\u03c0(xn | \u02c6\u03c6)\n\n\u03c0(xn | \u03c6)# .\n\n4.4 Prediction\n\nThe predictive distribution is the one that arises on the space X when the posterior on the latent\nfunction g(x) (and perhaps hyperparameters) is integrated out. It is the expected distribution of the\nnext datum, given the ones we have seen and taking into account our uncertainty. In the GPDS we\nsample from the predictive distribution by running the generative process of Section 3, initialized to\nthe current latent history sample from the Metropolis\u2013Hastings procedure described above.\n\nIt may also be desirable to estimate the actual value of the predictive density. We use the method of\nChib and Jeliazkov [10], and observe by detailed balance of a Metropolis\u2013Hastings move:\n\np(x | g, \u03b8, \u03c6)\u03c0(x\u2032) min(cid:18)1,\n\n\u03a6(g(x\u2032))\n\n\u03a6(g(x))(cid:19) = p(x\u2032 | g, \u03b8, \u03c6)\u03c0(x) min(cid:18)1,\n\n\u03a6(g(x))\n\n\u03a6(g(x\u2032))(cid:19) .\n\n\f(a)\n\n(b)\n\n(c)\n\nGM= {g(xm)}\n\nGN= {g(xn)}\n\n{g(\u02c6xm)}\n\n\u02c6GM= {\u02c6g(\u02c6xm)}\n\n\u02c6GN= {\u02c6g(xn)}\n\nD= {xn}\n\nM= {xm}\n\n\u02c6M= {\u02c6xm}\n\nFigure 3: These \ufb01gures show the sequence of proposing new rejection locations, new function values at those\nlocations, and new function values at the data. (a): The current state, with rejections labeled M = {xm} on\nthe left, along with the values of the latent function GM = {gm}. On the right side are the data D = {xn}\nand the corresponding values of the latent function GN = {gn}. (b): New rejections \u02c6M = {\u02c6xm} are\nproposed via q( \u02c6M \u2190 M), and the latent function is sampled at these points. (c): The latent function is\nperturbed at the new rejection locations and at the data via an underrelaxed proposal.\nWe \ufb01nd the expectation of each side under the posterior of g and the hyperparameters \u03b8 and \u03c6:\n\n\u03a6(g(x\u2032))\n\nThis gives an expression for the predictive density:\n\nZ d\u03b8Z d\u03c6 p(\u03b8, \u03c6 | D)Z dg p(g | \u03b8, D)Z dx\u2032 p(x | g, \u03b8, \u03c6)\u03c0(x\u2032) min(cid:18)1,\n\n\u03a6(g(x))(cid:19)\n=Z d\u03b8Z d\u03c6 p(\u03b8, \u03c6 | D)Z dg p(g | \u03b8, D)Z dx\u2032 p(x\u2032 | g, \u03b8, \u03c6)\u03c0(x) min(cid:18)1,\n\u03a6(g(x\u2032))(cid:17)\np(x | D) = Rd\u03b8Rd\u03c6RdgRdx\u2032 p(\u03b8, \u03c6, g, x\u2032 | D) \u03c0(x) min(cid:16)1, \u03a6(g(x))\nRd\u03b8Rd\u03c6RdgRdx\u2032 p(\u03b8, \u03c6, g | x, D) \u03c0(x\u2032) min(cid:16)1, \u03a6(g(x\u2032))\n\u03a6(g(x))(cid:17)\n\n\u03a6(g(x))\n\n\u03a6(g(x\u2032))(cid:19) .\n\n(5)\n\nBoth the numerator and the denominator in Equation 5 are expectations that can be estimated by\naveraging over the output from the GPDS Metropolis\u2013Hasting sampler. The denominator requires\nsampling from the posterior distribution with the data augmented by x.\n\n5 Results\n\nWe examined the GPDS prior and the latent history inference procedure on a toy data set and on\na skull reconstruction task. We compared the approach described in this paper to a kernel density\nestimate (Parzen windows), an in\ufb01nite mixture of Gaussians (iMoG), and Dirichlet diffusion trees\n(DFT). The kernel density estimator used a spherical Gaussian with the bandwidth set via ten-fold\ncross validation. Neal\u2019s Flexible Bayesian Modeling (FBM) Software [1] was used for the imple-\nmentation of both iMoG and DFT.\n\nThe toy data problem consisted of 100 uniform draws from a two-dimensional ring with radius 1.5,\nand zero-mean Gaussian noise added with \u03c3 = 0.2. The test data were 50 additional samples,\nand comparison used mean log probability of the test set. Each of the three Bayesian methods\nimproved on the Parzen window estimate by two or more nats, with the DFT approach being the\nmost successful. A bar plot of these results is shown in Figure 5.\n\nWe also compared the methods on a real-data task. We modeled the the joint density of ten measure-\nments of linear distances between anatomical landmarks on 228 rhesus macaque (Macaca mulatta)\nskulls. These linear distances were generated from three-dimensional coordinate data of anatomical\nlandmarks taken by a single observer from dried skulls using a digitizer [11]. Linear distances are\ncommonly used in morphological studies as they are invariant under rotation and translation of the\nobjects being compared [12]. Figure 5 shows a computed tomography (CT) scan reconstruction of\na macaque skull, along with the ten linear distances used. Each skull was measured three times in\ndifferent trials, and these were modeled separately. 200 randomly-selected skulls were used as a\ntraining set and 28 were used as a test set. To be as fair as possible, the data was logarithmically\ntransformed and whitened as a preprocessing step, to have zero sample mean and spherical sample\ncovariance. Each of the Bayesian approaches outperformed the Parzen window technique in mean\nlog probability of the test set, with comparable results for each. This result is not surprising, as\n\ufb02exible nonparametric Bayesian models should have roughly similar expressive capabilities. These\nresults are shown in Figure 5.\n\n\fGPDS\niMoG\nDFT\n\n)\ns\nt\n\na\nn\n(\n \nn\ne\nz\nr\na\nP\n\n \nr\ne\nv\nO\n\n \nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\nI\n\n0\n3\n\n.\n\n5\n\n.\n\n2\n\n0\n2\n\n.\n\n5\n\n.\n\n1\n\n.\n\n0\n1\n\n5\n.\n0\n\n0\n.\n0\n\nFigure 4: The macaque skull data are linear dis-\ntances calculated between three-dimensional coor-\ndinates of anatomical landmarks. These are superior\nand inferior views of a computed tomography (CT)\nscan of a male macaque skull, with the ten linear\ndistances superimposed. The anatomical landmarks\nare based on biological relevance and repeatability\nacross individuals.\n\n6 Discussion\n\nRing Mac T1 Mac T2 Mac T3\n\nFigure 5: This bar plot shows the improvement of\nthe GPDS, in\ufb01nite mixture of Gaussians (iMoG),\nand Dirichlet diffusion trees (DFT) in mean log\nprobability (base e) of the test set over cross-\nvalidated Parzen windows on the toy ring data and\nthe macaque data. The baseline log probability of\nthe Parzen method for the ring data was \u22122.253 and\nfor the macaque data was \u221215.443, \u221215.742, and\n\u221215.254 for each of three trials.\n\nValid MCMC algorithms for fully Bayesian kernel regression methods are well-established. This\nwork introduces the \ufb01rst such prior that enables tractable density estimation, complementing alter-\nnatives such as Dirichlet Diffusion Trees [1] and in\ufb01nite mixture models.\n\nAlthough the GPDS has similar motivation to the logistic Gaussian process [13, 14, 15, 16], it differs\nsigni\ufb01cantly in its applicability and practicality. All known treatments of the logistic GP require a\n\ufb01nite-dimensional proxy distribution. This proxy distribution is necessary both for tractability of\ninference and for estimation of the normalization constant. Due to the complexity constraints of both\nthe basis-function approach of Lenk [15] and the lattice-based approach of [16], these have only been\nimplemented on single-dimensional toy problems. The GPDS construction we have presented here\nnot only avoids numerical estimation of the normalization constant, but allows in\ufb01nite-dimensional\ninference both in theory and in practice.\n\n6.1 Computational complexity\n\nThe inference method for the GPDS prior is \u201cpractical\u201d in the sense that it can be implemented\nwithout approximations, but it has potentially-steep computational costs. To compare two latent\nhistories in a Metropolis\u2013Hastings step we must evaluate the marginal likelihood of the Gaussian\nprocess. This requires a matrix decomposition whose cost is O((N + M )3). The model explicitly\nallows M to be any nonnegative integer and so this cost is unbounded. The expected cost of an\nM\u2013H step is determined by the expected number of rejections M. For a given g(x), the expected\nM is N (Z\u03c0[g]\u22121 \u2212 1). This expression is derived from the observation that \u03c0(x) provides an upper\nbound on the function \u03a6(g(x))\u03c0(x) and the ratio of acceptances to rejections is determined by the\nproportion of the mass of \u03c0(x) contained by \u03a6(g(x))\u03c0(x).\n\n\fWe are optimistic that more sophisticated Markov chain Monte Carlo techniques may realize\nconstant-factor performance gains over the basic Metropolis\u2013Hasting scheme presented here, with-\nout compromising the correctness of the equilibrium distribution. Sparse approaches to Gaussian\nprocess regression that improve the asymptotically cubic behavior may also be relevant to the GPDS,\nbut it is unclear that these will be an improvement over other approximate GP-based schemes for\ndensity modeling.\n\n6.2 Alternative inference methods\n\nIn developing inference methods for the GPDS prior, we have also explored the use of exchange\nsampling [17, 7]. Exchange sampling is an MCMC technique explicitly developed for the situation\nwhere there is an intractable normalization constant that prevents exact likelihood evaluation, but\nexact samples may be generated for any particular parameter setting. Undirected graphical models\nsuch as the Ising and Potts models provide common examples of cases where exchange sampling\nis applicable via coupling from the past [8]. Using the exact sampling procedure of Section 3,\nit is applicable to the GPDS as well. Exchange sampling for the GPDS, however, requires more\nevaluations of the function g(x) than the latent history approach.\nIn practice the latent history\napproach of Section 4 does perform better.\n\nAcknowledgements\n\nThe authors wish to thank Radford Neal and Zoubin Ghahramani for valuable comments. Ryan\nAdams\u2019 research is supported by the Gates Cambridge Trust. Iain Murray\u2019s research is supported by\nthe government of Canada. The authors thank the Caribbean Primate Research Center, the University\nof Puerto Rico, Medical Sciences Campus, Laboratory of Primate Morphology and Genetics, and\nthe National Institutes of Health (Grant RR03640 to CPRC) for support.\n\nReferences\n\n[1] R. M. Neal. De\ufb01ning priors for distributions using Dirichlet diffusion trees. Technical Report 0104,\n\nDepartment of Statistics, University of Toronto, 2001.\n\n[2] D. J. C. MacKay. Bayesian neural networks and density networks. Nuclear Instruments and Methods in\n\nPhysics Research, Section A, 354(1):73\u201380, 1995.\n\n[3] N. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable\n\nmodels. Journal of Machine Learning Research, 6:1783\u20131816, 2005.\n\n[4] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cam-\n\nbridge, MA, 2006.\n\n[5] A. Beskos, O. Papaspiliopoulos, G. O. Roberts, and P. Fearnhead. Exact and computationally ef\ufb01cient\nlikelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of the\nRoyal Statistical Society: Series B, 68:333\u2013382, 2006.\n\n[6] O. Papaspiliopoulos and G. O. Roberts. Retrospective Markov chain Monte Carlo methods for Dirichlet\n\nprocess hierarchical models. Biometrika, 95(1):169\u2013186, 2008.\n\n[7] I. Murray. Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby Computational Neuro-\n\nscience Unit, University College London, London, 2007.\n\n[8] J. G. Propp and D. B. Wilson. Exact sampling with coupled Markov chains and applications to statistical\n\nmechanics. Random Structures and Algorithms, 9(1&2):223\u2013252, 1996.\n\n[9] R. M. Neal. Supressing random walks in Markov chain Monte Carlo using ordered overrelaxation, 1998.\n[10] S. Chib and I. Jeliazkov. Marginal likelihood from the Metropolis\u2013Hastings output. Journal of the\n\nAmerican Statistical Association, 96(453):270\u2013281, 2001.\n\n[11] K. E. Willmore, C. P. Klingenberg, and B. Hallgrimsson. The relationship between \ufb02uctuating asymmetry\n\nand environmental variance in rhesus macaque skulls. Evolution, 59(4):898\u2013909, 2005.\n\n[12] S. R. Lele and J. T. Richtsmeier. An invariant approach to statistical analysis of shapes. Chapman and\n\nHall/CRC Press, London, 2001.\n\n[13] T. Leonard. Density estimation, stochastic processes and prior information. Journal of the Royal Statis-\n\ntical Society, Series B, 40(2):113\u2013146, 1978.\n\n[14] D. Thorburn. A Bayesian approach to density estimation. Biometrika, 73(1):65\u201375, 1986.\n[15] P. J. Lenk. Towards a practicable Bayesian nonparametric density estimator. Biometrika, 78(3):531\u2013543,\n\n1991.\n\n[16] S. T. Tokdar and J. K. Ghosh. Posterior consistency of logistic Gaussian process priors in density estima-\n\ntion. Journal of Statistical Planning and Inference, 137:34\u201342, 2007.\n\n[17] I. Murray, Z. Ghahramani, and D. J.C. MacKay. MCMC for doubly-intractable distributions. In Pro-\nceedings of the 22nd Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 359\u2013366,\n2006.\n\n\f", "award": [], "sourceid": 240, "authors": [{"given_name": "Iain", "family_name": "Murray", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}