{"title": "Nonstationary Covariance Functions for Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 273, "page_last": 280, "abstract": "", "full_text": "Nonstationary Covariance Functions for\n\nGaussian Process Regression\n\nChristopher J. Paciorek and Mark J. Schervish\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\npaciorek@alumni.cmu.edu,mark@stat.cmu.edu\n\nAbstract\n\nWe introduce a class of nonstationary covariance functions for Gaussian\nprocess (GP) regression. Nonstationary covariance functions allow the\nmodel to adapt to functions whose smoothness varies with the inputs.\nThe class includes a nonstationary version of the Mat\u00e9rn stationary co-\nvariance, in which the differentiability of the regression function is con-\ntrolled by a parameter, freeing one from \ufb01xing the differentiability in\nadvance.\nIn experiments, the nonstationary GP regression model per-\nforms well when the input space is two or three dimensions, outperform-\ning a neural network model and Bayesian free-knot spline models, and\ncompetitive with a Bayesian neural network, but is outperformed in one\ndimension by a state-of-the-art Bayesian free-knot spline model. The\nmodel readily generalizes to non-Gaussian data. Use of computational\nmethods for speeding GP \ufb01tting may allow for implementation of the\nmethod on larger datasets.\n\n1\n\nIntroduction\n\nGaussian processes (GPs) have been used successfully for regression and classi\ufb01cation\ntasks. Standard GP models use a stationary covariance, in which the covariance between\nany two points is a function of Euclidean distance. However, stationary GPs fail to adapt\nto variable smoothness in the function of interest [1, 2]. This is of particular importance in\ngeophysical and other spatial datasets, in which domain knowledge suggests that the func-\ntion may vary more quickly in some parts of the input space than in others. For example, in\nmountainous areas, environmental variables are likely to be much less smooth than in \ufb02at\nregions. Spatial statistics researchers have made some progress in de\ufb01ning nonstationary\ncovariance structures for kriging, a form of GP regression. We extend the nonstationary\ncovariance structure of [3], of which [1] gives a special case, to a class of nonstationary\ncovariance functions. The class includes a Mat\u00e9rn form, which in contrast to most covari-\nance functions has the added \ufb02exibility of a parameter that controls the differentiability\nof sample functions drawn from the GP distribution. We use the nonstationary covariance\nstructure for one, two, and three dimensional input spaces in a standard GP regression\nmodel, as done previously only for one-dimensional input spaces [1].\n\nThe problem of variable smoothness has been attacked in spatial statistics by mapping\n\n\fthe original input space to a new space in which stationarity is assumed, but research has\nfocused on multiple noisy replicates of the regression function with no development nor\nassessment of the method in the standard regression setting [4, 5]. The issue has been ad-\ndressed in regression spline models by choosing the knot locations during the \ufb01tting [6] and\nin smoothing splines by choosing an adaptive penalizer on the integrated squared derivative\n[7]. The general approach in spline and other models involves learning the underlying basis\nfunctions, either explicitly or implicitly, rather than \ufb01xing the functions in advance. One\nalternative to a nonstationary GP model is mixtures of stationary GPs [8, 9]. Such meth-\nods adapt to variable smoothness by using different stationary GPs in different parts of the\ninput space. The main dif\ufb01culty is that the class membership is a function of the inputs;\nthis involves additional unknown functions in the hierarchy of the model. One possibility\nis to use stationary GPs for these additional unknown functions [8], while [9] reduce com-\nputational complexity by using a local estimate of the class membership, but do not know\nif the resulting model is well-de\ufb01ned probabilistically. While the mixture approach is in-\ntriguing, neither of [8, 9] compare their model to other methods. In our model, there are\nunknown functions in the hierarchy of the model that determine the nonstationary covari-\nance structure. We choose to fully model the functions as Gaussian processes themselves,\nbut recognize the computational cost and suggest that simpler representations are worth\ninvestigating.\n\n2 Covariance functions and sample function differentiability\n\nThe covariance function is crucial in GP regression because it controls how much the data\nare smoothed in estimating the unknown function. GP distributions are distributions over\nfunctions; the covariance function determines the properties of sample functions drawn\nfrom the distribution. The stochastic process literature gives conditions for determining\nsample function properties of GPs based on the covariance function of the process, sum-\nmarized in [10] for several common covariance functions. Stationary, isotropic covariance\nfunctions are functions only of Euclidean distance, (cid:28) . Of particular note, the squared expo-\nnential (also called the Gaussian) covariance function, C((cid:28) ) = (cid:27)2 exp(cid:0)(cid:0)((cid:28) =(cid:20))2(cid:1) ; where\n(cid:27)2 is the variance and (cid:20) is a correlation scale parameter, has sample functions with in-\n\ufb01nitely many derivatives. In contrast, spline regression models have sample functions that\nare typically only twice differentiable. In addition to being of theoretical concern from an\nasymptotic perspective [11], other covariance forms might better \ufb01t real data for which it is\nunlikely that the unknown function is so highly differentiable. In spatial statistics, the expo-\nnential covariance, C((cid:28) ) = (cid:27)2 exp ((cid:0)(cid:28) =(cid:20)) ; is commonly used, but this form gives sample\nfunctions that, while continuous, are not differentiable. Recent work in spatial statistics has\n(cid:0)((cid:23))2(cid:23)(cid:0)1 (2p(cid:23)(cid:28) =(cid:20))(cid:23) K(cid:23) (2p(cid:23)(cid:28) =(cid:20)) ; where K(cid:23)((cid:1))\nfocused on the Mat\u00e9rn form, C((cid:28) ) = (cid:27)2\nis the modi\ufb01ed Bessel function of the second kind, whose order is the differentiability pa-\nrameter, (cid:23) > 0. This form has the desirable property that sample functions are b(cid:23) (cid:0) 1c\ntimes differentiable. As (cid:23) ! 1, the Mat\u00e9rn approaches the squared exponential form,\nwhile for (cid:23) = 0:5; the Mat\u00e9rn takes the exponential form. Standard covariance functions\nrequire one to place all of one\u2019s prior probability on a particular degree of differentiability;\nuse of the Mat\u00e9rn allows one to more accurately, yet easily, express prior lack of knowledge\nabout sample function differentiability. One application for which this may be of particular\ninterest is geophysical data.\n\n1\n\n[12] suggest using the squared exponential covariance but with anisotropic distance,\n(cid:28) (xi; xj) = p(xi (cid:0) xj)T (cid:1)(cid:0)1(xi (cid:0) xj), where (cid:1) is an arbitrary positive de\ufb01nite ma-\ntrix, rather than the standard diagonal matrix. This allows the GP model to more easily\nmodel interactions between the inputs. The nonstationary covariance function we intro-\nduce next builds on this more general form.\n\n\f3 Nonstationary covariance functions\n\nOne nonstationary covariance\nis C(xi; xj) =\nR<2 kxi (u)kxj (u)du; where xi; xj, and u are locations in <2, and kx((cid:1)) is a ker-\nnel function centered at x. One can show directly that C(xi; xj) is positive de\ufb01nite in\n<p; p = 1; 2; : : :, [10]. For Gaussian kernels, the covariance takes the simple form,\n\nintroduced by [3],\n\nfunction,\n\nC N S(xi; xj) = (cid:27)2j(cid:6)ij\n\nwith quadratic form\n\n1\n\n4j(cid:6)jj\n\n1\n\n4j ((cid:6)i + (cid:6)j) =2j(cid:0) 1\n\n2 exp ((cid:0)Qij) ;\n\n(1)\n\n(2)\n\nQij = (xi (cid:0) xj)T (((cid:6)i + (cid:6)j) =2)(cid:0)1 (xi (cid:0) xj);\n\nwhere (cid:6)i, which we call the kernel matrix, is the covariance matrix of the Gaussian kernel\nat xi. The form (1) is a squared exponential correlation function, but in place of a \ufb01xed\nmatrix, (cid:1), in the quadratic form, we average the kernel matrices for the two locations. The\nevolution of the kernel matrices in space produces nonstationary covariance, with kernels\nthat drop off quickly producing locally short correlation scales. Independently, [1] derived a\nspecial case in which the kernel matrices are diagonal. Unfortunately, so long as the kernel\nmatrices vary smoothly in the input space, sample functions from GPs with the covariance\n(1) are in\ufb01nitely differentiable [10], just as for the stationary squared exponential.\n\nTo generalize (1) and introduce functions for which sample path differentiability varies, we\nextend (1) as proven in [10]:\nTheorem 1 Let Qij be de\ufb01ned as in (2). If a stationary correlation function, RS((cid:28) ), is\npositive de\ufb01nite on <p for every p = 1; 2; : : :, then\n\n1\n\n1\n\nRN S(xi; xj) = j(cid:6)ij\n\n2 RS (cid:16)pQij(cid:17)\nis a nonstationary correlation function, positive de\ufb01nite on <p; p = 1; 2; : : :.\nOne example of nonstationary covariance functions constructed in this way is a nonstation-\nary version of the Mat\u00e9rn covariance,\n\n4 j((cid:6)i + (cid:6)j) =2j(cid:0) 1\n\n4 j(cid:6)jj\n\n(3)\n\nC N S(xi; xj) =\n\n1\n\n4\n\n1\n\n(cid:27)2 j(cid:6)ij\n\n4 j(cid:6)jj\n(cid:0)((cid:23))2(cid:23)(cid:0)1\n\n(cid:6)i + (cid:6)j\n\n2\n\n(cid:0) 1\n\n2 (cid:16)2p(cid:23)Qij(cid:17)(cid:23)\n\nK(cid:23) (cid:16)2p(cid:23)Qij(cid:17) :\n\n(4)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nProvided the kernel matrices vary smoothly in space, the sample function differentiabil-\nity of the nonstationary form follows that of the stationary form, so for the nonstationary\nMat\u00e9rn, the sample function differentiability increases with (cid:23) [10].\n\n4 Bayesian regression model and implementation\n\nf\n\n((cid:1);(cid:1))(cid:17), where C N S\n\nAssume independent observations, Y1; : : : ; Yn, indexed by a vector of input or feature val-\nues, xi 2 <P , with Yi (cid:24) N (f (xi); (cid:17)2), where (cid:17)2 is the noise variance. Specify a Gaussian\nprocess prior, f ((cid:1)) (cid:24) GP(cid:16)(cid:22)f ; C N S\n((cid:1);(cid:1)) is the nonstationary Mat\u00e9rn co-\nvariance function (4) constructed from a set of Gaussian kernels as described below. For\nthe differentiability parameter, we use the prior, (cid:23)f (cid:24) U(0:5; 30), which varies between\nnon-differentiability (0:5) and high differentiability. We use proper, but diffuse, priors for\n(cid:22)f , (cid:27)2\nf , and (cid:17)2.The main challenge is to parameterize the kernel matrices, since their evo-\nlution determines how quickly the covariance structure changes in the input space and the\ndegree to which the model adapts to variable smoothness in the unknown function. In many\nproblems, it seems natural that the covariance structure would evolve smoothly; if so, the\ndifferentiability of the regression function will be determined by (cid:23)f .\n\nf\n\n\fWe put a prior distribution on the kernel matrices as follows. Any location in the input\nspace, xi, has a Gaussian kernel with mean xi and covariance (kernel) matrix, (cid:6)i. When\nthe input space is one-dimensional, each kernel \u2019matrix\u2019 is just a scalar, the variance of\nthe kernel, and we use a stationary Mat\u00e9rn GP prior on the log variance so that the vari-\nances evolve smoothly in the input space. Next consider multi-dimensional input spaces;\nsince there are (implicitly) kernel matrices at each location in the input space, we have\na multivariate process, the matrix-valued function, (cid:6)((cid:1)). Parameterizing positive de\ufb01nite\nmatrices as a function of the input space is a dif\ufb01cult problem; see [7]. We use the spectral\ndecomposition of an individual covariance matrix, (cid:6)i,\n\n(cid:6)i = (cid:0)((cid:13)1(xi); : : : ; (cid:13)Q(xi))D((cid:21)1(xi); : : : ; (cid:21)P (xi))(cid:0)((cid:13)1(xi); : : : ; (cid:13)Q(xi))T ;\n\n(5)\nwhere D is a diagonal matrix of eigenvalues and (cid:0) is an eigenvector matrix constructed\nas described below. (cid:21)p((cid:1)); p = 1; : : : ; P , and (cid:13)q((cid:1)); q = 1; : : : ; Q, which are func-\ntions on the input space, construct (cid:6)((cid:1)). We will refer to these as the eigenvalue\nand eigenvector processes, and to them collectively as the eigenprocesses. Let (cid:30)((cid:1)) 2\nflog((cid:21)1((cid:1))); : : : ; log((cid:21)P ((cid:1))); (cid:13)1((cid:1)); : : : ; (cid:13)Q((cid:1))g denote any one of these eigenprocesses. To\nhave the kernel matrices vary smoothly, we ensure that their eigenvalues and eigenvectors\nvary smoothly by taking each (cid:30)((cid:1)) to have a GP prior with a single stationary, anisotropic\nMat\u00e9rn correlation function, common to all the processes and described later. Using a\nshared correlation function gives us smoothly-varying kernels, while limiting the number\nof parameters. We force the eigenprocesses to be very smooth by \ufb01xing (cid:23) = 30. We do\nnot let (cid:23) vary, because it should have minimal impact on the regression estimate and is not\nwell-informed by the data.\n\nParameterizing the eigenvectors of the kernel matrices using Givens angles, with each an-\ngle a function on <P , the input space, is dif\ufb01cult, because the angle functions have range\n[0; 2(cid:25)) (cid:17) S1, which is not compatible with the range of a GP. To avoid this, we overparam-\neterize the eigenvectors, using Q = P (P (cid:0) 1)=2 + P (cid:0) 1 Gaussian processes, (cid:13)q((cid:1)), that\ndetermine the directions of a set of orthogonal vectors. Here, we demonstrate the construc-\ntion of the eigenvectors for xi 2 <2 and xi 2 <3; a similar approach, albeit with more\nparameters, applies to higher-dimensional spaces, but is probably infeasible in dimensions\nlarger than \ufb01ve or so. In <3, we construct an eigenvector matrix for an individual location\nas (cid:0) = (cid:0)3(cid:0)2, where\n0\nB@\n\n; (cid:0)2 = 0\n@\n\n1\nA :\n\n0\n(cid:0)v\nluv\nu\nluv\n\n(cid:0)b\nlab\na\nlab\n0\n\n0\nu\nluv\nv\nluv\n\n1\nCA\n\n(cid:0)ac\n\nlablabc\n\n(cid:0)bc\n\nlablabc\n\n(cid:0)3 =\n\nlab\nlabc\n\nlabc\n\nlabc\n\na\n\nb\n\nc\n\nlabc\n\n1\n0\n0\n\nThe elements of (cid:0)3 are functions of three random variables, fA; B; Cg, where labc =\npa2 + b2 + c2 and lab = pa2 + b2. ((cid:0)3)32 = 0 is a constraint that saves a degree of\nfreedom for the two-dimensional subspace orthogonal to (cid:0)3. The elements of (cid:0)2 are based\non two random variables, U and V . To have the matrices, (cid:6)((cid:1)), vary smoothly in space,\na; b; c; u and v, are the values of the processes, (cid:13)1((cid:1)); : : : ; (cid:13)5((cid:1)) at the input of interest.\nOne can integrate f, the function evaluated at the inputs, out of the GP model.\nIn the\nstationary GP model, the marginal posterior contains a small number of hyperparameters\nto either optimize or sample via MCMC. In the nonstationary case, the presence of the\nadditional GPs for the kernel matrices (5) precludes straightforward optimization, leaving\nMCMC. For each of the eigenprocesses, we reparameterize the vector, (cid:30), of values of the\nprocess at the input locations, (cid:30) = (cid:22)(cid:30) + (cid:27)(cid:30)L((cid:1)((cid:18)))!(cid:30); where !(cid:30) (cid:24) N (0; I) a priori and\nL is a matrix de\ufb01ned below. We sample (cid:22)(cid:30), (cid:27)(cid:30), and !(cid:30) via Metropolis-Hastings separately\nfor each eigenprocess. The parameter vector (cid:18), involving P correlation scale parameters\nand P (P (cid:0) 1)=2 Givens angles, is used to construct an anisotropic distance matrix, (cid:1)((cid:18)),\nshared by the (cid:30) vectors, creating a stationary, anisotropic correlation structure common to\nall the eigenprocesses. (cid:18) is also sampled via Metropolis-Hastings. L((cid:1)((cid:18))) is a general-\nized Cholesky decomposition of the correlation matrix shared by the (cid:30) vectors that deals\n\n\f2\n1\n\n6\n\n0\n\n1\n\n1\n\u2212\n\n6\n\n2\n\n4\n\u2212\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n8\n\n6\n\nz\n\n4\n\n2\n\n0\n0.0\n\n0.2\n\n0.4\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\ny\n\n0.6\n\n0.8\n\n0.8\n\n1.0\n\n1.0\n\n0.0\n\n0.2\n\n0.4\nx\n\n0.6\n\nFigure 1: On the left are the three test functions in one dimension, with one simulated set\nof observations (of the 50 used in the evaluation), while the right shows the test function\nwith two inputs.\n\nwith numerically singular correlation matrices by setting the ith column of the matrix to\nall zeroes when (cid:30)i is numerically a linear combination of (cid:30)1; : : : ; (cid:30)i(cid:0)1 [13]. One never\ncalculates L((cid:1)((cid:18)))(cid:0)1 or jL((cid:1)((cid:18)))j, which are not de\ufb01ned, and does not need to introduce\njitter, and therefore discontinuity in (cid:30)((cid:1)), into the covariance structure.\n\n5 Experiments\n\nFor one-dimensional functions, we compare the nonstationary GP method to a station-\nary GP model1, two neural network implementations2 , and Bayesian adaptive regression\nsplines (BARS), a Bayesian free-knot spline model that has been very successful in com-\nparisons in the statistical literature [6]. We use three test functions [6]: a smoothly-varying\nfunction, a spatially inhomogeneous function, and a function with a sharp jump (Figure\n1a). For each, we generate 50 sets of noisy data and compare the models using the means,\naveraged over the 50 sets, of the standardized MSE, Pi( ^fi (cid:0) fi)2=Pi(fi (cid:0) (cid:22)f )2; where ^fi\nis the posterior mean at xi, and (cid:22)f is the mean of the true values. In the non-Bayesian neural\nnetwork model, ^fi is the \ufb01tted value and, as a simpli\ufb01cation, we use a network with the op-\ntimal number of hidden units (3, 3, and 8 for the three functions), thereby giving an overly\noptimistic assessment of the performance. To avoid local minima, we used the network \ufb01t\nthat minimized the MSE (relative to the data, with yi in place of fi in the expression for\nMSE) over \ufb01ve \ufb01ts with different random seeds.\n\nFor higher-dimensional inputs, we compare the nonstationary GP to the stationary GP, the\nneural network models, and two free-knot spline methods, Bayesian multivariate linear\nsplines (BMLS) [14] and Bayesian multivariate automatic regression splines (BMARS)\n[15], a Bayesian version of MARS [16]. We choose to compare to neural networks and\n\n1We implement the stationary GP model by replacing C N S\n\n((cid:1); (cid:1)) with the Mat\u00e9rn stationary cor-\n\nf\n\nrelation, still using a differentiability parameter, (cid:23)f ; that is allowed to vary.\n\n2For a non-Bayesian model, we use the implementation in the statistical software R, which \ufb01ts\na multilayer perceptron with one hidden layer. For a Bayesian version, results from R. Neal\u2019s FBM\nsoftware were kindly provided by A. Vehtari.\n\n\fTable 1: Mean (over 50 data samples) and 95% con\ufb01dence interval for standardized MSE\nfor the \ufb01ve methods on the three test functions with one-dimensional input.\n\nMethod\nStat. GP\n\nNonstat. GP\n\nBARS\n\nBayes. neural net.\n\nneural network\n\nFunction 1\n\n.0083 (.0073,.0093)\n.0083 (0.0073,.0093)\n.0081 (.0071,.0092)\n.0082 (.0072,.0093)\n.0108 (.0095,.012)\n\nFunction 2\n\n.026 (.024,.029)\n.015 (.013,.016)\n.012 (.011,.013)\n.011 (.010,.014)\n.013 (.012,.015)\n\nFunction 3\n\n.071 (.067,.074)\n.026 (.021,.030)\n\n.0050 (.0043,.0056)\n\n.015 (.014,.016)\n.0095 (.0086,.010)\n\nsplines, because they are popular and these particular implementations have the ability\nto adapt to variable smoothness. BMLS uses piecewise, continuous linear splines, while\nBMARS uses tensor products of univariate splines; both are \ufb01t via reversible jump MCMC.\nWe use three datasets, the \ufb01rst a function with two inputs [14] (Figure 1b), for which we use\n225 training inputs and test on 225 inputs, for each of 50 simulated datasets. The second\nis a real dataset of air temperature as a function of latitude and longitude [17] that allows\nassessment on a spatial dataset with distinct variable smoothness. We use a 109 observation\nsubset of the original data, focusing on the Western hemisphere, 222:5(cid:14) (cid:0) 322:5(cid:14) E and\n62:5(cid:14)S-82:5(cid:14)N and \ufb01t the models on 54 splits with 107 training examples and two test\nexamples and one split with 108 training examples and one test example, thereby including\neach data point as a test point once. The third is a real dataset of 111 daily measurements\nof ozone [18] included in the S-plus statistical software. The goal is to predict the cube root\nof ozone based on three features: radiation, temperature, and wind speed. We do 55 splits\nwith 109 training examples and two test examples and one split of 110 training examples\nand one test example. For the non-Bayesian neural network, 10, 50, and 3 hidden units\nwere optimal for the three datasets, respectively.\n\nTable 1 shows that the nonstationary GP does as well or better than the stationary GP,\nbut that BARS does as well or better than the other methods on all three datasets with\none input. Part of the dif\ufb01culty for the nonstationary GP with the third function, which\nhas the sharp jump, is that our parameterization forces smoothly-varying kernel matrices,\nwhich prevents our particular implementation from picking up sharp jumps. A potential\nimprovement would be to parameterize kernel matrices that do not vary so smoothly. Table\n2 shows that for the known function on two dimensions, the GP models outperform both\nthe spline models and the non-Bayesian neural network, but not the Bayesian network. The\nstationary and nonstationary GPs are very similar, indicative of the relative homogeneity\nof the function. For the two real datasets, the nonstationary GP model outperforms the\nother methods, except the Bayesian network on the temperature dataset. Predictive density\ncalculations that assess the \ufb01ts of the functions drawn during the MCMC are similar to the\npoint estimate MSE calculations in terms of model comparison, although we do not have\npredictive density values for the non-Bayesian neural network implementation.\n\n6 Non-Gaussian data\n\nWe can model non-Gaussian data, using the usual extension from a linear model to a gen-\neralized linear model, for observations, Yi (cid:24) D (g (f (xi))), where D((cid:1)) (g((cid:1))) is an appro-\npriate distribution (link) function, such as the Poisson (log) for count data or the binomial\n(logit) for binary data. Take f ((cid:1)) to have a nonstationary GP prior; it cannot be integrated\nout of the model because of the lack of conjugacy, which causes slow MCMC mixing. [10]\nimproves mixing, which remains slow, using a sampling scheme in which the hyperparam-\neters (including the kernel structure for the nonstationarity) are sampled jointly with the\nfunction values, f, in a way that makes use of information in the likelihood.\n\n\fTable 2: For test function with two inputs, mean (over 50 data samples) and 95% con\ufb01dence\ninterval for standardized MSE at 225 test locations, and for the temperature and ozone\ndatasets, cross-validated standardized MSE, for the six methods.\n\nFunction with 2 inputs Temp. data Ozone data\n\nMethod\nStat. GP\n\nNonstat. GP\n\nBayesian neural network\n\nneural network\n\nBMARS\nBMLS\n\n.024 (.021,.026)\n.023 (.020,.026)\n.020 (.019,.022)\n.040* (.033,.047)\n.076 (.065,.087)\n.033 (.029,.038)\n\n.46\n.36\n.35\n.60\n.53\n.78\n\n.33\n.29\n.32\n.34\n.33\n.33\n\n* [14] report a value of .07 for a neural network implementation\n\nWe \ufb01t the model to the Tokyo rainfall dataset [19]. The data are the presence of rainfall\ngreater than 1 mm for every calendar day in 1983 and 1984. Assuming independence\nbetween years [19], conditional on f ((cid:1)) = logit(p((cid:1))), the likelihood for a given calendar\nday, xi, is binomial with two trials and unknown probability of rainfall, p(xi). Figure 2a\nshows that the estimated function reasonably follows the data and is quite variable because\nthe data in some areas are clustered. The model detects inhomogeneity in the function,\nwith more smoothness in the \ufb01rst few months and less smoothness later (Figure 2b).\n\n8\n.\n0\n\n4\n.\n0\n\nl\nl\n\ni\n\na\nf\nn\na\nr\n \nf\no\n \n.\nb\no\nr\nP\n\ni\n\ne\nz\ns\n \nl\ne\nn\nr\ne\nK\n\n0\n.\n0\n\n5\n2\n\n0\n1\n\n0\n\n100\n\n200\n\ncalendar day\n\n300\n\n7 Discussion\n\n(a)\n\n(b)\n\n(a) Posterior mean\nFigure 2.\nestimate, from nonstationary GP\nmodel, of p((cid:1)), the probability of\nrainfall as a function of calendar\nday, with 95% pointwise credi-\nble intervals. Dots are empirical\nprobabilities of rainfall based on\nthe two binomial trials.\n(b) Pos-\nterior geometric mean kernel size\n(square root of geometric mean\nkernel eigenvalue).\n\nWe introduce a class of nonstationary covariance functions that can be used in GP regres-\nsion (and classi\ufb01cation) models and allow the model to adapt to variable smoothness in\nthe unknown function. The nonstationary GPs improve on stationary GP models on sev-\neral test datasets. In test functions on one-dimensional spaces, a state-of-the-art free-knot\nspline model outperforms the nonstationary GP, but in higher dimensions, the nonstation-\nary GP outperforms two free-knot spline approaches and a non-Bayesian neural network,\nwhile being competitive with a Bayesian neural network. The nonstationary GP may be\nof particular interest for data indexed by spatial coordinates, where the low dimensionality\nkeeps the parameter complexity manageable.\n\nUnfortunately, the nonstationary GP requires many more parameters than a stationary GP,\nparticularly as the dimension grows, losing the attractive simplicity of the stationary GP\nmodel. Use of GP priors in the hierarchy of the model to parameterize the nonstationary\ncovariance results in slow computation, limiting the feasibility of the model to approxi-\nmately n < 1000, because the Cholesky decomposition is O(n3). Our approach provides\na general framework; work is ongoing on simpler, more computationally ef\ufb01cient param-\neterizations of the kernel matrices. Also, approaches that use low-rank approximations to\n\n\fthe covariance matrix [20, 21] may speed \ufb01tting.\n\nReferences\n\n[1] M.N. Gibbs. Bayesian Gaussian Processes for Classi\ufb01cation and Regression. PhD thesis, Univ.\n\nof Cambridge, Cambridge, U.K., 1997.\n\n[2] D.J.C. MacKay. Introduction to Gaussian processes. Technical report, Univ. of Cambridge,\n\n1997.\n\n[3] D. Higdon, J. Swall, and J. Kern. Non-stationary spatial modeling.\n\nIn J.M. Bernardo, J.O.\nBerger, A.P. Dawid, and A.F.M. Smith, editors, Bayesian Statistics 6, pages 761\u2013768, Oxford,\nU.K., 1999. Oxford University Press.\n\n[4] A.M. Schmidt and A. O\u2019Hagan. Bayesian inference for nonstationary spatial covariance struc-\n\nture via spatial deformations. Technical Report 498/00, University of Shef\ufb01eld, 2000.\n\n[5] D. Damian, P.D. Sampson, and P. Guttorp. Bayesian estimation of semi-parametric non-\n\nstationary spatial covariance structure. Environmetrics, 12:161\u2013178, 2001.\n\n[6] I. DiMatteo, C.R. Genovese, and R.E. Kass. Bayesian curve-\ufb01tting with free-knot splines.\n\nBiometrika, 88:1055\u20131071, 2002.\n\n[7] D. MacKay and R. Takeuchi. Interpolation models with multiple hyperparameters, 1995.\n[8] Volker Tresp. Mixtures of Gaussian processes. In Todd K. Leen, Thomas G. Dietterich, and\nVolker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 654\u2013660.\nMIT Press, 2001.\n\n[9] C.E. Rasmussen and Z. Ghahramani. In\ufb01nite mixtures of Gaussian process experts. In T. G.\nDietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing\nSystems 14, Cambridge, Massachusetts, 2002. MIT Press.\n\n[10] C.J. Paciorek. Nonstationary Gaussian Processes for Regression and Spatial Modelling. PhD\n\nthesis, Carnegie Mellon University, Pittsburgh, Pennsylvania, 2003.\n\n[11] M.L. Stein. Interpolation of Spatial Data : Some Theory for Kriging. Springer, N.Y., 1999.\n[12] F. Vivarelli and C.K.I. Williams. Discovering hidden features with Gaussian processes regres-\nIn M.J. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information\n\nsion.\nProcessing Systems 11, 1999.\n\n[13] J.R. Lockwood, M.J. Schervish, P.L. Gurian, and M.J. Small. Characterization of arsenic occur-\nrence in source waters of U.S. community water systems. J. Am. Stat. Assoc., 96:1184\u20131193,\n2001.\n\n[14] C.C. Holmes and B.K. Mallick. Bayesian regression with multivariate linear splines. Journal\n\nof the Royal Statistical Society, Series B, 63:3\u201317, 2001.\n\n[15] D.G.T. Denison, B.K. Mallick, and A.F.M. Smith. Bayesian MARS. Statistics and Computing,\n\n8:337\u2013346, 1998.\n\n[16] J.H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19:1\u2013141, 1991.\n[17] S.A. Wood, W.X. Jiang, and M. Tanner. Bayesian mixture of splines for spatially adaptive\n\nnonparametric regression. Biometrika, 89:513\u2013528, 2002.\n\n[18] S.M. Bruntz, W.S. Cleveland, B. Kleiner, and J.L. Warner. The dependence of ambient ozone\non solar radiation, temperature, and mixing height. In American Meteorological Society, editor,\nSymposium on Atmospheric Diffusion and Air Pollution, pages 125\u2013128, 1974.\n\n[19] C. Biller. Adaptive Bayesian regression splines in semiparametric generalized linear models.\n\nJournal of Computational and Graphical Statistics, 9:122\u2013140, 2000.\n\n[20] A.J. Smola and P. Bartlett. Sparse greedy Gaussian process approximation. In T. Leen, T. Di-\netterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, Cam-\nbridge, Massachusetts, 2001. MIT Press.\n\n[21] M. Seeger and C. Williams. Fast forward selection to speed up sparse Gaussian process regres-\n\nsion. In Workshop on AI and Statistics 9, 2003.\n\n\f", "award": [], "sourceid": 2350, "authors": [{"given_name": "Christopher", "family_name": "Paciorek", "institution": null}, {"given_name": "Mark", "family_name": "Schervish", "institution": null}]}