{"title": "Warped Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 337, "page_last": 344, "abstract": "", "full_text": "Warped Gaussian Processes\n\nEdward Snelson(cid:3)\n\nCarl Edward Rasmusseny\n\nZoubin Ghahramani(cid:3)\n\n(cid:3)Gatsby Computational Neuroscience Unit\n\nUniversity College London\n\n17 Queen Square, London WC1N 3AR, UK\n\nfsnelson,zoubing@gatsby.ucl.ac.uk\n\nyMax Planck Institute for Biological Cybernetics\nSpemann Stra\u00dfe 38, 72076 T\u00a8ubingen, Germany\n\ncarl@tuebingen.mpg.de\n\nAbstract\n\nWe generalise the Gaussian process (GP) framework for regression by\nlearning a nonlinear transformation of the GP outputs. This allows for\nnon-Gaussian processes and non-Gaussian noise. The learning algo-\nrithm chooses a nonlinear transformation such that transformed data is\nwell-modelled by a GP. This can be seen as including a preprocessing\ntransformation as an integral part of the probabilistic modelling problem,\nrather than as an ad-hoc step. We demonstrate on several real regression\nproblems that learning the transformation can lead to signi\ufb01cantly better\nperformance than using a regular GP, or a GP with a \ufb01xed transformation.\n\n1\n\nIntroduction\n\nA Gaussian process (GP) is an extremely concise and simple way of placing a prior on\nfunctions. Once this is done, GPs can be used as the basis for nonlinear nonparametric\nregression and classi\ufb01cation, showing excellent performance on a wide variety of datasets\n[1, 2, 3]. Importantly they allow full Bayesian predictive distributions to be obtained, rather\nthan merely point predictions.\n\nHowever, in their simplest form GPs are limited by the nature of their simplicity: they\nassume the target data to be distributed as a multivariate Gaussian, with Gaussian noise on\nthe individual points. This simplicity enables predictions to be made easily using matrix\nmanipulations, and of course the predictive distributions are Gaussian also.\n\nOften it is unreasonable to assume that, in the form the data is obtained, the noise will be\nGaussian, and the data well modelled as a GP. For example, the observations may be posi-\ntive quantities varying over many orders of magnitude, where it makes little sense to model\nthese quantities directly assuming homoscedastic Gaussian noise. In these situations it is\nstandard practice in the statistics literature to take the log of the data. Then modelling pro-\nceeds assuming that this transformed data has Gaussian noise and will be better modelled\nby the GP. The log is just one particular transformation that could be done; there is a con-\n\n\ftinuum of transformations that could be applied to the observation space to bring the data\ninto a form well modelled by a GP. Making such a transformation should really be a full\npart of the probabilistic modelling; it seems strange to \ufb01rst make an ad-hoc transformation,\nand then use a principled Bayesian probabilistic model.\n\nIn this paper we show how such a transformation or \u2018warping\u2019 of the observation space\ncan be made entirely automatically, fully encompassed into the probabilistic framework\nof the GP. The warped GP makes a transformation from a latent space to the observation,\nsuch that the data is best modelled by a GP in the latent space. It can also be viewed as a\ngeneralisation of the GP, since in observation space it is a non-Gaussian process, with non-\nGaussian and asymmetric noise in general. It is not however just a GP with a non-Gaussian\nnoise model; see section 6 for further discussion.\n\nFor an excellent review of Gaussian processes for regression and classi\ufb01cation see [4].\nWe follow the notation there throughout this paper and present a brief summary of GP\nregression in section 2. We show in sections 4 and 5, with both toy and real data, that the\nwarped GP can signi\ufb01cantly improve predictive performance over a variety of measures,\nespecially with regard to the whole predictive distribution, rather than just a single point\nprediction such as the mean or median. The transformation found also gives insight into\nthe properties of the data.\n\n2 Nonlinear regression with Gaussian processes\n\nSuppose we are given a dataset D, consisting of N pairs of input vectors XN (cid:17) fx(n)gN\nn=1\nand real-valued targets tN (cid:17) ftngN\nn=1. We wish to predict the value of an observation\ntN +1 given a new input vector x(N +1), or rather the distribution P (tN +1jx(N +1); D). We\nassume there is an underlying function y(x) which we are trying to model, and that the\nobservations lie noisily around this. A GP places a prior directly on the space of functions\nby assuming that any \ufb01nite selection of points XN gives rise to a multivariate Gaussian dis-\ntribution over the corresponding function values yN . The covariance between the function\nvalue of y at two points x and x0 is modelled with a covariance function C(x; x0), which\nis usually assumed to have some simple parametric form. If the noise model is taken to be\nGaussian, then the distribution over observations tN is also Gaussian with the entries of\nthe covariance matrix C given by\n\nCmn = C(x(m); x(n); (cid:2)) + (cid:14)mng(x(n); (cid:2)) ;\n\n(1)\n\nwhere (cid:2) parameterises the covariance function, g is the noise model, and (cid:14)mn is the Kro-\nnecker delta function.\n\nOften the noise model is taken to be input-independent, and the covariance function is taken\nto be a Gaussian function of the difference in the input vectors (a stationary covariance\nfunction), although many other possibilities exist, see e.g. [5] for GPs with input dependent\nnoise. In this paper we consider only this popular choice, in which case the entries in the\ncovariance matrix are given by\n\nCmn = v1 exp2\n4\n\n(cid:0)\n\n1\n2\n\n+ v0(cid:14)mn :\n\n(2)\n\nd (cid:0) x(n)\n\nd\n\nD\n\nXd=1 x(m)\n\nrd\n\n!23\n5\n\nHere rd is a width parameter expressing the scale over which typical functions vary in the\ndth dimension, v1 is a size parameter expressing the typical size of the overall process in\ny-space, v0 is the noise variance of the observations, and (cid:2) = fv0; v1; r1; : : : ; rDg.\nIt is simple to show that the predictive distribution for a new point given the observed\ndata, P (tN +1jtN ; XN +1), is Gaussian. The calculation of the mean and variance of this\n\n\fdistribution involves doing a matrix inversion of the covariance matrix CN of the training\ninputs, which using standard exact methods incurs a computational cost of order N 3.\nLearning, or \u2018training\u2019, in a GP is usually achieved by \ufb01nding a local maximum in the\nlikelihood using conjugate gradient methods with respect to the hyperparameters (cid:2) of the\ncovariance matrix. The negative log likelihood is given by\n\nL = (cid:0) log P (tN jXN ; (cid:2)) =\n\n1\n2\n\nlog det CN +\n\n1\n2\n\nN C(cid:0)1\nt>\n\nN tN +\n\nN\n2\n\nlog 2(cid:25) :\n\n(3)\n\nOnce again, the evaluation of L, and its gradients with respect to (cid:2), involve computing the\ninverse covariance matrix, incurring an order N 3 cost. Rather than \ufb01nding a ML estimate\n(cid:2)ML, a prior over (cid:2) can be included to \ufb01nd a MAP estimate (cid:2)MAP, or even better (cid:2) can be\nnumerically integrated out when computing P (tN +1jx(N +1); D) using for example hybrid\nMonte Carlo methods [2, 6].\n\n3 Warping the observation space\n\nIn this section we present a method of warping the observation space through a nonlinear\nmonotonic function to a latent space, whilst retaining the full probabilistic framework to\nenable learning and prediction to take place consistently. Let us consider a vector of latent\ntargets zN and suppose that this vector is modelled by a GP,\n\n(cid:0) log P (zN jXN ; (cid:2)) =\n\n1\n2\n\nlog det CN +\n\n1\n2\n\nN C(cid:0)1\nz>\n\nN zN +\n\nN\n2\n\nlog 2(cid:25) :\n\n(4)\n\nNow we make a transformation from the true observation space to the latent space by\nmapping each observation through the same monotonic function f,\n\nzn = f (tn; (cid:9))\n\n(5)\nwhere (cid:9) parameterises the transformation. We require f to be monotonic and mapping on\nto the whole of the real line; otherwise probability measure will not be conserved in the\ntransformation, and we will not induce a valid distribution over the targets tN . Including\nthe Jacobian term that takes the transformation into account, the negative log likelihood,\n(cid:0) log P (tN jXN ; (cid:2); (cid:9)), now becomes:\n\n8n ;\n\nL =\n\n1\n2\n\nlog det CN +\n\n1\n2\n\nf (tN )>C(cid:0)1\n\nN f (tN ) (cid:0)\n\n3.1 Training the warped GP\n\nlog\n\n@f (t)\n\n@t\n\nN\n\nXn=1\n\n+\n\nN\n2\n\n(cid:12)(cid:12)(cid:12)(cid:12)tn\n\nlog 2(cid:25) :\n\n(6)\n\nLearning in this extended model is achieved by simply taking derivatives of the negative\nlog likelihood function (6) with respect to both (cid:2) and (cid:9) parameter vectors, and using a\nconjugate gradient method to compute ML parameter values. In this way the form of both\nthe covariance matrix and the nonlinear transformation are learnt simultaneously under the\nsame probabilistic framework. Since the computational limiter to a GP is inverting the\ncovariance matrix, adding a few extra parameters into the likelihood is not really costing us\nanything. All we require is that the derivatives of f are easy to compute (both with respect\nto t and (cid:9)), and that we don\u2019t introduce so many extra parameters that we have problems\nwith over-\ufb01tting. Of course a prior over both (cid:2) and (cid:9) may be included to compute a MAP\nestimate, or in fact the parameters integrated out using a hybrid Monte Carlo method.\n\n3.2 Predictions with the warped GP\n\nFor a particular setting of the covariance function hyperparameters (cid:2) (for example (cid:2)ML\nor (cid:2)MAP), in latent variable space the predictive distribution at a new point is just as for a\n\n\fregular GP: a Gaussian whose mean and variance are calculated as mentioned in section 2;\n\nP (zN +1jx(N +1); D; (cid:2)) = N (cid:0)^zN +1((cid:2)); (cid:27)2\n\nN +1((cid:2))(cid:1) :\n\nTo \ufb01nd the distribution in the observation space we pass that Gaussian through the nonlinear\nwarping function, giving\n\n(7)\n\nP (tN +1jx(N +1); D; (cid:2); (cid:9)) =\n\nexp\"(cid:0)\n\n1\n\n2(cid:18) f (tN +1) (cid:0) ^zN +1\n\n(cid:27)N +1\n\n(cid:19)2# :\n\n(8)\n\nf 0(tN +1)\n\nq2(cid:25)(cid:27)2\n\nN +1\n\nThe shape of this distribution depends on the form of the warping function f, but in general\nit may be asymmetric and multimodal.\nIf we require a point prediction to be made, rather than the whole distribution over tN +1,\nthen the value we will predict depends on our loss function. If our loss function is absolute\nerror, then the median of the distribution should be predicted, whereas if our loss function is\nsquared error, then it is the mean of the distribution. For a standard GP where the predictive\ndistribution is Gaussian, the median and mean lie at the same point. For the warped GP in\ngeneral they are at different points. The median is particularly easy to calculate:\n\ntmed\nN +1 = f (cid:0)1(^zN +1) :\n\n(9)\n\nNotice we need to compute the inverse warping function. In general we are unlikely to have\nan analytical form for f (cid:0)1, because we have parameterised the function in the opposite\ndirection. However since we have access to derivatives of f, a few iterations of Newton-\nRaphson with a good enough starting point is enough.\n\nIt is often useful to give an indication of the shape and range of the distribution by giving\nthe positions of various \u2018percentiles\u2019. For example we may want to know the positions of\n\u20182(cid:27)\u2019 either side of the median so that we can say that approximately 95% of the density\nlies between these bounds. These points in observation space are calculated in exactly the\nsame way as the median - simply pass the values through the inverse function:\n\ntmed(cid:6)2(cid:27)\nN +1 = f (cid:0)1(^zN +1 (cid:6) 2(cid:27)N +1) :\n\n(10)\n\nTo calculate the mean, we need to integrate tN +1 over the density (8). Rewriting this\nintegral back in latent space we get\n\nE(tN +1) =Z dzf (cid:0)1(z)Nz(^zN +1; (cid:27)2\n\nN +1) = E(f (cid:0)1) :\n\n(11)\n\nThis is a simple one dimensional integral under a Gaussian density, so Gauss-Hermite\nquadrature may be used to accurately compute it with a weighted sum of a small number\nof evaluations of the inverse function f (cid:0)1 at appropriate places.\n\n3.3 Choosing a monotonic warping function\n\nWe wish to design a warping function that will allow for complex transformations, but we\nmust constrain the function to be monotonic. There are various ways to do this, an obvious\none being a neural-net style sum of tanh functions,\n\nI\n\nf (t; (cid:9)) =\n\nai tanh (bi(t + ci))\n\nai; bi (cid:21) 0 8i ;\n\n(12)\n\nXi=1\n\nwhere (cid:9) = fa; b; cg. This produces a series of smooth steps, with a controlling the size\nof the steps, b controlling their steepness, and c their position. Of course the number of\n\n\f2\n\n1.5\n\n1\n\n0.5\n\nt\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n(a)\n\n(b)\n\n3.5\n\n3\n\n2.5\n\n)\n4\n/\n\n\u2212p\n=\nx\n \n|\n \nt\n(\nP\n\n2\n\n1.5\n\n\u2212pi\n\n\u2212pi/2\n\n\u2212pi/4\n\n0\nx\n\npi/2\n\npi\n\n1\n\n0.5\n\n0\n\u22122\n\n\u22121\n\nt\n\n0\n\n1\n\nFigure 1: A 1D regression task. The dotted lines show the true generating distribution, the\ndashed lines show a GP\u2019s predictions, and the solid lines show the warped GP\u2019s predictions.\n(a) The triplets of lines represent the median, and 2(cid:27) percentiles in each case. (b) Predictive\nprobability densities at x = (cid:0)(cid:25)=4; i.e. a cross section through (a) at the solid grey line\n\nsteps I needs to be set, and that will depend on how complex a function one wants. The\nderivatives of this function with respect to either t, or the warping parameters (cid:9), are easy\nto compute. In the same spirit, sums of error functions, or sums of logistic functions, would\nproduce a similar series of steps, and so these could be used instead.\n\nThe problem with using (12) as it stands is that it is bounded; the inverse function f (cid:0)1(z)\ndoes not exist for values of z outside the range of these bounds. As explained earlier, this\nwill not lead to a proper density in t space, because the density in z space is Gaussian,\nwhich covers the whole of the real line. We can \ufb01x this up by using instead:\n\nI\n\nf (t; (cid:9)) = t +\n\nai tanh (bi(t + ci))\n\nai; bi (cid:21) 0 8i :\n\n(13)\n\nXi=1\n\nwhich has linear trends away from the tanh steps. In doing so, we have restricted ourselves\nto only making warping functions with f 0 (cid:21) 1, but because the size of the covariance\nfunction v1 is free to vary, the effective gradient can be made arbitrarily small by simply\nmaking the range of the data in the latent space arbitrarily big.\n\nA more \ufb02exible system of linear trends may be made by including, in addition to the neural-\nnet style function (12), some functions of the form 1\nm1; m2 (cid:21) 0. This function effectively splices two straight lines of gradients m1 and\nm2 smoothly together with a \u2018curvature\u2019 parameter (cid:12), and at position d. The sign of (cid:12)\ndetermines whether the join is convex or concave.\n\n(cid:12) log(cid:2)e(cid:12)m1(t(cid:0)d) + e(cid:12)m2(t(cid:0)d)(cid:3), where\n\n4 A simple 1D regression task\n\nA simple 1D regression task was created to show a situation where the warped GP should,\nand does, perform signi\ufb01cantly better than the standard GP. 101 points, regularly spaced\nfrom (cid:0)(cid:25) to (cid:25) on the x axis, were generated with Gaussian noise about a sine function.\nThese points were then warped through the function t = z1=3, to arrive at the dataset t\nwhich is shown as the dots in Figure 1(a).\n\n\f(a) sine\n\n(b) creep\n\n(c) abalone\n\n(d) ailerons\n\nz\n\nz\n\nz\n\nz\n\nt\n\nt\n\nt\n\nt\n\nFigure 2: Warping functions learnt for the four regression tasks carried out in this paper.\nEach plot is made over the range of the observation data, from tmin to tmax.\n\nA GP and a warped GP were trained independently on this dataset using a conjugate gradi-\nent minimisation procedure and randomly initialised parameters, to obtain maximum like-\nlihood parameters. For the warped GP, the warping function (13) was used with just two\ntanh functions. For both models the covariance matrix (2) was used. Hybrid Monte Carlo\nwas also implemented to integrate over all the parameters, or just the warping parameters\n(much faster since no matrix inversion is required with each step), but with this dataset (and\nthe real datasets of section 5) no signi\ufb01cant differences were found from ML.\n\nPredictions from the GP and warped GP were made, using the ML parameters, for 401\npoints regularly spaced over the range of x. The predictions made were the median and\n2(cid:27) percentiles in each case, and these are plotted as triplets of lines on Figure 1(a). The\npredictions from the warped GP are found to be much closer to the true generating distri-\nbution than the standard GP, especially with regard to the 2(cid:27) lines. The mean line was also\ncomputed, and found to lie close, but slightly skewed, from the median line.\n\nFigure 1(b) emphasises the point that the warped GP \ufb01nds the shape of the whole predictive\ndistribution much better, not just the median or mean. In this plot, one particular point on\nthe x axis is chosen, x = (cid:0)(cid:25)=4, and the predictive densities from the GP and warped GP\nare plotted alongside the true density (which can be written down analytically). Note that\nthe standard GP must necessarily predict a symmetrical Gaussian density, even when the\ndensity from which the points are generated is highly asymmetrical, as in this case.\n\nFigure 2(a) shows the warping function learnt for this regression task. The tanh functions\nhave adjusted themselves so that they mimic a t3 nonlinearity over the range of the obser-\nvation space, thus inverting the z1=3 transformation imposed when generating the data.\n\n5 Results for some real datasets\n\nIt is not surprising that the method works well on the toy dataset of section 4 since it was\ngenerated from a known nonlinear warping of a smooth function with Gaussian noise. To\ndemonstrate that nonlinear transformations also help on real data sets we have run the\nwarped GP comparing its predictions to an ordinary GP on three regression problems.\nThese datasets are summarised in the following table which shows the range of the targets\n(tmin, tmax), the number of input dimensions (D), and the size of the training and test sets\n(Ntrain, Ntest) that we used.\n\nDataset\ncreep\nabalone\nailerons 40 (cid:0)3:0 (cid:2) 10(cid:0)3 (cid:0)3:5 (cid:2) 10(cid:0)4\n\n530 MPa\n\nD\n30\n8\n\n29 yrs\n\ntmin\n\n18 MPa\n\n1 yr\n\ntmax\n\nNtrain Ntest\n1266\n800\n3177\n1000\n1000\n6154\n\n\fDataset\ncreep\n\nabalone\n\nModel\nGP\nGP + log\nwarped GP\nGP\nGP + log\nwarped GP\n\nAbsolute error\n\n16.4\n15.6\n15.0\n1.53\n1.48\n1.47\n\n654\n587\n554\n4.79\n4.62\n4.63\n\nailerons GP\n\nwarped GP\n\n1:23 (cid:2) 10(cid:0)4\n1:18 (cid:2) 10(cid:0)4\n\n3:05 (cid:2) 10(cid:0)8\n2:72 (cid:2) 10(cid:0)8\n\nSquared error (cid:0) log P (t)\n\n4.46\n4.24\n4.19\n2.19\n2.01\n1.96\n-7.31\n-7.45\n\nTable 1: Results of testing the GP, warped GP, and GP with log transform, on three real\ndatasets. The units for absolute error and squared error are as for the original data.\n\nThe dataset creep is a materials science set, with the objective to predict creep rup-\nture stress (in MPa) for steel given chemical composition and other inputs [7, 8]. With\nabalone the aim is to predict the the age of abalone from various physical inputs [9].\nailerons is a simulated control problem, with the aim to predict the control action on\nthe ailerons of an F16 aircraft [10, 11].\n\nFor datasets creep and abalone, which consist of positive observations only, standard\npractice may be to model the log of the data with a GP. So for these datasets we have\ncompared three models: a GP directly on the data, a GP on the \ufb01xed log-transformed data,\nand the warped GP directly on the data. The predictive points and densities were always\ncompared in the original data space, accounting for the Jacobian of both the log and the\nwarped transforms. The models were run as in the 1D task: ML parameter estimates only,\ncovariance matrix (2), and warping function (13) with three tanh functions.\nThe results we obtain for the three datasets are shown in Table 1. We show three measures\nof performance over independent test sets: mean absolute error, mean squared error, and\nthe mean negative log predictive density evaluated at the test points. This \ufb01nal measure\nwas included to give some idea of how well the model predicts the entire density, not just\npoint predictions.\n\nOn these three sets, the warped GP always performs signi\ufb01cantly better than the standard\nGP. For creep and abalone, the \ufb01xed log transform clearly works well too, but partic-\nularly in the case of creep, the warped GP learns a better transformation. Figure 2 shows\nthe warping functions learnt, and indeed 2(b) and 2(c) are clearly log-like in character. On\nthe other hand 2(d), for the ailerons set, is exponential-like. This shows the warped GP\nis able to \ufb02exibly handle these different types of datasets. The shapes of the learnt warp-\ning functions were also found to be very robust to random initialisation of the parameters.\nFinally, the warped GP also makes a better job of predicting the distributions, as shown by\nthe difference in values of the negative log density.\n\n6 Conclusions, extensions, and related work\n\nWe have shown that the warped GP is a useful extension to the standard GP for regression,\ncapable of \ufb01nding extra structure in the data through the transformations it learns. From\nanother viewpoint, it allows standard preprocessing transforms, such as log, to be discov-\nered automatically and improved on, rather than be applied in an ad-hoc manner. We have\ndemonstrated an improvement in performance over the regular GP on several datasets.\n\nOf course some datasets are well modelled by a GP already, and applying the warped GP\nmodel simply results in a linear \u201cwarping\u201d function. It has also been found that datasets that\nhave been censored, i.e. many observations at the edge of the range lie on a single point,\n\n\fcause the warped GP problems. The warping function attempts to model the censoring\nby pushing those points far away from the rest of the data, and it suffers in performance\nespecially for ML learning. To deal with this properly a censorship model is required.\n\nAs a further extension, one might consider warping the input space in some nonlinear fash-\nion.\nIn the context of geostatistics this has actually been dealt with by O\u2019Hagan [12],\nwhere a transformation is made from an input space which can have non-stationary and\nnon-isotropic covariance structure, to a latent space in which the usual conditions of sta-\ntionarity and isotropy hold.\n\nGaussian process classi\ufb01ers can also be thought of as warping the outputs of a GP, through a\nmapping onto the (0; 1) probability interval. However, the observations in classi\ufb01cation are\ndiscrete, not points in this warped continuous space. Therefore the likelihood is different.\nDiggle et al. [13] consider various other \ufb01xed nonlinear transformations of GP outputs.\n\nIt should be emphasised that the presented method can be bene\ufb01cial in situations where the\nnoise variance depends on the output value. Gaussian processes where the noise variance\ndepends on the inputs have been examined by e.g. [5]. Forms of non-Gaussianity which\ndo not directly depend on the output values (such as heavy tailed noise) are also not cap-\ntured by the method proposed here. We propose that the current method should be used in\nconjunction with methods targeted directly at these other issues. The force of the method\nit that it is powerful, yet very easy and computationally cheap to apply.\nAcknowledgements. Many thanks to David MacKay for useful discussions, suggestions\nof warping functions and datasets to try. CER was supported by the German Research\nCouncil (DFG) through grant RA 1030/1.\n\nReferences\n[1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touret-\nzky, M. C. Mozer, and M. E. Hasselmo., editors, Advances in Neural Information Processing\nSystems 8. MIT Press, 1996.\n\n[2] C. E. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-Linear Re-\n\ngression. PhD thesis, University of Toronto, 1996.\n\n[3] M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classi\ufb01cation. PhD thesis,\n\nCambridge University, 1997.\n\n[4] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks\n\nand Machine Learning, NATO ASI Series, pages 133\u2013166. Kluwer Academic Press, 1998.\n\n[5] Paul W. Goldberg, Christopher K. I. Williams, and Christopher M. Bishop. Regression with\ninput-dependent noise: A gaussian process treatment. In Advances in Neural Information Pro-\ncessing Systems 10. MIT Press, 1998.\n\n[6] Radford M.Neal. Monte Carlo implementation of Gaussian process models for Bayesian re-\n\ngression and classi\ufb01cation. Technical Report 9702, University of Toronto, 1997.\n\n[7] Materials algorithms project (MAP) program and data library. http://www.msm.cam.ac.\n\nuk/map/entry.html.\n\n[8] D. Cole, C. Martin-Moran, A. G. Sheard, H. K. D. H. Bhadeshia, and D. J. C. MacKay. Mod-\nelling creep rupture strength of ferritic steel welds. Science and Technology of Welding and\nJoining, 5:81\u201390, 2000.\n\n[9] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http:\n\n//www.ics.uci.edu/\u02dcmlearn/MLRepository.html.\n\n[10] L. Torgo. http://www.liacc.up.pt/\u02dcltorgo/Regression/.\n[11] R. Camacho. Inducing models of human control skills. PhD thesis, University of Porto, 2000.\n[12] A. O\u2019Hagan and A. M. Schmidt. Bayesian inference for nonstationary spatial covariance struc-\n\nture via spatial deformations. Technical Report 498/00, University of Shef\ufb01eld, 2000.\n\n[13] P. J. Diggle, J. A. Tawn, and R. A. Moyeed. Model-based geostatistics. Applied Statistics, 1998.\n\n\f", "award": [], "sourceid": 2481, "authors": [{"given_name": "Edward", "family_name": "Snelson", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}