{"title": "Gaussian Processes for Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 520, "abstract": null, "full_text": "Gaussian Processes for Regression \n\nChristopher K. I. Williams \n\nNeural Computing Research Group \n\nAston University \n\nBirmingham B4 7ET, UK \n\nCarl Edward Rasmussen \n\nDepartment of Computer ,Science \n\nUniversity of Toronto \n\nToronto, ONT, M5S lA4, Canada \n\nc.k.i.williams~aston.ac.uk \n\ncarl~cs.toronto.edu \n\nAbstract \n\nThe Bayesian analysis of neural networks is difficult because a sim(cid:173)\nple prior over weights implies a complex prior distribution over \nfunctions . In this paper we investigate the use of Gaussian process \npriors over functions, which permit the predictive Bayesian anal(cid:173)\nysis for fixed values of hyperparameters to be carried out exactly \nusing matrix operations. Two methods, using optimization and av(cid:173)\neraging (via Hybrid Monte Carlo) over hyperparameters have been \ntested on a number of challenging problems and have produced \nexcellent results. \n\n1 \n\nINTRODUCTION \n\nIn the Bayesian approach to neural networks a prior distribution over the weights \ninduces a prior distribution over functions. This prior is combined with a noise \nmodel, which specifies the probability of observing the targets t given function \nvalues y, to yield a posterior over functions which can then be used for predictions. \nFor neural networks the prior over functions has a complex form which means \nthat implementations must either make approximations (e.g. MacKay, 1992) or use \nMonte Carlo approaches to evaluating integrals (Neal, 1993) . \n\nAs Neal (1995) has argued , there is no reason to believe that, for real-world prob(cid:173)\nlems, neural network models should be limited to nets containing only a \"small\" \nnumber of hidden units. He has shown that it is sensible to consider a limit where \nthe number of hidden units in a net tends to infinity, and that good predictions can \nbe obtained from such models using the Bayesian machinery. He has also shown \nthat a large class of neural network models will converge to a Gaussian process prior \nover functions in the limit of an infinite number of hidden units. \n\nIn this paper we use Gaussian processes specified parametrically for regression prob(cid:173)\nlems. The advantage of the Gaussian process formulation is that the combination of \n\n\fGaussian Processes for Regression \n\n515 \n\nthe prior and noise models can be carried out exactly using matrix operations. We \nalso show how the hyperparameters which control the form of the Gaussian process \ncan be estimated from the data, using either a maximum likelihood or Bayesian \napproach, and that this leads to a form of \"Automatic Relevance Determination\" \n(Mackay 1993j Neal 1995). \n\n2 PREDICTION WITH GAUSSIAN PROCESSES \n\nA stochastic process is a collection of random variables {Y (x) Ix EX} indexed by a \nset X. In our case X will be the input space with dimension d, the number of irlputs. \nThe stochastic process is specified by giving the probability distribution for every \nfinite subset of variables Y(x(1)), . .. , Y(x(k)) in a consistent manner. A Gaussian \nprocess is a stochastic process which can be fully specified by its mean function \nJ.1.(:x:) = E[Y(x)] and its covariance function C(X , X/) = E[(Y(x) - J.1.(x))(Y(x /)(cid:173)\nJ.1.( Xl))]; any finite set of points will have a joint multivariate Gaussian distribution. \nBelow we consider Gaussian processes which have J.1.( x) == O. \nIn section 2.1 we will show how to parameterise covariances using hyperparametersj \nfor now we consider the form of the covariance C as given. The training data \nconsists of n pairs of inputs and targets {( xCi) , t(i)) , i = 1 .. . n} . The input vector \nfor a test case is denoted x (with no superscript). The inputs are d-dimensional \nXl, . .. , Xd and the targets are scalar. \nThe predictive distribution for a test case x is obtained from the n + 1 dimensional \njoint Gaussian distribution for the outputs of the n training cases and the test \ncase, by conditioning on the observed targets in the training set. This procedure is \nillustrated in Figure 1, for the case where there is one training point and one test \npoint. In general, the predictive distribution is Gaussian with mean and variance \n\nkT (x)K- 1t \nC(x,x) - kT(x)K- 1k(x), \n\n(1) \n(2) \n\nwhere k(x) = (C(x, x(1)), ... , C(x, x(n))f , K is the covariance matrix for the \ntraining cases Kij = C(x(i), x(j)), and t = (t(l), ... , t(n))T . \n\nThe matrix inversion step in equations (1) and (2) implies that the algorithm has \nO( n 3 ) time complexity (if standard methods of matrix inversion are employed) ; \nfor a few hundred data points this is certainly feasible on workstation computers, \nalthough for larger problems some iterative methods or approximations may be \nneeded. \n\n2.1 PARAMETERIZING THE COVARIANCE FUNCTION \n\nThere are many choices of covariance functions which may be reasonable. Formally, \nwe are required to specify functions which will generate a non-negative definite \ncovariance matrix for any set of points (x(1 ), ... , x(k )). From a modelling point of \nview we wish to specify covariances so that points with nearby inputs will give rise \nto similar predictions. We find that the following covariance function works well: \n\n(3) \n\nVo exp{ -t L WI(x~i) - x~j))2} \n\nd \n\n1=1 \nd \n\n+ao + a1 Lx~i)x~j) + V18(i , j), \n\n1=1 \n\n\f516 \n\nc. K. I. WILLIAMS, C. E. RASMUSSEN \n\ny \n\ny \n\n/ \n\n/ \n\ny1 \n\np(y) \n\n/ \n\n/ \n\nFigure 1: An illustration of prediction using a Gaussian process. There is one training \ncase (x(1), t(1)) and one test case for which we wish to predict y. The ellipse in the left(cid:173)\nhand plot is the one standard deviation contour plot of the joint distribution of Yl and \ny . The dotted line represents an observation Yl = t(1). In the right-hand plot we see \nthe distribution of the output for the test case, obtained by conditioning on the observed \ntarget. The y axes have the same scale in both plots. \n\nwhere (} = log(vo, V1, W1, . . . , Wd, ao, ad plays the role of hyperparameters1. We \ndefine the hyperparameters to be the log of the variables in equation (4) since these \nare positive scale-parameters. \n\nThe covariance function is made up of three parts; the first term, a linear regression \nterm (involving ao and aI) and a noise term V1b(i, j). The first term expresses the \nidea that cases with nearby inputs will have highly correlated outputs; the WI pa(cid:173)\nrameters allow a different distance measure for each input dimension. For irrelevant \ninputs, the corresponding WI will become small, and the model will ignore that in(cid:173)\nput. This is closely related to the Automatic Relevance Determination (ARD) idea \nof MacKay and Neal (MacKay, 1993; Neal 1995). The Vo variable gives the overall \nscale of the local correlations. This covariance function is valid for all input dimen(cid:173)\nsionalities as compared to splines, where the integrated squared mth derivative is \nonly a valid regularizer for 2m > d (see Wahba, 1990). ao and a1 are variables \ncontrolling the scale the of bias and linear contributions to the covariance. The last \nterm accounts for the noise on the data; VI is the variance of the noise. \n\nGiven a covariance function , the log likelihood of the training data is given by \n\n1= - ~ logdet I< - ~tT I<-lt -\n\n22 2 \n\n!!.log27r. \n\n(4) \n\nIn section 3 we will discuss how the hyperparameters III C can be adapted, in \nresponse to the training data. \n\n2.2 RELATIONSHIP TO PREVIOUS WORK \n\nThe Gaussian process view provides a unifying framework for many regression meth(cid:173)\nods. ARMA models used in time series analysis and spline smoothing (e.g. Wahba, \n1990 and earlier references therein) correspond to Gaussian process prediction with \n\n1 We call () the hyperparameters as they correspond closely to hyperparameters in neural \n\nnetworks; in effect the weights have been integrated out exactly. \n\n\fGaussian Processes for Regression \n\n517 \n\na particular choice of covariance function 2 . Gaussian processes have also been used \nin the geostatistics field (e .g. Cressie, 1993) , and are known there as \"kriging\", but \nthis literature has concentrated on the case where the input space is two or three \ndimensional , rather than considering more general input spaces. \n\nThis work is similar to Regularization Networks (Poggio and Girosi, 1990; Girosi, \nJones and Poggio, 1995), except that their derivation uses a smoothness functional \nrather than the equivalent covariance function. Poggio et al suggested that the \nhyperparameters be set by cross-validation. The main contributions of this paper \nare to emphasize that a maximum likelihood solution for 8 is possible, to recognize \nthe connections to ARD and to use the Hybrid Monte Carlo method in the Bayesian \ntreatment (see section 3). \n\n3 TRAINING A GAUSSIAN PROCESS \n\nThe partial derivative of the log likelihood of the training data I with respect to \nall the hyperparameters can be computed using matrix operations, and takes time \nO( n 3 ) . In this section we present two methods which can be used to adapt the \nhyperparameters using these derivatives. \n\n3.1 MAXIMUM LIKELIHOOD \n\nIn a maximum likelihood framework, we adjust the hyperparameters so as to max(cid:173)\nimize that likelihood of the training data. We initialize the hyperparameters to \nrandom values (in a reasonable range) and then use an iterative method, for exam(cid:173)\nple conjugate gradient, to search for optimal values of the hyperparameters. Since \nthere are only a small number of hyperparameters (d + 4) a relatively small number \nof iterations are usually sufficient for convergence. However, we have found that \nthis approach is sometimes susceptible to local minima, so it is advisable to try a \nnumber of random starting positions in hyperparameter space. \n\n3.2 \n\nINTEGRATION VIA HYBRID MONTE CARLO \n\nAccording to the Bayesian formalism, we should start with a prior distribution P( 8) \nover the hyperparameters which is modified using the training data D to produce \na posterior distribution P(8ID). To make predictions we then integrate over the \nposterior; for example, the predicted mean y( x) for test input x is given by \n\ny(x) = J Y8(x)P(8I D)d8 \n\n(5) \n\nwhere Y8( x) is the predicted mean (as given by equation 1) for a particular value of \n8. It is not feasible to do this integration analytically, but the Markov chain Monte \nCarlo method of Hybrid Monte Carlo (HMC) (Duane et ai, 1987) seems promising \nfor this application. We assign broad Gaussians priors to the hyperparameters, and \nuse Hybrid Monte Carlo to give us samples from the posterior. \n\nHMC works by creating a fictitious dynamical system in which the hyperparameters \nare regarded as position variables, and augmenting these with momentum variables \np. The purpose of the dynamical system is to give the hyperparameters \"inertia\" \nso that random-walk behaviour in 8-space can be avoided. The total energy, H, of \nthe system is the sum of the kinetic energy, J{, (a function of the momenta) and the \npotential energy, E. The potential energy is defined such that p(8ID) ex: exp(-E). \nWe sample from the joint distribution for 8 and p given by p(8,p) ex: exp(-E-\n\n2Technically splines require generalized covariance functions. \n\n\f518 \n\nC. K. I. WILUAMS, C. E. RASMUSSEN \n\nI<); the marginal of this distribution for 8 is the required posterior. A sample of \nhyperparameters from the posterior can therefore be obtained by simply ignoring \nthe momenta. \n\nSampling from the joint distribution is achieved by two steps: (i) finding new points \nin phase space with near-identical energies H by simulating the dynamical system \nusing a discretised approximation to Hamiltonian dynamics, and (ii) changing the \nenergy H by doing Gibbs sampling for the momentum variables. \n\nHamiltonian Dynamics \nHamilton's first order differential equations for H are approximated by a discrete \nstep (specifically using the leapfrog method). The derivatives of the likelihood \n(equation 4) enter through the derivative of the potential energy. This proposed \nstate is then accepted or rejected using the Metropolis rule depending on the final \nenergy H* (which is not necessarily equal to the initial energy H because of the \ndiscretization). The same step size c is used for all hyperparameters , and should be \nas large as possible while keeping the rejection rate low. \n\nGibbs Sampling for Momentum Variables \n\nThe momentum variables are updated using a modified version of Gibbs sampling, \nthereby allowing the energy H to change. A \"persistence\" of 0.95 is used; the new \nvalue of the momentum is a weighted sum of the previous value (with weight 0.95) \nand the value obtained by Gibbs sampling (weight (1 - 0.952)1/ 2). With this form \nof persistence, the momenta change approximately twenty times more slowly, thus \nincreasing the \"inertia\" of the hyperparameters, so as to further help in avoiding \nrandom walks. Larger values of the persistence will further increase the inertia, but \nreduce the rate of exploration of H . \n\nPractical Details \n\nThe priors over hyperparameters are set to be Gaussian with a mean of -3 and a \nstandard deviation of 3. In all our simulations a step size c = 0.05 produced a very \nlow rejection rate \u00ab 1 %). The hyperparameters corresponding to V1 and to the \nWI ' S were initialised to -2 and the rest to O. \n\nTo apply the method we first rescale the inputs and outputs so that they have mean \nof zero and a variance of one on the training set. The sampling procedure is run \nfor the desired amount of time, saving the values of the hyperparameters 200 times \nduring the last two-thirds of the run . The first third of the run is discarded; this \n\"burn-in\" is intended to give the hyperparameters time to come close to their equi(cid:173)\nlibrium distribution. The predictive distribution is then a mixture of 200 Gaussians. \nFor a squared error loss, we use the mean of this distribution as a point estimate. \nThe width of the predictive distribution tells us the uncertainty of the prediction. \n\n4 EXPERIMENTAL RESULTS \n\nWe report the results of prediction with Gaussian process on (i) a modified version \nof MacKay's robot arm problem and (ii) five real-world data sets. \n\n4.1 THE ROBOT ARM PROBLEM \n\nWe consider a version of MacKay's robot arm problem introduced by Neal (1995). \nThe standard robot arm problem is concerned with the mappings \n\nY1 = r1 cos Xl + r2 COS(X1 + X2) \n\nY2 = r1 sin Xl + r2 sin(x1 + X2) \n\n(6) \n\n\fGaussian Processes for Regression \n\n519 \n\nMethod \n\nNo. of inputs \n\nsum squared test error \n\nGaussian process \nGaussian process \n\nMacKay \n\nNeal \nNeal \n\n2 \n6 \n2 \n2 \n6 \n\n1.126 \n1.138 \n1.146 \n1.094 \n1.098 \n\nTable 1: Results on the robot arm task. The bottom three lines of data were obtained \nfrom Neal (1995) . The MacKay result is the test error for the net with highest \"evidence\". \n\nThe data was generated by picking Xl uniformly from [-1.932, -0.453] and [0.453, \n1.932] and picking X2 uniformly from [0 .534, 3.142]. Neal added four further inputs, \ntwo of which were copies of Xl and X2 corrupted by additive Gaussian noise of \nstandard deviation 0.02, and two further irrelevant Gaussian-noise inputs with zero \nmean and unit variance. Independent zero-mean Gaussian noise of variance 0.0025 \nwas then added to the outputs YI and Y2 . We used the same datasets as Neal and \nMacKay, with 200 examples in the training set and 200 in the test set . \n\nThe theory described in section 2 deals only with the prediction of a scalar quantity \nY , so predictors were constructed for the two outputs separately, although a joint \nprediction is possible within the Gaussian process framework (see co-kriging, \u00a73.2.3 \nin Cressie, 1993). \n\nTwo experiments were conducted, the first using only the two \"true\" inputs, and \nthe second one using all six inputs. In this section we report results using max(cid:173)\nimum likelihood training; similar results were obtained with HMC . The log( v),s \nand loge w )'s were all initialized to values chosen uniformly from [-3.0, 0.0], and \nwere adapted separately for the prediction of YI and Y2 (in these early experiments \nthe linear regression terms in the covariance function involving aa and al were not \npresent) . The conjugate gradient search algorithm was allowed to run for 100 iter(cid:173)\nations, by which time the likelihood was changing very slowly. Results are reported \nfor the run which gave the highest likelihood of the training data, although in fact \nall runs performed very similarly. The results are shown in Table 1 and are encour(cid:173)\naging, as they indicate that the Gaussian process approach is giving very similar \nperformance to two well-respected techniques. All of the methods obtain a level of \nperformance which is quite close to the theoretical minimum error level of 1.0 . ...Jt is \ninteresting to look at the values of the w's obtained after the optimization; for the \nY2 task the values were 0.243,0.237,0.0639,7.0 x 10- 4 , 2.32 x 10- 6 ,1.70 x 10- 6 , \nand Va and VI were 7.5278 and 0.0022 respectively. The w values show nicely that \nthe first two inputs are the most important, followed by the corrupted inputs and \nthen the irrelevant inputs. During training the irrelevant inputs are detected quite \nquickly, but the w 's for the corrupted inputs shrink more slowly, implying that the \ninput noise has relatively little effect on the likelihood. \n\n4.2 FIVE REAL-WORLD PROBLEMS \n\nGaussian Processes as described above were compared to several other regression \nalgorithms on five real-world data sets in (Rasmussen, 1996; in this volume). The \ndata sets had between 80 and 256 training examples, and the input dimension \nranged from 6 to 16. The length of the HMC sampling for the Gaussian processes \nwas from 7.5 minutes for the smallest training set size up to 1 hour for the largest \nones on a R4400 machine. The results rank the methods in the order (lowest error \nfirst) a full-blown Bayesian treatment of neural networks using HMC, Gaussian \n\n\f520 \n\nC. K. I. WILLIAMS, C. E. RASMUSSEN \n\nprocesses, ensembles of neural networks trained using cross validation and weight \ndecay, the Evidence framework for neural networks (MacKay, 1992), and MARS. \nWe are currently working on assessing the statistical significance of this ordering. \n\n5 DISCUSSION \n\nWe have presented the method of regression with Gaussian processes, and shown \nthat it performs well on a suite of real-world problems. \n\nWe have also conducted some experiments on the approximation of neural nets (with \na finite number of hidden units) by Gaussian processes, although space limitations \ndo not allow these to be described here. Some other directions currently under \ninvestigation include (i) the use of Gaussian processes for classification problems by \nsoftmaxing the outputs of k regression surfaces (for a k-class classification problem), \n(ii) using non-stationary covariance functions, so that C(x , Xl) f:- C(lx - XII) and \n(iii) using a covariance function containing a sum of two or more terms of the form \ngiven in line 1 of equation 3. \n\nWe hope to make our code for Gaussian process prediction publically available in the \nnear future. Check http://www.cs.utoronto.ca/neuron/delve/delve.html for details. \n\nAcknowledgements \n\nWe thank Radford Neal for many useful discussions, David MacKay for generously provid(cid:173)\ning the robot arm data used in this paper, and Chris Bishop, Peter Dayan, Radford Neal \nand Huaiyu Zhu for comments on earlier drafts. CW was partially supported by EPSRC \ngrant GRjJ75425. \n\nReferences \n\nCressie, N. A. C. (1993) . Statistics for Spatial Data. Wiley. \nDuane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. \n\nPhysics Letters B, 195:216-222. \n\nGirosi, F., Jones, M., and Poggio, T. (1995). Regularization Theory and Neural Networks \n\nArchitectures. Neural Computation, 7(2):219-269. \n\nMacKay, D. J. C. (1992). A Practical Bayesian Framework for Backpropagation Networks. \n\nNeural Computation, 4(3):448-472. \n\nMacKay, D. J. C. (1993). Bayesian Methods for Backpropagation Networks. \n\nIn van \nHemmen, J. L., Domany, E., and Schulten, K., editors, Models of Neural Networks \nII. Springer. \n\nNeal, R. M. (1993). Bayesian Learning via Stochastic Dynamics. In Hanson, S. J., Cowan, \nJ. D., and Giles, C. L., editors, Neural Information Processing Systems, Vol. 5, pages \n475-482. Morgan Kaufmann, San Mateo, CA. \n\nNeal, R. M. (1995). Bayesian Learning for Neural Networks. PhD thesis, Dept. of Com(cid:173)\n\nputer Science, University of Toronto. \n\nPoggio, T. and Girosi, F. (1990). Networks for approximation and learning. Proceedings \n\nof IEEE, 78:1481-1497. \n\nRasmussen, C. E. (1996). A Practical Monte Carlo Implementation of Bayesian Learning. \nIn Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural \nInformation Processing Systems 8. MIT Press. \n\nWahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Ap(cid:173)\nplied Mathematics. CBMS-NSF Regional Conference series in applied mathematics. \n\n\f", "award": [], "sourceid": 1048, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}