Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)
Paul Goldberg, Christopher Williams, Christopher Bishop
Gaussian processes provide natural non-parametric prior distribu(cid:173) tions over regression functions. In this paper we consider regression problems where there is noise on the output, and the variance of the noise depends on the inputs. If we assume that the noise is a smooth function of the inputs, then it is natural to model the noise variance using a second Gaussian process, in addition to the Gaussian process governing the noise-free output value. We show that prior uncertainty about the parameters controlling both pro(cid:173) cesses can be handled and that the posterior distribution of the noise rate can be sampled from using Markov chain Monte Carlo methods. Our results on a synthetic data set give a posterior noise variance that well-approximates the true variance.
1 Background and Motivation
A very natural approach to regression problems is to place a prior on the kinds of function that we expect, and then after observing the data to obtain a posterior. The prior can be obtained by placing prior distributions on the weights in a neural
494
P. W Goldberg, C. K. L Williams and C. M. Bishop
network, although we would argue that it is perhaps more natural to place priors di(cid:173) rectly over functions. One tractable way of doing this is to create a Gaussian process prior. This has the advantage that predictions can be made from the posterior using only matrix multiplication for fixed hyperparameters and a global noise level. In contrast, for neural networks (with fixed hyperparameters and a global noise level) it is necessary to use approximations or Markov chain Monte Carlo (MCMC) meth(cid:173) ods. Rasmussen (1996) has demonstrated that predictions obtained with Gaussian processes are as good as or better than other state-of-the art predictors.
In much of the work on regression problems in the statistical and neural networks literatures, it is assumed that there is a global noise level, independent of the input vector x. The book by Bishop (1995) and the papers by Bishop (1994), MacKay (1995) and Bishop and Qazaz (1997) have examined the case of input-dependent (Such models are said to noise for parametric models such as neural networks. heteroscedastic in the statistics literature.) In this paper we develop the treatment of an input-dependent noise model for Gaussian process regression, where the noise is assumed to be Gaussian but its variance depends on x. As the noise level is non(cid:173) negative we place a Gaussian process prior on the log noise level. Thus there are two Gaussian processes involved in making predictions: the usual Gaussian process for predicting the function values (the y-process), and another one (the z-process) for predicting the log noise level. Below we present a Markov chain Monte Carlo method for carrying out inference with this model and demonstrate its performance on a test problem.
1.1 Gaussian processes
A stochastic process is a collection of random variables {Y(x)lx E X} indexed by a set X. Often X will be a space such as 'R,d for some dimension d, although it could be more general. The stochastic process is specified by giving the probability distribution for every finite subset of variables Y(Xl), ... , Y(Xk) in a consistent manner. A Gaussian process is a stochastic process which can be fully specified by its mean function J.L(x) = E[Y(x)] and its covariance function Cp(x,x') = E[(Y(x)-J.L(x»)(Y(x')-J.L(x'»]; any finite set of points will have a joint multivariate Gaussian distribution. Below we consider Gaussian processes which have J.L(x) == O. This assumes that any known offset or trend in the data has been. removed. A non-zero I' (x ) is easily incorporated into the framework at the expense of extra notational complexity.
A covariance junction is used to define a Gaussian process; it is a parametrised function from pairs of x-values to their covariance. The form of the covariance function that we shall use for the prior over functions is given by
Cy(x(i),xU» =vyexp (-~ tWYl(x~i) _x~j»2) + Jy 8(i,j)