{"title": "Regression with Input-dependent Noise: A Gaussian Process Treatment", "book": "Advances in Neural Information Processing Systems", "page_first": 493, "page_last": 499, "abstract": null, "full_text": "Regression with Input-dependent Noise: \n\nA Gaussian Process Treatment \n\nPaul W. Goldberg \n\nChristopher K.I. Williams \n\nDepartment of Computer Science \n\nNeural Computing Research Group \n\nUniversity of Warwick \nCoventry, CV 4 7 AL, UK \npvgGdcs.varvick.ac.uk \n\nAston University \n\nBirmingham B4 7ET, UK \n\nc.k.i.villiamsGaston.ac.uk \n\nChristopher M. Bishop \n\nMicrosoft Research \nSt. George House \n1 Guildhall Street \n\nCambridge, CB2 3NH, UK \ncmbishopOmicrosoft.com \n\nAbstract \n\nGaussian processes provide natural non-parametric prior distribu(cid:173)\ntions over regression functions. In this paper we consider regression \nproblems where there is noise on the output, and the variance of \nthe noise depends on the inputs. If we assume that the noise is \na smooth function of the inputs, then it is natural to model the \nnoise variance using a second Gaussian process, in addition to the \nGaussian process governing the noise-free output value. We show \nthat prior uncertainty about the parameters controlling both pro(cid:173)\ncesses can be handled and that the posterior distribution of the \nnoise rate can be sampled from using Markov chain Monte Carlo \nmethods. Our results on a synthetic data set give a posterior noise \nvariance that well-approximates the true variance. \n\n1 Background and Motivation \n\nA very natural approach to regression problems is to place a prior on the kinds of \nfunction that we expect, and then after observing the data to obtain a posterior. \nThe prior can be obtained by placing prior distributions on the weights in a neural \n\n\f494 \n\nP. W Goldberg, C. K. L Williams and C. M. Bishop \n\nnetwork, although we would argue that it is perhaps more natural to place priors di(cid:173)\nrectly over functions. One tractable way of doing this is to create a Gaussian process \nprior. This has the advantage that predictions can be made from the posterior using \nonly matrix multiplication for fixed hyperparameters and a global noise level. In \ncontrast, for neural networks (with fixed hyperparameters and a global noise level) \nit is necessary to use approximations or Markov chain Monte Carlo (MCMC) meth(cid:173)\nods. Rasmussen (1996) has demonstrated that predictions obtained with Gaussian \nprocesses are as good as or better than other state-of-the art predictors. \n\nIn much of the work on regression problems in the statistical and neural networks \nliteratures, it is assumed that there is a global noise level, independent of the input \nvector x. The book by Bishop (1995) and the papers by Bishop (1994), MacKay \n(1995) and Bishop and Qazaz (1997) have examined the case of input-dependent \n(Such models are said to \nnoise for parametric models such as neural networks. \nheteroscedastic in the statistics literature.) In this paper we develop the treatment \nof an input-dependent noise model for Gaussian process regression, where the noise \nis assumed to be Gaussian but its variance depends on x. As the noise level is non(cid:173)\nnegative we place a Gaussian process prior on the log noise level. Thus there are \ntwo Gaussian processes involved in making predictions: the usual Gaussian process \nfor predicting the function values (the y-process), and another one (the z-process) \nfor predicting the log noise level. Below we present a Markov chain Monte Carlo \nmethod for carrying out inference with this model and demonstrate its performance \non a test problem. \n\n1.1 Gaussian processes \n\nA stochastic process is a collection of random variables {Y(x)lx E X} indexed by \na set X. Often X will be a space such as 'R,d for some dimension d, although it \ncould be more general. The stochastic process is specified by giving the probability \ndistribution for every finite subset of variables Y(Xl), ... , Y(Xk) in a consistent \nmanner. A Gaussian process is a stochastic process which can be fully specified \nby its mean function J.L(x) = E[Y(x)] and its covariance function Cp(x,x') = \nE[(Y(x)-J.L(x\u00bb)(Y(x')-J.L(x'\u00bb]; any finite set of points will have a joint multivariate \nGaussian distribution. Below we consider Gaussian processes which have J.L(x) == O. \nThis assumes that any known offset or trend in the data has been. removed. A \nnon-zero I' (x ) is easily incorporated into the framework at the expense of extra \nnotational complexity. \n\nA covariance junction is used to define a Gaussian process; it is a parametrised \nfunction from pairs of x-values to their covariance. The form of the covariance \nfunction that we shall use for the prior over functions is given by \n\nCy(x(i),xU\u00bb =vyexp (-~ tWYl(x~i) _x~j\u00bb2) + Jy 8(i,j) \n\n1=1 \n\n(1) \n\nwhere vy specifies the overall y-scale and W;:/2 is the length-scale associated with \nthe lth coordinate. Jy is a \"jitter\" term (as discussed by Neal, 1997), which is \nadded to prevent ill-conditioning of the covariance matrix of the outputs. Jy is a \ntypically given a small value, e.g. 10-6 . \n\nFor the prediction problem we are given n data points 1) = ((Xl,t1),(X2,t2), \n\n\fInput-dependent Noise: A Gaussian Process Treatment \n\n495 \n\n... , (x n, tn\u00bb), where ti is the observed output value at Xi. The t's are assumed \nto have been generated from the true y-values by adding independent Gaussian \nnoise whose variance is x-dependent. Let the noise variance at the n data points \nbe r = (r(xl),r(x2), ... ,r(xn)). Given the assumption of a Gaussian process prior \nover functions, it is a standard result (e.g. Whittle, 1963) that the predictive distri(cid:173)\nbution P(t*lx*) corresponding to a new input x* is t* \"'\" N(t(X*),0'2(X*)), where \n\n-\n\ni(x*) \n0'2(X*) \n\nk~(x*)(Ky + KN)-lt \nCy(x*, x*) + r(x*) - k~(x*)(Ky + KN )-lky(x*) \n\n(2) \n(3) \n\nwhere the noise-free covariance matrix K y satisfies [K Y] ij = Cy (x i, X j ), and \nky(x*) = (Cy(x*,xd, ... ,Cy(x*,xn\u00bbT, KN = diag(r) and t = (tb ... ,tn)T, \nand V0'2(X*) gives the \"error bars\" or confidence interval of the prediction. \n\nIn this paper we do not specify a functional form for the noise level r(x) but we do \nplace a prior over it. An independent Gaussian process (the z-process) is defined to \nbe the log of the noise level. Its values at the training data points are denoted by \nz = (zl, . .. ,zn),sothatr = (exp(zl), ... ,exp(zn\u00bb. The priorforz has a covariance \nfunction CZ(X(i), xU\u00bb similar to that given in equation 1, although the parameters \nvz and the WZI'S can be chosen to be different to those for the y-process. We also \nadd the jitter term Jz t5(i,j) to the covariance function for Z, where Jz is given the \nvalue 10-2 \u2022 This value is larger than usual, for technical reasons discussed later. \n\nWe use a zero-mean process for z which carries a prior assumption that the average \nnoise rate is approximately 1 (being e to the power of components of z). This is \nsuitable for the experiment described in section 3. In general it is easy to add an \noffset to the z-process to shift the prior noise rate. \n\n2 An input-dependent noise process \n\nWe discuss, in turn, sampling the noise rates and making predictions with fixed val(cid:173)\nues of the parameters that control both processes, and sampling from the posterior \non these parameters. \n\n2.1 Sampling the Noise Rates \n\nThe predictive distribution for t*, the output at a point x*, is P(t*lt) = \nf P(t*lt,r(z\u00bbP(zlt)dz. Given a z vector, the prediction P(t*lt,r(z\u00bb is Gaus(cid:173)\nsian with mean and variance given by equations 2 and 3, but P(zlt) is difficult to \nhandle analytically, so we use a Monte Carlo approximation to the integral. Given \na representative sample {Zb ... ' Zk} of log noise rate vectors we can approximate \nthe integral by the sum i E j P(t*lt,r(zj\u00bb. \nWe wish to sample from the distribution P(zlt). As this is quite difficult, we sample \ninstead from P(y, zit); a sample for P(zlt) can then be obtained by ignoring the \ny values. This is a similar approach to that taken by Neal (1997) in the case of \nGaussian processes used for classification or robust regression with t-distributed \nnoise. We find that \n\nP(y, zit) oc P(tly, r(z\u00bbP(y)P(z). \n\n(4) \n\nWe use Gibbs sampling to sample from P(y, zit) by alternately sampling from \nP(zly, t) and P(ylz, t). Intuitively were are alternating the \"fitting\" of the curve (or \n\n\f496 \n\nP. W. Goldberg, C. K. 1. Williams and C. M Bishop \n\ny-process) with \"fitting\" the noise level (z-process) . These two steps are discussed \nin turn . \n\n\u2022 Sampling from P(ylt, z) \n\nFor y we have that \n\nwhere \n\nP(ylt, z) ex P(tly, r(z\u00bbP(y) \n\nP(tly, r(z\u00bb = TI (21l'Ti)l/2 exp \n\n1 \n\nn \n\n( \n\n-\n\n(ti - Yi)2 ) \n\n2Ti \n\n. \n\n(5) \n\n(6) \n\nEquation (6) can also be written as P(tly,r(z\u00bb '\" N(t,KN) ' Thus P(ylt,z) is \na multivariate Gaussian with mean (Kyl + Ki/)-l K;/t and covariance matrix \n(Kyl + KN1)-1 which can be sampled by standard methods . \n\n\u2022 Sampling from P(zlt,y) \n\nFor fixed y and t we obtain \n\nP(zly, t) ex P(tly, z)P(z). \n\n(7) \n\nThe form of equation 6 means that it is not easy to sample z as a vector. Instead \nwe can sample its components separately, which is a standard Gibbs sampling al(cid:173)\ngorithm. Let Zi denote the ith component of z and let Z-i denote the remaining \ncomponents. Then \n\n(8) \n\nP(Zilz-i) is the distribution of Zi conditioned on the values of Z-i' The com(cid:173)\nputation of P(zilz-i) is very similar to that described by equations (2) and (3), \nexcept that Cy ( \" .) is replaced by C z ( \" .) and there is no noise so that T (.) will be \nidentically zero. \n\nWe sample from P(zilz-i' y, t) using rejection sampling. We first sample from \nP(zilz-i), and then reject according to the term exp{ -Zi/2 - Hti - Yi)2 exp( -Zi)} \n(the likelihood of local noise rate Zi), which can be rescaled to have a maximum \nvalue of lover Zi. Note that it is not necessary to perform a separate matrix \ninversion for each i when computing the P(zilz-i) terms; the required matrices \ncan be computed efficiently from the inverse of K z. We find that the average \nrejection rate is approximately two-thirds, which makes the method we currently use \nreasonably efficient. Note that it is also possible to incorporate the term exp( -Zi/2) \nfrom the likelihood into the mean of the Gaussian P(zilz-i) to reduce the rejection \nrate. \n\nAs an alternative approach, it is possible to carry out Gibbs sampling for P(zilz-i' t) \nwithout explicitly representing y, using the fact that 10gP(tlz) = -~logIKI\u00ad\n!tT K-1t + canst, where K = K y + K N . We have implemented this and found \nsimilar results to those obtained using sampling of the y's. However, explicitly \nrepresenting the y-process is useful when adapting the parameters, as described in \nsection 2.3. \n\n\fInput-dependent Noise: A Gaussian Process Treatment \n\n497 \n\n2.2 Making predictions \n\nSo far we have explained how to obtain a sample from P(zlt). To make predictions \nwe use \n\nP(t*lt) ~ ~ l: P(t*lt, r(zj)). \n\n(9) \n\nHowever, P(t*lt,r(zj)) is not immediately available, as z*, the noise level at x* is \nunknown. In fact \n\nj \n\n(10) \n\nP(z*IZj, t) is simply a Gaussian distribution for z* conditioned on Zj, and is ob(cid:173)\ntained in a similar way to P(zilz-i). As P(t*lz*, t, r(zj)) is a Gaussian distribution \nas given by equations (2) and (3), P(t*\\t, r(z j)) is an infinite mixture of Gaussians \nwith weights P(z*IZj) . Note, however, that each ofthese components has the same \nmean i(x*) as given by equation (2), but a different variance. \nWe approximate P(t*lt, r(zj)) by taking s = 10 samples of P(z*lzj) and thus obtain \na mixture of s Gaussians as the approximating distribution. The approximation for \nP(t*lt) is then obtained by averaging these s-component mixtures over the k samples \nZ1> \u2022\u2022\u2022 , Zk to obtain an sk-component mixture of Gaussians. \n\n2.3 Adapting the parameters \n\nAbove we have described how to obtain a sample from the posterior distribution \nP(z\\t) and to use this to make predictions, based on the assumption that the \nparameters Oy (Le. Vy,Jy,WYl, . .. ,WYd) and Oz (Le. vz,JZ,WZl, ... ,WZd) have \nbeen set to the correct values. In practice we are unlikely to know what these \nsettings should be, and so introduce a hierarchical model, as shown in Figure l. \nThis graphical model shows that the joint probability distribution decomposes as \nP(Oy,OZ, y, z, t) = P(Oy)P(Oz)P(yIOy )P(z\\Oz)P(t\\y, z). \nOur goal now becomes to obtain a sample from the posterior P(Oy,Oz,y,zlt), \nwhich can be used for making predictions as before. (Again, the y samples are \nnot needed for making predictions, but they will turn out to be useful for sampling \nOy .) Sampling from the joint posterior can be achieved by interleaving updates of \nOy and Oz with y and Z updates. Gibbs sampling for Oy and Oz is not feasible \nas these parameters are buried deeply in the K y and K N matrices, so we use the \nMetropolis algorithm for their updates. As usual, we consider moving from our \ncurrent state 0 = (Oy,Oz) to a new state 0 using a proposal distribution J(O,O). In \npractice we take J to be an isotropic Gaussian centered on 0\u00b0. Denote the ratio of \nP(Oy)P(Oz)P(yIOy)P(z\\Oz) in states 9 and 0 by r. Then the proposed state 0 is \naccepted with probability min{r, 1}. \n\nIt would also be possible to use more sophisticated MCMC algorithms such as the \nHybrid Monte Carlo algorithm which uses derivative information, as discussed in \nNeal (1997). \n\n3 Results \n\nWe have tested the method on a one-dimensional synthetic problem. 60 data points \n\n\f498 \n\nP. W. Goldberg, C. K. l Williams and C. M Bishop \n\nFigure 1: The hierarchical model including parameters. \n\nwere generated from the function y = 2 sin(271\"x) on [0, 1] by adding independent \nGaussian noise. This noise has a standard deviation that increases linearly from 0.5 \nat x = 0 to 1.5 at x = 1. The function and the training data set are illustrated in \nFigure 2(a). \n\nAs the parameters are non-negative quantities, we actually compute with their log \nvalues. logvy, logvz, logwy and log Wz were given N(O, 1) prior distributions. The \njitter values were fixed at Jy = 10-6 and J z = 10-2 \u2022 The relatively large value \nfor J z assists the convergence of the Gibbs sampling, since it is responsible for \nmost of the variance of the conditional distribution P(Zi/Z-i}. The broadening of \nthis distribution leads to samples whose likelihoods are more variable, allowing the \nlikelihood term (used for rejection) to be more influential. \n\nIn our simulations, on each iteration we made three Metropolis updates for the \nparameters, along with sampling from all of the y and Z variables. The Metropolis \nproposal distribution was an isotropic Gaussian with variance 0.01. We ran for \n3000 iterations, and discarded the first one-third of iterations as \"burn-in\", after \nwhich plots of each of the parameters seemed to have settled down. The parameters \nand Z values were stored every 100 iterations. In Figure 2(b) the average standard \ndeviation of the inferred noise has been plotted, along with with two standard \ndeviation error-bars. Notice how the standard deviation increases from left to right, \nin close agreement with the data generator. \n\nStudying the posterior distributions of the parameters, we find that the y(cid:173)\nlength scale A y d;j (wy) -1/2 is well localized around 0.22 \u00b1 0.1, in good agree(cid:173)\nment with the wavelength of the sinusoidal generator. (For the covariance function \nin equation 1, the expected number of zero crossings per unit length is 1/7I\"Ay.) \n(WZ)-1/2 is less tightly constrained, which makes sense as it corresponds to a longer \nwavelength process, and with only a short segment of data available there is still \nconsiderable posterior uncertainty. \n\n4 Conclusions \n\nWe have introduced a natural non-parametric prior on variable noise rates, and \ngiven an effective method of sampling the posterior distribution, using a MCMC \n\n\fInput-dependent Noise: A Gaussian Process Treatment \n\n499 \n\n.' \n\n- 2 \n\n.. \n\n.' \n\n.. \n\n\" \n\no. \n\n(a) \n\n(b) \n\n\u00b0O~~O I~~0 2--0~'~O~'~O ~'~\"--O~7~O~'~\"~ \n\nFigure 2: (a) shows the training set (crosses); the solid line depicts the x-dependent \nmean of the output. (b) The solid curve shows the average standard deviation of \nthe noise process, with two standard deviation error bars plotted as dashed lines. \nThe dotted line indicates the true standard deviation of the data generator. \n\nmethod. When applied to a data set with varying noise, the posterior noise rates \nobtained are well-matched to the known structure. We are currently experimenting \nwith the method on some more challenging real-world problems. \n\nAcknowledgements \n\nThis work was carried out at Aston University under EPSRC Grant Ref. GR/K 51792 \nValidation and Verification of Neural Network Systems. \n\nReferences \n\n[1] C.M. Bishop (1994). Mixture Density Networks. Technical report NCRG/94/001, Neu(cid:173)\n\nral Computing Research Group, Aston University, Birmingham, UK. \n\n[2] C.M. Bishop (1995). Neural Networks for Pattern Recognition. Oxford University Press. \n\n[3] C.M. Bishop and C. Qazaz (1997). Regression with Input-dependent Noise: A Bayesian \nTreatment. In M. C. Mozer, M. I. Jordan and T. Petsche (Eds) Advances in Neural \nInformation Processing Systems 9 Cambridge MA MIT Press. \n\n[4] D. J. C. MacKay (1995). Probabilistic networks: new models and new methods. In \n\nF. Fogelman-Soulie and P. Gallinari (Eds), Proceedings ICANN'95 International Con(cid:173)\nference on Neural Networks, pp. 331-337. Paris, EC2 & Cie. \n\n[5] R. Neal (1997). Monte Carlo Implementation of Gaussian Process Models for Bayesian \n\nRegression and Classification. Technical Report 9702, Department of Statistics, Uni(cid:173)\nversity of Toronto. Available from http://vvv.cs.toronto.edurradford/. \n\n[6] C.E. Rasmussen (1996). Evaluation of Gaussian Processes and Other Methods for Non(cid:173)\n\nlinear Regression. PhD thesis, Department of Computer Science, University of Toronto. \nAvailable from http://vwv . cs .utoronto. carcarl/. \n\n[7] C.K.I. Williams and C.E. Rasmussen (1996). Gaussian Processes for Regression. In \nD. S. Touretzky, M. C. Mozer and M. E . Hasselmo Advances in Neural Information \nProcessing Systems 8 pp. 514-520, Cambridge MA MIT Press. \n\n[8] P. Whittle (1963). Prediction and regulation by linear least-square methods. English \n\nUniversities Press. \n\n\f", "award": [], "sourceid": 1444, "authors": [{"given_name": "Paul", "family_name": "Goldberg", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Christopher", "family_name": "Bishop", "institution": null}]}