{"title": "Regression with Input-Dependent Noise: A Bayesian Treatment", "book": "Advances in Neural Information Processing Systems", "page_first": 347, "page_last": 353, "abstract": null, "full_text": "Regression with Input-Dependent Noise: \n\nA Bayesian Treatment \n\nChristopher M. Bishop \nC.M.BishopGaston.ac.uk \n\nCazhaow S. Qazaz \nqazazcsGaston.ac.uk \n\nNeural Computing Research Group \n\nAston University, Birmingham, B4 7ET, U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\nAbstract \n\nIn most treatments of the regression problem it is assumed that \nthe distribution of target data can be described by a deterministic \nfunction of the inputs, together with additive Gaussian noise hav(cid:173)\ning constant variance. The use of maximum likelihood to train such \nmodels then corresponds to the minimization of a sum-of-squares \nerror function. In many applications a more realistic model would \nallow the noise variance itself to depend on the input variables. \nHowever, the use of maximum likelihood to train such models would \ngive highly biased results. In this paper we show how a Bayesian \ntreatment can allow for an input-dependent variance while over(cid:173)\ncoming the bias of maximum likelihood. \n\n1 \n\nIntroduction \n\nIn regression problems it is important not only to predict the output variables but \nalso to have some estimate of the error bars associated with those predictions. An \nimportant contribution to the error bars arises from the intrinsic noise on the data. \nIn most conventional treatments of regression, it is assumed that the noise can be \nmodelled by a Gaussian distribution with a constant variance. However, in many \napplications it will be more realistic to allow the noise variance itself to depend on \nthe input variables. A general framework for modelling the conditional probability \ndensity function of the target data, given the input vector, has been introduced in \nthe form of mixture density networks by Bishop (1994, 1995). This uses a feed(cid:173)\nforward network to set the parameters of a mixture kernel distribution, following \nJacobs et al. (1991). The special case of a single isotropic Gaussian kernel function \n\n\f348 \n\nC. M. Bishop and C. S. Qazaz \n\nwas discussed by Nix and Weigend (1995), and its generalization to allow for an \narbitrary covariance matrix was given by Williams (1996). \n\nThese approaches, however, are all based on the use of maximum likelihood, which \ncan lead to the noise variance being systematically under-estimated. Here we adopt \nan approximate hierarchical Bayesian treatment (MacKay, 1991) to find the most \nprobable interpolant and most probable input-dependent noise variance. We com(cid:173)\npare our results with maximum likelihood and show how this Bayesian approach \nleads to a significantly reduced bias. \n\nIn order to gain some insight into the limitations of the maximum likelihood ap(cid:173)\nproach, and to see how these limitations can be overcome in a Bayesian treatment, it \nis useful to consider first a much simpler problem involving a single random variable \n(Bishop, 1995). Suppose that a variable Z is known to have a Gaussian distribution, \nbut with unknown mean fJ. and unknown variance (J2. Given a sample D == {zn} \ndrawn from that distribution, where n = 1, ... , N, our goal is to infer values for the \nmean and variance. The likelihood function is given by \n\np(DIfJ., (J ) = (27r(J2)N/2 exp \n\n{ \n\n2 \n\n1 \n\n1 \n\n- 2(J2 ?; (Zn - fJ.) \n\nN \n\n2 \n\n} \n\n. \n\n(1) \n\nA non-Bayesian approach to finding the mean and variance is to maximize the \nlikelihood jointly over fJ. and (J2, corresponding to the intuitive idea of finding the \nparameter values which are most likely to have given rise to the observed data set. \nThis yields the standard result \n\nN \n\n(12 = ~ 2)Zn - Ji)2. \n\nn=l \n\n(2) \n\nIt is well known that the estimate (12 for the variance given in (2) is biased since \nthe expectation of this estimate is not equal to the true value \n\nC[~2] _ N -1 2 \n- - - ( JO \n(, (J \n\nN \n\n(3) \n\nwhere (J5 is the true variance of the distribution which generated the data, and \n\u00a3[.] denotes an average over data sets of size N. For large N this effect is small. \nHowever, in the case of regression problems there are generally much larger number \nof degrees of freedom in relation to the number of available data points, in which \ncase the effect of this bias can be very substantial. \n\nThe problem of bias can be regarded as a symptom of the maximum likelihood \napproach. Because the mean Ji has been estimated from the data, it has fitted some \nof the noise on the data and this leads to an under-estimate of the variance. If the \ntrue mean is used in the expression for (12 in (2) instead of the maximum likelihood \nexpression, then the estimate is unbiased. \n\nBy adopting a Bayesian viewpoint this bias can be removed. The marginal likelihood \nof (J2 should be computed by integrating over the mean fJ.. Assuming a 'flat' prior \np(fJ.) we obtain \n\n(4) \n\n\fRegression with Input-Dependent Noise: A Bayesian Treatment \n\nMaximizing (5) with respect to ~2 then gives \n\n-2 \n~ = N _ 1 ~ Z n -\n\n1 ~( ~)2 \n\nJ.L \n\nN \n\nn=l \n\n349 \n\n(5) \n\n(6) \n\nwhich is unbiased. \n\nThis result is illustrated in Figure 1 which shows contours of p(DIJ.L, ~2) together \nwith the marginal likelihood p(DI~2) and the conditional likelihood p(DI;t, ~2) eval(cid:173)\nuated at J.L = ;t. \n\n2.5 \n\n2 \n\n0.5 \n\no~~----~----~--~ \n\n-2 \n\no \nmean \n\n2 \n\n2.5 \n\n2 \n\nQ) u \nc: \n.~ 1.5 \n113 > \n\n1 \n\n/ \n\n4 \n2 \nlikelihood \n\n6 \n\nFigure 1: The left hand plot shows contours of the likelihood function p(DIJ..L, 0-2) given \nby (1) for 4 data points drawn from a Gaussian distribution having zero mean and unit \nvariance. The right hand plot shows the marginal likelihood function p(DI0-2) (dashed \ncurve) and the conditional likelihood function p(DI{i,0-2) (solid curve). It can be seen that \nthe skewed contours result in a value of 0:2, which maximizes p(DI{i, 0-2), which is smaller \nthan 0:2 which maximizes p(DI0-2). \n\n2 Bayesian Regression \n\nConsider a regression problem involving the prediction of a noisy variable t given \nthe value of a vector x of input variablesl . Our goal is to predict both a regression \nfunction and an input-dependent noise variance. We shall therefore consider two \nnetworks. The first network takes the input vector x and generates an output \n\nIFor simplicity we consider a single output variable. The extension of this work to \n\nmultiple outputs is straightforward. \n\n\f350 \n\nC. M. Bishop and C. S. Qazaz \n\ny(x; w) which represents the regression function, and is governed by a vector of \nweight parameters w. The second network also takes the input vector x, and \ngenerates an output function j3(x; u) representing the inverse variance of the noise \ndistribution, and is governed by a vector of weight parameters u. The conditional \ndistribution of target data, given the input vector, is then modelled by a normal \ndistribution p(tlx, w, u) = N(tly, 13- 1 ). From this we obtain the likelihood function \n\nwhere j3n = j3(xn; u), \n\nN (271') 1/2 \n\nZD = II j3n \n\nn=l \n\n(7) \n\n(8) \n\n' \n\nand D == {xn' tn} is the data set. \nSome simplification of the subsequent analysis is obtained by taking the regression \nfunction, and In 13, to be given by linear combinations of fixed basis functions, as in \nMacKay (1995), so that \n\ny(x; w) = w T