{"title": "General Bounds on Bayes Errors for Regression with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 302, "page_last": 308, "abstract": null, "full_text": "General Bounds on Bayes Errors for \nRegression with Gaussian Processes \n\nManfred Opper \n\nFrancesco Vivarelli \n\nNeural Computing Research Group \n\nCentro Ricerche Ambientali \n\nDept. of Electronic Engineering \n\nMontecatini, \n\nand Computer Science, \n\nAston University, \n\nBirmingham, B4 7ET \n\nUnited Kingdom \n\noppermGaston.ac.uk \n\nvia Ciro Menotti, 48 \n\n48023 Marina di Ravenna, \n\nItaly \n\nfvivarelliGcramont.it \n\nAbstract \n\nBased on a simple convexity lemma, we develop bounds for differ(cid:173)\nent types of Bayesian prediction errors for regression with Gaussian \nprocesses. The basic bounds are formulated for a fixed training set. \nSimpler expressions are obtained for sampling from an input distri(cid:173)\nbution which equals the weight function of the covariance kernel, \nyielding asymptotically tight results. The results are compared \nwith numerical experiments. \n\n1 \n\nIntroduction \n\nNonparametric Bayesian models which are based on Gaussian priors on function \nspaces are becoming increasingly popular in the Neural Computation Community \n(see e.g.[2, 3, 4, 7, 1]). Since the model classes considered in this approach are \ninfinite dimensional, the application of Vapnik - Chervonenkis type of methods to \ndetermine bounds for the learning curves is nontrivial and has not been performed \nso far (to our knowledge). In these methods, the target function to be learnt is \nfixed and input data are drawn independently at random from a fixed (unknown) \ndistribution. The approach of this paper is different. Here, we assume that the target \nis actually drawn at random from a known prior distribution, and we are interested \nin developing simple bounds on the average prediction performance (with respect \nto the prior) which hold for a fixed set of inputs. Only at a later stage, an average \nover the input distribution is made. \n\n\fGeneral Bounds on Bayes Errors for Regression with Gaussian Processes \n\n303 \n\n2 Regression with Gaussian processes \n\nTo explain the Gaussian process scenario for regression problems [4J, we assume that \nobservations Y E R at input points x E RD are corrupted values of a function 8(x) \nby an independent Gaussian noise with variance u 2 . The appropriate stochastic \nmodel is given by the likelihood \n\n_ ( y _9 (.,))2 \ne \n\n2 .. 2 \n\npe(Ylx) = ~ \n\n(1) \n\nThe goal of a learner is to give an estimate of the function 8(x), based on a set of \nobserved example data Dt = ((Xl, Yl)\"'\" (Xt) Yt)) . As the prior information about \nthe unknown function 8(x) we asume that 8 is a realization of a Gaussian random \nfield with zero mean and covariance \n\nIt is useful to expand the random functions as \n\n00 \n\nC(x, x') = 18[8(x)8(x')J. \n\n(2) \n\n(3) \n\nin a complete set of deterministic functions \u00a2>k (x) with random Gaussian coefficients \nWk. As is well known, if the \u00a2>k are chosen as orthonormal eigenfunctions of the \nintegral equation \n\nk=O \n\n/ C(x,x')\u00a2>k(x')p(x')dx' = Ak\u00a2>k(X), \n\n(4) \n\nwith eigenvalues Ak and a nonnegative weight function p(x), the a priori statistics \nof WI is simple. They are independent Gaussian variables which satisfy 18[wkwd = \nAkOkl' \n\n3 Prediction and Bayes error \n\nUsually, the posterior mean of 8(x) is chosen as the prediction 8(x) on a new point \nx based on a dataset Dn = (Xl, yI), ... , (xn ) Yn). Its explicit form can be easily \nderived by using the expansion 8(x) = Lk Wk\u00a2>k(X), and the fact that for Gaussian \nrandom variables, their mean coincides with their most probable value. Maximizing \nthe log posterior, with respect to the W k) one finds for the infinite dimensional vector \nW ~ (Wk)k=O, ... ,oo the result W = (u 2J + AV) - 1 b where Vkl = L~=l \u00a2>k(Xi)\u00a2>I(xd \nAkl = AkOkl and bk = L~=1 AkYi\u00a2>k(xd Fixing the set of inputs xn, the Bayesian \nprediction error at a point x is given by \n\nc(xlxn) ~ 18 (8(x) - 8(x)f \nEvaluating (5) yields, after some work, the expression \n\nc(xlxn) = u2 Tr { (u 2 J + AV) -1 AU(x) } \n\n(5) \n\n(6) \n\nwith the matrix Ukl(X) = \u00a2>k(X)\u00a2>I(X). U has the properties that ~ L~=1 U(Xi) = V \nand J dx p(x)U(x) = I. We define the Bayesian training error as the empirical \naverage of the error (5) at the n datapoints of the training set and the Bayesian \ngeneralization error as the average error over all x weighted by the function p(x). \nWe get \n\n.!. Tr { A V (I + A V / u 2 ) -1 } \nn \nTr { A (I + A V / u 2 ) -1 } \n\n. \n\n(7) \n\n(8) \n\n\f304 \n\nM. Opper and F Vivarelli \n\n4 Entropic error \n\nIn order to understand the next type of error [9], we assume that the data arrive \nsequentially, one after the other. The predictive distribution after t - 1 training data \nat the new input Xt is the posterior expectation of the likelihood (1), Le. \n\nLet L t as the Bayesian average of the relative entropy (or Kullback Leibler diver(cid:173)\ngence) between the predictive distribution and the true distribution Pe from which \nthe data were generated, Le. Lt = lE [D K L (Pel I P) ]. It can also be shown that \nL t = ! In (1 + ~g(X~f-l\u00bb) . Hence, when the prediction error is small, we will have \n\n(9) \n\nThe cumulative entropic error Ee (xn) is defined by summing up all the losses (which \ngives an integrated learning curve) from t = 1 up to time n and one can show that \n\nE(xn) = tLt(Xt,Dt-d = lEDKL (Pellpn) = ~Trln (I + AV/(12) \n\nt=l \n\n(10) \n\nwhere P; = rr=l Pe(yilxd and pn = lE[n~=l Pe(Yilxd]. The first equality may be \nfound e.g. in [9], and the second follows from direct calculation. \n\n5 Bounds for fixed set of inputs \n\nIn order to get bounds on (7),(8) and (10), we use a lemma, which has been used in \nQuantum Statistical Mechanics to get bounds on the free energy. The lemma (for \nthe special function f(x) = e-.B X ) was proved by Sir Rudolf Peierls in 1938 [10]. In \norder to keep the paper self contained, we have included the proof in the appendix. \n\nLemma 1 Let H be a real symmetric matrix and f a convex real function. Then \nTr f(H) ~ L::k f(Hkk). \n\nBy noting, that for concave functions the bound goes in the other direction, we \nimmediately get \n\n(12 L \n\n-\n\nn \n\nk \n\nct \n\n< \n\n> L \n\nk \n\nCg \n\n<(1 \n\nAk Vk k \n\n(12 + Ak Vkk -\nL \n\n(12 Ak \n\n(12 + Ak Vkk ~ \n\nk \n\n2 L \n\nk \n\nAkVk \n\n(12 + nAkVk \n\n(12 Ak \n\n(12 + nAkVk \n\nE{xn) \n\n< \n\n~ LIn (1 + V kk A k/(12) ~ ~ LIn (1 + nVkAk/(12) \n\nk \n\nk \n\n(11) \n\n(12) \n\n(13) \n\nwhere in the rightmost inequalities, we assume that all n inputs are in a compact \nregion V, and we define Vk = sUPxE'D 4>~(x). 1 \n\nIThe entropic case may also be proved by Hadamard's inequality. \n\n\fGeneral Bounds on Bayes Errorsfor Regression with Gaussian Processes \n\n305 \n\n6 A verage case bounds \n\nNext, we assume that the input data are drawn at random and denote by ( ... ) \nthe expectations with respect to the distribution. We do not have to assume inde(cid:173)\npendence here, but only the fact that all marginal distributions for the n inputs are \nidentical! Using Jensen's inequality \n\nCt \n\ncg \n\nE \n\nk \n\n)..kUk \n\n0-2 + n)..kUk \n0- 2 ).. \n\n(Ct(xn)) ~ 0- 2 2: \n\n(cg(xn)) ~ 2: 0-2 + n~ U \n(E(xn)) ~ 22: In (1 + nUk)..kj0-2) \n\nk \n1 \n\nk k \n\n(14) \n\n(15) \n\n(16) \n\nk \n\nwhere now Uk = (\u00a2'Hx)). This result is especially simple, when the weighting \nfunction p(x) is a probability density and the inputs have the marginal distribution \np(x). In this case, we simply have Uk = 1. In this case, training and generalization \nerror sandwich the bound \n\n2 \" \n\ncb = 0- L-\n\n)..k \n\n2 \n\nk 0- + n/\\k \n\n\\ . \n\n(17) \n\nWe expect that the bound Cb becomes asymptotically exact, when n -+ 00. This \nshould be intuitively clear, because training and generalization error approach each \nother asymptotically. This fact may also be understood from (9), which shows that \nthe cumulative entropic error is within a factor of ! asymptotically equal to the \ncumulative generalization error. By integrating the lower bound (17) over n, we \nobtain precisely the upper bound on E with a factor 2, showing that upper and \nlower bounds show the same behaviour. \n\n7 Simulations \n\nWe have compared our bounds with simulations for the average training error and \ngeneralization error for the case that the data are drawn from p( x). Results for the \nentropic error will be given elsewhere. \nWe have specialized on the case, where the covariance kernel is of the RBF form \nC(x,x') = exp[(x - X')2j)..2], and p(x) = (27r)-~e-~X2, for which, following Zhu \n(1997), the k-th eigenvalue of the spectrum (k = 0 ... 00) can be written \net al. \nas)..k = abk, where a = VC,b = c/)..2, c = 2(1+2j)..2+v'1+4/)..2)-I, and)\" \nis the lengthscale of the process. We estimated the average generalisation error \nfor each training set based on the exact analytical expressions (8) and (7) over \nthe distribution of the datasets by using a Monte Carlo approximation. To begin \nwith, let us consider x E R. We sampled the I-dimensIOnal input space generating \n100 training sets whose data points were normally distributed around zero with \nunit variance. For each generation, the expected training and generalisation errors \nfor a GP have been evaluated using up to 1000 data points. We set the value \nof the lengthscale2 ).. to 0.1 and we let the noise level 0-2 assume several values \n(0- 2 = 10-4 , 10-3 , 10-2 ,10- 1 , 1). Figure 1 shows the results we obtained when \n\n2The value of the lengthscale ..\\ has the effect of stretching the training and learning \ncurves; thus the results of the experiments performed with different ..\\ are qualitatively \nsimilar to those presented. \n\n\f306 \n\nM Opper and F Vivarelli \n\n~ \"\n\n\" \n\n0.1 \n\n0.01 \n\n~ '\" \n\n\u00a3 \n9 \n\n\u00a31 \n\n. ~ .. \n\n0.1 \n\n.\" ... \n\n, , \n\n10 \n\nn \n\n100 \n\n1000 \n\n10 \n\nn \n\n(a) >. = 0.1, (72 = 0.1 \n\n(b) >. = 0.1, (72 = 1 \n\nFigure 1: The Figures show the graphs of the training and learning curves with their \nbound fb(n) obtained with>. = 0.1; the noise level is set to 0.1 in Figure l(a) and to \n1 in Figure l(b). In all the graphs, ft and fg(n) are drawn by the solid line and their \n95% confidence interval is signed by the dotted curves. The bound fb(n) is drawn by the \ndash-dotted lines. \n\n(12 = 0.1 (Figure l(a)) and (12 = 1 (Figure l{b)). The bound \u20acb{n) \nlies within the \ntraining and learning curves, being an upper bound for \u20act (n) and a lower bound for \n\u20acg(n). This bound is tighter for the processes with higher noise level; in particular, \nfor large datasets the error bars on the curves \u20act (n) and \u20acg (n) overlap the bound \n\u20acb(n). The curves \u20act{n), \nOur bounds can also be applied to higher dimensions D > 1 using the covariance \n(18) \n\n\u20acg(n) and \u20acb(n) approach zero as O(log(n)/n). \n\nC{x, x') = exp (-llx - x'112 />.2) \n\nfor x, x' E RD. Obviously the integral kernel C is just a direct product of RBF \nkernels, one for each coordinate of x and x'. The eigenvalue problem (4) can be \nimmediately reduced to the one for a single variable. Eigenfunctions and eigenvalues \nare simply products of those for the single coordinate problems. Hence, using a bit \nof combinatorics, the bound Cb can be written as \n\n(19) \n\n00 (k + D - 1) \n\nk \n\n(12aDbk \n(72 + naDbk , \n\n_ \n\nCb - L \n\nk=O \n\nwhere a and b have been defined above. We performed experiments when x E R2 \nand x E R5 . The correlation lengths along each direction of the input space has \nbeen set to 1 and the noise level was (12 = 1.0. The graphs of the curves, with their \nerror bars are reported in Figure 2{a) (for x E R2) and in Figure 2{b) (for x E R 5 ). \n\n8 Discussion \n\nBased on the minimal requirements on training inputs and covariances, we con(cid:173)\njecture that our bounds cannot be improved much without making more detailed \nassumptions on models and distributions. We can observe from the simulations \nthat the tightness of the bound \u20acb{n) depends on the dimension of the input space. \nIn particular, for large datasets \u20acb{n) \nis tighter for small dimension of the input \nspace; Figure 2{a) shows this quite clearly since \u20acb{n) overlaps the error bars of the \n\n\fGeneral Bounds on Bayes Errors for Regression with Gaussian Processes \n\n307 \n\n-t;,-- --_ \n\n0.1 \n\n10 \n\nn \n\n(a) d = 2 \n\n10 \n\nn \n\n100 \n\n1000 \n\n(b) d = 5 \n\nFigure 2: The Figures show the graphs of the training and learning curves with their \nbound Eb(n) obtained with the squared exponential covariance function with A = 1 and \n(72 = 1; the input space is R2 (Figure 2( a\u00bb and R 5 (Figure 2(b\u00bb. In all the Figures, Et \nand Eg(n) are drawn by the solid line and their 95% confidence interval is signed by the \ndotted curves. The bound Eb(n) is drawn by the dash-dotted lines. \n\ntraining and learning curves for large n. Numerical simulations performed using \nmodified Bessel covariance functions of order r (describing random processes r - 1 \ntime mean square differentiable) have shown that the bound \u20acb(n} becomes tighter \nfor smoother processes. \nAcknowledgement: We are grateful for many inspiring discussions with C.K.I. \nWilliams. M.O. would like to thank Peter Sollich for his conjecture that (17) is an \nexact lower bound on the generalization error, which motivated part of this work. \nF. V. was supported by a studentship of British Aerospace. \n\n9 Appendix: Proof of the lemma 1 \n\nLet {~(j)} be a complete set of orthonormal eigenvectors and {Ei} the correspond-\n(i) \ning set of eigenvalues of H, i.e. we have the properties Ll Hkl~l = Ei ~k , \nLi ~k ~l = 8kl , and Lk ~k ~k = 8ij . Then we get \n\n(j) \n\n(i) \n\n(i) \n\n(i ) \n\n(i) \n\nTr f(H} \n\nL \ni \n\nf(Ed = L L(~~i)}2 f(Ei } \n\nk \n\ni \n\n> ~f (~(~~\u00bb2Ei) ~ ~f (~~~i) ~H .. ~fi\u00bb) \n\n= Lf(Hkk} \n\nk \n\nThe second equality follows from orthonormality, because Lk(~~i)}2 = 1. The \ninequality uses the fact that by completeness, for any k, we have Li(~~i)}2 = 1 \nand we may regard the (~~i)}2 as probabilities, such that by convexity, Jensen's \ninequality can be used. After using the eigenvalue equation, the sum over i was \ncarried out with the help of the completeness relation, in order to obtain the last \nline. \n\n\f308 \n\nReferences \n\nM. Opper and F Vivarelli \n\n[1] D. \n\nJ. C. Mackay, Gaussian Processes, A Replacement \n\nfor Neu-\nfrom \n\nobtained \n\nral Networks, \nhttp://wol.ra.phy.cam.ac.uk/pub/mackay/. \n\ntutorial \n\nNIPS \n\n1997. May \n\nbe \n\n[2] R. Neal, Bayesian Learning for Neural Networks, Lecture Notes in Statistics, \n\nSpringer (1996). \n\n[3] C. K. I. Williams, Computing with Infinite Networks, in Neural Information \nProcessing Systems 9, M. C. Mozer, M.1. Jordan and T. Petsche, eds., 295-30l. \nMIT Press (1997). \n\n[4] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, in \nNeural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer and \nM. E. Hasselmo eds., 514-520, MIT Press (1996). \n\n[5] R. M. Neal, Monte Carlo Implementation of Gaussian Process Models for \nBayesian Regression and Classification, Technical Report CRG-TR-97-2, Dept. \nof Computer Science, University of Toronto (1997) . \n\n[6] M. N. Gibbs and D. J. C. Mackay, Variational Gaussian Process Classifiers, \n\nPreprint Cambridge University (1997). \n\n[7] D. Barber and C. K. I. Williams, Gaussian Processes for Bayesian Classification \nvia Hybrid Monte Carlo, in Neural Information Processing Systems 9, M. C. \nMozer, M. I. Jordan and T. Petsche, eds., 340-346. MIT Press (1997). \n\n[8] C. K. I. Williams and D. Barber, Bayesian Classification with Gaussian Pro(cid:173)\n\ncesses, Preprint Aston University (1997) . \n\n[9] D. Haussler and M. Opper, Mutual Information, Metric Entropy and Cumu(cid:173)\nlative Relative Entropy Risk, The Annals of Statistics, Vol 25, No 6, 2451 \n(1997). \n\n[10] R. Peierls, Phys. Rev. 54, 918 (1938). \n[11] H. Zhu, C. K. I. Williams, R. Rohwer and M. Morciniec, Gaussian Re(cid:173)\n\ngression and Optimal Finite Dimensional Linear Models, Technical report \nNCRG /97/011, Aston University (1997). \n\n\f", "award": [], "sourceid": 1622, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Francesco", "family_name": "Vivarelli", "institution": null}]}