{"title": "Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 273, "page_last": 279, "abstract": null, "full_text": "Learning curves for Gaussian processes \n\nregression: A framework for good \n\napproximations \n\nDorthe Malzahn \n\nManfred Opper \n\nNeural Computing Research Group \n\nSchool of Engineering and Applied Science \n\nAston University, Birmingham B4 7ET, United Kingdom. \n\n[malzahnd.opperm]~aston.ac.uk \n\nAbstract \n\nBased on a statistical mechanics approach, we develop a method \nfor approximately computing average case learning curves for Gaus(cid:173)\nsian process regression models. The approximation works well in \nthe large sample size limit and for arbitrary dimensionality of the \ninput space. We explain how the approximation can be systemati(cid:173)\ncally improved and argue that similar techniques can be applied to \ngeneral likelihood models. \n\n1 \n\nIntroduction \n\nGaussian process (GP) models have gained considerable interest in the Neural Com(cid:173)\nputation Community (see e.g.[I, 2, 3, 4] ) in recent years. Being non-parametric \nmodels by construction their theoretical understanding seems to be less well devel(cid:173)\noped compared to simpler parametric models like neural networks. We are especially \ninterested in developing theoretical approaches which will at least give good approx(cid:173)\nimations to generalization errors when the number of training data is sufficiently \nlarge. \n\nIn this paper we present a step in this direction which is based on a statistical me(cid:173)\nchanics approach. In contrast to most previous applications of statistical mechanics \nto learning theory we are not limited to the so called \"thermodynamic\" limit which \nwould require a high dimensional input space. \n\nOur work is very much motivated by recent papers of Peter Sollich (see e.g. [5]) who \npresented a nice approximate treatment of the Bayesian generalization error of G P \nregression which actually gives good results even in the case of a one dimensional \ninput space. His method is based on an exact recursion for the generalization \nerror of the regression problem together with approximations that decouple certain \ncorrelations of random variables. Unfortunately, the method seems to be limited \nbecause the exact recursion is an artifact of the Gaussianity of the regression model \nand is not available for other cases such as classification models. Second, it is \nnot clear how to assess the quality of the approximations made and how one may \nsystematically improve on them. Finally, the calculation is (so far) restricted to \n\n\fa full Bayesian scenario, where a prior average over the unknown data generating \nfunction simplifies the analysis. \n\nOur approach has the advantage that it is more general and may also be applied to \nother likelihoods. It allows us to compute other quantities besides the generalization \nerror. Finally, it is possible to compute the corrections to our approximations. \n\n2 Regression with Gaussian processes \n\nTo explain the Gaussian process scenario for regression problems [2], we assume \nthat we observe corrupted values y(x) E R of an unknown function f(x) at input \npoints x E Rd. If the corruption is due to independent Gaussian noise with variance \nu2, the likelihood for a set of m example data D = (Y(Xl), ... , Y(Xm))) is given by \n\nP(Dlf) = \n\n2 \n\nexp (_ \"'~ (Yi-f(Xi))2) \n\n20- 2 \n\nL...tz=l \n(2 ) \" ' . \n27rU \n\n(1) \n\nwhere Yi == Y(Xi). The goal of a learner is to give an estimate of the function f(x). \nThe available prior information is that f is a realization of a Gaussian process \n(random field) with zero mean and covariance C(x,x') = E[f(x)f(x')], where E \ndenotes the expectation over the Gaussian process. We assume that the prediction \nat a test point x is given by the posterior expectation of f(x), i.e. by \n\nj(x) = E{f(x)ID} = Ef(x)P(Dlf) \n\nZ \n\n(2) \n\nwhere the partition function Z normalises the posterior. Calling the true data \ngenerating function 1* (in order to distinguish it from the functions over which \nwe integrate in the expectations) we are interested in the learning curve, i.e. the \ngeneralization (mean square) error averaged over independent draws of example \ndata, i.e. Cg = [((f*(x) -\nj(x))2}]D as a function of m, the sample size. The \nbrackets [ .. . ]D denote averages over example data sets where we assume that the \ninputs Xi are drawn independently at random from a density p(x). ( ... ) denotes \nan average over test inputs drawn from the same density. Later, the same brackets \nwill also be used for averages over several different test points and for joint averages \nover test inputs and test outputs. \n\n3 The Partition Function \n\nAs typical of statistical mechanics approaches, we base our analysis on the averaged \n\"free energy\" [-In Z]D where the partition function Z (see Eq. (2)) is \n\nZ = EP(Dlf). \n\n(3) \n\n[In Z]D serves as a generating function for suitable posterior averages. The concrete \napplication to Cg will be given in the next section. The computation of [In Z]D is \nbased on the replica trick In Z = limn-+o znn-1 , where we compute [zn]D for integer \nn and perform the continuation at the end. \n\nIntroducing a set of auxiliary integration variables Zka in order to decouple the \nsquares, we get \n\n\fwhere En denotes the expectation over the n times replicated GP measure. In \ngeneral, it seems impossible to perform the average over the data. Using a cumu(cid:173)\nlant expansion, an infinite series of terms would be created. However one may be \ntempted to try the following heuristic approximation: If (for fixed function I), the \ndistribution of f(Xk) - Yk was a zero mean Gaussian, we would simply end up with \nonly the second cumulant and \n\n((72 '\" 2) \n27r exp -2 L...J Zka \n\nJ II dZka \nx En exp (-~ L L ZkaZkb((fa(X) - y)(fb(X) - yn) . \n\nk,a \n\nk,a \n\nX \n\na,b k \n\n(5) \n\nAlthough such a reasoning may be justified in cases where the dimensionality of in(cid:173)\nputs x is large, the assumption of approximate Gaussianity is typically (in the sense \nof the prior measure over functions I) completely wrong for small dimensions. Nev(cid:173)\nertheless, we will argue in the next section that the expression Eq. (5) Uustified by \na different reason) is a good approximation for large sample sizes and nonzero noise \nlevel. We will postpone the argument and proceed to evaluate Eq. (5) following a \nfairly standard recipe: The high dimensional integrals over Zka are turned into low \ndimensional integrals by the introduction of\" order-parameters\" 'T}ab = 2:;;'=1 ZkaZkb \nso that \n\n[Z\"[D ~ ! ll. d, \u2022\u2022 exp (-~a' ~> .. + G({,})) x \nx Enexp (-~ L'T}ab((fa(X) - y)(fb(X) - y)) \n\na,b \n\n(6) \n\nwhere eG({I)}) = J TIk,a d~;Q TIa::;b J (2:;;'=1 Zka Zkb - 'T}ab). We expect that in the \nlimit of large sample size m, the integrals are well approximated by the saddle-point \nmethod. To perform the limit n -t 0, we make the assumption that the saddle-point \nof the matrix 'T} is replica symmetric, i.e. 'T}ab = 'T} for a f:. band 'T}aa = 'T}o. After \nsome calculations we arrive at \n\n[lnZJD = \n\n+ \n\n(72'T}O m \n\n--2- + \"2 In ('T}o - 'T}) + 2('T}o _ 'T}) - \"2(E f (x) \nIn E exp [- 'T}o ; 'T} ((f(x) - y)2)] - ~ (In(27rm) - 1) \n\nm'T} \n\n'T} \n\n0 2 \n\n(7) \n\ninto which we have to insert the values 'T} and 'T}o that make the right hand side an \nextremum. We have defined a new auxiliary (translated) Gaussian measure over \nfunctions by \n\n(8) \n\nwhere \u00a2 is a functional of f. For a given input distribution it is possible to compute \nthe required expectations in terms of sums over eigenvalues and eigenfunctions of \nthe covariance kernel C(x, x'). We will give the details as well as the explicit order \nparameter equations in a full version of the paper. \n\n\f4 Generalization error \n\nTo relate the generalization error with the order parameters, note that in the replica \nframework (assuming the approximation Eq. (5)) we have \n\ncg+u2 \n\n-l~fIId\"'ab exp [-~u2L\"'aa+G({\"'})l x \nX a a En exp (-~ L \"'ab((fa(X) - y)(fb(X) - y))) \n\na~b \n\na \n\n\"'12 \n\na,b \n\nwhich by a partial integration and a subsequent saddle point integration yields \n\nCg = - ( \n\nm\", \n\n\"'0 -\", \n\n)2 - U \n\n2 \n. \n\n(9) \n\nIt is also possible to compute other error measures in terms of the order parameters \nlike the expected error on the (noisy) training data defined by \n\n(10) \n\nThe \"true\" training error which compares the prediction with the data generating \nfunction 1* is somewhat more complicated and will be given elsewhere. \n\n5 Why (and when) the approximation works \n\nOur intuition behind the approximation Eq. (5) is that for sufficiently large sample \nsize, the partition function is dominated by regions in function space which are close \nto the data generating function 1* such that terms like ((fa(x) - y)(fb(X) - y)) are \ntypically small and higher order polynomials in fa(x) - Y generated by a cumulant \nexpansion are less important. This intuition can be checked self consistently by es(cid:173)\ntimating the omitted terms perturbatively. We use the following modified partition \nfunction \n\n[Zn('\\)]D = f II d;;a e - u2\n\n2 Ek,a z~a En [ exp (i'\\ L Zka(fa (Xk) - y) \n\nk,a \n\na,b k \n\n1 ~ ,\\2 L L ZkaZkb((fa(X) - y)(fb(X) - y)))] \n\n(11) \n\nk,a \n\nD \n\nwhich for ,\\ = 1 becomes the \"true\" partition function, whereas Eq. (5) is ob(cid:173)\ntained for ,\\ = O. Expanding in powers of ,\\ (the terms with odd powers vanish) \nis equivalent to generating the cumulant expansion and subsequently expanding \nthe non-quadratic terms down. Within the saddle-point approximation, the first \nnonzero correction to our approximation of [In Z] is given by \n\n,\\4 C\"'O 2-:n\",)2 (u2(C(X, x)) + (C(x, X)F2(X)) - (C(x, x')F(x)F(x')) \n+\",(C(x, x')C(x, x\")C(x', x\")) - \",(C(x, x)C2 (x, x'))) \n+~ (-: + ~) ((C2(x,x)) - (C2(x, x'))) ). \n\n(12) \n\n\fC(x,x') = E\u00b0{f(x)f(x')} denotes the covariance with respect to the auxiliary \nmeasure and F(x) == f*(x) -\n(C(x,XI)f*(X\")). The significance of the individ(cid:173)\nual terms as m -+ 00 can be estimated from the following scaling. We find \n'T}) = O(m) is a positive quantity, whereas 'T} = O(m) is negative. \nthat ('T}o -\nC(x, x') = O(l/m). Using these relations, we can show that Eq. \n(12) remains \nfinite as m -+ 00, whereas the leading approximation Eq. (7) diverges with m. \n\nWe have not (yet) computed the resulting correction to ego However, we have \nstudied the somewhat simpler error measure e' == ~ l:dE{(f*(Xi) - f(Xi))2ID}]D \nwhich can be obtained from a derivative of [In Z]D with respect to a 2 . It equals the \nerror of a Gibbs algorithm (sampling from the posterior) on the training data. We \ncan show that the correction to e' is typically by a factor of O(l/m) smaller than \nthe leading term. However, our approximation becomes worse with decreasing noise \nvariance a 2 . a = 0 is a singular case for which (at least for some GPs with slowly \ndecreasing eigenvalues) it can be shown that our approximation for eg decays to \nzero at the wrong rate. For small values of a, a -+ 0, we expect that higher order \nterms in the perturbation expansion will become relevant. \n\n6 Results \n\nWe compare our analytical results for the error measures eg and et with simula(cid:173)\ntions of GP regression. For simplicity, we have chosen periodic processes of the \nform f(x) = v'2l:n (an cos(27rnx) + bn sin(27rnx)) for x E [0,1] where the coeffi(cid:173)\ncients an, bn are independent Gaussians with E{ a~} = E{b~} = An. This choice \nis convenient for analytical calculations by the orthogonality of the trigonometric \nfunctions when we sample the Xi from a uniform density in [0,1]. The An and \nthe translation invariant covariance kernel are related by c(x - y) == C(x,y) = \n2l:n An cos(27rn(x - y)) and An = J; c(x) cos(27rnx) dx. We specialise on the (pe(cid:173)\nriodic) RBF kernel c(x) = l:~-oo exp [-(x - k)2/2l2] with l = 0.1. For an il(cid:173)\nlustration we generated learning curves for two target functions f* as displayed in \nFig. 1. One function is a sine-wave f*(x) = J2Al sin(27rx) while the other is a ran(cid:173)\ndom realisation from the prior distribution. The symbols in the left panel of Fig. 1 \nrepresent example sets of fifty data points. The data points have been obtained by \ncorruption of the target function with Gaussian noise of variance a 2 = 0.01. The \nright panel of Fig. 1 shows the data averaged generalization and training errors eg, \net as a function of the number m of example data. Solid curves display simulation \nresults while the results of our theory Eqs. (9), (10) are given by dashed lines. The \ntraining error et converges to the noise level a 2 \u2022 As one can see from the pictures \nour theory is very accurate when the number m of example data is sufficiently large. \nWhile the generalization error e 9 differs initially, the asymptotic decay is the same. \n\n7 The Bayes error \n\nWe can also apply our method to the Bayesian generalization error (previously ap(cid:173)\nproximated by Peter Sollich [5]). The Bayes error is obtained by averaging the \ngeneralization error over \"true\" functions f* drawn at random from the prior dis(cid:173)\ntribution. Within our approach this can be achieved by an average of Eq. (7) over \nf*. The resulting order parameter equations and their relation to the Bayes error \nturn out be identical to Sollich's result. Hence, we have managed to re-derive his \napproximation within a broader framework from which also possible corrections can \nbe obtained. \n\n\fData generating function \n\nLearning curves \n\n\u2022 \nf (x) \n\n-1 \n\nCt \n\n0 \n\n0.2 \n\n0.4 \n\nx \n\n0.6 \n\n0.8 \n\n1 0 \n\n50 \n\n100 \n\n150 \n\nNumber m of example data \n\n. f (x) \n\n-1 \n\n-2 \n0 \n\nx \n\n1 \n\nNumber m of example data \n\n200 \n\n10\u00b0 \n\n10- 1 \n\n10-2 \n\n10-3 \n\n10-4 \n\n10\u00b0 \n\n10- 1 \n\n10-2 \n\n10-3 \n\n10-4 \n\nFigure 1: The left panels show two data generating functions f*(x) and example \nsets of 50 data points. The right panels display the corresponding averaged learning \ncurves. Solid curves display simulation results for generalization and training errors \nCg, Ct. The results of our theory Eqs. (9), (10) are given by dashed lines. \n\n8 Future work \n\nAt present, we extend our method in the following directions: \n\n\u2022 The statistical mechanics framework presented in this paper is based on \na partition function Z which can be used to generate a variety of other \ndata averages for posterior expectations. An obvious interesting quantity \nis given by the sample fluctuations of the generalization error \n\nwhich gives confidence intervals on Cg. \n\n\u2022 Obviously, our method is not restricted to a regression model (in this case \nhowever, all resulting integrals are elementary) but can also be directly \ngeneralized to other likelihoods such as the classification case [4, 6]. A \nfurther application to Support Vector Machines should be possible. \n\n\u2022 The saddle-point approximation neglects fluctuations of the order parame(cid:173)\n\nters. This may be well justified when m is sufficiently large. It is possible \nto improve on this by including the quadratic expansion around the saddle(cid:173)\npoint. \n\n\f\u2022 Finally, one may criticise our method as being of minor relevance to prac(cid:173)\n\ntical applications, because our calculations require the knowledge of the \nunknown function 1* and the density of the inputs x. However, Eqs. (9) \nand (10) show that important error measures are solely expressed by the \norder parameters 'fI and 'flo. Hence, estimating some error measures and \nthe posterior variance at the data points empirically would allow us to pre(cid:173)\ndict values for the order parameters. Those in turn could be used to make \npredictions for the unknown generalization error. \n\nAcknowledgement \n\nThis work has been supported by EPSRC grant GR/M8160l. \n\nReferences \n\n[1] D. \n\nJ. C. Mackay, Gaussian Processes, A Replacement \n\nfor \nobtained \n\nNeu(cid:173)\nfrom \n\nral Networks, \nhttp://wol.ra.phy.cam.ac.uk/pub/mackay/. \n\ntutorial \n\nNIPS \n\n1997, May \n\nbe \n\n[2] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, in \nNeural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer and \nM. E. Hasselmo eds., 514-520, MIT Press (1996). \n\n[3] C. K. I. Williams, Computing with Infinite Networks, in Neural Information \nProcessing Systems 9, M. C. Mozer, M. I. Jordan and T. Petsche, eds., 295-30l. \nMIT Press (1997). \n\n[4] D. Barber and C. K. I. Williams, Gaussian Processes for Bayesian Classification \nvia Hybrid Monte Carlo, in Neural Information Processing Systems 9, M . C. \nMozer, M. I. Jordan and T. Petsche, eds., 340-346. MIT Press (1997). \n\n[5] P. Sollich, Learning curves for Gaussian processes, in Neural Information Pro(cid:173)\n\ncessing Systems 11, M. S. Kearns, S. A. Solla and D. A. Cohn, eds. 344 - 350, \nMIT Press (1999). \n\n[6] L. Csata, E. Fokoue, M. Opper, B. Schottky, and O. Winther. Efficient ap(cid:173)\n\nproaches to Gaussian process classification. In Advances in Neural Information \nProcessing Systems, volume 12, 2000. \n\n\f", "award": [], "sourceid": 1825, "authors": [{"given_name": "D\u00f6rthe", "family_name": "Malzahn", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}]}