{"title": "Gaussian Process Regression with Mismatched Models", "book": "Advances in Neural Information Processing Systems", "page_first": 519, "page_last": 526, "abstract": null, "full_text": "Gaussian Process Regression with \n\nMismatched Models \n\nDepartment of Mathematics, King's College London \n\nStrand, London WC2R 2LS, U.K. Email peter.sollich@kcl.ac . uk \n\nPeter Sollich \n\nAbstract \n\nLearning curves for Gaussian process regression are well understood \nwhen the 'student' model happens to match the 'teacher' (true data \ngeneration process). I derive approximations to the learning curves \nfor the more generic case of mismatched models, and find very rich \nbehaviour: For large input space dimensionality, where the results \nbecome exact, there are universal (student-independent) plateaux \nin the learning curve, with transitions in between that can exhibit \narbitrarily many over-fitting maxima; over-fitting can occur even \nif the student estimates the teacher noise level correctly. In lower \ndimensions, plateaux also appear, and the learning curve remains \ndependent on the mismatch between student and teacher even in \nthe asymptotic limit of a large number of training examples. Learn(cid:173)\ning with excessively strong smoothness assumptions can be partic(cid:173)\nularly dangerous: For example, a student with a standard radial \nbasis function covariance function will learn a rougher teacher func(cid:173)\ntion only logarithmically slowly. All predictions are confirmed by \nsimulations. \n\n1 \n\nIntroduction \n\nThere has in the last few years been a good deal of excitement about the use \nof Gaussian processes (GPs) as an alternative to feedforward networks [1]. GPs \nmake prior assumptions about the problem to be learned very transparent, and \neven though they are non-parametric models, inference- at least in the case of \nregression considered below-\nis relatively straightforward. One crucial question \nfor applications is then how 'fast' GPs learn, i.e. how many training examples are \nneeded to achieve a certain level of generalization performance. The typical (as \nopposed to worst case) behaviour is captured in the learning curve, which gives the \naverage generalization error t as a function of the number of training examples n. \nGood bounds and approximations for t(n) are now available [1, 2, 3, 4, 5], but these \nare mostly restricted to the case where the 'student' model exactly matches the true \n'teacher' generating the datal. In practice, such a match is unlikely, and so it is \n\nlThe exception is the elegant work of Malzahn and Opper [2], which uses a statistical \nphysics framework to derive approximate learning curves that also apply for any fixed \ntarget function. However, this framework has not yet to my knowledge been exploited to \n\n\fimportant to understand how GPs learn if there is some model mismatch. This is \nthe aim of this paper. \nIn its simplest form, the regression problem is this: We are trying to learn a function \nB* which maps inputs x (real-valued vectors) to (real-valued scalar) outputs B*(x). \nWe are given a set of training data D , consisting of n input-output pairs (xl, yl) ; \nthe training outputs yl may differ from the 'clean' teacher outputs B*(xl ) due to \ncorruption by noise. Given a test input x, we are then asked to come up with a \nprediction B(x), plus error bar, for the corresponding output B(x). In a Bayesian \nsetting, we do this by specifying a prior P(B) over hypothesis functions , and a like(cid:173)\nlihood P(DIB) with which each B could have generated the training data; from this \nwe deduce the posterior distribution P(BID) ex P(DIB)P(B). For a GP, the prior is \ndefined directly over input-output functions B; this is simpler than for a Bayesian \nfeedforward net since no weights are involved which would have to be integrated \nout. Any B is uniquely determined by its output values B(x) for all x from the in(cid:173)\nput domain, and for a GP, these are assumed to have a joint Gaussian distribution \n(hence the name). If we set the means to zero as is commonly done, this distri(cid:173)\nbution is fully specified by the covariance function (B(x)B(xl))o = C(X,XI). The \nlatter transparently encodes prior assumptions about the function to be learned. \nSmoothness, for example, is controlled by the behaviour of C(x, Xl) for Xl -+ x: The \nOrnstein-Uhlenbeck (OU) covariance function C(x, Xl) = exp( -Ix - xliiI) produces \nvery rough (non-differentiable) functions , while functions sampled from the radial \nbasis function (RBF) prior with C(x, Xl) = exp[-Ix - x/12 1(212)] are infinitely differ(cid:173)\nentiable. Here I is a lengthscale parameter, corresponding directly to the distance \nin input space over which we expect significant variation in the function values. \nThere are good reviews on how inference with GPs works [1 , 6], so I only give \na brief summary here. The student assumes that outputs y are generated from \nthe 'clean' values of a hypothesis function B(x) by adding Gaussian noise of x(cid:173)\nindependent variance (J2. The joint distribution of a set of training outputs {yl} \nand the function values B(x) is then also Gaussian, with covariances given (under \nthe student model) by \n\n(ylym) = C(xl,xm) + (J2Jlm = (K)lm, \n\n(yIB(x)) = C(xl,x) = (k(X))1 \n\nHere I have defined an n x n matrix K and an x-dependent n-component vector \nk(x) . The posterior distribution P(BID) is then obtained by conditioning on the \n{yl}; it is again Gaussian and has mean and variance \n(B(x))o ID == B(xID) = k(X)TK-1y \n\n(1) \n(2) \nFrom the student's point of view, this solves the inference problem: The best pre(cid:173)\ndiction for B(x) on the basis of the data D is B(xID) , with a (squared) error bar \ngiven by (2). The squared deviation between the prediction and the teacher is \n[B(xID) - B*(x)]2; the average generalization error (which, as a function of n, de(cid:173)\nfines the learning curve) is obtained by averaging this over the posterior distribution \nof teachers, all datasets, and the test input x: \n\n((B(x) - B(X))2)o ID \n\nC(x,x) - k(X)TK-1k(x) \n\n(3) \nNow of course the student does not know the true posterior of the teacher; to \nestimate E, she must assume that it is identical to the student posterior, giving \nfrom (2) \n\nE = ((([B(xID) - B*(xWk ID)D)x \n\nE = ((([B(xID) - B(X)]2)o ID)D)x = ((C(x,x) - k(xfK-1k(X)){xl})x \n\n(4) \n\nconsider systematically the effects of having a mismatch between the teacher prior over \ntarget functions and the prior assumed by the student. \n\n\fwhere in the last expression I have replaced the average over D by one over the \ntraining inputs since the outputs no longer appear. If the student model matches \nthe true teacher model, E and \u20ac coincide and give the Bayes error, i.e. the best \nachievable (average) generalization performance for the given teacher. \n\nI assume in what follows that the teacher is also a GP, but with a possibly different \ncovariance function C* (x, x') and noise level (}\";. This allows eq. (3) for E to be \nsimplified, since by exact analogy with the argument for the student posterior \n\n(()* (x ) k iD = k* (x) TK :;-1 y , \nand thus, abbreviating a(x) = K- 1k(x) - K ;;-1k*(x), \n\n((); (x) )O. ID = (()* (x ))~. I D +C* (x, x) - k* (x) TK :;-1 k * (x) \n\nE = ((a(x)T yyTa(x) + C*(x,x) - k*( X)T K:;-1k*(x))D)x \n\nConditional on the training inputs, the t raining outputs have a Gaussian distribu(cid:173)\ntion given by the true (teacher) model; hence (yyT){yl} l{xl } = K *, giving \nE = ((C*(x,x) - 2k*(x)TK-1k(x) + k(X)T K -1 K *K -1 k(x )){xl})x \n\n(5) \n\n2 Calculating the learning curves \n\nAn exact calculation of the learning curve E(n) is difficult because of the joint av(cid:173)\nerage in (5) over the training inputs X and the test input x . A more convenient \nstarting point is obtained if (using Mercer's theorem) we decompose the covariance \nfunction into its eigenfunctions \u00a2i(X) and eigenvalues Ai, defined w.r.t. the input \ndistribution so that (C(x, X') \u00a2i (X') )x' = Ai\u00a2i(X) with the corresponding normaliza(cid:173)\ntion (\u00a2i(X)\u00a2j(x))x = bij. Then \n\n00 \n\ni=1 \n\n00 \n\ni=1 \n\nFor simplicity I assume here that the student and teacher covariance functions have \nthe same eigenfunctions (but different eigenvalues). This is not as restrictive as it \nmay seem; several examples are given below. The averages over the test input x \nin (5) are now easily carried out: E .g. for the last term we need \n((k(x) k(x)T)lm)x = L AiAj\u00a2i(Xl)(\u00a2i(X)\u00a2j (x))x\u00a2j (xm) = L A7\u00a2i(Xl )\u00a2i(Xm) \n\nij \n\ni \n\nIntroducing the diagonal eigenvalue matrix (A)ij = Aibij and the 'design matrix' \n(*A2**T . Similarly, for the second term \nin (5) , (k(x)k;(x))x = **AA***T, and (C*(x,x))x = trA*. This gives, dropping \nthe training inputs subscript from the remaining average, \n\nE = (tr A* - 2tr**AA***TK-1 + tr **A2**TK - 1K *K - 1) \n\nIn this new representation we also have K = (}\"21 + **A**T and similarly for K* ; \nfor the inverse of K we can use the Woodbury formula to write K -1 = (}\"-2 [1 -\n(}\" - 2**g** T], where 9 = (A - 1 + (}\"- 2** T ** )- 1. Inserting these results, one finds after \nsome algebra that \n\nE = (}\";(}\"-2 [(tr g) - (tr gA -1 g)] + (tr gA*A -29) \n\nwhich for the matched case reduces to the known result for the Bayes error [4] \n\n\u20ac = (tr g) \n\n(7) \n\n(8) \n\n\fEqs. (7,8) are still exact. We now need to tackle the remaining averages over training \ninputs. Two of these are of the form (tr QM9) ; if we generalize the definition of \nQ to Q = (A -1 + vI + wM + (/-2IJ>TIJ\u00bb-1 and define 9 = (tr Q) , then they reduce \nto (trQMQ) = -agjaw. (The derivative is taken at v = w = 0; the idea behind \nintroducing v will become clear shortly.) So it is sufficient to calculate g. To do \nthis, consider how Q changes when a new example is added to the training set. One \nhas \n\nQ(n + 1) - Q(n) = [Q-1(n) + (/-2 1jJ1jJTJ -1 _ Q(n) = _ Q(n)1jJ1jJTQ(n) \n(/2 + 1jJTQ(n)1jJ \n\n(9) \n\nin terms of the vector 1jJ with elements (1jJ)i = p(x))x (which can easily be evaluated as an \naverage over two binomially distributed variables, counting the number of + 1 's in \nx overall and among the Xa with a E p). With the As and A; determined, it is then \na simple matter to evaluate the predicted learning curve (11,12) numerically. First, \nthough, focus on the limit of large d, where much more can be said. If we write \nC(X,XI) = f(x\u00b7 xl/d), the eigenvalues become, for d -+ 00, As = d-sf(s)(O) and \nthe contribution to C(x, x) = f(l) from the s-th eigenvalue block is As == (~)As -+ \nf(s)(O)/s!, consistent with f(l) = 2::o f(s)(0)/s! The As, because of their scaling \nwith d, become infinitely separated for d -+ 00. For training sets of size n = O(dL), \nwe then see from (11) that eigenvalues with s > L contribute as if n = 0, since \nAs \u00bb n / (u 2 + \u20ac); \nthey have effectively not yet been learned. On the other hand, \neigenvalues with s < L are completely suppressed and have been learnt perfectly. \nWe thus have a hierarchical learning scenario, where different scalings of n with \nd-as defined by L-correspond to different 'learning stages'. Formally, we can \nanalyse the stages separately by letting d -+ 00 at a constant ratio a = n/(f) of the \nnumber of examples to the number of parameters to be learned at stage L (note \n(f) = O(dL) for large d). An independent (replica) calculation along the lines of \nRef. [8] shows that our approximation for the learning curve actually becomes exact \nin this limit. The resulting a-dependence of to can be determined explicitly: Set \nh = 2:s::=:L As (so that fa = f(I)) and similarly for fi. Then for large a , \n\nto = fL+1 + (fL+1 + u;)a- l + O(a- 2 ) \n\n(13) \nThis implies that, during successive learning stages, (teacher) eigenvalues are learnt \none by one and their contribution eliminated from the generalization error, giving \nplateaux in the learning curve at to = fi, f2, .... These plateaux, as well as the \nasymptotic decay (13) towards them, are universal [8], i.e. student-independent. \nThe (non-universal) behaviour for smaller a can also be fully characterized: Con(cid:173)\nsider first the simple case of linear percept ron learning (see e.g. [7]), which corre(cid:173)\nsponds to both student and teacher having simple dot-product covariance functions \nC (x, Xl) = C * (x, Xl) = X\u00b7 xl/d. In this case there is only a single learning stage (only \nAl = A~ = 1 are nonzero), and to = r(a) decays from r(O) = 1 to r(oo) = 0, with \nan over-fitting maximum around a = 1 if u 2 is sufficiently small compared to u;. \nIn terms of this function r(a), the learning curve at stage L for general covariance \nfunctions is then exactly given by to = fL+1 + ALr(a) if in the evaluation of r(a) \nthe effective noise levels &2 = (f L+1 + ( 2 ) / AL and &; = (fL+1 + u;) / A L are used. \nNote how in &;, the contribution fL+1 from the not-yet-Iearned eigenvalues acts as \neffective noise, and is normalized by the amount of 'signal' AL = fL - fL+l available \nat learning stage L. The analogous definition of &2 implies that, for small u 2 and \ndepending on the choice of student covariance function, there can be arbitrarily \nmany learning stages L where &2 \u00ab &;, and therefore arbitrarily many over-fitting \nmaxima in the resulting learning curves. From the definitions of &2 and &; it is \nclear that this situation can occur even if the student knows the exact teacher noise \nlevel, i.e. even if u 2 = u;. \nFig. 1(left) demonstrates that the above conclusions hold not just for d -+ 00; even \nfor the cases shown, with d = 10, up to three over-fitting maxima are apparent. \nOur theory provides a very good description of the numerically simulated learning \ncurves even though, at such small d, the predictions are still significantly different \nfrom those for d -+ 00 (see Fig. 1 (right) ) and therefore not guaranteed to be exact. \n\nIn the second example scenario, I consider continuous-valued input vectors, uni-\n\n\f10 \n\n100 \n\nn \n\n234 \na \n\n234 \na \n\n234 \na \n\nFigure 1: Left: Learning curves for RBF student and teacher, with uniformly dis(cid:173)\ntributed, binary input vectors with d = 10 components. Noise levels: Teacher \nu; = 1, student u2 = 10-4, 10-3 , ... , 1 (top to bottom). Length scales: Teacher \nl* = d1/2, student l = 2d1/2. Dashed: numerical simulations, solid: theoretical \nprediction. Right: Learning curves for u 2 = 10- 4 and increasing d (top to bottom: \n10, 20, 30, 40, 60, 80, [bold] 00). The x-axis shows a = n/(f) , for learning stages \nL = 1,2,3; the dashed lines are the universal asymptotes (13) for d -+ 00. \n\nformly distributed over the unit interval [0,1]; generalization to d dimensions \n(x E [O , I]d) is straightforward. For covariance functions which are stationary, i.e. \ndependent on x and x' only through x - x' , and assuming periodic boundary condi(cid:173)\ntions (see [4] for details), one then again has covariance function-independent eigen(cid:173)\nfunctions. They are indexed by integers3 q, with cPq(x) = e21riqx; the corresponding \neigenvalues are Aq = J dx C(O, x)e-27riqx . For the ('periodified') RBF covariance \nfunction C(x ,x' ) = exp[-(x - X ' )2/(2l2)], for example, one has Aq ex exp( -ip /2), \nwhere ij = 27rlq. The OU case C(x, x') = exp( -Ix - x/l/l), on the other hand, \ngives Aq ex (1 + ij2) - 1, thus Aq ex q- 2 for large q. I also consider below covariance \nfunctions which interpolate in smoothness between the OU and RBF limits: E.g. \nthe MB2 (modified Bessel) covariance C(x, x') = e- a (1 + a), with a = Ix - x /l /l, \nyields functions which are once differentiable [5]; its eigenvalues Aq ex (1 + ij2)-2 \nshow a faster asymptotic power law decay, Aq ex q-4, than those of the OU covari(cid:173)\nance function. To subsume all these cases I assume in the following analysis of the \ngeneral shape of the learning curves that Aq ex q-r (and similarly A~ ex q-r.). Here \nr = 2 for OU, r = 4 for MB2, and (due to the faster-than-power law decay of its \neigenvalues) effectively r = 00 for RBF. \nFrom (11 ,12), it is clear that the n-dependence of the Bayes error E has a strong \neffect on the true generalization error E. From previous work [4], we know that E(n) \nhas two regimes: For small n, where E \u00bb u 2 , E is dominated by regions in input \nspace which are too far from the training examples to have significant correlation \nwith them, and one finds E ex n -(r- 1). For much larger n, learning is essentially \nagainst noise, and one has a slower decay E ex (n/u2)-(r- 1) /r . These power laws can \nbe derived from (11) by approximating factors such as [A;;-l + n/ (u 2 + E)]- l as equal \nto either Aq or to 0, depending on whether n / (u 2 + E) < or > A;;-l. With the same \ntechnique, one can estimate the behaviour of E from (12). In the small n-regime, one \nfinds E ~ C1 u; + C2n-(r. -1), with prefactors C1, C2 depending on the student. Note \n\n3Since Aq = A_q, one can assume q ~ 0 if all Aq for q > 0 are taken as doubly \n\ndegenerate. \n\n\f\u00a3 \n0.1 \n\n\u00a3 \n0.1 \n\n\u00a3 \n0.1 \n\n10 \n\nn \n\n100 \n\n10 \n\nn \n\n100 \n\n1000 \n\nFigure 2: Learning curves for inputs x uniformly distributed over [0,1]. Teacher: \nMB2 covariance function, lengthscale I. = 0.1, noise level (7; = 0.1; student length(cid:173)\nscale I = 0.1 throughout. Dashed: simulations, solid: theory. Left: OU student \nwith (72 as shown. The predicted plateau appears as (72 decreases. Right: Stu(cid:173)\ndents with (72 = 0.1 and covariance function as shown; for clarity, the RBF and \nOU results have been multiplied by v'IO and 10, respectively. Dash-dotted lines \nshow the predicted asymptotic power laws for MB2 and OU; the RBF data have a \npersistent upward curvature consistent with the predicted logarithmic decay. Inset: \nRBF student with (72 = 10-3 , showing the occurrence of over-fitting maxima. \n\nthat the contribution proportional to (7; is automatically negligible in the matched \ncase (since then E = \u20ac \u00bb (72 = (7; for small n); if there is a model mismatch, however, \nand if the small-n regime extends far enough, it will become significant. This is the \ncase for small (72; indeed, for (72 -+ 0, the 'small n' criterion \u20ac \u00bb (72 is satisfied for \nany n. Our theory thus predicts the appearance of plateaux in the learning curves, \nbecoming more pronounced as (72 decreases; Fig. 2 (left ) confirms this4. Numerical \nevaluation also shows that for small (72, over-fitting maxima may occur before the \nplateau is reached, consistent with simulations; see inset in Fig. 2(right). In the \nlarge n-regime (\u20ac \u00ab (72), our theory predicts that the generalization error decays as \na power law. If the student assumes a rougher function than the teacher provides \n(r < r.) , the asymptotic power law exponent E ex: n-(r-l)/r is determined by the \nstudent alone. In the converse case, the asymptotic decay is E ex: n-(r.-l) / r and \ncan be very slow, actually becoming logarithmic for an RBF student (r -+ CXl). For \nr = r., the fastest decay for given r. is obtained, as expected from the properties of \nthe Bayes error. The simulation data in Fig. 2 are compatible with these predictions \n(though the simulations cover too small a range of n to allow exponents to be \ndetermined precisely). It should be stressed that the above results imply that there \nis no asymptotic regime of large training sets in which the learning curve assumes a \nuniversal form, in contrast to the case of parametric models where the generalization \nerror decays as E ex: lin for sufficiently large n independently of model mismatch \n(as long as the problem is learnable at all). This conclusion may seem counter(cid:173)\nintuitive, but becomes clear if one remembers that a GP covariance function with \nan infinite number of nonzero eigenvalues Ai always has arbitrarily many eigenvalues \n4If (J2 = 0 exactly, the plateau will extend to n -+ 00. With hindsight, this is clear: \na GP with an infinite number of nonzero eigenvalues has no limit on the number of its \n'degrees of freedom' and can fit perfectly any amount of noisy training data, without ever \nlearning the true teacher function . \n\n\fthat are arbitrarily close to zero (since the Ai are positive and 2:iAi = (C(x,x)) is \nfinite). Whatever n, there are therefore many eigenvalues for which Ail\u00bb n/u2 , \ncorresponding to degrees of freedom which are still mainly determined by the prior \nrather than the data (compare (11)). In other words, a regime where the data \ncompletely overwhelms the mismatched prior- and where the learning curve could \ntherefore become independent of model mismatch- can never be reached. \n\nIn summary, the above approximate theory makes a number of non-trivial predic(cid:173)\ntions for GP learning with mismatched models, all borne out by simulations: for \nlarge input space dimensions, the occurrence of multiple over-fitting maxima; in \nlower dimensions, the generic presence of plateaux in the learning curve if the stu(cid:173)\ndent assumes too small a noise level u 2 , and strong effects of model mismatch on the \nasymptotic learning curve decay. The behaviour is much richer than for the matched \ncase, and could guide the choice of (student) priors in real-world applications of GP \nregression; RBF students, for example, run the risk of very slow logarithmic decay \nof the learning curve if the target (teacher) is less smooth than assumed. \n\nAn important issue for future work- some of which is in progress-\nis to analyse to \nwhich extent hyperparameter tuning (e.g. via evidence maximization) can make GP \nlearning robust against some forms of model mismatch, e.g. a misspecified functional \nform of the covariance function. One would like to know, for example, whether a \ndata-dependent adjustment of the lengthscale of an RBF covariance function would \nbe sufficient to avoid the logarithmically slow learning of rough target functions. \n\nReferences \n\n[1] See e.g. D J C MacKay, Gaussian Processes, Tutorial at NIPS 10; recent papers \nby Csat6 et al. (NIPS 12), Goldberg/Williams/Bishop (NIPS 10), Williams \nand Barber/Williams (NIPS 9) , Williams/Rasmussen (NIPS 8); and references \nbelow. \n\n[2] D Malzahn and M Opper. In NIPS 13, pages 273- 279; also in NIPS 14. \n[3] C A Michelli and G Wahba. In Z Ziegler, editor, Approximation theory and \n\napplications, pages 329- 348. Academic Press, 1981; M Opper. In I K Kwok(cid:173)\nYee et al., editors, Theoretical Aspects of Neural Computation, pages 17-23. \nSpringer, 1997. \n\n[4] P Sollich. In NIPS 11, pages 344-350. \n[5] C K I Williams and F Vivarelli. Mach. Learn., 40:77-102, 2000. \n[6] C K I Williams. In M I Jordan, editor, Learning and Inference in Graphical \n\nModels, pages 599-621. Kluwer Academic, 1998. \n\n[7] P Sollich. J. Phys. A, 27:7771- 7784,1994. \n[8] M Opper and R Urbanczik. Phys. Rev. Lett., 86:4410- 4413, 2001. \n[9] R Dietrich, M Opper, and H Sompolinsky. Phys. Rev. Lett., 82:2975-2978, \n\n1999. \n\n\f", "award": [], "sourceid": 1987, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}]}*