{"title": "Learning Curves for Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 344, "page_last": 350, "abstract": null, "full_text": "Learning curves for Gaussian processes \n\nPeter Sollich * \n\nDepartment of Physics, University of Edinburgh \n\nEdinburgh EH9 3JZ, U.K. Email: P.SollichA2T. One then also has (C(x,x)}x = \ntr A, and the matrix K is expressed as K = a 21 + AT, 1 being the identity \nmatrix. Collecting these results, we have \n\n\u20ac(D) = tr A - tr (a21 + AT)-IA2T \n\nThis can be simplified using the Woodbury formula for matrix inverses (see e.g. [7]), \nwhich applied to our case gives (a2I+AT)-1 = a-2[I-(a21+ATTJ; \nafter a few lines of algebra, one then obtains the final result \n\nt= (t(D))D' \n\n\u20ac(D) =tra2A(a21+ATT T . It also includes as a special case the well-known result \nfor linear regression (see e.g. [8]); A-I and T can be interpreted as suitably \ngeneralized versions of the weight decay (matrix) and input correlation matrix. \nStarting from (6), one can now derive approximate expressions for the learning \n\n\fLearning Curves for Gaussian Processes \n\n347 \n\ncurve I:(n). The most naive approach is to entirely neglect the fluctuations in cJ>TcJ> \nover different data sets and replace it by its average, which is simply (( cJ> T cJ> )ij ) D = \nI:l (\u00a2i(Xt)\u00a2j(XI)) D = n8ij . This leads to the Naive approximation \n\nI:N(n) = tr (A -1 + O'- 2nI)-1 \n\n(7) \nwhich is not, in general, very good. It does however become exact in the large noise \nlimit 0'2 -t 00 at constant nlO'2 : The fluctuations of the elements of the matrix \nO'-2cJ>TcJ> then become vanishingly small (of order foO'- 2 = (nlO' 2 )/fo -t 0) and \nso replacing cJ> T cJ> by its average is justified. \nTo derive better approximations, it is useful to see how the matrix 9 = (A -1 + \nO'-2cJ>TcJ\u00bb-1 changes when a new example is added to the training set. One has \n\n9(n + 1) - 9(n) = [9- 1 (n) + O'- 21j11j1Tr l - 9(n) = _ 9(n)1jI1jIT 9(n) \n0'2 + 1jIT 9(n)1jI \n\n(8) \nin terms of the vector 1jI with elements (1jI)i = \u00a2i(Xn+I); the second identity uses \nagain the Woodbury formula. To get exact learning curves, one would have to av(cid:173)\nerage this update formula over both the new training input Xn+1 and all previous \nones. This is difficult, but progress can be made by again neglecting some fluctu(cid:173)\nations: The average over Xn+1 is approximated by replacing 1jI1jIT by its average, \nwhich is simply the identity matrix; the average over the previous training inputs \nby replacing 9(n) by its average G(n) = (9(n)) D' This yields the approximation \n\nG(n + 1) - G(n) = -\n\nG 2 (n) \n\n2 G() \n\na +tr \n\nn \n\n(9) \n\nIterating from G(n = 0) = A, one sees that G(n) remains diagonal for all n, and \nso (9) is trivial to implement numerically. I call the resulting I:D(n) = tr G(n) the \nDiscrete approximation to the learning curve, because it still correctly treats n as \na variable with discrete, integer values. One can further approximate (9) by taking \nn as continuously varying, replacing the difference on the left-hand side by the \nderivative dG( n) 1 dn. The resulting differential equation for G( n) is readily solved; \ntaking the trace, one obtains the generalization error \n\nI:uc(n) = tr (A -1 + O'-2n'I)-1 \n\n(10) \nwith n' determined by the self-consistency equation n' + tr In(I + O'- 2 n' A) = n. \nBy comparison with (7), n' can be thought of as an 'effective number of training \nexamples'. The subscript DC in (10) stands for Upper Continuous approximation. \nAs the name suggests, there is another, lower approximation also derived by treating \nn as continuous. It has the same form as (10), but a different self-consistent equation \nfor n', and is derived as follows. Introduce an auxiliary offset parameter v (whose \nusefulness will become clear shortly) by 9-1 = vI+A -1 +O'- 2cJ>TcJ>; at the end ofthe \ncalculation, v will be set to zero again. As before, start from (8)-which also holds \nfor nonzero v-and approximate 1jI1jIT and tr 9 by their averages, but retain possible \nfluctuations of 9 in the numerator. This gives G(n+ 1) - G(n) = - (92 (n)) 1[0'2 + \ntr G(n)]. Taking the trace yields an update formula for the generalization error 1:, \nwhere the extra parameter v lets us rewrite the average on the right-hand side as \n-tr (92 ) = (olov)tr (9) = ol:lov. Treating n again as continuous, we thus arrive \nat the partial differential equation Eh{on = (oI:l ov) 1 (0'2 + 1:). This can be solved \nusing the method of characteristics [8 and (for v = 0) gives the Lower Continuous \napproximation to the learning curve, \n\nI:Lc(n) = tr (A -1 + O'- 2n'I)-1 , \n\nn' = \n\nBy comparing derivatives w.r.t. n, it is easy to show that this is always lower than \nthe DC approximation (10). One can also check that all three approximations that \nI have derived (D, LC and DC) converge to the exact result (7) in the large noise \nlimit as defined above. \n\nnO'2 \n\n0'2 + I:LC \n\n(11) \n\n\f348 \n\nP. Sol/ich \n\n3 COMPARISON WITH BOUNDS AND SIMULATIONS \n\nI now compare the D, LC and UC approximations with existing bounds, and with \nthe 'true' learning curves as obtained by simulations. A lower bound on the gener(cid:173)\nalization error was given by Michelli and Wahba [2J as \n\n\u20ac(n) ~ \u20acMw(n) = 2::n+l Ai \n\n(12) \n\nThis is derived for the noiseless case by allowing 'generalized observations' (pro(cid:173)\njections of 0* (x) along the first n eigenfunctions of C (x, x') ), and so is unlikely \nto be tight for the case of 'real' observations at discrete input points. Based on \ninformation theoretic methods, a different Lower bound was obtained by Opper [3J: \n\n\u20ac(n) ~ \u20acLo(n) = 4\"tr (A -1 + 2a- 2nl)-1 x [I + (I + 2a-2 nA)-lJ \n\n1 \n\nThis is always lower than the naive approximation (7); both incorrectly suggest that \n\u20ac decreases to zero for a 2 -+ 0 at fixed n, which is clearly not the case (compare (12)). \nThere is also an Upper bound due to Opper [3J, \n\ni(n) ~ \u20acuo(n) = (a- 2 n)-1 tr In(1 + a- 2nA) + tr (A -1 + a- 2 nl)-1 \n\n(13) \n\nHere i is a modified version of \u20ac which (in the rescaled version that I am using) \nbecomes identical to \u20ac in the limit of small generalization errors (\u20ac \u00ab a 2 ), but never \ngets larger that 2a2 ; for small n in particular, \u20ac(n) can therefore actually be much \nlarger than i(n) and its bound (13). An upper bound on \u20ac(n) \nitself was derived \nby Williams and Vivarelli [4J for one-dimensional inputs and stationary covariance \nfunctions (for which C(x, x') is a function of x - x' alone). They considered the \ngeneralization error at x that would be obtained from each individual training ex(cid:173)\nample, and then took the minimum over all n examples; the training set average \nof this 'lower envelope' can be evaluated explicitly in terms of integrals over the \ncovariance function [4J. The resulting upper bound, \u20acwv(n), never decays below a 2 \nand therefore complements the range of applicability of the UO bound (13). \nIn the examples in Fig. 1, I consider a very simple input domain, x E [0, 1 Jd, \nwith a uniform input distribution. I also restrict myself to stationary covariance \nfunctions, and in fact I use what physicists call periodic boundary conditions. This \nis simply a trick that makes it easy to calculate the required eigenvalue spectra of \nthe covariance function, but otherwise has little effect on the results as long as the \nlength scale of the covariance function is smaller than the size of the input domain 2 , \nl \u00ab 1. To cover the two extremes of 'rough' and 'smooth' Gaussian priors, I \nconsider the OU [C(x, x') = exp( -lx-xll/l)J and SE [C(x, x') = exp( -lx-x' 12 /2l 2)J \ncovariance functions. The prior variance of the values of the function to be learned is \nsimply C (x, x) = 1; one generically expects this 'prior ignorance' to be significantly \nlarger than the noise on the training data, so I only consider values of a 2 < 1. \nI also fix the covariance function length scale to l = 0.1; results for l = 0.01 \nare qualitatively similar. Several observations can be made from Figure 1. \n(1) \nThe MW lower bound is not tight, as expected. (2) The bracket between Opper's \nlower and upper bounds (LO /UO) is rather wide (1-2 orders of magnitude); both \ngive good representations of the overall shape of the learning curve only in the \nasymptotic regime (most clearly visible for the SE covariance function), i. e., once \n\u20ac has dropped below a 2 . (3) The WV upper bound (available only in d = 1) works \n21n d = 1 dimension, for example, a 'periodically continued' stationary covariance \nfunction on [0,1] can be written as C(X,X') = 2:::_ooc(x - x' + r). For I \u00ab 1, only \nthe r = 0 term makes a significant contribution, except when x and x' are within ~ I of \nopposite ends of the input space. With this definition, the eigenvalues of C(x, x') are given \nby the Fourier transform 1:\"00 dx c(x) exp( -2rriqx), for integer q. \n\n\fLearning Curves for Gaussian Processes \n\n-3 \nOU, d=l, 1=0.1 , cr =10 \n\n2 \n\n\\. \n\n' ,L? \n\nMW--y~ - -\n--\n\n10\u00b0 \n\n10-1 \n\n10-2 \n\n10-3 \n\n10-4 \n\n10-5 \n\n349 \n\n(b) \n\n-3 \nSE, d=l, 1=0.1, cr =10 \n\n2 \n\n-_WV \n\n, \\ \n'i- , \n\\ \n'MW \n\n, - __ ~O \n\n---\n50 \n\n10\u00b0 \n\nE \n10-1 \n\n10-2 \n\n10-3 \n\n10\u00b0 \n\nE \n\n10-1 \n\n10\u00b0 \n\nE \n\n10-1 \n\n0 \n\n200 \n\n400 \n\n600 \n\n0 \n\n100 \n\n150 \n\n200 \n\nOU, d=l, 1=0.1, cr =0.1 \n\n2 \n\n(c) \n\n-~~-::::.-::::.- WV \n---\nVO \n\nD/uC \n\nLC \n\n, \n\\\\~ \n\n'-.\\. \n\n''''', ---\n\n-..!-O \n-\n\n200 \n\n10-2 \n\n0 \n\nSE, d=l , 1=0.1, cr =0.1 \n\n2 \n\n(d) \n___ \n'-. - - - - - - - - - - - -\n, \n\nwv \n\n--- - - _ _ VO \n- - -\n\n, \n\nIMW \n\n- _ _ -l-O \n\n- - --\n\n10\u00b0 \n\n10-1 \n\n10-2 \n\n10-3 \n\n10-4 \n\n-3 \nOU, d=2, 1=0.1, cr =10 \n\n2 \n\n\\. \n\n'-. \n\n-- -\n\n- - -~ \nvo--- ---\n\n400 \n\n600 \n\n0 \n\n200 \n\n400 \n\n600 \n\n(e) \n\nDIVC \n\nLC \n\n- - - - -\n\n10\u00b0 \n\n10-1 \n\n10-2 \n\n10-3 \n\n10-4 \n\n10-5 \n\n\\. \n\n- 3 \nSE, d=2, 1=0.1, cr =10 \n\n2 \n\n(D \n\n'.MW \n\n, , \n\n\\ \n, \n\\~---,~-\n\n\\ \n200 n \n\n400 \n\n600 \n\n400 \n\n600 \n\n0 \n\n\\ \n\n0 \n\n10-2 \n\n, - ___ ~o \n\n-- -\n200 n \n\nFigure 1: Learning curves c(n): Comparison of simulation results (thick solid lines; \nthe small fluctuations indicate the order of magnitude of error bars), approximations \nderived in this paper (thin solid lines; D = discrete, UC/LC = upper/lower contin(cid:173)\nuous) , and existing upper (dashed; UO = upper Opper, WV = Williams-Vivarelli) \nand lower (dot-dashed; LO = lower Opper, MW = Michelli-Wahba) bounds. The \ntype of covariance function (Ornstein-Uhlenbeck/Squared Exponential), its length \nscale l, the dimension d of the input space, and the noise level (72 are as shown. \nNote the logarithmic y-axes. On the scale of the plots, D and UC coincide (except \nin (b)); the simulation results are essentially on top of the LC curve in (c-e) . \n\n\f350 \n\nP'Sollich \n\nwell for the OU covariance function, but less so for the SE case. As expected, it \nis not useful in the asymptotic regime because it always remains above (72. \n(4) \nThe discrete (D) and upper continuous (UC) approximations are very similar, and \nin fact indistinguishable on the scale of most plots. This makes the UC version \npreferable in practice, because it can be evaluated for any chosen n without having \nto step through all smaller values of n. (5) In all the examples, the true learning \ncurve lies between the UC and LC curves. In fact I would conjecture that these two \napproximations provide upper and lower bounds on the learning curves, at least \nfor stationary covariance functions. (6) Finally, the LC approximation comes out \nas the clear winner: For (72 = 0.1 (Fig. 1c,d), it is indistinguishable from the true \nlearning curves. But even in the other cases it represents the overall shape of the \nlearning curves very well, both for small n and in the asymptotic regime; the largest \ndeviations occur in the crossover region between these two regimes. \nIn summary, I have derived an exact representation of the average generalization c \nerror of Gaussian processes used for regression, in terms of the eigenvalue decompo(cid:173)\nsition of the covariance function. Starting from this, I have obtained three different \napproximations to the learning curve c(n) . All of them become exact in the large \nnoise limit; in practice, one generically expects the opposite case ((72 /C(x, x) \u00ab 1), \nbut comparison with simulation results shows that even in this regime the new \napproximations perform well. The LC approximation in particular represents the \noverall shape of the learning curves very well, both for 'rough' (OU) and 'smooth' \n(SE) Gaussian priors, and for small as well as for large numbers of training examples \nn. It is not perfect, but does get substantially closer to the true learning curves \nthan existing bounds. Future work will have to show how well the new approx(cid:173)\nimations work for non-stationary covariance functions and/or non-uniform input \ndistributions, and whether the treatment of fluctuations in the generalization error \n(due to the random selection of training sets) can be improved, by analogy with \nfluctuation corrections in linear perceptron learning [8]. \nAcknowledgements: I would like to thank Chris Williams and Manfred Opper for \nstimulating discussions, and for providing me with copies of their papers [3,4] prior \nto publication. I am grateful to the Royal Society for financial support through a \nDorothy Hodgkin Research Fellowship. \n\n[1] See e.g. D J C MacKay, Gaussian Processes, Tutorial at NIPS 10, and recent papers \nby Goldberg/Williams/Bishop (in NIPS 10), Williams and Barber/Williams (NIPS 9), \nWilliams/Rasmussen (NIPS 8). \n\n[2] C A Michelli and G Wahba. Design problems for optimal surface interpolation. In \nZ Ziegler, editor, Approximation theory and applications, pages 329-348. Academic \nPress, 1981. \n\n[3] M Opper. Regression with Gaussian processes: Average case performance. In I K \n\nKwok-Yee, M Wong, and D-Y Yeung, editors, Theoretical Aspects of Neural Compu(cid:173)\ntation: A Multidisciplinary Perspective. Springer, 1997. \n\n[4] C K I Williams and F Vivarelli. An upper bound on the learning curve for Gaussian \n\nprocesses. Submitted for publication. \n\n[5] C K I Williams. Prediction with Gaussian processes: From linear regression to linear \nprediction and beyond. In M I Jordan, editor, Learning and Inference in Graphical \nModels. Kluwer Academic. In press. \n\n[6] E Wong. Stochastic Processes in Information and Dynamical Systems. McGraw-Hill, \n\nNew York, 1971. \n\n[7] W H Press, S A Teukolsky, W T Vetterling, and B P Flannery. Numerical Recipes in \n\nC (2nd ed.). Cambridge University Press, Cambridge, 1992. \n\n[8] P Sollich. Finite-size effects in learning and generalization in linear perceptrons. Jour(cid:173)\n\nnal of Physics A, 27:7771- 7784, 1994. \n\n\f", "award": [], "sourceid": 1501, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}]}