{"title": "Gaussian Process Regression with Mismatched Models", "book": "Advances in Neural Information Processing Systems", "page_first": 519, "page_last": 526, "abstract": null, "full_text": "Gaussian  Process  Regression with \n\nMismatched  Models \n\nDepartment of Mathematics, King's  College London \n\nStrand, London WC2R 2LS, U.K.  Email peter.sollich@kcl.ac . uk \n\nPeter  Sollich \n\nAbstract \n\nLearning curves for Gaussian process regression are well understood \nwhen the 'student' model happens to match the 'teacher' (true data \ngeneration process).  I derive approximations to the learning curves \nfor  the more generic case of mismatched models, and find  very rich \nbehaviour:  For large input space dimensionality,  where the results \nbecome  exact,  there  are  universal  (student-independent)  plateaux \nin  the learning curve, with transitions in between  that can exhibit \narbitrarily  many  over-fitting  maxima;  over-fitting  can  occur  even \nif the student estimates  the teacher  noise level  correctly.  In lower \ndimensions,  plateaux also  appear,  and the learning curve remains \ndependent  on  the  mismatch  between  student  and teacher even  in \nthe asymptotic limit of a large number of training examples.  Learn(cid:173)\ning with excessively strong smoothness assumptions can be partic(cid:173)\nularly  dangerous:  For  example,  a  student  with  a  standard radial \nbasis function covariance function will learn a rougher teacher func(cid:173)\ntion only  logarithmically slowly.  All  predictions  are  confirmed  by \nsimulations. \n\n1 \n\nIntroduction \n\nThere  has  in  the  last  few  years  been  a  good  deal  of  excitement  about  the  use \nof  Gaussian  processes  (GPs)  as  an  alternative  to  feedforward  networks  [1].  GPs \nmake  prior  assumptions  about  the  problem  to  be  learned  very  transparent,  and \neven  though  they  are  non-parametric  models,  inference- at  least  in  the  case  of \nregression  considered  below-\nis  relatively  straightforward.  One  crucial  question \nfor  applications  is  then how  'fast'  GPs learn, i.e.  how  many training examples  are \nneeded  to  achieve  a  certain  level  of generalization  performance.  The  typical  (as \nopposed to worst case)  behaviour is  captured in the learning  curve, which gives the \naverage generalization error  t  as  a  function  of the number  of training examples  n. \nGood bounds and approximations for t(n) are now available [1,  2, 3, 4, 5],  but these \nare mostly restricted to the case where the 'student' model exactly matches the true \n'teacher'  generating the  datal.  In  practice,  such  a  match  is  unlikely,  and  so  it  is \n\nlThe exception is  the elegant  work  of Malzahn  and Opper  [2],  which  uses  a  statistical \nphysics  framework  to  derive  approximate  learning  curves  that  also  apply  for  any  fixed \ntarget function.  However,  this framework  has  not  yet to my knowledge  been  exploited to \n\n\fimportant to understand how  GPs learn if there is  some  model  mismatch.  This is \nthe aim of this paper. \nIn its simplest form, the regression problem is this:  We are trying to learn a function \nB*  which  maps inputs x  (real-valued vectors)  to  (real-valued  scalar)  outputs B*(x). \nWe  are  given  a  set  of training  data D , consisting of n  input-output  pairs  (xl, yl) ; \nthe  training  outputs  yl  may  differ  from  the  'clean'  teacher  outputs  B*(xl )  due  to \ncorruption  by  noise.  Given  a  test  input  x,  we  are  then  asked  to  come  up  with  a \nprediction  B(x),  plus  error bar,  for  the  corresponding output  B(x).  In  a  Bayesian \nsetting, we  do  this by specifying a  prior P(B)  over hypothesis functions , and a  like(cid:173)\nlihood P(DIB)  with which each B could have generated the training data; from this \nwe  deduce the posterior distribution P(BID)  ex  P(DIB)P(B).  For a  GP,  the prior is \ndefined  directly  over  input-output functions  B; this  is  simpler  than for  a  Bayesian \nfeedforward  net  since  no  weights  are  involved  which  would  have  to  be  integrated \nout.  Any B is  uniquely  determined  by its output values  B(x)  for  all  x from  the in(cid:173)\nput domain, and for  a  GP, these are assumed to have a joint Gaussian distribution \n(hence  the  name).  If we  set  the  means  to  zero  as  is  commonly  done,  this  distri(cid:173)\nbution  is  fully  specified  by  the  covariance  function  (B(x)B(xl))o  =  C(X,XI).  The \nlatter  transparently  encodes  prior  assumptions  about  the  function  to  be  learned. \nSmoothness, for example, is  controlled by the behaviour of C(x, Xl)  for  Xl  -+  x:  The \nOrnstein-Uhlenbeck (OU)  covariance function C(x, Xl)  =  exp( -Ix - xliiI) produces \nvery  rough  (non-differentiable)  functions ,  while  functions  sampled from  the  radial \nbasis function  (RBF) prior with C(x, Xl)  =  exp[-Ix - x/12 1(212)]  are infinitely differ(cid:173)\nentiable.  Here  I  is  a  lengthscale parameter,  corresponding directly  to the distance \nin input space over which  we  expect significant variation in  the function  values. \nThere  are  good  reviews  on  how  inference  with  GPs  works  [1 ,  6],  so  I  only  give \na  brief summary  here.  The  student  assumes  that  outputs  y  are  generated  from \nthe  'clean'  values  of  a  hypothesis  function  B(x)  by  adding  Gaussian  noise  of  x(cid:173)\nindependent  variance  (J2.  The joint  distribution  of a  set  of training outputs  {yl} \nand the function  values  B(x)  is  then  also  Gaussian,  with  covariances  given  (under \nthe student model)  by \n\n(ylym)  =  C(xl,xm) + (J2Jlm =  (K)lm, \n\n(yIB(x))  =  C(xl,x) =  (k(X))1 \n\nHere  I  have  defined  an  n  x  n  matrix  K  and  an  x-dependent  n-component  vector \nk(x) .  The  posterior  distribution  P(BID)  is  then  obtained  by  conditioning  on  the \n{yl}; it is  again Gaussian and has mean and variance \n(B(x))o ID  ==  B(xID)  =  k(X)TK-1y \n\n(1) \n(2) \nFrom the student's  point  of view,  this solves  the inference  problem:  The best pre(cid:173)\ndiction  for  B(x)  on  the  basis  of the  data D  is  B(xID) , with  a  (squared)  error  bar \ngiven  by  (2).  The  squared  deviation  between  the  prediction  and  the  teacher  is \n[B(xID)  - B*(x)]2;  the  average  generalization error  (which,  as  a  function  of n,  de(cid:173)\nfines the learning curve) is obtained by averaging this over the posterior distribution \nof teachers, all  datasets,  and the test input x: \n\n((B(x)  - B(X))2)o ID \n\nC(x,x)  - k(X)TK-1k(x) \n\n(3) \nNow  of  course  the  student  does  not  know  the  true  posterior  of  the  teacher;  to \nestimate  E,  she  must  assume  that  it  is  identical  to  the  student  posterior,  giving \nfrom  (2) \n\nE  =  ((([B(xID)  - B*(xWk ID)D)x \n\nE =  ((([B(xID)  - B(X)]2)o ID)D)x  =  ((C(x,x) - k(xfK-1k(X)){xl})x \n\n(4) \n\nconsider  systematically  the  effects  of having  a  mismatch  between  the  teacher  prior  over \ntarget functions  and the prior  assumed by the student. \n\n\fwhere  in  the  last  expression  I  have  replaced  the  average  over  D  by  one  over  the \ntraining inputs  since  the outputs  no  longer appear.  If the student  model  matches \nthe  true  teacher  model,  E  and  \u20ac  coincide  and  give  the  Bayes  error,  i.e.  the  best \nachievable  (average)  generalization performance for  the given teacher. \n\nI assume in what follows that the teacher is  also a  GP, but with a  possibly different \ncovariance  function  C* (x, x')  and  noise  level  (}\";.  This  allows  eq.  (3)  for  E  to  be \nsimplified, since by exact analogy with the argument for  the student posterior \n\n(()* (x ) k iD = k* (x) TK :;-1 y , \nand thus, abbreviating a(x)  =  K- 1k(x)  - K ;;-1k*(x), \n\n((); (x) )O. ID =  (()* (x ))~. I D +C* (x, x) - k* (x) TK :;-1 k * (x) \n\nE  =  ((a(x)T yyTa(x) + C*(x,x) - k*( X)T K:;-1k*(x))D)x \n\nConditional on the training inputs, the t raining outputs have  a  Gaussian distribu(cid:173)\ntion given by the true (teacher)  model;  hence  (yyT){yl} l{xl } =  K *,  giving \nE =  ((C*(x,x)  - 2k*(x)TK-1k(x) + k(X)T K -1 K *K -1 k(x )){xl})x \n\n(5) \n\n2  Calculating the learning  curves \n\nAn  exact calculation of the learning  curve E(n)  is  difficult  because of the joint  av(cid:173)\nerage  in  (5)  over  the  training  inputs  X  and  the  test  input  x .  A  more  convenient \nstarting point is  obtained if (using Mercer's theorem)  we  decompose the covariance \nfunction  into  its  eigenfunctions  \u00a2i(X)  and  eigenvalues  Ai,  defined  w.r.t.  the  input \ndistribution so that (C(x, X') \u00a2i (X') )x'  =  Ai\u00a2i(X)  with the corresponding normaliza(cid:173)\ntion  (\u00a2i(X)\u00a2j(x))x  =  bij.  Then \n\n00 \n\ni=1 \n\n00 \n\ni=1 \n\nFor simplicity I assume here that the student and teacher covariance functions have \nthe  same eigenfunctions  (but  different  eigenvalues).  This  is  not  as  restrictive as  it \nmay  seem;  several  examples  are  given  below.  The  averages  over  the  test  input  x \nin  (5)  are now  easily  carried  out:  E .g.  for  the last term we  need \n((k(x) k(x)T)lm)x  =  L AiAj\u00a2i(Xl)(\u00a2i(X)\u00a2j (x))x\u00a2j (xm)  =  L A7\u00a2i(Xl )\u00a2i(Xm) \n\nij \n\ni \n\nIntroducing  the  diagonal  eigenvalue  matrix  (A)ij  =  Aibij  and the  'design  matrix' \n(<I\u00bbli  =  \u00a2i(Xl ), this  reads  (k(x) k(x)T)x  =  <I>A2<I>T .  Similarly, for  the second  term \nin  (5) ,  (k(x)k;(x))x  =  <I>AA*<I>T,  and  (C*(x,x))x  =  trA*.  This  gives,  dropping \nthe training inputs subscript from  the remaining average, \n\nE =  (tr A* - 2tr<I>AA*<I>TK-1 + tr <I>A2<I>TK - 1K *K - 1) \n\nIn  this  new  representation  we  also  have  K  =  (}\"21  + <I>A<I>T  and  similarly  for  K* ; \nfor  the  inverse  of K  we  can  use  the  Woodbury  formula  to  write  K -1  =  (}\"-2 [1  -\n(}\" - 2<I>g<I> T],  where 9 =  (A - 1 + (}\"- 2<I> T <I> )- 1.  Inserting these results, one finds  after \nsome algebra that \n\nE =  (}\";(}\"-2  [(tr g)  - (tr gA -1 g)] + (tr gA*A -29) \n\nwhich for  the matched case reduces to the known result for  the Bayes error [4] \n\n\u20ac  =  (tr g) \n\n(7) \n\n(8) \n\n\fEqs. (7,8) are still exact.  We now need to tackle the remaining averages over training \ninputs.  Two  of these  are  of the  form  (tr QM9) ; if we  generalize  the  definition  of \nQ to  Q =  (A -1  +  vI +  wM +  (/-2IJ>TIJ\u00bb-1  and define  9 =  (tr Q) , then they reduce \nto  (trQMQ)  =  -agjaw.  (The  derivative  is  taken  at v  =  w  =  0;  the  idea behind \nintroducing  v  will  become  clear  shortly.)  So  it  is  sufficient  to  calculate  g.  To  do \nthis, consider how Q changes when a  new example is  added to the training set.  One \nhas \n\nQ(n + 1) - Q(n)  =  [Q-1(n) + (/-2 1jJ1jJTJ -1  _  Q(n)  =  _  Q(n)1jJ1jJTQ(n) \n(/2 + 1jJTQ(n)1jJ \n\n(9) \n\nin terms of the vector 1jJ  with elements (1jJ)i =  <Pi(xn+d, using again the Woodbury \nformula.  To  obtain  the  change  in  9  we  need  the  average  of  (9)  over  both  the \nnew  training  input  X n +1  and  all  previous  ones.  This  cannot  be  done  exactly,  but \nwe  can  approximate  by  averaging  numerator  and  denominator  separately;  taking \nthe  trace  then  gives  g(n +  1)  - g(n)  =  -(trQ2(n))j[(/2 +  g(n)].  Now,  using  our \nauxiliary parameter v,  -(trQ2 )  =  agjav;  if we  also  approximate n  as  continuous, \nwe  get  the simple partial differential  equation agjan =  (agjaV)j((/2 +  g)  with the \ninitial condition gln=o  =  tr (A -1 +  vI +  WM)-1.  Solving this  using  the method of \ncharacteristics [7]  gives  a  self-consistent equation for  g, \n\n9  =  tr  [A -1 +  (v +  (/2: g) 1+ wM r1 \n\n(10) \n\n(11) \n\nThe Bayes error (8)  is  E =  glv=w=o  and therefore obeys \n\nE =  trG, \n\nG -1  =  A -1 + _n_ I \n(/2 + E \n\nwithin  our  approximation  (called  'LC'  in  [4]).  To  obtain  E,  we  differentiate  both \nsides of (10)  w.r.t.  w,  set v  =  w  =  0 and rearrange to give \n\n(tr QM9) = -agjaw = (tr MG2)j[1 - (tr G 2)nj((/2 +  E)2] \n\nUsing  this  result in  (7),  with  M  =  A -1  and M  =  A -1 A*A -1, we  find  after  some \nfurther simplifications the final  (approximate)  result for  the learning curve: \n\n, (/; tr G 2 +  n- 1 ((/2  +  E)2 tr A*A -2G2 \nE  =  E ----'---::------::c-::---':-:---;:---:-;:---:----::--::c-::--\n(/2trG2 +n-1((/2 +E)2trA-1G2 \n\n(12) \n\nwhich  transparently shows  how  in  the matched case  E  and E become identical. \n\n3  Examples \n\nI  now  apply  the  result  for  the  learning  curve  (11 ,12)  to  some  exemplary  learning \nscenarios.  First,  consider  inputs  x  which  are  binary  vectors2 with  d  components \nXa  E  {-I , I} ,  and  assume  that  the  input  distribution  is  uniform.  We  consider \ncovariance  functions  for  student  and  teacher  which  depend  on  the  product  x  . Xl \nonly;  this  includes  the  standard choices  (e.g.  OU  and  RBF)  which  depend  on  the \nEuclidean distance  Ix - xII,  since  Ix - x/12 = 2d - 2x  . Xl.  All  these  have  the same \neigenfunctions  [9],  so  our  above  assumption  is  satisfied.  The  eigenfunctions  are \nindexed  by  subsets  p of {I, 2 ... d}  and  given  explicitly  by  <pp(x)  =  IT a EP Xa'  The \n\n2This scenario may seem strange, but simplifies the determination of the eigenfunctions \nand  eigenvalues.  For  large  d,  one  expects  other  distributions  with  continuously  varying \nx  and the same first- and second-order  statistics  ((Xa)  = 0,  (XaXb)  = 8ab ) to give  similar \nresults  [8] . \n\n\fcorresponding eigenvalues  depend  only  on  the  size  s  =  Ipl  of  the  subsets  and  are \ntherefore  (~)-fold degenerate;  letting  e  =  (1,1 ... 1)  be the  'all  ones'  input  vector, \nthey  have  the  values  As  =  (C(x, e)\u00a2>p(x))x  (which  can  easily  be  evaluated  as  an \naverage over  two  binomially  distributed variables,  counting the number of + 1 's  in \nx  overall and among the Xa  with a E p).  With the As  and A;  determined, it is  then \na simple matter to evaluate the predicted learning curve  (11,12)  numerically.  First, \nthough,  focus  on  the  limit  of large  d,  where  much  more  can  be  said.  If we  write \nC(X,XI)  =  f(x\u00b7 xl/d),  the  eigenvalues  become,  for  d  -+  00,  As  =  d-sf(s)(O)  and \nthe contribution to C(x, x)  =  f(l)  from  the s-th eigenvalue block is  As  ==  (~)As -+ \nf(s)(O)/s!,  consistent with  f(l)  =  2::o f(s)(0)/s!  The  As, because of their scaling \nwith d,  become infinitely separated for  d -+  00.  For training sets of size n  =  O(dL), \nwe  then  see  from  (11)  that  eigenvalues  with  s  > L  contribute  as  if n  =  0,  since \nAs  \u00bb  n / (u 2  + \u20ac); \nthey  have  effectively  not  yet  been  learned.  On  the  other hand, \neigenvalues  with  s  <  L  are  completely  suppressed  and  have  been  learnt  perfectly. \nWe  thus  have  a  hierarchical  learning  scenario,  where  different  scalings  of  n  with \nd-as  defined  by  L-correspond  to  different  'learning  stages'.  Formally,  we  can \nanalyse the stages separately by letting d -+  00 at a constant ratio a  =  n/(f) of the \nnumber  of examples  to  the  number  of parameters to  be  learned  at  stage  L  (note \n(f)  =  O(dL) for  large  d).  An  independent  (replica)  calculation  along  the  lines  of \nRef.  [8]  shows that our approximation for  the learning curve actually becomes  exact \nin  this  limit.  The  resulting  a-dependence  of  to  can  be  determined  explicitly:  Set \nh  = 2:s::=:L  As  (so  that  fa  = f(I))  and similarly for  fi.  Then for  large a , \n\nto  =  fL+1  + (fL+1  + u;)a- l  + O(a- 2 ) \n\n(13) \nThis implies that, during successive learning stages,  (teacher) eigenvalues are learnt \none  by  one  and their  contribution eliminated from  the generalization error, giving \nplateaux in  the  learning  curve  at  to  =  fi,  f2,  ....  These  plateaux,  as  well  as  the \nasymptotic  decay  (13)  towards  them,  are  universal  [8],  i.e.  student-independent. \nThe  (non-universal)  behaviour for  smaller  a  can also  be  fully  characterized:  Con(cid:173)\nsider first  the simple  case  of linear  percept ron learning  (see  e.g.  [7]),  which  corre(cid:173)\nsponds to both student and teacher having simple dot-product covariance functions \nC (x, Xl)  =  C * (x, Xl)  =  X\u00b7 xl/d.  In this case there is only a single learning stage (only \nAl  =  A~ =  1 are nonzero),  and  to  =  r(a)  decays  from  r(O)  =  1 to  r(oo)  =  0,  with \nan over-fitting  maximum  around a  =  1 if u 2  is  sufficiently  small  compared to  u;. \nIn terms of this function  r(a),  the learning curve at stage L  for  general covariance \nfunctions  is  then  exactly given  by  to  =  fL+1  + ALr(a)  if in  the  evaluation  of r(a) \nthe effective  noise  levels  &2  =  (f L+1  + ( 2 ) /  AL  and &;  =  (fL+1  + u;) / A L are used. \nNote how in &;, the contribution  fL+1  from the not-yet-Iearned eigenvalues acts as \neffective noise,  and is normalized by the amount of 'signal'  AL  =  fL - fL+l  available \nat learning stage  L.  The analogous  definition  of &2  implies  that,  for  small  u 2  and \ndepending  on  the  choice  of  student  covariance  function,  there  can  be  arbitrarily \nmany learning stages L  where &2  \u00ab &;,  and therefore  arbitrarily  many  over-fitting \nmaxima in  the  resulting  learning  curves.  From  the  definitions  of &2  and  &;  it  is \nclear that this situation can occur  even if the  student knows  the  exact teacher noise \nlevel,  i.e.  even if u 2  =  u;. \nFig.  1(left)  demonstrates that the above conclusions hold not just for  d -+  00;  even \nfor  the  cases  shown,  with  d  =  10,  up  to  three  over-fitting  maxima  are  apparent. \nOur theory provides a  very good  description of the numerically simulated learning \ncurves  even though,  at such small d,  the predictions  are still  significantly different \nfrom  those for  d -+  00  (see  Fig.  1 (right) ) and therefore not guaranteed to be exact. \n\nIn  the  second  example  scenario,  I  consider  continuous-valued  input  vectors,  uni-\n\n\f10 \n\n100 \n\nn \n\n234  \na \n\n234  \na \n\n234  \na \n\nFigure  1:  Left:  Learning curves  for  RBF  student and teacher, with  uniformly  dis(cid:173)\ntributed,  binary  input  vectors  with  d  =  10  components.  Noise  levels:  Teacher \nu;  =  1,  student u2  =  10-4, 10-3 ,  ... , 1  (top  to  bottom).  Length  scales:  Teacher \nl*  =  d1/2,  student  l  =  2d1/2.  Dashed:  numerical  simulations,  solid:  theoretical \nprediction.  Right:  Learning curves for  u 2  =  10- 4  and increasing d  (top to bottom: \n10, 20, 30, 40, 60, 80,  [bold]  00).  The x-axis  shows  a  =  n/(f) , for  learning stages \nL  = 1,2,3; the dashed lines  are the universal asymptotes  (13)  for  d -+  00. \n\nformly  distributed  over  the  unit  interval  [0,1];  generalization  to  d  dimensions \n(x  E  [O , I]d)  is  straightforward.  For  covariance functions  which  are stationary,  i.e. \ndependent on x  and x' only through x - x' , and assuming periodic boundary condi(cid:173)\ntions  (see  [4]  for  details), one then again has covariance function-independent eigen(cid:173)\nfunctions.  They are indexed by integers3  q,  with cPq(x)  =  e21riqx;  the corresponding \neigenvalues  are  Aq  =  J dx C(O, x)e-27riqx .  For  the  ('periodified')  RBF  covariance \nfunction  C(x ,x' ) = exp[-(x  - X ' )2/(2l2)],  for  example,  one  has  Aq  ex  exp( -ip /2), \nwhere  ij  =  27rlq.  The  OU  case  C(x, x')  =  exp( -Ix  - x/l/l),  on  the  other  hand, \ngives  Aq  ex  (1 + ij2) - 1,  thus  Aq  ex  q- 2 for  large q.  I  also  consider below  covariance \nfunctions  which  interpolate  in  smoothness  between  the  OU  and  RBF  limits:  E.g. \nthe  MB2  (modified  Bessel)  covariance  C(x, x')  =  e- a (1  + a),  with  a  =  Ix  - x /l /l, \nyields  functions  which  are  once  differentiable  [5];  its  eigenvalues  Aq  ex  (1  + ij2)-2 \nshow  a  faster  asymptotic power law  decay,  Aq  ex  q-4, than those of the  OU  covari(cid:173)\nance function.  To subsume all  these cases I  assume in the following  analysis  of the \ngeneral shape of the learning curves that Aq  ex  q-r (and similarly A~ ex  q-r.).  Here \nr  =  2 for  OU,  r  =  4 for  MB2,  and  (due  to  the  faster-than-power  law  decay  of its \neigenvalues)  effectively r  =  00 for  RBF. \nFrom  (11 ,12),  it  is  clear  that  the  n-dependence  of the  Bayes  error E has  a  strong \neffect on the true generalization error E.  From previous work  [4],  we  know that E(n) \nhas  two  regimes:  For  small  n,  where  E \u00bb  u 2 ,  E is  dominated  by  regions  in  input \nspace  which  are  too  far  from  the  training examples  to have  significant  correlation \nwith  them,  and  one  finds  E ex  n -(r- 1).  For  much  larger  n,  learning  is  essentially \nagainst noise, and one has a slower decay E ex  (n/u2)-(r- 1) /r .  These power laws can \nbe derived from  (11)  by approximating factors such as  [A;;-l + n/ (u 2 + E)]- l  as equal \nto either Aq  or to 0, depending on whether n / (u 2  + E)  < or  > A;;-l.  With the same \ntechnique, one can estimate the behaviour of E from  (12).  In the small n-regime, one \nfinds  E  ~ C1 u; + C2n-(r. -1), with  prefactors C1,  C2  depending on the student.  Note \n\n3Since  Aq  =  A_q,  one  can  assume  q  ~ 0  if  all  Aq  for  q  >  0  are  taken  as  doubly \n\ndegenerate. \n\n\f\u00a3 \n0.1 \n\n\u00a3 \n0.1 \n\n\u00a3 \n0.1 \n\n10 \n\nn \n\n100 \n\n10 \n\nn \n\n100 \n\n1000 \n\nFigure  2:  Learning  curves  for  inputs  x  uniformly  distributed  over  [0,1].  Teacher: \nMB2  covariance function,  lengthscale I.  =  0.1,  noise level  (7;  =  0.1;  student length(cid:173)\nscale  I  =  0.1  throughout.  Dashed:  simulations,  solid:  theory.  Left:  OU  student \nwith  (72  as  shown.  The  predicted  plateau  appears  as  (72  decreases.  Right:  Stu(cid:173)\ndents  with  (72  =  0.1  and  covariance  function  as  shown;  for  clarity,  the  RBF  and \nOU  results  have  been  multiplied  by  v'IO  and  10,  respectively.  Dash-dotted  lines \nshow the predicted asymptotic power laws for  MB2  and OU;  the RBF data have a \npersistent upward curvature consistent with the predicted logarithmic decay.  Inset: \nRBF student with  (72  = 10-3 ,  showing the occurrence of over-fitting maxima. \n\nthat the contribution proportional to (7;  is  automatically negligible in the matched \ncase (since then E  =  \u20ac  \u00bb (72  =  (7;  for small n); if there is a model mismatch, however, \nand if the small-n regime extends far  enough, it will  become significant.  This is  the \ncase for  small  (72;  indeed, for  (72  -+  0,  the  'small  n' criterion  \u20ac  \u00bb  (72  is  satisfied for \nany n.  Our theory thus predicts the appearance of plateaux in the learning curves, \nbecoming more pronounced as  (72  decreases;  Fig.  2 (left )  confirms this4.  Numerical \nevaluation also  shows  that for  small  (72,  over-fitting maxima may occur  before the \nplateau  is  reached,  consistent  with  simulations;  see  inset  in  Fig.  2(right).  In  the \nlarge  n-regime (\u20ac  \u00ab (72),  our theory predicts that the generalization error decays as \na  power law.  If the student assumes  a  rougher function  than the teacher  provides \n(r  < r.) , the  asymptotic power  law  exponent  E  ex:  n-(r-l)/r  is  determined  by  the \nstudent  alone.  In  the  converse  case,  the  asymptotic  decay  is  E  ex:  n-(r.-l) / r  and \ncan be very slow,  actually becoming logarithmic for  an RBF student (r  -+  CXl).  For \nr  =  r., the fastest decay for  given r. is obtained, as expected from the properties of \nthe Bayes error.  The simulation data in Fig. 2 are compatible with these predictions \n(though  the  simulations  cover  too  small  a  range  of  n  to  allow  exponents  to  be \ndetermined precisely).  It should be stressed that the above results imply that there \nis  no asymptotic regime of large training sets in which the learning curve assumes a \nuniversal form, in contrast to the case of parametric models where the generalization \nerror  decays  as  E  ex:  lin for  sufficiently  large  n  independently  of model  mismatch \n(as  long  as  the  problem  is  learnable  at  all).  This  conclusion  may  seem  counter(cid:173)\nintuitive,  but becomes  clear  if one  remembers  that  a  GP  covariance function  with \nan infinite number of nonzero eigenvalues Ai  always has arbitrarily many eigenvalues \n4If  (J2  = 0  exactly,  the  plateau  will  extend to  n  -+  00.  With  hindsight,  this  is  clear: \na  GP  with  an  infinite  number  of nonzero  eigenvalues  has  no  limit  on  the  number of its \n'degrees of freedom'  and can fit  perfectly any amount of noisy training data,  without ever \nlearning the true teacher function . \n\n\fthat are arbitrarily close to zero  (since the Ai  are positive and 2:iAi =  (C(x,x))  is \nfinite).  Whatever n, there are therefore many eigenvalues for  which  Ail\u00bb n/u2 , \ncorresponding to degrees of freedom  which are still mainly determined by the prior \nrather  than  the  data  (compare  (11)).  In  other  words,  a  regime  where  the  data \ncompletely overwhelms the mismatched prior- and where the learning curve could \ntherefore become independent of model  mismatch- can never  be reached. \n\nIn summary,  the  above approximate theory makes  a  number of non-trivial predic(cid:173)\ntions  for  GP  learning  with  mismatched  models,  all  borne  out  by  simulations:  for \nlarge  input  space  dimensions,  the  occurrence  of multiple  over-fitting  maxima;  in \nlower  dimensions,  the generic presence of plateaux in  the learning curve if the stu(cid:173)\ndent assumes too small a noise level u 2 ,  and strong effects of model mismatch on the \nasymptotic learning curve decay.  The behaviour is much richer than for the matched \ncase, and could guide the choice of (student)  priors in real-world applications of GP \nregression;  RBF students, for  example,  run the risk of very slow  logarithmic decay \nof the learning curve if the target (teacher)  is  less  smooth than assumed. \n\nAn important issue for  future work- some of which is  in progress-\nis to analyse to \nwhich extent hyperparameter tuning (e.g. via evidence maximization) can make GP \nlearning robust against some forms of model mismatch, e.g. a misspecified functional \nform  of the  covariance function.  One  would  like  to know,  for  example,  whether  a \ndata-dependent adjustment of the lengthscale of an RBF covariance function would \nbe sufficient  to avoid the logarithmically slow  learning of rough target functions. \n\nReferences \n\n[1]  See e.g. D J  C MacKay, Gaussian Processes, Tutorial at NIPS 10;  recent papers \nby  Csat6  et  al.  (NIPS  12),  Goldberg/Williams/Bishop  (NIPS  10),  Williams \nand Barber/Williams (NIPS 9) , Williams/Rasmussen (NIPS 8);  and references \nbelow. \n\n[2]  D  Malzahn and M  Opper. In  NIPS 13,  pages  273- 279;  also in  NIPS  14. \n[3]  C  A  Michelli  and  G  Wahba.  In  Z  Ziegler,  editor,  Approximation  theory  and \n\napplications,  pages  329- 348.  Academic  Press,  1981;  M  Opper.  In  I  K  Kwok(cid:173)\nYee  et  al.,  editors,  Theoretical  Aspects  of Neural  Computation,  pages  17-23. \nSpringer,  1997. \n\n[4]  P  Sollich.  In  NIPS 11,  pages 344-350. \n[5]  C K  I  Williams  and F  Vivarelli.  Mach.  Learn.,  40:77-102, 2000. \n[6]  C  K  I  Williams.  In  M  I  Jordan, editor,  Learning  and  Inference  in  Graphical \n\nModels,  pages  599-621. Kluwer  Academic,  1998. \n\n[7]  P  Sollich.  J.  Phys.  A,  27:7771- 7784,1994. \n[8]  M  Opper and R  Urbanczik.  Phys.  Rev.  Lett., 86:4410- 4413,  2001. \n[9]  R  Dietrich,  M  Opper,  and  H  Sompolinsky.  Phys.  Rev.  Lett.,  82:2975-2978, \n\n1999. \n\n\f", "award": [], "sourceid": 1987, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}]}