{"title": "Mean Field Methods for Classification with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 309, "page_last": 315, "abstract": null, "full_text": "Mean field  methods for  classification  with \n\nGaussian  processes \n\nManfred  Opper \n\nNeural  Computing Research Group \n\nDivision  of Electronic Engineering and Computer Science \n\nAston University Birmingham B4 7ET,  UK. \n\nopperm~aston.ac.uk \n\nOle Winther \n\nTheoretical Physics II,  Lund  University, S6lvegatan 14  A \n\nS-223  62  Lund,  Sweden \n\nCONNECT,  The Niels  Bohr Institute,  University of Copenhagen \n\nBlegdamsvej  17,  2100  Copenhagen 0, Denmark \n\nwinther~thep.lu.se \n\nAbstract \n\nWe discuss the application of TAP mean field methods known from \nthe Statistical Mechanics of disordered systems to Bayesian classifi(cid:173)\ncation models with Gaussian processes.  In contrast to previous ap(cid:173)\nproaches,  no  knowledge about the distribution of inputs is  needed. \nSimulation results for  the Sonar data set are given. \n\n1  Modeling with Gaussian Processes \n\nBayesian models which are based on Gaussian prior distributions on function spaces \nare promising non-parametric statistical tools.  They have been recently introduced \ninto the Neural  Computation community  (Neal  1996, Williams &  Rasmussen  1996, \nMackay  1997).  To  give their basic definition,  we  assume that the likelihood of the \noutput  or  target  variable  T  for  a  given  input  s  E  RN  can  be  written  in  the  form \np(Tlh(s))  where  h  : RN  --+  R  is  a  priori  assumed  to  be  a  Gaussian  random field. \nIf we  assume fields  with  zero  prior  mean,  the statistics  of h  is  entirely  defined  by \nthe second order correlations C(s, S')  ==  E[h(s)h(S')], where E  denotes expectations \n\n\f310 \n\nM  Opper and 0.  Winther \n\nwith respect  to the prior.  Interesting examples  are \n\nC(s, s') \n\nC(s, s') \n\n(1) \n\n(2) \n\nThe  choice  (1)  can  be  motivated  as  a  limit  of a  two-layered  neural  network  with \ninfinitely many hidden units with factorizable input-hidden weight priors (Williams \n1997).  Wi  are hyperparameters determining the relevant  prior lengthscales of h(s). \nThe simplest  choice  C(s, s')  = 2::i WiSiS~ corresponds  to  a  single  layer  percept ron \nwith independent  Gaussian weight  priors. \n\nIn  this  Bayesian  framework,  one  can  make  predictions  on  a  novel  input  s  after \nJ.L  = 1, ... , m  by  using \nhaving  received  a  set  Dm  of m  training  examples  (TJ.1.,  sJ.1.), \nthe posterior distribution of the field  at the test  point s  which  is  given  by \n\n(3) \n\n(4) \n\np(h(s)IDm) = J p(h(s)l{hV}) p({hV}IDm) II dhJ.1.. \n\nJ.1. \n\np(h(s)1 {hV})  is  a  conditional Gaussian distribution and \n\np({hV}IDm)  =  ~P({hV}) II p(TJ.1.IhJ.1.). \n\nJ.1. \n\nis  the  posterior  distribution  of the  field  variables  at  the  training  points.  Z  is  a \nnormalizing partition function  and \n\n(5) \n\nis the prior distribution of the fields at the training points.  Here, we have introduced \nthe abbreviations hJ.1.  = h(sJ.1.)  and  CJ.1.V  ==  C(sJ.1., SV). \nThe  major  technical  problem  of  this  approach  comes  from  the  difficulty  in  per(cid:173)\nforming  the  high  dimensional  integrations.  Non-Gaussian  likelihoods  can  be  only \ntreated by  approximations, where e.g.  Monte  Carlo sampling  (Neal  1997),  Laplace \nintegration (Barber & Williams 1997) or bounds on the likelihood (Gibbs & Mackay \n1997)  have been  used so far.  In  this paper,  we  introduce a further  approach,  which \nis  based  on  a  mean  field  method  known  in  the  Statistical  Physics  of  disordered \nsystems  (Mezard,  Parisi &  Virasoro 1987). \n\nWe  specialize  on  the case  of a  binary  classification  problem,  where  a  binary  class \nlabel  T  =  \u00b11  must  be  predicted using  a  training set  corrupted  by  i.i.d  label  noise. \nThe likelihood for  this  problem is  taken  as \n\nwhere  I\\,  is  the probability that the true classification label is  corrupted, i. e.  flipped \nand the step function,  0(x)  is  defined  as  0(x) = 1 for  x  > 0  and 0  otherwise.  For \nsuch  a  case,  we  expect  that  (by  the non-smoothness  of the  model),  e.g.  Laplace's \nmethod  and  the  bounds  introduced  in  (Gibbs  &  Mackay  1997)  are  not  directly \napplicable. \n\n\fMean Field Methods for Classification with Gaussian Processes \n\n31J \n\n2  Exact  posterior averages \n\nIn order to make a prediction on  an input s, ideally the label with maximum poste(cid:173)\nrior probability should be chosen,  i.e.  TBayes  =  argmaxr  p( TIDm),  where the predic(cid:173)\ntive probability is given by P(TIDm)  = J dhp(Tlh) p(hIDm).  For the binary case the \nBayes classifier becomes TBayes  =  sign(signh(s)), where we throughout the paper let \nbrackets ( ... ) denote posterior averages.  Here, we use a somewhat simpler approach \nby  using the prediction \n\nT  =  sign((h(s)))  . \n\nThis  would  reduce  to  the  ideal  prediction,  when  the  posterior  distribution  of  h(s) \nis  symmetric  around its mean  (h(s)).  The goal of our mean field  approach  will  be \nto  provide  a  set  of equations  for  approximately  determining  (h( s)) .  The  starting \npoint of our analysis is  the partition function \n\nZ  = J II dX;:ihJ.L  IIp(TJ.LlhJ.L)e~ LI' ,u cI'UxI'X U- L I'  hl'xl'  , \n\n(6) \n\nJ.L \n\nJ.L \n\nwhere  the  new  auxiliary  variables  x/l  (integrated  along  the  imaginary  axis)  have \nbeen  introduced in  order to get rid of C- l  in  (5). \n\nIt  is  not  hard  to show  from  (6)  that  the  posterior  averages  of the fields  at  the  m \ntraining inputs  and at a  new  test point s  are given  by \n\n(7) \n\nl/ \n\nl/ \n\nWe  have thus  reduced  our  problem to the calculation of the  \"microscopic orderpa(cid:173)\nrameters\"  (x/l). 1  Averages in  Statistical Physics can be calculated from  derivatives \nof  -In Z  with  respect  to  small  external  fields,  which  are  then  set  to  zero,  An \nequivalent  formulation  uses  the  Legendre  transform  of  -In Z  as  a  function  of  the \nexpectations , which in  our case is  given  by \n\nG( {(XJ.L) , ((XJ.L)2)})  =  -In Z(, /l, A)  + L(xJ.L)'yJ.L + ~ L  AJ.L((XJ.L)2)  . \n\n(8) \n\nJ.L \n\nJ.L \n\nwith \n\nZ( bJ.L, A/l}) = J II dX;:ihJ.L  IIp(TJ.LlhJ.L)e~ LI',JAI'Ol'u+Cl' u)x l'x u+ LI' xl'(-yl' - hl').  (9) \n\n/l \n\nJ.L \n\nThe additional averages ((XJ.L)2)  have been  introduced, because the dynamical vari(cid:173)\nables  xJ.L  (unlike  Ising  spins)  do  not  have  fixed  length.  The  external  fields  ,J.L , AJ.L \n\nmust  be eliminated from  t~ =  t; = 0 and the true expectation values of xJ.L  and \n(  J.L)2 \nx \n\n- 0 \n, \n\n-\nose  w  IC  sa  IS  y  8 \u00ab xl' )2)  -\n\n8G \n\nh'  h \n\n8G \n\n8(xl' )  -\n\nth \n\nare \n\nt'  f \n\n3  Naive  mean field  theory \n\nSo far,  this description does not give anything new.  Usually G cannot be calculated \nexactly for  the non-Gaussian likelihood  models  of interest.  Nevertheless,  based  on \nmean field  theory  (MFT)  it is  possible to guess an approximate form  for  G. \n\n1 Although  the  integrations  are  over  the  imaginary  axis,  these  expectations  come  out \n\npositive.  This is  due to the fact  that the integration  \"measure\"  is  complex as  well. \n\n\f312 \n\nM  Opper and 0.  Winther \n\nMean field  methods have found interesting applications in Neural Computing within \nthe  framework  of ensemble  learning,  where  the  the  exact  posterior  distribution  is \napproximated  by  a  simpler  one  using  product  distributions  in  a  variational  treat(cid:173)\nment.  Such  a  \"standard\"  mean  field  method  for  the  posterior  of the  hf.L  (for  the \ncase of Gaussian process classification) is  in preparation and will be discussed some(cid:173)\nwhere else.  In  this paper,  we  suggest  a  different  route,  which  introduces nontrivial \ncorrections to a simple or  \"naive\"  MFT for  the variables xl-'.  Besides the variational \nmethod (which would be purely formal because the distribution of the xf.L  is  complex \nand does  not define a  probability), there are other ways  to define  the simple  MFT. \nE.g., by truncating a perturbation expansion with respect to the \"interactions\" Cf.LV \nin  G after the first  order  (Plefka  1982).  These approaches yield  the result \nG ~ Gnaive  =  Go  - ~ :LCI-'f.L((XI-')2)  - ~  :L  CI-'v(xl-')(XV). \n\n(10) \n\nI-' \n\n1-' , v, wl-f.L \n\nGo  is  the contribution to G for  a  model without any interactions i.e.  when CI-'v  =  0 \nin  (9),  i.e.  it is  the Legendre transform  of \n\n- In Zo  = l: In  [~+ (1  - 2~) <I>  (TI-' ; ; ) ]  , \n\nI-' \n\nwhere <I>(z)  =  J~oo .:}f;e-t2 / 2  is  an error function.  For simple  models  in  Statistical \nPhysics,  where  all interactions  CI-'V  are  positive  and  equal,  it is  easy  to  show  that \nGnaive will  become exact in  the limit of an infinite number of variables xl-'.  Hence, \nfor  systems  with a  large number of nonzero interactions having  the  same orders of \nmagnitude, one may expect that the approximation is  not too bad. \n\n4  The  TAP  approach \n\nNevertheless,  when  the interactions  Cf.LV  can  be both positive and negative  (as  one \nwould  expect  e.g.  when  inputs  have zero  mean),  even  in  the thermodynamic limit \nand for  nice  distributions of inputs,  an  additional contribution tlG must  be added \nto the  \"naive\"  mean field  theory  (10).  Such  a  correction  (often  called  an  Onsager \nreaction  term)  has  been  introduced for  a  spin glass  model  by  (Thouless,  Anderson \n&  Palmer  1977)  (TAP).  It  was  later  applied  to  the  statistical  mechanics  of single \nlayer perceptrons by (Mezard 1989) and then generalized to the Bayesian framework \nby  (Opper  &  Winther  1996,  1997).  For  an  application  to  multilayer  networks, see \n(Wong 1995).  In  the thermodynamic limit of infinitely large dimension of the input \nspace,  and for  nice  input  distributions,  the results  can  be shown  coincide  with  the \nresults  of the  replica framework.  The drawback  of the previous  derivations  of  the \nTAP  MFT for  neural networks was  the fact  that special  assumptions on  the input \ndistribution  had  been  made  and  certain  fluctuating  terms  have  been  replaced  by \ntheir averages over the distribution of random data, which in  practice would not be \navailable.  In this paper, we  will  use the approach of (Parisi &  Potters 1995),  which \nallows  to  circumvent  this  problem.  They  concluded  (applied  to  the  case  of a  spin \nmodel with random interactions of a  specific  type),  that the functional form  of tlG \nshould not depend on the type of the  \"single particle\"  contribution Go.  Hence,  one \nmay use any model in Go,  for  which G  can be calculated exactly (e.g.  the Gaussian \nregression  model)  and  subtract  the  naive  mean  field  contribution  to  obtain  the \n\n\fMean Field Methods for Classification with Gaussian Processes \n\n313 \n\ndesired  I:1G.  For  the  sake  of  simplicity,  we  have  chosen  the  even  simpler  model \np( TI-'l hi-')  '\"\"  6 (hi-')  without changing the final  result.  A lengthy  but straightforward \ncalculation for  this  problem leads to the result \n\n(11) \n\nwith RI-'  ==  ((xl-')2)  - (Xi-')2.  The Ai-'  must be eliminated using  t j(  = 0,  which leads \nto the equation \n\nI' \n\n(12) \n\nNote,  that with this choice, the TAP mean field  theory becomes exact for  Gaussian \nlikelihoods , i. e.  for  standard regression problems. \nFinally,  setting  the  derivatives  of  GT AP  = Gnaive  + I:1G  with  respect  to  the  4 \nvariables  (xl-'), ((xl-')2) ,rl-\"  AI-'  equal to zero,  we  obtain the equat ions \n\nv \n\n(13) \n\nwhere  D(z ) =  e- z 2 /2 /..,j2; is  the Gaussian  measure.  These eqs . have  to  be solved \nnumerically together with  (12).  In  contrast, for  the naive  MFT, the simpler  result \nAI-'  =  C 1-'1-'  is  found. \n\n5  Simulations \n\nSolving the nonlinear system of equations (12,13)  by iteration turns out to be quite \nstraightforward.  For  some data sets  to get  convergence, one has  to add a  diagonal \nterm v to the covariance matrix C:  Cij  -+ Cij +6ijV.  It may be shown that this term \ncorresponds to learning with Gaussian noise  (with variance v) added  the Gaussian \nrandom field. \n\nHere, we  present simulation  results for  a  single data set,  the  Sonar  - Mines  versus \nRocks using the same training/test set  split as  in  the original study by  (Gorman & \nSejnowski  1988).  The  input  data were  pre-processed  by  linear  rescaling  such  that \nover  the  training set each input variable has  zero mean and  unit variance.  In some \ncases the mean field  equations failed  to converge using  the raw data. \n\nA  further  important  feature  of TAP  MFT  is  the  fact  that  the  method  also  gives \nan  approximate leave-one-out estimator for  the generalization error , C]oo  expressed \nin  terms  of the solution  to the mean field  equations  (see  (Opper  & Winther  1996, \n1997)  for  more  details) .  It is  also  possible  to  derive  a  leave-one-out  estimator  for \nthe  naive  MFT  (Opper &  Winther to be published). \n\nSince  we  so  far  haven't  dealt  with  the  problem  of  automatically  estimating  the \nhyperparameters, their number was  drastically reduced by setting Wi  =  (TiN  in  the \ncovariances  (1)  and  (2).  The remaining hyperparameters,  a 2 ,  K,  and v  were chosen \n\n\f314 \n\nM.  Opper and 0.  Winther \n\nAlgorithm \nTAP  Mean  Field \n\nNaive  Mean Field \n\nBack-Prop \n\nTable 1:  The result for  the  Sonar data. \n\nCovariance Function \n(1) \n(2) \n(1) \n(2) \nSimple  Percept ron \n0.269(\u00b10.048) \nBest 21ayer - 12  Hidden  0.096(\u00b10.018) \n\n\u20actest \n0.183 \n0.077 \n0.154 \n0.077 \n\n\u20acJoo \n\n\u20acexact \n100 \n0.260  0.260 \n0.212 \n0.212 \n0.269  0.269 \n0.221 \n0.221 \n\nas  to  minimize  \u20acIoo .  It  turned  out  that  the  lowest  \u20acIoo  was  found  from  modeling \nwithout noise:  K,  = v  = O. \nThe simulation results are shown in  table 1.  The comparisons for  back-propagation \nis  taken  from  (Gorman  &  Sejnowski  1988).  The solution  found  by  the  algorithm \nturned out  to  be  unique,  i.e.  different  order  presentation  of the examples  and  dif(cid:173)\nferent  initial values for  the (XIL)  converged to the same solution. \n\nIn  table  1,  we  have  also  compared  the  estimate  given  by  the  algorithm  with  the \nexact leave-one-out estimate \u20aci~~ct obtained by  going through the training set  and \nkeeping  an  example  out  for  testing  and  running  the  mean  field  algorithm  on  the \nrest.  The  estimate  and  exact  value  are  in  complete  agreement.  Comparing  with \nthe test error we  see  that the  training set  is  'hard'  and  the test  set is  'easy'.  The \nsmall difference for test error between the naive and full  mean field  algorithms also \nindicate that the mean field  scheme is  quite robust with respect to choice of AIL ' \n\n6  Discussion \n\nMore work has to be done to make the TAP  approach a  practical tool for  Bayesian \nmodeling.  One has to find  better methods for  solving the equations.  A conversion \ninto a direct minimization problem for  a free energy maybe helpful.  To achieve this, \none may probably work with the real field  variables hJ.l.  instead of the imaginary XIL . \nA  further  problem  is  the  determination  of the  hyperparameters  of  the covariance \nfunctions.  Two  ways  seem  to  be  interesting here.  One  may  use  the  approximate \nfree  energy  G,  which  is  essentially the negative logarithm of the  Bayesian  evidence \nto estimate the most probable values of the hyperparameters.  However, an estimate \non  the errors made in the TAP approach would be necessary.  Second, one may use \nthe  built-in  leave-one-out  estimate to estimate the  generalization error.  Again  an \nestimate on the validity  of the approximation is  necessary.  It will  further  be inter(cid:173)\nesting to apply our way of deriving the TAP equations to other models  (Boltzmann \nmachines,  belief  nets,  combinatorial  optimization  problems),  for  which  standard \nmean field  theories have been applied successfully. \n\nAcknowledgments \n\nThis  research  is  supported  by  the  Swedish  Foundation  for  Strategic Research  and \nby  the  Danish  Research  Councils  for  the  Natural  and  Technical  Sciences  through \nthe Danish Computational Neural Network  Center  (CONNECT). \n\n\fMean Field Methods for Classification with Gaussian Processes \n\n315 \n\nReferences \n\nD.  Barber and C. K. I. Williams, Gaussian Processes for  Bayesian Classification via \nHybrid  Monte  Carlo,  in  Neural  Information  Processing  Systems  9,  M  .  C.  Mozer, \nM.  I. Jordan and T.  Petsche, eds.,  340-346.  MIT Press (1997). \n\nM.  N.  Gibbs and D.  J. C. Mackay, Variational Gaussian Process Classifiers, Preprint \nCambridge University  (1997). \n\nR.  P.  Gorman and T.  J. Sejnowski,  Analysis of Hidden  Units in  a Layered Network \nTrained to Classify Sonar Targets,  Neural  Networks  1,  75  (1988) . \n\nD.  J. C. Mackay, Gaussian Processes, A Replacement for  Neural Networks , NIPS tu(cid:173)\ntorial  1997,  May  be obtained from  http://wol.ra.phy.cam.ac . uk/pub/mackay/. \n\nM.  Mezard, The Space of interactions in  Neural Networks:  Gardner's Computation \nwith the Cavity Method,  J.  Phys.  A 22,  2181  (1989). \n\nM.  Mezard  and  G.  Parisi  and  M.  A.  Virasoro,  Spin  Glass  Theory  and  Beyond, \nLecture Notes  in Physics,  9, World Scientific  (1987). \n\nR.  Neal,  Bayesian  Learning  for  Neural  Networks,  Lecture  Notes  in  Statistics, \nSpringer  (1996). \n\nR.  M.  Neal,  Monte Carlo Implementation of Gaussian Process Models for  Bayesian \nRegression  and Classification,  Technical Report  CRG-TR-97-2, Dept.  of Computer \nScience,  University of Toronto (1997). \n\nM.  Opper  and  O.  Winther,  A  Mean  Field  Approach  to  Bayes  Learning  in  Feed(cid:173)\nForward Neural Networks,  Phys.  Rev.  Lett.  76,  1964  (1996). \n\nM.  Opper  and  O.  Winther,  A  Mean  Field  Algorithm  for  Bayes  Learning  in  Large \nFeed-Forward Neural Networks, in  Neural Information Processing Systems  9,  M.  C. \nMozer,  M.  I. Jordan and T.  Petsche, eds.,  225-231.  MIT  Press  (1997). \n\nG.  Parisi and  M.  Potters,  Mean-Field  Equations for  Spin  Models  with  Orthogonal \nInteraction Matrices,  J . Phys.  A:  Math.  Gen. 28 5267  (1995). \n\nT.  Plefka,  Convergence Condition of the TAP Equation for  the Infinite-Range Ising \nSpin  Glass,  J.  Phys.  A 15, 1971  (1982). \n\nD.  J.  Thouless,  P.  W.  Anderson  and  R.  G.  Palmer , Solution  of  a  'Solvable  Model \nof a  Spin  Glass', Phil.  Mag.  35, 593  (1977). \n\nC.  K. I.  Williams,  Computing  with  Infinite  Networks,  in  Neural  Information  Pro(cid:173)\ncessing  Systems  9,  M.  C.  Mozer, M.  I. Jordan  and  T.  Petsche, eds.,  295-301.  MIT \nPress  (1997). \n\nC.  K.  I.  Williams  and  C.  E.  Rasmussen,  Gaussian  Processes  for  Regression,  in \nNeural Information  Processing  Systems  8,  D.  S.  Touretzky,  M.  C.  Mozer  and M.  E. \nHasselmo eds.,  514-520,  MIT Press  (1996). \n\nK. Y.  M.  Wong,  Microscopic Equations and Stability Conditions in Optimal Neural \nNetworks,  Europhys.  Lett.  30, 245  (1995). \n\n\f", "award": [], "sourceid": 1532, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}