{"title": "Mean Field Methods for Classification with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 309, "page_last": 315, "abstract": null, "full_text": "Mean field methods for classification with \n\nGaussian processes \n\nManfred Opper \n\nNeural Computing Research Group \n\nDivision of Electronic Engineering and Computer Science \n\nAston University Birmingham B4 7ET, UK. \n\nopperm~aston.ac.uk \n\nOle Winther \n\nTheoretical Physics II, Lund University, S6lvegatan 14 A \n\nS-223 62 Lund, Sweden \n\nCONNECT, The Niels Bohr Institute, University of Copenhagen \n\nBlegdamsvej 17, 2100 Copenhagen 0, Denmark \n\nwinther~thep.lu.se \n\nAbstract \n\nWe discuss the application of TAP mean field methods known from \nthe Statistical Mechanics of disordered systems to Bayesian classifi(cid:173)\ncation models with Gaussian processes. In contrast to previous ap(cid:173)\nproaches, no knowledge about the distribution of inputs is needed. \nSimulation results for the Sonar data set are given. \n\n1 Modeling with Gaussian Processes \n\nBayesian models which are based on Gaussian prior distributions on function spaces \nare promising non-parametric statistical tools. They have been recently introduced \ninto the Neural Computation community (Neal 1996, Williams & Rasmussen 1996, \nMackay 1997). To give their basic definition, we assume that the likelihood of the \noutput or target variable T for a given input s E RN can be written in the form \np(Tlh(s)) where h : RN --+ R is a priori assumed to be a Gaussian random field. \nIf we assume fields with zero prior mean, the statistics of h is entirely defined by \nthe second order correlations C(s, S') == E[h(s)h(S')], where E denotes expectations \n\n\f310 \n\nM Opper and 0. Winther \n\nwith respect to the prior. Interesting examples are \n\nC(s, s') \n\nC(s, s') \n\n(1) \n\n(2) \n\nThe choice (1) can be motivated as a limit of a two-layered neural network with \ninfinitely many hidden units with factorizable input-hidden weight priors (Williams \n1997). Wi are hyperparameters determining the relevant prior lengthscales of h(s). \nThe simplest choice C(s, s') = 2::i WiSiS~ corresponds to a single layer percept ron \nwith independent Gaussian weight priors. \n\nIn this Bayesian framework, one can make predictions on a novel input s after \nJ.L = 1, ... , m by using \nhaving received a set Dm of m training examples (TJ.1., sJ.1.), \nthe posterior distribution of the field at the test point s which is given by \n\n(3) \n\n(4) \n\np(h(s)IDm) = J p(h(s)l{hV}) p({hV}IDm) II dhJ.1.. \n\nJ.1. \n\np(h(s)1 {hV}) is a conditional Gaussian distribution and \n\np({hV}IDm) = ~P({hV}) II p(TJ.1.IhJ.1.). \n\nJ.1. \n\nis the posterior distribution of the field variables at the training points. Z is a \nnormalizing partition function and \n\n(5) \n\nis the prior distribution of the fields at the training points. Here, we have introduced \nthe abbreviations hJ.1. = h(sJ.1.) and CJ.1.V == C(sJ.1., SV). \nThe major technical problem of this approach comes from the difficulty in per(cid:173)\nforming the high dimensional integrations. Non-Gaussian likelihoods can be only \ntreated by approximations, where e.g. Monte Carlo sampling (Neal 1997), Laplace \nintegration (Barber & Williams 1997) or bounds on the likelihood (Gibbs & Mackay \n1997) have been used so far. In this paper, we introduce a further approach, which \nis based on a mean field method known in the Statistical Physics of disordered \nsystems (Mezard, Parisi & Virasoro 1987). \n\nWe specialize on the case of a binary classification problem, where a binary class \nlabel T = \u00b11 must be predicted using a training set corrupted by i.i.d label noise. \nThe likelihood for this problem is taken as \n\nwhere I\\, is the probability that the true classification label is corrupted, i. e. flipped \nand the step function, 0(x) is defined as 0(x) = 1 for x > 0 and 0 otherwise. For \nsuch a case, we expect that (by the non-smoothness of the model), e.g. Laplace's \nmethod and the bounds introduced in (Gibbs & Mackay 1997) are not directly \napplicable. \n\n\fMean Field Methods for Classification with Gaussian Processes \n\n31J \n\n2 Exact posterior averages \n\nIn order to make a prediction on an input s, ideally the label with maximum poste(cid:173)\nrior probability should be chosen, i.e. TBayes = argmaxr p( TIDm), where the predic(cid:173)\ntive probability is given by P(TIDm) = J dhp(Tlh) p(hIDm). For the binary case the \nBayes classifier becomes TBayes = sign(signh(s)), where we throughout the paper let \nbrackets ( ... ) denote posterior averages. Here, we use a somewhat simpler approach \nby using the prediction \n\nT = sign((h(s))) . \n\nThis would reduce to the ideal prediction, when the posterior distribution of h(s) \nis symmetric around its mean (h(s)). The goal of our mean field approach will be \nto provide a set of equations for approximately determining (h( s)) . The starting \npoint of our analysis is the partition function \n\nZ = J II dX;:ihJ.L IIp(TJ.LlhJ.L)e~ LI' ,u cI'UxI'X U- L I' hl'xl' , \n\n(6) \n\nJ.L \n\nJ.L \n\nwhere the new auxiliary variables x/l (integrated along the imaginary axis) have \nbeen introduced in order to get rid of C- l in (5). \n\nIt is not hard to show from (6) that the posterior averages of the fields at the m \ntraining inputs and at a new test point s are given by \n\n(7) \n\nl/ \n\nl/ \n\nWe have thus reduced our problem to the calculation of the \"microscopic orderpa(cid:173)\nrameters\" (x/l). 1 Averages in Statistical Physics can be calculated from derivatives \nof -In Z with respect to small external fields, which are then set to zero, An \nequivalent formulation uses the Legendre transform of -In Z as a function of the \nexpectations , which in our case is given by \n\nG( {(XJ.L) , ((XJ.L)2)}) = -In Z(, /l, A) + L(xJ.L)'yJ.L + ~ L AJ.L((XJ.L)2) . \n\n(8) \n\nJ.L \n\nJ.L \n\nwith \n\nZ( bJ.L, A/l}) = J II dX;:ihJ.L IIp(TJ.LlhJ.L)e~ LI',JAI'Ol'u+Cl' u)x l'x u+ LI' xl'(-yl' - hl'). (9) \n\n/l \n\nJ.L \n\nThe additional averages ((XJ.L)2) have been introduced, because the dynamical vari(cid:173)\nables xJ.L (unlike Ising spins) do not have fixed length. The external fields ,J.L , AJ.L \n\nmust be eliminated from t~ = t; = 0 and the true expectation values of xJ.L and \n( J.L)2 \nx \n\n- 0 \n, \n\n-\nose w IC sa IS y 8 \u00ab xl' )2) -\n\n8G \n\nh' h \n\n8G \n\n8(xl' ) -\n\nth \n\nare \n\nt' f \n\n3 Naive mean field theory \n\nSo far, this description does not give anything new. Usually G cannot be calculated \nexactly for the non-Gaussian likelihood models of interest. Nevertheless, based on \nmean field theory (MFT) it is possible to guess an approximate form for G. \n\n1 Although the integrations are over the imaginary axis, these expectations come out \n\npositive. This is due to the fact that the integration \"measure\" is complex as well. \n\n\f312 \n\nM Opper and 0. Winther \n\nMean field methods have found interesting applications in Neural Computing within \nthe framework of ensemble learning, where the the exact posterior distribution is \napproximated by a simpler one using product distributions in a variational treat(cid:173)\nment. Such a \"standard\" mean field method for the posterior of the hf.L (for the \ncase of Gaussian process classification) is in preparation and will be discussed some(cid:173)\nwhere else. In this paper, we suggest a different route, which introduces nontrivial \ncorrections to a simple or \"naive\" MFT for the variables xl-'. Besides the variational \nmethod (which would be purely formal because the distribution of the xf.L is complex \nand does not define a probability), there are other ways to define the simple MFT. \nE.g., by truncating a perturbation expansion with respect to the \"interactions\" Cf.LV \nin G after the first order (Plefka 1982). These approaches yield the result \nG ~ Gnaive = Go - ~ :LCI-'f.L((XI-')2) - ~ :L CI-'v(xl-')(XV). \n\n(10) \n\nI-' \n\n1-' , v, wl-f.L \n\nGo is the contribution to G for a model without any interactions i.e. when CI-'v = 0 \nin (9), i.e. it is the Legendre transform of \n\n- In Zo = l: In [~+ (1 - 2~) * (TI-' ; ; ) ] , \n\nI-' \n\nwhere **(z) = J~oo .:}f;e-t2 / 2 is an error function. For simple models in Statistical \nPhysics, where all interactions CI-'V are positive and equal, it is easy to show that \nGnaive will become exact in the limit of an infinite number of variables xl-'. Hence, \nfor systems with a large number of nonzero interactions having the same orders of \nmagnitude, one may expect that the approximation is not too bad. \n\n4 The TAP approach \n\nNevertheless, when the interactions Cf.LV can be both positive and negative (as one \nwould expect e.g. when inputs have zero mean), even in the thermodynamic limit \nand for nice distributions of inputs, an additional contribution tlG must be added \nto the \"naive\" mean field theory (10). Such a correction (often called an Onsager \nreaction term) has been introduced for a spin glass model by (Thouless, Anderson \n& Palmer 1977) (TAP). It was later applied to the statistical mechanics of single \nlayer perceptrons by (Mezard 1989) and then generalized to the Bayesian framework \nby (Opper & Winther 1996, 1997). For an application to multilayer networks, see \n(Wong 1995). In the thermodynamic limit of infinitely large dimension of the input \nspace, and for nice input distributions, the results can be shown coincide with the \nresults of the replica framework. The drawback of the previous derivations of the \nTAP MFT for neural networks was the fact that special assumptions on the input \ndistribution had been made and certain fluctuating terms have been replaced by \ntheir averages over the distribution of random data, which in practice would not be \navailable. In this paper, we will use the approach of (Parisi & Potters 1995), which \nallows to circumvent this problem. They concluded (applied to the case of a spin \nmodel with random interactions of a specific type), that the functional form of tlG \nshould not depend on the type of the \"single particle\" contribution Go. Hence, one \nmay use any model in Go, for which G can be calculated exactly (e.g. the Gaussian \nregression model) and subtract the naive mean field contribution to obtain the \n\n\fMean Field Methods for Classification with Gaussian Processes \n\n313 \n\ndesired I:1G. For the sake of simplicity, we have chosen the even simpler model \np( TI-'l hi-') '\"\" 6 (hi-') without changing the final result. A lengthy but straightforward \ncalculation for this problem leads to the result \n\n(11) \n\nwith RI-' == ((xl-')2) - (Xi-')2. The Ai-' must be eliminated using t j( = 0, which leads \nto the equation \n\nI' \n\n(12) \n\nNote, that with this choice, the TAP mean field theory becomes exact for Gaussian \nlikelihoods , i. e. for standard regression problems. \nFinally, setting the derivatives of GT AP = Gnaive + I:1G with respect to the 4 \nvariables (xl-'), ((xl-')2) ,rl-\" AI-' equal to zero, we obtain the equat ions \n\nv \n\n(13) \n\nwhere D(z ) = e- z 2 /2 /..,j2; is the Gaussian measure. These eqs . have to be solved \nnumerically together with (12). In contrast, for the naive MFT, the simpler result \nAI-' = C 1-'1-' is found. \n\n5 Simulations \n\nSolving the nonlinear system of equations (12,13) by iteration turns out to be quite \nstraightforward. For some data sets to get convergence, one has to add a diagonal \nterm v to the covariance matrix C: Cij -+ Cij +6ijV. It may be shown that this term \ncorresponds to learning with Gaussian noise (with variance v) added the Gaussian \nrandom field. \n\nHere, we present simulation results for a single data set, the Sonar - Mines versus \nRocks using the same training/test set split as in the original study by (Gorman & \nSejnowski 1988). The input data were pre-processed by linear rescaling such that \nover the training set each input variable has zero mean and unit variance. In some \ncases the mean field equations failed to converge using the raw data. \n\nA further important feature of TAP MFT is the fact that the method also gives \nan approximate leave-one-out estimator for the generalization error , C]oo expressed \nin terms of the solution to the mean field equations (see (Opper & Winther 1996, \n1997) for more details) . It is also possible to derive a leave-one-out estimator for \nthe naive MFT (Opper & Winther to be published). \n\nSince we so far haven't dealt with the problem of automatically estimating the \nhyperparameters, their number was drastically reduced by setting Wi = (TiN in the \ncovariances (1) and (2). The remaining hyperparameters, a 2 , K, and v were chosen \n\n\f314 \n\nM. Opper and 0. Winther \n\nAlgorithm \nTAP Mean Field \n\nNaive Mean Field \n\nBack-Prop \n\nTable 1: The result for the Sonar data. \n\nCovariance Function \n(1) \n(2) \n(1) \n(2) \nSimple Percept ron \n0.269(\u00b10.048) \nBest 21ayer - 12 Hidden 0.096(\u00b10.018) \n\n\u20actest \n0.183 \n0.077 \n0.154 \n0.077 \n\n\u20acJoo \n\n\u20acexact \n100 \n0.260 0.260 \n0.212 \n0.212 \n0.269 0.269 \n0.221 \n0.221 \n\nas to minimize \u20acIoo . It turned out that the lowest \u20acIoo was found from modeling \nwithout noise: K, = v = O. \nThe simulation results are shown in table 1. The comparisons for back-propagation \nis taken from (Gorman & Sejnowski 1988). The solution found by the algorithm \nturned out to be unique, i.e. different order presentation of the examples and dif(cid:173)\nferent initial values for the (XIL) converged to the same solution. \n\nIn table 1, we have also compared the estimate given by the algorithm with the \nexact leave-one-out estimate \u20aci~~ct obtained by going through the training set and \nkeeping an example out for testing and running the mean field algorithm on the \nrest. The estimate and exact value are in complete agreement. Comparing with \nthe test error we see that the training set is 'hard' and the test set is 'easy'. The \nsmall difference for test error between the naive and full mean field algorithms also \nindicate that the mean field scheme is quite robust with respect to choice of AIL ' \n\n6 Discussion \n\nMore work has to be done to make the TAP approach a practical tool for Bayesian \nmodeling. One has to find better methods for solving the equations. A conversion \ninto a direct minimization problem for a free energy maybe helpful. To achieve this, \none may probably work with the real field variables hJ.l. instead of the imaginary XIL . \nA further problem is the determination of the hyperparameters of the covariance \nfunctions. Two ways seem to be interesting here. One may use the approximate \nfree energy G, which is essentially the negative logarithm of the Bayesian evidence \nto estimate the most probable values of the hyperparameters. However, an estimate \non the errors made in the TAP approach would be necessary. Second, one may use \nthe built-in leave-one-out estimate to estimate the generalization error. Again an \nestimate on the validity of the approximation is necessary. It will further be inter(cid:173)\nesting to apply our way of deriving the TAP equations to other models (Boltzmann \nmachines, belief nets, combinatorial optimization problems), for which standard \nmean field theories have been applied successfully. \n\nAcknowledgments \n\nThis research is supported by the Swedish Foundation for Strategic Research and \nby the Danish Research Councils for the Natural and Technical Sciences through \nthe Danish Computational Neural Network Center (CONNECT). \n\n\fMean Field Methods for Classification with Gaussian Processes \n\n315 \n\nReferences \n\nD. Barber and C. K. I. Williams, Gaussian Processes for Bayesian Classification via \nHybrid Monte Carlo, in Neural Information Processing Systems 9, M . C. Mozer, \nM. I. Jordan and T. Petsche, eds., 340-346. MIT Press (1997). \n\nM. N. Gibbs and D. J. C. Mackay, Variational Gaussian Process Classifiers, Preprint \nCambridge University (1997). \n\nR. P. Gorman and T. J. Sejnowski, Analysis of Hidden Units in a Layered Network \nTrained to Classify Sonar Targets, Neural Networks 1, 75 (1988) . \n\nD. J. C. Mackay, Gaussian Processes, A Replacement for Neural Networks , NIPS tu(cid:173)\ntorial 1997, May be obtained from http://wol.ra.phy.cam.ac . uk/pub/mackay/. \n\nM. Mezard, The Space of interactions in Neural Networks: Gardner's Computation \nwith the Cavity Method, J. Phys. A 22, 2181 (1989). \n\nM. Mezard and G. Parisi and M. A. Virasoro, Spin Glass Theory and Beyond, \nLecture Notes in Physics, 9, World Scientific (1987). \n\nR. Neal, Bayesian Learning for Neural Networks, Lecture Notes in Statistics, \nSpringer (1996). \n\nR. M. Neal, Monte Carlo Implementation of Gaussian Process Models for Bayesian \nRegression and Classification, Technical Report CRG-TR-97-2, Dept. of Computer \nScience, University of Toronto (1997). \n\nM. Opper and O. Winther, A Mean Field Approach to Bayes Learning in Feed(cid:173)\nForward Neural Networks, Phys. Rev. Lett. 76, 1964 (1996). \n\nM. Opper and O. Winther, A Mean Field Algorithm for Bayes Learning in Large \nFeed-Forward Neural Networks, in Neural Information Processing Systems 9, M. C. \nMozer, M. I. Jordan and T. Petsche, eds., 225-231. MIT Press (1997). \n\nG. Parisi and M. Potters, Mean-Field Equations for Spin Models with Orthogonal \nInteraction Matrices, J . Phys. A: Math. Gen. 28 5267 (1995). \n\nT. Plefka, Convergence Condition of the TAP Equation for the Infinite-Range Ising \nSpin Glass, J. Phys. A 15, 1971 (1982). \n\nD. J. Thouless, P. W. Anderson and R. G. Palmer , Solution of a 'Solvable Model \nof a Spin Glass', Phil. Mag. 35, 593 (1977). \n\nC. K. I. Williams, Computing with Infinite Networks, in Neural Information Pro(cid:173)\ncessing Systems 9, M. C. Mozer, M. I. Jordan and T. Petsche, eds., 295-301. MIT \nPress (1997). \n\nC. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, in \nNeural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer and M. E. \nHasselmo eds., 514-520, MIT Press (1996). \n\nK. Y. M. Wong, Microscopic Equations and Stability Conditions in Optimal Neural \nNetworks, Europhys. Lett. 30, 245 (1995). \n\n\f", "award": [], "sourceid": 1532, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}*