{"title": "A Mean Field Algorithm for Bayes Learning in Large Feed-forward Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 231, "abstract": null, "full_text": "A mean field algorithm for Bayes learning \n\nin large feed-forward neural networks \n\nManfred Opper \n\nInstitut fur Theoretische Physik \n\nOle Winther \nCONNECT \n\nJulius-Maximilians-Universitat, Am Hubland \n\nThe Niels Bohr Institute \n\nD-97074 Wurzburg, Germany \n\nopperOphysik.Uni-Wuerzburg.de \n\nBlegdamsvej 17 \n\n2100 Copenhagen, Denmark \nwintherGconnect.nbi.dk \n\nAbstract \n\nWe present an algorithm which is expected to realise Bayes optimal \npredictions in large feed-forward networks. It is based on mean field \nmethods developed within statistical mechanics of disordered sys(cid:173)\ntems. We give a derivation for the single layer perceptron and show \nthat the algorithm also provides a leave-one-out cross-validation \ntest of the predictions. Simulations show excellent agreement with \ntheoretical results of statistical mechanics. \n\n1 \n\nINTRODUCTION \n\nBayes methods have become popular as a consistent framework for regularization \nand model selection in the field of neural networks (see e.g. [MacKay,1992]). In \nthe Bayes approach to statistical inference [Berger, 1985] one assumes that the prior \nuncertainty about parameters of an unknown data generating mechanism can be \nencoded in a probability distribution, the so called prior. Using the prior and \nthe likelihood of the data given the parameters, the posterior distribution of the \nparameters can be derived from Bayes rule. From this posterior, various estimates \nfor functions ofthe parameter, like predictions about unseen data, can be calculated. \nHowever, in general, those predictions cannot be realised by specific parameter \nvalues, but only by an ensemble average over parameters according to the posterior \nprobability. \n\nHence, exact implementations of Bayes method for neural networks require averages \nover network parameters which in general can be performed by time consuming \n\n\f226 \n\nM. Opper and O. Winther \n\nMonte Carlo procedures. There are however useful approximate approaches for \ncalculating posterior averages which are based on the assumption of a Gaussian \nform of the posterior distribution [MacKay,1992]. Under regularity conditions on \nthe likelihood, this approximation becomes asymptotically exact when the number \nof data is large compared to the number of parameters. This Gaussian ansatz \nfor the posterior may not be justified when the number of examples is small or \ncomparable to the number of network weights. A second cause for its failure would \nbe a situation where discrete classification labels are produced from a probability \ndistribution which is a nonsmooth function of the parameters. This would include \nthe case of a network with threshold units learning a noise free binary classification \nproblem. \n\nIn this contribution we present an alternative approximate realization of Bayes \nmethod for neural networks, which is not based on asymptotic posterior normal(cid:173)\nity. The posterior averages are performed using mean field techniques known from \nthe statistical mechanics of disordered systems. Those are expected to become exact \nin the limit of a large number of network parameters under additional assumptions on \nthe statistics of the input data. Our analysis follows the approach of [Thouless, An(cid:173)\nderson& Palmer,1977] (TAP) as adapted to the simple percept ron by [Mezard,1989]. \n\nThe basic set up of the Bayes method is as follows: We have a training set consisting \nof m input-output pairs Dm = {(sll,ull),m = 1, ... ,/J}, where the outputs are \ngenerated independently from a conditional probability distribution P( u ll Iw, sll). \nThis probability is assumed to describe the output u ll to an input sll of a neural \nnetwork with weights w subject to a suitable noise process. If we assume that the \nunknown parameters w are randomly distributed with a prior distribution p(w), \nthen according to Bayes theorem our knowledge about w after seeing m examples \nis expressed through the posterior distribution \n\np(wIDm) = Z-lp(w) II P(ulllw,sll) \n\nm \n\n11=1 \n\n( 1) \n\nwhere Z = J dwp(w) n;=l P(ulllw, sll) is called the partition function in statistical \n\nmechanics and the evidence in Bayesian terminology. Taking the average with re(cid:173)\nspect to the posterior eq. (1), which in the following will be denoted by angle brack(cid:173)\nets, gives Bayes estimates for various quantities. For example the optimal predictive \nprobability for an output u to a new input s is given by pBayes(uls) = (P(ulw, s\u00bb. \nIn section 2 exact equations for the posterior averaged weights (w) are derived for \narbitrary networks. In 3 we specialize these equations to a perceptron and develop \na mean field ansatz in section 4. The resulting system of mean field equations equa(cid:173)\ntions is presented in section 5. In section 6 we consider Bayes optimal predictions \nand a leave-one-out estimator for the generalization error. We conclude in section 7 \nwith a discussion of our results. \n\n2 A RESULT FOR POSTERIOR AVERAGES FROM \n\nGAUSSIAN PRIORS \n\nIn this section we will derive an interesting equation for the posterior mean of the \nweights for arbitrary networks when the prior is Gaussian. This average of the \n\n\fMean Field Algorithm/or Bayes Learning \n\n227 \n\nweights can be calculated for the distribution (1) by using the following simple and \nwell known result for averages over Gaussian distributions. \n\nLet v be a Gaussian random variable with zero means. Then for any function f(v), \nwe have \n\n(vf(v\u00bba = (v )a\u00b7 (~)a . \n\n2 \n\ndf(v) \n\nHere ( .. . )a denotes the average over the Gaussian distribution of v. The relation is \neasily proved from an integration by parts. \nIn the following we will specialize to an isotropic Gaussian prior p(w) = \n~Ne-!w.w. In [Opper & Winter,1996] anisotropic priors are treated as well. \nApplying (2) to each component of wand the function n;=l P(o-Illw,sll), we get \nv 21r \n\nthe following equations \n\n(2) \n\n(3) \n\n(w) = Z-l J dw wp(w) Ii P(o-\"Iw, s\") \n= Z-l t J dwp(w) Ii P(o-\"Iw, s\")\\7w P(o-lllw, sll) \n\n,,=1 \n\n1l=1 \n\n\"icll \n\nJ dWp(w) ... n P(a\"lw ,s\") \n\nn \"1t' \n\nHere ( . . . ) Il = J \nthe Jl-th example is kept out of the training set and \\7 w denotes the gradient with \nrespect to w. \n\nis a reduced average over a posterior where \n\n\"~t' P(a\"lw ,s\") \n\ndwp(w) \n\n3 THE PERCEPTRON \n\nIn the following , we will utilize the fact that for neural networks, the probability (1) \ndepends only on the so called internal fields 8 = JNw . s . \nA simple but nontrivial example is the perceptron with N dimensional input vector s \nand output 0-( W, s) = sign( 8). We will generalize the noise free model by considering \nlabel noise in which the output is flipped, i.e. 0-8 < 0 with a probability (1 +e.B)-l. \n(For simplicity, we will assume that f3 is known such that no prior on f3 is needed .) \nThe conditional probability may thus be written as \n\nP(0-1l81l ) = P(o-Illw sll) -\n\n, - 1 + e-.B \n\n-\n\ne-.B9( -at' At') \n\n' \n\n(4) \n\nwhere 9(x) = 1 for x > 0 and 0 otherwise. Obviously, this a nonsmooth function of \nthe weights w, for which the posterior will not become Gaussian asymptotically. \n\nFor this case (3) reads \n\n(w) = _1_ t (P'(0-1l81l\u00bb1l o-Ilsll = \n_1_ f J d8fll (8)P'(0-1l8) o-Ilsll \n.jN 1l=1 J d8fll(A)P(0-1l8) \n\n.jN 1l=1 (P(0-1l81l\u00bb1l \n\n(5) \n\n\f228 \n\nM. Opper and O. Winther \n\nIIJ (~) is the density of -dNw . glJ, when the weights ware randomly drawn from a \nposterior, where example (glJ , (TIJ) was kept out of the training set. This result states \nthat the weights are linear combinations of the input vectors. It gives an example \nof the ability of Bayes method to regularize a network model: the effective number \nof parameters will never exceed the number of data points. \n\n4 MEAN FIELD APPROXIMATION \n\nSofar, no approximations have been made to obtain eqs. (3,5). In general IIJ(~) \ndepends on the entire set of data Dm and can not be calculated easily. Hence, we \nlook for a useful approximation to these densities. \nWe split the internal field into its average and fluctuating parts, i.e. we set ~IJ = \n(~IJ)IJ + v lJ , with vlJ = IN(w -\n(w)lJ)glJ. Our mean field approximation is based \non the assumption of a central limit theorem for the fluctuating part of the internal \nfield, vlJ which enters in the reduced average of eq. (5). This means, we assume \nthat the non-Gaussian fluctuations of Wi around (Wi)IJ' when mulitplied by sr will \nsum up to make vlJ a Gaussian random variable. The important point is here that \nfor the reduced average, the Wi are not correlated to the sr! 1 \nWe expect that this Gaussian approximation is reasonable, when N, the number \nof network weights is sufficiently large.Following ideas of [Mezard, Parisi & Vira(cid:173)\nsoro,1987] and [Mezard,1989]' who obtained mean field equations for a variety of \ndisordered systems in statistical mechanics, one can argue that in many cases this \nassumption may be exactly fulfilled in the 'thermodynamic limit' m, N ~ 00 with \na = ~ fixed. According to this ansatz, we get \n\nin terms of the second moment of vlJ AIJ := ~ 2:i,j srsj (WiWj)1J - (Wi)IJ(Wj)IJ). \nTo evaluate (5) we need to calculate the mean (~IJ)IJ and the variance AIJ. The first \nproblem is treated easily within the Gaussian approximation. \n\n(6) \n\nIn the third line (2) has been used again for the Gaussian random variable vlJ . \nSofar, the calculation of the variance AIJ for general inputs is an open problem. \nHowever, we can make a further reasonable ansatz, when the distribution of the \ninputs is known. The following approximation for AIJ is expected to become exact \nin the thermodynamic limit if the inputs of the training set are drawn independently \n\n1 Note that the fluctuations of the internal field with respect to the full posterior mean \n(which depends on the input si-') is non Gaussian, because the different terms in the sum \nbecome slightly correlated. \n\n\fMean Field Algorithm/or Bayes Learning \n\n229 \n\nfrom a distribution, where all components Si are uncorrelated and normalized i.e. \nSi = 0 and Si Sj = dij. The bars denote expectation over the distribution of inputs. \nFor the generalisation to a correlated input distribution see [Opper& Winther,1996]. \nOur basic mean field assumption is that the fluctuations of the All with the data \nset can be neglected so that we can replace them by their averages All. Since the \nreduced posterior averages are not correlated with the data sf, we obtain All ~ \ntr 2:i(wl}1l -\n(Wi)!). Finally, we replace the reduced average by the expectation \nover the full posterior, neglecting terms of order liN. Using 2:i(wl) = N, which \ntr 2:i (Wi)2 . This \nfollows from our choice of the Gaussian prior, we get All ~ A = 1 -\ndepends only on known quantities. \n\n5 MEAN FIELD EQUATIONS FOR THE PERCEPTRON \n(5) and (6) give a selfconsistent set of equations for the variable xll == \\~{::::N: . \n\nWe finally get \n\nwith \n\n(7) \n\n(8) \n\n(9) \n\nThese mean field equations can be solved by iteration. It is useful to start with a \nsmall number of data and then to increase the number of data in steps of 1 - 10. \nNumerical work show that the algorithm works well even for small systems sizes, \nN ~ 15. \n\n6 BAYES PREDICTIONS AND LEAVE-ONE-OUT \n\nAfter solving the mean field equations we can make optimal Bayesian classifications \nfor new data s by chosing the output label with the largest predictive probability. \nIn case of output noise this reduces to uBayes(s) = sign(u(w, s\u00bb Since the posterior \ndistribution is independent of the new input vector we can apply the Gaussian as(cid:173)\nsumption again to the internal field, d. and obtain uBayes(s) = u( (w), s), i.e for the \nsimple perceptron the averaged weights implement the Bayesian prediction. This \nwill not be the case for multi-layer neural networks. \n\nWe can also get an estimate for the generalization error which occurs on the pre(cid:173)\ndiction of new data. The generalization error for the Bayes prediction is defined \nby {Bayes = (8 (-u(s)(u(w,s\u00bb\u00bbs, where u(s) is the true output and ( ... )s denotes \naverage over the input distribution. To obtain the leave-one-out estimator of { one \n\n\f230 \n\nM. Opper and O. Winther \n\n0.50 \n\n0 .40 \n\n0 .30 \n\n0 .20 \n\n0.10 \n\n.\"-\n\n\"-\n. >-\n\n... \n\nI J \n\n1-\n\nI \n\nI \n\no. 00 '---\"'----'-.J'-----' __ --'---_\n\no \n\n---L _\n2 \n\n_ -'--_~ _\n\n_ _L_ _ \n\n___\"_ _\n\n_ \n\nL__ _\n\n- ' - -_ - - ' - -_\n\n- - - - ' \n\n4 \n\n6 \n\nFigure 1: Error vs. a = mj N for the simple percept ron with output noise f3 = 0.5 \nand N = 50 averaged over 200 runs. The full lines are the simulation results (upper \ncurve shows prediction error and the lower curve shows training error). The dashed \nline is the theoretical result for N -+ 00 obtained from statistical mechanics [Opper & \nHaussler, 1991] . The dotted line with larger error bars is the moving control estimate. \n\nremoves the p-th example from the training set and trains the network using only \nthe remaining m - 1 examples. The p'th example is used for testing. Repeating \nthis procedure for all p an unbiased estimate for the Bayes generalization error with \nm-1 training data is obtained as the mean value f~~r8 = ! EI' e (-ul'(O'(w, 81'\u00bb1') \nwhich is exactly the type of reduced averages which are calculated within our ap(cid:173)\nproach. Figure 1 shows a result of simulations of our algorithm when the inputs are \nuncorrelated and the outputs are generated from a teacher percept ron with fixed \nnoise rate f3. \n\n7 CONCLUSION \n\nIn this paper we have presented a mean field algorithm which is expected to imple(cid:173)\nment a Bayesian optimal classification well in the limit of large networks. We have \nexplained the method for the single layer perceptron. An extension to a simple mul(cid:173)\ntilayer network, the so called committee machine with a tree architecture is discussed \nin [Opper& Winther,1996]. The algorithm is based on a Gaussian assumption for \nthe distribution of the internal fields, which seems reasonable for large networks. \nThe main problem sofar is the restriction to ideal situations such as a known distri-\n\n\fMean FieLd Algorithm/or Bayes Learning \n\n231 \n\nbution of inputs which is not a realistic assumption for real world data. However, \nthis assumption only entered in the calculation of the variance of the Gaussian field. \nMore theoretical work is necessary to find an approximation to the variance which \nis valid in more general cases. A promising approach is a derivation of the mean \nfield equations directly from an approximation to the free energy -In(Z). Besides a \ndeeper understanding this would also give us the possibility to use the method with \nthe so called evidence framework , where the partition function (evidence) can be \nused to estimate unknown (hyper-) parameters of the model class [Berger, 1985]. It \nwill further be important to extend the algorithm to fully connected architectures. \nIn that case it might be necessary to make further approximations in the mean field \nmethod. \n\nACKNOWLEDGMENTS \n\nThis research is supported by a Heisenberg fellowship of the Deutsche Forschrmgs(cid:173)\ngemeinschaft and by the Danish Research Councils for the Natural and Technical \nSciences through the Danish Computational Neural Network Center (CONNECT) . \n\nREFERENCES \n\nBerger, J. O. (1985) Statistical Decision theory and Bayesian Analysis, Springer(cid:173)\nVerlag, New York. \n\nMacKay, D. J. (1992) A practical Bayesian framework for backpropagation networks, \nNeural Compo 4 448. \n\nMezard , M., Parisi G. & Virasoro M. A. (1987) Spin Glass Theory and Beyond, \nLecture Notes in Physics, 9, World Scientific, . \n\nMezard, M. (1989) The space of interactions in neural networks: Gardner's calcu(cid:173)\nlation with the cavity method J. Phys. A 22, 2181 . \n\nOpper, M. & Haussler, D. (1991) in IVth Annual Workshop on Computational \nLearning Theory (COLT91), Morgan Kaufmann. \n\nOpper M. & Winther 0 (1996) A mean field approach to Bayes learning in feed(cid:173)\nforward neural networks, Phys. Rev. Lett. 76 1964. \n\nThouless, D.J ., Anderson, P. W . & Palmer, R .G. (1977), Solution of 'Solvable model \nof a spin glass' Phil. Mag. 35, 593. \n\n\f", "award": [], "sourceid": 1268, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}