{"title": "Bayesian Model Comparison by Monte Carlo Chaining", "book": "Advances in Neural Information Processing Systems", "page_first": 333, "page_last": 339, "abstract": null, "full_text": "Bayesian Model Comparison \n\nby Monte Carlo Chaining \n\nDavid Barber \n\nD.Barber~aston.ac.uk \n\nChristopher M. Bishop \nC.M.Bishop~aston.ac.uk \n\nNeural Computing Research Group \n\nAston University, Birmingham, B4 7ET, U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\nAbstract \n\nThe techniques of Bayesian inference have been applied with great \nsuccess to many problems in neural computing including evaluation \nof regression functions, determination of error bars on predictions, \nand the treatment of hyper-parameters. However, the problem of \nmodel comparison is a much more challenging one for which current \ntechniques have significant limitations. In this paper we show how \nan extended form of Markov chain Monte Carlo, called chaining, \nis able to provide effective estimates of the relative probabilities of \ndifferent models. We present results from the robot arm problem \nand compare them with the corresponding results obtained using \nthe standard Gaussian approximation framework. \n\n1 Bayesian Model Comparison \n\nIn a Bayesian treatment of statistical inference, our state of knowledge of the values \nof the parameters w in a model M is described in terms of a probability distribution \nfunction. Initially this is chosen to be some prior distribution p(wIM), which can \nbe combined with a likelihood function p( Dlw, M) using Bayes' theorem to give a \nposterior distribution p(wID, M) in the form \n\n( \np w \n\n, \n\nID M) = p(Dlw,M)p(wIM) \n\np(DIM) \n\n(1) \n\nwhere D is the data set. Predictions of the model are obtained by performing \nintegrations weighted by the posterior distribution. \n\n\f334 \n\nD. Barber and C. M. Bishop \n\nThe comparison of different models Mi is based on their relative probabilities, which \ncan be expressed, again using Bayes' theorem, in terms of prior probabilities P(Mi) \nto give \n\nP(MiI D ) \np(MjID) \n\np(DIMdp(Mi) \np(DIMj )p(Mj) \n\n(2) \n\n(3) \n\nand so requires that we be able to evaluate the model evidence p(DIMi), which \ncorresponds to the denominator in (1). The relative probabilities of different models \ncan be used to select the single most probable model, or to form a committee of \nmodels, weighed by their probabilities. \n\nIt is convenient to write the numerator of (1) in the form exp{ -E(w)}, where E(w) \nis an error function. Normalization of the posterior distribution then requires that \n\np(DIM) = J exp{ -E(w)} dw. \n\nGenerally, it is straightforward to evaluate E(w) for a given value of w, although \nit is extremely difficult to evaluate the corresponding model evidence using (3) \nsince the posterior distribution is typically very small except in narrow regions \nof the high-dimensional parameter space, which are unknown a-priori. Standard \nnumerical integration techniques are therefore inapplicable. \n\nOne approach is based on a local Gaussian approximation around a mode of the \nposterior (MacKay, 1992). Unfortunately, this approximation is expected to be \naccurate only when the number of data points is large in relation to the number of \nparameters in the model. In fact it is for relatively complex models, or problems for \nwhich data is scarce, that Bayesian methods have the most to offer. Indeed, Neal \n(1996) has argued that, from a Bayesian perspective, there is no reason to limit \nthe number of parameters in a model, other than for computational reasons. We \ntherefore consider an approach to the evaluation of model evidence which overcomes \nthe limitations of the Gaussian framework. For additional techniques and references \nto Bayesian model comparison, see Gilks et al. (1995) and Kass and Raftery (1995). \n\n2 Chaining \n\nSuppose we have a simple model Mo for which we can evaluate the evidence an(cid:173)\nalytically, and for which we can easily generate a sample wi (where I = 1, ... , L) \nfrom the corresponding distribution p(wID, Mo). Then the evidence for some other \nmodel M can be expressed in the form \n\np(DIM) \np(DIMo) \n\nJ exp{-E(w) + Eo(w)}p(wID, Mo)dw \nL :E exp{ -E(w1) + Eo(w 1)}. \n\n1 L \n\n(4) \n\n1=1 \n\nUnfortunately, the Monte Carlo approximation in (4) will be poor if the two error \nfunctions are significantly different, since the exponent is dominated by regions \nwhere E is relatively small, for which there will be few samples unless Eo is also small \nin those regions. A simple Monte Carlo approach will therefore yield poor results. \nThis problem is equivalent to the evaluation of free energies in statistical physics, \n\n\fBayesian Model Comparison by Monte Carlo Chaining \n\n335 \n\nwhich is known to be a challenging problem, and where a number of approaches \nhave been developed Neal (1993). \n\nHere we discuss one such approach to this problem based on a chain of J{ successive \nmodels Mi which interpolate between Mo and M, so that the required evidence \ncan be written as \n\np(DIM) \np(DIM) = p(DIMo) p(DIMo) p(DIMt} ... p(DIMK)\u00b7 \n\np(DIMl) p(DIM2) \n\n(5) \n\nEach of the ratios in (5) can be evaluated using (4). The goal is to devise a chain \nof models such that each successive pair of models has probability distributions \nwhich are reasonably close, so that each of the ratios in (5) can be evaluated accu(cid:173)\nrately, while keeping the total number of links in the chain fairly small to limit the \ncomputational costs. \n\nWe have chosen the technique of hybrid Monte Carlo (Duane et ai., 1987; Neal, \n1993) to sample from the various distributions, since this has been shown to be \neffective for sampling from the complex distributions arising with neural network \nmodels (Neal, 1996). This involves introducing Hamiltonian equations of motion in \nwhich the parameters ware augmented by a set of fictitious 'momentum' variables, \nwhich are then integrated using the leapfrog method. At the end of each trajectory \nthe new parameter vector is accepted with a probability governed by the Metropolis \ncriterion, and the momenta are replaced using Gibbs sampling. As a check on our \nsoftware implementation of chaining, we have evaluated the evidence for a mixture \nof two non-isotropic Gaussian distributions, and obtained a result which was within \n10% of the analytical solution. \n\n3 Application to Neural Networks \n\nWe now consider the application of the chaining method to regression problems \ninvolving neural network models. The network corresponds to a function y(x, w), \nand the data set consists of N pairs of input vectors Xn and corresponding targets \ntn where n = 1, ... , N. Assuming Gaussian noise on the target data, the likelihood \nfunction takes the form \n\np(Dlw, M) = 211\" \n\n( f3)N/2 \n\n{f3 N \n\n} \nexp -2\" ~ Ily(xn ; w) - t n l1 2 \n\n(6) \n\nwhere f3 is a hyper-parameter representing the inverse of the noise variance. We \nconsider networks with a single hidden layer of 'tanh' units, and linear output \nunits. Following Neal (1996) we use a diagonal Gaussian prior in which the weights \nare divided into groups Wk, where k = 1, ... ,4 corresponding to input-to-hidden \nweights, hidden-unit biases, hidden-to-output weights, and output biases. Each \ngroup is governed by a separate 'precision' hyper-parameter ak, so that the prior \ntakes the form \n\np(wl{a,}) = L exp { -~ ~ a,wfw , } \n\n(7) \n\nwhere Zw is the normalization coefficient. The hyper-parameters {ad and f3 are \nthemselves each governed by hyper-priors given by Gamma distributions of the form \n\np( a) ex: a$ exp( -as /2w) \n\n(8) \n\n\f336 \n\nD. Barber and C. M. Bishop \n\nin which the mean wand variance 2w 2 / s are chosen to give very broad hyper-priors \nin reflection of our limited prior knowledge of the values of the hyper-parameters. \nWe use the hybrid Monte Carlo algorithm to sample from the joint distribution of \nparameters and hyper-parameters. For the evaluation of evidence ratios, however, \nwe consider only the parameter samples, and perform the integrals over hyper(cid:173)\nparameters analytically, using the fact that the gamma distribution is conjugate to \nthe Gaussian. \n\nIn order to apply chaining to this problem, we choose the prior as our reference dis(cid:173)\ntribution, and then define a set of intermediate distributions based on a parameter \nA which governs the effective contribution from the data term, so that \n\nE(A, w) = A(W) + Eo(w) \n\n(9) \nwhere (w) arises from the likelihood term (6) while Eo(w) corresponds to the \nprior (7). We select a set of 18 values of A which interpolate between the reference \ndistribution (A = 0) and the desired model distribution (A = 1) . The evidence for \nthe prior alone is easily evaluated analytically. \n\n4 Gaussian Approximation \n\nAs a comparison against the method of chaining, we consider the framework of \nMacKay (1992) based on a local Gaussian approximation to the posterior distri(cid:173)\nbution. This approach makes use of the evidence approximation in which the inte(cid:173)\ngration over hyper-parameters is approximated by setting them to specific values \nwhich are themselves determined by maximizing their evidence functions. \n\nThis leads to a hierarchical treatment as follows. At the lowest level, the maximum \nw of the posterior distribution over weights is found for fixed values of the hyper(cid:173)\nparameters by minimizing the error function . Periodically the hyper-parameters are \nre-estimated by evidence maximization, where the evidence is obtained analytically \nusing the Gaussian approximation. This gives the following re-estimation formulae \n\n1 \nf3 \n\n(10) \n\nwhere 'Yk = Wk - Uk Trk(A -1), Wk is the total number of parameters in group \nk, A = \\7\\7 E(w), 'Y = Lk 'Yk. and Trk(-) denotes the trace over the kth group \nof parameters. The weights are updated in an inner loop by minimizing the er(cid:173)\nror function using a conjugate gradient optimizer, while the hyper-parameters are \nperiodically re-estimated using (10)1. \n\nOnce training is complete, the model evidence is evaluated by making a Gaussian \napproximation around the converged values of the hyper-parameters, and integrat(cid:173)\ning over this distribution analytically. This gives the model log evidence as \n\nInp(DIM) = -E(w) - ~ In IAI + ~ L Wk lnuk + \n\nN \n2Inf3+lnh!+2Inh+ 2 ~ln(2hk) + 2 In (2/(N -'Y\u00bb. \n\n1 \n\nk \n\n1 \n\n(11) \n\n1 Note that we are assuming that the hyper-priors (8) are sufficiently broad that they \n\nhave no effect on the location of the evidence maximum and can therefore be neglected. \n\n\fBayesian Model Comparison by Monte Carlo Chaining \n\n337 \n\nHere h is the number of hidden units, and the terms In h! + 2ln h take account of \nthe many equivalent modes of the posterior distribution arising from sign-flip and \nhidden unit interchange symmetries in the network model. A derivation of these \nresults can be found in Bishop (1995; pages 434-436). \n\nThe result (11) corresponds to a single mode of the distribution. If we initialize \nthe weight optimization algorithm with different random values we can find distinct \nsolutions. In order to compute an overall evidence for the particular network model \nwith a given number of hidden units, we make the assumption that we have found all \nof the distinct modes of the posterior distribution precisely once each, and then sum \nthe evidences to arrive at the total model evidence. This neglects the possibility that \nsome of the solutions found are related by symmetry transformations (and therefore \nalready taken into account) or that we have missed important modes. While some \nattempt could be made to detect degenerate solutions, it will be difficult to do much \nbetter than the above within the framework of the Gaussian approximation. \n\n5 Results: Robot Arm Problem \n\nAs an illustration of the evaluation of model evidence for a larger-scale problem \nwe consider the modelling of the forward kinematics for a two-link robot arm in a \ntwo-dimensional space, as introduced by MacKay (1992). This problem was chosen \nas MacKay reports good results in using the Gaussian approximation framework to \nevaluate the evidences, and provides a good opportunity for comparison with the \nchaining approach. The task is to learn the mapping (Xl, X2) -+ (Yl, Y2) given by \n\nwhere the data set consists of 200 input-output pairs with outputs corrupted by \nzero mean Gaussian noise with standard deviation u = 0.05. We have used the \noriginal training data of MacKay, but generated our own test set of 1000 points \nusing the same prescription. The evidence is evaluated using both chaining and the \nGaussian approximation, for networks with various numbers of hidden units. \n\nIn the chaining method, the particular form of the gamma priors for the precision \nvariables are as follows: for the input-to-hidden weights and hidden-unit biases, \nw = 1, s = 0.2; for the hidden-to-output weights, w = h, s = 0.2; for the output \nbiases, w = 0.2, s = 1. The noise level hyper-parameters were w = 400, s = 0.2. \nThese settings follow closely those used by Neal (1996) for the same problem. The \nhidden-to-output precision scaling was chosen by Neal such that the limit of an \ninfinite number of hidden units is well defined and corresponds to a Gaussian process \nprior. For each evidence ratio in the chain, the first 100 samples from the hybrid \nMonte Carlo run, obtained with a trajectory length of 50 leapfrog iterations, are \nomitted to give the algorithm a chance to reach the equilibrium distribution. The \nnext 600 samples are obtained using a trajectory length of 300 and are used to \nevaluate the evidence ratio. \n\nIn Figure 1 (a) we show the error values of the sampling stage for 24 hidden units, \nwhere we see that the errors are largely uncorrelated, as required for effective Monte \nCarlo sampling. In Figure 1 (b), we plot the values of In{p(DIMi)/p(DIMi_l)} \nagainst .Ai i = 1..18. Note that there is a large change in the evidence ratios at the \nbeginning of the chain, where we sample close to the reference distribution. For this \n\n\f338 \n\nD. Barber and C. M. Bishop \n\n(a) \n\n6 \n\n(b) \n\no \n\n100 \n\n200 \n\n300 \n\n400 \n\n500 \n\n600 \n\n-4~--~----~--~----~--~ \n\n0.6 \n\n0.8 A 1 \n\no \n\n0.2 \n\n0.4 \n\nFigure 1: (a) error E(>. = 0.6,w) for h = 24, plotted for 600 successive Monte Carlo \nsamples. (b) Values of the ratio In{p(DIM.)jp(DIM.-d} for i = 1, ... ,18 for h = 24. \n\nreason, we choose the Ai to be dense close to A = O. We are currently researching \nmore principled approaches to the partitioning selection. Figure 2 (a) shows the \nlog model evidence against the number of hidden units. Note that the chaining \napproach is computationally expensive: for h=24, a complete chain takes 48 hours \nin a Matlab implementation running on a Silicon Graphics Challenge L. \n\nWe see that there is no decline in the evidence as the number of hidden units \ngrows. Correspondingly, in Figure 2 (b), we see that the test error performance \ndoes not degrade as the number of hidden units increases. This indicates that there \nis no over-fitting with increasing model complexity, in accordance with Bayesian \nexpectations. \n\nThe corresponding results from the Gaussian approximation approach are shown in \nFigure 3. We see that there is a characteristic 'Occam hill' whereby the evidence \nshows a peak at around h = 12, with a strong decrease for smaller values of h \nand a slower decrease for larger values. The corresponding test set errors similarly \nshow a minimum at around h = 12, indicating that the Gaussian approximation is \nbecoming increasingly inaccurate for more complex models. \n\n6 Discussion \n\nWe have seen that the use of chaining allows the effective evaluation of model \nevidences for neural networks using Monte Carlo techniques. In particular, we find \nthat there is no peak in the model evidence, or the corresponding test set error, \nas the number of hidden units is increased, and so there is no indication of over(cid:173)\nfitting. This is in accord with the expectation that model complexity should not be \nlimited by the size of the data set, and is in marked contrast to the conventional \n\n70.-----~----~------~----~ \n\n1.4r-----~-----,-------~----~ \n\n(b) \n\n60 \n\n50 \n\n40 \n\n(a) \n\n1.3~ \n1.2' ~ \n\n1.1 \n\n1!r------&----~ \n\n30L-----1~0------1~5------~20~--~ h \n\n10 \n\n15 \n\n20 \n\nh \n\nFigure 2: (a) Plot of Inp(DIM) for different numbers of hidden units. (b) Test error \nagainst the number of hidden units. Here the theoretical minimum value is 1.0. For \nh = 64 the test error is 1.11 \n\n\fBayesian Model Comparison by Monte Carlo Chaining \n\n85or-----------~----~----~ \n\n(a) \n\n800 \n\no \n\no \n\no \n\no 00 \n\n0 \n\n00 0 \n\n7505L-----l~0----~15------2O~----2......J5 h \n\n2.5 0 \no \n\no \n2 0 @ \n\no \n\n1.5 \n\n339 \n\n(b) \n\no \n\no \n\no \n\n1L-----~~--~----~----~ \n5 \n\n10 \n\n15 \n\n20 \n\n25 h \n\nFigure 3: \n(a) Plot of the model evidence for the robot arm problem versus the number \nof hidden units, using the Gaussian approximation framework. This clearly shows the \ncharacteristic 'Occam hill' shape. Note that the evidence is computed up to an additive \nconstant, and so the origin of the vertical axis has no significance. (b) Corresponding plot \nof the test set error versus the number of hidden units. Individual points correspond to \nparticular modes of the posterior weight distribution, while the line shows the mean test \nset error for each value of h. \n\nmaximum likelihood viewpoint. It is also consistent with the result that, in the \nlimit of an infinite number of hidden units, the prior over network weights leads to \na well-defined Gaussian prior over functions (Williams, 1997). \n\nAn important advantage of being able to make accurate evaluations of the model \nevidence is the ability to compare quite distinct kinds of model, for example radial \nbasis function networks and multi-layer perceptIOns. This can be done either by \nchaining both models back to a common reference model, or by evaluating normal(cid:173)\nized model evidences explicitly. \n\nAcknowledgements \n\nWe would like to thank Chris Williams and Alastair Bruce for a number of useful \ndiscussions. This work was supported by EPSRC grant GR/J75425: Novel Devel(cid:173)\nopments in Learning Theory for Neural Networks. \n\nReferences \n\nBishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press. \nDuane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987). Hybrid Monte Carlo. \n\nPhysics Letters B 195 (2), 216-222. \n\nGilks, W. R., S. Richardson, and D. J. Spiegelhalter (1995). Markov Chain Monte Carlo \n\nin Practice. Chapman and Hall. \n\nKass, R. E. and A. E. Raftery (1995). Bayes factors. J. Am. Statist. Ass. 90, 773-795. \nMacKay, D. J. C. (1992). A practical Bayesian framework for back-propagation net(cid:173)\n\nworks. Neural Computation 4 (3), 448- 472. \n\nNeal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. \nTechnical Report CRG-TR-93-1, Department of Computer Science, University of \nToronto, Cananda. \n\nNeal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. Lecture Notes in \n\nStatistics 118. \n\nWilliams, C. K. I. (1997). Computing with infinite networks. This volume. \n\n\f", "award": [], "sourceid": 1272, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Christopher", "family_name": "Bishop", "institution": null}]}