{"title": "Bayesian Backprop in Action: Pruning, Committees, Error Bars and an Application to Spectroscopy", "book": "Advances in Neural Information Processing Systems", "page_first": 208, "page_last": 215, "abstract": null, "full_text": "Bayesian Backprop in Action: \n\nPruning, Committees, Error Bars \nand an Application to Spectroscopy \n\nHans Henrik Thodberg \n\nDanish Meat Research Institute \n\nMaglegaardsvej 2, DK-4000 Roskilde \n\nthodberg~nn.meatre.dk \n\nAbstract \n\nMacKay's Bayesian framework for backpropagation is conceptually \nappealing as well as practical. It automatically adjusts the weight \ndecay parameters during training, and computes the evidence for \neach trained network. The evidence is proportional to our belief \nin the model. The networks with highest evidence turn out to \ngeneralise well. In this paper, the framework is extended to pruned \nnets, leading to an Ockham Factor for \"tuning the architecture \nto the data\". A committee of networks, selected by their high \nevidence, is a natural Bayesian construction. The evidence of a \ncommittee is computed. The framework is illustrated on real-world \ndata from a near infrared spectrometer used to determine the fat \ncontent in minced meat. Error bars are computed, including the \ncontribution from the dissent of the committee members. \n\n1 THE OCKHAM FACTOR \n\nWilliam of Ockham's (1285-1349) principle of economy in explanations, can be \nformulated as follows: \n\nIf several theories account for a phenomenon we should prefer the \nsimplest which describes the data sufficiently well. \n\n208 \n\n\fBayesian Backprop in Action \n\n209 \n\nThe principle states that a model has two virtues: simplicity and goodness of fit. \nBut what is the meaning of \"sufficiently well\" - i.e. what is the optimal trade-off \nbetween the two virtues? With Bayesian model comparison we can deduce this \ntrade-off. \n\nWe express our belief in a model as its probability given the data, and use Bayes' \nformula: \n\nP(H I D) = P(D IH)P(H) \n\nP(D) \n\nWe assume that the prior belief P(H) is the same for all models, so we can compare \nmodels by comparing P(D IH) which is called the evidence for H, and acts as a \nquality measure in model comparison. \n\nAssume that the model has a single tunable parameter w with a prior range ~ Wprior \nso that pew IH) = 1/ ~Wprior. The most probable (or maximum posterior) value \nWMP of the parameter w is given by the maximum of \n\nP( ID H)= P(Dlw,H)P(wl1i) \n\nw , \n\nP(DIH) \n\n(2) \n\nThe width of this distribution is denoted ~Wpo8terior. The evidence P(D 11i) is \nobtained by integrating over the posterior w distribution and approximating the \nintegral: \n\nP(DIH) \n\nJ P(Dlw,H)P(wIH)dw \nP(D I WMP, H) ~wpo8terior \n~Wprior \n\n(1) \n\n(3) \n\n(4) \n\n(5) \n\nEvidence \n\nLikelihood x OckhamFactor \n\nThe evidence for the model is the product of two factors: \n\n\u2022 The best fit likelihood, i.e. the probability of the data given the model and \nthe tuned parameters. It measures how well the tuned model fits the data . \n\n\u2022 The integrated probability of the tuned model parameters with their un(cid:173)\n\ncertainties, i.e. the collapse of the available parameter space when the data \nis taken into account. This factor is small when the model has many pa(cid:173)\nrameters or when some parameters must be tuned very accurately to fit \nthe data. It is called the Ockham Factor since it is large when the model is \nsimple. \n\nBy optimizing the modelling through the evidence framework we can avoid the \noverfitting problem as well as the equally important \"underfitting\" problem. \n\n2 THE FOUR LEVELS OF INFERENCE \n\nIn 1991-92 MacKay presented a comprehensive and detailed framework for combi(cid:173)\nning backpropagation neural networks with Bayesian statistics (MacKay, 1992). He \noutlined four levels of inference which applies for instance to a regression problem \nwhere we have a training set and want to make predictions for new data: \n\n\f210 \n\nThodberg \n\nLevel 1 Make predictions including error bars for new input data. \nLevel 2 Estimate the weight parameters and their uncertainties. \nLevel 3 Estimate the scale parameters (the weight decay parameters and the noise \n\nscale parameter) and their uncertainties. \n\nLevel 4 Select the network architecture and for that architecture select one of the \nw-minima. Optionally select a committee to reflect the uncertainty on this \nlevel. \n\nLevel 1 is the typical goal in an application. But to make predictions we have to \ndo some modelling, so at level 2 we pick a net and some weight decay parameters \nand train the net for a while. But the weight decay parameters were picked rather \narbitrarily, so on level 3 we set them to their inferred maximum posterior (MP) \nvalue. We alternate between level 2 and 3 until the network has converged. This is \nstill not the end, because also the network architecture was picked rather arbitrarily. \nHence level 2 and 3 are repeated for other architectures and the evidences of these \nare computed on level 4. (Pruning makes level 4 more complicated, see section 6). \n\nWhen we make inference on each of these levels, there are uncertainties which are \ndescribed by the posterior distributions of the parameters which are inferred. The \nuncertainty on level 2 is described by the Hessian (the second derivative of the net \ncost function with respect to the weights). The uncertainty on level 3 is negligible \nif the number of weight decays parameters is small compared to the number of \nweights. The uncertainty on level 4 is described by the committee of networks with \nhighest evidence within some margin (discussed below). \n\nThe uncertainties are used for two purposes. Firstly they give rise to error bars on \nthe predictions on level 1. And secondly the posterior uncertainty divided by the \nprior uncertainty (the Ockham Factor) enters the evidence. \n\nMacKay's approach differs in two respects from other Bayesian approaches to neural \nnets: \n\n\u2022 It assumes the Gaussian approximation to the posterior weight distribution. \nIn contrast, the Monte Carlo approach of (Neal, 1992) does not suffer from \nthis limitation . \n\n\u2022 It determines maximum posterior values of the weight decay parameters, \nrather than integrating them out as done in (Buntine and Weigend, 1991). \n\nIt is difficult to justify these choices in general. The Gaussian approximation is \nbelieved to be good when there are at least 3 training examples per weight (MacKay, \n1992). The use of MP weight decay parameters is the superior method when there \nare ill-defined parameters, as there usually is in neural networks, where some weights \nare typically poorly defined by the data (MacKay, 1993). \n\n3 BAYESIAN NEURAL NETWORKS \n\nThe training set D consists of N cases of the form (x, t). We model t as a function \nof x, t = y(x) + II, where II is Gaussian noise and y(x) is computed by a neural \n\n\fBayesian Backprop in Action \n\n211 \n\nnetwork 11. with weights w. The noise scale is a free parameter {3 = 1/(1';. The \nprobability of the data (the likelihood) is \n\nP(Dlw,{3,11.) \nEn \n\nex exp(-{3En) \n~ L:(y - t)2 \n\n(6) \n(7) \n\nwhere the sum extends over the N cases. \n\nIn Bayesian modelling we must specify the prior distribution of the model param(cid:173)\neters. The model contains k adjustable parameters w, called weights, which are \nin general split into several groups, for instance one per layer of the net. Here we \nconsider the case with all weights in one group. The general case is described in \n(MacKay, 1992) and in more details in (Thodberg, 1993). The prior of the weights \nw 1S \n\np(w\\{3,e,11.) \n\n(8) \n(9) \n{3 and e are called the scales of the model and are free parameters determined by \nthe data. \n\nEw _ ~L:w2 \n\nex exp(-{3eEw) \n\nThe most probable values of the weights given the data, some values of the scales \n(to be determined later) and the model, is given by the maximum of \n\nP(w\\D,{3,e,11.) \n\nP(D/w,{3,e, 11.)p(w/{3,e, 11.) \n\np(D\\{3,e,11.) \n\n( (3C) \n\nex exp -\n\n(10) \n\n(11) \n\nSo the maximum posterior weights according to the probabilistic interpretation are \nidentical to the weights obtained by minimising the familiar cost function C with \nweight decay parameter e. This is the well-known Bayesian account for weight \ndecay. \n\n4 MACKAY'S FORMULAE \n\nThe single most useful result of MacKay's analysis is a simple formula for the MP \nvalue of the weight decay parameter \n\nEn \n\n, \n\neMP =-E N w -, \n\n(12) \n\nwhere , is the number of well-determined parameters which can be approximated \nby the actual number of parameters k, or computed more accurately from the \neigenvalues Ai of the Hessian \\T\\T En: \n\n,--\n-L: \n' \n;=} Ai + eMP \nThe MP value of the noise scale is {3MP = N /(2C). \n\nA' \n\nk \n\n(13) \n\n\f212 \n\nThodberg \n\nThe evidence for a neural network 'Jf. is, as in section 1, obtained by integration \nover the posterior distribution of the inferred parameters, which gives raise to the \nOckham Factors: \n\nEv('Jf.) = log P( D 1'Jf.) \n\n-\n\nN - 'Y _ N log 411\"C \nN \n\n2 \n\n2 \n\n+ log Ock(w) + log Ock\u00ab(3) + log Ock(e) \nk \"\" \n~ L.J log \n\n- - + log h! + h log 2 \n\neMP \n\n'Y \n2 \n\n(14) \n\n(15) \n\n(16) \n\nlogOck(w) \n\ni=l \n\neMP + Ai \n\nOck\u00ab(3) \n\n-\n\nJ411\"/(N - 'Y) Ock(e) = .f4iFt \nlogO \n\nlogO \n\nThe first line in (14) is the log likelihood. The Ockham Factor for the weights \nOck(w) is small when the eigenvalues Ai of the Hessian are large, corresponding to \nwell-determined weights. 0 is the prior range of the scales and is set (subjectively) \nto 103 . \nThe expression (15) (valid for a network with a single hidden layer) contains a \nsymmetry factor h!2h . This is because the posterior volume must include all w \nconfigurations which are equivalent to the particular one. The hidden units can be \npermuted, giving a factor h! more posterior volume. And the sign of the weights to \nand from every hidden unit can be changed giving 2h times more posterior volume. \n\n5 COMMITTEES \n\nFor a given data set we usually train several networks with different numbers of \nhidden units and different initial weights. Several of these networks have evidence \nnear or at the maximal value, but the networks differ in their predictions. The \ndifferent solutions are interpreted as components of the posterior distribution and \nthe correct Bayesian answer is obtained by averaging the predictions over the solu(cid:173)\ntions, weighted by their posterior probabilities, i.e. their evidences. However, the \nevidence is not accurately determined, primarily due to the Gaussian approxima(cid:173)\ntion. This means that instead of weighting with Ev('Jf.) we should use the weight \nexp{log Ev / ~(log Ev\u00bb, where ~(log Ev) is the total uncertainty in the evaluation \nof log Ev. As an approximation to this, we define the committee as the models \nwith evidence larger than log Evrnax - ~ log Ev, where Evrnax is the largest evidence \nobtained, and all members enter with the same weight. \n\nTo compute the evidence Ev(C) of the committee, we assume for simplicity that all \nnetworks in the committee C share the same architecture. Let Nc be the number \nof truly different solutions in the committee. Of course, we count symmetric reali(cid:173)\nsations only once. The posterior volume i.e. the Ockham Factor for the weights is \nnow Nc times larger. This renders the committee more probable - it has a larger \nevidence: \n\nlog Ev(C) = log Nc + log Ev('Jf.) \n\n(17) \nwhere log Ev('Jf.) denotes the average log evidence of the members. Since the evi(cid:173)\ndence is correlated with the generalisation error, we expect the committee to gene(cid:173)\nralise better than the committee members. \n\n\fBayesian Backprop in Action \n\n213 \n\n6 PRUNING \n\nWe now extend the Bayesian framework to networks which are pruned to adjust the \narchitecture to the particular problem. This extends the fourth level of inference. \nAt first sight, the factor h! in the Ockham Factor for the weights in a sparsely con(cid:173)\nnected network appears to be lost, since the network is (in general) not symmetric \nwith respect to permutations of the hidden units. However, the symmetry reappears \nbecause for every sparsely connected network with tuned weights there are h! other \nequivalent network architectures obtained by permuting the hidden units. So the \nfactor h! remains. If this argument is not found compelling, it can be viewed as an \nassumption. \n\nIf the data are used to select the architecture, which is the case in pruning designed \nto minimise the cost function, an additional Ockham Factor must be included. \nWith one output unit, only the input-to-hidden layer is sparsely connected, so \nconsider only these connections. Attach a binary pruning parameter to each of \nthe m potential connections. A sparsely connected architecture is described by \nthe values of the pruning parameters. The prior probability of a connection to be \npresent is described by a hyperparameter cP which is determined from the data i.e. \nit is set to the fraction of connections remaining after pruning (notice the analogy \nbetween cP and a weight decay parameter). A non-pruned connection gives an \nOckham Factor cP and a pruned 1 -\ncP, assuming the data to be certain about the \narchitecture. The Ockham Factors for the pruning parameters is therefore \n\nlog Ock(pruning) = m(cPMP log cPMP + (1 - cPMP) 10g(1 - cPMP\u00bb \n\n(18) \n\nThe tuning of the meta-parameter to the data gives an Ockham factor Ock( cP) :::::; \nJ2jm, which is rather negligible. \nFrom a minimum description length perspective (18) reflects the extra information \nneeded to describe the topology of a pruned net relative to a fully connected net. It \nacts like a barrier towards pruning. Pruning is favoured only if the negative contri(cid:173)\nbution log Ock(pruning) is compensated by an increase in for instance log Ock(w). \n\n7 APPLICATION TO SPECTROSCOPY \n\nBayesian Backprop is used in a real-life application from the meat industry. The \ndata were recorded by a Tecator near-infrared spectrometer which measures the \nspectrum of light transmitted through samples of minced pork meat. The ab(cid:173)\nsorbance spectrum has 100 channels in the region 850-1050 nm. We want to calibrate \nthe spectrometer to determine the fat content. The first 10 principal components \nof the spectra are used as input to a neural network. \n\nThree weight decay parameters are used: one for the weights and biases of the \nhidden layer, one for the connections from the hidden to the output layer, and one \nfor the direct connections from the inputs to the output as well as the output bias. \n\nThe relation between test error and log evidence is shown in figure 1. The test error \nis given as standard error of prediction (SEP), i.e. the root mean square error. The \n12 networks with 3 hidden units and evidence larger than -270 are selected for a \n\n\f214 \n\nThodberg \n\nC\\f ... \n\nC! \n.,... \n\nco \nci \n\n~ \n0 \n\nc: \n0 .\"\" \n.!Z! \n\"Q \nI!! \n\nc.. -0 \n~ e ~ \nLU \n\"E \n'\" j \n\nen \n\n\u00a2 \n\n2 hidden units \n\n\u2022 1 hidden unit \n\u2022 3 hidden units \n\u2022 6 hidden units \n\n4 hidden units \n\n8 hidden units \n\nX \n0 \n\n0 \n\n0 \n\n0 \n\nX \n\n\u2022 \n\n\u2022 0 \u2022 \n\n0 \n\nC \n\n0 \n\n0 \n\n0 \n\nC \n\n0 \n\nX \n\n.. \n\u2022 \n\u2022 \n\u2022 \n\u2022\u2022 \n\u2022 ~OoO \n\u2022 \n\u2022 \n\nC \n\nX \n\nX \n\nX \n\n\u2022\u2022 \nd(D X \u2022 \n\u2022 \n~mwlDl \nIJ 0 c \n\n-320 \n\n-300 \n\n-280 \n\nlog Evidence \n\nX \n\n\u2022 \n\nX \n\n. X_. \n\u2022 \n~ X - . \u00b00 \n\n\u2022 X \n\u2022 \u2022 \n~ \u2022\u2022 X \nIx \n\nX \n\n-260 \n\nFigure 1: The test error as a function of the log evidence for networks trained on \nthe spectroscopic data. High evidence implies low test error. \n\ncommittee. The committee average gives 6% lower SEP than the members do on \naverage, and 21% lower SEP than a non-Bayesian analysis using early stopping (see \nThodberg, 1993). \n\nPruning is applied to the networks with 6 hidden units. The evidence decreases \nslightly, i.e. Ock(pruning) dominates. Also the SEP is slightly worse. So the \nevidence correctly suggests that pruning is not useful for this problem. 1 \n\nThe Bayesian error bars are illustrated for the spectroscopic data in figure 2. We \nstudy the model predictions on the line through input space defined by the second \nprincipal component axis, i.e. the second input is varied while all other inputs are \nzero. The total prediction variance for a new datum x is \n\nwhere Uwu comes from the weight uncertainties (level 2) and Ucu from the com(cid:173)\nmittee dissent (level 4). \n\n1 For artificial data generated by a sparsely connected network the evidence correctly \n\npoints to pruned nets as better models (see Thodberg, 1993). \n\n(19) \n\n\fBayesian Backprop in Action \n\n215 \n\n...... :::::::::.... \n\nI \n\n.I \n.! \n\n!/ \n\".,\"'. \n/j \n.... \n. <'>\" \n,'.!; \n.... .... \n,'. \n,II \n,I \n;. \n... .... \n,I \n.... -.-;.' \n........ if \n.:~\" . \"./ \nI. \n,'. \n... \nij. .... \n... \n... \n\nt'/\" \n\n.' \n\n'. \n\n\" . \u2022 . . \n\n.\u2022 \u2022\u2022 \u2022 , / CommiUM Prediction \n\nTotal U,..rllinty \n...... \n\n..... : .. ...... ........... '\\ ......... ::::::::::::::: .......... . \n\\ \n\\ \n\nTotal Unc.rtlinly \n\n\\ \n\n\u2022 \u2022\u2022 \u2022..\u2022\u2022\u2022 \u2022 .\u2022\u2022 \u2022\u2022 \n...... \" . \n\n\\ \\\\ .... P* Unc.rllinly \n\n, \n\n\". , \n\n\". \n\n10\u00b7 Weight Unc\u00abtainty \n\n\". \n\n\\ \n\\~\"\"\"~\n\n' \u2022 ConllTitlM Unc.rllinly \n'. \n\\.... \n\n\" \n\n'. \\~ \n\"~'\" \n.... \n\\ \n\\ \n\". \n\". \\ \n'.<, \n'. \n\"-\n\n\" \n\n10\u00b7 Randcm Noi.. \n\nl \nc: \n~ \n8 -as u.. \n\n~ \n\nyo) .-\n\n0 .-\n\n'. \n, \n\" . ................................................\n./ \n./ / \n\",..... \n/ / / \n/ \n'....:..-.:::----...::-_\"'::-_.-._.-. ~/ \n\n.,... \n\n---..----~ \n\n-\n\no ~------,_----------._----------r_---------.r---------~~--~~ \n\n-4 \n\n-2 \n\no \n\n2 \n\n4 \n\nSecond Principal Component \n\nFigure 2: Prediction of the fat content as a function of the second principal com(cid:173)\nponent P2 of the NIR spectrum . 95% of the training data have Ip21 < 2. The total \nerror bars are indicated by a \"1 sigma\" band with the dotted lines. The total stan(cid:173)\ndard errors O'total(X) and the standard errors of its contributions (O'v, O'wu(x) and \nO'cu(x)) are shown separately, multiplied by a factor of 10. \n\nReferences \n\nW.L.Buntine and A.S.Weigend, \"Bayesian Back-Propagation\", Complex Systems 5, \n(1991) 603-643. \nR.M.Neal, \"Bayesian Learning via Stochastic Dynamics\", Neural Information Pro(cid:173)\ncessing Systems, Vol.5, ed. C.L.Giles, S.J . Hanson and J .D.Cowan (Morgan Kauf(cid:173)\nmann, San Mateo, 1993) \n\nD.J .C.MacKay, \"A Practical Bayesian Framework for Backpropagation Networks\" \nNeural Compo 4 (1992) 448-472. \nD.J .C.MacKay, paper on Bayesian hyperparameters, in preparation 1993. \n\nH.H.Thodberg, \"A Review of Bayesian Backprop with an Application to \nNear Infrared Spectroscopy\" and \"A Bayesian Approach to Pruning of Neu(cid:173)\nral Networks\", submitted to IEEE Transactions of Neural Networks 1993 (in \n/pub/neuroprose/thodberg.ace-of-bayes*.ps.Z on archive.cis.ohio-state.edu). \n\n\f", "award": [], "sourceid": 720, "authors": [{"given_name": "Hans", "family_name": "Thodberg", "institution": null}]}