{"title": "Ensemble Learning for Multi-Layer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 401, "abstract": "", "full_text": "Ensemble Learning \n\nfor Multi-Layer Networks \n\nDavid Barber\u00b7 \n\nChristopher M. Bishopt \n\nNeural Computing Research Group \n\nDepartment of Applied Mathematics and Computer Science \n\nAston University, Birmingham B4 7ET, U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\nAbstract \n\nBayesian treatments of learning in neural networks are typically \nbased either on local Gaussian approximations to a mode of the \nposterior weight distribution, or on Markov chain Monte Carlo \nsimulations. A third approach, called ensemble learning, was in(cid:173)\ntroduced by Hinton and van Camp (1993). It aims to approximate \nthe posterior distribution by minimizing the Kullback-Leibler di(cid:173)\nvergence between the true posterior and a parametric approximat(cid:173)\ning distribution. However, the derivation of a deterministic algo(cid:173)\nrithm relied on the use of a Gaussian approximating distribution \nwith a diagonal covariance matrix and so was unable to capture \nthe posterior correlations between parameters. In this paper, we \nshow how the ensemble learning approach can be extended to full(cid:173)\ncovariance Gaussian distributions while remaining computationally \ntractable. We also extend the framework to deal with hyperparam(cid:173)\neters, leading to a simple re-estimation procedure. Initial results \nfrom a standard benchmark problem are encouraging. \n\n1 \n\nIntroduction \n\nBayesian techniques have been successfully applied to neural networks in the con(cid:173)\ntext of both regression and classification problems (MacKay 1992; Neal 1996). In \ncontrast to the maximum likelihood approach which finds only a single estimate \nfor the regression parameters, the Bayesian approach yields a distribution of weight \nparameters, p(wID), conditional on the training data D, and predictions are ex-\n\n\u00b7Present address: SNN, University of Nijmegen, Geert Grooteplein 21, Nijmegen, The \n\nNetherlands. http://wvw.mbfys.kun . n1/snn/ email: davidbbbfys.kun.n1 \n\ntpresent address: Microsoft Research Limited, St George House, Cambridge CB2 3NH, \n\nUK. http://vvv.research.microsoft . com email: cmbishopbicrosoft.com \n\n\f396 \n\nD. Barber and C. M. Bishop \n\npressed in terms of expectations with respect to the posterior distribution (Bishop \n1995). However, the corresponding integrals over weight space are analytically in(cid:173)\ntractable. One well-established procedure for approximating these integrals, known \nas Laplace's method, is to approximate the posterior distribution by a Gaussian, \ncentred at a mode of p(wID), in which the covariance of the Gaussian is deter(cid:173)\nmined by the local curvature of the posterior distribution (MacKay 1995). The \nrequired integrations can then be performed analytically. More recent approaches \ninvolve Markov chain Monte Carlo simulations to generate samples from the poste(cid:173)\nrior (Neal 1996}. However, such techniques can be computationally expensive, and \nthey also suffer from the lack of a suitable convergence criterion. \n\nA third approach, called ensemble learning, was introduced by Hinton and van \nCamp (1993) and again involves finding a simple, analytically tractable, approxi(cid:173)\nmation to the true posterior distribution. Unlike Laplace's method, however, the \napproximating distribution is fitted globally, rather than locally, by minimizing a \nKullback-Leibler divergence. Hinton and van Camp (1993) showed that, in the case \nof a Gaussian approximating distribution with a diagonal covariance, a determin(cid:173)\nistic learning algorithm could be derived. Although the approximating distribution \nis no longer constrained to coincide with a mode of the posterior, the assumption \nof a diagonal covariance prevents the model from capturing the (often very strong) \nposterior correlations between the parameters. MacKay (1995) suggested a modifi(cid:173)\ncation to the algorithm by including linear preprocessing of the inputs to achieve a \nsomewhat richer class of approximating distributions, although this was not imple(cid:173)\nmented. In this paper we show that the ensemble learning approach can be extended \nto allow a Gaussian approximating distribution with an general covariance matrix, \nwhile still leading to a tractable algorithm. \n\n1.1 The Network Model \n\nWe consider a two-layer feed-forward network having a single output whose value \nis given by \n\nH \n\n/(x,w) = LViU(Ui'X) \n\ni=1 \n\n(1) \n\nwhere w is a k-dimensional vector representing all of the adaptive parameters in the \nmodel, x is the input vector, {ud, i = 1, ... , H are the input-to-hidden weights, \nand {Vi}, i = 1, ... ,H are the hidden-to-output weights. The extension to multi(cid:173)\nple outputs is straightforward. For reasons of analytic tractability, we choose the \nsigmoidal hidden-unit activation function u(a) to be given by the error function \n\nu(a} = f! loa exp (-82/2) d8 \n\n(2) \n\nwhich (when appropriately scaled) is quantitatively very similar to the standard \nlogistic sigmoid. Hidden unit biases are accounted for by appending the input \nvector with a node that is always unity. In the current implementation there are \nno output biases (and the output data is shifted to give zero mean), although the \nformalism is easily extended to include adaptive output biases (Barber and Bishop \n1997) . . The data set consists of N pairs of input vectors and corresponding target \noutput values D = {x~, t~} ,It = 1, ... , N. We make the standard assumption \nof Gaussian noise on the target values, with variance (3-1. The likelihood of the \ntraining data is then proportional to exp(-(3ED ), where the training error ED is \n\nED{w) = 2 ~ (J{x~, w) - t~) . \n\n1\", \n\n2 \n\n~ \n\n(3) \n\n\fEnsemble Leamingfor Multi-Layer Networks \n\n397 \n\nThe prior distribution over weights is chosen to be a Gaussian of the form \n\n(4) \nwhere Ew(w) = !wT Aw, and A is a matrix of hyper parameters. The treatment of \n(3 and A is dealt with in Section 2.1. From Bayes' theorem, the posterior distribution \nover weights can then be written \n\np(w) (X exp (-Ew(w)) \n\np(wID) = z exp (-(3ED(w) - Ew(w)) \n\n1 \n\n(5) \n\nwhere Z is a normalizing constant. Network predictions on a novel example are \ngiven by the posterior average of the network output \n\n(f(x)) = J f(x, w)p(wID) dw. \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\nThis represents an integration over a high-dimensional space, weighted by a pos(cid:173)\nterior distribution p(wID) which is exponentially small except in narrow regions \nwhose locations are unknown a-priori. The accurate evaluation of such integrals is \nthus very difficult. \n\n2 \n\n. Ensemble Learning \n\nIntegrals of the form (6) may be tackled by approximating p(wID) by a simpler \ndistribution Q(w). In this paper we choose this approximating distribution to be \na Gaussian with mean wand covariance C. We determine the values of w and C \nby minimizing the Kullback-Leibler divergence between the network posterior and \napproximating Gaussian, given by \n\nF [Q] = \n\nQ(w) In p(wID) dw \n\n{ Q(w) } \n\nJ \nJ Q(w) In Q(w)dw - J Q(w) Inp(wID) dw. \n\nThe first term in (8) is the negative entropy of a Gaussian distribution, and is easily \nevaluated to give ! In det (C) + const. \nFrom (5) we see that the posterior dependent term in (8) contains two parts that \ndepend on the prior and likelihood \n\nJ Q(w)Ew(w)dw + J Q(w)ED(w)dw. \n\nNote that the normalization coefficient Z-l in (5) gives rise to a constant additive \nterm in the KL divergence and so can be neglected. The prior term Ew (w) is \nquadratic in w, and integrates to give Tr(CA) + ~wT Aw. This leaves the data \ndependent term in (9) which we write as \n\nJ \nQ(W)ED(W)dw = \"2 I: l(xl!, tl!) \n\n(3N \n\nL = \n\nI!=l \n\nwhere \n\nl(x, t) = J Q(w) (J(x, W))2 dw - 2t J Q(w)f(x, w) dw + t 2 \u2022 \n\n(10) \n\n(11) \n\n\f398 \n\nD. Barber and C. M. BisJwp \n\nFor clarity, we concentrate only on the first term in (11), as the calculation of the \nterm linear in I(x, w) is similar, though simpler. Writing the Gaussian integral \nover Q as an average, ( ), the first term of (11) becomes \n\n((I(x, w\u00bb2) = L (vivju(uTx)u(uJx\u00bb). \n\nH \n\ni,j=I \n\n(12) \n\nTo simplify the notation, we denote the set of input-to-hidden weights (Ul' ... , UH) \nby u and the set of hidden-to-output weights, (VI' ... ' V H) by v. Similarly, we \npartition the covariance matrix C into blocks, C uu , C vu , C vv , and C vu = C~v. As \nthe components of v do not enter the non-linear sigmoid functions, we can directly \nintegrate over v, so that each term in the summation (12) gives \n\n((Oij + (u - IT)T \\Ilij (u - IT) + n~ (u - IT\u00bb) u (uTxi) u (uTxj)) \n\n(13) \n\nwhere \n\nOij \n\n(Cvv - CvuCuu -lCuV)ij + \"hvj \nC uu -ICu,v=:i C lI=:j,uC uu -1, \n\n(14) \n\n\\Ilij \n\n(15) \n(16) \nAlthough the remaining integration in (13) over u is not analytically tractable, we \ncan make use of the following result to reduce it to a one-dimensional integration \n\n2Cuu -ICu,lI=:jVi. \n\nnij \n\n-\n\n(aT b)2 \n\nza Tb + bolal \n\nVlal2 (1 + Ib 12 ) -\n\n(u (z\u00b7a + ao) u (z \u00b7b + bo\u00bb)z = (u (zlal + 0.0) u ( \n\n)) \nz \n(17) \nwhere a and b are vectors and 0.0, bo are scalar offsets. The avera~e on the left of \n(17) is over an isotropic multi-dimensional Gaussian, p(z) ex: exp( -z z/2), while the \naverage on the right is over the one-dimensional Gaussian p(z) ex: exp( -z2 /2). This \nresult follows from the fact that the vector z only occurs through the scalar product \nwith a and b, and so we can choose a coordinate system in which the first two \ncomponents of z lie in the plane spanned by a and b. All orthogonal components \ndo not appear elsewhere in the integrand, and therefore integrate to unity. \nThe integral we desire, (13) is only a little more complicated than (17) and can be \nevaluated by first transforming the coordinate system to an isotopic basis z, and \nthen differentiating with respect to elements of the covariance matrix to 'pull down' \nthe required linear and quadratic terms in the u-independent pre-factor of (13). \nThese derivatives can then be reduced to a form which requires only the numerical \nevaluation of (17). We have therefore succeeded in reducing the calculation of the \nKL divergence to analytic terms together with a single one-dimensional numerical \nintegration of the form (17), which we compute using Gaussian quadrature1 . \nSimilar techniques can be used to evaluate the derivatives of the KL divergence with \nrespect to the mean and covariance matrix (Barber and Bishop 1997). Together with \nthe KL divergence, these derivatives are then used in a scaled conjugate gradient \noptimizer to find the parameters w and C that represent the best Gaussian fit. \nThe number of parameters in the covariance matrix scales quadratically with the \nnumber of weight parameters. We therefore have also implemented a version with \n\n1 Although (17) appears to depend on 4 parameters, it can be expressed in terms of 3 \nindependent parameters. An alternative to performing quadrature during training would \ntherefore be to compute a 3-dimensionallook-up table in advance. \n\n\fEnsemble Learning for Multi-Layer Networks \n\n399 \n\nPosterior \n\nlaplace fit \n\nMinimum KLD fit \n\nMinimum KL fit \n\nFigure 1: Laplace and minimum Kullback-Leibler Gaussian fits to the posterior. \nThe Laplace method underestimates the local posterior mass by basing the covari(cid:173)\nance matrix on the mode alone, and has KL value 41. The minimum Kullback(cid:173)\nLeibler Gaussian fit with a diagonal covariance matrix (KLD) gives a KL value \nof 4.6, while the minimum Kullback-Leibler Gaussian with full covariance matrix \nachieves a value of 3.9. \n\na constrained covariance matrix \n\nC = diag(di,\u00b7 \u00b7 \u00b7, d~) + L sisT \n\ns \n\ni=l \n\n(18) \n\nwhich is the form of covariance used in factor analysis (Bishop 1997). This reduces \nthe number offree parameters in the covariance matrix from k(k + 1)/2 to k(s + 1) \n(representing k(s + 1) - s(s - 1)/2 independent degrees of freedom) which is now \nlinear in k. Thus, the number of parameters can be controlled by changing sand, \nunlike a diagonal covariance matrix, this model can still capture the strongest of the \nposterior correlations. The value of s should be as large as possible, subject only \nto computational cost limitations. There is no 'over-fitting' as s is increased since \nmore flexible distributions Q(w) simply better approximate the true posterior. \nWe illustrate the optimization of the KL divergence using a toy problem involving \nthe posterior distribution for a two-parameter regression problem. Figure 1 shows \nthe true posterior together with approximations obtained from Laplace's method, \nensemble learning with a diagonal covariance Gaussian, and ensemble learning using \nan unconstrained Gaussian. \n\n2.1 Hyperparameter Adaptation \n\nSo far, we have treated the hyperparameters as fixed. We now extend the ensemble \nlearning formalism to include hyperparameters within the Bayesian framework. For \nsimplicity, we consider a standard isotropic prior covariance matrix of the form \nA = aI, and introduce hyperpriors given by Gamma distributions \n\nlnp (a) \n\nlnp (f3) = \n\nIn {aa-l exp ( -~) } + const \nIn {f3 C- 1 exp ( -~) } + const \n\n(19) \n\n(20) \n\n\f400 \n\nD. Barber and C. M. BisJwp \n\nwhere a, b, c, d are constants. The joint posterior distribution of the weights and \nhyperparameters is given by \n\nin which \n\np (w, a, ,BID)