David Barber, Christopher Bishop
Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was in(cid:173) troduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler di(cid:173) vergence between the true posterior and a parametric approximat(cid:173) ing distribution. However, the derivation of a deterministic algo(cid:173) rithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full(cid:173) covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparam(cid:173) eters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.