Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)
David Barber, Christopher Bishop
Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was in(cid:173) troduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler di(cid:173) vergence between the true posterior and a parametric approximat(cid:173) ing distribution. However, the derivation of a deterministic algo(cid:173) rithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full(cid:173) covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparam(cid:173) eters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.