Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)
Ilya Nemenman, William Bialek
Learning of a smooth but nonparametric probability density can be reg(cid:173) ularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free pa(cid:173) rameter of the theory (,smoothness scale') can be determined self con(cid:173) sistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets.
One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equiv(cid:173) alent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric prob(cid:173) lems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These obser(cid:173) vations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability
that a particular density Q(x) gave rise to these data is given by P[Q(x)] rr~1 Q(Xi)