Part of Advances in Neural Information Processing Systems 7 (NIPS 1994)
David Cohn, Zoubin Ghahramani, Michael Jordan
For many types of learners one can compute the statistically "op(cid:173) timal" way to select data. We review how these techniques have been used with feedforward neural networks [MacKay, 1992; Cohn, 1994] . We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regres(cid:173) sion are both efficient and accurate.
1 ACTIVE LEARNING - BACKGROUND
An active learning problem is one where the learner has the ability or need to influence or select its own training data. Many problems of great practical interest allow active learning, and many even require it. We consider the problem of actively learning a mapping X - Y based on a set of training examples {(Xi,Yi)}~l' where Xi E X and Yi E Y. The learner is allowed to iteratively select new inputs x (possibly from a constrained set), observe the resulting output y, and incorporate the new examples (x, y) into its training set. The primary question of active learning is how to choose which x to try next. There are many heuristics for choosing x based on intuition, including choosing places where we don't have data, where we perform poorly [Linden and Weber, 1993], where we have low confidence [Thrun and Moller, 1992], where we expect it
706
David Cohn, Zoubin Ghahramani, Michael I. Jordon
to change our model [Cohn et aI, 1990], and where we previously found data that resulted in learning [Schmidhuber and Storck, 1993]. In this paper we consider how one may select x "optimally" from a statistical viewpoint. We first review how the statistical approach can be applied to neural networks, as described in MacKay [1992] and Cohn [1994]. We then consider two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While optimal data selection for a neural network is computationally expensive and approximate, we find that optimal data selection for the two statistical models is efficient and accurate.
2 ACTIVE LEARNING - A STATISTICAL APPROACH
We denote the learner's output given input x as y(x). The mean squared error of this output can be expressed as the sum of the learner's bias and variance. The variance 0'3 (x) indicates the learner's uncertainty in its estimate at x. 1 Our goal will be to select a new example x such that when the resulting example (x, y) is added to the training set, the integrated variance IV is minimized:
IV = J 0'3 P (x)dx.
(1)
Here, P(x) is the (known) distribution over X. In practice, we will compute a Monte Carlo approximation of this integral, evaluating 0'3 at a number of random points drawn according to P(x). Selecting x so as to minimize IV requires computing 0-3, the new variance at x given (x, y). Until we actually commit to an x, we do not know what corresponding y we will see, so the minimization cannot be performed deterministically.2 Many learning architectures, however, provide an estimate of PWlx) based on current data, so we can use this estimate to compute the expectation of 0-3. Selecting x to minimize the expected integrated variance provides a solid statistical basis for choosing new examples.
2.1 EXAMPLE: ACTIVE LEARNING WITH A NEURAL