Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)

*Tom Heskes*

We propose a new method to compute prediction intervals. Espe(cid:173) cially for small data sets the width of a prediction interval does not only depend on the variance of the target distribution, but also on the accuracy of our estimator of the mean of the target, i.e., on the width of the confidence interval. The confidence interval follows from the variation in an ensemble of neural networks, each of them trained and stopped on bootstrap replicates of the original data set. A second improvement is the use of the residuals on validation pat(cid:173) terns instead of on training patterns for estimation of the variance of the target distribution. As illustrated on a synthetic example, our method is better than existing methods with regard to extrap(cid:173) olation and interpolation in data regimes with a limited amount of data, and yields prediction intervals which actual confidence levels are closer to the desired confidence levels.

1 STATISTICAL INTERVALS

In this paper we will consider feedforward neural networks for regression tasks: estimating an underlying mathematical function between input and output variables based on a finite number of data points possibly corrupted by noise. We are given a set of Pdata pairs {ifJ, tfJ } which are assumed to be generated according to

(1) where e(i) denotes noise with zero mean. Straightforwardly trained on such a regression task, the output of a network o(i) given a new input vector i can be

t(i) = f(i) + e(i) ,

RWCP: Real World Computing Partnership; SNN: Foundation for Neural Networks.

Practical Confidence and Prediction Intervals

177

interpreted as an estimate of the regression f(i) , i.e ., of the mean of the target distribution given input i. Sometimes this is all we are interested in: a reliable estimate of the regression f(i). In many applications, however, it is important to quantify the accuracy of our statements. For regression problems we can distinguish two different aspects: the accuracy of our estimate of the true regression and the accuracy of our estimate with respect to the observed output. Confidence intervals deal with the first aspect, i.e. , consider the distribution of the quantity f(i) - o(i), prediction intervals with the latter, i.e., treat the quantity t(i) - o(i). We see from

t(i) - o(i) = [f(i) - o(i)] + ~(i) ,

(2)

that a prediction interval necessarily encloses the corresponding confidence interval. In [7] a method somewhat similar to ours is introduced to estimate both the mean and the variance of the target probability distribution. It is based on the assumption that there is a sufficiently large data set, i.e., that their is no risk of overfitting and that the neural network finds the correct regression. In practical applications with limited data sets such assumptions are too strict. In this paper we will propose a new method which estimates the inaccuracy of the estimator through bootstrap resampling and corrects for the tendency to overfit by considering the residuals on validation patterns rather than those on training patterns.

2 BOOTSTRAPPING AND EARLY STOPPING

Bootstrapping [3] is based on the idea that the available data set is nothing but a particular realization of some unknown probability distribution. Instead of sam(cid:173) pling over the "true" probability distribution, which is obviously impossible, one defines an empirical distribution. With so-called naive bootstrapping the empirical distribution is a sum of delta peaks on the available data points, each with probabil(cid:173) ity content l/Pdata. A bootstrap sample is a collection of Pdata patterns drawn with replacement from this empirical probability distribution. This bootstrap sample is nothing but our training set and all patterns that do not occur in the training set are by definition part of the validation set . For large Pdata, the probability that a pattern becomes part of the validation set is (1 -

l/Pdata)Pdata ~ lie ~ 0.37.

When training a neural network on a particular bootstrap sample, the weights are adjusted in order to minimize the error on the training data. Training is stopped when the error on the validation data starts to increase. This so-called early stop(cid:173) ping procedure is a popular strategy to prevent overfitting in neural networks and can be viewed as an alternative to regularization techniques such as weight decay. In this context bootstrapping is just a procedure to generate subdivisions in training and validation set similar to k-fold cross-validation or subsampling.

On each of the nrun bootstrap replicates we train and stop a single neural network.

The output of network i on input vector i IJ is written oi(ilJ ) == or. As "the" estimate of our ensemble of networks for the regression f(i) we take the average output l

Do not remove: This comment is monitored to verify that the site is working properly