{"title": "Assessing and Improving Neural Network Predictions by the Bootstrap Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 196, "page_last": 203, "abstract": null, "full_text": "Assessing and Improving Neural Network \nPredictions by the Bootstrap Algorithm \n\nGerhard Paass \n\nGerman National Research Center for Computer Science (GMD) \n\nD-5205 Sankt Augustin, Germany \n\ne-mail: paass 3 \nif Xl + X2 + X3 + X4 > 3 and X5 + X6 + X7 + XS < 3 \nif Xl + X2 + X3 + X4 > 3 and X5 + X6 + X7 + Xs > 3 \n\nIn contrast to the simple xor model generalization is possible in this setup. We \ngenerated a training set X( n) of n = 100 inputs using the true model. \nWe used the pairwise bootstrap procedure described above and generated B = 30 \ndifferent bootstrap samples X;(n) by random selection from X(n) with replace(cid:173)\nment. This number of bootstrap samples is rather low and only will yield reliable \ninformation on the central tendency of the prediction. More sensitive parameters of \n\n\f200 \n\nPaass \n\nInput vectors \nY1 \nYg \n\nBootstrap Distribution of Prediction \n\n0.0 \n\n0.5 \n\n1.0 \n\n10110110 \n01110110 \n\n1 1 1 101 1 0 \n\n00001110 \n\n10001110 \n01001110 \n\n11001110 \n\n00101110 \n\n10101110 \n\n01101110 \n\n1 1 101 1 1 0 \n\n00011110 \n\n10011110 \n\n01011110 \n\n11011110 \n\n00111110 \n\nt---------tl ~ H \nr------mJ-; \n\n...---------.,14 I~ \n\n.... \n\n.A \n\nA.. \n\nl!:rt \nI\u00b7 ~ \nzill \n\nt------::zs:~~ \nI!H \nl{l}-i \n\n.A \n\n~ \n\n... \n.A \n\ntrue expected value \nvalue predicted by the original backprop model \n\n~------f--llnH percentiles of the \n10 \n\n50 75 90 bootstrap distribution \n\n25 \n\nFigure 1: Box-Plots of the Bootstrap Predictive Distribution for a Series \nof Different Input Vectors \n\nthe distribution like low percentiles and the standard deviation can be expected to \nexhibit larger fluctuations. We estimated 30 weight vectors ~b from those samples \nby the backpropagation method with random initial weights. Subsequently for each \nof the 256 possible input vectors Yi we determined the prediction g~1> (Yi) yielding \na predictive distribution. For comparison purposes we also estimated the weights \nof the original backprop model with the full data set X (n) and the corresponding \n\n\fAssessing and Improving Neural Network Predictions by the Bootstrap Algorithm \n\n201 \n\nTable 1: Mean Square Deviation from the True Prediction \n\nINPUT \nTYPE \n\nHIDDEN MEAN SQUARE DIFFERENCE \nUNITS BOOTSTRAP DB FULL DATA DF \n\ntraining \ninputs \n\nne I-training \ninputs \n\n2 \n3 \n4 \n\n2 \n3 \n4 \n\n0.18 \n0.17 \n0.17 \n\n0.30 \n0.35 \n0.37 \n\n0.19 \n0.19 \n0.19 \n\n0.34 \n0.38 \n0.42 \n\nTable 2: Coverage Probabilities of the Bootstrap Confidence Interval for Prediction \n\nHIDDEN \nUNITS \n\nFRACTION OF CASES WITH TRUE PREDICTION IN \n[q2S, q7S] \n\n[q10, qgo] \n\n2 \n3 \n4 \n\n0.47 \n0.44 \n0.43 \n\n0.77 \n0.70 \n0.70 \n\npredictions. \n\nFor some of those input vectors the results are shown in figure 1. The distributions \ndiffer greatly in size and form for the different input vectors. Usually the spread \nof the predictive distribution is large if the median prediction differs substantially \nfrom the true value. This reflects the situation that the observed data does not have \nmuch information on the specific input vector. Simply by inspecting the predictive \ndistribution the reliability of a predictions may be assessed in a heuristic way. This \nmay be a great help in practical applications. \nIn table 1 the mean square difference DB := (~L:~=1 (Zi - qSo)2) 1/2 between the \ntrue prediction Zi and the median qso of the bootstrap predictive distribution is \ncompared to the mean square difference Ds := (1. L:?=l(Zi - Zi,F)2)1/2 between \nthe true prediction and the value Zi,F estimated with full data backprop model. For \nthe non-training inputs the bootstrap median has a lower mean deviation from the \ntrue value. This effect is a real practical advantage and occurs even for this simple \nbootstrap procedure. It may be caused in part by the variation of the initial weight \nvalues (cf. Pearlmutter, Rosenfeld 1991). The utilization of bootstrap procedures \nwith higher order convergence has the potential to improve this effect. \nTable 2 list the fraction of cases in the full set of all 256 possible inputs where the \ntrue value is contained in the central 50% and 80% prediction interval. Note that \nthe intervals are based on only 30 cases. For the correct model with 2 hidden units \nthe difference is 0.03 which corresponds to just one case. Models with more hidden \nunits exhibit larger fluctuations. To arrive at more reliable intervals the number of \n\n\f202 \n\nPaass \n\nTable 3: Spread of the Predictive Distribution \n\nHIDDEN \nUNITS \n\nMEAN INTERQUARTILE RANGE FOR \n\nTRAINING INPUTS NON-TRAINING INPUTS \n\n2 \n3 \n4 \n\n0.13 \n0.11 \n0.11 \n\n0.29 \n0.35 \n0.37 \n\nbootstrap samples has to be increased by an order of magnitude. \n\nIf we use a model with more than two hidden units the fit to the training sam(cid:173)\nple cannot be improved but remains constant. For nontraining inputs, however, \nthe predictions of the model deteriorate. In table 1 we see that the mean square \ndeviation from the true prediction increases. This is just a manifestation of 'Oc(cid:173)\ncam's razor' which states that unnecessary complex models should not be prefered \nto simpler ones (MacKay 1992). Table 3 shows that the spread of the predictive \ndistribution is increased for non-training inputs in the case of models with more \nthan two hidden units. Therefore Occam's razor is supported by the bootstrap \npredictive distribution without knowing the correct prediction. \nThis effect shows that bootstrap procedures may be utilized for model selection. \nAnaloguous to Liu (1993) we may use a crossvalidation strategy to determine the \nprediction error for the bootstrap estimate ~b for sample elements of X (n) which \nare not contained in the bootstrap sample X; (n). In a similar way Efron (1982, \np.52f) determines the error for the predictions g~b(Y) within the full sample X(n) \nand uses this as an indicator of the model performance. \n\n4 SUMMARY \n\nThe bootstrap method offers an computation intensive alternative to estimate the \npredictive distribution for a neural network even if the analytic derivation is in(cid:173)\ntractable. The available asymptotic results show that it is valid for a large number \nof linear, nonlinear and even nonparametric regression problems. It has the po(cid:173)\ntential to model the distribution of estimators to a higher precision than the usual \nnormal asymptotics. It even may be valid if the normal asymptotics fail. However, \nthe theoretical properties of bootstrap procedures for neural networks - especially \nnonlinear models - have to be investigated more comprehensively. In contrast to \nthe Bayesian approach no distributional assumptions (e.g. normal errors) are have \nto be specified. The simulation experiments show that bootstrap methods offer \npractical advantages as the performance of the model with respect to a new input \nmay be readily assessed. \n\nAcknowledgements \n\nThis research was supported in part by the German Federal Department of Reserach \nand Technology, grant ITW8900A 7. \n\n\fAssessing and Improving Neural Network Predictions by the Bootstrap Algorithm \n\n203 \n\nReferences \nBeran, R. (1988): Prepivoting Test Statistics: A Bootstrap View of Asymptotic \nRefinements. Journal of the American Statistical Association. vol. 83, pp.687-697. \nBeran, R. (1990): Calibrating Prediction Regions. Journal of the American Statis(cid:173)\ntical Association., vol. 85, pp.715-723. \nBickel, P.J., Freedman, D.H. (1981): Some Asymptotic Theory for the Bootstrap. \nThe Annals of Statistics, vol. 9, pp.1l96-1217. \nBickel, P.J., Freedman, D.H. (1983): Bootstrapping Regression Models with many \nParame~ers. In P. Bickel, K. Doksum, J .C. Hodges (eds.) A Festschrift for Erich \nLehmann. Wadsworth, Belmont, CA, pp.28-48. \nDiCiccio, T.J., Romano, J.P. (1988): A Review of Bootstrap Confidence Intervals. \nJ. Royal Statistical Soc., Ser. B, vol. 50, pp.338-354. \nEfron, B. (1979): Bootstrap Methods: Another Look at the Jackknife. The Annals \nof Statistics, vol 7, pp.1-26. \nEfron, B. (1982): The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, \nPhiladelphia. \nEfron, B., Gong, G. (1983): A leisure look at the bootstrap, the jackknife and \ncrossvalidation. A merican Statistician, vol. 37, pp.36-48. \nEfron, B., Tibshirani (1986): Bootstrap methods for Standard Errors, Confidence \nIntervals, and other Measures of Statistical Accuracy . Statistical Science, vol 1, \npp.54-77. \nFreedman, D.H. (1981): Bootstrapping Regression Models. The Annals of Statis(cid:173)\ntics, vol 9, p.1218-1228. \nHardIe, W.(1990): Applied Nonparametric Regression. Cambridge University Press, \nCambridge. \nHardIe, W., Mammen, E. (1990): Bootstrap Methods in Nonparametric Regression. \nPreprint Nr. 593. Sonderforschungsbereich 123, University of Heidelberg. \nHall, P. (1988): Theoretical Comparison of Bootstrap Confidence Intervals. The \nAnnals of Statistics, vol 16, pp.927-985. \nHinkley, D .. (1988): Bootstrap Methods. Journal of the Royal Statistical Society, \nSer. B, vol.50, pp.321-337. \nLiu, R. (1988): Bootstrap Procedures under some non i.i.d. Models. The Annals \nof Statistics, vo1.16, pp. 1696-1708. \nLiu, Y. (1993): Neural Network Model Selection Using Asymptotic Jackknife Esti(cid:173)\nmator and Cross-Validation Method. This volume. \nMacKay, D. J. C. (1992): Bayesian Model Comparison and Backprop Nets. In \nMoody, J .E., Hanson, S.J., Lippman, R.P. (eds.) Advances in Neural Information \nProcessing Systems 4. Morgan Kaufmann, San Mateo, pp.839-846. \nMammen, E. (1991): When does Bootstrap Work: Asymptotic Results and Simu(cid:173)\nlations. Preprint Nr. 623. Sonderforschungsbereich 123, University of Heidelberg. \nPearlmutter, B.A., Rosenfeld, R. (1991): Chaitin-Kolmogorov Complexity and Gen(cid:173)\neralization in Neural Networks. \nInformation Processing Systems 3, Morgan Kaufmann, pp.925-931. \nC.F.J. Wu (1986): Jackknife, Bootstrap and other Resampling Methods in Regres(cid:173)\nsion Analysis. The Annals of Statistics, vol. 14, p.1261-1295. \n\nin Lippmann et al. (eds.): Advances in Neural \n\n\f", "award": [], "sourceid": 659, "authors": [{"given_name": "Gerhard", "family_name": "Paass", "institution": null}]}