{"title": "Cross-Validation Estimates IMSE", "book": "Advances in Neural Information Processing Systems", "page_first": 391, "page_last": 398, "abstract": null, "full_text": "Cross-Validation Estimates IMSE \n\nMark Plutowski t* \n\nShinichi Sakata t \n\nHalbert White t* \n\nt Department of Computer Science and Engineering \n\nt Department of Economics \n\n* Institute for Neural Computation \nUniversity of California, San Diego \n\nAbstract \n\nIntegrated Mean Squared Error (IMSE) is a version of the usual \nmean squared error criterion, averaged over all possible training \nIf it could be observed, it could be used \nsets of a given size. \nto determine optimal network complexity or optimal data sub(cid:173)\nsets for efficient training. We show that two common methods of \ncross-validating average squared error deliver unbiased estimates \nof IMSE, converging to IMSE with probability one. These esti(cid:173)\nmates thus make possible approximate IMSE-based choice of net(cid:173)\nwork complexity. We also show that two variants of cross validation \nmeasure provide unbiased IMSE-based estimates potentially useful \nfor selecting optimal data subsets. \n\n1 Summary \nTo begin, assume we are given a fixed network architecture. (We dispense with \nthis assumption later.) Let zN denote a given set of N training examples. Let \nQN(zN) denote the expected squared error (the expectation taken over all possible \nexamples) of the network after being trained on zN. This measures the quality of \nfit afforded by training on a given set of N examples. \nLet IMSEN denote the Integrated Mean Squared Error for training sets of size \nN. Given reasonable assumptions, it is straightforward to show that IMSEN = \nE[Q N(ZN)] - 0\"2, where the expectation is now over all training sets of size N, ZN \nis a random training set of size N, and 0\"2 is the noise variance. \nLet CN = CN(zN) denote the \"delete-one cross-validation\" squared error measure \nfor a network trained on zN. CN is obtained by training networks on each of the N \ntraining sets of size N -1 obtained by deleting a single example; the measure follows \n\n391 \n\n\f392 \n\nPlutowski, Sakata, and White \n\nby computing squared error for the corresponding deleted example and averaging the \nresults. Let G N,M = G N,M (zN , zM) denote the \"generalization\" measure obtained \nby separating the available data of size N + M into a training set zN of size N, and \na validation (\"test\") set zM of size M; the measure follows by training on zN and \ncomputing averaged squared error over zM. \n\nWe show that eN is an unbiased estimator of E[QN_l(ZN)], and hence, estimates \n1M SEN-l up to noise variance. Similarly, GN,M is an unbiased estimator of \nE[QN(ZN, ZM] . Given reasonable conditions on the estimator and on the data \ngenerating process we demonstrate convergence with probability 1 of GN,M and \neN to E[QN(ZN)] as Nand M grow large. \n\nA direct consequence of these results is that when choice is restricted to a set of \nnetwork architectures whose complexity is bounded above a priori, then choosing \nthe architecture for which either eN (or G N,M) is minimized leads to choice of \nthe network for which 1MSEN is nearly minimized for all N (respectively, N, M) \nsufficiently large. \n\nWe also provide results for training sets sampled at particular inputs. Conditional \n1M S E is an appealing criterion for evaluating a particular choice of training set in \nthe presence of noise. These results demonstrate that delete-one cross-validation es(cid:173)\ntimates average MSE (the average taken over the given set of inputs,) and that hold(cid:173)\nout set cross-validation gives an unbiased estimate of E[QN(ZN)IZN = (x N , yN)], \ngiven a set of N input values x N for which corresponding (random) output values \nyN are obtained. Either cross-validation measure can therefore be used to select \na representative subset of the entire dataset that can be used for data compaction, \nor for more efficient training (as training can be faster on smaller datasets) [4]. \n\n2 Definitions \n2.1 Learning Task \nWe consider the learning task of determining the relationship between a random \nvector X and a random scalar y, where X takes values in a subset X of ~r, and Y \ntakes values in a subset Y of~. (e.g. X = ~r and Y = ~). We refer to X as the \ninput space. The learning task is thus one of training a neural network with r inputs \nand one output. It is straightforward to extend the following analysis to networks \nwith multiple targets. We make the following assumption on the observations to be \nused in the training of the networks. \nAssumption 1 X is a Borel subset of ~r and Y is a Borel subset of~. Let Z = \nX x Y and n = Zoo = Xi:1Z. Let (n,.1', P) be a probability space with.1' = 8(n). \nThe observations on Z = (X', Y)' to be used in the training of the network are a \nrealization of an i.i.d. stochastic process {Z, _ (Xi, Yi)' : n ---+ X x V}. \nWhen wEn is fixed, we write z, = Z,(w) for each i = 1,2, .... Also write ZN = \n(Zl, ... ,ZN) and zn = (Zl, .. . ,zn). \nAssumption 1 allows uncertainty caused by measurement errors of observations \n. It, however, does not \nas well as a probabilistic relationship between X and Y \nprevent a deterministic relation ship between X and Y such that Y = g(X) for \nsome measurable mapping g : ~r ---+ ~. \n\n\fCross-Validation Estimates IMSE \n\n393 \n\nWe suppose interest attaches to the conditional expectation of Y given X, written \ng( x) = E(Y IX). The next assumption guarantees the existence of E(Yi IXi) and \nE(cdXi), Ci = Yi - E(YiIXi). Next, for convenience, we assume homoscedasticity \nof the conditional variance of Yi given Xi. \nAssumption 2 E(y2) < 00. \nAssumption 3 E(ci IX1) = u 2, where u 2 is a strictly positive constant. \n\n2.2 Network Model \nLet fP(.,.) : X X WP -- Y be a network function with the \"weight space\" WP, \nwhere p denotes the dimension of the \"weight space\" (the number of weights.) We \nimpose some mild conditions on the network architecture. \n\nAssumption 4 For each p E {I, 2, ... ,p}, pEN, WP is a compact subset of ~P, \nand fP : X x WP -- ~ satisfies the following conditions: \n\n1. fP(., w) : X -- Y is measurable for each wE WP; \n\n2. fP(x,\u00b7) : WP -- Y is continuous for all x E X. \n\nWe further make a joint assumption on the underlying data generating process \nand the network architecture to assure that the training dataset and the networks \nbehaves appropriately. \n\nHence, fP is square integrable for each wP E WP. We will measure network per(cid:173)\n\nAssumption 5 There exists a function D : X -- ~+ = [0,00) such that for each \nx E X and w E WP, IfP(x, w)1 ~ D(x), and E [(D(X\u00bb2] < 00. \nformance using mean squared error, which for weights wP is given by ).(wP;p) = \nE [(Y - fP(X, wP)2]. The optimal weights are the weights that minimize ).(wP;p). \nThe set of all optimal weights are given by WP\u00b7 = {w\u00b7 E WP : ).( w\u00b7 ; p) ~ \n).( w; p) for any w E Wp}. The index of the best network is p. , given by the smallest \np minimizing minwl'Ewl' ).(wP;p), p E {I, 2, ... ,pl. \n2.3 Least-Squares Estimator \nWhen assumptions I and 4 hold, the nonlinear least-squares estimator exists. For(cid:173)\nmally, we have \nLemma 1 Suppose that Assumptions 1 and 4 hold. Then 1. For each N EN, \nthere exists a measurable function INC; p) : ZN -- WP such that IN(ZN; p) solves \nthe following problem with probability one: minwEWI' N- 1 E~l (Yi - J(Xi, w\u00bb2 . \n2. ).(.; p) : WP __ ~ is continuous on WP, and WP\u00b7 is not empty. \nFor convenience, we also define ~ : n -- WP by ~(w) = IN(ZN (w);p) for \neach wEn. Next let i1 , i2 , \u2022.. , iN be distinct natural numbers and let ZN = \n(Zil' ... , ZiN)\" Then IN(ZN) given above solves ;; Ef=l (Yi; - f(Xi;, wP\u00bb2 with \nprobability one. In particular, we will consider the estimate using the dataset Z~i \nmade by deleting the ith observation from zN. Let Z~i be a random matrix made \n\n\f394 \n\nPlutowski, Sakata, and White \n\nby deleting the ith row from ZN. Thus, IN -1 (Z~i; p) is a measurable least squares \nestimator and we can consider its probabilistic behavior. \n\nIntegrated Mean Squared Error \n\n3 \nIntegrated Mean Squared Error (IMSE) has been used to regulate network complex(cid:173)\nity [9]. Another (conditional) version of IMSE is used as a criterion for evaluating \ntraining examples [5, 6, 7, 8]. The first version depends only on the sample size, \nnot the particular sample. The second (conditional) version depends additionally \nupon the observed location of the examples in the input space. \n\n3.1 Unconditional IMSE \nThe (unconditional) mean squared error (MSE) of the network output at a partic(cid:173)\nular input value x is \n\nMN(X;p)=E [{g(x)-!(x,IN(ZN;p))}2]. \n\nIntegrating MSE over all possible inputs gives the unconditional IMSE: \n\nIMSEN(p) J [MN(X, ;p)] J.L(dx) \n\nE [MN(XjP)], \n\nwhere J.L is the input distribution. \n\n(1) \n\n(2) \n\n(3) \n\n3.2 Conditional IMSE \nTo evaluate exemplars obtained at inputs x N , we modify Equation (1) by condi(cid:173)\ntioning on x N , giving \n\nMN(xlxN ;p) = E [{g(x) - !(x, IN(ZN))PIXN = xN] . \n\nThe conditional IMSE (given inputs x N ) is then \n\nIMSEN(xN;p) J MN(xlxN;p)JL(dx) \n\nE [MN(XlxN;p)] . \n\n(4) \n\n(5) \n\n4 Cross-Validation \nCross-validatory measures have been used successfully to assess the performance of \na wide range of estimators [10, 11, 12, 13, 14, 15]. Cross-validatory measures have \nbeen derived for various performance criteria, including the Kullback-Liebler Infor(cid:173)\nmation Criterion (KLIC) and the Integrated Squared Error (ISE, asymptotically \nequivalent to IMSE) [16]. Although provably inappropriate in certain applications \n[17, 18], optimality and consistency results for the cross-validatory measures have \nbeen obtained for several estimators, including linear regression, orthogonal series, \nsplines, histograms, and kernel density estimators [16, 19,20, 21, 22, 23, 24]. The \nauthors are not aware of similar results applicable to neural networks, although two \nmore general, but weaker results do apply [26]. A general result applicable to neural \nnetworks shows asymptotic equivalence between cross-validation and Akaike's Cri(cid:173)\nterion for network selection [25,29]' as well as between cross-validation and Moody's \nCriterion [30, 29]. \n\n\fCross-Validation Estimates IMSE \n\n395 \n\n4.1 Expected Network Error \nGiven our assumptions, we can relate cross-validation to IMSE. For clarity and \nnotational convenience, we first introduce a measure of network error closely related \nto IMSE. For each weight wP E WP, we have defined the mean squared error A( wP ; p) \nin Section 2.2. We define QN to map each dataset to the mean squared error of the \nestimated network \n\nQN(ZN;p) = A(lN(zN;p);p). \n\nWhen Assumption 3 holds, we have \n\nA(wP;p) = E [(g(X) - f(X, W P ))2] + u2 \n\n= E [(g(XN+d - f(XN+l, wP))2] + u2 \n\nas is easily verified. We therefore have \n\nQN(zN; p) = E [(g(XN+d - f(XN+l, IN(ZN ;p)))2IZN = zN] + u2. \n\nThus, by using the law of iterated expectations, we have \n\nE[QN(ZN;p)] = \n\nIMSEN(p)+u2. \n\nLikewise, given x N E X N , \n\nE[QN(ZN; p)IXN = xN] = IMSE(xN ;p) + u2. \n\n(6) \n\n4.2 Cross-Validatory Estimation of Error \nIn practice we work with observable quantities only. In particular, we must esti(cid:173)\nmate the error of network p over novel data (\"generalization\") from a finite set of \nexamples. Such an estimate is given by the delete-one cross-validation measure: \n\nCN(zN;p) = ~ L (Yi - f(Xi,IN_l(zl!i;P)))2 \n\nN \n\ni=l \n\n(7) \n\n~ere zl!i, denotes the training set obtained by deleting the ith example. Using \nz_i insteaa of z \navoids a downward bias due to testing upon examples used in \ntraining, as we show below (Theorem 3.) Another version of cross-validation is \ncommonly used for evaluating \"generalization\" when an abundant supply of novel \ndata is available for use as a \"hold-out\" set: \n\nGN,M(zN,zM;p) = ~ L (iii - J(xi,IN(zN;p)))2, \n\nM \n\n(8) \n\ni=l \n\nwhere zM = (ZN+l' ... , ZN+M)' \n5 Expectation of the Cross-Validation Measures \nWe now consider the relation between cross-validation measure and IMSE. We ex(cid:173)\namine delete-one cross-validation first. \n\nProposition 1 (Unbiasedness of CN) Let Assumptions 1 through 5 hold. Then \nfor given N, CN is an unbiased estimator of 1M SEN-l (p) + u2 : \n\nE [CN(ZN;p)] = IMSEN-1(p) +u2. \n\n(9) \n\n\f396 \n\nPlutowski, Sakata, and White \n\nWith hold-out set cross-validation, the validation set ZM gives i.i.d. information \nregarding points outside of the training set ZN. \n\nProposition 2 (Unbiasedness of GN,M) Let Assumptions 1 through 5 hold. Let \n\nZM = (ZN+l, .. . ,ZM)'. Then for given Nand M, GN,M is an unbiased estimator \nof IMSEN(p) + u 2 : \n[ \n\n;p) = IMSEN(p) + u . \n\nN \nE GN,M(Z \n\n(10) \n\n2 \n\n-M \n\n, Z \n\n] \n\nThe latter result is appealing for large M, N. We expect delete-one cross-validation \nto be more appealing when training data is not abundant. \n\n6 Expectation of Cross-Validation when Sampling at \n\nSelected Inputs \n\nWe obtain analogous results for training sets obtained by sampling at a given set \nof inputs x N . We first consider the result for delete-one cross-validation. \n\nProposition 3 (Expectation of CN given xN1 Let Assumptions 1 through 5 \n, CN is an unbiased estimator \nhold. Then, given a particular set of inputs, x \nof average MSEN-l + u 2 , the average taken over x N : \n\nN \n\nN L.J M N - 1 Xi x_i' p) + u , \n1 \" \n2 \n\n(I N . \n\ni=l \n\nwhere X~i is a matrix made by deleting the ith row of x N \u2022 \nThis essentially gives an estimate of MSEN-l limited to x E x N , losing a degree of \nfreedom while providing no estimate of the M S E off of the training points. For this \naverage to converge to IMSEN-l, it will suffice for the empirical distribution of \nx N , p,N, to converge to J-lN, i.e., P,N => J-lN. We obtain a stronger result for hold-out \nset cross-validation. The hold-out set gives independent information on M SEN off \nof the training points, resulting in an estimate of IMSEN for given x N . \n\nProposition 4 (Expectation of GN,M given x N ) Let Assumptions 1 through 5 \n\nhold. Let ZM = (ZN+1, .. . , ZN+M )'. Then, given a particular set of inputs, x N , \nGN,M is an unbiased estimator of of IMSEN(x N ; p) + u 2 : \n\nE [GN,M(ZN,ZM;p)IXN =xN] \n\nIMSEN(xN;p) +u2 \u2022 \n\n7 Strong Convergence of Hold-Out Set Cross-Validation \nOur conditions deliver not only unbiasedness, but also convergence of hold-out set \ncross-validation to IMSEN, with probability 1. \n\nTheorem 1 (Convergence of Hold-Out Set w.p. 1) Let \n\ntions 1 through 5 hold. Also let ZM = (ZN+l, ... , ZN+M)'. If for some A > 0 \na sequence {MN} of natural numbers satisfies MN > AN for any N = 1,2, ... , \nthen \n\nAssump(cid:173)\n\n\fCross-Validation Estimates IMSE \n\n397 \n\n8 Strong Convergence of Delete-One Cross-Validation \nGiven an additional condition (uniqueness of optimal weights) we can show strong \nconvergence for delete-one cross-validation. First we establish uniform convergence \nof the estimators WP(Z~i) to optimal weights (uniformly over 1 < i < N.) \n\nTheorem 2 Let Assumptions 1 through 5 hold. Let Z~k be the dataset made by \ndeleting the kth observation from ZN. Then \n\nmax d (IN-1(Z~i;P), Wp*) --+ 0 a.s.-P as N --+ 00, \nl~i~N \n\nwhere d(w, Wp*) = infw.Ewp.llw - w*lI. \n\n(11) \n\nThis convergence result leads to the next result that the delete-one cross validation \nmeasure does not under-estimate the optimized MSE, namely, infwPEWp ).(wP;p). \n\nTheorem 3 Let Assumptions 1 through 5 hold. Then \n\nliminfCN(ZN ;p) > min ).(w;p) a.s.-P. \nN-oo \n\nwEWp \n\nWhen the optimum weight is unique, we have a stronger result about convergence \nof the delete-one cross validation measure. \n\nAssumption 6 Wp* is a singleton, i.e., wp* has only one element. \n\nTheorem 4 Let Assumptions 1 through 6 hold. Then \n\nCN (ZN ;p) - E [QN(ZN ;p)] --+ 0 a.s. as N --+ 00. \n\n9 Conclusion \nOur results justify the intuition that cross-validation measures unbiasedly and con(cid:173)\nsistently estimate the expected squared error of networks trained on finite training \nsets, therefore providing means of obtaining 1M S E-approximate methods of select(cid:173)\ning appropriate network architectures, or for evaluating particular choice of training \nset. \n\nUse of these cross-validation measures therefore permits us to avoid underfitting \nthe data, asymptotically. Note, however, that although we also thereby avoid over(cid:173)\nfitting asymptotically, this avoidance is not necessarily accomplished by choosing \na minimally complex architecture. The possibility exists that IMSEN-1(p) = \n1M S EN -1 (p + 1). Because our cross-validated estimates of these quanti ties are \nrandom we may by chance observe CN(ZN;p) > CN(ZN;p+ 1) and therefore se(cid:173)\nlect the more complex network, even though the less complex network is equally \ngood. Of course, because the IMSE's are the same, no performance degradation \n(overfitting) will result in this solution. \n\nAcknowledgements \nThis work was supported by NSF grant IRI 92-03532. We thank David Wolpert, \nJ an Larsen, Jeff Racine, Vjachislav Krushkal, and Patrick Fitzsimmons for valuable \ndiscussions. \n\n\f398 \n\nPlutowski, Sakata, and White \n\nReferences \n[1] White, H. 1989. \"Learning in Artificial Neu(cid:173)\nral Networks: A Statistical Perspective.\" Neu(cid:173)\nral Computation, 1 \nCambridge, MA. \n\ni, pp.i25-i6i. MIT Press, \n\n(2] Plutowski, Mark E., Shinichi Sakata, and Hal(cid:173)\n\nbert White. 1993. \"Cross-validation delivers \nstrongly consistent unbiased estimates of Inte(cid:173)\ngrated Mean Squared Error.\" To appear. \n\n(3] Plutowski, Mark E., and Halbert White. 1993. \n\"Selecting concise training sets from clean data.\" \nIEEE Transactions on Neural Networks. 4, 3, \npp.305-318. \n\n(i) Plutowski, Mark E., Garrison Cottrell, and Hal(cid:173)\n\nbert White. 1992. \"Learning Mackey-Glass from \n25 examples, Plus or Minus 2.\" (Presented \nat 1992 Neural Information Processing Systems \nconference.) Jack D. Cowan, Gerald Tesauro, \nJoshua Aspector (eds.), Advances in neural in(cid:173)\nformation processing systems 6, San Mateo, CA: \nMorgan Kaufmann Publishers. \n\n(5] Fedorov, V.V. 1972. Theory of Optimal Ex(cid:173)\n\nperimenta. Academic Prell, New York . \n\n(6] Box,G., and N.Draper. 1987. Empirical Model(cid:173)\n\nBuilding and Reaponae Surfacea. Wiley, New \nYork. \n\n(7] Khuri, A.I., and J.A.Cornell. 1987. Reaponae \nSurfacea (Deaigna and Analyaea). Marcel \nDekker, Inc., New York . \n\n(8] Faraway, Julian J. 1990. \"Sequential design for \n\nthe nonparametric regression of curves and sur(cid:173)\nfaces.\" Technical Report #177, Department of \nStatistics, The University of Michigan. \n\n(9] Geman, Stuart, Elie Bienenstock, and Rene \nDoursat. 1992. \nthe \nbias/variance dilemma.\" Neural Computation. 4, \nI, 1-58. \n\n\"Neural network. and \n\n(10] Stone, M. 1959. \"Application of a measure of in(cid:173)\nformation to the design and comparison of re(cid:173)\ngression experiments.\" Annals Math. Stat. 30 \n55-69 \n\n(II] \"Cross-validatory choice and assessment of sta(cid:173)\n\ntistical predictions.\" J.R. Statist . Soc. B. 36, 2, \nIll-H. \n\n(12] Bowman, Adrian W. 198i. \"An &1ternative \nmethod of cross-validation for the smoothing of \ndensity e.timates.\" Biometrika (198i), 71, 2, pp. \n353-60. \n\n(13] Bowman, Adrian W., Peter Hall, D.M. Titter(cid:173)\n\nington . 198i. \"Cross-validation in nonparametric \nestimation of probabilities and probability den(cid:173)\n\u2022 ities.\" Biometrika (198i), 71, 2, pp. 3U-51. \n\n(Ii] Marron, M. 1987. \"A compari.on of cross(cid:173)\n\nvalidation techniques in den.ity estimation.\" The \nAnnals of Stati.tics. 15, I, 152-162. \n\n(15] Wahba, Grace. 1990. Spline Modela for Ob(cid:173)\naervatlonal Data. v. 59 in the CBMS-NSF Re(cid:173)\ngional Conference Series in Applied Mathemat(cid:173)\nics, SIAM, Philadelphia, PA, March 1990. Soft(cid:173)\ncover, 169 pages, bibliography, author index. \nISBN 0-89871-2H-0 \n\n(16] Hall, Peter. 1983. \"Large sample optimality of \n\nlea.t squares cross-validation in density estima(cid:173)\ntion.\" The Annals of Statistics. 11, i, 1156-117i. \n\n(17] Schuster, E .F. , and G.G. Gregory. 1981. \"On the \n\nnonconsistency of maximum likelihood nonpara(cid:173)\nmetric den.ity estimators\". Comp .Sci. ok Statis(cid:173)\ntics: Proc. 13th Symp. on the Interface. W.F. \nEddy ed. 295-298 . Springer-Verlag. \n\n(18] Chow, Y.-S. and S.Geman, L.-D. Wu. 1983. \n\"Consistent cross-validated density estimation.\" \nThe Annals of Statistics. 11, I, 25-38. \n\n(19] Bowman, Adrian W. 1980. \"A note on consi.(cid:173)\n\ntency of the kernel method for the analysis of \ncategoric&1 data.\" Biometrika (1980), 67, 3, pp . \n682-i. \n\n(20] Stone, Charles J. 198i \"An asymptotically opti(cid:173)\nmal window selection rule for kernel density es(cid:173)\ntimates.\" The Annals of Statistics. 12, i, 1285-\n1297. \n\n(21] Li, Ker-Chau . 1986. \"Asymptotic optimality of \n\nC L and generalized cross-validation in ridge re(cid:173)\ngression with application to spline smoothing.\" \nThe Annals of Statistics. 14, 3, 1101-1112. \n\n(22] Li, Ker-Chau. 1987. \"Asymptotic optimality for \n\nC p , CL, cross-v&1idation, and generalized cross(cid:173)\nvalidation: di.crete index set.\" The Annals of \nStatistics. 15, 3, 958-975. \n\n(23] Utreras, Florencio I. 1987. \"On generalized cross(cid:173)\n\nvalidation \nfor multivariate smoothing spline \nfunction .... SIAM J . Sci. Stat. Comput. 8, i, July \n1987. \n\n(U] Andrews, Don&1d W.K. 1991. \"A.ymptotic opti(cid:173)\n\nmality of generalized C L, cross-validation, and \ngeneralized cross-validation in regression with \nheteroskedastic errors.\" Journal of Econometrics . \n47 (1991) 359-377. North-Holland. \n\n(25] Stone, M. 1977. \"An asymptotic equivalence of \nchoice of model by cross-v&1idation and Akaike's \ncriterion.\" J. Roy. Stat. Soc. Ser B, 39, I, ii-i7. \n\n(26] Stone, M. 1977. \"Asymptotics for and against \n\ncross-validation.\" Biometrika. 64, I, 29-35. \n\n(27] Billingsley, Patrick. 1986. Probability and \n\nMeaaure. Wiley, New York. \n\n(28] Jennrich, R. 1969. \"Asymptotic properties of \nnonlinear least squares estimators.\" Ann. Math. \nStat. 40, 633-6i3. \n\n(29] Liu, Yong. 1993. \"Neural network model selec(cid:173)\n\ntion using a.ymptotic jackknife estimator and \ncross-validation method.\" In Giles, C.L., Han(cid:173)\nson, S.J., and Cowan, J.D. (eds.), Advances in \nneural information processing systems 5, San \nMateo, CA: Morgan Kaufmann Publishers. \n\n(30] Moody, John E. 1992. \"The effective number \nof parameters, an analysis of generalization and \nregularization in nonlinear learning system.\" In \nMoody, J.E., Hanson, S. J., and Lippmann, R.P., \n(eds.), Advances in neural information process(cid:173)\ning systems i, San Mateo, CA: Morgan Kauf(cid:173)\nmann Publishers . \n\n(31] Bailey, Timothy L. and Charles Elk&n. 1993. \n\"Estimating the accuracy of learned concepts.\" \nTo appear in Proc. International Joint Confer(cid:173)\nence on Artificial Intelligence. \n\n(32] White, Halbert. 1993. Eatimation, Inference, \n\nand Specification Analyaia. Manu.cript. \n\n(33] White, Halbert . 198i. Aaymptotic Theory for \n\nEconometriciana. Academic Preu. \n\n(3i] Klein, Erwin and Anthony C. Thompson. 198i \n\nTheory of correapondencea : including ap(cid:173)\nplicationa to mathematical economica . Wi(cid:173)\nley. \n\n\f", "award": [], "sourceid": 762, "authors": [{"given_name": "Mark", "family_name": "Plutowski", "institution": null}, {"given_name": "Shinichi", "family_name": "Sakata", "institution": null}, {"given_name": "Halbert", "family_name": "White", "institution": null}]}