{"title": "Statistical Theory of Overtraining - Is Cross-Validation Asymptotically Effective?", "book": "Advances in Neural Information Processing Systems", "page_first": 176, "page_last": 182, "abstract": null, "full_text": "Statistical Theory of Overtraining - Is \n\nCross-Validation Asymptotically \n\nEffective? \n\ns. Amari, N. Murata, K.-R. Miiller* \n\nDept. of Math. Engineering and Inf. Physics, University of Tokyo \n\nHongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan \n\nM. Finke \n\nInst. f. Logik , University of Karlsruhe \n\n76128 Karlsruhe, Germany \n\nH. Yang \n\nLab . f. Inf. Representation, RIKEN, \n\nWakoshi, Saitama, 351-01, Japan \n\nAbstract \n\nA statistical theory for overtraining is proposed. The analysis \ntreats realizable stochastic neural networks, trained with Kullback(cid:173)\nLeibler loss in the asymptotic case. It is shown that the asymptotic \ngain in the generalization error is small if we perform early stop(cid:173)\nping, even if we have access to the optimal stopping time. Consider(cid:173)\ning cross-validation stopping we answer the question: In what ratio \nthe examples should be divided into training and testing sets in or(cid:173)\nder to obtain the optimum performance. In the non-asymptotic \nregion cross-validated early stopping always decreases the general(cid:173)\nization error. Our large scale simulations done on a CM5 are in \nnice agreement with our analytical findings. \n\n1 \n\nIntroduction \n\nTraining multilayer neural feed-forward networks, there is a folklore that the gen(cid:173)\neralization error decreases in an early period of training, reaches the minimum and \nthen increases as training goes on, while the training error monotonically decreases. \nTherefore, it is considered advantageous to stop training at an adequate time or to \nuse regularizers (Hecht-Nielsen [1989), Hassoun [1995), Wang et al. [1994)' Poggio \nand Girosi [1990), Moody [1992)' LeCun et al. [1990] and others). To avoid over(cid:173)\ntraining, the following stopping rule has been proposed based on cross-validation: \n\n*Permanent address: GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany. \n\nE-mail: Klaus@first .gmd.de \n\n\fStatistical Theory of Overtraining-Is Cross-Validation Asymptotically Effective? \n\n177 \n\nDivide all the available examples into two disjoint sets. One set is used for train(cid:173)\ning. The other set is used for testing such that the behavior of the trained network \nis evaluated by using the test examples and training is stopped at the point that \nminimizes the testing error. \nThe present paper gives a mathematical analysis of the so-called overtraining phe(cid:173)\nnomena to elucidate the folklore. We analyze the asymptotic case where the number \nt of examples are very large. Our analysis treats 1) a realizable stochastic machine, \n2) Kullback-Leibler loss (negative ofthe log likelihood loss), 3) asymptotic behavior \nwhere the number t of examples is sufficiently large (compared with the number m \nof parameters). We firstly show that asymptotically the gain of the generalization \nerror is small even if we could find the optimal stopping time. We then answer the \nquestion: In what ratio, the examples should be divided into training and testing \nsets in order to obtain the optimum performance. We give a definite answer to this \nproblem. When the number m of network parameters is large, the best strategy is \nto use almost all t examples in the training set and to use only l/v2m examples \nin the testing set, e.g. when m = 100, this means that only 7% of the training \npatterns are to be used in the set determining the point for early stopping. \nOur analytic results were confirmed by large-scale computer simulations of three(cid:173)\nlayer continuous feedforward networks where the number m of modifiable param(cid:173)\neters are m = 100. When t > 30m, the theory fits well with simulations, showing \ncross-validation is not necessary, because the generalization error becomes worse \nby using test examples to obtain an adaequate stopping time. For an intermediate \nrange, where t < 30m overtraining occurs surely and the cross-validation stopping \nimproves the generalization ability strongly. \n\n2 Stochastic feedforward networks \n\nLet us consider a stochastic network which receives input vector x and emits \noutput vector y. The network includes a modifiable vector parameter w = \n(WI,\"', wm ) and is denoted by N(w). The input-output relation of the net(cid:173)\nwork N(w) is specified by the conditional probability p(Ylx; w). We assume (a) \nthat there exists a teacher network N(wo) which generates training examples \nfor the student N(w). And (b) that the Fisher information matrix Gij(w) = \nE [a~. logp(x, y; w) a~j logp(x, y; w)] exists, is non-degenerate and is smooth in \nw, where E denotes the expectation with respect to p(x, Y; w) = q(x)p(Ylx; w). \nThe training set Dt = {(Xl, YI), ... , (Xt, Yt)} consists of t independent examples \ngenerated by the distribution p(x, Y; wo) of N(wo). The maximum likelihood es(cid:173)\ntimator (m.l.e.) Vi is the one that maximizes the likelihood of producing D t , or \nequivalently minimizes the training error or empirical risk function \n\nRtrain(w) = -i I:logp(xi,Yi;w). \n\n1 \n\nt \n\n(2.1) \n\ni=l \n\nThe generalization error or risk function R(w) of network N(w) is the expectation \nwith respect to the true distribution, \nR(w) = -Eo[logp(x, Y; w)] = Ho+D(wo II w) = Ho+Eo [log p~x, Y; wojJ, (2.2) \n\np x,y;w \n\nwhere Eo denotes the expectation with respect to p(x, Y; wo), Ho is the entropy \nof the teacher network and D(wo II w) is the Kullback-Leibler divergence from \nprobability distribution p(x,y;wo) to p(x,y;w) or the divergence of N(w) from \nN(wo). Hence, minimizing R(w) is equivalent to minimizing D(wo II w), and the \n\n\f178 \n\nS. AMARI, N. MURATA, K. R. MULLER, M. FINKE, H. YANG \n\nminimum is attained at w = Wo. The asymptotic theory of statistics proves that the \nm.l.e. Wt is asymptotically subject to the normal distribution with mean Wo and \nvariance G-1 It, where G- 1 is the inverse of the Fisher information matrix G. We \ncan expand for example the risk R(w) = Ho+ t(w -wo)TG(wo)(w -wo) + 0 (/2) \nto obtain \n\n(Rgen(w)) = Ho + ~ + 0 C~ ), (Rtrain(w)) = Ho - ~ + 0 C~), (2.3) \n\nas asymptotic result for training and test error (see Murata et al. [1993] and Amari \nand Murata [1990)) . An extension of (2.3) including higher order corrections was \nrecently obtained by M liller et al. [1995]. \nLet us consider the gradient descent learning rule (Amari [1967], Rumelhart et al. \n[1986], and many others), where the parameter w(n) at the nth step is modified by \n\nw(n + 1) = w(n) _ \u20ac f)Rtr~~(wn) , \n\n(2.4) \n\nand where \u20ac \nis a small positive constant. This is batch learning where all the \ntraining examples are used for each iteration of modifying w( n).l The batch process \nis deterministic and w( n) converges to W, provided the initial w(O) is included in \nits basin of attraction. For large n we can argue, that w(n) is approaching w \nisotropically and the learning trajectory follows a linear ray towards w (for details \nsee Amari et al. [1995]). \n\n3 Virtual optimal stopping rule \n\nDuring learning as the parameter w(n) approaches W, the generalization behavior \nof network N {w(n)} is evalulated by the sequence R(n) = R{w(n)}, n = 1,2, . .. \nThe folklore says that R(n) decreases in an early period oflearning but it increases \nlater. Therefore, there exists an optimal stopping time n at which R(n) is mini(cid:173)\nmized. The stopping time nopt is a random variable depending on wand the initial \nw(O) . We now evaluate the ensemble average of (R(nopd). \nThe true Wo and the m.l.e. ware in general different, and they are apart of order \n1/Vt. Let us compose a sphere S of which the center is at (1/2)(wo+w) and which \npasses through both Wo and W, as shown in Fig.1b. Its diameter is denoted by d, \nwhere d2 = Iw - Wo 12 and \n\nEo [d2] \n\nEo[(w - wo? G- 1(w - wo)] = ~tr(G-1G) = m. \nt \n\nt \n\n(3 .1) \n\nLet A be the ray, that is the trajectory w(n) starting at w(O) which is not in the \nneighborhood of Wo . The optimal stopping point w\" that minimizes \n\nR(n) = Ho + ~Iw(n) - wol 2 \n\n(3.2) \n\nis given by the first intersection of the ray A and the sphere S. \nSince w\" is the point on A such that Wo - w\" is orthogonal to A, it lies on the \nsphere S (Fig.1b). When ray A' is approaching w from the opposite side ofwo (the \nright-hand side in the figure), the first intersection point is w itself. In this case, \nthe optimal stopping never occurs until it converges to W. \nLet () be the angle between the ray A and the diameter Wo - w of the sphere S. \nWe now calculate the distribution of () when the rays are isotropically distributed. \n\nlWe can alternatively use on-line learning, studied by Amari [1967], Heskes and Kappen \n\n[1991] , and recently by Barkai et al. [1994] and SolI a and Saard [1995]. \n\n\fStatistical Theory of Overtraining-Is Cross-Validation Asymptotically Effective? \n\n179 \n\nLemma 1. When ray A is approaching V. from the side in which Wo is included, the \nprobability density of 0, 0 :::; 0 :::; 7r /2, is given by \n\nreO) = -- sinm- 2 0, where 1m = \n\n1 \n\n1m-2 \n\n17r/2 \n\n0 \n\nsinm OdO. \n\n(3.3) \n\nThe det,ailed proof of this lemma can be found in Amari et aI. [1995]. Using the \ndensity of 0 given by Eq.(3.3) and we arrive at the following theorem. \n\nTheorem 1. The average generalization error at the optimal stopping point is \ngiven by \n\n(3.4) \n\nProof When ray A is at angle 0, 0 :::; 0 < 7r /2, the optimal stopping point w* is on \nthe sphere S. It is easily shown that Iw* - wol = dsinO. This is the case where A \nis from the same side as Wo (from the left-hand side in Fig.l b), which occurs with \nprobability 0.5, and the average of (d sin 0)2 is \n\nEo[(dsinO?] \n\nEo[d2\n1m - 2 Jo \n\n] r/\\in2 Osinm- 2 OdO = m ~ = m (1- ~). \n\nt 1m-2 \n\nt \n\nm \n\nWhen 0 is 7r/2 :::; 0 :::; 7r, that is A approaches V. from the opposite side, it does \nnot stop until it reaches V., so that Iw* - Wo 12 = IV. - Wo I = d2 \u2022 This occurs with \nprobability 0.5. Hence, we proved the theorem. \n\nThe theorem shows that, if we could know the optimal stopping time nopt for \neach trajectory, the generalization error decreases by 1/2t, which has an effect of \ndecreasing the effective dimensions by 1/2. This effect is neglegible when m is large. \nThe optimal stopping time is of the order logt. However, it is impossible to know \nthe optimal stopping time. If we stop learning at an estimated optimal time nopt, \nwe have a small gain when the ray A is from the same side as Wo but we have \nsome loss when ray A is from the opposite direction. This shows that the gain is \neven smaller if we use a common stopping time iiopt independent of V. and w(O) as \nproposed by Wang et aI. [1994]. However, the point is that there is neither direct \nmeans to estimate nopt nor iiopt rather than for example cross-validation. Hence, \nwe analyze cross-validation stopping in the following . \n\n4 Optimal stopping by cross-validation \n\nThe present section studies asymptotically two fundamental problems: 1) Given t \nexamples, how many examples should be used in the training set and how many \nin the testing set? 2) How much gain can one expect by the above cross-validated \nstopping? \nLet us divide t examples into rt examples of the training set and r't examples of the \ntesting set, where r + r' = 1. Let V. be the m.I.e. from rt training examples, and let \nw be the m .I.e. from the other r't testing examples. Since the training examples \nand testing examples are independent, V. and ware subject to independent nor(cid:173)\nmal distributions with mean Wo and covariance matrices G-1/(rt) and G-l/(r't), \nrespecti vely. \nLet us compose the triangle with vertices Wo, V. and w. The trajectory A starting \nat w(O) enters V. linearly in the neighborhood. The point w\" on the trajectory A \nwhich minimizes the testing error is the point on A that is closest to W, since the \ntesting error defined by \n\nRtest(w) = r't ~{-logp(xi'Yi; w)}, \n\n1 \n\nt \n\n(4.1) \n\n\f180 \n\nS. AMARI, N. MURATA, K. R. MULLER, M. FINKE, H. YANG \n\nwhere summation is taken over r't testing examples, can be expanded as \n\nRtest(w) == Ho - ~Iw - wol 2 + ~Iw - w1 2 . \n\n(4.2) \nLet S be the sphere centered at (w + w)/2 and passing through both wand w. \nIt 's diameter is given by d == Iw - wi. Then, the optimal stopping point w* is \ngiven by the intersection of the trajectory A and sphere S . When the trajectory \ncomes from the opposite side of W, it does not intersect S until it converges to w, \nso that the optimal point is w* == w in this case. Omitting the detailed proof, the \ngeneralization error of w* is given by Eq.(??) , so that we calculate the expectation \n\nE[lw* -woI 2 ] == m _ ~ (~_~). \n\ntr \n\n2t \n\nl' \n\n1\" \n\nLemma 2. The average generalization error by the optimal cross-validated stopping \nIS \n\n* \n\n1 \n(R(w ,1')) = Ho + 4rt + 4r't \n\n2m - 1 \n\n( 4.3) \n\nWe can then calculate the optimal division rate \n\nropt = 1 -\n\nJ2m -1-1 \n\n2(m _ 1) \n\nand \n\nropt = 1 - J2m \n\n1 \n\n(large m limit). \n\n( 4.4) \n\nof examples, which minimizes the generalization error. So for large m only \n(1/J2m) x 100% of examples should be used for testing and all others for training. \nFor example, when m = 100, this shows that 93% of examples are to be used for \ntraining and only 7% are to be kept for testing. From Eq.( 4.4) we obtain as optimal \ngeneralization error for large m \n\n(R(w', ropt\u00bb = Ho +; (1 + If) . \n\n( 4.5) \n\nThis shows that the generalization error asymptotically increases slightly by cross(cid:173)\nvalidation compared with non-stopped learning which is using all the examples for \ntraining. \n\n5 Simulations \n\nWe use standard feed-forward classifier networks with N inputs, H sigmoid hidden \nunits and M softmax outputs (classes). The output activity 0/ of the lth output \nunit is calculated via the softmax squashing function \n\n_ \n\n. \n\n_ \n\n_ \n\nexp(h/) \nk exp \n\n+ \n\n(h )' \n\np(y\u00b7-GI!x,w)-O/-l 2: \nwhere h? = Lj wg Sj -\nposteriori probability of being in class G/, 0 0 denotes a zero class for normalization \npurposes. The m network parameters consist of biases '19 and weights w . When x \nis input, the activity of the j-th hidden unit is \n\n'19? is the local field potential. Each output 0/ codes the a(cid:173)\n\n/=l ,\u00b7\u00b7 \u00b7,M, \n\nk \n\nSj = [1 + exp( - L Wf{:Xk -\n\nN \n\nk=1 \n\n'I9.f)]-I , \n\nj = 1, .. \" H . \n\nThe input layer is connected to the hidden layer via w H , the hidden layer is con(cid:173)\nnected to the output layer via wo, but no short-cut connections are present . Al(cid:173)\nthough the network is completely deterministic, it is constructed to approximate \n\n\fStatistical Theory of Overtraining-Is Cross-Validation Asymptotically Effective? \n\n181 \n\nclass conditional probabilities (Finke and Miiller [1994]) . \nThe examples {(x}, yd, .. \" (Xt , Yt)} are produced randomly, by drawing Xi, i = \n1, .. . , t, from a uniform distribution independently and producing the labels Yi \nstochastically from the teacher classifier. Conjugate gradient learning with line(cid:173)\nsearch on the empirical risk function Eq.(2.1) is applied, starting from some ran(cid:173)\ndom initial vector. The generalization ability is measured using Eq. (2.2) on a large \ntest set (50000 patterns). Note that we use Eq. (2.1) on the cross-validation set , \nbecause only the empirical risk is available on the cross-validation set in a practical \nsituation. We compare the generalisation error for the settings: exhaustive training \n(no stopping), early stopping (controlled by the cross-validation set) and optimal \nstopping (controlled by the large testset) . The simulations were performed on a \nparallel computer (CM5). Every curve in the figures takes about 8h of computing \ntime on a 128 respectively 256 partition of the CM5, i.e. we perform 128-256 paral(cid:173)\nlel trials. This setting enabled us to do extensive statistics (cf. Amari et al. [1995]) . \nFig. la shows the results of simulations, where N = 8, H = 8, M = 4, so that \nthe number m of modifiable parameters is m = (N + I)H + (H + I)M = 108. We \nobserve clearly, that saturated learning without early stopping is the best in the \nasymptotic range of t > 30m, a range which is due to the limited size of the data \nsets often unaccessible in practical applications. Cross-validated early stopping does \nnot improve the generalization error here, so that no overtraining is observed on \nthe average in this range. In the asymptotic area (figure 1) we observe that the \nsmaller the percentage of the training set , which is used to determine the point of \nearly stopping, the better the performance of the generalization ability. When we \nuse cross-validation, the optimal size of the test set is about 7% of all the examples, \nas the theory predicts. \nClearly, early stopping does improve the generalization ability to a large extent in \nan intermediate range for t < 30m (see Miiller et al. [1995]) . Note, that our the(cid:173)\nory also gives a good estimate of the optimal size of the early stopping set in this \nintermediate range. \n\n0.05 \n\n0.045 \n\n0.04 \n\n0.035 \n\n0.03 \n\n0.025 \n\n0.02 \n\n0.015 \n\nom \n\n'i \nCit \n\n\" \nopt, 4-\n20.% -+--(cid:173)\n,3:l% .. E} ... \n/ 42% ...... .. -\nn9,gtopping -.. >-\n/ \"\". >:i<:;_=-~;:~~::::~::~~~:------\n\n;' \n\n~ . \n\nA \n\n.~~;;?;;;;\u00bb/ \n\n~A.~~.::.../ \n\" , \n\n0.005 L....I..._--'-_-'-_-'--_-'------'_--'-_--'-_-'-----l \n\n5e-5 \n\nle-4 1.5e-4 2e-4 2.5e-4 3e-4 3.5e-4 4e-4 4.5e-4 5e-4 \n\nlIt \n(a) \n\n(b) \n\nFigure 1: (a) R(w) plotted as a function of lit for different sizes r' of the early \nstopping set for an 8-8-4 classifier network. opt. denotes the use of a very large \ncross-validation set (50000) and no stopping adresses the case where 100% of the \ntraining set is used for exhaustive learning. (b) Geometrical picture to determine \nthe optimal stopping point w* . \n\n\f182 \n\ns. AMARI. N. MURATA. K. R. MOLLER. M. FINKE. H. YANG \n\n6 Conclusion \n\nWe proposed an asymptotic theory for overtraining. The analysis treats realizable \nstochastic neural networks, trained with Kullback-Leibler loss. \nIt is demonstrated both theoretically and in simulations that asymptotically the gain \nin the generalization error is small if we perform early stopping, even if we have \naccess to the optimal stopping time. For cross-validation stopping we showed for \nlarge m that optimally only r~pt = 1/ J2m examples should be used to determine \nthe point of early stopping in order to obtain the best performance. For example, \nif m = 100 this corresponds to using 93% of the t training patterns for training and \nonly 7% for testing where to stop. Yet, even if we use rapt for cross-validated stop(cid:173)\nping the generalization error is always increased comparing to exhaustive training. \nNevertheless note, that this range is due to the limited size of the data sets often \nunaccessible in practical applications. \nIn the non-asymptotic region simulations show that cross-validated early stopping \nalways helps to enhance the performance since it decreases the generalization error. \nIn this intermediate range our theory also gives a good estimate of the optimal size \nof the early stopping set. In future we will consider higher order correction terms \nto extend our theory to give also a quantitative description of the non-asymptotic \nregIOn. \n\nAcknowledgements: We would like to thank Y. LeCun, S. Bos and K Schulten \nfor valuable discussions. K -R. M. thanks K Schulten for warm hospitality during \nhis stay at the Beckman Inst. in Urbana, Illinois. We acknowledge computing time \non the CM5 in Urbana (NCSA) and in Bonn, supported by the National Institutes \nof Health (P41RRO 5969) and the EC S & T fellowship (FTJ3-004, K. -R. M.). \n\nReferences \n\nAmari, S. [1967], IEEE Trans., EC-16, 299- 307. \nAmari, S., Murata, N. [1993], Neural Computation 5, 140 \nAmari, S., Murata, N., Muller, K-R., Finke, M., Yang, H. [1995], Statistical Theory \nof Overtraining and Overfitting, Univ. of Tokyo Tech. Report 95-06, submitted \nBarkai, N. and Seung, H. S. and Sompolinski, H. [1994], On-line learning of di(cid:173)\nchotomies, NIPS'94 \nFinke, M. and Muller, K-R. [1994] in Proc. of the 1993 Connectionist Models sum(cid:173)\nmer school, Mozer, M., Smolensky, P., Touretzky, D.S ., Elman, J.L. and Weigend, \nA.S. (Eds.), Hillsdale, NJ: Erlenbaum Associates, 324 \nHassoun, M. H. [1995], Fundamentals of Artificial Neural Networks, MIT Press. \nHecht-Nielsen, R. [1989], Neurocomputing, Addison-Wesley. \nHeskes, T. and Kappen, B. [1991]' Physical Review, A44, 2718- 2762. \nLeCun, Y., Denker, J .S., Solla, S. [1990], Optimal brain damage, NIPS'89 \nMoody, J . E. [1992]' The effective number of parameters: An analysis of general(cid:173)\nization and regularization in nonlinear learning systems, NIPS 4 \nMurata, N., Yoshizawa, S., Amari , S. [1994], IEEE Trans., NN5, 865-872. \nMuller, K-R., Finke, M., Murata, N., Schulten, K and Amari, S. [1995] A numer(cid:173)\nical study on learning curves in stochastic multilayer feed-forward networks, Univ. \nof Tokyo Tech. Report METR 95-03 and Neural Computation in Press \nPoggio, T. and Girosi, F. [1990], Science, 247, 978- 982. \nRissanen, J. [1986], Ann, Statist., 14, 1080- 1100. \nRumelhart, D., Hinton, G. E., Williams, R. J. [1986], in PDP, Vol.1, MIT Press. \nSaad, D., Solla, S. A. [1995], PRL, 74,4337 and Phys. Rev. E, 52,4225 \nWang, Ch., Venkatesh, S. S., Judd, J. S. [1994], Optimal stopping and effective ma(cid:173)\nchine complexity in learning, to appear, (revised and extended version of NIPS'93). \n\n\f", "award": [], "sourceid": 1060, "authors": [{"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}, {"given_name": "Noboru", "family_name": "Murata", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Michael", "family_name": "Finke", "institution": null}, {"given_name": "Howard", "family_name": "Yang", "institution": null}]}