{"title": "Learning Curves: Asymptotic Values and Rate of Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 327, "page_last": 334, "abstract": null, "full_text": "Learning Curves: Asymptotic Values and \n\nRate of Convergence \n\nCorinna Cortes, L. D. Jackel, Sara A. Solla, Vladimir Vapnik, \n\nand John S. Denker \nAT&T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nAbstract \n\nTraining classifiers on large databases is computationally demand(cid:173)\ning. It is desirable to develop efficient procedures for a reliable \nprediction of a classifier's suitability for implementing a given task, \nso that resources can be assigned to the most promising candidates \nor freed for exploring new classifier candidates. We propose such \na practical and principled predictive method. Practical because it \navoids the costly procedure of training poor classifiers on the whole \ntraining set, and principled because of its theoretical foundation. \nThe effectiveness of the proposed procedure is demonstrated for \nboth single- and multi-layer networks. \n\n1 \n\nIntrod uction \n\nTraining classifiers on large data.bases is computationally demanding. It is desirable \nto develop efficient procedures for a reliable prediction of a classifier's suitability \nfor implementing a given task. Here we describe such a practical and principled \npredictive method. \n\nThe procedure applies to real-life situations with huge databases and limited re(cid:173)\nsources. Classifier selection poses a problem because training requires resources -\nespecially CPU-cycles, and because there is a combinatorical explosion of classifier \ncandidates. Training just a few of the many possible classifiers on the full database \nmight take up all the available resources, and finding a classifier particular suitable \nfor the task requires a search strategy. \n\n327 \n\n\f328 \n\nCortes, Jackel, SoBa, Vapnik, and Denker \n\ntest \nerror \n\n10,000 \n\n,~ ----------------(cid:173)\n....... --\n........... \n~~~~,~-----------\ntraining \nset size \n\n30,000 \n\n50,000 \n\nFigure 1: Test errors as a function of the size of the training set for three different \nclassifiers. A classifier choice based on best test error at training set size 10 = 10,000 \nwill result in an inferior classifier choice if the full database contains more than \n15,000 patterns. \n\nThe naive solution to the resource dilemma is to reduce the size of the database to \n1 = 10 , so that it is feasible to train all classifier candidates. The performance of the \nclassifiers is estimated from an independently chosen test set after training. This \nmakes up one point for each classifier in a plot of the test error as a function of the \nsize 1 of the training set. The naive search strategy is to keep the best classifier at \n10 , under the assumption that the relative ordering of the classifiers is unchanged \nwhen the test error is extrapolated from the reduced size 10 to the full database size. \nSuch an assumption is questionable and could easily result in an inferior classifier \nchoice as illustrated in Fig. 1. \n\nOur predictive method also utilizes extrapolation from medium sizes to large sizes \nof the training set, but it is based on several data points obtained at various sizes \nof the training set in the intermediate size regime where the computational cost of \ntraining is low. A change in the representation of the measured data points is used \nto gain confidence in t.he extrapolation. \n\n2 A Predictive Method \n\nOur predictive method is based on a simple modeling of the learning curves of a \nclassifier. By learning curves we mean the expectation value of the test and training \nerrors as a function of the training set size I. The expectat.ion value is taken over \nall the possible ways of choosing a training set of a given size. \n\nA typical example of learning curves is shown in Fig. 2. The test error is always \nlarger than the training error, but asymptotically t.hey reach a common value, a. \nWe model the errors for large siz\u20ac's of the training s\u20ac't as power-law decays to the \n\n\fLearning Curves: Asymptotic Values and Rate of Convergence \n\n329 \n\nerror \n\na\u00b7\u00b7 ..\u2022. \u00b7\u00b7\u00b7\u00b7 \u2022.\u2022 \u00b7~.~\u00b7~.~.~.~.~.~\u00b7~Trr.P.~ \n_------------training error \n\n~~-\n\ntraining set size, I \n\nFigure 2: Learning curves for a typical classifier. For all finite values of the \ntraining set size I the test error is larger t han the training error. Asymptotically \nthey converge to the same value a. \n\nasymptotic error value, a: \n\nb \n['test = a + ler \n\nand \n\n['train = a -\n\nc \n1i3 \n\nwhere I is the size of the training set, and a and f3 are positive exponents. From \nthese two expressions the sum and difference is formed: \n\n['test + ['train \n\n['test - ['train \n\nc \n1i3 \n\nb \n2a + ler -\nb \nc \nler + 1i3 \n\nIf we make the assumption 0'= f3 and b = c the equation (1) and (2) reduce to \n\n['test + [train \n\n[test - [train \n\n2a \n2b \nler \n\n(1) \n\n(2) \n\n(3) \n\nThese expressions suggest a log-log representation of the sum and difference of the \ntest and training errors as a function of the the training set size I, resulting in \ntwo straight lines for large sizes of the training set: a constant \"-' log(2a) for the \nsum, and a straight line with slope -a and intersection log(b + c) \"-' log(2b) for the \ndifference, as shown in Fig. 3. \nThe assumption of equal amplitudes b = c of the two convergent terms is a conve(cid:173)\nnient but not crucial simplification of the model. \\Ve find experimentally that for \nclassifiers where this approximation does not hold, the difference ['test -\n['train \nstill forms a straight line in a log-log-plot. From this line the sum s = b + c \ncan be extracted as the intersection, as indicated on Fig. 3. The weighted sum \n\n\f330 \n\nCortes, Jackel, Solla, Vapnik, and Denker \n\nlog(error) \n\nlog(b+c) \nlog(2b) \n\n.. .. .. .. .. \n\nlog(\u00a3tesl +\u00a3train) -log(2a) \n\nlog(trainlng set size, l) \n\nFigure 3: Within the validity of the power-law modeling of the test and training \nerrors, the sum and difference between the two errors as a function of training set \nsize give two straight lines in a log-log-plot: a constant\"\" log(2a) for the sum, and a \nstraight line with slope -0' and intersection log(b + c) ,..., log(2b) for the difference. \n\nc . Etest + b . Etrain will give a constant for an appropriate choice of band c, with \nb + c = s. \nThe validity of the above model was tested on numerous boolean classifiers with \nlinear decision surfaces. In all experiments we found good agreement with the model \nand we were able to extract reliable estimates of the three parameters needed to \nmodel the learning curves: the asymptotic value a, and the power 0', and amplitude \nb of the power-law decay. An example is shown in Fig. 4, (left). The considered \ntask is separation of handwritten digits 0-4 from the digits 5-9. This problem is \nunrealizable with the given database and classifier. \n\nThe simple modeling of the test and training errors of equation (3) is only assumed \nto hold for large sizes of the training set, but it appears to be valid already at \nintermediate sizes, as seen in Fig. 4, (left). The predictive model suggested here is \nbased on this observation, and it can be illustrated from Fig. 4, (left): with test \nand training errors measured for I ~ 2560 it is possible to estimate the two straight \nlines, extract approximate values for the three parameters which characterize the \nlearning curves, and use the resulting power-laws to extrapolate the learning curves \nto the full size of the database. \n\nThe algorithm for the predictive method is therefore as follows: \n\n1. Measure Etest and Etrain for intermediate sizes of the training set. \n2. Plot 10g(Etest + Etrain) and 10g(Etest - Etrain) versus log I. \n3. Estimate the two straight lines and extract the asymptotic value a \n\nthe amplitude b, and the exponent 0'. \n\n4. Extrapolate the learning curves to the full size of the database. \n\n\fLearning Curves: Asymptotic Values and Rate of Convergence \n\n331 \n\nlog (error) \n\nerror \n\n-1 \n\n-2 \n\n-3r---------~~-----\n\n1 \n\n2 log (1/ 256) \nI \n\n\u2022 \nI \n\n2560 \n\n25600 \n\no \n256 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.25 + -\n\npoints used for prediction \n\n\u2022\u2022\u2022\u2022 predicted learning curves \n\n...... ~. -4-\u00b7\u00b7-... m \nA.~-\u00b7-A\u00b7 -~ .\u2022. _ \u2022\u2022 -.a \n\n0.05 I training error \n0+-------------\n\n15360 \n\n7680 \n\n2560 \n\ntraining set size, I \n\nFigure 4: \nLeft: Test of the model for a 256 dimensional boolean classifier trained by minimiz(cid:173)\ning a mean squared error. The sum and difference of the test and training errors \nare shown as a function of the normalized training set size in a log-log-plot (base \n10). Each point is the mean with standard deviation for ten different choices of a \ntraining set of the given size. The straight line with a = 1, corresponding to a 1/1 \ndecay, is shown as a reference. \nRight: Prediction of learning curves for a 256 dimensional boolean classifier trained \nby minimizing a mean squared error. Measured errors for training set size of \nI ~ 2560 are used to fit the two proposed straight lines in a log-log plot. The \nthree parameters which characterize the learning curves are extracted and used for \nextrapolation. \n\nA prediction for a boolean classifier with linear decision surface is illustrated in \nFig. 4, (right). The prediction is excellent for this type of classifiers because the \nsum and difference of the test and training errors converge quickly to two straight \nlines in a log-log-plot. Unfortunately, linear decision surfaces are in general not \nadequate for many real-life applications. \n\nThe usefulness of the predictive method proposed here can be judged from its per(cid:173)\nformance on real-life sophisticated multi-layer networks. Fig. 5 demonstrates the \nvalidity of the model even for a fully-connected multi-layer network operating in its \nnon-linear regime to implement an unrealizable digit recognition task. Already for \nintermediate sizes of the training set the sum and difference between the test and \ntraining errors are again observed to follow straight lines. \n\nThe predictive method was finally tested on sparsely connected multi-layer net(cid:173)\nworks. Fig. 6, (left), shows the test and training errors for two networks trained \nfor the recognition of handwritten digits. The network termed \"old\" is commonly \nreferred to as LeNet [LCBD+90]. The network termed \"new\" is a modification of \nLeN et with additional feature maps. The full size of the database is 60,000 patterns, \n\n\f332 \n\nCortes, Jackel, SoHa, Vapnik, and Denker \n\nlog (error) \n\nE lest + E train \n\n-1 \n\n-2 \n\n-3 \n\nlog ( 11100) \n\n1000 \n\n10000 \n\n100000 I \n\n. \n\n, \n\nFigure 5: Test of the model for a fully-connected 100-10-10 network. The sum \nand the difference of the test and training error are shown as a function of the \nnormalized training set size in a log-log-plot. Each point is the mean with standard \ndeviation for 20 different choices of a training set of the given size. \n\na 50-50 % mixture of the NIST1 training and test sets. \nAfter training on 12,000 patterns it becomes obvious that the new network will out(cid:173)\nperform the old network when trained on the full database, but we wish to quantify \nthe expected improvement. If our predictive method gives a good quantitative \nestimate of the new network's test error at 60,000 patterns, we can decide whether \nthree weeks of training should be devoted to the new architecture. \n\nA log-log-plot based on the three datapoints from the new network result in values \nfor the three parameters that determine the power-laws used to extrapolate the \nlearning curves of the new network to the full size of the database, as illustrated in \nFig. 6, (right). The predicted test error at the full size of the database I = 60,000 \nis less than half of the test error for the old architecture, which strongly suggest \nperforming the training on the full database. The result of the full training is also \nindicated in Fig. 6, (right). The good agreement between predicted and measured \nvalues illustrates the power and applicability of the predictive method proposed \nhere to real-life applications. \n\n3 Theoretical Foundation \n\nThe proposed predictive method based on power-law modeling of the learning curves \nis not just heuristic. A fair amount of theoretical work has been done within the \nframework of statistical mechanics [SST92] to compute learning curves for simple \nclassifiers implementing unrealizable rules with non-zero asymptotic error value. A \nkey assumption of this theoretical approach is that the number of weights in the \nnetwork is large. \n\n1 National Institute for Standards and Technology, Special Database 3. \n\n\fLearning Curves: Asymptotic Values and Rate of Convergence \n\n333 \n\n. . \u2022 : old network \n- : new network \n\n.................. a \n\nerror \n\n0.02 \n\n0.01 \n\nerror \n\n0.03 \u2022 \n\nC; \n\n. \n\n0.02 \n\n0.01 \n\no \n\n: new network \n\n-\n- - - : new network predicted \n\n...... --.-..... \n, ----\n... --\n\n---------- - - 't:) \n\n..------------n \n\n20 \n\n40 \n\n30 \ntraining set size, 111000 \n\n60 \n\n50 \n\no ~~---------------------. \n30 \ntraining set size, 111000 \n\n40 50 \n\n60 \n\n20 \n\nFigure 6: \nLeft: Test (circles) and training ( triangles) errors for two networks. The \"old\" net(cid:173)\nwork is what commonly is referred to as LeNet. The network termed \"new\" is a \nmodification of LeNet with additional feature maps. The full size of the database \nis 60,000 patterns, and it is a 50-50 % mixture of the NIST training and test set. \nRight: Test (circles) and training (triangles) errors for the new network. The figure \nshows the predicted values of the learning curves in the range 20,000 - 60,000 train(cid:173)\ning patterns for the \"new\" network, and the actually measured values at 60,000 \npatterns. \n\nThe statistical mechanical calculations support a symmetric power-law decay of the \nexpected test and training errors to their common asymptotic value. The power(cid:173)\nlaws describe the behavior in the large I regime, with an exponent a which falls in \nthe interval 1/2 ~ a ~ 1. Our numerical observations and modeling of the test and \ntraining errors are in agreement with these theoretical predictions. \n\nWe have, moreover, observed a correlation between the exponent a and the asymp(cid:173)\ntotic error value a not accounted for by any of the theoretical models considered so \nfar. Fig. 7 shows a plot of the exponent a versus the asymptotic error a evaluated \nfor three different tasks. It appears from this data that the more difficult the target \nrule, the smaller the exponent, or the slower the learning. A larger generalization \nerror for intermediate training set sizes is in such cases due to the combined effect \nof a larger asymptotic error and a slower convergence. Numerical results for classi(cid:173)\nfiers of both smaller and larger input dimension support the explanation that this \ncorrelation might be due to the finite size of the input dimension of the classifier \n(here 256). \n\n4 Summary \n\nIn this paper we propose a practical and principled method for predicting the suit(cid:173)\nability of classifiers trained on large databases. Such a procedure may eliminate \n\n\f334 \n\nCortes, Jackel, Solla, Vapnik, and Denker \n\nexponent, \n\n(X. \n\n0.9 \u2022 \n0.8 ~I \n0.7 0. \n\n0.6 \n\n0 \n\n0.1 \n\n0.2 \n\nasymptotic \nerror, a \n\nFigure 1: Exponent of extracted power-law decay as a function of asymptotic \nerror for three different tasks. The un-realizability of the tasks, as characterized by \nthe asymptotic error a, can be changed by tuning the strength of a weight-decay \nconstraint on the norm of the weights of the classifier. \n\npoor classifiers at an early stage of the training procedure and allow for a more \nintelligent use of computational resources. \n\nThe method is based on a simple modeling of the expected training and test errors, \nexpected to be valid for large sizes of the training set. \nror measures are assumed to follow power-law decays to their common asymptotic \nerror value, with the same exponent and amplitude characterizing the power-law \nconvergence. \n\nIn this model both er(cid:173)\n\nThe validity of the model has been tested on classifiers with linear as well as non(cid:173)\nlinear decision surfaces. The free parameters of the model are extracted from data \npoints obtained at medium sizes of the training set, and an extrapolation gives good \nestimates of the test error at large size of the training set. \n\nOur numerical studies of learning curves have revealed a correlation between the \nexponent of the power-law decay and the asymptotic error rate. This correlation is \nnot accounted for by any existing theoretical models, and is the subject of continuing \nresearch. \n\nReferences \n\n[LCBD+90] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, \nW. Hubbard, and L. D. Jackel. Handwritten digit recognition with a \nback-propagation network. In Advances in Neural Information Pro(cid:173)\ncessing Systems, volume 2, pages 396-404. Morgan Kaufman, 1990. \nH. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of \nlearning from examples. Physical Review A, 45:6056-6091, 1992. \n\n[SST92] \n\n\f", "award": [], "sourceid": 803, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "L.", "family_name": "Jackel", "institution": null}, {"given_name": "Sara", "family_name": "Solla", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}, {"given_name": "John", "family_name": "Denker", "institution": null}]}