{"title": "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 847, "page_last": 854, "abstract": null, "full_text": "The Effective Number of Parameters: \n\nAn Analysis of Generalization and Regularization \n\nin Nonlinear Learning Systems \n\nJohn E. Moody \n\nDepartment of Computer Science, Yale University \n\nP.O. Box 2158 Yale Station, New Haven, CT 06520-2158 \n\nInternet: moody@cs.yale.edu, Phone: (203)432-1200 \n\nAbstract \n\nWe present an analysis of how the generalization performance (expected \ntest set error) relates to the expected training set error for nonlinear learn(cid:173)\ning systems, such as multilayer perceptrons and radial basis functions. The \nprincipal result is the following relationship (computed to second order) \nbetween the expected test set and tlaining set errors: \n\n(1) \nHere, n is the size of the training sample e, u;f f is the effective noise \nvariance in the response variable( s), ,x is a regularization or weight decay \nparameter, and Peff(,x) is the effective number of parameters in the non(cid:173)\nlinear model. The expectations ( ) of training set and test set errors are \ntaken over possible training sets e and training and test sets e' respec(cid:173)\ntively. The effective number of parameters Peff(,x) usually differs from the \ntrue number of model parameters P for nonlinear or regularized models; \nthis theoretical conclusion is supported by Monte Carlo experiments. In \naddition to the surprising result that Peff(,x) ;/; p, we propose an estimate \nof (1) called the generalized prediction error (GPE) which generalizes well \nestablished estimates of prediction risk such as Akaike's F P E and AI C, \nMallows Cp, and Barron's PSE to the nonlinear setting.! \n\nlCPE and Peff(>\") were previously introduced in Moody (1991). \n\n847 \n\n\f848 \n\nMoody \n\n1 Background and Motivation \n\nMany of the nonlinear learning systems of current interest for adaptive control, \nadaptive signal processing, and time series prediction, are supervised learning sys(cid:173)\ntems of the regression type. Understanding the relationship between generalization \nperformance and training error and being able to estimate the generalization per(cid:173)\nformance of such systems is of crucial importance. We will take the prediction risk \n(expected test set error) as our measure of generalization performance. \n\n2 Learning from Examples \n\nConsider a set of n real-valued input/output data pairs ~(n) = {~i = (xi, yi); i = \n1, ... , n} drawn from a stationary density 3(~). The observations can be viewed as \nbeing generated according to the signal plus noise model 2 \n\n(2) \n\nwhere yi is the observed response (dependent variable), Xl are the independent \nvariables sampled with input probability density O( x), Ei is independent, identicaIIy(cid:173)\ndistributed (iid) noise sampled with density ~(E) having mean 0 and variance (72,3 \nand J.t(x) is the conditional mean, an unknown function. From the signal plus noise \nperspective, the density 3(~) = 3(x, y) can be represented as the product of two \ncomponents, the conditional density w(ylx) and the input density O(x): \n\n3(x, y) \n\nw(ylx) O(x) \n~(y - J.t(x\u00bb O(x) \n\n(3) \n\nThe learning problem is then to find an estimate jJ,(x) of the conditional mean J.t(x) \non the basis of the training set ~(n). \n\nIn many real world problems, few a priori assumptions can be made about the \nfunctional form of J.t(x). Since a parame~ric function class is usually not known, \none must resort to a nonparametric regression approach, whereby one constructs an \nestimate jJ,(x) = f(x) for J.t(x) from a large class of functions F known to have good \napproximation properties (for example, F could be all possible radial basis func(cid:173)\ntion networks and multilayer perceptrons). The class of approximation functions is \nusually the union of a countable set of subclasses (specific network architectures)4 \nA C F for which the elements of each subclass f(w, x) E A are continuously \nparametrized by a set of p = p( A) weights w = {WCX; 0: = 1, ... , p}. The task of \nfinding the estimate f( x) thus consists of two problems: choosing the best architec-\nture A and choosing the best set of weights w given the architecture. Note that in \n2The assumption of additive noise ( which is independent of x is a standard assumption \nand is not overly restrictive. Many other conceivable signal/noise models can be trans(cid:173)\nformed into this form. For example, the multiplicative model y = /L(x)(l + () becomes \ny' = /L'(x) + (' for the transformed variable y' = log(y). \n\n3Note that we have made only a minimal assumption about the noise (, that it is has \nfinite variance (T2 independent of x. Specifically, we do not need to make the assumption \nthat the noise density *(() is of known form (e.g. gaussian) for the following development. \n\n4For example, a \"fully connected two layer perceptron with five internal units\". \n\n\fThe Effective Number of Parameters \n\n849 \n\nthe nonparametric setting, there does not typically exist a function f( w'\" , x) E F \nwith a finite number of parameters such that f(w\"', x) = I1(X) for arbitrary l1(x). \nFor this reason, the estimators ji( x) = f( w, x) will be biased estimators of 11( x). 5 \nThe first problem (finding the architecture A) requires a search over possible ar(cid:173)\nchitectures (e.g. network sizes and topologies), usually starting with small archi(cid:173)\ntectures and then considering larger ones. By necessity, the search is not usually \nexhaustive and must use heuristics to reduce search complexity. (A heuristic search \nprocedure for two layer networks is presented in Moody and Utans (1992).) \nThe second problem (finding a good set of weights for f(w,x)) is accomplished by \nminimizing an objective function: \n\nWA = argminw U(A, w, e(n)) . \n\n(4) \n\nThe objective function U consists of an error function plus a regularizer: \n\n(5) \nHere, the error Etrain(W,e(n)) measures the \"distance\" between the target response \nvalues yi and the fitted values f(w,xi): \n\nU(A, w,e(n)) = nEtrain(W,e(n)) + A S(w) \n\nEtrain(W,e(n)) = ~ 6 E[y\"f(w,x' )] \n. \n\n1 \" ' . \n\nn \n\n, \n\n(6) \n\nand S( w) is a regularization or weight-decay function which biases the solution \ntoward functions with a priori \"desirable\" characteristics, such as smoothness. The \nparameter A ~ 0 is the regularization or weight decay parameter and must itself be \noptimized. 6 \n\ni=l \n\nThe most familiar example of an objective function uses the squared error7 \nE[yi,f(w, xi)] = [yi - f(w,x i )]2 and a weight decay term: \np \n\nn \n\nU(A,w,~(n)) = L(yi - f(w,x i))2 + A Lg(wCY ) \n\n(7) \n\ni=l \n\ncy=l \n\nThe first term is the sum of squared errors (SSE) of the model f ( w, x) with resp ect \nto the training data, while the second term penalizes either small, medium, or \nlarge weights, depending on the form of g(wCY). Two common examples of weight \ndecay functions are the ridge regression form g( wCY) = (w CY )2 (which penalizes large \nweights) and the Rumelhart form g(wCY ) = (wCY )2/[(wO)2 + (w CY )2] (which penalizes \nweights of intermediate values near wO). \n\n5By biased, we mean that the mean squared bias is nonzero: MSB = J p(x)((/:t(x))e(cid:173)\nlL(x))2dx > o. Here, p(x) is some positive weighting function on the input space and \n()e denotes an expected valued taken over possible training sets \u20ac(n). For unbiasedness \n(MSB = 0) to occur, there must exist a set of weights w* such that f(w\"', x) = IL(X), \nand the learned weights ill must be \"close to\" w*. For \"near unbiasedness\", we must have \nw* = argminwMSB(w) such that (MSB(w\u00b7)::::: 0) and ill \"close to\" w*. \n\n6The optimization of..x will be discussed in Moody (1992). \n7 Other error functions, such as those used in generalized linear models (see for example \nMcCullagh and NeIder 1983) or robust statistics (see for example Huber 1981) are more \nappropriate than the squared error if the noise is known to be non-gaussian or the data \ncontains many outliers. \n\n\f850 \n\nMoody \n\nAn example of a regularizer which is not explicitly a weight decay term is: \n\nS(w) = 1 dxO(x)IIOxxf(w, x)112 . \n\n(8) \n\nThis is a smoothing term which penalizes functional fits with high curvature. \n\n3 Prediction Risk \nWith l1(x) = f( w[c;( n)], x) denoting an estimate of the true regression function J.t(x) \ntrained on a data set c;( n), we wish to estimate the prediction risk P, which is the \nexp ected error of 11( x) in predicting future data. In principle, we can either define \nP for models l1(x) trained on arbitrary training sets of size n sampled from the \nunknown density w(ylx )O( x) or for training sets of size n with input density equal \nto the empirical density defined by the single training set available: \n\nO'(x) = - L 8(x - xi) . \n\n1 n \nn \n\ni=1 \n\n(9) \n\nFor such training sets, the n inputs xi are held fixed, but the response variables yi \nare sampled with the conditional densities w(ylx i ). Since O'(x) is known, but O(x) \nis generally not known a priori, we adopt the latter approach. \n\nFor a large ensemble of such training sets, the expected training set error is8 \n\n(f ... ;n( A)), \n\n\\ \n\n1=1 \n\n/ ~ t f[Y;, I( w[~( n)], X;)]) \nJ ~ t. f[lI ,J( w[~( n)], x;)] {g wMx; )dll } \nP = J f[z,J(w[~(n)]'x)lw(zlx)n(x) {g W(Y;IX;)dY;} dzdx \n\nE \n\nFor a future exemplar (x,z) sampled with density w(zlx)O(x), the prediction risk \nP is defined as: \n\n(10) \n\n(11) \n\nAgain, however, we don't assume that O(x) is known, so computing (11) is not \npossible. \n\nFollowing Akaike (1970), Barron (1984), and numerous other authors (see Eubank \n1988), we can define the prediction risk P as the expected test set error for test sets \nof size n e'(n) = {c;i, = (xi,zi); i = 1, ... ,n} having the empirical input density \n0' (x). This expected test set error has form: \n\n(f.\".(A)),<, \n\n/ ~ tf[i,J(w[~(n)l,x;)l) \n\n\\ \n\n1=1 \n\nJ ! t. f[z; ,J( w[~( n)], x;) I {g w (y; Ix; )w( z; Ix;)dy; dz; } \n\nEE' \n\n(12) \n\n8Following the physics convention, we use angled brackets ( ) to denote expected values. \n\nThe subscripts denote the random variables being integrated over. \n\n\fThe Effective Number of Parameters \n\n851 \n\nWe take (12) as a proxy for the true prediction risk P. \nIn order to compute (12), it will not be necessary to know the precise functional \nform of the noise density ~(f). Knowing just the noise variance (T2 will enable an \nexact calculation for linear models trained with the SSE error and an approximate \ncalculation correct to second order for general nonlinear models. The results of \nthese calculations are presented in the next two sections. \n\n4 The Expected Test Set Error for Linear Models \n\nThe relationship between expected training set and expected test set errors for linear \nmodels trained using the SSE error function with no regularizer is well known in \nstatistics (Akaike 1970, Barron 1984, Eubank 1988). The exact relation for test and \ntraining sets with density (9): \n\n(13) \n\nAs pointed out by Barron (1984), (13) can also apply approximately to the case of \na nonlinear model f( w, x) trained by minimizing the sum of squared errors SSE. \nThis approximation can be arrived at in two ways. First, the model few, x) can be \ntreated as locally linear in a neighborhood of w. This approximation ignores the \nhessian and higher order shape of f( w, x) in parameter space. Alternatively, the \nmodel f( w, x) can be assumed to be locally quadratic in parameter space wand \nunbiased. \n\nHowever, the extension of (13) as an approximate relation for nonlinear models \nbreaks down if any of the following situations hold: \n\nThe SSE error function is not used. (For example, one may use a robust error \n\nmeasure (Huber 1981) or log likelihood error measure instead.) \n\nA regularization term is included in the objective function. (This introduces bias.) \nThe locally linear approximation for few, x) is not good. \nThe unbiasedness assumption for few, x) is incorrect. \n\n5 The Expected Test Set Error for Nonlinear Models \n\nFor neural network models, which are typically nonparametric (thus biased) and \nhighly nonlinear, a new relationship is needed to replace (13). We have derived \nsuch a result correct to second order for nonlinear models: \n\nThis result differs from the classical result (13) by the appearance of Pelf ()..) (the \neffective number of parameters), (T;1f (the effective noise variance in the response \nvariable( s\u00bb, and a dependence on ).. (the regularization or weight decay parameter). \n\nA full derivation of (14) will be presented in a longer paper (Moody 1992). The \nresult is obtained by considering the noise terms fi for both the training and test \n\n(14) \n\n\f852 \n\nMoody \n\nsets as perturbations to an idealized model fit to noise free data. The perturbative \nexpansion is computed out to second order in the fi s subject to the constraint that \nthe estimated weights w minimize the perturbed objective function. Computing \nexpectation values and comparing the expansions for expected test and training \nerrors yields (14). It is important to re-emphasize that deriving (14) does not \nrequire knowing the precise form of the noise density ~(f). Only a knowledge of u 2 \nis assumed. \n\nThe effective number of parameters Peff(>') usually differs from the true number \nof model parameters P and depends upon the amount of model bias, model non(cid:173)\nlinearity, and on our prior model preferences (eg. smoothness) as determined by \nthe regularization parameter A and the form of our regularizer. The precise form of \nPeff(A) is \n\nPeff(A) = trC = - L..J1iaUaJTf3i , \n\n_ \n\n1\", \n2. laf3 \n\n(15) \n\n(16) \n\n(17) \n\nwhere C is the generalized influence matrix which generalizes the standard influence \nor hat matrix of linear regression, 1ia is the n x p matrix of derivatives of the training \nerror function \n\n1ia = -8 . -8 nE(w,e(n)) , \n\n8 8 \nyl wa \n\nand U;;J is the inverse of the hessian of the total objective function \n\n8 \n\n8 \n\nUaf3 = 8wa 8wf3 U(A, w, e(n)) \n\nIn the general case that u2 (x) varies with location in the input space x, the effective \nnoise variance u;ff is a weighted average of the noise variances u2{xi). For the \nuniform signal plus noise model model we have described above, u;f f = u 2 \u2022 \n\n6 The Effects of Regularization \n\nIn the neural network community, the most commonly used regularizers are weight \ndecay functions. The use of weight decay is motivated by the intuitive notion that \nit removes unnecessary weights from the model. An analysis of Peff{A) with weight \ndecay (A > 0) confirms this intuitive notion. Furthermore, whenever u 2 > 0 and \nn < 00, there exists some Aoptimal > 0 such that the expected test set error (12) is \nminimized. This is because weight decay methods yield models with lower model \nvariance, even though they are biased. These effects will be discussed further in \nMoody (1992). \nFor models trained with squared error ~SSE) and quadratic weight decay g(wa ) = \n(w a )2, the assumptions of unbiasedness or local linearizability lead to the following \nexpression for Peff{A) which we call the linearized effective number of parameters \nPlin{A): \n\n(18) \n\n9Strictly speaking, a model with quadratic weight decay is unbiased only if the \"true\" \n\nweights are o. \n\n\fmplied. Li nearized. and Full P-effectiv e \n\nThe Effective Number of Parameters \n\n853 \n\nLinearized \n\n. - -----.--~--.... .. , \n\n'II( \n\n~, \n\nFull \n\nt \n\n, \n\nImpJi d \n~-- ~ ~-\n\n~\" \n\n'\" K \n1\\' \n\n,. \n1-\nWeight Decay Parameter (Lambda) \n\n,. \n\n,. \n\nI. \n\n\" \n\n~ \n..0 \u2022 E \n:i \nu \n> \n::= , \n\n\" i \n\nFigure 1: The full Peff(~) (15) agrees with the implied Pimp(~) (19) to within \nexp erimental error, whereas the linearized Plin (~) (18) does not (except for very \nlarge ~). These results verify the significance of (14) and (15) for nonlinear models. \n\nHere, \",01 is the a th eigenvalue of the P x P matrix J{ = TtT, with T as defined in \n(16). \n\nThe form of Pelf(~) can be computed easily for other weight decay functions, such \nas the Rumelhart form g(w Ol ) = (w Ol )2/[(wO)2 + (w Ol )2]. The basic result for all \nweight decay or regularization functions , however, is that Peff (~) is a decreasing \nfunction of ~ with Pelf(O) = P and Pelf(oo) = 0, as is evident in the special case \n(18). If the model is nonlinear and biased, then Pelf (0) generally differs from p. \n\n7 Testing the Theory \n\nTo test the result (14) in a nonlinear context, we computed the full Pej j(A) (15), \nthe linearized Plin(~) (18), and the implied number of parameters Pimp (A) (19) for a \nnonlinear test problem. The value of Pimp (~) is obtained by computing the expected \ntraining and test errors for an ensemble of training sets of size n with known noise \nvariance u 2 and solving for Pelf (~) in equation (14): \n\n(19) \n\nThe \"\"\"s indicate Monte Carlo estimates based on computations using a finite ensem(cid:173)\nble (10 in our experiments) of training sets. The test problem was to fit training \nsets of size 50 generated as a sum of three sigmoids plus noise, with the noise sam(cid:173)\npled from the uniform density. The model architecture f(w , x) was also a sum of \nthree sigmoids and the weights w were estimated by minimizing (7) with quadratic \nweight decay. See figure 1. \n\n\f854 \n\nMoody \n\n8 G PE: An Estimate of Prediction Risk for Nonlinear \n\nSystems \n\nA number of well established, closely related criteria for estimating the prediction \nrisk for linear or linearizable models are available. These include Akaike's F P E \n(1970), Akaike's AlC (1973) Mallow's Cp (1973), and Barron's PSE (1984). (See \nalso Akaike (1974) and Eubank (1988).) These estimates are all based on equation \n(13). \n\nThe generalized prediction error (G P E) generalizes the classical estimators F P E, \nAIC, Cp, and PSE to the nonlinear setting by estimating (14) as follows: \n\nGPE>' = PGPE = &train n + 2ueff \n\n( ) \n\n-. \n\n( ) \n\n~2 Peff(>') \n\nn \n\n. \n\n(20) \n\nThe estimation process and the quality of the resulting GP E estimates will be \ndescribed in greater detail elsewhere. \n\nAcknowledgements \n\nThe author wishes to thank Andrew Barron and Joseph Chang for helpful conversations. \nThis research was supported by AFOSR grant 89-0478 and ONR grant N00014-89-J-1228. \n\nReferences \n\nH. Akaike. (1970). Statistical predictor identification. Ann. Inst. Stat. Math., 22:203. \n\nH. Akaike. \nprinciple. In 2nd Inti. Symp. on Information Theory, Akademia Kiado, Budapest, 267. \n\nInformation theory and an extension of the maximum likelihood \n\n(1973). \n\nH. Akaike. \nAuto. Control, 19:716-723. \n\n(1974). A new look at the statistical model identification. \n\nIEEE Trans. \n\nA. Barron. (1984). Predicted squared error: a criterion for automatic model selection. In \nSelf-Organizing Methods in Modeling, S. Farlow, ed., Marcel Dekker, New York. \n\nR. Eubank. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker, \nNew York. \n\nP. J. Huber. (1981). Robust Statistics. Wiley, New York. \n\nC. L. Mallows. (1973). Some comments on Cpo Technometrics 15:661-675. \n\nP. McCullagh and J.A. NeIder. (1983). Generalized Linear Models. Chapman and Hall, \nNew York. \n\nJ. Moody. \n(1991). Note on Generalization, Regularization, and Architecture Selection \nin Nonlinear Learning Systems. In B.H. Juang, S.Y. Kung, and C.A. Kamm, editors, \nNeural Networks for Signal Processing, IEEE Press, Piscataway, N J. \n\nJ. Moody. (1992). Long version of this paper, in preparation. \n\nJ. Moody and J. Utans. (1992). Principled architecture selection for neural networks: \napplication to corporate bond rating prediction. In this volume. \n\n\f", "award": [], "sourceid": 530, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}]}*