{"title": "A Simple Weight Decay Can Improve Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 950, "page_last": 957, "abstract": null, "full_text": "A Simple Weight Decay Can Improve \n\nGeneralization \n\nAnders Krogh\u00b7 \n\nCONNECT, The Niels Bohr Institute \n\nBlegdamsvej 17 \n\nJohn A. Hertz \n\nNordita \n\nBlegdamsvej 17 \n\nhertz@nordita.dk \n\nDK-2100 Copenhagen, Denmark \n\nDK-2100 Copenhagen, Denmark \n\nkrogh@cse.ucsc.edu \n\nAbstract \n\nIt has been observed in numerical simulations that a weight decay can im(cid:173)\nprove generalization in a feed-forward neural network. This paper explains \nwhy. It is proven that a weight decay has two effects in a linear network. \nFirst, it suppresses any irrelevant components of the weight vector by \nchoosing the smallest vector that solves the learning problem. Second, if \nthe size is chosen right, a weight decay can suppress some of the effects of \nstatic noise on the targets, which improves generalization quite a lot. It \nis then shown how to extend these results to networks with hidden layers \nand non-linear units. Finally the theory is confirmed by some numerical \nsimulations using the data from NetTalk. \n\n1 \n\nINTRODUCTION \n\nMany recent studies have shown that the generalization ability of a neural network \n(or any other 'learning machine') depends on a balance between the information in \nthe training examples and the complexity of the network, see for instance [1,2,3]. \nBad generalization occurs if the information does not match the complexity, e.g. \nif the network is very complex and there is little information in the training set. \nIn this last instance the network will be over-fitting the data, and the opposite \nsituation corresponds to under-fitting. \n\n\u00b7Present address: Computer and Information Sciences, Univ. of California Santa Cruz, \n\nSanta Cruz, CA 95064. \n\n950 \n\n\fA Simple Weight Decay Can Improve Generalization \n\n951 \n\nOften the number of free parameters, i. e. the number of weights and thresholds, is \nused as a measure of the network complexity, and algorithms have been developed, \nwhich minimizes the number of weights while still keeping the error on the training \nexamples small [4,5,6]. This minimization of the number of free parameters is not \nalways what is needed. \n\nA different way to constrain a network, and thus decrease its complexity, is to limit \nthe growth of the weights through some kind of weight decay. It should prevent the \nweights from growing too large unless it is really necessary. It can be realized by \nadding a term to the cost function that penalizes large weights, \n\nE(w) = Eo(w) + 2A L..J Wi' \n1 \"\" 2 \n\ni \n\n(1) \n\nwhere Eo is one's favorite error measure (usually the sum of squared errors), and \nA is a parameter governing how strongly large weights are penalized. w is a vector \ncontaining all free parameters of the network, it will be called the weight vector. If \ngradient descend is used for learning, the last term in the cost function leads to a \nnew term -AWi in the weight update: \n\n. \nWi ex: --fJ - \",Wi\u00b7 \n\n\\ \n\nfJEo \nWi \n\n(2) \n\nHere it is formulated in continuous time. If the gradient of Eo (the 'force term') \nwere not present this equation would lead to an exponential decay of the weights. \n\nObviously there are infinitely many possibilities for choosing other forms of the \nadditional term in (1), but here we will concentrate on this simple form. \n\nIt has been known for a long time that a weight decay of this form can improve \ngeneralization [7], but until now not very widely recognized. The aim of this paper \nis to analyze this effect both theoretically and experimentally. Weight decay as a \nspecial kind of regularization is also discussed in [8,9] . \n\n2 FEED-FORWARD NETWORKS \n\nA feed-forward neural network implements a function of the inputs that depends \non the weight vector w, it is called fw. For simplicity it is assumed that there is \nonly one output unit. When the input is e the output is fw (e) . Note that the input \nvector is a vector in the N-dimensional input space, whereas the weight vector is a \nvector in the weight space which has a different dimension W. \n\nThe aim of the learning is not only to learn the examples, but to learn the underlying \nfunction that produces the targets for the learning process. First, we assume that \nthis target function can actually be implemented by the network . This means there \nexists a weight vector u such that the target function is equal to fu . The network \nwith parameters u is often called the teacher, because from input vectors it can \nproduce the right targets . The sum of squared errors is \n\np \n\nEo(w) = ~ 2:[fu(eJl ) - fw(eJl)]2, \n\nJI=l \n\n(3) \n\n\f952 \n\nKrogh and Hertz \n\nwhere p is the number of training patterns. The learning equation (2) can then be \nwritten \n\nWi 0, the \nconstant part in V/ will decay to zero asymptotically (as e->'t, where t is the time). \nAn infinitesimal weight decay will therefore choose the solution with the smallest \nnorm out of all the solutions in the valley described above. This solution can be \nshown to be the optimal one on average. \n\n4 LEARNING WITH AN UNRELIABLE TEACHER \n\nRandom errors made by the teacher can be modeled by adding a random term 11 to \nthe targets: \n\n(12) \nThe variance of TJ is called u 2 , and it is assumed to have zero mean. Note that these \ntargets are not exactly realizable by the network (for Q' > 0), and therefore this is \na simple model for studying learning of an unrealizable function. \n\nWith this noise the learning equation (2) becomes \n\nWi ex L:(N- 1 L: Vjf.j + N- 1/ 211/J)f.f - AWi\u00b7 \n\n/J \n\nj \n\nTransforming it to the basis where A is diagonal as before, \n\nvr ex -(Ar + A)Vr + AUr - N- 1/ 2 L 11/Jf.~\u00b7 \n\n/J \n\nThe asymptotic solution to this equation is \n\nAUr - N-l/ 2 L/J TJ/Jf.~ \n\nVr = \n\nA + Ar \n\n. \n\n(13) \n\n(14) \n\n(15) \n\nThe contribution to the generalization error is the square of this summed over all \nr. If averaged over the noise (shown by the bar) it becomes for each r \n\n(16) \n\nThe last expression has a minimum in A, which can be found by putting the deriva(cid:173)\ntive with respect to A equal to zero, A~Ptimal = u 2 /u;. Remarkably it depends only \n\n\f954 \n\nKrogh and Hertz \n\nFigure 1: Generalization error as a \nfunction of Q' = pIN. The full line is \nfor A = u 2 = 0.2, and the dashed line \nfor A = O. The dotted line is the gener(cid:173)\nalization error with no noise and A = O. \n\nLI.. \n\no~ __________ ~ __________ ~ \no \n2 \n\n., \n\n1 \npIN \n\non u and the variance of the noise, and not on A. If it is assumed that u is random \n(16) can be averaged over u. This yields an optimal A independent of r, \n\nu 2 \nAoptlmai = ---;;-, \n\nu~ \n\n(17) \n\nwhere u 2 is the average of N- 1IuI 2 . \nIn this case the weight decay to some extent prevents the network from fitting the \nnOIse. \n\nFrom equation (14) one can see that the noise is projected onto the pattern subspace. \nTherefore the contribution to the generalization error from V/ is the same as before, \nand this contribution is on average minimized by a weight decay of any size. \n\nEquation (17) was derived in [10] in the context of a particular eigenvalue spectrum. \nFigure fig. 1 shows the dramatic improvement in generalization error when the \noptimal weight decay is used in this case, The present treatment shows that (17) \nis independent of the spectrum of A. \n\nWe conclude that a weight decay has two positive effects on generalization in a \nlinear network: 1) It suppresses any irrelevant components of the weight vector by \nchoosing the smallest vector that solves the learning problem. 2) If the size is chosen \nright, it can suppress some of the effect of static noise on the targets . \n\n5 NON-LINEAR NETWORKS \n\nIt is not possible to analyze a general non-linear network exactly, as done above \nfor the linear case. By a local linearization, it is however, possible to draw some \ninteresting conclusions from the results in the previous section. \nAssume the function is realizable, f = fu. Then learning corresponds to solving the \np equations \n\n(18) \n\n\fA Simple Weight Decay Can Improve Generalization \n\n955 \n\nin W variables, where W is the number of weights. For p < W these equations \ndefine a manifold in weight space of dimension at least W - p. Any point W on this \nmanifold gives a learning error of zero, and therefore (4) can be expanded around \nw. Putting v = W - w, expanding fw in v, and using it in (4) yields \n\nVi \n\nex \n\n- L (8f;~:Jj\u00bb) v/9f;~:Jj) + A(Wi - vd \n\nJj ,1 \n\n- LAij(W)Vj - AVi + AWj \n\nj \n\n(The derivatives in this equation should be taken at iV.) \nThe analogue of A is defined as \n\nA\u00b7\u00b7( -) = L 8fw(eJj) 8fw(eJj) \n\n'1 w \n\n-\n\n~ \nuW' \n, \n\nJj \n\n:::l \nuW' \n1 \n\n(19) \n\n(20) \n\n\u2022 \n\nSince it is of outer product form (like A) its rank R( in) ~ min{p, W}. Thus when \np < W, A is never of full rank. The rank of A is of course equal to W minus the \ndimension of the manifold mentioned above. \n\nFrom these simple observations one can argue that good generalization should not \nbe expected for p < W. This is in accordance with other results (cf. [3]), and with \ncurrent 'folk-lore'. The difference from the linear case is that the 'rain gutter' need \nnot be (and most probably is not) linear, but curved in this case. There may in fact \nbe other valleys or rain gutters disconnected from the one containing u. One can \nalso see that if A has full rank, all points in the immediate neighborhood of W = u \ngive a learning error larger than 0, i.e. there is a simple minimum at u. \n\nAssume that the learning finds one of these valleys. A small weight decay will \npick out the point in the valley with the smallest norm among all the points in the \nvalley. In general it can not be proven that picking that solution is the best strategy. \nBut, at least from a philosophical point of view, it seems sensible, because it is (in a \nloose sense) the solution with the smallest complexity-the one that Ockham would \nprobably have chosen. \n\nThe value of a weight decay is more evident if there are small errors in the targets. \nIn that case one can go through exactly the same line of arguments as for the linear \ncase to show that a weight decay can improve generalization, and even with the \nsame optimal choice (17) of A. This is strictly true only for small errors (where the \nlinear approximation is valid). \n\n6 NUMERICAL EXPERIMENTS \n\nA weight decay has been tested on the NetTalk problem [11]. In the simulations \nback-propagation derived from the 'entropic error measure' [12] with a momentum \nterm fixed at 0.8 was used. The network had 7 x 26 input units, 40 hidden units and \n26 output units. In all about 8400 weights. It was trained on 400 to 5000 random \nwords from the data base of around 20.000 words, and tested on a different set of \n1000 random words. The training set and test set were independent from run to \nrun . \n\n\f0.26 \n\n0.24 \n\n0.22 \n\nf/) \n\n.... \n.... \n0 0.20 \n.... w \n\n0.18 \n\n0.16 \n\n0.14 \n0 \n\n. . \n\n956 \n\nKrogh and Hertz \n\n1.2 \n\n1.0 \n\nlL. \n\n0.8 \n\n0.6 \n\no \n\n2 0 104 \n\nP \n\n4 0 104 \n\n. --\n\nFigure 2: The top full line corresponds to the generalization error after 300 epochs \n(300 cycles through the training set) without a weight decay. The lower full line is \nwith a weight decay. The top dotted line is the lowest error seen during learning \nwithout a weight decay, and the lower dotted with a weight decay. The size of the \nweight decay was .A = 0.00008. \nInsert: Same figure except that the error rate is shown instead of the squared error. \nThe error rate is the fraction of wrong phonemes when the phoneme vector with \nthe smallest angle to the actual output is chosen, see [11]. \n\nResults are shown in fig. 2. There is a clear improvement in generalization error \nwhen weight decay is used. There is also an improvement in error rate (insert of \nfig. 2), but it is less pronounced in terms of relative improvement. Results shown \nhere are for a weight decay of .A = 0.00008. The values 0.00005 and 0.0001 was also \ntried and gave basically the same curves. \n\n7 CONCLUSION \n\nIt was shown how a weight decay can improve generalization in two ways: 1) It \nsuppresses any irrelevant components of the weight vector by choosing the smallest \nvector that solves the learning problem. 2) If the size is chosen right, a weight decay \ncan suppress some of the effect of static noise on the targets. Static noise on the \ntargets can be viewed as a model of learning an unrealizable function. The analysis \nassumed that the network could be expanded around an optimal weight vector, and \n\n\fA Simple Weight Decay Can Improve Generalization \n\n957 \n\ntherefore it is strictly only valid in a little neighborhood around that vector. \n\nThe improvement from a weight decay was also tested by simulations. For the \nNetTalk data it was shown that a weight decay can decrease the generalization \nerror (squared error) and also, although less significantly, the actual mistake rate \nof the network when the phoneme closest to the output is chosen. \n\nAcknowledgements \n\nAK acknowledges support from the Danish Natural Science Council and the Danish \nTechnical Research Council through the Computational Neural Network Center \n(CONNECT). \n\nReferences \n\n[1] D.B. Schwartz, V.K. Samalam, S.A. Solla, and J.S. Denker. Exhaustive learn(cid:173)\n\ning. Neural Computation, 2:371-382, 1990. \n\n[2] N. Tishby, E. Levin, and S.A. Solla. Consistent inference of probabilities in \nlayered networks: predictions and generalization. In International Joint Con(cid:173)\nference on Neural Networks, pages 403-410, (Washington 1989), IEEE, New \nYork, 1989. \n\n[3] E.B. Baum and D. Haussler. What size net gives valid generalization? Neural \n\nComputation, 1:151-160, 1989. \n\n[4] Y. Le Cun, J .S. Denker, and S.A. Solla. Optimal brain damage. In D.S. Touret(cid:173)\nzky, editor, Advances in Neural Information Processing Systems, pages 598-\n605, (Denver 1989), Morgan Kaufmann, San Mateo, 1990. \n\n[5] H.H. Thodberg. Improving generalization of neural networks through pruning. \n\nInternational Journal of Neural Systems, 1:317-326, 1990. \n\n[6] D.H. Weigend, D.E. Rumelhart, and B.A. Huberman. Generalization by \nweight-elimination with application to forecasting. In R.P. Lippmann et ai, \neditors, Advances in Neural Information Processing Systems, page 875-882, \n(Denver 1989), Morgan Kaufmann, San Mateo, 1991. \n\n[7] G.E. Hinton. Learning translation invariant recognition in a massively parallel \nnetwork. In G. Goos and J. Hartmanis, editors, PARLE: Parallel Architec(cid:173)\ntures and Languages Europe. Lecture Notes in Computer Science, pages 1-13, \nSpringer-Verlag, Berlin, 1987. \n\n[8] J .Moody. Generalization, weight decay, and architecture selection for nonlin(cid:173)\n\near learning systems. These proceedings. \n\n[9] D. MacKay. A practical bayesian framework for backprop networks. These \n\nproceedings. \n\n[10] A. Krogh and J .A. Hertz. Generalization in a Linear Perceptron in the Presence \n\nof Noise. To appear in Journal of Physics A 1992. \n\n[11] T.J. Sejnowski and C.R. Rosenberg. Parallel networks that learn to pronounce \n\nenglish text . Complex Systems, 1:145-168,1987. \n\n[12] J .A. Hertz, A. Krogh, and R.G. Palmer. Introduction to the Theory of Neural \n\nComputation. Addison-Wesley, Redwood City, 1991. \n\n\f", "award": [], "sourceid": 563, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "John", "family_name": "Hertz", "institution": null}]}