{"title": "Dynamic Behavior of Constained Back-Propagation Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 642, "page_last": 649, "abstract": "", "full_text": "642 \n\nChauvin \n\nDynamic Behavior of Constrained \n\nBack-Propagation Networks \n\nYves Chauvin! \n\nThomson-CSF, Inc. \n\n630 Hansen Way, Suite 250 \n\nPalo Alto, CA. 94304 \n\nABSTRACT \n\nThe learning dynamics of the back-propagation algorithm are in(cid:173)\nvestigated when complexity constraints are added to the standard \nLeast Mean Square (LMS) cost function. It is shown that loss of \ngeneralization performance due to overtraining can be avoided \nwhen using such complexity constraints. Furthermore, \"energy,\" \nhidden representations and weight distributions are observed and \ncompared during learning. An attempt is made at explaining the \nresults in terms of linear and non-linear effects in relation to the \ngradient descent learning algorithm. \n\n1 INTRODUCTION \nIt is generally admitted that generalization performance of back-propagation net(cid:173)\nworks (Rumelhart, Hinton & Williams, 1986) will depend on the relative size ofthe \ntraining data and of the trained network. By analogy to curve-fitting and for theo(cid:173)\nretical considerations, the generalization performance of the network should de(cid:173)\ncrease as the size of the network and the associated number of degrees of freedom \nincrease (Rumelhart, 1987; Denker et al., 1987; Hanson & Pratt, 1989). \nThis paper examines the dynamics of the standard back-propagation algorithm \n(BP) and of a constrained back-propagation variation (CBP), designed to adapt \nthe size of the network to the training data base. The performance, learning \ndynamics and the representations resulting from the two algorithms are compared. \n\n1. Also in the Psychology Department, Stanford University, Stanford, CA. 94305 \n\n\fDynamic Behavior of Constrained Back-Propagation Networks \n\n643 \n\n2 GENERALIZATION PERFORM:ANCE \n\n2.1 STANDARD BACK-PROPAGATION \n\nIn Chauvin (In Press). the generalization performance of a back-propagation net(cid:173)\nwork was observed for a classification task from spectrograms into phonemic cate(cid:173)\ngories (single speaker. 9 phonemes. 10msx16frequencies spectrograms. 63 training \npatterns. 27 test patterns). This performance was examined as a function of the \nnumber of training cycles and of the number of (logistic) hidden units (see also. \nMorgan & Bourlard. 1989). During early learning. the performance of the network \nappeared to be basically independent of the number of hidden units (provided a \nminimal size). However. after prolonged training. performance started to decrease \nwith training at a rate that was a function of the size of the hidden layer. More \nprecisely. from 500 to 10.000 cycles. the generalization performance (in terms of \npercentage of correctly classified spectrograms) decreased from about 93% to 74% \nfor a 5 hidden unit network and from about 95% to 62% for a 10 hidden unit \nnetwork. These results confirmed the basic hypothesis proposed in the Introduc(cid:173)\ntion but only with a sufficient number of training cycles (overtraining). \n\n2.2 CONSTRAINED BACK-PROPAGATION \n\nSeveral constraints have been proposed to \"adapt\" the size of the trained network \nto the training data. These constraints can act directly on the weights. or on the net \ninput or activation of the hidden units (Rumelhart. 1987; Chauvin. 1987. 1989. In \nPress; Hanson & Pratt. 1989; Ji. Snapp & Psaltis. 1989; Ishikawa. 1989; Golden \nand Rumelhart. 1989). The complete cost function adopted in Chauvin (In Press) \nfor the speech labeling task was the following: \n\nC = aE, + PEn + yW = a L (tip - Oip) + P L 1 \n\n2 \n2 ~ Olp \n\nHP \n\nOP \n~ \n\nip \n\nW \n2 \n~ WI} \n2 + Y L \n2 \nI) 1 + wI} \n\nIp + Oip \n\n[ 1 ] \n\nE, is the usual LMS error computed at the output layer. E\" is a function of the \nsquared activations of the hidden units and W is a function of the squared weights \nthroughout the network. This constrained back-propagation (CBP) algorithm ba(cid:173)\nsically eliminated the overtraining effect: the resulting generalization performance \nremained constant (about 95%) throughout the complete training period. indepen(cid:173)\ndently of the original network size. \n\n3 ERROR AND ENERGY DYNAMICS \n\nUsing the same speech labeling task as in Chauvin (In Press). the dynamics of the \nglobal variables of the network defined in Equation 1 (E,. E\". and W) were \nobserved during training of a network with 5 hidden units. Figure 1 represents the \nerror and energy dynamics for the standard (BP) and the constrained back-propa(cid:173)\ngation algorithm (CBP). For BP and CBP. the error on the training patterns kept \n\n\f644 \n\nChauvin \n\n1.4-------------------- 0.04~-----------------\n\nEr (Test) \n\n1.2\u00b7 \n, \n1\u00b7 \n....... ----------\nO. 8\u00b7 \n\n0.6 \n\n0.4-\n\n0.2-\n\n'-\n\nBP \n,..-------\n\n0.03-\n\nBP \n\nCBP \n\n0.01\u00b7 \n\nCBP \n\n0 \n\n~ ..... ~ ..... ~ ..... -~.--~ ...... ~ \n10 \n0 \n\n\u2022 \n2 \n\n\u2022 \n4 \n\n6 \n\n8 \n\n0 \n\n0 \n\n. \n2 \n\n. \n4 \n\n. \n6 \n\n\u2022 \n8 \n\n10 \n\nNumber of cycles (x1000). \n\nFigure 1. \"Energy\" (left) and generalization error - LMS averaged \n(right) when using the stan(cid:173)\nover the test patterns and output units -\ndard (BP) or the constrainted (CBP) back-propagation algorithm \n\nduring a typical run. \n\ndecreasing during the entire training period (more slowly for CBP). The W dynam(cid:173)\nics over the entire network were similar for BP and CBP (but the distributions were \ndifferent, see below). \n\n3.1 STANDARD BACK-PROPAGATION \n\nAs shown in Figure 1, the \"energy\" Ell (Equation 1) of the hidden layer slightly \nincreases during the entire learning period, long after the minimum was reached for \nthe test error (around 200 cycles). This \"energy\" reaches a plateau after long \novertraining, around 10,000 cycles. The generalization error reaches a minimum \nand later increases as training continues, also slowly reaching a plateau around \n10,000 cycles. \n\n3.2 CONSTRAINED BACK-PROPAGATION \n\nWith CBP, the \"energy\" decreases to a much lower level during early learning and \nremains about constant throughout the complete training period. The error quickly \ndecreases during early learning and remains about constant during the rest of the \ntraining period, apparently stabilized by the energy and weight constraints given in \nEquation 1. \n\n\fDynamic Behavior of Constrained Back.Propagation Networks \n\n645 \n\n4 REPRESENTATION \n\nThe hidden unit activations and weights of the networks were examined after learn(cid:173)\ning. using BP or CBP. A hidden unit was considered \"dead\" when its contribution \nto any output unit (computed as the product of its activation times the correspond(cid:173)\ning outgoing weight) was at least 50 times smaller than the total contribution from \nall hidden units. over the entire set of input patterns. \n\n4.1 STANDARD BACK-PROPAGATION \n\nAs also observed by Hanson et al. (1989). standard back-propagation usually \nmakes use of most or all hidden units: the representation of the input patterns is \nwell distributed over the entire set of hidden units. even if the network is oversized \nfor the task. The exact representation depends on the initial weights. \n\n4.2 CONSTRAINED BACK-PROPAGATION \n\nUsing the constraints described in Equation 1. the hidden layer was reduced to 2 or \n3 hidden units for all the observed runs (2 hidden units corresponds to the minimal \nsize network necessary to solve the task) . All the other units were actually \"killed\" \nduring learning. independently of the size of the original network (from 4 to 11 \nunits in the simulations). Both the constraints on the hidden unit activations (E\" ) \nand on the weights (W) contribute to this reduction. \nFigure 2 represents an example of the resulting weights from the input layer to a \nremaining hidden unit. As we can see. a few weights ended up dominating the \nentire set: they actually \"picked up\" a characteristic of the input spectrograms that \nallow the disctinction between two phoneme categories (this phenomenon was also \npredicted and observed by Rumelhart. 1989). In this case. the weights \"picked \nup\" the 10th and 14th frequency components of the spectrograms. both present \nduring the 5th time interval. The characteristics of the spectrum make the corre(cid:173)\nsponding hidden unit especially responsive to the [0] phoneme. The specific non(cid:173)\nlinear W constraint on the input-to-hidden weights used by CBP forced that hid(cid:173)\nden unit to acquire a very local receptor field. Note that this was not always ob(cid:173)\nserved in the simulations. Some hidden units acquired broad receptor fields with \nweights distributed over the entire spectrogram (as it is always the case with stan(cid:173)\ndard BP). No statistical comparison was made to compute the relative ratio of local \nto distributed units. which probably depends on the exact form of the reduction \nconstraint used in CB P. \n5 INTERPRETATION OF RESULTS \n\nWe observed that the occurrence of overfitting effects depends both on the size of \nthe network and on the number of training cycles. At this point. a better theoreti(cid:173)\ncal understanding of the back-propagation learning dynamics would be useful to \nexplain this dependency (Chauvin. In Preparation). This section presents an infor(cid:173)\nmal interpretation of the results in terms of linear and non-linear phenomena. \n\n\f646 \n\nChauvin \n\nH \n(l) \n~ as \n...l \n\nC \n(l) \n~ \n~ \n.r-! \n::t: \n\n0 \n~ \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \u2022 \n-\n\n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \u2022 \u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \nFrom Input Layer \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\na \n\nFigure 2. Typical fan-in weights after learning from the input layer \nto a hidden unit using the constrained back-propagation algorithm. \n\n5.1 LINEAR PHENOMENA \n\nThese linear phenomena might be due to probable correlations between sample \nplus observation noise at the input level and the desired classification at the output \nlevel. The gradient descent learning rule should eventually make use of these cor(cid:173)\nrelations to decrease the LMS error. However, these correlations are specific to \nthe used training data set and should have a negative impact on the performance of \nthe network on a testing data set. Figure 3 represents the generalization perform(cid:173)\nance of linear networks with 1 and 7 hidden units (averaged over 5 runs) for the \nspeech labeling task described above. As predicted, we can see that overtraining \neffects are actually generated by linear networks (as they would with a one-step \nalgorithm; e.g., Vallet et a!., 1989). Interestingly, they occur even when the size of \nthe network is minimum. These effects should obviously decrease by increasing the \nsize of the training data set (therefore reducing the effect of sample and observa(cid:173)\ntion noise). \n\n5.2 NON-LINEAR PHENOMENA \n\nThe second type of effect is non-linear. This is illustrated in Chauvin (In Press) \nwith a curve-fitting problem. In the first problem, a non-linear back-propagation \nnetwork (1 input unit, 1 output unit, 2 layers of 20 hidden units) is trained to fit a \nfunction composed of two linear segments separated by a discontinuity. The map(cid:173)\nping realized by the network over the entire interval is observed as a function of the \nnumber of training cycles. It appears that the interpolated fit reaches a minimum \n\n\fDynamic Behavior of Constrained Back-Propagation Networks \n\n647 \n\n0.10 \" \n\\ \n\\ \n\\ \n\n,~ \n\n\\ \" \n0.08\u00b7 \\ ' \n\\ \n\\ \n, \n\\ \n\nH 0.06-\n~ \nH'~ \n~ \n\nH7 ~ \n\n0.04-\n\n0.02\u00b7 \n\n~-----~~~-~---\n\n~--~~----\n\n_ ........ -\n\n------\nHI \n\n- - - - -\n\n~------~-\n\n------\n\nTraining \n\nTesting \n\nO~----------~.~--------~.----------~.----------~ \n4 \n\n2 \n\n1 \n\n3 \n\no \n\nNumber of cycles (xl000). \n\nFigure 3. LMS error for the training and test data sets of a speech \nlabeling task as a function of the number of training cycles. A one \n\nhidden and a 7 hidden unit linear network are considered. \n\nand gets worse with the number of training cycles and with the size of the sample \ntraining set around the discontinuity. \n\nThis phenomenon is evocative of an effect in interpolation theory known as the \nRunge effect (Steffenssen, 1950). In this case, a \"well-behaved\" bell-like func-\ntion, f{x) = 1/(1 + r) , uniformly sampled n+l times over a {-D, +D] interval, is \nfitted with a polynomial of degree n. Runge showed that over the considered inter(cid:173)\nval, the maximum distance between the fitted function and the fitting polynomial \ngoes to infinity as n increases. Note that in theory, there is no overfitting since the \nnumber of degree of freedoms associated with the polynomial matches the number \nof data points. However, the interpolation \"overfitting effect\" actually increases \nwith the sampling data set, that is with the increased accuracy in the description of \nthe fitted function. (Runge also showed that the effect may disappear by changing \nthe size of the sampled interval or the distribution of the sampling data points.) \n\nWe can notice that in the piecewise linear example, a linear network would have \ncomputed a linear mapping using only two degrees of freedom (the problem IS then \nequivalent to one-dimensional linear regression). With a non-linear network, sim(cid:173)\nulations show that the network actually computes the desired mapping by slowly \n\n\f648 \n\nChauvin \n\nfitting higher and higher \"frequency components\" present in the desired mapping \n(reminiscent of the Gibb's phenomenon observed with successive Fourier series \napproximations of a square wave; e.g., Sommerfeld, 1949). The discontinuity, \nconsidered as a singular point with high frequency components, is fitted during later \nstages of learning. Increasing the number of sampling points around the disconti(cid:173)\nnuilty generates an effect similar to the Runge effect with overtraining. In this \nsense, the notion of degrees of freedom in non-linear neural networks is not only a \nfunction of the network architecture - the .. capacity\" of the network - and of the \nnon-linearities of the fitted function but also of the learning algorithm (gradient \ndescent), which gradually \"adjusts\" the \"capacity\" of the network to fit the non(cid:173)\nlinearities required by the desired function. \n\nA practical classification task might generate not only linear overtraining effects \ndue to sample and observation noise but also non-linear effects if a continuous \ninput variable (such as a frequency component in the speech example) has to be \nclassified in two different bins. It is also easy to imagine that noise may generate \nnon-linear effects. At this stage, the non-linear effects involved in back-propaga(cid:173)\ntion networks composed of logistic hidden units are poorly understood. In general, \nboth effects will probably occur in non-linear networks and might be difficult to \nassess. However, because of the gradient descent procedure, both effects seem to \ndepend on the amount of training relative to the capacity of the network. The use \nof complexity constraints acting on the complexity of the network seems to consti(cid:173)\ntute a promising solution to the overtraining problem in both the linear and non-li(cid:173)\nnear cases. \n\nAcknowledgements \n\nI am greatful to Pierre Baldi, Fred Fisher, Matt Franklin, Richard Golden, Julie \nHolmes, Erik Marcade, Yoshiro Miyata, David Rumelhart and Charlie Schley for \nhelpful comments. \n\nReferences \n\nChauvin, Y. (1987). Generalization as a function of the number of hidden units in \nback-propagation networks. Unpublished Manuscript. University of California, \nSan Diego, CA. \n\nChauvin, Y. \n(1989). A back-propagation algorithm with optimal use of the \nhidden units. In D. Touretzky (Ed.), Advances in Neural Information Processing \nSystems 1. Palo Alto, CA: Morgan Kaufman. \n\nChauvin, Y. \n(In Press). Generalization performance of back-propagation \nnetworks. Proceedings of the 1990 European conference on Signal Processing \n(Eurasip) . Springer-Verlag. \n\nChauvin, Y. (In Preparation). Generalization performance of LMS trained linear \nnetworks. \n\n\fDynamic Behavior of Constrained Back.Propagation Networks \n\n649 \n\n(1989). A back-propagation algorithm with optimal use of the \nChauvin, Y. \nhidden units. In D. Touretzky (Ed.), Advances in Neural Information Processing \nSystems 1. Palo Alto, CA: Morgan Kaufman. \n\n(1989). A back-propagation algorithm with optimal use of the \nChauvin, Y. \nhidden units. In D. Touretzky (Ed.), Advances in Neural Information Processing \nSystems 1. Palo Alto, CA: Morgan Kaufman. \n\nDenker, J. S., Schwartz, D. B., Wittner, B. S., Solla, S. A., Howard, R. E., Jackel, \nL. D., & Hopfield, J. J. \n(1987). Automatic learning, rule extraction, and \ngeneralization. Complex systems, 1, 877-922. \n\nGolden, R.M., & Rumelhart, D.E. \nmulti-layer networks through weight decay and derivative minimization. \nUnpublished Manuscript. Stanford University, Palo Alto, CA. \n\nImproving generalization in \n\n(1989). \n\nHanson, S. J. & Pratt, L. P. \n(1989). Comparing biases for minimal network \nconstruction with back-propagation. In D. Touretzky (Ed.), Advances in Neural \nInformation Processing Systems 1. Palo Alto, CA: Morgan Kaufman. \n\nIshikawa M. (1989). A structural learning algorithm with forgetting of weight link \nweights. Proceedings of the IJCNN International Joint Conference on Neural \nNetworks, II, 626. Washington D.C., June 18-22, 1989. \n\nJi, C., Snapp R. & Psaltis D. (1989). Generalizing smoothness constraints from \ndiscrete samples. Unpublished Manuscript. Department of Electrical Engineering. \nCalifornia Institute of Technology, CA. \n\nMorgan, N. & Bourlard, H. (1989). Generalization and parameter estimation in \nfeedforward nets: some experiments. Paper presented at the Snowbird Conference \non Neural Networks, Utah. \n\nRumelhart, D. E., Hinton G. E., Williams R. J. \n(1986). Learning internal \nrepresentations by error propagation. In D. E. Rumelhart & J. L. McClelland \n(Eds.) Parallel Distributed Processing: Explorations in the Microstructures of \nCognition (Vol. I). Cambridge, MA: MIT Press. \n\nRumelhart, D. E. \n\n(1987). Talk given at Stanford University, CA. \n\nRumelhart, D. E. \n\n(1989). Personal Communication. \n\nSommerfeld, A. \nAcademic Press: New York, NY. \n\n(1949). Partial differential equations in physics. \n\n(Vol. VI). \n\nSteffenssen, J. F. \n\n(1950). Interpolation. Chelsea: New York, NY. \n\nVallet, F., Cailton, J.-G. & Refregier P. (1989). Solving the problem of overfitting \nof the pseudo-inverse solution for classification learning. Proceedings of the \nIJCNN Conference, II, 443-450. Washington D.C., June 18-22, 1989. \n\n\f", "award": [], "sourceid": 197, "authors": [{"given_name": "Yves", "family_name": "Chauvin", "institution": null}]}