{"title": "A Smoothing Regularizer for Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 458, "page_last": 464, "abstract": null, "full_text": "A Smoothing Regularizer for Recurrent \n\nNeural Networks \n\nOregon Graduate Institute, Computer Science Dept., Portland, OR 97291-1000 \n\nLizhong Wu and John Moody \n\nAbstract \n\nWe derive a smoothing regularizer for recurrent network models by \nrequiring robustness in prediction performance to perturbations of \nthe training data. The regularizer can be viewed as a generaliza(cid:173)\ntion of the first order Tikhonov stabilizer to dynamic models. The \nclosed-form expression of the regularizer covers both time-lagged \nand simultaneous recurrent nets, with feedforward nets and one(cid:173)\nlayer linear nets as special cases. We have successfully tested this \nregularizer in a number of case studies and found that it performs \nbetter than standard quadratic weight decay. \n\n1 \n\nIntrod uction \n\nOne technique for preventing a neural network from overfitting noisy data is to add \na regularizer to the error function being minimized. Regularizers typically smooth \nthe fit to noisy data. Well-established techniques include ridge regression, see (Ho(cid:173)\nerl & Kennard 1970), and more generally spline smoothing functions or Tikhonov \nstabilizers that penalize the mth-order squared derivatives of the function being fit, \nas in (Tikhonov & Arsenin 1977), (Eubank 1988), (Hastie & Tibshirani 1990) and \n(Wahba 1990). Thes( -ilethods have recently been extended to networks of radial \nbasis functions (Girosi, Jones & Poggio 1995), and several heuristic approaches have \nbeen developed for sigmoidal neural networks, for example, quadratic weight decay \n(Plaut, Nowlan & Hinton 1986), weight elimination (Scalettar & Zee 1988),(Chau(cid:173)\nvin 1990),(Weigend, Rumelhart & Huberman 1990) and soft weight sharing (Nowlan \n& Hinton 1992).1 All previous studies on regularization have concentrated on feed(cid:173)\nforward neural networks. To our knowledge, recurrent learning with regularization \nhas not been reported before. \n\nITwo additional papers related to ours, but dealing only with feed forward networks, \ncame to our attention or were written after our work was completed. These are (Bishop \n1995) and (Leen 1995). Also, Moody & Rognvaldsson (1995) have recently proposed \nseveral new classes of smoothing regularizers for feed forward nets. \n\n\fA Smoothing Regularizer for Recurrent Neural Networks \n\n459 \n\nIn Section 2 of this paper, we develop a smoothing regularizer for general dynamic \nmodels which is derived by considering perturbations of the training data. We \npresent a closed-form expression for our regularizer for two layer feedforward and \nrecurrent neural networks, with standard weight decay being a special case. In \nSection 3, we evaluate our regularizer's performance on predicting the U.S. Index \nof Industrial Production. The advantage of our regularizer is demonstrated by \ncomparing to standard weight decay in both feedforward and recurrent modeling. \nFinally, we conclude our paper in Section 4. \n\n2 Smoothing Regularization \n\n2.1 Prediction Error for Perturbed Data Sets \n\nZ(t) = F*(I(t\u00bb + E*(t) with \n\nI(t) = {X(s), s = 1,2,\u00b7\u00b7\u00b7, t} . \n\nConsider a training data set {P: Z(t),X(t)}, where the targets Z(t) are assumed to \nbe generated by an unknown dynamical system F*(I(t)) and an unobserved noise \nprocess: \n\n(1) \nHere, I(t) is, the information set containing both current and past inputs X(s), and \nthe E*(t) are independent random noise variables with zero mean and variance (F*2. \nConsider next a dynamic network model Z(t) = F(~, I(t)) to be trained on data set \nP, where ~ represents a set of network parameters, and F( ) is a network transfer \nfunction which is assumed to be nonlinear and dynamic. We assume that F( ) has \ngood approximation capabilities, such that F(~p,I(t)) ~ F*(I(t)) for learnable \nparameters ~ p. \nOur goal is to derive a smoothing regularizer for a network trained on the actual \ndata set P that in effect optimizes the expected network performance (prediction \nrisk) on perturbed test data sets of form {Q : Z(t),X(t)}. The elements of Q are \nrelated to the elements of P via small random perturbations Ez(t) and Ez(t), so that \n(2) \n(3) \nThe Ez(t) and Ez(t) have zero mean and variances (Fz2 and (Fz2 respectively. The \ntraining and test errors for the data sets P and Q are \n\nZ(t) = Z(t) + Ez(t) , \nX(t) = X(t) + Ez(t) . \n\nDp = ~ L [Z(t) - F(~p,I(t))]2 \n\nN \n\nt=l \nN \n\nDQ = ~ L[Z(t) - F(~p,i(t)W , \n\nt=l \n\n(4) \n\n(5) \n\nwhere ~ p denotes the network parameters obtained by training on data set P, and \nl(t) = {X(s),s = 1,2,\u00b7\u00b7\u00b7 ,t} is the perturbed information set of Q. With this \nnotation, our goal is to minimize the expected value of DQ, while training on D p. \nConsider the prediction error for the perturbed data point at time t: \n\nWith Eqn (2), we obtain \n\nd(t) = [Z(t) - F(~p,i(t)W . \n\nd(t) = \n-\n\n[Z(t) + Ez(t) - F(~p,I(t)) + F(~p,I(t)) - F(~p,i(t)W, \n[Z(t) - F(~p,I(t)W + [F(~p,I(t)) - F(~p,l(t)W + [Ez(t)]2 \n+2[Z(t) - F(~p,I(t))JIF(~p,I(t)) - F(~p,i(t))] \n+2Ez(t)lZ(t) - F(~p,l(t))]. \n\n(6) \n\n(7) \n\n\f460 \n\nL. WU. 1. MOODY \n\nAssuming that C:z(t) is uncorrelated with [Z(t) - F(~p,i(t\u00bb] and averaging over \nthe exemplars of data sets P and Q, Eqn(7) becomes \n\nDQ = \n\n1 N \n\nDp+ NL[F(~p,I(t\u00bb-F(~p,i(t)W+ NL[c: z(t)]2 \n\nt=1 \n\nt=1 \n\n2 N \n\n+ N L[Z(t) - F(~p,I(t\u00bb)][F(~p,I(t\u00bb - F(~p,i(t\u00bb]. \n\n(8) \n\nt=l \n\n1 N \n\nThe third term, 2:::'1 [C: z (t)]2, in Eqn(8) is independent of the weights, so it can \nbe neglected during the learning process. The fourth term in Eqn(8) is the cross-\ncovariance between [Z~t) - F(~p,I(t\u00bb] and [F(~p,I(t\u00bb - F(~p,i(t\u00bb]. Using \nthe inequality 2ab ~ a + b2 , we can see that minimizing the first term D p and \nthe second term ~ 2:~I[F(~p,I(t\u00bb - F(~p,i(t\u00bb]2 in Eqn (8) during training \nwill automatically decrease the effect of the cross-covariance term. Therefore, we \nexclude the cross-covariance term from the training criterion. \nThe above analysis shows that the expected test error DQ can be minimized by \nminimizing the objective function D: \n\n1 N \n\n1 N \n\nD = N L[Z(t) - F(~, I(t\u00bb]2 + N L[F(~p, I(t\u00bb - F(~ p,i(t\u00bb]2. \n\n(9) \n\nt=l \n\nt=l \n\nIn Eqn (9), the second term is the time average of the squared disturbance \nIIZ(t) - Z(t)1I2 of the trained network output due to the input perturbation \nlIi(t) - I(t)W. Minimizing this term demands that small changes in the input \nvariables yield correspondingly small changes in the output. This is the standard \nsmoothness prior, nanlely that if nothing else is known about the function to be \napproximated, a good option is to assume a high degree of smoothness. Without \nknowing the correct functional form of the dynamical system F- or using such prior \nassumptions, the data fitting problem is ill-posed. In (Wu & Moody 1996), we have \nshown that the second term in Eqn (9) is a dynamic generalization of the first order \nTikhonov stabilizer. \n\n2.2 Form of the Proposed Smoothing Regularizer \n\nConsider a general, two layer, nonlinear, dynamic network with recurrent connec(cid:173)\ntions on the internal layer 2 as described by \n\nYet) = f (WY(t - T) + V X(t\u00bb ,Z(t) = UY(t) \n\n(10) \n\nwhere X(t), Yet) and Z(t) are respectively the network input vector, the hidden \noutput vector and the network output; ~ = {U, V, W} is the output, input and \nrecurrent connections of the network; f( ) is the vector-valued nonlinear transfer \nfunction of the hidden units; and T is a time delay in the feedback connections of \nhidden layer which is pre-defined by a user and will not be changed during learning. \nT can be zero, a fraction, or an integer, but we are interested in the cases with a \nsmall T.3 \n\n20ur derivation can easily be extended to other network structures. \n3When the time delay T exceeds some critical value, a recurrent network becomes \n\nunstable and lies in oscillatory modes. See, for example, (Marcus & Westervelt 1989). \n\n\fA Smoothing Regularizer for Recurrent Neural Networks \n\n461 \n\nWhen T = 1, our model is a recurrent network as described by (Elman 1990) and \n(Rumelhart, Hinton & Williams 1986) (see Figure 17 on page 355). When T is equal \nto some fraction smaller than one, the network evolves ~ times within each input \ntime interval. When T decreases and approaches zero, our model is the same as the \nnetwork studied by (Pineda 1989), and earlier, widely-studied additive networks. In \n(Pineda 1989), T was referred to as the network relaxation time scale. (Werbos 1992) \ndistinguished the recurrent networks with zero T and non-zero T by calling them \nsimultaneous recurrent networks and time-lagged recurrent networks respectively. \nWe have found that minimizing the second term of Eqn(9) can be obtained by \nsmoothing the output response to an input perturbation at every time step. This \nyields, see (Wu & Moody 1996): \n\nIIZ(t)-Z(t)W~p/(~p)IIX(t)-X(t)W for t=1,2, ... ,N. \n\n(11) \nWe call PT 2 (~ p) the output sensitivity of the trained network ~ p to an input pertur(cid:173)\nbation. PT 2 ( ~ p) is determined by the network parameters only and is independent \nof the time variable t. \nWe obtain our new regularizer by training directly on the expected prediction error \nfor perturbed data sets Q. Based on the analysis leading to Eqns (9) and (11), the \ntraining criterion thus becomes \n\n1 N \n\nD = N 2:[Z(t) - F(~,I(t)W + .\\p/(~) . \n\n(12) \n\nt=l \n\nThe coefficient .\\ in Eqn(12) is a regularization parameter that measures the degree \nof input perturbation lIi(t) - I(t)W. The algebraic form for PT(~) as derived in \n(Wu & Moody 1996) is: \n\nP (~)- ,IIUIIIIVII {1-\n\n1 _ ,IIWII \n\nT \n\n-\n\nexp \n\n(,IIWIl -l)} \n\nT \n\n' \n\n(13) \n\nfor time-lagged recurrent networks (T > 0). Here, 1111 denotes the Euclidean matrix \nnorm. The factor, depends upon the maximal value of the first derivatives of the \nactivation functions of the hidden units and is given by: \n\n, = m~ II/(oj(t)) I , \n\nt ,] \n\n(14) \n\nwhere j is the index of hidden units and OJ(t) is the input to the ph unit. In general, \n, ~ 1. 4 To insure stability and that the effects of small input perturbations are \ndamped out, it is required, see (Wu & Moody 1996), that \n\n(15) \nThe regularizer Eqn(13) can be deduced for the simultaneous recurrent networks in \nthe limit THO by: \n\n,IIWII < 1 . \n\np(~) = P (~) = ,IIUIIIIVII \n\n1 - ,IIWII . \n\n0 \n\n-\n\n(16) \n\nIf the network is feedforward, W = 0 and T = 0, Eqns (13) and (16) become \n\n(17) \nMoreover, if there is no hidden layer and the inputs are directly connected to the \noutputs via U, the network is an ordinary linear model, and we obtain \n\np(~) = ,11U1I11V1l . \n\n(18) \n4For instance, f'(x} = [1- f(x})f(x} if f(x) = l+!-z. Then, \"'{ = max 1 f'(x}} 1= t. \n\np(~) = IIUII , \n\n\f462 \n\nL. WU, J. MOODY \n\nwhich is standard quadratic weight decay (Plaut et al. 1986) as is used in ridge \nregression (Hoerl & Kennard 1970). \nThe regularizer (Eqn(17) for feedforward networks and Eqn (13) for recurrent net(cid:173)\nworks) was obtained by requiring smoothness of the network output to perturbations \nof data. We therefore refer to it as a smoothing regularizer. Several approaches can \nbe applied to estimate the regularization parameter..x, as in (Eubank 1988), (Hastie \n& Tibshirani 1990) and (Wahba 1990). We will not discuss this subject in this \npaper. \nIn the next section, we evaluate the new regularizer for the task of predicting the \nU.S. Index of Industrial Production. Additional empirical tests can be found in \n(Wu & Moody 1996). \n\n3 Predicting the U.S. Index of Industrial Production \n\nThe Index of Industrial Production (IP) is one of the key measures of economic \nactivity. It is computed and published monthly. Our task is to predict the one(cid:173)\nmonth rate of change of the index from January 1980 to December 1989 for models \ntrained from January 1950 to December 1979. The exogenous inputs we have used \ninclude 8 time series such as the index of leading indicators, housing starts, the \nmoney supply M2, the S&P 500 Index. These 8 series are also recorded monthly. \nIn previous studies by (Moody, Levin & Rehfuss 1993), with the same defined \ntraining and test data sets, the normalized prediction errors of the one month rate \nof change were 0.81 with the neuz neural network simulator, and 0.75 with the \nproj neural network simulator. \n\nWe have simulated feedforward and recurrent neural network models. Both models \nconsist of two layers. There are 9 input units in the recurrent model, which re(cid:173)\nceive the 8 exogenous series and the previous month IP index change. We set the \ntime-delayed length in the recurrent connections T = 1. The feedforward model is \nconstructed with 36 input units, which receive 4 time-delayed versions of each input \nseries. The time-delay lengths a,re 1, 3, 6 and 12, respectively. The activation func(cid:173)\ntions of hidden units in both feedforward and recurrent models are tanh functions. \nThe number of hidden units varies from 2 to 6. Each model has one linear output \nunit. \n\nWe have divided the data from January 1950 to December 1979 into four non(cid:173)\noverlapping sub-sets. One sub-set consists of 70% of the original data and each of \nthe other three subsets consists of 10% of the original data. The larger sub-set is \nused as training data and the three smaller sub-sets are used as validation data. \nThese three validation data sets are respectively used for determination of early \nstopped training, selecting the regularization parameter and selecting the number \nof hidden units. \n\nWe have formed 10 random training-validation partitions. For each training(cid:173)\nvalidation partition, three networks with different initial weight parameters are \ntrained. Therefore, our prediction committee is formed by 30 networks. \n\nThe committee error is the average of the errors of all committee members. All \nnetworks in the committee are trained simultaneously and stopped at the same \ntime based on the committee error of a validation set. The value of the regulariza(cid:173)\ntion parameter and the number of hidden units are determined by minimizing the \ncommittee error on separate validation sets. \n\nTable 1 compares the out-of-sample performance of recurrent networks and feedfor-\n\n\fA Smoothing Regularizer for Recurrent Neural Networks \n\n463 \n\nTable 1: Nonnalized prediction errors for the one-month rate of return on the U.S. \nIndex of Industrial Production (Jan. 1980 - Dec. 1989). Each result is based on 30 \nnetworks. \n\nModel \n\nRegularizer Mean \u00b1 Std Median Max Min Committee \n\nRecurrent \n0.646\u00b10.008 \nNetworks Weight Decay 0.734\u00b10.018 \n\nSmoothing \n\n0.647 0.657 0.632 \n0.737 0.767 0.704 \n\nFeedforward \n\n0.700\u00b10.023 \nNetworks Weight Decay 0.745\u00b10.043 \n\nSmoothing \n\n0.707 0.729 0.654 \n0.748 0.805 0.676 \n\n0.639 \n0.734 \n\n0.693 \n0.731 \n\nward networks trained with our smoothing regularizer to that of networks trained \nwith standard weight decay. The results are based on 30 networks. As shown, the \nsmoothing regularizer again outperfonns standard weight decay with 95% confi(cid:173)\ndence (in t-distribution hypothesis) in both cases of recurrent networks and feed(cid:173)\nforward networks. We also list the median, maximal and minimal prediction errors \nover 30 predictors. The last column gives the committee results, which are based on \nthe simple average of 30 network predictions. We see that the median, maximal and \nminimal values and the committee results obtained with the smoothing regularizer \nare all smaller than those obtained with standard weight decay, in both recurrent \nand feedforward network models. \n\n4 Concluding Remarks \n\nRegularization in learning can prevent a network from overtraining. Several tech(cid:173)\nniques have been developed in recent years, but all these are specialized for feed(cid:173)\nforward networks. To our best knowledge, a regularizer for a recurrent network has \nnot been reported previously. \nWe have developed a smoothing regularizer for recurrent neural networks that cap(cid:173)\ntures the dependencies of input, output, and feedback weight values on each other. \nThe regularizer covers both simultaneous and time-lagged recurrent networks, with \nfeedforward networks and single layer, linear networks as special cases. Our smooth(cid:173)\ning regularizer for linear networks has the same fonn as standard weight decay. The \nregularizer developed depends on only the network parameters, and can easily be \nused. A more detailed description of this work appears in (Wu & Moody 1996). \n\nReferences \n\nBishop, C. (1995), 'Training with noise is equivalent to Tikhonov regularization', \n\nNeural Computation 7(1), 108-116. \n\nChauvin, Y. (1990), Dynamic behavior of constrained back-propagation networks, \nin D. Touretzky, ed., 'Advances in Neural Infonnation Processing Systems 2', \nMorgan Kaufmann Publishers, San Francisco, CA, pp. 642-649. \n\nElman, J. (1990), 'Finding structure in time', Cognition Science 14, 179-211. \nEubank, R. L. (1988), Spline Smoothing and Nonparametric Regression, Marcel \n\nDekker, Inc. \n\nGirosi, F., Jones, M. & Poggio, T. (1995), 'Regularization theory and neural net(cid:173)\n\nworks architectures', Neural Computation 7, 219-269. \n\n\f464 \n\nL. WU, J. MOODY \n\nHastie, T. J. & Tibshirani, R. J. (1990), Generalized Additive Models, Vol. 43 of \n\nMonographs on Statistics and Applied Probability, Chapman and Hall. \n\nHoerl, A. & Kennard, R. (1970), 'Ridge regression: biased estimation for nonorthog(cid:173)\n\nonal problems', Technometrics 12, 55-67. \n\nLeen, T. (1995), 'From data distributions to regularization in invariant learning', \n\nNeural Computation 7(5), 974-98l. \n\nMarcus, C. & Westervelt, R. (1989), Dynamics of analog neural networks with \ntime delay, in D. Touretzky, ed., 'Advances in Neural Information Processing \nSystems 1', Morgan Kaufmann Publishers, San Francisco, CA. \n\nMoody, J. & Rognvaldsson, T. (1995), Smoothing regularizers for feed-forward neu(cid:173)\n\nral networks, Oregon Graduate Institute Computer Science Dept. Technical \nReport, submitted for publication, 1995. \n\nMoody, J., Levin, U. & Rehfuss, S. (1993), 'Predicting the U.S. index of indus(cid:173)\n\ntrial production', In proceedings of the 1993 Parallel Applications in Statistics \nand Economics Conference, Zeist, The Netherlands. Special issue of Neural \nNetwork World 3(6), 791-794. \n\nNowlan, S. & Hinton, G. (1992), 'Simplifying neural networks by soft weight(cid:173)\n\nsharing', Neural Computation 4(4), 473-493. \n\nPineda, F. (1989), 'Recurrent backpropagation and the dynamical approach to \n\nadaptive neural computation', Neural Computation 1(2), 161-172. \n\nPlaut, D., Nowlan, S. & Hinton, G. (1986), Experiments on learning by back prop(cid:173)\n\nagation, Technical Report CMU-CS-86-126, Carnegie-Mellon University. \n\nRumelhart, D., Hinton, G. & Williams, R. (1986), Learning internal representa(cid:173)\n\ntions by error propagation, in D. Rumelhart & J. McClelland, eds, 'Parallel \nDistributed Processing: Exploration in the microstructure of cognition', MIT \nPress, Cambridge, MA, chapter 8, pp. 319-362. \n\nScalettar, R. & Zee, A. (1988), Emergence of grandmother memory in feed forward \nnetworks: learning with noise and forgetfulness, in D. Waltz & J. Feldman, \neds, 'Connectionist Models and Their Implications: Readings from Cognitive \nScience', Ablex Pub. Corp. \n\nTikhonov, A. N. & Arsenin, V. 1. (1977), Solutions of Ill-posed Problems, Winston; \nNew York: distributed solely by Halsted Press. Scripta series in mathematics. \nTranslation editor, Fritz John. \n\nWahba, G. (1990), Spline models for observational data, CBMS-NSF Regional Con(cid:173)\n\nference Series in Applied Mathematics. \n\nWeigend, A., Rumelhart, D. & Huberman, B. (1990), Back-propagation, weight(cid:173)\nelimination and time series prediction, in T. Sejnowski, G. Hinton & D. Touret(cid:173)\nzky, eds, 'Proceedings of the connectionist models summer school', Morgan \nKaufmann Publishers, San Mateo, CA, pp. 105-116. \n\nWerbos, P. (1992), Neurocontrol and supervised learning: An overview and eval(cid:173)\n\nuation, in D. White & D. Sofge, eds, 'Handbook of Intelligent Control', Van \nNostrand Reinhold, New York. \n\nWu, L. & Moody, J. (1996), 'A smoothing regularizer for feedforward and recurrent \n\nneural networks', Neural Computation 8(3), 463-491. \n\n\f", "award": [], "sourceid": 1119, "authors": [{"given_name": "Lizhong", "family_name": "Wu", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}