{"title": "A Smoothing Regularizer for Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 458, "page_last": 464, "abstract": null, "full_text": "A  Smoothing Regularizer for  Recurrent \n\nNeural  Networks \n\nOregon Graduate Institute, Computer Science Dept., Portland, OR 97291-1000 \n\nLizhong Wu and John Moody \n\nAbstract \n\nWe derive a smoothing regularizer for recurrent network models by \nrequiring robustness in prediction performance to perturbations of \nthe training data.  The regularizer can be viewed  as  a  generaliza(cid:173)\ntion of the first order Tikhonov stabilizer to dynamic models.  The \nclosed-form  expression  of the  regularizer  covers  both time-lagged \nand  simultaneous recurrent  nets,  with  feedforward  nets  and  one(cid:173)\nlayer linear nets as special cases.  We  have successfully tested this \nregularizer in a  number of case studies and found that it performs \nbetter than standard quadratic weight  decay. \n\n1 \n\nIntrod uction \n\nOne technique for preventing a neural network from overfitting noisy data is to add \na  regularizer to the error function  being minimized.  Regularizers typically smooth \nthe fit  to noisy data.  Well-established techniques include ridge regression, see  (Ho(cid:173)\nerl & Kennard  1970), and more generally spline  smoothing functions or Tikhonov \nstabilizers that penalize the mth-order squared derivatives of the function being fit, \nas in  (Tikhonov &  Arsenin  1977),  (Eubank 1988),  (Hastie &  Tibshirani  1990) and \n(Wahba  1990).  Thes(  -ilethods have  recently been extended to networks  of radial \nbasis functions (Girosi, Jones & Poggio 1995), and several heuristic approaches have \nbeen developed for sigmoidal neural networks, for example, quadratic weight decay \n(Plaut, Nowlan &  Hinton 1986), weight  elimination (Scalettar & Zee  1988),(Chau(cid:173)\nvin 1990),(Weigend, Rumelhart & Huberman 1990) and soft weight sharing (Nowlan \n& Hinton 1992).1  All  previous studies on regularization have concentrated on feed(cid:173)\nforward neural networks.  To  our knowledge, recurrent learning with regularization \nhas not been reported before. \n\nITwo additional papers related to ours,  but dealing only with feed  forward  networks, \ncame to our attention or were written after our work was  completed.  These are (Bishop \n1995)  and  (Leen  1995).  Also,  Moody  &  Rognvaldsson  (1995)  have  recently  proposed \nseveral new  classes of smoothing regularizers for  feed forward nets. \n\n\fA Smoothing Regularizer for Recurrent Neural Networks \n\n459 \n\nIn Section 2 of this paper, we  develop a  smoothing regularizer for  general dynamic \nmodels  which  is  derived  by  considering  perturbations  of  the  training  data.  We \npresent  a  closed-form expression for  our regularizer for  two  layer  feedforward  and \nrecurrent  neural  networks,  with  standard  weight  decay  being  a  special  case.  In \nSection 3,  we  evaluate our regularizer's performance on predicting the U.S.  Index \nof  Industrial  Production.  The  advantage  of  our  regularizer  is  demonstrated  by \ncomparing to standard weight  decay  in  both feedforward  and recurrent modeling. \nFinally,  we  conclude our paper in Section 4. \n\n2  Smoothing Regularization \n\n2.1  Prediction Error for  Perturbed Data Sets \n\nZ(t) = F*(I(t\u00bb + E*(t)  with \n\nI(t) = {X(s), s = 1,2,\u00b7\u00b7\u00b7, t}  . \n\nConsider a training data set {P: Z(t),X(t)}, where the targets Z(t) are assumed to \nbe generated by  an unknown dynamical system F*(I(t))  and an unobserved noise \nprocess: \n\n(1) \nHere, I(t) is, the information set containing both current and past inputs X(s), and \nthe E*(t)  are independent random noise variables with zero mean and variance (F*2. \nConsider next a dynamic network model Z(t) = F(~, I(t)) to be trained on data set \nP, where  ~ represents a  set of network parameters, and F( )  is  a  network transfer \nfunction which is  assumed to be nonlinear and dynamic.  We  assume that F(  ) has \ngood  approximation  capabilities,  such  that  F(~p,I(t)) ~ F*(I(t))  for  learnable \nparameters ~ p. \nOur goal  is  to derive  a  smoothing regularizer for  a  network  trained on  the actual \ndata set P  that in  effect  optimizes  the expected network performance (prediction \nrisk)  on perturbed test data sets of form  {Q : Z(t),X(t)}.  The elements of Q are \nrelated to the elements of P  via small random perturbations Ez(t)  and Ez(t), so that \n(2) \n(3) \nThe  Ez(t)  and Ez(t)  have  zero  mean and variances (Fz2  and  (Fz2  respectively.  The \ntraining and test errors for  the data sets P  and Q  are \n\nZ(t)  =  Z(t) + Ez(t)  , \nX(t)  =  X(t) + Ez(t)  . \n\nDp  =  ~ L  [Z(t)  - F(~p,I(t))]2 \n\nN \n\nt=l \nN \n\nDQ  =  ~ L[Z(t) - F(~p,i(t)W , \n\nt=l \n\n(4) \n\n(5) \n\nwhere ~ p  denotes the network parameters obtained by training on data set P, and \nl(t)  = {X(s),s  =  1,2,\u00b7\u00b7\u00b7 ,t}  is  the  perturbed information  set  of Q.  With  this \nnotation, our goal is  to minimize the expected value of DQ,  while  training on D p. \nConsider the prediction error for  the perturbed data point at time t: \n\nWith Eqn (2),  we  obtain \n\nd(t) =  [Z(t) - F(~p,i(t)W . \n\nd(t)  = \n-\n\n[Z(t) + Ez(t)  - F(~p,I(t)) + F(~p,I(t)) - F(~p,i(t)W, \n[Z(t) - F(~p,I(t)W + [F(~p,I(t)) - F(~p,l(t)W + [Ez(t)]2 \n+2[Z(t) - F(~p,I(t))JIF(~p,I(t)) - F(~p,i(t))] \n+2Ez(t)lZ(t) - F(~p,l(t))]. \n\n(6) \n\n(7) \n\n\f460 \n\nL.  WU. 1. MOODY \n\nAssuming that C:z(t)  is  uncorrelated with  [Z(t)  - F(~p,i(t\u00bb] and averaging  over \nthe exemplars of data sets P  and Q,  Eqn(7) becomes \n\nDQ  = \n\n1  N \n\nDp+ NL[F(~p,I(t\u00bb-F(~p,i(t)W+ NL[c: z(t)]2 \n\nt=1 \n\nt=1 \n\n2  N \n\n+ N  L[Z(t) - F(~p,I(t\u00bb)][F(~p,I(t\u00bb  - F(~p,i(t\u00bb]. \n\n(8) \n\nt=l \n\n1  N \n\nThe third  term, 2:::'1 [C: z (t)]2,  in  Eqn(8)  is  independent of the weights,  so  it can \nbe neglected during the learning process.  The fourth  term in  Eqn(8) is  the cross-\ncovariance  between  [Z~t) - F(~p,I(t\u00bb] and  [F(~p,I(t\u00bb - F(~p,i(t\u00bb].  Using \nthe inequality  2ab  ~ a  + b2 ,  we  can see  that minimizing  the first  term  D p  and \nthe  second  term  ~ 2:~I[F(~p,I(t\u00bb - F(~p,i(t\u00bb]2 in  Eqn  (8)  during  training \nwill  automatically decrease the effect  of the cross-covariance  term.  Therefore,  we \nexclude the cross-covariance term from the training criterion. \nThe  above  analysis  shows  that  the expected test error  DQ  can  be minimized  by \nminimizing the objective function D: \n\n1  N \n\n1  N \n\nD = N  L[Z(t) - F(~, I(t\u00bb]2 + N  L[F(~p, I(t\u00bb  - F(~ p,i(t\u00bb]2. \n\n(9) \n\nt=l \n\nt=l \n\nIn  Eqn  (9),  the  second  term  is  the  time  average  of  the  squared  disturbance \nIIZ(t)  - Z(t)1I2  of  the  trained  network  output  due  to  the  input  perturbation \nlIi(t)  - I(t)W.  Minimizing  this  term  demands  that  small  changes  in  the  input \nvariables  yield  correspondingly small changes in  the output.  This  is  the standard \nsmoothness prior,  nanlely  that if nothing else  is  known  about the function  to  be \napproximated,  a  good option is  to assume a  high  degree of smoothness.  Without \nknowing the correct functional form of the dynamical system F- or using such prior \nassumptions, the data fitting problem is ill-posed.  In (Wu & Moody 1996), we have \nshown that the second term in Eqn (9) is a dynamic generalization of the first order \nTikhonov stabilizer. \n\n2.2  Form of the Proposed Smoothing Regularizer \n\nConsider a  general,  two  layer,  nonlinear,  dynamic network with recurrent connec(cid:173)\ntions on the internal layer  2  as  described by \n\nYet) = f  (WY(t - T) + V X(t\u00bb  ,Z(t) = UY(t) \n\n(10) \n\nwhere  X(t),  Yet)  and Z(t)  are respectively  the network  input  vector,  the hidden \noutput  vector  and  the  network  output;  ~ =  {U, V, W}  is  the output,  input  and \nrecurrent  connections of the  network;  f(  )  is  the vector-valued  nonlinear  transfer \nfunction of the hidden units;  and  T  is  a  time delay in  the feedback connections of \nhidden layer which is  pre-defined by a user and will not be changed during learning. \nT  can be zero,  a  fraction,  or an integer,  but we  are interested in  the cases with  a \nsmall T.3 \n\n20ur derivation can easily be extended to other network structures. \n3When  the  time  delay  T  exceeds  some  critical  value,  a  recurrent  network  becomes \n\nunstable and lies in oscillatory modes.  See, for example, (Marcus & Westervelt 1989). \n\n\fA Smoothing Regularizer for Recurrent Neural Networks \n\n461 \n\nWhen  T  =  1,  our model  is  a  recurrent network as described by  (Elman 1990)  and \n(Rumelhart, Hinton & Williams 1986)  (see Figure 17 on page 355).  When T  is equal \nto some fraction  smaller than one,  the network evolves  ~ times within each  input \ntime interval.  When T  decreases and approaches zero, our model is the same as the \nnetwork studied by  (Pineda 1989), and earlier, widely-studied additive  networks.  In \n(Pineda 1989), T  was referred to as the network relaxation time scale.  (Werbos 1992) \ndistinguished  the recurrent  networks  with  zero  T  and non-zero  T  by  calling  them \nsimultaneous  recurrent  networks and time-lagged  recurrent  networks respectively. \nWe  have  found  that  minimizing  the  second  term  of Eqn(9)  can  be  obtained  by \nsmoothing the output response to  an input perturbation at every time step.  This \nyields,  see  (Wu & Moody 1996): \n\nIIZ(t)-Z(t)W~p/(~p)IIX(t)-X(t)W for  t=1,2, ... ,N. \n\n(11) \nWe call PT 2 (~ p) the output sensitivity of the trained network ~ p to an input pertur(cid:173)\nbation.  PT 2 ( ~ p)  is  determined by the network parameters only and is  independent \nof the time variable t. \nWe obtain our new regularizer by training directly on the expected prediction error \nfor perturbed data sets Q.  Based on  the analysis leading to Eqns (9)  and (11), the \ntraining criterion thus becomes \n\n1  N \n\nD  =  N  2:[Z(t) - F(~,I(t)W + .\\p/(~)  . \n\n(12) \n\nt=l \n\nThe coefficient .\\ in Eqn(12) is  a regularization parameter that measures the degree \nof input  perturbation lIi(t) - I(t)W.  The algebraic  form  for  PT(~) as  derived  in \n(Wu &  Moody 1996) is: \n\nP  (~)- ,IIUIIIIVII {1-\n\n1 _  ,IIWII \n\nT \n\n-\n\nexp \n\n(,IIWIl -l)} \n\nT \n\n' \n\n(13) \n\nfor time-lagged recurrent networks (T  > 0).  Here, 1111  denotes the Euclidean matrix \nnorm.  The factor, depends upon the maximal value of the first  derivatives of the \nactivation functions of the hidden units and is  given by: \n\n,  =  m~ II/(oj(t)) I , \n\nt ,] \n\n(14) \n\nwhere j  is the index of hidden units and OJ(t)  is the input to the ph unit.  In general, \n,  ~ 1.  4  To  insure stability  and that the effects  of small  input perturbations are \ndamped out, it is  required, see  (Wu & Moody 1996), that \n\n(15) \nThe regularizer Eqn(13) can be deduced for the simultaneous recurrent networks in \nthe limit  THO by: \n\n,IIWII < 1  . \n\np(~) = P (~) = ,IIUIIIIVII \n\n1 - ,IIWII  . \n\n0 \n\n-\n\n(16) \n\nIf the network is  feedforward,  W  = 0 and T  = 0,  Eqns (13)  and (16)  become \n\n(17) \nMoreover,  if there is  no  hidden  layer  and the inputs are directly connected to the \noutputs via U,  the network is  an ordinary linear model, and we  obtain \n\np(~) =  ,11U1I11V1l  . \n\n(18) \n4For instance,  f'(x} = [1- f(x})f(x}  if f(x) = l+!-z.  Then, \"'{  = max 1 f'(x}} 1=  t. \n\np(~) = IIUII , \n\n\f462 \n\nL. WU, J.  MOODY \n\nwhich  is  standard quadratic weight  decay  (Plaut  et  al.  1986)  as  is  used  in  ridge \nregression (Hoerl &  Kennard 1970). \nThe regularizer (Eqn(17) for feedforward networks and Eqn (13) for recurrent net(cid:173)\nworks) was obtained by requiring smoothness of the network output to perturbations \nof data.  We therefore refer to it as a smoothing regularizer.  Several approaches can \nbe applied to estimate the regularization parameter..x, as in (Eubank 1988), (Hastie \n&  Tibshirani  1990)  and  (Wahba  1990).  We  will  not  discuss  this  subject  in  this \npaper. \nIn the next section, we  evaluate the new regularizer for  the task of predicting the \nU.S.  Index  of Industrial  Production.  Additional  empirical  tests  can  be found  in \n(Wu & Moody 1996). \n\n3  Predicting the U.S.  Index of Industrial Production \n\nThe  Index  of Industrial  Production  (IP)  is  one  of the  key  measures of economic \nactivity.  It  is  computed and published monthly.  Our task  is  to  predict  the one(cid:173)\nmonth rate of change of the index from January 1980 to December 1989 for models \ntrained from  January 1950 to December 1979.  The exogenous inputs we  have used \ninclude 8  time series  such  as  the index  of leading indicators,  housing starts,  the \nmoney supply M2,  the S&P 500  Index.  These 8  series are also  recorded monthly. \nIn  previous  studies  by  (Moody,  Levin  &  Rehfuss  1993),  with  the  same  defined \ntraining and test data sets, the normalized prediction errors of the one month rate \nof change  were  0.81  with  the  neuz  neural network  simulator,  and 0.75  with  the \nproj neural network simulator. \n\nWe  have simulated feedforward and recurrent neural network models.  Both models \nconsist  of two  layers.  There are  9  input  units  in  the  recurrent model,  which  re(cid:173)\nceive  the 8 exogenous series  and the previous month IP index change.  We  set the \ntime-delayed length in  the recurrent connections T  = 1.  The feedforward  model is \nconstructed with 36 input units, which receive 4 time-delayed versions of each input \nseries.  The time-delay lengths a,re  1, 3, 6 and 12, respectively.  The activation func(cid:173)\ntions of hidden units in  both feedforward and recurrent models are tanh functions. \nThe number of hidden units varies from  2 to 6.  Each model has one linear output \nunit. \n\nWe  have  divided  the  data  from  January  1950  to  December  1979  into  four  non(cid:173)\noverlapping sub-sets.  One sub-set consists of 70%  of the original  data and each of \nthe other three subsets consists of 10%  of the original data.  The larger sub-set  is \nused as  training  data and  the three smaller sub-sets are used  as  validation  data. \nThese  three  validation  data sets  are  respectively  used  for  determination of early \nstopped training,  selecting the regularization parameter and selecting the number \nof hidden units. \n\nWe  have  formed  10  random  training-validation  partitions.  For  each  training(cid:173)\nvalidation  partition,  three  networks  with  different  initial  weight  parameters  are \ntrained.  Therefore, our prediction committee is  formed by 30 networks. \n\nThe committee error is  the  average  of the errors  of all  committee members.  All \nnetworks  in  the  committee are  trained  simultaneously  and  stopped at  the  same \ntime based on the committee error of a validation set.  The value of the regulariza(cid:173)\ntion parameter and the number of hidden units  are determined by  minimizing the \ncommittee error on separate validation sets. \n\nTable 1 compares the out-of-sample performance of recurrent networks and feedfor-\n\n\fA Smoothing Regularizer for Recurrent Neural Networks \n\n463 \n\nTable 1:  Nonnalized prediction errors for the one-month rate of return on the U.S. \nIndex of Industrial Production (Jan.  1980 - Dec.  1989).  Each result is  based on 30 \nnetworks. \n\nModel \n\nRegularizer  Mean \u00b1  Std  Median  Max  Min  Committee \n\nRecurrent \n0.646\u00b10.008 \nNetworks  Weight  Decay  0.734\u00b10.018 \n\nSmoothing \n\n0.647  0.657  0.632 \n0.737  0.767  0.704 \n\nFeedforward \n\n0.700\u00b10.023 \nNetworks  Weight  Decay  0.745\u00b10.043 \n\nSmoothing \n\n0.707  0.729  0.654 \n0.748  0.805  0.676 \n\n0.639 \n0.734 \n\n0.693 \n0.731 \n\nward networks trained with our smoothing regularizer to that of networks  trained \nwith standard weight  decay.  The results are based on 30  networks.  As  shown,  the \nsmoothing  regularizer  again  outperfonns standard weight  decay  with  95%  confi(cid:173)\ndence  (in  t-distribution hypothesis)  in  both cases of recurrent networks  and feed(cid:173)\nforward networks.  We also list the median, maximal and minimal prediction errors \nover 30 predictors.  The last column gives the committee results, which are based on \nthe simple average of 30 network predictions.  We see that the median, maximal and \nminimal values and the committee results obtained with the smoothing regularizer \nare all  smaller than those obtained with  standard weight  decay,  in  both recurrent \nand feedforward network models. \n\n4  Concluding Remarks \n\nRegularization in  learning can prevent  a  network from overtraining.  Several  tech(cid:173)\nniques  have  been developed  in  recent  years,  but  all  these are specialized for  feed(cid:173)\nforward networks.  To our best knowledge, a regularizer for a  recurrent network has \nnot been reported previously. \nWe have developed a  smoothing regularizer for recurrent neural networks that cap(cid:173)\ntures the dependencies of input, output, and feedback weight values on each other. \nThe regularizer covers both simultaneous and time-lagged recurrent networks, with \nfeedforward networks and single layer, linear networks as special cases.  Our smooth(cid:173)\ning regularizer for linear networks has the same fonn as standard weight decay.  The \nregularizer developed depends on only  the network parameters, and can easily  be \nused.  A more detailed description of this work appears in  (Wu &  Moody 1996). \n\nReferences \n\nBishop,  C.  (1995),  'Training  with noise is  equivalent  to Tikhonov  regularization', \n\nNeural  Computation 7(1), 108-116. \n\nChauvin, Y.  (1990),  Dynamic behavior of constrained back-propagation networks, \nin D.  Touretzky,  ed.,  'Advances in  Neural Infonnation Processing Systems 2', \nMorgan Kaufmann Publishers, San Francisco, CA, pp. 642-649. \n\nElman, J.  (1990), 'Finding structure in time',  Cognition  Science  14, 179-211. \nEubank,  R.  L.  (1988),  Spline  Smoothing  and  Nonparametric  Regression,  Marcel \n\nDekker, Inc. \n\nGirosi,  F.,  Jones,  M.  &  Poggio,  T.  (1995),  'Regularization theory and neural net(cid:173)\n\nworks  architectures', Neural  Computation  7, 219-269. \n\n\f464 \n\nL.  WU,  J.  MOODY \n\nHastie,  T.  J.  &  Tibshirani,  R.  J.  (1990),  Generalized  Additive  Models,  Vol.  43  of \n\nMonographs  on  Statistics  and Applied Probability,  Chapman and Hall. \n\nHoerl, A.  & Kennard, R.  (1970), 'Ridge regression:  biased estimation for nonorthog(cid:173)\n\nonal problems',  Technometrics  12, 55-67. \n\nLeen,  T.  (1995),  'From  data distributions  to regularization  in  invariant  learning', \n\nNeural  Computation  7(5), 974-98l. \n\nMarcus,  C.  &  Westervelt,  R.  (1989),  Dynamics  of  analog  neural  networks  with \ntime delay,  in D.  Touretzky, ed.,  'Advances in  Neural Information Processing \nSystems 1', Morgan Kaufmann Publishers, San Francisco, CA. \n\nMoody, J. & Rognvaldsson, T. (1995), Smoothing regularizers for feed-forward neu(cid:173)\n\nral  networks,  Oregon  Graduate  Institute Computer Science  Dept.  Technical \nReport, submitted for  publication, 1995. \n\nMoody,  J.,  Levin,  U.  &  Rehfuss,  S.  (1993),  'Predicting the  U.S.  index  of indus(cid:173)\n\ntrial production', In proceedings  of the  1993 Parallel Applications  in Statistics \nand  Economics  Conference,  Zeist,  The  Netherlands.  Special  issue  of Neural \nNetwork World  3(6), 791-794. \n\nNowlan,  S.  &  Hinton,  G.  (1992),  'Simplifying  neural  networks  by  soft  weight(cid:173)\n\nsharing',  Neural  Computation  4(4), 473-493. \n\nPineda,  F.  (1989),  'Recurrent  backpropagation  and  the  dynamical  approach  to \n\nadaptive neural computation', Neural  Computation 1(2), 161-172. \n\nPlaut, D.,  Nowlan, S.  & Hinton, G.  (1986), Experiments on learning by  back prop(cid:173)\n\nagation, Technical Report CMU-CS-86-126, Carnegie-Mellon University. \n\nRumelhart,  D.,  Hinton,  G.  & Williams,  R.  (1986),  Learning  internal  representa(cid:173)\n\ntions  by  error propagation,  in D.  Rumelhart &  J.  McClelland,  eds,  'Parallel \nDistributed Processing:  Exploration in  the microstructure of cognition', MIT \nPress, Cambridge, MA,  chapter 8, pp.  319-362. \n\nScalettar, R.  & Zee,  A.  (1988), Emergence of grandmother memory in feed  forward \nnetworks:  learning  with  noise  and forgetfulness,  in  D.  Waltz  &  J.  Feldman, \neds,  'Connectionist Models and Their Implications:  Readings from  Cognitive \nScience', Ablex Pub. Corp. \n\nTikhonov, A.  N.  &  Arsenin, V. 1.  (1977), Solutions  of Ill-posed  Problems, Winston; \nNew York:  distributed solely by Halsted Press.  Scripta series in mathematics. \nTranslation editor, Fritz John. \n\nWahba, G.  (1990), Spline  models for  observational data,  CBMS-NSF Regional Con(cid:173)\n\nference Series in  Applied Mathematics. \n\nWeigend,  A.,  Rumelhart,  D.  &  Huberman,  B.  (1990),  Back-propagation,  weight(cid:173)\nelimination and time series prediction, in T. Sejnowski, G.  Hinton & D. Touret(cid:173)\nzky,  eds,  'Proceedings  of the  connectionist  models  summer school',  Morgan \nKaufmann Publishers, San Mateo, CA, pp.  105-116. \n\nWerbos,  P.  (1992),  Neurocontrol and supervised learning:  An  overview  and eval(cid:173)\n\nuation,  in D.  White &  D.  Sofge,  eds,  'Handbook of Intelligent  Control',  Van \nNostrand Reinhold, New York. \n\nWu, L. &  Moody, J.  (1996), 'A smoothing regularizer for feedforward and recurrent \n\nneural networks',  Neural  Computation 8(3), 463-491. \n\n\f", "award": [], "sourceid": 1119, "authors": [{"given_name": "Lizhong", "family_name": "Wu", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}