{"title": "Recurrent Networks: Second Order Properties and Pruning", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 680, "abstract": null, "full_text": "Recurrent Networks: \n\nSecond Order Properties and Pruning \n\nCONNECT, Electronics Institute \n\nTechnical University of Denmark B349 \n\nMorten With Pedersen and Lars Kai Hansen \n\nDK-2800 Lyngby, DENMARK \nemails:with.lkhansen@ei.dtu.dk \n\nAbstract \n\nSecond order properties of cost functions for recurrent networks \nare investigated. We analyze a layered fully recurrent architecture, \nthe virtue of this architecture is that it features the conventional \nfeedforward architecture as a special case. A detailed description of \nrecursive computation of the full Hessian of the network cost func(cid:173)\ntion is provided. We discuss the possibility of invoking simplifying \napproximations of the Hessian and show how weight decays iron the \ncost function and thereby greatly assist training. We present tenta(cid:173)\ntive pruning results, using Hassibi et al.'s Optimal Brain Surgeon, \ndemonstrating that recurrent networks can construct an efficient \ninternal memory. \n\n1 LEARNING IN RECURRENT NETWORKS \n\nTime series processing is an important application area for neural networks and \nnumerous architectures have been suggested, see e.g. (Weigend and Gershenfeld, 94). \nThe most general structure is a fully recurrent network and it may be adapted using \nReal Time Recurrent Learning (RTRL) suggested by (Williams and Zipser, 89). By \ninvoking a recurrent network, the length of the network memory can be adapted to \nthe given time series, while it is fixed for the conventional lag-space net (Weigend \net al., 90). \nIn forecasting, however, feedforward architectures remain the most \npopular structures; only few applications are reported based on the Williams&Zipser \napproach. The main difficulties experienced using RTRL are slow convergence and \n\n\f674 \n\nMorten With Pedersen, Lars Kai Hansen \n\nlack of generalization. Analogous problems in feedforward nets are solved using \nsecond order methods for training and pruning (LeCun et al., 90; Hassibi et al., \n92; Svarer et al., 93). Also, regularization by weight decay significantly improves \ntraining and generalization. In this work we initiate the investigation of second order \nproperties for RTRL; a detailed calculation scheme for the cost function Hessian is \npresented, the importance of weight decay is demonstrated, and preliminary pruning \nresults using Hassibi et al.'s Optimal Brain Surgeon (OBS) are presented. We find \nthat the recurrent network discards the available lag space and constructs its own \nefficient internal memory. \n\n1.1 REAL TIME RECURRENT LEARNING \n\nThe fully connected feedback nets studied by Williams&Zipser operate like a state \nmachine, computing the outputs from the internal units according to a state vector \nz(t) containing previous external inputs and internal unit outputs. Let x(t) denote \na vector containing the external inputs to the net at time t, and let y(t) denote a \nvector containing the outputs of the units in the net. We now arrange the indices \non x and y so that the elements of z(t) can be defined as \n\n, k E I \n, k E U \n\nwhere I denotes the set of indices for which Zk is an input, and U denotes the set of \nindices for which Zk is the output of a unit in the net. Thresholds are implemented \nusing an input permanently clamped to unity. The k'th unit in the net is now \nupdated according to \n\nwhere Wkj denotes the weight to unit k from input/unit j and \"'0 is the activation \nfunction of the k'th unit. \nWhen used for time series prediction, the input vector (excluding threshold) is \nusually defined as x(t) = [x(t), . .. , x(t - L + 1)] where L denotes the dimension of \nthe lag space. One of the units in the net is designated to be the output unit Yo, and \nits activating function 10 is often chosen to be linear in order to allow for arbitrary \ndynamical range. The prediction of x(t + 1) is x(t + 1) = lo[so(t\u00bb). Also, if the first \nprediction is at t = 1, the first example is presented at t = 0 and we 'set y(O) = O. \nWe analyse here a modification of the standard Williams&Zipser construction that \nis appropriate for forecasting purposes. The studied architecture is layered. Firstly, \nwe remove the external inputs from the linear output unit in order to prevent the \nnetwork from getting trapped in a linear mode. The output then reads \n\nx(t + 1) = Yo(t + 1) = L WojYj(t) + Wthres,o \n\njeU \n\n(1) \n\nSince y(O) = 0 we obtain a first prediction yielding x(l) = Wthres,o which is likely \nto be a poor prediction, and thereby introducing a significant error that is fed \nback into the network and used in future predictions. Secondly, when pruning \n\n\fRecurrent Networks: Second Order Properties and Pruning \n\n675 \n\na fully recurrent feedback net we would like the net to be able to reduce to a \nsimple two-layer feedforward net if necessary. Note that this is not possible with \nthe conventional Williams&Zipser update rule, since it doesn't include a layered \nfeedforward net as a special case. In a layered feedforward net the output unit is \ndisconnected from the external inputs; in this case, cf. (1) we see that x(t + 1) is \nbased on the internal 'hidden' unit outputs Yk(t) which are calculated on the basis \nof z(t - 1) and thereby x(t -1). Hence, besides the startup problems, we also get \na two-step ahead predictor using the standard architecture. \n\nIn order to avoid the problems with the conventional Williams&Zipser update \nscheme we use a layered updating scheme inspired by traditional feedforward nets, \nin which we distinguish between hidden layer units and the output unit. At time t, \nthe hidden units work from the input vector zh(t) \n\n, k E I \n, kE U \n, k=O \n\nwhere I denotes the input indices, U denotes the hidden layer units and 0 the \noutput unit. Further, we use superscripts hand 0 to distinguish between hidden \nunit and output units. The activation of the hidden units is calculated according \nto \n\ny~(t) = fr[s~(t)] = fr [ L \n\nie1uUuO \n\nWki zJ (t)] \n\n, k E U \n\n(2) \n\nThe hidden unit outputs are forwarded to the output unit, which then sees the \ninput vector zkCt) \n\nOCt) _ { y~(t) \n\nyO(t-1) \n\nZk \n\n-\n\nand is updated according to \n\n, k E U \nk=O \n\n(3) \n\nThe cost function is defined as C = E + wTRw. R is a regularization matrix, w is \nthe concatenated set of parameters, and the sum of squared errors is \n\n1 T \n\nE = 2 L[e(t)F , e(t) = x(t) - yO(t), \n\nt=l \n\n(4) \n\nwhere T is the size of the training set series. RTRL is based on gradient descent in \nthe cost function, here we investigate accelerated training using Newton methods. \nFor that we need to compute first and second derivatives of the cost function. The \nessential difficulty is to determine derivatives of the sum of squared errors: \n\naE = _ {-.. e(t) ayO(t) \naw\u00b7 \u00b7 \naw .. \n'3 \n'3 \n\nL...J \nt=l \n\n(5) \n\n\f676 \n\nMorten With Pedersen, Lars Kai Hansen \n\nThe derivative of the output unit is computed as \n\n8yO(t) \n- - -\n8Wij \n\n8r[sO(t)] 8s0(t) \n._-\n8Wij \n\n8so(t) \n\nwhere \n\n8s0(t) _ 1: \n-8-- - UO,Zj + L- WOJI 8 \n\n. O(t) \"\" \nj/EU \n\nWij \n\n. 8yjl(t) \n\nWij \n\n+ woo \n\n8yO(t - 1) \n\n8 \nWij \n\n(6) \n\n(7) \n\nwhere 6j k is the Kronecker delta. This expression contains the derivative of the \nhidden units \n\n(8) \n\n(9) \n\nwhere \n\n... \n132 \n<> \n\n... \n132 \n<> \n\n~!:'::.3-.a~.2:;:-5 ---:'.a.'::-2 ---:.a7..15'--.a-:':.1'--.a7..0:;:-5-~-::0~.05~-::-0.1;---::'!0.1\u00b75 \n\nWEIGHT VAlUE \n\n~.3 \n\n.a.25 \n\n.a.2 \n\n.a.15 \n\n.a.1 \n\n.a.OS \n\nWEIGHT VALUE \n\n0.05 \n\n0.1 \n\n0.15 \n\nFigure 1: Cost function dependence of a weight connecting two hidden units for \nthe sunspot benchmark series. Left panel: Cost function with small weight decay, \nthe (local) optimum chosen is marked by an asterix. Right panel: The same slice \nthrough the cost function but here retrained with higher weight decay. \n\nThe complexity of the training problem for the recurrent net using RTRL is demon(cid:173)\nstrated in figure 1. The important role of weight decay (we have used a simple weight \ndecay R = at) in controlling the complexity of the cost function is evident in the \nright panel of figure 1. The example studied is the sunspot benchmark problem \n(see e.g. (Weigend et al., 90) for a definition). First, we trained a network with \nthe small weight decay and recorded the left panel result. Secondly, the network \nwas retrained with increased weight decay and the particular weight connecting \ntwo hidden units was varied to produce the right panel result. In both cases all \nother weights remained fixed at their optimal values for the given weight decay. In \naddition to the complexity visible in these one-parameter slices of the cost func(cid:173)\ntion, the cost function is highly anisotropic in weight space and consequently the \nnetwork Hessian is ill-conditioned. Hence, gradient descent is hampered by slow \ncon vergen ce. \n\n\fRecurrent Networks; Second Order Properties and Pruning \n\n677 \n\n2 SECOND ORDER PROPERTIES OF THE COST \n\nFUNCTION \n\nTo improve training by use of Newton methods and for use in OBS-pruning we \ncompute the second derivative of the error functional: \n\n= _ t [e(t) 82yO(t) _ 8yO(t) . 8yO (t)] \n\n8Wij8wpq \n\n8Wij \n\n8wpq \n\n82 E \n\n8Wij8wpq \n\nt=l \n\n(10) \n\n(11) \n\nThe second derivative of the output is \n\n82yO(t) _ 82 r[sO(t)] 8s0(t) 8s0(t) \n--....:.....:.- -\n8wij 8wpq \n8wpq \n\n. - - . - - + \n\n8so(t)2 \n\n8Wij \n\n8r[sO(t)] \n\n8so(t) \n\n82 SO(t) \n. ----:...;....;..-\n8Wij8wpq \n\nwith \n\n82so(t) _, 8zJ(t) ~ 82yj,(t) \n8 \n8wij 8wpq \nWij Wpq \n\n- Ooi-O- - + ~ Woj' \n\nj'EU \n\nWpq \n\n8 \n\n+ woo \n\n82yO(t - 1) \n8wij 8wpq \n\n8z~(t) \n+ Dop - -\n8Wij \n\n(12) \n\nThis expression contains the second derivative of the hidden unit outputs \n\n82yi(t) _ 82 fr[si(t)] . 8si(t) . 8si(t) + 8fr[si(t)]. 02si(t) \nOWijOWpq -\nOWijOWpq \n\nosi(t)2 \n\nosi(t) \n\nOWpq \n\nOWij \n\n(13) \n\nwith \n\n02si(t) _ \n8WijOWpq \n\n- Dki -0-- + L..J Wkj I \n\nozj(t) ~ 02yj,(t - 1) \nOWijOWpq \n\nj'EU \n\nWpq \n\n+ Wko \n\n02 yO(t - 1) \nOWijOWpq \n\noz~(t) (14) \n\n+ Dkp 0 \n\nWij \n\nRecursion in the five index quantity (14) imposes a significant computational bur(cid:173)\nden; in fact the first term of the Hessian in (10), involving the second derivative, is \noften neglected for computational convenience (LeCun et al., 90). Here we start by \nanalyzing the significance of this term during training. We train a layered architec(cid:173)\nture to predict the sunspot benchmark problem. In figure 2 the ratio between the \nlargest eigenvalue of the second derivative term in (10) and the largest eigenvalue \nof the full Hessian is shown. The ratio is presented for two different magnitudes of \nweight decay. In line with our observations above the second order properties of the \n\"ironed\" cost function are manageable, and we can simplify the Hessian calcula(cid:173)\ntion by neglecting the second derivative term in (10), i.e., apply the Gauss-Newton \napproximation. \n\n3 PRUNING BY THE OPTIMAL BRAIN SURGEON \n\nPruning of recurrent networks has been pursued by (Giles and Omlin, 94) using \na heuristic pruning technique, and significant improvement in generalization for a \nsequence recognition problem was demonstrated. Two pruning schemes are based \non systematic estimation of weight saliency: the Optimal Brain Damage (OBD) \nscheme of (LeCun et al., 90) and OBS by (Hassibi et al., 93). OBD is based \non the diagonal approximation of the Hessian and is very robust for forecasting \n(Svarer et al., 93). If an estimate of the full Hessian is available OBS can be used \n\n\f678 \n\n10' \n\nMorten With Pedersen, Lars Kai Hansen \n\n.. :.::.: .... :::::;:;::: ... :::: . \n. '\" \n\n10 ... '----!-10--f::20--:30~----! .. :---~50--.. ~---:::70:----=1O \n\n1040''-----:'10--f::20--:30~----! .. :---~50,----.. ~---:70~---:!IO \n\nITERATION. \n\nITERATION. \n\nFigure 2: Ratio between the largest magnitude eigenvalue of the second derivative \nterm of the Hessian (c.f. equation (10)) and the largest magnitude .eigenvalue of \nthe complete Hessian as they appeared during ten training sessions. The connected \ncircles represent the average ratio. Left panel: Training with small weight decay. \nRight panel: Training with a high weight decay. \n\nfor estimation of saliencies incorporating linear retraining. In (Hansen and With \nPedersen, 94) OBS was generalized to incorporate weight decays; we use these \nmodifications in our experiments. Note that OBS in its standard form only allows \nfor one weight to be eliminated at a time. The result of a pruning session is a \nnested family of networks. In order to select the optimal network within the family \nit was suggested in (Svarer et al., 93.) to use the estimated test error. In particular \nwe use Akaike's Final Prediction Error (Akaike, 69) to estimate the network test \nerror Etest = \u00ab(T + N)/(T - N\u00bb \n. 2E/T 1, and N is the number of parameters \nin the network. In figure 3 we show the results of such a pruning session on the \nsunspot data starting from a (4-4-1) network architecture. The recurrent network \nwas trained using a damped Gauss-Newton scheme. Note that the training error \nincreases as weights are eliminated, while the test error and the estimated test error \nboth pass through shallow minima showing that generalization is slightly improved \nby pruning. In fact, by retraining the optimal architecture with reduced weight \ndecay both training and test errors are decreased in line with the observations in \n(Svarer et al., 93). It is interesting to observe that the network, though starting \nwith access to a lag-space of four delay units, has lost three of the delayed inputs; \nhence, rely solely on its internal memory, as seen in the right panel of. figure 3. To \nfurther illustrate the memory properties of the optimal network, we show in figure \n4 the network response to a unit impulse. It is interesting that the response of the \nnetwork extends for approximately 12 time steps corresponding to the \"period\" of \nthe sunspot series. \n\nlThe use of Akaike's estimate is not well justified for a feedback net, test error estimates \n\nfor feedback models is a topic of current research. \n\n\fRecurrent Networks: Second Order Properties and Pruning \n\n679 \n\nOUTPUT \n\n0.25 \n\n. \n\n~0.'5 \n\nw \n\n0.1 ,' \n\n: .. \"': -\n\n.- - ... . ~:.. __ .... ~ . .; .... ~ ....... . _~_ . __ ~ ...... ~~ .-: :7 \n\n0.05 \n\n~~-='0--~'5~~~~~~~~~--~$~-~~~\"~~ \n\nNUMSER OF PARAMETERS \n\nX(I-I) X(I-2) X(I-3) X(I-4) \n\nFigure 3: Left panel: OBS pruning of a (4-4-1) recurrent network trained on sunspot \nbenchmark. Development of training error, test error, and Akaike estimated test \nerror (FPE). Right panel: Architecture of the FPE-optimal network. Note that the \nnetwork discards the available lag space and solely predicts from internal memory. \n\n0.8 \n\n0.5 \n\n0.4 \n\n0.3 \n\nw ., \n~ 0.2 \nII! \n\n0.' V\"-\n\n.0.' \n\n0.1 \n\n0.8 \n\n.-.. , \n\n! .. \n\n... \n\n.. \n\nI \n. \\ .,: \n\nI \nt \nI \nI \n\n.' \n\nI \n\n.0.4 \n\n.0.8 \n\n. .\\./ \n, \n.0.1 ... \n\n\" \n\n: \n\\ : \norf\" \n\n, \n: \\ \n, \n, \n, , ,---\n\n'0 \n\nTIME \n\n'5 \n\n~ \n\n.0.20 \n\n'0 \n\nTIME \n\n15 \n\n~ \n\nFigure 4: Left panel: Output of the pruned network after a unit impulse input at \nt = O. The internal memory is about 12 time units long which is, in fact, roughly \nthe period of the sunspot series. Right panel: Activity of the four hidden units in \nthe pruned network after a unit impUlse at time t = O. \n\n4 CONCLUSION \n\nA layered recurrent architecture, which has a feedforward net as a special case, has \nbeen investigated. A scheme for recursive estimation of the Hessian of the fully \nrecurrent neural net is devised. It's been shown that weight decay plays a decisive \nrole when adapting recurrent networks. Further, it is shown that the' second order \ninformation may be used to train and prune a recurrent network and in this process \nthe network may discard the available lag space. The network builds an efficient \n\n\f680 \n\nMorten With Pedersen, Lars Kai Hansen \n\ninternal memory extending beyond the lag space that was originally available. \n\nAcknowledgments \n\nWe thank J an Larsen, Sara Solla, and Claus Svarer for useful discuss~ons, and Lee \nGiles for providing us with a preprint of (Giles and amlin, 94). We thank the \nanonymous reviewers for valuable comments on the manuscript. This research is \nsupported by the Danish Natural Science and Technical Research Councils through \nthe Computational Neural Network Center (CONNECT). \n\nReferences \n\nH. Akaike: Fitting Autoregressive M ode/s for Prediction. Ann. Inst. Stat. Mat. \n21, 243-247, (1969). \nY. Le Cun, J.S. Denker, and S.A. Solla: Optimal Brain Damage. In Advances \nin Neural Information Processing Systems 2, (Ed. D.S. Touretzsky) Morgan Kauf(cid:173)\nmann, 598-605, (1990). \n\nC.L. Giles and C.W. amlin: Pruning of Recurrent Neural Networks for Improved \nGeneralization Performance. IEEE Transactions on Neural Networks, to appear. \nPreprint NEC Research Institute (1994). \n\nL.K. Hansen and M. With Pedersen: Controlled Growth of Cascade Correlation \nNets, International Conference on Artificial Neural Networks ICANN'94 Sorrento. \n(Eds. M. Marinaro and P.G. Morasso) Springer, 797-801, (1994). \nB. Hassibi, D. G. Stork, and G. J. Wolff, Optimal Brain Surgeon and General \nNetwork Pruning, in Proceedings of the 1993 IEEE International Conference on \nNeural Networks, San Francisco (Eds. E.H. Ruspini et al. ) IEEE, 293-299 (1993). \n\nC. Svarer, L.K. Hansen, and J. Larsen: On Design and Evaluation of . Tapped Delay \nLine Networks, In Proceedings ofthe 1993 IEEE International Conference on Neural \nNetworks, San Francisco, (Eds. E.H. Ruspini et al. ) 46-51, (1993). \n\nA.S . Weigend, B.A. Huberman, and D.E. Rumelhart: Predicting the future: A \nConnectionist Approach. Int. J. of Neural Systems 3, 193-209 (1990). \n\nA.S. Weigend and N.A. Gershenfeld, Eds.: Times Series Prediction: Forecasting the \nFuture and Understanding the Past. Redwood City, CA: Addison-Wesley (1994). \nR.J. Williams and D. Zipser: A Learning Algorithm for Continually Running Fully \nRecurrent Neural Networks, Neural Computation 1, 270-280, (1989). \n\n\f", "award": [], "sourceid": 987, "authors": [{"given_name": "Morten", "family_name": "Pedersen", "institution": null}, {"given_name": "Lars", "family_name": "Hansen", "institution": null}]}