Recurrent Networks: Second Order Properties and Pruning

Pedersen, Morten; Hansen, Lars

Recurrent Networks: Second Order Properties and Pruning

Morten With Pedersen, Lars Kai Hansen

Advances in Neural Information Processing Systems 7 (NIPS 1994)

Abstract

Second order properties of cost functions for recurrent networks are investigated. We analyze a layered fully recurrent architecture, the virtue of this architecture is that it features the conventional feedforward architecture as a special case. A detailed description of recursive computation of the full Hessian of the network cost func(cid:173) tion is provided. We discuss the possibility of invoking simplifying approximations of the Hessian and show how weight decays iron the cost function and thereby greatly assist training. We present tenta(cid:173) tive pruning results, using Hassibi et al.'s Optimal Brain Surgeon, demonstrating that recurrent networks can construct an efficient internal memory.

1 LEARNING IN RECURRENT NETWORKS

Time series processing is an important application area for neural networks and numerous architectures have been suggested, see e.g. (Weigend and Gershenfeld, 94). The most general structure is a fully recurrent network and it may be adapted using Real Time Recurrent Learning (RTRL) suggested by (Williams and Zipser, 89). By invoking a recurrent network, the length of the network memory can be adapted to the given time series, while it is fixed for the conventional lag-space net (Weigend et al., 90). In forecasting, however, feedforward architectures remain the most popular structures; only few applications are reported based on the Williams&Zipser approach. The main difficulties experienced using RTRL are slow convergence and

674

Morten With Pedersen, Lars Kai Hansen

lack of generalization. Analogous problems in feedforward nets are solved using second order methods for training and pruning (LeCun et al., 90; Hassibi et al., 92; Svarer et al., 93). Also, regularization by weight decay significantly improves training and generalization. In this work we initiate the investigation of second order properties for RTRL; a detailed calculation scheme for the cost function Hessian is presented, the importance of weight decay is demonstrated, and preliminary pruning results using Hassibi et al.'s Optimal Brain Surgeon (OBS) are presented. We find that the recurrent network discards the available lag space and constructs its own efficient internal memory.

1.1 REAL TIME RECURRENT LEARNING

The fully connected feedback nets studied by Williams&Zipser operate like a state machine, computing the outputs from the internal units according to a state vector z(t) containing previous external inputs and internal unit outputs. Let x(t) denote a vector containing the external inputs to the net at time t, and let y(t) denote a vector containing the outputs of the units in the net. We now arrange the indices on x and y so that the elements of z(t) can be defined as

, k E I , k E U

where I denotes the set of indices for which Zk is an input, and U denotes the set of indices for which Zk is the output of a unit in the net. Thresholds are implemented using an input permanently clamped to unity. The k'th unit in the net is now updated according to

where Wkj denotes the weight to unit k from input/unit j and "'0 is the activation function of the k'th unit. When used for time series prediction, the input vector (excluding threshold) is usually defined as x(t) = [x(t), . .. , x(t - L + 1)] where L denotes the dimension of the lag space. One of the units in the net is designated to be the output unit Yo, and its activating function 10 is often chosen to be linear in order to allow for arbitrary dynamical range. The prediction of x(t + 1) is x(t + 1) = lo[so(t»). Also, if the first prediction is at t = 1, the first example is presented at t = 0 and we 'set y(O) = O. We analyse here a modification of the standard Williams&Zipser construction that is appropriate for forecasting purposes. The studied architecture is layered. Firstly, we remove the external inputs from the linear output unit in order to prevent the network from getting trapped in a linear mode. The output then reads

x(t + 1) = Yo(t + 1) = L WojYj(t) + Wthres,o

Abstract

Name Change Policy