{"title": "Analysis of Short Term Memories for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1011, "page_last": 1018, "abstract": null, "full_text": "Analysis of Short Term Memories for Neural \n\nNetworks \n\nJose C. Principe, Hui-H. Hsu and Jyh-Ming Kuo \n\nComputational NeuroEngineering Laboratory \n\nDepartment of Electrical Engineering \n\nUniversity of Florida, CSE 447 \n\nGainesville, FL 32611 \n\nprincipe@synapse.ee.ufi.edu \n\nAbstract \n\nShort term memory is indispensable for the processing of time \nvarying information with artificial neural networks. In this paper a \nmodel for linear memories is presented, and ways to include \nmemories in connectionist topologies are discussed. A comparison \nis drawn among different memory types, with indication of what is \nthe salient characteristic of each memory model. \n\n1 \n\nINTRODUCTION \n\nAn adaptive system that has to interact with the external world is faced with the \nproblem of coping with the time varying nature of real world signals. Time varying \nsignals, natural or man made, carry information in their time structure. The problem \nis then one of devising methods and topologies (in the case of interest here, neural \ntopologies) that explore information along time.This problem can be appropriately \ncalled temporal pattern recognition, as opposed to the more traditional case of static \npattern recognition. In static pattern recognition an input is represented by a point in \na space with dimensionality given by the number of signal features, while in temporal \npattern recognition the inputs are sequence of features. These sequence of features \ncan also be thought as a point but in a vector space of increasing dimensionality. \nFortunately the recent history of the input signal is the one that bears more \ninformation to the decision making, so the effective dimensionality is finite but very \nlarge and unspecified a priori. How to find the appropriate window of input data \n\n1011 \n\n\f1012 \n\nPrincipe, Hsu, and Kuo \n\n(memory depth) for a given application is a difficult problem. Likewise, how to \ncombine the information in this time window to better meet the processing goal is \nalso nontrivial. Since we are interested in adaptive systems, the goal is to let the \nsystem find these quantities adaptively using the output error information. \n\nThese abstract ideas can be framed more quantitatively in a geometric setting (vector \nspace). Assume that the input is a vector [u(l), ... u(n), .... ] of growing size. The \nadaptive processor (a neural network in our case) has a fixed size to represent this \ninformation, which we assign to its state vector [x1(n), .... xN(n)] of size N. The \nusefulness of xk(n) depends on how well it spans the growing input space (defined by \nthe vector u(n\u00bb, and how well it spans the decision space which is normally \nassociated with the minimization of the mean square error (Figure 1). Therefore, in \nprinciple, the procedure can be divided into a representational and a mapping \nproblem. \n\nThe most general solution to this problem is to consider a nonlinear projection \nmanifold which can be modified to meet both requirements. In terms of neural \ntopologies, this translates to a full recurrent system, where the weights are adapted \nsuch that the error criterion is minimized. Experience has shown that this is a rather \ndifficult proposition. Instead, neural network researchers have worked with a wealth \nof methods that in some way constrain the neural topology. \n\nProjection space \n\nNonlinear mapping \n\nerror \n\n~ Optimal \n\nDecision space \n\nFigure 1. Projection ofu(n) and the error for the task. (for simplicity we \n\nare representing only linear manifolds) \n\nThe solution that we have been studying is also constrained. We consider a linear \nmanifold as the projection space, which we call the memory space. The projection of \nu(n) in this space is subsequently mapped by means of a feedforward neural network \n(multilayer perceptron) to a vector in decision space that minimizes the error \ncriterion. This model gives rise to the focused topologies. The advantage of this \nconstrained model is that it allows an analytical study of the memory structures, since \nthey become linear filters. It is important to stress that the choice of the projection \nspace is crucial for the ultimate performance of the system, because if the projected \nversion of u(n) in the memory space discards valuable information about u(n), then \n\n\fAnalysis of Short Term Memories for Neural Networks \n\n1013 \n\nthe nonlinear mapping will always produce sub-optimal results. \n\nProjection in the memory space \n\n2 \nIf the projection space is linear, then the representational problem can be studied with \nlinear system concepts. The projected vector u(n) becomes Yn \n\nN \n\nYn = L w0n-k \n\nk=l \n\n(1) \n\nwhere xn are the memory traces. Notice that in this equation the coefficients wk are \nindependent of time, and their number fixed to N. What is the most general linear \nstructure that implements this projection operation? It is the generalizedfeedfonvard \nstructure [Principe et aI, 1992] (Figure 2), which in connectionist circles has been \ncalled the time lagged recursive network [Back and Tsoi, 1992]. One can show that \nthe defining relation for generalized feedforward structures is \n\ngk (n) = g (n) \u2022 gk-l (n) \n\nk';? 1 \n\nwhere \u2022 represents the convolution operation, and go (n) = (5 (n) . This relation \nmeans that the next state vector is constructed from the previous state vector by \nconvolution with the same function g(n), yet unspecified. Different choices of g(n) \nwill provide different choices for the projection space axes. When we apply the input \nu(n) to this structure, the axes of the projection space become xk(n), the convolution \nof u(n) with the tap signals. The projection is obtained by linearly weighting the tap \nsignals according to equation (1). \n\nFigure 2. The generalizedfeedfonvard structure \n\nWe define a memory structure as a linear system whose generating kernel g(n) is \ncausal g (n) = 0 fo r n < 0 and normalized, i.e. \nL Ig(n)1 = 1 \n\n00 \n\nn=O \n\nWe define memory depth D as the modified center of mass (first moment in time) of \nthe last memory tap. \n\n00 \n\nD = L ngk(n) \n\nn=O \n\nAnd we define the memory resolution R as the number of taps by unit time, which \n\n\f1014 \n\nPrincipe, Hsu, and Kuo \n\nbecomes liD. The purpose of the memory structure is to transform the search for an \nunconstrained number of coefficients (as necessary if we worked directly with u(n\u00bb \ninto one of seeking a fixed number of coefficients in a space with time varying axis. \n\n3 \n\nReview of connectionist memory structures \n\nThe gamma memory [deVries and Principe, 1992] contains as special cases the \ncontext unit [Jordan, 1986] and the tap delay line as used in TDNN [Waibel et aI, \n1989]. However, the gamma memory is also a special case of the generalized \nfeedforward filters where g (n) = Jl (1 - Jl) n which leads to the gamma functions as \nthe tap signals. Figure 3, adapted from [deVries and Principe, 1993], shows the most \ncommon connectionist memory structures and its characteristics. \n\nAs can be seen when k=l, the gamma memory defaults to the context unit, and when \nJl=1 the gamma memory becomes the tap delay line. In vector spaces the context unit \nrepresents a line, and by changing 11 we are finding the best projection of u(n) on this \nline. This representation is appropriate when one wants long memories but low \nresolution. \n\nLikewise, in the tap delay line, we are projecting u(n) in a memory space that is \nuniquely determined by the input signal, i.e. once the input signal u(n) is set, the axes \nbecome u(n-k) and the only degree of freedom is the memory order K. This memory \nstructure has the highest resolution but lacks versatility, since one can only improve \nthe input signal representation by increasing the order of the memory. In this respect, \nthe simple context unit is better (or any memory with a recursive parameter), since \nthe neural system can adapt the parameter 11 to project the input signal for better \nperformance. \n\nWe recently proved that the gamma memory structure in continuous time represents \na memory space that is rigid [Principe et aI, 1994] . When minimizing the output mean \nsquare error, the distance between the input signal and the projection space \ndecreases. The recursive parameter in the feedforward structures changes the span of \nthe memory space with respect to the input signal u(n) (which can be visualized as \nsome type of complex rotation). In terms of time domain analysis, the recursive \nparameter is finding the length of the time window (the memory depth) containing \nthe relevant information to decrease the output mean square error. The recursive \nparameter Jl can be adapted by gradient descent learning [deVries and Principe, \n1992], but the adaptation becomes nonlinear and multiple minima exists.Notice that \nthe memory structure is stable for O O. These parameters can be adapted by \ngradient descent [Silva et aI, 1992]. In terms of versatility, the Gamma II has a pair \nof free complex poles, the Gamma I has a pole restricted to the real line in the Z \ndomain, and the tap delay line has the pole set at the origin of the Z domain (z=O). A \nmultilayer perceptron equipped with an input memory layer with the Gamma II \nmemory structure implements a nonlinear mapping on an ARMA model of the input \nsignal. \n\n5 \n\nHow to use Memory structures in Connectionist networks. \n\nAlthough we have presented this theory with the focused architectures (which \n\n\fAnalysis of Short Term Memories for Neural Networks \n\n1017 \n\ncorresponds to a nonlinear moving average model (NMAX\u00bb, the memory structures \ncan be placed anywhere in the neural topology. Any nonlinear processing element can \nfeed one of these memory kernels as an extension of [Wan, 1990]. If the memory \nstructures are used to store traces of the output of the net, we obtain a nonlinear \nautoregressive model (NARX). If they are used both at the input and output, they \nrepresent a nonlinear ARMAX model shown very powerful for system identification \ntasks. When the memory layer is placed in the hidden layers, there is no \ncorresponding linear model. \n\nGamma II \n\nDelay operator: _Jl_[z_-_< l_-_Jl)_]_ \n[z - (l - Jl)] 2 + ~Jl2 \n\nOne must realize that these types of memory structures are recursive (except the tap \ndelay line), so their training will involve gradients that depend on time. In the focused \ntopologies the network weights can still be trained with static backpropagation, but \nthe recursive parameter must be trained with real time recurrent learning (RTRL) or \nbackpropagation through time (BPTT). When memory structures are scattered \nthrough out the topology, training can be easily accomplished with backpropagation \nthrough time, provided a systematic way is utilized to decompose the global \ndynamics in local dynamics as suggested in [Lefebvre and Principe, 1993]. \n\n6 \n\nConclusions \n\nThe goal of this paper is to present a set of memory structures and show their \nrelationship. The newly introduced Gamma II is the most general of the memories \nreviewed. By adaptively changing the two parameters u,Jl the memory can create \ncomplex poles at any location in the unit circle. This is probably the most general \nmemory mechanism that needs to be considered. With it one can model poles and \nzeros of the system that created the signal (if it accepts the linear model). \n\nIn this paper we addressed the general problem of extracting patterns in time. We \nhave been studying this problem by pre-wiring the additive neural model, and \ndecomposing it in a linear part -the memory space- that is dedicated to the storage of \npast values of the input (output or internal states), and in a nonlinear part which is \nstatic. The memory space accepts local recursion, which creates a powerful \nrepresentational structure and where stability can be easily enforced (test in a single \nparameter). Recursive memories have the tremendous advantage of being able to \ntrade memory depth by resolution. In vector spaces this means changing the relative \n\n\f1018 \n\nPrincipe, Hsu, and Kuo \n\nposition between the projection space and the input signal. However, the problem of \nfinding the best resolution is still open (this means adaptively finding k, the memory \norder). Likewise ways to adaptively find the optimal value of the memory depth need \nimprovements since the gradient procedures used up to now may be trapped in local \nminima. It is still necessary to modify the definition of memory depth such that it \napplies to both of these new memory structures. The method is to define it as the \ncenter of mass of the envelope of the last kernel. \n\nAcknowledgments:This work was partially supported by NSF grant ECS #920878. \n\nIteferences \n\n7 \nBack, A. D. and A. C. Tsoi, An Adaptive Lattice Architecture for Dynamic Multilayer \nPerceptrons, Neural Computation, vol. 4, no. 6, pp. 922-931, November, 1992. \nde Vries, B. and J. C. Principe, \"The gamma model - a new neural model for temporal \nprocessing,\" Neural Networks, vol. 5, no. 4, pp. 565-576, 1992. \nde Vries, B., J.C. Principe, and P.G. De Oliveira, \"Adaline with adaptive recursive \nmemory,\" Proc. IEEE Workshop Neural Networks on Signal Processing, Princeton, \nNJ, 1991. \nJordan, M., \"Attractor dynamics and parallelism in a connectionist sequential \nmachine,\" Proc. 8th annual Conf. on Cognitive Science Society, pp. 531-546, 1986. \nLefebvre, C., and J.C. Principe, \"Object-oriented artificial neural network \nimplementations\", Proc. World Cong on Neural Nets, vol IV, pp436-439, 1993. \nPrincipe, J. deVries B., Oliveira P., \"Generalized feedforward structures: a new class \nof adaptive fitlers\", ICASSP92, vol IV, 244-248, San Francisco. \nPrincipe, J.C., and B. de Vries, \"Short term neural memories for time varying signal \nclassification,\" in Proc. 26th ASILOMAR Conf., pp. 766-770, 1992. \nPrincipe J. C., J.M. Kuo, and S. Celebi,\" An Analysis of Short Term Memory \nStructures in Dynamic Neural Networks\", accepted in the special issue of recurrent \nnetworks of IEEE Trans. on Neural Networks. \nPalkar M., and J.e. Principe, \"Echo cancellation with the gamma filter,\" to be \npresented at ICASSP, 1994. \nSilva, T.O., \"On the equivalence between gamma and Laguerre filters,\" to be presented \nat ICASSP, 1994. \nSilva, T.O., J.C. Principe, and B. de Vries, \"Generalized feedforward filters with \ncomplex poles,\" Proc. Second IEEE Conf. Neural Networks for Signal Processing, \npp.503-510, 1992. \nWaiber, A., \"Modular Construction of Time-Delay Neural Networks for Speech \nRecognition,\" Neural Computation I, pp39-46, 1989. \nWan, A. E., \"Temporal backpropagation: an efficient algorithm for finite impulse \nresponse neural networks,\" Connectionist Models, Proc. of the 1990 Summer School, \npp.131-137, 1990. \n\n\f", "award": [], "sourceid": 795, "authors": [{"given_name": "Jose", "family_name": "Principe", "institution": null}, {"given_name": "Hui-H.", "family_name": "Hsu", "institution": null}, {"given_name": "Jyh-Ming", "family_name": "Kuo", "institution": null}]}