{"title": "A Dynamical Approach to Temporal Pattern Processing", "book": "Neural Information Processing Systems", "page_first": 750, "page_last": 759, "abstract": null, "full_text": "750 \n\nA DYNAMICAL APPROACH TO TEMPORAL PATTERN \n\nPROCESSING \n\nW. Scott Stornetta \n\nStanford University, Physics Department, Stanford, Ca., 94305 \n\nTad Hogg and B. A. Huberman \n\nXerox Palo Alto Research Center, Palo Alto, Ca. 94304 \n\nABSTRACT \n\nRecognizing patterns with temporal context is important for \nsuch tasks as speech recognition, motion detection and signature \nverification. We propose an architecture in which time serves as its \nown representation, and temporal context is encoded in the state of the \nnodes. We contrast this with the approach of replicating portions of the \narchitecture to represent time. \n\nAs one example of these ideas, we demonstrate an architecture \nwith capacitive inputs serving as temporal feature detectors in an \notherwise standard back propagation model. Experiments involving \nmotion detection and word discrimination serve to illustrate novel \nfeatures of the system. Finally, we discuss possible extensions of the \narchitecture. \n\nINTRODUCTION \n\nRecent interest in connectionist, or \"neural\" networks has emphasized their \nability to store, retrieve and process patterns1,2. For most applications, the patterns to \nbe processed are static in the sense that they lack temporal context. \n\nAnother important class consists of those problems that require the processing \nof temporal patterns. In these the information to be learned or processed is not a \nparticular pattern but a sequence of patterns. Such problems include speech \nprocessing, signature verification, motion detection, and predictive \nsignal \nprocessin,r-8. \n\nMore precisely, temporal pattern processing means that the desired output \ndepends not only on the current input but also on those preceding or following it as \nwell. This implies that two identical inputs at different time steps might yield \ndifferent desired outputs depending on what patterns precede or follow them. \n\nThere is another feature characteristic of much temporal pattern processing. \nHere an entire sequence of patterns is recognized as a single distinct category, \n\n\u00a9 American Institute of Physics 1988 \n\n\f751 \n\ngenerating a single output. A typical example of this would be the need to recognize \nwords from a rapidly sampled acoustic signal. One should respond only once to the \nappearance of each word, even though the word consists of many samples. Thus, each \ninput may not produce an output. \n\nWith these features in mind, there are at least three additional issues which \nnetworks that process temporal patterns must address, above and beyond those that \nwork with static patterns. The first is how to represent temporal context in the state of \nthe network. The second is how to train at intermediate time steps before a temporal \npattern is complete. The third issue is how to interpret the outputs during recognition, \nthat is, how to tell when the sequence has been completed. Solutions to each of these \nissues require the construction of appropriate input and output representations. This \npaper is an attempt to address these issues, particularly the issue of representing \ntemporal context in the state of the machine. We note in passing that the recognition \nof temporal sequences is distinct from the related problem of generating a sequence, \ngiven its first few members9.l O\u202211 . \n\nTEMPORAL CLASSIFICATION \n\nWith some exceptions10.12, in most previous work on temporal problems the \nsystems record the temporal pattern by replicating part of the architecture for each \ntime step. In some instances input nodes and their associated links are replicated3,4. In \nother cases only the weights or links are replicated, once for each of several time \ndelays 7,8. In either case, this amounts to mapping the temporal pattern into a spatial \none of much higher dimension before processing. \n\nThese systems have generated significant and encouraging results. However, \nthese approaches also have inherent drawbacks. First, by replicating portions of the \narchitecture for each time step the amount of redundant computation is significantly \nincreased. This problem becomes extreme when the signal \nis sampled very \nfrequently4. :-.l' ext, by re lying on replications of the architecture for each time step, the \nsystem is quite inflexible to variations in the rate at which the data is presented or size \nof the temporal window. Any variability in the rate of the input signal can generate an \ninput pattern which bears little or no resemblance to the trained pattern. Such \nvariability is an important issue, for example, in speech recognition . Moreover, having \na temporal window of any fixed length makes it manifestly impossible to detect \ncontextual effects on time scales longer than the window size. An additional difficulty \nis that a misaligned signal, in its spatial representation, may have very little \nresemblance to the correctly aligned training signal. That is, these systems typically \nsuffer from not being translationally invariant in time. \n\n~etworks based on relaxation to equilibrium 11,13,14 also have difficulties for \nuse with temporal problems. Such an approach removes any dependence on initial \n\n\f752 \n\nconditions and hence is difficult to reconcile directly with temporal problems, which by \ntheir nature depend on inputs from earlier times. Also, if a temporal problem is to be \nhandled in terms of relaxation to equilibrium, the equilibrium points themselves must \nbe changing in time. \n\nA NON\u00b7REPLICATED, DYNAMIC ARCHITECTURE \n\nWe believe that many of the difficulties mentioned above are tied to the \nattempt to map an inherently dynamical problem into a static problem of higher \ndimension. As an alternative, we propose to represent the history of the inputs in the \nstate of the nodes of a system, rather than by adding additional units. Such an \napproach to capturing temporal context shows some very immediate advantages over \nthe systems mentioned above . F'irst, it requires no replication of units for each distinct \ntime step. Second, it does not fix in the architecture itself the window for temporal \ncontext or the presentation rate. These advantages are a direct result of the decision to \nlet time serve as its own representation for temporal sequences, rather than creating \nadditional spatial dimensions to represent time. \n\nIn addition to providing a solution to the above problems, this system lends \nitself naturally to interpretation as an evolving dynamical system. Our approach \nallows one to think of the process of mapping an evolving input into a discrete \nsequence of outputs (such as mapping continuous speech input into a sequence of \nwords) as a dynamical system moving from one attractor to another 15. \n\nAs a preliminary example of the application of these ideas, we introduce a \nsystem that captures the temporal context of input patterns without replicating units \nfor each time step. We modify the conventional back propagation algorithm by making \nthe input units capacitive. In contrast to the conventional architecture in which the \ninput nodes are used simply to distribute the signal to the next layer, our system \nperforms an additional computation. Specifically, let Xi be the value computed by an \n\ninput node at time ti ' and Ii be the input signal to this node at the same time. Then the \n\nnode computes successive values according to \n\n(1) \n\nwhere a is an input amplitude and d is a decay rate. Thus, the result computed by an \ninput unit is the sum of the current input value multiplied by a, plus a fractional part, \nd, of the previously computed value of the input unit. In the absence of further input, \nthis produces an exponential decay in the activation of the input nodes. The value for d \nis chosen so that this decay reaches lie of its original value in a time t characteristic of \nthe time scale for the particular problem, i.e., d=e'tr, where r is the presentation rate. \nThe value for a is chosen to produce a specified maximum value for X, given by \n\n\f753 \n\nal ma/(1-d) . We note that Eq. (1) is equivalent to having a non-modifiable recurrent \nlink with weight d on the input nodes, as illustrated in Fig. l. \n\no \n\n0 \n\nFig. 1: Schematic architecture with capacitive inputs. The input nodes \ncompute values according to Eq. (1). Hidden and output units are \nidentical to standard back propagation nets. \n\nThe processing which takes place at the input node can also be thought of in \nterms of an infinite impulse response (IIR) digital filter. The infinite impulse response \nof the filter allows input from the arbitrarily distant past to influence the current \noutput of the filter, in contrast to methods which employ fixed windows, which can be \nviewed in terms of finite impulse response (FIR) filters. The capacitive node of Fig. 1 is \nequivalent to pre-processing the signal with a filter with transfer function a/(1-dz\u00b7 1) . \n\nThis system has the unique feature that a simple transformation of the \nparameters a and d allows it to respond in a near-optimal way to a signal which differs \nfrom the training signal in its rate. Consider a system initially trained at rate r with \ndecay rate d and amplitude a. To make use of these weights for a different presentation \nrate, r~ one simply adjusts the values a 'and d'according to \n\nd' = d r/r' \n\n1 - d' \na' = a \"\"[:\"d \n\n(2) \n\n(3) \n\n\f754 \n\nThese equations can be derived by the following argument. The general idea is \nthat the values computed by the input nodes at the new rate should be as close as \npossible to those computed at the original rate. Specifically, suppose one wishes to \nchange the sampling rate from r to nr, where n is an integer. Suppose that at a time to \n\nthe computed value of the input node is Xo ' If this node receives no additional input, \n\nthen after m time steps, the computed value of the input node will be Xod m . For the \n\nmore rapid sampling rate, Xodm should be the value obtained after nm time steps. \n\nThus we require \n\n(4) \n\nwhich leads to Eq. (2) because n= r7r. Now suppose that an input I is presented m \n\ntimes in succession to an input node that is initially zero. After the mth presentation, \nthe computed value of the input node is \n\n(5) \n\nRequiring this value to be equal to the corresponding value for the faster presentation \nrate after nm time steps leads to Eq. (3). These equations, then, make the computed \nvalues of the input nodes identical, independent of the presentation rate . Of course, \nthis statement only holds exactly in the limit that the computed values of the input \nnodes change only infinitesimally from one time step to the next. Thus, in practice, one \nmust insure that the signal is sampled frequently enough that the computed value of \nthe input nodes is slowly changing. \n\nThe point in weight space obtained after initial training at the rate r has two \ndesirable properties. First, it can be trained on a signal at one sampling rate and then \nthe values of the weights arrived at can be used as a near-optimal starting point to \nfurther train the system on the same signal but at a different sampling rate. \nAlternatively, the system can respond to temporal patterns which differ in rate from \nthe training signal, without any retraining of the weights. These factors are a result of \nthe choice of input representation, which essentially present the same pattern to the \nhidden unit and other layers, independent of sampling rate. These features highlight \nthe fact that in this system the weights to some degree represent the temporal pattern \nindependent of the rate of presentation. In contrast, in systems which use temporal \nwindows, the weights obtained after training on a signal at one sampling rate would \nhave little or no relation to the desired values of the weights for a differen.t sampling \nrate or window size. \n\n\f755 \n\nEXPERIMENTS \n\nAs an illustration of this architecture and related algorithm, a three-layer, \n15-30-2 system was trained to detect the leftward or rightward motion of a gaussian \npulse moving across the field of input units with sudden changes in direction. The \nvalues of d and a were 0.7788 and 0.4424, respectively. These values were chosen to \ngive a characteristic decay time of 4 time steps with a maximum value computed by \nthe input nodes of 2.0 . The pulse was of unit height with a half-width, 0, of 1.3. Figure \n2 shows the input pulse as well as the values computed by the input nodes for leftward \nor rightward motion. Once trained at a velocity of 0.1 unit per sampling time, the \nvelocity was varied over a wide range, from a factor of2 slower to a factor of2 faster as \nshown in Fig. 3. For small variations in velocity the system continued to correctly \nidentify the type of motion. More impressive was its performance when the scaling \nrelations given in Eqs. (2) and (3) were used to modify the amplitude and decay rate . In \nthis case, acceptable performance was achieved over the entire range of velocities \ntested. This was without any additional retraining at the new rates. The difference in \nperformance between the two curves also demonstrates that the excellent performance \nof the system is not an anomaly of the particular problem chosen, but characteristic of \nrescaling a and d according to Eqs. (2) and (3). We thus see that a simple use of \ncapacitive links to store temporal context allows for motion detection at variable \nvelocities. \n\nA second experiment involving speech data was performed to compare the \nsystem's performance to the time-delay-neural-network of Watrous and Shastri 8. In \ntheir work, they trained a system to discriminate between suitably processed acoustic \nsignals of the words \"no\" and \"go.\" Once trained on a single utterance, the system was \nable to correctly identify other samples of these words from the same speaker. One \ndrawback of their approach was that the weights did not converge to a fixed point. We \nwere therefore particularly interested in whether our system could converge smoothly \nand rapidly to a stable solution, using the same data, and yet generalize as well as \ntheirs did. This experiment also provided an opportunity to test a solution to the \nintermediate step training problem. \n\nThe architecture was a 16-30-2 network. Each of the input nodes received an \ninput signal corresponding to the energy (sampled every 2.5 milliseconds) as a \nfunction of time in one of 16 frequency channels. The input values were normalized to \nlie in the range 0.0 to 1.0. The values of d and a were 0.9944 and 0.022, respectively. \nThese values were chosen to give a characteristic decay time comparable to the length \nof each word (they were nearly the same length), and a maximum value computed by \nthe input nodes of 4.0. For an input signal that was part of the word \"no\", the training \nsignal was (t.O, 0.0), while for the word \"go\" it was (0.0, 1.0). Thus the outputs that \nwere compared to the training signal can be interpreted as evidence for one word or the \nother at each time step. The error shown in Fig. 4 is the sum of the squares of the \n\n\f756 \n\ndifference between the desired outputs and the computed outputs for each time step, \nfor both words, after training up to the number ofiterations indicated along the x-axis. \n\na) input wavepacket \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\nB \n\n9 \n\n10 \n\nb) rightward \n\nmotion \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\nB \n\n9 \n\n10 \n\nc) leftward \n\nmotion \n\n4 \n\n5 \n\n6 \n\n3 \n\n2 \nFig. 2: a) Packet presented to input nodes. The x-axis represents the \ninput nodes. b) Computed values from input nodes during rightward \nmotion. c) Computed values during leftward motion. \n\n10 \n\n7 \n\nB \n\n9 \n\n\f757 \n\n100~ ______________ ~~~~ __ ~~~::::~ \n\n% 80 __ . \n\n: \" ' :w' - - - -. . . . . . . . . . . . . . . . . . ~I!\"'.::..:/,j/.:). . 'i \n\nc \no \nr \nr \ne \nc \nt \n\n60 _ \n\nI) \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nI~ \n\n;~ \n\n40 +-\n\n20 -I-\n\no \n\nI \nI \n.5 \n\nI \nI \n\n1.0 \nv'lv \n\nI \nI \n\n1.5 \n\n-\n\n'0 \n\nI \nI \n\n2.0 \n\nFig. 3: Performance of motion detection experiment for various \nvelocities. Dashed curve is performance without scaling and solid \ncurve is with the scaling given in Eqs. (2) and (3). \n\n125.0 \n\n100.0 \n\n75.0 \n\n50.0 \n\n25.0 \n\n0.0 \n\ne \nr \nr \n0 \nr \n\n0 \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\n2500 \n\niterations \n\nFig. 4: Error in no/go discrimination as a function of the number of \ntraining iterations. \n\nEvidence for each word was obtained by summing the values of the respective \nnodes over time. This suggests a mechanism for signaling the completion of a \nsequence: when this sum crosses a certain threshold value, the sequence (in this case, \nthe word) is considered recognized. Moreover, it may be possible to extend this \nmechanism to apply to the case of connected speech: after a word is recognized, the \nsums could be reset to zero, and the input nodes reinitialized. \n\nOnce we had trained the system on a single utterance, we tested the \nperfor~ance of the resulting weights on additional utterances of the same speaker. \n\n\f758 \n\nPreliminary results indicate an ability to correctly discriminate between \"no\" and \n\"go.\" This suggests that the system has at least a limited ability to generalize in this \ntask domain. \n\nDISCUSSION \n\nAt a more general level, this paper raises and addresses some issues of \nrepresentation. By choosing input and output representations in a particular way, we \nare able to make a static optimizer work on a temporal problem while still allowing \ntime to serve as its own representation. In this broader context, one realizes that the \nchoice of capacitive inputs for the input nodes was only one among many possible \ntemporal feature detectors. \n\nOther possibilities include refractory units, derivative units and delayed spike \nunits. Refractory units would compute a value which was some fraction of the current \ninput. The fraction would decrease the more frequently and recently the node had been \n\"on\" in the recent past. A derivative unit would have a larger output the more rapidly \na signal changed from one time step to the next. A delayed spike unit might have a \ntransfer function of the form Itne-at, where t is the time since the presentation of the \nsignal. This is similar to the function used by Tank and Hopfield7, but here it could \nserve a different purpose. The maximum value that a given input generated would be \ndelayed by a certain amount of time. By similarly delaying the training signal, the \nsystem could be trained to recognize a given input in the context of signals not only \npreceding but also following it. An important point to note is that the transfer \nfunctions of each of these proposed temporal feature detectors could be rescaled in a \nmanner similar to the capacitive nodes. This would preserve the property of the system \nthat the weights contain information about the temporal sequence to some degree \nindependent of the sampling rate. \n\nAn even more ambitious possibility would be to have the system train the \nparameters, such as d in the capacitive node case. It may be feasible to do this in the \nsame way that weights are trained, namely by taking the partial of the computed error \nwith respect to the parameter in question. Such a system may be able to determine the \nrelevant time scales of a temporal signal and adapt accordingly. \n\nACKNOWLEDGEMENTS \n\nWe are grateful for fruitful discllssions with Jeff Kephart and the help of \nRaymond Watrous in providing data from his own experiments. This work was \npartially supported by DARPA ISTO Contract # N00140-86-C-8996 and ONR \nContract # N00014-82-0699_ \n\n\f759 \n\n1. \n\n2. \n\n3. \n\nD. Rumelhart, ed., Parallel Distributed Processing, (:\\'lIT Press, Cambridge, \n1986). \n\nJ. Denker, ed., Neural Networks for Computing, AlP Conf. Proc.,151 (1986). \n\nT. J. Sejnowski and C. R. Rosenberg, NETtalk: A Parallel Network that Learns to \nRead Aloud, Johns Hopkins Univ. Report No. JHU/EECS-86101 (1986). \n\n4. \n\nJ.L. McClelland and J.L. Elman, in Parallel Distributed Processing, vol. II, p. 58. \n\n5. W. Keirstead and B.A. Huberman, Phys . Rev. Lett. 56,1094 (1986). \n\n6. A. Lapedes and R. Farber, Nonlinear Signal Processing Using Neural Networks, \n\n7. \n\n8. \n\n9. \n\nLos Alamos preprint LA-uR-87-2662 (1987). \n\nD. Tank and J. Hopfield, Proc. Nat. Acad. Sci., 84, 1896 (1987). \n\nR. Watrous and L. Shastri, Proc. 9th Ann. Conf Cog. Sci. Soc., (Lawrence \nErlbaum, Hillsdale, 1987), p. 518. \n\nP. Kanerva, Self-Propagating Search: A Unified Theory of Memory, Stanford \nUniv. Report No. CSLI-84-7 (1984). \n\n10. M.1. Jordan, Proc. 8th Ann. Conf. Cog. Sci. Soc., (Lawrence Erlbaum, Hillsdale, \n\n1986), p. 531. \n\n11. J. Hopfield,Proc. Nat. Acad. SCi., 79, 2554 (1982). \n\n12. S. Grossberg, The Adaptive Brain, vol. II, ch. 6, (North-Holland, Amsterdam, \n\n1987). \n\n13. G. Hinton and T. J. Sejnowski, in Parallel Distributed Processing, vol. I, p. 282. \n\n14. B. Gold, in Neural Networks for Computing, p. 158. \n\n15. T. Hogg and B.A. Huberman, Phys. Rev. A32, 2338 (1985). \n\n\f", "award": [], "sourceid": 76, "authors": [{"given_name": "W.", "family_name": "Stornetta", "institution": null}, {"given_name": "Tad", "family_name": "Hogg", "institution": null}, {"given_name": "Bernardo", "family_name": "Huberman", "institution": null}]}