{"title": "Time Warping Invariant Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 180, "page_last": 187, "abstract": null, "full_text": "Time Warping Invariant Neural Networks \n\nGuo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee \n\nInstitute for Advanced Computer Studies \n\nand \n\nLaboratory for Plasma Research, \n\nUniversity of Maryland \nCollege Park, MD 20742 \n\nAbstract \n\nWe proposed a model of Time Warping Invariant Neural Networks (TWINN) \n\nto handle the time warped continuous signals. Although TWINN is a simple modifica(cid:173)\ntion of well known recurrent neural network, analysis has shown that TWINN com(cid:173)\npletely removes time warping and is able to handle difficult classification problem. It \nis also shown that TWINN has certain advantages over the current available sequential \nprocessing schemes: Dynamic Programming(DP)[I], Hidden Markov Model((cid:173)\nHMM)[2], Time Delayed Neural Networks(TDNN) [3] and Neural Network Finite \nAutomata(NNFA)[4]. \n\nWe also analyzed the time continuity employed in TWINN and pointed out that \n\nthis kind of structure can memorize longer input history compared with Neural Net(cid:173)\nwork Finite Automata (NNFA). This may help to understand the well accepted fact \nthat for learning grammatical reference with NNF A one had to start with very short \nstrings in training set. \n\nThe numerical example we used is a trajectory classification problem. This \n\nproblem, making a feature of variable sampling rates, having internal states, continu(cid:173)\nous dynamics, heavily time-warped data and deformed phase space trajectories, is \nshown to be difficult to other schemes. With TWINN this problem has been learned in \n100 iterations. For benchmark we also trained the exact same problem with TDNN and \ncompletely failed as expected. \n\nI. INTRODUCTION \n\nIn dealing with the temporal pattern classification or recognition, time warping of input sig(cid:173)\n\nnals is one of the difficult problems we often encounter. Although there are a number of \nschemes available to handle time warping, e.g. Dynamic Programming (DP) and Hidden Mark(cid:173)\nov Model(HMM), these schemes also have their own shortcomings in certain aspects. More de(cid:173)\npressing is that, as far as we know, there are no efficient neural network schemes to handle time \nwarping. In this paper we proposed a model of Time Warping Invariant Neural Networks \n(TWINN) as a solution. Although TWINN is only a simple modification to the well known neu(cid:173)\nral net structure, analysis shows that TWINN has the built-in ability to remove time warping \ncompletely. \n\nThe basic idea ofTWINN is straightforward. If one plots the state trajectories of a continuous \n\n180 \n\n\fTime Warping Invariant Neural Networks \n\n181 \n\ndynamical system in its phase space, these trajectory curves are independent of time warping \nbecause time warping can only change the time duration when traveling along these trajectories \nand does not affect their shapes and structures. Therefore, if we normalize the time dependence \nof the state variables with respect to any phase space variable, say the length of trajectory, the \nneural network dynamics becomes time warping invariant. \n\nTo illustrate the power of the TWINN we tested it with a numerical example of trajectory \nclassification. This problem, chosen as a typical problem that the TWINN could handle, has the \nfollowing properties: (1). The input signals obey a continuous time dynamics and are sampled \nwith various sampling rates. (2). The dynamics of the de-warped signals has internal states. (3). \nThe temporal patterns consist of severely time warped signals. \n\nTo our knowledge there have not been any neural network schemes which can deal with this \n\ncase effectively. We tested it with TDNN and failed to learn. \n\nIn the next section we will introduce the TWINN and prove its time warping invariance. In \nSection III we analyze its features and identify the advantages over other schemes. The numer(cid:173)\nical example of the trajectory classification with TWINN is presented in Section IV. \n\nII. TIME WARPING INVARIANT NEURAL NETWORKS (TWINN) \n\nTo process temporal signals, we consider a fully recurrent network, which consists of two \ngroups of neurons: the state neurons (or recurrent units) represented by vector S(t) and the input \nneurons that are clamped to the external input signals {I(t), t = 0, I, 2, ...... , T-l). The Time \nWarping Invariant Neural Networks (TWINN) is simply defined as: \n\n(1) \nwhere W is the weight matrix, [(t) is the distance between two consecutive input vectors defined \nby the norm \n\nS(t+ 1) = S(t) +1(t)F(S(t), W,/(t\u00bb \n\n(2) \nand the mapping function F is a nonlinear function usually referred as neural activity function. \nFor example of first order networks, it could take the form: \n\nl(t) = 11/(t+ 1) -/(t) II \n\nFj(S(t), W,/(t\u00bb = Tanh(~Wij(S(t) EfH(t\u00bb) \n\nJ \n\n(3) \n\nwhere Tanh(x) is Hyperbolic Tangent function and symbol Ef> stands for the vector concatena(cid:173)\ntion. \n\nFor the purpose of classification (or recognition), we assign the target final state Sk> \n(k= 1,2,3, ... K), for each category of patterns. After we feed into the TWINN the whole sequence \n{J(O), 1(1), 1(2), ...... ,/(T-l)}, the state vector S(t) will reach the final state SeT). We then need to \ncompare S(n with the target final state Sk for each category k, (k=I,2,3, ... K), and calculate the \nerror: \n\n(4) \n\nThe one with minimal error will be classified as such. The ideal error is zero. \n\nFor the purpose of training, we are given a set of training examples for each category. We \nthen minimize the error functions given by Eq. (4) using either back-propagation[7] or forward \npropagation algorithm[8]. The training process can be terminated when the total error reach its \nminimum. \n\nThe formula of TWINN as shown in Eq. (1) does not look like new. The subtle difference \nfrom wildly used models is the introduction of normalization factor let) as in Eq. (1). The main \nadvantage by doing this lies in its built-in time warping ability. This can be directly seen from \nits continuous version. \n\nAs Eq. (1) is the discrete implementation of continuous dynamics, we can easily convert it \n\ninto a continuous version by replacing \"t +1\" by \"t+~t\" and let ~t --? O. By doing so, we get \n\n\f182 \n\nSun, Chen, and Lee \n\nS(t+~t) -Set) \n\ndS \n. \n11m - - - - - - - -\n61-.01l/(t+M) -/(t) II - dL \n\nwhere L is the input trajectory length, which can be expressed as an integral \n\nI \n\nL (t) = III ~~II dt \n\no \n\nor summation (as in discrete version) \n\nL(t) = L II/(t+ 1) - \u00b7/(t) II \n\nI \n\n1:=0 \n\n(5) \n\n(6) \n\n(7) \n\nFor deterministic dynamics, the distance L(t) is a single-valued function. Therefore, we can \nmake a unique mapping from t to L, TI: t --7 L, and any function of t can be transformed into a \nfunction of L in terms of this mapping. For instance, the input trajectory I(t) and the state tra(cid:173)\njectory Set) can be transformed into I(L) and S(L). By doing so, discrete dynamics of Eq. (1) \nbecomes, in the continuous limit, \n\n(8) \nIt is obvious that there is no explicit time dependence in Eq. (8) and therefore the dynamics rep(cid:173)\nresented by Eq. (8) is time warping independent. \n\n~~ = F (S (L), W, I (L) ) \n\nTo be more specific, if we draw the trajectory curves of l(t) and S(t) in their phase spaces re(cid:173)\n\nspectively, these two curves would not be deformed if we only change the time duration when \ntraveling along the curves. Therefore, if we generate several input sequences {J(t)} using dif(cid:173)\nferent time warping functions and feed them into TWINN, represented by Eq. (8) or Eq. (1), the \ninduced state dynamics of S(L) would be the same. Meanwhile, the final state is the solo crite(cid:173)\nrion for classification. Therefore, any time warped signals would be classified by the TWINN \nas the same. This is the so called \"time warping invariant\". \n\nIII. ANALYSIS OF TWINN VS. OTHER SCHEMES \n\nWe emphasize two points in this section. First, we would analyze the advantages of the \nTWINN over the other neural network structures, like TDNN, and other mature and well known \nalgorithms for time warping, such as HMM and Dynamics Programming. Second, we would an(cid:173)\nalyze the memory capacity of input history for both the continuous dynamical networks as il(cid:173)\nlustrated in Eq. (1) and its discrete companion, Neural Network Finite Automata used in \ngrammatical inference by Liu [3], Sun [4] and Giles [5]. And, we will show by mathematical \nestimation that the continuity employed in TWINN increases the power of memorizing history \ncompared with NNFA \n\nThe Time Delayed Neural Networks (TDNN)[3] has been a useful neural network structure \n\nin processing temporal signals and achieves successes in several applications, e.g. speech rec(cid:173)\nognition. The traditional neural network structures are either feedforward or recurrent. The \nTDNN is something in between. The power of TDNN is in its dynamic combination of the spa(cid:173)\ntial processing (as in a feedforward net) and sequential processing (as in a recurrent net with \nshort time memory). Therefore, the TDNN could detect the local features within each windowed \nframe and store their voting scores into the short time memory neurons, and then make a final \ndecision at the end of input sequence. This technique is suitable for processing the temporal pat(cid:173)\nterns where the classification is decided by the integration of local features. But, it could not \nhandle the long time correlation across time frames like a state machine. It also does not tolerate \ntime warping effectively. Each of time warped patterns will be treated as a new feature. There(cid:173)\nforG, TDNN would not be able to handle the numerical example given in this paper which has \nboth the severe time warping and the internal states (long time correlation). The benchmark test \nhas been performed and it proved our prediction. Actually, it can be seen later that in our exam-\n\n\fTime Warping Invariant Neural Networks \n\n183 \n\npIes, no matter which category they belong to, all windowed frames would contain similar local \nfeatures, the simple integration of local features do not contribute directly to the final classifi(cid:173)\ncation, rather the whole sinal history will decide the classification. \n\nAs for the Dynamic Programming, it is to date the most efficient way to cope with time warp(cid:173)\n\ning problem. The most impressing feature of dynamic programming is that it accomplishes a \nglobal search among all NN possible paths using only -0(N2) operations, where N is the length \nof the input time series and, of course, one operation here represents all calculations involved \nin evaluating the 'score\" of one path. But, on the other hand this is not ideal. If we can do the \ntime warping using recurrent network, the number of uperations will be reduced to -O(N). This \nis a dramatic saving. Another undesirable feature of current dynamic warping scheme is that the \nrecognition or classification result heavily depends on the pre-selected template and therefore \none may need a large number of templates for a better classification rate. By adding one or two \ntemplate we actually double or triple the number of operations. Therefore, search for a neural \nnetwork time warping scheme is a pressing task. \n\nAnother available technique for time warping is Hidden Markov Model (HMM), which has \nbeen successfully applied in speech recognition. The way for HMM to deal with time warping \nis in terms of statistical behavior of its hidden state transition. Starting from one state qj, HMM \nallows a certain probability ~j to forward to another state qj. Therefore, for any given HMM one \ncould generate various state sequences, say, qlq2q2q3q4q4qS' QlQ2Q2Q2q3Q3q4q4qS' etc., each \nwith a certain occurrence probability. But, these state sequences are \"hidden\", the observed part \nis a set of speech data or symbol represented by {Sk} for example. HMM also includes a set of \nobservation probability B=={bjk}, so that when it is in a certain state, say Qj' HMM allows each \nsymbol from the set {sk} to occur with the probability bjk. Therefore, for any state sequence one \ncan generate various series of symbols. As an example, let us consider one simple way to gen(cid:173)\nerate symbols: in state Qj we generate symbol Sj (with probability bjj). By doing so, the two state \nsequences mentioned above would correspond \ntwo possible symbol sequences: \nsl s2s2s3s4s4sS and sl s2s2s2s3s3s4S4sS' Examining the two strings closely, we find that the second \none may be considered as the time warped version of the first one, or vice versa. If we present \nthese two strings to the HMM for testing, it will accept them with similar probabilities. This is \nthe way that HMM tolerates time warping. And, these state transition probabilities of HMM are \nlearned from the statistics of training set by using re-estimation formula. In this sense, HMM \ndoes not deal with time warping directly, instead, it learns statistical distribution of training set \nwhich contains time warped patterns. Consequently, if one presents a test pattern with time \nwarped signals which is far away from the statistical distribution of training set, it is very un(cid:173)\nlikely for a HMM to recognize this pattern. \n\nOn the contrary, the model of TWINN we proposed here has intrinsic built-in time warping \n\nto \n\nnature. Although the TWINN itself has internal states, these internal states are not used for tol(cid:173)\nerating time warping. Instead, they are used to learn more complex behavior of the \"de-warped\" \ntrajectories. In this sense, TWINN could be more powerful than HMM. \n\nAnother feature ofTWINN needs be mention is its explicit expression of continuous mapping \nfrom S(t) to S(t+1) as shown in Eq. (1). In our early work of [4,5,6], to train a NNFA (Neural \nNetwork Finite Automaton), we used a discrete mapping \n\nS(t+ 1) = F(S(t), W,/(t\u00bb \n\n(9) \nwhere F is a nonlinear function, say Sigmoid function g(x) == 1 l(l+e' X). This model has been \nsuccessfully applied into the grammatical inference. The reason we call Eq. (1) a continuous \nmapping but Eq. (9) a discrete one, even though both of them are implemented in discrete time \nsteps, is because there is an explicit infinitesimal factor let) used in Eq. (1). Due to this factor \nthe continuous state dynamics is guaranteed, by which we mean that the state variation S(t+ I) \n- S(t+1) approaches zero if the input variation 1(t+l) -1{t+I) does so. But, In general, the state \n\n\f184 \n\nSun, Chen, and Lee \n\nvariation S(t+ 1) - S(t+ 1) generated by Eq. (9) is of order of one, regardless of what input varia(cid:173)\ntions are. If one starts from random initial weights, Eq. (9) provides a discrete jump between \ndifferent, randomly distributed states, which is far away from any continuous dynamics. \n\nWe did numerical test using NNFA of Eq. (9) to learn the classification problem of continu(cid:173)\n\nous trajectories as shown in Section V. For simplicity we did not include time warping, but the \nNNFA still failed to learn. The reason is that when we tried to train a NNF A to learning the con(cid:173)\ntinuous dynamics, we were actually forcing the weights to generate an almost identical mapping \nF from Set) to S(t+ 1). This is a very strong constrain on the weight parameters, such that it \ndrives the diagonal terms to positive infinity and off-diagonal terms to negative infinity (Sig(cid:173)\nmoid function is used). When this happens, the learning is stuck due to the saturation effect. \n\nThe failure of NNF A may also comes from the short history memory capacity compared to \n\nthe continuous mapping ofEq. (1). It has been shown by many numerical experiments on gram(cid:173)\nmatical inference [3,4,5] that to train an NNFA as in Eq. (9) effectively, one has to start with \nshort training patterns (usually, the sentence length ~ 4). Otherwise, learning will fail or be very \nslow. This is exactly what happened to learning the trajectory classification using NNFA, where \nthe lengths of our training patterns are in general considerably long (normally,- 60). But, \nTWINN learned it easily. To understand the NNFA's failure and TWINN's success, in the fol(cid:173)\nlowing, we will analyze how the history information enters the learning process. \n\nConsider the example of learning grammatical inference. Before training since we have no (I \npriori knowledge about the target values of weights, we normally start with random initial val(cid:173)\nues. On the other hand, during training the credit assignment (or the weight correction ~ W) can \nonly be done at the end of each input sequence. Consequently, each ~W should explicitly con(cid:173)\ntain the information about all symbols contained in that string, otherwise the learning is mean(cid:173)\ningless. But, in numerical implementation, every variable, including both ~W and W, has a \nfinite precision and any information beyond the precision range will be lost. Therefore, to com(cid:173)\npare which model has the longer history memory we need to examine how the history informa(cid:173)\ntion relates to the finite precisions of ~ Wand W. \n\nLet us illustrate this point with a simple second-order connected fully recurrent network and \n\nwrite both Eq. (1) and Eq. (9) in a unified form \n\nsuch that Eq. (1) is represented by \n\nS(t+l) =G,+l \n\nand Eq. (9) is just \n\nG' + I = S (1) + I (1) g (K (1) ) \n\nG,+l = g(K(t)) \n\nwhere K(t) is the weighted sum of concatenation of vectors Set) and /(t) \n\nKj(t) = LWjj(S(t) EfH(t\u00bbj \n\nj \n\nFor a grammatical inference problem the error is calculated from the final state S(I) as \n\n(10) \n\n(11 ) \n\n(12) \n\n(13) \n\nE= (S(T)-Starget)2 \n\n(14) \nLearning is to minimize this error function. According to the standard error back-propagation \nscheme, the recurrent net can be viewed as a multi-layered net with identical weights between \nneurons at adjacent time step: w(t) = W, where w(t) is the \"till layer\" weights connecting input \nS(t-I) to output S(t). The total weight correction is the summation of all weight corrections at \neach layer. By using the gradient descent scheme one immediately has \n\n~W= LOW(t) =-llLaW(t) =-llL aS(t) \u00b7 aW(t) \n\n(15) \n\nIf we define new symbols: vector u(t), second-order tensor A(t) and third-order tensor B(t) as \n\n1=1 \n\n1=1 \n\n1=1 \n\nT \n\nT a E \n\nT aE \n\naG I \n\n\fTime Warping Invariant Neural Networks \n\n185 \n\nA .. (t) == a I \n\nIJ \n\na G~+ I \nS.(t) \n\nJ \n\nthe weight correction can be simply written as \nT \n\n~W=-1\\~U(t).B(t) \n\nand the \"error rate\" u(t) can be back-propagated using the Derivative Chain Rule \n\nu (t) = u (t + 1) . A (t) \n\nt = 1, 2, ... , T - 1 ; \n\n(16) \n\n(17) \n\n(18) \n\nso that it is easy to have \nu(t) = u(n \u00b7A(T-I) \u00b7 A(T-2) \u00b7 ... \u00b7A(t) ==u(n'tJ~t~(t) \nFirst, let us examine the model ofNNFA in Eq. (9). Using Eqs. (12), (13) and (16), Ai/t) and \n\nt = 1,2, ... ,T-I; \n\n(19) \n\nBijk(t) can be written as \n\nI \n\n~ \n\nlj \n\nA .. (t) = g' (K. (t) ) W .. \n\nB\u00b7\u00b7k(f) = a .. (S(t-I) El)/(t-I\u00bbk \n\n(20) \nwhere g'(x) == dg/dx = gO-g) is the derivative of Sigmoid function and 8ij is Kronecker delta \nfunction. If we substitute Bijk(t) into Eq. (17), ~Wbecomes a weighted sum of all input symbols \n{/(O), 1(1), 1(2), ...... J(T-I)}, each with different weighting factor u(t). Therefore, to guarantee \nthat ~ W contain the information of all input symbols {/(O), 1(1), 1(2), ...... J(T-I)}, the ratio of \nlu(t)lmaxllu(t)lmin should be within the range of precision of ~W. This is the main point. \n\nIJ \n\nIJ \n\nThe exact mathematical analysis has not been done, but from a rough estimate we can gain \nsome good understanding. From Eq. (9), u(t) is a matrices product of Aij(t), and u(1) the coef-\nficient of 1(0) contains the highest order product of Ai/t). The key point is that the coefficient \nratio between the adjacent symbols: lu(t)\"lu(t+l) is of the order of lAi/t)I, which is a small val(cid:173)\nue, therefore the earlier symbol information could be lost from ~ W due to its finite precision. It \ncan be shown that xg'(x) =x g(x)( l-g(x)< 0.25 for any real value of x. Then, we roughly have \nlAij(t)1 = Ig' Wijl = Ig(l-g)Wij 1< 0.25, if we assume the values of weights Wij to be order 1. Thus, \nthe ratio R=lu(t)lmax\"u(t)lmin is estimated as \n\nR- IU(1)l/lu(nl-\"p_ 1IA(f')1 <2-2. (T-l) \n\n1 \n\n(21) \n\nFrom Eq. (21) we see that if the input pattern length is T= lOwe need at least 2(T -1) == 18 bits \ncomputer memory to store weight variables (including u, W and ~ W). If T= 60, as in the trajec(cid:173)\ntory classification problem, it requires at least 128 bit weight variables. This is why the NNFA \nEq. (9) could not work. \n\nSimilarly, for the dynamics of Eq. 0), we use Eqs. (11), (13) and (16), and obtain \nAij(l) = 1+1(t) (g'(Kj(t\u00bbWi} \nBijk(l) = 1(1) (aij(S(t-l) GH(t-l\u00bbk) \n(22) \nFrom Eq. (22) we see that no matter how small the factor let) will be, lAi/!)1 remains a value \nof order of one, therefore the ratio R=lu(t)lmax\"u(t)lmin which is estimated as a product of lAij(1)1 \nwould be of order of one compared with result of discrete case as in Eq. (21).Therefore, the con-\ntributions from all {I(O), 1(1), 1(2), ...... J(T-I)} to the weight correction ~ Ware of the same or-\nder. This prevents the information loss during learning. \n\nIV NUMERICAL SIMULATION \n\nWe demonstrate the power of TWINN with a trajectory classification problem. The three 2-\n\n\f186 \n\nSun, Chen, and Lee \n\nD trajectory equations are artificially given by \n\n(xCt) =sin(t+~)lsinCt)1 (xCt) =sin(O.5t+~)sin(1.5t) (xU) =sinU+~)sin(2t) (23) \n\\y (t) = cos (t+~) I sin (t) I \\y (t) = cos (O.5t +~) sin (1.5t) ~ (1) = cos (t +~) sin (2t) \n\nwhere ~ is a unifonnly distributed random parameter. When ~ is changed, these trajectories are \ndistorted accordingly. Some examples (three for each class) are shown in Fig. I. \n\nClass 1 \n\nClass 2 \n\nClass 3 \n\n-\n\n-G .7 !. -0.5 \n\n-\n\nFig.l PHASE SPACE TRAJECTORIES \n\nThree different shapes of 2-D trajecwry, each is shown in one column with three examples. \nRecurrent neural networks are trained to recognize the different shapes of trajectory. \n\nThe trajectory data are the time series of two dimensional coordinate pairs {x(t), y(t)} sampled \nalong three different types of curves in the phase space. The neural net dynamics of TWINN is \n\nwhere we used 6 input neurons I = {I, x(t), y(t), .?(t), y2(t), x(t)y(t)} (normalized to norm = 1.0) \nand 4 (N=4) state neurons S ={ Sl' S2' S3' S4}. The neural network structure is shown in Fig. 2. \n\n(24) \n\nFig.2 Time Warping Invariant Neural Network \n\nfor Trajectory Classification \n\nFig.3 Time Delayed Neural Network \n\nfor Trajectory ClassificatiolJ \n\nFor training, we assign the desired final output for the three trajectory classes to be (1,0,0), \n\n\fTime Warping Invariant Neural Networks \n\n187 \n\n(0,1,0) and (0,0,1) respectively. For recognition, each trajectory data sequence needs to be fed \nto the input neurons and the state neurons evolve according to the dynamics in Eq. (24). At the \nend of input series we check the last three state neurons and classify the input trajectory accord(cid:173)\ning to the \"winner-take-all\" rule. \n\nIn each iteration of training we randomly picked up 150 deformed trajectories, 50 for each of \n\nthe three categories, by choosing different values of ~ within O$~ $27t. To simulate time warp(cid:173)\ning we randomly sampled the data by choosing the random time step ~t = 27trff along each tra(cid:173)\njectory, where r is a random number between 0 and 2 and the sampling rate T=60 for training \npatterns, and T=20 to 200 for testing patterns. Therefore, each training pattern is a time warped \ntrajectory data with averaged length = 60. Using RTRL algorithm[8] to minimize the error func(cid:173)\ntion, after 100 iterations of training it converged to Mean Square Error of == 0.03. \n\nWe tested the trained network with hundreds of randomly picked input sequences with differ(cid:173)\n\nent sampling rate (from 20/27t to 200127t) and different wrapping functions (non-uniform step \nlength). All input trajectories are classified correctly. If the sampling rates are too large (>200) \nor too small( <20), some classification errors will occur. \n\nWe test the same example with TDNN. See Fig.3 for its parameters. The top layer contains \nthree output neurons for the three classes of trajectories. The classification rules, error function \nand training patterns are the same as those ofTWINN. After three days of training with DEC-\n3100 Workstation the training error (MSE) approaches 0.5 and in testing the error rate is 70%. \nV. CONCLUSION \n\nWe have proposed a model of Time Warping Invariant Neural Network to handle temporal \npattern classification where the severely time warped and deformed data may occur. This model \nis shown to have built-in time warping ability. We have analyzed the properties ofTWINN and \nshown that for trajectory classification it has several advantages over other schemes: HMM, DP, \nTDNN and NNFA. \n\nWe also numerically implemented the TWINN and trained a trajectory classification easily. \nThis problem is shown by analysis to be difficult to other schemes. It has been trained with \nTDNN but failed. \nReferences \n[1] H.Sakoe and S. Chiba, \"Dynamic Programming Algorithm Optimization for Spoken \nWord Recognition\", IEEE Transactions on Acoustics Speech and Signal Processing, Vol. \nASSP-26, pp.43-49, Feb. 1978. \n\n[2] L.R.Rabiner and B.H.Juang, \"An Introduction to Hidden Markov Models\", IEEE, ASSP \n\nMag., Vol.3, No.1, pp. 4-16, 1986. \n\n[3]A. Weibel, T. Hanazawa, G. Hinton, K.shikano and K. Lang, \"Phoneme Recognition Us(cid:173)\ning Time-Delay Neural Networks\", IEEE Transactions on Acoustics Speech and Signal Pro(cid:173)\ncessing, March, 1989. \n\n[4]. Y.D. Liu, G.Z. Sun, H.H. Chen, c.L. Giles and Y.c. Lee, \"Grammatic Inference and \nNeural Network State Machine\", Proceedings of the International Joint Conference on Neural \nNetworks, pp. 1-285, Washington D.C. (1990). \n\n[5]. G.Z. Sun, H.H. Chen, c.L. Giles, Y.c. Lee and D. Chen, \"Connectionist Pushdown Au(cid:173)\n\ntomata that Learn Context-Free Grammars\", Proceedings of the International Joint Conference \non Neural networks, pp. 1-577, Washington D.C. (1990). \n\n[6]Giles, C.L., Sun, G.Z., Chen, H.H., Lee,Y.C., and Chen, D. (1990). \"Higher Order Recur(cid:173)\n\nrent Networks & Grammatical Inference\". Advances in Neurallnformation Processing Systems \n2, D.S. Touretzky (editor), 380-386, Morgan Kaufmann, San Mateo, c.A. (7) \n\n[7] D.Rumelhart, G. Hinton, and R. Williams. \"Learning internal representations by error \npropagation\", In PDP: VoU MIT press 1986. P. Werbos, \"Beyond Regression: New tools for \nprediction and analysis in the behavior sciences\", Ph.D. thesis, Harvard university, 1974. \n\n[8] R. Williams and D. Zipser, \"A learning algorithm for continually running fully recurrent \n\nneural networks\", Neural Computation 1 (1989), pp.270-280. \n\n\f", "award": [], "sourceid": 701, "authors": [{"given_name": "Guo-Zheng", "family_name": "Sun", "institution": null}, {"given_name": "Hsing-Hen", "family_name": "Chen", "institution": null}, {"given_name": "Yee-Chun", "family_name": "Lee", "institution": null}]}