{"title": "The Gamma MLP for Speech Phoneme Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 785, "page_last": 791, "abstract": null, "full_text": "The Gamma MLP for Speech Phoneme \n\nRecognition \n\nSteve Lawrence~ Ah Chung Tsoi, Andrew D. Back \n\n{lawrence,act,back}Oelec.uq.edu.au \n\nDepartment of Electrical and Computer Engineering \n\nUniversity of Queensland \n\nSt. Lucia Qld 4072 Australia \n\nAbstract \n\nWe define a Gamma multi-layer perceptron (MLP) as an MLP \nwith the usual synaptic weights replaced by gamma filters (as pro(cid:173)\nposed by de Vries and Principe (de Vries and Principe, 1992)) and \nassociated gain terms throughout all layers. We derive gradient \ndescent update equations and apply the model to the recognition \nof speech phonemes. We find that both the inclusion of gamma \nfilters in all layers, and the inclusion of synaptic gains, improves \nthe performance of the Gamma MLP. We compare the Gamma \nMLP with TDNN, Back-Tsoi FIR MLP, and Back-Tsoi I1R MLP \narchitectures, and a local approximation scheme. We find that the \nGamma MLP results in an substantial reduction in error rates. \n\n1 \n\nINTRODUCTION \n\n1.1 THE GAMMA FILTER \n\nInfinite Impulse Response (I1R) filters have a significant advantage over Finite Im(cid:173)\npulse Response (FIR) filters in signal processing: the length of the impulse response \nis uncoupled from the number of filter parameters. The length of the impulse re(cid:173)\nsponse is related to the memory depth of a system, and hence I1R filters allow a \ngreater memory depth than FIR filters of the same order. However, I1R filters are \n\n*http://www.neci.nj.nec.com/homepages/lawrence \n\n\f786 \n\nS. LAWRENCE, A. C. TSOI, A. D. BACK \n\nnot widely used in practical adaptive signal processing. This may be attributed \nto the fact that a) there could be instability during training and b) the gradient \ndescent training procedures are not guaranteed to locate the global optimum in the \npossibly non-convex error surface (Shynk, 1989). \n\nDe Vries and Principe proposed using gamma filters (de Vries and Principe, 1992), \na special case of IIR filters, at the input to an otherwise standard MLP. The gamma \nfilter is designed to retain the uncoupling of memory depth to the number of pa(cid:173)\nrameters provided by IIR filters, but to have simple stability conditions. \n\nThe output of a neuron \nI 1-1) \nI \n. \n\nf (\"\",Nr-l \n\nL--i=O WkiYi \n\nYk = \n\nin a multi-layer perceptron is computed using1 \nDe Vries and Principe consider adding short \n\nf (\"\",Nr-l \"\",K \n\n(t \n\n.) 1-1 (t \n\n.)) \n\nh \n\nI \n\n-\nYk -\n\nI \n\n- J Yi \n\nL--i=O L--j=O 9kij \n\nterm memory with delays: \n- J were \n9~ij = (r!i)! tj-le-/-'~it \nj = 1, ... , K . The depth of the memory is controlled \nby J.t, and K is the order of the filter. For the discrete time case, we obtain the \nrecurrence relation: zo(t) = x(t) and Zj(t) = (1 - J.t)Zj(t - 1) + J.tZj-l (t - 1) for \nj = 1, ... , K. In this form, the gamma filter can be interpreted as a cascaded series \nof filter modules, where each module is a first order IIR filter with the transfer func-\ntion q-(I-/-,) , where qZj(t) ~ Zj(t + 1). We have a filter with K poles, all located \nat 1 - J.t. Thus, the gamma filter may be considered as a low pass filter for J.t < 1. \nThe value of J.t can be fixed, or it can be adapted during training. \n\n2 NETWORK MODELS \n\nFigure 1: A gamma filter synapse with an associated gain term 'c'. \n\nWe have defined a gamma MLP as a multi-layer perceptron where every synapse \ncontains a gamma filter and a gain term, as shown in figure 1. The motivation \nbehind the inclusion of the gain term is discussed later. A separate J.t parameter \nis used for each filter. Update equations are derived in a manner analogous to the \nstandard MLP and can be found in Appendix A. The model is defined as follows. \n\nlwhere yi is the output of neuron k in layer I, Nl is the number of neurons in layer I, \nWii is the weight connecting neuron k in layer I to neuron i in layer I - 1, yb = 1 (bias), \nand / is commonly a sigmoid function. \n\n\fThe Gamma MLP for Speech Phoneme Recognition \n\n787 \n\nDefinition 1 A Gamma MLP with L layers excluding the input layer (0,1, ... , L), \ngamma filters of order K, and No, N 1 , ... , NL neurons per layer, is defined as \n\nK \n\nf (x~ (t)) \nN'-l \nL C~i(t) L wLj (t)Zkij (t) \ni=O \n(1- ILL(t))zkij(t -1) + ILL(t)zki(j_I)(t -1) \ny!-l (t) \n\nj=O \n\n(1) \n\n1 ~j ~ K \n\nj=O \n\nZiij (t) \nZiij (t) \n\nwhere y(t) = neuron output, c'ki = synaptic gain, f(a) = \n1,2, ... ,N,(neuronindex), I = 0,1, ... ,L(layer), and Ziijli=O = \n0, C~ij li=O = 1(bias). \n\neO / 2 _e- o / 2 \neO/2+e 0/2, k \n1, W~ij li=O,#O \n\no \n\nFor comparison purposes, we have used the TDNN (Time Delay Neural Network) \narchitecture2 , the Back-Tsoi FIR3 and I1R MLP architectures (Back and Tsoi, \n1991a) where every synapse contains an FIR or I1R filter and a gain term, and the \nlocal approximation algorithm used by Casdagli (k-NN LA) (Casdagli, 1991)4. The \nGamma MLP is a special case of the I1R MLP. \n\n3 TASK \n\n3.1 MOTIVATION \n\nAccurate speech recognition requires models which can account for a high degree \nof variability in the data. Large amounts of data may be available but it may be \nimpractical to use all of the information in standard neural network models. \n\nHypothesis: As the complexity of a problem increases (higher dimensionality, greater \nvariety of training data), the error surface of a neural network becomes more com(cid:173)\nplex. It may contain a number of local minima5 many of which may be much worse \nthan the global minimum. The training (parameter estimation) algorithms become \n\"stuck\" in local minima which may be increasingly poor compared to the global \noptimum. The problem suffers from the so called \"curse of dimenSionality\" and the \n\n2We use TDNN to refer to an MLP with a time window of inputs, not the replicated \n\narchitecture introduced by Lang (Lang et al., 1990). \n\n3We distinguish the Back-Tsoi FIR network from the Wan FIR network in that the \nWan architecture has no synaptic gains, and the update algorithms are different. The \nBack-Tsoi update algorithm has provided better convergence in previous experiments. \n4Casdagli created an affine model of the following form for each test pattern: yi = \naD + L~=l ai~, where k is the number of neighbors, j = 1, ... , k, and n is the input \ndimension. The resulting model is used to find y for the test pattern. \n\n5We note that it can be difficult to distinguish a true local minimum from a long plateau \n\nin the standard backpropagation algorithm. \n\n\f788 \n\nS. LAWRENCE, A. C. TSOI, A. D. BACK \n\ndifficulty in optimizing a function with limited control over the nature of the error \nsurface. \n\nWe can identify two main reasons why the application of the Gamma MLP may \nbe superior to the standard TDNN for speech recognition: a) the gamma filtering \noperation allows consideration of the input data using different time resolutions and \ncan account for more past history of the signal which can only be accounted for in \nan FIR or TDNN system by increasing the dimensionality of the model, and b) \nthe low pass filtering nature of the gamma filter may create a smoother function \napproximation task, and therefore a smoother error surface for gradient descent6 . \n\n3.2 TASK DETAILS \n\nModel Input Window \n\n[~ \n\nTarget Function \n\nNetworl( Output 1 \n\nNetworl( Output 2 \n\nClassification 0 \n\n; \n\nj\n\n: \n\nII \n\n~} ...::.. ...:::'.!'} ~.} ,.;:!.. \"\"::'I'} \n\n. ; i ~ \nl I ~ \n\n! \n\nFrames of RASTA data \n\n~ Sequence End ~ \n\nFigure 2: PLP input data format and the corresponding network target functions for the \nphoneme \"aa\" . \n\nOur data consists of phonemes extracted from the TIMIT database and organized \nas a number of sequences as shown in figure 2 (example for the phoneme \"aa\"). \nOne model is trained for each phoneme. Note that the phonemes are classified in \ncontext, with a number of different contexts, and that the surrounding phonemes \nare labelled only as not belonging to the target phoneme class. Raw speech data \nwas pre-processed into a sequence of frames using the RASTA-PLP v2.0 software7. \nWe used the default options for PLP analysis. The analysis window (frame) was \n20 ms. Each succeeding frame overlaps with the preceding frame by 10 ms. 9 \nPLP coefficients together with the signal power are extracted and used as features \ndescribing each frame of data. Phonemes used in the current tests were the vowel \n\"aa\" and the fricative \"s\" . The phonemes were extracted from speakers coming \nfrom the same demographic region in the TIMIT database. Multiple speakers were \nused and the speakers used in the test set were not contained in the training set. \nThe training set contained 4000 frames, where each phoneme is roughly 10 frames. \nThe test set contained 2000 frames, and an additional validation set containing 2000 \nframes was used to control generalization. \n\n6If we consider a very simple network and derive the relationship of the smoothness of \nthe required function approximation to the smoothness of the error surface this statement \nappears to be valid. However, it is difficult to show a direct relationship for general \nnetworks. \n\n7 Obtained from ftp:/ /ftp.icsi.berkeley.edu/pub/speech/rasta2.0.tar.Z. \n\n\fThe Gamma MLP for Speech Phoneme Recognition \n\n789 \n\n4 RESULTS \n\nTwo outputs were used in the neural networks as shown by the target functions in \nfigure 2, corresponding to the phoneme being present or not. A confidence criterion \nwas used: Ymax x (Ymax - Ymin) (for soft max outputs). The initial learning rate was \n0.1, 10 hidden nodes were used, FIR and Gamma orders were 5 (6 taps), the TDNN \nand k-NN models had an input window of 6 steps in time, the tanh activation func(cid:173)\ntion was used, target outputs were scaled between -0.8 and 0.8, stochastic update \nwas used, and initial weights were chosen from a set of candidates based on training \nset performance. The learning rate was varied over time according to the schedule: \n\",~!(o.Cj(n C2 N\u00bb)) where'TI = learning rate, 'TIo = initial \n'TI = 'TIo/ (N/2 + \nlearning rate, N = total epochs, n = current epoch, Cl = 50, C2 = 0.65. This is \nsimilar to the schedule proposed in (Darken and Moody, 1991) with an additional \nterm to decrease the learning rate towards zero over the final epochs8 . \n\nmax 1,(cI-\n\n(I C2)N \n\n( \n\nI Train Error % I 2-NN I 5-NN I 1st layer \n0.43 \n0 .39 \n\nGamma MLP \n\nFIR MLP \n\n17.6 \n7.78 \n\nTDNN \n\nk-NN LA \n\n0 \n\n0 \n\nI Test Error % \nFIR MLP \n\nGamma MLP \n\nTDNN \n\nI 2-NN I 5-NN l i s t layer \n\n22.2 \n14 .7 \n\n0.97 \n0.16 \n\nk-NN LA \n\n31 \n\n28.4 \n\nI Test False +ve I 2-NN I 5-NN l i s t layer \n\nFIR MLP \n\nGamma MLP \n\nTDNN \n\nk-NN LA \n\n22.6 \n\n17.4 \n\n13.5 \n7 .94 \n\n0 .67 \n0.45 \n\nI Test False -ve I 2-NN I 5-NN l i s t layer \n2 .6 \n1.2 \n\nGamma MLP \n\nFIR MLP \n\n44.9 \n32 .2 \n\nTDNN \n\nk-NN LA \n\n53 \n\n56.8 \n\nI All layers \n1.5 \n0 .88 \n\n14.5 \n5.73 \n\nI All layers \n0 .61 \n0 .33 \n\n20.4 \n13.5 \n\nI All layers \n2.0 \n0.47 \n\n11.4 \n7 .01 \n\nI Gains 1st layer I Gains all layers \n\n, \n\n, \n\n27.2 \n6 .07 \n\n0 .59 \n0 .12 \n\n40.9 \n5.63 \n14.4 \n\n19.8 \n1.68 \n0.86 \n\nI Gams 1st layer I Gams all layers I \n\n, \n\n, \n\n29 \n12.8 \n\n0.14 \n1.0 \n\n41 \n12.7 \n24.5 \n\n21 \n0.50 \n0 .68 \n\nI Gams 1st layer I Gams all layers I \n\n, \n\n, \n\n4.5 \n6.83 \n\n0 .77 \n0 .34 \n\n31.3 \n8.05 \n13 \n\n49.0 \n1.8 \n0 .27 \n\nI All layers \n5 .6 \n2.2 \n\n44.1 \n30.4 \n\nI Gams 1st layer I Gams all layers I \n\n, \n\n, \n\n92.9 \n28.4 \n\n2.4 \n2 .8 \n\n66.4 \n24.7 \n54.6 \n\n53 \n4.4 \n1.8 \n\nTable 1: Results comparing the architectures and the use of filters in all layers and \nsynaptic gains for the FIR and Gamma MLP models. The NMSE is followed by the \nstandard deviation. The TDNN results are listed under an arbitrary column heading \n(gains and 1st layer/alilayers does not apply). \n\nThe results of the simulations are shown in table 19 . Each result represents an \naverage over four simulations with different random seeds - the standard deviation \nof the four individual results is also shown. The FIR and Gamma MLP networks \nhave been tested both with and without synaptic gains, and with and without \nfilters in the output layer synapses. These results are for the models trained on \nthe \"s\" phoneme, results for the \"aa\" phoneme exhibit the same trend. \"Test false \nnegative\" is probably the most important result here, and is shown graphically \nin figure 3. This is the percentage of times a true classification (ie. \nthe current \n\n8Without this term we have encountered considerable parameter fluctuation over the \n\nlast epoch. \n\n9NMSE = 2:~=1 (d(k) - y(k))2 I (2:~=1 (d(k) - (2:~=1 d(k)) INr) IN. \n\n\f790 \n\nS. LAWRENCE, A. C. TSOI, A. D. BACK \n\nQ) \n\n~ \n~ \nZ \nQ) \n.!!2 '\" \ni \n\nLL \n\nI-\n\n60 \n\n55 \n\n50 \n\n45 \n\n40 \n\n35 \n\n30 \n\n25 \n\n20 \n\n--\" \n\nf------f \n\n~ Ga:~~ ~t~ =-=-~-' \n\nTDNN -_ .. _.(cid:173)\nk-NN LA _._._ .. \n\nI - - -__ \n\nI -r-\u00b7---- ----+ --- -1 \n\n2-NN 5-NN NG 1 L NG AL \n\nG lL GAL \n\nFigure 3: Percentage of false negative classifications on the test set. NG=No gains, \nG=Gains, lL=filters in the first layer only, AL=filters in all layers. The error bars show \nplus and minus one standard deviation. The synaptic gains case for the FIR MLP is not \nshown as the poor performance compresses the remainder of the graph. Top to bottom, \nthe lines correspond to: k-NN LA (left), TDNN, FIR MLP, and Gamma MLP. \n\nphoneme is present) is incorrectly reported as false. From the table we can see \nthat the Gamma MLP performs Significantly better than the FIR MLP or standard \nTDNN models for this problem. Synaptic gains and gamma filters in all layers \nimprove the performance of the Gamma MLP, while the inclusion of synaptic gains \npresented difficulty for the FIR MLP. Results for the IIR MLP are not shown - we \nhave been unable to obtain significant convergencelO . We investigated values of k \nnot listed in the table for the k-NN LA model, but it performed poorly in all cases. \n\n5 CONCLUSIONS \n\nWe have defined a Gamma MLP as an MLP with gamma filters and gain terms in \nevery synapse. We have shown that the model performs significantly better on our \nspeech phoneme recognition problem when compared to TDNN, Back-Tsoi FIR and \nIIR MLP architectures, and Casdagli's local approximation model. The percentage \nof times a phoneme is present but not recognized for the Gamma MLP was 44% \nlower than the closest competitor, the Back-Tsoi FIR MLP model. \n\nThe inclusion of gamma filters in all layers and the inclusion of synaptic gains im(cid:173)\nproved the performance of the Gamma MLP. The improvement due to the inclusion \nof synaptic gains may be considered non-intuitive to many - we are adding degrees \nof freedom, but no additional representational power. The error surface will be dif(cid:173)\nferent in each case, and the results indicate that the surface for the synaptic gains \ncase is more amenable to gradient descent. One view of the situation is seen by \nBack & Tsoi with their FIR and IIR MLP networks (Back and Tsoi, 1991b): From \na signal processing perspective the response of each synapse is determined by pole(cid:173)\nzero positions. With no synaptic gains, the weights determine both the static gain \nand the pole-zero positions of the synapses. In an experimental analysis performed \nby Back & Tsoi it was observed that some synapses devoted themselves to model-\n\nlOTheoretically, the IIR MLP model is the most powerful model used here. Though it \nis prone to stability problems, the stability of the model can and was controlled in the \nsimulations performed here (basically, by reflecting poles that move outside the unit circle \nback inside). The most obvious hypothesis for the difficulty in training the model is related \nto the error surface and the nature of gradient descent. We expect the error surface to be \nconsiderably more complex for the IIR MLP model, and for gradient descent update to \nexperience increased difficulty optimizing the function. \n\n\fThe Gamma MLP for Speech Phoneme Recognition \n\n791 \n\ning the dynamics of the system in question, while others \"sacrificed\" themselves to \nprovide the necessary static gainsll to construct the required nonlinearity. \n\nAPPENDIX A: GAMMA MLP UPDATE EQUATIONS \n\n-'1 8 \n\n8J(t) \nI ( ) = '16\" (t)c\", (t)Z\"i; (t) \nw\",; t \n\nI \n\nI \n\nI \n\n~W~i;(t) \n\n= \n\n~C~i(t) \n\n~J'~i (t) \n\n= \n\n= \n\no \n(1 - J'~i(t))a~,;(t -1) + J'~i(t)a~iC;_I)(t - 1) \n+z~,(;_I)(t -1) - Z~i;(t - 1) \n\nj=O \n\n1 $j $ K \n\nI=L \n\n1 $j $ K \n\n1 \n(1 - J';,,(t)).B;,,;(t -1) + J';,,(t).B~\"(;_l) (t - 1) \n\nj=O \n\n1 $j $K \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\n(6) \n\n(7) \n\nAcknowledgments \n\nThis work has been partially supported by the Australian Research Council (ACT and \nADB) and the Australian Telecommunications and Electronics Research Board (SL). \n\nReferences \n\nBack, A. and Tsoi, A. (1991a). FIR and IIR synapses, a new neural network architecture \n\nfor time series modelling. Neural Computation, 3(3):337-350. \n\nBack, A. D. and Tsoi, A. C. (1991b). Analysis of hidden layer weights in a dynamic locally \nrecurrent network. In Simula, 0., editor, Proceedings International Conference on \nArtificial Neural Networks, ICANN-91, volume 1, pages 967-976, Espoo, Finland. \n\nCasdagli, M. (1991). Chaos and deterministic versus stochastic non-linear modelling. J.R. \n\nStatistical Society B, 54(2):302-328. \n\nDarken, C. and Moody, J. (1991). Note on learning rate schedules for stochastic optimiza(cid:173)\n\ntion. In Neural Information Processing Systems 3, pages 832-838. Morgan Kaufmann. \n\nde Vries, B. and Principe, J. (1992). The gamma model- a new neural network for temporal \n\nprocessing. Neural Networks, 5(4):565-576. \n\nLang, K. J., Waibel, A. H., and Hinton, G. E. (1990). A time-delay neural network \n\narchitecture for isolated word recognition. Neural Networks, 3:23-43. \n\nShynk, J . (1989). Adaptive IIR filtering. IEEE ASSP Magazine, pages 4-21. \n\nllThe neurons were observed to have gone into saturation, providing a constant output. \n\n\f\fPART VII \nVISION \n\n\f\f", "award": [], "sourceid": 1021, "authors": [{"given_name": "Steve", "family_name": "Lawrence", "institution": null}, {"given_name": "Ah", "family_name": "Tsoi", "institution": null}, {"given_name": "Andrew", "family_name": "Back", "institution": null}]}