{"title": "Hybrid NN/HMM-Based Speech Recognition with a Discriminant Neural Feature Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 763, "page_last": 769, "abstract": "", "full_text": "Hybrid NNIHMM-Based Speech Recognition \nwith a Discriminant Neural Feature Extraction \n\nDaniel Willett, Gerhard RigoU \n\nDepartment of Comfuter Science \nFaculty of Electrica Engineering \n\nGerhard-Mercator-University Duisburg, Germany \n\n{ willett,rigoll}@tb9-ti.uni-duisburg.de \n\nAbstract \n\nIn this paper, we present a novel hybrid architecture for continuous speech \nrecognition systems. It consists of a continuous HMM system extended \nby an arbitrary neural network that is used as a preprocessor that takes \nseveral frames of the feature vector as input to produce more discrimin(cid:173)\native feature vectors with respect to the underlying HMM system. This \nhybrid system is an extension of a state-of-the-art continuous HMM sys(cid:173)\ntem, and in fact, it is the first hybrid system that really is capable of outper(cid:173)\nforming these standard systems with respect to the recognition accuracy. \nExperimental results show an relative error reduction of about 10% that \nwe achieved on a remarkably good recognition system based on continu(cid:173)\nous HMMs for the Resource Management 1 OOO-word continuous speech \nrecognition task. \n\n1 \n\nINTRODUCTION \n\nStandard state-of-the-art speech recognition systems utilize Hidden Markov Models \n(HMMs) to model the acoustic behavior of basic speech units like phones or words. Most \ncommonly the probabilistic distribution functions are modeled as mIXtures of Gaussian dis(cid:173)\ntributions. These mixture distributions can be regarded as output nodes of a Radial-Basis(cid:173)\nFunction (RBF) network that is embedded in the HMM system [1]. Contrary to neural train(cid:173)\ning procedures the parameters of the HMM system, including the RBF network, are usu(cid:173)\nally estimated to maximize the training observations' likelihood. In order to combine the \ntime-warping abilities of HMMs and the more discriminative power of neural networks, \nseveral hybrid approaches arose during the past five years, that combine HMM systems and \nneural networks. The best known approach is the one proposed by Bourlard [2]. It replaces \nthe HMMs' RBF -net with a Multi-Layer-Perceptron (MLP) which is trained to output each \nHMM state's posterior probability. At last year's NIPS our group presented a novel hybrid \nspeech recognition approach that combines a discrete HMM speech recognition system and \na neural quantizer (3). By maximizing the mutual information between the VQ-Iabels and \nthe assigned phoneme-classes, this apl'roach outperforms standard discrete recognition sys(cid:173)\ntems. We showed that this approach IS capable of building up very accurate systems with \nan extremely fast likelihood computation, that only consists of a quantization and a table \nlookup. This resulted in a hybrid system with recognition performance equivalent to the best \n\n\f764 \n\nD. Willett and G. Rigoll \n\nx (t-P) \n\nx(t) \n\nx(t+F) \n\nfeature \nextraction \n\nNeural Network \n(linear transfonnation, \nMLP or recurrent MLP) \n\nx'(t) \n\nHMM-System \n\n(RBF-network) \n\np(x(t)IW[) \n\np(x(t)l~ \n\nFigure I: Architecture of the hybrid NN/HMM system \n\ncontinuous systems, but with a much faster decoding. Nevertheless, it has turned out that \nthis hybrid approach is not really capable of substantially outperforming very good continu(cid:173)\nous systems with respect to the recognition accuracy. This observation is similar to experi(cid:173)\nences with Bourlard's MLP approach. For the decoding procedure, this architecture offers \na very efficient pruning technique (phone deactivation pruning [4)) that is much more effi(cid:173)\ncient than pruning on likelihoods, but until today this approach did not outperform standard \ncontinuous HMM systems in recognition performance. \n\n2 HYBRID CONTINUOUS HMMlMLP APPROACH \n\nTherefore, we followed a different approach, namely the extension of a state-of-the-art con(cid:173)\ntinuous system that achieves extremely good recognition rates with a neural net that is \ntrained with MMI-methods related to those in [5]. The major difference in this approach \nis the fact that the acoustic processor is not replaced by a neural network, but that the Gaus(cid:173)\nsian probability density component is retained and combined with a neural component in an \nappropriate manner. A similar approach was presented in [6] to improve a speech recogni(cid:173)\ntIOn system for the TIMIT database. We propose to regard the additional neural component \nas being part of the feature extraction, and to reuse it in recognition systems of higher com(cid:173)\nplexity where discriminative training is extremely expensive. \n\n2.1 ARCHITECTURE \nThe basic architecture of this hybrid system is illustrated in Figure 1. The neural net func(cid:173)\ntions as a feature transformation that takes several additional past and future feature vec(cid:173)\ntors into account to produce an improved more discriminant feature vector that is fed into \nthe HMM system. This architecture allows (at least) three ways of interpretation; 1. as a \nhybrid system that combines neural nets and continuous HMMs, 2. as an LDA-like trans(cid:173)\nformation that incorporates the HMM parameters into the calculation of the transformation \nmatrix and 3. as feature extraction method, that allows the extraction offeatures according \nto the underlying HMM system. The considered types of neural networks are linear trans(cid:173)\nformations, MLPs and recurrent MLPs. A detailed 3escription of the possible topologies is \ngiven in Section 3. \nWith this architecture, additional past and future feature vectors can be taken into account \nin the probability estimation process without increasing the dimensionality of the Gaussian \nmixture components. Instead of increasing the HMM system's number of parameters the \nneural net is trained to produce more discriminant feature vectors with respect to the trained \nHMM system. Of course, adding some kind of neural net increases the number of para(cid:173)\nmeters too, but the increase is much more moderate than it would be when increasing each \nGaussian's dimensionality. \n\n\fSpeech Recognition with a Discriminant Neural Feature Extraction \n\n765 \n\n2.2 TRAINING OBJECTIVE \nThe original purpose of this approach was the intention to transfer the hybrid approach \npresented in [3], based on MMI neural network, to (semi-) continuous systems. ThIS way, \nwe hoped to be able to achieve the same remarkable improvements that we obtained on \ndiscrete systems now on continuous systems, which are the much better and more flexible \nbaseline systems. The most natural way to do this would be the re-estimation of the code(cid:173)\nbook of Gaussian mean vectors of a semi-continuous system using the neural MMI training \nalgorithm presented in [31. Unfortunately though, this won't work, as this codebook of a \nsemi-continuous system does not determine a separation of the feature space, but is used as \nmeans of Gaussian densities. The MMI-principle can be retained, however, by leaving the \noriginal HMM system unmodified and instead extending it with a neural component, trained \naccording to a frame-based MMI approach, related to the one in [3]. The MMI criterion is \nusually formulated in the following way: \n)..MMI = argmaxi).(X, W) = argmax(H>.(X) - H>.(XIW)) = argmax P>.(~l~) \n(1) \n\nThis means that following the MMI criterion the system's free parameters ,\\ have to be es(cid:173)\ntimated to maximize the quotient of the observation's likelihood p>.(XIW) for the known \ntranscription Wand its overall likelihood P>. (X). With X = (x(l), x(2), ... x(T)) denot(cid:173)\ning the training observations and W = (w(l), w(2) , .. . w(T)) denoting the HMM states(cid:173)\nassigned to the observation vectors in a Viterbi-alignment - the frame-based MMI criterion \nbecomes \n\nP>. \n\n>. \n\n>. \n\n>. \n\n)..MMI ~ arg~ax L i).(x(i), w(i)) \n\nT \n\ni=1 \n\n=arg~ax . \n\nITT p>.(x(i)lw(i)) \np>.(x(i)) \n\n1:1 \n\n:::::::arg~ax . s . \n\np>.(x(i)lw(i)) \n\nITT \n%=1 2: P>.(X(Z)lwk)p(Wk) \n\n(2) \n\nk=1 \n\nwhere S is the total number ofHMM states, (W1 , '\" ws) denotes the HMM states and p( Wk) \ndenotes each states' prior-probability that is estimated on the alignment of the training data \nor by an analysis of the lan~age model. \nEq. 2 can be used to re-estImate the Gaussians of a continuous HMM system directly. In \n[7] we reported the slight improvements in recognition accuracy that we achieved with this \nparameter estimation. However, it turned out, that only the incorporation of additional fea(cid:173)\ntures in the probability calculation pipeline can provide more discriminative emission prob(cid:173)\nabilities and a major advance in recognition accuracy. Thus, we experienced it to be more \nconvenient to train an additional neural net in order to maximize Eq. 2. Besides, this ap(cid:173)\nproach offers the possibility of improving a recognition system by applying a trained feature \nextraction network taken from a different system. Section 5 will report our positive exper(cid:173)\niences with this procedure. \nAt first, for matter of simplicity, we will consider a linear network that takes P past fea(cid:173)\nture vectors and F future feature vectors as additional input. With the linear net denoted as \na (P + F + 1) x N matrix NET, each component x' (t)[c] of the network output x'(t) \ncomputes to \n\nP+F N \n\nx'(t)[c] = L L x(t - P + i)[jJ . N ET[i * N + j][c] \n\nVc E {L.N} \n\n(3) \n\nso that the derivative with respect to a component of NET easily computes to \n\ni=O j=1 \n\n8x'(t)[cJ \n\n_ 6 -x(t - P \n\n8NET[i*N+j][c] -\n\nC,c \n\ni)['J \nJ \n\n+ \n\n(4) \n\nIn a continuous HMM system with diagonal covariance matrices the pdf of each HMM state \nw is modeled by a mixture of Gaussian components like \n\nC \n\np>.(xlw) = Ldwj \n\n1U \n\nj=1 \n\nN \n\n2 \n1 ~ (m;[I)-x[l]) \n\najll) \n\n1 \n\n-\"2 L, \ne \n1=1 \ny'(2rr)nIUjl \n\n(5) \n\n\f766 \n\nD. Willett and G. Rigoll \n\nA pdfs derivative with respect to a component x'[e] of the net's output becomes \n\nC \n\n8p,\\.(x'lw) _ ~ d \n- ~ WJ \n\n8x'[e) \n\nN \n\n1 ~ (mj[ll-\",/[I))2 \n\ne- 2 ~ aj[!} \n\n1 \n\n(6) \n\n. (:e[e] - mj[e)) \n\noAe) \n\ny'(21l')nIO'jl \n\nWith x(t) in Eq. 2 now replaced by the net output x'(t) the partial derivative ofEq. 2 with \nrespect to a probabilistic distribution function p( x' (i) IWk) computes to \n\n8h(x'(i), w(i)) _ \n8p,\\.(x'(i)lwk) - p,\\.(x(i)lwk) \n\n<5W(i) ,Wk \n\ns \n2: p,\\.(x(i)lwI)p(wd \n\n1=1 \n\n(7) \n\nThus, using the chain rule the derivative of the net's parameters with respect to the frame(cid:173)\nbased MMI criterion can be computed as displayed in Eq. 8 \n\n8h(X, W) = t(t(8h(:e(i)IW(i))) 8p,\\.(x'(i)l wk) ox' (i)[c] ) \n8N ET[lHe) \noN ET[lHe) \n\ni=l k=l op,\\.(x'(i)lwk) \n\nox' (i)[e] \n\n(8) \n\nand a gradient descent procedure can be used to determine the optimal parameter estimates. \n\n2.3 ADVANTAGES OF THE PROPOSED APPROACH \nWhen using a linear network, the proposed approach strongly resembles the well known \nLinear Discriminant Analysis (LDA) [8] in architecture and training objective. The main \ndifference is the way the transformation is set up. In the proposed approach the transform(cid:173)\nation is computed by taking directly the HMM parameters into account whereas the LDA \nonly tries to separate the features according to some class assignment. With the incorpor(cid:173)\nation of a trained continuous HMM system the net's parameters are estimated to produce \nfeature vectors that not only have a good separability in general, but also have a distribu(cid:173)\ntion that can be modeled with mixtures ofGaussians very well. Our experiments given at the \nend ofthis paper prove this advantage. Furthermore, contrary to LDA, that produces feature \nvectors that don't have much in common with the original vectors, the proposed approach \nonly slightly modifies the input vectors. Thus, a well trained continuous system can be ex(cid:173)\ntended by the MMI-net approach, in order to improve its recognition performance without \nthe need for completely rebuilding it. In addition to that, the approach offers a fairly easy \nextension to nonlinear networks (MLP) and recurrent networks (recurrent MLP). This will \nbe outlined in the following Section. And, maybe as the major advantage, the approach al(cid:173)\nlows keeping up the division of the input features into streams of features that are strongly \nuncorrelated and which are modeled with separate pdfs. The case of multiple streams is dis(cid:173)\ncussed in detail in Section 4. Besides) the MMI approach offers the possibility of a unified \ntraining of the HMM system and the reature extraction network or an iterative procedure of \ntraining each part alternately. \n\n3 NETWORK TOPOLOGIES \n\nSection 2 explained how to train a linear transformation with respect to the frame-based \nMMI criterion. However, to exploit all the advantages of the proposed hybrid approach the \nnetwork should be able to perform a nonlinear mapping, in order to produce features whose \ndistribution is (closer to) a mixture of Gaussians although the original distribution is not. \n\n3.1 MLP \nWhen using a fully connected MLP as displayed in Figure 2 with one hidden layer of H \nnodes, that perform the nonlinear function /, the activation of one of the output nodes \nx'(t)[e] becomes \n\nx'(tHe) = t; L2[hJ[e)\u00b7 f( BIAS. + ~ ~ x(t - P + i)li)\u00b7 L1[i * N + jJ[h)) (9) \n\nP+F N \n\nH \n\n\fSpeech Recognition with a Discriminant Neural Feature Extraction \n\n767 \n\nx(t) \n~ \n\nx(t+1) \n\n~ \n\n1 \n~ \n\nx'(t) \n\nRBF(cid:173)\nnetwork \n\noriginal features \n(multiple frames) x(t-l) \n~ \n\nI \n\nI \n\n: :~--:EA6 \n::: ~~~~ \nI: I \n: ': Ll \"'~~~'~~~~~11111~ \n! : : \n::: \n\n' \"''\"~' ~... \\NlIl,~l/\\f)(JI\\L.-i.M/I/ \n\\,,~~ \n\n'~~:\\ ... \n\n~ \n\nI \n\nI o : : : \n\nI \n\nI \n\n: : : \n: : : \n: : ~ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \n\nI \n\nI \n\nI \n\n1- '--_-_-.:_-_-_-_-_-_-_-_-_-_-_-_-_ - __ \n\ntransformed features \n\np(x(t)lw ~ \n\np(x(t)lwi \n\np(x(t)lwj \n\n\u00a7~~ \n\ncontext-dependent \nHidden Markov Model \n\nFigure 2: Hybrid system with a nonlinear feature transfonnation \n\nwhich is easily differentiable with respect to the nonlinear network's parameters. In our \nexperiments we chose f to be defined as the hyperbolic tangents f(x) := tanh(x) = (2(1+ \ne -X) -1 - 1) so that the partial derivative with respect to i.e. a weight L 1 [i . N + J] [h] of \nthe first layer computes to \n\nax' (t)[e] \n\naL1[i . N + 3][h] \n\nx(t - p + i)[J] . L2[h][c] \n\n.COSh( BIAS. + % ~ x(t - P + i)lil\u00b7 L1[i * N + j][h1r' \n\n(10) \n\nand the gradient can be set up according to Eq. 8. \n\n3.2 RECURRENT MLP \nWith the incorporation of several additional past feature vectors as explained in Section 2, \nmore discriminant feature vectors can be generated. However, this method is not capable of \nmodeling longer term relations, as it can be achieved by extending the network with some \nrecurrent connections. For the sake of simplicity, in our experiments we simply extended \nthe MLP as indicated with the dashed lines in Figure 2 by propagating the output x (t) back \nto the input of the network (with a delay of one discrete time step). This type of recurrent \nneural net is often referred to as a 'Jordan' -network. Certainly, the extension of the network \nwith additional hidden nodes in order to model the recurrence more independently would \nbe possible as well. \n\n4 MULTI STREAM SYSTEMS \n\nIn HMM-based recognition systems the extracted features are often divided into streams \nthat are modeled independently. This is useful the less correlated the divided features are. \nIn this case the overall likelihood of an observation computes to \n\np>.(xlw) = IT p$>.(xlw)w, \n\nM \n\n(11) \n\nwhere each of the stream pdfs p$>.(xlw) only uses a subset of the features in x. The stream \nweights W$ are usually set to unity. \n\n\f768 \n\nD. Willett and G. Rigoll \n\nTable 1: Word error rates achieved in the experiments \n\nA multi stream system can be improved by a neural extraction for each stream and an in(cid:173)\ndependent training of these neural networks. However, it has to be considered that the \nsubdivided features usually are not totally independent and by considering multiple input \nframes as illustrated in Figure 1 this dependence often increases. It is a common practice, \nfor instance, to model the features' first and second order delta coefficients in independ(cid:173)\nent streams. So, for sure the streams lose independence when considering multiple frames, \nas these coefficients are calculated using the additional frames. Nevertheless, we found it \nto give best results to maintain this subdivision into streams, but to consider the stronger \ncorrelation by training each stream's net dependent on the other nets' outputs. A training \ncriterion follows straight from Eq. 11 inserted in Eq. 2. \n\n\\ \n/\\MMI - argmax \n\n_ \n\n). \n\nrrT p).(x(i)lw(i)) _ \ni=l \n\n( (.)) \n\nP). X t \n\n- argmax \n\n.). \n\nrrT 11M (P3).(X(i)IW(i)))W' \ni=13=1 \n\n( (.)) \nPs). X z \n\n(12) \n\nThe derivative of this equation with respect to the pdf P.;). (xlw) ofa specific stream s de(cid:173)\npends on the other streams' pdfs. With the Ws set to unity it is \n8h(x'(;): w(i)) = (rr ps).(X(i)I~(i))) ( 6w (i):Wk \n\n8ps).(x (Z)IWk) \n\n~. P3).(X(Z)) \n3 r 3 \n\nps).(X(Z)IWk) \n\np(Wk) \n((')1) ( \n\n_ s \n'I\\' \nW P.;). x Z WI P WI \n1=1 \n\n) \n) \n\n(13) \nNeglecting the correlation among the streams the training of each stream's net can be done \nindependently. However, the more the incorporation of additional features increases the \nstreams' correlation, the more important it gets to train the nets in a unified training pro(cid:173)\ncedure according to Eq. 13. \n\n5 EXPERIMENTS AND RESULTS \n\nWe applied the proposed approach to improve a context-independent (monophones) and a \ncontext-dependent (triphones) continuous speech reco~tion system for the 1000-wordRe(cid:173)\nsource Management (RM) task. The systems used lmear HMMs of three emitting states \neach. The tying of Gaussian mixture components was perfonned with an adaptive proced(cid:173)\nure according to [9]. The HMM states of the word-internal triphone system were clustered \nin a tree-based phonetic clustering procedure. Decoding was perfonned with a Viterbi(cid:173)\ndecoder and the standard wordpair-grammar of perplexity 60. Training of the MLP was \nperfonned with the RPROP algorithm. For training the weights of the recurrent connec(cid:173)\ntions we chose real-time recurrent learning. The average error rates were computed using \nthe test-sets Feb89, Oct89, Feb91 and Sep92. \nThe table above shows the recognition results with single stream systems in its first section. \nThese systems simply use a 12-value Cepstrum feature vector without the incorporation of \ndelta coefficients. The systems with an input transfonnation use one additional past and one \nadditional future feature vector as input. The proposed approach achieves the same perfonn(cid:173)\nance as the LDA, but it is not capable of outperfonning It. \nThe second section of the table lists the recognition results with four stream systems that \nuse the first and second order delta coefficients in additional streams plus log energy and \nthis values' delta coefficients in a forth stream. The MLP system trained according to Eq. \n\n\fSpeech Recognition with a Discriminant Neural Feature Extraction \n\n769 \n\nII slightly outperforms the other approaches. The incorporation of recurrent network con(cid:173)\nnections does not improve the system's performance. \nThe third section of the table lists the recognition results with four stream systems with a \ncontext-dependent acoustic modeling (triphones). The applied LDA and the MMI -networks \nwere taken from the monophone four stream system. On the one hand, this was done to \navoid the computational complexity that the MMI training objective causes on context(cid:173)\ndependent systems. On the other hand, this demonstrates that the feature vectors produced \nby the trained networks have a good discrimination for continuous systems in general. \nAgain, the MLP system outperforms the other\u00b7 approaches and achieves a very remarkable \nword error rate. It should be pointed out here, that the structure of the continuous system as \nreported in (9) is already highly optimized and it is almost impossible to further reduce the \nerror rate by means of any acoustic modeling method. This is reflected in the fact that even \na standard LDA cannot improve this system. Only the new neural approach leads to a 10% \nreduction in error rate which is a large improvement considering the fact that the error rate \nof the baseline system is among the best ever reported for the RM database, \n\n6 CONCLUSION \n\nThe paper has presented a novel approach to discriminant feature extraction. A MLP net(cid:173)\nwork has successfully been used to compute a feature transformation that outputs extremely \nsuitable features for continuous HMM systems. The experimental results have proven that \nthe proposed approach is an appropriate method for including several feature frames in the \nprobability estimation process without increasing the dimensionality of the Gaussian mix(cid:173)\nture components in the HMM system. Furthermore did the results on the triphone speech \nrecognition system prove that the approach provides discriminant features, not only for the \nsystem that the mapping is computed on, but for HMM systems with a continuous modeling \nin general: The application of recurrent networks did not improve the recognition accuracy. \nThe longer range relations seem to be very weak and they seem to be covered well by using \nthe neighboring feature vectors and first and second order delta coefficients. The proposed \nunified training procedure for multiple nets in multi-stream systems allows keeping up the \nsubdivision of features of weak correlations, and gave us best profits in recognition accur(cid:173)\nacy. \n\nReferences \n[1) H. Ney, \"Speech Recognition in a Neural Network Framework: Discriminative Train(cid:173)\ning of Gaussian Models and Mixture Densities as Radial Basis Functions\", Proc. IEEE(cid:173)\nICASSp, 1991, pp. 573-576. \n\n[2) H Bouriard, N. Morgan, \"Connectionist Speech Recognition - A Hybrid Approach\", \n\nKluwer Academic Press, 1994. \n\n[3) G. Rigoll, C. Neukirchen, \"A new approach to hybrid HMMIANN speech recognition \nusing mutual information neural networks\", Advances in Neural Information Processing \nSystems (NIPS-96), Denver, Dec. 1996, pp. 772-778. \n\n[4] M. M. Hochberg, G. D. Cook, S. J. Renals, A. J. Robinson, A. S. Schechtman, \"The \n1994 ABBOT Hybrid Connectionist-HMM Large-Vocabulary Recognition System\", \nProc. ARPA Spoken Language Systems Technology Workshop, 1995. \n\n[5] G. Rigoll, \"Maximum Mutual Information Neural Networks for Hybrid Connectionist(cid:173)\nHMM Speech Recognition\", IEEE-Trans. Speech Audio Processing, Vol. 2, No.1, Jan. \n1994,pp.175-184. \n\n[6] Y. Bengio et aI., \"Global Optimization of a Neural Network - Hidden Markov Model \n\nHybrid\" IEEE-Transcations on NN, Vol. 3, No. 2, 1992, pp. 252-259. \n\n[7] D. Willett, C. Neukirchen, R. Rottland, \"Dictionary-Based Discriminative HMM Para(cid:173)\nmeter Estimation for Continuous Speech Recognition Systems\", Proc. IEEE-ICASSp, \n1997,pp.1515-1518. \n\n[8] X. Aubert, R. Haeb-Umbach, H. Ney, \"Continuous mixture densities and linear discrim(cid:173)\ninant analysis for improved context-dependent acoustic models\", Proc. IEEE-ICASSp, \n1993, pp. II 648-651. \n\n[9) D. Willett, G. Rigoll, \"A New Approach to Generalized Mixture Tying for Continuous \n\nHMM-Based Speech Recognition\",Proc. EUROSPEECH, Rhodes, 1997. \n\n\f", "award": [], "sourceid": 1405, "authors": [{"given_name": "Daniel", "family_name": "Willett", "institution": null}, {"given_name": "Gerhard", "family_name": "Rigoll", "institution": null}]}