{"title": "A New Approach to Hybrid HMM/ANN Speech Recognition using Mutual Information Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 772, "page_last": 778, "abstract": null, "full_text": "A New Approach to Hybrid HMMJANN Speech \nRecognition Using Mutual Information Neural \n\nNetworks \n\nG. Rigoll, c. Neukirchen \nGerhard-Mercator-University Duisburg \n\nFaculty of Electrical Engineering \nDepartment of Computer Science \n\nBismarckstr. 90, Duisburg, Germany \n\nABSTRACT \n\nThis paper presents a new approach to speech recognition with hybrid \nHMM/ANN technology. While the standard approach to hybrid \nHMMI ANN systems is based on the use of neural networks as \nposterior probability estimators, the new approach is based on the use \nof mutual information neural networks trained with a special learning \nalgorithm in order to maximize the mutual information between the \ninput classes of the network and its resulting sequence of firing output \nneurons during training. It is shown in this paper that such a neural \nnetwork is an optimal neural vector quantizer for a discrete hidden \nMarkov model system trained on Maximum Likelihood principles. \nOne of the main advantages of this approach is the fact, that such \nneural networks can be easily combined with HMM's of any \ncomplexity with context-dependent capabilities. It is shown that the \nresulting hybrid system achieves very high recognition rates, which \nare now already on the same level as the best conventional HMM \nsystems with continuous parameters, and the capabilities of the \nmutual information neural networks are not yet entirely exploited. \n\n1 INTRODUCTION \n\nHybrid HMM/ANN systems deal with the optimal combination of artificial neural \nnetworks (ANN) and hidden Markov models (HMM). Especially in the area of automatic \nspeech recognition, it has been shown that hybrid approaches can lead to very powerful \nand efficient systems, combining the discriminative capabilities of neural networks and \nthe superior dynamic time warping abilities of HMM's. The most popular hybrid \napproach is described in (Hochberg, 1995) and replaces the component modeling the \nemission probabilities of the HMM by a neural net. This is possible, because it is shown \n\n\fMutual In/ormation Neural Networks/or Hybrid HMMIANN Speech Recognition \n\n773 \n\nin (Bourlard, 1994) that neural networks can be trained so that the output of the m-th \nneuron approximates the posterior probability p(QmIX). In this paper, an alternative \nmethod for constructing a hybrid system is presented. It is based on the use of discrete \nHMM's which are combined with a neural vector quantizer (VQ) in order to form a hybrid \nsystem. Each speech feature vector is presented to the neural network, which generates a \nfiring neuron in its output layer. This neuron is processed as VQ label by the HMM's. \nThere are the following arguments for this alternative hybrid approach: \n\n\u2022 The neural vector quantizer has to be trained on a special information theory criterion, \nbased on the mutual information between network input and resulting neuron firing \nsequence. It will be shown that such a network is the optimal acoustic processor for a \ndiscrete HMM system, resulting in a profound mathematical theory for this approach. \n\n\u2022 Resulting from this theory, a formula can be derived which jointly describes the \nbehavior of the HMM and the neural acoustic processor. In that way, both systems can \nbe described in a unified manner and both major components of the hybrid system can \nbe trained using a unified learning criterion. \n\n\u2022 The above mentioned theoretical background leads to the development of new neural \nnetwork paradigms using novel training algorithms that have not been used before in \nother areas of neurocomputing, and therefore represent major challenges and issues in \nlearning and training for neural systems. \n\n\u2022 The neural networks can be easily combined with any HMM system of arbitrary \ncomplexity. This leads to the combination of optimally trained neural networks with \nvery powerful HMM's, having all features useful for speech recognition, e.g. triphones, \nfunction words, crossword triphones, etc .. Context-dependency, which is very desirable \nbut relatively difficult to realize with a pure neural approach, can be left to the HMM's. \n\u2022 The resulting hybrid system has still the basic structure of a discrete system, and \ntherefore has all the effective features associated with discrete systems, e.g. quick and \neasy training as well as recognition procedures, real-time capabilities, etc .. \n\n\u2022 The work presented in this paper has been also successfully implemented for a \ndemanding speech recognition problem, the 1000 word speaker-independent continuous \nResource Management speech recognition task. For this task, the hybrid system \nproduces one of the best recognition results obtained by any speech recognition system. \n\nIn the following section, the theoretical foundations of the hybrid approach are briefly \nexplained. A unified probabilistic model for the combined HMMIANN system is derived, \ndescribing the interaction of the neural and the HMM component. Furthermore, it is \nshown that the optimal neural acoustic processor can be obtained from a special \ninformation theoretic network training algorithm. \n\n2 INFORMATION THEORY PRINCIPLES FOR NEURAL \n\nNETWORK TRAINING \n\nWe are considering now a neural network of arbitrary topology used as neural vector \nquantizer for a discrete HMM system. If K patterns are presented to the hybrid system \nduring training, the feature vectors resulting from these patterns using any feature \nextraction method can be denoted as x(k), k=l.. .K. If these feature vectors are presented to \nthe input layer of a neural network, the network will generate one firing neuron for each \npresentation. Hence, all K presentations will generate a stream of firing neurons with \nlength K resulting from the output layer of the neural net. This label stream is denoted as \nY=y(l) ... y(K). The label stream Y will be presented to the HMM's, which calculate the \nprobability that this stream has been observed while a pattern of a certain class has been \npresented to the system. It is assumed, that M different classes Q m are active in the \n\n\f774 \n\nG. Rigoll and C. Neukirchen \n\nsystem, e.g. the words or phonemes in speech recognition. Each feature vector ~(k) will \nbelong to one of these classes. The class Om, to which feature vector ~(k) belongs is \ndenoted as Q(k). The major training issue for the neural network can be now formulated \nas follows : How should the weights of the network be trained, so that the network \nproduces a stream of firing neurons that can be used by the discrete HMM's in an optimal \nway? It is known that HMM's are usually trained with information theory methods which \nmostly rely on the Maximum Likelihood (ML) principle. If the parameters of the hybrid \nsystem (i.e. transition and emission probabilities and network weights) are summarized in \nthe vector !!, the probability P!!(x(k)IQ(k\u00bb denotes the probability of the pattern X at \ndiscrete time k, under the assumption that it has been generated by the model representing \nclass O(k), with parameter set !!. The ML principle will then try to maximize the joint \nprobability of all presented training patterns ~(k), according to the following Maximum \nLikelihood function: \n\nfl* = arg max {~ i log P!! (K(k) I Q(k\u00bbj \n\n~ \n\nk=1 \n\n(1) \nwhere !!* is the optimal parameter vector maximizing this equation. Our goal is to feed \nthe feature vector ~ into a neural network and to present the neural network output to the \nMarkov model. Therefore, one has to introduce the neural network output in a suitable \nmanner into the above formula. If the vector ~ is presented to the network input layer, and \nwe assume that there is a chance that any neuron Yn, n=1...N (with network output layer \nsize N) can fire with a certain probability, then the output probability p(~IQ) in (1) can \nbe written as: \n\nN \n\nN \n\np(KIQ) = I p(x ,Y n IQ) = I p(y n IQ) . p(x Iy n,Q) \n\nn=1 \n\nn=1 \n\n(2) \nNow, the combination of the neural component with the HMM can be made more \nobvious: In (2), typically the probability P(YnIQ) will be described by the Markov model, \nin terms of the emission probabilities of the HMM. For instance, in continuous \nparameter HMM's, these probabilities are interpreted as weights for Gaussian mixtures. In \nthe case of semi-continuous systems or discrete HMM's, these probabilities will serve as \ndiscrete emission probabilities of the codebook labels. The probability p(xIYn,Q) \ndescribes the acoustic processor of the system and is characterizing the relation between \nthe vector ~ as input to the acoustic processor and the label Yn, which can be considered \nas the n-th output component of the acoustic processor. This n-th output component may \ncharacterize e.g. the n-th Gaussian mixture component in continuous parameter HMM's, \nor the generation of the n-th label of a vector quantizer in a discrete system. This \nprobability is often considered as independent of the class 0 and can then be expressed as \np(xIYn). It is exactly this probability, that can be modeled efficiently by our neural \nnetwork. In this case, the vector X serves as input to the neural network and Yn \ncharacterizes the n-th neuron in the output layer of the network. Using Bayes law, this \nprobability can be written as: \n\nP(YnIK) ' pW \n\np(xl Y n) = \n\np(y n) \n\n(3) \n\n(4) \n\nyielding for (2): \n\nUsing again Bayes law to express \n\n\fMutual Information Neural Networks for Hybrid HMMIANN Speech Recognition \n\n775 \n\n(5) \n\none obtains from (4): \n\np(K) \n\nN \n\np(KI.Q)= -(.Q) . L p(.Qly n) \u00b7p(ynlo!J \n\np \n\nn=1 \n\n(6) \nWe have now modified the class-dependent probability of the feature vector X in a way \nthat allows the incorporation of the probability P(YnIX). This probability allows a better \ncharacterization of the behavior of the neural network, because it describes the probability \nof the various neurons Yn, if the vector X is presented to the network input. Therefore, \nthese probabilities give a good description of the input/output behavior of the neural \nnetwork. Eq. (6) can therefore be considered as probabilistic model for the hybrid system, \nwhere the neural acoustic processor is characterized by its input/output behavior. Two \ncases can be now distinguished: In the first case, the neural network is assumed to be a \nprobabilistic paradigm, where each neuron fires with a certain probability, if an input \nvector is presented. In this case all neurons contribute to the information forwarded to the \nHMM's. As already mentioned, in this paper, the second possible case is considered, \nnamely that only one neuron in the output layer fires and will be fed as observed label to \nthe HMM. In this case, we have a deterministic decision, and the probability P(YnIX) \ndescribes what neuron Yn* fires if vector X is presented to the input layer. Therefore, this \nprobability reduces to \n\nThen, (6) yields: \n\n(7) \n\n(8) \n\nNow, the class-dependent probability p(Xln) is expressed through the probability \np(nIYn*), involving directly the firing neuron Yn*, when feature vector X is presented. \nOne has now to turn back to (1), recalling the fact, that this equation describes the fact \nthat the Markov models are trained with the ML criterion. It should also be recalled, that \nthe entire sequence of feature vectors, x(k), k=l...K, results in a label stream of firing \nneurons Yn*(k), k=l...K, where Yn*(k) is the firing neuron if the k-th vector x(k) is \npresented to the neural network. Now, (8) can be substituted into (1) for each presentation \nk, yielding the modified ML criterion: \n\n{ \n\nK \n\np(x(k)) \n\n1( = arg;ax ::1 log P(Q (k)) . p(.Q(k) I Y n*,k)) \n~ arg;ax {~, log p(x (k)) - ~109P(Q(k)) + ~IOg p(Q(k) I Y n.(k))} \n\n} \n\n(9) \n\nUsually, in a continuous parameter system, the probability p(x) can be expressed as: \n\nN \n\np(K) = LP(K,ly n) . p(y n) \n\n(10) \nand is therefore dependent of the parameter vector ft, because in this case, p(xIYn) can be \ninterpreted as the probability provided by the Gaussian distributions, and the parameters of \n\nn=1 \n\n\f776 \n\nG. Rigoll and C. Neukirchen \n\nthe Gaussians will depend on ft. As just mentioned before, in a discrete system, only one \nfiring neuron Yn* survives, resulting in the fact that only the n*-th member remains in \nthe sum in (10). This would correspond to only one \"firing Gaussian\" in the continuous \ncase, leading to the following expression for p(x): \n\np(K) = p(x Iy nJ\u00b7 p(y nJ = p(K,y nJ = p(y n\"lx) . p(x) \n\n(11) \n\nConsidering now the fact, that the acoustic processor is not represented by a Gaussian but \ninstead by a vector quantizer, where the probability P(Yn*IX) of the firing neuron is equal \nto 1, then (11) reduces to p(~) = p(x) and it becomes obvious that this probability is not \naffected by any distribution that depends on the parameter vector ft. This would be \ndifferent, if P(Yn*IX) in (11) would not have binary characteristics as in (7), but would be \ncomputed by a continuous function which in this case would depend on the parameter \nvector ft. Thus, without consideration of p(X), the remaining expression to be maximized \nin (9) reduces to: \n\n,r( = arg;ax [~ ~IOg p(.Q( k)) + ! log p(.Q( k) I Y n\u00b7(k)) 1 \n\n(12) \n\n= arg max [- E {log p(.o)} + E {log p(.o I y n\")}] \n\nfJ.. \n\nThese expectations of logarithmic probabilities are also defined as entropies. Therefore, \n(9) can be also written as \n\nfl.\" = arg max {H (.0) - H(.o I Y)} \n\nfJ.. \n\n(13) \nThis equation can be interpreted as follows: The term on the right side of (13) is also \nknown as the mutual information I(n,Y~ between the probabilistic variables nand Y, \ni.e. : \n\n1(.0, Y) =H(.o) - H (.01 Y) =H (Y) - H(YI.o) \n\n(14) \nTherefore, the final information theory-based training criterion for the neural network can \nbe formulated as follows: The synaptic weights of the neural network should be chosen as \nto maximize the mutual information between the string representing the classes of the \nvectors presented to the network input layer during training and the string representing the \nresulting sequence of firing neurons in the output layer of the neural network. This can be \nalso expressed as the Maximum Mutual Information (MMI) criterion for neural network \ntraining. This concludes the proof that MMI neural networks are indeed optimal acoustic \nprocessors for HMM's trained with maximum likelihood principles. \n\n3 REALIZATION OF MMI TRAINING ALGORITHMS FOR \n\nNEURAL NETWORKS \n\nTraining the synaptic weights of a neural network in order to achieve mutual information \nmaximization is not easy. Two different algorithms have been developed for this task and \ncan only be briefly outlined in this paper. A detailed description can be found in (Rigoll, \n1994) and (Neukirchen, 1996). The first experiments used a single-layer neural network \nwith Euclidean distance as propagation function. The first implementation of the MMI \ntraining paradigm has been realized in (Rigoll, 1994) and is based on a self-organizing \nprocedure, starting with initial weights derived from k-means clustering of the training \nvectors, followed by an iterative procedure to modify the weights. The mutual \ninformation increases in a self-organizing way from a low value at the start to a much \nhigher value after several iteration cycles. The second implementation has been realized \n\n\fMutual Information Neural Networks for Hybrid HMMIANN Speech Recognition \n\n777 \n\nrecently and is described in detail in (Neukirchen, 1996). It is based on the idea of using \ngradient methods for finding the MMI value. This technique has not been used before, \nbecause the maximum search for finding the firing neuron in the output layer has \nprevented the calculation of derivatives. This maximum search can be approximated using \nthe softmax function, denoted as sn for the n-th neuron. It can be computed from the \nactivations Zl of all neurons as: \n\nN \n\nz IT \"\" Z I IT \n\nSn=e n \n\n/ \u00a3..Je \n\n(15) \nwhere a small value for parameter T approximates a crisp maximum selection. Since the \nstring n in (14) is always fixed during training and independent of the parameters in ft, \nonly the function H(nIY) has to be minimized. This function can also be expressed as \n\n/=1 \n\nH(!2 I Y) = - L L p(y n,!2m ) \u00b7logp(!2m I Y n) \n\nM N \n\nm=1 n=1 \n\nA derivative with respect to a weight Wlj of the neural network yields: \n\nm=1 n=1 \n\naH (!21 Y) = \n\nJW/j \n\n(16) \n\n(17) \n\nAs shown in (Neukirchen, 1996), all the required terms in (17) can be computed \neffectively and it is possible to realize a gradient descend method in order to maximize the \nmutual information of the training data. The great advantage of this method is the fact \nthat it is now possible to generalize this algorithm for use in all popular neural network \narchitectures, including multilayer and recurrent neural networks. \n\n4 RESULTS FOR THE HYBRID SYSTEM \n\nThe new hybrid system has been developed and extensively tested using the Resource \nManagement 1000 word speaker-independent continuous speech recognition task. First, a \nbaseline discrete HMM system has been built up with all well-known features of a \ncontext-dependent HMM system. The performance of that baseline system is shown in \ncolumn 2 of Table 1. The 1st column shows the performance of the hybrid system with \nthe neural vector quantizer. This network has some special features not mentioned in the \nprevious sections, e.g. it uses multiple frame input and has been trained on context(cid:173)\ndependent classes. That means that the mutual information between the stream of firing \nneurons and the corresponding input stream of triphones has been maximized. In this \nway, the firing behavior of the network becomes sensitive to context-dependent units. \nTherefore, this network may be the only existing context-dependent acoustic processor, \ncarrying the principle of triphone modeling from the HMM structure to the acoustic front \nend. It can be seen, that a substantially higher recognition performance is obtained with \nthe hybrid system, that compares well with the leading continuous system (HTK, in \ncolumn 3). It is expected, that the system will be further improved in the near future \nthrough various additional features, including full exploitation of multilayer neural VQ's \n\n\f778 \n\nG. Rigoll and C. Neukirchen \n\nand several conventional HMM improvements, e.g. the use of crossword triphones. \nRecent results on the larger Wall Street Journal (WSJ) database have shown a 10.5% error \nrate for the hybrid system compared to a 13.4% error rate for a standard discrete system, \nusing the 5k vocabulary test with bigram language model of perplexity 110. This error \nrate can be further reduced to 8.9% using crossword triphones and 6.6% with a trigram \nlanguage model. This rate compares already quite favorably with the best continuous \nsystems for the same task. It should be noted that this hybrid WSJ system is still in its \ninitial stage and the neural component is not yet as sophisticated as in the RM system. \n\n5 CONCLUSION \n\nA new neural network paradigm and the resulting hybrid HMMI ANN speech recognition \nsystem have been presented in this paper. The new approach performs already very well \nand is still perfectible. It gains its good performance from the following facts: (1) The use \nof information theory-based training algorithms for the neural vector quantizer, which can \nbe shown to be optimal for the hybrid approach. (2) The possibility of introducing \ncontext-dependency not only to the HMM's, but also to the neural quantizer. (3) The fact \nthat this hybrid approach allows the combination of an optimal neural acoustic processor \nwith the most advanced context-dependent HMM system. We will continue to further \nimplement various possible improvements for our hybrid speech recognition system. \n\nREFERENCES \n\nRigoll, G. (1994) Maximum Mutual Information Neural Networks for Hybrid \nConnectionist-HMM Speech Recognition Systems, IEEE Transactions on Speech and \nAudio Processing, Vol. 2, No.1, Special Issue on Neural Networks for Speech \nProcessing, pp. 175-184 \nNeukirchen, C. & Rigoll, G. (1996) Training of MMI Neural Networks as Vector \nQuantizers, Internal Report, Gerhard-Mercator-University Duisburg, Faculty of Electrical \nEngineering, available via http://www.fb9-tLuni-duisburg.de/veroeffentl.html \nBourlard, H. & Morgan, N. (1994) Connectionist Speech Recognition: A Hybrid \nApproach, Kluwer Academic Publishers \nHochberg, M., Renals, S., Robinson, A., Cook, G. (1995) Recent Improvements to the \nABBOT Large Vocabulary CSR System, in Proc. IEEE-ICASSP, Detroit, pp. 69-72 \nRigoll, G., Neukirchen, c., Rottland, J. (1996) A New Hybrid System Based on MMI(cid:173)\nNeural Networks for the RM Speech Recognition Task, in Proc. IEEE-ICASSP, Atlanta \n\nTable 1: Comparison of recognition rates for different speech recognition systems \n\nRM SI word recognition rate with word pair grammar: correctness (accuracy) \n\ntest set \n\nFeb.'89 \nOct.'89 \n\nFeb.'91 \n\nSep.'92 \n\naverage \n\nhybrid MMI-NN \n\nsystem \n\nbaseline k-means \n\nVQ system \n\ncontinuous pdf system \n\n(HTK) \n\n96,3 % \n95,4 % \n96,7 % \n93,9 % \n95,6 % \n\n(95,6 %) \n(94,5 %) \n(95,9 %) \n(92,5 %) \n(94,6 %) \n\n94,3 % (93,6 %) \n93,5 % (92,0 %) \n\n96,0 % (95,5 %) \n95,4% (94,9 %) \n\n94,4% (93,5 %) \n\n96,6% (96,0 %) \n\n90,7 % (88,9 %) \n\n93,6 % \n\n(92,6 %) \n\n93,2 % (92,0 %) \n\n95,4% (94,7 %) \n\n\f", "award": [], "sourceid": 1193, "authors": [{"given_name": "Gerhard", "family_name": "Rigoll", "institution": null}, {"given_name": "Christoph", "family_name": "Neukirchen", "institution": null}]}