{"title": "Connectionist Approaches to the Use of Markov Models for Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 213, "page_last": 219, "abstract": null, "full_text": "Connectionist Approaches to the Use of \nMarkov Models for Speech Recognition \n\nHerve Bourlard t,~ \nt L & H Speechproducts \nKoning Albert 1 laan, 64 \n1780 Wemmel, BELGIUM \n\nNelson Morgan ~ I.e Chuck Wooters ~ \n~ IntI. Compo Sc. Institute \n1947, Center St., Suite 600 \nBerkeley, CA 94704, USA \n\nABSTRACT \n\nPrevious work has shown the ability of Multilayer Perceptrons \n(MLPs) to estimate emission probabilities for Hidden Markov Mod(cid:173)\nels (HMMs). The advantages of a speech recognition system incor(cid:173)\nporating both MLPs and HMMs are the best discrimination and \nthe ability to incorporate multiple sources of evidence (features, \ntemporal context) without restrictive assumptions of distributions \nor statistical independence. This paper presents results on the \nspeaker-dependent portion of DARPA's English language Resource \nManagement database. Results support the previously reported \nutility of MLP probability estimation for continuous speech recog(cid:173)\nnition. An additional approach we are pursuing is to use MLPs as \nnonlinear predictors for autoregressive HMMs. While this is shown \nto be more compatible with the HMM formalism, it still suffers \nfrom several limitations. This approach is generalized to take ac(cid:173)\ncount of time correlation between successive observations, without \nany restrictive assumptions about the driving noise. \n\nINTRODUCTION \n\n1 \nWe have been working on continuous speech recognition using moderately large \nvocabularies (1000 words) [1,2]. While some of our research has been in speaker(cid:173)\nindependent recognition [3], we have primarily used a German speaker-dependent \n\n213 \n\n\f214 \n\nBourlard, Morgan, and \\\\boters \n\ndatabase called SPICOS [1,2]. In our previously reported work, we developed a \nhybrid MLP /HMM algorithm in which an MLP is trained to generate the output \nprobabilities of an HMM [1,2]. Given speaker-dependent training, we have been able \nto recognize 50-60 % of the words in the SPICOS test sentences. While this is not a \nstate-of-the-art level of performance, it was accomplished with single-state phoneme \nmodels, no triphone or allophone representations, no function word modeling, etc., \nand so may be regarded as a \"baseline\" system. The main point to using such a \nsimple system is simplicity for comparison of the effectiveness of alternate proba(cid:173)\nbility estimation techniques. While we are working on extending our technique to \nmore complex systems, the current paper describes the application of the baseline \nsystem (with a few changes, such as different VQ features) to the speaker-dependent \nportion of the English language Resource Management (RM) database (continuous \nutterances built up from a lexicon of roughly 1000 words) [4]. While this exercise \nwas primarily intended to confirm that the previous result, which showed the utility \nof MLPs for the estimation of HMM output probabilities, was not restricted to the \nlimited data set of our first experiments, it also shows how to improve further the \ninitial scheme. \nHowever, potential problems remain. In order to improve local discrimination, the \nMLP is usually provided with contextual inputs [1,2,3] or recurrent links. Unfor(cid:173)\ntunately, in these cases, the dynamic programming recurrences of the Viterbi algoters \n\nerror is equivalent to estimation of p(xn Iqk' X::=;) (where qk is the HMM state \nassociated with x n ), which can be expressed as a Gaussian (with unity variance) \nwhere the exponent is the prediction error. Consequently, the prediction errors can \nbe used as local distances in DP and are fully compatible with the recurrences of \nthe Viterbi algorithm. However, although the MLP /HMM interface problem seems \nto be solved, we are now limited to Gaussian AR processes. Furthermore, each state \nmust be associated with its own MLP [10]. An alternative approach, as proposed \nin [9], is to have a single MLP with additional \"control\" inputs coding the state \nbeing considered. However, in both cases, the discriminant character of the MLP is \nlost since it is only used as a nonlinear predictor. On preliminary experiments on \nSPICOS we were unable to get significant results from these approaches compared \nwith the method presented in the previous Section [1,2]. \n\nHowever, it is possible to generalize the former approach and to avoid the Gaussian \nhypothesis. It is indeed easy to prove (by using Bayes' rule with an additional \nconditional X::=; everywhere) that: \n\n(1) \n\nAs p(xn IX::-=-;) in (1) is independent of the classes qk it can overlooked in the DP \nrecurrences. In this case, without any assumption about mean and covariance of \nthe driving noise, p(xnlq~, X::=;) can be expressed as the ratio of the output values \nof two \"standard\" MLPs (as used in the previous Section and in [1,2]), respectively \nwith X::=; and X::_ p as input. In preliminary experiments, this approach lead to \nbetter results then the former AR models without however bearing comparison with \nthe method used in the previous Section and in [1,2]. For example, on SPICOS and \nafter tuning, we got 46 % recognition rate instead of 65 % with our best method \n[2]. \n\n5 CONCLUSION \nDespite some theoretical nonidealities, the HMM/MLP hybrid approach can achieve \nsignificant improvement over comparable standard HMMs. This was observed us(cid:173)\ning a simplified HMM system with single-state monophone models, and no langauge \nmodel. However, the reported results also show that many of the tricks used to im(cid:173)\nprove standard HMMs are also valid for our hybrid approach, which leaves the \nway open to all sort of further developments. Now that we have confirmed the \nprinciple, we are beginning to develop a complete system, which will incorporate \ncontext-dependent sound units. In this framework, we are studying the possibil(cid:173)\nity of modeling multi-states HMMs and triphones. On the other hand, in spite of \npreliminary disappointing performance (which seems to corroborate previous ex(cid:173)\nperiments done by others [13,14] with AR processes for speech recognition), MLPs \nas AR models are still worth considering further given their attractive theoretical \nbasis and better interface with the HMM formalism. \n\n\fConnectionist Approaches to the Use of Markov Models for Speech Recognition \n\n219 \n\nReferences \n\n[1] Bourlard, H., Morgan, N., & Wellekens, C.J., \"Statistical Inference in Multilayer \nPerceptrons and Hidden Markov Models with Applications in Continuous Speech \nRecognition\", Neurocomputing, Ed. F. Fogelman & J. Herault, NATO ASI Series, \nvol. F68, pp. 217-226, 1990. \n\n[2] Morgan, N., & Bourlard, H., \"Continuous Speech Recognition using Multilayer Per(cid:173)\n\nceptrons with Hidden Markov Models\", IEEE Proc. of the 1990 Inti. Conf. on \nASSP, pp. 413-416, Albuquerque, NM, April 1990. \n\n[3] Morgan, N. , Hermansky, H., Bourlard, H., Kohn, P., Wooters, C., & Kohn, P., \"Con(cid:173)\ntinuous Speech Recognition Using PLP Analysis with Multilayer Perceptrons\" ac(cid:173)\ncepted for IEEE Proc. of the 1991 Inti. Conf. on ASSP, Toronto, 1991. \n\n[4] Price, P., Fisher, W., Bernstein, J., & Pallet, D., \"The DARPA 1000-Word Resource \nManagement Database for Continuous Speech Recognition\", Proc. IEEE Inti. Conf. \non ASSP, pp. 651-654, New-York, 1988. \n\n[5] Bourlard, H., & Wellekens, C.J., \"Links between Markov Models and Multilayer \nPerceptrons\", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 12, \nNo. 12, pp. 1167-1178, December 1990. \n\n[6] Murveit, H., & Weintraub, M., \"1000-Word Speaker-Independent Continuous Speech \nRecognition Using Hidden Markov Models\", Proc. IEEE Inti. Conf. on ASSP, pp. \n115-118, New-York, 1988. \n\n[7] Morgan, N., & Bourlard, H., \"Generalization and Parameter Estimation in Feedfor(cid:173)\n\nward Nets: Some Experiments\", Advances in Neural Information Processing Systems \n2, Ed. D.S Touretzky, San Mateo, CA: Morgan-Kaufmann, pp. 630-637, 1990. \n\n[8] Juang, RH. & Rabiner, L.R., \"Mixture Autoregressive Hidden Markov Models for \n\nSpeech Signals\", IEEE Trans. on ASSP, vol. 33, no. 6, pp. 1404-1412, 1985. \n\n[9] Levin, E., \"Speech Recognition Using Hidden Control Neural Network Architecture\" , \n\nProc. of IEEE Inti. Conf. on ASSP, Albuquerque, New Mexico, 1990. \n\n[10] Tebelskis, J ., & Waibel A., \"Large Vocabulary Recognition Using Linked Predictive \nNeural Networks\", Proc. of IEEE Inti. Conf. on ASSP, Albuquerque, New Mexico, \n1990. \n\n[11] Morgan, N., Wooters, C., Bourlard, H., & Cohen, M., \"Continuous Speech Recogni(cid:173)\ntion on the Resource Management Database Using Connectionist Probability Esti(cid:173)\nmation\", Proc. of Inti. Conf. on Spoken Language Processing, Kobe, Japan, 1990. \n[12] Bourlard, H., \"How Connectionist Models Could Improve Markov Models for Speech \nRecognition\", Advanced Neural Computers, Ed. R. Eckmiller, North-Holland, pp. \n247-254, 1990. \n\n[13] de La Noue, P., Levinson, S., & Sondhi M., \"Incorporating the Time Correlation \nBetween Successive Observations in an Acoustic-Phonetic Hidden Markov Model \nfor Continuous Speech Recognition\", AT&T Technical Memorandum No. 11226, \n1989. \n\n[14] Wellekens, C.J., \"Explicit Time Correlation in Hidden Markov Models\", Proc. of the \n\nIEEE Inti. Conf. on ASSP, Dallas, Texas, 1987. \n\n\f", "award": [], "sourceid": 359, "authors": [{"given_name": "Herv\u00e9", "family_name": "Bourlard", "institution": null}, {"given_name": "Nelson", "family_name": "Morgan", "institution": null}, {"given_name": "Chuck", "family_name": "Wooters", "institution": null}]}