{"title": "Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System", "book": "Advances in Neural Information Processing Systems", "page_first": 750, "page_last": 756, "abstract": null, "full_text": "Context-Dependent Classes in a Hybrid \n\nRecurrent Network-HMM Speech \n\nRecognition System \n\nDan Kershaw \n\nTony Robinson Mike Hochberg \u2022 \n\nCambridge University Engineering Department, \n\nTrumpington Street, Cambridge CB2 1PZ, England. \n\nTel: [+44]1223332800, Fax: [+44]1223332662. \n\nEmail: djk.ajr@eng.cam.ac.uk \n\nAbstract \n\nA method for incorporating context-dependent phone classes in \na connectionist-HMM hybrid speech recognition system is intro(cid:173)\nduced. A modular approach is adopted, where single-layer networks \ndiscriminate between different context classes given the phone class \nand the acoustic data. The context networks are combined with a \ncontext-independent (CI) network to generate context-dependent \n(CD) phone probability estimates. Experiments show an average \nreduction in word error rate of 16% and 13% from the CI system \non ARPA 5,000 word and SQALE 20,000 word tasks respectively. \nDue to improved modelling, the decoding speed of the CD system \nis more than twice as fast as the CI system. \n\nINTRODUCTION \n\nThe ABBOT hybrid connectionist-HMM system performed competitively with many \nconventional hidden Markov model (HMM) systems in the 1994 ARPA evaluations \nof speech recognition systems (Hochberg, Cook, Renals, Robinson & Schechtman \n1995). This hybrid framework is attractive because it is compact, having far fewer \nparameters than conventional HMM systems, whilst also providing the discrimina(cid:173)\ntive powers of a connectionist architecture. \n\nIt is well established that particular phones vary acoustically when they occur in \ndifferent phonetic contexts. For example a vowel may become nasalized when fol(cid:173)\nlowing a nasal sound. The short-term contextual influence of co-articulation is \n\n\u00b7Mike Hochberg is now at Nuance Communications, 333 Ravenswood Avenue, Building \n\n110, Menlo Park, CA 94025, USA. Tel: [+1] 415 6148260. \n\n\fContext-dependent Classes in a Speech Recognition System \n\n751 \n\nhandled in HMMs by creating a model for all sufficiently differing phonetic con(cid:173)\ntexts with enough acoustic evidence. This modelling of phones in their particular \nphonetic contexts produces sharper probability density functions . This approach \nvastly improves HMM recognition accuracy over equivalent context-independent \nsystems (Lee 1989). Although the recurrent neural network (RNN) model acoustic \ncontext internally (within the state vector) , it does not model phonetic context. \nThis paper presents an approach to improving the ABBOT system through phonetic \ncontext-dependent modelling. \n\nIn Cohen, Franco, Morgan , Rumelhart & Abrash (1992) separate sets of context(cid:173)\ndependent output layers are used to model context effects in different states ofHMM \nphone models. A set of networks discriminate between phones in 8 different broad(cid:173)\nclass left and right contexts. Training time is reduced by initialising from a CI multi(cid:173)\nlayer perceptron (MLP) and only changing the hidden-to-output weights during \ncontext-dependent training. This system performs well on the DARPA Resource \nManagement Task. The work presented in Zhoa, Schwartz , Sroka & Makhoul (1995) \nfollowed along similar work to Cohen et al. (1992) . A context-dependent mixture \n(Jordan & Jacobs 1994) based on the structure of the \nof experts (ME) system \ncontext-independent ME was built. For each state, the whole training data was \ndivided into 46 parts according to its left or right context. Then, a separate ME \nmodel was built for each context. \n\nAnother approach to phonetic context-dependent modelling with MLPs was pro(cid:173)\nposed by Bourlard & Morgan (1993) . It was based on factoring the conditional \nprobability of a phone-in-context given the data in terms of the phone given the \ndata, and its context given the data and the phone. The approach taken in this \npaper is a mixture of the above work. However, this work augments a recurrent net(cid:173)\nwork (rather than an MLP) and concentrates on building a more compact system, \nwhich is more suited to our requirements. As a result, the context training scheme is \nfast and is implemented on a workstation (rather than a parallel processing machine \nas is used for training the RNN) . \n\nOVERVIEW OF THE ABBOT HYBRID SYSTEM \n\nThe basic framework of the ABBOT system is similar to the one described in Bourlard \n& Morgan (1994) except that a recurrent network is used as the acoustic model \nfor the within the HMM framework. A more detailed description of the recurrent \nnetwork for phone probability estimation is given in Robinson (1994). At each 16ms \ntime frame , the acoustic vector u(t) is mapped to an output vector y(t), which \nrepresents an estimate of the posterior probability of each of the phone classes \n\nYi(t) ~ Pr(qi(t)luiH ) , \n\n(1) \nwhere qi(t) is phone class i at time t , and ul = {u(l) , .. . , u(t)} is the input from \ntime 1 to t . Left (past) acoustic context is modelled internally by a 256 dimensional \nstate vector x(t) , which can be envisaged as \"storing\" the information that has \nbeen presented at the input. Right (future) acoustic context is given by delaying \nthe posterior probability estimation until four frames of input have been seen by the \nnetwork . The network is trained using a modified version of error back-propagation \nthrough time (Robinson 1994) . \nDecoding with the hybrid connectionist-HMM approach is equivalent to conven(cid:173)\ntional HMM decoding, with the difference being that the RNN models the state \nobservations. Like typical HMM systems, the recognition process is expressed as \nfinding the maximum a posteriori state sequence for the utterance. The decoding \ncriterion specified above requires the computation of the likelihood of the acoustic \n\n\f752 \n\nD. KERSHAW, T. ROBINSON, M. HOCHBERG \n\ndata given a phone (state) sequence, \n\n( (t)1 .(t)) = Pr(qi(t)lu(t))p(u(t)) \np u \n\nPr(qi)' \n\nq, \n\n(2) \n\nwhere p(u(t)) is the same for all phones, and hence drops out of the decoding \nprocess. Hence, the network outputs are mapped to scaled likelihoods by, \n\np(U(t)lqi(t)) ::: Pr(qd ' \n\nYi(t) \n\n(3) \n\nwhere the priors Pr(qi) are estimated from the training data. Decoding uses the \nNOWAY decoder (Renals & Hochberg 1995) to compute the utterance model that is \nmost likely to have generated the observed speech signal. \n\nCONTEXT-DEPENDENT PROBABILITY ESTIMATION \n\nThe approach taken by this work is to augment the CI RNN, in a similar vein \nto Bourlard & Morgan (1993). The context-dependent likelihood, p(UtICt, Qd, \ncan be factored as, \n\n(u IC Q) = Pr(Ct!Ut, Qt)p(Ut/Qt) \nPr(Ct!Qd' \np \n\nt, \n\nt \n\nt \n\n(4) \n\nwhere C is a set of context classes and Q is a set of context-independent phones or \nmonophones. Substituting for the context independent probability density function, \np(Ut IQt), using (2), this becomes \n\n(u IC Q) = Pr(C t IUt, Qd Pr(Qt!Ut) (U) \nPr(CtIQt) Pr(Qt) P t \u00b7 \np \n\nt, \n\nt \n\nt \n\n(5) \n\nThe term p(U t} is constant for all frames, so this drops out of the decoding process \nand is ignored for all further purposes. This format is extremely appealing since \nPr(C t IQt) and Pr(Qt) are estimated from the training data and the CI RNN es(cid:173)\ntimates Pr(QtIUt). All that is then needed is an estimate of Pr(CtIUt, Qt). The \napproach taken in this paper uses a set of context experts or modules for each mono(cid:173)\nphone class to augment the existing CI RNN. \n\nTRAINING ON THE STATE VECTOR \n\nAn estimate of Pr(Ct!Ut , Qt) can be obtained by training a recurrent network to \ndiscriminate between contexts Cj(t) for phone class qi(t), such that \n\n(6) \n\nwhere Yjli (t) is an estimate of the posterior probability of context class j given \nphone class i. However, training recurrent neural networks in this format would \nbe expensive and difficult. For a recurrent format, the network must contain no \ndiscontinuities in the frame-by-frame acoustic input vectors. This implies all recur(cid:173)\nrent networks for all the phone classes i must be \"shown\" all the data. Instead, the \nassumption is made that since the state vector x = f(u), then \n\nx(t + 4) is a good representation for uiH . \n\nHence, a single-layer perceptron is trained on the state vectors corresponding to \neach monophone, qi, to classify the different phonetic context classes. Finally, \n\n\fContext-dependent Classes in a Speech Recognition System \n\n753 \n\nthe likelihood estimates for the phonetic context class j for phone class i used \nin decoding are given by, \n\nPr(qi(t)lui+4) Pr(cj (t)lx(t + 4) , qi(t)) \n\nPr(cj(t)lqi(t)) Pr(qi(t)) \n\nYi (t)Yjli (t) \n\nPr( Cj Iqi) Pr( qd . \n\n(7) \n\nEmbedded training is used to estimate the parameters of the CD networks and \nthe training data is aligned using a Viterbi segmentation. Each context network \nis trained on a non-overlapping subset of the state vectors generated from all the \nViterbi aligned training data. The context networks were trained using the RProp \ntraining procedure (Robinson 1994). \n\n=> \n=> \n=> \n\n=> \n\n:I \n\no o \n~ I \n~ \ni \n:I \nC. \nCD \n\n:I -\"'tJ \ni -CD \n~. o \n\"'tJ a \nC\" \nD) g: \n\n\"\"I \n\n'~ \n\n,',.-.._,1 \n, , \n\n~ _______ oJ' \n\n01ime 0 \nO~elayO \n\n'---------fO ~ 0 11----' \n\nyj1i(t) \n\nFigure 1: The Phonetic Context-Dependent RNN Modular System. \n\nThe frame-by-frame phonetic context posterior probabilities are required as input to \nthe NOWAY decoder, i.e. all the outputs from the context modules on the right hand \nside of Figure 1. These posterior probabilities are calculated from the numerator \nof (7). The CI RNN stage operates in its normal fashion, generating frame-by-frame \nmonophone posterior probabilities. At the same time the CD modules take the state \nvector generated by the RNN as input, in order to classify into a context class. The \n\n\f754 \n\nD. KERSHAW, T. ROBINSON, M. HOCHBERG \n\nRNN posterior probability outputs are multiplied by the module outputs to form \ncontext-dependent posterior probability estimates. \n\nRELATIONSHIP WITH MIXTURE OF EXPERTS \n\nThis architecture has similarities with mixture of experts (Jordan & Jacobs 1994). \nDuring training, rather than making a \"soft\" split of the data as in the mixture of \nexperts case, the Viterbi segmentation selects one expert at every exemplar. This \nmeans only one expert is responsible for each example in the data. This assumes that \nthe Viterbi segmentation is a good approximation to tjle segmentation/selection \nprocess. Hence, each expert is trained on a small subset of the training data, \navoiding the computationally expensive requirement for each expert to \"see\" all \nthe data. During decoding, the RNN is treated as a gating network, smoothing the \npredictions of the experts, in an analogous manner to a standard mixture of experts \ngating network. For further description of the system see Kershaw, Hochberg & \nRobinson (1995) . \n\nCLUSTERING CONTEXT CLASSES \n\nOne of the problems faced by having a context-dependent system is to decide which \ncontext classes are to be included in the CD system. A method for overcoming \nthis problem is a decision-tree based approach to cluster the context classes. This \nguarantees a full coverage of all phones in any context with the context classes \nbeing chosen using the acoustic evidence available. The tree clustering framework \nalso allows for the building of a small number of context-dependent phones, keeping \nthe new context-dependent connectionist system architecture compact. The tree \nbuilding algorithm was based on Young, Odell & Woodland (1994), and further \ndetails can be found in Kershaw et al. (1995). Once the trees were built, they were \nused to relabel the training data and the pronunciation lexicon. \n\nEVALUATION OF THE CONTEXT SYSTEM \n\nThe context-independent networks were trained on the ARPA Wall Street Jour(cid:173)\nnal S184 Corpus. The phonetic context-dependent classes were clustered on the \nacoustic data according to the decision tree algorithm. Running the data through a \nrecurrent network in a feed-forward fashion to obtain three million frames with 256 \ndimensional state vectors took approximately 8 hours on an HP735 workstation. \nTraining all the context-dependent networks on all the training data takes between \n4- 6 hours (in total) on an HP735 workstation. The context-dependent modules \nwere cross-validated on a development set at the word level. \n\nResults for two context-dependent systems, compared with the context-independent \nbaseline are shown in Table 1, where the 1993 spoke 5 test is used for cross-validation \nand development purposes. \n\nThe context-dependent systems were also applied to larger tasks such as the recent \n1995 SQALE (a European multi-language speech recognition evaluation) 20,000 \nword development and evaluation sets. The American English context-dependent \nsystem (CD527) was extended to include a set of modules trained backwards in \ntime (which were log-merged with the forward context), to augment a four way log(cid:173)\nmerged context-independent system (Hochberg, Cook, Renals & Robinson 1994). \n\n\fContext-dependent Classes in a Speech Recognition System \n\n755 \n\nTable 1: Comparison Of The CI System With The CD205 And CD527 Systems, \nFor 5000 Word, Bigram Language Model Tasks. \n\nCD205 System \n\nCD527 System \n\nI Red!!. WER \n\n1993 \nTest Sets \nSpoke 5 \nSpoke 6 \nEval. \n\nCI System \n\nWER \n16.0 \n14.6 \n15.7 \n\nWER \n14.0 \n12.2 \n14.3 \n\nI Red!!. WER WER \n13.6 \n11.7 \n13.7 \n\n12.7 \n16.3 \n8.4 \n\n14.9 \n19.8 \n12.6 \n\nTable 2: Comparison Of The Merged CI Systems With The CD527US And \nCD465UK Systems, For 20,000 Word Tasks. All Tests Use A Trigram Language \nModel. The CD527US And CD465UK Evaluation Results Have Been Officially \nAdjudicated . \n\n1995 Test Sets \n\nUS English dev _test \nUS English evLtest \nUK English dev _test \nUK English evLtest \n\nCI System CD System Red!!. \nWER \n12.2 \n9.8 \n18.9 \n15.7 \n\nWER \n11.3 \n12.9 T \n12.7 \n13.8T \n\nWER \n12.8 \n14.5 \n15.6 \n16.4 \n\nTable 3: Comparison Of Average Utterance Decode Speed Of The CI Systems With \nThe CD527US And CD465UK Systems On An HP735, For 20,000 Word Tasks. All \nTests Use A Trigram Language Model, And The Same Pruning Levels. \n\nTests \n\nCI \n\nCD \n\nUtterance Av. \n\nUtterance Av . \nDecode Speed (s) Decode Speed (s) \n\nSpeedup \n\nAmerican English \nBritish English \n\n67 \n131 \n\n31 \n48 \n\n2.16 \n2.73 \n\nTable 4: The Number Of Parameters Used For The CI Systems As Compared With \nThe CD527US And CD465UK Systems. \n\nSystem \n\n# CI \n\n#CD \n\nParameters Parameters \n\n'fo Increase In \nParameters \n\nAmerican English \nBritish English \n\n341,000 \n331 ,000 \n\n612,000 \n570,000 \n\n79.0 \n72.2 \n\nA similar system was built for British English (CD465). Table 2 shows the improve(cid:173)\nment gained by using context models. The daggers indicate the official entries for \nthe 1995 SQALE evaluation. These figures represent the lowest reported word error \nrate for both the US and UK English tasks. \nAs a result of improved phonetic modelling and class discrimination the search \nspace was reduced. This meant that decoding speed was over twice as fast as the \ncontext-dependent system, Table 3, even though there were roughly ten times as \nmany context-dependent phones compared to the monophones. \n\nThe increase in the number of parameters due to the introduction of the context \nmodels for the SQALE evaluation system are shown in Table 4. Although this \nseems a large increase in the number of system parameters, it is still an order of \nmagnitude less than any equivalent HMM system built for this task. \n\n\f756 \n\nCONCLUSIONS \n\nD. KERSHAW, T. ROBINSON, M. HOCHBERG \n\nThis paper has discussed a successful way of integrating phonetic context-dependent \nclasses into the current ABBOT hybrid system. The architecture followed a modular \napproach which could be used to augment any current RNN-HMM hybrid system. \nFast training of the context-dependent modules was achieved. Training on all of the \nSI84 corpus took between 4 and 6 hours. Utterance decoding was performed using \nthe standard NOWAY decoder. The word error was significantly reduced, whilst the \ndecoding speed of the context system was over twice as fast as the baseline system \n(for 20,000 word tasks). \n\nReferences \n\nBourlard, H. & Morgan, N. (1993), 'Continuous Speech Recognition by Connection(cid:173)\n\nist Statistical Methods', IEEE Transactions on Neural Networks 4(6), 893- 909. \nBourlard, H. & Morgan, N. (1994), Connectionist Speech Recognition: A Hybrid \n\nApproach, Kluwer Acedemic Publishers. \n\nCohen, M., Franco, H., Morgan, N., Rumelhart, D. & Abrash, V. (1992), Context(cid:173)\n\nDependent Multiple Distribution Phonetic Modeling with MLPs, in 'NIPS 5'. \nHochberg, M., Cook, G., Renals, S. & Robinson, A. (1994), Connectionist Model \nCombination for Large Vocabulary Speech Recognition, in 'Neural Networks \nfor Signal Processing', Vol. IV, pp. 269-278. \n\nHochberg, M., Cook, G., Renals, S., Robinson, A. & Schechtman, R. (1995), The \n\n1994 ABBOT Hybrid Connectionist-HMM Large-Vocabulary Recognition Sys(cid:173)\ntem, in 'Spoken Language Systems Technology Workshop', ARPA, pp. 170-6. \nJordan, M. & Jacobs, R. (1994), 'Hierarchical Mixtures of Experts and the EM \n\nAlgorithm', Neural Computation 6, 181-214. \n\nKershaw, D., Hochberg, M. & Robinson, A. (1995), Incorporating Context(cid:173)\n\nDependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition \nSystem, F-INFENG TR217, Cambridge University Engineering Department. \n\nLee, K.-F. (1989), Automatic Speech Recognition; The Development of the SPHINX \n\nSystem, Kluwer Acedemic Publishers. \n\nRenals, S. & Hochberg, M. (1995), Efficient Search Using Posterior Phone Proba(cid:173)\n\nbility Estimates, in 'ICASSP', Vol. 1, pp. 596-9. \n\nRobinson, A. (1994), 'An Application of Recurrent Nets to Phone Probability Es(cid:173)\n\ntimation.', IEEE Transactions on Neural Networks 5(2),298-305. \n\nYoung, S., Odell, J. & Woodland, P. (1994), 'Tree-Based State Tying for High Ac(cid:173)\ncuracy Acoustic Modelling', Spoken Language Systems Technology Workshop. \nZhoa, Y., Schwartz, R., Sroka, J . & Makhoul, J. (1995), Hierarchical Mixtures of \nExperts Methodology Applied to Continuous Speech Recognition, in 'NIPS 7'. \n\n\f", "award": [], "sourceid": 1039, "authors": [{"given_name": "Dan", "family_name": "Kershaw", "institution": null}, {"given_name": "Anthony", "family_name": "Robinson", "institution": null}, {"given_name": "Mike", "family_name": "Hochberg", "institution": null}]}