{"title": "Connectionist Optimisation of Tied Mixture Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 167, "page_last": 174, "abstract": null, "full_text": "Connectionist Optimisation of Tied Mixture \n\nHidden Markov Models \n\nSteve Renals \nNelson Morgan \nICSI \nBerkeley CA 94704 \nUSA \n\nHerve Bourlard \n\nL&H Speech products \n\nleper B-9800 \n\nBelgium \n\nHoracio Franco \nMichael Cohen \nSRI International \nMenlo Park CA 94025 \nUSA \n\nAbstract \n\nIssues relating to the estimation of hidden Markov model (HMM) local \nprobabilities are discussed. In particular we note the isomorphism of ra(cid:173)\ndial basis functions (RBF) networks to tied mixture density modellingj \nadditionally we highlight the differences between these methods arising \nfrom the different training criteria employed. We present a method in \nwhich connectionist training can be modified to resolve these differences \nand discuss some preliminary experiments. Finally, we discuss some out(cid:173)\nstanding problems with discriminative training. \n\n1 \n\nINTRODUCTION \n\nIn a statistical approach to continuous speech recognition the desired quantity is \nthe posterior probability p(Wrlxf, 8) of a word sequence Wr = Wl ..... Ww given \nthe acoustic evidence X[ = Xl ..... XT and the parameters of the speech model used \n8. Typically a set of models is used, to separately model different units of speech. \nThis probability may be re-expressed using Bayes' rule: \n\n(1) \n\nP(WwIXT 8) = p(XflW r.8)p(WrI8) \n\n1 \n\nl' \n\np(Xn8) \n\np(Xflwr. 8)p(Wr 18) \n\n= Lw' p(XnW'. 8)P(W'18) . \n\np(XnWr.8)/p(Xn8) is the acoustic model. This is the ratio of the likelihood \nof the acoustic evidence given the sequence of word models, to the probability of \n167 \n\n\f168 \n\nRenaIs, Morgan, Bourlard, Franco, and Cohen \n\nthe acoustic data being generated by the complete set of models. p(Xn8) may be \nregarded as a normalising term that is constant (across models) at recognition time. \nHowever at training time the parameters 8 are being adapted, thus p(Xn8) is no \nlonger constant. The prior, P(WfI8), is obtained from a language model. \nThe basic unit of speech, typically smaller than a word (here we use phones), is \nmodelled by a hidden Markov model (HMM). Word models consist of concate(cid:173)\nnations of phone HMMs (constrained by pronunciations stored in a lexicon), and \nsentence models consist of concatenations of word HMMs (constrained by a gram(cid:173)\nmar). The lexicon and grammar together make up a language model, specifying \nprior probabilities for sentences, words and phones. \n\nA HMM is a stochastic automaton defined by a set of states qi, a topology specify(cid:173)\ning allowed state transitions and a set oflocal probability density functions (PDFs) \nP(x\" qilqj. X~-l). Making the further assumptions that the output at time t is inde(cid:173)\npendent of previous outputs and depends only on the current state, we may separate \nthe local probabilities into state transition probabilities P(q;lqj) and output PDFs \nP(X,lqi). A set of initial state probabilities must also be specified. \n\nThe parameters of a HMM are usually set via a maximum likelihood procedure that \noptimally estimates the joint density P(q, xI8). The forward-backward algorithm, a \nprovably convergent algorithm for this task, is extremely efficient in practice. How(cid:173)\never, in speech recognition we do not wish to make the best model of the data {x, q} \ngiven the model parameters; we want to make the optimal discrimination between \nclasses at each time. This can be better achieved by computing a discriminant \nP(qlx, 8). Note that in this case we do not model the input density P(xI8). \n\nWe may estimate P(qlx.8) using a feed-forward network trained to an entropy \ncriterion (Bourlard & Wellekens, 1989). However, we require likelihoods of the form \nP(xlq, 8), as HMM output probabilities. We may convert posterior probabilities to \nscaled likelihoods P(xlq, 8)IP(xI8), by dividing the network outputs by the relative \nfrequencies of each class1. Note that we are not using connectionist training to \nobtain density estimates here; we are obtaining a ratio and not modelling P(xI8). \nThis ratio is the quantity that we wish to maximise: this corresponds to maximising \nP(xlqc.8) and minimising P(Xlqi. 8). i =I c, where qc is the correct class. We have \nused discriminatively trained networks to estimate the output PDFs (Bourlard & \nMorgan, 1991; Renals et al., 1991, 1992), and have obtained superior results to \nmaximum likelihood training on continuous speech recognition tasks. \n\nIn this paper, we are mainly concerned with radial basis function (RBF) networks. \nA RBF network generally has a single hidden layer, whose units may be regarded \nas computing local (or approximately local) densities, rather than global decision \nsurfaces. The resultant posteriors are obtained by output units that combine these \nlocal densities. We are interested in using RBF networks for various reasons: \n\n\u2022 A RBF network is isomorphic to a tied mixture density model, although the \ntraining criterion is typically different. The relationship between the two is \nexplored in this paper . \n\n\u2022 The locality of RBFs makes them suitable for situations in which the input \n\nITbese are the estimates of P(qi) implicitly used during classifier training. \n\n\fConnectionist Optimisation of Tied Mixture Hidden Markov Models \n\n169 \n\ndistribution may change (e.g. speaker adaptation). Surplus RBFs in a region \nof the input space where data no longer occurs will not effect the final classi(cid:173)\nfication. This is not so for sigmoidal hidden units in a multi-layer perceptron \n(MLP), which have a global effect. \n\n\u2022 RBFs are potentially more computationally efficient than MLPs at both train(cid:173)\n\ning and recognition time. \n\n2 TIED MIXTURE HMM \n\nTied mixtures of Gaussians have proven to be powerful PDF estimators in HMM \nspeech recognition systems (Huang & Jack, 1989; Bellegarda & Nahamoo, 1990). \nThe resulting systems are also known as semi-continuous HMMs. Tied mixture \ndensity estimation may be regarded as an interpolation between discrete and con(cid:173)\ntinuous density modelling Essentially, tied mixture modelling has a single \"code(cid:173)\nbook\" of Gaussians shared by all output PDFs. Each of these PDFs has its own \nset of mixture coefficients used to combine the individual Gaussians. If h(xlqk) is \nthe output PDF of state qk, and Nj (xlJ1j. ~j) are the component Gaussians, then: \n\n(2) \n\nh(xlqk. 8 ) = I: akjNj (xlJ1jo ~j) \n\nj \n\n1:akj = 1 \nj \n\nwhere akj is an element of the matrix of mixture coefficients (which may be inter(cid:173)\npreted as the prior probability P(Pj. l:jlqk\u00bb) defining how much component density \nNj(xlJ1jt ~j) contributes to output PDF h(xlqk.8). Alternatively this may be re(cid:173)\ngarded as \"fuzzy\" vector quantisation. \n\n3 RADIAL BASIS FUNCTIONS \n\nThe radial basis functions (RBF) network was originally introduced as a means \nof function interpolation (Powell, 1985; Broomhead & Lowe, 1988). A set of K \napproximating functions, hex) is constructed from a set of J basis functions cp(x): \n\n(3) \n\nJ \n\nfk(X) = 1: akjcpj(x) \n\nj=l \n\nThis equation defines a RBF network with J RBFs (hidden units) and K outputs. \nThe output units here are linear, with weights akj. The RBFs are typically Gaus(cid:173)\nsians, with means Ilj and covariance matrices ~r \n\n(4) \n\nwhere R is a normalising constant. The covariance matrix is frequently assumed to \nbe diagonal2\u2022 \n\n1bis is often reasonable for speech applications. since mel or PLP cepstral coefficients are \n\northogonal. \n\n\f170 \n\nRenaIs, Morgan, Bourlard, Franco, and Cohen \n\nSuch a network has been used for HMM output probability estimation in contin(cid:173)\nuous speech recognition (Renals et al., 1991) and an isomorphism to tied-mixture \nHMMs was noted. However, there is a mismatch between the posterior probabilities \nestimated by the network and the likelihoods required for the HMM decoding. Pre(cid:173)\nviously this was resolved by dividing the outputs by the relative frequencies of each \nstate. It would be desirable, though, to retain the isomorphism to tied mixtures: \nspecifically we wish to interpret the hidden-to-output weights of an RBF network as \nthe mixture coefficients of a tied mixture likelihood function. This can be achieved \nby defining the transfer units of the output units to implement Bayes' rule, which \nrelates the posterior gk{X) to the likelihood !k(x): \n\n(5) \n\nSuch a transfer function ensures the output units sum to 1; if fA:{x) is guaranteed \nnon-negative, then the outputs are formally probabilities. The output of such a \nnetwork is a probability distribution and we are using 'l-from-K' training: thus the \nrelative entropy E is simply: \n\n(6) \n\nE == -loggc(x). \n\nwhere qc is the desired output class (HMM distribution). Bridle (1990) has demon(cid:173)\nstrated that minimising this error function is equivalent to maximising the mutual \ninformation between the acoustic evidence and HMM state sequence. \nIf we wish to interpret the weights as mixture coefficients, then we must ensure \nthat they are non-negative and sum to 1. This may be achieved us'ing a normalised \nexponential (softmax) transformation: \n\n(7) \n\nThe mixture coefficients akj are used to compute the likelihood estimates, but it is \nthe derived variables wkj that are used in the unconstrained optimisation. \n\n3.1 TRAINING \n\nSteepest descent training specifies that: \n\n(8) \n\nHere E is the relative entropy objective function (6). We may decompose the right \nhand side of this by a careful application of the chain rule of differentiation: \n\n(9) \n\n\fConnectionist Optimisation of Tied Mixture Hidden Markov Models \n\n171 \n\nWe may write down expressions for each of these partials (where ~ab is the Kronecker \ndelta and qc is the desired state): \ndE \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\n~c' \n- -= - -\ngc \ndg,(X) \ndg,(X) = gA;(X) (~ _ g,) \n!A: (x) \nd!A:(X) \n() \ndfA;(X) \n- - =tPli x \ndaU \ndau \n-\ndWIcj \n\n= aU(~IIj - alcj). \n\nSubstituting (10), (11), (12) and (13) into (9) we obtain: \n\n(14) \n\ndE \ndwlcj = fA;(X) (gA;(X) -\n\n1 \n\n~A:c) alcj (tPj(x) - !A: (x) ) . \n\nApart from the added terms due to the normalisation of the weights, the major dif(cid:173)\nference in the gradient compared with using a sigmoid or softmax transfer function \nis the 1/!A:(x) factor. To some extent we may regard this as a dimensional term. \nThe required gradient is simpler if we construct the network to estimate log likeli(cid:173)\nhoods, replacing fA;(X) with ZA;(X) = logfA;(x): \n(15) \n\nZA;(X) = L wkj'tPj(x) \n\n(16) \n\nj \n\ngA;(X) = p(qA;) exp(zA;(x\u00bb \n2:, p(q,) exp(z,(x\u00bb \n\n. \n\nSince this is in the log domain, no constraints on the weights are required. The new \ngradient we need is: \n\n(17) \n\nThus the gradient of the error is: \ndE \n-;- = (gA;(X) - ~d) tPj(x). \naWIcj \n\n(18) \n\nSince we are in log domain, the lI!A:(x) factor is additive and thus disappears from \nthe gradient. This network is similar to Bridle's softmax, except here uniform priors \nare not assumed; the gradient is of identical form, though. In this case the weights \ndo not have a simple relationship with the mixture coefficients obtained in tied \nmixture density modelling. \nWe may also train the means and variances of the RBFs by back-propagation of \nerror; the gradients are straightforward. \n\n3.2 PRELIMINARY EXPERIMENTS \n\nWe have experimented with both the Bayes' rule transfer function (5) and the \nvariant in the log domain (16). We used a phoneme classification task, with a \n\n\f172 \n\nRenals, Morgan, Bourlard, Franco, and Cohen \n\ndatabase consisting of 160,000 frames of continuous speech. We typically computed \nthe parameters of the RBFs by a k-means clustering process. We found that the \ngradient resulting from the first transfer function (14) had a tendency to numerical \ninstability, due to the lIJ term; thus most of our experiments have used the log \ndomain transfer function. \n\nIn experiments using a 1000 RBFs, we have obtained frame classification ra.tes of \n52%. This is somewhat poorer than the frame classification we obtain using 80512 \nhidden unit MLP (59%) . We are investigating improvements to our procedure, \nincluding variations to the learning schedule, the use of the EM algorithm to set \nRBF parameters and the use of priors on the weight matrix. \n\n4 PROBLEMS WITH DISCRIMINATIVE TRAINING \n\n4.1 UNLABELLED DATA \n\nA problem arises from the use of unlabelled or partially labelled data. When training \na speech recogniser, we typically know the word sequence for an utterance, but we \ndo not have a time-aligned phonetic transcription. This is a case of partially labelled \ndata: a training set of data pairs {x\"qt} is unavailable, but we do not have purely \nunlabelled data {Xt}. Instea.d, we have the constraining information of the word \nsequence W . Thus P(qilxt) may be decomposed as: \n\n(19) \n\nWe usually make the further approximation that the optimal state sequence is \nmuch more likely than any competing state sequence. Thus, P(qclxt) = 1, and the \nprobabilities of all other states at time tare O. This most likely state sequence (which \nmay be computed using a forced Viterbi alignment) is often used as the desired \noutputs for a discriminatively trained network. Using this alignment implicitly \nassumes model correctness; however, we use discriminative training because we \nbelieve the HMMs are an inadequate speech model. Hence there is a mismatch \nbetween the maximum likelihood labelling and alignment, and the discriminative \ntraining used for the networks. \n\nIt may be that this mismatch is responsible for the lack of robustness of discrim(cid:173)\ninative training (compared with pure maximum likelihood training) in vocabulary \nindependent speech recognition tasks (Paul et al., 1991). The assumption of model \ncorrectness used to generate the labels may have the effect of further embedding \nspecifics of the training data into the final models. A solution to this problem \nmay be to use a probabilistic alignment, with a distribution over labels at each \ntimestep. This could be computed using the forward-backward algorithm, rather \nthan the Viterbi approximation. This maximum likelihood approach still assumes \nmodel correctness of course. A discriminative approach to this problem would also \nattempt to infer distributions over labels. A basic goal might be to sharpen the dis(cid:173)\ntribution toward the maximum likelihood estimate. An example of such a method \nis the 'phantom targets' algorithm introduced by Bridle & Cox (1991). \nThese optimisations are local: the error is not propagated through time. Algorithms \nfor globally optimising discriminative training have been proposed (e.g. Bengio et \nal., these proceedings), but are not without problems, when used with a constrain-\n\n\fConnectionist Optimisation of Tied Mixture Hidden Markov Models \n\n173 \n\ning language model. The problem is that to compute the posterior, the ratio of \nthe probabilities of generating the correct utterance and generating all allowable \nutterances must be computed. \n\n4.2 THE PRIORS \n\nIt has been shown, both theoretically and in practice, that the training and recogni(cid:173)\ntion procedures used with standard HMMs remain valid for posterior probabilities \n(Bourlard & Wellekens, 1989). Why then do we replace these posterior probabilities \nwith likelihoods? \n\nThe answer to this problem lies in a mismatch between the prior probabilities given \nby the training data and those imposed by the topology of the HMMs. Choosing the \nHMM topology also amounts to fixing the priors. For instance, if classes qA: represent \nphones, prior probabilitiesP(qk) are fixed when word models are defined as particular \nsequences of phone models. This discussion can be extended to different levels of \nprocessing: if qk represents sub-phonemic states and recognition is constrained by a \nlanguage model, prior probabilities qk are fixed by (and can be calculated from) the \nphone models, word models and the language model. Ideally, the topologies of these \nmodels would be inferred directly from the training data, by using a discriminative \ncriterion which implicitly contains the priors. Here, at least in theory, it would \nbe possible to start from fully-connected models and to determine their topology \naccording to the priors observed on the training data. Unfortunately this results in \na huge number of parameters that would require an unrealistic amount of training \ndata to estimate them significantly. This problem has also been raised in the context \nof language modelling (Paul et aI., 1991). \n\nSince the ideal theoretical solution is not accessible in practice, it is usually better to \ndispose of the poor estimate of the priors obtained using the training data, replacing \nthem with \"prior\" phonological or syntactic knowledge. \n\n5 CONCLUSION \n\nHaving discussed the similarities and differences between RBF networks and tied \nmixture density estimators, we present a method that attempts to resolve a mis(cid:173)\nmatch between discriminative training and density estimation. Some preliminary \nexperiments relating to this approach were discussed; we are currently performing \nfurther speech recognition experiments using these methods. Finally we raised some \nimportant issues pertaining to discriminative training. \n\nAcknowledgement \n\nThis work was partially funded by DARPA contract MDA904-90-C-5253. \n\nReferences \nBellegarda, J. R. & Nahamoo, D. (1990). Tied mixture continuous parameter model(cid:173)\n\ning for speech recognition. IEEE Transactions on Acoustics, Speech and Signal \nProcessing, 38, 2033-2045. \n\n\f174 \n\nRenals, Morgan, Bourlard, Franco, and Cohen \n\nBourlard, H. & Morgan, N. (1991). Conectionist approaches to the use of Markov \nmodels for continuous speech recognition. In Lippmann, R. P., Moody, J. E., & \nTouretzky, D. S. (Eds.), Advances in Neural Information Processing Systems, \nVol. 3, pp. 213-219. Morgan Kaufmann, San Mateo CA. \n\nBourlard, H. & Wellekens, C. J. (1989). Links between Markov models and multi(cid:173)\n\nlayer perceptrons. In Touretzky, D. S. (Ed.), Advances in Neural Information \nProcessing Systems, Vol. 1, pp. 502-510. Morgan Kaufmann, San Mateo CA. \nBridle, J. S. & Cox, S. J. (1991). RecNorm: Simultaneous normalisation and clas(cid:173)\nsification applied to speech recognition. In Lippmann, R. P., Moody, J. E., & \nTouretzky, D. S. (Eds.), Advances in Neural Information Processing Systems, \nVol. 3, pp. 234-240. Morgan Kaufmann, San Mateo CA. \n\nBridle, J. S. (1990). Training stochastic model recognition algorithms as networks \n\ncan lead to maximum mutual information estimation of parameters. In Touret(cid:173)\nzky, D. S. (Ed.), Advances in Neural Information Processing Systems, Vol. 2, \npp. 211-217. Morgan Kaufmann, San Mateo CA. \n\nBroomhead, D. S. & Lowe, D. (1988). Multi-variable functional interpolation and \n\nadaptive networks. Complex Systems, 2, 321-355. \n\nHuang, X. D. & Jack, M. A. (1989). Semi-continuous hidden Markov models for \n\nspeech signals. Computer Speech and Language, 3, 239-251. \n\nPaul, D. B., Baker, J. K., & Baker, J. M. (1991) . On the interaction between \n\ntrue source, training and testing language models. In Proceedings IEEE Inter(cid:173)\nnational Conference on Acoustics, Speech and Signal Processing, pp. 569-572 \nToronto. \n\nPowell, M. J . D. (1985). Radial basis functions for multi-variable interpolation: a \n\nreview. Tech. rep. DAMPT/NAI2, Dept. of Applied Mathematics and Theo(cid:173)\nretical Physics, University of Cambridge. \n\nRenals, S., McKelvie, D., & McInnes, F. (1991). A comparative study of continu(cid:173)\n\nous speech recognition using neural networks and hidden Markov models. In \nProceedings IEEE International Conference on Acoustics, Speech and Signal \nProcessing, pp. 369-372 Toronto. \n\nRenals, S., Morgan, N., Cohen, M., & Franco, H. (1992). Connectionist probabil(cid:173)\n\nity estimation in the DECIPHER speech recognition system . In Proceedings \nIEEE International Conference on Acoustics, Speech and Signal Processing San \nFrancisco. In press. \n\n\f", "award": [], "sourceid": 443, "authors": [{"given_name": "Steve", "family_name": "Renals", "institution": null}, {"given_name": "Nelson", "family_name": "Morgan", "institution": null}, {"given_name": "Herv\u00e9", "family_name": "Bourlard", "institution": null}, {"given_name": "Horacio", "family_name": "Franco", "institution": null}, {"given_name": "Michael", "family_name": "Cohen", "institution": null}]}