{"title": "REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities - Application to Transition-Based Connectionist Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 394, "abstract": null, "full_text": "REMAP: Recursive Estimation and \n\nMaximization of A Posteriori \nProbabilities - Application to \n\nTransition-Based Connectionist Speech \n\nRecognition \n\nYochai Konig, Herve Bourlard~ and Nelson Morgan \n\n{konig, bourlard,morgan }@icsi.berkeley.edu \nInternational Computer Science Institute \n\n1947 Center Street Berkeley, CA 94704, USA. \n\nAbstract \n\nIn this paper, we introduce REMAP, an approach for the training \nand estimation of posterior probabilities using a recursive algorithm \nthat is reminiscent of the EM-based Forward-Backward (Liporace \n1982) algorithm for the estimation of sequence likelihoods. Al(cid:173)\nthough very general, the method is developed in the context of a \nstatistical model for transition-based speech recognition using Ar(cid:173)\ntificial Neural Networks (ANN) to generate probabilities for Hid(cid:173)\nden Markov Models (HMMs). In the new approach, we use local \nconditional posterior probabilities of transitions to estimate global \nposterior probabilities of word sequences. Although we still use \nANNs to estimate posterior probabilities, the network is trained \nwith targets that are themselves estimates of local posterior proba(cid:173)\nbilities. An initial experimental result shows a significant decrease \nin error-rate in comparison to a baseline system. \n\n1 \n\nINTRODUCTION \n\nThe ultimate goal in speech recognition is to determine the sequence of words that \nhas been uttered. Classical pattern recognition theory shows that the best possi(cid:173)\nble system (in the sense of minimum probability of error) is the one that chooses \nthe word sequence with the maximum a posteriori probability (conditioned on the \n\n* Also affiliated with with Faculte Poly technique de Mons, Mons, Belgium \n\n\fREMAP: Recursive Estimation and Maximization of A Posteriori Probabilities \n\n389 \n\nevidence). If word sequence i is represented by the statistical model M i , and the \nevidence (which, for the application reported here, is acoustical) is represented by \na sequence X = {Xl, ... , X n , ... , X N }, then we wish to choose the sequence that \ncorresponds to the largest P(MiIX). In (Bourlard & Morgan 1994), summarizing \nearlier work (such as (Bourlard & Wellekens 1989)), we showed that it was possi(cid:173)\nble to compute the global a posteriori probability P(MIX) of a discriminant form \nof Hidden Markov Model (Discriminant HMM), M, given a sequence of acoustic \nvectors X. In Discriminant HMMs, the global a posteriori probability P(MIX) is \ncomputed as follows: if r represents all legal paths (state sequences ql, q2, ... , qN) \nin Mi, N being the length of the sequence, then \n\nP(Mi IX) = L P(Mi, ql, q2, ... , qNIX) \n\nr \n\nin which ~n represents the specific state hypothesized at time n, from the set Q = \n{ql, ... , q , qk, ... , qK} of all possible HMM states making up all possible models \nMi. We can further decompose this into: \n\nP(Mi, ql, q2,\u00b7\u00b7\u00b7, qNIX) = P(ql, q2,\u00b7\u00b7\u00b7, qNIX)P(Milql, q2,\u00b7\u00b7\u00b7, qN, X) \n\nUnder the assumptions stated in (Bourlard & Morgan 1994) we can compute \n\nP(ql, q2,\u00b7\u00b7\u00b7, qNIX) = II p(qnlqn-l, xn) \n\nN \n\nn=l \n\nThe Discriminant HMM is thus described in terms of conditional transition proba(cid:173)\nbilities p(q~lq~-l' xn), in which q~ stands for the specific state ql of Q hypothesized \nat time n and can be schematically represented as in Figure 1. \n\nP(IkIIIkI, x) \n\np(/aell/ael, x) \n\nP(ltIlltI, x) \n\nP(/aelllkl, x) \n\nP(ltll/ael, x) \n\nFigure 1: An example Discriminant HMM for the word \"cat\". The variable X refers \nto a specific acoustic observation Xn at time n. \n\nFinally, given a state sequence we assume the following approximation: \n\nP(Milql, q2,\u00b7\u00b7\u00b7, qN, X) :::::::: P(Milql, q2,\u00b7\u00b7\u00b7, qN) \n\nWe can estimate the right side of this last equation from a phonological model (in \nthe case that a given state sequence can belong to two different models). All the \nrequired (local) conditional transition probabilities p(q~lq~-l> xn) can be estimated \nby the Multi-Layer Perceptron (MLP) shown in Figure 2. \nRecent work at lesl has provided us with further insight into the discriminant \nHMM, particularly in light of recent work on transition-based models (Konig & \nMorgan 1994j Morgan et al. 1994). This new perspective has motivated us to further \ndevelop the original Discriminant HMM theory. The new approach uses posterior \nprobabilities at both local and global levels and is more discriminant in nature. \nIn this paper, we introduce the Recursive Estimation-Maximization of A posteriori \n\n\f390 \n\nY. KONIG, H. BOURLARD, N. MORGAN \n\nP(CurrenCstlte I Acoustics, Prevlous_stlte) \n\nt t \n\nt t \n\nt \u00b7\u00b7\u00b7\u00b7 .. t \n\n0.1 \u2022\u2022 0 \nPrevious \n\nStlte \n\nAcoustics \n\nFigure 2: An MLP that estimates local conditional transition probabilities. \n\nProbabilities (REMAP) training algorithm for hybrid HMM/MLP systems. The \nproposed algorithm models a probability distribution over all possible transitions \n(from all possible states and for all possible time frames n) rather than picking a \nsingle time point as a transition target. Furthermore, the algorithm incrementally \nincreases the posterior probability of the correct model, while reducing the posterior \nprobabilities of all other models. Thus, it brings the overall system closer to the \noptimal Bayes classifier. \nA wide range of discriminant approaches to speech recognition have been studied \nby researchers (Katagiri et al. 1991; Bengio et al. 1992; Bourlard et al. 1994). A \nsignificant difficulty that has remained in applying these approaches to continuous \nspeech recognition has been the requirement to run computationally intensive algo(cid:173)\nrithms on all of the rival sentences. Since this is not generally feasible, compromises \nmust always be made in practice. For instance, estimates for all rival sentences can \nbe derived from a list of the \"N-best\" utterance hypotheses, or by using a fully \nconnected word model composed of all phonemes. \n\n2 REMAP TRAINING OF THE DISCRIMINANT HMM \n\n2.1 MOTIVATIONS \n\nThe discriminant HMM/MLP theory as described above uses transition-based prob(cid:173)\nabilities as the key building block for acoustic recognition. However, it is well known \nthat estimating transitions accurately is a difficult problem (Glass 1988). Due to \nthe inertia of the articulators, the boundaries between phones are blurred and over(cid:173)\nlapped in continuous speech. In our previous hybrid HMM/MLP system, targets \nwere typically obtained by using a standard forced Viterbi alignment (segmenta(cid:173)\ntion). For a transition-based system as defined above, this procedure would thus \nyield rigid transition targets, which is not realistic. \nAnother problem related to the Viterbi-based training of the MLP presented in \nFigure 2 and used in Discriminant HMMs, is the lack of coverage of the input space \nduring training. Indeed, during training (based on hard transitions), the MLP only \nprocesses inputs consisting of \"correct\" pairs of acoustic vectors and correct previous \nstate, while in recognition the net should generalize to all possible combinations of \n\n\fREMAP: Recursive Estimation and Maximization of A Posteriori Probabilities \n\n391 \n\nacoustic vectors and previous states, since all possible models and transitions will be \nhypothesized for each acoustic input. For example, some hypothesized inputs may \ncorrespond to an impossible condition that has thus never been observed, such as \nthe acoustics of the temporal center of a vowel in combination with a previous state \nthat corresponds to a plosive. It is unfortunately possible that the interpolative \ncapabilities of the network may not be sufficient to give these \"impossible\" pairs a \nsufficiently low probability during recognition. \nOne possible solution to these problems is to use a full MAP algorithm to find tran(cid:173)\nsition probabilities at each frame for all possible transitions by a forward-backward(cid:173)\nlike algorithm (Liporace 1982), taking all possible paths into account. \n\n2.2 PROBLEM FORMULATION \n\nAs described above, global maximum a posteriori training of HMMs should find the \noptimal parameter set e maximizing \n\nJ II P(Mj IXj, e) \n\nj=1 \n\n(1) \n\nin which Mj represents the Markov model associated with each training utterance \nXj, with j = 1, ... , J. \nAlthough in principle we could use a generalized back-propagation-like gradient \nprocedure in e to maximize (1) (Bengio et al. 1992), an EM-like algorithm should \nhave better convergence properties, and could preserve the statistical interpreta(cid:173)\ntion of the ANN outputs. In this case, training of the discriminant HMM by a \nglobal MAP criterion requires a solution to the following problem: given a trained \nMLP at iteration t providing a parameter set et and, consequently, estimates of \nP(q~lxn' q~-I' et ), how can we determine new MLP targets that: \n\n1. will be smooth estimates of conditional transition probabilities q~-1 -+ q~, \n\nVk,f E [1, K] and \"In E [1, N], \n\n2. when training the MLP for iteration t+ 1, will lead to new estimates of et+l \nand P(q~lxn' q~-I' et+1) that are guaranteed to incrementally increase the \nglobal posterior probability P(MiIX, e)? \n\nIn (Bourlard et al. 1994), we prove that a re-estimate of MLP targets that guarantee \nconvergence to a local maximum of (1) is given by1: \n\n(2) \n\nwhere we have estimated the left-hand side using a mapping from the previous \nstate and the local acoustic data to the current state, thus making the estimator \nrealizable by an MLP with a local acoustic window. 2 Thus, we will want to estimate \n\n1 In most of the following, we consider only one particular training sequence X associated \nwith one particular model M. It is, however, easy to see that all of our conclusions remain \nvalid for the case of several training sequences Xj, j = 1, ... , J. A simple way to look \nat the problem is to consider all training sequences as a single training sequence obtained \nby concatenating all the X,'s with boundary conditions at every possible beginning and \nending point. \n2Note that, as done in our previous hybrid HMM/MLP systems, all conditional on Xn \ncan be replaced by X;::!: = {x n - c , ., \u2022. , X n , .\u2022\u2022 , Xn+d} to take some acoustic context into \naccount. \n\n\f392 \n\nY. KONIG, H. BOURLARD, N. MORGAN \n\nthe transition probability conditioned on the local data (as MLP targets) by using \nthe transition probability conditioned on all of the data. \nIn (Bourlard et al. 1994), we further prove that alternating MLP target estimation \n(the \"estimation\" step) and MLP training (the\" maximization\" step) is guaranteed \nto incrementally increase (1) over t. 3 The remaining problem is to find an efficient \nalgorithm to express P(q~IX, q~-l' M) in terms of P(q~lxn, q~-l) so that the next \niteration targets can be found. We have developed several approaches to this esti(cid:173)\nmation, some of which are described in (Bourlard et al. 1994). Currently, we are \nimplementing this with an efficient recursion that estimates the sum of all possible \npaths in a model, for every possible transition at each possible time. From these \nvalues we can compute the desired targets (2) for network training by \n\nP( t IX M \n\nqn \n\n, \n\n) = P(M, q~, ~~_lIX) \nIX) \n\n, qn-l ~ . P(M J \n\nk \n\nk \n\n,qn, qn-l \n\nDJ \n\n(3) \n\n2.3 REMAP TRAINING ALGORITHM \n\nThe general scheme of the REMAP training of hybrid HMM/MLP systems can be \nsummarized as follow: \n\n1. Start from some initial net providing P(q~lxn' q~-l' et ), t = 0, V possible \n\n(k,\u00a3)-pairs4. \n\n2. Compute MLP targets P(q~IXj,q~_l,et,Mj) according to (3), V training \nsentences Xj associated with HMM Mj, V possible (k, \u00a3) state transition \npairs in Mj and V X n, n = 1, ... , N in Xj (see next point). \n\n3. For every Xn in the training database, train the MLP to minimize the \nrelative entropy between the outputs and targets. See (Bourlard et ai, \n1994) for more details. This provides us with a new set of parameters et , \nfor t = t + 1. \n\n4. Iterate from 2 until convergence. \n\nThis procedure is thus composed of two steps: an Estimation (E) step, correspond(cid:173)\ning to step 2 above, and a Maximization (M) step, corresponding to step 3 above. \nIn this regards, it is reminiscent of the Estimation-Maximization (EM) algorithm \nas discussed in (Dempster et al. 1977). However, in the standard EM algorithm, \nthe M step involves the actual maximization of the likelihood function. In a related \napproach, usually referred to as Generalized EM (GEM) algorithm, the M step does \nnot actually maximize the likelihood but simply increases it (by using, e.g., a gra(cid:173)\ndient procedure). Similarly, REMAP increases the global posterior function during \nthe M step (in the direction of targets that actually maximize that global function), \nrather than actually maximizing it. Recently, a similar approach was suggested for \nmapping input sequences to output sequences (Bengio & Frasconi 1995). \n\n3Note here that one \"iteration\" does not stand for one iteration of the MLP training \nbut for one estimation-maximization iteration for which a complete MLP training will be \nrequired. \n\n4This can be done, for instance, by training up such a net from a hand-labeled database \n\nlike TIMIT or from some initial forward-backward estimator of equivalent local probabil(cid:173)\nities (usually referred to as \"gamma\" probabilities in the Baum-Welch procedure). \n\n\fREMAP: Recursive Estimation and Maximization of A Posteriori Probabilities \n\n393 \n\nSystem \nDHMM, pre-REMAP \n1 REMAP iteration \n2 REMAP iterations \n\nError Rate \n14.9% \n13.6% \n13.2% \n\nTable 1: Training and testing on continuous numbers, no syntax, no durational \nmodels. \n\n3 EXPERIMENTS AND RESULTS \n\nFor testing our theory we chose the Numbers'93 corpus. It is a continuous speech \ndatabase collected by CSLU at the Oregon Graduate Institute. It consists of num(cid:173)\nbers spoken naturally over telephone lines on the public-switched network (Cole \net al. 1994). The Numbers'93 database consists of 2167 speech files of spoken num(cid:173)\nbers produced by 1132 callers. We used 877 of these utterances for training and \n657 for cross-validation and testing (200 for cross-validation) saving the remaining \nutterances for final testing purposes. There are 36 words in the vocabulary, namely \nzero, oh, 1, 2, 3, ... ,20, 30, 40, 50, ... ,100, 1000, a, and, dash, hyphen, and double. \nAll our nets have 214 inputs: 153 inputs for the acoustic features, and 61 to rep(cid:173)\nresent the previous state (one unit for every possible previous state, one state per \nphoneme in our case). The acoustic features are combined from 9 frames with 17 \nfeatures each (RASTA-PLP8 + delta features + delta log gain) computed with an \nanalysis window of 25 ms computed every 12.5 ms (overlapping windows) and with \na sampling rate of 8 Khz . The nets have 200 hidden units and 61 outputs. \n\nOur results are summarized in Table 1. The row entitled \"DHMM, pre-REMAP\" \ncorresponds to a Discriminant HMM using the same training approach, with hard \ntargets determined by the first system, and additional inputs to represent the previ(cid:173)\nous state The improvement in the recognition rate as a result of REMAP iterations \nis significant at p < 0.05. However all the experiments were done using acoustic \ninformation alone. Using our (baseline) hybrid system under equal conditions, i.e., \nno duration information and no language information, we get 31.6% word error; \nadding the duration information back we get 12.4% word error. We are currently \nexperimenting with enforcing minimum duration constraints in our framework. \n\n4 CONCLUSIONS \n\nIn summary: \n\n\u2022 We have a method for MAP training and estimation of sequences. \n\n\u2022 This can be used in a new form of hybrid HMM/MLP. Note that recurrent \nnets or TDNNs could also be used. As with standard HMM/MLP hybrids, \nthe network is used to estimate local posterior probabilities (though in this \ncase they are conditional transition probabilities, that is, state probabilities \nconditioned on the acoustic data and the previous state). However, in the \ncase of REMAP these nets are trained with probabilistic targets that are \nthemselves estimates of local posterior probabilities. \n\n\u2022 Initial experiments demonstrate a significant reduction in error rate for this \n\nprocess. \n\n\f394 \n\nY. KONIG, H. BOURLARD, N. MORGAN \n\nAcknowledgments \n\nWe would like to thank Kristine Ma and Su-Lin Wu for their help with the Num(cid:173)\nbers'93 database. We also thank OGI, in particular to Ron Cole, for providing the \ndatabase. We gratefully acknowledge the support of the Office of Naval Research, \nURI No. N00014-92-J-1617 (via UCB), the European Commission via ESPRIT \nproject 20077 (SPRACH), and ICSI and FPMs in general for supporting this work. \n\nReferences \n\nBENGIO, Y., & P. FRASCONI. \n\n1995. An input output HMM architecture. \nIn Advances in Neural Information Processing Systems, ed. by G. Tesauro, \nD. Touretzky, & T. Leen, volume 7. Cambridge: MIT press. \n\n--, R. DE MORI, G. FLAMMIA, & R. KOMPE. 1992. Global optimization of a \nneural network-hidden Markov model hybrid. IEEE trans. on Neural Networks \n3.252-258. \n\nBOURLARD, H., Y. KONIG, & N. MORGAN. 1994. REMAP: Recursive estimation \nand maximization of a posteriori probabilities, application to transition-based \nconnectionist speech recognition. Technical Report TR-94-064, International \nComputer Science Institute, Berkeley, CA. \n\n--, & N. MORGAN. 1994. Connectionist Speech Recognition - A Hybrid Approach. \n\nKluwer Academic Publishers. \n\n--, & C. J. WELLEKENS. 1989. Links between Markov models and multilayer \nperceptrons. In Advances in Neural Information Processing Systems 1, ed. by \nD.J. Touretzky, 502-510, San Mateo. Morgan Kaufmann. \n\nCOLE, R.A., M. FANTY, & T. LANDER. 1994. Telephone speech corpus develop(cid:173)\n\nment at CSL U. In Proceedings Int 'I Conference on Spoken Language Processing, \nYokohama, Japan. \n\nDEMPSTER, A. P., N. M. LAIRD, & D. B. RUBIN. 1977. Maximum likelihood \nfrom incomplete data via the EM algorithm. Journal of the Royal Statistical \nSociety, Series B 34.1-38. \n\nGLASS, J. R., 1988. Finding Acoustic Regularities in Speech Applications to Pho(cid:173)\n\nnetic Recognition. M.LT dissertation. \n\nKATAGIRI, S., C.H. LEE, & JUANG B.H. 1991. New discriminative training \nalgorithms based on the generalized probabilistic decent method. In Proc. of \nthe IEEE Workshop on Neural Netwroks for Signal Processing, ed. by RH. \nJuang, S.Y. Kung, & C.A. Kamm, 299-308 . \n\nKONIG, Y., & N. MORGAN. 1994. Modeling dynamics in connectionist speech \nrecognition - the time index model. In Proceedings Int'l Conference on Spoken \nLanguage Processing, 1523-1526, Yokohama, Japan. \n\nLIPORACE, L. A. 1982. Maximum likelihood estimation for multivariate observa(cid:173)\ntions of markov sources. IEEE Trans. on Information Theory IT-28.729-734. \nMORGAN, N., H. BOURLARD, S. GREENBERG, & H. HERMANSKY. 1994. Stochas(cid:173)\ntic perceptual auditory-event-based models for speech recognition. In Proceed(cid:173)\nings Int'l Conference on Spoken Language Processing, 1943-1946, Yokohama, \nJapan. \n\n\f", "award": [], "sourceid": 1027, "authors": [{"given_name": "Yochai", "family_name": "Konig", "institution": null}, {"given_name": "Herv\u00e9", "family_name": "Bourlard", "institution": null}, {"given_name": "Nelson", "family_name": "Morgan", "institution": null}]}