{"title": "Context-Dependent Multiple Distribution Phonetic Modeling with MLPs", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 657, "abstract": null, "full_text": "Context-Dependent Multiple \n\nDistribution Phonetic Modeling with \n\nMLPs \n\nMichael Cohen \nSRI International \n\nMenlo Park. CA 94025 \n\nHoracio Franco \nSRl International \n\nNelson Morgan \n\nIntI. Computer Science Inst. \n\nBerkeley, CA 94704 \n\nDavid Rumelhart \nStanford University \nStanford, CA 94305 \n\nVictor Abrash \nSRI International \n\nAbstract \n\nA number of hybrid multilayer perceptron (MLP)/hidden Markov \nmodel (HMM:) speech recognition systems have been developed in \nrecent years (Morgan and Bourlard. 1990). In this paper. we present \na new MLP architecture and training algorithm which allows the \nmodeling of context-dependent phonetic classes \nin a hybrid \nMLP/HMM: framework. The new training procedure smooths MLPs \ntrained at different degrees of context dependence in order to obtain \na robust estimate of the cootext-dependent probabilities. Tests with \nthe DARPA Resomce Management database have shown substantial \nadvantages of the context-dependent MLPs over earlier cootext(cid:173)\nindependent MLPs. and have shown substantial advantages of this \nhybrid approach over a pure HMM approach. \n\n1 INTRODUCTION \nBidden Markov models are used in most current state-of-the-art continuous-speech \nrecognition systems. A hidden Markov model (HMM) is a stochastic finite state \nmachine with two sets of probability distributions. Associated with each state is a \nprobability distribution over transitions to next states and a probability distribution \nover output symbols (often referred to as observation probabilities). When applied to \ncontinuous speech. the observation probabilities are typically used to model local \n\n649 \n\n\f650 \n\nCohen, Franco, Morgan, Rumelhart, and Abrash \n\nspeech features such as spectra, and the transition probabilities are used to model the \ndisplacement of these features through time. HMMs of individual phonetic segments \n(phones) can be concatenated to model words and word models can be concatenated, \naccording to a grammar, to model sentences, resulting in a finite state representation \nof acoustic-phonetic, phonological, and syntactic structure. \nThe HMM approach is limited by the need for strong statistical assumptions that are \nunlikely to be valid for speech. Previous work by Morgan and Bourlard (1990) has \nshown both theoretically and practically that some of these limitations can be over(cid:173)\ncome by using multilayer perceptrons (MLPs) to estimate the HMM state-dependent \nobservation probabilities. In addition to relaxing the restrictive independence assump(cid:173)\ntions of traditional HMMs, this approach results in a reduction in the number of \nparameters needed for detailed phonetic modeling as a result of increased sharing of \nmodel parameters between phonetic classes. \nRecently, this approach was applied to the SRI-DECIPHER\u2122 system, a state-of-the-art \ncontinuous speech recognition system (Cohen et al., 1990), using an MLP to provide \nestimates of context-independent posterior probabilities of phone classes, which were \nthen converted to HMM context-independent state observation likelihoods using \nBayes' rule (Renals et aI., 1992). In this paper, we describe refinements of the system \nto model phonetic classes with a sequence of context-dependent probabilities. \nContext-dependent modeling: The realization of individual phones in continuous \nspeech is highly dependent upon phonetic context. For example, the sound of the \nvowel /ae/ in the words \"map\" and \"tap\" is different, due to the influence of the \npreceding phone. These context effects are referred to as \"coarticulation\". Experience \nwith HMM technology has shown that using context-dependent phonetic models \nimproves recognition accuracy significantly (Schwartz et al., 1985). This is so because \nacoustic correlates of coarticulatory effects are explicitly modeled, producing sharper \nand less overlapping probability density functions for the different phone classes. \nContext-dependent HMMs use different probability distributions for every phone in \nevery different relevant context. This practice causes problems that are due to the \nreduced amount of data available to train phones in highly specific contexts, resulting \nin models that are not robust and generalize poorly. The solution to this problem used \nby many HMM systems is to train models at many different levels of context(cid:173)\nspecificity, including biphone (conditioned only on the phone immediately to the left \nor right), generalized biphone (conditioned on the broad class of the phone to the left \nor right), triphone (conditioned on the phone to the left and the right), generalized tri(cid:173)\nphone, and word specific phone. Models conditioned by more specific contexts are \nlinearly smoothed with more general models. The \"deleted interpolation\" algorithm \n(Jelinek and Mercer, 1980) provides linear weighting coefficients for the observation \nprobabilities with different degrees of context dependence by maximizing the likeli(cid:173)\nhood of the different models over new, unseen data. This approach cannot be directly \nextended to MLP-based systems because averaging the weights of two MLPs does not \nIt would be possible to use this \nresult in an MLP with the average performance. \napproach to average the probabilities that are output from different MLPs; however, \nsince the MLP training algorithm is a discriminant procedure, it would be desirable to \nuse a discriminant or error-based procedure to smooth the MLP probabilities together. \nAn earlier approach to context-dependent phonetic modeling with MLPs was proposed \nby Bourlard et al. (1992). It is based on factoring the context-dependent likelihood \nand uses a set of binary inputs to the network to specify context classes. The number \n\n\fContext-Dependent Multiple Distribution Phonetic Modeling with MLPs \n\n651 \n\nof parameters and the computational load using this approach are not much greater \nthan those for the original context-independent net. \nThe context-dependent modeling approach we present here uses a different factoring \nof the desired context-dependent likelihoods. a network architecture that shares the \ninput-to-hidden layer among the context-dependent classes to reduce the number of \nparameters. and a training procedure that smooths networks with different degrees of \ncontext-dependence in order to achieve robustness in probability estimates. \nMultidistribution modeling: Experience with HMM-based systems has shown the \nimportance of modeling phonetic units with a sequence of distributions rather than a \nsingle distribution. This allows the model to capture some of the dynamics of \nphonetic segments. The SRI-DECIPHER\u2122 system models most phones with a \nsequence of three HMM states. Our initial hybrid system used only a single MLP out(cid:173)\nput unit for each HMM phonetic class. This output unit supplied the probability for \nall the states of the associated phone model. \nOur initial attempt to extend the hybrid system to the modeling of a sequence of dis(cid:173)\ntributions for each phone involved increasing the number of output units from 69 \n(corresponding to phone classes) to 200 (corresponding to the states of the HMM \nphone models). This resulted in an increase in word-recognition error rate by almost \n30%. Experiments at ICSI had a similar result (personal communication). The higher \nerror rate seemed to be due to the discriminative nature of the MLP training algo(cid:173)\nrithm. The new MLP. with 200 output units. was attempting to discriminate sub(cid:173)\nphonetic classes. corresponding to HMM states. As a result. the MLP was attempting \nto discriminate into separate classes acoustic vectors that corresponded to the same \nphone and. in many cases. were very similar but were aligned with different HMM \nstates. There were likely to have been many cases in which almost identical acoustic \ntraining vectors were labeled as a positive example in one instance and a negative \nexample in another for the same output class. The appropriate level at which to train \ndiscrimination is likely to be the level of the phone (or higher) rather than the sub(cid:173)\nphonetic HMM-state level (to which these outputs units correspond). The new archi(cid:173)\ntecture presented here accomplishes this by training separate output layers for each of \nthe three HMM states. resulting in a network trained to discriminate at the phone \nlevel. while allowing three distributions to model each phone. This approach is com(cid:173)\nbined with the context-dependent modeling approach. described in Section 3. \n\n2 HYBRID MLP/HMM \nThe SRI-DECIPHER\u2122 system is a phone-based. speaker-independent. continuous(cid:173)\nspeech recognition system. based on semicontinuous (tied Gaussian mixture) HMMs \n(Cohen et al.. 1990). The system extracts four features from the input speech \nwaveform. including 12th-order mel cepstrum. log energy. and their smoothed deriva(cid:173)\ntives. The front end produces the 26 coefficients for these four features for each 10-\nms frame of speech. \nTraining of the phonetic models is based on maximum-likelihood estimation using the \nforward-backward algorithm (Levinson et a1.. 1983). Recognition uses the Viterbi \nalgorithm (Levinson et al .\u2022 1983) to find the HMM state sequence (corresponding to a \nsentence) with the highest probability of generating the observed acoustic sequence. \nThe hybrid MLP/HMM DECIPHERTM system substitutes (scaled) probability estimates \ncomputed with MLPs for \ntied-mixture HMM state-dependent observation \n\nthe \n\n\f652 \n\nCohen, Franco, Morgan, Rumelhart, and Abrash \n\nprobability densities. No changes are made in the topology of the HMM system. \nThe initial hybrid system used an MLP to compute context-independent phonetic pro(cid:173)\nbabilities for the 69 phone classes in the DECIPHER TM system. Separate probabilities \nwere not computed for the different states of phone models. During the Viterbi recog(cid:173)\nnition search. the probability of acoustic vector Yt given the phone class qj. P (Yt I qj)' \nis required for each HMM state. Since MLPs can compute Bayesian posterior proba(cid:173)\nbilities. we compute the required HMM probabilities using \n\nP (Y I .) = P (q j I Yt )P (Yt ) \n\nt q] \n\nP(qj) \n\n(l) \n\nThe factor P (qj I Yt ) is the posterior probability of phone class qj given the input vec(cid:173)\ntor Y at time t. This is computed by a backpropagation-trained (Rumelhart et al .\u2022 \n1986) three-layer feed-forward MLP. P (qj) is the prior probability of phone class % \nand is estimated by counting class occurrences in the examples used to train the MLP. \nP (Yt ) is common to all states for any given time frame. and can therefore be dis(cid:173)\ncarded in the Viterbi computation. since it will not change the optimal state sequence \nused to get the recognized string. \nThe MLP has an input layer of 234 units. spanning 9 frames (with 26 coefficients for \neach) of cepstra. delta-cepstra. log-energy. and delta-log-energy that are normalized to \nhave zero mean and unit variance. The hidden layer has 1000 units. and the output \nlayer has 69 units. one for each context-independent phonetic class \nthe \nDECIPHER TM system. Both the hidden and output layers consist of sigmoidal units. \nThe MLP is trained to estimate P (q. I Yt ). where qj is the class associated with the \nmiddle frame of the input window. Stochastic ~adient descent is used. The training \nsignal is provided by the HMM DECIPHER \nsystem previously trained by the \nforward-backward algorithm. Forced Viterbi alignments (alignments to the known \nword string) for every training sentence provide phone labels. among 69 classes. for \nevery frame of speech. The target distribution is defined as 1 for the index \ncorresponding to the phone class label and 0 for the other classes. A minimum rela(cid:173)\ntive entropy between posterior target distribution and posterior output distribution is \nused as a training criterion. With this training criterion and target distribution. assum(cid:173)\ning enough parameters in the MLP. enough training data. and that the training does \nnot get stuck in a local minimum. the MLP outputs will approximate the posterior \nclass probabilities P (q j I Yt ) (Morgan and Bourlard. 1990). Frame classification on an \nindependent cross-validation set is used to control the learning rate and to decide when \nto stop training as in Renals et al. (1992). The initial learning rate is kept constant \nuntil cross-validation performance increases less than 0.5%, after which it is reduced \nas l/2n until performance increases no further. \n\nin \n\n3 CONTEXT-DEPENDENCE \nOur initial implementation of context-dependent MLPs models generalized biphone \nphonetic categories. We chose a set of eight left and eight right generalized biphone \nphonetic-context classes, based principally on place of articulation and acoustic \ncharacteristics. The context-dependent architecture is shown in Figure 1. A separate \noutput layer (consisting of 69 output units corresponding to 69 context-dependent \nphonetic classes) is trained for each context. The context-dependent MLP can be \nviewed as a set of MLPs. one for each context. which have the same input-to-hidden \n\n\fContext-Dependent Multiple Distribution Phonetic Modeling with MLPs \n\n653 \n\nweights. Separate sets of context-dependent output layers are used to model context \neffects in different states of HMM phone models. thereby combining the modeling of \nmultiple phonetic distributions and cmtext-dependence. During training and recogni(cid:173)\ntion. speech frames aligned with first states of HMM phones are associated with the \nappropriate left context output layer. those aligned with last states of HMM phones are \nassociated with the appropriate right context output layer. and middle states of three \nstate models are associated with the context-independent output layer. As a result. \nsince the training proceeds (as before) as if each output layer were part of an indepen(cid:173)\ndent net. the system learns discriminatioo between the different phonetic classes within \nan output layer (which now corresponds to a specific context and HMM-state posi(cid:173)\ntion). but does not learn discrimjnatioo between occurrences of the same phooe in \ndifferent contexts or between the different states of the same HMM phone. \n\nL1 \n\nRS \n\n1,000 hidden unIts \n\n234 Inputs \n\nFigure 1: Context-Dependent MLP \n\n3.1 CONTEXT \u00b7DEPENDENT FACTORING \nIn a context-dependent HMM. every state is associated with a specific phone class and \ncontext During the Viterbi recognition search. P (Yt Iqj .CA:) (the probability of acous(cid:173)\ntic vector Yt given the phone class qj in the context class CA:) is required for each \nstate. We compute the required HMM probabilities using \n\nI. \n\n_ P (qj IYt .CA:)P (Yt ICA:) \n\nP(Yt % .c/c) -\n\nP(qj Ic/c) \n\nwhere P (Yt ICA:) can be factored again as \n\nP(Yt CA:) = - - - - (cid:173)\n\nP (Ck IYt)p (Yt) \n\nI \n\nP (CA:) \n\n(2) \n\n(3) \n\nThe factor P(qj IYt.cA:) is the posterior probability of phone class qj given the input \nvector Yt and the context class C/c' To compute this factor. we consider the coodition(cid:173)\ning on C/c in (2) as restricting the set of input vectors only to those produced in the \ncontext C/c. If M is the number of context classes. this implementation uses a set of M \nMLPs (all sharing the same input-to-hidden layer) similar to those used in the \ncontext-independent case except that each MLP is trained using only input-output \nexamples obtained from the corresponding context. Ck. \n\n\f654 \n\nCohen, Franco, Morgan, Rumelhart, and Abrash \n\nEvery context-specific net performs a simpler classification than in the context(cid:173)\nindependent case because within a context the acoustics corresponding to different \nphones have less overlap. \nP (Ck Iy,) is computed by a second MLP. A three-layer feed-forward MLP is used \nwhich has 1000 hidden units and an output unit corresponding to each context class. \nP (qj Ic!) and P (Ck) are estimated by counting over the training examples. Finally, \nP CY,) is common to all states for any given time frame, and can therefore be dis(cid:173)\ncarded in the Viterbi computation, since it will not change the optimal state sequence \nused to get the recognized string. \n\nto \n\n3.2 CONTEXT -DEPENDENT TRAINING AND SMOOTHING \nWe use the following method to achieve robust training of context-specific nets: \nAn initial context-independent MLP is trained, as described in Section 2, to estimate \nthe context-independent posterior probabilities over the N phone classes. After the \ncontext-independent training converges, the resulting weights are used to initialize the \nweights going \nthe context-specific output layers. Context-dependent training \nproceeds by backpropagating error only from the appropriate output layer for each \ntraining example. Otherwise, the training procedure is similar to that for the context(cid:173)\nindependent net, using stochastic gradient descent and a relative-entropy training cri(cid:173)\nterion. Overall classification performance evaluated on an independent cross-validation \nset is used to determine the learning rate and stopping point. Only hidden-to-output \nweights are adjusted during context-dependent training. We can view the separate \noutput layers as belonging to independent nets, each one trained on a non-overlapping \nsubset of the original training data. \nEvery context-specific net would asymptotically converge to the context conditioned \nposteriors P (qj IY, ,Ck) given enough training data and training iterations. As a result \nof the initialization, the net starts estimating P (qj IY,), and from that point it follows a \ntrajectory in weight space, incrementally moving away from the context-independent \nparameters as long as classification performance on the cross-validation set improves. \nAs a result, the net retains useful information from the context-independent initial con(cid:173)\nditions. \ncontext-independent parameters and the pure context-dependent parameters. \n\nIn this way, we perform a type of nonlinear smoothing between the pure \n\n4 EVALUATION \nTraining and recognition experiments were conducted using the speaker-independent, \ncontinuous-speech, DARPA Resource Management database. The vocabulary size is \n998 words. Tests were run both with a word-pair (perplexity 60) grammar and with no \ngrammar. The training set for the HMM system and for the MLP consisted of the \n3990 sentences that make up the standard DARPA speaker-independent training set for \nthe Resource Management task. The 600 sentences making up the Resource Manage(cid:173)\nment February 89 and October 89 test sets were used for cross-validation during both \nthe context-independent and context-dependent MLP training, and for tuning HMM \nsystem parameters (e.g., word transition weight). \n\n\fContext-Dependent Multiple Distribution Phonetic Modeling with MLPs \n\n655 \n\nTable 1: Percent Word Error and Parameter Count with Word-Pair Grammar \n\nCIMLP CD~P HMM MIXED \n\nFeb91 \nSep92a \nSep92b \n# Parms \n\n5.~ \n10.9 \n9.5 \n300K \n\n4.7 \n7.6 \n6.6 \n\n3.~ \n10.1 \n7.0 \n\n3.2 \n7.7 \n5.7 \n\n1400K \n\n5500K \n\n6 lOOK \n\nTable 2: Percent Word Error with No Grammar \n\nCIMLP CDMLP HMM MIXED \n\nFeb91 \nSep92a \nSep92b \n\n24.7 \n31.5 \n30.9 \n\n18.4 \n27.1 \n24.9 \n\n19.3 \n29.2 \n26.6 \n\n15.9 \n25.4 \n21.5 \n\nTable I presents word recognition error and number of system parameters for four \ndifferent versions of the system, for three different Resource Management test sets \nusing the word-pair grammar. Table 2 presents word recognition error for the \ncorresponding tests with no grammar (the number of system parameters are the same \nas those shown in Table I). \nComparing context-independent MLP (CIMLP) to context-dependent MLP (CDMLP) \nshows improvements with CDMLP in all six tests, ranging from a 15% to 30% reduc(cid:173)\ntion in word error. The CDMLP system combines multiple-distribution modeling with \nthe context-dependent modeling technique. The CDMLP system performs better than \nthe context-dependent HMM: (CDHMM:) system in five out of the six tests. \nThe :MIXED system uses a weighted mixture of the logs of state obseIV ation likeli(cid:173)\nhoods provided by the CIMLP and the CDHMM: (Renals et al., 1992). This system \nshows the best recognition performance so far achieved with the DECIPHER TM system \non the Resource Management database. In all six tests, it performs significantly better \nthan the pure CDIDv1M: system. \n\n5 DISCUSSION \nThe results shown in Tables I and 2 suggest that MLP estimation of HMM obsexva(cid:173)\ntion likelihoods can improve the performance of standard IDv1M:s. These results also \nsuggest that systems that use MLP-based probability estimation make more efficient \nuse of their parameters than standard HMM: systems. In standard HMMs, most of the \nparameters in the system are in the obseIVation distributions associated with the indivi(cid:173)\ndual states of phone models. MLPs use representations that are more distributed in \nnature, allowing more sharing of representational resources and better allocation of \nIn addition, since MLPs are trained to \nrepresentational resources based on training. \ndiscriminate between classes, they focus on modeling boundaries between classes \nrather than class internals. \nOne should keep in mind that the reduction in memory needs that may be attained by \nreplacing HMM distributions with MLP-based estimates must be traded off against \nincreased computational load during both training and recognition. The MLP compu(cid:173)\ntations during training and recognition are much larger than the corresponding Gaus(cid:173)\nsian mixture computations for IDv1M: systems. \n\n\f656 \n\nCohen, Franco, Morgan, Rumelhart, and Abrash \n\nThe results also show that the context-dependent modeling approach presented here \nsubstantially improves performance over the earlier context-independent MLP. \nIn \naddition, the context-dependent MLP performed better than the context-dependent \nHMM in five out of the six tests although the CDMLP is a far simpler system than the \nCDHMM, with approximately a factor of four fewer parameters and modeling of only \ngeneralized biphone phonetic contexts. The CDHMM uses a range of context(cid:173)\ndependent models including generalized and specific biphone, triphone, and word(cid:173)\nspecific phone. The fact that context-dependent MLPs can perform as well or better \nthan context-dependent HMMs while using less specific models suggests that they may \nbe more vocabulary-independent, which is useful when porting systems to new tasks. \nIn the near future we will test the CDMLP system on new vocabularies. \nThe MLP smoothing approach described here can be extended to the modeling of finer \ncontext classes. A hierarchy of context classes can be defined in which each context \nclass at one level is included in a broader class at a higher level. The context-specific \nMLP at a given level in the hierarchy is initialized with the weights of a previously \ntrained context-specific MLP at the next higher level, and then finer context training \ncan proceed as described in Section 3.2. \nThe distributed representation used by MLPs is exploited in the context-dependent \nmodeling approach by sharing the input-to-hidden layer weights between all context \nclasses. This sharing substantially reduces the number of parameters to train and the \namount of computation required during both training and recognition. In addition, we \ndo not adjust the input-to-hidden weights during the context-dependent phase of train(cid:173)\ning, assuming that the features provided by the hidden layer activations are relatively \nlow level and are appropriate for context-dependent as well as context-independent \nmodeling. The large decrease in cross-validation error observed going from context(cid:173)\nindependent to context-dependent MLPs (30.6% to 21.4%) suggests that the features \nlearned by the hidden layer during the context-independent training phase, combined \nwith the extra modeling power of the context-specific hidden-to-output layers, were \nadequate to capture the more detailed context-specific phone classes. \nThe best performance shown in Tables 1 and 2 is that of the MIXED system, which \ncombines CIMLP and CDHMM probabilities. The CDMLP probabilities can also be \ncombined with CDHMM probabilities; however, we hope that the planned extension \nof our CDMLP system to finer contexts will lead to a better system than the MIXED \nsystem without the need for such mixing, therefore resulting in a simpler system. \nThe context-dependent MLP shown here has more than 1,400,000 weights. We were \nable to robustly train such a large network by using a cross-validation set to determine \nwhen to stop training, sharing many of the weights between context classes, and \nsmoothing context-dependent with context-independent MLPs using the approach \ndescribed in Section 3.2. In addition, the Ring Array Processor (RAP) special purpose \nhardware, developed at ICSI (Morgan et aI., 1992), allowed rapid training of such \nlarge networks on large data sets. In order to reduce the number of weights in the \nMLP, we are currently exploring alternative architectures which apply the smoothing \ntechniques described here to binary context inputs. \n\n6 CONCLUSIONS \nMLP-based probability estimation can be useful for both improving recognition accu(cid:173)\nracy and reducing memory needs for HMM-based speech recognition systems. These \nbenefits, however, must be weighed against increased computational requirements. \n\n\fContext-Dependent Multiple Distribution Phonetic Modeling with MLPs \n\n657 \n\nWe have presented a new MLP architecture and training procedure for modeling \ncontext-dependent phonetic classes with a sequence of distributions. Tests using the \nDARPA Resource Management database have shown improvements in recognition \nperformance using this new approach, modeling only generalized biphone context \ncategories. These results suggest that sharing input-to-hidden weights between context \ncategories (and not retraining them during the context-dependent training phase) \nresults in a hidden layer representation which is adequate for context-dependent as \nwell as context-independent modeling, error-based smoothing of context-independent \nand context-dependent weights is effective for training a robust model, and using \nseparate output layers and hidden-to-output weights corresponding to different context \nclasses of different states of HMM: phone models is adequate to capture acoustic \neffects which change throughout the production of individual phonetic segments. \n\nAcknowledgements \nThe work reported here was partially supported by DARPA Contract MDA904-9O-C-\n5253. Discussions with Herve Bourlard were very helpful. \n\nReferences \nH. Bourlard, N. Morgan, C. Wooters, and S. Renals (1992), \"CDNN: A Context \nDependent Neural Network for Continuous Speech Recognition,\" \nICASSP, pp. 349-\n352, San Francisco. \nM. Cohen, H. Murveit. J Bernstein, P. Price. and M. Weintraub (1990), \"The DECI(cid:173)\nPHER Speech Recognition System.\" ICASSP, pp. 77-80, Alburquerque. New Mexico. \nF. Jelinek and R. Mercer (1980). \"Interpolated estimation of markov source parame(cid:173)\nters from sparse data,\" in Pattern Recognition in Practice, E. Gelsema and L. Kanal. \nEds. Amsterdam: North-Holland. pp. 381-397. \nS. Levinson, L. Rabiner, and M. Sondhi (1983). \"An introduction to the application of \nthe theory of probabilistic functions of a Markov process to automatic speech recogni(cid:173)\ntion.\" Bell Syst. Tech. Journal 62, pp. 1035-1074. \nN. Morgan and H. Bourlard (1990). \"Continuous Speech Recognition Using Mul(cid:173)\ntilayer Perceptrons with Hidden Markov Models,\" ICASSP, pp. 413-416. Albur(cid:173)\nquerque, New Mexico. \nN. Morgan. 1. Beck, P. Kohn, 1. Bilmes. E. Allman, and 1. Beer (1992). \"The Ring \nArray Processor (RAP): A Multiprocessing Peripheral for Connectionist Applications.\" \nJournal of Parallel and Distributed Computing, pp. 248-259. \nS. Renals, N. Morgan, M. Cohen, and H. Franco (1992), \"Connectionist Probability \n&timation in the DECIPHER Speech Recognition System,\" ICASSP, pp. 601-604. \nSan Francisco. \nD. Rumelhart. G. Hinton. and R. Williams (1986), \"Learning Internal Representations \nby Error Propagation.\" \nin Parallel Distributed Processing: Explorations of the \nMicrostructure of Cognition, vol 1: Foundations. D. Rumelhart & 1. McOelland. Eds. \nCambridge: MIT Press. \nR. Schwartz. Y. Chow. O. Kimball, S. Roucos. M. Krasner. and 1. Makhoul (1985), \nrecognition of continuous \n\"Context-dependent modeling \nspeech.\" ICASSP, pp. 1205-1208. \n\nfor acoustic-phonetic \n\n\f", "award": [], "sourceid": 710, "authors": [{"given_name": "Michael", "family_name": "Cohen", "institution": null}, {"given_name": "Horacio", "family_name": "Franco", "institution": null}, {"given_name": "Nelson", "family_name": "Morgan", "institution": null}, {"given_name": "David", "family_name": "Rumelhart", "institution": null}, {"given_name": "Victor", "family_name": "Abrash", "institution": null}]}