{"title": "Time-Warping Network: A Hybrid Framework for Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 151, "page_last": 158, "abstract": null, "full_text": "Time-Warping Network: \n\nA Hybrid Framework for Speech Recognition \n\nEsther Levin \n\nEnrico Bocchieri \n\nRoberto Pieraccini \nAT&T Bell Laboratories \n\nSpeech Research Department \nMurray Hill, NJ 00974 USA \n\nABSTRACT \n\ninterest has been generated regarding speech \nRecently. much \nrecognition systems based on Hidden Markov Models (HMMs) and \nneural network (NN) hybrids. Such systems attempt to combine the \nbest features of both models: the temporal structure of HMMs and \nthe discriminative power of neural networks. In this work we define \na time-warping (1W) neuron that extends the operation of the fonnal \nneuron of a back-propagation network by warping the input pattern to \nmatch it optimally to its weights. We show that a single-layer \nnetwork of TW neurons is equivalent to a Gaussian density HMM(cid:173)\nthe \nbased \ndiscriminative power of this system by using back-propagation \ndiscriminative training. and/or by generalizing the structure of the \nrecognizer to a multi-layered net The performance of the proposed \nnetwork was evaluated on a highly confusable, isolated word. multi \nspeaker recognition task. The results indicate that not only does the \nrecognition performance improve. but the separation between classes \nto set up a rejection criterion to \nis enhanced also, allowing us \nimprove the confidence of the system. \n\nrecognition system. and we propose \n\nto \n\nimprove \n\nL INTRODUCTION \nSince their first application in speech recognition systems in the late seventies, hidden \nMarkov models have been established as a most useful tool. mainly due to their ability \nto handle the sequential dynamical nature of the speech signal. With the revival of \nconnectionism in the mid-eighties. considerable interest arose in applying artificial \nneural networks for speech recognition. This interest was based on the discriminative \npower of NNs and their ability to deal with non-explicit knowledge. These two \nparadigms. namely HMM and NN. inspired by different philosophies. were seen at first \nas different and competing tools. Recently. links have been established between these \ntwo paradigms. aiming at a hybrid framework in which the advantages of the two \nmodels can be combined. For example. Bourlard and Wellekens [1] showed that neural \n\n151 \n\n\f152 \n\nLevin, Pieraccini, and Bocchieri \n\nto HMM. Bridle [2] \n\nnetworks with proper architecture can be regarded as non-parametric models for \nintroduced \ncomputing \"discriminant probabilities\" related \n\" Alpha-nets\", a recurrent neural architecture that implements the alpha computation of \nHMM, and found connections between back-propagation [3] training and discriminative \nHMM parameter estimation. Predictive neural nets were shown to have a statistical \ninterpretation [4], generalizing the conventional hidden Markov model by assuming \nthat the speech signal is generated by nonlinear dynamics contaminated by noise. \nIn this work we establish one more link between the two paradigms by introducing the \ntime-warping network (1WN) that is a generalization of both an HMM-based \nrecognizer and a back-propagation net. The basic element of such a network, a time(cid:173)\nwarping neuron, generalizes the function of a fonnal neuron by warping the input \nIn the special case of network parameter \nsignal in order maximize its activation. \nvalues, a single-layered network of time-warping (TW) neurons is equivalent to a \nrecognizer based on Gaussian HMMs. This equivalence of the HMM-based recognizer \nand single-layer TWN suggests ways of using discriminative neural tools to enhance \nthe perfonnance of the recognizer. For instance, a training algorithm, like back(cid:173)\npropagation, that minimizes a quantity related to the recognition performance, can be \nused to train the recognizer instead of the standard non-discriminative maximum \nlikelihood training. Then, the architecture of the recognizer can be expanded to \ncontain more than one layer of units, enabling the network to fonn discriminant feature \ndetectors in the hidden layers. \nThis paper is organized as follows: in the first part of Section 2 we describe a simple \nHMM-based recognizer. Then we define the time-warping neuron and show that a \nsingle-layer network built with such neurons is equivalent to the HMM recognizer. In \nSection 3 two methods are proposed to improve the discriminative power of the \nrecognizer, namely, adopting neural training algorithms and extending the structure of \nthe recognizer to a multi-layer net. For special cases of such multi-layer architecture \nsuch net can implement a conventional or weighted [5] HMM recognizer. Results of \nexperiments using a TW network for recognition of the English E-set are presented in \nSection 4. The results indicate that not only does the recognition performance \nimprove, but the separation between classes is enhanced also, allowing us to set up a \nrejection criterion to improve the confidence of the system. A summary and discussion \nof this work are included in Section 5. \nll. THE MODEL \nIn this section first we describe the basic HMM-based speech recognition system that \nis used in many applications, including isolated and connected word recognition [6] \nand large vocabulary subword-based recognition [7]. Though in this paper we treat the \ncase of isolated word recognition, generalization to connected speech can be made like \nin [6,7]. In the second part of this section we define a single-layered time-warping \nnetwork and show that it is equivalent to the HMM based recognizer when certain \nconditions constraining the network parameter values apply. \n11.1 THE HIDDEN MARKOV MODEL\u00b7BASED RECOGNITION SYSTEM \nA HMM-based recognition system consists of K N-state HMMs, where K is the \nvocabul~ size (number of words or subword units in the defined task). The k-th \nHMM, 0 , is associated with the k-th word in the vocabulary and is characterized by a \nmatrix A A:= (at} of transition probabilities between states, \n\n(1) \nwhere St denotes the active state at time t (so =0 is a dummy initial state) and by a set \nof emission probabilities (one per state): \n\nat=Pr(St=j I St-l=i) ,0~i~N , l~j~N, \n\n\fTime-Warping Network: A Hybrid Framework for Speech Recognition \n\n153 \n\nPr(X, I s,=i)= ~21t Illl:~ II 2 exp [- ~ (X,-J1~). (l:~)-l (X,-J1~)] , i =1, ... ,N, \n\n(2) \n\nis \n\nwhere X, \nthe d-dimensional observation vector describing some parametric \nrepresentation of the t-th frame of the spoken token, and (). denotes the transpose \noperation. \nFor the case discussed here, we concentrate on strictly left-to-right HMMs, where \n\nat * 0 only if j =i or j =i + 1, and a simplified case of (2) where all r} = I d, the \n\nd=dimensional unit matrix. \nThe system \nclassifying the token into the class ko with the highest likelihood L O(X), \n\nrecognizes a speech token of duration T, X={X~.x2'\u00b7\u00b7\u00b7 ,XT}, by \n\nko=argmaxLJ:(X). \n\nISk~ \n\nThe likelihood L J:(X) is computed for the k-th HMM as \n\nLJ:(X)= max 10g[Pr(X I of,Si=i 1, \u2022\u2022\u2022 ,sT=i,)] \n{il. Of 0 ,iTt \nII X,-J.1t II 2+10gat14 -log21t. \n= 0 max 0 L -2 \n\n{II' 0 \n\n0 ,IT} 1=1 \n\n0 \n\n(3) \n\n(4) \n\nThe state sequence that maximizes (4) is found by using the Viterbi [8] algorithm. \n11.2 THE EQUIVALENT SINGLE-LAYER TIME-WARPING NETWORK \nA single-layer TW network is composed of K TW neurons, one for each word in the \nvocabulary. The TW neuron is an extension of a fonnal neuron that can handle \ndynamic and temporally distorted patterns. The k-th TW neuron, associated with\"tpe \nk-t.I\\ '!..Qfabulary\" word, is charpcterized by a bias w~ and a set of weights. W = \n{W 10 W2, .... ~) \u2022 where Wi is a column vector of dimensionality d +2. Given an \ninput speech token of duration T. X={X ltX2 \u2022... ,XT }, the output activation yJ: of the \nk-th unit is computed as \n\n(5) \n\n\". \n\n/=g( LX\"w 4 +w~ )=g( L ( L X')'Wj+w~), \n\nN \n\n\". .:.k \n\nT \". ::.k \n\n1=1 \n\nj=l , : 4=j \n\nwhere g (-) is a sigmoidal, smooth, strictly increasing nonlinearity, and X, = [X; \u2022 1, 1] \nis an d+2 - dimensional augmented input vector. The corresponding indices i,. \nt=l, ... ,T are detennined by the following condition: \n\nfi10 ... \u2022 iT} =argmax LX, 'W4 +w~ . \n\nT \". \"J: \n,=1 \n\n(6) \n\nIn other words. a TW neuron warps the input pattern to match it optimally to its \nweights (6) and computes its output using this warped version of the input (5). The \ntime-warping process of (6) is a distinguishing feature of this neural model, enabling it \nto deal with the dynamic nature of a speech signal and to handle temporal distortions. \nAll TW neurons in this single-layer net recognizer receive the same input speech token \nX. Recognition is perfonned by selecting the word class corresponding to the neuron \nwith the maximal output activation. \nIt is easy to show that when \n[W j] = [[Jlj] \u2022 -\"2 Jlj \n::.k \u2022 \n\n.logaj,j ], \n\nk II 2 \n\n1 II \n\n(7a) \n\nk . \n\nk \n\nand \n\n\f154 \n\nLevin, Pieraccini, and Bocchieri \n\nw~ = L loga',j_1 -loga',j \n\nN \n\nj=1 \n\n(7b) \n\nthis network is equivalent to an HMM-based recognition system, with K N-state \nHMMs, as described above.l \nThis equivalent neural representation of an HMM-based system suggests ways of \nimproving the discriminative power of the recognizer, while preserving the temporal \nstructure of the HMM, thus allowing generalization to more complicated tasks (e.g., \ncontinuous speech, subword units, etc.). \nIII. IMPROVING DISCRIMINATION \nThere are two important differences between the HMM-based ~ystem and a neural net \napproach to speech recognition that contribute to the improved discrimination power of \nthe latter, namely, training and structure. \nID.I DISCRIMINATIVE TRAINING \nThe HMM parameters are usually estimated by applying the maximum likelihood \napproach, using only the examples of the word represented by the model and \ndisregarding the rival classes completely. This is a non-discriminative approach: \nthe \nlearning criterion is not directly connected to the improvement of recognition accuracy. \nHere we propose to enhance the discriminative power of the system by adopting a \nneural training approach. \nNN training algorithms are based on minimizing an error function E. which is related \nto the performance of the network on the training set of labeled examples, {X I , Z '}, \n1=1, ... ,L, where Z'=[z{, ... ,zkl* denotes the vector of target neural outputs for the \nI-th input token. Zl has +1 only in the entry corresponding to the right word class, and \n-1 elsewhere. Then, \n\nL \n\nE = LE'(Z', yl), \n\n'=1 \n\n(8) \nwhere yl = [Y,i, ... ,ykt is a vector of neural output activations for the I-th input \ntoken, and E'(Z', y') measures the distortion between the two vectofi' One choice for \nE'(Z', yl) is a quadratic error measure, i.e., E'(Z', y')= II Z'_yl \n2. Other choices \ninclude the cross-entropy error [9] and the recently proposed discriminative error \nfunctions, which measure the misciassification rate more directly [10]. \nThe gradient based training algorithms (such as back-propagation) modify the \nparameters of the network after presentation of each training token to minimize the \nerror (8). The change in the j-th weight subvector of the k-th model after presentation \nof the I-th training token, ~IW' is inversely proportional to the derivative of the error \nE' with respect to this weight subvector, \n\n~'W'=-(l--j; =-(l L -, ~, l~jgj ,1~g, \n\nwhere a> 0 \n.=-.I: \u2022 \n[Wj] = [[ Wj +~Wi] ,-\"'2 Wi +~Wi \nI: \n\nstep-size, \n1 II \nI: \n\na \n1:. \n\nis \n\ndE' \ndWj \n\nK dE' dyl \nm=1 dy m aWj \nresulting \nin \nI: II 2 \nI: \n\nan \n\nupdated weight \n\n' logaj,j]' To compute the terms d~ \n\n(9) \n\nvectpr \nay m \n\n1. With minor changes we can show equivalence to a general Gaussian HMM, where the covariance \n\nmatrices are not restricted to be the unit matrix. \n\nJ \n\n\fTime-Warping Network: A Hybrid Framework for Speech Recognition \n\n155 \n\nwe have to consider (5) and (6) that define the operation of the neuron. Equation (6) \n\nexpresses the dependence of the warping indices iI, ... ,iT on W,. In the proposed \n\nlearning rule we compute the gradient for the quadratic error criterion using only (5). \n\nA'W'=a(zi-Yi)g'(\u00b7) L X~-W' ' \n\n, :i,=j \n\n(10) \n\nwhere the values of it fulfill condition (6). Although the weights do not change \naccording to the exact gradient descent rule (since (6) is not taken into account for \nback-propagation) we found experimentally that the error made by the network always \ndecreases after the weight update. This fact also can be proved when certain \nconditions restricting the step-size a hold, and we conjecture that it is always true for \na>O. \nm.2 THE STRUCTURE OF THE RECOGNIZER \nWhen the equivalent neural representation of the HMM-based recognizer is used, there \nexists a natural way of adaptively increasing the complexity of the decision boundaries \nand developing discriminative feature detectors. This can be done by extending the \nstructure of the recognizer to a multi-layered neL There are many possible \narchitectures that result from such an extension by changing the number of hidden \nlayers, as well as the number and the type (Le., standard or TW ) of neurons in the \nhidden layers. Moreover, the role of the TW neurons in the first hidden layer is \ndifferent now: they are no longer class representatives, as in a single-layered net, but \nIn this work \njust abstract computing elements with built-in time scale nonnalization. \nwe investigate only a simple special case of such multi-layered architecture. The \nmulti-layered network we use has a single hidden layer, with NxK TW neurons. Each \nhidden neuron corresponds to oQ~ state of one of the original HM:Ms, and is \n\ncharac~riz~ by a weight vector Wj and a bias w,. The output activation hj of the \n\nneuron IS gIven as \n\n(11) \n\nwhere \n\nand \n\n{ i It \u2022\u2022\u2022 ,iT} = argmax L ur \n\nN \n\nj=1 \n\nThe output layer is composed of K standard neurons. The activation of output neurons \nyi, k=I, ... , K, is detennined by the hidden layer neurons activations as \n\n(12) \nwhere Vi is a NxK dimensional weight vector, H is the vector of hidden neurons \nactivation, and Vi is a bias tenn. \nIn a special case of parameter values, when ~ satisfy the conditions (7a,b) and \n\nyi=g(H* Vi + Vi), \n\nk \nj,J \n\nw- = oga- --I - oga- -\ni \nj,J ' \n\n(13) \nthe activation hj corresponds to an accumulated j-th state likelihood of the k-th HMM: \nand the network implements a weighted [5] HMM recognizer where the connection \nweight vectors Vi detennine the relative weights assigned to each state likelihood in \nthe final classification. Such network can learn to adopt these weights to enhance \ndiscrimination by giving \nlarge positive weights to states that contain infonnation \nimportant for discrimination and ignoring (by fonning zero or close to zero weights) \nthose states that do not contribute to discrimination. A back-propagation algorithm \n\nI \n\nI \n\nk \nJ \n\n\f156 \n\nLevin, Pieraccini, and Bocchieri \n\ncan be used for training this net. \n\nIV. EXPERIMENTAL RESULTS \nTo evaluate the effectiveness of the proposed TWN, we conducted several experiments \nthat involved recognition of the highly confusable English E-set (i.e., Ib, c, d, e, g, p, t, \nv, z/). The utterances were collected from 100 speakers, 50 males and 50 females, \neach speaking every word in the E-set twice, once for training and once for testing. \nThe signal was sampled at 6.67 kHz. We used 12 cepstral and 12 delra-cepstral LPC(cid:173)\nderived [11] coefficients to represent each 45 msec frame of the sampled signal. \nWe used a baseline conventional HMM-based recognizer to initialize the TW network, \nand to get a benchmark performance. Each strictly left-to-right HMM in this system \nhas five states, and the observation densities are modeled by four Gaussian mixture \ncomponents. The recognition rates of this system are 61.7% on the test data, and \n80.2% on the training data. \nExperiment with single-layer TWN: In this experiment the single-layer TW network \nwas initialized according to (7), using the parameters of the baseline HMMs. The four \nmixture components of each state were treated as a fully connected set of four states, \nwith transition probabilities that reflect the original transition probabilities and the \nrelati ve weights of the mixtures. This corresponds to the case in which the local \nlikelihood is computed using the dominant mixture component only. The network was \ntrained using the suggested training algorithm (10), with quadratic error function. The \nrecognition rate of the trained network increased to 69.4% on the test set and 93.6% on \nthe training sel \nExperiment with multi-layer TWN: In this experiment we used the multi-layer \nnetwork architecture described in the previous section. The recognition perfonnance of \nthis network after training was 74.4% on the test set and 91 % on the training set. \nFigures I, 2, and 3 show the recognition performance of a single-layer lWN, \ninitialized by a baseline HMM. the trained single-layer TWN. and the trained multi(cid:173)\nlayer TWN, respectively. In these figures the activation of the unit representing the \ncorrect class is plotted against the activation of the best wrong unit (Le., the incorrect \nclass with the highest score) for each input utterance. Therefore, the utterances that \ncorrespond to the marks above the diagonal line are correctly recognized, and those \nunder it are misclassified. The most interesting observation that can be made from \nthese plots is the striking difference between the multi-layer and the single-layer \nTWNs. The single-layer lWNs in Figures 1 and 2 (the baseline and the trained) \nexhibit the same typical behavior when the utterances are concentrated around the \ndiagonal line. For the multi-layer net, the utterances that were recognized correctly tend \nto concentrate in the upper part of the graph, having the correct unit activation close to \n1.0. This property of a multi-layer net can be used for introducing error rejection \ncriterions: utterances for which the difference between the highest activation and \nsecond high activation is less than a prescribed threshold are rejected. In Figure 4 we \ncompare the test performance of the multi-layer net and the baseline system, both with \nsuch rejection mechanism. for different values of rejection threshold. As expected. the \nmulti-layer net outperforms \nthe baseline recognizer, by showing much smaller \nmisclassification rate for the same number of rejections. \n\nV. SUMMARY AND DISCUSSION \nIn this paper we established a hybrid framework for speech recognition, combining the \ncharacteristics of hidden Markov models and neural networks. We showed that a \nHMM-based recognizer has an equivalent representation as a single-layer network \ncomposed of time-warping neurons, and proposed to improve the discriminative power \nof the recognizer by using back-propagation training and by generalizing the structure \nof the recognizer to a multi-layer net. Several experiments were conducted for testing \n\n\fTime-Warping Network: A Hybrid Framework for Speech Recognition \n\n157 \n\nthe perfonnance of the proposed network on a highly confusable vocabulary (the \nEnglish E-set). The recognition perfonnance on the test set of a single-layer TW net \nimproved from 61% (when initialized with a baseline HMMs) to 69% after training. \nExpending the structure of the recognizer by one more layer of neurons, we obtained \nfurther improvement of recognition accuracy up to 74.4%. Scatter plots of the results \nindicate that in the multi-layer case, there is a qualitative change in the perfonnance of \nthe recognizer, allowing us to set up a rejection criterion to improve the confidence of \nthe system. \n\nRererences \n1. H. Bourlard, CJ. Wellekens, \"Links between Markov models and multilayer \nperceptrons,\" Advances in Neural Information Processing Systems. pp.502-510, \nMorgan Kauffman, 1989. \n2. J.S. Bridle, \"Alphanets: a recurrent 'neural' network architecture with a hidden \nNIarkov model interpretation,\" Speech Communication, April 1990. \n3. D.E. Rumelhart, G.E. Hinton and RJ. Williams, \"Learning internal representation \nby error propagation,\" Parallel Distributed Processing: Exploration \nthe \nMicrostructure of Cognition. MIT Press. 1986. \n4. E. Levin. \"Word recognition using hidden control neural architecture,\" Proc. of \nICASSP. Albuquerque, April 1990. \n5. K.-Y. Suo C.-H. Lee. \"Speech Recognition Using Weighted HMM and Subspace \nProjection Approaches,\" Proc of ICASSP. Toronto, 1991. \n6. L. R. Rabiner, \"A tutorial on hidden Markov models and selected applications in \nspeech recognition,\" Proc. of IEEE. vol. 77, No.2, pp. 257-286, February 1989. \n7. C.-H. Lee, L. R. Rabiner, R. Pieraccini, J. G. Wilpon, \"Acoustic Modeling for Large \nVocabulary Speech Recognition,\" Computer Speech and Language, 1990. No.4. pp. \n127-165. \n8. G.D. Forney, \"The Viterbi algorithm,\" Proc. IEEE. vol. 61, pp. 268-278, 1-tar. \n1973. \n9. S.A. Solta, E. Levin, M. Fleisher. \"Improved targets for multilayer perceptron \nlearning.\" Neural Networks Journal. 1988. \n10. B.-H. Juang, S. Katagiri, \"Discriminative Learning \nClassification,\" IEEE Trans. on SP, to be published. \n11. B.S. Atal, \"Effectiveness of linear prediction characteristics of the speech wave for \nautomatic speaker identification and verification,\" J. Acoust. Soc. Am., vol. 55, No.6, \npp. 1304-1312, June 1974. \n\n:Minimum Error \n\nin \n\nfor \n\nFigure 1: Scatter plot for baseline recognizer \n\n\f158 \n\nLevin, Pieraccini, and Bocchieri \n\n\u2022 \u2022 r \n-.1 \u00b7 -. -. \n\nr \n: \n\n-. \n\n-. \n\nQ \n\n\u2022 \u2022 \n\u00b7 -. \n\n-. \n\n-. \n\n\u2022 \n\n.... e. ...... \n\nFigure 2: Scatter plot for trained single-layer 1WN \n\nFigure 3: Scaner plot for multi-layer 1WN \n\nI \n\nj \n\nI \n\n~ \n\nJ4 \n14\"-\nni. \nJOi.. \na\u00a3. \n\n~. ~ \n~ \" \n~ \nS \n\nz ~~ \n::t, \n. ,,,-\n: I~ \n.. l \nIII \n,ot \n~ \n~ \u2022 ... \n:i. , \n'1 \n\n~. \n\nj \n\nrr '.\" ~-ry - iCl. I\" t:. .. \n\n~_-' \n\n.J \n\nI \n, ~I \n\nto \n\nH~I \n\nI \n\n.0 \n\nW \n\nIr\"J ... tM \n\nFigure 4: Rejection perfonnance of baseline recognizer and the multi-layer nvN \n\n\f", "award": [], "sourceid": 449, "authors": [{"given_name": "Esther", "family_name": "Levin", "institution": null}, {"given_name": "Roberto", "family_name": "Pieraccini", "institution": null}, {"given_name": "Enrico", "family_name": "Bocchieri", "institution": null}]}