{"title": "Markov Processes on Curves for Automatic Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 751, "page_last": 760, "abstract": null, "full_text": "Markov processes on curves for \nautomatic speech recognition \n\nLawrence Saul and Mazin Rahim \n\nAT&T Labs - Research \n\nShannon Laboratory \n180 Park Ave E-171 \n\nFlorham Park, NJ 07932 \n\n{lsaul,rnazin}Gresearch.att.com \n\nAbstract \n\nWe investigate a probabilistic framework for automatic speech \nrecognition based on the intrinsic geometric properties of curves. \nIn particular, we analyze the setting in which two variables-one \ncontinuous (~), one discrete (s )-evolve jointly in time. We sup(cid:173)\npose that the vector ~ traces out a smooth multidimensional curve \nand that the variable s evolves stochastically as a function of the \narc length traversed along this curve. Since arc length does not \ndepend on the rate at which a curve is traversed, this gives rise \nto a family of Markov processes whose predictions, Pr[sl~]' are \ninvariant to nonlinear warpings of time. We describe the use of \nsuch models, known as Markov processes on curves (MPCs), for \nautomatic speech recognition, where ~ are acoustic feature trajec(cid:173)\ntories and s are phonetic transcriptions. On two tasks-recognizing \nNew Jersey town names and connected alpha-digits- we find that \nMPCs yield lower word error rates than comparably trained hidden \nMarkov models. \n\n1 \n\nIntroduction \n\nVariations in speaking rate currently present a serious challenge for automatic \nspeech recognition (ASR) (Siegler & Stern, 1995). It is widely observed, for example, \nthat fast speech is more prone to recognition errors than slow speech. A related ef(cid:173)\nfect, occurring at the phoneme level, is that consonants (l,re more frequently botched \nthan vowels. Generally speaking, consonants have short-lived, non-stationary acous(cid:173)\ntic signatures; vowels, just the opposite. Thus, at the phoneme level, we can view \nthe increased confusability of consonants as a consequence of locally fast speech. \n\n\f752 \n\nL. Saul and M. Rahim \n\nset) = s 1 \n\nSTART \n\nt=O \n\nx(t) \n\nEND \nt='t \n\nFigure 1: Two variables-one continuous (x), one discrete (s )- evol ve jointly in \ntime. The trace of s partitions the curve of x into different segments whose bound(cid:173)\naries occur where s changes value. \n\nIn this paper, we investigate a probabilistic framework for ASR that models vari(cid:173)\nations in speaking rate as arising from nonlinear warpings of time (Tishby, 1990) . \nOur framework is based on the observation that acoustic feature vectors trace out \ncontinuous trajectories (Ostendorf et aI , 1996). We view these trajectories as mul(cid:173)\ntidimensional curves whose intrinsic geometric properties (such as arc length or \nradius) do not depend on the rate at which they are traversed (do Carmo, 1976). \nWe describe a probabilistic model whose predictions are based on these intrinsic \ngeometric properties and-as such-are invariant to nonlinear warpings of time. \nThe handling of this invariance distinguishes our methods from traditional hidden \nMarkov models (HMMs) (Rabiner & Juang, 1993). \n\nThe probabilistic models studied in this paper are known as Markov processes on \ncurves (MPCs). The theoretical framework for MPCs was introduced in an earlier \npaper (Saul, 1997), which also discussed the problems of decoding and parameter \nestimation. In the present work, we report the first experimental results for MPCs \non two difficult benchmark problems in ASR. On these problems-\nrecognizing New \nJersey town names and connected alpha-digits- our results show that MPCs gen(cid:173)\nerally match or exceed the performance of comparably trained HMMs. \n\nThe organization of this paper is as follows . In section 2, we review the basic \nelements of MPCs and discuss important differences between MPCs and HMMs. In \nsection 3, we present our experimental results and evaluate their significance. \n\n2 Markov processes on curves \n\nSpeech recognizers take a continuous acoustic signal as input and return a sequence \nof discrete labels representing phonemes, syllables, or words as output. Typically \nthe short-time properties of the speech signal are summarized by acoustic feature \nvectors. Thus the abstract mathematical problem is to describe a multidimensional \ntrajectory {x(t) It E [0, T]} by a sequence of discrete labels S1 S2 . . . Sn. As shown in \nfigure 1, this is done by specifying consecutive time intervals such that s(t) = Sk \nfor t E [tk-1, tk] and attaching the labels Sk to contiguous arcs along the trajectory. \nTo formulate a probabilistic model of this process, we consider two variables-one \ncontinuous (x), one discrete (s )-that evolve jointly in time. Thus the vector x \ntraces out a smooth multidimensional curve, to each point of which the variable s \nattaches a discrete label. \n\nMarkov processes on curves are based on the concept of arc length. After reviewing \nhow to compute arc lengths along curves, we introduce a family of Markov processes \nwhose predictions are invariant to nonlinear warpings of time. We then consider \nthe ways in which these processes (and various generalizations) differ from HMMs. \n\n\fMarkov Processes on Curves for Automatic Speech Recognition \n\n753 \n\n2.1 Arc length \n\nLet g(~) define a D x D matrix-valued function over x E RP. If g(~) is everywhere \nnon-negative definite, then we can use it as a metric to compute distances along \nIn particular, consider two nearby points separated by the infinitesimal \ncurves. \nvector d~. We define the squared distance between these two points as: \n\n(1) \n\nArc length along a curve is the non-decreasing function computed by integrating \nthese local distances. Thus, for the trajectory x(t), the arc length between the \npoints x(t!) and X(t2) is given by: \n\nf= l t2dt [~Tg(x)i:]~, \n\n(2) \nwhere i: = it [~(t)] denotes the time derivative of~ . Note that the arc length defined \nby eq. (2) is invariant under reparameterizations of the trajectory, ~(t) -t ~(J(t)) , \nwhere f(t) is any smooth monotonic function of time that maps the interval [tl, t2] \ninto itself. \n\ntl \n\nIn the special case where g(~) is the identity matrix, eq. (2) reduces to the standard \ndefinition of arc length in Euclidean space. More generally, however, eq. (1) defines \na non-Euclidean metric for computing arc lengths. Thus, for example, if the metric \ng(x) varies as a function of~, then eq. (2) can assign different arc lengths to the \ntrajectories x(t) and x(t) + ~o, where ~o is a constant displacement. \n\n2.2 States and lifelengths \n\nWe now return to the problem of segmentation, as illustrated in figure 1. We refer \nto the possible values of s as states. MPCs are conditional random processes that \nevolve the state variable s stochastically as a function of the arc length traversed \nalong the curve of~. In MPCs, the probability of remaining in a particular state \ndecays exponentially with the cumulative arc length traversed in that state. The \nsignature of a state is the particular way in which it computes arc length. \n\nTo formalize this idea, we associate with each state i the following quantities: (i) \na feature-dependent matrix gi (x) that can be used to compute arc lengths, as in \neq. (2); (ii) a decay parameter Ai that measures the probability per unit arc length \nthat s makes a transition from state i to some other state; and (iii) a set of transition \nprobabilities aij, where aij represents the probability that-having decayed out of \nstate i-the variable s makes a transition to state j . Thus, aij defines a stochastic \ntransition matrix with zero elements along the diagonal and rows that sum to one: \naii = 0 and 2:j aij = 1. A Markov process is defined by the set of differential \nequations: \n\nd \n1 \nPi \ndt = -/liPi X gi X X + L.J /ljpjaji ~ 9j x ~ , \n) \u2022 ] :I \n\n( ).]:1 ~ \\ \n\n[ . T \n\n( \n\n\\ \n\n[ . T \n\n1 \n\n(3) \n\n#i \n\nwhere Pi(t) denotes the (forward) probability that s is in state i at time t, based \non its history up to that point in time. The right hand side of eq. (3) consists of \ntwo competing terms. The first term computes the probability that s decays out \nof state i; the second computes the probability that s decays into state i. Both \nterms are proportional to measures of arc length, making the evolution of Pi along \nthe curve of x invariant to nonlinear warpings of time. The decay parameter, Ai, \ncontrols the typical amount of arc length traversed in state i ; it may be viewed as \n\n\f754 \n\nL. Saul and M. Rahim \n\nan inverse lifetime or-to be more precise-an inverse lifelength. The entire process \nis Markovian because the evolution of Pi depends only on quantities available at \ntime t. \n\n2.3 Decoding \n\nGiven a trajectory x(t), the Markov process in eq. (3) gives rise to a conditional \nprobability distribution over possible segmentations, s(t). Consider the segmenta(cid:173)\ntion in which s(t) takes the value Sk between times tk-l and tk, and let \n\nfSk = jtk dt [XTgsk(X) X ]% \n\ntk-l \n\n(4) \n\ndenote the arc length traversed in state Sk. By integrating eq. (3), one can show that \nthe probability of remaining in state Sk decays exponentially with the arc length f Sk ' \nThus, the conditional probability of the overall segmentation is given by: \n\nPr[s,flx] = II ASke->'Sklsk II aSkSk+ll \n\nn \n\nn \n\n(5) \n\nI \n\n/ \n\nk=l \n\nk=O \n\nwhere we have used So and Sn+1 to denote the START and END states of the Markov \nprocess. The first product in eq. (5) multiplies the probabilities that each segment \ntraverses exactly its observed arc length. The second product multiplies the prob(cid:173)\nabilities for transitions between states Sk and Sk+l' The leading factors of ASk are \nincluded to normalize each state's distribution over observed arc lengths. \n\nThere are many important quantities that can be computed from the distribution, \nPr[ S Ix]. Of particular interest for ASR is the most probable segmentation: s* (x) = \nargmaxs,l {In Pr[s, fix]}. As described elsewhere (Saul, 1997), this maximization \ncan be performed by discretizing the time axis and applying a dynamic programming \nprocedure. The resulting algorithm is similar to the Viterbi procedure for maximum \nlikelihood decoding (Rabiner & Juang, 1993). \n\n2.4 Parameter estimation \n\nThe parameters {Ai, aij, gi (x)} in MPCs are estimated from training data to max(cid:173)\nimize the log-likelihood of target segmentations. In our preliminary experiments \nwith MPCs, we estimated only the metric parameters, gi(X); the others were as(cid:173)\nsigned the default values Ai = 1 and aij = 1/ Ii, where Ii is the fanout of state i. \nThe metrics gi (x) were assumed to have the parameterized form: \n\n(6) \nwhere (ji is a positive definite matrix with unit determinant, and cI>i (x) is a non(cid:173)\nnegative scalar-valued function of x. For the experiments in this paper, the form of \ncI>i(X) was fixed so that the MPCs reduced to HMMs as a special case, as described \nin the next section. Thus the only learning problem was to estimate the matrix \nparameters (ji. This was done using the reestimation formula: \n\nJ \n\n~xT \n\n[x (ji-1X]\"2 \n\n(ji ~ C dt. T \n\n\u2022 1 cI>i(x(t)), \n\n(7) \n\nwhere the integral is over all speech segments belonging to state i, and the constant \nC is chosen to enforce the determinant constraint l(ji I = 1. For fixed cI>i (x), we \nhave shown previously (Saul, 1997) that this iterative update leads to monotonic \nincreases in the log-likelihood. \n\n\fMarkov Processes on Curves for Automatic Speech Recognition \n\n755 \n\n2.5 Relation to HMMs and previous work \n\nThere are several important differences between HMMs and MPCs. HMMs param(cid:173)\neterize joint distributions of the form: Pr[s, z] = Dt Pr[st+1lsd Pr[zt Isd. Thus, \nin HMMs, parameter estimation is directed at learning a synthesis model, Pr[zls]' \nwhile in MPCs, it is directed at learning a segmentation model, Pr[s,flz]. The \ndirection of conditioning on z is a crucial difference. MPCs do not attempt to learn \nanything as ambitious as a joint distribution over acoustic feature trajectories. \\ \n\nHMMs and MPCs also differ in how they weight the speech signal. In HMMs, each \nstate contributes an amount to the overall log-likelihood that grows in proportion \nto its duration in time. In MPCs, on the other hand, each state contributes an \namount that grows in proportion to its arc length. Naturally, the weighting by arc \nlength attaches a more important role to short-lived but non-stationary phonemes, \nsuch as consonants. It also guarantees the invariance to nonlinear warpings of time \n(to which the predictions of HMMs are quite sensitive). \n\nIn terms of previous work,\\mr motivation for MPCs resembles that of Tishby (1990), \nwho several years ago proposed a dynamical systems approach to speech processing. \nBecause MPCs exploit the continuity of acoustic feature trajectories, they also bear \nsome resemblance to so-called segmental HMMs (Ostendorf et aI, 1996). MPCs \nnevertheless differ from segmental HMMs in two important respects: the invariance \nto nonlinear warpings of time , and the emphasis on learning a segmentation model \nPr[s, flz], as opposed to a synthesis model, Pr[xls]. \n\nFinally, we note that admitting a slight generalization in the concept of arc length, \nwe can essentially realize HMMs as a special case of MPCs. This is done by com(cid:173)\nputing arc lengths along the spacetime trajectories z(t) = {x(t),t}-that is to say, \nreplacing eq. (1) by dL 2 = [zTg(z) z]dt 2 , where z = {:il, 1} and g(z) is a spacetime \nmetric. This relaxes the invariance to nonlinear warpings of time and incorporates \nboth movement in acoustic feature space and duration in time as measures of phone(cid:173)\nmic evolution. Moreover, in this setting, one can mimic the predictions of HMMs \nby setting the (J'i matrices to have only one non-zero element (namely, the diagonal \nelement for delta-time contributions to the arc length) and by defining the functions \ni(X) in terms of HMM emission probabilities P(xli) as: \n\n] \ni(X) = -In 2::k P(xlk) \n\n[ P(zli) \n\n. \n\n(8) \n\nThis relation is important because it allows us to initialize the parameters of an \nMPC by those of a continuous-density HMM,. This initialization was used in all the \nexperiments reported below. \n\n3 Automatic speech recognition \n\nBoth HMMs and MPCs were used to, build connected speech recognizers. Training \nand test data came from speaker-independent databases of telephone speech. All \ndata was digitized at the caller's local switch and transmitted in this form to the \nreceiver. For feature extraction, input telephone signals (sampled at 8 kHz and \nband-limited between 100-3800 Hz) were pre-emphasized and blocked into 30ms \nframes with a frame shift of 10ms. Each frame was Hamming windowed, autocor(cid:173)\nrelated, and processed by LPC cepstral analysis to produce a vector of 12 liftered \ncepstral coefficients (Rabiner & Juang, 1993). The feature vector was then aug(cid:173)\nmented by its normalized log energy value, as well as temporal derivatives of first \nand second order. Overall, each frame of speech was described by 39 features . These \nfeatures were used diffe:.;ently by HMMs and MPCs, as described below. \n\n\f756 \n\nL. Saul and M. Rahim \n\nMixtures HMM (%) \n\nMPC ('fo) \n\n2 \n4 \n8 \n16 \n32 \n64 \n\n22.3 \n18.9 \n16.5 \n14.6 \n13.5 \n11.7 \n\n20.9 \n17.5 \n15.1 \n13.3 \n12.3 \n11.4 \n\nNJ town names \n\n- 0-\n\n22 ~ , \n\n14 \n\n12 \n\no \n\n1000 \n\n2000 \nparameters pe r state \n\n3000 \n\n4000 \n\n5000 \n\nTable 1: Word error rates for HMMs (dashed) and MPCs (solid) on the task of \nrecognizing NJ town names. The table shows the error rates versus the number of \nmixture components; the graph, versus the number of parameters per hidden state. \n\nRecognizers were evaluated on two tasks. The first task was recognizing New Jer(cid:173)\nsey town names (e.g., Newark) . The training data for this task (Sachs et aI , 1994) \nconsisted of 12100 short phrases, spoken in the seven major dialects of American \nEnglish. These phrases, ranging from two to four words in length, were selected to \nprovide maximum phonetic coverage. The test data consisted of 2426 isolated utter(cid:173)\nances of 1219 New Jersey town names and was collected from nearly 100 speakers. \nNote that the training and test data for this task have non-overlapping vocabularies. \n\nBaseline recognizers were built using 43Ieft-to-right continuous-density HMMs, each \ncorresponding to a context-independent English phone. Phones were modeled by \nthree-state HMMs, with the exception of background noise , which was modeled by \na single state. State emission probabilities were computed by mixtures of Gaussians \nwith diagonal covariance matrices. Different sized models were trained using M = 2, \n4, 8, 16, 32, and 64 mixture components per hidden state; for a particular model , \nthe number of mixture components was the same across all states. Parameter \nestimation was handled by a Viterbi implementation of the Baum-Welch algorithm. \n\nMPC recognizers were built using the same overall grammar. Each hidden state \nin the MPCs was assigned a metric gi(~) = O';l*l(~). The functions **i(~) were \ninitialized (and fixed) by the state emission probabilities of the HMMs, as given \nby eq. (8). The matrices O'i were estimated by iterating eq. (7). We computed arc \nlengths along the 14 dimensional spacetime trajectories through cepstra, log-energy, \nand time. Thus each O'i was a 14 x 14 symmetric matrix applied to tangent vectors \nconsisting of delta-cepstra, delta-log-energy, and delta-time. \n\nThe table in figure 1 shows the results of these experiments comparing MPCs to \nHMMs. For various model sizes (as measured by the number of mixture compo(cid:173)\nnents), we found the MPCs to yield consistently lower error rates than the HMMs. \nThe graph in figure 1 plots these word error rates versus the number of modeling pa(cid:173)\nrameters per hidden state. This graph shows that the MPCs are not outperforming \nthe HMMs merely because they have extra modeling parameters (i .e. , the O'i ma(cid:173)\ntrices). The beam widths for the decoding procedures in these experiments were \nchosen so that corresponding recognizers activated roughly equal numbers of arcs. \n\nThe second task in our experiments involved the recognition of connected alpha(cid:173)\ndigits (e.g., N Z 3 V J 4 E 3 U 2). The training and test data consisted of \n\n\fMarkov Processes on Curves for Automatic Speech Recognition \n\n757 \n\nMixtures HMM (%) MPC (%) \n\n2 \n4 \n8 \n\n12.5 \n10.7 \n10.0 \n\n10.0 \n8.8 \n8.2 \n\n13 \n\n12 \n\n~11 \n~ \ng10 \nCD \n\n9 \n\n..... , \n\n'0 \n\n~oo \n\n400 \n\n1000 \n600 \nparameters per state \n\n800 \n\n1200 \n\n1400 \n\nFigure 2: Word error rates for HMMs and MPCs on the task of recognizing con(cid:173)\nnected alpha-digits. The table shows the error rates versus the number of mixture \ncomponents; the graph , versus the number of parameters per hidden state. \n\n14622 and 7255 utterances, respectively. Recognizers were built from 285 sub-word \nHMMs/MPCs, each corresponding to a context-dependent English phone. The rec(cid:173)\nognizers were trained and evaluated in the same way as the previous task. Results \nare shown in figure 2. \n\nWhile these results demonstrate the viability of MPCs for automatic speech recog(cid:173)\nnition, several issues require further attention. The most important issues are fea(cid:173)\nture selection-how to define meaningful acoustic trajectories from the raw speech \nsignal-and learning- how to parameterize and estimate the hidden state metrics \ngi (~) from sampled trajectories {z (t)}. These issues and others will be studied in \nfuture work. \n\nReferences \n\nM. P. do Carmo (1976) . Differential Geometry of Curves and Surfaces. Prentice \nHall. \n\nM. Ostendorf, V. Digalakis, and O. Kimball (1996). From HMMs to segment mod(cid:173)\nels: a unified view of stochastic modeling for speech recognition. IEEE Transactions \non Acoustics, Speech and Signal Processing, 4:360-378. \n\nL. Rabiner and B. Juang (1993) . Fundamentals of Speech Recognition. Prentice \nHall, Englewood Cliffs, NJ. \n\nR. Sachs, M. Tikijian, and E. Roskos (1994). United States English subword speech \ndata. AT&T unpublished report. \n\nL. Saul (1998) . Automatic segmentation of continuous trajectories with invariance \nto nonlinear warpings of time. In Proceedings of the Fifteenth International Con(cid:173)\nference on Machine Learning, 506- 514. \n\nM. A. Siegler and R . M. Stern (1995). On the effects of speech rate in large vocab(cid:173)\nulary speech recognition systems. In Proceedings of th e 1995 IEEE International \nConference on Acoustics, Speech, and Signal Processing, 612-615. \nN. Tishby (1990). A dynamical system approach to speech processing. In Proceed(cid:173)\nings of the 1990 IEEE International Conference on Acoustics, Speech, and Signal \nProcessing, 365-368 . \n\n\f\fPART VII \n\nVISUAL PROCESSING \n\n\f\f", "award": [], "sourceid": 1508, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Mazin", "family_name": "Rahim", "institution": null}]}*