{"title": "Modeling Acoustic Correlations by Factor Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 749, "page_last": 755, "abstract": null, "full_text": "Modeling acoustic correlations by \n\nfactor analysis \n\nLawrence Saul and Mazin Rahim \n{lsaul.mazin}~research.att.com \n\nAT&T Labs - Research \n\n180 Park Ave, D-130 \n\nFlorham Park, NJ 07932 \n\nAbstract \n\nHidden Markov models (HMMs) for automatic speech recognition \nrely on high dimensional feature vectors to summarize the short(cid:173)\ntime properties of speech. Correlations between features can arise \nwhen the speech signal is non-stationary or corrupted by noise. We \ninvestigate how to model these correlations using factor analysis, \na statistical method for dimensionality reduction . Factor analysis \nuses a small number of parameters to model the covariance struc(cid:173)\nture of high dimensional data. These parameters are estimated \nby an Expectation-Maximization (EM) algorithm that can be em(cid:173)\nbedded in the training procedures for HMMs. We evaluate the \ncombined use of mixture densities and factor analysis in HMMs \nthat recognize alphanumeric strings. Holding the total number of \nparameters fixed, we find that these methods, properly combined, \nyield better models than either method on its own. \n\n1 \n\nIntroduction \n\nHidden Markov models (HMMs) for automatic speech recognition[l] rely on high \ndimensional feature vectors to summarize the short-time, acoustic properties of \nspeech. Though front-ends vary from recognizer to recognizer, the spectral infor(cid:173)\nmation in each frame of speech is typically codified in a feature vector with thirty \nor more dimensions. In most systems, these vectors are conditionally modeled by \nmixtures of Gaussian probability density functions (PDFs). In this case, the corre(cid:173)\nlations between different features are represented in two ways[2]: implicitly by the \nuse of two or more mixture components, and explicitly by the non-diagonal elements \nin each covariance matrix. Naturally, these strategies for modeling correlations(cid:173)\nimplicit versus explicit-involve tradeoffs in accuracy, speed, and memory. This \npaper examines these tradeoffs using the statistical method of factor analysis. \n\n\f750 \n\nL Saul and M Rahim \n\nThe present work is motivated by the following observation. Currently, most HMM(cid:173)\nbased recognizers do not include any explicit modeling of correlations; that is to \nsay-conditioned on the hidden states, acoustic features are modeled by mixtures of \nGaussian PDFs with diagonal covariance matrices. The reasons for this practice are \nwell known. The use offull covariance matrices imposes a heavy computational bur(cid:173)\nden, making it difficult to achieve real-time recognition. Moreover, one rarely has \nenough data to (reliably) estimate full covariance matrices. Some of these disadvan(cid:173)\ntages can be overcome by parameter-tying[3]-e.g., sharing the covariance matrices \nacross different states or models. But parameter-tying has its own drawbacks: it \nconsiderably complicates the training procedure, and it requires some artistry to \nknow which states should and should not be tied. \n\nUnconstrained and diagonal covariance matrices clearly represent two extreme \nchoices for the hidden Markov modeling of speech. The statistical method of factor \nanalysis[4,5] represents a compromise between these two extremes. The idea behind \nfactor analysis is to map systematic variations of the data into a lower dimensional \nsubspace. This enables one to represent, in a very compact way, the covariance ma(cid:173)\ntrices for high dimensional data. These matrices are expressed in terms of a small \nnumber of parameters that model the most significant correlations without incur(cid:173)\nring much overhead in time or memory. Maximum likelihood estimates of these \nparameters are obtained by an Expectation-Maximization (EM) algorithm that can \nbe embedded in the training procedures for HMMs. \n\nIn this paper we investigate the use of factor analysis in continuous density HMMs. \nApplying factor analysis at the state and mixture component level[6, 7] results in \na powerful form of dimensionality reduction, one tailored to the local properties \nof speech. Briefly, the organization of this paper is as follows. In section 2, we \nreview the method of factor analysis and describe what makes it attractive for large \nproblems in speech recognition. In section 3, we report experiments on the speaker(cid:173)\nindependent recognition of connected alpha-digits. Finally, in section 4, we present \nour conclusions as well as ideas for future research. \n\n2 Factor analysis \n\nFactor analysis is a linear method for dimensionality reduction of Gaussian random \nvariables[4, 5]. Many forms of dimensionality reduction (including those imple(cid:173)\nmented as neural networks) can be understood as variants of factor analysis. There \nare particularly close ties to methods based on principal components analysis (PCA) \nand the notion of tangent distance[8]. The combined use of mixture densities and \nfactor analysis-resulting in a non-linear form of dimensionality reduction-was \nfirst applied by Hinton et al[6] to the modeling of handwritten digits. The EM \nprocedure for mixtures of factor analyzers was subsequently derived by Ghahra(cid:173)\nmani et al[7]. Below we describe the method offactor analysis for Gaussian random \nvariables, then show how it can be applied to the hidden Markov modeling of speech. \n\n2.1 Gaussian model \n\nLet x E nP denote a high dimensional Gaussian random variable. For simplicity, \nwe will assume that x has zero mean. If the number of dimensions, D, is very \nlarge, it may be prohibitively expensive to estimate, store, multiply, or invert a full \ncovariance matrix. The idea behind factor analysis is to find a subspace of much \nlower dimension, f \u00ab D, that captures most of the variations in x. To this end, let \nz E 'RJ denote a low dimensional Gaussian random variable with zero mean and \n\n\fModeling Acoustic Correlations by Factor Analysis \n\nidentity covariance matrix: \n\n751 \n\n(1) \n\nWe now imagine that the variable x is generated by a random process in which z is a \nlatent (or hidden) variable; the elements of z are known as the factors. Let A denote \nan arbitrary D x f matrix, and let '11 denote a diagonal, positive-definite D x D \nmatrix. We imagine that x is generated by sampling z from eq. (1), computing the \nD-dime.nsional vector Az, then adding independent Gaussian noise (with variances \nWii) to each component of this vector. The matrix A is known as the factor loading \nmatrix. The relation between x and z is captured by the conditional distribution: \n\nP(xlz) = 1'111- 1/ 2 e- HX-AZ)TI)-l(X-AZ) \n\n(211\")D/2 \n\n(2) \n\nThe marginal distribution for x is found by integrating out the hidden variable z. \nThe calculation is straightforward because both P(z) and P(xlz) are Gaussian: \n\n(3) \n\n(4) \n\nP(x) = J dz P(xlz)P(z) \n\nI'll + AAT I- 1/ 2 -!XT(I)+AATf1x \n\n(211\")D/2 \n\ne \n\nFrom eq. (4), we see that x is normally distributed with mean zero and covariance \nmatrix '11 + AAT . It follows that when the diagonal elements ofw are small, most \nof the variation in x occurs in the subspace spanned by the columns of A. The \nvariances Wii measure the typical size of componentwise ftuctations outside this \nsubspace. \nCovariance matrices of the form '11 + AAT have a number of useful properties. Most \nimportantly, they are expressed in terms of a small number of parameters, namely \nthe D(f + 1) non-zero elements of A and W. If f ~ D, then storing A and '11 requires \nmuch less memory than storing a full covariance matrix. Likewise, estimating A and \n'11 also requires much less data than estimating a full covariance matrix. Covariance \nmatrices of this form can be efficiently inverted using the matrix inversion lemma[9], \n\n(5) \nwhere I is the f x f identity matrix. This decomposition also allows one to com(cid:173)\npute the probability P(x) with only O(fD) multiplies, as opposed to the O(D2) \nmultiplies that are normally required when the covariance matrix is non-diagonal. \nMaximum likelihood estimates of the parameters A and '11 are obtained by an EM \nprocedure[4]. Let {xt} denote a sample of data points (with mean zero). The EM \nprocedure is an iterative procedure for maximizing the log-likelihood, Lt In P(xt}, \nwith P(Xt) given by eq. (4). The E-step of this procedure is to compute: \n\nQ(A', '11'; A, '11) = 'LJdz P(zIXt,A, w)lnP(z,xtIA', '11'). \n\nt \n\nThe right hand side of eq. (6) depends on A and '11 through the statistics[7]: \n\nE[zlxtl \n\n[I + AT w- 1 A]-lATw-1Xt, \n\nE[zzT lxtl = [I + AT W- 1 A]-l + E[zlxtlE[zTlxtl. \n\n(6) \n\n(7) \n(8) \n\nHere, E['lxtl denotes an average with respect to the posterior distribution, \nP(zlxt, A, '11). The M-step of the EM algorithm is to maximize the right hand \n\n\f752 \n\nL. Saul and M. Rahim \n\nside of eq. (6) with respect to'll' and A'. This leads to the iterative updates[7]: \n\nA' \n\n'11' \n\n(~X'E[ZT IX,]) (~E[zzTIX,]) -1 \ndiag { ~ ~ [x,x; - A'E[zlx,]xiJ }, \n\n(9) \n\n(10) \n\nwhere N is the number of data points, and'll' is constrained to be purely diago(cid:173)\nnal. These updates are guaranteed to converge monotonically to a (possibly local) \nmaximum of the log-likelihood. \n\n2.2 Hidden Markov modeling of speech \n\nConsider a continuous density HMM whose feature vectors, conditioned on the \nhidden states, are modeled by mixtures of Gaussian PDFs. If the dimensionality of \nthe feature space is very large, we can make use of the parameterization in eq. (4). \nEach mixture component thus obtains its own means, variances, and factor loading \nmatrix. Taken together, these amount to a total of C(f + 2)D parameters per \nmixture model, where C is the number of mixture components, f the number of \nfactors, and D the dimensionality of the feature space. Note that these models \ncapture feature correlations in two ways: implicitly, by using two or more mixture \ncomponents, and explicitly, by using one or more factors. Intuitively, one expects \nthe mixture components to model discrete types of variability (e.g., whether the \nspeaker is male or female), and the factors to model continuous types of variability \n(e.g., due to coarticulation or noise). Both types of variability are important for \nbuilding accurate models of speech. \n\nIt is straightforward to integrate the EM algorithm for factor analysis into the \ntraining of HMMs. Suppose that S = {xtl represents a sequence of acoustic vectors. \nThe forward-backward procedure enables one to compute the posterior probability, \n,t C = P(St = s, Ct = ciS), that the HMM used state s and mixture component cat \ntime t. The updates for the matrices A $C and w3C (within each state and mixture \ncomponent) have essentially the same form as eqs. (9-10), except that now each \nobservation Xt is weighted by the posterior probability, ,tc . Additionally, one must \ntake into account that the mixture components have non-zero means[7]. A complete \nderivation of these updates (along with many additional details) will be given in a \nlonger version of this paper. \n\nClearly, an important consideration when applying factor analysis to speech is \nthe choice of acoustic features. A standard choice--and the one we use in our \nexperiments-is a thirty-nine dimensional feature vector that consists of twelve cep(cid:173)\nstral coefficients (with first and second derivatives) and the normalized log-energy \n(with first and second derivatives). There are known to be correlations[2] between \nthese features, especially between the different types of coefficients (e.g., cepstrum \nand delta-cepstrum). While these correlations have motivated our use of factor \nanalysis, it is worth emphasizing that the method applies to arbitrary feature vec(cid:173)\ntors. Indeed, whatever features are used to summarize the short-time properties \nof speech, one expects correlations to arise from coarticulation, background noise, \nspeaker idiosynchrasies, etc. \n\n3 Experiments \n\nContinuous density HMMs with diagonal and factored covariance matrices were \ntrained to recognize alphanumeric strings (e .g., N Z 3 V J 4 E 3 U 2). Highly \n\n\fModeling Acoustic Correlations by FactorAnalysis \n\n753 \n\nalpha-digits (ML) \n\n37 \n\nalpha-dig~s (ML) \n\n17 \n\n16 \n\n15 \n\n#: \n;14 \n!! g13 \nII> \n'E 12 \n~ \n\n11 \n\n\\ . \n\n, . \n\n'0. \n\n3~~--5~--1~0---1~5---2~0--~25~~30--~ \n\nparameters \n\n10 \n\n90 \n\n5 \n\n10 \n\n15 \n20 \nparameters \n\n25 \n\n30 \n\nFigure 1: Plots of log-likelihood scores and word error rates on the test set versus \nthe number of parameters per mixture model (divided by the number of features). \nThe stars indicate models with diagonal covariance matrices; the circles indicate \nmodels with factor analysis. The dashed lines connect the recognizers in table 2. \n\nconfusable letters such as BjV , C jZ, and MjN make this a challenging problem \nin speech recognition . The training and test data were recorded over a telephone \nnetwork and consisted of 14622 and 7255 utterances, respectively. Recognizers \nwere built from 285 left-to-right HMMs trained by maximum likelihood estimation; \neach HMM modeled a context-dependent sub-word unit. Testing was done with a \nfree grammar network (i .e., no grammar constraints). We ran several experiments, \nvarying both the number of mixture components and the number of factors. The \ngoal was to determine the best model of acoustic feature correlations. \n\nTable 1 summarizes the results of these experiments. The columns from left to \nright show the number of mixture components, the number of factors, the number \nof parameters per mixture model (divided by the feature dimension), the word error \nrates (including insertion , deletion, and substition errors) on the test set, the average \nlog-likelihood per frame of speech on the test set , and the CPU time to recognize \ntwenty test utterances (on an SGI R4000). Not surprisingly, the word accuracies \nand likelihood scores increase with the number of modeling parameters; likewise, \nso do the CPU times. The most interesting comparisons are between models with \nthe same number of parameters-e.g., four mixture components with no factors \nversus two mixture components with two factors. The left graph in figure 1 shows \na plot of the average log-likelihood versus the number of parameters per mixture \nmodel; the stars and circles in this plot indicate models with and without diagonal \ncovariance matrices. One sees quite clearly from this plot that given a fixed number \nof parameters, models with non-diagonal (factored) covariance matrices tend to \nhave higher likelihoods. The right graph in figure 1 shows a similar plot of the word \nerror rates versus the number of parameters. Here one does not see much difference; \npresumably, because HMMs are such poor models of speech to begin with , higher \nlikelihoods do not necessarily translate into lower error rates. We will return to this \npoint later. \nIt is worth noting that the above experiments used a fixed number of factors per \nmixture component . In fact , because the variability of speech is highly context(cid:173)\ndependent, it makes sense to vary the number of factors , even across states within \nthe same HMM . A simple heuristic is to adjust the number of factors depending on \nthe amount of training data for each state (as determined by an initial segmentation \nof the training utterances). We found that this heuristic led to more pronounced \n\n\f754 \n\nL Saul and M Rahim \n\nC \n1 \n1 \n1 \n1 \n1 \n2 \n2 \n2 \n2 \n2 \n4 \n4 \n4 \n4 \n4 \n8 \n8 \n8 \n16 \n\nf C(f + 2) word error (%) \n0 \n1 \n2 \n3 \n4 \n0 \n1 \n2 \n3 \n4 \n0 \n1 \n2 \n3 \n4 \n0 \n1 \n2 \n0 \n\n16.2 \n14.6 \n13.7 \n13.0 \n12.5 \n13 .4 \n12.0 \n11.4 \n10.9 \n10.8 \n11.5 \n10.4 \n10.1 \n10.0 \n9.8 \n10.2 \n9.7 \n9.6 \n9.5 \n\n2 \n3 \n4 \n5 \n6 \n4 \n6 \n8 \n10 \n12 \n8 \n12 \n16 \n20 \n24 \n16 \n24 \n32 \n32 \n\nlog-likelihood CPU time (sec) \n\n32.9 \n34.2 \n34.9 \n35.3 \n35.8 \n34.0 \n35.1 \n35.8 \n36.2 \n36.6 \n34.9 \n35.9 \n36.5 \n36.9 \n37.3 \n35.6 \n36.5 \n37.0 \n36.2 \n\n25 \n30 \n30 \n38 \n39 \n30 \n44 \n48 \n61 \n67 \n46 \n80 \n93 \n132 \n153 \n93 \n179 \n226 \n222 \n\nTable 1: Results for different recogmzers. The columns indicate the number of \nmixture components, the number of factors , the number of parameters per mixture \nmodel (divided by the number of features), the word error rates and average log(cid:173)\nlikelihood scores on the test set, and the CPU time to recognize twenty utterances. \n\nC \n1 \n2 \n4 \n\nf C(f + 2) word error J % J \n2 \n2 \n2 \n\n12.3 \n10.5 \n9.6 \n\n4 \n8 \n16 \n\nlog-likelihood CPU time Jse