{"title": "Factored Semi-Tied Covariance Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 779, "page_last": 785, "abstract": null, "full_text": "Factored Semi-Tied Covariance Matrices \n\nM.J.F. Gales \n\nCambridge University Engineering Department \n\nTrumpington Street, Cambridge. CB2 IPZ \n\nUnited Kingdom \n\nmjfg@eng.cam.ac.uk \n\nAbstract \n\nA new form of covariance modelling for Gaussian mixture models and \nhidden Markov models is presented. This is an extension to an efficient \nform of covariance modelling used in speech recognition, semi-tied co(cid:173)\nvariance matrices. In the standard form of semi-tied covariance matrices \nthe covariance matrix is decomposed into a highly shared decorrelating \ntransform and a component-specific diagonal covariance matrix. The use \nof a factored decorrelating transform is presented in this paper. This fac(cid:173)\ntoring effectively increases the number of possible transforms without in(cid:173)\ncreasing the number of free parameters. Maximum likelihood estimation \nschemes for all the model parameters are presented including the compo(cid:173)\nnent/transform assignment, transform and component parameters. This \nnew model form is evaluated on a large vocabulary speech recognition \ntask. It is shown that using this factored form of covariance modelling \nreduces the word error rate. \n\n1 Introduction \n\nA standard problem in machine learning is to how to efficiently model correlations in multi(cid:173)\ndimensional data. Solutions should be efficient both in terms of number of model param(cid:173)\neters and cost of the likelihood calculation. For speech recognition this is particularly \nimportant due to the large number of Gaussian components used, typically in the tens of \nthousands, and the relatively large dimensionality of the data, typically 30-60. \n\nThe following generative model has been used in speech recognition 1 \n\nW \n\nX(T) \nO(T) = F [ X~T) ] \n\n(1) \n\n(2) \n\nwhere X(T) is the underlying speech signal, F is the observation transformation matrix, W \nis generated by a hidden Markov model (HMM) with diagonal covariance matrix Gaussian \n\nIThis describes the static version of the generative model. The more general version is described \n\nby replacing equation 1 by x( T) = Cx( T - 1) + w. \n\n\fmixture model (GMM) to model each state2 and v is usually assumed to be generated by a \nGMM, which is common to all HMMs. This differs from the static linear Gaussian models \npresented in [7] in two important ways. First w is generated by either an HMM or GMM, \nrather than a simple Gaussian distribution. The second difference is that the \"noise\" is \nnow restricted to the null space of the signal x (7). This type of system can be considered \nto have two streams. The first stream, the n1 dimensions associated with X(7), is the set \nof discriminating, useful, dimensions. The second stream, the n2 dimensions associated \nwith v, is the set of non-discriminating, nuisance, dimensions. Linear discriminant analy(cid:173)\nsis (LDA) and heteroscedastic LDA (HLDA) [5] are both based on this form of generative \nmodel. When the dimensionality of the nuisance dimensions is reduced to zero this gener(cid:173)\native model becomes equivalent to a semi-tied covariance matrix system [3] with a single, \nglobal, semi-tied class. \n\nThis generative model has a clear advantage during recognition compared to the standard \nlinear Gaussian models [2] in the reduction in the computational cost of the likelihood \ncalculation. The likelihood for component m may be computed as3 \n\n( \n\n( ) . \n\npo 7 ,IL \n\n(m) ~(m) F) _ \n-\n\n, diag' \n\nl(7) N((F-1) \n\nIdet(F)I \n\n(). (m) ~(m)) \n, diag \n\n[1]07 , IL \n\n(3) \n\nwhere lL(m) is the n1 -dimensional mean and ~~~lg the diagonal covariance matrix of Gaus(cid:173)\nsian component m. l (7) is the nuisance dimension likelihood which is independent of the \ncomponent being considered and only needs to be computed once for each time instance. \nThe initial normalisation term is only required during recognition when multiple trans(cid:173)\nforms are used. The dominant cost is a diagonal Gaussian computation for each compo(cid:173)\nnent, O(n1) per component. In contrast a scheme such as factor analysis (a covariance \nmodelling scheme from the linear Gaussian model in [7]) has a cost of O(ni) per compo(cid:173)\nnent (assuming there are n1 factors). The disadvantage of this form of generative model is \nthat there is no simple expectation-maximisation (EM) [1] scheme for estimating the model \nparameters. However, a simple iterative scheme is available [3]. \n\nFor some tasks, such as speech recognition where there are many different \"sounds\" to be \nrecognised, it is unlikely that a single transform is sufficient to well model the data. To \nreflect this there has been some work on using multiple feature-spaces [3, 2]. The stan(cid:173)\ndard approach for using multiple transforms is to assign each component, m, to a particular \ntransform, F( Tm). To simplify the description of the new scheme only modifications to the \nsemi-tied covariance matrix scheme, where the nuisance dimension is zero, are considered. \nThe generative model is modified to be 0(7) = F(T m )X(7), where Tm is the transform \nclass associated with the generating component, m, at time instance 7. The assignment \nvariable, Tm , may either be determined by an \"expert\", for example using phonetic context \ninformation, or it may be assigned in a maximum likelihood (ML) fashion [3]. Simply \n\n2 Although it is not strictly necessary to use diagonal covariance matrices, tllese currently dominate \n\napplications in speech recognition. w could also be generated by a simple GMM. \n\n3This paper uses the following convention: capital bold letters refer to matrices e.g. A, bold \nletters refer to vectors e.g. b, and scalars are not bold e.g. c. When referring to elements of a matrix \nor vector subscripts are used e.g. ai is tlle ith row of matrix A, aij is tlle element of row i column \nj of matrix A and bi is element i of vector b. Diagonal matrices are indicated by A diag. Where \nmultiple streams are used tllis is indicated, for example, by A[s], this is a n. x n matrix (n is tlle \ndimensionality of tlle feature vector and n. is tlle size of stream 8). Where subsets of tlle diagonal \nmatrices are specified tlle matrices are square, e.g. Adiag[s] is ns x ns square diagonal matrix. AT \nis tlle transpose of tlle matrix and det( A) is tlle determinant of the matrix. \n\n\fincreasing the number of transforms increases the number of model parameters to be esti(cid:173)\nmated, hence reducing the robustness of the estimates. There is a corresponding increase in \nthe computational cost during recognition. In the limit there is a single transform per com(cid:173)\nponent, the standard full-covariance matrix case. The approach adopted in this paper is to \nfactor the transform into multiple streams. Each component can then use a different trans(cid:173)\nform for each stream. Hence instead of using an assignment variable an assignment vector \nis used. In order to maintain the efficient likelihood computation of equation 3, F(r)-l, \nrather than F(r), must be factored into rows. This is a partitioning of the feature space into \na set of observation streams. In common with other factoring schemes this dramatically in(cid:173)\ncreases the effective number of transforms from which each component may select without \nincreasing the number of transform parameters. Though this paper only considers factoring \nsemi-tied covariance matrices the extension to the \"projection\" schemes presented in [2] is \nstraightforward. \n\nThis paper describes how to estimate the set of transforms and determine which subspaces \na particular component should use. The next section describes how to assign components \nto transforms and, given this assignment, how to estimate the appropriate transforms. Some \ninitial experiments on a large vocabulary speech recognition task are presented in the fol(cid:173)\nlowing section. \n\n2 Factored Semi-Tied Covariance Matrices \n\nIn order to factor semi-tied covariance matrices the inverse of the observation transforma(cid:173)\ntion for a component is broken into multiple streams. The feature space of each stream is \nthen determined by selecting from an inventory of possible transforms. Consider the case \nwhere there are S streams. The effective full covariance matrix of component m, ~(m), \nmay be written as ~(m) = F(z(~)) ~(':') F(Z(~))T where the form of F(z(~)) is restricted \n\ndlag \n\n, \n\nso that4 \n\n(4) \n\nand z(m) is the S-dimensional assignment vector for component m. The complete set of \nmodel parameters, M, consists of the standard model parameters, the component means, \nvariances, weights and, additionally, the set of transforms { Af~l ' ... , Af~')} for each \n\nstream s (Rs is the number of transforms associated with stream s) and the assignment \nvector z(m) for each component. Note that the semi-tied covariance matrix scheme is the \ncase when S = 1. The likelihood is efficiently estimated by storing transformed observa(cid:173)\ntions for each stream transform, i.e. Af;! O(T). \nThe model parameters are estimated using ML training on a labelled set of training data \no = {0(1), . .. , o(T)}. The likelihood of the training data may be written as \np(OIM) = LIT (P(q(T)lq(T -1)) L w(m)p(O(T);IL(m),~g;lg'A(Z(~)))) (5) \n\nE> \n\nr \n\nmE(}(r) \n\n4A similar factorisation has also been proposed in [4]. \n\n\fwhere e is the set of all valid state sequences according to the transcription for the data, \nq(T) is the state at time T of the current path, O(T) is the set of Gaussian components be(cid:173)\nlonging to state q(T), and w(m) is the prior of componentm. Directly optimising equation 5 \nis a very large optimisation task, as there are typically millions of model parameters. Alter(cid:173)\nnatively, as is common with standard HMM training, an EM-based approach is used. The \nposterior probability of a particular component, m, generating the observation at a given \ntime instance is denoted as 'Ym ( T). This may be simply found using the forward backward \nalgorithm [6] and the old set of model parameters M. The new set of model parameters \nwill be denoted as M. The estimation of the component priors and HMM transition ma(cid:173)\ntrices are estimated in the standard fashion [6]. Directly optimising the auxiliary function \nfor the model parameters is computationally expensive [3] and does not allow the embed(cid:173)\nding of the assignment process. Instead a simple iterative optimisation scheme is used as \nfollows: \n\n1. Estimate the within class covariance matrix for each Gaussian component in the \nsystem, W(m), using the values of 'Ym (T). Initialise the set of assignment vectors, \n{z} = {Z(1), ... , Z(M)} and the set of transforms for each stream {A} = \n{A (1) \n\nA(Rt) \n\nA(1) \n\n[8)\"'\" \n\nA(RS)} \n. \n\n[8) \n\n[1)\"'\" \n\n[1) \n\n, ... , \n\n2. Using the current estimates of the transforms and assignment vectors obtain the \n\nML estimate of the set of component specific diagonal covariance matrices incor(cid:173)\nporating the appropriate parameter tying as required. This set of parameters will \nbe denoted as {t} = {~~~g\"'\" ~~~}. \n\n3. Estimate the new set of transforms, { A }, using the current set of component co(cid:173)\nvariance matrices { t } and assignment vectors { Z }. The new auxiliary function \nat this stage will be written as Q(M, M; { t } , { z} ). \n\n4. Update the set of assignment variables for each component { Z }, given the current \n\nset of model transforms, { A } . \n\n5. Goto (2) until convergence, or an appropriate stopping criterion is satisfied. Oth(cid:173)\nerwise update {t} and the component means using the latest transforms and \nassignment variables. \n\nThere are three distinct optimisation problems within this task. First the ML estimate of \nthe set of component specific diagonal covariance matrices is required. Second, the new \nset of transforms must be estimated. Finally the new set of assignment vectors is required. \nThe ML estimates of the component specific variances (and means) under a transformation \nis a standard problem, e.g. for the semi-tied case see [3] and is not described further. The \nML estimation of the transforms and assignment variables are described below. \n\nThe transforms are estimated in an iterative fashion. The proposed scheme is derived by \nmodifying the standard semi-tied covariance optimisation equation in [3]. A row by row \n\n\foptimisation is used. Consider row i of stream p of transform r, a[;fi' the auxiliary function \n\nmay be written as (ignoring constant scalings and elements independent of a[;fi) \n\nQ(M M' {t} {z}) = \"\" (3(m) log ((c(z(m\u00bba(Z~~\u00bbT)2) _ \"\" a(r) .K(srj)a(r)T \n\n[pj. \n\n[pj. \n\n[sj} \n\n\" , L...J \nm \n\nL...J \n8,r,j \n\n[sj} \n\nK(srj ) = L \n\nw(m) \n\n(m)2 L 'Ym(r) \n\nm:{z~m)=r} U diag[sjj \n\nT \n\n(6) \n\n(z(m\u00bb \n\nand c[pji \n\nis the cofactor of row i of stream p of transform A \n\n(z(m\u00bb \n\n(r) \n. The gradient j [pji' \n\ndifferentiating the auxiliary function with respect to a[;fi' is given by5 \n\nj(r). = \n\n[pj. \n\n\"\" \nL...J \n\n{ \nm:{z~m)=r} \n\n2 (3 \n\n(m)c(z~m\u00bb} \n(z(m\u00bb \n\n[pj. \n\n(r)T \na[pji \n\nC[pji \n\n_ 2a(r).K(pri) \n\n[pj. \n\n(8) \n\nThe main cost for computing the gradient is calculating the cofactors for each component. \nHaving computed the gradient the Hessian may also be simply calculated as \n\nH(r) . = \"\" \nL...J \n\n[pj. \n\nm:{z~m)=r} \n\n{ \n\n_2(3 \n\nc [pji \n( (z(m\u00bb \nc[pji \n\nc[pji \n(r)T)2 \n\na[pji \n\n(m) (z(m\u00bbT (z(m\u00bb} \n\n_ 2K(pri) \n\n(9) \n\nThe Hessian is guaranteed to be negative definite so the Newton direction must head to(cid:173)\nwards a maximum. At the t + 1 th iteration \n\n(r) ( \na[pji t + \n\n1) _ \n\n(r) () \n- a[pji t -\n\nj(r) H(r)-l \n\n[pji \n\n[Pji \n\n(10) \n\nwhere the gradient and Hessian are based on the tth parameter estimates. In practice this \nestimation scheme was highly stable. \n\nThe assignment for stream s of component m is found using a greedy search technique \nbased on ML estimation. Stream s of component m is assigned using \n\nz(m) - arg max \ns \nrER, \n\n-\n\n{ ( \n\nIdet ( diag (A[;i W(m) A[;t) ) I \n\nIdet (A (u(,rm\u00bb) 12 \n\n) } \n\nwhere the hypothesised assignment of factor stream s, u(srm), is given by \n\n(srm) _ { r, \n\nu j \n\n-\n\nz~m), (otherwise) \n\nj = s \n\n(11) \n\n(12) \n\n-------------------------\n\n5When the standard semi-tied system is used (i.e. S = 1) the estimation of row, i has the closed \n\nform solution \n\n(r) _ \n\n(r) K(lri)-l \n\n(Lm:{zim)=r} f3(m)) \n\na[l]i - C[l ]i \n\n(r) K(lri)-l (r)T \n\nC[l]i \n\nC[l]i \n\n(7) \n\n\fAs the assignment is dependent on the cofactors, which themselves are dependent on the \nother stream assignments for that component, an iterative scheme is required. In practice \nthis was found to converge rapidly. \n\n3 Results and Discussion \n\nAn initial investigation of the use of factored semi-tied covariance matrices was carried \nout on a large-vocabulary speaker-independent continuous-speech recognition task. The \nrecognition experiments were performed on the 1994 ARPA Hub 1 data (the HI task), an \nunlimited vocabulary task. The results were averaged over the development and evaluation \ndata. Note that no tuning on the \"development\" data was performed. The baseline sys(cid:173)\ntem used for the recognition task was a gender-independent cross-word-triphone mixture(cid:173)\nGaussian tied-state HMM system. For details of the system see [8]. The total number of \nphones (counting silence as a separate phone) was 46, from which 6399 distinct context \nstates were formed. The speech was parameterised into a 39-dimensional feature vector. \nThe set of baseline experiments with semi-tied covariance matrices (8 = 1) used \"expert\" \nknowledge to determine the transform classes. Two sets were used. The first was based \non phone level transforms where all components of all states from the same phone shared \nthe same class (phone classes). The second used an individual transform per state (state \nclasses). In addition a global transform (global class) and a full-covariance matrix system \n(comp class) were tested. Two systems were examined, a four Gaussian components per \nstate system and a twelve Gaussian component system. The twelve component system is \nthe standard system described in [8]. In both cases a diagonal covariance matrix system (la(cid:173)\nbelled none) was generated in the standard HTK fashion [9]. These systems were then used \nto generate the initial alignments to build the semi-tied systems. An additional iteration of \nBaum-We1ch estimation was then performed. \n\nThree forms of assignment training were compared. The previously described expert sys(cid:173)\ntem and two ML-based schemes, standard andfactored. The standard scheme used a single \nstream (8 = 1) which is similar to the scheme described in [3]. The factored scheme used \nthe new approach described in this paper with a separate stream for each of the elements of \nthe feature vector (8 = 39). \n\nTable 1: System performance on the 1994 ARPA HI task \n\nAssignment \n\nScheme \n\nnone \nglobal \nphone \nstate \ncomp \nphone \nphone \n\n-\n\nexpert \nexpert \n\n-\n\nstandard \nfactored \n\n10.34 8.87 \n10.04 8.86 \n8.84 \n9.20 \n9.22 \n9.98 \n8.62 \n9.73 \n9.48 \n8.42 \n\nThe results of the baseline semi-tied covariance matrix systems are shown in table 1. For the \nfour component system the full covariance matrix system achieved approximately the same \nperformance as that of the expert state semi-tied system. Both systems significantly (at the \n\n\f95% level) outperformed the standard 12-component system (9.71 %). The expert phone \nsystem shows around an 9% degradation in performance compared to the state system, \nbut used less than a hundredth of the number of transforms (46 versus 6399). Using the \nstandard ML assignment scheme with initial phone classes, S = 1, reduced the error rate \nof the phone system by around 3% over the expert system. The factored scheme, S = 39, \nachieved further reductions in error rate. A 5% reduction in word error rate was achieved \nover the expert system, which is significant at the 95% level. \n\nTable 1 also shows the performance of the twelve component system. The use of a global \nsemi-tied transform significantly reduced the error rate by around 9% relative. Increasing \nthe number of transforms using the expert assignment showed no reduction in error rate. \nAgain using the phone level system and training the component transform assignments, \neither the standard or the factored schemes, reduced the word error rate. Using the factored \nsemi-tied transforms (S = 39) significantly reduced the error rate, by around 5%, compared \nto the expert systems. \n\n4 Conclusions \n\nThis paper has presented a new form of semi-tied covariance, the factored semi-tied co(cid:173)\nvariance matrix. The theory for estimating these transforms has been developed and im(cid:173)\nplemented on a large vocabulary speech recognition task. On this task the use of these \nfactored transforms was found to decrease the word error rate by around 5% over using a \nsingle transform, or multiple transforms, where the assignments are expertly determined. \nThe improvement was significant at the 95% level. In future work the problems of deter(cid:173)\nmining the required number of transforms for each of the streams and how to determine the \nappropriate dimensions will be investigated. \n\nReferences \n\n[1] A P Dempster, N M Laird, and D B Rubin. Maximum likelihood from incomplete data via the \n\nEM algorithm. Journal of the Royal Statistical Society, 39:1-38, 1977. \n\n[2] M J F Gales. Maximum likelihood multiple projection schemes for hidden Markov models. Tech(cid:173)\nnical Report CUEDIF-INFENGffR365, Cambridge University, 1999. Available via anonymous \nftp from: svr-ftp.eng.cam.ac.uk. \n\n[3] M J F Gales. Semi-tied covariance matrices for hidden Markov models. IEEE Transactions \n\nSpeech and Audio Processing, 7:272-281, 1999. \n\n[4] N K Goel and R Gopinath. Multiple linear transforms. In Proceedings ICASSP, 2001. To appear. \n\n[5] N Kumar. Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant \n\nAnalysisfor Improved Speech Recognition. PhD thesis, John Hopkins University, 1997. \n\n[6] L R Rabiner. A tutorial on hidden Markov models and selected applications in speech recogni(cid:173)\n\ntion. Proceedings of the IEEE, 77, February 1989. \n\n[7] S Roweiss and Z Ghahramani. A unifying review of linear Gaussian models. Neural Computa(cid:173)\n\ntion, 11:305-345, 1999. \n\n[8] PC Woodland, J J Odell, V Valtchev, and S J Young. The development of the 1994 HTK large \nvocabulary speech recognition system. In Proceedings ARPA Workshop on Spoken Language \nSystems Technology, pages 104-109, 1995. \n\n[9] S J Young, J Jansen, J Odell, D Ollason, and P Woodland. The HTK Book (for HTK Version 2.0). \n\nCambridge University, 1996. \n\n\f", "award": [], "sourceid": 1871, "authors": [{"given_name": "Mark", "family_name": "Gales", "institution": null}]}