{"title": "Neural Network - Gaussian Mixture Hybrid for Speech Recognition or Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 175, "page_last": 182, "abstract": null, "full_text": "Neural Network - Gaussian Mixture Hybrid for \n\nSpeech Recognition or Density Estimation \n\nYoshua Bengio \n\nDept. Brain and Cognitive Sciences \nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nRenato De Morl \n\nSchool of Computer Science \n\nMcGill University \n\nCanada \n\nGiovanni Flammia \n\nSpeech Technology Center, \nAalborg University, Denmark \n\nRalf Kompe \n\nErlangen University, Computer Science \n\nErlangen, Germany \n\nAbstract \n\nThe subject of this paper is the integration of multi-layered Artificial Neu(cid:173)\nral Networks (ANN) with probability density functions such as Gaussian \nmixtures found in continuous density Hidden Markov Models (HMM). In \nthe first part of this paper we present an ANN/HMM hybrid in which \nall the parameters of the the system are simultaneously optimized with \nrespect to a single criterion. In the second part of this paper, we study \nthe relationship between the density of the inputs of the network and the \ndensity of the outputs of the networks. A few experiments are presented \nto explore how to perform density estimation with ANNs. \n\n1 \n\nINTRODUCTION \n\nThis paper studies the integration of Artificial Neural Networks (ANN) with prob(cid:173)\nability density functions (pdf) such as the Gaussian mixtures often used in contin(cid:173)\nuous density Hidden Markov Models. The ANNs considered here are multi-layered \nor recurrent networks with hyperbolic tangent hidden units. Raw or preprocessed \ndata is fed to the ANN, and the outputs of the ANN are used as observations for \na parametric probability density function such as a Gaussian mixture. One may \nview either the ANN as an adaptive preprocessor for the Gaussian mixture, or the \nGaussian mixture as a statistical postprocessor for the ANN. A useful role for the \nANN would be to transform the input data so that it can be more efficiently mod(cid:173)\neled by a Gaussian mixture. An interesting situation is one in which most of the \ninput data points can be described in a lower dimensional space. In this case, it \nis desired that the ANN learns the possibly non-linear transformation to a more \ncompact representation. \n\n175 \n\n\f176 \n\nBengio, De Mori, Flammia, and Kampe \n\nIn the first part of this paper, we briefly describe a hybrid of ANNs and Hid(cid:173)\nden Markov Models (HMM) for continuous speech recognition. More details on \nthis system can be found in (Bengio 91). In this hybrid, all the free parameters \nare simultaneously optimized with respect to a single criterion. In recent years, \nmany related combinations have been studied (e.g., Levin 90, Bridle 90, Bourlard \n& Wellekens 90). These approaches are often motivated by observed advantages and \ndisadvantages of ANNs and HMMs in speech recognition (Bourlard & Wellekens 89, \nBridle 90). Experiments of phoneme recognition on the TIMIT database with the \nproposed ANN /HMM hybrid are reported. The task under study is the recogni(cid:173)\ntion (or spotting) of plosive sounds in continuous speech. Comparative results on \nthis task show that the hybrid performs better than the ANN alone, better than \nthe ANN followed by a dynamic programming based postprocessor using duration \nconstraints, and better than the HMM alone. Furthermore, a global optimization \nof all the parameters of the system also yielded better performance than a separate \noptimization. \n\nIn the second part of this paper, we attempt to extend some of the findings of the \nfirst part, in order to use the same basic architecture (ANNs followed by Gaussian \nmixtures) to perform density estimation. We establish the relationship between \nthe network input and output densities, and we then describe a few experiments \nexploring how to perform density estimation with this system. \n\n2 ANN/HMM HYBRID \n\nIn a HMM, the likelihood of the observations, given the model, depends in a sim(cid:173)\nple continuous way on the observations. It is therefore possible to compute the \nderivative of an optimization criterion C, with respect to the observations of the \nHMM. For example, one may use the criterion of the Maximum Likelihood (ML) \nof the observations, or of the Maximum Mutual Information (MMI) between the \nobservations and the correct sequence. If the observation at each instant is the \nvector output, Yi, of an ANN, then one can use this gradient, gf\" \nto optimize the \nparameters of the ANN with back-propagation. See (Bridle 90, Bottou 91, Bengio \n91, Bengio et a192) on ways to compute this gradient. \n\n2.1 EXPERIMENTS \n\nA preliminary experiment has been performed using a prototype system based on \nthe integration of ANNs with HMMs. The ANN was initially trained based on \na prior task decomposition. The task is the recognition of plosive phonemes pro(cid:173)\nnounced by a large speaker population. The 1988 version of the TIM IT continuous \nspeech database has been used for this purpose. SI and SX sentences from regions \n2, 3 and 6 were used, with 1080 training sentences and 224 test sentences, 135 train(cid:173)\ning speakers and 28 test speakers. The following 8 classes have been considered: \n/p/,/t/,/k/,/b/,/d/,/g/,/dx/,/all other phones/. Speaker-independent recognition \nof plosive phonemes in continuous speech is a particularly difficult task because \nthese phonemes are made of short and non-stationary events that are often con(cid:173)\nfused with other acoustically similar consonants or may be merged with other unit \nsegments by a recognition system. \n\n\fNeural Network-Gaussian Mixture Hybrid for Speech Recognition or Density Estimation \n\n177 \n\ninitially trained \nto re~ze \nbroad phonetic \n\u2022\u2022\u2022\u2022 \nclasses \n\nSPEECH J .OU l 'U \u00b7U ' -\n\nLevell \n\nspecialized \nnetworks \n\nLevel 2 \nInitially trained to \nprincipal \ncomponents \nof lower \nlevels \n\nLevel 3 \n\ngradielll \n\n\u00b7\u00b7\u2022\u00b7\u2022\u2022\u2022\u2022 .. \u00b7 .. \u00b71\"\u00b7\u00b7. \n\npreprocessing \n\ninitially trained to perfonn \nsome specialized task \ne.g. ploslve discrirnation \n\nFigure 1: Architecture of the ANN/HMM Hybrid for the Experiments. \n\nThe ANNs were trained with back-propagation and on-line weight update. As dis(cid:173)\ncussed in (Bengio 91), speech knowledge is used to design the input, output, and \narchitecture of the system and of each one of the networks. The experimental sys(cid:173)\ntem is based on the scheme shown in Figure 1. The architecture is built on three \nlevels. The approach that we have taken is to select different input parameters and \ndifferent ANN architectures depending on the phonetic features to be recognized. \nAt levell, two ANNs are initially trained to perform respectively plosive recognition \n(ANN3) and broad classification of phonemes (ANN2). ANN3 has delays and recur(cid:173)\nrent connections and is trained to recognize static articulatory features of plosives \nin a way that depends of the place of articulation of the right context phoneme. \nANN2 has delays but no recurrent connections. The design of ANN2 and ANN3 is \ndescribed in more details in (Bengio 91). At level 2, ANNI acts.as an integrator of \nparameters generated by the specialized ANNs oflevel 1. ANNI is a linear network \nthat initially computes the 8 principal components of the concatenated output vec(cid:173)\ntors of the lower level networks (ANN2 and ANN3). In the experiment described \nbelow, the combined network (ANN1+ANN2+ANN3) has 23578 weights. Level 3 \ncontains the HMMs, in which each distribution is modeled by a Gaussian mixture \nwith 5 densities. See (Bengio et al 92) for more details on the topology of the \nHMM. The covariance matrix is assumed to be diagonal since the observations are \ninitially principal components and this assumption reduces significantly the num(cid:173)\nber of parameters to be estimated. After one iteration of ML re-estimation of the \nHMM parameters only, all the parameters of the hybrid system were simultane(cid:173)\nously tuned to maximize the ML criterion for the next 2 iterations. Because of the \nsimplicity of the implementation of the hybrid trained with ML, this criterion was \nused in these experiments. Although such an optimization may theoretically worsen \nperformance 1 , we observed an marked improvement in performance after the final \nglobal tuning. This may be explained by the fact that a nearby local maximum of \n\n1 In section 3, we consider maximization of the likelihood of the inpu ts of the network, \n\n\f178 \n\nBengio, De Mori, Flammia, and Kompe \n\nthe likelihood is attained from the initial starting point based on prior and separate \ntraining of the ANN and the HMM. \nTable 1: Comparative Recognition Results. % recognized = 100 - % substitutions \n- % deletions. % accuracy = 100 - % substitutions - % deletions -% insertions. \n\n% rec %ms % del % subs % acc \n\n53 \n69 \n72 \n81 \n86 \n\nANNs alone \nHMMs alone \nANNs+DP \nANNs+HMM \nANNs+HMM+global opt. \n\n85 \n76 \n88 \n87 \n90 \n\n32 \n6.3 \n16 \n6.8 \n3.8 \n\n0.04 \n2.2 \n0.01 \n0.9 \n1.4 \n\n15 \n22.3 \n11 \n12 \n9.0 \n\nIn order to assess the value of the proposed approach as well as the improvements \nbrought by the HMM as a post-processor for time alignment, the performance \nof the hybrid system was evaluated and compared with that of a simple post(cid:173)\nprocessor applied to the outputs of the ANNs and with that of a standard dynamic \nprogramming postprocessor that models duration probabilities for each phoneme. \nThe simple post-processor assigns a symbol to each output frame of the ANNs by \ncomparing the target output vectors with actual output vectors. It then smoothes \nthe resulting string to remove very short segments and merges consecutive segments \nthat have the same symbol. The dynamic programming (DP) postprocessor finds \nthe sequence of phones that minimizes a cost that imposes durational constraints \nfor each phoneme. In the HMM alone system, the observations are the cepstrum \nand the energy of the signal, as well as their derivatives. Comparative results for \nthe three systems are summarized in Table 1. \n\n3 DENSITY ESTIMATION WITH AN ANN \n\nIn this section, we consider an extension of the system of the previous section. \nThe objective is to perform density estimation of the inputs of the ANN. Instead \nof maximizing a criterion that depends on the density of the outputs of an ANN, \nwe maximize the likelihood of inputs of the ANN. Hence the ANN is more than a \npreprocessor for the gaussian mixtures, it is part of the probability density function \nthat is to be estimated. Instead of representing a pdf only with a set of spatially \nlocal functions or kernels such as gaussians (Silverman 86), we explore how to use \na global transformation such as one performed by an ANN in order to represent a \npdf. Let us first define some notation: f x (x) def p( X = x), fy (y) def p(Y = y), \nand fXIY(x)(x) def p(X = x I Y = y(x)). \n\n3.1 RELATION BETWEEN INPUT PDF AND OUTPUT PDF \n\nTheorem Suppose a random variable Y (e.g., the outputs of an ANN) is a deter(cid:173)\nministic parametric function y(X) of a random variable X (here, the inputs of the \nANN), where y and x are vectors of dimension ny and n x . Let J -\n8(Yl.h.\u00b7\u00b7 .Yn v) \n-\n8(Xl.Xl, oo .X n .. ) \n\nnot the outputs of the network. \n\n\fNeural Network-Gaussian Mixture Hybrid for Speech Recognition or Density Estimation \n\n179 \n\nbe the Jacobian of the transformation from X to Y, and assume J = U DVt be a \nsingular value decomposition of J, with s(x) =1 Il~1/ Dii 1 the product of the sin(cid:173)\ngular values. Suppose Y is modeled by a probability density function fy(y). Then, \nfor nz >= ny and s(x) > 0 \n\nfx(x) = \n\nfy(y(x\u00bb \n\nfXIY(x)(x) s(x) \n\n(1) \n\nProof. In the case in which nz = ny, by change of variable y -- x in the following \nintegral, \n\n1 fy(y) dy = 1 \n\n01/ \n\nwe obtain the following result2 : \n\nfx(x) = fy(y(x\u00bb \n\n(3) \nLet us now consider the case ny < n z , i.e., the network has less outputs than inputs. \nIn order to do so we will introduce an intermediate transformation to a space Z of \ndimension nz in which some dimensions directly correspond to Y. Define Z such \nthat f} Zl,Z2,\u00b7\u00b7\u00b7,Z.. = V t \u2022 Decompose Z into Z' and Z\": \n\n1 Determinant(J) 1 \n\nf} Xl ,X2, . .. ,X .... \n\nz' = (Zl' ... , zn1/) , Z\" = (Zn1/+1, ... , zn\",) \n\n(4) \nThere is a one-to-one mapping Yz (z') between Z' and Y, and its Jacobian is U D', \nwhere D' is the matrix composed of the first ny columns of D. Perform a change \nof variables y -- z' in the integral of equation 2: \n\n(2) \n\n(5) \n\n(6) \n\n(7) \n\n(8) \n\n1 fy (yz (z'\u00bb s dz' = 1 \n\n0.1 \n\n1 p(z\" 1 z') dz\" = 1 \n\n0.11 \n\nIn order to make a change of variable to the variable x, we have to specify the \nconditional pdf fXIY(x)(x) and the corresponding pdf \np(z\" 1 z') = p(z\", z, 1 z') =3 p(z 1 y) =4 fXIY(X)(x). Hence we can write \n\nMultiplying the two integrals in equations 5 and 6, we obtain the following: \n\n1= 1 p(z\"lz')dz\" 1 fy(yz(z'\u00bbsdz'= 1 fy(yz(z')p(z\"lz')sdz \n\n0.11 \n\n0.1 \n\no. \n\nand substituting z __ vtx: \n\n1 fy(y(x\u00bb fXIY(X)(X) s(x) dx \n\n0 .. \n\n1, \n\nwhich yields to the general result of equation 1 D. \nUnfortunately, it is not clear how to efficiently evaluate fXIY(x)(x) and then com(cid:173)\npute its derivative with respect to the network weights. In the experiments described \nin the next section we first study empirically the simpler case in which nx = n y \u2022 \n\n2in that case, 1 Determinant(l) 1= sand IXIY(x)(x) = 1. \n3knowing z' is equivalent to knowing y. \nfbecause z = Vtx and Determinant(V) = 1. \n\n\f180 \n\nBengio, De Mori, Flammia, and Kampe \n\nFigure 2: First Series of Experiments on Density Estimation with an ANN, for data \ngenerated on a non-linear input curve. From left to right: Input samples, density \nof the input, X, estimated with ANN+Gaussian, ANN that maps X to Y, density \nof the output, Y, as estimated by a Gaussian. \n\n3.2 ESTIMATION OF THE PARAMETERS \n\nWhen estimating a pdf, one can approximate the functions fy(y) and y(x) by \nparameterized functions. For example, we consider for the output pdf the class \nof densities fy (y; 8) modeled by a Gaussian mixture of a certain number of com(cid:173)\nponents, where 8 is a set of means, variances and mixing proportions. For the \nnon-linear transformation y(x;w) from X to Y, we choose an ANN, defined by its \narchitecture and the values of its weights w. In order to choose values for the Gaus(cid:173)\nsian and ANN parameters one can maximize the a-posteriori (MAP) probability of \nthese parameters given the data, or if no prior is known or assumed, maximize the \nlikelihood (ML) of the input data given the parameters. In the preliminary exper(cid:173)\niments described here, the logarithm of the likelihood of the data was maximized, \ni.e., the optimal parameters are defined as follows: \n\n(0, w) = argmax L log(Jx(x\u00bb \n\n(9,w) xeS \n\n(9) \n\nwhere::: is the set of inputs samples. \nIn order to estimate a density with the above described system, one computes the \nderivative of p(X = x I 8,w) with respect to w. If the output pdf is a Gaussian \nmixture, we reestimate its parameters 8 with the EM algorithm (only fy (y) depends \non 8 in the expression for f x (x) in equations 3 or 1). Differentiating equation 3 \nwith respect to w yields: \n\n8J\u00b7\u00b7 \n8 \n8w(logfx(x\u00bb = 8w(logfy(y(x;w); 8\u00bb + L...J 8J .. (log(Determinant(J\u00bb) 8:: \n\n8 \n\n'\" 8 \ni,j \n\nI, \n\n(10) \nThe derivative of the logarithm of the determinant can be computed simply as \nfollows (Bottou 91): \n\n8~ij (log(Determinant(J\u00bb) = (J-1)ji, \n\n(11) \n\nsince VA, Determinant(A) = Ej AijCofactorij(A) ,and (A-l)ij = ~=;,;;..;;...Io~~ \n\n\fNeural Network-Gaussian Mixture Hybrid for Speech Recognition or Density Estimation \n\n181 \n\n\u2022 . . \u2022 . , \n\n\u2022 . \\ , \n\n\u2022 \n\\ \n\u2022 , \n\n\\ \n\n, . \n. . \n\nI \n\nFigure 3: Second Series of Experiments on Density Estimation with an ANN. From \nleft to right: Input samples, density with non-linear net + Gaussian, output samples \nafter network transformation. \n\n3.3 EXPERIMENTS \n\nThe first series of experiments verified that a transformation of the inputs with \nan ANN could improve the likelihood of the inputs and that gradient ascent in \nthe ML criterion could find a good solution. In these experiments, we attempt \nto model some two-dimensional data extracted from a speech database. The 1691 \ntraining data points are shown in the left of Figure 2. In the first experiment, a \ndiagonal Gaussian is used, with no ANN. In the second experiment a linear network \nand a diagonal Gaussian are used. In the third experiment, a non-linear network \nwith 4 hidden units and a diagonal Gaussian are used. The average log likelihoods \nobtained on a test set of 617 points were -3.00, -2.95 and -2.39 respectively for the \nthree experiments. The estimated input and output pdfs for the last experiment \nare depicted in Figure 2, with white indicating high density and black low density. \nThe second series of experiments addresses the following question: \nif we use a \nGaussian mixture with diagonal covariance matrix and most of the data is on a non(cid:173)\nlinear hypersurface cI> of dimension less than n x , can the ANN's outputs separate \nthe dimensions in which the data varies greatly (along ~) from those in which \nit almost doesn't (orthogonal to ~)7 Intuitively, it appears that this will be the \ncase, because the variance of outputs which don't vary with the data will be close \nto zero, while the determinant of the Jacobian is non-zero. The likelihood will \ncorrespondingly tend to infinity. The first experiment in this series verified that \nthis was the case for linear networks. For data generated on a diagonal line in \n2-dimensional space, the resulting network separated the\" variant\" dimension from \nthe \"invariant\" dimension, with one of the output dimensions having near zero \nvariance, and the transformed data lying on a line parallel to the other output \ndimension. \n\nExperiments with non-linear networks suggest that with such networks, a solution \nthat separates the variant dimensions from the invariant ones is not easily found \nby gradient ascent. However, it was possible to show that such a solution was at \na maximum (possibly local) of the likelihood. A last experiment was designed to \ndemonstrate this. The input data, shown in Figure 3, was artificially generated to \nmake sure that a solution existed. The network had 2 inputs, 3 hidden units and 2 \n\n\f182 \n\nBengio, De Mori, Flammia, and Kampe \n\noutputs. The input samples and the input density corresponding to the weights in \na maximum of the likelihood are displayed in Figure 3, along with the transformed \ninput data for those weights. The points are projected by the ANN to a line parallel \nto the first output dimension. Any variation of the weights from that solution, in \nthe direction of the gradient, even with a learning rate as small as 10- 14, yielded \neither no perceptible improvement or a decrease in likelihood. \n\n4 CONCLUSION \n\nThis paper has studied an architecture in which an ANN performs a non-linear \ntransformation of the data to be analyzed, and the output of the ANN is modeled \nby a Gaussian mixture. The design of the ANN can incorporate prior knowledge \nabout the problem, for example to modularize the task and perform an initial \ntraining of the sub-networks. In phoneme recognition experiments, an ANN/HMM \nhybrid based on this architecture performed better than the ANN alone or the HMM \nalone. In the second part of th paper, we have shown how the pdf of the input of \nthe network relates to the pdf of the outputs of the network. The objective of this \nwork is to perform density estimation with a non-local non-linear transformation of \nthe data. Preliminary experiments showed that such estimation was possible and \nthat it did improve the likelihood of the resulting pdf with respect to using only a \nGaussian pdf. We also studied how this system could perform a non-linear analogue \nto principal components analysis. \n\nReferences \n\nBengio Y. 1991. Artificial Neural Networks and their Application to Sequence \nRecognition. PhD Thesis, School of Computer Science, McGill University, Montreal, \nCanada. \nBengio Y., De Mori R., Flammia G., and Kompe R. 1992. Phonetically moti(cid:173)\nvated acoustic parameters for continuous speech recognition using artificial neural \nnetworks. To appear in Speech Communication. \nBottou L. 1991. Une approche theorique a. l'apprentissage connexioniste; applica(cid:173)\ntions a. la reconnaissance de la parole. Doctoral Thesis, Universite de Paris Sud, \nFrance. \n\nBourlard, H. and Wellekens, C.J. (1989). Speech pattern discrimination and multi(cid:173)\nlayer perceptrons. Computer, Speech and Language, vol. 3, pp. 1-19. \n\nBridle J .S. 1990. Training stochastic model recognition algorithms as networks can \nlead to maximum mutual information estimation of parameters. Advances in Neural \nInformation Processing Systems 2, (ed . D.S. Touretzky) Morgan Kauffman Publ., \npp. 211-217. \n\nLevin E. 1990. Word recognition using hidden control neural architecture. Proceed(cid:173)\nings of the International Conference on Acoustics, Speech and Signal Processing, \nAlbuquerque, NM, April 90, pp. 433-436. \nSilverman B.W. 1986. Density Estimation for Statistics and Data Analysis. Chap(cid:173)\nman and Hall, New York, NY. \n\n\f", "award": [], "sourceid": 521, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Renato", "family_name": "De Mori", "institution": null}, {"given_name": "Giovanni", "family_name": "Flammia", "institution": null}, {"given_name": "Ralf", "family_name": "Kompe", "institution": null}]}