{"title": "ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1165, "page_last": 1171, "abstract": null, "full_text": "ALGONQUIN - Learning dynamic noise \n\nmodels from noisy speech for robust \n\nspeech recognition \n\nBrendan J. Freyl, Trausti T. Kristjanssonl , Li Deng2 , Alex Acero 2 \n\n1 Probabilistic and Statistical Inference Group, University of Toronto \n\nhttp://www.psi.toronto.edu \n\n2 Speech Technology Group, Microsoft Research \n\nAbstract \n\nA challenging, unsolved problem in the speech recognition com(cid:173)\nmunity is recognizing speech signals that are corrupted by loud, \nhighly nonstationary noise. One approach to noisy speech recog(cid:173)\nnition is to automatically remove the noise from the cepstrum se(cid:173)\nquence before feeding it in to a clean speech recognizer. In previous \nwork published in Eurospeech, we showed how a probability model \ntrained on clean speech and a separate probability model trained \non noise could be combined for the purpose of estimating the noise(cid:173)\nfree speech from the noisy speech. We showed how an iterative 2nd \norder vector Taylor series approximation could be used for prob(cid:173)\nabilistic inference in this model. In many circumstances, it is not \npossible to obtain examples of noise without speech. Noise statis(cid:173)\ntics may change significantly during an utterance, so that speech(cid:173)\nfree frames are not sufficient for estimating the noise model. In this \npaper, we show how the noise model can be learned even when the \ndata contains speech. In particular, the noise model can be learned \nfrom the test utterance and then used to de noise the test utterance. \nThe approximate inference technique is used as an approximate E \nstep in a generalized EM algorithm that learns the parameters of \nthe noise model from a test utterance. For both Wall Street J our(cid:173)\nnal data with added noise samples and the Aurora benchmark, we \nshow that the new noise adaptive technique performs as well as or \nsignificantly better than the non-adaptive algorithm, without the \nneed for a separate training set of noise examples. \n\n1 \n\nIntroduction \n\nTwo main approaches to robust speech recognition include \"recognizer domain ap(cid:173)\nproaches\" (Varga and Moore 1990; Gales and Young 1996), where the acoustic \nrecognition model is modified or retrained to recognize noisy, distorted speech, and \n\"feature domain approaches\" (Boll 1979; Deng et al. 2000; Attias et al. 2001; Frey \net al. 2001), where the features of noisy, distorted speech are first denoised and then \nfed into a speech recognition system whose acoustic recognition model is trained on \nclean speech. \n\nOne advantage of the feature domain approach over the recognizer domain approach \nis that the speech modeling part of the denoising model can have much lower com-\n\n\fplexity than the full acoustic recognition model. This can lead to a much faster \noverall system, since the denoising process uses probabilistic inference in a much \nsmaller model. Also, since the complexity of the denoising model is much lower \nthan the complexity of the recognizer, the denoising model can be adapted to new \nenvironments more easily, or a variety of denoising models can be stored and applied \nas needed. \n\n(In contrast, Attias et al. \n\nWe model the log-spectra of clean speech, noise, and channel impulse response \n(2001) model \nfunction using mixtures of Gaussians. \nautoregressive coefficients.) The relationship between these log-spectra and the \nlog-spectrum of the noisy speech is nonlinear, leading to a posterior distribution \nover the clean speech that is a mixture of non-Gaussian distributions. We show \nhow a variational technique that makes use of an iterative 2nd order vector Taylor \nseries approximation can be used to infer the clean speech and compute sufficient \nstatistics for a generalized EM algorithm that can learn the noise model from noisy \nspeech. \n\nOur method, called ALGONQUIN, improves on previous work using the vector \nTaylor series approximation (Moreno 1996) by modeling the variance of the noise \nand channel instead of using point estimates, by modeling the noise and channel as a \nmixture mixture model instead of a single component model, by iterating Laplace's \nmethod to track the clean speech instead of applying it once at the model centers, \nby accounting for the error in the nonlinear relationship between the log-spectra, \nand by learning the noise model from noisy speech. \n\n2 ALGONQUIN's Probability Model \n\nFor clarity, we present a version of ALGONQUIN that treats frames of log-spectra \nindependently. The extension of the version presented here to HMM models of \nspeech, noise and channel distortion is analogous to the extension of a mixture of \nGaussians to an HMM with Gaussian outputs. \n\nFollowing (Moreno 1996), we derive an approximate relationship between the log \nspectra of the clean speech, noise, channel and noisy speech. Assuming additive \nnoise and linear channel distortion, the windowed FFT Y(j) for a particular frame \n(25 ms duration, spaced at 10 ms intervals) of noisy speech is related to the FFTs \nof the channel H(j), clean speech 5(j) and additive noise N(j) by \n\nY(j) = H(j)5(j) + N(j). \n\n(1) \n\nWe use a mel-frequency scale, in which case this relationship is only approximate. \nHowever, it is quite accurate if the channel frequency response is roughly constant \nacross each mel-frequency filter band. \nFor brevity, we will assume H(j) = 1 in the remainder of this paper. Assuming \nthere is no channel distortion simplifies the description of the algorithm. To see \nhow channel distortion can be accounted for in a nonadaptive way, see (Frey et al. \n2001). The technique described in this paper for adapting the noise model can be \nextended to adapting the channel model. \nAssuming H(j) = 1, the energy spectrum is obtained as follows: \n\nIY(j)1 2 = Y(j)*Y(j) = 5(j)* 5(j) + N(j)* N(j) + 2Re(N(j)* 5(j)) \n\n= 15(j)12 + IN(j)12 + 2Re(N(j)* 5(j)) , \n\nwhere \"*,, denotes complex conjugate. If the phase of the noise and the speech are \nuncorrelated, the last term in the above expression is small and we can approximate \n\n\fthe energy spectrum as follows: \n\nIYUW ~ ISUW + INUW\u00b7 \n\n(2) \nAlthough we could model these spectra directly, they are constrained to be non(cid:173)\nnegative. To make density modeling easier, we model the log-spectrum instead. An \nadditional benefit to this approach is that channel distortion is an additive effect in \nthe log-spectrum domain. \nLetting y be the vector containing the log-spectrum log IY(:W, and similarly for s \nand n , we can rewrite (2) as \n\nexp(y) ~ exp(s) + exp(n) = exp(s) 0 (1 + exp(n - s)) , \n\nwhere the expO function operates in an element-wise fashion on its vector argument \nand the \"0\" symbol indicates element-wise product. \n\nTaking the logarithm, we obtain a function gO that is an approximate mapping of \nsand n to y (see (Moreno 1996) for more details): \n\ny ~ g([~]) = s + In(l + exp(n - s)). \n\n(4) \n\n\"T\" indicates matrix transpose and InO and expO operate on the individual elements \nof their vector arguments. \n\nAssuming the errors in the above approximation are Gaussian, the observation \nlikelihood is \n\np(yls,n) =N(y;g([~]),W), \n\n(5) \nwhere W is the diagonal covariance matrix of the errors. A more precise approxi(cid:173)\nmation to the observation likelihood can be obtained by writing W as a function of \ns and n , but we assume W is constant for clarity. \nUsing a prior p(s, n), the goal of de noising is to infer the log-spectrum of the clean \nspeech s , given the log-spectrum ofthe noisy speech y. The minimum squared error \nestimate of sis s = Is sp(sly) , where p(sly) ex InP(yls, n)p(s, n). This inference is \nmade difficult by the fact that the nonlinearity g([s n]T) in (5) makes the posterior \nnon-Gaussian even if the prior is Gaussian. In the next section, we show how an it(cid:173)\nerative variational method that uses a 2nd order vector Taylor series approximation \ncan be used for approximate inference and learning. \nWe assume that a priori the speech and noise are independent - p(s , n) = p(s)p(n) \n-\nand we model each using a separate mixture of Gaussians. cS = 1, ... , NS is the \nclass index for the clean speech and en = 1, ... ,Nn is the class index for the noise. \nThe mixing proportions and Gaussian components are parameterized as follows: \n\np(s) = LP(cS)p(slcS), p(CS) =7r~s , p(slcS) =N(s;JL~s ,~~s ), \n\nC S \n\nWe assume the covariance matrices ~~s and ~~n are diagonal. \n\nCombining (5) and (6), the joint distribution over the noisy speech, clean speech \nclass, clean speech vector, noise class and noise vector is \n\np(y, s , cs, n , en) = N(y; g([~]), w)7r~sN(s; JL~s , ~~s )7r~N(n; JL~n , ~~n). \n\n(7) \n\nUnder this joint distribution, the posterior p(s, nly) is a mixture of non-Gaussian \nIn fact, for a given speech class and noise class, the posterior \ndistributions. \np(s, nics, en , y) may have multiple modes. So, exact computation of s is intractable \nand we use an approximation. \n\n\f3 Approximating the Posterior \nFor the current frame of noisy speech y, ALGONQUIN approximates the posterior \nusing a simpler, parameterized distribution, q: \n\np(s ,cS, n,cnly) ~ q(s,cS,n,cn). \n\nThe \"variational parameters\" of q are adjusted to make this approximation accurate, \nand then q is used as a surrogate for the true posterior when computing \u00a7 and \nlearning the noise model (c.f. (Jordan et al. 1998)). \nFor each cS and en, we approximate p(s, nics, en, y) by a Gaussian, \n\n(9) \n\nwhere 1J~'en and 1J~'en are the approximate posterior means of the speech and noise \nfor classes cS and en, and